# sklearn for NLP

## Count Vectorizer
Caculates the TF matrix for each documents.

Suppose our document space is listed below:
```
Train Document Set:

d1: The sky is blue.
d2: The sun is bright.

Test Document Set:

d3: The sun in the sky is bright.
d4: We can see the shining sun, the bright sun.
```

取小写的字母,移除不必要的stopwords等操作后,我们可以建立vocabulary dictionary,如下所示:
```
{'blue': 0, 'bright': 1, 'sky': 2, 'sun': 3}
```

那么对于d3来说,blue出现0次,bright出现1次,sky出现1次,sun出现1次,所以d3的TF Feature就是:
```
[0, 1, 1, 1]
```

对于d4来说,blue出现0次,bright出现1次,sky出现0次,sun出现2次,所以d4的TF Feature就是:
```
[0, 1, 0, 2]
```


In [2]:
train_data = ("The sky is blue.", "The sun is bright.")
test_data = ("The sun in the sky is bright.", "We can see the shining sun, the bright sun.")
print('train_data:')
print(train_data)
print('test_data:')
print(test_data)

train_data:
('The sky is blue.', 'The sun is bright.')
test_data:
('The sun in the sky is bright.', 'We can see the shining sun, the bright sun.')


In [3]:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords

vectorizer = CountVectorizer(stop_words=stopwords.words('english'))

print(vectorizer)
print()
vectorizer.fit_transform(train_data)
print(vectorizer.vocabulary_)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None,
        stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs',... 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"],
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

{'sky': 2, 'blue': 0, 'sun': 3, 'bright': 1}


The CountVectorizer already uses as default “analyzer” called WordNGramAnalyzer, which is responsible to convert the text to lowercase, accents removal, token extraction, (filter stop words,) etc… you can see more information by printing the class information:

In [4]:
smatrix = vectorizer.transform(test_data)
print(smatrix)

  (0, 1)	1
  (0, 2)	1
  (0, 3)	1
  (1, 1)	1
  (1, 3)	2


Note that the sparse matrix created called smatrix is a Scipy sparse matrix with elements stored in a Coordinate format. But you can convert it into a dense format:

In [5]:
tf_features = smatrix.toarray()
print(tf_features)

[[0 1 1 1]
 [0 1 0 2]]


## TfidfVectorizer

Tf means term-frequency while tf–idf means term-frequency times inverse document-frequency:

$\text{tf-idf(t, d, D)}=\text{tf(t, d)} \times \text{idf(t, D)}$.



Using the `TfidfTransformer`’s default settings, `TfidfTransformer(norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)` the term frequency, the number of times a term occurs in a given document, is multiplied with idf component, which is computed as

$\text{idf}(t, D) = \log{\frac{1 + n}{1+\text{df}(t)}} + 1$,

where $N$ is the total number of documents and $df(t)$ is the number of documents that contains token $t$. 

$\text{tf(t, d)}$ is the number of times token $t$ appears in the document $d$.

The resulting tf-idf vectors are then normalized by the Euclidean norm:

$v_{norm} = \frac{v}{||v||_2} = \frac{v}{\sqrt{v{_1}^2 +
v{_2}^2 + \dots + v{_n}^2}}$

See details in [this](https://scikit-learn.org/stable/modules/feature_extraction.html)

In [6]:
# Train and test data. Both the full documents and their labels ("Sports" vs "Non Sports")
train_data = ['Football: a great sport', 'The referee has been very bad this season', 'Our team scored 5 goals', 'I love tennis',
              'Politics is in decline in the UK', 'Brexit means Brexit', 'The parlament wants to create new legislation',
              'I so want to travel the world']
train_labels = ["Sports","Sports","Sports","Sports", "Non Sports", "Non Sports", "Non Sports", "Non Sports"]

test_data = ['Swimming is a great sport', 
             'A lot of policy changes will happen after Brexit', 
             'The table tennis team will travel to the UK soon for the European Championship']
test_labels = ["Sports","Non Sports","Sports"]

### Case: TfidfVectorizer

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords

vectorizer = TfidfVectorizer(stop_words=stopwords.words('english'))

print(vectorizer)

vectorizer.fit_transform(train_data)
print(vectorizer.vocabulary_)
print(len(vectorizer.vocabulary_))

smtarix = vectorizer.transform(test_data)
test_features = smtarix.toarray()
print(test_features)
print(test_features.shape)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs',... 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"],
        strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)
{'football': 4, 'great': 6, 'sport': 16, 'referee': 13, 'bad': 0, 'season': 15, 'team': 17, 'scored': 14, 'goals': 5, 'love': 8, 'tennis': 18, 'politics': 12, 

  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):


### Case: Implement on ourselves

In [8]:
def tokenize(sentence):
    # remove the punctuation
    import string
    remove_punctuation_map = dict((ord(char),None) for char in string.punctuation)
    sentence_no_punctuation = sentence.translate(remove_punctuation_map)
        
    # lower
    sentence_no_punctuation = sentence_no_punctuation.lower()
        
    # word_tokenize
    from nltk import word_tokenize
    words = word_tokenize(sentence_no_punctuation)
        
    # remove stopwords
    from nltk.corpus import stopwords
    filtered_words = [word for word in words if word not in stopwords.words('english')]
        
    # stem
    from nltk.stem import SnowballStemmer
    snowball_stemmer = SnowballStemmer("english")
    words_stemed = [snowball_stemmer.stem(word) for word in filtered_words]
    
    return words_stemed

def tf(word, document):
    word = tokenize(word)[0]
    words = tokenize(document)
    return sum(1 for word1 in words if word1 == word)

def idf(word, documents):
    tokens_list = [tokenize(document) for document in documents]
    token = tokenize(word)[0]
    import math
    try:
        return math.log((1 + len(tokens_list)) / (1 + sum(1 for tokens in tokens_list if token in tokens)))
    except ValueError:
        return 0

def tf_idf(word, document, documents):
    return tf(word, document) * idf(word, documents)

vocabulary = vectorizer.vocabulary_
vocab_size = len(vocabulary)

import numpy as np

tf_idf_features = np.zeros((len(test_data), vocab_size), np.float32)
# computes tf-idf features
for i in range(len(test_data)):
    for token, index in vocabulary.items():
        tf_idf_features[i, index] = tf_idf(token, test_data[i], test_data)
    # normalize
    tf_idf_features[i] = tf_idf_features[i] / np.sqrt(np.sum(tf_idf_features[i] ** 2))
print(tf_idf_features)

[[0.         0.         0.         0.         0.         0.
  0.70710677 0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.70710677 0.
  0.         0.         0.         0.         0.         0.        ]
 [0.         1.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.        ]
 [0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.5
  0.5        0.5        0.5        0.         0.         0.        ]]


The result is the same as what we obtain using sklearn TfidfVectorizer.