# sklearn for NLP

## Count Vectorizer
Caculates the TF matrix for each documents.

Suppose our document space is listed below:
```
Train Document Set:

d1: The sky is blue.
d2: The sun is bright.

Test Document Set:

d3: The sun in the sky is bright.
d4: We can see the shining sun, the bright sun.
```

取小写的字母,移除不必要的stopwords等操作后,我们可以建立vocabulary dictionary,如下所示:
```
{'blue': 0, 'bright': 1, 'sky': 2, 'sun': 3}
```

那么对于d3来说,blue出现0次,bright出现1次,sky出现1次,sun出现1次,所以d3的TF Feature就是:
```
[0, 1, 1, 1]
```

对于d4来说,blue出现0次,bright出现1次,sky出现0次,sun出现2次,所以d4的TF Feature就是:
```
[0, 1, 0, 2]
```


In [2]:
train_data = ("The sky is blue.", "The sun is bright.")
test_data = ("The sun in the sky is bright.", "We can see the shining sun, the bright sun.")
print('train_data:')
print(train_data)
print('test_data:')
print(test_data)

train_data:
('The sky is blue.', 'The sun is bright.')
test_data:
('The sun in the sky is bright.', 'We can see the shining sun, the bright sun.')


In [9]:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords

vectorizer = CountVectorizer(stop_words=stopwords.words('english'))

print(vectorizer)
print()
vectorizer.fit_transform(train_data)
print(vectorizer.vocabulary_)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None,
        stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs',... 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"],
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

{'sky': 2, 'blue': 0, 'sun': 3, 'bright': 1}


The CountVectorizer already uses as default “analyzer” called WordNGramAnalyzer, which is responsible to convert the text to lowercase, accents removal, token extraction, (filter stop words,) etc… you can see more information by printing the class information:

In [10]:
smatrix = vectorizer.transform(test_data)
print(smatrix)

  (0, 1)	1
  (0, 2)	1
  (0, 3)	1
  (1, 1)	1
  (1, 3)	2


Note that the sparse matrix created called smatrix is a Scipy sparse matrix with elements stored in a Coordinate format. But you can convert it into a dense format:

In [11]:
tf_features = smatrix.toarray()
print(tf_features)

[[0 1 1 1]
 [0 1 0 2]]
