# CountVectorizer

The CountVectorizer provides a simple way to both tokenize a collection of text documents and build a vocabulary of known words, but also to encode new documents using that vocabulary.
Every vectorizer is basically constituted by two important methods
<ol>
<li><b>Fit: </b>function in order to learn a vocabulary from one or more documents</li>
<li><b>Transform: </b>function on one or more documents as needed to encode each as a vector.</li>
</ol>
## First Example
Let me start from a simple CountVectorizer

In [39]:
from sklearn.feature_extraction.text import CountVectorizer

text = ["the cat is on the table and plays with a ball"]
vectorizer = CountVectorizer()
vectorizer.fit(text)
vectorizer.vocabulary_

AttributeError: 'CountVectorizer' object has no attribute 'idf_'

The result of the <b>fit</b> function is to represent the vocabulary given in input.
Let see the <b>Transform</b> function.

In [7]:
vector = vectorizer.transform(text)
print(vector.toarray())

[[1 1 1 1 1 1 1 2 1]]


Every i-th of the matrix, correspond the number of such word in the document. Indeed every word appears once, "the" appears two times.

## Second Example
Just an example with more documents

In [9]:
text = ["the cat is on the table", "the cat plays with the ball","Please get on the television"]
vectorizer = CountVectorizer()
vectorizer.fit(text)
vectorizer.vocabulary_

{u'ball': 0,
 u'cat': 1,
 u'get': 2,
 u'is': 3,
 u'on': 4,
 u'plays': 5,
 u'please': 6,
 u'table': 7,
 u'television': 8,
 u'the': 9,
 u'with': 10}

In [10]:
vector = vectorizer.transform(text)
print(vector.toarray())

[[0 1 0 1 1 0 0 1 0 2 0]
 [1 1 0 0 0 1 0 0 0 2 1]
 [0 0 1 0 1 0 1 0 1 1 0]]


In this case every column represent a word and every row represent the number of times that appears such word in the document.

## Third Example
Let me try some parameters of CountVectorizer

In [11]:
text = ["the cat is on the table", "the cat plays with the ball","Please get on the television"]
vectorizer = CountVectorizer(analyzer="char")
vectorizer.fit(text)
vectorizer.vocabulary_

{u' ': 0,
 u'a': 1,
 u'b': 2,
 u'c': 3,
 u'e': 4,
 u'g': 5,
 u'h': 6,
 u'i': 7,
 u'l': 8,
 u'n': 9,
 u'o': 10,
 u'p': 11,
 u's': 12,
 u't': 13,
 u'v': 14,
 u'w': 15,
 u'y': 16}

With the parameter <i>analyze</i> equals to <i>char</i> every character is taken, not words.

Suppose we want to consider only a set of particular word, for example every one that has more than three letters. 
We need to build a parser.

In [16]:
class Parse():
    def __call__(self, text):
        return [x for x in text.split(" ") if len(x) > 3]

text = ["the cat is on the table.", "the cat plays with the ball.","Please get on the television."]
vectorizer = CountVectorizer(analyzer="word", tokenizer=Parse())
vectorizer.fit(text)
vectorizer.vocabulary_

{u'ball.': 0,
 u'plays': 1,
 u'please': 2,
 u'table.': 3,
 u'television.': 4,
 u'with': 5}

</i>table</i> and <i>television</i> have the dot at the end. this could be a problem in documents that have that same word in the middle and at the end of the senteces.

In [18]:
text = ["the cat is on the table.", "the table is brown"]
vectorizer = CountVectorizer(analyzer="word", tokenizer=Parse())
vectorizer.fit(text)
vectorizer.vocabulary_

{u'brown': 0, u'table': 1, u'table.': 2}

We have the same word counted two times. 
In order to solve this problem...

In [19]:
import re
class Parse():
    def __call__(self, text):
        return [re.sub(r'[^\w\s]','',x) for x in text.split(" ") if len(x) > 3]

text = ["the cat is on the table.", "the cat plays with the ball.","Please get on the television.","the table is brown"]
vectorizer = CountVectorizer(analyzer="word", tokenizer=Parse())
vectorizer.fit(text)
vectorizer.vocabulary_

{u'ball': 0,
 u'brown': 1,
 u'plays': 2,
 u'please': 3,
 u'table': 4,
 u'television': 5,
 u'with': 6}

Last, <i>min_df</i>, <i>max_df</i> and <i>max_features</i> get all the word that appear more than, less than and the firsts respectively.

In [36]:
text = ["the cat is on the table.", "the cat plays with the ball.","Please get on the television.","the table is brown"]
vectorizer = CountVectorizer(min_df=2)
vectorizer.fit(text)
vectorizer.vocabulary_
vector = vectorizer.transform(text)
vector.toarray()

array([[1, 1, 1, 1, 2],
       [1, 0, 0, 0, 2],
       [0, 0, 1, 0, 1],
       [0, 1, 0, 1, 1]])

if we sum the columns, is always grater than 2

In [37]:
text = ["the cat is on the table.", "the cat plays with the ball.","Please get on the television.","the table is brown"]
vectorizer = CountVectorizer(max_df=2)
vectorizer.fit(text)
vectorizer.vocabulary_
vector = vectorizer.transform(text)
vector.toarray()

array([[0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0],
       [1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1],
       [0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0],
       [0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0]])

if we sum the columns, is always less than 2

In [38]:
text = ["the cat is on the table.", "the cat plays with the ball.","Please get on the television.","the table is brown"]
vectorizer = CountVectorizer(max_features=3)
vectorizer.fit(text)
vectorizer.vocabulary_
vector = vectorizer.transform(text)
vector.toarray()

array([[1, 1, 2],
       [1, 0, 2],
       [0, 0, 1],
       [0, 1, 1]])

Return the array with the three words with the greatest sum