# Count Vectorizer

The model is commonly used in methods of document classification where the frequency/count of each word is used as a feature for training.

We create two text documents as follows:

In [13]:
text1 = "I love my cat but the cat sat on my face"
text2 = "I love my dog but the dog sat on my bed"

## Word Tokenization

In [14]:
words1 = text1.split(" ")
words2 = text2.split(" ")

In [15]:
print(words1)

['I', 'love', 'my', 'cat', 'but', 'the', 'cat', 'sat', 'on', 'my', 'face']


## Combining the Words into a Single Set

In [16]:
corpus = set(words1).union(set(words2))
print(corpus)

{'dog', 'but', 'love', 'face', 'I', 'my', 'bed', 'cat', 'sat', 'the', 'on'}


## Count Vectorization

In [17]:
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
    "I love my cat but the cat sat on my face",
    "I love my dog but the dog sat on my bed"
]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
feature_names = vectorizer.get_feature_names()
corpus_index = [n for n in corpus]

import pandas as pd
df = pd.DataFrame(X.T.todense(), index = feature_names, columns = corpus_index)
df.style

Unnamed: 0,I love my cat but the cat sat on my face,I love my dog but the dog sat on my bed
bed,0,1
but,1,1
cat,2,0
dog,0,2
face,1,0
love,1,1
my,2,2
on,1,1
sat,1,1
the,1,1


The count of all the words as features are exported to a dataframe which can be utilized for further analysis.