# TFIDF Vectorizer

In simple words, TFIDF is a numerical statistic that shows the importance of a word in a text document.

We create two text documents as follows:

In [54]:
text1 = "I love my cat but the cat sat on my face"
text2 = "I love my dog but the dog sat on my bed"

## Word Tokenization

In [55]:
words1 = text1.split(" ")
words2 = text2.split(" ")

In [56]:
print(words1)

['I', 'love', 'my', 'cat', 'but', 'the', 'cat', 'sat', 'on', 'my', 'face']


## Combining the Words into a Single Set

In [57]:
corpus = set(words1).union(set(words2))
print(corpus)

{'bed', 'but', 'sat', 'love', 'cat', 'my', 'face', 'the', 'I', 'dog', 'on'}


## TFIDF Vectorization

In [60]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()

corpus = ["I love my cat but the cat sat on my face", "I love my dog but the dog sat on my bed"]

X = vectorizer.fit_transform(corpus)
feature_names = vectorizer.get_feature_names()
corpus_index = [n for n in corpus]

import pandas as pd
df = pd.DataFrame(X.T.todense(), index = feature_names, columns = corpus_index)
df.style

Unnamed: 0,I love my cat but the cat sat on my face,I love my dog but the dog sat on my bed
bed,0.0,0.323487
but,0.230164,0.230164
cat,0.646975,0.0
dog,0.0,0.646975
face,0.323487,0.0
love,0.230164,0.230164
my,0.460328,0.460328
on,0.230164,0.230164
sat,0.230164,0.230164
the,0.230164,0.230164


It is seen that the words 'cat', 'my' and 'face' are the most important features in the first sentence.
And, words 'dog', 'my' and 'bed' are important features in the second sentence.