# Tfidftransformer & Tfidfvectorizer

Scikit-learn’s Tfidftransformer and Tfidfvectorizer aim to do the same thing, which is to convert a collection of raw documents to a matrix of TF-IDF features. 

What is the difference ?

## Tfidftransformer Usage

**First Create a dummy datasets and apply both to understanding the usage of the both method.

In [3]:
import pandas as pd 
from sklearn.feature_extraction.text import TfidfTransformer 
from sklearn.feature_extraction.text import CountVectorizer

In [4]:
docs=["the house had a tiny little mouse", 
"the cat saw the mouse", 
"the mouse ran away from the house", 
"the cat finally ate the mouse", 
"the end of the mouse story"
]

** Check our dataset.

In [5]:
docs

['the house had a tiny little mouse',
 'the cat saw the mouse',
 'the mouse ran away from the house',
 'the cat finally ate the mouse',
 'the end of the mouse story']

**In order to start using TfidfTransformer you will first have to create a CountVectorizer to count the number of words (term frequency), apply preprocessing such limit your vocabulary size (stemming, lemmatization), apply stop words, etc.

In [6]:
cv=CountVectorizer() 
word_count_vector=cv.fit_transform(docs)

**Now, let’s check the shape. We should have 5 rows (5 docs) and 16 columns (16 unique words, minus single character words)

In [8]:
word_count_vector.shape

(5, 16)

**Now we are going to compute the IDF values by calling tfidf_transformer.fit(word_count_vector) on the word counts

In [9]:
tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True) 
tfidf_transformer.fit(word_count_vector)

TfidfTransformer()

In [11]:
df_idf = pd.DataFrame(tfidf_transformer.idf_, index=cv.get_feature_names(),columns=["idf_weights"]) 
df_idf.sort_values(by=['idf_weights'])

Unnamed: 0,idf_weights
mouse,1.0
the,1.0
cat,1.693147
house,1.693147
ate,2.098612
away,2.098612
end,2.098612
finally,2.098612
from,2.098612
had,2.098612


#### **Notice that the words ‘mouse’ and ‘the’ have the lowest IDF values. This is expected as these words appear in each and every document in our collection. The lower the IDF value of a word, the less unique it is to any particular document.

**Once you have the IDF values, you can now compute the tf-idf scores

In [12]:
count_vector=cv.transform(docs) 
tf_idf_vector=tfidf_transformer.transform(count_vector)

**Internally this is computing the tf * idf  multiplication where your term frequency is weighted by its IDF values.

In [21]:
feature_names = cv.get_feature_names() 
 
#get tfidf vector for first document 
first_document_vector=tf_idf_vector[0] 
 
#print the scores 
df = pd.DataFrame(first_document_vector.T.todense(), index=feature_names, columns=["tfidftransformer"]) 
df.sort_values(by=["tfidftransformer"],ascending=False)

Unnamed: 0,tfidftransformer
had,0.493562
little,0.493562
tiny,0.493562
house,0.398203
mouse,0.235185
the,0.235185
ate,0.0
away,0.0
cat,0.0
end,0.0


** Notice that only certain words have scores. This is because our first document is “the house had a tiny little mouse”  all the words in this document have a tf-idf score and everything else show up as zeroes. Notice that the word “a” is missing from this list. This is possibly due to internal pre-processing of CountVectorizer where it removes single characters.

The scores above make sense. The more common the word across documents, the lower its score and the more unique a word is to our first document (e.g. ‘had’ and ‘tiny’) the higher the score.

## Tfidfvectorizer Usage

Now, we are going to use the same 5 documents from above to do the same thing as we did for Tfidftransformer – which is to get the tf-idf scores of a set of documents. But, notice how this is much shorter.

With Tfidfvectorizer you compute the word counts, idf and tf-idf values all at once. It’s really simple.

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer 

tfidf_vectorizer=TfidfVectorizer(use_idf=True) 
tfidf_vectorizer_vectors=tfidf_vectorizer.fit_transform(docs)

In [22]:
first_vector_tfidfvectorizer=tfidf_vectorizer_vectors[0] 
df['tfidfvectorize'] = pd.DataFrame(first_vector_tfidfvectorizer.T.todense(), index=tfidf_vectorizer.get_feature_names()) 
df.sort_values(by=["tfidfvectorize"],ascending=False)

Unnamed: 0,tfidftransformer,tfidfvectorize
had,0.493562,0.493562
little,0.493562,0.493562
tiny,0.493562,0.493562
house,0.398203,0.398203
mouse,0.235185,0.235185
the,0.235185,0.235185
ate,0.0,0.0
away,0.0,0.0
cat,0.0,0.0
end,0.0,0.0


# Tfidftransformer vs. Tfidfvectorizer
In summary, the main difference between the two modules are as follows:

With Tfidftransformer you will systematically compute word counts using CountVectorizer and then compute the Inverse Document Frequency (IDF) values and only then compute the Tf-idf scores.

With Tfidfvectorizer on the contrary, you will do all three steps at once. Under the hood, it computes the word counts, IDF values, and Tf-idf scores all using the same dataset.

In [24]:
df
df.sort_values(by=["tfidfvectorize"],ascending=False)

Unnamed: 0,tfidftransformer,tfidfvectorize
had,0.493562,0.493562
little,0.493562,0.493562
tiny,0.493562,0.493562
house,0.398203,0.398203
mouse,0.235185,0.235185
the,0.235185,0.235185
ate,0.0,0.0
away,0.0,0.0
cat,0.0,0.0
end,0.0,0.0
