Reference: https://kavita-ganesan.com/tfidftransformer-tfidfvectorizer-usage-differences/#.XuwXWmhKiM8

In [78]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer 

## CountVectorizer

In [35]:
docs = ["Henry is man", "Henry is strong"]

In [36]:
#instantiate CountVectorizer()
cv=CountVectorizer()

In [37]:
# Scikit-learn’s CountVectorizer is used to transform a corpora of text to a vector of term / token counts.
word_count_vector = cv.fit_transform(docs)

In [38]:
word_count_vector.shape # 5 doc, 16 vocab

(2, 4)

In [39]:
# word_count_vector is a vector of size 16 (since there are 16 unique vocab in docs)

# Each column in the matrix represents a unique word in the vocabulary, 
# while each row represents the document in our dataset.

# The values in each cell are the word counts (term frequency of term t in doc d). 
# Note that with this representation, counts of some words could be 0 if the word 
# did not appear in the corresponding document.

In [40]:
# With CountVectorizer we are converting raw text to a numerical vector representation of words and n-grams. 
# This makes it easy to directly use this representation as features (signals) 
# in Machine Learning tasks such as for text classification and clustering.

In [41]:
# show resulting vocabulary; the numbers are not counts, they are the position in the sparse vector.
cv.vocabulary_

{'henry': 0, 'is': 1, 'man': 2, 'strong': 3}

## Compute the term frequency (tf)

In [44]:
# My corpus with 5 documents and 16 vocabs
docs=["the house had a tiny little mouse",
      "the cat saw the mouse",
      "the mouse ran away from the house",
      "the cat finally ate the mouse",
      "the end of the mouse story"
     ]

In [45]:
#instantiate CountVectorizer()
cv=CountVectorizer()
# this steps generates word counts for the words in your docs
word_count_vector=cv.fit_transform(docs)

In [48]:
# shape of the sparse matrix, contains the tf of each term t in doc d
word_count_vector.shape

(5, 16)

In [49]:
# show resulting vocabulary; the numbers are not counts, they are the position in the sparse vector.
cv.vocabulary_

{'the': 14,
 'house': 7,
 'had': 6,
 'tiny': 15,
 'little': 8,
 'mouse': 9,
 'cat': 2,
 'saw': 12,
 'ran': 11,
 'away': 1,
 'from': 5,
 'finally': 4,
 'ate': 0,
 'end': 3,
 'of': 10,
 'story': 13}

In [53]:
cv.get_feature_names() # show the vocab sort by their position in the sparse vector

['ate',
 'away',
 'cat',
 'end',
 'finally',
 'from',
 'had',
 'house',
 'little',
 'mouse',
 'of',
 'ran',
 'saw',
 'story',
 'the',
 'tiny']

## Compute the idf values

In [50]:
tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True)
tfidf_transformer.fit(word_count_vector) # use word_count_vector (sparse matrix) as argument

TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)

In [54]:
tfidf_transformer.idf_.shape  # the idf weights

(16,)

In [57]:
tfidf_transformer.idf_  # an array of the idf weights for each of the 16 vocab

array([2.09861229, 2.09861229, 1.69314718, 2.09861229, 2.09861229,
       2.09861229, 2.09861229, 1.69314718, 2.09861229, 1.        ,
       2.09861229, 2.09861229, 2.09861229, 2.09861229, 1.        ,
       2.09861229])

In [61]:
# print idf values
idf = pd.DataFrame(tfidf_transformer.idf_, index=cv.get_feature_names(),columns=["idf_weights"])
# sort ascending by idf weights
idf.sort_values(by=['idf_weights'])

Unnamed: 0,idf_weights
mouse,1.0
the,1.0
cat,1.693147
house,1.693147
ate,2.098612
away,2.098612
end,2.098612
finally,2.098612
from,2.098612
had,2.098612


Notice that the words ‘mouse’ and ‘the’ have the lowest IDF values. This is expected as these words appear in every document in our collection. The lower the IDF value of a word, the less unique it is to any particular document and the less important it is (it would have low tf-idf score).

## Compute the TFIDF score for your documents
- tf-idf(t,d) is defined for each term t in document d

The first line above, gets the word counts for the documents in a sparse matrix form. We could have actually used word_count_vector from above. However, in practice, you may be computing tf-idf scores on a set of new unseen documents. When you do that, you will first have to do cv.transform(your_new_docs) to generate the matrix of word counts.

In [69]:
# Use the fitted tfidf_transformer to transform the count_vector can compute the tf-idf scores for term t and in doc d
# Internally tfidf_transformer.transform(count_vector) computes the tf * idf  multiplication 
# where your term frequency is weighted by its IDF values.
count_vector=cv.transform(docs)
tf_idf_vector=tfidf_transformer.transform(count_vector)

In [70]:
# Now, let’s print the tf-idf values of the first document to see if it makes sense.
feature_names = cv.get_feature_names()

In [71]:
feature_names

['ate',
 'away',
 'cat',
 'end',
 'finally',
 'from',
 'had',
 'house',
 'little',
 'mouse',
 'of',
 'ran',
 'saw',
 'story',
 'the',
 'tiny']

In [72]:
#get tfidf vector for first document
first_document_vector=tf_idf_vector[0]

In [73]:
first_document_vector

<1x16 sparse matrix of type '<class 'numpy.float64'>'
	with 6 stored elements in Compressed Sparse Row format>

In [74]:
#print the scores
df = pd.DataFrame(first_document_vector.T.todense(), index=feature_names, columns=["tfidf"])
df.sort_values(by=["tfidf"],ascending=False)

Unnamed: 0,tfidf
had,0.493562
little,0.493562
tiny,0.493562
house,0.398203
mouse,0.235185
the,0.235185
ate,0.0
away,0.0
cat,0.0
end,0.0


Notice that only certain words have scores. This is because our first document is “the house had a tiny little mouse”  all the words in this document have a tf-idf score and everything else show up as zeroes.  Notice that the word “a” is missing from this list. This is possibly due to internal pre-processing of CountVectorizer where it removes single characters.

The scores above make sense. The more common the word across documents, the lower its score and the more unique a word is to our first document (e.g. ‘had’ and ‘tiny’) the higher the score.

## Tfidfvectorizer 
- With Tfidfvectorizer you compute the word counts, idf and tf-idf values all at once.

In [76]:
docs

['the house had a tiny little mouse',
 'the cat saw the mouse',
 'the mouse ran away from the house',
 'the cat finally ate the mouse',
 'the end of the mouse story']

In [81]:
# settings that you use for count vectorizer will go here
tfidf_vectorizer=TfidfVectorizer(use_idf=True)

# just send in all your docs here
tfidf_vectorizer_vectors=tfidf_vectorizer.fit_transform(docs)

Now let’s print the tfidf values for the first document from our collection. Notice that these values are identical to the ones from Tfidftransformer, only thing is that it’s done in just two steps.

In [82]:
# get the first vector out (for the first document)
first_vector_tfidfvectorizer=tfidf_vectorizer_vectors[0]
 
# place tf-idf values in a pandas data frame
df = pd.DataFrame(first_vector_tfidfvectorizer.T.todense(), index=tfidf_vectorizer.get_feature_names(), columns=["tfidf"])
df.sort_values(by=["tfidf"],ascending=False)

Unnamed: 0,tfidf
had,0.493562
little,0.493562
tiny,0.493562
house,0.398203
mouse,0.235185
the,0.235185
ate,0.0
away,0.0
cat,0.0
end,0.0


**Calling fit and transform separately**

In [84]:
tfidf_vectorizer=TfidfVectorizer(use_idf=True)
 
# just send in all your docs here
fitted_vectorizer=tfidf_vectorizer.fit(docs)
tfidf_vectorizer_vectors=fitted_vectorizer.transform(docs)

With Tfidftransformer you will systematically compute word counts using CountVectorizer and then compute the Inverse Document Frequency (IDF) values and only then compute the Tf-idf scores.

With Tfidfvectorizer on the contrary, you will do all three steps at once. Under the hood, it computes the word counts, IDF values, and Tf-idf scores all using the same dataset.