https://www.kdnuggets.com/2018/08/wtf-tf-idf.html

TF-IDF, which stands for term frequency — inverse document frequency, is a scoring measure widely used in information retrieval (IR) or summarization. TF-IDF is intended to reflect how relevant a term is in a given document.

The intuition behind it is that if a word occurs multiple times in a document, we should boost its relevance as it should be more meaningful than other words that appear fewer times (TF). At the same time, if a word occurs many times in a document but also along many other documents, maybe it is because this word is just a frequent word; not because it was relevant or meaningful (IDF).

Defining what a “relevant word” means

We can come up with a more or less subjective definition driven by our intuition: a word’s relevance is proportional to the amount of information that it gives about its context (a sentence, a document or a full dataset). That is, the most relevant words are those that would help us, as humans, to better understand a whole document without reading it all.

As pointed out, relevant words are not necessarily the most frequent words since stopwords like “the”, “of” or “a” tend to occur very often in many documents.

There is another caveat: if we want to summarize a document compared to a whole dataset about an specific topic (let’s say, movie reviews), there will be words (other than stopwords, like character or plot), that could occur many times in the document as well as in many other documents. These words are not useful to summarize a document because they convey little discriminating power; they say very little about what the document contains compared to the other documents.

Let’s go through some examples to better illustrate how TF-IDF works.

 

Search engine example
 
Let’s suppose we have a database with thousands of cats descriptions and a user wants to search for furry cats, so she/he issues the query “the furry cat”. As a search engine, we have to decide which documents should be returned from our database.

If we have documents that match the exact query, there is no doubt but… what if we have to decide between partial matches? For simplicity, let’s say we have to choose between these two descriptions:

“the lovely cat”
“a furry kitten”
The first description contains 2 out of 3 words from the query and the second one matches just 1 out of 3, then we would pick the first description. How can TF-IDF help us to choose the second description instead of the first one?

The TF is the same for each word, no difference here. However, we could expect that the terms “cat” and “kitten” would appear in many documents (large document frequency implies low IDF), while the term “furry” will appear in fewer documents (larger IDF). So the TF-IDF for cat & kitten has a low value whereas the TF-IDF is larger for “furry”, i.e. in our database the word “furry” has more discriminative power than “cat” or “kitten”.

Conclusion

If we use the TF-IDF to weight the different words that matched the query, “furry” would be more relevant than “cat” and so we could eventually choose “the furry kitten” as the best match.

In [1]:
docs=["the house had a tiny little mouse", 
"the cat saw the mouse", 
"the mouse ran away from the house", 
"the cat finally ate the mouse", 
"the end of the mouse story"
]

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(use_idf=True)
vectors = tfidf.fit_transform(docs)

In [14]:
print(tfidf.get_feature_names())
print(vectors.toarray())

['ate', 'away', 'cat', 'end', 'finally', 'from', 'had', 'house', 'little', 'mouse', 'of', 'ran', 'saw', 'story', 'the', 'tiny']
[[0.         0.         0.         0.         0.         0.
  0.49356209 0.39820278 0.49356209 0.23518498 0.         0.
  0.         0.         0.23518498 0.49356209]
 [0.         0.         0.48334378 0.         0.         0.
  0.         0.         0.         0.28547062 0.         0.
  0.59909216 0.         0.57094124 0.        ]
 [0.         0.45709287 0.         0.         0.         0.45709287
  0.         0.36877965 0.         0.2178072  0.         0.45709287
  0.         0.         0.43561441 0.        ]
 [0.51392301 0.         0.41462985 0.         0.51392301 0.
  0.         0.         0.         0.24488707 0.         0.
  0.         0.         0.48977413 0.        ]
 [0.         0.         0.         0.49175319 0.         0.
  0.         0.         0.         0.23432303 0.49175319 0.
  0.         0.49175319 0.46864606 0.        ]]
