TF-IDF or ( Term Frequency(TF) — Inverse Dense Frequency(IDF) )is a technique which is used to find meaning of sentences consisting of words and cancels out the incapabilities of Bag of Words technique which is good for text classification or for helping a machine read words in numbers. However, it just blows up in your face when you ask it to understand the meaning of the sentence or the document.




I highly suggest you read about BoW before you go through this article to get a context -



---



**So what is it, do you want to understand using an example ?**



Let’s say a machine is trying to understand meaning of this —


```
Today is a beautiful day

```


What do you focus on here but tell me as a human not a machine?

This sentence talks about today, it also tells us that today is a beautiful day. The mood is happy/positive, anything else cowboy?

Beauty is clearly the adjective word used here. From a BoW approach all words are broken into count and frequency with no preference to a word in particular, all words have same frequency here (1 in this case)and obviously there is no emphasis on beauty or positive mood by the machine.

The words are just broken down and if we were talking about importance, ‘a’ is as important as ‘day’ or ‘beauty’.

But is it really that ‘a’ tells you more about context of a sentence compared to ‘beauty’ ?

No, that’s why Bag of words needed an upgrade.

Also, another major drawback is say a document has 200 words, out of which ‘a’ comes 20 times, ‘the’ comes 15 times etc.

Many words which are repeated again and again are given more importance in final feature building and we miss out on context of less repeated but important words like Rain, beauty, subway , names.

So it’s easy to miss on what was meant by the writer if read by a machine and it presents a problem that TF-IDF solves, so now we know why do we use TF-IDF.

---

**Let’s now see how does it work, okay?**




TF-IDF is useful in solving the major drawbacks of Bag of words by introducing an important concept called inverse document frequency.

It’s a score which the machine keeps where it is evaluates the words used in a sentence and measures it’s usage compared to words used in the entire document. In other words, it’s a score to highlight each word’s relevance in the entire document. It’s calculated as -



```

IDF =Log[(# Number of documents) / (Number of documents containing the word)] and

TF = (Number of repetitions of word in a document) / (# of words in a document)


```

okay, for now let’s just say that TF answers questions like — how many times is beauty used in that entire document, give me a probability and IDF answers questions like how important is the word beauty in the entire list of documents, is it a common theme in all the documents.

So using TF and IDF machine makes sense of important words in a document and important words throughout all documents.

**Answer me this**

Imagine there’s a document full of sentences, what is the best way to break it so that a machine can make some sense of what it is ?

1. Break it in words.

2. Break it in letters.

3. Break it in sentences.

4. Break it in bytes.

In [38]:
from google.colab import output
import nltk
nltk.download('punkt')
import numpy as np
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
import re
import pandas as pd
output.clear()

What is the way of finding TF-IDF of a document?
The process to find meaning of documents using TF-IDF is very similar to Bag of words,

1. Clean data / Preprocessing — Clean data (standardise data) , Normalize data( all lower case) , lemmatize data ( all words to root words ).


3. Tokenize words with frequency

4. Find TF for words

5. Find IDF for words

6. Vectorize vocab

In [117]:
text = "It is going to rain today. Today I am not going outside. I am going to watch the season premiere."

1. Clean data / Preprocessing — Clean data (standardise data) , Normalize data( all lower case) , lemmatize data ( all words to root words ).


In [124]:
#tokenize sentenses
sents = nltk.sent_tokenize(text)


print("tokenize sentenses : ", "\n")
for index, sent in enumerate(sents):
  print(f"{index + 1} --- > {sent}")

print("*" * 20)

print("clean data with all lower case  : ", "\n")
#clean data with all lower case 
for index, sent in enumerate(sents):
  sents[index] = sent.lower()
  print(f"{index + 1} --- > {sents[index]}")

print("*" * 20)

#remove Punctuation marks
print("#remove Punctuation marks  : ", "\n")
for index, sent in enumerate(sents):
  sents[index] = re.sub("\W", " ", sents[index])
  print(f"{index + 1} --- > {sents[index]}")

print("*" * 20)

#tokenize words
print("tokenize words  : ", "\n")
for index, sent in enumerate(sents):
  sents[index] = nltk.word_tokenize(sent)
  print(f"{index + 1} --- > {sents[index]}")


print("*" * 20)

#find lemmatize data ( all words to root words )

Lemmatizer = WordNetLemmatizer()
print("find lemmatize data ( all words to root words )  : ", "\n")
words = []
for index, sent in enumerate(sents):
  Lemmatizer_word = set()
  for word in sent:
    Lemmatizer_word.add(Lemmatizer.lemmatize(word))
  sents[index] = list(Lemmatizer_word)
  words.extend(list(Lemmatizer_word))
  print(f"{index + 1} --- > {sents[index]}")

print("*" * 20)

words = list(set(words))
print("set of words : ", "\n")
print(words)

tokenize sentenses :  

1 --- > It is going to rain today.
2 --- > Today I am not going outside.
3 --- > I am going to watch the season premiere.
********************
clean data with all lower case  :  

1 --- > it is going to rain today.
2 --- > today i am not going outside.
3 --- > i am going to watch the season premiere.
********************
#remove Punctuation marks  :  

1 --- > it is going to rain today 
2 --- > today i am not going outside 
3 --- > i am going to watch the season premiere 
********************
tokenize words  :  

1 --- > ['it', 'is', 'going', 'to', 'rain', 'today']
2 --- > ['today', 'i', 'am', 'not', 'going', 'outside']
3 --- > ['i', 'am', 'going', 'to', 'watch', 'the', 'season', 'premiere']
********************
find lemmatize data ( all words to root words )  :  

1 --- > ['to', 'is', 'rain', 'it', 'today', 'going']
2 --- > ['i', 'today', 'outside', 'going', 'not', 'am']
3 --- > ['to', 'the', 'i', 'watch', 'going', 'season', 'premiere', 'am']
******************

In [119]:
count_words = dict()

for word in words:
  count = 0
  for sent in sents:
    count += sent.count(word)
  count_words[word] = count

for word, count in count_words.items():
  print(f"{word} ----- > {count}")

to ----- > 2
is ----- > 1
rain ----- > 1
the ----- > 1
i ----- > 2
watch ----- > 1
it ----- > 1
today ----- > 2
outside ----- > 1
premiere ----- > 1
going ----- > 3
season ----- > 1
not ----- > 1
am ----- > 2


**Step 1 Find TF**



```
Find it’s TF = (Number of repetitions of word in a document) / (# of words in a document)
```




In [120]:
def create_tf(sents, words):
  tf = pd.DataFrame()
  tf["Words"] = words
  for index , sent in enumerate(sents):
    name_columns = "documents" + str(index + 1)
    count = []
    for word in words:
      count.append(sent.count(word))
    count = [round((repeat / len(sent)), 2) for repeat in count]
    tf[name_columns] = count
  
  tf = tf.set_index("Words")
  return tf
      


print("TF tables : ", "\n")
tf = create_tf(sents, words)
tf

TF tables :  



Unnamed: 0_level_0,documents1,documents2,documents3
Words,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
to,0.17,0.0,0.12
is,0.17,0.0,0.0
rain,0.17,0.0,0.0
the,0.0,0.0,0.12
i,0.0,0.17,0.12
watch,0.0,0.0,0.12
it,0.17,0.0,0.0
today,0.17,0.17,0.0
outside,0.0,0.17,0.0
premiere,0.0,0.0,0.12


**Step 2 Find IDF**




Find IDF for documents (we do this for feature names only/ vocab words which have no stop words )

```
IDF =Log[(Number of documents) / (Number of documents containing the word)]
```


In [121]:
import math as m
def create_idf(sents, words):
  idf = pd.DataFrame()
  idf["Words"] = words
  count = []
  for word in words:
    counter = 0
    for sent in sents:
      if word in sent:
        counter += 1
    count.append(round(m.log(len(sents) / counter), 2))
  idf["IDF - VALUE"] = count
  idf = idf.set_index("Words")
  return idf
    

print("IDF tables : ", "\n")
idf = create_idf(sents, words)
idf

IDF tables :  



Unnamed: 0_level_0,IDF - VALUE
Words,Unnamed: 1_level_1
to,0.41
is,1.1
rain,1.1
the,1.1
i,0.41
watch,1.1
it,1.1
today,0.41
outside,1.1
premiere,1.1


**Step 3 Compare results and use table to ask questions**


In [122]:
def create_tf_idf(tf, idf):
  result = tf.copy()
  name_cols = list(tf.columns)
  idf_value = idf["IDF - VALUE"]
  for col in name_cols:
    tf_value = tf[col]
    result[col] = idf_value * tf_value
    result[col] = result[col].apply(lambda number : round(number, 2))
  result = result.T
  return result

print("TF-IDF tables : ", "\n")
tf_idf = create_tf_idf(tf, idf)
tf_idf

TF-IDF tables :  



Words,to,is,rain,the,i,watch,it,today,outside,premiere,going,season,not,am
documents1,0.07,0.19,0.19,0.0,0.0,0.0,0.19,0.07,0.0,0.0,0.0,0.0,0.0,0.0
documents2,0.0,0.0,0.0,0.0,0.07,0.0,0.0,0.07,0.19,0.0,0.0,0.0,0.19,0.07
documents3,0.05,0.0,0.0,0.13,0.05,0.13,0.0,0.0,0.0,0.13,0.0,0.13,0.0,0.05


In [123]:
np_tf_idf = np.array(tf_idf)
print("TF-IDF tables : ", "\n")
np_tf_idf

TF-IDF tables :  



array([[0.07, 0.19, 0.19, 0.  , 0.  , 0.  , 0.19, 0.07, 0.  , 0.  , 0.  ,
        0.  , 0.  , 0.  ],
       [0.  , 0.  , 0.  , 0.  , 0.07, 0.  , 0.  , 0.07, 0.19, 0.  , 0.  ,
        0.  , 0.19, 0.07],
       [0.05, 0.  , 0.  , 0.13, 0.05, 0.13, 0.  , 0.  , 0.  , 0.13, 0.  ,
        0.13, 0.  , 0.05]])