# Introduction to TF-IDF

In this notebook, we will compute the **Term Frequency-Inverse Document Frequency (TF-IDF)** of a set of text documents. TF-IDF is a numerical statistic that reflects the importance of a word within a document, relative to a collection of documents (corpus).

- **Term Frequency (TF)**: Measures how frequently a term (word) appears in a document.
- **Inverse Document Frequency (IDF)**: Measures how important a term is in the whole corpus. It gives less weight to terms that appear frequently across all documents.

We will first compute TF-IDF manually, and then use the `TfidfVectorizer` from `scikit-learn` to compute it more efficiently.


In [1]:
import numpy as np



In [2]:
documents_str = [
    "The cat sat on the mat.",
    "A dog sat on the mat.",
    "The bird flew over the tree."
]

In [3]:
(1/6)*np.log(3/1)

0.1831020481113516

In [4]:
def tokenize(sent):
    return sent.lower().split()

In [5]:
vocab={}
for doc in documents_str:
    tokens=tokenize(doc)
    for token in tokens:
        if token in vocab:
            vocab[token]+=1
        else: 
            vocab[token]=1
    print(doc, tokenize(doc))
    

The cat sat on the mat. ['the', 'cat', 'sat', 'on', 'the', 'mat.']
A dog sat on the mat. ['a', 'dog', 'sat', 'on', 'the', 'mat.']
The bird flew over the tree. ['the', 'bird', 'flew', 'over', 'the', 'tree.']


In [6]:
print(vocab)

{'the': 5, 'cat': 1, 'sat': 2, 'on': 2, 'mat.': 2, 'a': 1, 'dog': 1, 'bird': 1, 'flew': 1, 'over': 1, 'tree.': 1}


In [7]:
word2token={ k:idx for idx,(k,v) in enumerate(vocab.items())}

In [8]:
word2token

{'the': 0,
 'cat': 1,
 'sat': 2,
 'on': 3,
 'mat.': 4,
 'a': 5,
 'dog': 6,
 'bird': 7,
 'flew': 8,
 'over': 9,
 'tree.': 10}

In [9]:
token2word={ v:k for (k,v) in word2token.items()}

In [10]:
token2word

{0: 'the',
 1: 'cat',
 2: 'sat',
 3: 'on',
 4: 'mat.',
 5: 'a',
 6: 'dog',
 7: 'bird',
 8: 'flew',
 9: 'over',
 10: 'tree.'}

### 1 Term Frequency (TF)

TF measures the frequency of a word within a document. It is calculated as:

$$
TF(w, D) = \frac{\text{Number of times word } w \text{ appears in document } D}{\text{Total number of words in document } D}
$$

For example, in the document `['the', 'cat', 'sat', 'on', 'the', 'mat']`, the term frequency of "the" is$ \frac{2}{6} = 0.333 $.

In [11]:
sent=documents_str[0]
tokens=tokenize(sent)
print(tokens)

['the', 'cat', 'sat', 'on', 'the', 'mat.']


In [12]:
TF=np.zeros(len(vocab))

In [13]:
for token in tokens:
    TF[word2token[token]]+=1

In [14]:
TF/=len(tokens)

In [15]:
TF

array([0.33333333, 0.16666667, 0.16666667, 0.16666667, 0.16666667,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        ])


### 2 Inverse Document Frequency (IDF)

IDF measures the importance of a word in the whole corpus. The formula is:

$$
IDF(w) = \log \left( \frac{1+N}{1+\text{df}(w)} \right)
$$

Where:
- $N$ is the total number of documents.
- $df(w)$ is the number of documents containing the word $w$.

For example, the word "the" is very common across the corpus, so its IDF value would be low.


In [16]:
DF=np.zeros(len(vocab))
for doc in documents_str:
    tokens=set(tokenize(doc))
    for token in tokens:
        DF[word2token[token]]+=1
    

In [17]:
DF

array([3., 1., 2., 2., 2., 1., 1., 1., 1., 1., 1.])

In [18]:
IDF=np.zeros(len(vocab))
for i in range(len(vocab)):
    if DF[i]>0:
        IDF[i]=np.log((1+len(documents_str))/(1+DF[i]))

In [19]:
IDF

array([0.        , 0.69314718, 0.28768207, 0.28768207, 0.28768207,
       0.69314718, 0.69314718, 0.69314718, 0.69314718, 0.69314718,
       0.69314718])

### 3 TF-IDF Computation

The final TF-IDF value is obtained by multiplying the TF and IDF values for each word in a document. A high TF-IDF value means that the word is important in the document and less common across other documents.


In [20]:
TF_IDF=np.zeros((len(documents_str), len(vocab)))
for idx_sent, sent in enumerate(documents_str):
    tokens=tokenize(sent)
    ###### TF (w,d) #
    TF=np.zeros(len(vocab))
    for token in tokens:
        TF[word2token[token]]+=1
    TF/=len(tokens)
    
    
    for i in range(len(vocab)):
        TF_IDF[idx_sent,i]=IDF[i]*TF[i]
    print(TF_IDF[idx_sent])

[0.         0.11552453 0.04794701 0.04794701 0.04794701 0.
 0.         0.         0.         0.         0.        ]
[0.         0.         0.04794701 0.04794701 0.04794701 0.11552453
 0.11552453 0.         0.         0.         0.        ]
[0.         0.         0.         0.         0.         0.
 0.         0.11552453 0.11552453 0.11552453 0.11552453]


### 4 Relationship Between TF-IDF and Stop Words

1. **High Term Frequency (TF) but Low IDF**: Stop words tend to appear very frequently in documents, leading to a high term frequency. However, because they appear across almost every document in the corpus, their inverse document frequency (IDF) is low. This results in a low TF-IDF score for stop words, indicating their low relevance in distinguishing between documents.

2. **Impact on TF-IDF Calculation**: Stop words don't help in distinguishing one document from another because they appear in nearly all documents. Hence, when calculating TF-IDF, the IDF component for stop words is very small, and their overall TF-IDF score becomes negligible.



In [21]:
word2token["the"]

0

In [22]:
IDF[0]

0.0

### 5. How to Use TF-IDF to Compute Sentence Similarity
To compute the similarity between two sentences, use **cosine similarity**, which measures the cosine of the angle between the two sentence vectors. The cosine similarity ranges from -1 (completely dissimilar) to 1 (identical). The formula for cosine similarity between two vectors $A$ and $B$ is:

   $$
   \text{Cosine Similarity}(A, B) = \frac{A \cdot B}{\|A\| \|B\|}
   $$
   Where:
   - $A \cdot B$ is the dot product of vectors $A$ and $B$.
   - $\|A\|$ and $\|B\|$ are the magnitudes (lengths) of the vectors.

By applying TF-IDF and cosine similarity, we can compare sentences based on their content, taking into account the relative importance of each word. This method is particularly useful in tasks like **document clustering**, **information retrieval**, and **text classification**.

In [23]:
from numpy.linalg import norm

In [24]:
def cosine_sim(A,B):
 return np.dot(A,B)/(norm(A)*norm(B))


In [41]:
word2token

{'the': 0,
 'cat': 1,
 'sat': 2,
 'on': 3,
 'mat.': 4,
 'a': 5,
 'dog': 6,
 'bird': 7,
 'flew': 8,
 'over': 9,
 'tree.': 10}

In [37]:
TF_IDF[0]

array([0.        , 0.11552453, 0.04794701, 0.04794701, 0.04794701,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        ])

In [39]:
TF_IDF[2]

array([0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.11552453, 0.11552453, 0.11552453,
       0.11552453])

In [25]:
cosine_sim(TF_IDF[0],TF_IDF[1])

0.2644932986430687

In [26]:
cosine_sim(TF_IDF[1],TF_IDF[2])

0.0

### 6. Using scikit-learn to Compute TF-IDF
Now, let's use the TfidfVectorizer from the scikit-learn library to compute the TF-IDF values automatically.

In [56]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Sample sentences
documents_str = [
    "The cat sat on the mat.",
    "A dog sat on the mat.",
    "The bird flew over the tree."
]

# Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer(use_idf=True,smooth_idf=False, lowercase=True)

# Fit and transform the documents to compute the TF-IDF matrix
tfidf_matrix = vectorizer.fit_transform(documents_str)

# Compute cosine similarity between all pairs of sentences
cos_sim_matrix = cosine_similarity(tfidf_matrix)

# Show the cosine similarity matrix
import pandas as pd
cos_sim_df = pd.DataFrame(cos_sim_matrix, columns=["Sentence 1", "Sentence 2", "Sentence 3"], 
                          index=["Sentence 1", "Sentence 2", "Sentence 3"])
cos_sim_df


Unnamed: 0,Sentence 1,Sentence 2,Sentence 3
Sentence 1,1.0,0.622028,0.227269
Sentence 2,0.622028,1.0,0.127796
Sentence 3,0.227269,0.127796,1.0


*  The resultant idf (irrespective of the value of smooth_idf) is still combined  according to tf * (idf + 1).

In [55]:
vectorizer.idf_ [8]

1.0

In [53]:
vectorizer.vocabulary_["tree"]

9

In [32]:
vectorizer.get_feature_names_out()

array(['bird', 'cat', 'dog', 'flew', 'mat', 'on', 'over', 'sat', 'the',
       'tree'], dtype=object)