# Understanding TF-IDF: Term Frequency-Inverse Document Frequency

In this notebook, we will explore **TF-IDF**, a powerful technique for text representation in Natural Language Processing (NLP). TF-IDF measures the importance of a term in a document relative to a collection of documents (corpus).

## Objectives
By the end of this notebook, you will:
1. Understand the concept of TF-IDF and its components: Term Frequency (TF) and Inverse Document Frequency (IDF).
2. Learn how to compute TF-IDF using `scikit-learn`.
3. Visualize the results in a tabular format.
4. Understand the calculations behind TF-IDF for specific terms.

---

## What is TF-IDF?
TF-IDF combines two metrics:
1. **Term Frequency (TF)**: How often a term appears in a document.
2. **Inverse Document Frequency (IDF)**: How unique a term is across the corpus.

The formula for TF-IDF is:
$$
\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)
$$
Where:
- \(t\): Term
- \(d\): Document

- $\text{IDF}(t) = \log(\frac{N}{1 + \text{DF}(t)})$, where \(N\) is the total number of documents, and $\text{DF}(t)$ is the number of documents containing the term \(t\).

In this notebook, we will compute the TF-IDF values for a small set of sample documents.


# Preliminaries: Import Libraries and Prepare Data

We will use the following libraries:
1. **`numpy`**: For numerical operations.
2. **`sklearn.feature_extraction.text.TfidfVectorizer`**: To calculate TF-IDF values.
3. **`pandas`**: For creating DataFrames to visualize the results.

Let’s start by defining our sample documents.


In [51]:
# Import required libraries
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# Sample documents
documents = [
    "The cat sat on the mat.",
    "The dog chased the cat.",
    "The mat was on the floor."
]

# Step 1: Compute TF-IDF Matrix

We will use the `TfidfVectorizer` class from `scikit-learn` to calculate the TF-IDF matrix for our sample documents. This matrix represents the importance of each term in each document.


In [52]:
# Create TfidfVectorizer object
vectorizer = TfidfVectorizer()

# Calculate TF-IDF matrix
tfidf_matrix = vectorizer.fit_transform(documents)

# Get feature names (terms)
feature_names = vectorizer.get_feature_names_out()

# Convert the matrix to a DataFrame for better visualization
df = pd.DataFrame(
    tfidf_matrix.toarray(),
    columns=feature_names,
    index=['Doc 1', 'Doc 2', 'Doc 3']
)

print("TF-IDF Matrix:")
print(df)


TF-IDF Matrix:
            cat    chased       dog     floor       mat        on       sat  \
Doc 1  0.374207  0.000000  0.000000  0.000000  0.374207  0.374207  0.492038   
Doc 2  0.381519  0.501651  0.501651  0.000000  0.000000  0.000000  0.000000   
Doc 3  0.000000  0.000000  0.000000  0.468699  0.356457  0.356457  0.000000   

            the       was  
Doc 1  0.581211  0.000000  
Doc 2  0.592567  0.000000  
Doc 3  0.553642  0.468699  


# Step 2: Calculate and Visualize IDF Values

The **Inverse Document Frequency (IDF)** measures how unique a term is across the documents. Terms that appear in many documents will have lower IDF values, while rare terms will have higher IDF values.

We can use the `idf_` attribute of `TfidfVectorizer` to extract the IDF values for each term.


In [53]:
# Extract IDF values
idf_values = vectorizer.idf_

# Create a DataFrame for IDF values
idf_df = pd.DataFrame(
    {'Term': feature_names, 'IDF': idf_values}
).sort_values(by='IDF', ascending=False)

print("\nIDF Values:")
print(idf_df)



IDF Values:
     Term       IDF
1  chased  1.693147
2     dog  1.693147
3   floor  1.693147
6     sat  1.693147
8     was  1.693147
0     cat  1.287682
4     mat  1.287682
5      on  1.287682
7     the  1.000000


# Step 3: Explain TF-IDF for a Specific Term in a Document

To understand the calculations, we will compute the **TF**, **IDF**, and **TF-IDF** for a specific term in a document.

For example, we can calculate the TF-IDF of the term "cat" in the first document. This helps us break down the importance of the term.


In [54]:
# Function to explain TF-IDF calculation for a specific term in a document
def explain_tfidf(term, doc_index):
        
    # Find the index of the term
    term_index = list(feature_names).index(term)

    # Calculate TF (Term Frequency)
    tf = vectorizer.transform([documents[doc_index]]).toarray()[0][term_index]

    # Get IDF (Inverse Document Frequency)
    idf = vectorizer.idf_[term_index]

    # Calculate TF-IDF
    tfidf = tf * idf

    # Display results
    print(f"\nExplanation for term '{term}' in Document {doc_index + 1}:")
    print(f"TF (Term Frequency): {tf}")
    print(f"IDF (Inverse Document Frequency): {idf:.4f}")
    print(f"TF-IDF: {tfidf:.4f}")


# Example: Explain TF-IDF for the Term "cat"

Let’s use the function to explain the TF-IDF calculation for the term "cat" in the first document. This will provide insight into how the TF-IDF value is computed.


In [55]:
# Example explanation
explain_tfidf("cat", 0)  # Explain 'cat' in the first document


Explanation for term 'cat' in Document 1:
TF (Term Frequency): 0.37420725915942793
IDF (Inverse Document Frequency): 1.2877
TF-IDF: 0.4819


# Exercises: Hands-on Practice

1. **Visualize Top TF-IDF Terms**:
   Modify the DataFrame to display the top 3 terms with the highest TF-IDF values for each document.

2. **Add New Documents**:
   Add new documents to the corpus and observe how the IDF values change.

3. **Explain Multiple Terms**:
   Extend the `explain_tfidf` function to display calculations for multiple terms at once.

4. **Custom Preprocessing**:
   Modify the `TfidfVectorizer` to include custom preprocessing, such as lowercasing, removing stopwords, or stemming.

5. **Analyze the Impact of Parameters**:
   Experiment with the `TfidfVectorizer` parameters, such as `max_df`, `min_df`, and `ngram_range`. Observe their effect on the TF-IDF matrix.

6. **Compare with Raw Term Frequencies**:
   Use `CountVectorizer` to calculate raw term frequencies and compare them with the TF-IDF values.

7. **Real Dataset**:
   Apply TF-IDF to a real-world dataset, such as movie reviews or news articles. Visualize the most important terms in each document.


In [56]:
top3 = idf_df.head(3)
top3

Unnamed: 0,Term,IDF
1,chased,1.693147
2,dog,1.693147
3,floor,1.693147


### Corpus

In [46]:
added_documents = [
    "The sun was shining brightly in the sky.",
    "The dog ran quickly around the corner.",
    "The cat purred contentedly on my lap.",
    "The book fell off the table.",
    "The baby laughed at the silly clown.",
    "The flowers bloomed in the garden.",
    "The car drove down the highway.",
    "The teacher wrote on the blackboard.",
    "The student solved the math problem.",
    "The phone rang loudly in the room.",
    "The clock struck midnight.",
    "The baby cried loudly in the night.",
    "The cat chased the mouse.",
    "The dog wagged its tail.",
    "The sun set slowly in the west.",
    "The stars twinkled in the sky.",
    "The moon glowed brightly in the night.",
    "The river flowed gently through the valley.",
    "The mountain rose steeply into the air.",
    "The tree swayed gently in the breeze.",
    "The bird sang sweetly in the morning.",
    "The fish swam quickly through the water.",
    "The boat sailed smoothly across the lake.",
    "The plane flew high in the sky.",
    "The train chugged along the tracks.",
    "The bus drove down the road.",
    "The bicycle rode smoothly down the hill.",
    "The car stopped at the red light.",
    "The pedestrian crossed the street.",
    "The bicycle fell off the rack.",
    "The bookshelf was filled with books.",
    "The desk was covered with papers.",
    "The chair was placed in the corner.",
    "The table was set for dinner.",
    "The vase was filled with flowers.",
    "The picture hung on the wall.",
    "The clock was ticking loudly.",
    "The radio played softly in the background.",
    "The TV was turned off.",
    "The computer beeped loudly.",
    "The phone was ringing in the other room.",
    "The door was open.",
    "The window was closed.",
    "The curtain was drawn.",
    "The bed was made.",
    "The pillow was fluffed.",
    "The blanket was pulled up.",
    "The chair was pushed back.",
    "The table was cleared.",
    "The floor was swept.",
    "The room was tidy.",
    "The house was quiet.",
    "The garden was blooming.",
    "The flowers were in bloom.",
    "The trees were in leaf.",
    "The grass was green.",
    "The sky was blue.",
    "The sun was shining.",
    "The clouds were white.",
    "The wind was blowing.",
    "The rain was falling.",
    "The snow was falling.",
    "The ice was melting.",
    "The water was flowing.",
    "The river was flowing.",
    "The lake was calm.",
    "The ocean was vast.",
    "The mountain was tall.",
    "The valley was green.",
    "The forest was dense.",
    "The trees were tall.",
    "The flowers were colorful.",
    "The birds were singing.",
    "The bees were buzzing.",
    "The butterflies were fluttering.",
    "The ants were marching.",
    "The bees were busy.",
    "The flowers were blooming.",
    "The trees were blooming.",
    "The grass was growing.",
    "The sky was changing.",
    "The clouds were moving.",
    "The wind was blowing.",
    "The rain was falling.",
    "The snow was falling.",
    "The ice was melting.",
    "The water was flowing.",
    "The river was flowing.",
    "The lake was calm.",
    "The ocean was vast.",
    "The mountain was tall.",
    "The valley was green.",
    "The forest was dense.",
    "The trees were tall.",
    "The flowers were colorful.",
    "The birds were singing.",
    "The bees were buzzing.",
    "The butterflies were fluttering.",
    "The ants were marching.",
    "The bees were busy.",
    "The flowers were blooming.",
    "The trees were blooming.",
    "The grass was growing.",
    "The sky was changing.",
    "The clouds were moving.",
    "The wind was blowing.",
    "The rain was falling.",
    "The snow was falling.",
    "The ice was melting.",
    "The water was flowing.",
    "The river was flowing.",
    "The lake was calm.",
    "The ocean was vast.",
    "The mountain was tall.",
    "The valley was green.",
    "The forest was dense.",
    "The trees were tall.",
    "The flowers were colorful.",
    "The birds were singing.",
    "The bees were buzzing.",
    "The butterflies were fluttering.",
    "The ants were marching.",
    "The bees were busy.",
    "The flowers were blooming.",
    "The trees were blooming.",
    "The grass was growing.",
    "The sky was changing.",
    "The clouds were moving.",
    "The wind was blowing.",
    "The rain was falling.",
    "The snow was falling.",
    "The ice was melting.",
    "The water was flowing.",
    "The river was flowing.",
    "The lake was calm.",
    "The ocean was vast.",
    "The mountain was tall.",
    "The valley was green.",
    "The forest was dense.",
    "The trees were tall.",
    "The flowers were colorful.",
    "The birds were singing.",
    "The bees were buzzing.",
    "The butterflies were fluttering.",
    "The ants were marching.",
    "The bees were busy.",
    "The flowers were blooming.",
    "The trees were blooming.",
    "The grass was growing.",
    "The sky was changing.",
    "The clouds were moving.",
    "The wind was blowing.",
    "The rain was falling.",
    "The snow was falling.",
    "The ice was melting.",
    "The water was flowing.",
    "The river was flowing.",
    "The lake was calm.",
    "The ocean was vast.",
    "The mountain was tall.",
    "The valley was green.",
    "The forest was dense.",
    "The trees were tall.",
    "The flowers were colorful.",
    "The birds were singing.",
    "The bees were buzzing.",
    "The butterflies were fluttering.",
    "The ants were marching.",
    "The bees were busy.",
    "The flowers were blooming.",
    "The trees were blooming.",
    "The grass was growing.",
    "The sky was changing.",
    "The clouds were moving.",
    "The wind was blowing.",
    "The rain was falling.",
    "The snow was falling.",
    "The ice was melting.",
    "The water was flowing.",
    "The river was flowing.",
    "The lake was calm.",
    "The ocean was vast.",
    "The mountain was tall.",
    "The valley was green.",
    "The forest was dense.",
    "The trees were tall.",
    "The flowers were colorful.",
    "The birds were singing.",
    "The bees were buzzing.",
    "The butterflies were fluttering.",
    "The ants were marching.",
    "The bees were busy.",
    "The flowers were blooming.",
    "The trees were blooming.",
    "The grass was growing.",
    "The sky was changing.",
    "The clouds were moving.",
    "The wind was blowing.",
    "The rain was falling.",
    "The snow was falling.",
    "The ice was melting.",
    "The water was flowing.",
    "The river was flowing.",
    "The lake was calm.",
    "The ocean was vast.",
    "The mountain was tall.",
    "The valley was green.",
    "The forest was dense.",
    "The trees were tall.",
    "The flowers were colorful.",
    "The birds were singing.",
    "The bees were buzzing.",
    "The butterflies were fluttering.",
    "The ants were marching.",
    "The bees were busy.",
    "The flowers were blooming.",
    "The trees were blooming.",
    "The grass was growing.",
    "The sky was changing.",
    "The clouds were moving.",
    "The wind was blowing.",
    "The rain was falling.",
    "The snow was falling.",
    "The ice was melting.",
    "The water was flowing.",
    "The river was flowing.",
    "The lake was calm.",
    "The ocean was vast.",
    "The mountain was tall.",
    "The valley was green.",
    "The forest was dense.",
    "The trees were tall.",
    "The flowers were colorful.",
    "The birds were singing.",
    "The bees were buzzing.",
    "The butterflies were fluttering.",
    "The ants were marching.",
    "The bees were busy.",
    "The flowers were blooming.",
    "The trees were blooming.",
    "The grass was growing.",
    "The sky was changing.",
    "The clouds were moving.",
    "The wind was blowing.",
    "The rain was falling.",
    "The snow was falling.",
    "The ice was melting.",
    "The water was flowing.",
    "The river was flowing.",
    "The lake was calm.",
    "The ocean was vast.",
    "The mountain was tall.",
    "The valley was green.",
    "The forest was dense.",
    "The trees were tall.",
    "The flowers were colorful.",
    "The birds were singing.",
    "The bees were buzzing.",
    "The butterflies were fluttering.",
    "The ants were marching.",
    "The bees were busy.",
    "The flowers were blooming.",
    "The trees were blooming.",
    "The grass was growing.",
    "The sky was changing.",
    "The clouds were moving.",
    "The wind was blowing.",
    "The rain was falling.",
    "The snow was falling."]


In [47]:
corpus = documents+added_documents 
corpus

['The cat sat on the mat.',
 'The dog chased the cat.',
 'The mat was on the floor.',
 'The sun was shining brightly in the sky.',
 'The dog ran quickly around the corner.',
 'The cat purred contentedly on my lap.',
 'The book fell off the table.',
 'The baby laughed at the silly clown.',
 'The flowers bloomed in the garden.',
 'The car drove down the highway.',
 'The teacher wrote on the blackboard.',
 'The student solved the math problem.',
 'The phone rang loudly in the room.',
 'The clock struck midnight.',
 'The baby cried loudly in the night.',
 'The cat chased the mouse.',
 'The dog wagged its tail.',
 'The sun set slowly in the west.',
 'The stars twinkled in the sky.',
 'The moon glowed brightly in the night.',
 'The river flowed gently through the valley.',
 'The mountain rose steeply into the air.',
 'The tree swayed gently in the breeze.',
 'The bird sang sweetly in the morning.',
 'The fish swam quickly through the water.',
 'The boat sailed smoothly across the lake.',
 'T

In [48]:
# Create TfidfVectorizer object
vectorizer = TfidfVectorizer()

# Calculate TF-IDF matrix
tfidf_matrix = vectorizer.fit_transform(corpus)

# Get feature names (terms)
feature_names = vectorizer.get_feature_names_out()

# Convert the matrix to a DataFrame for better visualization
df = pd.DataFrame(
    tfidf_matrix.toarray(),
    columns=feature_names,
    # index=range(len(corpus)) # index=['Doc 1', 'Doc 2', 'Doc 3']
)

print("TF-IDF Matrix:")
print(df)

TF-IDF Matrix:
     across  air  along  ants    around   at  baby  back  background  bed  \
0       0.0  0.0    0.0   0.0  0.000000  0.0   0.0   0.0         0.0  0.0   
1       0.0  0.0    0.0   0.0  0.000000  0.0   0.0   0.0         0.0  0.0   
2       0.0  0.0    0.0   0.0  0.000000  0.0   0.0   0.0         0.0  0.0   
3       0.0  0.0    0.0   0.0  0.000000  0.0   0.0   0.0         0.0  0.0   
4       0.0  0.0    0.0   0.0  0.464792  0.0   0.0   0.0         0.0  0.0   
..      ...  ...    ...   ...       ...  ...   ...   ...         ...  ...   
267     0.0  0.0    0.0   0.0  0.000000  0.0   0.0   0.0         0.0  0.0   
268     0.0  0.0    0.0   0.0  0.000000  0.0   0.0   0.0         0.0  0.0   
269     0.0  0.0    0.0   0.0  0.000000  0.0   0.0   0.0         0.0  0.0   
270     0.0  0.0    0.0   0.0  0.000000  0.0   0.0   0.0         0.0  0.0   
271     0.0  0.0    0.0   0.0  0.000000  0.0   0.0   0.0         0.0  0.0   

     ...  wall       was  water      were  west  white      

In [49]:
# Extract IDF values
idf_values = vectorizer.idf_

# Create a DataFrame for IDF values
idf_df = pd.DataFrame(
    {'Term': feature_names, 'IDF': idf_values}
).sort_values(by='IDF', ascending=False)

print("\nIDF Values:")
print(idf_df)


IDF Values:
           Term       IDF
0        across  5.916325
120     problem  5.916325
112      papers  5.916325
113  pedestrian  5.916325
115     picture  5.916325
..          ...       ...
60      falling  3.564949
67      flowers  3.518429
185        were  2.066177
183         was  1.625865
166         the  1.000000

[192 rows x 2 columns]


In [None]:
vectorizer = TfidfVectorizer()

# Calculate TF-IDF matrix

tfidf_matrix = vectorizer.fit_transform(documents)

# Get feature names (terms)

feature_names = vectorizer.get_feature_names_out()

# Create an index with the correct number of elements

# We have 40 documents, so we need an index with 40 elements

doc_index = [f'Doc {i+1}' for i in range(len(documents))]

# Convert the matrix to a DataFrame for better visualization

df = pd.DataFrame(

    tfidf_matrix.toarray(),

    columns=feature_names,

    index=doc_index # Use the generated index

)

print("TF-IDF Matrix:")

print(df)

In [None]:
# explian for multiple terms in a document

def explain_tfidf_multiple(terms, doc_index):

    for term in terms:

        # Find the index of the term

        term_index = list(feature_names).index(term)

        # Calculate TF (Term Frequency)

        tf = vectorizer.transform([documents[doc_index]]).toarray()[0][term_index]

        # Get IDF (Inverse Document Frequency)

        idf = vectorizer.idf_[term_index]

        # Calculate TF-IDF

        tfidf = tf * idf

        # Display results

        print(f"\nExplanation for term '{term}' in Document {doc_index + 1}:")

        print(f"TF (Term Frequency): {tf}")

        print(f"IDF (Inverse Document Frequency): {idf:.4f}")

        print(f"TF-IDF: {tfidf:.4f}")


In [None]:
explain_tfidf_multiple(["cat", "dog", "mat"], 0)