Imagine you have a library of books. If you're looking for information about 'dinosaurs', a document that mentions 'dinosaur' many times is likely about dinosaurs. However, if the word 'the' appears just as frequently, it's not very informative. TF-IDF helps us distinguish between these two scenarios. It assigns a higher score to 'dinosaur' because it's likely specific to that document, while 'the' would receive a very low score because it appears in almost every document.

##### The TF-IDF score is a product of two components:                                                                                             
**Term Frequency (TF)**: This measures how frequently a term appears in a document.                                                                   
**Inverse Document Frequency (IDF)**: This measures how rare a term is across all documents in the corpus.                                           
By multiplying these two values, TF-IDF effectively highlights words that are both frequent within a document and relatively uncommon across the entire collection. This makes it a powerful tool for identifying keywords, understanding document similarity, and improving the performance of various NLP tasks.

## Calculating Term Frequency (TF): How Often Does a Word Appear?

### Term Frequency (TF) Variants
To account for document length, TF is often normalized. There are several common ways to calculate Term Frequency:
- **Raw Count**
  - Number of times a term appears in a document

- **Normalized TF**
  - Raw count ÷ total number of words in the document
  - Prevents bias toward longer documents

- **Augmented Frequency**
  - Limits impact of very frequent words
  - Often uses a cap (e.g., max value around 0.5)

- **Logarithmic Scaling**
  - Uses log of raw count
  - Reduces effect of extremely high frequencies


In [1]:
import pandas as pd
from collections import Counter

In [2]:
corpus = [
    "The quick brown fox jumps over the lazy dog.",
    "The lazy dog barks loudly.",
    "The fox is quick and brown.",
    "A quick brown dog is a happy dog."
]

In [3]:
def calculate_tf(document, term):
    words = document.lower().split()    # Tokenize the document (simple split for demonstration)
    raw_frequency = words.count(term)   # Count raw frequency of the term
    total_words = len(words)   # Total number of words in the document
    
    # Calculate normalized TF
    if total_words == 0:
        return 0.0
    return raw_frequency / total_words

In [4]:
doc_index = 0
for doc in corpus:
    print(f"--- Document {doc_index + 1} ---")
    print(f"Document: '{doc}'")
    
    # Calculate TF for 'quick'
    tf_quick = calculate_tf(doc, 'quick')
    print(f"TF('quick'): {tf_quick:.4f}")
    
    # Calculate TF for 'the'
    tf_the = calculate_tf(doc, 'the')
    print(f"TF('the'): {tf_the:.4f}")
    
    # Calculate TF for 'dog'
    tf_dog = calculate_tf(doc, 'dog')
    print(f"TF('dog'): {tf_dog:.4f}")
    
    doc_index += 1

--- Document 1 ---
Document: 'The quick brown fox jumps over the lazy dog.'
TF('quick'): 0.1111
TF('the'): 0.2222
TF('dog'): 0.0000
--- Document 2 ---
Document: 'The lazy dog barks loudly.'
TF('quick'): 0.0000
TF('the'): 0.2000
TF('dog'): 0.2000
--- Document 3 ---
Document: 'The fox is quick and brown.'
TF('quick'): 0.1667
TF('the'): 0.1667
TF('dog'): 0.0000
--- Document 4 ---
Document: 'A quick brown dog is a happy dog.'
TF('quick'): 0.1250
TF('the'): 0.0000
TF('dog'): 0.1250


In [6]:
print("--- Using Pandas for a more structured approach ---")

df = pd.DataFrame({'document': corpus})
df['tokens'] = df['document'].apply(lambda x: x.lower().split())
df['total_words'] = df['tokens'].apply(len)

def calculate_term_tf(row, term):
    raw_frequency = row['tokens'].count(term)
    if row['total_words'] == 0:
        return 0.0
    return raw_frequency / row['total_words']

# Calculate TF for specific terms across all documents
df['tf_quick'] = df.apply(lambda row: calculate_term_tf(row, 'quick'), axis=1)
df['tf_the'] = df.apply(lambda row: calculate_term_tf(row, 'the'), axis=1)
df['tf_dog'] = df.apply(lambda row: calculate_term_tf(row, 'dog'), axis=1)

print(df[['document', 'tf_quick', 'tf_the', 'tf_dog']])

--- Using Pandas for a more structured approach ---
                                       document  tf_quick    tf_the  tf_dog
0  The quick brown fox jumps over the lazy dog.  0.111111  0.222222   0.000
1                    The lazy dog barks loudly.  0.000000  0.200000   0.200
2                   The fox is quick and brown.  0.166667  0.166667   0.000
3             A quick brown dog is a happy dog.  0.125000  0.000000   0.125


## Calculating Inverse Document Frequency (IDF): How Rare is a Word?

In [9]:
import math
def calculate_idf(term, documents):
    N = len(documents)
    doc_count = sum(1 for doc in documents if term in doc)
    return math.log((N + 1) / (doc_count + 1)) + 1   # smoothed IDF

In [10]:
# Terms to analyze
terms = ['quick', 'the', 'dog']

# Calculate IDF values
idf_values = {term: calculate_idf(term, df['tokens']) for term in terms}

In [11]:
for term, value in idf_values.items():
    print(f"IDF('{term}') = {value:.4f}")

IDF('quick') = 1.2231
IDF('the') = 1.2231
IDF('dog') = 1.5108


#### TF-IDF Together

In [12]:
# Add TF-IDF columns
for term in terms:
    df[f'tfidf_{term}'] = df[f'tf_{term}'] * idf_values[term]

In [13]:
print("\nTF and TF-IDF values:")
print(df[['document', 'tf_quick', 'tf_the', 'tf_dog',
          'tfidf_quick', 'tfidf_the', 'tfidf_dog']])


TF and TF-IDF values:
                                       document  tf_quick    tf_the  tf_dog  \
0  The quick brown fox jumps over the lazy dog.  0.111111  0.222222   0.000   
1                    The lazy dog barks loudly.  0.000000  0.200000   0.200   
2                   The fox is quick and brown.  0.166667  0.166667   0.000   
3             A quick brown dog is a happy dog.  0.125000  0.000000   0.125   

   tfidf_quick  tfidf_the  tfidf_dog  
0     0.135905   0.271810   0.000000  
1     0.000000   0.244629   0.302165  
2     0.203857   0.203857   0.000000  
3     0.152893   0.000000   0.188853  


### Implementing TF-IDF with Scikit-learn's TfidfVectorizer

TfidfVectorizer converts a collection of raw documents into a matrix of TF-IDF features. It performs the following steps internally:
1. **Tokenization**: It breaks down text into individual words or tokens.
2. **Vocabulary Building**: It creates a vocabulary of all unique tokens found in the corpus.
3. **Term Frequency (TF) Calculation**: It calculates the TF for each term in each document.
4. **Document Frequency (DF) Calculation**: It counts the number of documents each term appears in.
5. **Inverse Document Frequency (IDF) Calculation**: It computes the IDF for each term.
6. **TF-IDF Score Calculation**: It multiplies TF and IDF to get the final TF-IDF score for each term in each document.
7. **Matrix Formation**: It outputs a sparse matrix where rows represent documents and columns represent terms, with the values being the TF-IDF scores.


In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [15]:
corpus = [
    "The cat sat on the mat.",
    "The dog chased the cat.",
    "The cat is happy.",
    "The dog is playful."
]

In [16]:
vectorizer = TfidfVectorizer()
# Fit the vectorizer to the corpus and transform the corpus into a TF-IDF matrix
tfidf_matrix = vectorizer.fit_transform(corpus)

# Making array of transformed tf-idf matrix
tfidf_array = tfidf_matrix.toarray() 
feature_names = vectorizer.get_feature_names_out()
df_tfidf = pd.DataFrame(tfidf_array, columns=feature_names)

print(df_tfidf)

        cat    chased       dog     happy        is       mat        on  \
0  0.301002  0.000000  0.000000  0.000000  0.000000  0.471578  0.471578   
1  0.361459  0.566295  0.446473  0.000000  0.000000  0.000000  0.000000   
2  0.420753  0.000000  0.000000  0.659191  0.519714  0.000000  0.000000   
3  0.000000  0.000000  0.497096  0.000000  0.497096  0.000000  0.000000   

    playful       sat       the  
0  0.000000  0.471578  0.492178  
1  0.000000  0.000000  0.591032  
2  0.000000  0.000000  0.343993  
3  0.630504  0.000000  0.329023  


In [19]:
idf_values = vectorizer.idf_
df_idf_scores = pd.DataFrame({'term': feature_names, 'idf': idf_values})
print("--- IDF Scores calculated by TfidfVectorizer ---")
print(df_idf_scores.sort_values(by='idf', ascending=False))

--- IDF Scores calculated by TfidfVectorizer ---
      term       idf
1   chased  1.916291
3    happy  1.916291
8      sat  1.916291
5      mat  1.916291
7  playful  1.916291
6       on  1.916291
4       is  1.510826
2      dog  1.510826
0      cat  1.223144
9      the  1.000000


In [20]:
print("--- TfidfVectorizer with stop_words='english' and ngram_range=(1, 2) ---")
vectorizer_advanced = TfidfVectorizer(stop_words='english', ngram_range=(1, 2))
tfidf_matrix_advanced = vectorizer_advanced.fit_transform(corpus)
tfidf_array_advanced = tfidf_matrix_advanced.toarray()
feature_names_advanced = vectorizer_advanced.get_feature_names_out()

df_tfidf_advanced = pd.DataFrame(tfidf_array_advanced, columns=feature_names_advanced)

print("TF-IDF Matrix (with stop words removed and bigrams):")
print(df_tfidf_advanced)

--- TfidfVectorizer with stop_words='english' and ngram_range=(1, 2) ---
TF-IDF Matrix (with stop words removed and bigrams):
        cat  cat happy  cat sat    chased  chased cat       dog  dog chased  \
0  0.304035   0.000000  0.47633  0.000000    0.000000  0.000000    0.000000   
1  0.317993   0.000000  0.00000  0.498197    0.498197  0.392784    0.498197   
2  0.411378   0.644503  0.00000  0.000000    0.000000  0.000000    0.000000   
3  0.000000   0.000000  0.00000  0.000000    0.000000  0.486934    0.000000   

   dog playful     happy      mat   playful      sat  sat mat  
0     0.000000  0.000000  0.47633  0.000000  0.47633  0.47633  
1     0.000000  0.000000  0.00000  0.000000  0.00000  0.00000  
2     0.000000  0.644503  0.00000  0.000000  0.00000  0.00000  
3     0.617614  0.000000  0.00000  0.617614  0.00000  0.00000  


### Comparing TF-IDF with Bag-of-Words (BoW): Why TF-IDF Wins

- **BoW Limitation**
  - Treats all words equally
  - High frequency ≠ high importance
  - Common words dominate (e.g., *the*, *is*)

- **TF-IDF Advantage**
  - Penalizes common words using **IDF**
  - Highlights informative, rare terms
  - Better represents document importance

- **Better Model Performance**
  - Improves results in text classification, search, clustering

- **Still Simple & Efficient**
  - Easy to compute
  - Works well as a baseline NLP technique

**Conclusion:**  
TF-IDF provides a more meaningful and discriminative text representation than BoW.
