# Building TF-IDF from Scratch
In this notebook, we will implement **Term Frequency - Inverse Document Frequency (TF-IDF)** from scratch using Python.
This is a fundamental technique in Natural Language Processing (NLP) for converting text data into numerical vectors.

### Goals
1. Implement **TF (Term Frequency)**.
2. Implement **IDF (Inverse Document Frequency)**.
3. Combine them to create **TF-IDF**.
4. Compare our results with `sklearn`.


In [1]:
import pandas as pd
import math
import numpy as np

# Sample Corpus
corpus = [
    "the cat sat on the mat",
    "the dog sat on the log",
    "cats and dogs are great"
]
print("Corpus:", corpus)

Corpus: ['the cat sat on the mat', 'the dog sat on the log', 'cats and dogs are great']


## 1. Term Frequency (TF)

**Term Frequency** measures how frequently a term appears in a document.

### Formula
$$
TF(t, d) = \frac{\text{count of term } t \text{ in document } d}{\text{total number of terms in document } d}
$$

### Exercise 1
Complete the function `compute_tf` below.


In [3]:
def compute_tf(document):
    """
    Computes TF for a single document (string).
    Returns a dictionary: {term: tf_value}
    """
    # Split the document into words (tokens)
    # Using .split() handles whitespace automatically
    words = document.lower().split()
    total_words = len(words)

    # 1. Count the frequency of each word
    word_counts = {}
    for word in words:
        # Increment count if word exists, else start at 1
        word_counts[word] = word_counts.get(word, 0) + 1

    # 2. Calculate TF for each word
    tf_dict = {}
    for word, count in word_counts.items():
        # Formula: TF = (count of term) / (total terms)
        tf_dict[word] = count / total_words

    return tf_dict

# Example usage (assuming corpus is defined)
# corpus = ["The cat sat on the mat", "The dog barked"]
# print("TF for doc 0:", compute_tf(corpus[0]))

## 2. Inverse Document Frequency (IDF)

**IDF** measures how important a term is. While TF considers all terms equally important, IDF weighs down frequent terms (like "the", "is") and scales up rare terms.

### Formula
$$
IDF(t) = \log \left( \frac{N}{DF(t)} \right)
$$
Where:
* $N$ = Total number of documents.
* $DF(t)$ = Number of documents containing term $t$.

*Note: Use `math.log10` or `math.log` (natural log). For this exercise, simple log is fine.*

### Exercise 2
Complete the function `compute_idf`.


In [5]:
import math

def compute_idf(corpus):
    """
    Computes IDF for the entire corpus.
    Returns a dictionary: {term: idf_value}
    """
    N = len(corpus)
    all_words_df = {}

    # 1. Calculate Document Frequency (DF)
    for doc in corpus:
        # We split the doc into words and use set()
        # so "apple apple" in one doc only counts as 1 for DF
        words = set(doc.lower().split())

        for word in words:
            # Increment the count of how many documents contain this word
            all_words_df[word] = all_words_df.get(word, 0) + 1

    # 2. Calculate IDF
    idf_dict = {}
    for word, df_count in all_words_df.items():
        # Formula: IDF(t) = log(N / DF(t))
        # Note: If N/df_count is 1 (word is in all docs), IDF will be 0
        idf_dict[word] = math.log(N / df_count)

    return idf_dict

# Test it
# Assuming 'corpus' is defined from your previous cells
idf_result = compute_idf(corpus)
print("IDF Result:", idf_result)

IDF Result: {'cat': 1.0986122886681098, 'mat': 1.0986122886681098, 'the': 0.4054651081081644, 'on': 0.4054651081081644, 'sat': 0.4054651081081644, 'log': 1.0986122886681098, 'dog': 1.0986122886681098, 'and': 1.0986122886681098, 'are': 1.0986122886681098, 'cats': 1.0986122886681098, 'dogs': 1.0986122886681098, 'great': 1.0986122886681098}


## 3. TF-IDF

Now we multiply them together:
$$
TF\text{-}IDF = TF(t, d) \times IDF(t)
$$

### Exercise 3
Create the full TF-IDF matrix for our corpus.


In [6]:
def compute_tfidf(corpus):
    idf_dict = compute_idf(corpus)
    vectors = []

    for doc in corpus:
        tf_dict = compute_tf(doc)

        # Calculate TF-IDF for each word in the doc
        doc_tfidf = {}
        for word, tf_val in tf_dict.items():
            # TODO: Multiply TF * IDF
            # doc_tfidf[word] = ...
            pass

        vectors.append(doc_tfidf)

    return vectors

# Run the pipeline
tfidf_vectors = compute_tfidf(corpus)

# Display as a DataFrame for better visibility
df = pd.DataFrame(tfidf_vectors)
df = df.fillna(0) # Fill NaN with 0 (words not in document)
print("My TF-IDF Matrix:")
print(df)


My TF-IDF Matrix:
Empty DataFrame
Columns: []
Index: [0, 1, 2]


## 4. Comparison with Scikit-Learn
Let's see how our results compare with a professional library.
Note: Sklearn uses a slightly different IDF formula (adds 1 smoothing, different normalization), so numbers wont be identical, but should be correlated.


In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Setup TfidfVectorizer (defaults do normalization, we turned that off above for simplicity)
vectorizer = TfidfVectorizer(norm=None, smooth_idf=False) # Trying to match simple logic

sklearn_tfidf = vectorizer.fit_transform(corpus)
df_sklearn = pd.DataFrame(sklearn_tfidf.toarray(), columns=vectorizer.get_feature_names_out())

print("Sklearn TF-IDF Matrix:")
print(df_sklearn)


Sklearn TF-IDF Matrix:
        and       are       cat      cats       dog      dogs     great  \
0  0.000000  0.000000  2.098612  0.000000  0.000000  0.000000  0.000000   
1  0.000000  0.000000  0.000000  0.000000  2.098612  0.000000  0.000000   
2  2.098612  2.098612  0.000000  2.098612  0.000000  2.098612  2.098612   

        log       mat        on       sat      the  
0  0.000000  2.098612  1.405465  1.405465  2.81093  
1  2.098612  0.000000  1.405465  1.405465  2.81093  
2  0.000000  0.000000  0.000000  0.000000  0.00000  
