# NLP Foundations Workshop: From Preprocessing to tf-idf
**Course:** PROG8245 – Machine Learning Programming  
**Team:** Team 5  
**Team Members:**
- Mandeep Singh (ID: 8989367)  
- Kumari Nikitha Singh (ID: 9053016)  
- Krishna (ID: 905861)

## Step 1: Document Collection
We use the 20 Newsgroups dataset provided by Scikit-learn and extract the first 20 documents. This corpus represents real-world newsgroup posts and is commonly used in NLP tasks.

In [1]:
# Step 1: Document Collection
from sklearn.datasets import fetch_20newsgroups

# Load dataset and keep only raw text (remove headers/footers/quotes)
newsgroups = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))

# Use first 5 documents to keep it simple
documents = newsgroups.data[:20]

# Show total documents
print(f"Total documents loaded: {len(documents)}")


Total documents loaded: 20


**Talking Point:**  
The 20 Newsgroups dataset reflects noisy, diverse real-world text — ideal for simulating realistic NLP pipelines.

---

## Step 2: Tokenization
We implement a basic tokenizer using regular expressions. It:
- Converts text to lowercase
- Extracts only alphanumeric word tokens


In [2]:
def view_document(documents, doc_id):
    if 0 <= doc_id < len(documents):
        print(f"\n--- Document {doc_id} ---\n")
        print(documents[doc_id][:500] + "...\n")
    else:
        print("Invalid document ID")

# Example: View Document 0
view_document(documents, 0)



--- Document 0 ---

I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail....



**Talking Point:**  
Regex-based tokenization is more reliable than whitespace splitting. It avoids punctuation issues and provides clean tokens for indexing and frequency calculations.

---

## Step 3: Normalization (Stopword Removal + Stemming)
We remove common stop words using NLTK’s English stop word list and apply stemming with the Porter Stemmer to reduce words to their root form.


In [3]:
import re

def tokenize(text):
    """
    Tokenizes text using regex:
    - Lowercases all words
    - Extracts word tokens using \b\w+\b
    """
    return re.findall(r'\b\w+\b', text.lower())

# Example: Tokenize Document 0
tokens_doc0 = tokenize(documents[0])
print("First 20 tokens:", tokens_doc0[:20])


First 20 tokens: ['i', 'was', 'wondering', 'if', 'anyone', 'out', 'there', 'could', 'enlighten', 'me', 'on', 'this', 'car', 'i', 'saw', 'the', 'other', 'day', 'it', 'was']


**Talking Point:**  
Normalization reduces redundancy and focuses on meaningful content. Stemming helps group variants like “running” and “runs” under “run.”

---

## Step 4: Term-Document Incidence Matrix
Using `CountVectorizer(binary=True)`, we create a binary matrix showing whether each term is present (1) or absent (0) in each document.


In [4]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

# Download NLTK resources (only once)
nltk.download('stopwords')

# Initialize stopwords and stemmer
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

def normalize_tokens(tokens):
    """
    Removes stop words and applies stemming.
    """
    filtered = [t for t in tokens if t not in stop_words]
    stemmed = [stemmer.stem(t) for t in filtered]
    return stemmed

# Example: Normalize tokens from Document 0
normalized_tokens_doc0 = normalize_tokens(tokens_doc0)
print("Normalized tokens (first 20):", normalized_tokens_doc0[:20])


Normalized tokens (first 20): ['wonder', 'anyon', 'could', 'enlighten', 'car', 'saw', 'day', '2', 'door', 'sport', 'car', 'look', 'late', '60', 'earli', '70', 'call', 'bricklin', 'door', 'realli']


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\kittu\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


**Talking Point:**  
This matrix supports Boolean queries. It allows us to check which documents contain all query terms and enables basic rule-based retrieval systems.



## Step 5: Term Frequency (TF)
We compute both raw term frequencies and normalized TF values for a sample document. Normalized TF is the term count divided by total words in the document.


In [5]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# --- Create Binary Term-Document Matrix ---
vectorizer = CountVectorizer(binary=True)
X = vectorizer.fit_transform(documents)

# Create labeled DataFrame
incidence_matrix = pd.DataFrame(
    X.toarray(),
    index=[f"Doc{i+1}" for i in range(len(documents))],
    columns=vectorizer.get_feature_names_out()
)

print("🔎 Term-Document Incidence Matrix:")
display(incidence_matrix)

# --- Phrase Presence Checker ---
def phrase_presence(query, matrix, vectorizer):
    terms = query.lower().split()
    valid_terms = [term for term in terms if term in vectorizer.vocabulary_]
    
    if not valid_terms:
        return f"No terms from '{query}' found in vocabulary."
    
    # Check if all query terms exist in each document
    doc_matches = matrix[valid_terms].all(axis=1)
    result = matrix[valid_terms][doc_matches]
    
    if result.empty:
        return f"No documents contain all terms from: '{query}'"
    else:
        return result

# Example query
phrase_result = phrase_presence("machine learning", incidence_matrix, vectorizer)
print("\n📌 Documents where all terms appear:")
display(phrase_result)


🔎 Term-Document Incidence Matrix:


Unnamed: 0,000,0320,0826,10,100,1000,1000yds,100k,10mb,11,...,yesterday,yet,yhwh,yo,york,you,young,your,yrs,zoom
Doc1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
Doc2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,1,0,0
Doc3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
Doc4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
Doc5,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,1,0,0,0,0
Doc6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
Doc7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Doc8,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,1,0,1,0,0
Doc9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Doc10,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0



📌 Documents where all terms appear:


Unnamed: 0,machine
Doc3,1
Doc8,1


**Talking Point:**  
TF reveals the most frequent and possibly most relevant terms in a document. Normalizing the values supports better comparison across documents of varying lengths.

---

## Step 6: Log Frequency Weighting
To limit the impact of very frequent terms, we apply log frequency weighting. This transformation compresses large counts while preserving their relative importance.


In [6]:
from collections import Counter

# Choose one document (e.g., Doc 1)
doc = documents[0]

# Tokenize it
tokens = tokenize(doc)

# Count raw term frequency
tf_raw = Counter(tokens)

# Total terms in doc
total_terms = len(tokens)

# Normalize TF values
tf_normalized = {term: count / total_terms for term, count in tf_raw.items()}

# Show raw TF
print("🔢 Raw Term Frequencies:")
display(pd.DataFrame(tf_raw.items(), columns=["Term", "Raw TF"]))

# Show normalized TF
print("\n📏 Normalized Term Frequencies:")
display(pd.DataFrame(tf_normalized.items(), columns=["Term", "Normalized TF"]))


🔢 Raw Term Frequencies:


Unnamed: 0,Term,Raw TF
0,i,3
1,was,4
2,wondering,1
3,if,2
4,anyone,2
...,...,...
63,funky,1
64,looking,1
65,please,1
66,e,1



📏 Normalized Term Frequencies:


Unnamed: 0,Term,Normalized TF
0,i,0.032258
1,was,0.043011
2,wondering,0.010753
3,if,0.021505
4,anyone,0.021505
...,...,...
63,funky,0.010753
64,looking,0.010753
65,please,0.010753
66,e,0.010753


**Talking Point:**  
Log weighting reduces the dominance of repetitive terms like “data” or “learning” and stabilizes the input for statistical models.


## Step 7: Document Frequency (DF)
We count how many documents contain each term. Terms with high DF appear in many documents and are less useful for distinguishing content.


In [7]:
import numpy as np

# Using the raw TF from Step 5
log_weighted_tf = {
    term: 1 + np.log10(freq) if freq > 0 else 0
    for term, freq in tf_raw.items()
}

# Display results
log_tf_df = pd.DataFrame({
    "Term": tf_raw.keys(),
    "Raw TF": tf_raw.values(),
    "Log TF Weight": log_weighted_tf.values()
})

print("📉 Log Frequency Weighted Term Frequencies:")
display(log_tf_df.sort_values(by="Log TF Weight", ascending=False))


📉 Log Frequency Weighted Term Frequencies:


Unnamed: 0,Term,Raw TF,Log TF Weight
14,the,6,1.778151
11,this,4,1.602060
1,was,4,1.602060
12,car,4,1.602060
18,a,3,1.477121
...,...,...,...
32,doors,1,1.000000
33,were,1,1.000000
35,small,1,1.000000
36,in,1,1.000000


**Talking Point:**  
Document frequency highlights terms that are common vs. specific. Low DF terms tend to be more informative for identifying document topics.


## Step 8: Inverse Document Frequency (IDF)
We calculate IDF using the formula log(N / df). Rare terms receive higher scores, helping us focus on unique or meaningful words.


In [8]:
from sklearn.feature_extraction.text import CountVectorizer

# Re-use full 5-document corpus
vectorizer_df = CountVectorizer()
X_df = vectorizer_df.fit_transform(documents)

# Extract terms and term-document matrix
terms = vectorizer_df.get_feature_names_out()
X_array = X_df.toarray()

# Document frequency: count how many docs each term appears in
df_counts = (X_array > 0).sum(axis=0)

# Format results
df_table = pd.DataFrame({
    "Term": terms,
    "Document Frequency (df_t)": df_counts
}).sort_values("Document Frequency (df_t)", ascending=False)

print("📊 Document Frequency Table:")
display(df_table.head(10))  # Top 10 by DF


📊 Document Frequency Table:


Unnamed: 0,Term,Document Frequency (df_t)
1373,to,18
1342,the,17
658,have,16
176,and,16
957,of,15
237,be,15
713,in,15
1358,this,15
758,it,13
1540,you,13


**Talking Point:**  
IDF downweights generic terms and emphasizes words that carry more discriminative power across the corpus.


## Step 9: TF-IDF Weighting
We compute the TF-IDF score for each term in each document using log-scaled TF multiplied by IDF. This forms the basis for vector-based search systems.


In [9]:
# Total number of documents
N = len(documents)

# Compute IDF
idf_values = np.log10(N / df_counts)

# Display as a DataFrame
idf_table = pd.DataFrame({
    "Term": terms,
    "Document Frequency (df_t)": df_counts,
    "IDF (log10(N / df_t))": idf_values
}).sort_values("IDF (log10(N / df_t))", ascending=False)

print("📉 Inverse Document Frequency Table:")
display(idf_table.head(10))  # Top 10 most informative (highest IDF)


📉 Inverse Document Frequency Table:


Unnamed: 0,Term,Document Frequency (df_t),IDF (log10(N / df_t))
0,000,1,1.30103
973,optimize,1,1.30103
963,oklahoma,1,1.30103
962,ok,1,1.30103
961,oil,1,1.30103
960,office,1,1.30103
959,offer,1,1.30103
956,oct,1,1.30103
955,occurs,1,1.30103
954,obtained,1,1.30103


**Talking Point:**  
TF-IDF captures both relevance within a document and uniqueness across the corpus. It's the foundation of classical IR, search ranking, and document classification.


## Manual  TF-IDF and Cosine Similarity
We convert a query phrase into a TF-IDF vector and calculate cosine similarity between it and each document. This ranks documents by semantic closeness to the query.


In [10]:
# Use CountVectorizer to get term frequencies
vectorizer_tfidf = CountVectorizer()
X_tfidf = vectorizer_tfidf.fit_transform(documents)
terms = vectorizer_tfidf.get_feature_names_out()
X_array = X_tfidf.toarray()

# Document frequencies
df = (X_array > 0).sum(axis=0)
idf = np.log10(N / df)

# Apply log-scaled TF
tf_log = 1 + np.where(X_array > 0, np.log10(X_array), 0)

# Compute TF-IDF
tfidf = tf_log * idf

# Final DataFrame
tfidf_df = pd.DataFrame(tfidf, columns=terms, index=[f"Doc{i+1}" for i in range(N)])

print("📊 TF-IDF Matrix (Rounded to 3 decimals):")
display(tfidf_df.round(3))


📊 TF-IDF Matrix (Rounded to 3 decimals):


  tf_log = 1 + np.where(X_array > 0, np.log10(X_array), 0)


Unnamed: 0,000,0320,0826,10,100,1000,1000yds,100k,10mb,11,...,yesterday,yet,yhwh,yo,york,you,young,your,yrs,zoom
Doc1,1.301,1.301,1.301,0.824,1.0,1.0,1.301,1.301,1.301,1.301,...,1.301,1.0,1.301,1.301,1.301,0.187,1.301,0.699,1.301,1.301
Doc2,1.301,1.301,1.301,0.824,1.0,1.0,1.301,1.301,1.301,1.301,...,1.301,1.0,1.301,1.301,1.301,0.187,1.301,0.699,1.301,1.301
Doc3,1.301,1.301,1.301,0.824,1.0,1.0,1.301,1.301,1.301,1.301,...,1.301,1.0,1.301,1.301,1.301,0.187,1.301,0.699,1.301,1.301
Doc4,1.301,1.301,1.301,0.824,1.0,1.0,1.301,1.301,1.301,1.301,...,1.301,1.0,1.301,1.301,1.301,0.187,1.301,0.699,1.301,1.301
Doc5,1.301,1.301,1.301,0.824,1.0,1.0,1.301,1.301,1.301,1.301,...,1.301,1.0,1.301,1.301,1.301,0.187,1.301,0.699,1.301,1.301
Doc6,1.301,1.301,1.301,0.824,1.0,1.0,1.301,1.301,1.301,1.301,...,1.301,1.0,1.301,1.301,1.301,0.243,1.301,0.699,1.301,1.301
Doc7,1.301,1.301,1.301,0.824,1.0,1.0,1.301,1.301,1.301,1.301,...,1.301,1.0,1.301,1.301,1.301,0.187,1.301,0.699,1.301,1.301
Doc8,1.301,1.301,1.301,0.824,1.0,1.0,1.301,1.301,2.084,1.301,...,1.301,1.0,1.301,1.301,1.301,0.243,1.301,0.699,1.301,1.301
Doc9,1.301,1.301,1.301,0.824,1.0,1.0,1.301,1.301,1.301,1.301,...,1.301,1.0,1.301,1.301,1.301,0.187,1.301,0.699,1.301,1.301
Doc10,1.301,1.301,1.301,0.824,1.0,1.0,1.301,1.301,1.301,1.301,...,1.301,1.0,1.301,1.301,1.301,0.187,1.301,0.699,1.301,1.301


**Talking Point:**  
Cosine similarity allows us to turn our TF-IDF matrix into a semantic search engine. This mimics the behavior of intelligent search systems used in production.

---

#  Conclusion

In this Lab, we implemented a complete NLP pipeline that walks through the foundational stages of Information Retrieval, from raw document collection to semantic phrase queries using TF-IDF and cosine similarity.

Each step—tokenization, normalization, frequency analysis, and weighting—contributes to building structured representations of unstructured text. These representations are essential for powering modern search engines, recommendation systems, and intelligent agents.

By exploring and applying key concepts such as Term Frequency, Document Frequency, IDF, and TF-IDF, we gained practical insight into how text can be converted into meaningful numerical data. We also extended this understanding by running phrase queries against the TF-IDF matrix to simulate how search engines rank documents based on relevance.

This end-to-end process prepares us for the next phase of NLP and IR development, where we transition into **vector space models**, **clustering**, and **deep learning-based language models**.

---

**Key Takeaway:**  
Building a robust IR system starts with mastering the basics—clean tokenization, thoughtful normalization, and weighted representation. These remain the building blocks for more advanced NLP systems in production today.
