# Assignment 4: Representation Robustness under Noise
## CNG463 - Introduction to Natural Language Processing
### METU NCC Computer Engineering | Fall 2025-26

**Student Name:**  
**Student ID:**  
**Due Date:** 26 December 2025 (Friday) before midnight

---

## Overview

This assignment focuses on:
1. Building **TF-IDF** and **word embedding-based** document representations
2. Implementing **noise injection** (token deletion, stopword manipulation)
3. Applying **PCA** for dimensionality reduction
4. Building a **cosine similarity-based information retrieval** system
5. Evaluating robustness using **Precision@k, Recall@k, and MAP**

**Note:** This assignment deliberately avoids deep learning, transformers, and large language models. You will use the BBC News Dataset with approximately 2,200 news articles across five topics.

**Grading:**
- Data Loading and Preprocessing: **8 pts**
- TF-IDF and Word2Vec Representation: **8 pts**
- Noise Injection (2 types): **14 pts**
- PCA Dimensionality Reduction: **20 pts**
- Information Retrieval System: **20 pts**
- Evaluation Metrics: **20 pts**
- Written Questions (5 √ó 2 pt): **10 pts**
- **Total: 100 pts**

---

## Pre-Submission Checklist

- [ ] Name and student ID at top
- [ ] No cells are added or removed
- [ ] All TODO sections completed
- [ ] All questions answered
- [ ] Code runs without errors
- [ ] Results tables included
- [ ] Run All before saving

## Setup and Imports

In [None]:
!pip install gensim

# Standard libraries
import numpy as np
import pandas as pd
import random
from collections import defaultdict, Counter
import os

# Scikit-learn for TF-IDF and PCA
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import cosine_similarity

# Gensim for Word2Vec
from gensim.models import Word2Vec

# NLTK for preprocessing
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Matplotlib for PCA explained variance curve
import matplotlib.pyplot as plt

# Set random seed for reproducibility
seed = 42
np.random.seed(seed)
random.seed(seed)

---

# Task 1: Data Loading and Preprocessing (8 points)

Load the BBC News Dataset and apply standard preprocessing.

## 1.1: Load the BBC News Dataset (4 points)

Download the dataset and load all documents with their category labels.

In [None]:
# Download required NLTK data
nltk.download('punkt')
nltk.download('stopwords')

# Download BBC News Dataset
!wget -q http://mlg.ucd.ie/files/datasets/bbc-fulltext.zip
!unzip -q bbc-fulltext.zip

# TODO: Load all documents from the BBC dataset
#
# Steps:
# 1. Navigate through the 'bbc' directory structure
# 2. For each category folder (business, entertainment, politics, sport, tech):
#    - Read all text files
#    - Store document text and category label
# 3. Create lists: documents (raw text) and labels (category names)
#
# Expected output:
# - documents: list of strings (raw document text)
# - labels: list of strings (category names)


print(f"Total documents loaded: {len(documents)}")
print(f"Categories: {set(labels)}")
print(f"\nCategory distribution:")
for cat in sorted(set(labels)):
    print(f"  {cat}: {labels.count(cat)}")

## 1.2: Preprocessing Function (2 points)

Implement a preprocessing function that applies tokenisation, lowercasing, and optional stopword removal.

In [None]:
nltk.download('punkt_tab')
def preprocess_text(text, remove_stopwords=False):
    """
    Preprocess text: tokenise, lowercase, and optionally remove stopwords.
    
    Args:
        text: Raw document text
        remove_stopwords: Whether to remove stopwords (default False)
    
    Returns:
        List of processed tokens
    """
    # TODO: Implement preprocessing
    #
    # Steps:
    # 1. Tokenise using word_tokenize
    # 2. Convert to lowercase
    # 3. Keep only alphabetic tokens (use str.isalpha())
    # 4. If remove_stopwords is True, filter out stopwords
    # 5. Return list of processed tokens
    
    pass

# Test preprocessing
sample_text = "The quick brown fox jumps over the lazy dog!"
print("Original:", sample_text)
print("Processed:", preprocess_text(sample_text))
print("Without stopwords:", preprocess_text(sample_text, remove_stopwords=True))

## 1.3: Apply Preprocessing to All Documents (2 points)

Preprocess all documents and store both tokenised and string versions.

In [None]:
# TODO: Preprocess all documents
#
# Create two versions:
# 1. processed_docs: list of token lists (for Word2Vec)
# 2. processed_docs_str: list of strings (for TF-IDF)
#
# Use remove_stopwords=False for now

print(f"Preprocessed {len(processed_docs)} documents")
print(f"\nExample preprocessed document (first 20 tokens):")
print(processed_docs[0][:20])

---

# Task 2: Representation (8 points)

Build document representations using TF-IDF and Word2Vec.

## 2.1: Build TF-IDF Vectors (2 points)

Create TF-IDF document vectors using scikit-learn.

In [None]:
# TODO: Build TF-IDF representation
#
# Steps:
# 1. Initialise TfidfVectorizer with appropriate parameters:
#    - min_df: minimum document frequency (try 2 or 3)
#    - max_df: maximum document frequency (try 0.8 or 0.9)
#    - ngram_range: experiment with (1,1) or (1,2)
# 2. Fit and transform the processed_docs_str
# 3. Store the resulting matrix as tfidf_matrix
# 4. Print shape and number of features

print(f"TF-IDF matrix shape: {tfidf_matrix.shape}")
print(f"Number of features: {len(vectorizer.get_feature_names_out())}")

**Question 2.1:** Explain the TF-IDF parameters you chose (min_df, max_df, ngram_range) and why. (2 point, 2-3 sentences)

**[YOUR ANSWER HERE]**

## 2.2: Train Word2Vec Model (2 points)

Train a Word2Vec model on the BBC News corpus.

In [None]:
# TODO: Train Word2Vec model
#
# Steps:
# 1. Initialise Word2Vec with parameters:
#    - vector_size: embedding dimension (try 100 or 200)
#    - window: context window size (try 5)
#    - min_count: minimum word frequency (try 2)
#    - sg: 0 for CBOW, 1 for Skip-gram (try both)
#    - seed: use seed variable for reproducibility
# 2. Train on processed_docs
# 3. Print vocabulary size

w2v_model = None  # YOUR CODE HERE

print(f"Word2Vec vocabulary size: {len(w2v_model.wv)}")
print(f"Vector dimension: {w2v_model.wv.vector_size}")

## 2.3: Create Document Embeddings (4 points)

Aggregate word vectors to create document embeddings using simple averaging.

In [None]:
def create_doc_embedding(tokens, w2v_model):
    """
    Create document embedding by averaging word vectors.
    
    Args:
        tokens: List of tokens in document
        w2v_model: Trained Word2Vec model
    
    Returns:
        Document embedding vector (numpy array)
    """
    # TODO: Implement document embedding
    #
    # Steps:
    # 1. Initialise empty list for word vectors
    # 2. For each token in the document:
    #    - If token is in model vocabulary, add its vector
    # 3. If no valid vectors found, return zero vector
    # 4. Otherwise, return mean of all word vectors
    
    pass

# TODO: Create document embeddings for all documents

# Convert to numpy array
doc_embeddings = np.array(doc_embeddings)

print(f"Document embeddings shape: {doc_embeddings.shape}")

---

# Task 3: Noise Injection (14 points)

Implement controlled noise injection to test representation robustness.

## 3.1: Random Token Deletion (4 points)

Implement random token deletion noise.

In [None]:
def apply_token_deletion(tokens, deletion_rate=0.1):
    """
    Randomly delete tokens from document.
    
    Args:
        tokens: List of tokens
        deletion_rate: Percentage of tokens to delete (default 0.1 = 10%)
    
    Returns:
        List of tokens after deletion
    """
    # TODO: Implement token deletion
    #
    # Steps:
    # 1. For each token, generate random number
    # 2. Keep token only if random number > deletion_rate
    # 3. Ensure at least one token remains
    # 4. Return filtered tokens
    
    pass

# Test token deletion
sample_tokens = preprocess_text("The quick brown fox jumps over the lazy dog")
print("Original:", sample_tokens)
print("10% deletion:", apply_token_deletion(sample_tokens, 0.1))
print("30% deletion:", apply_token_deletion(sample_tokens, 0.3))

## 3.2: Stopword Manipulation (3 points)

Implement stopword removal as another noise type.

In [None]:
def apply_stopword_removal(tokens):
    """
    Remove all stopwords from document.
    
    Args:
        tokens: List of tokens
    
    Returns:
        List of tokens without stopwords
    """
    # TODO: Implement stopword removal
    #
    # Steps:
    # 1. Get English stopwords from NLTK
    # 2. Filter out tokens that are stopwords
    # 3. Ensure at least one token remains
    # 4. Return filtered tokens
    
    pass

# Test stopword removal
sample_tokens = preprocess_text("The quick brown fox jumps over the lazy dog")
print("Original:", sample_tokens)
print("Without stopwords:", apply_stopword_removal(sample_tokens))

## 3.3: Apply Noise to All Documents (7 points)

Create noisy versions of the dataset for evaluation.

In [None]:
# TODO: Create noisy versions of documents
#
# Create four versions:
# 1. noisy_docs_10: 10% token deletion
# 2. noisy_docs_20: 20% token deletion
# 3. noisy_docs_30: 30% token deletion
# 4. noisy_docs_stopword: stopword removal
#
# For each, store both token list and string versions

noisy_versions = {
    '10_del': {'tokens': [], 'strings': []},
    '20_del': {'tokens': [], 'strings': []},
    '30_del': {'tokens': [], 'strings': []},
    'stopword': {'tokens': [], 'strings': []}
}

# YOUR CODE HERE

print("Noisy versions created:")
for name, data in noisy_versions.items():
    print(f"  {name}: {len(data['tokens'])} documents")

---

# Task 4: PCA Dimensionality Reduction (20 points)

Apply PCA to reduce dimensionality of document representations.

In [None]:
# TODO: PCA with variance-threshold‚Äìbased dimensionality selection
#
# For each representation (TF-IDF and Word Embeddings):
# - Plot cumulative explained variance. (8 points)
# - Determine the minimum number of components required to retain
#   at least 80% and 90% of the variance. (4 points)
# - Apply PCA using these two dimensionalities. (4 points)
# - Report the exact value of explained variance and store the reduced representations. (4 points)


**Question 5.1:** How does PCA affect the information retained in TF-IDF compared to Word2Vec representations? Compare the number of components required to retain 80% and 90% of the variance. (2 point, 2-3 sentences)

**[YOUR ANSWER HERE]**

---

# Task 5: Information Retrieval System (20 points)

Build a cosine similarity-based retrieval system.

## 5.1: Define Queries (10 points)

Define 10-15 queries targeting different BBC News categories.

In [None]:
# TODO: Define queries
#
# Create a list of dictionaries with:
# - 'query': 2-5 content words
# - 'target_category': primary BBC News category
#
# Ensure queries cover all five categories

queries = [
    # Business queries
    {'query': 'stock market economy', 'target_category': 'business'},
    # Add more queries here
    # YOUR CODE HERE
]

print(f"Total queries defined: {len(queries)}")
print("\nQueries by category:")
for cat in ['business', 'entertainment', 'politics', 'sport', 'tech']:
    count = sum(1 for q in queries if q['target_category'] == cat)
    print(f"  {cat}: {count}")

## 5.2: Implement Retrieval Function (10 points)

Implement cosine similarity-based document retrieval.

In [None]:
def retrieve_documents(query_text, doc_vectors, vectorizer=None, w2v_model=None, top_k=10):
    """
    Retrieve top-k documents for a query using cosine similarity.
    
    Args:
        query_text: Query string
        doc_vectors: Document representation matrix
        vectorizer: TfidfVectorizer (if using TF-IDF)
        w2v_model: Word2Vec model (if using embeddings)
        top_k: Number of documents to retrieve
    
    Returns:
        List of document indices ranked by similarity
    """
    # TODO: Implement retrieval
    #
    # Steps:
    # 1. Preprocess query text
    # 2. Convert query to same representation as documents:
    #    - If vectorizer: use vectorizer.transform
    #    - If w2v_model: use create_doc_embedding
    # 3. Compute cosine similarity with all documents
    # 4. Get top-k document indices by similarity
    # 5. Return ranked indices
    
    pass

---

# Task 6: Evaluation Metrics (20 points)

Implement and compute IR evaluation metrics.

## 6.1: Implement Evaluation Metrics (8 points)

Implement Precision@k, Recall@k, and MAP.

In [None]:
def precision_at_k(retrieved_docs, relevant_docs, k):
    """
    Calculate Precision@k.
    
    Args:
        retrieved_docs: List of retrieved document indices
        relevant_docs: Set of relevant document indices
        k: Number of top documents to consider
    
    Returns:
        Precision@k score
    """
    # TODO: Implement Precision@k
    pass

def recall_at_k(retrieved_docs, relevant_docs, k):
    """
    Calculate Recall@k.
    
    Args:
        retrieved_docs: List of retrieved document indices
        relevant_docs: Set of relevant document indices
        k: Number of top documents to consider
    
    Returns:
        Recall@k score
    """
    # TODO: Implement Recall@k
    pass

def average_precision(retrieved_docs, relevant_docs):
    """
    Calculate Average Precision.
    
    Args:
        retrieved_docs: List of retrieved document indices
        relevant_docs: Set of relevant document indices
    
    Returns:
        Average Precision score
    """
    # TODO: Implement Average Precision
    #
    # For each relevant document in retrieved list:
    # - Calculate precision at that position
    # - Average all precision values
    pass

def mean_average_precision(queries_results):
    """
    Calculate Mean Average Precision across all queries.
    
    Args:
        queries_results: List of (retrieved_docs, relevant_docs) tuples
    
    Returns:
        MAP score
    """
    # TODO: Implement MAP
    pass

## 6.2: Run Complete Evaluation (12 points)

Evaluate all representations (clean and noisy) using all metrics.

In [None]:
# TODO: Complete evaluation framework
#
# For each representation type:
# - TF-IDF (original, PCA-80%, PCA-90%)
# - Word Embeddings (original, PCA-80%, PCA-90%)
#
# For each noise condition:
# - Clean (no noise)
# - 10% deletion
# - 20% deletion
# - 30% deletion
# - Stopword removal
#
# For each query:
# - Retrieve top-10 documents
# - Identify relevant documents (same category)
# - Calculate Precision@5, Recall@5, AP
#
# Aggregate results and compute MAP
#
# Store results in structured format (e.g., DataFrame)

# YOUR CODE HERE

# Display results table
print("="*80)
print("EVALUATION RESULTS")
print("="*80)
# Display your results here

## 6.3: Analysis Questions

Answer the following questions based on your results.

**Question 7.1:** Which representation (TF-IDF or word embeddings) is more robust to token deletion noise? Support your answer with specific metric values. (2 point, 2-3 sentences)

**[YOUR ANSWER HERE]**

**Question 7.2:** How does stopword removal affect retrieval performance compared to token deletion? Which noise type is more damaging? (2 point, 2-3 sentences)

**[YOUR ANSWER HERE]**

**Question 7.3:** Does PCA improve or hurt robustness to noise? Compare performance with and without PCA under noisy conditions. (2 point, 2-3 sentences)

**[YOUR ANSWER HERE]**

---

# Convert Your Colab Notebook to PDF

### Step 1: Download Your Notebook
- Go to **File ‚Üí Download ‚Üí Download .ipynb**
- Save the file to your computer

### Step 2: Upload to Colab
- Click the **üìÅ folder icon** on the left sidebar
- Click the **upload button**
- Select your downloaded .ipynb file
- Wait for the upload to complete

### Step 3: Run the Code Below
- **Uncomment the cell below** and run the cell
- This will take about 1-2 minutes to install required packages

### Step 4: Enter Notebook Name
- When prompted, type your notebook name (e.g. `gs_000000_as4.ipynb`)
- Press Enter

### The PDF will automatically download to your computer

In [None]:
# # Install required packages (this takes about 30 seconds)
# print("Installing PDF converter... please wait...")
# !apt-get update -qq
# !apt-get install -y texlive-xetex texlive-fonts-recommended texlive-plain-generic pandoc > /dev/null 2>&1
# !pip install -q nbconvert

# print("\n" + "="*50)
# print("COLAB NOTEBOOK TO PDF CONVERTER")
# print("="*50)
# print("\nSTEP 1: Download your notebook")
# print("- Go to File ‚Üí Download ‚Üí Download .ipynb")
# print("- Save it to your computer")
# print("\nSTEP 2: Upload it here")
# print("- Click the folder icon on the left (üìÅ)")
# print("- Click the upload button and select your .ipynb file")
# print("- Wait for upload to complete")
# print("\nSTEP 3: Enter the filename below")
# print("="*50)

# # Get notebook name from user
# notebook_name = input("\nEnter your notebook name: ")

# # Add .ipynb if missing
# if not notebook_name.endswith('.ipynb'):
#     notebook_name += '.ipynb'

# import os
# notebook_path = f'/content/{notebook_name}'

# # Check if file exists
# if not os.path.exists(notebook_path):
#     print(f"\n‚ö† Error: '{notebook_name}' not found in /content/")
#     print("\nMake sure you uploaded the file using the folder icon (üìÅ) on the left!")
# else:
#     print(f"\n‚úì Found {notebook_name}")
#     print("Converting to PDF... this may take 1-2 minutes...\n")

#     # Convert the notebook to PDF
#     !jupyter nbconvert --to pdf "{notebook_path}"

#     # Download the PDF
#     from google.colab import files
#     pdf_name = notebook_name.replace('.ipynb', '.pdf')
#     pdf_path = f'/content/{pdf_name}'

#     if os.path.exists(pdf_path):
#         print("‚úì SUCCESS! Downloading your PDF now...")
#         files.download(pdf_path)
#         print("\n‚úì Done! Check your downloads folder.")
#     else:
#         print("‚ö† Error: Could not create PDF")