---

### ðŸŽ“ **Professor**: Apostolos Filippas

### ðŸ“˜ **Class**: AI Engineering

### ðŸ“‹ **Homework 4**: Embeddings & Semantic Search

### ðŸ“… **Due Date**: Day of Lecture 5, 11:59 PM


**Note**: You are not allowed to share the contents of this notebook with anyone outside this class without written permission by the professor.

---

In this homework, you'll build on Homework 3 (BM25 search) by adding **embedding-based semantic search**.

You will:
1. **Generate embeddings** using both local (Hugging Face) and API (OpenAI) models
2. **Implement cosine similarity** from scratch
3. **Implement semantic search** from scratch
4. **Compare BM25 vs semantic search** using Recall
5. **Compare different embedding models** and analyze their differences

**Total Points: 95**

---

## Instructions

- Complete all tasks by filling in code where you see `# YOUR CODE HERE`
- You may use ChatGPT, Claude, documentation, Stack Overflow, etc.
- When using external resources, briefly cite them in a comment
- Run all cells before submitting to ensure they work

**Submission:**
1. Create a branch called `homework-4`
2. Commit and push your work
3. Create a PR and merge to main
4. Submit the `.ipynb` file on Blackboard

---

## Task 1: Environment Setup (10 points)

### 1a. Imports (5 pts)

Import the required libraries and load the WANDS data.

In [1]:
import os
from pathlib import Path

print("CWD:", os.getcwd())
print("Here:", Path(".").resolve())
print("Parent:", Path("..").resolve())


CWD: c:\Users\mohit\ai-engineering-fordham\ce\homework
Here: C:\Users\mohit\ai-engineering-fordham\ce\homework
Parent: C:\Users\mohit\ai-engineering-fordham\ce


In [2]:
from pathlib import Path

p = (Path("..") / "scripts" / "helpers.py").resolve()
print("helpers should be at:", p)
print("exists?", p.exists())


helpers should be at: C:\Users\mohit\ai-engineering-fordham\ce\scripts\helpers.py
exists? True


In [3]:
from pathlib import Path

base = Path.cwd().resolve().parents[1]  # ce/
print("Base:", base)

found = list(base.rglob("helpers.py"))
print("Found count:", len(found))
for p in found[:20]:
    print(p)


Base: C:\Users\mohit\ai-engineering-fordham
Found count: 7
C:\Users\mohit\ai-engineering-fordham\ce\scripts\helpers.py
C:\Users\mohit\ai-engineering-fordham\.venv\Lib\site-packages\aiohttp\helpers.py
C:\Users\mohit\ai-engineering-fordham\.venv\Lib\site-packages\pyparsing\helpers.py
C:\Users\mohit\ai-engineering-fordham\.venv\Lib\site-packages\setuptools\tests\integration\helpers.py
C:\Users\mohit\ai-engineering-fordham\.venv\Lib\site-packages\jedi\api\helpers.py
C:\Users\mohit\ai-engineering-fordham\.venv\Lib\site-packages\jedi\inference\helpers.py
C:\Users\mohit\ai-engineering-fordham\.venv\Lib\site-packages\aiohttp\_websocket\helpers.py


In [4]:
# ruff: noqa: E402

# Standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time
import warnings
warnings.filterwarnings("ignore")

# Import ONLY data loading from helpers
import sys
sys.path.append('../scripts')
from helpers import load_wands_products, load_wands_queries, load_wands_labels

# Embedding libraries - we use these directly
from sentence_transformers import SentenceTransformer
import litellm

# Load environment variables for API keys
from dotenv import load_dotenv
load_dotenv()

pd.set_option('display.max_colwidth', 80)
print("All imports successful!")

All imports successful!


In [5]:
# Load the WANDS dataset
products = load_wands_products()
queries = load_wands_queries()
labels = load_wands_labels()

print(f"Products: {len(products):,}")
print(f"Queries: {len(queries):,}")
print(f"Labels: {len(labels):,}")

Products: 42,994
Queries: 480
Labels: 233,448


### 1b. Copy BM25 functions from HW3 (5 pts)

Copy your BM25 implementation from Homework 3. We'll use it to compare against semantic search.

In [None]:
# Task 1b: Verify API keys
response = litellm.completion(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Say 'API working!' and nothing else."}],
    max_tokens=20
)
print(response.choices[0].message.content)


API working!


In [None]:
# Data loading functions (provided)
# Note: Data from WANDS (Wayfair Annotated Dataset)
# Source: https://github.com/wayfair/WANDS

def load_wands_products(data_dir: str = "../data") -> pd.DataFrame:
    """
    Load WANDS products from local file.
    
    Args:
        data_dir: Path to the data directory containing wayfair-products.csv
        
    Returns:
        DataFrame with product information including product_id, product_name,
        product_class, category_hierarchy, product_description, etc.
    """
    filepath = Path(data_dir) / "wayfair-products.csv"
    products = pd.read_csv(filepath, sep='\t')
    products = products.rename(columns={'category hierarchy': 'category_hierarchy'})
    return products

def load_wands_queries(data_dir: str = "../data") -> pd.DataFrame:
    """
    Load WANDS queries from local file.
    
    Args:
        data_dir: Path to the data directory containing wayfair-queries.csv
        
    Returns:
        DataFrame with query_id and query columns
    """
    filepath = Path(data_dir) / "wayfair-queries.csv"
    queries = pd.read_csv(filepath, sep='\t')
    return queries

def load_wands_labels(data_dir: str = "../data") -> pd.DataFrame:
    """
    Load WANDS relevance labels from local file.
    
    Args:
        data_dir: Path to the data directory containing wayfair-labels.csv
        
    Returns:
        DataFrame with query_id, product_id, label (Exact/Partial/Irrelevant),
        and grade (2/1/0) columns
    """
    filepath = Path(data_dir) / "wayfair-labels.csv"
    labels = pd.read_csv(filepath, sep='\t')
    grade_map = {'Exact': 2, 'Partial': 1, 'Irrelevant': 0}
    labels['grade'] = labels['label'].map(grade_map)
    return labels

print("Loading functions defined!")

Loading functions defined!


In [None]:
import Stemmer
stemmer = Stemmer.Stemmer("english")


In [None]:
# Task 2a: Load the data

# YOUR CODE HERE
stemmer = Stemmer.Stemmer("english")

def stemming_tokenize(text: str):
    # 1. lowercase
    text = text.lower()
    
    # 2. remove punctuation
    text = text.translate(str.maketrans("", "", string.punctuation))
    
    # 3. split into tokens
    tokens = text.split()
    
    # 4. stem tokens
    tokens = stemmer.stemWords(tokens)
    
    return tokens



In [None]:
products = load_wands_products()


In [None]:
import string


In [None]:
import math
from collections import Counter, defaultdict


In [None]:
import re


In [None]:
# --- Task 2: BM25 (self-contained cell) ---

import string
import numpy as np
import pandas as pd
from collections import Counter
import Stemmer  # PyStemmer

# Stemmer + punctuation translation
stemmer = Stemmer.Stemmer("english")
punct_trans = str.maketrans({key: " " for key in string.punctuation})

def snowball_tokenize(text: str) -> list[str]:
    """Tokenize text with punctuation removal + lowercasing + stemming."""
    if pd.isna(text) or text is None:
        return []
    text = str(text).translate(punct_trans)
    tokens = text.lower().split()
    return [stemmer.stemWord(token) for token in tokens]

def build_index(docs: list[str], tokenizer) -> tuple[dict, list[int]]:
    """
    Build an inverted index from a list of documents.
    Returns (index, doc_lengths).
    """
    index: dict[str, dict[int, int]] = {}
    doc_lengths: list[int] = []

    for doc_id, doc in enumerate(docs):
        tokens = tokenizer(doc)
        doc_lengths.append(len(tokens))
        term_counts = Counter(tokens)

        for term, count in term_counts.items():
            if term not in index:
                index[term] = {}
            index[term][doc_id] = count

    return index, doc_lengths

def get_df(term: str, index: dict) -> int:
    """Document frequency of term."""
    return len(index.get(term, {}))

def bm25_idf(df: int, num_docs: int) -> float:
    """BM25 IDF."""
    return np.log((num_docs - df + 0.5) / (df + 0.5) + 1)

def bm25_tf(tf: int, doc_len: int, avg_doc_len: float, k1: float = 1.2, b: float = 0.75) -> float:
    """BM25 TF normalization."""
    return (tf * (k1 + 1)) / (tf + k1 * (1 - b + b * doc_len / avg_doc_len))

def score_bm25(
    query: str,
    index: dict,
    num_docs: int,
    doc_lengths: list[int],
    tokenizer,
    k1: float = 1.2,
    b: float = 0.75
) -> np.ndarray:
    """Return BM25 scores for all docs."""
    query_tokens = tokenizer(query)
    scores = np.zeros(num_docs)
    avg_doc_len = float(np.mean(doc_lengths)) if doc_lengths else 1.0

    for token in query_tokens:
        df = get_df(token, index)
        if df == 0:
            continue

        idf = bm25_idf(df, num_docs)

        postings = index.get(token, {})
        for doc_id, tf in postings.items():
            tf_norm = bm25_tf(tf, doc_lengths[doc_id], avg_doc_len, k1, b)
            scores[doc_id] += idf * tf_norm

    return scores

def search_products(
    query: str,
    products_df: pd.DataFrame,
    index: dict,
    doc_lengths: list[int],
    tokenizer,
    k: int = 10
) -> pd.DataFrame:
    """Return top-k rows from products_df with BM25 scores."""
    scores = score_bm25(query, index, len(products_df), doc_lengths, tokenizer)
    top_k_idx = np.argsort(-scores)[:k]

    results = products_df.iloc[top_k_idx].copy()
    results["score"] = scores[top_k_idx]
    results["rank"] = range(1, len(results) + 1)
    return results

# --- Wrapper to match HW4 expected function name/signature ---
def search_bm25(
    query: str,
    index: dict,
    products_df: pd.DataFrame,
    doc_lengths: list[int],
    tokenizer=snowball_tokenize,
    k: int = 10
) -> pd.DataFrame:
    return search_products(query, products_df, index, doc_lengths, tokenizer, k)

print("BM25 functions ready: snowball_tokenize, build_index, score_bm25, search_bm25")


BM25 functions ready: snowball_tokenize, build_index, score_bm25, search_bm25


In [None]:
# Task 2: Build BM25 index + run a test query (auto-picks text column)

# 0) Pick the best text column automatically
cols = products.columns.tolist()
candidates = [c for c in cols if any(k in c.lower() for k in ["title", "name", "desc", "text"])]
print("Text column candidates:", candidates)

if not candidates:
    raise ValueError(f"No obvious text column found. Available columns: {cols}")

TEXT_COL = candidates[0]  # pick the first reasonable match
print("Using TEXT_COL =", TEXT_COL)

# 1) Extract docs
docs = products[TEXT_COL].fillna("").astype(str).tolist()

# 2) Build index
index, doc_lengths = build_index(docs, snowball_tokenize)

# 3) Test query
q = queries.iloc[0]["query"] if hasattr(queries, "iloc") else queries[0]
results = search_bm25(q, index, products, doc_lengths, k=10)

results.head()


In [None]:
# 1. Extract text column
docs = products["product_title"].fillna("").astype(str).tolist()

# 2. Build index
index, doc_lengths = build_index(docs, snowball_tokenize)

# 3. Test query
q = queries.iloc[0]["query"]
results = search_bm25(q, index, products, doc_lengths, k=10)

results.head()


KeyError: 'product_title'

In [28]:
from helpers import build_index, search_bm25


In [31]:
q = queries.iloc[0]["query"] if hasattr(queries, "iloc") else queries[0]
results = search_bm25(q, index, k=10)
results[:3]


TypeError: search_bm25() missing 2 required positional arguments: 'products_df' and 'doc_lengths'

---

## Task 2: Understanding Embeddings (15 points)

### 2a. Load a local model and generate embeddings (5 pts)

Use `sentence-transformers` to load a local embedding model and generate embeddings for a list of words.

In [None]:
# Load the all-MiniLM-L6-v2 model using SentenceTransformer
# Then generate embeddings for each word in the list
words = ["wooden coffee table", "oak dining table", "red leather sofa", "blue area rug", "kitchen sink"]

# YOUR CODE HERE

# Print the number of embeddings you generated and the dimension of the embeddings

### 2b. Implement cosine similarity and create a similarity matrix (5 pts)

Implement cosine similarity from scratch:

$$\text{cosine\_similarity}(a, b) = \frac{a \cdot b}{\|a\| \times \|b\|}$$

In [None]:
# Implement cosine similarity from scratch

# Create similarity matrix

# Display as DataFrame


### 2c. Embed using OpenAI API (5 pts)

Use `litellm` to get embeddings from OpenAI's API and compare dimensions.

In [None]:
# Use litellm to get an embedding from OpenAI's text-embedding-3-small model
# Compare the dimension with the local model

---

## Task 3: Batch Embedding Products (20 points)

### 3a. Embed a product sample (10 pts)

Create a combined text field and embed 5,000 products using the local model.

In [None]:
# Get a consistent sample


In [None]:
# Create a combined text field (product_name + product_class)
# Then embed all products using model.encode()

# YOUR CODE HERE


### 3b. Save and load embeddings (5 pts)

Save embeddings to a `.npy` file so you don't have to recompute them.

In [None]:
# Save embeddings to ../temp/hw4_embeddings.npy
# Save products_sample to ../temp/hw4_products.csv
# Then load them back and verify they match

### 3c. Cost estimation (5 pts)

Estimate the cost to embed all 43K products using OpenAI's API.

**Pricing**: text-embedding-3-small costs ~$0.02 per 1 million tokens.

In [None]:
# Use tiktoken to count actual tokens in the sample
# Then extrapolate to estimate cost for the full dataset


---

## Task 4: Semantic Search (25 points)

### 4a. Implement semantic search (15 pts)

Implement a semantic search function from scratch.

In [None]:
# Implement batch cosine similarity for efficiency


In [None]:
# Implement semantic search


In [None]:
# Test semantic search

### 4b. Evaluate and compare BM25 vs semantic search (10 pts)

Implement Recall@k and compare the two search methods.

In [None]:
# Implement Recall@k


In [None]:
# Build BM25 index for comparison

# Filter queries to those with products in our sample


In [None]:
# Evaluate both BM25 and semantic search on all queries
# Calculate Recall@10 for each method

In [None]:
# Visualize comparison


---

## Task 5: Compare Embedding Models (20 points)

### 5a. Embed products with two different models (10 pts)

Compare embeddings from:
- `BAAI/bge-base-en-v1.5`
- `sentence-transformers/all-mpnet-base-v2`

In [None]:
# Load the two embedding models

In [None]:
# Embed products with both models


### 5b. Compare search results between models (10 pts)

Evaluate both models on the same queries and analyze differences.

In [None]:
# Compare results for specific queries
test_queries = ["comfortable sofa", "star wars rug", "modern coffee table"]
# add more!

In [None]:
# Visualize model comparison with a scatter plot
# X-axis: BGE Recall@10, Y-axis: MPNet Recall@10


---

## Task 6: Git Submission (5 points)

Submit your work using the Git workflow:

- [ ] Create a new branch called `homework-4`
- [ ] Commit your work with a meaningful message
- [ ] Push to GitHub
- [ ] Create a Pull Request
- [ ] Merge the PR to main
- [ ] Submit the `.ipynb` file on Blackboard

The TA will verify your submission by checking the merged PR on GitHub.