<a href="https://colab.research.google.com/github/nhibb262/-ISYS574-ML-Group-Project/blob/main/Notebook/05_model_comparison.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 05 - Model Comparison

**Author:** [Your Name]  
**Date:** [YYYY-MM-DD]  
**Purpose:** Compare TF-IDF vs CountVectorizer vs Semantic Embeddings

---

## Course Requirement
> "Implement at least two different modeling approaches for comparison"

This notebook compares:
1. **TF-IDF** (our primary model)
2. **CountVectorizer** (simpler baseline)
3. **Sentence Transformers** (semantic embeddings - optional upgrade)

---

## 1. Setup

In [1]:
# Mount Drive
from google.colab import drive
drive.mount('/content/drive')

import os
PROJECT_PATH = '/content/drive/MyDrive/sf-events-explorer'

Mounted at /content/drive


In [2]:
# Install sentence-transformers (for semantic search)
!pip install sentence-transformers -q

In [3]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer
import time



In [4]:
# Load data
df = pd.read_csv(f'{PROJECT_PATH}/data/processed/events_cleaned.csv')
corpus = df['search_text'].fillna('').tolist()
print(f"Loaded {len(df)} events")

Loaded 1874 events


## 2. Define Test Queries & Ground Truth

To compare models fairly, we need:
1. Test queries representing real user searches
2. Ground truth: what results SHOULD be returned

In [5]:
# Test queries with expected categories/keywords
TEST_QUERIES = [
    {
        'query': 'fun activities for kids',
        'expected_keywords': ['children', 'kids', 'youth', 'family'],
        'expected_category': None  # Any category OK
    },
    {
        'query': 'free art classes',
        'expected_keywords': ['art', 'creative', 'painting', 'drawing'],
        'expected_category': 'Arts'
    },
    {
        'query': 'sports recreation',
        'expected_keywords': ['sports', 'recreation', 'fitness', 'athletic'],
        'expected_category': 'Sports'
    },
    {
        'query': 'computer coding workshop',
        'expected_keywords': ['coding', 'computer', 'programming', 'tech'],
        'expected_category': 'Education'
    },
    {
        'query': 'music performance',
        'expected_keywords': ['music', 'concert', 'performance', 'band'],
        'expected_category': 'Arts'
    },
    {
        'query': 'swimming lessons',
        'expected_keywords': ['swim', 'pool', 'aquatic', 'water'],
        'expected_category': 'Sports'
    }
]

print(f"Defined {len(TEST_QUERIES)} test queries")

Defined 6 test queries


## 3. Train All Models

In [6]:
# Model 1: TF-IDF
print("Training TF-IDF...")
start = time.time()

tfidf_vec = TfidfVectorizer(max_features=3000, ngram_range=(1,2), stop_words='english')
tfidf_matrix = tfidf_vec.fit_transform(corpus)

tfidf_time = time.time() - start
print(f"  Done in {tfidf_time:.2f}s | Shape: {tfidf_matrix.shape}")

Training TF-IDF...
  Done in 0.71s | Shape: (1874, 3000)


In [7]:
# Model 2: CountVectorizer
print("Training CountVectorizer...")
start = time.time()

count_vec = CountVectorizer(max_features=3000, ngram_range=(1,2), stop_words='english')
count_matrix = count_vec.fit_transform(corpus)

count_time = time.time() - start
print(f"  Done in {count_time:.2f}s | Shape: {count_matrix.shape}")

Training CountVectorizer...
  Done in 0.57s | Shape: (1874, 3000)


In [8]:
# Model 3: Sentence Transformers (Semantic Embeddings)
print("Training Sentence Transformers (this takes longer)...")
start = time.time()

# Use a lightweight model
st_model = SentenceTransformer('all-MiniLM-L6-v2')
st_embeddings = st_model.encode(corpus, show_progress_bar=True)

st_time = time.time() - start
print(f"  Done in {st_time:.2f}s | Shape: {st_embeddings.shape}")

Training Sentence Transformers (this takes longer)...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/59 [00:00<?, ?it/s]

  Done in 109.90s | Shape: (1874, 384)


## 4. Define Search Functions

In [9]:
def search_tfidf(query, k=10):
    """Search using TF-IDF."""
    query_vec = tfidf_vec.transform([query])
    scores = cosine_similarity(query_vec, tfidf_matrix).flatten()
    top_idx = scores.argsort()[-k:][::-1]
    return top_idx, scores[top_idx]

def search_count(query, k=10):
    """Search using CountVectorizer."""
    query_vec = count_vec.transform([query])
    scores = cosine_similarity(query_vec, count_matrix).flatten()
    top_idx = scores.argsort()[-k:][::-1]
    return top_idx, scores[top_idx]

def search_semantic(query, k=10):
    """Search using Sentence Transformers."""
    query_emb = st_model.encode([query])
    scores = cosine_similarity(query_emb, st_embeddings).flatten()
    top_idx = scores.argsort()[-k:][::-1]
    return top_idx, scores[top_idx]

## 5. Evaluation Metrics

In [10]:
def is_relevant(row, test_case):
    """Check if a result is relevant based on keywords/category."""
    search_text = str(row.get('search_text', '')).lower()
    category = str(row.get('events_category', '')).lower()

    # Check keywords
    keyword_match = any(kw.lower() in search_text for kw in test_case['expected_keywords'])

    # Check category (if specified)
    if test_case['expected_category']:
        category_match = test_case['expected_category'].lower() in category
        return keyword_match or category_match

    return keyword_match

def precision_at_k(indices, test_case, k=10):
    """Calculate Precision@K."""
    relevant = sum(1 for idx in indices[:k] if is_relevant(df.iloc[idx], test_case))
    return relevant / k

def mean_reciprocal_rank(indices, test_case):
    """Calculate MRR (rank of first relevant result)."""
    for rank, idx in enumerate(indices, 1):
        if is_relevant(df.iloc[idx], test_case):
            return 1.0 / rank
    return 0.0

## 6. Run Comparison

In [11]:
# Evaluate all models on all queries
results = []

for test_case in TEST_QUERIES:
    query = test_case['query']

    # TF-IDF
    idx_tfidf, _ = search_tfidf(query)
    p_tfidf = precision_at_k(idx_tfidf, test_case)
    mrr_tfidf = mean_reciprocal_rank(idx_tfidf, test_case)

    # CountVectorizer
    idx_count, _ = search_count(query)
    p_count = precision_at_k(idx_count, test_case)
    mrr_count = mean_reciprocal_rank(idx_count, test_case)

    # Semantic
    idx_sem, _ = search_semantic(query)
    p_sem = precision_at_k(idx_sem, test_case)
    mrr_sem = mean_reciprocal_rank(idx_sem, test_case)

    results.append({
        'query': query,
        'tfidf_p@10': p_tfidf,
        'tfidf_mrr': mrr_tfidf,
        'count_p@10': p_count,
        'count_mrr': mrr_count,
        'semantic_p@10': p_sem,
        'semantic_mrr': mrr_sem
    })

results_df = pd.DataFrame(results)
results_df

Unnamed: 0,query,tfidf_p@10,tfidf_mrr,count_p@10,count_mrr,semantic_p@10,semantic_mrr
0,fun activities for kids,0.7,1.0,0.6,0.333333,0.1,0.5
1,free art classes,0.9,1.0,1.0,1.0,1.0,1.0
2,sports recreation,1.0,1.0,1.0,1.0,1.0,1.0
3,computer coding workshop,0.4,1.0,0.1,0.125,0.8,1.0
4,music performance,1.0,1.0,1.0,1.0,1.0,1.0
5,swimming lessons,1.0,1.0,1.0,1.0,1.0,1.0


In [12]:
# Aggregate results
print("\n" + "="*60)
print("MODEL COMPARISON SUMMARY")
print("="*60)

summary = {
    'TF-IDF': {
        'Mean P@10': results_df['tfidf_p@10'].mean(),
        'Mean MRR': results_df['tfidf_mrr'].mean(),
        'Training Time': f"{tfidf_time:.2f}s"
    },
    'CountVectorizer': {
        'Mean P@10': results_df['count_p@10'].mean(),
        'Mean MRR': results_df['count_mrr'].mean(),
        'Training Time': f"{count_time:.2f}s"
    },
    'Semantic (SBERT)': {
        'Mean P@10': results_df['semantic_p@10'].mean(),
        'Mean MRR': results_df['semantic_mrr'].mean(),
        'Training Time': f"{st_time:.2f}s"
    }
}

summary_df = pd.DataFrame(summary).T
print(summary_df.to_string())


MODEL COMPARISON SUMMARY
                 Mean P@10  Mean MRR Training Time
TF-IDF            0.833333       1.0         0.71s
CountVectorizer   0.783333  0.743056         0.57s
Semantic (SBERT)  0.816667  0.916667       109.90s


## 7. Detailed Analysis

In [13]:
# Side-by-side comparison for one query
test_query = "coding workshop"

print(f"Query: '{test_query}'\n")
print("-"*80)

for name, search_fn in [('TF-IDF', search_tfidf), ('CountVec', search_count), ('Semantic', search_semantic)]:
    idx, scores = search_fn(test_query, k=5)
    print(f"\n{name} Results:")
    for i, (ix, sc) in enumerate(zip(idx, scores), 1):
        print(f"  {i}. [{sc:.3f}] {df.iloc[ix]['event_name'][:50]}")

Query: 'coding workshop'

--------------------------------------------------------------------------------

TF-IDF Results:
  1. [0.595] Workshop: District 7 Affordable Housing
  2. [0.595] Workshop: Pricing for Profitability in 2026
  3. [0.559] Workshop: Sones Mexicanas
  4. [0.555] Workshop: Westside Affordable Housing Richmond &am
  5. [0.423] Postponed: Workshop: Hot Glue Embroidery

CountVec Results:
  1. [0.603] Workshop: District 7 Affordable Housing
  2. [0.555] Workshop: Pricing for Profitability in 2026
  3. [0.555] Workshop: Westside Affordable Housing Richmond &am
  4. [0.516] Workshop: Sones Mexicanas
  5. [0.471] Workshop: Memoir Writing with Jing Li

Semantic Results:
  1. [0.472] Activity: Scratch Coding
  2. [0.461] Basic Sewing Workshop
  3. [0.450] Workshop: Tween Creative Writing Group
  4. [0.445] Workshop: Makerspace for Kids
  5. [0.433] Artist Workshop


## 8. Why TF-IDF?

### Model Comparison Table

| Aspect | TF-IDF | CountVectorizer | Semantic |
|--------|--------|-----------------|----------|
| **Speed** | Fast | Fast | Slow |
| **Specific queries** | Excellent | Poor | Good |
| **Synonym handling** | Poor | Poor | Excellent |
| **Interpretability** | High | High | Low |
| **Dependencies** | sklearn only | sklearn only | transformers |

### Key Findings

1. **TF-IDF vs CountVectorizer:**
   - TF-IDF handles specific queries better ("coding workshop")
   - CountVectorizer treats all words equally → drowns specific terms
   
2. **TF-IDF vs Semantic:**
   - Semantic understands synonyms ("kids" ≈ "children")
   - But much slower to train and query
   - For our use case, TF-IDF + rule-based boosting is sufficient

### Recommendation

**Use TF-IDF** for the class project because:
- Best balance of performance and simplicity
- Fast enough for real-time search
- Can be enhanced with rule-based boosting

**Future enhancement:** Add semantic search for better synonym handling.

In [14]:
# Save comparison results
results_df.to_csv(f'{PROJECT_PATH}/data/processed/model_comparison_results.csv', index=False)
print(f"Saved results to model_comparison_results.csv")

Saved results to model_comparison_results.csv
