<a href="https://colab.research.google.com/github/nhibb262/-ISYS574-ML-Group-Project/blob/main/Notebook/04_tfidf_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 04 - TF-IDF Model Implementation

**Author:** [Your Name]  
**Date:** [YYYY-MM-DD]  
**Purpose:** Build the TF-IDF search model for event discovery

---

## Table of Contents
1. Setup & Load Data
2. Understanding TF-IDF
3. Train TF-IDF Vectorizer
4. Build Search Function
5. Add Rule-Based Feature Boosting
6. Test the Search
7. Save the Model

## 1. Setup & Load Data

In [1]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

import os
PROJECT_PATH = '/content/drive/MyDrive/sf-events-explorer'
MODEL_PATH = f'{PROJECT_PATH}/models'
os.makedirs(MODEL_PATH, exist_ok=True)

Mounted at /content/drive


In [2]:
# Imports
import pandas as pd
import numpy as np
import re
import pickle

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [3]:
# Load cleaned data
df = pd.read_csv(f'{PROJECT_PATH}/data/processed/events_cleaned.csv')
print(f"Loaded {len(df)} events")

Loaded 1874 events


## 2. Understanding TF-IDF

**TF-IDF = Term Frequency × Inverse Document Frequency**

```
TF-IDF(t, d) = TF(t, d) × log(N / df(t))

Where:
- TF(t, d) = frequency of term t in document d
- N = total number of documents
- df(t) = number of documents containing term t
```

**Intuition:** Rare but meaningful words get higher weights.

In [4]:
# Quick TF-IDF demo
demo_docs = ["coding workshop for kids", "art class for kids", "coding class for adults"]
demo_vec = TfidfVectorizer()
demo_vec.fit(demo_docs)

print("Word IDF scores (higher = rarer = more distinctive):")
for word, score in sorted(zip(demo_vec.get_feature_names_out(), demo_vec.idf_), key=lambda x: -x[1]):
    print(f"  {word}: {score:.3f}")

Word IDF scores (higher = rarer = more distinctive):
  adults: 1.693
  art: 1.693
  workshop: 1.693
  class: 1.288
  coding: 1.288
  kids: 1.288
  for: 1.000


## 3. Train TF-IDF Vectorizer

In [5]:
# Configuration
TFIDF_CONFIG = {
    'max_features': 3000,      # Vocabulary size
    'ngram_range': (1, 2),     # Unigrams + bigrams
    'stop_words': 'english',   # Remove common words
    'min_df': 2,               # Ignore very rare terms
    'max_df': 0.95             # Ignore very common terms
}

print("TF-IDF Config:")
for k, v in TFIDF_CONFIG.items():
    print(f"  {k}: {v}")

TF-IDF Config:
  max_features: 3000
  ngram_range: (1, 2)
  stop_words: english
  min_df: 2
  max_df: 0.95


In [6]:
# Train vectorizer
vectorizer = TfidfVectorizer(**TFIDF_CONFIG)
corpus = df['search_text'].fillna('').tolist()
tfidf_matrix = vectorizer.fit_transform(corpus)

print(f"TF-IDF Matrix: {tfidf_matrix.shape}")
print(f"  {tfidf_matrix.shape[0]} events")
print(f"  {tfidf_matrix.shape[1]} vocabulary terms")
print(f"  Sparsity: {100*(1 - tfidf_matrix.nnz/(tfidf_matrix.shape[0]*tfidf_matrix.shape[1])):.1f}%")

TF-IDF Matrix: (1874, 3000)
  1874 events
  3000 vocabulary terms
  Sparsity: 98.3%


## 4. Build Search Function

In [7]:
def search_events(query, top_k=10):
    """Search events using TF-IDF cosine similarity."""
    # Transform query
    query_vec = vectorizer.transform([query])

    # Compute similarities
    scores = cosine_similarity(query_vec, tfidf_matrix).flatten()

    # Get top results
    top_idx = scores.argsort()[-top_k:][::-1]

    results = df.iloc[top_idx].copy()
    results['score'] = scores[top_idx]
    return results

In [8]:
# Test basic search
results = search_events("art classes for kids")
print("Search: 'art classes for kids'\n")
for i, row in results.head(5).iterrows():
    print(f"[{row['score']:.3f}] {row['event_name']}")

Search: 'art classes for kids'

[0.289] Saturday Afternoon Drop-in Art
[0.289] Saturday Afternoon Drop-in Art
[0.262] Workshop: Makerspace for Kids
[0.260] Playing With Art
[0.260] Playing With Art


## 5. Add Rule-Based Feature Boosting

TF-IDF handles text matching, but we can boost results that match user preferences like:
- Age group (kids, teens, families)
- Cost (free vs paid)
- Time (morning, afternoon, evening)
- Day (weekend vs weekday)

In [9]:
# Feature extraction patterns
FEATURE_PATTERNS = {
    'kids': r'\b(kid|kids|child|children|toddler)\b',
    'teens': r'\b(teen|teens|youth|teenager)\b',
    'families': r'\b(family|families)\b',
    'free': r'\b(free)\b',
    'morning': r'\b(morning)\b',
    'afternoon': r'\b(afternoon)\b',
    'evening': r'\b(evening|night)\b',
    'weekend': r'\b(weekend|saturday|sunday)\b'
}

def extract_features(query):
    """Extract structured features from query."""
    query_lower = query.lower()
    features = {}
    for name, pattern in FEATURE_PATTERNS.items():
        features[name] = bool(re.search(pattern, query_lower))
    return features

# Test
print(extract_features("free art classes for kids on saturday"))

{'kids': True, 'teens': False, 'families': False, 'free': True, 'morning': False, 'afternoon': False, 'evening': False, 'weekend': True}


In [10]:
def search_with_boost(query, top_k=10, boost_weight=0.15):
    """Search with TF-IDF + rule-based boosting."""
    # Base TF-IDF search
    query_vec = vectorizer.transform([query])
    scores = cosine_similarity(query_vec, tfidf_matrix).flatten()

    # Extract features from query
    features = extract_features(query)

    # Apply boosts
    boost = np.zeros(len(df))

    # Age group boosts
    if features['kids'] and 'age_group_eligibility_tags' in df.columns:
        kids_mask = df['age_group_eligibility_tags'].str.contains('Children|Pre-Teens', case=False, na=False)
        boost[kids_mask] += boost_weight

    if features['teens'] and 'age_group_eligibility_tags' in df.columns:
        teens_mask = df['age_group_eligibility_tags'].str.contains('Teens', case=False, na=False)
        boost[teens_mask] += boost_weight

    # Free events boost
    if features['free'] and 'fee' in df.columns:
        free_mask = df['fee'].astype(str).str.lower().isin(['false', 'no', '0'])
        boost[free_mask] += boost_weight

    # Weekend boost
    if features['weekend'] and 'days_of_week' in df.columns:
        weekend_mask = df['days_of_week'].str.contains('Sa|Su|Sat|Sun', case=False, na=False)
        boost[weekend_mask] += boost_weight * 0.5

    # Combine scores
    final_scores = scores + boost

    # Get top results
    top_idx = final_scores.argsort()[-top_k:][::-1]

    results = df.iloc[top_idx].copy()
    results['tfidf_score'] = scores[top_idx]
    results['boost'] = boost[top_idx]
    results['final_score'] = final_scores[top_idx]

    return results, features

## 6. Test the Search

In [11]:
# Test queries
test_queries = [
    "free art classes for kids",
    "basketball",
    "coding workshop",
    "family activities on weekend",
    "music performance"
]

for query in test_queries:
    print(f"\n{'='*60}")
    print(f"Query: '{query}'")
    print(f"{'='*60}")

    results, features = search_with_boost(query, top_k=5)

    active_features = [k for k, v in features.items() if v]
    if active_features:
        print(f"Detected features: {active_features}")

    print(f"\nTop 5 Results:")
    for i, row in results.iterrows():
        print(f"  [{row['final_score']:.3f}] {row['event_name'][:60]}")


Query: 'free art classes for kids'
Detected features: ['kids', 'free']

Top 5 Results:
  [0.532] Workshop: Makerspace for Kids
  [0.504] Activity: Silhouette Art
  [0.502] Workshop: Engineering for Kids
  [0.484] Activity: Art Club
  [0.475] Activity: Craft Club

Query: 'basketball'

Top 5 Results:
  [0.563] Basketball - Pee Wee
  [0.551] Basketball - Pee Wee
  [0.475] Drop-in: Basketball
  [0.475] Drop-in: Basketball
  [0.472] Drop-in: Basketball

Query: 'coding workshop'

Top 5 Results:
  [0.595] Workshop: District 7 Affordable Housing
  [0.595] Workshop: Pricing for Profitability in 2026
  [0.559] Workshop: Sones Mexicanas
  [0.555] Workshop: Westside Affordable Housing Richmond &amp; Sunset
  [0.423] Postponed: Workshop: Hot Glue Embroidery

Query: 'family activities on weekend'
Detected features: ['families', 'weekend']

Top 5 Results:
  [0.503] TR Family Days
  [0.454] Social: Monday Fun Day Family Dance Party
  [0.454] Social: Monday Fun Day Family Dance Party
  [0.364] Social:

## 7. Save the Model

In [12]:
# Save vectorizer
with open(f'{MODEL_PATH}/tfidf_vectorizer.pkl', 'wb') as f:
    pickle.dump(vectorizer, f)
print(f"Saved: {MODEL_PATH}/tfidf_vectorizer.pkl")

# Save TF-IDF matrix
with open(f'{MODEL_PATH}/tfidf_matrix.pkl', 'wb') as f:
    pickle.dump(tfidf_matrix, f)
print(f"Saved: {MODEL_PATH}/tfidf_matrix.pkl")

# Save config
with open(f'{MODEL_PATH}/tfidf_config.pkl', 'wb') as f:
    pickle.dump(TFIDF_CONFIG, f)
print(f"Saved: {MODEL_PATH}/tfidf_config.pkl")

Saved: /content/drive/MyDrive/sf-events-explorer/models/tfidf_vectorizer.pkl
Saved: /content/drive/MyDrive/sf-events-explorer/models/tfidf_matrix.pkl
Saved: /content/drive/MyDrive/sf-events-explorer/models/tfidf_config.pkl


In [13]:
# Verify saved models
print("\nSaved model files:")
for f in os.listdir(MODEL_PATH):
    size = os.path.getsize(f'{MODEL_PATH}/{f}') / 1024
    print(f"  {f}: {size:.1f} KB")


Saved model files:
  tfidf_vectorizer.pkl: 119.5 KB
  tfidf_matrix.pkl: 1115.3 KB
  tfidf_config.pkl: 0.1 KB


## Summary

### What We Built
1. **TF-IDF Vectorizer** trained on event corpus
2. **Cosine similarity** search function
3. **Rule-based boosting** for age, cost, time preferences

### Model Stats
- Events: [fill in]
- Vocabulary: [fill in] terms
- Features: unigrams + bigrams

### Next Steps
- Compare with CountVectorizer (Notebook 05)
- Evaluate with test queries (Notebook 06)
- Build Streamlit app