<a href="https://colab.research.google.com/github/nhibb262/-ISYS574-ML-Group-Project/blob/main/Notebook/03_failed_approaches.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 03 - Failed Approaches (Lessons Learned)

**Author:** [Your Name]  
**Date:** [YYYY-MM-DD]  
**Purpose:** Document approaches that were attempted but didn't work, and why

---

## Why Document Failures?

> "The only real mistake is the one from which we learn nothing." - Henry Ford

This notebook documents approaches that seemed reasonable but ultimately didn't fit our problem. This is valuable because:

1. **Future reference:** Avoid repeating the same mistakes
2. **Project documentation:** Shows the exploration process
3. **Learning:** Understanding *why* something doesn't work deepens knowledge
4. **Course requirement:** Demonstrates critical thinking

---

## Table of Contents
1. [Setup](#1-setup)
2. [Failed Approach 1: IQR Outlier Detection](#2-failed-approach-1-iqr-outlier-detection)
3. [Failed Approach 2: Linear Regression](#3-failed-approach-2-linear-regression)
4. [Failed Approach 3: Date/Time Feature Engineering](#4-failed-approach-3-datetime-feature-engineering)
5. [Failed Approach 4: Collaborative Filtering](#5-failed-approach-4-collaborative-filtering)
6. [Key Lessons Learned](#6-key-lessons-learned)

---

## 1. Setup

In [1]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

import os
PROJECT_PATH = '/content/drive/MyDrive/sf-events-explorer'

import pandas as pd
import numpy as np

# Load cleaned data
df = pd.read_csv(f'{PROJECT_PATH}/data/processed/events_cleaned.csv')
print(f"Loaded {len(df)} records")

Mounted at /content/drive
Loaded 1874 records


---

## 2. Failed Approach 1: IQR Outlier Detection

### What We Tried
We attempted to use Interquartile Range (IQR) outlier detection to identify and remove anomalous events based on:
- Event duration
- Admission price
- Geographic coordinates (bounding box for SF)

### The Code

In [2]:
# IQR Outlier Detection Attempt

def detect_outliers_iqr(df, column):
    """Detect outliers using IQR method."""
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
    return outliers, lower_bound, upper_bound

# Try to apply to duration_min column
try:
    outliers, lb, ub = detect_outliers_iqr(df, 'duration_min')
    print(f"Found {len(outliers)} outliers")
except KeyError as e:
    print(f"ERROR: {e}")
    print("The 'duration_min' column doesn't exist in our dataset!")

ERROR: 'duration_min'
The 'duration_min' column doesn't exist in our dataset!


In [3]:
# Check what numeric columns actually exist
print("Numeric columns in dataset:")
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
for col in numeric_cols:
    non_null = df[col].notna().sum()
    print(f"  {col}: {non_null} non-null values")

Numeric columns in dataset:
  admission_price: 0 non-null values
  latitude: 1769 non-null values
  longitude: 1769 non-null values
  supervisor_district: 1769 non-null values


In [4]:
# Try with admission_price
if 'admission_price' in df.columns:
    non_null_prices = df['admission_price'].notna().sum()
    print(f"admission_price has {non_null_prices} non-null values out of {len(df)}")
    print(f"That's {non_null_prices/len(df)*100:.1f}% of the data")

    if non_null_prices < 10:
        print("\nNot enough data for meaningful outlier detection!")

admission_price has 0 non-null values out of 1874
That's 0.0% of the data

Not enough data for meaningful outlier detection!


### Why It Failed

| Issue | Details |
|-------|--------|
| **Missing column** | `duration_min` doesn't exist - the dataset doesn't include event duration as a numeric field |
| **Empty data** | `admission_price` is almost entirely null (~99% missing) |
| **Wrong problem type** | Outlier detection is useful for regression/prediction, but our problem is search/retrieval |

### Lesson Learned
> **Always validate column existence and data availability BEFORE writing analysis code.**
>
> Run `df.columns` and `df.describe()` first to understand what data you actually have.

---

## 3. Failed Approach 2: Linear Regression

### What We Tried
We attempted to build a regression model to predict:
- Event "popularity" or "fit score"
- Attendance numbers
- User engagement

### The Code

In [5]:
# Linear Regression Attempt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# What would we predict?
print("Looking for a target variable to predict...")
print("\nPotential target columns:")

potential_targets = ['popularity', 'attendance', 'rating', 'engagement', 'views', 'clicks']
for col in potential_targets:
    if col in df.columns:
        print(f"  ✓ {col} EXISTS")
    else:
        print(f"  ✗ {col} NOT FOUND")

Looking for a target variable to predict...

Potential target columns:
  ✗ popularity NOT FOUND
  ✗ attendance NOT FOUND
  ✗ rating NOT FOUND
  ✗ engagement NOT FOUND
  ✗ views NOT FOUND
  ✗ clicks NOT FOUND


In [6]:
# The fundamental problem: No target variable!
print("\nFUNDAMENTAL PROBLEM:")
print("="*50)
print("Linear regression requires a continuous target variable to predict.")
print("")
print("Our dataset contains:")
print("  - Event descriptions (text)")
print("  - Categories (categorical)")
print("  - Locations (categorical/geographic)")
print("  - Times (temporal)")
print("")
print("What it DOESN'T contain:")
print("  - Popularity scores")
print("  - Attendance numbers")
print("  - User ratings")
print("  - Any numeric outcome to predict")
print("")
print("CONCLUSION: Regression is not applicable to this dataset.")


FUNDAMENTAL PROBLEM:
Linear regression requires a continuous target variable to predict.

Our dataset contains:
  - Event descriptions (text)
  - Categories (categorical)
  - Locations (categorical/geographic)
  - Times (temporal)

What it DOESN'T contain:
  - Popularity scores
  - Attendance numbers
  - User ratings
  - Any numeric outcome to predict

CONCLUSION: Regression is not applicable to this dataset.


### Why It Failed

| Issue | Details |
|-------|--------|
| **No target variable** | The dataset has no numeric outcome to predict (no popularity, attendance, ratings) |
| **Wrong problem framing** | Our actual problem is **search/retrieval**, not **prediction** |
| **Misunderstanding the use case** | Users want to FIND events, not PREDICT something about events |

### Lesson Learned
> **Identify your problem type BEFORE choosing algorithms.**
>
> - **Regression:** Predict a continuous number (price, score, quantity)
> - **Classification:** Predict a category (spam/not spam, positive/negative)
> - **Clustering:** Group similar items together
> - **Information Retrieval:** Find relevant items given a query ← **OUR PROBLEM**
>
> Our problem is Information Retrieval, which uses techniques like TF-IDF, not regression.

---

## 4. Failed Approach 3: Date/Time Feature Engineering

### What We Tried
We built a comprehensive date/time parsing pipeline to extract:
- `event_year`, `event_month`, `event_day`
- `event_weekday`, `event_hour`
- `is_weekend`
- `event_duration`

### The Code

In [7]:
# Date/Time Feature Engineering Attempt

# Check what date columns exist
print("Date-related columns in dataset:")
date_cols = [col for col in df.columns if 'date' in col.lower() or 'time' in col.lower()]
for col in date_cols:
    print(f"  {col}: {df[col].dtype}")
    print(f"    Sample: {df[col].iloc[0]}")

Date-related columns in dataset:
  event_start_date: object
    Sample: 2026-01-06
  event_end_date: object
    Sample: 2026-03-07
  start_time: object
    Sample: 00:00:00
  end_time: object
    Sample: 00:00:00
  time_of_day: object
    Sample: morning


In [8]:
# Try the original code that failed
try:
    # This was the original approach
    df['event_year'] = pd.to_datetime(df['start_date']).dt.year
    print("Success!")
except KeyError as e:
    print(f"ERROR: {e}")
    print("")
    print("The column name is different than expected!")
    print(f"Expected: 'start_date'")
    print(f"Actual columns: {[c for c in df.columns if 'date' in c.lower()]}")

ERROR: 'start_date'

The column name is different than expected!
Expected: 'start_date'
Actual columns: ['event_start_date', 'event_end_date']


In [9]:
# Even with correct column names, the features aren't useful for search
print("\nBUT WAIT - Even if we extract these features...")
print("")
print("How would 'event_year = 2025' help a user searching for 'kids art classes'?")
print("")
print("These features might be useful for:")
print("  - Time-based filtering (show only weekend events)")
print("  - Analytics (how many events per month)")
print("")
print("They are NOT useful for:")
print("  - Semantic search (understanding query intent)")
print("  - Text matching (finding relevant events)")


BUT WAIT - Even if we extract these features...

How would 'event_year = 2025' help a user searching for 'kids art classes'?

These features might be useful for:
  - Time-based filtering (show only weekend events)
  - Analytics (how many events per month)

They are NOT useful for:
  - Semantic search (understanding query intent)
  - Text matching (finding relevant events)


### Why It Failed

| Issue | Details |
|-------|--------|
| **Column name mismatch** | Expected `start_date`, actual column was `event_start_date` |
| **Inconsistent date formats** | Raw data had multiple date format variations |
| **Features not useful for search** | Extracted calendar features don't help with semantic search |

### What We Did Instead
For time-based features, we use **rule-based extraction** at query time:
- If user says "weekend", boost events with Saturday/Sunday in `days_of_week`
- If user says "morning", boost events with start_time before noon

### Lesson Learned
> **Validate column names before writing code.**
>
> Always run `df.columns` and check exact names. Column naming conventions vary between datasets.
>
> Also: **Match features to your use case.** Date features are useful for filtering, not for semantic search.

---

## 5. Failed Approach 4: Collaborative Filtering

### What We Tried
Collaborative filtering recommends items based on what similar users liked. We considered:
- User-item interaction matrix
- Matrix factorization (SVD)
- Finding similar users based on event preferences

### Why We Couldn't Even Start

In [10]:
# Collaborative Filtering Requirements Check

print("COLLABORATIVE FILTERING REQUIREMENTS")
print("="*50)
print("")
print("Required data:")
print("  1. User IDs (who interacted)")
print("  2. Item IDs (what they interacted with)")
print("  3. Interaction data (clicks, ratings, saves, etc.)")
print("")
print("Our dataset has:")

user_cols = [col for col in df.columns if 'user' in col.lower()]
interaction_cols = [col for col in df.columns if any(x in col.lower() for x in ['click', 'view', 'save', 'rating', 'interaction'])]

print(f"  User columns: {user_cols if user_cols else 'NONE'}")
print(f"  Interaction columns: {interaction_cols if interaction_cols else 'NONE'}")
print("")
print("CONCLUSION: Cannot do collaborative filtering without user interaction data.")

COLLABORATIVE FILTERING REQUIREMENTS

Required data:
  1. User IDs (who interacted)
  2. Item IDs (what they interacted with)
  3. Interaction data (clicks, ratings, saves, etc.)

Our dataset has:
  User columns: NONE
  Interaction columns: NONE

CONCLUSION: Cannot do collaborative filtering without user interaction data.


### Why It Failed

| Issue | Details |
|-------|--------|
| **No user data** | The Our415 dataset is a catalog of events, not a record of user interactions |
| **No interaction history** | No clicks, views, saves, or ratings to learn from |
| **Cold start problem** | Even if we had some data, new events would have no interactions |

### What We Did Instead
**Content-based filtering using TF-IDF:**
- Match events to queries based on text similarity
- No user history required
- Works for all events immediately

### Future Enhancement
If we deploy the app and collect user interaction data (clicks, saves), we could add collaborative filtering as a hybrid approach.

### Lesson Learned
> **Check data requirements BEFORE choosing an algorithm.**
>
> - Collaborative filtering needs: User-item interaction data
> - Content-based filtering needs: Item descriptions/features ← **What we have**
>
> Choose the algorithm that fits your available data.

---

## 6. Key Lessons Learned

### Summary Table

| Approach | Why It Failed | Lesson |
|----------|--------------|--------|
| IQR Outlier Detection | Missing columns, empty data | Validate data availability first |
| Linear Regression | No target variable | Identify problem type before choosing algorithm |
| Date/Time Features | Column name mismatch, not useful for search | Check exact column names; match features to use case |
| Collaborative Filtering | No user interaction data | Check data requirements before choosing algorithm |

### The Right Approach: TF-IDF

After these failed attempts, we realized:

1. **Our problem is Information Retrieval**, not prediction or clustering
2. **Our data is text-based** (event names, descriptions)
3. **We don't have user interaction data**

TF-IDF is the right tool because:
- Works directly with text data ✓
- Solves the retrieval problem ✓
- No user data required ✓
- Fast and interpretable ✓

### Meta-Lesson

> **Sometimes the "sophisticated" solution isn't the right one.**
>
> The right solution is the one that:
> 1. Fits your actual data
> 2. Solves your actual problem
> 3. Your users actually need

In [11]:
# Final summary
print("APPROACH SELECTION CHECKLIST")
print("="*50)
print("")
print("Before choosing an ML approach, ask:")
print("")
print("1. What is my problem type?")
print("   [ ] Regression (predict a number)")
print("   [ ] Classification (predict a category)")
print("   [ ] Clustering (group similar items)")
print("   [✓] Information Retrieval (find relevant items)")
print("")
print("2. What data do I have?")
print("   [✓] Text descriptions")
print("   [✓] Categorical features")
print("   [ ] Numeric target variable")
print("   [ ] User interaction history")
print("")
print("3. What does the user need?")
print("   [✓] Find events matching their query")
print("   [ ] Predict event popularity")
print("   [ ] See personalized recommendations")
print("")
print("CONCLUSION: TF-IDF + content-based filtering is the right approach.")

APPROACH SELECTION CHECKLIST

Before choosing an ML approach, ask:

1. What is my problem type?
   [ ] Regression (predict a number)
   [ ] Classification (predict a category)
   [ ] Clustering (group similar items)
   [✓] Information Retrieval (find relevant items)

2. What data do I have?
   [✓] Text descriptions
   [✓] Categorical features
   [ ] Numeric target variable
   [ ] User interaction history

3. What does the user need?
   [✓] Find events matching their query
   [ ] Predict event popularity
   [ ] See personalized recommendations

CONCLUSION: TF-IDF + content-based filtering is the right approach.
