# ML Exploration: Day 2 - Feature Engineering Prep

This notebook covers the preparation for machine learning feature engineering:
- Feature extraction planning
- Train/test split design
- Helper similarity functions

## Setup and Imports

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances
from scipy.spatial.distance import jaccard

# Import our feature engineering module
import sys
sys.path.append('../src')
from feature_engineering import (
    encode_genres_onehot,
    vectorize_tags_tfidf,
    create_year_features,
    FeaturePipeline
)

---

# Part 1: Feature Extraction Planning

## Overview

For our movie recommendation system, we will extract several types of features:

### 1.1 Genre Features (Categorical → Binary)
- **Source**: `genres` column (e.g., "Action|Comedy|Drama")
- **Encoding**: One-hot (multi-label) encoding
- **Rationale**: Genres are the primary content descriptor and form the basis of content-based filtering

### 1.2 Tag Features (Text → Numeric)
- **Source**: User-generated tags
- **Encoding**: TF-IDF vectorization
- **Rationale**: Tags capture nuanced aspects of movies that genres miss (e.g., "twist ending", "dark humor")

### 1.3 Temporal Features
- **Source**: Year extracted from movie title (e.g., "Toy Story (1995)")
- **Derived Features**:
  - Release year (numeric)
  - Decade (categorical/numeric)
- **Rationale**: User preferences often correlate with movie eras

## 1.4 Feature Extraction Strategy

```
Raw Data → Preprocessing → Feature Extraction → Feature Matrix

movies.csv ─────┬───> Genre One-Hot ────────────┐
                │                               │
tags.csv ───────┴───> Tag TF-IDF ───────────────┼───> Combined Feature Matrix
                │                               │
movie titles ───┴───> Year/Decade Extraction ───┘
```

In [None]:
# Example: Feature extraction plan demonstration

# Sample data for demonstration
sample_movies = pd.DataFrame({
    'movieId': [1, 2, 3],
    'title': ['Toy Story (1995)', 'Jumanji (1995)', 'Heat (1995)'],
    'genres': ['Animation|Children|Comedy', 'Adventure|Children|Fantasy', 'Action|Crime|Thriller'],
    'tags': ['pixar animated fun', 'jungle adventure board game', 'heist robbery deniro pacino']
})

print("Sample Movies Data:")
print(sample_movies)
print("\n" + "="*60)

In [None]:
# Demonstrate genre one-hot encoding
genre_features, genre_encoder = encode_genres_onehot(sample_movies, 'genres')
print("Genre One-Hot Encoding:")
print(genre_features)
print(f"\nGenre classes: {genre_encoder.classes_}")

In [None]:
# Demonstrate year extraction
year_features = create_year_features(sample_movies, 'title')
print("Year Features:")
print(year_features)

---

# Part 2: Train/Test Split Design

## 2.1 Split Strategy Considerations

For recommendation systems, we must carefully consider how to split data:

### Random Split
- **Use Case**: Standard evaluation of model performance
- **Pros**: Simple, unbiased sample
- **Cons**: May leak temporal patterns

### Temporal Split
- **Use Case**: Simulating real-world deployment
- **Pros**: More realistic evaluation
- **Cons**: May bias toward recent items

### User-based Split
- **Use Case**: Evaluating cold-start scenarios
- **Pros**: Tests generalization to new users
- **Cons**: Different user behavior patterns

## 2.2 Split Configuration

In [None]:
# Train/Test Split Configuration

SPLIT_CONFIG = {
    'test_size': 0.2,           # 20% for testing
    'validation_size': 0.1,     # 10% for validation (from training)
    'random_state': 42,         # For reproducibility
    'stratify': True,           # Stratify by rating distribution
}

print("Split Configuration:")
for key, value in SPLIT_CONFIG.items():
    print(f"  {key}: {value}")

In [None]:
def create_train_val_test_split(
    df: pd.DataFrame,
    test_size: float = 0.2,
    val_size: float = 0.1,
    random_state: int = 42,
    stratify_column: str = None
):
    """
    Create train/validation/test splits for the dataset.
    
    Parameters:
    -----------
    df : pd.DataFrame
        Input dataframe to split
    test_size : float
        Proportion of data for test set
    val_size : float
        Proportion of training data for validation set
    random_state : int
        Random seed for reproducibility
    stratify_column : str, optional
        Column to use for stratified splitting
        
    Returns:
    --------
    train_df, val_df, test_df : tuple of DataFrames
    """
    stratify = df[stratify_column] if stratify_column and stratify_column in df.columns else None
    
    # First split: separate test set
    train_val_df, test_df = train_test_split(
        df,
        test_size=test_size,
        random_state=random_state,
        stratify=stratify
    )
    
    # Second split: separate validation from training
    stratify_val = train_val_df[stratify_column] if stratify_column and stratify_column in df.columns else None
    
    # Adjust validation size relative to remaining data
    val_size_adjusted = val_size / (1 - test_size)
    
    train_df, val_df = train_test_split(
        train_val_df,
        test_size=val_size_adjusted,
        random_state=random_state,
        stratify=stratify_val
    )
    
    return train_df, val_df, test_df

In [None]:
def create_temporal_split(
    df: pd.DataFrame,
    timestamp_column: str,
    test_ratio: float = 0.2,
    val_ratio: float = 0.1
):
    """
    Create train/validation/test splits based on timestamp.
    
    Parameters:
    -----------
    df : pd.DataFrame
        Input dataframe with timestamp column
    timestamp_column : str
        Name of the timestamp column
    test_ratio : float
        Proportion of most recent data for test set
    val_ratio : float
        Proportion of data for validation set
        
    Returns:
    --------
    train_df, val_df, test_df : tuple of DataFrames
    """
    df_sorted = df.sort_values(timestamp_column)
    n = len(df_sorted)
    
    test_start_idx = int(n * (1 - test_ratio))
    val_start_idx = int(n * (1 - test_ratio - val_ratio))
    
    train_df = df_sorted.iloc[:val_start_idx]
    val_df = df_sorted.iloc[val_start_idx:test_start_idx]
    test_df = df_sorted.iloc[test_start_idx:]
    
    return train_df, val_df, test_df

In [None]:
# Demonstrate splits with sample data

sample_ratings = pd.DataFrame({
    'userId': [1, 1, 1, 2, 2, 2, 3, 3, 3, 4],
    'movieId': [1, 2, 3, 1, 2, 4, 2, 3, 4, 1],
    'rating': [4.0, 3.5, 5.0, 4.5, 3.0, 4.0, 5.0, 4.5, 3.5, 4.0],
    'timestamp': [1000, 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009]
})

print("Sample Ratings Data:")
print(sample_ratings)
print("\n" + "="*60)

# Random split
train, val, test = create_train_val_test_split(sample_ratings, test_size=0.2, val_size=0.1)
print(f"\nRandom Split Sizes: Train={len(train)}, Val={len(val)}, Test={len(test)}")

---

# Part 3: Helper Similarity Functions

## 3.1 Similarity Metrics Overview

For content-based filtering, we need to compute similarity between items:

| Metric | Best For | Range |
|--------|----------|-------|
| Cosine | Sparse vectors (TF-IDF) | [-1, 1] |
| Jaccard | Binary features (genres) | [0, 1] |
| Euclidean | Dense numeric features | [0, ∞) |
| Pearson | Rating patterns | [-1, 1] |

In [None]:
def compute_cosine_similarity(feature_matrix: np.ndarray) -> np.ndarray:
    """
    Compute pairwise cosine similarity between all items.
    
    Parameters:
    -----------
    feature_matrix : np.ndarray
        Matrix of shape (n_items, n_features)
        
    Returns:
    --------
    similarity_matrix : np.ndarray
        Matrix of shape (n_items, n_items) with pairwise similarities
    """
    return cosine_similarity(feature_matrix)

In [None]:
def compute_jaccard_similarity(binary_matrix: np.ndarray) -> np.ndarray:
    """
    Compute pairwise Jaccard similarity for binary features.
    
    Parameters:
    -----------
    binary_matrix : np.ndarray
        Binary matrix of shape (n_items, n_features)
        
    Returns:
    --------
    similarity_matrix : np.ndarray
        Matrix of shape (n_items, n_items) with pairwise Jaccard similarities
    """
    n_items = binary_matrix.shape[0]
    similarity_matrix = np.zeros((n_items, n_items))
    
    for i in range(n_items):
        for j in range(i, n_items):
            # Jaccard = intersection / union
            intersection = np.sum(np.logical_and(binary_matrix[i], binary_matrix[j]))
            union = np.sum(np.logical_or(binary_matrix[i], binary_matrix[j]))
            
            if union == 0:
                sim = 0.0
            else:
                sim = intersection / union
            
            similarity_matrix[i, j] = sim
            similarity_matrix[j, i] = sim
    
    return similarity_matrix

In [None]:
def compute_euclidean_similarity(
    feature_matrix: np.ndarray,
    normalize: bool = True
) -> np.ndarray:
    """
    Compute similarity based on Euclidean distance.
    
    Parameters:
    -----------
    feature_matrix : np.ndarray
        Matrix of shape (n_items, n_features)
    normalize : bool
        If True, normalize distances to [0, 1] range and convert to similarity
        
    Returns:
    --------
    similarity_matrix : np.ndarray
        Matrix of shape (n_items, n_items) with pairwise similarities
    """
    distances = euclidean_distances(feature_matrix)
    
    if normalize:
        # Convert distance to similarity: sim = 1 / (1 + distance)
        similarity_matrix = 1 / (1 + distances)
    else:
        similarity_matrix = -distances  # Negative distance as similarity
    
    return similarity_matrix

In [None]:
def get_top_similar_items(
    similarity_matrix: np.ndarray,
    item_idx: int,
    top_n: int = 10,
    exclude_self: bool = True
) -> list:
    """
    Get the top N most similar items to a given item.
    
    Parameters:
    -----------
    similarity_matrix : np.ndarray
        Precomputed similarity matrix
    item_idx : int
        Index of the query item
    top_n : int
        Number of similar items to return
    exclude_self : bool
        Whether to exclude the item itself from results
        
    Returns:
    --------
    similar_items : list of tuples
        List of (item_idx, similarity_score) tuples
    """
    similarities = similarity_matrix[item_idx]
    
    if exclude_self:
        # Set self-similarity to -inf to exclude
        similarities = similarities.copy()
        similarities[item_idx] = -np.inf
    
    # Get indices of top N similar items
    top_indices = np.argsort(similarities)[::-1][:top_n]
    
    return [(idx, similarities[idx]) for idx in top_indices]

In [None]:
def compute_weighted_similarity(
    feature_matrices: list,
    weights: list,
    similarity_fn=cosine_similarity
) -> np.ndarray:
    """
    Compute weighted combination of multiple similarity matrices.
    
    Parameters:
    -----------
    feature_matrices : list of np.ndarray
        List of feature matrices (one per feature type)
    weights : list of float
        Weights for each feature type (should sum to 1)
    similarity_fn : callable
        Similarity function to use
        
    Returns:
    --------
    combined_similarity : np.ndarray
        Weighted combination of similarity matrices
    """
    if len(feature_matrices) != len(weights):
        raise ValueError("Number of matrices must match number of weights")
    
    # Normalize weights
    weights = np.array(weights) / np.sum(weights)
    
    combined_similarity = None
    
    for matrix, weight in zip(feature_matrices, weights):
        sim_matrix = similarity_fn(matrix)
        
        if combined_similarity is None:
            combined_similarity = weight * sim_matrix
        else:
            combined_similarity += weight * sim_matrix
    
    return combined_similarity

In [None]:
# Demonstrate similarity computations

# Use genre features from earlier
print("Genre Features Matrix:")
print(genre_features.values)
print("\n" + "="*60)

# Compute cosine similarity
cosine_sim = compute_cosine_similarity(genre_features.values)
print("\nCosine Similarity Matrix:")
print(np.round(cosine_sim, 3))

# Compute Jaccard similarity
jaccard_sim = compute_jaccard_similarity(genre_features.values)
print("\nJaccard Similarity Matrix:")
print(np.round(jaccard_sim, 3))

In [None]:
# Find most similar movies to Toy Story (index 0)
print("Movies most similar to 'Toy Story':")
similar_to_toy_story = get_top_similar_items(cosine_sim, item_idx=0, top_n=2)

for idx, score in similar_to_toy_story:
    print(f"  {sample_movies.iloc[idx]['title']}: similarity = {score:.3f}")

---

## Summary

This notebook established the foundation for ML feature engineering:

1. **Feature Extraction Planning**: Defined our approach for extracting genre, tag, and temporal features
2. **Train/Test Split Design**: Implemented both random and temporal split strategies
3. **Similarity Functions**: Created helper functions for computing various similarity metrics

### Next Steps

- [ ] Load actual MovieLens data and apply feature extraction
- [ ] Evaluate different similarity metrics on real data
- [ ] Build content-based recommendation models
- [ ] Implement collaborative filtering approach

In [None]:
# Summary of helper functions available:

print("Feature Engineering Functions:")
print("  - encode_genres_onehot(): One-hot encode movie genres")
print("  - vectorize_tags_tfidf(): TF-IDF vectorize movie tags")
print("  - create_year_features(): Extract year and decade from titles")
print("  - FeaturePipeline: Complete feature extraction pipeline")
print("")
print("Split Functions:")
print("  - create_train_val_test_split(): Random train/val/test split")
print("  - create_temporal_split(): Time-based train/val/test split")
print("")
print("Similarity Functions:")
print("  - compute_cosine_similarity(): Cosine similarity for sparse features")
print("  - compute_jaccard_similarity(): Jaccard similarity for binary features")
print("  - compute_euclidean_similarity(): Euclidean distance-based similarity")
print("  - get_top_similar_items(): Get top-N similar items")
print("  - compute_weighted_similarity(): Combine multiple similarity matrices")