# GRU4Rec Reproduction Study: End-to-End Walkthrough

**A Technical Deep-Dive into Session-Based Recommendations**

---

This notebook is a **learning artifact** designed to help you understand, modify, and experiment with the entire GRU4Rec reproduction pipeline. By the end, you'll understand:

- Why session-based recommendations matter
- How to build reproducible ML evaluation pipelines
- The importance of proper baselines and evaluation protocols
- How to extend and modify this system for your own research

**Runtime:** ~2 minutes on CPU with synthetic data

---

## 1. Project Overview

### 1.1 The Problem: Anonymous User Recommendations

Traditional recommender systems (Netflix, Amazon) work great when you know the user. But what about:

- **E-commerce visitors** who haven't logged in?
- **News readers** browsing anonymously?
- **First-time users** with no history?

**Statistics show:**
- 70-80% of e-commerce visitors are anonymous
- Anonymous users convert at 1-2% vs 3-5% for returning users
- This represents 20-40% potential revenue loss

### 1.2 The Solution: Session-Based Recommendations

Instead of user history, we use **session history**:

```
Traditional:     User Profile (weeks/months of data) → Recommendations
Session-Based:   Current Session (minutes of clicks) → Recommendations
```

**GRU4Rec** (Hidasi et al., 2016) pioneered using Recurrent Neural Networks for this task, treating each session as a sequence and predicting the next item.

### 1.3 What This Project Delivers

| Output | Description |
|--------|-------------|
| **Reproducible Pipeline** | One-command execution from data to results |
| **Baseline Models** | Popularity and Markov Chain for comparison |
| **Evaluation Framework** | Full-ranking metrics (Recall@K, MRR@K) |
| **Visualizations** | 11 publication-quality figures |
| **Documentation** | Bilingual technical and executive docs |

---

## 2. Project Structure & Mental Model

### 2.1 Repository Layout

In [None]:
# Let's visualize the project structure
import os
from pathlib import Path

# Navigate to project root (works from notebooks/ directory)
PROJECT_ROOT = Path(os.getcwd()).parent if Path(os.getcwd()).name == 'notebooks' else Path(os.getcwd())
os.chdir(PROJECT_ROOT)

print(f"Project root: {PROJECT_ROOT}")
print("\nKey directories:")
for item in sorted(PROJECT_ROOT.iterdir()):
    if item.name.startswith('.') or item.name == '__pycache__':
        continue
    icon = "[DIR]" if item.is_dir() else "[FILE]"
    print(f"  {icon} {item.name}")

### 2.2 Understanding the Pipeline Stages

The pipeline follows this flow:

```
┌─────────┐    ┌────────────┐    ┌─────────┐    ┌──────────┐    ┌──────────┐
│  FETCH  │───▶│  GENERATE  │───▶│  SPLIT  │───▶│  TRAIN   │───▶│ EVALUATE │
│ GRU4Rec │    │   Data     │    │ Temporal│    │  Models  │    │  Metrics │
└─────────┘    └────────────┘    └─────────┘    └──────────┘    └──────────┘
     │               │                │               │               │
     ▼               ▼                ▼               ▼               ▼
  vendor/     synth_sessions.tsv   train.tsv    model.pt      results.json
                                   test.tsv     baselines      figures/
```

**Key Design Decisions:**
1. **Fetch on-demand**: Official GRU4Rec is not redistributed (licensing)
2. **Temporal split**: No data leakage - earlier sessions for training, later for testing
3. **Full ranking evaluation**: Production-realistic metrics (not inflated sampled metrics)

### 2.3 Entry Points

| Entry Point | Purpose |
|-------------|----------|
| `Makefile` | Main orchestrator - run `make help` for all commands |
| `scripts/` | Individual pipeline steps (fetch, preprocess, train, eval) |
| `src/` | Reusable modules (baselines, metrics, visualizations) |

---

## 3. Setup & Configuration

### 3.1 Imports

In [None]:
# Standard library
import sys
import json
import warnings
from pathlib import Path

# Data processing
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt

# Add src to path for importing project modules
sys.path.insert(0, str(PROJECT_ROOT / 'src'))
sys.path.insert(0, str(PROJECT_ROOT / 'scripts'))

# Project modules
from baselines import PopularityBaseline, MarkovBaseline
from metrics import recall_at_k, mrr_at_k, ndcg_at_k

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

print("Imports successful!")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")

### 3.2 Configuration Parameters

Here are the key parameters that control the pipeline. Understanding these is crucial for experimentation.

In [None]:
# =============================================================================
# CONFIGURATION - Modify these to experiment!
# =============================================================================

CONFIG = {
    # Data generation
    'n_sessions': 1000,      # Number of sessions to generate
    'n_items': 500,          # Unique items in catalog
    'min_session_len': 2,    # Minimum items per session
    'max_session_len': 20,   # Maximum items per session
    
    # Data split
    'train_ratio': 0.8,      # 80% train, 20% test
    'filter_unseen': True,   # Remove unseen items from test
    
    # Evaluation
    'cutoffs': [5, 10, 20],  # K values for Recall@K and MRR@K
    
    # Reproducibility
    'random_seed': 42,       # For reproducible results
}

# Set random seed globally
np.random.seed(CONFIG['random_seed'])

print("Configuration loaded:")
for key, value in CONFIG.items():
    print(f"  {key}: {value}")

### 3.3 Why These Defaults?

| Parameter | Default | Rationale |
|-----------|---------|----------|
| `n_sessions=1000` | Small for fast iteration; real datasets have millions |
| `n_items=500` | Balances complexity vs speed; production has 10K-1M items |
| `train_ratio=0.8` | Standard ML split; temporal ordering prevents leakage |
| `cutoffs=[5,10,20]` | Standard in RecSys literature; @20 is typical production limit |
| `random_seed=42` | Reproducibility - same data every run |

---

## 4. Data Ingestion / Generation

### 4.1 Why Synthetic Data?

We use synthetic data for several reasons:
1. **No licensing issues** - real session data is proprietary
2. **Fast iteration** - small data for quick experiments
3. **Controlled properties** - we know the ground truth

The synthetic data mimics real e-commerce patterns:
- **Power-law item popularity** (few items get most clicks)
- **Variable session lengths** (2-20 items)
- **Temporal structure** (realistic timestamps)

In [None]:
# Import the data generation function from our scripts
from make_synth_data import generate_synthetic_sessions

# Generate synthetic session data
df = generate_synthetic_sessions(
    n_sessions=CONFIG['n_sessions'],
    n_items=CONFIG['n_items'],
    min_session_len=CONFIG['min_session_len'],
    max_session_len=CONFIG['max_session_len'],
    seed=CONFIG['random_seed']
)

print(f"Generated {len(df):,} interactions")
print(f"\nDataFrame shape: {df.shape}")
print(f"\nSchema:")
print(df.dtypes)

### 4.2 Data Schema Explained

| Column | Type | Description |
|--------|------|-------------|
| `SessionId` | int | Unique session identifier (one per user visit) |
| `ItemId` | int | Product/item identifier that was clicked |
| `Time` | int | Unix timestamp (seconds since 1970-01-01) |

In [None]:
# Preview the data
print("First 10 rows:")
df.head(10)

### 4.3 Basic Statistics

In [None]:
# Compute key statistics
n_sessions = df['SessionId'].nunique()
n_items = df['ItemId'].nunique()
n_interactions = len(df)

session_lengths = df.groupby('SessionId').size()

print("=" * 50)
print("DATA SUMMARY")
print("=" * 50)
print(f"Total interactions:     {n_interactions:,}")
print(f"Unique sessions:        {n_sessions:,}")
print(f"Unique items:           {n_items:,}")
print(f"Avg items/session:      {n_interactions/n_sessions:.1f}")
print(f"Session length range:   {session_lengths.min()} - {session_lengths.max()}")
print(f"Median session length:  {session_lengths.median():.0f}")
print("=" * 50)

### 4.4 Sanity Checks

In [None]:
# Check for data quality issues
print("SANITY CHECKS:")
print(f"  Null values:          {df.isnull().sum().sum()} (should be 0)")
print(f"  Duplicate rows:       {df.duplicated().sum()}")
print(f"  Min session length:   {session_lengths.min()} (should be >= 2)")
print(f"  ItemId range:         [{df['ItemId'].min()}, {df['ItemId'].max()}]")

# Verify timestamps are ordered within sessions
is_sorted = df.groupby('SessionId')['Time'].apply(lambda x: x.is_monotonic_increasing).all()
print(f"  Time sorted/session:  {is_sorted} (should be True)")

### 4.5 Item Popularity Distribution

Real e-commerce data follows a **power law**: a few items get most of the clicks.

In [None]:
# Analyze item popularity
item_counts = df['ItemId'].value_counts()

print("Item Popularity Distribution:")
print(f"  Most popular item:  {item_counts.iloc[0]:,} clicks")
print(f"  Least popular item: {item_counts.iloc[-1]:,} clicks")
print(f"  Median popularity:  {item_counts.median():.0f} clicks")

# Top 20% of items account for what % of interactions?
top_20_pct = int(n_items * 0.2)
top_20_clicks = item_counts.iloc[:top_20_pct].sum()
print(f"\n  Top 20% items ({top_20_pct} items) account for {100*top_20_clicks/n_interactions:.1f}% of clicks")

In [None]:
# Visualize item popularity (log-log scale reveals power law)
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Linear scale
axes[0].bar(range(min(50, len(item_counts))), item_counts.values[:50], color='steelblue')
axes[0].set_xlabel('Item Rank')
axes[0].set_ylabel('Number of Clicks')
axes[0].set_title('Top 50 Items by Popularity')

# Log-log scale (power law appears as straight line)
ranks = np.arange(1, len(item_counts) + 1)
axes[1].loglog(ranks, item_counts.values, 'o', alpha=0.5, markersize=3)
axes[1].set_xlabel('Item Rank (log)')
axes[1].set_ylabel('Number of Clicks (log)')
axes[1].set_title('Power Law Distribution (Log-Log)')

plt.tight_layout()
plt.show()

print("The straight line in log-log scale confirms power-law (Zipf) distribution.")

---

## 5. Feature Engineering / Preprocessing

### 5.1 The Critical Step: Temporal Train/Test Split

**Why temporal split matters:**

- **Random split** = Data leakage! Future information used to predict past.
- **Temporal split** = Realistic! Train on past, evaluate on future.

```
Timeline:  ──────────────────────────────────────────────────▶
           [   TRAIN (80%)   ][  TEST (20%)  ]
           Earlier sessions    Later sessions
```

In [None]:
# Import the preprocessing function
from preprocess_sessions import temporal_split

# Perform temporal split
train_df, test_df = temporal_split(
    df,
    train_ratio=CONFIG['train_ratio'],
    filter_unseen_items=CONFIG['filter_unseen']
)

print(f"Training set:  {len(train_df):,} interactions, {train_df['SessionId'].nunique():,} sessions")
print(f"Test set:      {len(test_df):,} interactions, {test_df['SessionId'].nunique():,} sessions")

### 5.2 Verifying No Data Leakage

In [None]:
# Verify temporal ordering
train_max_time = train_df['Time'].max()
test_min_time = test_df['Time'].min()

print("TEMPORAL INTEGRITY CHECK:")
print(f"  Latest training timestamp:   {train_max_time}")
print(f"  Earliest test timestamp:     {test_min_time}")
print(f"  Gap (seconds):               {test_min_time - train_max_time}")
print(f"  No overlap:                  {train_max_time <= test_min_time}")

# Verify no session overlap
train_sessions = set(train_df['SessionId'])
test_sessions = set(test_df['SessionId'])
overlap = train_sessions & test_sessions
print(f"  Session overlap:             {len(overlap)} (should be 0)")

### 5.3 Understanding Item Filtering

When `filter_unseen_items=True`, we remove items from test that never appeared in training.

**Why?** We can only recommend items we've seen. Evaluating on "impossible" targets is misleading.

In [None]:
# Check item overlap
train_items = set(train_df['ItemId'])
test_items = set(test_df['ItemId'])

print("ITEM COVERAGE:")
print(f"  Items in training:    {len(train_items)}")
print(f"  Items in test:        {len(test_items)}")
print(f"  Items in both:        {len(train_items & test_items)}")
print(f"  Test-only items:      {len(test_items - train_items)} (filtered if enabled)")

### 5.4 Session Length After Split

In [None]:
# Compare session length distributions
train_lens = train_df.groupby('SessionId').size()
test_lens = test_df.groupby('SessionId').size()

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].hist(train_lens, bins=range(2, 22), alpha=0.7, label='Train', color='steelblue')
axes[0].axvline(train_lens.mean(), color='red', linestyle='--', label=f'Mean: {train_lens.mean():.1f}')
axes[0].set_xlabel('Session Length')
axes[0].set_ylabel('Count')
axes[0].set_title('Training Set Session Lengths')
axes[0].legend()

axes[1].hist(test_lens, bins=range(2, 22), alpha=0.7, label='Test', color='darkorange')
axes[1].axvline(test_lens.mean(), color='red', linestyle='--', label=f'Mean: {test_lens.mean():.1f}')
axes[1].set_xlabel('Session Length')
axes[1].set_ylabel('Count')
axes[1].set_title('Test Set Session Lengths')
axes[1].legend()

plt.tight_layout()
plt.show()

---

## 6. Core Models / Algorithms

### 6.1 The Importance of Baselines

Before running complex neural networks, we need baselines to answer:

> "Is the neural network actually better than simple methods?"

Surprisingly, simple baselines often achieve **60-70%** of neural network performance!

### 6.2 Baseline 1: Popularity

**Algorithm:** Always recommend the globally most popular items.

**Pros:**
- Zero inference cost (precomputed list)
- Surprisingly effective (popular items are popular for a reason)

**Cons:**
- No personalization
- Ignores session context entirely

In [None]:
# Train popularity baseline
pop_baseline = PopularityBaseline()
pop_baseline.fit(train_df)

print("Popularity Baseline trained!")
print(f"  Learned popularity ranking for {len(pop_baseline.top_items)} items")
print(f"  Top 5 most popular items: {pop_baseline.top_items[:5]}")

In [None]:
# Demo prediction
example_session = [10, 25, 42]  # A user clicked items 10, 25, 42
pop_predictions = pop_baseline.predict(example_session, k=10)

print(f"Session history: {example_session}")
print(f"Top-10 recommendations: {pop_predictions}")
print("\nNote: Recommendations are the same regardless of session (no personalization)")

### 6.3 Baseline 2: First-Order Markov Chain

**Algorithm:** Recommend based on what typically follows the last clicked item.

$$P(\text{next item} | \text{session}) \approx P(\text{next item} | \text{last item})$$

**Pros:**
- Captures sequential patterns
- Fast inference (lookup table)

**Cons:**
- Only considers last item (ignores full history)
- Cold start for unseen items

In [None]:
# Train Markov baseline
markov_baseline = MarkovBaseline(alpha=0.0)  # Pure Markov, no popularity blending
markov_baseline.fit(train_df)

print("Markov Baseline trained!")
print(f"  Learned transitions from {len(markov_baseline.transitions)} unique items")

# Show example transitions
example_item = list(markov_baseline.transitions.keys())[0]
transitions = markov_baseline.transitions[example_item]
top_3 = sorted(transitions.items(), key=lambda x: -x[1])[:3]
print(f"\nExample: After item {example_item}, users often click:")
for item, count in top_3:
    print(f"    Item {item}: {count} times")

In [None]:
# Demo prediction - now personalized!
example_session = [10, 25, 42]
markov_predictions = markov_baseline.predict(example_session, k=10)

print(f"Session history: {example_session}")
print(f"Last item seen: {example_session[-1]}")
print(f"Top-10 recommendations: {markov_predictions}")

# Different session = different recommendations
different_session = [100, 200, 300]
different_preds = markov_baseline.predict(different_session, k=10)
print(f"\nDifferent session {different_session[-1]} → {different_preds}")

### 6.4 GRU4Rec: The Neural Network Approach

**Algorithm:** Use a Gated Recurrent Unit (GRU) to encode the full session history.

```
Session: [item1] → [item2] → [item3] → ???
            ↓          ↓          ↓
        ┌──────┐  ┌──────┐  ┌──────┐
        │ GRU  │──│ GRU  │──│ GRU  │──▶ Hidden State
        └──────┘  └──────┘  └──────┘         ↓
                                        Score all items
                                             ↓
                                      Top-K recommendations
```

**Pros:**
- Considers full session history
- Learns complex sequential patterns

**Cons:**
- Requires training (time, compute)
- More complex to deploy

In [None]:
# Note: GRU4Rec training requires the official implementation
# For this tutorial, we focus on baselines (GRU4Rec can be trained via `make train_tiny`)

print("GRU4Rec Model Architecture:")
print("  Input:   Session as sequence of item IDs")
print("  Layer 1: Embedding (ItemId → dense vector)")
print("  Layer 2: GRU (sequence → hidden state)")
print("  Layer 3: Output (hidden state → scores for all items)")
print("  Loss:    Cross-entropy or BPR-max")
print("\nTo train GRU4Rec, run: make fetch && make train_tiny")

---

## 7. Prediction / Scoring / Inference

### 7.1 Understanding the Prediction Task

**Next-item prediction:** Given a session prefix, predict the next item.

```
Session:     [A] → [B] → [C] → [?]
Model sees:  [A] → [B] → [C]
Model predicts: Ranked list of all items
Ground truth: The actual next item the user clicked
```

In [None]:
# Get a sample session from test set
sample_session_id = test_df['SessionId'].iloc[0]
sample_session = test_df[test_df['SessionId'] == sample_session_id]['ItemId'].values

print(f"Sample test session {sample_session_id}:")
print(f"  Full sequence: {list(sample_session)}")
print(f"  Length: {len(sample_session)} items")

In [None]:
# Simulate next-item prediction for each position
print("Next-Item Predictions (Popularity Baseline):")
print("-" * 60)

for t in range(len(sample_session) - 1):
    history = list(sample_session[:t+1])
    target = sample_session[t+1]
    predictions = pop_baseline.predict(history, k=10)
    
    hit = "HIT" if target in predictions else "miss"
    rank = list(predictions).index(target) + 1 if target in predictions else ">10"
    
    print(f"  History: {history[-3:]:>20} → Target: {target:>4} | Rank: {rank:>4} [{hit}]")

### 7.2 Comparing Model Predictions

In [None]:
# Compare predictions from both baselines
history = list(sample_session[:-1])
target = sample_session[-1]

pop_preds = pop_baseline.predict(history, k=20)
markov_preds = markov_baseline.predict(history, k=20)

print(f"Session history (last 5): {history[-5:]}")
print(f"Target item: {target}")
print()
print(f"Popularity Top-10:  {list(pop_preds[:10])}")
print(f"Markov Top-10:      {list(markov_preds[:10])}")
print()
print(f"Target in Popularity@20: {target in pop_preds}")
print(f"Target in Markov@20:     {target in markov_preds}")

---

## 8. Evaluation & Validation

### 8.1 Metrics Explained

**Recall@K:** Did the target appear in the top-K recommendations?
$$\text{Recall@K} = \frac{1}{N} \sum_{i=1}^{N} \mathbb{1}[\text{target}_i \in \text{top-K}_i]$$

**MRR@K (Mean Reciprocal Rank):** How high was the target ranked?
$$\text{MRR@K} = \frac{1}{N} \sum_{i=1}^{N} \frac{1}{\text{rank}_i} \cdot \mathbb{1}[\text{rank}_i \leq K]$$

In [None]:
# Demonstrate metrics with examples
print("METRIC EXAMPLES:")
print("=" * 50)

# Example predictions and targets
predictions_1 = np.array([10, 20, 30, 40, 50])  # Top-5 predictions
target_1 = 20  # Target is at position 2

predictions_2 = np.array([10, 20, 30, 40, 50])
target_2 = 99  # Target not in predictions

print(f"\nExample 1: predictions={list(predictions_1)}, target={target_1}")
print(f"  Recall@5:  {recall_at_k(predictions_1, target_1)}  (target found)")
print(f"  MRR@5:     {mrr_at_k(predictions_1, target_1):.3f}  (1/rank = 1/2 = 0.5)")

print(f"\nExample 2: predictions={list(predictions_2)}, target={target_2}")
print(f"  Recall@5:  {recall_at_k(predictions_2, target_2)}  (target NOT found)")
print(f"  MRR@5:     {mrr_at_k(predictions_2, target_2):.3f}  (rank > K → 0)")

### 8.2 Full Evaluation: Why Full Ranking Matters

**Sampled Evaluation (Common but Misleading):**
- Score target against 100 random negatives
- Fast, but inflates metrics by 2-3x!

**Full Ranking Evaluation (Production-Realistic):**
- Score target against ALL items
- Slower, but gives realistic performance estimates

In [None]:
# Run full evaluation on baselines
print("Evaluating baselines (this may take a moment)...")
print()

# Evaluate popularity baseline
pop_results = pop_baseline.evaluate(test_df, k=CONFIG['cutoffs'])

print("POPULARITY BASELINE RESULTS:")
for metric, value in pop_results.items():
    if metric != 'n_predictions':
        print(f"  {metric}: {value:.4f}")
print(f"  (Evaluated on {pop_results['n_predictions']:,} predictions)")

In [None]:
# Evaluate Markov baseline
markov_results = markov_baseline.evaluate(test_df, k=CONFIG['cutoffs'])

print("MARKOV BASELINE RESULTS:")
for metric, value in markov_results.items():
    if metric != 'n_predictions':
        print(f"  {metric}: {value:.4f}")
print(f"  (Evaluated on {markov_results['n_predictions']:,} predictions)")

### 8.3 Results Comparison

In [None]:
# Create comparison table
results_df = pd.DataFrame({
    'Metric': [f'Recall@{k}' for k in CONFIG['cutoffs']] + [f'MRR@{k}' for k in CONFIG['cutoffs']],
    'Popularity': [pop_results[f'Recall@{k}'] for k in CONFIG['cutoffs']] + 
                  [pop_results[f'MRR@{k}'] for k in CONFIG['cutoffs']],
    'Markov': [markov_results[f'Recall@{k}'] for k in CONFIG['cutoffs']] + 
              [markov_results[f'MRR@{k}'] for k in CONFIG['cutoffs']]
})

# Add winner column
results_df['Winner'] = results_df.apply(
    lambda row: 'Popularity' if row['Popularity'] > row['Markov'] else 'Markov', 
    axis=1
)

print("BASELINE COMPARISON:")
print("=" * 60)
print(results_df.to_string(index=False))

In [None]:
# Visualize comparison
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Recall comparison
recall_data = results_df[results_df['Metric'].str.contains('Recall')]
x = np.arange(len(recall_data))
width = 0.35

axes[0].bar(x - width/2, recall_data['Popularity'], width, label='Popularity', color='#2ecc71')
axes[0].bar(x + width/2, recall_data['Markov'], width, label='Markov', color='#3498db')
axes[0].set_xlabel('K')
axes[0].set_ylabel('Recall@K')
axes[0].set_title('Recall@K Comparison')
axes[0].set_xticks(x)
axes[0].set_xticklabels(CONFIG['cutoffs'])
axes[0].legend()
axes[0].set_ylim(0, max(recall_data[['Popularity', 'Markov']].max()) * 1.2)

# MRR comparison
mrr_data = results_df[results_df['Metric'].str.contains('MRR')]
axes[1].bar(x - width/2, mrr_data['Popularity'], width, label='Popularity', color='#2ecc71')
axes[1].bar(x + width/2, mrr_data['Markov'], width, label='Markov', color='#3498db')
axes[1].set_xlabel('K')
axes[1].set_ylabel('MRR@K')
axes[1].set_title('MRR@K Comparison')
axes[1].set_xticks(x)
axes[1].set_xticklabels(CONFIG['cutoffs'])
axes[1].legend()
axes[1].set_ylim(0, max(mrr_data[['Popularity', 'Markov']].max()) * 1.2)

plt.tight_layout()
plt.show()

### 8.4 Interpreting Results

**What do these numbers mean?**

| Metric | Interpretation |
|--------|---------------|
| Recall@20 = 0.35 | 35% of the time, the item user clicked is in our top-20 recommendations |
| MRR@20 = 0.13 | On average, the correct item appears around position 7-8 (1/0.13 ≈ 7.7) |

**Is this good or bad?**
- For production: These are baseline numbers to beat
- Typical neural network improvement: 10-30% relative gain
- GRU4Rec on real data typically achieves Recall@20 of 0.4-0.6

---

## 9. Outputs & Artifacts

### 9.1 What the Pipeline Produces

In [None]:
# List typical output artifacts
artifacts = {
    'data/': {
        'synth_sessions.tsv': 'Full synthetic dataset',
        'train.tsv': 'Training split (80%)',
        'test.tsv': 'Test split (20%)'
    },
    'results/': {
        'model_tiny.pt': 'Trained GRU4Rec model weights',
        'model_tiny.config.json': 'Model configuration',
        '*.results.json': 'Evaluation metrics'
    },
    'figures/': {
        '*.png': '11 publication-quality visualizations',
        '*.svg': 'Vector versions for papers'
    }
}

print("PIPELINE ARTIFACTS:")
print("=" * 50)
for directory, files in artifacts.items():
    print(f"\n{directory}")
    for filename, description in files.items():
        print(f"  {filename:25} {description}")

In [None]:
# Save current results
results_dir = PROJECT_ROOT / 'results'
results_dir.mkdir(exist_ok=True)

# Combine results for saving
all_results = {
    'Popularity': {k: v for k, v in pop_results.items() if k != 'n_predictions'},
    'Markov': {k: v for k, v in markov_results.items() if k != 'n_predictions'}
}

results_file = results_dir / 'notebook_baselines.json'
with open(results_file, 'w') as f:
    json.dump(all_results, f, indent=2)

print(f"Results saved to: {results_file}")
print("\nContents:")
print(json.dumps(all_results, indent=2))

### 9.2 How These Outputs Are Used

| Artifact | Use Case |
|----------|----------|
| `model.pt` | Deploy to production inference service |
| `results.json` | Compare experiments, track progress |
| `figures/` | Include in papers, presentations, portfolio |
| `train/test.tsv` | Reproduce experiments, debugging |

---

## 10. How to Modify & Experiment

### 10.1 Configuration Changes

The `CONFIG` dictionary at the top controls key parameters:

In [None]:
# Show modifiable configuration
print("MODIFIABLE CONFIGURATION:")
print("=" * 60)
print()
print("Data Generation:")
print("  CONFIG['n_sessions'] = 1000  # Increase for more realistic data")
print("  CONFIG['n_items'] = 500      # Increase for harder task")
print()
print("Data Split:")
print("  CONFIG['train_ratio'] = 0.8  # Try 0.9 for more training data")
print("  CONFIG['filter_unseen'] = True  # Try False to see impact")
print()
print("Evaluation:")
print("  CONFIG['cutoffs'] = [5, 10, 20]  # Add [1, 3, 50] for more granularity")

### 10.2 Model Parameter Changes

In [None]:
# Markov baseline has a tunable parameter
print("MARKOV BASELINE TUNING:")
print("=" * 60)
print()
print("alpha parameter controls popularity blending:")
print("  alpha=0.0  → Pure Markov (transition probabilities only)")
print("  alpha=0.5  → 50% Markov, 50% popularity")
print("  alpha=1.0  → Pure popularity (ignores transitions)")
print()
print("Example: MarkovBaseline(alpha=0.3)")

In [None]:
# Quick experiment: Compare different alpha values
print("Quick Experiment: Markov alpha sweep")
print("-" * 50)

for alpha in [0.0, 0.25, 0.5, 0.75, 1.0]:
    model = MarkovBaseline(alpha=alpha)
    model.fit(train_df)
    results = model.evaluate(test_df, k=[20])
    print(f"  alpha={alpha:.2f} → Recall@20: {results['Recall@20']:.4f}, MRR@20: {results['MRR@20']:.4f}")

### 10.3 GRU4Rec Parameter Changes

If you want to experiment with GRU4Rec itself, modify these in `scripts/run_gru4rec.py` or pass as command-line arguments:

```bash
# Larger model
python scripts/run_gru4rec.py train --layers 256 --epochs 20

# Different loss function
python scripts/run_gru4rec.py train --loss bpr-max

# GPU training
python scripts/run_gru4rec.py train --device cuda:0
```

---

## 11. Guided Exercises

These exercises will deepen your understanding of the system. Work through them to build intuition.

### Exercise 1: Impact of Dataset Size

In [None]:
# EXERCISE 1: How does dataset size affect performance?
# 
# WHAT TO DO: Generate datasets of different sizes and compare performance
# WHAT TO OBSERVE: Does more data always help? Diminishing returns?
# WHY IT MATTERS: Determines data collection requirements for production

print("EXERCISE 1: Dataset Size Impact")
print("=" * 60)

for n_sess in [100, 500, 1000, 2000]:
    # Generate data
    df_exp = generate_synthetic_sessions(n_sessions=n_sess, n_items=500, seed=42)
    train_exp, test_exp = temporal_split(df_exp, train_ratio=0.8)
    
    # Train and evaluate
    pop = PopularityBaseline().fit(train_exp)
    results = pop.evaluate(test_exp, k=[20])
    
    print(f"  n_sessions={n_sess:5} → Recall@20: {results['Recall@20']:.4f}")

print("\nObservation: _____________________________________")
print("(Fill in what you notice about the trend)")

### Exercise 2: Train/Test Ratio Impact

In [None]:
# EXERCISE 2: How does train/test split ratio affect results?
#
# WHAT TO DO: Try different train_ratio values
# WHAT TO OBSERVE: Performance vs statistical reliability tradeoff
# WHY IT MATTERS: Affects confidence in your evaluation numbers

print("EXERCISE 2: Train/Test Ratio Impact")
print("=" * 60)

# Use the original full dataset
for ratio in [0.5, 0.7, 0.8, 0.9]:
    train_exp, test_exp = temporal_split(df, train_ratio=ratio)
    
    pop = PopularityBaseline().fit(train_exp)
    results = pop.evaluate(test_exp, k=[20])
    
    print(f"  ratio={ratio:.1f} → Train: {len(train_exp):5}, Test: {len(test_exp):4} | Recall@20: {results['Recall@20']:.4f}")

print("\nObservation: _____________________________________")

### Exercise 3: Item Filtering Impact

In [None]:
# EXERCISE 3: What happens if we don't filter unseen items?
#
# WHAT TO DO: Compare with and without filter_unseen_items
# WHAT TO OBSERVE: How much does filtering affect metrics?
# WHY IT MATTERS: Determines if your evaluation is realistic

print("EXERCISE 3: Item Filtering Impact")
print("=" * 60)

# With filtering (default)
train_filtered, test_filtered = temporal_split(df, filter_unseen_items=True)
pop_filtered = PopularityBaseline().fit(train_filtered)
results_filtered = pop_filtered.evaluate(test_filtered, k=[20])

# Without filtering
train_unfiltered, test_unfiltered = temporal_split(df, filter_unseen_items=False)
pop_unfiltered = PopularityBaseline().fit(train_unfiltered)
results_unfiltered = pop_unfiltered.evaluate(test_unfiltered, k=[20])

print(f"  With filtering:    Recall@20 = {results_filtered['Recall@20']:.4f} (test size: {len(test_filtered)})")
print(f"  Without filtering: Recall@20 = {results_unfiltered['Recall@20']:.4f} (test size: {len(test_unfiltered)})")

print("\nQuestion: Why is the unfiltered Recall@20 lower?")
print("Answer: _____________________________________")

### Exercise 4: Catalog Size Impact

In [None]:
# EXERCISE 4: How does catalog size affect difficulty?
#
# WHAT TO DO: Generate data with different n_items values
# WHAT TO OBSERVE: Harder task with more items?
# WHY IT MATTERS: Real catalogs have 10K-1M items

print("EXERCISE 4: Catalog Size Impact")
print("=" * 60)

for n_items in [100, 500, 1000, 2000]:
    df_exp = generate_synthetic_sessions(n_sessions=1000, n_items=n_items, seed=42)
    train_exp, test_exp = temporal_split(df_exp)
    
    pop = PopularityBaseline().fit(train_exp)
    results = pop.evaluate(test_exp, k=[20])
    
    print(f"  n_items={n_items:5} → Recall@20: {results['Recall@20']:.4f}")

print("\nObservation: More items makes the task _____________")

### Exercise 5: Markov Alpha Tuning

In [None]:
# EXERCISE 5: Find the optimal alpha for Markov baseline
#
# WHAT TO DO: Sweep alpha from 0 to 1 in small increments
# WHAT TO OBSERVE: Is there a sweet spot?
# WHY IT MATTERS: Shows value of blending approaches

print("EXERCISE 5: Optimal Markov Alpha")
print("=" * 60)

alphas = np.linspace(0, 1, 11)
recall_scores = []

for alpha in alphas:
    model = MarkovBaseline(alpha=alpha)
    model.fit(train_df)
    results = model.evaluate(test_df, k=[20])
    recall_scores.append(results['Recall@20'])

# Find best
best_idx = np.argmax(recall_scores)
print(f"  Best alpha: {alphas[best_idx]:.2f} with Recall@20: {recall_scores[best_idx]:.4f}")

# Plot
plt.figure(figsize=(8, 4))
plt.plot(alphas, recall_scores, 'o-', linewidth=2, markersize=8)
plt.axvline(alphas[best_idx], color='red', linestyle='--', label=f'Best: {alphas[best_idx]:.2f}')
plt.xlabel('Alpha (0=Markov, 1=Popularity)')
plt.ylabel('Recall@20')
plt.title('Markov-Popularity Blend Optimization')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

### Exercise 6: Session Length Analysis

In [None]:
# EXERCISE 6: Does performance vary by session length?
#
# WHAT TO DO: Segment test data by session length and evaluate separately
# WHAT TO OBSERVE: Are longer sessions easier to predict?
# WHY IT MATTERS: Informs whether to prioritize short vs long sessions

print("EXERCISE 6: Performance by Session Length")
print("=" * 60)

# Group sessions by length
test_session_lens = test_df.groupby('SessionId').size()

# Define length buckets
buckets = [(2, 4), (5, 8), (9, 15), (16, 20)]

for min_len, max_len in buckets:
    # Filter sessions in this length range
    valid_sessions = test_session_lens[(test_session_lens >= min_len) & (test_session_lens <= max_len)].index
    if len(valid_sessions) == 0:
        continue
    test_subset = test_df[test_df['SessionId'].isin(valid_sessions)]
    
    # Evaluate on subset
    results = pop_baseline.evaluate(test_subset, k=[20])
    
    print(f"  Length {min_len}-{max_len}: {len(valid_sessions):4} sessions → Recall@20: {results['Recall@20']:.4f}")

print("\nObservation: _____________________________________")

---

## 12. Limitations & Next Steps

### 12.1 What This Project Does NOT Handle

In [None]:
limitations = [
    ("Real-time inference", "No serving infrastructure; batch evaluation only"),
    ("Item features", "Only uses item IDs; no product metadata, images, text"),
    ("User features", "Purely session-based; no demographics or history"),
    ("Multi-objective", "Only click prediction; no revenue/diversity optimization"),
    ("A/B testing", "Offline evaluation only; no online experimentation"),
    ("Scale", "Tested on 1K sessions; real systems have millions"),
]

print("KNOWN LIMITATIONS:")
print("=" * 60)
for limitation, description in limitations:
    print(f"\n  {limitation}")
    print(f"    → {description}")

### 12.2 Logical Next Extensions

In [None]:
extensions = [
    ("Real data", "RetailRocket, RecSys Challenge datasets", "High"),
    ("More baselines", "SKNN, STAN, SASRec", "Medium"),
    ("Hyperparameter tuning", "Optuna integration for GRU4Rec", "Medium"),
    ("Item features", "Incorporate product categories, prices", "Medium"),
    ("Serving API", "FastAPI endpoint for real-time inference", "High"),
    ("MLflow tracking", "Experiment tracking and model registry", "Low"),
]

print("POSSIBLE EXTENSIONS:")
print("=" * 60)
print(f"{'Extension':<25} {'Description':<35} {'Priority'}")
print("-" * 60)
for ext, desc, priority in extensions:
    print(f"{ext:<25} {desc:<35} {priority}")

### 12.3 Quick Reference Commands

In [None]:
print("""
QUICK REFERENCE - Common Operations:
================================================================================

# Run full demo pipeline (30 seconds)
make demo

# Generate larger synthetic dataset
python scripts/make_synth_data.py --n_sessions 10000 --n_items 1000

# Train GRU4Rec (requires GPU for speed)
make fetch
python scripts/run_gru4rec.py train --device cuda:0 --epochs 10

# Generate all visualizations
make visualize

# Run tests
make test

# Clean all generated files
make clean
""")

---

## Summary

In this notebook, you've learned:

1. **The Problem**: Session-based recommendations for anonymous users
2. **The Pipeline**: Data → Split → Train → Evaluate (with temporal integrity)
3. **The Baselines**: Popularity and Markov Chain as competitive benchmarks
4. **The Metrics**: Recall@K and MRR@K with full ranking evaluation
5. **The Experiments**: How to modify parameters and observe effects

**Key Takeaways:**
- Simple baselines are surprisingly strong (60-70% of neural networks)
- Temporal splits prevent data leakage
- Full ranking evaluation gives realistic estimates
- Reproducibility requires discipline (seeds, versions, documentation)

---

*This notebook is part of the GRU4Rec Reproduction Study portfolio project.*

*Repository: https://github.com/oscgonz19/gru4rec-reproduction-and-audit*