# SVD vs ALS (NMF) — Recommendation System Comparison

This notebook:
1. Loads and cleans the Amazon review dataset
2. Prepares a shared train/test split
3. Trains an **SVD** model (explicit rating prediction)
4. Trains an **NMF** model (ALS-style matrix factorization)
5. Compares both models on **RMSE, MAE, Precision@10, Recall@10**

## 1. Import Libraries

In [None]:
import pandas as pd
import json
import os
import numpy as np
from collections import defaultdict

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from scipy.sparse import csr_matrix

from surprise import SVD, NMF, Dataset, Reader, accuracy, Trainset
from surprise.model_selection import train_test_split as surprise_split

---
## 2. Load & Explore Data

In [None]:
def load_json_lines(filepath):
    records = []
    with open(filepath, 'r', encoding='utf-8') as f:
        for line in f:
            line = line.strip()
            if line:
                records.append(json.loads(line))
    return pd.DataFrame(records)

FILE_PATH = 'reco_dataset.json'

df = load_json_lines(FILE_PATH)
print(f"Loaded {len(df):,} reviews")
print(f"Shape: {df.shape}")

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df['overall'].describe()

---
## 3. Data Cleaning

### 3.1 Select Relevant Columns

In [None]:
df_cf = df[['reviewerID', 'asin', 'overall']].copy()

print(f"Users  : {df_cf['reviewerID'].nunique():,}")
print(f"Items  : {df_cf['asin'].nunique():,}")
print(f"Ratings: {len(df_cf):,}")

### 3.2 Check for Missing Values

In [None]:
print("Missing values per column:")
print(df_cf.isnull().sum())

### 3.3 Remove Duplicates

In [None]:
print(f"Before dedup: {len(df_cf):,} rows")

df_cf = df_cf.drop_duplicates(subset=['reviewerID', 'asin'], keep='last')

print(f"After dedup : {len(df_cf):,} rows")

### 3.4 Drop Missing Values

In [None]:
df_cf = df_cf.dropna(subset=['reviewerID', 'asin', 'overall'])

print(f"After dropping nulls: {len(df_cf):,} rows")


### 3.5 Validate Rating Range

In [None]:
print("Rating distribution before filter:")
print(df_cf['overall'].value_counts().sort_index())

# Keep only valid ratings
df_cf = df_cf[df_cf['overall'].between(1.0, 5.0)]

print(f"\nAfter rating filter: {len(df_cf):,} rows")

### 3.6 Filter Cold-Start Users & Items

This is the most important cleaning step for matrix factorization. 
Users with only 1 or 2 ratings give the model nothing to learn from.

In [None]:
MIN_USER_RATINGS = 5   # user must have rated at least 5 items
MIN_ITEM_RATINGS = 5   # item must have been rated at least 5 times

# Filter users
user_counts = df_cf['reviewerID'].value_counts()
valid_users = user_counts[user_counts >= MIN_USER_RATINGS].index
df_cf = df_cf[df_cf['reviewerID'].isin(valid_users)]

# Filter items
item_counts = df_cf['asin'].value_counts()
valid_items = item_counts[item_counts >= MIN_ITEM_RATINGS].index
df_cf = df_cf[df_cf['asin'].isin(valid_items)]

print(f"After cold start filter:")
print(f"  Users  : {df_cf['reviewerID'].nunique():,}")
print(f"  Items  : {df_cf['asin'].nunique():,}")
print(f"  Ratings: {len(df_cf):,}")

### 3.7 Final Sanity Check

In [None]:
print("=== Final Clean Dataset ===")
print(f"Shape       : {df_cf.shape}")
print(f"Users       : {df_cf['reviewerID'].nunique():,}")
print(f"Items       : {df_cf['asin'].nunique():,}")
print(f"Ratings     : {len(df_cf):,}")
print(f"Missing vals: {df_cf.isnull().sum().sum()}")
print(f"Duplicates  : {df_cf.duplicated().sum()}")
print("\nRating distribution:")
print(df_cf['overall'].value_counts().sort_index())

df_cf.head()

---
## 4. Feature Engineering

### 4.1 Encode User & Item IDs to Integers

In [None]:
user_enc = LabelEncoder()
item_enc = LabelEncoder()

df_cf['user_id'] = user_enc.fit_transform(df_cf['reviewerID'])
df_cf['item_id'] = item_enc.fit_transform(df_cf['asin'])

n_users = df_cf['user_id'].nunique()
n_items = df_cf['item_id'].nunique()

print(f"Number of users : {n_users:,}")
print(f"Number of items : {n_items:,}")
print(f"\nSample encoding:")
df_cf[['reviewerID', 'user_id', 'asin', 'item_id', 'overall']].head()

### 4.2 Build the Sparse User-Item Matrix

In [None]:
sparse_matrix = csr_matrix(
    (df_cf['overall'].astype(float),
     (df_cf['user_id'], df_cf['item_id'])),
    shape=(n_users, n_items)
)

# Check sparsity
total_cells = n_users * n_items
filled_cells = len(df_cf)
sparsity = 1 - (filled_cells / total_cells)

print(f"Matrix shape : {sparse_matrix.shape}")
print(f"Filled cells : {filled_cells:,}")
print(f"Total cells  : {total_cells:,}")
print(f"Sparsity     : {sparsity:.4%}")

### 4.3 Save Lookup Dictionaries

In [None]:
# Map integer → original ID
user_id_to_reviewer = {i: label for i, label in enumerate(user_enc.classes_)}
item_id_to_asin     = {i: label for i, label in enumerate(item_enc.classes_)}

# Map original ID → integer
reviewer_to_user_id = {v: k for k, v in user_id_to_reviewer.items()}
asin_to_item_id     = {v: k for k, v in item_id_to_asin.items()}

print(f"Sample user mapping: {list(user_id_to_reviewer.items())[:3]}")
print(f"Sample item mapping: {list(item_id_to_asin.items())[:3]}")

### 4.4 Feature Engineering Summary

In [None]:
print("=== Feature Engineering Summary ===")
print(f"df_cf columns     : {list(df_cf.columns)}")
print(f"Sparse matrix type: {type(sparse_matrix)}")
print(f"Matrix shape      : {sparse_matrix.shape}")
print(f"Stored values     : {sparse_matrix.nnz:,}")
df_cf.head()

---
## 5. Shared Train/Test Split

Both models (SVD and NMF) will be evaluated on the **exact same** 
train/test split for a fair comparison.

### 5.1 Sklearn Split (for sparse matrices)

In [None]:
train_data, test_data = train_test_split(
    df_cf,
    test_size=0.2,
    random_state=42,
    stratify=df_cf['user_id']
)

print(f"Train size : {len(train_data):,} ratings")
print(f"Test size  : {len(test_data):,} ratings")
print(f"Train users: {train_data['user_id'].nunique():,}")
print(f"Test users : {test_data['user_id'].nunique():,}")

### 5.2 Build Sparse Matrices

In [None]:
train_matrix = csr_matrix(
    (train_data['overall'].astype(float),
     (train_data['user_id'], train_data['item_id'])),
    shape=(n_users, n_items)
)

test_matrix = csr_matrix(
    (test_data['overall'].astype(float),
     (test_data['user_id'], test_data['item_id'])),
    shape=(n_users, n_items)
)

print(f"Train matrix shape : {train_matrix.shape}")
print(f"Train matrix nnz   : {train_matrix.nnz:,}")
print(f"Test matrix shape  : {test_matrix.shape}")
print(f"Test matrix nnz    : {test_matrix.nnz:,}")

### 5.3 Sanity Check — No Overlap

In [None]:
train_pairs = set(zip(train_data['user_id'], train_data['item_id']))
test_pairs  = set(zip(test_data['user_id'],  test_data['item_id']))

overlap = train_pairs & test_pairs
print(f"Overlapping user-item pairs: {len(overlap)}")

### 5.4 Convert to Surprise Format

Both SVD and NMF (from the Surprise library) need data in Surprise's internal format. 
We convert the **same** train/test split so both models see identical data.

In [None]:
reader = Reader(rating_scale=(1, 5))

# Build Surprise trainset from our train_data
surprise_full = Dataset.load_from_df(
    df_cf[['reviewerID', 'asin', 'overall']], reader
)

# Build trainset using the same indices
train_surprise = Dataset.load_from_df(
    train_data[['reviewerID', 'asin', 'overall']], reader
)
trainset = train_surprise.build_full_trainset()

# Build testset from test_data (list of tuples)
testset = list(test_data[['reviewerID', 'asin', 'overall']].itertuples(index=False, name=None))

print(f"Surprise trainset: {trainset.n_ratings:,} ratings, {trainset.n_users:,} users, {trainset.n_items:,} items")
print(f"Surprise testset : {len(testset):,} ratings")

---
## 6. Evaluation Helpers

In [None]:
def precision_recall_at_k(predictions, k=10, threshold=4.0):
    """Compute Precision@K and Recall@K from Surprise predictions."""
    user_est_true = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        user_est_true[uid].append((est, true_r))

    precisions, recalls = {}, {}
    for uid, user_ratings in user_est_true.items():
        user_ratings.sort(key=lambda x: x[0], reverse=True)
        n_relevant = sum(1 for (_, true_r) in user_ratings if true_r >= threshold)
        n_hits     = sum(1 for (_, true_r) in user_ratings[:k] if true_r >= threshold)
        precisions[uid] = n_hits / k
        recalls[uid]    = n_hits / n_relevant if n_relevant > 0 else 0

    avg_precision = sum(precisions.values()) / len(precisions)
    avg_recall    = sum(recalls.values()) / len(recalls)
    return avg_precision, avg_recall

print("✅ Evaluation helper defined.")

---
## 7. Model 1 — SVD (Explicit Rating Prediction)

SVD (Singular Value Decomposition) from the Surprise library predicts explicit ratings.
It learns user and item latent factor vectors directly from 1–5 star ratings.

In [None]:
model_svd = SVD(
    n_factors=50,
    n_epochs=20,
    lr_all=0.005,
    reg_all=0.1,
    random_state=42
)

print("Training SVD model...")
model_svd.fit(trainset)
print("✅ SVD model trained successfully!")

### 7.1 SVD Evaluation

In [None]:
# Predict on the test set
svd_predictions = model_svd.test(testset)

# RMSE & MAE
svd_rmse = accuracy.rmse(svd_predictions, verbose=True)
svd_mae  = accuracy.mae(svd_predictions, verbose=True)

# Precision@10 & Recall@10
svd_precision, svd_recall = precision_recall_at_k(svd_predictions, k=10, threshold=4.0)

print(f"\nSVD Precision@10 : {svd_precision:.4f} ({svd_precision*100:.2f}%)")
print(f"SVD Recall@10    : {svd_recall:.4f} ({svd_recall*100:.2f}%)")

---
## 8. Model 2 — NMF (ALS-Style Matrix Factorization)

NMF (Non-negative Matrix Factorization) from Surprise is an ALS-style algorithm.
Like ALS, it decomposes the user-item matrix into non-negative latent factors.
It uses the **same** train/test split as SVD for a fair comparison.

In [None]:
model_nmf = NMF(
    n_factors=50,
    n_epochs=20,
    reg_pu=0.1,       # user factor regularization
    reg_qi=0.1,       # item factor regularization
    random_state=42
)

print("Training NMF (ALS-style) model...")
model_nmf.fit(trainset)
print("✅ NMF model trained successfully!")

### 8.1 NMF Evaluation

In [None]:
# Predict on the test set
nmf_predictions = model_nmf.test(testset)

# RMSE & MAE
nmf_rmse = accuracy.rmse(nmf_predictions, verbose=True)
nmf_mae  = accuracy.mae(nmf_predictions, verbose=True)

# Precision@10 & Recall@10
nmf_precision, nmf_recall = precision_recall_at_k(nmf_predictions, k=10, threshold=4.0)

print(f"\nNMF Precision@10 : {nmf_precision:.4f} ({nmf_precision*100:.2f}%)")
print(f"NMF Recall@10    : {nmf_recall:.4f} ({nmf_recall*100:.2f}%)")

---
## 9. Model Comparison — SVD vs NMF (ALS)

In [None]:
print("=" * 55)
print("   ALGORITHM COMPARISON ON SAME DATASET")
print("=" * 55)
print(f"{'Metric':<20} {'SVD':>12} {'NMF (ALS)':>12}")
print("-" * 55)
print(f"{'RMSE':<20} {svd_rmse:>12.4f} {nmf_rmse:>12.4f}")
print(f"{'MAE':<20} {svd_mae:>12.4f} {nmf_mae:>12.4f}")
print(f"{'Precision@10':<20} {svd_precision:>12.4f} {nmf_precision:>12.4f}")
print(f"{'Recall@10':<20} {svd_recall:>12.4f} {nmf_recall:>12.4f}")
print("=" * 55)
print()
print("NOTES:")
print("  Lower RMSE/MAE  = better rating prediction accuracy")
print("  Higher Prec/Rec = better at ranking relevant items")
print("  Both models trained on the SAME data split for fair comparison")

### Interpretation

| Model | Strengths | Best For |
|-------|-----------|----------|
| **SVD** | Learns biases, handles explicit ratings well | Rating prediction tasks |
| **NMF (ALS)** | Non-negative factors, interpretable | Implicit feedback, when factors should be positive |