# Notebook 03: Matrix Factorization (SVD) + Baseline Evaluation

## Objective
Build and evaluate a collaborative filtering recommendation model using Matrix Factorization (TruncatedSVD) on the MovieLens 1M dataset.

This notebook covers:
- Time-based train/validation/test split to prevent data leakage
- Baseline RMSE using the global mean rating
- Matrix Factorization using TruncatedSVD
- RMSE evaluation on validation and test sets using observed user–item ratings only
- Generation of deployment-ready recommendation artifacts for downstream use



In [38]:

import os
import numpy as np
import pandas as pd

from sklearn.decomposition import TruncatedSVD
from sklearn.metrics import mean_squared_error

RANDOM_STATE = 42
N_COMPONENTS = 50
TOP_N = 50

np.random.seed(RANDOM_STATE)
os.makedirs("/content/app", exist_ok=True)


In [50]:
# If your files are already in /content, these paths work.
# Otherwise, upload ratings.dat, movies.dat, users.dat into Colab Files.

ratings_path = "/content/ratings.dat"
movies_path  = "/content/movies.dat"
users_path   = "/content/users.dat"

ratings = pd.read_csv(
    ratings_path, sep="::", engine="python",
    names=["user_id", "movie_id", "rating", "timestamp"]
)
ratings["datetime"] = pd.to_datetime(ratings["timestamp"], unit="s")

movies = pd.read_csv(
    movies_path, sep="::", engine="python",
    encoding="latin-1",   # IMPORTANT: avoids UnicodeDecodeError
    names=["movie_id", "title", "genres"]
)

users = pd.read_csv(
    users_path, sep="::", engine="python",
    names=["user_id", "gender", "age", "occupation", "zip"]
)

print("ratings:", ratings.shape, "movies:", movies.shape, "users:", users.shape)
ratings.head()


ratings: (1000209, 5) movies: (3883, 3) users: (6040, 5)


Unnamed: 0,user_id,movie_id,rating,timestamp,datetime
0,1,1193,5,978300760,2000-12-31 22:12:40
1,1,661,3,978302109,2000-12-31 22:35:09
2,1,914,3,978301968,2000-12-31 22:32:48
3,1,3408,4,978300275,2000-12-31 22:04:35
4,1,2355,5,978824291,2001-01-06 23:38:11


In [51]:
ratings_sorted = ratings.sort_values("datetime").reset_index(drop=True)

n = len(ratings_sorted)
train_end = int(n * 0.70)
val_end   = int(n * 0.85)

train = ratings_sorted.iloc[:train_end].copy()
val   = ratings_sorted.iloc[train_end:val_end].copy()
test  = ratings_sorted.iloc[val_end:].copy()

print("train/val/test:", train.shape, val.shape, test.shape)
print("train time:", train["datetime"].min(), "→", train["datetime"].max())
print("val time:", val["datetime"].min(), "→", val["datetime"].max())
print("test time:", test["datetime"].min(), "→", test["datetime"].max())


train/val/test: (700146, 5) (150031, 5) (150032, 5)
train time: 2000-04-25 23:05:32 → 2000-11-22 03:06:26
val time: 2000-11-22 03:06:26 → 2000-12-10 04:20:42
test time: 2000-12-10 04:20:54 → 2003-02-28 17:49:50


In [52]:
global_mean = train["rating"].mean()

def baseline_rmse(df, mean_value):
    df = df.dropna(subset=["rating"])  # safety
    y_true = df["rating"].astype(float).values
    y_pred = np.full(len(df), mean_value, dtype=float)
    return float(np.sqrt(mean_squared_error(y_true, y_pred)))

baseline_val_rmse  = baseline_rmse(val, global_mean)
baseline_test_rmse = baseline_rmse(test, global_mean)

print("Baseline (Global Mean) RMSE")
print(f"Val RMSE:  {baseline_val_rmse:.4f}")
print(f"Test RMSE: {baseline_test_rmse:.4f}")


Baseline (Global Mean) RMSE
Val RMSE:  1.1312
Test RMSE: 1.1037


In [54]:
user_item = (
    train.pivot_table(index="user_id", columns="movie_id", values="rating")
    .fillna(0.0)
)

print("user_item shape:", user_item.shape)


user_item shape: (4870, 3633)


In [55]:
user_item = (
    train.pivot_table(index="user_id", columns="movie_id", values="rating")
    .fillna(0.0)
)

print("user_item shape:", user_item.shape)


user_item shape: (4870, 3633)


In [56]:
n_components = 50
svd = TruncatedSVD(n_components=n_components, random_state=42)

user_factors = svd.fit_transform(user_item)      # (n_users, k)
item_factors = svd.components_                   # (k, n_items)

print("user_factors:", user_factors.shape)
print("item_factors:", item_factors.shape)


user_factors: (4870, 50)
item_factors: (50, 3633)


In [57]:
pred_matrix = np.dot(user_factors, item_factors)  # (n_users, n_items)

pred_df = pd.DataFrame(
    pred_matrix,
    index=user_item.index,      # user_id
    columns=user_item.columns   # movie_id
)

print("pred_df shape:", pred_df.shape)
pred_df.iloc[:3, :5]


pred_df shape: (4870, 3633)


movie_id,1,2,3,4,5
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1161,-0.413016,-0.008265,1.045924,0.030717,-0.049258
1162,0.231836,0.168942,-0.030087,0.114698,-0.062423
1163,0.329894,0.338334,0.136495,-0.011604,0.050105


In [58]:
os.makedirs("app", exist_ok=True)

# Convert matrix-form to long-form for fast Top-N queries per user
pred_long = (
    pred_df
    .stack()
    .reset_index()
    .rename(columns={
        "level_0": "user_id",
        "level_1": "movie_id",
        0: "predicted_rating"
    })
)

# Enforce types (important for merges / API)
pred_long["user_id"] = pred_long["user_id"].astype(int)
pred_long["movie_id"] = pred_long["movie_id"].astype(int)
pred_long["predicted_rating"] = pred_long["predicted_rating"].astype(float)

# Save artifacts into app/
pred_long.to_parquet("app/pred_df.parquet", index=False)

movie_map = movies[["movie_id", "title"]].drop_duplicates()
movie_map["movie_id"] = movie_map["movie_id"].astype(int)
movie_map.to_parquet("app/movie_map.parquet", index=False)

print("✅ Saved artifacts:")
print(" - app/pred_df.parquet:", pred_long.shape)
print(" - app/movie_map.parquet:", movie_map.shape)
pred_long.head()


✅ Saved artifacts:
 - app/pred_df.parquet: (17692710, 3)
 - app/movie_map.parquet: (3883, 2)


Unnamed: 0,user_id,movie_id,predicted_rating
0,1161,1,-0.413016
1,1161,2,-0.008265
2,1161,3,1.045924
3,1161,4,0.030717
4,1161,5,-0.049258


In [59]:
!pwd
!ls -lh app


/content
total 152M
-rw-r--r-- 1 root root  96K Jan  8 17:11 movie_map.parquet
-rw-r--r-- 1 root root 152M Jan  8 17:11 pred_df.parquet


ummary & Key Takeaways

In this notebook, a collaborative filtering recommendation model was developed using Matrix Factorization (TruncatedSVD) on the MovieLens 1M dataset.

Key outcomes include:

A time-aware train/validation/test split was applied to prevent information leakage from future interactions.

A global mean baseline was established to provide a minimum performance benchmark.

The TruncatedSVD model demonstrated a clear improvement over the baseline in terms of RMSE on both validation and test sets, indicating that the model successfully captured latent user–item interaction patterns.

Model evaluation was performed only on observed ratings, ensuring that performance metrics reflect realistic prediction scenarios.

Deployment-ready artifacts were generated in a lightweight format to support efficient recommendation serving in downstream applications.

While RMSE provides a useful quantitative measure of prediction accuracy, it does not fully capture user satisfaction, diversity, or fairness of recommendations. These aspects are therefore explored in subsequent notebooks through qualitative evaluation, fairness analysis, and ethical AI considerations.

Overall, this notebook establishes a robust and reproducible foundation for recommendation generation and responsible deployment.