# Notebook 03: Matrix Factorization (SVD) + Baseline Evaluation

## Objective
Build and evaluate a collaborative filtering recommendation model using Matrix Factorization (TruncatedSVD) on MovieLens 1M.

This notebook covers:
- Time-based train/validation/test split (to avoid data leakage)
- Baseline RMSE using global mean rating (must-have)
- Matrix Factorization using TruncatedSVD
- RMSE evaluation on validation and test sets


In [8]:
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error
from sklearn.decomposition import TruncatedSVD

In [9]:
# Upload ratings.dat from MovieLens 1M
from google.colab import files
uploaded = files.upload()

Saving ratings.dat to ratings (2).dat


In [10]:
ratings = pd.read_csv(
    "ratings.dat",
    sep="::",
    engine="python",
    names=["user_id", "movie_id", "rating", "timestamp"]
)

ratings["datetime"] = pd.to_datetime(ratings["timestamp"], unit="s")

ratings.head()


Unnamed: 0,user_id,movie_id,rating,timestamp,datetime
0,1,1193,5,978300760,2000-12-31 22:12:40
1,1,661,3,978302109,2000-12-31 22:35:09
2,1,914,3,978301968,2000-12-31 22:32:48
3,1,3408,4,978300275,2000-12-31 22:04:35
4,1,2355,5,978824291,2001-01-06 23:38:11


In [14]:
print("Rows:", len(ratings))
print("Unique users:", ratings["user_id"].nunique())
print("Unique movies:", ratings["movie_id"].nunique())
print("Rating scale:", ratings["rating"].min(), "to", ratings["rating"].max())


Rows: 1000209
Unique users: 6040
Unique movies: 3706
Rating scale: 1 to 5


In [17]:
ratings_sorted = ratings.sort_values("datetime").reset_index(drop=True)

n = len(ratings_sorted)
train_end = int(0.8 * n)
val_end = int(0.9 * n)

train = ratings_sorted.iloc[:train_end].copy()
val   = ratings_sorted.iloc[train_end:val_end].copy()
test  = ratings_sorted.iloc[val_end:].copy()

print("Train:", train.shape, "Val:", val.shape, "Test:", test.shape)
print("Train last date:", train["datetime"].max())
print("Val last date:", val["datetime"].max())
print("Test last date:", test["datetime"].max())


Train: (800167, 5) Val: (100021, 5) Test: (100021, 5)
Train last date: 2000-12-02 14:52:18
Val last date: 2000-12-29 23:42:47
Test last date: 2003-02-28 17:49:50


## Why time-based splitting?

In recommender systems, we must avoid training on “future” user interactions.
A chronological split ensures the model is trained on past ratings and evaluated on later ratings,
which better reflects real-world deployment and prevents data leakage.


In [18]:
global_mean = train["rating"].mean()

def baseline_rmse(df, mean_value):
    y_true = df["rating"].values
    y_pred = np.full(len(df), mean_value)
    return np.sqrt(mean_squared_error(y_true, y_pred))

baseline_val_rmse = baseline_rmse(val, global_mean)
baseline_test_rmse = baseline_rmse(test, global_mean)

print("Baseline (Global Mean) RMSE")
print(f"Val RMSE:  {baseline_val_rmse:.4f}")
print(f"Test RMSE: {baseline_test_rmse:.4f}")


Baseline (Global Mean) RMSE
Val RMSE:  1.1191
Test RMSE: 1.0893


In [19]:
user_item = train.pivot_table(
    index="user_id",
    columns="movie_id",
    values="rating"
)

user_item.shape


(5400, 3662)

In [21]:
user_item_filled = user_item.fillna(global_mean)


In [22]:
n_components = 50  # you can tune this later (20, 50, 100)
svd = TruncatedSVD(n_components=n_components, random_state=42)

user_factors = svd.fit_transform(user_item_filled)
item_factors = svd.components_

pred_matrix = np.dot(user_factors, item_factors)

pred_df = pd.DataFrame(
    pred_matrix,
    index=user_item_filled.index,
    columns=user_item_filled.columns
)
pred_df.shape


(5400, 3662)

In [23]:
def rmse_from_pred_df(df, pred_df):
    # keep only rows where user and movie exist in train matrix
    mask = df["user_id"].isin(pred_df.index) & df["movie_id"].isin(pred_df.columns)
    d = df.loc[mask, ["user_id", "movie_id", "rating"]].copy()

    # vectorized prediction lookup
    y_true = d["rating"].values
    y_pred = pred_df.to_numpy()[
        pred_df.index.get_indexer(d["user_id"]),
        pred_df.columns.get_indexer(d["movie_id"])
    ]

    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    return rmse, len(d)


In [24]:
svd_val_rmse, val_n = rmse_from_pred_df(val, pred_df)
svd_test_rmse, test_n = rmse_from_pred_df(test, pred_df)

print("Matrix Factorization (TruncatedSVD) RMSE")
print(f"Val RMSE:  {svd_val_rmse:.4f}  (n={val_n})")
print(f"Test RMSE: {svd_test_rmse:.4f}  (n={test_n})")

print("\nImprovement vs Baseline")
print(f"Val improvement:  {baseline_val_rmse - svd_val_rmse:.4f}")
print(f"Test improvement: {baseline_test_rmse - svd_test_rmse:.4f}")


Matrix Factorization (TruncatedSVD) RMSE
Val RMSE:  1.0573  (n=23841)
Test RMSE: 1.0490  (n=80608)

Improvement vs Baseline
Val improvement:  0.0618
Test improvement: 0.0402


In [25]:
explained = svd.explained_variance_ratio_.sum()
print("Explained variance ratio (sum):", explained)


Explained variance ratio (sum): 0.24924648898899054


## Conclusion

- Baseline (global mean) RMSE provides a minimum benchmark.
- TruncatedSVD matrix factorization learns latent user/movie factors and produces personalized predictions.
- The SVD model RMSE is compared against baseline on both validation and test sets.
- If the SVD RMSE improves vs baseline, it indicates the model captures meaningful user-item structure.
- If improvement is small, it may be due to sparsity, limited latent factors, or lack of bias terms; tuning and better MF methods can help.
