# MovieLens Linear Models Demo

This notebook demonstrates a clean, **deterministic** baseline for predicting movie average ratings using simple linear models:

- **Download** the MovieLens *ml-latest-small* dataset
- Build **movie-level features** (counts, years, one-hot genres)
- **Random train/validation/test** split (with a fixed seed)
- Train **Linear**, **Ridge** (α=0.1), **Lasso** (α=0.001) — *no grid search*
- **Interpret Coefficients** to understand feature effects
- **Time-based Train/Test Split** to show a more realistic scenario

> Tip: This is a *movie-level* demo. Real recommenders are user–movie level and benefit from collaborative features (user/movie factors).


## 0. Setup & Utilities


In [4]:
import os, io, zipfile, math, re, json, textwrap
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import r2_score, mean_squared_error
from datetime import datetime

np.set_printoptions(suppress=True)
pd.set_option("display.max_columns", 200)

def rmse(y_true, y_pred):
    return float(np.sqrt(mean_squared_error(y_true, y_pred)))

def r2_rmse(y_true, y_pred):
    return r2_score(y_true, y_pred), rmse(y_true, y_pred)


## 1. Download MovieLens (ml-latest-small)
If files already exist locally, this step will skip the download.


In [5]:
import requests

DATA_DIR = Path("data")
DATA_DIR.mkdir(exist_ok=True)
ZIP_URL = "https://files.grouplens.org/datasets/movielens/ml-latest-small.zip"
ZIP_PATH = DATA_DIR / "ml-latest-small.zip"
EXTRACT_DIR = DATA_DIR / "ml-latest-small"

if not EXTRACT_DIR.exists():
    if not ZIP_PATH.exists():
        print("Downloading:", ZIP_URL)
        r = requests.get(ZIP_URL, timeout=60)
        r.raise_for_status()
        ZIP_PATH.write_bytes(r.content)
        print("Saved:", ZIP_PATH)

    print("Extracting:", ZIP_PATH)
    with zipfile.ZipFile(ZIP_PATH, "r") as zf:
        zf.extractall(DATA_DIR)
    print("Extracted to:", EXTRACT_DIR)
else:
    print("Already available:", EXTRACT_DIR)

# Load CSVs
ratings = pd.read_csv(EXTRACT_DIR / "ratings.csv")
movies  = pd.read_csv(EXTRACT_DIR / "movies.csv")

print("ratings:", ratings.shape, "| movies:", movies.shape)
display(ratings.head())
display(movies.head())


Already available: data\ml-latest-small
ratings: (100836, 4) | movies: (9742, 3)


Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


## 2. Build Movie-Level Features

We aggregate **ratings** per `movieId` to compute:
- `avg_rating` — mean rating
- `n_ratings` — number of ratings
- `first_year`, `last_year` — min/max rating year (from timestamps)

We also one-hot encode the **genres** from `movies.csv`.


In [6]:
# Convert timestamps to years for first/last rating year
ratings["year"] = pd.to_datetime(ratings["timestamp"], unit="s").dt.year

agg = ratings.groupby("movieId").agg(
    avg_rating=("rating", "mean"),
    n_ratings=("rating", "size"),
    first_year=("year", "min"),
    last_year=("year", "max"),
).reset_index()

# Genres one-hot
def split_genres(g):
    if pd.isna(g) or g == "(no genres listed)":
        return []
    return g.split("|")

genre_sets = movies["genres"].apply(split_genres)
unique_genres = sorted(set(g for s in genre_sets for g in s))
# Remove empty label if present
unique_genres = [g for g in unique_genres if g and g != "(no genres listed)"]

genre_mat = {f"genre_{g}": [] for g in unique_genres}
for gs in genre_sets:
    gs_set = set(gs)
    for g in unique_genres:
        genre_mat[f"genre_{g}"].append(g in gs_set)

genre_df = pd.DataFrame(genre_mat)
movies_expanded = pd.concat([movies[["movieId", "title"]], genre_df], axis=1)

# Merge features
movie_features = agg.merge(movies_expanded, on="movieId", how="left")

# Optional: extract release year from title "(YYYY)"
def extract_release_year(title):
    m = re.search(r"\((\d{4})\)", str(title))
    return int(m.group(1)) if m else np.nan

movie_features["release_year"] = movie_features["title"].apply(extract_release_year)

display(movie_features.head())
print(movie_features.shape)


Unnamed: 0,movieId,avg_rating,n_ratings,first_year,last_year,title,genre_Action,genre_Adventure,genre_Animation,genre_Children,genre_Comedy,genre_Crime,genre_Documentary,genre_Drama,genre_Fantasy,genre_Film-Noir,genre_Horror,genre_IMAX,genre_Musical,genre_Mystery,genre_Romance,genre_Sci-Fi,genre_Thriller,genre_War,genre_Western,release_year
0,1,3.92093,215,1996,2018,Toy Story (1995),False,True,True,True,True,False,False,False,True,False,False,False,False,False,False,False,False,False,False,1995.0
1,2,3.431818,110,1996,2018,Jumanji (1995),False,True,False,True,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,1995.0
2,3,3.259615,52,1996,2017,Grumpier Old Men (1995),False,False,False,False,True,False,False,False,False,False,False,False,False,False,True,False,False,False,False,1995.0
3,4,2.357143,7,1996,2009,Waiting to Exhale (1995),False,False,False,False,True,False,False,True,False,False,False,False,False,False,True,False,False,False,False,1995.0
4,5,3.071429,49,1996,2018,Father of the Bride Part II (1995),False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,1995.0


(9724, 26)


## 3. Feature Matrix (X) and Target (y)

We use numeric predictors:
- `n_ratings_log1p` (log-transformed count)
- `first_year`
- `last_year`

…and all one-hot `genre_*` columns.

Target: **`avg_rating`**.


In [7]:
target_col = "avg_rating"
num_cols = ["n_ratings", "first_year", "last_year"]
genre_cols = [c for c in movie_features.columns if c.startswith("genre_")]

X = movie_features[num_cols + genre_cols].copy()
X["n_ratings_log1p"] = np.log1p(X["n_ratings"])
feature_cols = ["n_ratings_log1p", "first_year", "last_year"] + genre_cols
X = X[feature_cols]
y = movie_features[target_col].copy()

print("X shape:", X.shape, "| y shape:", y.shape)
display(X.head())


X shape: (9724, 22) | y shape: (9724,)


Unnamed: 0,n_ratings_log1p,first_year,last_year,genre_Action,genre_Adventure,genre_Animation,genre_Children,genre_Comedy,genre_Crime,genre_Documentary,genre_Drama,genre_Fantasy,genre_Film-Noir,genre_Horror,genre_IMAX,genre_Musical,genre_Mystery,genre_Romance,genre_Sci-Fi,genre_Thriller,genre_War,genre_Western
0,5.375278,1996,2018,False,True,True,True,True,False,False,False,True,False,False,False,False,False,False,False,False,False,False
1,4.70953,1996,2018,False,True,False,True,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False
2,3.970292,1996,2017,False,False,False,False,True,False,False,False,False,False,False,False,False,False,True,False,False,False,False
3,2.079442,1996,2009,False,False,False,False,True,False,False,True,False,False,False,False,False,False,True,False,False,False,False
4,3.912023,1996,2018,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False


## 4. Random Train /  Test Split

- Random `train/test` split (80/20).
- Train **Linear**, **Ridge (α=0.1)**, **Lasso (α=0.001)** — fixed alphas for a simple, reproducible baseline.


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import numpy as np, pandas as pd

# --- split only into train/test ---
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42
)

# --- columns: scale numeric only; passthrough genre_* ---
num_cols = ["n_ratings_log1p", "first_year", "last_year"]
cat_cols = [c for c in X.columns if c.startswith("genre_")]

pre = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), num_cols),  # scale numeric
        ("cat", "passthrough", cat_cols),     # keep booleans as-is
    ],
    remainder="drop",
)

# --- fixed alphas ---
RIDGE_ALPHA = 0.1
LASSO_ALPHA = 0.001

# --- build pipelines & fit on TRAIN ---
lin_full   = Pipeline([("pre", pre), ("model", LinearRegression())]).fit(X_train, y_train)
ridge_full = Pipeline([("pre", pre), ("model", Ridge(alpha=RIDGE_ALPHA))]).fit(X_train, y_train)
lasso_full = Pipeline([("pre", pre), ("model", Lasso(alpha=LASSO_ALPHA, max_iter=20000))]).fit(X_train, y_train)

def rmse(y_true, y_pred):
    return float(np.sqrt(mean_squared_error(y_true, y_pred)))

def metrics_row(pipe, Xtr, ytr, Xte, yte):
    ytr_p = pipe.predict(Xtr)
    yte_p = pipe.predict(Xte)
    return {
        "R2_train": r2_score(ytr, ytr_p),
        "RMSE_train": rmse(ytr, ytr_p),
        "R2_test": r2_score(yte, yte_p),
        "RMSE_test": rmse(yte, yte_p),
    }

# --- build “sweep-style” single-row tables (train/test only) ---
ridge_m = metrics_row(ridge_full, X_train, y_train, X_test, y_test)
ridge_table = pd.DataFrame([{"Alpha": RIDGE_ALPHA, **ridge_m}])

lasso_m = metrics_row(lasso_full, X_train, y_train, X_test, y_test)
lasso_table = pd.DataFrame([{"Alpha": LASSO_ALPHA, **lasso_m}])

# --- summary table (test only) ---
lin_m = metrics_row(lin_full, X_train, y_train, X_test, y_test)
summary = pd.DataFrame([
    {"Model": "Linear",                   "Alpha": None,
     "R2_test": lin_m["R2_test"],   "RMSE_test": lin_m["RMSE_test"]},
    {"Model": f"Ridge (α={RIDGE_ALPHA})", "Alpha": RIDGE_ALPHA,
     "R2_test": ridge_m["R2_test"], "RMSE_test": ridge_m["RMSE_test"]},
    {"Model": f"Lasso (α={LASSO_ALPHA})", "Alpha": LASSO_ALPHA,
     "R2_test": lasso_m["R2_test"], "RMSE_test": lasso_m["RMSE_test"]},
])

print("— Ridge alpha sweep —")
display(ridge_table.round(4))
print("— Lasso alpha sweep —")
display(lasso_table.round(4))
print("— Summary (Test) —")
display(summary.sort_values("RMSE_test").round(4))


— Ridge alpha sweep —


Unnamed: 0,Alpha,R2_train,RMSE_train,R2_test,RMSE_test
0,0.1,0.121,0.8233,0.1238,0.7825


— Lasso alpha sweep —


Unnamed: 0,Alpha,R2_train,RMSE_train,R2_test,RMSE_test
0,0.001,0.1204,0.8236,0.124,0.7824


— Summary (Test) —


Unnamed: 0,Model,Alpha,R2_test,RMSE_test
2,Lasso (α=0.001),0.001,0.124,0.7824
1,Ridge (α=0.1),0.1,0.1238,0.7825
0,Linear,,0.1238,0.7825


## 5. Interpret Coefficients

Coefficient magnitudes reflect feature influence **under the model’s assumptions**.
For linear models with mixed-scale features, magnitudes are not directly comparable unless features are standardized.
Here, numeric features have different scales than booleans (`genre_*`), so compare directions and relative sizes with care.

> For demos, it’s common to standardize numeric features before Ridge/Lasso.


In [13]:
def coef_df(pipeline, cols):
    # For pipelines, we need to access the model inside the pipeline
    if hasattr(pipeline, 'named_steps'):
        # It's a pipeline - get the model from the 'model' step
        model = pipeline.named_steps['model']
    else:
        # It's a direct model
        model = pipeline
    
    coef = getattr(model, "coef_", None)
    if coef is None:
        return pd.DataFrame({"feature": cols, "coef": np.nan})
    return pd.DataFrame({"feature": cols, "coef": coef}).sort_values("coef", ascending=False)

cols = X.columns.tolist()

print("Top positive coefficients (Linear):")
display(coef_df(lin_full, cols).head(10))

print("Top negative coefficients (Linear):")
display(coef_df(lin_full, cols).tail(10))


Top positive coefficients (Linear):


Unnamed: 0,feature,coef
9,genre_Documentary,0.658918
5,genre_Animation,0.471264
12,genre_Film-Noir,0.36782
10,genre_Drama,0.277678
0,n_ratings_log1p,0.262867
20,genre_War,0.244753
21,genre_Western,0.211233
1,first_year,0.166606
17,genre_Romance,0.079528
16,genre_Mystery,0.078933


Top negative coefficients (Linear):


Unnamed: 0,feature,coef
4,genre_Adventure,0.00374
11,genre_Fantasy,-0.001804
7,genre_Comedy,-0.05645
14,genre_IMAX,-0.071736
19,genre_Thriller,-0.07836
18,genre_Sci-Fi,-0.084432
2,last_year,-0.10294
13,genre_Horror,-0.182525
3,genre_Action,-0.219125
6,genre_Children,-0.286296


In [17]:
# Get the model from the pipeline
linear_model = lin_full.named_steps['model']
intercept = linear_model.intercept_
coefficients = linear_model.coef_

# Create the formula string
formula_parts = [f"{intercept:.4f}"]  # Start with intercept

for i, (feature, coef) in enumerate(zip(cols, coefficients)):
    if coef != 0:  # Only include non-zero coefficients
        sign = "+" if coef >= 0 else ""
        formula_parts.append(f"{sign}{coef:.4f} * {feature}")

formula = " + ".join(formula_parts)
print("Linear Regression Formula:")
print(f"avg_rating = {formula}")



Linear Regression Formula:
avg_rating = 3.1634 + +0.2629 * n_ratings_log1p + +0.1666 * first_year + -0.1029 * last_year + -0.2191 * genre_Action + +0.0037 * genre_Adventure + +0.4713 * genre_Animation + -0.2863 * genre_Children + -0.0564 * genre_Comedy + +0.0612 * genre_Crime + +0.6589 * genre_Documentary + +0.2777 * genre_Drama + -0.0018 * genre_Fantasy + +0.3678 * genre_Film-Noir + -0.1825 * genre_Horror + -0.0717 * genre_IMAX + +0.0368 * genre_Musical + +0.0789 * genre_Mystery + +0.0795 * genre_Romance + -0.0844 * genre_Sci-Fi + -0.0784 * genre_Thriller + +0.2448 * genre_War + +0.2112 * genre_Western


## 6. Time-based Train/Test Split

Instead of a random split, we split by **first rating year** (`first_year`) to simulate training on *earlier* movies and testing on *later* ones.
We'll pick a cutoff year as the **80th percentile** of `first_year` (you can change this).

> Note: because `avg_rating` aggregates *all* ratings, there is some leakage from post-cutoff ratings into the target. For a truly time-causal setup, compute targets from **pre-cutoff ratings only**; that’s more involved, so we keep it simple here for demonstration.


In [10]:
# Choose cutoff as 80th percentile of first_year among movies with complete features
mask_complete = (~movie_features["first_year"].isna())
cutoff = int(movie_features.loc[mask_complete, "first_year"].quantile(0.8))
cutoff


2015

In [11]:
# Time-based split
mask_train_time = movie_features["first_year"] <= cutoff
mask_test_time  = movie_features["first_year"]  > cutoff

X_train_time = X.loc[mask_train_time].copy()
y_train_time = y.loc[mask_train_time].copy()
X_test_time  = X.loc[mask_test_time].copy()
y_test_time  = y.loc[mask_test_time].copy()

print("Cutoff year:", cutoff)
print("Train_time:", X_train_time.shape, "| Test_time:", X_test_time.shape)

# Fit fixed-alpha models
lin_t   = LinearRegression().fit(X_train_time, y_train_time)
ridge_t = Ridge(alpha=0.1).fit(X_train_time, y_train_time)
lasso_t = Lasso(alpha=0.001, max_iter=20000).fit(X_train_time, y_train_time)

# Evaluate
def eval_row(name, model, Xtr, ytr, Xte, yte):
    ytr_p = model.predict(Xtr)
    yte_p = model.predict(Xte)
    return {
        "Model": name,
        "Train RMSE": rmse(ytr, ytr_p),
        "Train R2": r2_score(ytr, ytr_p),
        "Test RMSE": rmse(yte, yte_p),
        "Test R2": r2_score(yte, yte_p),
    }

rows_time = [
    eval_row("Linear (time)", lin_t, X_train_time, y_train_time, X_test_time, y_test_time),
    eval_row("Ridge α=0.1 (time)", ridge_t, X_train_time, y_train_time, X_test_time, y_test_time),
    eval_row("Lasso α=0.001 (time)", lasso_t, X_train_time, y_train_time, X_test_time, y_test_time),
]
pd.DataFrame(rows_time).round(4)


Cutoff year: 2015
Train_time: (7789, 22) | Test_time: (1935, 22)


Unnamed: 0,Model,Train RMSE,Train R2,Test RMSE,Test R2
0,Linear (time),0.7224,0.1686,1.1236,0.0079
1,Ridge α=0.1 (time),0.7224,0.1686,1.1236,0.0079
2,Lasso α=0.001 (time),0.7226,0.168,1.1236,0.0079
