# MovieLens 32M – Matrix Factorization (SVD)

In this notebook, we build a **matrix factorization–based recommendation model** using the MovieLens 32M ratings data.

We will:

- Prepare the ratings data for use with the `surprise` library  
- Train an SVD-based collaborative filtering model  
- Evaluate the model using RMSE on a test set  
- Generate movie recommendations for a given user based on predicted ratings.


### Why We Use a Sample of the Data

The full MovieLens 32M dataset contains tens of millions of ratings.  
Matrix Factorization (SVD) is computationally intensive, and training a full model on all ratings would exceed the memory and compute limits of a standard Colab environment.

To keep training efficient while still capturing meaningful patterns, we use a **random sample** of the ratings.  
This is a common and practical approach in real-world recommendation pipelines when prototyping or working under resource constraints.


In [1]:
!pip install scikit-surprise




In [4]:
import pandas as pd
import numpy as np

from surprise import Dataset, Reader, SVD
from surprise.model_selection import train_test_split
from surprise import accuracy


In [3]:
!ls


sample_data


In [5]:
!wget https://files.grouplens.org/datasets/movielens/ml-32m.zip
!unzip ml-32m.zip -d ml32m


--2025-12-04 05:29:47--  https://files.grouplens.org/datasets/movielens/ml-32m.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.96.204
Connecting to files.grouplens.org (files.grouplens.org)|128.101.96.204|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 238950008 (228M) [application/zip]
Saving to: ‘ml-32m.zip’


2025-12-04 05:29:55 (32.2 MB/s) - ‘ml-32m.zip’ saved [238950008/238950008]

Archive:  ml-32m.zip
   creating: ml32m/ml-32m/
  inflating: ml32m/ml-32m/tags.csv   
  inflating: ml32m/ml-32m/links.csv  
  inflating: ml32m/ml-32m/README.txt  
  inflating: ml32m/ml-32m/checksums.txt  
  inflating: ml32m/ml-32m/ratings.csv  
  inflating: ml32m/ml-32m/movies.csv  


In [6]:
ratings_path = "ml32m/ml-32m/ratings.csv"
movies_path  = "ml32m/ml-32m/movies.csv"

ratings = pd.read_csv(ratings_path)
movies  = pd.read_csv(movies_path)

ratings.head(), ratings.shape


(   userId  movieId  rating  timestamp
 0       1       17     4.0  944249077
 1       1       25     1.0  944250228
 2       1       29     2.0  943230976
 3       1       30     5.0  944249077
 4       1       32     5.0  943228858,
 (32000204, 4))

## 1. Sampling the Ratings Data

We select a fixed number of random ratings (e.g., 150,000) to build a manageable training dataset.  
This allows the SVD model to train efficiently while still learning the underlying user–item structure.


In [7]:
sampled_ratings = ratings.sample(frac=0.1, random_state=42)
sampled_ratings.shape


(3200020, 4)

## 2. Preparing Data for the Surprise Library

The `surprise` library expects the dataset in a specific format:  
a DataFrame with three columns:

- `userId`
- `movieId`
- `rating`

We convert our sampled data into this structure and define a rating scale for the model.


In [8]:
data_for_surprise = sampled_ratings[['userId', 'movieId', 'rating']]
reader = Reader(rating_scale=(0.5, 5.0))
surprise_data = Dataset.load_from_df(data_for_surprise, reader)


## 3. Training the SVD Model

Matrix Factorization (MF) aims to learn:

- A latent representation of users  
- A latent representation of movies  

so that unseen ratings can be predicted based on these hidden factors.

We train a lightweight SVD model (reduced factors and epochs) to keep computation fast while maintaining predictive performance.


In [12]:
# Take a fixed-size sample so SVD trains fast
sample_size = 150_000  # you can reduce to 100k if needed

if len(ratings) > sample_size:
    sampled_ratings = ratings.sample(n=sample_size, random_state=42)
else:
    sampled_ratings = ratings.copy()

sampled_ratings.shape


(150000, 4)

### Model Evaluation (RMSE)

RMSE (Root Mean Square Error) measures how well the model predicts existing ratings.  
A lower RMSE indicates better predictive accuracy.

Although RMSE is not the only metric for recommendation systems, it provides a useful benchmark for MF models.


In [14]:
trainset, testset = train_test_split(surprise_data, test_size=0.1, random_state=42)

svd_model = SVD(
    n_factors=50,
    n_epochs=10,
    random_state=42
)

svd_model.fit(trainset)

predictions = svd_model.test(testset)
rmse = accuracy.rmse(predictions)
rmse


RMSE: 0.8850


0.8849630800565446

In [15]:
# For convenience: which movies each user has already rated (using the sampled data)
user_rated_movies = sampled_ratings.groupby('userId')['movieId'].apply(set)
all_movie_ids = sampled_ratings['movieId'].unique()


## 4. Generating Recommendations with SVD

Once the SVD model is trained, we can use it to:

- Predict ratings for movies a user has not yet rated  
- Rank movies based on predicted ratings  
- Recommend the top-N most relevant movies for each user

The function below implements this logic.



In [16]:
def recommend_for_user_svd(user_id, top_n=10):
    """
    Recommend movies for a given user using the trained SVD model.

    - user_id: a userId present in sampled_ratings
    - top_n: number of movie recommendations to return
    """
    if user_id not in user_rated_movies.index:
        print("This user is not present in the sampled dataset.")
        return None

    # Movies the user has already rated
    rated = user_rated_movies[user_id]

    # Candidate movies = all movies in sample that user has NOT rated
    candidate_movie_ids = [mid for mid in all_movie_ids if mid not in rated]

    preds = []
    for mid in candidate_movie_ids:
        pred = svd_model.predict(user_id, mid)
        preds.append((mid, pred.est))

    # Sort by predicted rating (high to low)
    preds_sorted = sorted(preds, key=lambda x: x[1], reverse=True)[:top_n]
    top_movie_ids = [mid for (mid, _) in preds_sorted]
    top_scores = [score for (_, score) in preds_sorted]

    # Get movie titles
    recs = movies[movies['movieId'].isin(top_movie_ids)][['movieId', 'title', 'genres']]

    # Preserve order of top_movie_ids
    recs = recs.set_index('movieId').loc[top_movie_ids].reset_index()

    recs['predicted_rating'] = top_scores

    return recs


In [17]:
sampled_user_ids = sampled_ratings['userId'].unique()
sampled_user_ids[:10]


array([ 66954,   9877,  38348, 101952, 140400, 173400,  74417, 195523,
         1953,  82682])

In [18]:
recommend_for_user_svd(sampled_user_ids[0], top_n=10)


Unnamed: 0,movieId,title,genres,predicted_rating
0,170705,Band of Brothers (2001),Action|Drama|War,4.861842
1,148881,World of Tomorrow (2015),Animation|Comedy,4.857954
2,858,"Godfather, The (1972)",Crime|Drama,4.842371
3,159817,Planet Earth (2006),Documentary,4.832504
4,179135,Blue Planet II (2017),Documentary,4.82634
5,198185,Twin Peaks (1989),Drama|Mystery,4.812997
6,171011,Planet Earth II (2016),Documentary,4.809271
7,202439,Parasite (2019),Comedy|Drama,4.779922
8,318,"Shawshank Redemption, The (1994)",Crime|Drama,4.770156
9,1203,12 Angry Men (1957),Drama,4.755207


## 4. Summary

In this notebook, we implemented a **matrix factorization–based recommendation model** using the MovieLens 32M dataset and the `surprise` library (SVD).

Key steps:

- Loaded the ratings and movies data
- Sampled a subset of ratings to keep training efficient in a Colab environment
- Prepared the data for the `surprise` framework
- Trained an SVD model and evaluated it using RMSE on a held-out test set
- Implemented a function:
  - `recommend_for_user_svd(user_id, top_n)`
  - which predicts ratings for unseen movies and returns the top-N recommendations for a given user

This SVD model provides a more expressive, latent-factor–based recommendation approach compared to simple popularity-based or pure neighborhood-based collaborative filtering models.
