# Movie Recommendation System – Item–Item and User–User Collaborative Filtering

In this notebook, we build personalized recommendation models on top of the MovieLens 32M dataset:

- Item–Item Collaborative Filtering (similar movies)
- User–User Collaborative Filtering (similar users)

We will:
- Prepare a suitable user–item interaction matrix
- Apply filtering to avoid memory issues
- Implement functions to generate recommendations for a given movie or user


In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split


In [None]:
!ls


ml32m  ml-32m.zip  ml-32m.zip.1  ml-32m.zip.2  sample_data


In [None]:
!wget https://files.grouplens.org/datasets/movielens/ml-32m.zip


--2025-12-04 04:56:46--  https://files.grouplens.org/datasets/movielens/ml-32m.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.96.204
Connecting to files.grouplens.org (files.grouplens.org)|128.101.96.204|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 238950008 (228M) [application/zip]
Saving to: ‘ml-32m.zip.3’


2025-12-04 04:56:50 (74.1 MB/s) - ‘ml-32m.zip.3’ saved [238950008/238950008]



In [None]:
!unzip ml-32m.zip -d ml32m


Archive:  ml-32m.zip
replace ml32m/ml-32m/tags.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: A
  inflating: ml32m/ml-32m/tags.csv   
  inflating: ml32m/ml-32m/links.csv  
  inflating: ml32m/ml-32m/README.txt  
  inflating: ml32m/ml-32m/checksums.txt  
  inflating: ml32m/ml-32m/ratings.csv  
  inflating: ml32m/ml-32m/movies.csv  


In [None]:
!ls


ml32m  ml-32m.zip  ml-32m.zip.1  ml-32m.zip.2  ml-32m.zip.3  sample_data


In [None]:
!ls ml32m


ml-32m


In [None]:
!ls ml32m/ml-32m


checksums.txt  links.csv  movies.csv  ratings.csv  README.txt  tags.csv


In [None]:
movies_path = "ml32m/ml-32m/movies.csv"
ratings_path = "ml32m/ml-32m/ratings.csv"

movies  = pd.read_csv(movies_path)
ratings = pd.read_csv(ratings_path)

movies.head(), ratings.head()

(   movieId                               title  \
 0        1                    Toy Story (1995)   
 1        2                      Jumanji (1995)   
 2        3             Grumpier Old Men (1995)   
 3        4            Waiting to Exhale (1995)   
 4        5  Father of the Bride Part II (1995)   
 
                                         genres  
 0  Adventure|Animation|Children|Comedy|Fantasy  
 1                   Adventure|Children|Fantasy  
 2                               Comedy|Romance  
 3                         Comedy|Drama|Romance  
 4                                       Comedy  ,
    userId  movieId  rating  timestamp
 0       1       17     4.0  944249077
 1       1       25     1.0  944250228
 2       1       29     2.0  943230976
 3       1       30     5.0  944249077
 4       1       32     5.0  943228858)

In [None]:
print("Movies:", movies.shape)
print("Ratings:", ratings.shape)

ratings['rating'].describe()


Movies: (87585, 3)
Ratings: (32000204, 4)


Unnamed: 0,rating
count,32000200.0
mean,3.540396
std,1.058986
min,0.5
25%,3.0
50%,3.5
75%,4.0
max,5.0


## 1. Data Filtering for Collaborative Filtering

The full MovieLens 32M dataset is too large to build a dense user–item matrix in a typical Colab environment.

To avoid memory issues while still keeping meaningful structure, we:

- Keep only **active users** (who have rated many movies)
- Keep only **popular movies** (with many ratings)

This reduces the size of the matrix while preserving useful patterns for collaborative filtering.


In [None]:
# Count ratings per user and per movie
user_counts = ratings['userId'].value_counts()
movie_counts = ratings['movieId'].value_counts()

# Thresholds (you can tune these)
min_user_ratings = 100   # user must have rated at least this many movies
min_movie_ratings = 200  # movie must have at least this many ratings

active_users = user_counts[user_counts >= min_user_ratings].index
popular_movies = movie_counts[movie_counts >= min_movie_ratings].index

filtered_ratings = ratings[
    ratings['userId'].isin(active_users) &
    ratings['movieId'].isin(popular_movies)
]

ratings.shape, filtered_ratings.shape


((32000204, 4), (25227282, 4))

## 2. Item–Item Collaborative Filtering

We first build an **item–item (movie–movie) collaborative filtering model**.

High-level idea:
- Create a user–item matrix where rows = users and columns = movies
- Use cosine similarity between movie columns
- For a given movie, find the most similar movies based on user rating patterns.


### Note on Dataset Size

The full MovieLens 32M dataset contains tens of millions of ratings.  
Building a full dense user–item matrix in a typical Colab environment would exceed memory limits.

To maintain efficiency while preserving meaningful collaborative filtering patterns,  
we limit our scope to:

- The top-N most active users  
- The top-N most rated movies  

This is a common and practical strategy in recommender systems research, allowing us to:
- Build similarity models efficiently  
- Maintain a representative interaction structure  
- Avoid runtime crashes in limited-resource environments


In [None]:
# Count ratings per user and per movie
user_counts = ratings['userId'].value_counts()
movie_counts = ratings['movieId'].value_counts()

# Pick a SMALL, safe number of users and movies
top_n_users = 500
top_n_movies = 500

active_users = user_counts.head(top_n_users).index
popular_movies = movie_counts.head(top_n_movies).index

filtered_ratings = ratings[
    ratings['userId'].isin(active_users) &
    ratings['movieId'].isin(popular_movies)
]

ratings.shape, filtered_ratings.shape, len(active_users), len(popular_movies)


((32000204, 4), (194159, 4), 500, 500)

In [None]:
user_item_matrix = filtered_ratings.pivot_table(
    index='userId',
    columns='movieId',
    values='rating',
    fill_value=0
)

user_item_matrix.shape


(2000, 2000)

In [None]:
item_matrix = user_item_matrix.T  # movies x users
item_similarity = cosine_similarity(item_matrix)


In [None]:
item_similarity = cosine_similarity(item_matrix)
item_similarity.shape

(2000, 2000)

In [None]:
# List of movieIds in the same order as item_matrix rows
movie_ids = item_matrix.index.tolist()

# Map movieId -> index
movie_index_map = {movie_id: idx for idx, movie_id in enumerate(movie_ids)}

# For looking up titles by movieId
movies_lookup = movies.set_index('movieId')


### 2.1 Similar Movies by Item–Item Collaborative Filtering

Using the filtered user–item matrix, we:

- Compute cosine similarity between movie rating vectors  
- For a given movie, retrieve the top-N most similar movies  
- Use this method to build a "more like this" recommender (similar to Netflix’s interface)

The function below,  
**`recommend_similar_movies(title, top_n)`**,  
returns the top-N movies whose rating patterns are most similar to the selected movie.


In [None]:
def recommend_similar_movies(movie_title, top_n=10):
    """
    Recommend movies similar to the given movie title,
    based on item-item cosine similarity.
    """
    # Find the movieId for this title
    match = movies[movies['title'].str.lower() == movie_title.lower()]
    if match.empty:
        print(f"No movie found with title: {movie_title}")
        return None

    movie_id = match.iloc[0]['movieId']

    # Check if this movie is in our filtered subset
    if movie_id not in movie_index_map:
        print("This movie is not in the filtered subset used for similarity.")
        return None

    # Get the index in the similarity matrix
    idx = movie_index_map[movie_id]

    # Similarity scores for this movie with all others
    sim_scores = item_similarity[idx]

    # Sort by similarity (high to low), skip the movie itself at index idx
    similar_indices = np.argsort(sim_scores)[::-1][1:top_n+1]

    # Get the corresponding movieIds
    similar_movie_ids = [movie_ids[i] for i in similar_indices]

    # Build a result DataFrame
    recommendations = movies_lookup.loc[similar_movie_ids][['title', 'genres']].copy()
    recommendations = recommendations.reset_index()
    recommendations['similarity_score'] = sim_scores[similar_indices]

    return recommendations


In [None]:
movies.head(20)


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller


In [None]:
recommend_similar_movies("Toy Story (1995)", top_n=10)


Unnamed: 0,movieId,title,genres,similarity_score
0,3114,Toy Story 2 (1999),Adventure|Animation|Children|Comedy|Fantasy,0.931553
1,1270,Back to the Future (1985),Adventure|Comedy|Sci-Fi,0.926393
2,480,Jurassic Park (1993),Action|Adventure|Sci-Fi|Thriller,0.92247
3,1198,Raiders of the Lost Ark (Indiana Jones and the...,Action|Adventure,0.921536
4,260,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Sci-Fi,0.918791
5,593,"Silence of the Lambs, The (1991)",Crime|Horror|Thriller,0.915676
6,4306,Shrek (2001),Adventure|Animation|Children|Comedy|Fantasy|Ro...,0.915315
7,2571,"Matrix, The (1999)",Action|Sci-Fi|Thriller,0.915245
8,2762,"Sixth Sense, The (1999)",Drama|Horror|Mystery,0.913946
9,1196,Star Wars: Episode V - The Empire Strikes Back...,Action|Adventure|Sci-Fi,0.913591


## 3. User–User Collaborative Filtering

In this section, we implement a User–User collaborative filtering model.

High-level idea:
- Find users who have similar rating patterns  
- Use their ratings to recommend new movies  
- Predict which movies a target user may like  

We use the same filtered dataset (top-N active users and popular movies) to stay within memory limits while preserving meaningful rating structure.


In [None]:
# Make a copy to avoid modifying the original matrix
user_item_norm = user_item_matrix.copy()

# Subtract each user's mean rating (center the ratings)
user_means = user_item_norm.replace(0, np.nan).mean(axis=1)  # mean per user
user_item_norm = user_item_norm.sub(user_means, axis=0).fillna(0)

user_item_norm.shape


(2000, 2000)

In [None]:
# Compute user-user similarity (cosine)
user_similarity = cosine_similarity(user_item_norm)

user_similarity.shape


(2000, 2000)

In [None]:
user_ids = user_item_norm.index.tolist()
user_index_map = {uid: idx for idx, uid in enumerate(user_ids)}


In [None]:
def recommend_for_user(target_user, top_n=10):
    """
    Recommend movies to a user using User–User collaborative filtering.
    """
    if target_user not in user_index_map:
        print("This user is not in the filtered subset.")
        return None

    # Get index
    u_idx = user_index_map[target_user]

    # Get similarity scores to all other users
    sim_scores = user_similarity[u_idx]

    # Sort users by similarity (skip itself)
    similar_users_idx = np.argsort(sim_scores)[::-1][1:50]  # top 50 similar users

    # Get their userIds
    similar_users = [user_ids[i] for i in similar_users_idx]

    # Get ratings by similar users
    similar_ratings = filtered_ratings[filtered_ratings['userId'].isin(similar_users)]

    # Aggregate mean rating per movie
    movie_scores = similar_ratings.groupby('movieId')['rating'].mean().sort_values(ascending=False)

    # Take top N movieIds
    top_movie_ids = movie_scores.head(top_n).index

    # Get titles
    return movies[movies['movieId'].isin(top_movie_ids)][['movieId', 'title', 'genres']]


In [None]:
user_ids[:10]


[28, 188, 265, 527, 655, 661, 1048, 1140, 1411, 1421]

In [None]:
recommend_for_user(user_ids[0], top_n=10)


Unnamed: 0,movieId,title,genres
292,296,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller
314,318,"Shawshank Redemption, The (1994)",Crime|Drama
729,745,Wallace & Gromit: A Close Shave (1995),Animation|Children|Comedy
1168,1198,Raiders of the Lost Ark (Indiana Jones and the...,Action|Adventure
1182,1213,Goodfellas (1990),Crime|Drama
1351,1387,Jaws (1975),Action|Horror
1666,1732,"Big Lebowski, The (1998)",Comedy|Crime
3021,3114,Toy Story 2 (1999),Adventure|Animation|Children|Comedy|Fantasy
5905,6016,City of God (Cidade de Deus) (2002),Action|Adventure|Crime|Drama|Thriller
15240,80463,"Social Network, The (2010)",Drama


## 3. User–User Collaborative Filtering

In this section, we build a simple User–User collaborative filtering model.

Idea:

- Each user has a vector of ratings across movies
- We measure similarity between users based on these rating patterns
- For a target user, we:
  - Find similar users
  - Use their ratings to recommend movies the target user has not rated yet

We use the same filtered user–item matrix as in the item–item model, so we stay within memory limits.


In [None]:
# Copy the user-item matrix
user_item_norm = user_item_matrix.copy()

# Compute each user's mean rating (ignore 0s as "no rating")
user_means = user_item_norm.replace(0, np.nan).mean(axis=1)

# Subtract user mean from each rating (centered ratings)
user_item_norm = user_item_norm.sub(user_means, axis=0).fillna(0)

user_item_norm.shape


(2000, 2000)

In [None]:
# Compute cosine similarity between users
user_similarity = cosine_similarity(user_item_norm)

user_similarity.shape


(2000, 2000)

In [None]:
# List of userIds in the same order as user_item_norm
user_ids = user_item_norm.index.tolist()

# Map userId -> index in the similarity matrix
user_index_map = {uid: idx for idx, uid in enumerate(user_ids)}


### 3.1 Recommend Movies for a User

We now define a function that:

- Takes a `userId` as input  
- Finds the most similar users  
- Aggregates their ratings to score movies  
- Recommends movies the target user has not rated yet


In [None]:
def recommend_for_user(target_user, top_n=10, neighbors_k=50):
    """
    Recommend movies to a user using User–User collaborative filtering.

    target_user : userId from the filtered data
    top_n       : how many movie recommendations to return
    neighbors_k : how many similar users to consider
    """
    if target_user not in user_index_map:
        print("This user is not in the filtered subset.")
        return None

    # Index of the target user
    u_idx = user_index_map[target_user]

    # Similarity scores between this user and all other users
    sim_scores = user_similarity[u_idx]

    # Get indices of the most similar users (skip itself at u_idx)
    similar_indices = np.argsort(sim_scores)[::-1]
    similar_indices = [idx for idx in similar_indices if idx != u_idx][:neighbors_k]

    # Get their userIds and similarity weights
    neighbor_ids = [user_ids[i] for i in similar_indices]
    neighbor_sims = np.array([sim_scores[i] for i in similar_indices])

    # Get their ratings from the user-item matrix
    neighbor_ratings = user_item_matrix.loc[neighbor_ids]

    # Compute weighted sum of ratings for each movie
    # (movies are columns)
    weighted_sum = neighbor_ratings.T.dot(neighbor_sims)
    sim_sum = neighbor_sims.sum()

    # Avoid division by zero
    if sim_sum == 0:
        print("No similar users found with non-zero similarity.")
        return None

    predicted_scores = weighted_sum / sim_sum

    # Remove movies the target user has already rated
    target_user_ratings = user_item_matrix.loc[target_user]
    already_rated = target_user_ratings[target_user_ratings > 0].index
    predicted_scores = predicted_scores.drop(already_rated, errors='ignore')

    # Get top N movieIds
    top_movie_ids = predicted_scores.sort_values(ascending=False).head(top_n).index

    # Build a result DataFrame with titles
    recommendations = movies[movies['movieId'].isin(top_movie_ids)][['movieId', 'title', 'genres']]
    recommendations = recommendations.set_index('movieId').loc[top_movie_ids].reset_index()

    # Add the predicted score
    recommendations['predicted_score'] = predicted_scores.loc[top_movie_ids].values

    return recommendations


In [None]:
user_ids[:10]


[28, 188, 265, 527, 655, 661, 1048, 1140, 1411, 1421]

In [None]:
recommend_for_user(user_ids[0], top_n=10)


Unnamed: 0,movieId,title,genres,predicted_score
0,1258,"Shining, The (1980)",Horror,3.631893
1,3176,"Talented Mr. Ripley, The (1999)",Drama|Mystery|Thriller,3.113928
2,48385,Borat: Cultural Learnings of America for Make ...,Comedy,2.975574
3,1230,Annie Hall (1977),Comedy|Romance,2.974606
4,1219,Psycho (1960),Crime|Horror,2.95516
5,2324,Life Is Beautiful (La Vita è bella) (1997),Comedy|Drama|Romance|War,2.871441
6,2391,"Simple Plan, A (1998)",Crime|Drama|Thriller,2.69584
7,2288,"Thing, The (1982)",Action|Horror|Sci-Fi|Thriller,2.660383
8,1183,"English Patient, The (1996)",Drama|Romance|War,2.65872
9,70,From Dusk Till Dawn (1996),Action|Comedy|Horror|Thriller,2.649091


The table above shows example recommendations generated using User–User collaborative filtering.  
These recommendations are based on ratings from users with similar rating patterns to the target user.


## 4. Summary

In this notebook, we focused on building a **movie–movie (item–item) collaborative filtering model** using a subset of the MovieLens 32M dataset.

Steps we took:

- Loaded the `movies` and `ratings` data
- Filtered the dataset to:
  - The most active users
  - The most frequently rated movies  
  (to avoid memory issues in a Colab environment)
- Constructed a user–item rating matrix
- Computed **cosine similarity** between movies based on user rating patterns
- Implemented a function:
  - `recommend_similar_movies(title, top_n)`
  - which returns movies that are most similar to a given movie

This model can power a “More like this” feature, where a user clicks on a movie and sees a list of similar titles based on collective viewing behavior.
