# Building a Hybrid Movie Recommender System

Author: Mohamed Oussama Naji

Date: March 29, 2024

## Introduction

In this notebook, we will build a hybrid movie recommender system that combines content-based filtering and collaborative filtering techniques. The recommender system aims to provide personalized movie recommendations to users based on their preferences and the preferences of similar users. We will use the MovieLens 20M dataset, which contains movie ratings and tags, to train and evaluate our recommender system.


## Table of Contents
<a href="#installation-setup">1. Installation and Setup</a><br>
<a href="#data-loading-preprocessing">2. Data Loading and Preprocessing</a><br>
<a href="#popularity-based-recommendation">3. Popularity-based Recommendation</a><br>
<a href="#content-based-filtering">4. Content-based Filtering</a><br>
<a href="#collaborative-filtering">5. Collaborative Filtering</a><br>
<a href="#hybrid-recommender-function">6. Hybrid Recommender Function</a><br>
<a href="#testing-evaluation">7. Testing and Evaluation</a><br>
<a href="#conclusion">8. Conclusion</a><br>

## Installation and Setup <a id="installation-setup"></a>

Installing the necessary library.

In [None]:
!pip install surprise

## Data Loading and Preprocessing <a id="data-loading-preprocessing"></a>

Downloading and loading the MovieLens 20M dataset

In [None]:
!wget https://files.grouplens.org/datasets/movielens/ml-20m.zip
!unzip ml-20m.zip

Importing the necessary libraries

In [None]:
import numpy as np
import pickle

from scipy.sparse import csr_matrix
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from surprise import Dataset, Reader, SVD

import cupy as cp
import pandas as pd

Loading and preprocessing the relevant data

In [None]:
movies = pd.read_csv('ml-20m/movies.csv')
ratings = pd.read_csv('ml-20m/ratings.csv')
tags = pd.read_csv('ml-20m/tags.csv')

tags['tag'] = tags['tag'].astype(str)
tags = tags.groupby('movieId')['tag'].apply(' '.join).reset_index()
movies = movies.merge(tags, on='movieId', how='left').fillna('')

Data cleaning

In [None]:
all_movie_ids = np.unique(np.concatenate([movies['movieId'].unique(), ratings['movieId'].unique()]))

valid_ratings = ratings[ratings['movieId'].isin(all_movie_ids)]
valid_ratings = valid_ratings.dropna(subset=['userId', 'movieId'])
valid_ratings = valid_ratings.drop_duplicates(subset=['userId', 'movieId'])

valid_user_ids = valid_ratings['userId'].unique()

## Popularity-based Recommendation <a id="popularity-based-recommendation"></a>

In [None]:
movie_ratings = valid_ratings.groupby('movieId')['rating'].count().reset_index().sort_values('rating', ascending=False).reset_index(drop=True)
popular_movies = movie_ratings[:20]['movieId'].tolist()

## Content-based Filtering <a id="content-based-filtering"></a>


In [None]:
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(movies['tag'])
svd = TruncatedSVD(n_components=10)
latent_matrix_1 = svd.fit_transform(tfidf_matrix)

## Collaborative Filtering <a id="collaborative-filtering"></a>


In [None]:
chunk_size = 5000000
num_users = len(valid_user_ids)
num_movies = len(all_movie_ids)
ratings_matrix_chunks = []
total_chunks = len(range(0, len(valid_ratings), chunk_size))

for i, start in enumerate(range(0, len(valid_ratings), chunk_size)):
    chunk = valid_ratings.iloc[start:start+chunk_size]
    valid_chunk = chunk[chunk['userId'].isin(valid_user_ids)]
    pivot_chunk = valid_chunk.pivot_table(index='userId', columns='movieId', values='rating', fill_value=0)
    pivot_chunk = pivot_chunk.reindex(columns=all_movie_ids, fill_value=0)
    chunk_matrix = csr_matrix((pivot_chunk.shape[0], num_movies), dtype=np.float32)

    for col_id in pivot_chunk.columns:
        if col_id < chunk_matrix.shape[1]:
            chunk_matrix[:, col_id] = csr_matrix(pivot_chunk[col_id].values.reshape(-1, 1))

    chunk_matrix_gpu = cp.sparse.csr_matrix(chunk_matrix)
    ratings_matrix_chunks.append(chunk_matrix_gpu)
    print(f"Progress: {((i + 1) / total_chunks) * 100:.2f}% complete")

ratings_matrix_gpu = cp.sparse.vstack(ratings_matrix_chunks, format='csr')

reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(valid_ratings[['userId', 'movieId', 'rating']], reader)
model_svd = SVD()
model_svd.fit(data.build_full_trainset())
latent_matrix_2 = model_svd.pu

## Hybrid Recommender Function <a id="hybrid-recommender-function"></a>

In [None]:
def hybrid_recommender(user_id, movies, ratings, latent_matrix_1, latent_matrix_2, ratings_matrix_gpu, popular_movies, batch_size=100):
    user_ratings = ratings[ratings['userId'] == user_id]
    watched_movies = user_ratings['movieId'].tolist()
    unwatched_movies = movies[~movies['movieId'].isin(watched_movies)]

    latent_matrix_1_gpu = cp.asarray(latent_matrix_1)

    content_recommendations = []
    for start_idx in range(0, len(unwatched_movies), batch_size):
        end_idx = min(start_idx + batch_size, len(unwatched_movies))
        batch = unwatched_movies.iloc[start_idx:end_idx]
        batch_indices = batch.index.tolist()

        content_sim_scores = cp.asnumpy(cp.dot(latent_matrix_1_gpu, latent_matrix_1_gpu[batch_indices].T).max(axis=1))
        batch_recommendations = list(zip(batch['movieId'].values, content_sim_scores))
        content_recommendations.extend(batch_recommendations)

        cp.get_default_memory_pool().free_all_blocks()

    content_recommendations.sort(key=lambda x: x[1], reverse=True)
    top_content_recommendations = content_recommendations[:10]

    mf_recommendations = []
    for movie_id in unwatched_movies['movieId']:
        predicted_score = model_svd.predict(user_id, movie_id).est
        mf_recommendations.append((movie_id, predicted_score))

    mf_recommendations.sort(key=lambda x: x[1], reverse=True)
    top_mf_recommendations = mf_recommendations[:10]

    combined_recommendations = top_content_recommendations + top_mf_recommendations
    combined_recommendations = list(dict.fromkeys(combined_recommendations))
    combined_recommendations.sort(key=lambda x: x[1], reverse=True)

    return combined_recommendations[:10]

## Testing and Evaluation <a id="testing-evaluation"></a>

Generating and displaying personalized movie recommendations for a specific user.

In [None]:
user_id = 1
recommendations = hybrid_recommender(user_id, movies, ratings, latent_matrix_1, latent_matrix_2, ratings_matrix_gpu, popular_movies)
print(f"Recommendations for user {user_id}:")
for movie_id, score in recommendations:
    print(f"{movies[movies['movieId'] == movie_id]['title'].values[0]}: {score:.2f}")


## Conclusion <a id="conclusion"></a>

In this notebook, we built a hybrid movie recommender system that combines content-based filtering and collaborative filtering techniques. We used the MovieLens 20M dataset to train and evaluate our recommender system.

The content-based filtering approach utilized TF-IDF vectorization and Singular Value Decomposition (SVD) to create latent representations of movies based on their tags. The collaborative filtering approach employed matrix factorization using the Surprise library's SVD algorithm to learn latent factors for users and movies.

The hybrid recommender function combined the recommendations from both content-based and collaborative filtering approaches, ensuring diversity in the recommended movies. We also incorporated popularity-based recommendations to provide a baseline for comparison.

The testing and evaluation section demonstrated the generation of personalized movie recommendations for a specific user. The recommender system successfully provided a list of movies that align with the user's preferences and the preferences of similar users.

To further improve the recommender system, we can explore additional techniques such as item-based collaborative filtering, incorporating user and item metadata, and applying more advanced machine learning algorithms. Additionally, we can conduct more extensive evaluation and validation to assess the quality and effectiveness of the recommendations.

Overall, this notebook serves as a foundation for building a hybrid movie recommender system and can be extended and customized based on specific requirements and available data.