# Kemboi Elly Kiplimo
## DS-FT05
## Phase 4 Project
# MovieLens: Building a Model for Personalized Movie Recommendations Based on User Ratings"
<img >

## Project Overview
Movie recommendation systems are becoming increasingly important in the age of personalized content. In this project, we will build a model that provides top 5 movie recommendations to a user based on their ratings of other movies. We will use the open-source MovieLens dataset from GroupLens, which contains 100K data points of various movies and user

## a. Data Analytic Question
How can we provide top 5 movie recommendations to a user based on their ratings of other movies?

###  Problem Statement
The goal of this project is to build a movie recommendation system from the GroupLens research lab at the University of Minnesota that can provide personalized movie recommendations to users based on their ratings of other movies. The system should be able to handle the cold start problem and provide accurate recommendations to users.

###  Main Objectives
1. Build a movie recommendation system using collaborative filtering
2. Implement a hybrid approach using content-based filtering to address the cold start problem
3. Provide top 5 movie recommendations to a user based on their ratings of other movies

### Specific Objectives
1. Preprocess and clean the MovieLens dataset
2. Build a collaborative filtering model to provide movie recommendations
3. Implement a content-based filtering approach to address the cold start problem
4. Evaluate the performance of the recommendation system using regression metrics such as RMSE and MAE

## b. Metric of Success
The success of the project will be measured by the accuracy of the movie recommendations provided by the system. We will evaluate the performance of the system using regression metrics such as RMSE and MAE. The system should be able to provide accurate recommendations to users and handle the cold start problem effectively.

## c. Experimental Design
## d. Data Understanding




### 1. Import the necessary libraries

In [36]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from surprise import Dataset
from surprise.model_selection import train_test_split
from surprise import Dataset, Reader, SVD         
from surprise.model_selection import cross_validate 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel


### 2. Loading the Data

In [16]:
links = pd.read_csv('data/ml-latest-small/links.csv')
movie = pd.read_csv('data/ml-latest-small/movies.csv')
rating = pd.read_csv('data/ml-latest-small/ratings.csv')
tags = pd.read_csv('data/ml-latest-small/tags.csv')

In [17]:
links.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [18]:
movie.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [34]:
movie.isna().sum()

movieId     0
title       0
genres      0
imdbId_x    0
tmdbId_x    0
imdbId_y    0
tmdbId_y    0
dtype: int64

In [19]:
rating.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [20]:
tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


In [21]:
rating.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


In [22]:
reader = Reader(rating_scale=(0.5,5))

data = Dataset.load_from_df(rating[['userId','movieId','rating']], reader)

In [24]:
links.isna().sum()

movieId    0
imdbId     0
tmdbId     8
dtype: int64

In [25]:
links = links.dropna(subset=['tmdbId'])
links['tmdbId'] = links['tmdbId'].astype('int')

In [29]:
movie = movie.merge(links, left_on='movieId', right_on='movieId')

movie_and_ratings = movie.merge(rating, on='movieId')

In [31]:
movie_and_ratings

Unnamed: 0,movieId,title,genres,imdbId_x,tmdbId_x,imdbId_y,tmdbId_y,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862,114709,862,1,4.0,964982703
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862,114709,862,5,4.0,847434962
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862,114709,862,7,4.5,1106635946
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862,114709,862,15,2.5,1510577970
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862,114709,862,17,4.5,1305696483
...,...,...,...,...,...,...,...,...,...,...
100818,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,5476944,432131,5476944,432131,184,4.0,1537109082
100819,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,5914996,445030,5914996,445030,184,3.5,1537109545
100820,193585,Flint (2017),Drama,6397426,479308,6397426,479308,184,3.5,1537109805
100821,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,8391976,483455,8391976,483455,184,3.5,1537110021


# Collaborative Filtering

In [33]:
svd = SVD()
cross_validate(svd, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8739  0.8741  0.8676  0.8710  0.8842  0.8742  0.0055  
MAE (testset)     0.6713  0.6728  0.6628  0.6704  0.6809  0.6716  0.0058  
Fit time          1.03    1.02    1.09    1.02    1.04    1.04    0.03    
Test time         0.18    0.21    0.13    0.12    0.13    0.15    0.03    


{'test_rmse': array([0.87386766, 0.87409401, 0.86764428, 0.87101403, 0.88423367]),
 'test_mae': array([0.67126038, 0.67275383, 0.66282445, 0.67038939, 0.68086948]),
 'fit_time': (1.0321087837219238,
  1.018277883529663,
  1.0901145935058594,
  1.022503137588501,
  1.0431993007659912),
 'test_time': (0.17942190170288086,
  0.20947909355163574,
  0.12865996360778809,
  0.121673583984375,
  0.12962651252746582)}

# Content-Based Filtering

In [38]:
tfidf = TfidfVectorizer(stop_words='english')
movie['genres'] = movie['genres'].fillna('') 
tfidf_matrix = tfidf.fit_transform(movie['genres'])
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [42]:
indices = pd.Series(movie.index, index=movie['title']).drop_duplicates()

def content_based_recommend_movies(title, n=10):
    index = indices[title]
    sim_scores = list(enumerate(cosine_sim[index]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:n+1]
    movie_indices = [i[0] for i in sim_scores]
    return movie['title'].iloc[movie_indices]

In [43]:
def combined_collab_and_content_based_recs(user_id, title, n=10 ):
    cont_based = content_based_recommend_movies(title, n).to_frame()
    cont_based.columns = ['title']

    content_based_recommend = cont_based.merge(movie_and_ratings, on='title')

    content_based_recommend = content_based_recommend.drop_duplicates(subset=['title'], keep='first')
    
    content_based_recommend['est'] = content_based_recommend['movieId'].apply(lambda x: svd.predict(user_id, x).est)

    content_based_recommend = content_based_recommend.sort_values(by='est', ascending=False)

    return content_based_recommend.head(n)['title']

In [44]:
user_id = 2
title = "Toy Story (1995)"
n = 5

rec = combined_collab_and_content_based_recs(user_id, title, n)

recommended_movies = rec.tolist()

print(f"Top {n} recommended movies for user {user_id} are {title}:")

for i, movie_title in enumerate(recommended_movies, start=1):
    print(f"{i}. {movie_title}")

Top 5 recommended movies for user 2 are Toy Story (1995):
1. Toy Story 2 (1999)
2. Monsters, Inc. (2001)
3. Emperor's New Groove, The (2000)
4. Antz (1998)
5. Adventures of Rocky and Bullwinkle, The (2000)
