<center>
  <h1 align="center"> Movie Recommender System </h1>
</center>

- A movie recommendation system is an automated system that can suggest movies to users based on their preferences. 
- The system uses an algorithm to analyze user data such as past movie ratings, genre preferences, and other related criteria to provide personalized movie recommendations. 
- The system can also provide general recommendations for users who don't have any past data or preferences. 

## Import

In [10]:
conda install -c conda-forge scikit-surprise


Note: you may need to restart the kernel to use updated packages.


In [1]:
import pandas as pd
import numpy as np

from scipy import stats
import random

from wordcloud import WordCloud
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel,cosine_similarity

from surprise import Reader, Dataset, SVD,accuracy
from surprise.model_selection import GridSearchCV

import random


import warnings 
warnings.filterwarnings("ignore")

## Data Loading
- source https://grouplens.org/datasets/movielens/

In [2]:
Rating = pd.read_csv('C:/Users/vijay/OneDrive/Desktop/Datascience/Wecareer/Portfolio/Portfolio/Movie recomendation/ratings.csv')
Movies = pd.read_csv('C:/Users/vijay/OneDrive/Desktop/Datascience/Wecareer/Portfolio/Portfolio/Movie recomendation/movies.csv')
Tags = pd.read_csv('C:/Users/vijay/OneDrive/Desktop/Datascience/Wecareer/Portfolio/Portfolio/Movie recomendation/tags.csv')

In [4]:
Rating = pd.read_csv('/Movie recomendation/ratings.csv')
Movies = pd.read_csv('/Movie recomendation/movies.csv')
Tags = pd.read_csv('/Movie recomendation/tags.csv')

## EDA

In [3]:
Movies.head(5)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [4]:
Rating.head(5)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [5]:
Tags.head(5)

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


In [6]:
Tags['movieId'].value_counts(ascending=False)

296     181
2959     54
924      41
293      35
7361     34
       ... 
3307      1
3310      1
3317      1
830       1
2719      1
Name: movieId, Length: 1572, dtype: int64

### Check Null values

In [7]:
print('Movies:\n',Movies.isnull().sum())
print('Rating:\n',Rating.isnull().sum())
print('Tags:\n',Tags.isnull().sum())

Movies:
 movieId    0
title      0
genres     0
dtype: int64
Rating:
 userId       0
movieId      0
rating       0
timestamp    0
dtype: int64
Tags:
 userId       0
movieId      0
tag          0
timestamp    0
dtype: int64


### Merge tables

In [8]:
print('Movies:',Movies.shape)
print('Rating:',Rating.shape)
print('Tags:',Tags.shape)

Movies: (9742, 3)
Rating: (100836, 4)
Tags: (3683, 4)


#### Groupby movie_id and mean ratings

In [9]:
rating_df = Rating.groupby(by='movieId',as_index=False).agg({'rating':'mean'})
rating_df.head(5)

Unnamed: 0,movieId,rating
0,1,3.92093
1,2,3.431818
2,3,3.259615
3,4,2.357143
4,5,3.071429


In [10]:
rating_df.shape

(9724, 2)

#### Merge table ratings and movies

In [11]:
movies_df = pd.merge(rating_df, Movies, how='left',on='movieId')
movies_df.head(5)

Unnamed: 0,movieId,rating,title,genres
0,1,3.92093,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,3.431818,Jumanji (1995),Adventure|Children|Fantasy
2,3,3.259615,Grumpier Old Men (1995),Comedy|Romance
3,4,2.357143,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,3.071429,Father of the Bride Part II (1995),Comedy


## Content based recommendation system
- Content-based methods are based on the similarity of movie attributes. Using this type of recommender system, if a user watches one movie, similar movies are recommended.
- Creating a content based recommender system based on Movie genre and Movie Tag.

#### Genre

In [21]:
#Define a TF-IDF Vectorizer Object
tfidf = TfidfVectorizer()
#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(movies_df['genres'])
# Compute the cosine similarity matrix
cos_similarity = linear_kernel(tfidf_matrix, tfidf_matrix)
indices=pd.Series(movies_df.index,index=movies_df['title'])

titles=movies_df['title']



In [23]:
titles.shape

(9724,)

In [27]:
titles

0                                Toy Story (1995)
1                                  Jumanji (1995)
2                         Grumpier Old Men (1995)
3                        Waiting to Exhale (1995)
4              Father of the Bride Part II (1995)
                          ...                    
9719    Black Butler: Book of the Atlantic (2017)
9720                 No Game No Life: Zero (2017)
9721                                 Flint (2017)
9722          Bungo Stray Dogs: Dead Apple (2018)
9723          Andrew Dice Clay: Dice Rules (1991)
Name: title, Length: 9724, dtype: object

In [25]:
indices.shape

(9724,)

### Function movie_recommender:
- takes as input the title of the movie for which we predict similar movie recommendation.
- Find the index for the title and pick up the similarity score for that index.
- We will then sort the score and return the top 10 movie titles with highest similarity score.

In [26]:
def movie_recommender(title):
    # Get the index of the movie that matches the title
    index = indices[title]
    similarity_scores = list(enumerate(cos_similarity[index]))
    # Sort the movies based on the similarity scores
    similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)
     # Get the scores of the 10 most similar movies
    similarity_scores = similarity_scores[1:10]
   
    movie_indices = [i[0] for i in similarity_scores]
    return titles.iloc[movie_indices]

In [29]:
movie_recommender('Copycat (1995)')

2637                               American Psycho (2000)
2960                Book of Shadows: Blair Witch 2 (2000)
3549                                     From Hell (2001)
4321                                      Identity (2003)
4480                                  House of Wax (1953)
5262    Testament of Dr. Mabuse, The (Das Testament de...
6055    Bird with the Crystal Plumage, The (Uccello da...
7156                                        Saw VI (2009)
5635         American Psycho II: All American Girl (2002)
Name: title, dtype: object

In [31]:
movies_df[movies_df['movieId']==22]

Unnamed: 0,movieId,rating,title,genres
21,22,3.222222,Copycat (1995),Crime|Drama|Horror|Mystery|Thriller


In [32]:
movies_df[movies_df['title']=='Book of Shadows: Blair Witch 2 (2000)']

Unnamed: 0,movieId,rating,title,genres
2960,3973,1.125,Book of Shadows: Blair Witch 2 (2000),Crime|Horror|Mystery|Thriller


In [33]:
movies_df[movies_df['title']=='American Psycho (2000)']

Unnamed: 0,movieId,rating,title,genres
2637,3535,3.788136,American Psycho (2000),Crime|Horror|Mystery|Thriller


### Tags

In [50]:
tags_df = pd.merge(Tags, Movies, how='left',on='movieId')
tags_df.head(5)

Unnamed: 0,userId,movieId,tag,timestamp,title,genres
0,2,60756,funny,1445714994,Step Brothers (2008),Comedy
1,2,60756,Highly quotable,1445714996,Step Brothers (2008),Comedy
2,2,60756,will ferrell,1445714992,Step Brothers (2008),Comedy
3,2,89774,Boxing story,1445715207,Warrior (2011),Drama
4,2,89774,MMA,1445715200,Warrior (2011),Drama


In [51]:
tags_df.shape

(3683, 6)

In [52]:
#Define a TF-IDF Vectorizer Object
tfidf2 = TfidfVectorizer()

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix2 = tfidf2.fit_transform(tags_df['tag'])

# Compute the cosine similarity matrix
cos_similarity2 = linear_kernel(tfidf_matrix2, tfidf_matrix2)

indices2 = pd.Series(tags_df.index,index=tags_df['title'])

titles2 = tags_df['title']

### Function movie_recommender_tags:
- takes as input the title of the movie for which we similar movie recommendation.
- Find the index for the title and pick up the similarity score for that index.
- We will then sort the score and return the top 10 movie titles with highest similarity score.

In [53]:
def movie_recommender_tags(title):
    index = indices2[title]
    similarity_scores = list(enumerate(cos_similarity2[index]))
    similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)
    similarity_scores = similarity_scores[1:10]
    movie_indices = [i[0] for i in similarity_scores]
    return titles2.iloc[movie_indices]

In [54]:
movie_recommender_tags('Copycat (1995)')

993                                        Copycat (1995)
1015                          Seven (a.k.a. Se7en) (1995)
1085                                    Virtuosity (1995)
1377                                             M (1931)
1890                                     Manhunter (1986)
2259                                       Monster (2003)
2962    Man Bites Dog (C'est arrivé près de chez vous)...
423                                      John Wick (2014)
3479                                  Pulp Fiction (1994)
Name: title, dtype: object

In [59]:
tags_df[tags_df['title'] == 'Copycat (1995)']

Unnamed: 0,userId,movieId,tag,timestamp,title,genres
993,474,22,serial killer,1137375496,Copycat (1995),Crime|Drama|Horror|Mystery|Thriller


In [58]:
tags_df[tags_df['title'] == 'Seven (a.k.a. Se7en) (1995)']

Unnamed: 0,userId,movieId,tag,timestamp,title,genres
697,424,47,mystery,1457842470,Seven (a.k.a. Se7en) (1995),Mystery|Thriller
698,424,47,twist ending,1457842458,Seven (a.k.a. Se7en) (1995),Mystery|Thriller
1015,474,47,serial killer,1137206452,Seven (a.k.a. Se7en) (1995),Mystery|Thriller


In [55]:
tags_df[tags_df['title'] == 'American Psycho (2000)']

Unnamed: 0,userId,movieId,tag,timestamp,title,genres


In [56]:
tags_df[tags_df['title'] == 'Book of Shadows: Blair Witch 2']


Unnamed: 0,userId,movieId,tag,timestamp,title,genres


## Collaboartive recommender system
- recommendation is based on users past interactions

In [87]:
rating_user = df_ratings[['userId','movieId','title','rating']].groupby(by=['movieId','userId'], as_index=False )\
.agg({'rating':'mean','title':'first'})
rating_user

Unnamed: 0,movieId,userId,rating,title
0,1,1,4.0,Toy Story (1995)
1,1,5,4.0,Toy Story (1995)
2,1,7,4.5,Toy Story (1995)
3,1,15,2.5,Toy Story (1995)
4,1,17,4.5,Toy Story (1995)
...,...,...,...,...
100831,193581,184,4.0,Black Butler: Book of the Atlantic (2017)
100832,193583,184,3.5,No Game No Life: Zero (2017)
100833,193585,184,3.5,Flint (2017)
100834,193587,184,3.5,Bungo Stray Dogs: Dead Apple (2018)


In [90]:
rating_user.isnull().sum()

movieId    0
userId     0
rating     0
title      0
dtype: int64

In [93]:
reader = Reader()
data = Dataset.load_from_df(rating_user[['userId', 'movieId', 'rating']], reader)
#trainset = data.build_full_trainset()

In [94]:
raw_ratings = data.raw_ratings

# shuffle ratings if you want
random.shuffle(raw_ratings)

# A = 90% of the data, B = 10% of the data
threshold = int(0.8 * len(raw_ratings))
train_raw_ratings = raw_ratings[:threshold]
test_raw_ratings = raw_ratings[threshold:]

data.raw_ratings = train_raw_ratings  # data is now the set train

# Select your best algo with grid search.
print("Grid Search...")
param_grid = {"n_epochs": [5, 10], "lr_all": [0.002, 0.005]}
grid_search = GridSearchCV(SVD, param_grid, measures=["rmse"], cv=3)
grid_search.fit(data)

print(f'RMSE Best Parameters: {grid_search.best_params["rmse"]}')
print(f'RMSE Best Score:      {grid_search.best_score["rmse"]}\n')

Grid Search...
RMSE Best Parameters: {'n_epochs': 10, 'lr_all': 0.005}
RMSE Best Score:      0.8901336504823513



In [135]:
trainset = data.build_full_trainset()

### Function recommendation_SVD:
- takes as input the userdId and the get_recommend> number of recommendations required. 
- Default values userID=1, get_recommend =10
- We will use Singular value decomposition (SVD) model


In [139]:
# SVD

def recommendation_SVD(userID=1, get_recommend =10):
       
    # predict rating for all pairs of users & items that are not in the trainset
    
    
    model = SVD(n_factors=50, n_epochs=10, lr_all=0.005, reg_all= 0.2)
    model.fit(trainset)
    testset = trainset.build_anti_testset()
    predictions = model.test(testset)
    predictions_df = pd.DataFrame(predictions)
    
    # get the top get_recommend predictions for userID
    
    predictions_userID = predictions_df[predictions_df['uid'] == userID].\
                         sort_values(by="est", ascending = False).head(get_recommend)
    
    recommendations = []
    recommendations = Movies.loc[Movies['movieId'].isin(list(predictions_userID['iid']))]['title']
    recommendations
    
    return(recommendations)

In [140]:
### Finding top 10 movie recommendation for userid 6
recommendations = recommendation_SVD(6,10)
recommendations

46                             Usual Suspects, The (1995)
602     Dr. Strangelove or: How I Learned to Stop Worr...
659                                 Godfather, The (1972)
686                                    Rear Window (1954)
863                Monty Python and the Holy Grail (1975)
903     Good, the Bad and the Ugly, The (Buono, il bru...
906                             Lawrence of Arabia (1962)
914                                     Goodfellas (1990)
922                        Godfather: Part II, The (1974)
3622    Amelie (Fabuleux destin d'Amélie Poulain, Le) ...
Name: title, dtype: object