# Movie Recommendation System
*Building a movie recommendation System using three way:*

 * 1) Popularity based recommendation system
 * 2) Content based recommendation system
 * 3) Collaborative filtering based recommendation system

# Popularity based Recommender

Easiest way to build a recommendation system is popularity based, The basic idea behind this recommender is that movies that are more popular will have a higher probability of being liked by the average audience. Pretty simple 

This model does not give personalized recommendations based on the user.

# Import Libraries

In [1]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

# Get the Data

In [2]:
movies = pd.read_csv('movie_dataset.csv',)
movies.head(5)

Unnamed: 0,index,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,cast,crew,director
0,0,237000000,Action Adventure Fantasy Science Fiction,http://www.avatarmovie.com/,19995,culture clash future space war space colony so...,en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,Sam Worthington Zoe Saldana Sigourney Weaver S...,"[{'name': 'Stephen E. Rivkin', 'gender': 0, 'd...",James Cameron
1,1,300000000,Adventure Fantasy Action,http://disney.go.com/disneypictures/pirates/,285,ocean drug abuse exotic island east india trad...,en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,...,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,Johnny Depp Orlando Bloom Keira Knightley Stel...,"[{'name': 'Dariusz Wolski', 'gender': 2, 'depa...",Gore Verbinski
2,2,245000000,Action Adventure Crime,http://www.sonypictures.com/movies/spectre/,206647,spy based on novel secret agent sequel mi6,en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,...,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466,Daniel Craig Christoph Waltz L\u00e9a Seydoux ...,"[{'name': 'Thomas Newman', 'gender': 2, 'depar...",Sam Mendes
3,3,250000000,Action Crime Drama Thriller,http://www.thedarkknightrises.com/,49026,dc comics crime fighter terrorist secret ident...,en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,...,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106,Christian Bale Michael Caine Gary Oldman Anne ...,"[{'name': 'Hans Zimmer', 'gender': 2, 'departm...",Christopher Nolan
4,4,260000000,Action Adventure Science Fiction,http://movies.disney.com/john-carter,49529,based on novel mars medallion space travel pri...,en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,...,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124,Taylor Kitsch Lynn Collins Samantha Morton Wil...,"[{'name': 'Andrew Stanton', 'gender': 2, 'depa...",Andrew Stanton


In [3]:
#Selecting feature which is need to build a simple recomandation
movies_useful_features = movies[['id','title','vote_average','vote_count']]
movies_useful_features.head()

Unnamed: 0,id,title,vote_average,vote_count
0,19995,Avatar,7.2,11800
1,285,Pirates of the Caribbean: At World's End,6.9,4500
2,206647,Spectre,6.3,4466
3,49026,The Dark Knight Rises,7.6,9106
4,49529,John Carter,6.1,2124


In [4]:
#checking nan values
movies_useful_features.isna().sum()

id              0
title           0
vote_average    0
vote_count      0
dtype: int64

## Using Weighted average for each movie's  Average Rating

We can use the average ratings of the movie as the score but using this won't be fair enough since a movie with 9.0 average rating and only 3 votes cannot be considered better than the movie with 7.8 as as average rating but 40 votes. So, I'll be using IMDB's weighted rating (wr) Formula which is given as :


Weighted Rating (WR) = $(\frac{v}{v + m} . R) + (\frac{m}{v + m} . C)$

where,
* *v* is the number of votes for the movie
* *m* is the minimum votes required to be listed in the chart
* *R* is the average rating of the movie
* *C* is the mean vote across the whole report

In [5]:
#calculating all the components based IMDB formula
v=movies_useful_features['vote_count']
R=movies_useful_features['vote_average']
C=movies_useful_features['vote_average'].mean()
m=movies_useful_features['vote_count'].quantile(0.70)

movies_useful_features['weighted_average']=((R*v)+ (C*m))/(v+m)

In [6]:
#Finally, let's sort the DataFrame based on the weighted_average score
popular_movies = movies_useful_features.sort_values(by='weighted_average',ascending=False)

*Popula Movies Recomandations For all users*

In [7]:
popular_movies.head(10)

Unnamed: 0,id,title,vote_average,vote_count,weighted_average
1881,278,The Shawshank Redemption,8.5,8205,8.340775
3337,238,The Godfather,8.4,5893,8.192887
662,550,Fight Club,8.3,9413,8.171648
3232,680,Pulp Fiction,8.3,8428,8.157615
65,155,The Dark Knight,8.2,12002,8.102674
809,13,Forrest Gump,8.2,7927,8.056059
1818,424,Schindler's List,8.3,4329,8.038748
3865,244786,Whiplash,8.3,4254,8.034695
96,27205,Inception,8.1,13752,8.018611
1990,1891,The Empire Strikes Back,8.2,5879,8.010426


# Content Based Recommender

The recommender we built in the previous section suffers some severe limitations. For one, it gives the same recommendation to everyone, regardless of the user's personal taste. If a person who loves romantic movies (and hates action) were to look at our Top 10 Chart, he/she wouldn't probably like most of the movies.

To personalise our recommendations more, I am going to build an engine that computes similarity between movies based on certain metrics and suggests movies that are most similar to a particular movie that a user liked. Since we will be **using movie metadata (or content) to build this engine**, this also known as **Content Based Filtering.**

In [8]:
#Import librarie
from sklearn.metrics.pairwise import cosine_similarity

In [9]:
#same dataframe which is load previously
movies.head()

Unnamed: 0,index,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,cast,crew,director
0,0,237000000,Action Adventure Fantasy Science Fiction,http://www.avatarmovie.com/,19995,culture clash future space war space colony so...,en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,Sam Worthington Zoe Saldana Sigourney Weaver S...,"[{'name': 'Stephen E. Rivkin', 'gender': 0, 'd...",James Cameron
1,1,300000000,Adventure Fantasy Action,http://disney.go.com/disneypictures/pirates/,285,ocean drug abuse exotic island east india trad...,en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,...,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,Johnny Depp Orlando Bloom Keira Knightley Stel...,"[{'name': 'Dariusz Wolski', 'gender': 2, 'depa...",Gore Verbinski
2,2,245000000,Action Adventure Crime,http://www.sonypictures.com/movies/spectre/,206647,spy based on novel secret agent sequel mi6,en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,...,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466,Daniel Craig Christoph Waltz L\u00e9a Seydoux ...,"[{'name': 'Thomas Newman', 'gender': 2, 'depar...",Sam Mendes
3,3,250000000,Action Crime Drama Thriller,http://www.thedarkknightrises.com/,49026,dc comics crime fighter terrorist secret ident...,en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,...,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106,Christian Bale Michael Caine Gary Oldman Anne ...,"[{'name': 'Hans Zimmer', 'gender': 2, 'departm...",Christopher Nolan
4,4,260000000,Action Adventure Science Fiction,http://movies.disney.com/john-carter,49529,based on novel mars medallion space travel pri...,en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,...,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124,Taylor Kitsch Lynn Collins Samantha Morton Wil...,"[{'name': 'Andrew Stanton', 'gender': 2, 'depa...",Andrew Stanton


In [10]:
#Select Feature
#If any null value present in in this columns we are feling with empty strings
feature = ['keywords','cast','genres','director']

for i in feature:
    movies[i] = movies[i].fillna('')

In [11]:
#concatinate all three columns and create one seprate column for them
movies["combined_features"] = movies['keywords'] +' '+ movies['cast'] +' '+ movies['genres'] +' '+ movies['director']

In [12]:
movies["combined_features"].head()

0    culture clash future space war space colony so...
1    ocean drug abuse exotic island east india trad...
2    spy based on novel secret agent sequel mi6 Dan...
3    dc comics crime fighter terrorist secret ident...
4    based on novel mars medallion space travel pri...
Name: combined_features, dtype: object

In [13]:
#create count matrix from this new combined column
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
count_matrix = cv.fit_transform(movies["combined_features"])

In [14]:
##Step 5: Now Compute the Cosine Similarity based on the count_matrix
from sklearn.metrics.pairwise import cosine_similarity
cosine_sim = cosine_similarity(count_matrix)

In [15]:
# Function that takes in movie title as input and outputs most similar movies

#Construct a reverse map of indices and movie titles
indices = pd.Series(movies.index, index=movies['title']).drop_duplicates()

def get_recomandation_contentBase(title):
    # Get the index of the movie that matches the title
    idx = indices[title]
    
    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))
    
    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores,key=lambda x:x[1],reverse=True)
    
    # Get the scores of the 15 most similar movies
    sim_scores = sim_scores[0:16]
    
    for i in sim_scores:
        movie_index = i[0]
        print(movies['title'].iloc[movie_index])

In [16]:
#Now lets make predictions
get_recomandation_contentBase('The Dark Knight Rises')

The Dark Knight Rises
Batman Begins
The Dark Knight
Amidst the Devil's Wings
The Killer Inside Me
The Prestige
Batman Returns
Batman
Batman & Robin
Kick-Ass
RockNRolla
Kick-Ass 2
Harry Brown
In Too Deep
Defendor
Point Blank


# Collaborative filtering

The recommendations are done based on the user’s behavior. History of the user plays an important role. For example, if the user ‘A’ likes ‘Coldplay’, ‘The Linkin Park’ and ‘Britney Spears’ while the user ‘B’ likes ‘Coldplay’, ‘The Linkin Park’ and ‘Taylor Swift’ then they have similar interests. So, there is a huge probability that the user ‘A’ would like ‘Taylor Swift’ and the user ‘B’ would like ‘Britney Spears’. This is the way collaborative filtering is done.

* In general, Collaborative filtering (CF) is more commonly used than content-based systems because it usually gives better results and is relatively easy to understand (from an overall implementation perspective).

**Collaborative filters can further be classified into two types:**

* 1) user-user filtering

* 2) item-item filtering

Note: Since the dataset we used before did not have userId(which is necessary for collaborative filtering) let's load another dataset

# Get the Data

In [17]:
ratings = pd.read_csv('ratings.csv')
movies = pd.read_csv('movies.csv')

In [18]:
#We can merge them together
data = pd.merge(ratings,movies,on='movieId')
data.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,5,1,4.0,847434962,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,7,1,4.5,1106635946,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
3,15,1,2.5,1510577970,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
4,17,1,4.5,1305696483,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy


In [19]:
#Selecting usefull features
data = data[['movieId','title','userId','rating']]
data.head()

Unnamed: 0,movieId,title,userId,rating
0,1,Toy Story (1995),1,4.0
1,1,Toy Story (1995),5,4.0
2,1,Toy Story (1995),7,4.5
3,1,Toy Story (1995),15,2.5
4,1,Toy Story (1995),17,4.5


Now let's take a quick look at the number of unique users and movies.

In [20]:
n_users = data['userId'].nunique()
n_items = data['movieId'].nunique()

print('Num. of Users: '+ str(n_users))
print('Num of Movies: '+str(n_items))

Num. of Users: 610
Num of Movies: 9724


In [21]:
#groupby title and count how many people rated each and every movie & reseting index because I dont want title to be my index.
movie_rating_count = pd.DataFrame(data.groupby('title')['rating'].count().reset_index())
movie_rating_count = movie_rating_count.rename(columns={'rating':'total rating count'})
movie_rating_count.head()

Unnamed: 0,title,total rating count
0,'71 (2014),1
1,'Hellboy': The Seeds of Creation (2004),1
2,'Round Midnight (1986),2
3,'Salem's Lot (2004),1
4,'Til There Was You (1997),2


In [22]:
#adding new column called total rating count
rating_with_totalRatingCount = pd.merge(data,movie_rating_count,on='title')
rating_with_totalRatingCount.head()

Unnamed: 0,movieId,title,userId,rating,total rating count
0,1,Toy Story (1995),1,4.0,215
1,1,Toy Story (1995),5,4.0,215
2,1,Toy Story (1995),7,4.5,215
3,1,Toy Story (1995),15,2.5,215
4,1,Toy Story (1995),17,4.5,215


In [23]:
#only taking those movie whose total rating count is greater than 50
rating_popular_movie = rating_with_totalRatingCount[rating_with_totalRatingCount['total rating count']>=50]

Now let's create a matrix that has the user ids on one access and the movie title on another axis. Each cell will then consist of the rating the user gave to that movie. Note there will be a lot of NaN values, because most people have not seen most of the movies.

In [24]:
#creating pivot table
movie_feature_df = rating_with_totalRatingCount.pivot_table(index='userId',columns='title',values='rating').fillna(0)
movie_feature_df.head()

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


The similarity can be computed with Pearson Correlation or Cosine Similarity. I am using "Pearson Correlation" in **user-to-user cf** and **"Cosine Similarity" in item-to-item cf**

# **User-User collaborative filtering**

In [25]:
user_similarity = movie_feature_df.corr()

In [26]:
#Function that takes in movie title and ratings as input and outputs most similar movies
def get_recomandation(movie_name,ratings):
    similar_score = user_similarity[movie_name]*(ratings-2.5)
    similar_score = similar_score.sort_values(ascending=False)
    
    return similar_score

In [27]:
#Now lets make predictions
get_recomandation('Toy Story (1995)',2).head(10)

title
Man from Earth, The (2007)                       0.040539
Wild Tales (2014)                                0.039545
Intouchables (2011)                              0.037117
Hunt, The (Jagten) (2012)                        0.033501
Incendies (2010)                                 0.032223
Planet Earth II (2016)                           0.031829
The Lair of the White Worm (1988)                0.028170
Departures (Okuribito) (2008)                    0.025979
Sacrifice, The (Offret - Sacraficatio) (1986)    0.025034
Single Man, A (2009)                             0.025027
Name: Toy Story (1995), dtype: float64

# **Item-Item collaborative filtering**

In [28]:
item_similarity = cosine_similarity(movie_feature_df.T)
item_similarity_df = pd.DataFrame(item_similarity,index=movie_feature_df.columns,columns=movie_feature_df.columns)

In [29]:
#Function that takes in movie title and ratings as input and outputs most similar movies
def get_recomandation2(movie_name,ratings):
    similar_score = item_similarity_df[movie_name]*(ratings-2.5)
    similar_score = similar_score.sort_values(ascending=False)
    
    return similar_score

In [30]:
#Now lets make predictions
get_recomandation2('Toy Story (1995)',3).head(10)

title
Toy Story (1995)                                     0.500000
Toy Story 2 (1999)                                   0.286301
Jurassic Park (1993)                                 0.282818
Independence Day (a.k.a. ID4) (1996)                 0.282131
Star Wars: Episode IV - A New Hope (1977)            0.278694
Forrest Gump (1994)                                  0.273548
Lion King, The (1994)                                0.270573
Star Wars: Episode VI - Return of the Jedi (1983)    0.270545
Mission: Impossible (1996)                           0.269456
Groundhog Day (1993)                                 0.267084
Name: Toy Story (1995), dtype: float64

# Great Job!