# Movie Recommendation System

Movie recommendation systems and similar algorithms are now commonly seen data science applications across all kinds of apps and interfaces. Netflix recommends shows based on what you've seen, Tiktok customizes your for you page based on your likes, and Instagram changes the posts you see based on what you click into. In this notebook, I am to follow the tutorial provided [here](https://www.kaggle.com/code/ibtesama/getting-started-with-a-movie-recommendation-system/notebook)
to create a movie recommendation system.

In [7]:
## Import necessary libraries
import pandas as pd
import numpy as np
import sklearn

In [9]:
## Read in csv from kaggle dataset
df1 = pd.read_csv('tmdb_data/tmdb_5000_credits.csv')
df2 = pd.read_csv('tmdb_data/tmdb_5000_movies.csv')

In [11]:
## Join datasets together on id
df1.columns = ['id','tittle','cast','crew']
df2= df2.merge(df1,on='id')

In [13]:
## Take a look at what our dataframe looks like
df2.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,tittle,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...",...,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...",...,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]",...,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [15]:
df1.head()

Unnamed: 0,id,tittle,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


## Demographic Filtering
To get started, we will first look at demographic filtering. This is a method of utilizing a weighted ratings of a movie to judge it's performance rather than simply looking at average rating. This is a more representative method of rating movie performance since it accounts for not just the average rating of a movie, but also the number of votes for a movie. We will be using the formula provided below in our calculations.

Weighted Rating (WR) = ($\frac{v}{v + m}$ * R) + ($\frac{m}{v + m}$ * C)
- v: Number of votes
- m: minimum votes required to be listed
- R: Average rating
- C: Mean vote across entire report

In [20]:
## Calculate the average vote
C = df2['vote_average'].mean()
C

6.092171559442016

Now we must decide what we want "m" to be. Since our purpose here is to recommend movies users might like, we want to recommend movies that are highly rated. Thus, choosing movies that have received more votes than 90% of all movies seems to suffice.

In [25]:
## Calculate the 90th percentile of votes
m = df2['vote_count'].quantile(0.9)
m

1838.4000000000015

In [27]:
## Only select movies that meet our requirement of being above the 90th percentile
qualifying_movies = df2.copy().loc[df2['vote_count'] >= m]
qualifying_movies.shape

(481, 23)

Now we have defined all the variables necessary for using our formula. We can now define a function to calculate the weighted rating of each movie. 

In [34]:
## Function for calculating the weighted rating
def weighted_rating(x, m=m, C=C):
    v = x['vote_count']
    R = x['vote_average']
    return (v/(v + m) * R) + (m/ (m + v) * C)

In [38]:
## Create a new column, 'score', with weighted rating of each movie
qualifying_movies['score'] = qualifying_movies.apply(weighted_rating, axis = 1)
qualifying_movies.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,spoken_languages,status,tagline,title,vote_average,vote_count,tittle,cast,crew,score
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de...",7.050669
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de...",6.665696
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...",...,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de...",6.239396
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...",...,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de...",7.346721
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]",...,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de...",6.096368


In [40]:
## Sort by descending score and filter for relevant columns
qualifying_movies = qualifying_movies.sort_values('score', ascending = False)
qualifying_movies[['title', 'vote_count', 'vote_average', 'score']].head()

Unnamed: 0,title,vote_count,vote_average,score
1881,The Shawshank Redemption,8205,8.5,8.059258
662,Fight Club,9413,8.3,7.939256
65,The Dark Knight,12002,8.2,7.92002
3232,Pulp Fiction,8428,8.3,7.904645
96,Inception,13752,8.1,7.863239


In [42]:
## Sort by popularity
popular = qualifying_movies.sort_values('popularity', ascending = False)
popular[['title', 'vote_count', 'vote_average', 'score', 'popularity']].head()

Unnamed: 0,title,vote_count,vote_average,score,popularity
546,Minions,4571,6.4,6.311706,875.581305
95,Interstellar,10867,8.1,7.809479,724.247784
788,Deadpool,10995,7.4,7.212652,514.569956
94,Guardians of the Galaxy,9742,7.9,7.613005,481.098624
127,Mad Max: Fury Road,9427,7.2,7.019214,434.278564


## Content Based Filtering

We now move on to a more complex method of recommendation. For content based filtering, we recommend movies based off of it's contents (much like the name). We will consider factors such as overview, cast, crew, and keyword, among others, to assign a similarity score among movies. Then movies that are judged to be most similar will be recommended to the user.

In [45]:
## Look at what the overview column contains
df2['overview'].head()

0    In the 22nd century, a paraplegic Marine is di...
1    Captain Barbossa, long believed to be dead, ha...
2    A cryptic message from Bond’s past sends him o...
3    Following the death of District Attorney Harve...
4    John Carter is a war-weary, former military ca...
Name: overview, dtype: object

We will now use sklearn's TfidVectorizer to process the overview. The relative frequency of a word in a document will be calculated as term instances / total instances, with inverse document frequency being calculated as log(number of documents / documents with term). The overall importance of a word will be equal to TF * IDF.

Computing each word as such will give us a matrix where columns represents words in the overview, with each row being a different movie. Ultimately this will allow us to reduce the importance of words that occur frequenty in overviews and reduce their contribution to a similarity score.

In [47]:
## Convert each overview to a Term Frequency-Inverse Document Frequency vector

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words = 'english')
df2['overview'] = df2['overview'].fillna('')
tfidf_matrix = tfidf.fit_transform(df2['overview'])
tfidf_matrix.shape

(4803, 20978)

Now that we have created a matrix representation of each movie's overview, we can compute a similarity score. We will be doing so using the cosine similarity score, calculated with the formula below. 

similarity = cos($\theta$) = $\frac{A * B}{||A|| ||B||}$ = $$\frac{\sum_{i=1}^n A_i B_i}{\sqrt{\sum_{i=1}^n A_i ^2}{\sqrt{\sum_{i=1}^n B_i ^2}}}$$

After utilizing the TF-IDF vectorizer, we can determine the cosine similarity using dot product with sklearn's linear_kernel()

In [52]:
## Import linear kernel
from sklearn.metrics.pairwise import linear_kernel

## Compute similarity scores
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [54]:
## Create an index to identify movie given title
indices = pd.Series(df2.index, index=df2['title']).drop_duplicates()

We now have a method of computing similarity score as well as a method of identifying movie instances given a title. With these tools in our repertoire, we can create a function to recommend movies. 

In [56]:
## Function to find most similar movies
def get_recommendations(title, cosine_sim = cosine_sim):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key = lambda x: x[1], reverse = True)
    sim_scores = sim_scores[1:11]
    movie_indices = [i[0] for i in sim_scores]
    return df2['title'].iloc[movie_indices]

In [22]:
get_recommendations('The Dark Knight Rises')

65                              The Dark Knight
299                              Batman Forever
428                              Batman Returns
1359                                     Batman
3854    Batman: The Dark Knight Returns, Part 2
119                               Batman Begins
2507                                  Slow Burn
9            Batman v Superman: Dawn of Justice
1181                                        JFK
210                              Batman & Robin
Name: title, dtype: object

In [23]:
get_recommendations("The Avengers")

7               Avengers: Age of Ultron
3144                            Plastic
1715                            Timecop
4124                 This Thing of Ours
3311              Thank You for Smoking
3033                      The Corruptor
588     Wall Street: Money Never Sleeps
2136         Team America: World Police
1468                       The Fountain
1286                        Snowpiercer
Name: title, dtype: object

Our recommendation system now looks at movie overviews and recommends the most similar ones! However, even among movies with similar overviews there may be some differences. For example, The Dark Knight Rises has a high similarity score with all other Batman movies, but those who enjoyed the movie may enjoy other movies with the same actors/director. 

To account for this, we will now look at the top 3 actors, directors, related genres, and movie plot when making our recommendations.

In [60]:
## Parse string features into python objects
from ast import literal_eval

features = ['cast', 'crew', 'keywords', 'genres']
for feature in features:
    df2[feature] = df2[feature].apply(literal_eval)

In [62]:
## Function to identify the director
def get_director(x): 
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

In [64]:
## Function to identify top 3 elements in a lits
def get_list(x):
    if isinstance(x, list):
        names = [i['name'] for i in x]
    if len(names) > 3:
        names = names[:3]
    return names

    return []

In [66]:
## Create a director, cast, genre column as needed
df2['director'] = df2['crew'].apply(get_director)
features = ['cast', 'keywords', 'genres']
for feature in features:
    df2[feature] = df2[feature].apply(get_list)

In [70]:
## Looking at our new table
df2[['title', 'cast', 'director', 'keywords', 'genres']].head()

Unnamed: 0,title,cast,director,keywords,genres
0,Avatar,"[Sam Worthington, Zoe Saldana, Sigourney Weaver]",James Cameron,"[culture clash, future, space war]","[Action, Adventure, Fantasy]"
1,Pirates of the Caribbean: At World's End,"[Johnny Depp, Orlando Bloom, Keira Knightley]",Gore Verbinski,"[ocean, drug abuse, exotic island]","[Adventure, Fantasy, Action]"
2,Spectre,"[Daniel Craig, Christoph Waltz, Léa Seydoux]",Sam Mendes,"[spy, based on novel, secret agent]","[Action, Adventure, Crime]"
3,The Dark Knight Rises,"[Christian Bale, Michael Caine, Gary Oldman]",Christopher Nolan,"[dc comics, crime fighter, terrorist]","[Action, Crime, Drama]"
4,John Carter,"[Taylor Kitsch, Lynn Collins, Samantha Morton]",Andrew Stanton,"[based on novel, mars, medallion]","[Action, Adventure, Science Fiction]"


In [72]:
## Function to convert strings to lowercase and remove spaces
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''


In [74]:
## Running the function on the dataset
features = ['cast', 'keywords', 'director', 'genres']

for feature in features:
    df2[feature] = df2[feature].apply(clean_data)

In [76]:
## Create new column with keywords, cast, director, and genre concatenated together
def create_soup(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' '
    + x['director'] + ' ' + ' '.join(x['genres'])
df2['soup'] = df2.apply(create_soup, axis=1)

In [78]:
## Use CountVectorizer to later calculate similarity 
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(df2['soup'])

In [80]:
## Calculate similarity
from sklearn.metrics.pairwise import cosine_similarity
cosine_sim2 = cosine_similarity(count_matrix, count_matrix)

In [82]:
df2 = df2.reset_index()
indices = pd.Series(df2.index, index=df2['title'])

In [84]:
get_recommendations("The Dark Knight Rises", cosine_sim2)

65                         The Dark Knight
119                          Batman Begins
1196                          The Prestige
1246                     Quest for Camelot
1775                         The Statement
2460                            The Unborn
317                     The Flowers of War
2793                  The Killer Inside Me
3172                         The Contender
9       Batman v Superman: Dawn of Justice
Name: title, dtype: object

In [90]:
get_recommendations("The Godfather", cosine_sim2)

867     The Godfather: Part III
4124         This Thing of Ours
2401             City of Ghosts
2649          The Son of No One
4147           Small Apartments
499               Jack and Jill
3069              Danny Collins
20       The Amazing Spider-Man
277               Casino Royale
286                      Eraser
Name: title, dtype: object

## Collaborative Filtering

Lastly we build a recommendation system based on collaborative filtering. A limitation of content based filtering is that it can only recommend similar movies. It cannot capture tastes and provide cross genre recommendations.

In this section, we will implement a new technique that takes into account similarity to other users, and the preferences of similar users. We will use item-based collaborative filtering for this recommendation system.

In [88]:
from surprise import Reader, Dataset, SVD
from surprise.model_selection import cross_validate
reader = Reader()
ratings = pd.read_csv('ratings.csv')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [38]:
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)
model = SVD()
results = cross_validate(model, data, measures=['RMSE', 'MAE'], cv=5, verbose = True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8801  0.8737  0.8710  0.8675  0.8756  0.8736  0.0043  
MAE (testset)     0.6774  0.6720  0.6702  0.6650  0.6725  0.6714  0.0040  
Fit time          0.60    0.61    0.61    0.60    0.61    0.61    0.00    
Test time         0.12    0.06    0.11    0.06    0.11    0.09    0.03    


In [39]:
trainset = data.build_full_trainset()
model.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x146cfef30>

In [40]:
ratings[ratings['userId'] == 1]

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
227,1,3744,4.0,964980694
228,1,3793,5.0,964981855
229,1,3809,4.0,964981220
230,1,4006,4.0,964982903


In [41]:
model.predict(1, 302, 3)

Prediction(uid=1, iid=302, r_ui=3, est=4.181884060623315, details={'was_impossible': False})

## Conclusion

With this, we have implemented 3 different approaches to movie recommendation. Each method has it's own pro's and con's, and tackles the problem in a slightly different way. To further improve upon the recommendations, a hybrid system can be implemented where two approaches compliment one another. 