# SIMPLE MOVIE RECOMMENDER

**The output of this notebook is the Top 10 Movies of any specified genre, according the year of its release, popularity or weighted rating.**

We will first begin by importing the necessary tools and modules :

In [1]:
import pandas as pd
import numpy as np
from ast import literal_eval
import warnings
from youtube_dl import main; warnings.simplefilter('ignore')  # Ignores warnings, doesn't output them.

Reading the "movies_metadata" CSV file :

In [2]:
smr = pd.read_csv("movies_metadata.csv")
smr.shape

(45466, 24)

In [3]:
smr.head(5)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In the "genres" column, we observe that the data is in the form of list of dictionaries. We need to preprocess the column in such a way that the data consists of only a list, and we need to eliminate the "id" of the genres. Ultimately, we will end up with a list consisting of only genres.

For that, we first need to take NaN values into consideration. We will fill the NaN values with an empty list, passed as a string. Also, the data in the genres column is in the form of a string, hence we need AST's `literal_eval()` method to parse those strings into actual lists. Finally, we eliminate the "id" and keep only the genre name, using a list comprehension :

In [4]:
smr['genres'] = smr['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
smr['genres']

0         [Animation, Comedy, Family]
1        [Adventure, Fantasy, Family]
2                   [Romance, Comedy]
3            [Comedy, Drama, Romance]
4                            [Comedy]
                     ...             
45461                 [Drama, Family]
45462                         [Drama]
45463       [Action, Drama, Thriller]
45464                              []
45465                              []
Name: genres, Length: 45466, dtype: object

Storing the number of total vote counts and average vote counts in the form of integers, after filtering out the NaN values :

In [5]:
vote_counts = smr[smr['vote_count'].notnull()]['vote_count'].astype(int)
vote_averages = smr[smr['vote_average'].notnull()]['vote_average'].astype(int)

Setting variables for the weighted rating :

In [6]:
mean_votes = vote_averages.mean()
min_votes_req_for_charts = vote_counts.quantile(0.95)
# 0.95 here means that the movie should have atleast 95% more votes than other movies, then only it can be featured in the Top Charts.

min_votes_req_for_charts

434.0

_This means that a movie requires a minimum of `434.0` votes to be featured in the Top Charts._

Creating a column "year" which stores the year from the "release_date" column :

In [7]:
smr['year'] = pd.to_datetime(smr['release_date'], errors='coerce').apply(lambda x: str(x).split('-')[0] if x!=np.nan else np.nan)
smr['year']

0        1995
1        1995
2        1995
3        1995
4        1995
         ... 
45461     NaT
45462    2011
45463    2003
45464    1917
45465    2017
Name: year, Length: 45466, dtype: object

Creating a table for movies whose votes were greater than the minimum number of votes :

In [8]:
top_movies = smr[(smr['vote_count']>=min_votes_req_for_charts) & (smr['vote_count'].notnull()) & (smr['vote_average'].notnull())][['title', 'year', 'vote_count', 'vote_average', 'popularity', 'genres']]
top_movies.shape

(2274, 6)

In [9]:
top_movies

Unnamed: 0,title,year,vote_count,vote_average,popularity,genres
0,Toy Story,1995,5415.0,7.7,21.946943,"[Animation, Comedy, Family]"
1,Jumanji,1995,2413.0,6.9,17.015539,"[Adventure, Fantasy, Family]"
5,Heat,1995,1886.0,7.7,17.924927,"[Action, Crime, Drama, Thriller]"
9,GoldenEye,1995,1194.0,6.6,14.686036,"[Adventure, Action, Thriller]"
15,Casino,1995,1343.0,7.8,10.137389,"[Drama, Crime]"
...,...,...,...,...,...,...
44624,What Happened to Monday,2017,598.0,7.3,60.581223,"[Science Fiction, Thriller]"
44632,Atomic Blonde,2017,748.0,6.1,14.455104,"[Action, Thriller]"
44678,Dunkirk,2017,2712.0,7.5,30.938854,"[Action, Drama, History, Thriller, War]"
44842,Transformers: The Last Knight,2017,1440.0,6.2,39.186819,"[Action, Science Fiction, Thriller, Adventure]"


In [10]:
top_movies['vote_count'].dtypes
top_movies['vote_average'].dtypes

dtype('float64')

Converting `vote_count` and `vote_average` from float to int :

In [11]:
top_movies['vote_count'] = top_movies['vote_count'].astype(int)
top_movies['vote_average'] = top_movies['vote_average'].astype(int)

In [12]:
top_movies['vote_count'].dtypes
top_movies['vote_average'].dtypes

dtype('int32')

The IMDB's Weighted Rating Formula is as follows :

Weighted Rating = (v/(v+m)\*R)+(m/(v+m)\*C), where

v = Number of votes<br>
m = Minimum number of votes required by a movie to enter the Top Charts<br>
R = Average rating<br>
C = Mean votes<br>

We will now create a function that implements the IMDB's Weighted Rating Formula on our dataset :

In [13]:
def weighted_rating(x):
    num_of_votes = x['vote_count']
    average_rating = x['vote_average']
    return (num_of_votes/(num_of_votes + min_votes_req_for_charts)*average_rating) + (min_votes_req_for_charts/(min_votes_req_for_charts + num_of_votes)*mean_votes)

Creating a column for weighted rating of each movie, and sorting them all :

In [14]:
top_movies['weighted_rating'] = top_movies.apply(weighted_rating, axis=1)
top_movies = top_movies.sort_values('weighted_rating', ascending=False).head(250)
top_movies.head(5)

Unnamed: 0,title,year,vote_count,vote_average,popularity,genres,weighted_rating
15480,Inception,2010,14075,8,29.108149,"[Action, Thriller, Science Fiction, Mystery, A...",7.917588
12481,The Dark Knight,2008,12269,8,123.167259,"[Drama, Action, Crime, Thriller]",7.905871
22879,Interstellar,2014,11187,8,32.213481,"[Adventure, Drama, Science Fiction]",7.897107
2843,Fight Club,1999,9678,8,63.869599,[Drama],7.881753
4863,The Lord of the Rings: The Fellowship of the Ring,2001,8892,8,32.070725,"[Adventure, Fantasy, Action]",7.871787


Separating each genre from the "genres" list for each row, then presenting it in a new column and as different rows :

In [15]:
genre = smr.apply(lambda x: pd.Series(x['genres']), axis=1).stack().reset_index(level=1, drop=True)
genre.name = 'genre'
smr_genre = smr.drop('genres', axis=1).join(genre)
smr_genre['genre'].head(5)

0    Animation
0       Comedy
0       Family
1    Adventure
1      Fantasy
Name: genre, dtype: object

Creating a function that takes genre and sorting filter (rating, popularity or year) as an input and cutoff percentile (for top charts) as an optional input, and returns the top movies for the specified genre : 

In [16]:
def top_charts(gen, sort_by, perc=0.85):
    df = smr_genre[smr_genre['genre']==gen]
    vote_counts = df[df['vote_count'].notnull()]['vote_count'].astype(int)
    vote_averages = df[df['vote_average'].notnull()]['vote_average'].astype(int)
    c = vote_averages.mean()
    m = vote_counts.quantile(perc)
    qualified = df[(df['vote_count']>=m) & (df['vote_count'].notnull()) & (df['vote_average'].notnull())][['title', 'year', 'vote_count', 'vote_average', 'popularity']]
    qualified['vote_count'] = qualified['vote_count'].astype(int)
    qualified['vote_average'] = qualified['vote_average'].astype(int)
    qualified['popularity'] = qualified['popularity'].astype(float)
    qualified['year'] = qualified['year'].astype(int)
    qualified['weighted_rating'] = qualified.apply(lambda x: (x['vote_count']/(x['vote_count']+m) * x['vote_average']) + (m/(m + x['vote_count']) * c), axis=1)
    qualified = qualified.sort_values(sort_by, ascending=False).head(250)
    return qualified

Different types of genres in our dataset :

In [17]:
all_genres = smr_genre['genre'].unique()
all_genres

array(['Animation', 'Comedy', 'Family', 'Adventure', 'Fantasy', 'Romance',
       'Drama', 'Action', 'Crime', 'Thriller', 'Horror', 'History',
       'Science Fiction', 'Mystery', 'War', 'Foreign', nan, 'Music',
       'Documentary', 'Western', 'TV Movie', 'Carousel Productions',
       'Vision View Entertainment', 'Telescene Film Group Productions',
       'Aniplex', 'GoHands', 'BROSTA TV',
       'Mardock Scramble Production Committee', 'Sentai Filmworks',
       'Odyssey Media', 'Pulser Productions', 'Rogue State', 'The Cartel'],
      dtype=object)

**Testing out the Top Charts for movie recommendations :**

In [18]:
top_charts('Action', 'popularity').head(10)

Unnamed: 0,title,year,vote_count,vote_average,popularity,weighted_rating
33356,Wonder Woman,2017,5025,7,294.337037,6.92096
43644,Baby Driver,2017,2083,7,228.032744,6.820297
24455,Big Hero 6,2014,6289,7,213.849907,6.936292
26564,Deadpool,2016,11444,7,187.860492,6.964431
26566,Guardians of the Galaxy Vol. 2,2017,4858,7,185.330992,6.918364
14551,Avatar,2009,12114,7,185.070892,6.966363
24351,John Wick,2014,5499,7,183.870374,6.927503
26567,Captain America: Civil War,2016,7462,7,145.882135,6.946011
26560,Pirates of the Caribbean: Dead Men Tell No Tales,2017,2814,6,133.82782,5.938156
12481,The Dark Knight,2008,12269,8,123.167259,7.94861


In [19]:
top_charts('Adventure', 'weighted_rating').head(10)

Unnamed: 0,title,year,vote_count,vote_average,popularity,weighted_rating
15480,Inception,2010,14075,8,29.108149,7.906526
22879,Interstellar,2014,11187,8,32.213481,7.883426
4863,The Lord of the Rings: The Fellowship of the Ring,2001,8892,8,32.070725,7.854939
7000,The Lord of the Rings: The Return of the King,2003,8226,8,29.324358,7.843867
5814,The Lord of the Rings: The Two Towers,2002,7641,8,29.423537,7.832647
256,Star Wars,1977,6778,8,42.149697,7.812801
1225,Back to the Future,1985,6239,8,25.778509,7.797828
1154,The Empire Strikes Back,1980,5998,8,19.470959,7.790329
5481,Spirited Away,2001,3968,8,41.048867,7.695056
9698,Howl's Moving Castle,2004,2049,8,16.136048,7.465435


In [20]:
top_charts('Horror', ['year', 'popularity']).head(10)

Unnamed: 0,title,year,vote_count,vote_average,popularity,weighted_rating
42902,Alien: Covenant,2017,2677,5,72.884078,4.991358
45202,Wish Upon,2017,127,5,59.578823,4.903564
39694,47 Meters Down,2017,548,5,52.854103,4.96398
45014,The Dark Tower,2017,688,5,50.903593,4.97019
42169,Get Out,2017,2978,7,36.894806,6.912248
42853,Life,2017,1959,6,36.263803,5.92885
42096,Rings,2017,1075,4,24.535733,4.083231
43289,It Comes at Night,2017,357,5,20.504587,4.949677
41973,The Bye Bye Man,2017,262,5,19.225832,4.937292
43296,Resident Evil: Vendetta,2017,160,6,13.875911,5.47815


In [21]:
top_charts('Romance', ['year', 'weighted_rating', 'popularity']).head(10)

Unnamed: 0,title,year,vote_count,vote_average,popularity,weighted_rating
45437,In a Heartbeat,2017,146,8,20.82178,7.003959
43220,"Everything, Everything",2017,670,7,12.108048,6.809355
41488,The Space Between Us,2017,564,7,11.824404,6.778497
42205,The Big Sick,2017,249,7,23.424794,6.573221
42222,Beauty and the Beast,2017,5530,6,287.253654,5.990364
42355,Fifty Shades Darker,2017,2341,6,29.130443,5.977728
43157,The Discovery,2017,312,6,6.803535,5.865569
42308,A Ghost Story,2017,95,6,24.339781,5.708649
43291,My Cousin Rachel,2017,91,5,12.855228,5.201968
44271,The Bad Batch,2017,160,5,78.8072,5.146424


In [22]:
top_charts('Science Fiction', ['weighted_rating', 'popularity']).head(10)

Unnamed: 0,title,year,vote_count,vote_average,popularity,weighted_rating
15480,Inception,2010,14075,8,29.108149,7.939069
22879,Interstellar,2014,11187,8,32.213481,7.923728
256,Star Wars,1977,6778,8,42.149697,7.876106
1225,Back to the Future,1985,6239,8,25.778509,7.865868
1154,The Empire Strikes Back,1980,5998,8,19.470959,7.860722
1163,A Clockwork Orange,1971,3432,8,17.112594,7.764533
1901,Metropolis,1927,666,8,14.487867,7.078592
14551,Avatar,2009,12114,7,185.070892,6.952299
17818,The Avengers,2012,12000,7,89.887648,6.951856
23753,Guardians of the Galaxy,2014,10014,7,53.291601,6.942571
