This system used overall TMDB Vote Count and Vote Averages to build Top Movies Charts, for specific genre in general. The IMDB Weighted Rating System was used to calculate ratings on which the sorting was finally performed.
- Movie overview and taglines based,
- Cast, crew, genre and keywords based. A simple filter is added to give greater preference to movies with more votes and higher ratings.
Used Surprise module to build a collaborative filter based on single value decomposition. The RMSE obtained was less than 1 and the engine gave estimated ratings for a given user and movie.
Combined content and collaborative filterting to build an engine that gave movie suggestions to a particular user based on the estimated ratings that it had internally calculated for that user.
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from nltk.stem.snowball import SnowballStemmer
#from nltk.stem.wordnet import WordNetLemmatibzer
from nltk.corpus import wordnet
from surprise import Reader, Dataset, SVD
import warnings; warnings.simplefilter('ignore')
md = pd.read_csv('dataset/movie/movies_metadata.csv')
md.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
adult | belongs_to_collection | budget | genres | homepage | id | imdb_id | original_language | original_title | overview | ... | release_date | revenue | runtime | spoken_languages | status | tagline | title | video | vote_average | vote_count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | False | {'id': 10194, 'name': 'Toy Story Collection', ... | 30000000 | [{'id': 16, 'name': 'Animation'}, {'id': 35, '... | http://toystory.disney.com/toy-story | 862 | tt0114709 | en | Toy Story | Led by Woody, Andy's toys live happily in his ... | ... | 10/30/1995 | 373554033.0 | 81.0 | [{'iso_639_1': 'en', 'name': 'English'}] | Released | NaN | Toy Story | False | 7.7 | 5415.0 |
1 | False | NaN | 65000000 | [{'id': 12, 'name': 'Adventure'}, {'id': 14, '... | NaN | 8844 | tt0113497 | en | Jumanji | When siblings Judy and Peter discover an encha... | ... | 12/15/1995 | 262797249.0 | 104.0 | [{'iso_639_1': 'en', 'name': 'English'}, {'iso... | Released | Roll the dice and unleash the excitement! | Jumanji | False | 6.9 | 2413.0 |
2 | False | {'id': 119050, 'name': 'Grumpy Old Men Collect... | 0 | [{'id': 10749, 'name': 'Romance'}, {'id': 35, ... | NaN | 15602 | tt0113228 | en | Grumpier Old Men | A family wedding reignites the ancient feud be... | ... | 12/22/1995 | 0.0 | 101.0 | [{'iso_639_1': 'en', 'name': 'English'}] | Released | Still Yelling. Still Fighting. Still Ready for... | Grumpier Old Men | False | 6.5 | 92.0 |
3 | False | NaN | 16000000 | [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam... | NaN | 31357 | tt0114885 | en | Waiting to Exhale | Cheated on, mistreated and stepped on, the wom... | ... | 12/22/1995 | 81452156.0 | 127.0 | [{'iso_639_1': 'en', 'name': 'English'}] | Released | Friends are the people who let you be yourself... | Waiting to Exhale | False | 6.1 | 34.0 |
4 | False | {'id': 96871, 'name': 'Father of the Bride Col... | 0 | [{'id': 35, 'name': 'Comedy'}] | NaN | 11862 | tt0113041 | en | Father of the Bride Part II | Just when George Banks has recovered from his ... | ... | 2/10/1995 | 76578911.0 | 106.0 | [{'iso_639_1': 'en', 'name': 'English'}] | Released | Just When His World Is Back To Normal... He's ... | Father of the Bride Part II | False | 5.7 | 173.0 |
5 rows × 24 columns
Generalized recommendation based on movie popularity and genre to every user. For example, movies that are more popular and more critically acclaimed will have a higher probability of being liked by the average audience.
Sort movies based on ratings (The Movie Database (TMDb) Ratings) and popularity, then display the top movies list.
md['genres'] = md['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])\
Now, determine an appropriate value for m, the minimum votes required to be listed in the chart. 95th percentile will be used as cutoff. For a movie to feature in the charts, it must have more votes than at least 95% of the movies in the list.
vote_counts = md[md['vote_count'].notnull()]['vote_count'].astype('int')
vote_averages = md[md['vote_average'].notnull()]['vote_average'].astype('int')
C = vote_averages.mean()
C
5.244896612406511
m = vote_counts.quantile(0.95)
m
434.0
md['year'] = pd.to_datetime(md['release_date'], errors='coerce').apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan)
qualified = md[(md['vote_count'] >= m) & (md['vote_count'].notnull()) & (md['vote_average'].notnull())][['title', 'year', 'vote_count', 'vote_average', 'popularity', 'genres']]
qualified['vote_count'] = qualified['vote_count'].astype('int')
qualified['vote_average'] = qualified['vote_average'].astype('int')
qualified.shape
(2274, 6)
def weighted_rating(x):
v = x['vote_count']
R = x['vote_average']
return (v/(v+m) * R) + (m/(m+v) * C)
qualified['wr'] = qualified.apply(weighted_rating, axis=1)
#Choose top 250
qualified = qualified.sort_values('wr', ascending=False).head(250)
len(qualified)
250
qualified.head(15)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
title | year | vote_count | vote_average | popularity | genres | wr | |
---|---|---|---|---|---|---|---|
15480 | Inception | 2010 | 14075 | 8 | 29.108149 | [Action, Thriller, Science Fiction, Mystery, A... | 7.917588 |
12481 | The Dark Knight | 2008 | 12269 | 8 | 123.167259 | [Drama, Action, Crime, Thriller] | 7.905871 |
22878 | Interstellar | 2014 | 11187 | 8 | 32.213481 | [Adventure, Drama, Science Fiction] | 7.897107 |
2843 | Fight Club | 1999 | 9678 | 8 | 63.869599 | [Drama] | 7.881753 |
4863 | The Lord of the Rings: The Fellowship of the Ring | 2001 | 8892 | 8 | 32.070725 | [Adventure, Fantasy, Action] | 7.871787 |
292 | Pulp Fiction | 1994 | 8670 | 8 | 140.950236 | [Thriller, Crime] | 7.868660 |
314 | The Shawshank Redemption | 1994 | 8358 | 8 | 51.645403 | [Drama, Crime] | 7.864000 |
7000 | The Lord of the Rings: The Return of the King | 2003 | 8226 | 8 | 29.324358 | [Adventure, Fantasy, Action] | 7.861927 |
351 | Forrest Gump | 1994 | 8147 | 8 | 48.307194 | [Comedy, Drama, Romance] | 7.860656 |
5814 | The Lord of the Rings: The Two Towers | 2002 | 7641 | 8 | 29.423537 | [Adventure, Fantasy, Action] | 7.851924 |
256 | Star Wars | 1977 | 6778 | 8 | 42.149697 | [Adventure, Action, Science Fiction] | 7.834205 |
1225 | Back to the Future | 1985 | 6239 | 8 | 25.778509 | [Adventure, Comedy, Science Fiction, Family] | 7.820813 |
834 | The Godfather | 1972 | 6024 | 8 | 41.109264 | [Drama, Crime] | 7.814847 |
1154 | The Empire Strikes Back | 1980 | 5998 | 8 | 19.470959 | [Adventure, Action, Science Fiction] | 7.814099 |
46 | Se7en | 1995 | 5915 | 8 | 18.457430 | [Crime, Mystery, Thriller] | 7.811669 |
The chart indicates a strong bias of TMDB Users towards particular genres and directors (Christopher Nolan)
def build_chart(genre, percentile=0.85):
df = gen_md[gen_md['genre'] == genre]
vote_counts = df[df['vote_count'].notnull()]['vote_count'].astype('int')
vote_averages = df[df['vote_average'].notnull()]['vote_average'].astype('int')
C = vote_averages.mean()
m = vote_counts.quantile(percentile)
qualified = df[(df['vote_count'] >= m) & (df['vote_count'].notnull()) & (df['vote_average'].notnull())][['title', 'year', 'vote_count', 'vote_average', 'popularity']]
qualified['vote_count'] = qualified['vote_count'].astype('int')
qualified['vote_average'] = qualified['vote_average'].astype('int')
qualified['wr'] = qualified.apply(lambda x: (x['vote_count']/(x['vote_count']+m) * x['vote_average']) + (m/(m+x['vote_count']) * C), axis=1)
qualified = qualified.sort_values('wr', ascending=False).head(250)
return qualified
Use 85th percentile instead, and split movie with multiple genres into seperate row
s = md.apply(lambda x: pd.Series(x['genres']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'genre'
gen_md = md.drop('genres', axis=1).join(s)
gen_md
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
adult | belongs_to_collection | budget | homepage | id | imdb_id | original_language | original_title | overview | popularity | ... | runtime | spoken_languages | status | tagline | title | video | vote_average | vote_count | year | genre | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | False | {'id': 10194, 'name': 'Toy Story Collection', ... | 30000000 | http://toystory.disney.com/toy-story | 862 | tt0114709 | en | Toy Story | Led by Woody, Andy's toys live happily in his ... | 21.946943 | ... | 81.0 | [{'iso_639_1': 'en', 'name': 'English'}] | Released | NaN | Toy Story | False | 7.7 | 5415.0 | 1995 | Animation |
0 | False | {'id': 10194, 'name': 'Toy Story Collection', ... | 30000000 | http://toystory.disney.com/toy-story | 862 | tt0114709 | en | Toy Story | Led by Woody, Andy's toys live happily in his ... | 21.946943 | ... | 81.0 | [{'iso_639_1': 'en', 'name': 'English'}] | Released | NaN | Toy Story | False | 7.7 | 5415.0 | 1995 | Comedy |
0 | False | {'id': 10194, 'name': 'Toy Story Collection', ... | 30000000 | http://toystory.disney.com/toy-story | 862 | tt0114709 | en | Toy Story | Led by Woody, Andy's toys live happily in his ... | 21.946943 | ... | 81.0 | [{'iso_639_1': 'en', 'name': 'English'}] | Released | NaN | Toy Story | False | 7.7 | 5415.0 | 1995 | Family |
1 | False | NaN | 65000000 | NaN | 8844 | tt0113497 | en | Jumanji | When siblings Judy and Peter discover an encha... | 17.015539 | ... | 104.0 | [{'iso_639_1': 'en', 'name': 'English'}, {'iso... | Released | Roll the dice and unleash the excitement! | Jumanji | False | 6.9 | 2413.0 | 1995 | Adventure |
1 | False | NaN | 65000000 | NaN | 8844 | tt0113497 | en | Jumanji | When siblings Judy and Peter discover an encha... | 17.015539 | ... | 104.0 | [{'iso_639_1': 'en', 'name': 'English'}, {'iso... | Released | Roll the dice and unleash the excitement! | Jumanji | False | 6.9 | 2413.0 | 1995 | Fantasy |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
45460 | False | NaN | 0 | NaN | 67758 | tt0303758 | en | Betrayal | When one of her hits goes wrong, a professiona... | 0.903007 | ... | 90.0 | [{'iso_639_1': 'en', 'name': 'English'}] | Released | A deadly game of wits. | Betrayal | False | 3.8 | 6.0 | 2003 | Action |
45460 | False | NaN | 0 | NaN | 67758 | tt0303758 | en | Betrayal | When one of her hits goes wrong, a professiona... | 0.903007 | ... | 90.0 | [{'iso_639_1': 'en', 'name': 'English'}] | Released | A deadly game of wits. | Betrayal | False | 3.8 | 6.0 | 2003 | Drama |
45460 | False | NaN | 0 | NaN | 67758 | tt0303758 | en | Betrayal | When one of her hits goes wrong, a professiona... | 0.903007 | ... | 90.0 | [{'iso_639_1': 'en', 'name': 'English'}] | Released | A deadly game of wits. | Betrayal | False | 3.8 | 6.0 | 2003 | Thriller |
45461 | False | NaN | 0 | NaN | 227506 | tt0008536 | en | Satana likuyushchiy | In a small town live two brothers, one a minis... | 0.003503 | ... | 87.0 | [] | Released | NaN | Satan Triumphant | False | 0.0 | 0.0 | 1917 | NaN |
45462 | False | NaN | 0 | NaN | 461257 | tt6980792 | en | Queerama | 50 years after decriminalisation of homosexual... | 0.163015 | ... | 75.0 | [{'iso_639_1': 'en', 'name': 'English'}] | Released | NaN | Queerama | False | 0.0 | 0.0 | 2017 | NaN |
93536 rows × 25 columns
build_chart('Action').head(15)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
title | year | vote_count | vote_average | popularity | wr | |
---|---|---|---|---|---|---|
15480 | Inception | 2010 | 14075 | 8 | 29.108149 | 7.955099 |
12481 | The Dark Knight | 2008 | 12269 | 8 | 123.167259 | 7.948610 |
4863 | The Lord of the Rings: The Fellowship of the Ring | 2001 | 8892 | 8 | 32.070725 | 7.929579 |
7000 | The Lord of the Rings: The Return of the King | 2003 | 8226 | 8 | 29.324358 | 7.924031 |
5814 | The Lord of the Rings: The Two Towers | 2002 | 7641 | 8 | 29.423537 | 7.918382 |
256 | Star Wars | 1977 | 6778 | 8 | 42.149697 | 7.908327 |
1154 | The Empire Strikes Back | 1980 | 5998 | 8 | 19.470959 | 7.896841 |
4135 | Scarface | 1983 | 3017 | 8 | 11.299673 | 7.802046 |
9430 | Oldboy | 2003 | 2000 | 8 | 10.616859 | 7.711649 |
1910 | Seven Samurai | 1954 | 892 | 8 | 15.017770 | 7.426145 |
43187 | Band of Brothers | 2001 | 725 | 8 | 7.903731 | 7.325485 |
1215 | M | 1931 | 465 | 8 | 12.752421 | 7.072073 |
14551 | Avatar | 2009 | 12114 | 7 | 185.070892 | 6.966363 |
17818 | The Avengers | 2012 | 12000 | 7 | 89.887648 | 6.966049 |
26563 | Deadpool | 2016 | 11444 | 7 | 187.860492 | 6.964431 |
Personalized recommendations - Computes similarity between movies based on certain metrics and suggests movies that are most similar to a particular movie that a user liked.
#links.csv contains the reference bewteen imdbId tmdbId
links_small = pd.read_csv('dataset/movie/links_small.csv')
links_small = links_small[links_small['tmdbId'].notnull()]['tmdbId'].astype('int')
links_small
0 862
1 8844
2 15602
3 31357
4 11862
...
9120 402672
9121 315011
9122 391698
9123 137608
9124 410803
Name: tmdbId, Length: 9112, dtype: int32
md['id'] = md['id'].astype('int')
small_md = md[md['id'].isin(links_small)]
small_md.shape
(9099, 25)
small_md
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
adult | belongs_to_collection | budget | genres | homepage | id | imdb_id | original_language | original_title | overview | ... | revenue | runtime | spoken_languages | status | tagline | title | video | vote_average | vote_count | year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | False | {'id': 10194, 'name': 'Toy Story Collection', ... | 30000000 | [Animation, Comedy, Family] | http://toystory.disney.com/toy-story | 862 | tt0114709 | en | Toy Story | Led by Woody, Andy's toys live happily in his ... | ... | 373554033.0 | 81.0 | [{'iso_639_1': 'en', 'name': 'English'}] | Released | NaN | Toy Story | False | 7.7 | 5415.0 | 1995 |
1 | False | NaN | 65000000 | [Adventure, Fantasy, Family] | NaN | 8844 | tt0113497 | en | Jumanji | When siblings Judy and Peter discover an encha... | ... | 262797249.0 | 104.0 | [{'iso_639_1': 'en', 'name': 'English'}, {'iso... | Released | Roll the dice and unleash the excitement! | Jumanji | False | 6.9 | 2413.0 | 1995 |
2 | False | {'id': 119050, 'name': 'Grumpy Old Men Collect... | 0 | [Romance, Comedy] | NaN | 15602 | tt0113228 | en | Grumpier Old Men | A family wedding reignites the ancient feud be... | ... | 0.0 | 101.0 | [{'iso_639_1': 'en', 'name': 'English'}] | Released | Still Yelling. Still Fighting. Still Ready for... | Grumpier Old Men | False | 6.5 | 92.0 | 1995 |
3 | False | NaN | 16000000 | [Comedy, Drama, Romance] | NaN | 31357 | tt0114885 | en | Waiting to Exhale | Cheated on, mistreated and stepped on, the wom... | ... | 81452156.0 | 127.0 | [{'iso_639_1': 'en', 'name': 'English'}] | Released | Friends are the people who let you be yourself... | Waiting to Exhale | False | 6.1 | 34.0 | 1995 |
4 | False | {'id': 96871, 'name': 'Father of the Bride Col... | 0 | [Comedy] | NaN | 11862 | tt0113041 | en | Father of the Bride Part II | Just when George Banks has recovered from his ... | ... | 76578911.0 | 106.0 | [{'iso_639_1': 'en', 'name': 'English'}] | Released | Just When His World Is Back To Normal... He's ... | Father of the Bride Part II | False | 5.7 | 173.0 | 1995 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
40221 | False | NaN | 15000000 | [Action, Adventure, Drama, Horror, Science Fic... | NaN | 315011 | tt4262980 | ja | シン・ゴジラ | From the mind behind Evangelion comes a hit la... | ... | 77000000.0 | 120.0 | [{'iso_639_1': 'it', 'name': 'Italiano'}, {'is... | Released | A god incarnate. A city doomed. | Shin Godzilla | False | 6.6 | 152.0 | 2016 |
40500 | False | NaN | 0 | [Documentary, Music] | http://www.thebeatlesliveproject.com/ | 391698 | tt2531318 | en | The Beatles: Eight Days a Week - The Touring Y... | The band stormed Europe in 1963, and, in 1964,... | ... | 0.0 | 99.0 | [{'iso_639_1': 'en', 'name': 'English'}] | Released | The band you know. The story you don't. | The Beatles: Eight Days a Week - The Touring Y... | False | 7.6 | 92.0 | 2016 |
44818 | False | {'id': 34055, 'name': 'Pokémon Collection', 'p... | 16000000 | [Adventure, Fantasy, Animation, Action, Family] | http://movies.warnerbros.com/pk3/ | 10991 | tt0235679 | ja | Pokémon 3: The Movie | When Molly Hale's sadness of her father's disa... | ... | 68411275.0 | 93.0 | [{'iso_639_1': 'en', 'name': 'English'}] | Released | Pokémon: Spell of the Unknown | Pokémon: Spell of the Unknown | False | 6.0 | 144.0 | 2000 |
44823 | False | {'id': 34055, 'name': 'Pokémon Collection', 'p... | 0 | [Adventure, Fantasy, Animation, Science Fictio... | http://www.pokemon.com/us/movies/movie-pokemon... | 12600 | tt0287635 | ja | 劇場版ポケットモンスター セレビィ 時を越えた遭遇(であい) | All your favorite Pokémon characters are back,... | ... | 28023563.0 | 75.0 | [{'iso_639_1': 'ja', 'name': '日本語'}] | Released | NaN | Pokémon 4Ever: Celebi - Voice of the Forest | False | 5.7 | 82.0 | 2001 |
45262 | False | NaN | 0 | [Comedy, Drama] | NaN | 265189 | tt2121382 | sv | Turist | While holidaying in the French Alps, a Swedish... | ... | 1359497.0 | 118.0 | [{'iso_639_1': 'fr', 'name': 'Français'}, {'is... | Released | NaN | Force Majeure | False | 6.8 | 255.0 | 2014 |
9099 rows × 25 columns
Movie descriptions and taglines based recommender
small_md['tagline'] = small_md['tagline'].fillna('')
small_md['description'] = small_md['overview'] + small_md['tagline']
small_md['description'] = small_md['description'].fillna('')
TF-IDF Vectorizer
tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(small_md['description'])
tfidf_matrix.shape
(9099, 268123)
Cosine Similarity Cosine Similarity willl be used to calculate a numeric quantity that denotes the similarity between two movies.
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
cosine_sim[0]
array([1. , 0.00680476, 0. , ..., 0. , 0.00344913,
0. ])
Now all the movies in the dataset has a pairwise cosine similarity matrix
small_md = small_md.reset_index()
titles = small_md['title']
indices = pd.Series(small_md.index, index=small_md['title'])
indices
title
Toy Story 0
Jumanji 1
Grumpier Old Men 2
Waiting to Exhale 3
Father of the Bride Part II 4
...
Shin Godzilla 9094
The Beatles: Eight Days a Week - The Touring Years 9095
Pokémon: Spell of the Unknown 9096
Pokémon 4Ever: Celebi - Voice of the Forest 9097
Force Majeure 9098
Length: 9099, dtype: int64
Function that returns the 30 most similar movies based on the cosine similarity score of input movie
def get_recommendations(title):
idx = indices[title]
sim_scores = list(enumerate(cosine_sim[idx]))
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
sim_scores = sim_scores[1:31]
movie_indices = [i[0] for i in sim_scores]
return titles.iloc[movie_indices]
get_recommendations('The Dark Knight')
7931 The Dark Knight Rises
132 Batman Forever
1113 Batman Returns
8227 Batman: The Dark Knight Returns, Part 2
7565 Batman: Under the Red Hood
524 Batman
7901 Batman: Year One
2579 Batman: Mask of the Phantasm
2696 JFK
8165 Batman: The Dark Knight Returns, Part 1
6144 Batman Begins
7933 Sherlock Holmes: A Game of Shadows
5511 To End All Wars
4489 Q & A
7344 Law Abiding Citizen
7242 The File on Thelma Jordon
3537 Criminal Law
2893 Flying Tigers
1135 Night Falls on Manhattan
8680 The Young Savages
8917 Batman v Superman: Dawn of Justice
1240 Batman & Robin
6740 Rush Hour 3
1652 The Shaggy D.A.
6667 Fracture
4028 The Rookie
8371 Justice League: Crisis on Two Earths
8719 By the Gun
3730 Dr. Mabuse, the Gambler
4160 The Master of Disguise
Name: title, dtype: object
As observed, the system takes conderation of the description and taglines of Batman: Dark knight and recommend all other Batman movie, followed by Detective, Superhero, Crime, etc
credits = pd.read_csv('dataset/movie/credits.csv')
keywords = pd.read_csv('dataset/movie/keywords.csv')
keywords['id'] = keywords['id'].astype('int')
credits['id'] = credits['id'].astype('int')
md['id'] = md['id'].astype('int')
md.shape
(45463, 25)
md = md.merge(credits, on='id')
md = md.merge(keywords, on='id')
sub_md = md[md['id'].isin(links_small)]
sub_md.shape
(9219, 28)
sub_md.head(5)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
adult | belongs_to_collection | budget | genres | homepage | id | imdb_id | original_language | original_title | overview | ... | status | tagline | title | video | vote_average | vote_count | year | cast | crew | keywords | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | False | {'id': 10194, 'name': 'Toy Story Collection', ... | 30000000 | [Animation, Comedy, Family] | http://toystory.disney.com/toy-story | 862 | tt0114709 | en | Toy Story | Led by Woody, Andy's toys live happily in his ... | ... | Released | NaN | Toy Story | False | 7.7 | 5415.0 | 1995 | [{'cast_id': 14, 'character': 'Woody (voice)',... | [{'credit_id': '52fe4284c3a36847f8024f49', 'de... | [{'id': 931, 'name': 'jealousy'}, {'id': 4290,... |
1 | False | NaN | 65000000 | [Adventure, Fantasy, Family] | NaN | 8844 | tt0113497 | en | Jumanji | When siblings Judy and Peter discover an encha... | ... | Released | Roll the dice and unleash the excitement! | Jumanji | False | 6.9 | 2413.0 | 1995 | [{'cast_id': 1, 'character': 'Alan Parrish', '... | [{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de... | [{'id': 10090, 'name': 'board game'}, {'id': 1... |
2 | False | {'id': 119050, 'name': 'Grumpy Old Men Collect... | 0 | [Romance, Comedy] | NaN | 15602 | tt0113228 | en | Grumpier Old Men | A family wedding reignites the ancient feud be... | ... | Released | Still Yelling. Still Fighting. Still Ready for... | Grumpier Old Men | False | 6.5 | 92.0 | 1995 | [{'cast_id': 2, 'character': 'Max Goldman', 'c... | [{'credit_id': '52fe466a9251416c75077a89', 'de... | [{'id': 1495, 'name': 'fishing'}, {'id': 12392... |
3 | False | NaN | 16000000 | [Comedy, Drama, Romance] | NaN | 31357 | tt0114885 | en | Waiting to Exhale | Cheated on, mistreated and stepped on, the wom... | ... | Released | Friends are the people who let you be yourself... | Waiting to Exhale | False | 6.1 | 34.0 | 1995 | [{'cast_id': 1, 'character': "Savannah 'Vannah... | [{'credit_id': '52fe44779251416c91011acb', 'de... | [{'id': 818, 'name': 'based on novel'}, {'id':... |
4 | False | {'id': 96871, 'name': 'Father of the Bride Col... | 0 | [Comedy] | NaN | 11862 | tt0113041 | en | Father of the Bride Part II | Just when George Banks has recovered from his ... | ... | Released | Just When His World Is Back To Normal... He's ... | Father of the Bride Part II | False | 5.7 | 173.0 | 1995 | [{'cast_id': 1, 'character': 'George Banks', '... | [{'credit_id': '52fe44959251416c75039ed7', 'de... | [{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n... |
5 rows × 28 columns
To make things lless complicated,
- Crew will be represented by the director
- Only top 3 actors willl be choseden to represent the cast.
sub_md['cast'] = sub_md['cast'].apply(literal_eval)
sub_md['crew'] = sub_md['crew'].apply(literal_eval)
sub_md['keywords'] = sub_md['keywords'].apply(literal_eval)
sub_md['cast_size'] = sub_md['cast'].apply(lambda x: len(x))
sub_md['crew_size'] = sub_md['crew'].apply(lambda x: len(x))
def get_director(x):
for i in x:
if i['job'] == 'Director':
return i['name']
return np.nan
sub_md['director'] = sub_md['crew'].apply(get_director)
sub_md['cast'] = sub_md['cast'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
sub_md['cast'] = sub_md['cast'].apply(lambda x: x[:3] if len(x) >=3 else x)
sub_md['keywords'] = sub_md['keywords'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
sub_md.head(10)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
adult | belongs_to_collection | budget | genres | homepage | id | imdb_id | original_language | original_title | overview | ... | video | vote_average | vote_count | year | cast | crew | keywords | cast_size | crew_size | director | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | False | {'id': 10194, 'name': 'Toy Story Collection', ... | 30000000 | [Animation, Comedy, Family] | http://toystory.disney.com/toy-story | 862 | tt0114709 | en | Toy Story | Led by Woody, Andy's toys live happily in his ... | ... | False | 7.7 | 5415.0 | 1995 | [Tom Hanks, Tim Allen, Don Rickles] | [{'credit_id': '52fe4284c3a36847f8024f49', 'de... | [jealousy, toy, boy, friendship, friends, riva... | 13 | 106 | John Lasseter |
1 | False | NaN | 65000000 | [Adventure, Fantasy, Family] | NaN | 8844 | tt0113497 | en | Jumanji | When siblings Judy and Peter discover an encha... | ... | False | 6.9 | 2413.0 | 1995 | [Robin Williams, Jonathan Hyde, Kirsten Dunst] | [{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de... | [board game, disappearance, based on children'... | 26 | 16 | Joe Johnston |
2 | False | {'id': 119050, 'name': 'Grumpy Old Men Collect... | 0 | [Romance, Comedy] | NaN | 15602 | tt0113228 | en | Grumpier Old Men | A family wedding reignites the ancient feud be... | ... | False | 6.5 | 92.0 | 1995 | [Walter Matthau, Jack Lemmon, Ann-Margret] | [{'credit_id': '52fe466a9251416c75077a89', 'de... | [fishing, best friend, duringcreditsstinger, o... | 7 | 4 | Howard Deutch |
3 | False | NaN | 16000000 | [Comedy, Drama, Romance] | NaN | 31357 | tt0114885 | en | Waiting to Exhale | Cheated on, mistreated and stepped on, the wom... | ... | False | 6.1 | 34.0 | 1995 | [Whitney Houston, Angela Bassett, Loretta Devine] | [{'credit_id': '52fe44779251416c91011acb', 'de... | [based on novel, interracial relationship, sin... | 10 | 10 | Forest Whitaker |
4 | False | {'id': 96871, 'name': 'Father of the Bride Col... | 0 | [Comedy] | NaN | 11862 | tt0113041 | en | Father of the Bride Part II | Just when George Banks has recovered from his ... | ... | False | 5.7 | 173.0 | 1995 | [Steve Martin, Diane Keaton, Martin Short] | [{'credit_id': '52fe44959251416c75039ed7', 'de... | [baby, midlife crisis, confidence, aging, daug... | 12 | 7 | Charles Shyer |
5 | False | NaN | 60000000 | [Action, Crime, Drama, Thriller] | NaN | 949 | tt0113277 | en | Heat | Obsessive master thief, Neil McCauley leads a ... | ... | False | 7.7 | 1886.0 | 1995 | [Al Pacino, Robert De Niro, Val Kilmer] | [{'credit_id': '52fe4292c3a36847f802916d', 'de... | [robbery, detective, bank, obsession, chase, s... | 65 | 71 | Michael Mann |
6 | False | NaN | 58000000 | [Comedy, Romance] | NaN | 11860 | tt0114319 | en | Sabrina | An ugly duckling having undergone a remarkable... | ... | False | 6.2 | 141.0 | 1995 | [Harrison Ford, Julia Ormond, Greg Kinnear] | [{'credit_id': '52fe44959251416c75039da9', 'de... | [paris, brother brother relationship, chauffeu... | 57 | 53 | Sydney Pollack |
7 | False | NaN | 0 | [Action, Adventure, Drama, Family] | NaN | 45325 | tt0112302 | en | Tom and Huck | A mischievous young boy, Tom Sawyer, witnesses... | ... | False | 5.4 | 45.0 | 1995 | [Jonathan Taylor Thomas, Brad Renfro, Rachael ... | [{'credit_id': '52fe46bdc3a36847f810f797', 'de... | [] | 7 | 4 | Peter Hewitt |
8 | False | NaN | 35000000 | [Action, Adventure, Thriller] | NaN | 9091 | tt0114576 | en | Sudden Death | International action superstar Jean Claude Van... | ... | False | 5.5 | 174.0 | 1995 | [Jean-Claude Van Damme, Powers Boothe, Dorian ... | [{'credit_id': '52fe44dbc3a36847f80ae0f1', 'de... | [terrorist, hostage, explosive, vice president] | 6 | 9 | Peter Hyams |
9 | False | {'id': 645, 'name': 'James Bond Collection', '... | 58000000 | [Adventure, Action, Thriller] | http://www.mgm.com/view/movie/757/Goldeneye/ | 710 | tt0113189 | en | GoldenEye | James Bond must unmask the mysterious head of ... | ... | False | 6.6 | 1194.0 | 1995 | [Pierce Brosnan, Sean Bean, Izabella Scorupco] | [{'credit_id': '52fe426ec3a36847f801e14b', 'de... | [cuba, falsely accused, secret identity, compu... | 20 | 46 | Martin Campbell |
10 rows × 31 columns
Strip Spaces and Convert to Lowercase from all our features
sub_md['cast'] = sub_md['cast'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])
Mention Director 3 times to give it more weight relative to the entire cast.
sub_md['director'] = sub_md['director'].astype('str').apply(lambda x: str.lower(x.replace(" ", "")))
sub_md['director'] = sub_md['director'].apply(lambda x: [x,x, x])
Calculate the frequenct counts of every keyword that appears in the dataset
s = sub_md.apply(lambda x: pd.Series(x['keywords']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'keyword'
s = s.value_counts()
s[:5]
independent film 610
woman director 550
murder 399
duringcreditsstinger 327
based on novel 318
Name: keyword, dtype: int64
Remove keyword that only occur once
s = s[s > 1]
Convert every word to its stem so that words such as heroes and hero are considered the same.
stemmer = SnowballStemmer('english')
stemmer.stem('heroes')
'hero'
def filter_keywords(x):
words = []
for i in x:
if i in s:
words.append(i)
return words
sub_md['keywords'] = sub_md['keywords'].apply(filter_keywords)
sub_md['keywords'] = sub_md['keywords'].apply(lambda x: [stemmer.stem(i) for i in x])
sub_md['keywords'] = sub_md['keywords'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])
Create a metadata dump for every movie which consists of genres, director, main actors and keywords.
sub_md['metadata'] = sub_md['keywords'] + sub_md['cast'] + sub_md['director'] + sub_md['genres']
sub_md['metadata'] = sub_md['metadata'].apply(lambda x: ' '.join(x))
Use Count Vectorizer to create count matrix. Then,calculate the cosine similarities and return movies that are most similar.
count = CountVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
count_matrix = count.fit_transform(sub_md['metadata'])
cosine_sim = cosine_similarity(count_matrix, count_matrix)
sub_md = sub_md.reset_index()
titles = sub_md['title']
indices = pd.Series(sub_md.index, index=sub_md['title'])
def get_recommendations(title):
idx = indices[title]
sim_scores = list(enumerate(cosine_sim[idx]))
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
sim_scores = sim_scores[1:31]
movie_indices = [i[0] for i in sim_scores]
return titles.iloc[movie_indices]
get_recommendations('The Dark Knight').head(10)
8031 The Dark Knight Rises
6218 Batman Begins
6623 The Prestige
2085 Following
7648 Inception
4145 Insomnia
3381 Memento
8613 Interstellar
7659 Batman: Under the Red Hood
1134 Batman Returns
Name: title, dtype: object
As observed, more Christopher Nolan's movies made it to the list. Besides, these movies appeared to share same genres and required more thinking.
Take the top 25 movies based on similarity scores and calculate the vote of the 60th percentile movie. Then, using this as the value of m, calculate the weighted rating of each movie using IMDB's formula.
def improved_recommendations(title):
idx = indices[title]
sim_scores = list(enumerate(cosine_sim[idx]))
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
sim_scores = sim_scores[1:26]
movie_indices = [i[0] for i in sim_scores]
movies = sub_md.iloc[movie_indices][['title', 'vote_count', 'vote_average', 'year']]
vote_counts = movies[movies['vote_count'].notnull()]['vote_count'].astype('int')
vote_averages = movies[movies['vote_average'].notnull()]['vote_average'].astype('int')
C = vote_averages.mean()
m = vote_counts.quantile(0.60)
qualified = movies[(movies['vote_count'] >= m) & (movies['vote_count'].notnull()) & (movies['vote_average'].notnull())]
qualified['vote_count'] = qualified['vote_count'].astype('int')
qualified['vote_average'] = qualified['vote_average'].astype('int')
qualified['wr'] = qualified.apply(weighted_rating, axis=1)
qualified = qualified.sort_values('wr', ascending=False).head(10)
return qualified
improved_recommendations('The Dark Knight')
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
title | vote_count | vote_average | year | wr | |
---|---|---|---|---|---|
7648 | Inception | 14075 | 8 | 2010 | 7.917588 |
8613 | Interstellar | 11187 | 8 | 2014 | 7.897107 |
6623 | The Prestige | 4510 | 8 | 2006 | 7.758148 |
3381 | Memento | 4168 | 8 | 2000 | 7.740175 |
8031 | The Dark Knight Rises | 9263 | 7 | 2012 | 6.921448 |
6218 | Batman Begins | 7511 | 7 | 2005 | 6.904127 |
1134 | Batman Returns | 1706 | 6 | 1992 | 5.846862 |
132 | Batman Forever | 1529 | 5 | 1995 | 5.054144 |
9024 | Batman v Superman: Dawn of Justice | 7189 | 5 | 2016 | 5.013943 |
1260 | Batman & Robin | 1447 | 4 | 1997 | 4.287233 |
Predict using simillar users data, using Surprise and Singular Value Decomposition (SVD) algorithm
from surprise import SVD
from surprise import Dataset
from surprise import Reader
from surprise.model_selection import cross_validate
reader = Reader()
ratings = pd.read_csv('dataset/movie/ratings_small.csv')
ratings.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
userId | movieId | rating | timestamp | |
---|---|---|---|---|
0 | 1 | 31 | 2.5 | 1260759144 |
1 | 1 | 1029 | 3.0 | 1260759179 |
2 | 1 | 1061 | 3.0 | 1260759182 |
3 | 1 | 1129 | 2.0 | 1260759185 |
4 | 1 | 1172 | 4.0 | 1260759205 |
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)
svd = SVD()
cross_validate(svd, data, measures=['RMSE', 'MAE'], cv=5 , verbose=True) #5 splits
Evaluating RMSE, MAE of algorithm SVD on 5 split(s).
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Mean Std
RMSE (testset) 0.8997 0.8943 0.8981 0.9034 0.8894 0.8970 0.0048
MAE (testset) 0.6924 0.6879 0.6923 0.6949 0.6866 0.6908 0.0031
Fit time 5.50 5.39 5.82 5.68 6.19 5.72 0.28
Test time 0.14 0.26 0.16 0.15 0.27 0.20 0.06
{'test_rmse': array([0.89971588, 0.89433794, 0.89810361, 0.90339806, 0.88944449]),
'test_mae': array([0.69242199, 0.68794254, 0.69231266, 0.6948946 , 0.68659872]),
'fit_time': (5.496105194091797,
5.386150598526001,
5.822059869766235,
5.684120178222656,
6.186817169189453),
'test_time': (0.14284086227416992,
0.2593069076538086,
0.15757989883422852,
0.1506178379058838,
0.26779866218566895)}
Root Mean Sqaure Error = 0.8963, which is pretty good
trainset = data.build_full_trainset()
svd.fit(trainset)
<surprise.prediction_algorithms.matrix_factorization.SVD at 0x1a0f52151c8>
ratings[ratings['userId'] == 1]
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
userId | movieId | rating | timestamp | |
---|---|---|---|---|
0 | 1 | 31 | 2.5 | 1260759144 |
1 | 1 | 1029 | 3.0 | 1260759179 |
2 | 1 | 1061 | 3.0 | 1260759182 |
3 | 1 | 1129 | 2.0 | 1260759185 |
4 | 1 | 1172 | 4.0 | 1260759205 |
5 | 1 | 1263 | 2.0 | 1260759151 |
6 | 1 | 1287 | 2.0 | 1260759187 |
7 | 1 | 1293 | 2.0 | 1260759148 |
8 | 1 | 1339 | 3.5 | 1260759125 |
9 | 1 | 1343 | 2.0 | 1260759131 |
10 | 1 | 1371 | 2.5 | 1260759135 |
11 | 1 | 1405 | 1.0 | 1260759203 |
12 | 1 | 1953 | 4.0 | 1260759191 |
13 | 1 | 2105 | 4.0 | 1260759139 |
14 | 1 | 2150 | 3.0 | 1260759194 |
15 | 1 | 2193 | 2.0 | 1260759198 |
16 | 1 | 2294 | 2.0 | 1260759108 |
17 | 1 | 2455 | 2.5 | 1260759113 |
18 | 1 | 2968 | 1.0 | 1260759200 |
19 | 1 | 3671 | 3.0 | 1260759117 |
svd.predict(1, 302, 3)
Prediction(uid=1, iid=302, r_ui=3, est=2.575412635819967, details={'was_impossible': False})
Prediction of user ID = 1 on movie ID =302 returns an estimated prediction of 2.7202 based on how the other users have predicted the movie.
def convert_int(x):
try:
return int(x)
except:
return np.nan
def hybrid(userId, title):
idx = indices[title]
tmdbId = id_map.loc[title]['id']
#print(idx)
movie_id = id_map.loc[title]['movieId']
sim_scores = list(enumerate(cosine_sim[int(idx)]))
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
sim_scores = sim_scores[1:26]
movie_indices = [i[0] for i in sim_scores]
movies = sub_md.iloc[movie_indices][['title', 'vote_count', 'vote_average', 'year', 'id']]
movies['est'] = movies['id'].apply(lambda x: svd.predict(userId, indices_map.loc[x]['movieId']).est)
movies = movies.sort_values('est', ascending=False)
return movies.head(10)
id_map = pd.read_csv('dataset/movie/links_small.csv')[['movieId', 'tmdbId']]
id_map['tmdbId'] = id_map['tmdbId'].apply(convert_int)
id_map.columns = ['movieId', 'id']
id_map = id_map.merge(sub_md[['title', 'id']], on='id').set_index('title')
#id_map = id_map.set_index('tmdbId')
indices_map = id_map.set_index('id')
hybrid(1, 'The Dark Knight')
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
title | vote_count | vote_average | year | id | est | |
---|---|---|---|---|---|---|
3381 | Memento | 4168.0 | 8.1 | 2000 | 77 | 3.518882 |
7648 | Inception | 14075.0 | 8.1 | 2010 | 27205 | 3.253708 |
6623 | The Prestige | 4510.0 | 8.0 | 2006 | 1124 | 3.230209 |
8613 | Interstellar | 11187.0 | 8.1 | 2014 | 157336 | 3.094515 |
6218 | Batman Begins | 7511.0 | 7.5 | 2005 | 272 | 3.030609 |
5943 | Thursday | 84.0 | 7.0 | 1998 | 9812 | 3.003159 |
8031 | The Dark Knight Rises | 9263.0 | 7.6 | 2012 | 49026 | 2.864251 |
7362 | Gangster's Paradise: Jerusalema | 16.0 | 6.8 | 2008 | 22600 | 2.856210 |
5098 | The Enforcer | 21.0 | 7.4 | 1951 | 26712 | 2.817666 |
7561 | Harry Brown | 351.0 | 6.7 | 2009 | 25941 | 2.729585 |
hybrid(500, 'The Dark Knight')
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
title | vote_count | vote_average | year | id | est | |
---|---|---|---|---|---|---|
6623 | The Prestige | 4510.0 | 8.0 | 2006 | 1124 | 3.823928 |
3381 | Memento | 4168.0 | 8.1 | 2000 | 77 | 3.601578 |
8613 | Interstellar | 11187.0 | 8.1 | 2014 | 157336 | 3.519702 |
7648 | Inception | 14075.0 | 8.1 | 2010 | 27205 | 3.330051 |
5943 | Thursday | 84.0 | 7.0 | 1998 | 9812 | 3.322717 |
7561 | Harry Brown | 351.0 | 6.7 | 2009 | 25941 | 3.210024 |
2448 | Nighthawks | 87.0 | 6.4 | 1981 | 21610 | 3.034214 |
2131 | Superman | 1042.0 | 6.9 | 1978 | 1924 | 2.998992 |
2085 | Following | 363.0 | 7.2 | 1998 | 11660 | 2.974581 |
8031 | The Dark Knight Rises | 9263.0 | 7.6 | 2012 | 49026 | 2.967465 |
As observed, different recommendation lists were offered to different user watching the same movie, indicating that the recommendation more tailored and personalized for each user