Skip to content

Automated recommendation system that fits data through multiple recommenders to obtain the best result.

Notifications You must be signed in to change notification settings

louisyuzhe/recommender_system

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

Type of Recommendation Systems

1. Simple Recommender:

This system used overall TMDB Vote Count and Vote Averages to build Top Movies Charts, for specific genre in general. The IMDB Weighted Rating System was used to calculate ratings on which the sorting was finally performed.

2. Content Based Recommender:

  1. Movie overview and taglines based,
  2. Cast, crew, genre and keywords based. A simple filter is added to give greater preference to movies with more votes and higher ratings.

3. Collaborative Filtering:

Used Surprise module to build a collaborative filter based on single value decomposition. The RMSE obtained was less than 1 and the engine gave estimated ratings for a given user and movie.

4. Hybrid Engine:

Combined content and collaborative filterting to build an engine that gave movie suggestions to a particular user based on the estimated ratings that it had internally calculated for that user.

%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from nltk.stem.snowball import SnowballStemmer
#from nltk.stem.wordnet import WordNetLemmatibzer
from nltk.corpus import wordnet
from surprise import Reader, Dataset, SVD

import warnings; warnings.simplefilter('ignore')
md = pd.read_csv('dataset/movie/movies_metadata.csv')
md.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
adult belongs_to_collection budget genres homepage id imdb_id original_language original_title overview ... release_date revenue runtime spoken_languages status tagline title video vote_average vote_count
0 False {'id': 10194, 'name': 'Toy Story Collection', ... 30000000 [{'id': 16, 'name': 'Animation'}, {'id': 35, '... http://toystory.disney.com/toy-story 862 tt0114709 en Toy Story Led by Woody, Andy's toys live happily in his ... ... 10/30/1995 373554033.0 81.0 [{'iso_639_1': 'en', 'name': 'English'}] Released NaN Toy Story False 7.7 5415.0
1 False NaN 65000000 [{'id': 12, 'name': 'Adventure'}, {'id': 14, '... NaN 8844 tt0113497 en Jumanji When siblings Judy and Peter discover an encha... ... 12/15/1995 262797249.0 104.0 [{'iso_639_1': 'en', 'name': 'English'}, {'iso... Released Roll the dice and unleash the excitement! Jumanji False 6.9 2413.0
2 False {'id': 119050, 'name': 'Grumpy Old Men Collect... 0 [{'id': 10749, 'name': 'Romance'}, {'id': 35, ... NaN 15602 tt0113228 en Grumpier Old Men A family wedding reignites the ancient feud be... ... 12/22/1995 0.0 101.0 [{'iso_639_1': 'en', 'name': 'English'}] Released Still Yelling. Still Fighting. Still Ready for... Grumpier Old Men False 6.5 92.0
3 False NaN 16000000 [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam... NaN 31357 tt0114885 en Waiting to Exhale Cheated on, mistreated and stepped on, the wom... ... 12/22/1995 81452156.0 127.0 [{'iso_639_1': 'en', 'name': 'English'}] Released Friends are the people who let you be yourself... Waiting to Exhale False 6.1 34.0
4 False {'id': 96871, 'name': 'Father of the Bride Col... 0 [{'id': 35, 'name': 'Comedy'}] NaN 11862 tt0113041 en Father of the Bride Part II Just when George Banks has recovered from his ... ... 2/10/1995 76578911.0 106.0 [{'iso_639_1': 'en', 'name': 'English'}] Released Just When His World Is Back To Normal... He's ... Father of the Bride Part II False 5.7 173.0

5 rows × 24 columns

Simple Recommender

Generalized recommendation based on movie popularity and genre to every user. For example, movies that are more popular and more critically acclaimed will have a higher probability of being liked by the average audience.

Sort movies based on ratings (The Movie Database (TMDb) Ratings) and popularity, then display the top movies list.

md['genres'] = md['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])\

Now, determine an appropriate value for m, the minimum votes required to be listed in the chart. 95th percentile will be used as cutoff. For a movie to feature in the charts, it must have more votes than at least 95% of the movies in the list.

vote_counts = md[md['vote_count'].notnull()]['vote_count'].astype('int')
vote_averages = md[md['vote_average'].notnull()]['vote_average'].astype('int')
C = vote_averages.mean()
C
5.244896612406511

Average rating for a movie on TMDB is 5.244 on a scale of 10

m = vote_counts.quantile(0.95)
m
434.0

As shown, to qualify to be considered for the chart, a movie has to have at least 434 votes on TMDB

md['year'] = pd.to_datetime(md['release_date'], errors='coerce').apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan)
qualified = md[(md['vote_count'] >= m) & (md['vote_count'].notnull()) & (md['vote_average'].notnull())][['title', 'year', 'vote_count', 'vote_average', 'popularity', 'genres']]
qualified['vote_count'] = qualified['vote_count'].astype('int')
qualified['vote_average'] = qualified['vote_average'].astype('int')
qualified.shape
(2274, 6)

2274 Movies qualify to be on the top chart

IMDB's weighted rating formula is used.

def weighted_rating(x):
    v = x['vote_count']
    R = x['vote_average']
    return (v/(v+m) * R) + (m/(m+v) * C)
qualified['wr'] = qualified.apply(weighted_rating, axis=1)

#Choose top 250
qualified = qualified.sort_values('wr', ascending=False).head(250)
len(qualified)
250
qualified.head(15)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
title year vote_count vote_average popularity genres wr
15480 Inception 2010 14075 8 29.108149 [Action, Thriller, Science Fiction, Mystery, A... 7.917588
12481 The Dark Knight 2008 12269 8 123.167259 [Drama, Action, Crime, Thriller] 7.905871
22878 Interstellar 2014 11187 8 32.213481 [Adventure, Drama, Science Fiction] 7.897107
2843 Fight Club 1999 9678 8 63.869599 [Drama] 7.881753
4863 The Lord of the Rings: The Fellowship of the Ring 2001 8892 8 32.070725 [Adventure, Fantasy, Action] 7.871787
292 Pulp Fiction 1994 8670 8 140.950236 [Thriller, Crime] 7.868660
314 The Shawshank Redemption 1994 8358 8 51.645403 [Drama, Crime] 7.864000
7000 The Lord of the Rings: The Return of the King 2003 8226 8 29.324358 [Adventure, Fantasy, Action] 7.861927
351 Forrest Gump 1994 8147 8 48.307194 [Comedy, Drama, Romance] 7.860656
5814 The Lord of the Rings: The Two Towers 2002 7641 8 29.423537 [Adventure, Fantasy, Action] 7.851924
256 Star Wars 1977 6778 8 42.149697 [Adventure, Action, Science Fiction] 7.834205
1225 Back to the Future 1985 6239 8 25.778509 [Adventure, Comedy, Science Fiction, Family] 7.820813
834 The Godfather 1972 6024 8 41.109264 [Drama, Crime] 7.814847
1154 The Empire Strikes Back 1980 5998 8 19.470959 [Adventure, Action, Science Fiction] 7.814099
46 Se7en 1995 5915 8 18.457430 [Crime, Mystery, Thriller] 7.811669

The chart indicates a strong bias of TMDB Users towards particular genres and directors (Christopher Nolan)

Generate top chart based on genre

def build_chart(genre, percentile=0.85):
    df = gen_md[gen_md['genre'] == genre]
    vote_counts = df[df['vote_count'].notnull()]['vote_count'].astype('int')
    vote_averages = df[df['vote_average'].notnull()]['vote_average'].astype('int')
    C = vote_averages.mean()
    m = vote_counts.quantile(percentile)
    
    qualified = df[(df['vote_count'] >= m) & (df['vote_count'].notnull()) & (df['vote_average'].notnull())][['title', 'year', 'vote_count', 'vote_average', 'popularity']]
    qualified['vote_count'] = qualified['vote_count'].astype('int')
    qualified['vote_average'] = qualified['vote_average'].astype('int')
    
    qualified['wr'] = qualified.apply(lambda x: (x['vote_count']/(x['vote_count']+m) * x['vote_average']) + (m/(m+x['vote_count']) * C), axis=1)
    qualified = qualified.sort_values('wr', ascending=False).head(250)
    
    return qualified

Use 85th percentile instead, and split movie with multiple genres into seperate row

s = md.apply(lambda x: pd.Series(x['genres']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'genre'
gen_md = md.drop('genres', axis=1).join(s)
gen_md
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
adult belongs_to_collection budget homepage id imdb_id original_language original_title overview popularity ... runtime spoken_languages status tagline title video vote_average vote_count year genre
0 False {'id': 10194, 'name': 'Toy Story Collection', ... 30000000 http://toystory.disney.com/toy-story 862 tt0114709 en Toy Story Led by Woody, Andy's toys live happily in his ... 21.946943 ... 81.0 [{'iso_639_1': 'en', 'name': 'English'}] Released NaN Toy Story False 7.7 5415.0 1995 Animation
0 False {'id': 10194, 'name': 'Toy Story Collection', ... 30000000 http://toystory.disney.com/toy-story 862 tt0114709 en Toy Story Led by Woody, Andy's toys live happily in his ... 21.946943 ... 81.0 [{'iso_639_1': 'en', 'name': 'English'}] Released NaN Toy Story False 7.7 5415.0 1995 Comedy
0 False {'id': 10194, 'name': 'Toy Story Collection', ... 30000000 http://toystory.disney.com/toy-story 862 tt0114709 en Toy Story Led by Woody, Andy's toys live happily in his ... 21.946943 ... 81.0 [{'iso_639_1': 'en', 'name': 'English'}] Released NaN Toy Story False 7.7 5415.0 1995 Family
1 False NaN 65000000 NaN 8844 tt0113497 en Jumanji When siblings Judy and Peter discover an encha... 17.015539 ... 104.0 [{'iso_639_1': 'en', 'name': 'English'}, {'iso... Released Roll the dice and unleash the excitement! Jumanji False 6.9 2413.0 1995 Adventure
1 False NaN 65000000 NaN 8844 tt0113497 en Jumanji When siblings Judy and Peter discover an encha... 17.015539 ... 104.0 [{'iso_639_1': 'en', 'name': 'English'}, {'iso... Released Roll the dice and unleash the excitement! Jumanji False 6.9 2413.0 1995 Fantasy
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
45460 False NaN 0 NaN 67758 tt0303758 en Betrayal When one of her hits goes wrong, a professiona... 0.903007 ... 90.0 [{'iso_639_1': 'en', 'name': 'English'}] Released A deadly game of wits. Betrayal False 3.8 6.0 2003 Action
45460 False NaN 0 NaN 67758 tt0303758 en Betrayal When one of her hits goes wrong, a professiona... 0.903007 ... 90.0 [{'iso_639_1': 'en', 'name': 'English'}] Released A deadly game of wits. Betrayal False 3.8 6.0 2003 Drama
45460 False NaN 0 NaN 67758 tt0303758 en Betrayal When one of her hits goes wrong, a professiona... 0.903007 ... 90.0 [{'iso_639_1': 'en', 'name': 'English'}] Released A deadly game of wits. Betrayal False 3.8 6.0 2003 Thriller
45461 False NaN 0 NaN 227506 tt0008536 en Satana likuyushchiy In a small town live two brothers, one a minis... 0.003503 ... 87.0 [] Released NaN Satan Triumphant False 0.0 0.0 1917 NaN
45462 False NaN 0 NaN 461257 tt6980792 en Queerama 50 years after decriminalisation of homosexual... 0.163015 ... 75.0 [{'iso_639_1': 'en', 'name': 'English'}] Released NaN Queerama False 0.0 0.0 2017 NaN

93536 rows × 25 columns

build_chart('Action').head(15)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
title year vote_count vote_average popularity wr
15480 Inception 2010 14075 8 29.108149 7.955099
12481 The Dark Knight 2008 12269 8 123.167259 7.948610
4863 The Lord of the Rings: The Fellowship of the Ring 2001 8892 8 32.070725 7.929579
7000 The Lord of the Rings: The Return of the King 2003 8226 8 29.324358 7.924031
5814 The Lord of the Rings: The Two Towers 2002 7641 8 29.423537 7.918382
256 Star Wars 1977 6778 8 42.149697 7.908327
1154 The Empire Strikes Back 1980 5998 8 19.470959 7.896841
4135 Scarface 1983 3017 8 11.299673 7.802046
9430 Oldboy 2003 2000 8 10.616859 7.711649
1910 Seven Samurai 1954 892 8 15.017770 7.426145
43187 Band of Brothers 2001 725 8 7.903731 7.325485
1215 M 1931 465 8 12.752421 7.072073
14551 Avatar 2009 12114 7 185.070892 6.966363
17818 The Avengers 2012 12000 7 89.887648 6.966049
26563 Deadpool 2016 11444 7 187.860492 6.964431

Content Based Recommender

Personalized recommendations - Computes similarity between movies based on certain metrics and suggests movies that are most similar to a particular movie that a user liked.

#links.csv contains the reference bewteen imdbId	tmdbId
links_small = pd.read_csv('dataset/movie/links_small.csv')
links_small = links_small[links_small['tmdbId'].notnull()]['tmdbId'].astype('int')
links_small
0          862
1         8844
2        15602
3        31357
4        11862
         ...  
9120    402672
9121    315011
9122    391698
9123    137608
9124    410803
Name: tmdbId, Length: 9112, dtype: int32
md['id'] = md['id'].astype('int')
small_md = md[md['id'].isin(links_small)]
small_md.shape
(9099, 25)
small_md
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
adult belongs_to_collection budget genres homepage id imdb_id original_language original_title overview ... revenue runtime spoken_languages status tagline title video vote_average vote_count year
0 False {'id': 10194, 'name': 'Toy Story Collection', ... 30000000 [Animation, Comedy, Family] http://toystory.disney.com/toy-story 862 tt0114709 en Toy Story Led by Woody, Andy's toys live happily in his ... ... 373554033.0 81.0 [{'iso_639_1': 'en', 'name': 'English'}] Released NaN Toy Story False 7.7 5415.0 1995
1 False NaN 65000000 [Adventure, Fantasy, Family] NaN 8844 tt0113497 en Jumanji When siblings Judy and Peter discover an encha... ... 262797249.0 104.0 [{'iso_639_1': 'en', 'name': 'English'}, {'iso... Released Roll the dice and unleash the excitement! Jumanji False 6.9 2413.0 1995
2 False {'id': 119050, 'name': 'Grumpy Old Men Collect... 0 [Romance, Comedy] NaN 15602 tt0113228 en Grumpier Old Men A family wedding reignites the ancient feud be... ... 0.0 101.0 [{'iso_639_1': 'en', 'name': 'English'}] Released Still Yelling. Still Fighting. Still Ready for... Grumpier Old Men False 6.5 92.0 1995
3 False NaN 16000000 [Comedy, Drama, Romance] NaN 31357 tt0114885 en Waiting to Exhale Cheated on, mistreated and stepped on, the wom... ... 81452156.0 127.0 [{'iso_639_1': 'en', 'name': 'English'}] Released Friends are the people who let you be yourself... Waiting to Exhale False 6.1 34.0 1995
4 False {'id': 96871, 'name': 'Father of the Bride Col... 0 [Comedy] NaN 11862 tt0113041 en Father of the Bride Part II Just when George Banks has recovered from his ... ... 76578911.0 106.0 [{'iso_639_1': 'en', 'name': 'English'}] Released Just When His World Is Back To Normal... He's ... Father of the Bride Part II False 5.7 173.0 1995
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
40221 False NaN 15000000 [Action, Adventure, Drama, Horror, Science Fic... NaN 315011 tt4262980 ja シン・ゴジラ From the mind behind Evangelion comes a hit la... ... 77000000.0 120.0 [{'iso_639_1': 'it', 'name': 'Italiano'}, {'is... Released A god incarnate. A city doomed. Shin Godzilla False 6.6 152.0 2016
40500 False NaN 0 [Documentary, Music] http://www.thebeatlesliveproject.com/ 391698 tt2531318 en The Beatles: Eight Days a Week - The Touring Y... The band stormed Europe in 1963, and, in 1964,... ... 0.0 99.0 [{'iso_639_1': 'en', 'name': 'English'}] Released The band you know. The story you don't. The Beatles: Eight Days a Week - The Touring Y... False 7.6 92.0 2016
44818 False {'id': 34055, 'name': 'Pokémon Collection', 'p... 16000000 [Adventure, Fantasy, Animation, Action, Family] http://movies.warnerbros.com/pk3/ 10991 tt0235679 ja Pokémon 3: The Movie When Molly Hale's sadness of her father's disa... ... 68411275.0 93.0 [{'iso_639_1': 'en', 'name': 'English'}] Released Pokémon: Spell of the Unknown Pokémon: Spell of the Unknown False 6.0 144.0 2000
44823 False {'id': 34055, 'name': 'Pokémon Collection', 'p... 0 [Adventure, Fantasy, Animation, Science Fictio... http://www.pokemon.com/us/movies/movie-pokemon... 12600 tt0287635 ja 劇場版ポケットモンスター セレビィ 時を越えた遭遇(であい) All your favorite Pokémon characters are back,... ... 28023563.0 75.0 [{'iso_639_1': 'ja', 'name': '日本語'}] Released NaN Pokémon 4Ever: Celebi - Voice of the Forest False 5.7 82.0 2001
45262 False NaN 0 [Comedy, Drama] NaN 265189 tt2121382 sv Turist While holidaying in the French Alps, a Swedish... ... 1359497.0 118.0 [{'iso_639_1': 'fr', 'name': 'Français'}, {'is... Released NaN Force Majeure False 6.8 255.0 2014

9099 rows × 25 columns

Movie Description Based Recommender

Movie descriptions and taglines based recommender

small_md['tagline'] = small_md['tagline'].fillna('')
small_md['description'] = small_md['overview'] + small_md['tagline']
small_md['description'] = small_md['description'].fillna('')

TF-IDF Vectorizer

tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(small_md['description'])
tfidf_matrix.shape
(9099, 268123)

Cosine Similarity Cosine Similarity willl be used to calculate a numeric quantity that denotes the similarity between two movies.

cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
cosine_sim[0]
array([1.        , 0.00680476, 0.        , ..., 0.        , 0.00344913,
       0.        ])

Now all the movies in the dataset has a pairwise cosine similarity matrix

small_md = small_md.reset_index()
titles = small_md['title']
indices = pd.Series(small_md.index, index=small_md['title'])
indices
title
Toy Story                                                0
Jumanji                                                  1
Grumpier Old Men                                         2
Waiting to Exhale                                        3
Father of the Bride Part II                              4
                                                      ... 
Shin Godzilla                                         9094
The Beatles: Eight Days a Week - The Touring Years    9095
Pokémon: Spell of the Unknown                         9096
Pokémon 4Ever: Celebi - Voice of the Forest           9097
Force Majeure                                         9098
Length: 9099, dtype: int64

Function that returns the 30 most similar movies based on the cosine similarity score of input movie

def get_recommendations(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:31]
    movie_indices = [i[0] for i in sim_scores]
    return titles.iloc[movie_indices]
get_recommendations('The Dark Knight')
7931                      The Dark Knight Rises
132                              Batman Forever
1113                             Batman Returns
8227    Batman: The Dark Knight Returns, Part 2
7565                 Batman: Under the Red Hood
524                                      Batman
7901                           Batman: Year One
2579               Batman: Mask of the Phantasm
2696                                        JFK
8165    Batman: The Dark Knight Returns, Part 1
6144                              Batman Begins
7933         Sherlock Holmes: A Game of Shadows
5511                            To End All Wars
4489                                      Q & A
7344                        Law Abiding Citizen
7242                  The File on Thelma Jordon
3537                               Criminal Law
2893                              Flying Tigers
1135                   Night Falls on Manhattan
8680                          The Young Savages
8917         Batman v Superman: Dawn of Justice
1240                             Batman & Robin
6740                                Rush Hour 3
1652                            The Shaggy D.A.
6667                                   Fracture
4028                                 The Rookie
8371       Justice League: Crisis on Two Earths
8719                                 By the Gun
3730                    Dr. Mabuse, the Gambler
4160                     The Master of Disguise
Name: title, dtype: object

As observed, the system takes conderation of the description and taglines of Batman: Dark knight and recommend all other Batman movie, followed by Detective, Superhero, Crime, etc

Metadata Based Recommender

Based on genre, keywords, cast and crew

credits = pd.read_csv('dataset/movie/credits.csv')
keywords = pd.read_csv('dataset/movie/keywords.csv')
keywords['id'] = keywords['id'].astype('int')
credits['id'] = credits['id'].astype('int')
md['id'] = md['id'].astype('int')
md.shape
(45463, 25)
md = md.merge(credits, on='id')
md = md.merge(keywords, on='id')
sub_md = md[md['id'].isin(links_small)]
sub_md.shape
(9219, 28)
sub_md.head(5)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
adult belongs_to_collection budget genres homepage id imdb_id original_language original_title overview ... status tagline title video vote_average vote_count year cast crew keywords
0 False {'id': 10194, 'name': 'Toy Story Collection', ... 30000000 [Animation, Comedy, Family] http://toystory.disney.com/toy-story 862 tt0114709 en Toy Story Led by Woody, Andy's toys live happily in his ... ... Released NaN Toy Story False 7.7 5415.0 1995 [{'cast_id': 14, 'character': 'Woody (voice)',... [{'credit_id': '52fe4284c3a36847f8024f49', 'de... [{'id': 931, 'name': 'jealousy'}, {'id': 4290,...
1 False NaN 65000000 [Adventure, Fantasy, Family] NaN 8844 tt0113497 en Jumanji When siblings Judy and Peter discover an encha... ... Released Roll the dice and unleash the excitement! Jumanji False 6.9 2413.0 1995 [{'cast_id': 1, 'character': 'Alan Parrish', '... [{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de... [{'id': 10090, 'name': 'board game'}, {'id': 1...
2 False {'id': 119050, 'name': 'Grumpy Old Men Collect... 0 [Romance, Comedy] NaN 15602 tt0113228 en Grumpier Old Men A family wedding reignites the ancient feud be... ... Released Still Yelling. Still Fighting. Still Ready for... Grumpier Old Men False 6.5 92.0 1995 [{'cast_id': 2, 'character': 'Max Goldman', 'c... [{'credit_id': '52fe466a9251416c75077a89', 'de... [{'id': 1495, 'name': 'fishing'}, {'id': 12392...
3 False NaN 16000000 [Comedy, Drama, Romance] NaN 31357 tt0114885 en Waiting to Exhale Cheated on, mistreated and stepped on, the wom... ... Released Friends are the people who let you be yourself... Waiting to Exhale False 6.1 34.0 1995 [{'cast_id': 1, 'character': "Savannah 'Vannah... [{'credit_id': '52fe44779251416c91011acb', 'de... [{'id': 818, 'name': 'based on novel'}, {'id':...
4 False {'id': 96871, 'name': 'Father of the Bride Col... 0 [Comedy] NaN 11862 tt0113041 en Father of the Bride Part II Just when George Banks has recovered from his ... ... Released Just When His World Is Back To Normal... He's ... Father of the Bride Part II False 5.7 173.0 1995 [{'cast_id': 1, 'character': 'George Banks', '... [{'credit_id': '52fe44959251416c75039ed7', 'de... [{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n...

5 rows × 28 columns

To make things lless complicated,

  1. Crew will be represented by the director
  2. Only top 3 actors willl be choseden to represent the cast.
sub_md['cast'] = sub_md['cast'].apply(literal_eval)
sub_md['crew'] = sub_md['crew'].apply(literal_eval)
sub_md['keywords'] = sub_md['keywords'].apply(literal_eval)
sub_md['cast_size'] = sub_md['cast'].apply(lambda x: len(x))
sub_md['crew_size'] = sub_md['crew'].apply(lambda x: len(x))
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan
sub_md['director'] = sub_md['crew'].apply(get_director)
sub_md['cast'] = sub_md['cast'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
sub_md['cast'] = sub_md['cast'].apply(lambda x: x[:3] if len(x) >=3 else x)
sub_md['keywords'] = sub_md['keywords'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
sub_md.head(10)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
adult belongs_to_collection budget genres homepage id imdb_id original_language original_title overview ... video vote_average vote_count year cast crew keywords cast_size crew_size director
0 False {'id': 10194, 'name': 'Toy Story Collection', ... 30000000 [Animation, Comedy, Family] http://toystory.disney.com/toy-story 862 tt0114709 en Toy Story Led by Woody, Andy's toys live happily in his ... ... False 7.7 5415.0 1995 [Tom Hanks, Tim Allen, Don Rickles] [{'credit_id': '52fe4284c3a36847f8024f49', 'de... [jealousy, toy, boy, friendship, friends, riva... 13 106 John Lasseter
1 False NaN 65000000 [Adventure, Fantasy, Family] NaN 8844 tt0113497 en Jumanji When siblings Judy and Peter discover an encha... ... False 6.9 2413.0 1995 [Robin Williams, Jonathan Hyde, Kirsten Dunst] [{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de... [board game, disappearance, based on children'... 26 16 Joe Johnston
2 False {'id': 119050, 'name': 'Grumpy Old Men Collect... 0 [Romance, Comedy] NaN 15602 tt0113228 en Grumpier Old Men A family wedding reignites the ancient feud be... ... False 6.5 92.0 1995 [Walter Matthau, Jack Lemmon, Ann-Margret] [{'credit_id': '52fe466a9251416c75077a89', 'de... [fishing, best friend, duringcreditsstinger, o... 7 4 Howard Deutch
3 False NaN 16000000 [Comedy, Drama, Romance] NaN 31357 tt0114885 en Waiting to Exhale Cheated on, mistreated and stepped on, the wom... ... False 6.1 34.0 1995 [Whitney Houston, Angela Bassett, Loretta Devine] [{'credit_id': '52fe44779251416c91011acb', 'de... [based on novel, interracial relationship, sin... 10 10 Forest Whitaker
4 False {'id': 96871, 'name': 'Father of the Bride Col... 0 [Comedy] NaN 11862 tt0113041 en Father of the Bride Part II Just when George Banks has recovered from his ... ... False 5.7 173.0 1995 [Steve Martin, Diane Keaton, Martin Short] [{'credit_id': '52fe44959251416c75039ed7', 'de... [baby, midlife crisis, confidence, aging, daug... 12 7 Charles Shyer
5 False NaN 60000000 [Action, Crime, Drama, Thriller] NaN 949 tt0113277 en Heat Obsessive master thief, Neil McCauley leads a ... ... False 7.7 1886.0 1995 [Al Pacino, Robert De Niro, Val Kilmer] [{'credit_id': '52fe4292c3a36847f802916d', 'de... [robbery, detective, bank, obsession, chase, s... 65 71 Michael Mann
6 False NaN 58000000 [Comedy, Romance] NaN 11860 tt0114319 en Sabrina An ugly duckling having undergone a remarkable... ... False 6.2 141.0 1995 [Harrison Ford, Julia Ormond, Greg Kinnear] [{'credit_id': '52fe44959251416c75039da9', 'de... [paris, brother brother relationship, chauffeu... 57 53 Sydney Pollack
7 False NaN 0 [Action, Adventure, Drama, Family] NaN 45325 tt0112302 en Tom and Huck A mischievous young boy, Tom Sawyer, witnesses... ... False 5.4 45.0 1995 [Jonathan Taylor Thomas, Brad Renfro, Rachael ... [{'credit_id': '52fe46bdc3a36847f810f797', 'de... [] 7 4 Peter Hewitt
8 False NaN 35000000 [Action, Adventure, Thriller] NaN 9091 tt0114576 en Sudden Death International action superstar Jean Claude Van... ... False 5.5 174.0 1995 [Jean-Claude Van Damme, Powers Boothe, Dorian ... [{'credit_id': '52fe44dbc3a36847f80ae0f1', 'de... [terrorist, hostage, explosive, vice president] 6 9 Peter Hyams
9 False {'id': 645, 'name': 'James Bond Collection', '... 58000000 [Adventure, Action, Thriller] http://www.mgm.com/view/movie/757/Goldeneye/ 710 tt0113189 en GoldenEye James Bond must unmask the mysterious head of ... ... False 6.6 1194.0 1995 [Pierce Brosnan, Sean Bean, Izabella Scorupco] [{'credit_id': '52fe426ec3a36847f801e14b', 'de... [cuba, falsely accused, secret identity, compu... 20 46 Martin Campbell

10 rows × 31 columns

Strip Spaces and Convert to Lowercase from all our features

sub_md['cast'] = sub_md['cast'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])

Mention Director 3 times to give it more weight relative to the entire cast.

sub_md['director'] = sub_md['director'].astype('str').apply(lambda x: str.lower(x.replace(" ", "")))
sub_md['director'] = sub_md['director'].apply(lambda x: [x,x, x])

Calculate the frequenct counts of every keyword that appears in the dataset

s = sub_md.apply(lambda x: pd.Series(x['keywords']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'keyword'
s = s.value_counts()
s[:5]
independent film        610
woman director          550
murder                  399
duringcreditsstinger    327
based on novel          318
Name: keyword, dtype: int64

Remove keyword that only occur once

s = s[s > 1]

Convert every word to its stem so that words such as heroes and hero are considered the same.

stemmer = SnowballStemmer('english')
stemmer.stem('heroes')
'hero'
def filter_keywords(x):
    words = []
    for i in x:
        if i in s:
            words.append(i)
    return words
sub_md['keywords'] = sub_md['keywords'].apply(filter_keywords)
sub_md['keywords'] = sub_md['keywords'].apply(lambda x: [stemmer.stem(i) for i in x])
sub_md['keywords'] = sub_md['keywords'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])

Create a metadata dump for every movie which consists of genres, director, main actors and keywords.

sub_md['metadata'] = sub_md['keywords'] + sub_md['cast'] + sub_md['director'] + sub_md['genres']
sub_md['metadata'] = sub_md['metadata'].apply(lambda x: ' '.join(x))

Use Count Vectorizer to create count matrix. Then,calculate the cosine similarities and return movies that are most similar.

count = CountVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
count_matrix = count.fit_transform(sub_md['metadata'])
cosine_sim = cosine_similarity(count_matrix, count_matrix)
sub_md = sub_md.reset_index()
titles = sub_md['title']
indices = pd.Series(sub_md.index, index=sub_md['title'])
def get_recommendations(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:31]
    movie_indices = [i[0] for i in sim_scores]
    return titles.iloc[movie_indices]
get_recommendations('The Dark Knight').head(10)
8031         The Dark Knight Rises
6218                 Batman Begins
6623                  The Prestige
2085                     Following
7648                     Inception
4145                      Insomnia
3381                       Memento
8613                  Interstellar
7659    Batman: Under the Red Hood
1134                Batman Returns
Name: title, dtype: object

As observed, more Christopher Nolan's movies made it to the list. Besides, these movies appeared to share same genres and required more thinking.

Adding Popularity and Ratings to recommender

Take the top 25 movies based on similarity scores and calculate the vote of the 60th percentile movie. Then, using this as the value of m, calculate the weighted rating of each movie using IMDB's formula.

def improved_recommendations(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:26]
    movie_indices = [i[0] for i in sim_scores]
    
    movies = sub_md.iloc[movie_indices][['title', 'vote_count', 'vote_average', 'year']]
    vote_counts = movies[movies['vote_count'].notnull()]['vote_count'].astype('int')
    vote_averages = movies[movies['vote_average'].notnull()]['vote_average'].astype('int')
    C = vote_averages.mean()
    m = vote_counts.quantile(0.60)
    qualified = movies[(movies['vote_count'] >= m) & (movies['vote_count'].notnull()) & (movies['vote_average'].notnull())]
    qualified['vote_count'] = qualified['vote_count'].astype('int')
    qualified['vote_average'] = qualified['vote_average'].astype('int')
    qualified['wr'] = qualified.apply(weighted_rating, axis=1)
    qualified = qualified.sort_values('wr', ascending=False).head(10)
    return qualified
improved_recommendations('The Dark Knight')
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
title vote_count vote_average year wr
7648 Inception 14075 8 2010 7.917588
8613 Interstellar 11187 8 2014 7.897107
6623 The Prestige 4510 8 2006 7.758148
3381 Memento 4168 8 2000 7.740175
8031 The Dark Knight Rises 9263 7 2012 6.921448
6218 Batman Begins 7511 7 2005 6.904127
1134 Batman Returns 1706 6 1992 5.846862
132 Batman Forever 1529 5 1995 5.054144
9024 Batman v Superman: Dawn of Justice 7189 5 2016 5.013943
1260 Batman & Robin 1447 4 1997 4.287233

Collaborative Filtering

Predict using simillar users data, using Surprise and Singular Value Decomposition (SVD) algorithm

from surprise import SVD
from surprise import Dataset
from surprise import Reader
from surprise.model_selection import cross_validate

reader = Reader()
ratings = pd.read_csv('dataset/movie/ratings_small.csv')
ratings.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
userId movieId rating timestamp
0 1 31 2.5 1260759144
1 1 1029 3.0 1260759179
2 1 1061 3.0 1260759182
3 1 1129 2.0 1260759185
4 1 1172 4.0 1260759205
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)
svd = SVD()
cross_validate(svd, data, measures=['RMSE', 'MAE'], cv=5 , verbose=True) #5 splits
Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8997  0.8943  0.8981  0.9034  0.8894  0.8970  0.0048  
MAE (testset)     0.6924  0.6879  0.6923  0.6949  0.6866  0.6908  0.0031  
Fit time          5.50    5.39    5.82    5.68    6.19    5.72    0.28    
Test time         0.14    0.26    0.16    0.15    0.27    0.20    0.06    





{'test_rmse': array([0.89971588, 0.89433794, 0.89810361, 0.90339806, 0.88944449]),
 'test_mae': array([0.69242199, 0.68794254, 0.69231266, 0.6948946 , 0.68659872]),
 'fit_time': (5.496105194091797,
  5.386150598526001,
  5.822059869766235,
  5.684120178222656,
  6.186817169189453),
 'test_time': (0.14284086227416992,
  0.2593069076538086,
  0.15757989883422852,
  0.1506178379058838,
  0.26779866218566895)}

Root Mean Sqaure Error = 0.8963, which is pretty good

trainset = data.build_full_trainset()
svd.fit(trainset)
<surprise.prediction_algorithms.matrix_factorization.SVD at 0x1a0f52151c8>
ratings[ratings['userId'] == 1]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
userId movieId rating timestamp
0 1 31 2.5 1260759144
1 1 1029 3.0 1260759179
2 1 1061 3.0 1260759182
3 1 1129 2.0 1260759185
4 1 1172 4.0 1260759205
5 1 1263 2.0 1260759151
6 1 1287 2.0 1260759187
7 1 1293 2.0 1260759148
8 1 1339 3.5 1260759125
9 1 1343 2.0 1260759131
10 1 1371 2.5 1260759135
11 1 1405 1.0 1260759203
12 1 1953 4.0 1260759191
13 1 2105 4.0 1260759139
14 1 2150 3.0 1260759194
15 1 2193 2.0 1260759198
16 1 2294 2.0 1260759108
17 1 2455 2.5 1260759113
18 1 2968 1.0 1260759200
19 1 3671 3.0 1260759117
svd.predict(1, 302, 3)
Prediction(uid=1, iid=302, r_ui=3, est=2.575412635819967, details={'was_impossible': False})

Prediction of user ID = 1 on movie ID =302 returns an estimated prediction of 2.7202 based on how the other users have predicted the movie.

Hybrid Recommender

Content based + collaborative filter based recommender

Input: User ID and the Title of a Movie

Output: Similar movies sorted on the basis of expected ratings by that particular user.

def convert_int(x):
    try:
        return int(x)
    except:
        return np.nan
    
def hybrid(userId, title):
    idx = indices[title]
    tmdbId = id_map.loc[title]['id']
    #print(idx)
    movie_id = id_map.loc[title]['movieId']
    
    sim_scores = list(enumerate(cosine_sim[int(idx)]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:26]
    movie_indices = [i[0] for i in sim_scores]
    
    movies = sub_md.iloc[movie_indices][['title', 'vote_count', 'vote_average', 'year', 'id']]
    movies['est'] = movies['id'].apply(lambda x: svd.predict(userId, indices_map.loc[x]['movieId']).est)
    movies = movies.sort_values('est', ascending=False)
    return movies.head(10)
id_map = pd.read_csv('dataset/movie/links_small.csv')[['movieId', 'tmdbId']]
id_map['tmdbId'] = id_map['tmdbId'].apply(convert_int)
id_map.columns = ['movieId', 'id']
id_map = id_map.merge(sub_md[['title', 'id']], on='id').set_index('title')
#id_map = id_map.set_index('tmdbId')
indices_map = id_map.set_index('id')
hybrid(1, 'The Dark Knight')
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
title vote_count vote_average year id est
3381 Memento 4168.0 8.1 2000 77 3.518882
7648 Inception 14075.0 8.1 2010 27205 3.253708
6623 The Prestige 4510.0 8.0 2006 1124 3.230209
8613 Interstellar 11187.0 8.1 2014 157336 3.094515
6218 Batman Begins 7511.0 7.5 2005 272 3.030609
5943 Thursday 84.0 7.0 1998 9812 3.003159
8031 The Dark Knight Rises 9263.0 7.6 2012 49026 2.864251
7362 Gangster's Paradise: Jerusalema 16.0 6.8 2008 22600 2.856210
5098 The Enforcer 21.0 7.4 1951 26712 2.817666
7561 Harry Brown 351.0 6.7 2009 25941 2.729585
hybrid(500, 'The Dark Knight')
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
title vote_count vote_average year id est
6623 The Prestige 4510.0 8.0 2006 1124 3.823928
3381 Memento 4168.0 8.1 2000 77 3.601578
8613 Interstellar 11187.0 8.1 2014 157336 3.519702
7648 Inception 14075.0 8.1 2010 27205 3.330051
5943 Thursday 84.0 7.0 1998 9812 3.322717
7561 Harry Brown 351.0 6.7 2009 25941 3.210024
2448 Nighthawks 87.0 6.4 1981 21610 3.034214
2131 Superman 1042.0 6.9 1978 1924 2.998992
2085 Following 363.0 7.2 1998 11660 2.974581
8031 The Dark Knight Rises 9263.0 7.6 2012 49026 2.967465

As observed, different recommendation lists were offered to different user watching the same movie, indicating that the recommendation more tailored and personalized for each user

About

Automated recommendation system that fits data through multiple recommenders to obtain the best result.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published