Type of Recommendation Systems

1. Simple Recommender:

This system used overall TMDB Vote Count and Vote Averages to build Top Movies Charts, for specific genre in general. The IMDB Weighted Rating System was used to calculate ratings on which the sorting was finally performed.

2. Content Based Recommender:

Movie overview and taglines based,
Cast, crew, genre and keywords based. A simple filter is added to give greater preference to movies with more votes and higher ratings.

3. Collaborative Filtering:

Used Surprise module to build a collaborative filter based on single value decomposition. The RMSE obtained was less than 1 and the engine gave estimated ratings for a given user and movie.

4. Hybrid Engine:

Combined content and collaborative filterting to build an engine that gave movie suggestions to a particular user based on the estimated ratings that it had internally calculated for that user.

%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from nltk.stem.snowball import SnowballStemmer
#from nltk.stem.wordnet import WordNetLemmatibzer
from nltk.corpus import wordnet
from surprise import Reader, Dataset, SVD

import warnings; warnings.simplefilter('ignore')

md = pd.read_csv('dataset/movie/movies_metadata.csv')
md.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	adult	belongs_to_collection	budget	genres	homepage	id	imdb_id	original_language	original_title	overview	...	release_date	revenue	runtime	spoken_languages	status	tagline	title	video	vote_average	vote_count
0	False	{'id': 10194, 'name': 'Toy Story Collection', ...	30000000	[{'id': 16, 'name': 'Animation'}, {'id': 35, '...	http://toystory.disney.com/toy-story	862	tt0114709	en	Toy Story	Led by Woody, Andy's toys live happily in his ...	...	10/30/1995	373554033.0	81.0	[{'iso_639_1': 'en', 'name': 'English'}]	Released	NaN	Toy Story	False	7.7	5415.0
1	False	NaN	65000000	[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...	NaN	8844	tt0113497	en	Jumanji	When siblings Judy and Peter discover an encha...	...	12/15/1995	262797249.0	104.0	[{'iso_639_1': 'en', 'name': 'English'}, {'iso...	Released	Roll the dice and unleash the excitement!	Jumanji	False	6.9	2413.0
2	False	{'id': 119050, 'name': 'Grumpy Old Men Collect...	0	[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...	NaN	15602	tt0113228	en	Grumpier Old Men	A family wedding reignites the ancient feud be...	...	12/22/1995	0.0	101.0	[{'iso_639_1': 'en', 'name': 'English'}]	Released	Still Yelling. Still Fighting. Still Ready for...	Grumpier Old Men	False	6.5	92.0
3	False	NaN	16000000	[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...	NaN	31357	tt0114885	en	Waiting to Exhale	Cheated on, mistreated and stepped on, the wom...	...	12/22/1995	81452156.0	127.0	[{'iso_639_1': 'en', 'name': 'English'}]	Released	Friends are the people who let you be yourself...	Waiting to Exhale	False	6.1	34.0
4	False	{'id': 96871, 'name': 'Father of the Bride Col...	0	[{'id': 35, 'name': 'Comedy'}]	NaN	11862	tt0113041	en	Father of the Bride Part II	Just when George Banks has recovered from his ...	...	2/10/1995	76578911.0	106.0	[{'iso_639_1': 'en', 'name': 'English'}]	Released	Just When His World Is Back To Normal... He's ...	Father of the Bride Part II	False	5.7	173.0

5 rows × 24 columns

Simple Recommender

Generalized recommendation based on movie popularity and genre to every user. For example, movies that are more popular and more critically acclaimed will have a higher probability of being liked by the average audience.

Sort movies based on ratings (The Movie Database (TMDb) Ratings) and popularity, then display the top movies list.

md['genres'] = md['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])\

Now, determine an appropriate value for m, the minimum votes required to be listed in the chart. 95th percentile will be used as cutoff. For a movie to feature in the charts, it must have more votes than at least 95% of the movies in the list.

vote_counts = md[md['vote_count'].notnull()]['vote_count'].astype('int')
vote_averages = md[md['vote_average'].notnull()]['vote_average'].astype('int')
C = vote_averages.mean()
C

5.244896612406511

Average rating for a movie on TMDB is 5.244 on a scale of 10

m = vote_counts.quantile(0.95)
m

434.0

As shown, to qualify to be considered for the chart, a movie has to have at least 434 votes on TMDB

md['year'] = pd.to_datetime(md['release_date'], errors='coerce').apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan)

qualified = md[(md['vote_count'] >= m) & (md['vote_count'].notnull()) & (md['vote_average'].notnull())][['title', 'year', 'vote_count', 'vote_average', 'popularity', 'genres']]
qualified['vote_count'] = qualified['vote_count'].astype('int')
qualified['vote_average'] = qualified['vote_average'].astype('int')
qualified.shape

(2274, 6)

2274 Movies qualify to be on the top chart

IMDB's weighted rating formula is used.

def weighted_rating(x):
    v = x['vote_count']
    R = x['vote_average']
    return (v/(v+m) * R) + (m/(m+v) * C)
qualified['wr'] = qualified.apply(weighted_rating, axis=1)

#Choose top 250
qualified = qualified.sort_values('wr', ascending=False).head(250)

len(qualified)

qualified.head(15)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	title	year	vote_count	vote_average	popularity	genres	wr
15480	Inception	2010	14075	8	29.108149	[Action, Thriller, Science Fiction, Mystery, A...	7.917588
12481	The Dark Knight	2008	12269	8	123.167259	[Drama, Action, Crime, Thriller]	7.905871
22878	Interstellar	2014	11187	8	32.213481	[Adventure, Drama, Science Fiction]	7.897107
2843	Fight Club	1999	9678	8	63.869599	[Drama]	7.881753
4863	The Lord of the Rings: The Fellowship of the Ring	2001	8892	8	32.070725	[Adventure, Fantasy, Action]	7.871787
292	Pulp Fiction	1994	8670	8	140.950236	[Thriller, Crime]	7.868660
314	The Shawshank Redemption	1994	8358	8	51.645403	[Drama, Crime]	7.864000
7000	The Lord of the Rings: The Return of the King	2003	8226	8	29.324358	[Adventure, Fantasy, Action]	7.861927
351	Forrest Gump	1994	8147	8	48.307194	[Comedy, Drama, Romance]	7.860656
5814	The Lord of the Rings: The Two Towers	2002	7641	8	29.423537	[Adventure, Fantasy, Action]	7.851924
256	Star Wars	1977	6778	8	42.149697	[Adventure, Action, Science Fiction]	7.834205
1225	Back to the Future	1985	6239	8	25.778509	[Adventure, Comedy, Science Fiction, Family]	7.820813
834	The Godfather	1972	6024	8	41.109264	[Drama, Crime]	7.814847
1154	The Empire Strikes Back	1980	5998	8	19.470959	[Adventure, Action, Science Fiction]	7.814099
46	Se7en	1995	5915	8	18.457430	[Crime, Mystery, Thriller]	7.811669

The chart indicates a strong bias of TMDB Users towards particular genres and directors (Christopher Nolan)

Generate top chart based on genre

def build_chart(genre, percentile=0.85):
    df = gen_md[gen_md['genre'] == genre]
    vote_counts = df[df['vote_count'].notnull()]['vote_count'].astype('int')
    vote_averages = df[df['vote_average'].notnull()]['vote_average'].astype('int')
    C = vote_averages.mean()
    m = vote_counts.quantile(percentile)
    
    qualified = df[(df['vote_count'] >= m) & (df['vote_count'].notnull()) & (df['vote_average'].notnull())][['title', 'year', 'vote_count', 'vote_average', 'popularity']]
    qualified['vote_count'] = qualified['vote_count'].astype('int')
    qualified['vote_average'] = qualified['vote_average'].astype('int')
    
    qualified['wr'] = qualified.apply(lambda x: (x['vote_count']/(x['vote_count']+m) * x['vote_average']) + (m/(m+x['vote_count']) * C), axis=1)
    qualified = qualified.sort_values('wr', ascending=False).head(250)
    
    return qualified

Use 85th percentile instead, and split movie with multiple genres into seperate row

s = md.apply(lambda x: pd.Series(x['genres']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'genre'
gen_md = md.drop('genres', axis=1).join(s)

gen_md

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	adult	belongs_to_collection	budget	homepage	id	imdb_id	original_language	original_title	overview	popularity	...	runtime	spoken_languages	status	tagline	title	video	vote_average	vote_count	year	genre
0	False	{'id': 10194, 'name': 'Toy Story Collection', ...	30000000	http://toystory.disney.com/toy-story	862	tt0114709	en	Toy Story	Led by Woody, Andy's toys live happily in his ...	21.946943	...	81.0	[{'iso_639_1': 'en', 'name': 'English'}]	Released	NaN	Toy Story	False	7.7	5415.0	1995	Animation
0	False	{'id': 10194, 'name': 'Toy Story Collection', ...	30000000	http://toystory.disney.com/toy-story	862	tt0114709	en	Toy Story	Led by Woody, Andy's toys live happily in his ...	21.946943	...	81.0	[{'iso_639_1': 'en', 'name': 'English'}]	Released	NaN	Toy Story	False	7.7	5415.0	1995	Comedy
0	False	{'id': 10194, 'name': 'Toy Story Collection', ...	30000000	http://toystory.disney.com/toy-story	862	tt0114709	en	Toy Story	Led by Woody, Andy's toys live happily in his ...	21.946943	...	81.0	[{'iso_639_1': 'en', 'name': 'English'}]	Released	NaN	Toy Story	False	7.7	5415.0	1995	Family
1	False	NaN	65000000	NaN	8844	tt0113497	en	Jumanji	When siblings Judy and Peter discover an encha...	17.015539	...	104.0	[{'iso_639_1': 'en', 'name': 'English'}, {'iso...	Released	Roll the dice and unleash the excitement!	Jumanji	False	6.9	2413.0	1995	Adventure
1	False	NaN	65000000	NaN	8844	tt0113497	en	Jumanji	When siblings Judy and Peter discover an encha...	17.015539	...	104.0	[{'iso_639_1': 'en', 'name': 'English'}, {'iso...	Released	Roll the dice and unleash the excitement!	Jumanji	False	6.9	2413.0	1995	Fantasy
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
45460	False	NaN	0	NaN	67758	tt0303758	en	Betrayal	When one of her hits goes wrong, a professiona...	0.903007	...	90.0	[{'iso_639_1': 'en', 'name': 'English'}]	Released	A deadly game of wits.	Betrayal	False	3.8	6.0	2003	Action
45460	False	NaN	0	NaN	67758	tt0303758	en	Betrayal	When one of her hits goes wrong, a professiona...	0.903007	...	90.0	[{'iso_639_1': 'en', 'name': 'English'}]	Released	A deadly game of wits.	Betrayal	False	3.8	6.0	2003	Drama
45460	False	NaN	0	NaN	67758	tt0303758	en	Betrayal	When one of her hits goes wrong, a professiona...	0.903007	...	90.0	[{'iso_639_1': 'en', 'name': 'English'}]	Released	A deadly game of wits.	Betrayal	False	3.8	6.0	2003	Thriller
45461	False	NaN	0	NaN	227506	tt0008536	en	Satana likuyushchiy	In a small town live two brothers, one a minis...	0.003503	...	87.0	[]	Released	NaN	Satan Triumphant	False	0.0	0.0	1917	NaN
45462	False	NaN	0	NaN	461257	tt6980792	en	Queerama	50 years after decriminalisation of homosexual...	0.163015	...	75.0	[{'iso_639_1': 'en', 'name': 'English'}]	Released	NaN	Queerama	False	0.0	0.0	2017	NaN

93536 rows × 25 columns

build_chart('Action').head(15)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	title	year	vote_count	vote_average	popularity	wr
15480	Inception	2010	14075	8	29.108149	7.955099
12481	The Dark Knight	2008	12269	8	123.167259	7.948610
4863	The Lord of the Rings: The Fellowship of the Ring	2001	8892	8	32.070725	7.929579
7000	The Lord of the Rings: The Return of the King	2003	8226	8	29.324358	7.924031
5814	The Lord of the Rings: The Two Towers	2002	7641	8	29.423537	7.918382
256	Star Wars	1977	6778	8	42.149697	7.908327
1154	The Empire Strikes Back	1980	5998	8	19.470959	7.896841
4135	Scarface	1983	3017	8	11.299673	7.802046
9430	Oldboy	2003	2000	8	10.616859	7.711649
1910	Seven Samurai	1954	892	8	15.017770	7.426145
43187	Band of Brothers	2001	725	8	7.903731	7.325485
1215	M	1931	465	8	12.752421	7.072073
14551	Avatar	2009	12114	7	185.070892	6.966363
17818	The Avengers	2012	12000	7	89.887648	6.966049
26563	Deadpool	2016	11444	7	187.860492	6.964431

Content Based Recommender

Personalized recommendations - Computes similarity between movies based on certain metrics and suggests movies that are most similar to a particular movie that a user liked.

#links.csv contains the reference bewteen imdbId	tmdbId
links_small = pd.read_csv('dataset/movie/links_small.csv')
links_small = links_small[links_small['tmdbId'].notnull()]['tmdbId'].astype('int')

links_small

0          862
1         8844
2        15602
3        31357
4        11862
         ...  
9120    402672
9121    315011
9122    391698
9123    137608
9124    410803
Name: tmdbId, Length: 9112, dtype: int32

md['id'] = md['id'].astype('int')
small_md = md[md['id'].isin(links_small)]
small_md.shape

(9099, 25)

small_md

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	adult	belongs_to_collection	budget	genres	homepage	id	imdb_id	original_language	original_title	overview	...	revenue	runtime	spoken_languages	status	tagline	title	video	vote_average	vote_count	year
0	False	{'id': 10194, 'name': 'Toy Story Collection', ...	30000000	[Animation, Comedy, Family]	http://toystory.disney.com/toy-story	862	tt0114709	en	Toy Story	Led by Woody, Andy's toys live happily in his ...	...	373554033.0	81.0	[{'iso_639_1': 'en', 'name': 'English'}]	Released	NaN	Toy Story	False	7.7	5415.0	1995
1	False	NaN	65000000	[Adventure, Fantasy, Family]	NaN	8844	tt0113497	en	Jumanji	When siblings Judy and Peter discover an encha...	...	262797249.0	104.0	[{'iso_639_1': 'en', 'name': 'English'}, {'iso...	Released	Roll the dice and unleash the excitement!	Jumanji	False	6.9	2413.0	1995
2	False	{'id': 119050, 'name': 'Grumpy Old Men Collect...	0	[Romance, Comedy]	NaN	15602	tt0113228	en	Grumpier Old Men	A family wedding reignites the ancient feud be...	...	0.0	101.0	[{'iso_639_1': 'en', 'name': 'English'}]	Released	Still Yelling. Still Fighting. Still Ready for...	Grumpier Old Men	False	6.5	92.0	1995
3	False	NaN	16000000	[Comedy, Drama, Romance]	NaN	31357	tt0114885	en	Waiting to Exhale	Cheated on, mistreated and stepped on, the wom...	...	81452156.0	127.0	[{'iso_639_1': 'en', 'name': 'English'}]	Released	Friends are the people who let you be yourself...	Waiting to Exhale	False	6.1	34.0	1995
4	False	{'id': 96871, 'name': 'Father of the Bride Col...	0	[Comedy]	NaN	11862	tt0113041	en	Father of the Bride Part II	Just when George Banks has recovered from his ...	...	76578911.0	106.0	[{'iso_639_1': 'en', 'name': 'English'}]	Released	Just When His World Is Back To Normal... He's ...	Father of the Bride Part II	False	5.7	173.0	1995
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
40221	False	NaN	15000000	[Action, Adventure, Drama, Horror, Science Fic...	NaN	315011	tt4262980	ja	シン・ゴジラ	From the mind behind Evangelion comes a hit la...	...	77000000.0	120.0	[{'iso_639_1': 'it', 'name': 'Italiano'}, {'is...	Released	A god incarnate. A city doomed.	Shin Godzilla	False	6.6	152.0	2016
40500	False	NaN	0	[Documentary, Music]	http://www.thebeatlesliveproject.com/	391698	tt2531318	en	The Beatles: Eight Days a Week - The Touring Y...	The band stormed Europe in 1963, and, in 1964,...	...	0.0	99.0	[{'iso_639_1': 'en', 'name': 'English'}]	Released	The band you know. The story you don't.	The Beatles: Eight Days a Week - The Touring Y...	False	7.6	92.0	2016
44818	False	{'id': 34055, 'name': 'Pokémon Collection', 'p...	16000000	[Adventure, Fantasy, Animation, Action, Family]	http://movies.warnerbros.com/pk3/	10991	tt0235679	ja	Pokémon 3: The Movie	When Molly Hale's sadness of her father's disa...	...	68411275.0	93.0	[{'iso_639_1': 'en', 'name': 'English'}]	Released	Pokémon: Spell of the Unknown	Pokémon: Spell of the Unknown	False	6.0	144.0	2000
44823	False	{'id': 34055, 'name': 'Pokémon Collection', 'p...	0	[Adventure, Fantasy, Animation, Science Fictio...	http://www.pokemon.com/us/movies/movie-pokemon...	12600	tt0287635	ja	劇場版ポケットモンスターセレビィ時を越えた遭遇（であい）	All your favorite Pokémon characters are back,...	...	28023563.0	75.0	[{'iso_639_1': 'ja', 'name': '日本語'}]	Released	NaN	Pokémon 4Ever: Celebi - Voice of the Forest	False	5.7	82.0	2001
45262	False	NaN	0	[Comedy, Drama]	NaN	265189	tt2121382	sv	Turist	While holidaying in the French Alps, a Swedish...	...	1359497.0	118.0	[{'iso_639_1': 'fr', 'name': 'Français'}, {'is...	Released	NaN	Force Majeure	False	6.8	255.0	2014

9099 rows × 25 columns

Movie Description Based Recommender

Movie descriptions and taglines based recommender

small_md['tagline'] = small_md['tagline'].fillna('')
small_md['description'] = small_md['overview'] + small_md['tagline']
small_md['description'] = small_md['description'].fillna('')

TF-IDF Vectorizer

tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(small_md['description'])
tfidf_matrix.shape

(9099, 268123)

Cosine Similarity Cosine Similarity willl be used to calculate a numeric quantity that denotes the similarity between two movies.

cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
cosine_sim[0]

array([1.        , 0.00680476, 0.        , ..., 0.        , 0.00344913,
       0.        ])

Now all the movies in the dataset has a pairwise cosine similarity matrix

small_md = small_md.reset_index()
titles = small_md['title']
indices = pd.Series(small_md.index, index=small_md['title'])

indices

title
Toy Story                                                0
Jumanji                                                  1
Grumpier Old Men                                         2
Waiting to Exhale                                        3
Father of the Bride Part II                              4
                                                      ... 
Shin Godzilla                                         9094
The Beatles: Eight Days a Week - The Touring Years    9095
Pokémon: Spell of the Unknown                         9096
Pokémon 4Ever: Celebi - Voice of the Forest           9097
Force Majeure                                         9098
Length: 9099, dtype: int64

Function that returns the 30 most similar movies based on the cosine similarity score of input movie

def get_recommendations(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:31]
    movie_indices = [i[0] for i in sim_scores]
    return titles.iloc[movie_indices]

get_recommendations('The Dark Knight')

7931                      The Dark Knight Rises
132                              Batman Forever
1113                             Batman Returns
8227    Batman: The Dark Knight Returns, Part 2
7565                 Batman: Under the Red Hood
524                                      Batman
7901                           Batman: Year One
2579               Batman: Mask of the Phantasm
2696                                        JFK
8165    Batman: The Dark Knight Returns, Part 1
6144                              Batman Begins
7933         Sherlock Holmes: A Game of Shadows
5511                            To End All Wars
4489                                      Q & A
7344                        Law Abiding Citizen
7242                  The File on Thelma Jordon
3537                               Criminal Law
2893                              Flying Tigers
1135                   Night Falls on Manhattan
8680                          The Young Savages
8917         Batman v Superman: Dawn of Justice
1240                             Batman & Robin
6740                                Rush Hour 3
1652                            The Shaggy D.A.
6667                                   Fracture
4028                                 The Rookie
8371       Justice League: Crisis on Two Earths
8719                                 By the Gun
3730                    Dr. Mabuse, the Gambler
4160                     The Master of Disguise
Name: title, dtype: object

As observed, the system takes conderation of the description and taglines of Batman: Dark knight and recommend all other Batman movie, followed by Detective, Superhero, Crime, etc

Metadata Based Recommender

Based on genre, keywords, cast and crew

credits = pd.read_csv('dataset/movie/credits.csv')
keywords = pd.read_csv('dataset/movie/keywords.csv')

keywords['id'] = keywords['id'].astype('int')
credits['id'] = credits['id'].astype('int')
md['id'] = md['id'].astype('int')
md.shape

(45463, 25)

md = md.merge(credits, on='id')
md = md.merge(keywords, on='id')
sub_md = md[md['id'].isin(links_small)]
sub_md.shape

(9219, 28)

sub_md.head(5)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	adult	belongs_to_collection	budget	genres	homepage	id	imdb_id	original_language	original_title	overview	...	status	tagline	title	video	vote_average	vote_count	year	cast	crew	keywords
0	False	{'id': 10194, 'name': 'Toy Story Collection', ...	30000000	[Animation, Comedy, Family]	http://toystory.disney.com/toy-story	862	tt0114709	en	Toy Story	Led by Woody, Andy's toys live happily in his ...	...	Released	NaN	Toy Story	False	7.7	5415.0	1995	[{'cast_id': 14, 'character': 'Woody (voice)',...	[{'credit_id': '52fe4284c3a36847f8024f49', 'de...	[{'id': 931, 'name': 'jealousy'}, {'id': 4290,...
1	False	NaN	65000000	[Adventure, Fantasy, Family]	NaN	8844	tt0113497	en	Jumanji	When siblings Judy and Peter discover an encha...	...	Released	Roll the dice and unleash the excitement!	Jumanji	False	6.9	2413.0	1995	[{'cast_id': 1, 'character': 'Alan Parrish', '...	[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...	[{'id': 10090, 'name': 'board game'}, {'id': 1...
2	False	{'id': 119050, 'name': 'Grumpy Old Men Collect...	0	[Romance, Comedy]	NaN	15602	tt0113228	en	Grumpier Old Men	A family wedding reignites the ancient feud be...	...	Released	Still Yelling. Still Fighting. Still Ready for...	Grumpier Old Men	False	6.5	92.0	1995	[{'cast_id': 2, 'character': 'Max Goldman', 'c...	[{'credit_id': '52fe466a9251416c75077a89', 'de...	[{'id': 1495, 'name': 'fishing'}, {'id': 12392...
3	False	NaN	16000000	[Comedy, Drama, Romance]	NaN	31357	tt0114885	en	Waiting to Exhale	Cheated on, mistreated and stepped on, the wom...	...	Released	Friends are the people who let you be yourself...	Waiting to Exhale	False	6.1	34.0	1995	[{'cast_id': 1, 'character': "Savannah 'Vannah...	[{'credit_id': '52fe44779251416c91011acb', 'de...	[{'id': 818, 'name': 'based on novel'}, {'id':...
4	False	{'id': 96871, 'name': 'Father of the Bride Col...	0	[Comedy]	NaN	11862	tt0113041	en	Father of the Bride Part II	Just when George Banks has recovered from his ...	...	Released	Just When His World Is Back To Normal... He's ...	Father of the Bride Part II	False	5.7	173.0	1995	[{'cast_id': 1, 'character': 'George Banks', '...	[{'credit_id': '52fe44959251416c75039ed7', 'de...	[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n...

5 rows × 28 columns

To make things lless complicated,

Crew will be represented by the director
Only top 3 actors willl be choseden to represent the cast.

sub_md['cast'] = sub_md['cast'].apply(literal_eval)
sub_md['crew'] = sub_md['crew'].apply(literal_eval)
sub_md['keywords'] = sub_md['keywords'].apply(literal_eval)
sub_md['cast_size'] = sub_md['cast'].apply(lambda x: len(x))
sub_md['crew_size'] = sub_md['crew'].apply(lambda x: len(x))

def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan
sub_md['director'] = sub_md['crew'].apply(get_director)

sub_md['cast'] = sub_md['cast'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
sub_md['cast'] = sub_md['cast'].apply(lambda x: x[:3] if len(x) >=3 else x)

sub_md['keywords'] = sub_md['keywords'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

sub_md.head(10)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	adult	belongs_to_collection	budget	genres	homepage	id	imdb_id	original_language	original_title	overview	...	video	vote_average	vote_count	year	cast	crew	keywords	cast_size	crew_size	director
0	False	{'id': 10194, 'name': 'Toy Story Collection', ...	30000000	[Animation, Comedy, Family]	http://toystory.disney.com/toy-story	862	tt0114709	en	Toy Story	Led by Woody, Andy's toys live happily in his ...	...	False	7.7	5415.0	1995	[Tom Hanks, Tim Allen, Don Rickles]	[{'credit_id': '52fe4284c3a36847f8024f49', 'de...	[jealousy, toy, boy, friendship, friends, riva...	13	106	John Lasseter
1	False	NaN	65000000	[Adventure, Fantasy, Family]	NaN	8844	tt0113497	en	Jumanji	When siblings Judy and Peter discover an encha...	...	False	6.9	2413.0	1995	[Robin Williams, Jonathan Hyde, Kirsten Dunst]	[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...	[board game, disappearance, based on children'...	26	16	Joe Johnston
2	False	{'id': 119050, 'name': 'Grumpy Old Men Collect...	0	[Romance, Comedy]	NaN	15602	tt0113228	en	Grumpier Old Men	A family wedding reignites the ancient feud be...	...	False	6.5	92.0	1995	[Walter Matthau, Jack Lemmon, Ann-Margret]	[{'credit_id': '52fe466a9251416c75077a89', 'de...	[fishing, best friend, duringcreditsstinger, o...	7	4	Howard Deutch
3	False	NaN	16000000	[Comedy, Drama, Romance]	NaN	31357	tt0114885	en	Waiting to Exhale	Cheated on, mistreated and stepped on, the wom...	...	False	6.1	34.0	1995	[Whitney Houston, Angela Bassett, Loretta Devine]	[{'credit_id': '52fe44779251416c91011acb', 'de...	[based on novel, interracial relationship, sin...	10	10	Forest Whitaker
4	False	{'id': 96871, 'name': 'Father of the Bride Col...	0	[Comedy]	NaN	11862	tt0113041	en	Father of the Bride Part II	Just when George Banks has recovered from his ...	...	False	5.7	173.0	1995	[Steve Martin, Diane Keaton, Martin Short]	[{'credit_id': '52fe44959251416c75039ed7', 'de...	[baby, midlife crisis, confidence, aging, daug...	12	7	Charles Shyer
5	False	NaN	60000000	[Action, Crime, Drama, Thriller]	NaN	949	tt0113277	en	Heat	Obsessive master thief, Neil McCauley leads a ...	...	False	7.7	1886.0	1995	[Al Pacino, Robert De Niro, Val Kilmer]	[{'credit_id': '52fe4292c3a36847f802916d', 'de...	[robbery, detective, bank, obsession, chase, s...	65	71	Michael Mann
6	False	NaN	58000000	[Comedy, Romance]	NaN	11860	tt0114319	en	Sabrina	An ugly duckling having undergone a remarkable...	...	False	6.2	141.0	1995	[Harrison Ford, Julia Ormond, Greg Kinnear]	[{'credit_id': '52fe44959251416c75039da9', 'de...	[paris, brother brother relationship, chauffeu...	57	53	Sydney Pollack
7	False	NaN	0	[Action, Adventure, Drama, Family]	NaN	45325	tt0112302	en	Tom and Huck	A mischievous young boy, Tom Sawyer, witnesses...	...	False	5.4	45.0	1995	[Jonathan Taylor Thomas, Brad Renfro, Rachael ...	[{'credit_id': '52fe46bdc3a36847f810f797', 'de...	[]	7	4	Peter Hewitt
8	False	NaN	35000000	[Action, Adventure, Thriller]	NaN	9091	tt0114576	en	Sudden Death	International action superstar Jean Claude Van...	...	False	5.5	174.0	1995	[Jean-Claude Van Damme, Powers Boothe, Dorian ...	[{'credit_id': '52fe44dbc3a36847f80ae0f1', 'de...	[terrorist, hostage, explosive, vice president]	6	9	Peter Hyams
9	False	{'id': 645, 'name': 'James Bond Collection', '...	58000000	[Adventure, Action, Thriller]	http://www.mgm.com/view/movie/757/Goldeneye/	710	tt0113189	en	GoldenEye	James Bond must unmask the mysterious head of ...	...	False	6.6	1194.0	1995	[Pierce Brosnan, Sean Bean, Izabella Scorupco]	[{'credit_id': '52fe426ec3a36847f801e14b', 'de...	[cuba, falsely accused, secret identity, compu...	20	46	Martin Campbell

10 rows × 31 columns

Strip Spaces and Convert to Lowercase from all our features

sub_md['cast'] = sub_md['cast'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])

Mention Director 3 times to give it more weight relative to the entire cast.

sub_md['director'] = sub_md['director'].astype('str').apply(lambda x: str.lower(x.replace(" ", "")))
sub_md['director'] = sub_md['director'].apply(lambda x: [x,x, x])

Calculate the frequenct counts of every keyword that appears in the dataset

s = sub_md.apply(lambda x: pd.Series(x['keywords']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'keyword'
s = s.value_counts()
s[:5]

independent film        610
woman director          550
murder                  399
duringcreditsstinger    327
based on novel          318
Name: keyword, dtype: int64

Remove keyword that only occur once

s = s[s > 1]

Convert every word to its stem so that words such as heroes and hero are considered the same.

stemmer = SnowballStemmer('english')
stemmer.stem('heroes')

'hero'

def filter_keywords(x):
    words = []
    for i in x:
        if i in s:
            words.append(i)
    return words
sub_md['keywords'] = sub_md['keywords'].apply(filter_keywords)
sub_md['keywords'] = sub_md['keywords'].apply(lambda x: [stemmer.stem(i) for i in x])
sub_md['keywords'] = sub_md['keywords'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])

Create a metadata dump for every movie which consists of genres, director, main actors and keywords.

sub_md['metadata'] = sub_md['keywords'] + sub_md['cast'] + sub_md['director'] + sub_md['genres']
sub_md['metadata'] = sub_md['metadata'].apply(lambda x: ' '.join(x))

Use Count Vectorizer to create count matrix. Then,calculate the cosine similarities and return movies that are most similar.

count = CountVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
count_matrix = count.fit_transform(sub_md['metadata'])
cosine_sim = cosine_similarity(count_matrix, count_matrix)

sub_md = sub_md.reset_index()
titles = sub_md['title']
indices = pd.Series(sub_md.index, index=sub_md['title'])

def get_recommendations(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:31]
    movie_indices = [i[0] for i in sim_scores]
    return titles.iloc[movie_indices]

get_recommendations('The Dark Knight').head(10)

8031         The Dark Knight Rises
6218                 Batman Begins
6623                  The Prestige
2085                     Following
7648                     Inception
4145                      Insomnia
3381                       Memento
8613                  Interstellar
7659    Batman: Under the Red Hood
1134                Batman Returns
Name: title, dtype: object

As observed, more Christopher Nolan's movies made it to the list. Besides, these movies appeared to share same genres and required more thinking.

Adding Popularity and Ratings to recommender

Take the top 25 movies based on similarity scores and calculate the vote of the 60th percentile movie. Then, using this as the value of m, calculate the weighted rating of each movie using IMDB's formula.

def improved_recommendations(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:26]
    movie_indices = [i[0] for i in sim_scores]
    
    movies = sub_md.iloc[movie_indices][['title', 'vote_count', 'vote_average', 'year']]
    vote_counts = movies[movies['vote_count'].notnull()]['vote_count'].astype('int')
    vote_averages = movies[movies['vote_average'].notnull()]['vote_average'].astype('int')
    C = vote_averages.mean()
    m = vote_counts.quantile(0.60)
    qualified = movies[(movies['vote_count'] >= m) & (movies['vote_count'].notnull()) & (movies['vote_average'].notnull())]
    qualified['vote_count'] = qualified['vote_count'].astype('int')
    qualified['vote_average'] = qualified['vote_average'].astype('int')
    qualified['wr'] = qualified.apply(weighted_rating, axis=1)
    qualified = qualified.sort_values('wr', ascending=False).head(10)
    return qualified

improved_recommendations('The Dark Knight')

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	title	vote_count	vote_average	year	wr
7648	Inception	14075	8	2010	7.917588
8613	Interstellar	11187	8	2014	7.897107
6623	The Prestige	4510	8	2006	7.758148
3381	Memento	4168	8	2000	7.740175
8031	The Dark Knight Rises	9263	7	2012	6.921448
6218	Batman Begins	7511	7	2005	6.904127
1134	Batman Returns	1706	6	1992	5.846862
132	Batman Forever	1529	5	1995	5.054144
9024	Batman v Superman: Dawn of Justice	7189	5	2016	5.013943
1260	Batman & Robin	1447	4	1997	4.287233

Collaborative Filtering

Predict using simillar users data, using Surprise and Singular Value Decomposition (SVD) algorithm

from surprise import SVD
from surprise import Dataset
from surprise import Reader
from surprise.model_selection import cross_validate

reader = Reader()

ratings = pd.read_csv('dataset/movie/ratings_small.csv')
ratings.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	userId	movieId	rating	timestamp
0	1	31	2.5	1260759144
1	1	1029	3.0	1260759179
2	1	1061	3.0	1260759182
3	1	1129	2.0	1260759185
4	1	1172	4.0	1260759205

data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)
svd = SVD()
cross_validate(svd, data, measures=['RMSE', 'MAE'], cv=5 , verbose=True) #5 splits

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8997  0.8943  0.8981  0.9034  0.8894  0.8970  0.0048  
MAE (testset)     0.6924  0.6879  0.6923  0.6949  0.6866  0.6908  0.0031  
Fit time          5.50    5.39    5.82    5.68    6.19    5.72    0.28    
Test time         0.14    0.26    0.16    0.15    0.27    0.20    0.06    





{'test_rmse': array([0.89971588, 0.89433794, 0.89810361, 0.90339806, 0.88944449]),
 'test_mae': array([0.69242199, 0.68794254, 0.69231266, 0.6948946 , 0.68659872]),
 'fit_time': (5.496105194091797,
  5.386150598526001,
  5.822059869766235,
  5.684120178222656,
  6.186817169189453),
 'test_time': (0.14284086227416992,
  0.2593069076538086,
  0.15757989883422852,
  0.1506178379058838,
  0.26779866218566895)}

Root Mean Sqaure Error = 0.8963, which is pretty good

trainset = data.build_full_trainset()
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x1a0f52151c8>

ratings[ratings['userId'] == 1]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	userId	movieId	rating	timestamp
0	1	31	2.5	1260759144
1	1	1029	3.0	1260759179
2	1	1061	3.0	1260759182
3	1	1129	2.0	1260759185
4	1	1172	4.0	1260759205
5	1	1263	2.0	1260759151
6	1	1287	2.0	1260759187
7	1	1293	2.0	1260759148
8	1	1339	3.5	1260759125
9	1	1343	2.0	1260759131
10	1	1371	2.5	1260759135
11	1	1405	1.0	1260759203
12	1	1953	4.0	1260759191
13	1	2105	4.0	1260759139
14	1	2150	3.0	1260759194
15	1	2193	2.0	1260759198
16	1	2294	2.0	1260759108
17	1	2455	2.5	1260759113
18	1	2968	1.0	1260759200
19	1	3671	3.0	1260759117

svd.predict(1, 302, 3)

Prediction(uid=1, iid=302, r_ui=3, est=2.575412635819967, details={'was_impossible': False})

Prediction of user ID = 1 on movie ID =302 returns an estimated prediction of 2.7202 based on how the other users have predicted the movie.

Hybrid Recommender

Content based + collaborative filter based recommender

Input: User ID and the Title of a Movie

Output: Similar movies sorted on the basis of expected ratings by that particular user.

def convert_int(x):
    try:
        return int(x)
    except:
        return np.nan
    
def hybrid(userId, title):
    idx = indices[title]
    tmdbId = id_map.loc[title]['id']
    #print(idx)
    movie_id = id_map.loc[title]['movieId']
    
    sim_scores = list(enumerate(cosine_sim[int(idx)]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:26]
    movie_indices = [i[0] for i in sim_scores]
    
    movies = sub_md.iloc[movie_indices][['title', 'vote_count', 'vote_average', 'year', 'id']]
    movies['est'] = movies['id'].apply(lambda x: svd.predict(userId, indices_map.loc[x]['movieId']).est)
    movies = movies.sort_values('est', ascending=False)
    return movies.head(10)

id_map = pd.read_csv('dataset/movie/links_small.csv')[['movieId', 'tmdbId']]
id_map['tmdbId'] = id_map['tmdbId'].apply(convert_int)
id_map.columns = ['movieId', 'id']
id_map = id_map.merge(sub_md[['title', 'id']], on='id').set_index('title')
#id_map = id_map.set_index('tmdbId')
indices_map = id_map.set_index('id')

hybrid(1, 'The Dark Knight')

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	title	vote_count	vote_average	year	id	est
3381	Memento	4168.0	8.1	2000	77	3.518882
7648	Inception	14075.0	8.1	2010	27205	3.253708
6623	The Prestige	4510.0	8.0	2006	1124	3.230209
8613	Interstellar	11187.0	8.1	2014	157336	3.094515
6218	Batman Begins	7511.0	7.5	2005	272	3.030609
5943	Thursday	84.0	7.0	1998	9812	3.003159
8031	The Dark Knight Rises	9263.0	7.6	2012	49026	2.864251
7362	Gangster's Paradise: Jerusalema	16.0	6.8	2008	22600	2.856210
5098	The Enforcer	21.0	7.4	1951	26712	2.817666
7561	Harry Brown	351.0	6.7	2009	25941	2.729585

hybrid(500, 'The Dark Knight')

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	title	vote_count	vote_average	year	id	est
6623	The Prestige	4510.0	8.0	2006	1124	3.823928
3381	Memento	4168.0	8.1	2000	77	3.601578
8613	Interstellar	11187.0	8.1	2014	157336	3.519702
7648	Inception	14075.0	8.1	2010	27205	3.330051
5943	Thursday	84.0	7.0	1998	9812	3.322717
7561	Harry Brown	351.0	6.7	2009	25941	3.210024
2448	Nighthawks	87.0	6.4	1981	21610	3.034214
2131	Superman	1042.0	6.9	1978	1924	2.998992
2085	Following	363.0	7.2	1998	11660	2.974581
8031	The Dark Knight Rises	9263.0	7.6	2012	49026	2.967465

As observed, different recommendation lists were offered to different user watching the same movie, indicating that the recommendation more tailored and personalized for each user

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Type of Recommendation Systems

1. Simple Recommender:

2. Content Based Recommender:

3. Collaborative Filtering:

4. Hybrid Engine:

Simple Recommender

Generalized recommendation based on movie popularity and genre to every user. For example, movies that are more popular and more critically acclaimed will have a higher probability of being liked by the average audience.

Average rating for a movie on TMDB is 5.244 on a scale of 10

As shown, to qualify to be considered for the chart, a movie has to have at least 434 votes on TMDB

2274 Movies qualify to be on the top chart

IMDB's weighted rating formula is used.

Generate top chart based on genre

Content Based Recommender

Personalized recommendations - Computes similarity between movies based on certain metrics and suggests movies that are most similar to a particular movie that a user liked.

Movie Description Based Recommender

Function that returns the 30 most similar movies based on the cosine similarity score of input movie

Metadata Based Recommender

Based on genre, keywords, cast and crew

Adding Popularity and Ratings to recommender

Collaborative Filtering

Hybrid Recommender

Content based + collaborative filter based recommender

Input: User ID and the Title of a Movie

Output: Similar movies sorted on the basis of expected ratings by that particular user.

Files

README.md

Latest commit

History

README.md

File metadata and controls

Type of Recommendation Systems

1. Simple Recommender:

2. Content Based Recommender:

3. Collaborative Filtering:

4. Hybrid Engine:

Simple Recommender

Generalized recommendation based on movie popularity and genre to every user. For example, movies that are more popular and more critically acclaimed will have a higher probability of being liked by the average audience.

Average rating for a movie on TMDB is 5.244 on a scale of 10

As shown, to qualify to be considered for the chart, a movie has to have at least 434 votes on TMDB

2274 Movies qualify to be on the top chart

IMDB's weighted rating formula is used.

Generate top chart based on genre

Content Based Recommender

Personalized recommendations - Computes similarity between movies based on certain metrics and suggests movies that are most similar to a particular movie that a user liked.

Movie Description Based Recommender

Function that returns the 30 most similar movies based on the cosine similarity score of input movie

Metadata Based Recommender

Based on genre, keywords, cast and crew

Adding Popularity and Ratings to recommender

Collaborative Filtering

Hybrid Recommender

Content based + collaborative filter based recommender

Input: User ID and the Title of a Movie

Output: Similar movies sorted on the basis of expected ratings by that particular user.