In [3]:
import pandas as pd 
import numpy as np 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from sklearn.metrics.pairwise import cosine_similarity


In this problem we attempted to make a movie recommendation based on two categories. First, popular movie recommendation system. Second, similar movie recommendation. 
For the first recommendation system, we used dataset provided in the Kaggle (TMDB) and used IMDB's weighted rating formula. The basic idea behind this recommender is that movies that are more popular will have a higher probability of being liked by the average audience. This recommender doesn't give personalized recommendations for each user.
The second recommendation system recommends similar movies given a title of the movie, taking into account the overview information from the dataset.  

In [9]:
# loading the data
df1 = pd.read_csv('/Users/nurdauletzhuzbay/Desktop/tmdb_5000_credits.csv')
df2 = pd.read_csv('/Users/nurdauletzhuzbay/Desktop/tmdb_5000_movies.csv')

#joining the two dataset on the 'id' column
df1.columns = ['id','tittle','cast','crew']
df = df2.merge(df1,on='id')


df.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,tittle,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...",...,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...",...,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]",...,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


# **Simple Popular Movie Recommender** 

IMDB's Weighted Rating (wr) = (vR/(v+m))+(mC/(v+m))

* v - the number of votes for the movie
* m - the minimum votes required to be listed in the chart
* R - the average rating of the movie
* C - mean vote across the whole report


In [10]:
vote_averages = df[df['vote_average'].notnull()]['vote_average'].astype('int')
C = vote_averages.mean()
C

5.6529252550489275

We set a cutoff line for movies to be in the chart. In other words, this means that for a movie to be in the chart, it has to have 90% more votes than other movies in the list. 

In [18]:
vote_counts = df[df['vote_count'].notnull()]['vote_count'].astype('int')
m = vote_counts.quantile(0.9)
m

1838.4000000000015

In [53]:
q = df[(df['vote_count'] >= m) & (df['vote_count'].notnull()) & (df['vote_average'].notnull())][['title', 'vote_count', 'vote_average', 'popularity', 'genres']]
q['vote_count'] = q['vote_count'].astype('int')
q['vote_average'] = q['vote_average'].astype('int')
q.shape

(481, 5)

Thus, in order to be considered in the chart the movie has to have 1838 votes, at least. Average rating of a movie in TMDB dataset is 5.652. There are 481 movies that are allowed to be in the chart

In [55]:
def weighted_rating(x):
    v = x['vote_count']
    R = x['vote_average']
    return (v/(v+m) * R) + (m/(m+v) * C)


In [56]:
q['wr'] = q.apply(weighted_rating, axis=1)

In [57]:
q = q.sort_values('wr', ascending=False).head(250)

In [58]:
q.head(15)

Unnamed: 0,title,vote_count,vote_average,popularity,genres,wr
96,Inception,13752,8,167.58371,"[{""id"": 28, ""name"": ""Action""}, {""id"": 53, ""nam...",7.723236
65,The Dark Knight,12002,8,187.322927,"[{""id"": 18, ""name"": ""Drama""}, {""id"": 28, ""name...",7.688242
95,Interstellar,10867,8,724.247784,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 18, ""...",7.660391
662,Fight Club,9413,8,146.757391,"[{""id"": 18, ""name"": ""Drama""}]",7.616504
262,The Lord of the Rings: The Fellowship of the Ring,8705,8,138.049577,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",7.590752
3232,Pulp Fiction,8428,8,121.463076,"[{""id"": 53, ""name"": ""Thriller""}, {""id"": 80, ""n...",7.57971
1881,The Shawshank Redemption,8205,8,136.747729,"[{""id"": 18, ""name"": ""Drama""}, {""id"": 80, ""name...",7.570378
329,The Lord of the Rings: The Return of the King,8064,8,123.630332,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",7.564261
809,Forrest Gump,7927,8,138.133331,"[{""id"": 35, ""name"": ""Comedy""}, {""id"": 18, ""nam...",7.558148
330,The Lord of the Rings: The Two Towers,7487,8,106.914973,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",7.5373


# **Movie description based recommender**

In this recommender our model reads the overview of the movie and finds similar movies. 



In [32]:
df["overview"]

0       In the 22nd century, a paraplegic Marine is di...
1       Captain Barbossa, long believed to be dead, ha...
2       A cryptic message from Bond’s past sends him o...
3       Following the death of District Attorney Harve...
4       John Carter is a war-weary, former military ca...
                              ...                        
4798    El Mariachi just wants to play his guitar and ...
4799    A newlywed couple's honeymoon is upended by th...
4800    "Signed, Sealed, Delivered" introduces a dedic...
4801    When ambitious New York attorney Sam is sent t...
4802    Ever since the second grade when he first saw ...
Name: overview, Length: 4803, dtype: object

To get more specific information about the movie overview Frequency-Inverse Document Frequency (TF-IDF) vectors from scikit-learn are used. First, let's construct TFIDF matrix

In [42]:
v = TfidfVectorizer(analyzer='word', stop_words='english')
x = v.fit_transform(df['overview'].values.astype('U'))
tfidf_matrix = v.fit_transform(df['overview'])

In [43]:
tfidf_matrix.shape

(4803, 20978)

Then we will use cosine similarity to calculate the similarity index between the movies. Its mathematical representation is $cosine(x,y) = \frac{x. y^\intercal}{||x||.||y||} $

Because we used TFIDF vectorizer, dot product of its matrices will explicitly give cosine similarity numerical value. That is why we will use linear kernel, because it is faster. 

In [59]:
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [60]:
cosine_sim[0]

array([1., 0., 0., ..., 0., 0., 0.])

In [61]:
df = df.reset_index()
titles = df['title']
indices = pd.Series(df.index, index=df['title'])

In [62]:
def get_recommendations(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:31]
    movie_indices = [i[0] for i in sim_scores]
    return titles.iloc[movie_indices]

In [63]:
get_recommendations('Titanic').head(10)

1269                                  Raise the Titanic
2143                                         Ghost Ship
2287                         I Can Do Bad All By Myself
770                                       Event Horizon
4287                                            Niagara
3212                                           The Rose
2902                                           Triangle
4228                        The Ballad of Jack and Rose
171     Master and Commander: The Far Side of the World
104                                            Poseidon
Name: title, dtype: object

**Conclusion** 
We implemented two recommendation algorithms. First is popular movie recommendation system. Second is similar movie recommendation system. 