I will be implementing a few recommendation algorithms (content based and collaborative filtering) and try to build an ensemble of these models to come up with our final recommendation system

To build our standard metadata based content recommender, we will need to merge our current dataset with the crew and the keyword datasets.

In [28]:
import pandas as pd 
import numpy as np 
import ast
from IPython.display import Image, HTML
import seaborn as sns
from scipy import sparse

In [29]:
movie_MetaData = pd.read_csv("movie_MetaData_cleanData.csv")
movie_MetaData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 21 columns):
belongs_to_collection    4494 non-null object
budget                   8890 non-null float64
genres                   45466 non-null object
id                       45466 non-null object
imdb_id                  45449 non-null object
original_language        45455 non-null object
overview                 44512 non-null object
popularity               45461 non-null object
production_companies     45463 non-null object
production_countries     45463 non-null object
release_date             45376 non-null object
revenue                  7408 non-null float64
runtime                  45203 non-null float64
spoken_languages         45460 non-null object
status                   45379 non-null object
tagline                  20412 non-null object
title                    45460 non-null object
vote_average             45460 non-null float64
vote_count               45460 non-null floa

In [30]:
movie_CreditData = pd.read_csv("credits.csv")
movie_CreditData.head()

Unnamed: 0,cast,crew,id
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602
3,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862


In [31]:
movie_KeywordsData = pd.read_csv("keywords.csv")
movie_KeywordsData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46419 entries, 0 to 46418
Data columns (total 2 columns):
id          46419 non-null int64
keywords    46419 non-null object
dtypes: int64(1), object(1)
memory usage: 725.4+ KB


# Simple Recommender

First I will be building a simple recommendation system.The Simple Recommender offers generalized recommnendations to every user based on movie popularity and genre. The basic idea behind this recommender is that movies that are more popular and more critically acclaimed will have a higher probability of being liked by the average audience. This model does not give personalized recommendations based on the user.

The implementation of this model is extremely trivial. All we have to do is sort our movies based on ratings and popularity and display the top movies of our list. As an added step, we can pass in a genre argument to get the top movies of a particular genre.

I use the TMDB Ratings to come up with our Top Movies Chart. I will use IMDB's weighted rating formula to construct my chart. For a movie to feature in the charts, it must have more votes than at least 80% of the movies in the list.

I will build our overall Top 250 Chart and will define a function to build charts for a particular genre.

In [32]:
movie_MetaData['genres'] = movie_MetaData['genres'].fillna('[]').apply(ast.literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

In [33]:
vote_counts = movie_MetaData[movie_MetaData['vote_count'].notnull()]['vote_count'].astype('int')
vote_averages = movie_MetaData[movie_MetaData['vote_average'].notnull()]['vote_average'].astype('int')
avg_vote = vote_averages.mean()
print("Voting average is ",avg_vote) 


Voting average is  5.244896612406511


In [34]:
vote_quantile=vote_counts.quantile(0.80)
print("Vote count 80% quantile is ",vote_quantile)

Vote count 80% quantile is  50.0


In [35]:
movie_MetaData['year'] = pd.to_datetime(movie_MetaData['release_date'], errors='coerce').apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan)


In [36]:
qualified = movie_MetaData[(movie_MetaData['vote_count'] >= vote_quantile) & (movie_MetaData['vote_count'].notnull()) & (movie_MetaData['vote_average'].notnull())][['title', 'year', 'vote_count', 'vote_average', 'popularity', 'genres']]
qualified['vote_count'] = qualified['vote_count'].astype('int')
qualified['vote_average'] = qualified['vote_average'].astype('int')
qualified.shape

(9151, 6)

Therefore, to qualify to be considered for the chart, a movie has to have at least 50 votes on TMDB. We also see that the average rating for a movie on TMDB is 5.244 on a scale of 10. 9151 Movies qualify to be on our chart.

In [37]:
def weighted_rating(x):
    v = x['vote_count']
    R = x['vote_average']
    return (v/(v+vote_quantile) * R) + (vote_quantile/(vote_quantile+v) * avg_vote)

In [38]:
qualified['weighted_Rating'] = qualified.apply(weighted_rating, axis=1)

In [39]:
qualified = qualified.sort_values('weighted_Rating', ascending=False).head(250)

We have got Top 250 movies based on TMDB rating. Let us see top 10 movies in TMDB.

In [40]:
qualified.head(10)

Unnamed: 0,title,year,vote_count,vote_average,popularity,genres,weighted_Rating
10309,Dilwale Dulhania Le Jayenge,1995,661,9,34.457,"[Comedy, Drama, Romance]",8.735928
15480,Inception,2010,14075,8,29.1081,"[Action, Thriller, Science Fiction, Mystery, A...",7.990247
12481,The Dark Knight,2008,12269,8,123.167,"[Drama, Action, Crime, Thriller]",7.988818
22879,Interstellar,2014,11187,8,32.2135,"[Adventure, Drama, Science Fiction]",7.987741
2843,Fight Club,1999,9678,8,63.8696,[Drama],7.985839
4863,The Lord of the Rings: The Fellowship of the Ring,2001,8892,8,32.0707,"[Adventure, Fantasy, Action]",7.984595
292,Pulp Fiction,1994,8670,8,140.95,"[Thriller, Crime]",7.984202
314,The Shawshank Redemption,1994,8358,8,51.6454,"[Drama, Crime]",7.983616
7000,The Lord of the Rings: The Return of the King,2003,8226,8,29.3244,"[Adventure, Fantasy, Action]",7.983355
351,Forrest Gump,1994,8147,8,48.3072,"[Comedy, Drama, Romance]",7.983194


It is interesting to see three Christopher Nolan Films, Inception, The Dark Knight and Interstellar occur at the very top of our chart. The chart also indicates a strong bias of TMDB Users towards particular genres and directors.

Let us now construct our function that builds charts for particular genres.

In [41]:
genre_list = movie_MetaData.apply(lambda x: pd.Series(x['genres']),axis=1).stack().reset_index(level=1, drop=True)
genre_list.name = 'genre'
gen_md = movie_MetaData.drop('genres', axis=1).join(genre_list)

In [42]:
def build_chart(genre, percentile=0.95):
    df = gen_md[gen_md['genre'] == genre]
    vote_counts = df[df['vote_count'].notnull()]['vote_count'].astype('int')
    vote_averages = df[df['vote_average'].notnull()]['vote_average'].astype('int')
    C = vote_averages.mean()
    m = vote_counts.quantile(percentile)
    
    qualified = df[(df['vote_count'] >= m) & (df['vote_count'].notnull()) & (df['vote_average'].notnull())][['title', 'year', 'vote_count', 'vote_average', 'popularity']]
    qualified['vote_count'] = qualified['vote_count'].astype('int')
    qualified['vote_average'] = qualified['vote_average'].astype('int')
    
    qualified['weighted_Rating'] = qualified.apply(lambda x: (x['vote_count']/(x['vote_count']+m) * x['vote_average']) + (m/(m+x['vote_count']) * C), axis=1)
    qualified = qualified.sort_values('weighted_Rating', ascending=False).head(250)
    
    return qualified

Let us see our method in action by displaying the Top 10 Drama Movies. As we saw Drama is most popular movie genre. Also let us build chart for 'Mystery' genre as well, which is less popular as per our analysis.

In [43]:
build_chart('Drama').head(10)

Unnamed: 0,title,year,vote_count,vote_average,popularity,weighted_Rating
12481,The Dark Knight,2008,12269,8,123.167,7.924623
22879,Interstellar,2014,11187,8,32.2135,7.917574
2843,Fight Club,1999,9678,8,63.8696,7.905213
314,The Shawshank Redemption,1994,8358,8,51.6454,7.890901
351,Forrest Gump,1994,8147,8,48.3072,7.888202
834,The Godfather,1972,6024,8,41.1093,7.851163
24860,The Imitation Game,2014,5895,8,31.5959,7.848105
359,The Lion King,1994,5520,8,21.6058,7.838458
18465,The Intouchables,2011,5410,8,16.0869,7.835391
22841,The Grand Budapest Hotel,2014,4644,8,14.442,7.810313


In [44]:
build_chart('Mystery').head(10)

Unnamed: 0,title,year,vote_count,vote_average,popularity,weighted_Rating
15480,Inception,2010,14075,8,29.1081,7.856221
46,Se7en,1995,5915,8,18.4574,7.682311
11354,The Prestige,2006,4510,8,16.9456,7.598743
4099,Memento,2000,4168,8,15.4508,7.571292
9430,Oldboy,2003,2000,8,10.6169,7.243008
877,Rear Window,1954,1531,8,17.9113,7.092712
896,Citizen Kane,1941,1244,8,15.8119,6.967234
876,Vertigo,1958,1162,8,18.2082,6.924746
14825,Shutter Island,2010,6559,7,15.8136,6.822468
23675,Gone Girl,2014,6023,7,154.801,6.808585


The recommender we built in the previous section suffers some severe limitations. For one, it gives the same recommendation to everyone, regardless of the user's personal taste. If a person who loves romantic movies (and hates action) were to look at our Top 10 Chart, s/he wouldn't probably like most of the movies. If s/he were to go one step further and look at our charts by genre, s/he wouldn't still be getting the best recommendations.

# Content Based Recommender

It goes without saying that the quality of our recommender would be increased with the usage of better metadata. That is exactly what we are going to do in this section. We are going to build a recommender based on the following metadata: the 3 top actors, the director, related genres and the movie plot keywords.

In [45]:
def clean_numeric(x):
    try:
        return float(x)
    except:
        return np.nan

In [46]:
movie_MetaData['id'] = movie_MetaData['id'].apply(clean_numeric).astype('float')

In [47]:
movie_MetaData['id'] = movie_MetaData['id'].fillna(0) .astype('int')

In [48]:
movie_MetaData = movie_MetaData.merge(movie_CreditData, on='id')

In [49]:
movie_MetaData = movie_MetaData.merge(movie_KeywordsData, on='id')

In [50]:
movie_MetaData.shape

(46628, 24)

I will be using a subset of all the movies available to us due to limiting computing power available to me. his small dataset Comprises of 100,000 ratings and 1,300 tag applications applied to 9,000 movies by 700 users.

In [51]:
links_small = pd.read_csv('links_small.csv')[['movieId', 'tmdbId']]
links_small = links_small[links_small['tmdbId'].notnull()]['tmdbId'].astype('int')


In [52]:
small_movie_MetaData = movie_MetaData[movie_MetaData['id'].isin(links_small)]
small_movie_MetaData.shape

(9219, 24)

As of now Required data is present in the form of "stringified" lists , we need to convert it into a safe and usable structure

In [53]:

features = ['cast', 'crew', 'keywords']
for feature in features:
    small_movie_MetaData[feature] = small_movie_MetaData[feature].fillna('[]').apply(ast.literal_eval)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


creating a metadata dump for every movie which consists of genres, director, main actors and keywords. I then use a Count Vectorizer to create our count matrix 

In [54]:
# Get the director's name from the crew feature. If director is not listed, return NaN
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

In [55]:
# Returns the list top 3 elements or entire list; whichever is more.
def get_list(x):
    if isinstance(x, list):
        names = [i['name'] for i in x]
        #Check if more than 3 elements exist. If yes, return only first three. If no, return entire list.
        if len(names) > 3:
            names = names[:3]
        return names

    #Return empty list in case of missing/malformed data
    return []

In [56]:
# Define new director, cast, genres and keywords features that are in a suitable form.
small_movie_MetaData['director'] = small_movie_MetaData['crew'].apply(get_director)




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [57]:

features = ['cast', 'keywords']
for feature in features:
    small_movie_MetaData[feature] = small_movie_MetaData[feature].apply(get_list)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [58]:
# Print the new features of the first 3 films
small_movie_MetaData[['title', 'cast','director', 'keywords', 'genres']].head(3)

Unnamed: 0,title,cast,director,keywords,genres
0,Toy Story,"[Tom Hanks, Tim Allen, Don Rickles]",John Lasseter,"[jealousy, toy, boy]","[Animation, Comedy, Family]"
1,Jumanji,"[Robin Williams, Jonathan Hyde, Kirsten Dunst]",Joe Johnston,"[board game, disappearance, based on children'...","[Adventure, Fantasy, Family]"
2,Grumpier Old Men,"[Walter Matthau, Jack Lemmon, Ann-Margret]",Howard Deutch,"[fishing, best friend, duringcreditsstinger]","[Romance, Comedy]"


The next step would be to convert the names and keyword instances into lowercase and strip all the spaces between them. This is done so that our vectorizer doesn't count the Johnny of "Johnny Depp" and "Johnny Galecki" as the same.

In [59]:
# Function to convert all strings to lower case and strip names of spaces
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        #Check if director exists. If not, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

In [60]:
# Apply clean_data function to your features.
features = ['cast', 'keywords', 'director', 'genres']

for feature in features:
    small_movie_MetaData[feature] = small_movie_MetaData[feature].apply(clean_data)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


Mentioning Director 3 times to give it more weight relative to the entire cast.

In [61]:
small_movie_MetaData['director'] = small_movie_MetaData['director'].apply(lambda x: [x,x, x])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


We are now in a position to create our "metadata soup", which is a string that contains all the metadata that we want to feed to our vectorizer (namely actors, director and keywords).

In [62]:
small_movie_MetaData['soup'] = small_movie_MetaData['keywords'] + small_movie_MetaData['cast'] + small_movie_MetaData['director'] + small_movie_MetaData['genres']
small_movie_MetaData['soup'] = small_movie_MetaData['soup'].apply(lambda x: ' '.join(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


We will be using the cosine similarity to calculate a numeric quantity that denotes the similarity between two movies. We use the cosine similarity score since it is independent of magnitude and is relatively easy and fast to calculate. we use the CountVectorizer() as we do not want to down-weight the presence of an actor/director if he or she has acted or directed in relatively more movies. 

In [63]:
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(small_movie_MetaData['soup'])


In [64]:
# Compute the Cosine Similarity matrix based on the count_matrix
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim = cosine_similarity(count_matrix, count_matrix)

We are now in a good position to define our recommendation function. These are the following steps we'll follow :-

Get the index of the movie given its title.
Get the list of cosine similarity scores for that particular movie with all movies. Convert it into a list of tuples where the first element is its position and the second is the similarity score.
Sort the aforementioned list of tuples based on the similarity scores; that is, the second element.
Get the top 10 elements of this list. Ignore the first element as it refers to self (the movie most similar to a particular movie is the movie itself).
Return the titles corresponding to the indices of the top elements.

In [65]:
# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return small_movie_MetaData['title'].iloc[movie_indices]

We are going to define a function that takes in a movie title as an input and outputs a list of the 10 most similar movies. Firstly, for this, we need a reverse mapping of movie titles and DataFrame indices. In other words, we need a mechanism to identify the index of a movie in our metadata DataFrame, given its title.

In [66]:
# Reset index of our main DataFrame and construct reverse mapping 
small_movie_MetaData = small_movie_MetaData.reset_index()
indices = pd.Series(small_movie_MetaData.index, index=small_movie_MetaData['title'])

In [67]:
get_recommendations('The Dark Knight', cosine_sim)

8031    The Dark Knight Rises
6218            Batman Begins
6623             The Prestige
2085                Following
4145                 Insomnia
7648                Inception
3381                  Memento
8613             Interstellar
6645              Harsh Times
6902                   Hitman
Name: title, dtype: object

The recommendations seem to have recognized other Christopher Nolan movies (due to the high weightage given to director) and put them as top recommendations.It is recommended 8 out of 10 movies based on director.

One thing that we notice about our recommendation system is that it recommends movies regardless of ratings and popularity. Let us add a mechanism to remove bad movies and return movies which are popular and have had a good critical response.

In [68]:
def improved_recommendations(title,cosine_sim=cosine_sim):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:26]
    movie_indices = [i[0] for i in sim_scores]
    
    movies = small_movie_MetaData.iloc[movie_indices][['title', 'vote_count', 'vote_average', 'year']]
    vote_counts = movies[movies['vote_count'].notnull()]['vote_count'].astype('int')
    vote_averages = movies[movies['vote_average'].notnull()]['vote_average'].astype('int')
    C = vote_averages.mean()
    m = vote_counts.quantile(0.60)
    qualified = movies[(movies['vote_count'] >= m) & (movies['vote_count'].notnull()) & (movies['vote_average'].notnull())]
    qualified['vote_count'] = qualified['vote_count'].astype('int')
    qualified['vote_average'] = qualified['vote_average'].astype('int')
    qualified['weighted_rating'] = qualified.apply(weighted_rating, axis=1)
    qualified = qualified.sort_values('weighted_rating', ascending=False).head(10)
    return qualified

In [69]:
improved_recommendations('The Dark Knight',cosine_sim)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  app.launch_new_instance()


Unnamed: 0,title,vote_count,vote_average,year,weighted_rating
7648,Inception,14075,8,2010,7.990247
8613,Interstellar,11187,8,2014,7.987741
6623,The Prestige,4510,8,2006,7.969791
3381,Memento,4168,8,2000,7.967341
8031,The Dark Knight Rises,9263,7,2012,6.990577
6218,Batman Begins,7511,7,2005,6.988394
2839,American Psycho,2128,7,2000,6.959708
4145,Insomnia,1181,6,2002,5.96933
7912,Takers,399,6,2010,5.915913
6902,Hitman,982,5,2007,5.011865


Our content based engine suffers from some severe limitations. It is only capable of suggesting movies which are close to a certain movie. That is, it is not capable of capturing tastes and providing recommendations across genres.

Also, the engine that we built is not really personal in that it doesn't capture the personal tastes and biases of a user. Anyone querying our engine for recommendations based on a movie will receive the same recommendations for that movie, regardless of who she/he is.

Therefore, in this section, we will use a technique called Collaborative Filtering to make recommendations to Movie Watchers.

# Collaborative Recommender

we will be using the scipy library in Python to implement algorithms like Singular Value Decomposition (SVD) to  give great recommendations.

In [70]:
from surprise import Reader, Dataset, SVD
from surprise.model_selection import cross_validate
reader = Reader()
ratings = pd.read_csv('ratings_small.csv')
users = ratings['userId'].unique() #list of all users
movies = ratings['movieId'].unique() #list of all movies
print("Number of users", len(users))
print("Number of movies", len(movies))
ratings = ratings.rename(columns={"userId":"userID","movieId": "MovieID", "rating": "Rating"})
print(ratings.head())

Number of users 671
Number of movies 9066
   userID  MovieID  Rating   timestamp
0       1       31     2.5  1260759144
1       1     1029     3.0  1260759179
2       1     1061     3.0  1260759182
3       1     1129     2.0  1260759185
4       1     1172     4.0  1260759205


In this dataset movies are rated on a scale of 5

In [71]:
movies_df = small_movie_MetaData[['id', 'title', 'genres']]
movies_df = movies_df.rename(columns={"id": "MovieID", "title": "Title","genres": "Genre"})
movies_df.head()

Unnamed: 0,MovieID,Title,Genre
0,862,Toy Story,"[animation, comedy, family]"
1,8844,Jumanji,"[adventure, fantasy, family]"
2,15602,Grumpier Old Men,"[romance, comedy]"
3,31357,Waiting to Exhale,"[comedy, drama, romance]"
4,11862,Father of the Bride Part II,[comedy]


Pivot Ratings into Movie Features

To gain a better interpretation of the data, we pivot the dataframe to have userId as rows and movieId as columns, filling the null values with 0.0.

In [72]:
movie_features = ratings.pivot(index = 'userID', columns ='MovieID', values = 'Rating').fillna(0)
movie_features.head()


MovieID,1,2,3,4,5,6,7,8,9,10,...,161084,161155,161594,161830,161918,161944,162376,162542,162672,163949
userID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [73]:
movie_features.head()

MovieID,1,2,3,4,5,6,7,8,9,10,...,161084,161155,161594,161830,161918,161944,162376,162542,162672,163949
userID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Applying Singular Value Decomposition

Here, we will be using the scipy library in Python to implement SVD.

In [74]:
R = movie_features.as_matrix()
user_ratings_mean = np.mean(R, axis = 1)
R_demeaned = R - user_ratings_mean.reshape(-1, 1)

  """Entry point for launching an IPython kernel.


In [75]:
from scipy.sparse.linalg import svds

U, sigma, Vt = svds(R_demeaned, k = 50)
# that the Sigma$ returned is just the values instead of a diagonal matrix. 
# This is useful, but since I'm going to leverage matrix multiplication to get predictions 
# I'll convert it to the diagonal matrix form.
sigma = np.diag(sigma)
all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) + user_ratings_mean.reshape(-1, 1)

In [76]:
preds_df = pd.DataFrame(all_user_predicted_ratings, columns = movie_features.columns)

In [77]:
def recommend_movies(predictions_df, userID, movies_df, original_ratings_df, num_recommendations=5):
    
    # Get and sort the user's predictions
    user_row_number = userID - 1 # UserID starts at 1, not 0
    sorted_user_predictions = predictions_df.iloc[user_row_number].sort_values(ascending=False)
    
    # Get the user's data and merge in the movie information.
    user_data = original_ratings_df[original_ratings_df.userID == (userID)]
    user_full = (user_data.merge(movies_df, how = 'left', left_on = 'MovieID', right_on = 'MovieID').
                     sort_values(['Rating'], ascending=False)
                 )

    print ('User {0} has already rated {1} movies.'.format(userID, user_full.shape[0]))
    print ('Recommending the highest {0} predicted ratings movies not already rated.'.format(num_recommendations))
    
    # Recommend the highest predicted rating movies that the user hasn't seen yet.
    recommendations = (movies_df[~movies_df['MovieID'].isin(user_full['MovieID'])].
         merge(pd.DataFrame(sorted_user_predictions).reset_index(), how = 'left',
               left_on = 'MovieID',
               right_on = 'MovieID').
         rename(columns = {user_row_number: 'Predictions'}).
         sort_values('Predictions', ascending = False).
                       iloc[:num_recommendations, :-1]
                      )

    return user_full, recommendations

already_rated, predictions = recommend_movies(preds_df,44, movies_df, ratings, 10)

User 44 has already rated 25 movies.
Recommending the highest 10 predicted ratings movies not already rated.


#### User 44 has already rated 25 movies.
#### Below are 10 movies which user 44 has already rated.


In [78]:
already_rated.dropna().head(10)

Unnamed: 0,userID,MovieID,Rating,timestamp,Title,Genre
19,44,780,5.0,858707138,The Passion of Joan of Arc,"[drama, history]"
7,44,62,5.0,858707138,2001: A Space Odyssey,"[sciencefiction, mystery, adventure]"
23,44,805,4.0,858707310,Rosemary's Baby,"[horror, drama, mystery]"
9,44,104,4.0,858707248,Run Lola Run,"[action, drama, thriller]"
10,44,135,4.0,858707310,Dont Look Back,"[documentary, music]"
14,44,628,3.0,858707310,Interview with the Vampire,"[horror, romance]"
22,44,802,3.0,858707310,Lolita,"[drama, romance]"
21,44,788,3.0,858707248,Mrs. Doubtfire,"[comedy, drama, family]"
20,44,786,3.0,858707194,Almost Famous,"[drama, music]"
16,44,648,3.0,858707138,Beauty and the Beast,"[drama, fantasy, romance]"


#### Recommending the highest 10 predicted ratings movies not already rated by user 44.

In [79]:
predictions

Unnamed: 0,MovieID,Title,Genre
4168,608,Men in Black II,"[action, adventure, comedy, sciencefiction]"
2172,1073,Arlington Road,"[drama, thriller, mystery]"
1025,832,M,"[drama, action, thriller, crime]"
3211,708,The Living Daylights,"[action, adventure, thriller]"
1103,653,Nosferatu,"[fantasy, horror]"
5024,79,Hero,"[drama, adventure, action, history]"
923,762,Monty Python and the Holy Grail,"[adventure, comedy, fantasy]"
5437,673,Harry Potter and the Prisoner of Azkaban,"[adventure, fantasy, family]"
6287,647,Final Fantasy VII: Advent Children,"[action, adventure, animation, fantasy]"
6590,86,The Elementary Particles,"[drama, romance]"


In [80]:
improved_recommendations('2001: A Space Odyssey',cosine_sim)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  app.launch_new_instance()


Unnamed: 0,title,vote_count,vote_average,year,weighted_rating
8613,Interstellar,11187,8,2014,7.987741
1029,The Shining,3890,8,1980,7.965037
979,A Clockwork Orange,3432,8,1971,7.960438
995,Full Metal Jacket,2595,7,1987,6.966822
7284,Moon,1831,7,2009,6.953347
8132,Prometheus,5152,6,2012,5.992742
7907,Transformers: Dark of the Moon,3351,6,2011,5.988899
7764,TRON: Legacy,2895,6,2010,5.98718
1497,Armageddon,2540,6,1998,5.985423
1349,Starship Troopers,1584,6,1997,5.976894
