# Recommender systems with different methods

* 12/20/2019
* 12/22/2019
(1)Add some-written code for the content-based recommendation
(2)Split the movie data into train and test, do 5-fold CV on the train data to select the optimal hyperparameters.
(3)Add the evaluation metrics, such as precision@k and recall@k.

There are basically 3 types of recommender systems:-

> *  **Popularity Filtering**- They offer generalized recommendations to every user, based on movie popularity/vote. The System recommends the same movies to users. Since each user is different , this approach is considered to be too simple. The basic idea behind this system is that movies that are more popular and critically acclaimed will have a higher probability of being liked by the average audience.



> *  **Content Based Filtering**- They suggest similar items based on the similarities between the attributes or the properties of the items. This system uses item metadata (data about data), such as genre, director, plot description, actors, etc. for movies to make these recommendations. The general idea behind these recommender systems is that if a person liked a particular item, he or she will also like an item that is similar to it.

> *  **Collaborative Filtering**- This system matches persons with similar interests and provides recommendations based on this matching. Collaborative filters do not require item metadata like its content-based counterparts. This approach uses the memory of previous users interactions to compute users similarities based on items they've interacted (user-based approach) or compute items similarities based on the users that have interacted with them (item-based approach).

In [1]:
import numpy as np
import pandas as pd

# 1. Popularity Filtering

When we have a cold start, or a new user whose information we know nothing about, this is not a bad way to start. As the popularity accounts for the "wisdom of the crowds", it usually provides good recommendations, generally interesting for most people.

The dataset we are going to use is from Kaggle website: TMDB 5000 Movie Dataset, which has Metadata on ~5,000 movies from TMDb. Below is the link: https://www.kaggle.com/tmdb/tmdb-movie-metadata/version/2.

In [2]:
df1 = pd.read_csv('tmdb-movie-metadata/tmdb_5000_credits.csv')
df2 = pd.read_csv('tmdb-movie-metadata/tmdb_5000_movies.csv')

In [3]:
df1.head()

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


The first dataset contains the following features:-

* movie_id - A unique identifier for each movie.
* cast - The name of lead and supporting actors.
* crew - The name of Director, Editor, Composer, Writer etc.

The second dataset has the following features:- 

* budget - The budget in which the movie was made.
* genre - The genre of the movie, Action, Comedy ,Thriller etc.
* homepage - A link to the homepage of the movie.
* id - This is infact the movie_id as in the first dataset.
* keywords - The keywords or tags related to the movie.
* original_language - The language in which the movie was made.
* original_title - The title of the movie before translation or adaptation.
* overview - A brief description of the movie.
* popularity - A numeric quantity specifying the movie popularity.
* production_companies - The production house of the movie.
* production_countries - The country in which it was produced.
* release_date - The date on which it was released.
* revenue - The worldwide revenue generated by the movie.
* runtime - The running time of the movie in minutes.
* status - "Released" or "Rumored".
* tagline - Movie's tagline.
* title - Title of the movie.
* vote_average -  average ratings the movie recieved.
* vote_count - the count of votes recieved.

Let's join the two dataset on the 'id' column


In [4]:
df1.columns = ['id','title_','cast','crew']
df = df2.merge(df1, on='id')

df.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,title_,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...",...,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...",...,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]",...,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


Before getting started with this  -
* we need a metric to score or rate movie 
* Calculate the score for every movie 
* Sort the scores and recommend the best rated movie to the users.

We can use the average ratings of the movie as the score but using this won't be fair enough since a movie with 8.9 average rating and only 3 votes cannot be considered better than the movie with 7.8 as as average rating but 40 votes.
So, I'll be using IMDB's weighted rating (wr) which is given as :-

![](https://image.ibb.co/jYWZp9/wr.png)
where,
* v is the number of votes for the movie;
* m is the minimum votes required to be listed in the chart;
* R is the average rating of the movie;
* C is the mean vote across the whole report

We already have v(**vote_count**) and R (**vote_average**) and C can be calculated as 

In [5]:
C = df2['vote_average'].mean()
print("The average rating for all the movies is:", C)

The average rating for all the movies is: 6.092171559442016


So, the mean rating for all the movies is approx 6 on a scale of 10.The next step is to determine an appropriate value for m, the minimum votes required to be listed in the chart. We will use 90th percentile as our cutoff. In other words, for a movie to feature in the charts, it must have more votes than at least 90% of the movies in the list.

In [7]:
m = df['vote_count'].quantile(0.9)
m

1838.4000000000015

In [8]:
# now we can filter out the movies that has enough votes...
df_small = df[df['vote_count'] >= m]
df_small.shape

(481, 23)

In [9]:
# we will define a function, weighted_rating() and define a new feature score, of which we'll calculate 
# the value by applying this function to our DataFrame of qualified movies

def weighted_rating(x, m=m, C=C):
    v = x['vote_count']
    R = x['vote_average']
    # Calculation based on the IMDB formula
    return (v/(v+m) * R) + (m/(m+v) * C)

In [10]:
# Define a new feature 'score' and calculate its value with `weighted_rating()`
df_small['score'] = df_small.apply(weighted_rating, axis=1)

#Sort movies based on score calculated above
df_small.sort_values('score', ascending=False)

#Print the top 10 movies
df_small[['id','title', 'vote_count', 'vote_average', 'score']].head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,id,title,vote_count,vote_average,score
0,19995,Avatar,11800,7.2,7.050669
1,285,Pirates of the Caribbean: At World's End,4500,6.9,6.665696
2,206647,Spectre,4466,6.3,6.239396
3,49026,The Dark Knight Rises,9106,7.6,7.346721
4,49529,John Carter,2124,6.1,6.096368
5,559,Spider-Man 3,3576,5.9,5.96525
6,38757,Tangled,3330,7.4,6.934805
7,99861,Avengers: Age of Ultron,6767,7.3,7.041968
8,767,Harry Potter and the Half-Blood Prince,5293,7.4,7.062856
9,209112,Batman v Superman: Dawn of Justice,7004,5.7,5.781535


***Now something to keep in mind is that these demographic recommender provide a general chart of recommended movies to all the users. They are not sensitive to the interests and tastes of a particular user. This is when we move on to a more refined system- Content Basesd Filtering.***

# 2. Content-based Filtering

This is robust to the cold start problem. As long as the user inputs some information about his likes or dislikes, we can base on those to make recommendations.

In this recommender system the content of the movie (overview, cast, crew, keyword, tagline etc) is used to find its similarity with other movies. Then the movies that are most likely to be similar are recommended.

## 2.1 Movie plot description based recommendation
We will compute pairwise similarity scores for all movies based on their plot descriptions and recommend movies based on that similarity score. The plot description is given in the **overview** feature of our dataset. 
Let's take a look at the data.

In [12]:
df['overview'].head(5)

0    In the 22nd century, a paraplegic Marine is di...
1    Captain Barbossa, long believed to be dead, ha...
2    A cryptic message from Bond’s past sends him o...
3    Following the death of District Attorney Harve...
4    John Carter is a war-weary, former military ca...
Name: overview, dtype: object

we need to convert each word into vector for processing. A technique called ***Term Frequency-Inverse 
Document Frequency (TF-IDF)*** is very helpful here.
https://en.wikipedia.org/wiki/Tf%E2%80%93idf

### TF-IDF
This technique converts unstructured text into a vector structure, where each word is represented by a position in the vector, and the value measures how relevant a given word is for an article. As all items will be represented in the same Vector Space Model, it is to compute similarity between articles.

Now if you are wondering what is term frequency , it is the relative frequency of a word in a document and is given as
   **(term instances/total instances)**.
Inverse Document Frequency is the relative count of documents containing the term is given as 
   **log(number of documents/documents with term)**
This is done to reduce the importance of words that occur frequently in plot overviews and therefore, their significance in computing the final similarity score.
The overall importance of each word to the documents in which they appear is equal to **TF * IDF**

This will give you a matrix where each ***column*** represents a word in the overview vocabulary (all the words that appear in at least one document) and each ***row*** represents a movie, as before.


In [13]:
#Import TfIdfVectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

#Replace NaN with an empty string
df2['overview'] = df2['overview'].fillna('')

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(df2['overview'])

#Output the shape of tfidf_matrix
tfidf_matrix.shape

(4803, 20978)

We see that over 20,000 different words were used to describe the 4800 movies in our dataset.

With this matrix in hand, we can now compute a similarity score. There are several candidates for this; such as the euclidean, the Pearson and the [cosine similarity scores](https://en.wikipedia.org/wiki/Cosine_similarity). 

Pearson correlation and cosine similarity are invariant to scaling, i.e. multiplying all elements by a nonzero constant. Pearson correlation is also invariant to adding any constant to all elements. It is also easy to see that Pearson Correlation Coefficient and Cosine Similarity are equivalent when X and Y have means of 0, so we can think of Pearson Correlation Coefficient as demeaned version of Cosine Similarity.

There is no right answer to which score is the best. Different scores work well in different scenarios and it is often a good idea to experiment with different metrics. 

We will be using the cosine similarity to calculate a numeric quantity that denotes the similarity between two movies. We use the cosine similarity score since it is independent of magnitude and is relatively easy and fast to calculate.

Since we have used the TF-IDF vectorizer, calculating the dot product will directly give us the cosine similarity score. Therefore, we will use sklearn's linear_kernel() instead of cosine_similarities() since it is faster.

In [12]:
# Import linear_kernel
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [13]:
cosine_sim.shape

(4803, 4803)

We are going to define a function that takes in a movie title as an input and outputs a list of the 10 most similar movies. Firstly, for this, we need a reverse mapping of movie titles and DataFrame indices. In other words, we need a mechanism to identify the index of a movie in our metadata DataFrame, given its title.

In [14]:
#Construct a reverse map of indices and movie titles
indices = pd.Series(df2.index, index=df2['title']).drop_duplicates()

Steps:
* Get the index of the movie given its title.
* Get the list of cosine similarity scores for that particular movie with all movies. Convert it into a list of tuples where the 1st element is its position and 2nd the similarity score.
* Sort the aforementioned list of tuples based on the similarity scores.
* Get the top 10 elements of this list. Ignore the first element as it refers to self (the movie most similar to a particular movie is the movie itself).
* Return the titles corresponding to the indices of the top elements.

In [15]:
# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title, cosine_sim = cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return df2['title'].iloc[movie_indices]

In [16]:
get_recommendations('The Dark Knight Rises')

65                              The Dark Knight
299                              Batman Forever
428                              Batman Returns
1359                                     Batman
3854    Batman: The Dark Knight Returns, Part 2
119                               Batman Begins
2507                                  Slow Burn
9            Batman v Superman: Dawn of Justice
1181                                        JFK
210                              Batman & Robin
Name: title, dtype: object

While our system has done a decent job of finding movies with similar plot descriptions, the quality of recommendations is not that great. "The Dark Knight Rises" returns all Batman movies while it is more likely that the people who liked that movie are more inclined to enjoy other Christopher Nolan movies. This is something that cannot be captured by the present system.

## 2.2 Actors, Genres, Keywords, production companies based recommendation

We are going to build a recommender based on the following metadata: the 3 top actors, the director, related genres and the movie plot keywords.

In [28]:
# Parse the stringified features into their corresponding python objects
from ast import literal_eval

features = ['cast', 'crew', 'keywords', 'genres', 'production_companies']
for feature in features:
    df[feature] = df[feature].apply(literal_eval)

In [29]:
# Get the director's name from the crew feature. If director is not listed, return NaN
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

In [30]:
# Returns the list top 3 elements or entire list; whichever is more.
def get_list(x):
    if isinstance(x, list):
        names = [i['name'] for i in x]
        #Check if more than 3 elements exist. If yes, return only first three. If no, return entire list.
        if len(names) > 3:
            names = names[:3]
        return names

    #Return empty list in case of missing/malformed data
    return []

In [31]:
# Define new director, cast, genres and keywords features that are in a suitable form.
df['director'] = df['crew'].apply(get_director)

features = ['cast', 'keywords', 'genres','production_companies']
for feature in features:
    df[feature] = df[feature].apply(get_list)

In [32]:
# Print the new features of the first 5 films
df[['title', 'cast', 'director', 'keywords', 'genres','production_companies']].head(5)

Unnamed: 0,title,cast,director,keywords,genres,production_companies
0,Avatar,"[Sam Worthington, Zoe Saldana, Sigourney Weaver]",James Cameron,"[culture clash, future, space war]","[Action, Adventure, Fantasy]","[Ingenious Film Partners, Twentieth Century Fo..."
1,Pirates of the Caribbean: At World's End,"[Johnny Depp, Orlando Bloom, Keira Knightley]",Gore Verbinski,"[ocean, drug abuse, exotic island]","[Adventure, Fantasy, Action]","[Walt Disney Pictures, Jerry Bruckheimer Films..."
2,Spectre,"[Daniel Craig, Christoph Waltz, Léa Seydoux]",Sam Mendes,"[spy, based on novel, secret agent]","[Action, Adventure, Crime]","[Columbia Pictures, Danjaq, B24]"
3,The Dark Knight Rises,"[Christian Bale, Michael Caine, Gary Oldman]",Christopher Nolan,"[dc comics, crime fighter, terrorist]","[Action, Crime, Drama]","[Legendary Pictures, Warner Bros., DC Entertai..."
4,John Carter,"[Taylor Kitsch, Lynn Collins, Samantha Morton]",Andrew Stanton,"[based on novel, mars, medallion]","[Action, Adventure, Science Fiction]",[Walt Disney Pictures]


In [33]:
# Function to convert all strings to lower case and strip names of spaces
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        #Check if director exists. If not, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

In [34]:
# Apply clean_data function to your features.
features = ['cast', 'keywords', 'director', 'genres', 'production_companies']

for feature in features:
    df[feature] = df[feature].apply(clean_data)

In [35]:
df[['title', 'cast', 'director', 'keywords', 'genres', 'production_companies']].head(5)

Unnamed: 0,title,cast,director,keywords,genres,production_companies
0,Avatar,"[samworthington, zoesaldana, sigourneyweaver]",jamescameron,"[cultureclash, future, spacewar]","[action, adventure, fantasy]","[ingeniousfilmpartners, twentiethcenturyfoxfil..."
1,Pirates of the Caribbean: At World's End,"[johnnydepp, orlandobloom, keiraknightley]",goreverbinski,"[ocean, drugabuse, exoticisland]","[adventure, fantasy, action]","[waltdisneypictures, jerrybruckheimerfilms, se..."
2,Spectre,"[danielcraig, christophwaltz, léaseydoux]",sammendes,"[spy, basedonnovel, secretagent]","[action, adventure, crime]","[columbiapictures, danjaq, b24]"
3,The Dark Knight Rises,"[christianbale, michaelcaine, garyoldman]",christophernolan,"[dccomics, crimefighter, terrorist]","[action, crime, drama]","[legendarypictures, warnerbros., dcentertainment]"
4,John Carter,"[taylorkitsch, lynncollins, samanthamorton]",andrewstanton,"[basedonnovel, mars, medallion]","[action, adventure, sciencefiction]",[waltdisneypictures]


We are now in a position to create our "metadata soup", which is a string that contains all the metadata that we want to feed to our vectorizer (namely actors, director and keywords).

In [36]:
def create_soup(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres']) + ' ' + ' '.join(x['production_companies'])

In [37]:
df['soup'] = df.apply(create_soup, axis=1)

In [38]:
df['soup'][2]

'spy basedonnovel secretagent danielcraig christophwaltz léaseydoux sammendes action adventure crime columbiapictures danjaq b24'

The next steps are the same as what we did with our plot description based recommender. One important difference is that we use the CountVectorizer() instead of TF-IDF. This is because we do not want to down-weight the presence of an actor/director if he or she has acted or directed in relatively more movies. It doesn't make much intuitive sense.

In [39]:
# Import CountVectorizer and create the count matrix
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(df['soup'])

In [40]:
# Compute the Cosine Similarity matrix based on the count_matrix
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim2 = cosine_similarity(count_matrix, count_matrix)

In [41]:
# Reset index of our main DataFrame and construct reverse mapping as before
df2 = df2.reset_index()
indices = pd.Series(df2.index, index=df2['title'])

In [42]:
get_recommendations('The Dark Knight Rises', cosine_sim2)

NameError: name 'get_recommendations' is not defined

In [None]:
get_recommendations('The Hobbit: The Battle of the Five Armies', cosine_sim2)

# 3. **Collaborative Filtering**

Our content based engine suffers from some severe limitations. It is only capable of suggesting movies which are close to a certain movie. That is, it is not capable of capturing tastes and providing recommendations across genres. (No novelty!!!)

Also, the engine that we built is not really personal in that it doesn't capture the personal tastes and biases of a user. Anyone querying our engine for recommendations based on a movie will receive the same recommendations for that movie, regardless of who she/he is.

Therefore, in this section, we will use a technique called Collaborative Filtering to make recommendations to Movie Watchers. 

*One important thing to keep in mind is that in an approach based purely on collaborative filtering, the similarity is not calculated using factors like the age of users, genre of the movie, or any other data about users or items. It is calculated only on the basis of the rating (explicit or implicit) a user gives to an item. For example, two users can be considered similar if they give the same ratings to ten movies despite there being a big difference in their age.

It is basically of two types: neighborhood based and method based.

In [14]:
from surprise import Reader, Dataset
from surprise.model_selection import cross_validate

In this project, I will build a movie recommendation systems based on the data "MovieLens" provided by GroupLens at: https://grouplens.org/datasets/movielens/

The datasets it provides has several versions with different data sizes. We start with the smallest one, which has 100,000 ratings and 3,600 tag applications applied to 9,000 movies by 600 users.

Let's first import the dataset and combine them.

In [15]:
movie_info = pd.read_csv("ml-latest-small/movies.csv")
print(movie_info.shape)

(9742, 3)


In [16]:
# Load the movielens-100k dataset (download it if needed).
ratings = pd.read_csv("ml-latest-small/ratings.csv")
print(ratings.shape)

(100836, 4)


In [17]:
ratings = ratings.merge(movie_info, on = "movieId")

In [18]:
# extract the unique movie ids
movie_IDs = list(ratings['movieId'].unique())
movie_IDs.sort()

In [19]:
ratings.head(10)

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,5,1,4.0,847434962,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,7,1,4.5,1106635946,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
3,15,1,2.5,1510577970,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
4,17,1,4.5,1305696483,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
5,18,1,3.5,1455209816,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
6,19,1,4.0,965705637,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
7,21,1,3.5,1407618878,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
8,27,1,3.0,962685262,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
9,31,1,5.0,850466616,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy


In [390]:
from collections import defaultdict

def precision_recall_at_k(predictions, k = 10, threshold = 3.5):
    """
    This is to calculate the performance of the recommendation system using precision/recall at the top K recommendations.
    We go through each user, look at his ratings for items, pick a rating threshold (like 3.5), to divide the items
    he/she rates as relevant (rating > 3.5), and not relevant(rating < 3.5).
    
    # Relevant items are already known in the data set
        Relevant item: Has a True/Actual rating >= 3.5
        Irrelevant item: Has a True/Actual rating < 3.5
    
    
    The recomemndations we make to him can also be splitted into 2 groups:
    # Recommended items are generated by recommendation algorithm
        Recommended item: has a predicted rating >= 3.5
        Not recommended item: Has a predicted rating < 3.5
        
    Precision@k = (# of recommended items @k that are relevant) / (# of recommended items @k)
    
    Recall@k = (# of recommended items @k that are relevant) / (total # of relevant items)

    predictions: a list of predictions made for each user + each movie.
    
    """

    users = ratings_test['userId'].unique().tolist()
    
    print("there are {} unique users in the testing data".format(len(users)))
    
    
    # retrive the results and assign them to each user
    
    user_ratings = defaultdict(list)
    
    
    for uid, iid, true_rating, est_rating, _ in predictions:
        user_ratings[uid].append( (est_rating, true_rating) )
        
        precisions = dict()
        recalls = dict()
        
        for uid, ratings in user_ratings.items():
            
            # sort user ratings by estimated ratings, from high to low
            ratings = sorted(ratings, key = lambda x: x[0], reverse = True)
            
            # Number of relevant items
            n_rel = sum((true_r >= threshold) for (_, true_r) in ratings)
            
            # Number of recommended items in top k
            n_rec_k = sum((est >= threshold) for (est, _) in ratings[:k])
            
            # Number of relevant and recommended items in top k
            n_rel_and_rec_k = sum(((true_r >= threshold) and (est >= threshold)) for (est, true_r) in ratings[:k])
            
            # Precision@K: Proportion of recommended items that are relevant
            precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 1

            # Recall@K: Proportion of relevant items that are recommended
            recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 1
            
    print("Precision@{} = {}, Recall@{} = {}".format(k, np.mean(list(precisions.values())), k, np.mean(list(recalls.values()))))

    return precisions, recalls

## 3.1 Memory or neighborhood based methods

https://realpython.com/build-recommendation-engine-collaborative-filtering/


The two approaches are mathematically quite similar, but there is a conceptual difference between the two. Here’s how the two compare:

* **User-based**: For a user U, with a set of similar users determined based on rating vectors consisting of given item ratings, the rating for an item I, which hasn’t been rated, is found by picking out N users from the similarity list who have rated the item I and calculating the rating based on these N ratings.

* **Item-based**: For an item I, with a set of similar items determined based on rating vectors consisting of received user ratings, the rating by a user U, who hasn’t rated it, is found by picking out N items from the similarity list that have been rated by U and calculating the rating based on these N ratings.

Item-based collaborative filtering was developed by Amazon. In a system where there are more users than items, item-based filtering is faster and more stable than user-based. It is effective because usually, the average rating received by an item doesn’t change as quickly as the average rating given by a user to different items. It’s also known to perform better than the user-based approach when the ratings matrix is sparse.

*  **User based filtering**-  These systems recommend products to a user that similar users have liked. 

To find the rating R that a user U would give to an item I, the approach includes:

* Finding users similar to U who have rated the item I
* Calculating the rating R based the ratings of users found in the previous step

For measuring the similarity between two users we can either use pearson correlation or cosine similarity.
This filtering technique can be illustrated with an example. In the following matrixes, each row represents a user, while the columns correspond to different movies except the last one which records the similarity between that user and the target user. Each cell represents the rating that the user gives to that movie. 

Although computing user-based CF is very simple, it suffers from several problems. One main issue is that users’ preference can change over time. It indicates that precomputing the matrix based on their neighboring users may lead to bad performance. To tackle this problem, we can apply item-based CF.

* **Item Based Collaborative Filtering** - Instead of measuring the similarity between users, the item-based CF recommends items based on their similarity with the items that the target user rated. Likewise, the similarity can be computed with Pearson Correlation or Cosine Similarity. The major difference is that, with item-based collaborative filtering, we fill in the blank vertically, as oppose to the horizontal manner that user-based CF does. 

It successfully avoids the problem posed by dynamic user preference as item-based CF is more static. However, several problems remain for this method. First, the main issue is ***scalability***. The computation grows with both the customer and the product. The worst case complexity is O(mn) with m users and n items. In addition, ***sparsity*** is another concern. Take a look at the above table again. Although there is only one user that rated both Matrix and Titanic rated, the similarity between them is 1. In extreme cases, we can have millions of users and the similarity between two fairly different movies could be very high simply because they have similar rank for the only user who ranked them both.

### 3.1.1 Using the package Surprise

In [397]:
from surprise import KNNWithMeans

In [401]:
ratings_train, ratings_test = train_test_split(ratings[['userId', 'movieId', 'rating']],
                                              stratify = ratings['userId'],
                                               test_size = 0.1,
                                               random_state = 23)

print("Shape of training data:", ratings_train.shape)
print("Shape of testing data:", ratings_test.shape)

ratings_train.reset_index(drop = True)
ratings_test.reset_index(drop = True)

Shape of training data: (90752, 3)
Shape of testing data: (10084, 3)


Unnamed: 0,userId,movieId,rating
0,68,89774,3.0
1,219,3248,2.0
2,239,1673,3.0
3,221,5225,4.0
4,260,1094,2.5
...,...,...,...
10079,446,165,3.0
10080,249,4776,4.0
10081,448,3174,3.0
10082,580,5608,4.0


In [402]:
reader = Reader()
train_data = Dataset.load_from_df(ratings_train, reader)
train_data

<surprise.dataset.DatasetAutoFolds at 0x1a37511290>

In [398]:
# To use item-based cosine similarity
sim_options = {
    "name": "pearson",
    "user_based": True,  # Compute similarities between items
}
algo = KNNWithMeans(sim_options=sim_options)

In [403]:
# Run 5-fold cross-validation and print results.
cross_validate(algo, train_data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNWithMeans on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8988  0.9130  0.8915  0.9065  0.9064  0.9032  0.0074  
MAE (testset)     0.6866  0.6939  0.6834  0.6880  0.6909  0.6886  0.0036  
Fit time          0.36    0.39    0.38    0.39    0.38    0.38    0.01    
Test time         1.32    1.15    1.16    1.15    1.17    1.19    0.07    


{'test_rmse': array([0.89882983, 0.91299248, 0.89148067, 0.90654902, 0.90638875]),
 'test_mae': array([0.6865713 , 0.6938894 , 0.68339325, 0.68803527, 0.69092609]),
 'fit_time': (0.36472511291503906,
  0.3905017375946045,
  0.3784811496734619,
  0.39151716232299805,
  0.3770749568939209),
 'test_time': (1.323214054107666,
  1.1520802974700928,
  1.1634440422058105,
  1.1543197631835938,
  1.1668930053710938)}

In [406]:
from surprise.model_selection import GridSearchCV


sim_options = {
    "name": ["msd", "cosine", "pearson"],
    "min_support": [3, 4, 5],
    "user_based": [False],
}

param_grid = {"sim_options": sim_options}

gs = GridSearchCV(KNNWithMeans, param_grid, measures=["rmse", "mae"], cv=3)
gs.fit(train_data)

print(gs.best_score["rmse"])
print(gs.best_params["rmse"])

# 0.9046721359174695
# {'sim_options': {'name': 'msd', 'min_support': 3, 'user_based': True}}

# 0.9146750351637483
# {'sim_options': {'name': 'msd', 'min_support': 3, 'user_based': False}}

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix.

* user based model

In [407]:
trainset = train_data.build_full_trainset()

sim_options = {'name': 'msd', 
               'min_support': 3, 
               'user_based': True}

algo = KNNWithMeans(sim_options=sim_options)

algo.fit(trainset)

Computing the msd similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x1a373b6b90>

In [408]:

predictions = algo.test(testset)

len(predictions)

10084

In [410]:
precisions, recalls = precision_recall_at_k(predictions, k = 10, threshold = 3.5)

there are 610 unique users in the testing data
Precision@10 = 0.889403511542815, Recall@10 = 0.705329901635628


In [411]:
np.mean([p for p in list(precisions.values()) if p!=1.0])

0.7036014109347443

In [412]:
np.mean([p for p in list(recalls.values()) if p!=1.0])

0.45158620582186315

* Item-based model

In [413]:
trainset = train_data.build_full_trainset()

sim_options = {'name': 'msd', 
               'min_support': 3, 
               'user_based': False}

algo = KNNWithMeans(sim_options=sim_options)

algo.fit(trainset)

Computing the msd similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x1a37401b90>

In [414]:

predictions = algo.test(testset)

len(predictions)

10084

In [415]:
precisions, recalls = precision_recall_at_k(predictions, k = 10, threshold = 3.5)

there are 610 unique users in the testing data
Precision@10 = 0.8820053173287002, Recall@10 = 0.715719989433332


In [416]:
np.mean([p for p in list(precisions.values()) if p!=1.0])

0.6675196558374129

In [417]:
np.mean([p for p in list(recalls.values()) if p!=1.0])

0.4821122466111758

# 3.1.2 Self-written functions

In [223]:
from scipy.sparse import csr_matrix
from sklearn.metrics.pairwise import cosine_similarity
import sklearn.preprocessing as pp

import heapq

def cosine_similarities(mat):
    '''
    calculate the cosine similarities between columns of the sparse matrix
    
    '''
    col_normed_mat = pp.normalize(mat.tocsc(), axis=0)
    return col_normed_mat.T * col_normed_mat

In [224]:
ratings[['userId', 'movieId', 'rating']].head()

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,5,1,4.0
2,7,1,4.5
3,15,1,2.5
4,17,1,4.5


In [225]:
# map the movieId to index

movieId_to_index = pd.Series(ratings.index, index = ratings['movieId'])

In [226]:
rates = ratings[['userId', 'movieId', 'rating']].copy()

#Creating a sparse pivot table with users in rows and items in columns
users_items_pivot_matrix_df = rates.pivot(index='userId', 
                                        columns='movieId', 
                                        values='rating').fillna(0)

users_items_pivot_matrix_df.tail(10)

# print(users_items_pivot_matrix_df.shape)
# (610, 9724)a

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
601,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
602,0.0,4.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
603,4.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
604,3.0,5.0,0.0,0.0,3.0,3.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
605,4.0,3.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
606,2.5,0.0,0.0,0.0,0.0,0.0,2.5,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
607,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
608,2.5,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
609,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
610,5.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [227]:
users_items_pivot_sparse_matrix = csr_matrix(users_items_pivot_matrix_df.values)

In [228]:
simi_matrix = cosine_similarities(users_items_pivot_sparse_matrix)
simi_matrix 

<9724x9724 sparse matrix of type '<class 'numpy.float64'>'
	with 26325068 stored elements in Compressed Sparse Row format>

In [229]:
print(simi_matrix)

  (0, 9372)	0.08507019059648767
  (0, 9371)	0.08507019059648767
  (0, 9324)	0.08507019059648767
  (0, 9312)	0.08507019059648767
  (0, 9307)	0.08507019059648767
  (0, 9274)	0.08507019059648767
  (0, 9225)	0.06922446149029846
  (0, 9213)	0.08507019059648767
  (0, 9157)	0.08507019059648767
  (0, 9141)	0.047188451416789345
  (0, 9138)	0.08507019059648767
  (0, 9136)	0.08507019059648767
  (0, 9135)	0.08507019059648767
  (0, 9109)	0.08507019059648767
  (0, 9103)	0.08507019059648767
  (0, 9093)	0.0601537086476085
  (0, 9059)	0.08507019059648767
  (0, 9047)	0.06922446149029846
  (0, 9045)	0.08507019059648767
  (0, 9026)	0.08507019059648767
  (0, 9017)	0.05601911250004021
  (0, 9016)	0.08507019059648767
  (0, 8995)	0.08507019059648767
  (0, 8978)	0.08507019059648767
  (0, 8974)	0.08507019059648767
  :	:
  (9723, 2912)	0.18139890032691966
  (9723, 2903)	0.12531464857602115
  (9723, 2884)	0.13867504905630726
  (9723, 2832)	0.05708417215969288
  (9723, 2803)	0.1377219446414666
  (9723, 2802)	0.124

In [201]:
# user based, the similarities between rows.

# find the users who have rated item iid, and find the K nearest neighbors among these users with uid, use weighted average 
# of their ratings for this item as the estimate for the rating of user uid for the iten iid. The weights are the
# similarities numbers between the users.


# item based, the similarities between columns.

# find the items that user uid has rated, and find the K nearest neighbors among these items with iid, use weighted average 
# of their ratings for this item as the estimate for the rating of user uid for the iten iid. The weights are the
# similarities numbers between the items.

In [230]:
def get_ratings(uid, iid, ratings, users_items_pivot_sparse_matrix, K = 10, user_based = True):
    
    num_rows, num_cols = ratings.shape 
    
    if user_based:
        
        # mapping from the index in the sparse matrix to the real id of user.
        uid_to_index = dict()

        for index, user_id in enumerate(users_items_pivot_matrix_df.T.columns.values):
            uid_to_index[user_id] = index
        
        # calcluate the similarities between the rows.
        simi_matrix = cosine_similarities(users_items_pivot_sparse_matrix.T)
        
        # collect the users who have rates item iid
        
        users = (ratings[(ratings["rating"] != 0.0) & (ratings["movieId"] == iid)]["userId"]).tolist()
        
        # collect the similarities between these users
        
        similarity = []
        
        for u in users:
            similarity.append( (simi_matrix[(uid_to_index[uid], uid_to_index[u])], u) )  # tuple of (similarity, userId)
        
        # find the K nearest neighbors
        heapq.heapify(similarity)
        
        kNN = heapq.nlargest(K, similarity)  
        print(kNN)
        
        # find the weighted ratings of a movie
        
        numer = 0
        denom = 0
        
        for sim, u in kNN:
#             print(sim, u, ratings[(ratings["userId"] == u) & (ratings["movieId"] == iid)]["rating"])
            if u != uid:
                numer += sim * ratings[(ratings["userId"] == u) & (ratings["movieId"] == iid)]["rating"].values[0]
                denom += sim
            
        return numer/denom
        
        
    else:
        
        # mapping from the index in the sparse matrix to the real id of item.
        iid_to_index = dict()

        for index, movie_id in enumerate(users_items_pivot_matrix_df.columns.values):
            iid_to_index[movie_id] = index
        
        
        # calcluate the similarities between the columns.
        simi_matrix = cosine_similarities(users_items_pivot_sparse_matrix)
        
        # collect the items that user uid has rated
        
        movies = (ratings[(ratings["rating"] != 0.0) & (ratings["userId"] == uid)]["movieId"]).tolist()
        
        # collect the similarities between these users
        
        similarity = []
        
        for m in movies:
            similarity.append( (simi_matrix[(iid_to_index[iid], iid_to_index[m])], m) )  # tuple of (similarity, movieId)
        
        # find the K nearest neighbors
        heapq.heapify(similarity)
        
        kNN = heapq.nlargest(K, similarity)  
        print(kNN)
        
        # find the weighted ratings of a movie
        
        numer = 0
        denom = 0
        
        for sim, m in kNN:
#             print(sim, u, ratings[(ratings["userId"] == u) & (ratings["movieId"] == iid)]["rating"])
            if m != iid:
                numer += sim * ratings[(ratings["userId"] == uid) & (ratings["movieId"] == m)]["rating"].values[0]
                denom += sim
            
        return numer/denom
        
    
    

In [234]:
uid = 100
iid = 2000

In [235]:
get_ratings(uid, iid, ratings, users_items_pivot_sparse_matrix, K = 10, user_based = True)

[(0.2539580703325401, 480), (0.2497813234242547, 597), (0.24338055016326507, 42), (0.23419675500232737, 144), (0.2329483970492169, 68), (0.22643856760973166, 414), (0.22392341964626722, 489), (0.22170029468930014, 590), (0.2184831190290502, 57), (0.20493851741430238, 288)]


3.7653267509211066

In [236]:
get_ratings(uid, iid, ratings, users_items_pivot_sparse_matrix, K = 10, user_based = False)

[(0.560926699240374, 2716), (0.5418191364362658, 2406), (0.5086121171043441, 1580), (0.5053491684617659, 1968), (0.4918556463755713, 1265), (0.48366076647295114, 1270), (0.4793651026636947, 2028), (0.4554690898443342, 1220), (0.45433598434594685, 1101), (0.4404134004455199, 592)]


3.944156752214165

## 3.2 Model based methods (matrix factorisation)

One way to handle the scalability and sparsity issue created by CF is to leverage a **latent factor model** to capture the similarity between users and items. Essentially, we want to turn the recommendation problem into an optimization problem. We can view it as how good we are in predicting the rating for items given a user. One common metric is Root Mean Square Error (RMSE). **The lower the RMSE, the better the performance**.

Now talking about latent factor you might be wondering what is it ?It is a broad idea which describes a property or concept that a user or an item have. For instance, for music, latent factor can refer to the genre that the music belongs to. SVD decreases the dimension of the utility matrix by extracting its latent factors. Essentially, we map each user and each item into a latent space with dimension r. 

The number of latent factors affects the recommendations in a manner where the greater the number of factors, the more personalized the recommendations become. But too many factors can lead to overfitting in the model.



Therefore, it helps us better understand the relationship between users and items as they become directly comparable. The below figure illustrates this idea.

In [297]:
from sklearn.model_selection import train_test_split
from surprise.model_selection import GridSearchCV

from surprise import Reader, Dataset
from surprise.model_selection import cross_validate

In [298]:
ratings_train, ratings_test = train_test_split(ratings[['userId', 'movieId', 'rating']],
                                              stratify = ratings['userId'],
                                               test_size = 0.1,
                                               random_state = 23)

print("Shape of training data:", ratings_train.shape)
print("Shape of testing data:", ratings_test.shape)

ratings_train.reset_index(drop = True)
ratings_test.reset_index(drop = True)


Shape of training data: (90752, 3)
Shape of testing data: (10084, 3)


Unnamed: 0,userId,movieId,rating
0,68,89774,3.0
1,219,3248,2.0
2,239,1673,3.0
3,221,5225,4.0
4,260,1094,2.5
...,...,...,...
10079,446,165,3.0
10080,249,4776,4.0
10081,448,3174,3.0
10082,580,5608,4.0


In [299]:
def recommend_for_user(uid, algo, movie_IDs, raw_ratings, N = 10):
    """
    uid: the user id whom we want to recommend movie to
    algo: the algorithm we have trained to make recommendations
    movie_IDs: the pool of movie IDs that we select
    raw_ratings: we want to make sure we don't want to recommend the same movies the user has watched.
    N: how many we want to recommend
    """
    rates = []

    for m_id in movie_IDs:
        pred = algo.predict(uid, m_id, verbose=False)
        rates.append( (pred[3], m_id) )   # append a tuple (predicted rating, movie ID)

    # sort by the predicted rating by the algorithm
    rates = sorted(rates, key = lambda x: x[0], reverse = True)

    # movies that the user has rated/watched
    watched = set(ratings[ratings['userId'] == uid]['movieId'])

    # recommend 10 movies from the predicted rating score, among which the user hasn't rated before.
    recommend = []
    count = 0
    for rate, m_id in rates:
        if m_id not in watched:
            recommend.append( (rate, m_id, raw_ratings[raw_ratings['movieId'] == m_id]['title'].values[0]) )
            count += 1

            if count == N:
                break

    return recommend

### 3.2.1 SVD solved by stochastic gradient descent

A good website for explaining how SVD works in this case.
https://sifter.org/~simon/journal/20061211.html

In [300]:
from surprise import SVD

In [301]:
# load in the training data and do CV to select the hyperparameters.

In [311]:
reader = Reader()
train_data = Dataset.load_from_df(ratings_train, reader)
train_data

<surprise.dataset.DatasetAutoFolds at 0x1a374c0c10>

Conduct 5-fold CV to search for the optimal hyperparameters 

In [304]:
param_grid = {'lr_all': [0.002, 0.01],    # learning rate
              'reg_all': [0.02, 0.1, 0.2]}       # regularization coefficient

gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv = 5)

gs.fit(train_data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

# 0.8665810922463116
# {'lr_all': 0.01, 'reg_all': 0.1}


# best RMSE score
print(gs.best_score['mae'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['mae'])

# 0.6654614848689339
# {'lr_all': 0.01, 'reg_all': 0.1}

0.8665810922463116
{'lr_all': 0.01, 'reg_all': 0.1}


In [306]:
svd = SVD(lr_all = 0.01,  reg_all = 0.1)

# Run 5-fold cross-validation and print results.
# cross_validate(svd, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

We get a mean Root Mean Sqaure Error of 0.87 approx which is more than good enough for our case. Let us now train the model again using all the data and arrive at predictions.

In [312]:
trainset = train_data.build_full_trainset()

svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x1a3739b150>

In [314]:
# try the algorithm on one user and one item

uid = 30  # raw user id (as in the ratings file). 
iid = 1036  # raw item id (as in the ratings file). 

# get a prediction for specific users and items.
pred = svd.predict(uid, iid, 
                   r_ui = ratings[(ratings['userId'] == uid) & (ratings['movieId'] == iid)]['rating'].values[0], 
                   verbose=True)

pred[3]

user: 30         item: 1036       r_ui = 4.00   est = 4.51   {'was_impossible': False}


4.514769801737049

In [315]:
# given a user ID, we score ratings for all the movies and recommend the top 10 for the user.

uid = 2000  # raw user id (as in the ratings file). 

recommend_for_user(uid, svd, movie_IDs, ratings, N = 10)

[(4.439986897667283, 1248, 'Touch of Evil (1958)'),
 (4.421933929731393,
  177593,
  'Three Billboards Outside Ebbing, Missouri (2017)'),
 (4.390415921070395, 3451, "Guess Who's Coming to Dinner (1967)"),
 (4.35390116741904, 3468, 'Hustler, The (1961)'),
 (4.353805777575018, 1104, 'Streetcar Named Desire, A (1951)'),
 (4.335382568817432, 1178, 'Paths of Glory (1957)'),
 (4.313632975879529, 1223, 'Grand Day Out with Wallace and Gromit, A (1989)'),
 (4.305299123451166, 3030, 'Yojimbo (1961)'),
 (4.291315953355015, 1237, 'Seventh Seal, The (Sjunde inseglet, Det) (1957)'),
 (4.2706635017629795, 1217, 'Ran (1985)')]

One of the approaches to measure the accuracy of your result is the Root Mean Square Error (RMSE), in which you predict ratings for a test dataset of user-item pairs whose rating values are already known. The difference between the known value and the predicted value would be the error.

### Evaluation of performance

We also want to use other metrics, such as precision @k and recall@k to evaluate the performance of the methods.

* https://medium.com/@m_n_malaeb/recall-and-precision-at-k-for-recommender-systems-618483226c54

* http://sdsawtelle.github.io/blog/output/mean-average-precision-MAP-for-recommender-systems.html

In [313]:
predictions = svd.test(testset)

len(predictions)

10084

In [391]:
precisions, recalls = precision_recall_at_k(predictions, k = 5, threshold = 3.5)

there are 610 unique users in the testing data
Precision@5 = 0.8927584300718627, Recall@5 = 0.6063441640570226


In [392]:
np.mean([p for p in list(precisions.values()) if p!=1.0])

0.5983436853002071

In [393]:
np.mean([p for p in list(recalls.values()) if p!=1.0])

0.37035949847847377

### 3.2.2 Non-negative Matrix Factorization

In [170]:
from surprise import NMF
nmf = NMF(n_factors = 10)

In [171]:
# Run 5-fold cross-validation and print results.
cross_validate(nmf, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm NMF on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9329  0.9354  0.9418  0.9364  0.9414  0.9376  0.0035  
MAE (testset)     0.7280  0.7286  0.7344  0.7288  0.7319  0.7303  0.0024  
Fit time          4.62    4.64    4.67    4.61    4.62    4.63    0.02    
Test time         0.11    0.11    0.11    0.12    0.11    0.11    0.00    


{'test_rmse': array([0.93292034, 0.93541309, 0.94176827, 0.93638589, 0.94137109]),
 'test_mae': array([0.72802939, 0.72859693, 0.73438118, 0.72879147, 0.73187805]),
 'fit_time': (4.62404990196228,
  4.635416030883789,
  4.6712141036987305,
  4.6081321239471436,
  4.6219258308410645),
 'test_time': (0.11307907104492188,
  0.11388301849365234,
  0.11384701728820801,
  0.11981511116027832,
  0.11321330070495605)}

We get a mean Root Mean Sqaure Error of 0.92 approx which is more than good enough for our case. Let us now train the model again using all the data and arrive at predictions.

In [122]:
trainset = data.build_full_trainset()
nmf.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.NMF at 0x128feb690>

In [123]:
ratings[ratings['userId'] == 30]

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
863,30,110,5.0,1500370456,Braveheart (1995),Action|Drama|War
1580,30,260,5.0,1500370339,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Sci-Fi
6788,30,1196,5.0,1500370341,Star Wars: Episode V - The Empire Strikes Back...,Action|Adventure|Sci-Fi
7145,30,1198,5.0,1500370343,Raiders of the Lost Ark (Indiana Jones and the...,Action|Adventure
7567,30,1210,5.0,1500370347,Star Wars: Episode VI - Return of the Jedi (1983),Action|Adventure|Sci-Fi
8327,30,1240,3.5,1500370449,"Terminator, The (1984)",Action|Sci-Fi|Thriller
9072,30,1291,5.0,1500370351,Indiana Jones and the Last Crusade (1989),Action|Adventure
12655,30,2571,5.0,1500370345,"Matrix, The (1999)",Action|Sci-Fi|Thriller
16310,30,318,5.0,1500370344,"Shawshank Redemption, The (1994)",Crime|Drama
17075,30,58559,5.0,1500370398,"Dark Knight, The (2008)",Action|Crime|Drama|IMAX


In [124]:
# try the algorithm on one user and one item

uid = 30  # raw user id (as in the ratings file). 
iid = 1036  # raw item id (as in the ratings file). 

# get a prediction for specific users and items.
pred = nmf.predict(uid, iid, 
                   r_ui = ratings[(ratings['userId'] == uid) & (ratings['movieId'] == iid)]['rating'].values[0], 
                   verbose=True)

pred[3]

user: 30         item: 1036       r_ui = 4.00   est = 4.52   {'was_impossible': False}


4.521038059500823

In [133]:
# given a user ID, we score ratings for all the movies and recommend the top 10 for the user.

uid = 239  # raw user id (as in the ratings file). 

recommend_for_user(uid, nmf, movie_IDs, ratings, N = 10)

[(5, 99, 'Heidi Fleiss: Hollywood Madam (1995)'),
 (5, 213, 'Burnt by the Sun (Utomlyonnye solntsem) (1994)'),
 (5, 391, "Jason's Lyric (1994)"),
 (5, 942, 'Laura (1944)'),
 (5, 945, 'Top Hat (1935)'),
 (5, 951, 'His Girl Friday (1940)'),
 (5, 1041, 'Secrets & Lies (1996)'),
 (5,
  1201,
  'Good, the Bad and the Ugly, The (Buono, il brutto, il cattivo, Il) (1966)'),
 (5, 1223, 'Grand Day Out with Wallace and Gromit, A (1989)'),
 (5, 1236, 'Trust (1990)')]