# Recommender Systems

### Learning Objectives:
- [Introduction: Simple Recommender Systems](#Introduction:-Simple-Recommeder-Systems)
- [Offline & Online Evaluation](#Offline-&-Online-Evaluation)
- [Content-based Recommenders](#Content\-based-Recommenders)
- [Collaborative-filtering](#Collaborative\-filtering)
- [Hybrid Systems](#Hybrid-Systems)


# Introduction: Simple Recommender Systems

__Recommender systems__, also referred to as __recommendation systems__, are filtering systems used by many different companies world-wide to be able to recommend products (e.g. movies, clothes, etc) based on user preferences. Unlike __ranking algorithms,__ recommender systems aim to provide recommendations without an explicit input from the user (such as a search query). We obviously cannot recommend _exactly_ what a user wants as we cannot access or process all the information in their brain at the same time. Instead, we can take advantage or users' past ratings, choices and preferences to __predict__ the products the user will most probably like.

How do these systems do what they do? This is question that has become a large topic of research and the current answer is that there are mutliple ways to create recommender systems: each working under different assumptions and algorithms. There are two main broad classifications that we will cover shortly: __content-based recommendation__ (item-centred) and __collaborative filtering__ (user-centred).

<img width="500px;" height="500px" src="https://www.researchgate.net/profile/Lionel_Ngoupeyou_Tondji/publication/323726564/figure/fig5/AS:631605009846299@1527597777415/Content-based-filtering-vs-Collaborative-filtering-Source.png">

Before we cover the implementation of these, we will cover what is informally referred to as a simple recommender system: a system that uses the weighted average rating from all users to make recommendations on the "best" options. This is also referred to as a type of __demographic recommender system.__ Throughout this notebook, we will use the "Movies Dataset" from [Kaggle](https://www.kaggle.com/rounakbanik/the-movies-dataset), where the full version contains information on over 45,000 movies with 26 million ratings from  270,000 users. We will be using the small version, as shown below:

In [33]:
import pandas as pd
import numpy as np
import plotly.graph_objects as go

In [34]:
# Importing ratings
ratings = pd.read_csv("../DATA/ratings_small.csv")
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [35]:
# Importing movie metadata
metadata = pd.read_csv("../DATA/movies_metadata.csv")
# metadata.head()
metadata.nlargest(n=10, columns=["vote_average"])

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
186,False,,0,"[{'id': 14, 'name': 'Fantasy'}, {'id': 35, 'na...",,58372,tt0114241,en,Reckless,"On Christmas eve, a relentlessly cheerful woma...",...,1995-11-17,0.0,91.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,The most twisted Christmas ever.,Reckless,False,10.0,1.0
394,False,,0,"[{'id': 10749, 'name': 'Romance'}, {'id': 12, ...",,278939,tt0113173,en,Girl in the Cadillac,A runaway teenage girl and a drifter rob a ban...,...,1995-10-24,0.0,89.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,On our first date... we robbed a bank,Girl in the Cadillac,False,10.0,1.0
706,False,,0,"[{'id': 99, 'name': 'Documentary'}]",,73183,tt0113270,en,"The Haunted World of Edward D. Wood, Jr.",The strange life and the wonderfully awful fil...,...,1996-05-01,0.0,112.0,[],Released,Can your MIND stand the SHOCKING TRUTH?,"The Haunted World of Edward D. Wood, Jr.",False,10.0,1.0
738,False,,0,"[{'id': 99, 'name': 'Documentary'}]",,255546,tt0109381,pt,Carmen Miranda: Bananas Is My Business,A biography of the Portuguese-Brazilian singer...,...,1995-04-13,0.0,91.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,,Carmen Miranda: Bananas Is My Business,False,10.0,1.0
1634,False,,0,"[{'id': 18, 'name': 'Drama'}]",,64562,tt0119845,en,Other Voices Other Rooms,Truman Capote's semi-autobiographical first no...,...,1995-09-15,0.0,0.0,[],Released,,Other Voices Other Rooms,False,10.0,1.0
1761,False,,0,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,78373,tt0118925,en,"Dancer, Texas Pop. 81","Four guys, best friends, have grown up togethe...",...,1998-05-01,565592.0,97.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,in the middle of nowhere they had everything,"Dancer, Texas Pop. 81",False,10.0,1.0
2114,False,,0,"[{'id': 18, 'name': 'Drama'}]",http://www.thefarmerswifefilm.co.uk/,143750,tt2140519,en,The Farmer's Wife,"As her surroundings are invaded by outsiders, ...",...,2012-06-20,0.0,18.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,The Farmer's Wife,False,10.0,1.0
2653,False,,0,"[{'id': 35, 'name': 'Comedy'}]",,89861,tt0120210,en,Stiff Upper Lips,Stiff Upper Lips is a broad parody of British ...,...,1998-06-12,0.0,99.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Stiff Upper Lips,False,10.0,1.0
2948,False,,0,"[{'id': 18, 'name': 'Drama'}, {'id': 28, 'name...",,124853,tt0114008,en,Ten Benny,The story of a young shoe salesman whose overe...,...,1995-04-09,0.0,108.0,[],Released,,Ten Benny,False,10.0,1.0
3160,False,,0,"[{'id': 99, 'name': 'Documentary'}]",,49477,tt0192069,en,Gendernauts: A Journey Through Shifting Identi...,Monika Treut explores the worlds and thoughts ...,...,1999-06-12,0.0,87.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Gendernauts: A Journey Through Shifting Identi...,False,10.0,2.0


For our simple recommender system, we will use the IMDB's known __weighted average formula__ used for their Top Movies Chart, given as follows:

$$ R_{W} = (\frac{v}{v + m})R + (\frac{m}{v + m})C  $$

Where:
- $R_{W}$ is the weighted average movie rating
- $v$ is the number of votes for that movie title
- $m$ is the minimum number of votes required to be in the top Chart
- $R$ is the average rating of that movie title
- $C$ is the mean vote rating across all movies

We can now begin our calculations to construct our simple recommender:

In [36]:
# Computing mean vote count across all movies
vote_counts = metadata[metadata['vote_count'].notnull()]['vote_count'].astype('int')
vote_averages = metadata[metadata['vote_average'].notnull()]['vote_average'].astype('int')
C = vote_averages.mean()
C

5.244896612406511

We must now choose a value for the minimum number. In this case, we will choose a value $m$ that gives us movies that have received more votes than 95% of the other remaining movies.

In [37]:
# Computing minimum number of votes required
m = vote_counts.quantile(0.95)
print(m)

434.0


We can now extract the movies that are considered to be canditates for the top charts in our recommender system given our computed 'm'.

In [38]:
# Extracting all movies that have a votecount that is greater than our m value
qualified = metadata[(metadata['vote_count'] >= m) & (metadata['vote_count'].notnull()) & (metadata['vote_average'].notnull())] \
                 [['title', 'release_date', 'vote_count', 'vote_average', 'popularity', 'genres']]
qualified['vote_count'] = qualified['vote_count'].astype('int')
qualified['vote_average'] = qualified['vote_average'].astype('int')
qualified.shape

(2274, 6)

In [39]:
# Computing weighted average and determining top 250 chart
def weighted_rating(x):
    v = x['vote_count']
    R = x['vote_average']
    return (v/(v+m) * R) + (m/(m+v) * C)

qualified['weighted_average'] = qualified.apply(weighted_rating, axis=1)
qualified = qualified.sort_values('weighted_average', ascending=False).head(250)
qualified

Unnamed: 0,title,release_date,vote_count,vote_average,popularity,genres,weighted_average
15480,Inception,2010-07-14,14075,8,29.1081,"[{'id': 28, 'name': 'Action'}, {'id': 53, 'nam...",7.917588
12481,The Dark Knight,2008-07-16,12269,8,123.167,"[{'id': 18, 'name': 'Drama'}, {'id': 28, 'name...",7.905871
22879,Interstellar,2014-11-05,11187,8,32.2135,"[{'id': 12, 'name': 'Adventure'}, {'id': 18, '...",7.897107
2843,Fight Club,1999-10-15,9678,8,63.8696,"[{'id': 18, 'name': 'Drama'}]",7.881753
4863,The Lord of the Rings: The Fellowship of the Ring,2001-12-18,8892,8,32.0707,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",7.871787
...,...,...,...,...,...,...,...
2006,Indiana Jones and the Temple of Doom,1984-05-23,2841,7,15.8023,"[{'id': 12, 'name': 'Adventure'}, {'id': 28, '...",6.767415
16129,The King's Speech,2010-09-06,2817,7,11.2604,"[{'id': 18, 'name': 'Drama'}, {'id': 36, 'name...",6.765698
895,Sunset Boulevard,1950-08-10,533,8,11.7098,"[{'id': 18, 'name': 'Drama'}]",6.763480
9888,Sin City,2005-04-01,2755,7,15.0105,"[{'id': 28, 'name': 'Action'}, {'id': 53, 'nam...",6.761143


Great, so now we have the top charts. This chart can be carried further to become a simple recommender system by recommending the top movies in the charts to all users. Is this a good recommender system? Not particularly. By taking a global weighted average we are able to determine which ones are considered the best on average, but we are unable to account for the individual preferences of the users. For instance, if I was a fan of exclusively Romcoms, I would be recommended only movies I dislike from the list above. The same would happen to people who really don't like action and thrillers. Therefore, we must use our data to instead be able to account for individual preferences! Therefore, simple recommender systems like this that generate global recommendations are generally only used for users who the system has collected little data from.

By accounting for individual user preferences, we would likely achieve a higher score. But how can we determine which system is better? This leads us to the two methods of recommender system evaluation: __offline evaluation__ and __online evaluation.__

# Offline & Online Evaluation

How can we tell that our recommender system is doing what it is supposed to? There are two different approaches to evaluating our system:

- __Offline evaluation:__ Use data we already have and evaluation metrics to compute numeric efectiveness measures that can be tuned for and/or compared. These are the same evaluation metrics which we have encountered and used to assess the performance of our models
- __Online evaluation:__  involves using a live system, and tracking user-related behaviors such as dwell-times, click-through rates, and purchase conversions

When carrying offline evaluation, we can split our data into a training and a test dataset just as we have seen before to ensure that we are tuning our systems appropriately. On the other hand, online evaluation enables us to capture aspects of the performance of our system that offline methods cannot. Whether offline evaluation, online evaluation or a combination of both is the best method to evaluate our system's performance still remains a topic of research. For the purposes of this notebook, we will only be covering simplistic forms of offline evaluation.

# Content-based Recommenders
We can now begin to understand the first sub-class of recommendation systems: __content-based recommenders.__ Let us look at the recommendation problem in the context of our movies dataset. It is intuitive to say that we would like to recommend romance movies to someone that has rated other romantic movies highly as opposed to action, or to recommend older films to users to who are fans of old classics, or even Batman movies to a Batman fan. In this context, we are looking at the characteristics (content) of each movie, and recommending movies that are similar to the previously highly rated movies by the same user. 

There are multiple approaches for the machinery of content-based recommenders. Most will either use the features of movies to predict whether you like or dislike a movie (classification) or to predict the rating the user would give to a movie they have not yet seen (__model-based__). Some might even use the features of a movie you have just watched and recommend the most similar movies to that given movie given their respective features (__memory-based__). We will be creating our own algorithm to predict the ratings of unseen movies and recommend those that are rated the highest.

In modelling terms, we can frame the problem of recommendation as using the __features__ of movies watched by a user and the ratings given to each movie to __predict__ the rating the user would give to a movie not yet watched based on the movie's features. This is why this approach is referred to as item-centred. In other words, if we have enough data, we train a model for each user based on the previously watched movies and their features. The features we have chosen to use in our model are given genres, vote average, release date and runtime. While release date and vote average are available, we have to extract and process the genres for each movie. Given that we have a list of genres for each movie, we will have to dummy encode it as follows:
- Determine how many genres there are and make each genre a feature of the model
- Assign to each genre a zero if it is not present in the movie's list of genres, or 1 for each genre given each movie 

In [40]:
# Importing our data
links_small = pd.read_csv("../DATA/links_small.csv")
ratings_small = pd.read_csv("../DATA/ratings_small.csv")
md = pd.read_csv("../DATA/movies_metadata.csv")

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [41]:
# Displaying our data
# md
# ratings_small
#links_small

In [42]:
# Only considering movies with the upper 15% most votes
m = md["vote_count"].quantile(0.85)
md = md[md["vote_count"] > m]

In [43]:
# Making sure imdb_id matches between links and metadata
md["imdb_id"] = md["imdb_id"].str.strip("tt")

# Removing all movies without a genre, release_date or runtime
md["genres"].replace('[]', np.nan, inplace=True)
md.dropna(subset=["genres", "release_date", "imdb_id", "vote_average"], inplace=True)

# Converting release_date to a POSIX timestamp float
md["release_date"] = pd.to_datetime(md["release_date"], infer_datetime_format=True)
md["release_date"] = md["release_date"].apply(lambda x:x.timestamp())

# Converting imdb_id to int
md["imdb_id"] = md["imdb_id"].astype('int64')

In [44]:
# Converting "genres" column from a dictionary from a list of strings, containing the respective genres
def extract_genres(x):
    genre_string = ''
    x = eval(x) # executes expression inside of string
    for dictionary in x:
        genre_string += dictionary["name"] + '|'
    return genre_string # include all but last one
md["genres"] = md["genres"].apply(extract_genres)

Now that we have preprocessed the data we are going to use for this model, we can extract the three columns: genre, release_data and budget. We will now also determine the unique features present in the dataset and use each as a feature. Note that this model assumes that all possible genres are included in the dataset.

In [45]:
# Initialising our features matrix
FEATURES = md[["imdb_id", "original_title", "release_date", "vote_average"]]

# Finding unique genre names
GENRES = md["genres"]
unique_genres = list(set(GENRES.sum().split('|')[:-1])) # Don't include last element
print(unique_genres)

# Removing '|' from the end of each string
GENRES = GENRES.apply(lambda x:x[:-1])

['Science Fiction', 'War', 'Animation', 'Adventure', 'Crime', 'Horror', 'Comedy', 'Action', 'Family', 'Drama', 'Mystery', 'Documentary', 'History', 'Western', 'Fantasy', 'TV Movie', 'Music', 'Romance', 'Thriller']


In [46]:
# Adding each genre as a feature
extended_features = GENRES.str.get_dummies()

# Horizontally stack our extended features and the original features
FEATURES = FEATURES.merge(extended_features, left_index=True, right_index=True)

In [47]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Normalizing non-categorical features
scaler = MinMaxScaler()
FEATURES[['release_date', 'vote_average']] = scaler.fit_transform(FEATURES[['release_date', 'vote_average']])

Now that our features for each movie are set, we need to create a list of users, where each user contains the features and ratings of each of the movies they rated. To link the users and their ratings to the movies, we will need to use the intermediary "links" table. Be careful! We have dropped a few of the movies in the original dataset when the required feature was not available. First we will merge the appropriate dataframes.

In [48]:
# First Join: links JOIN FEATURES ON imdbId
FEATURES = FEATURES.rename(columns={'imdb_id':'imdbId'}) # making column names match
first_join = links_small.merge(FEATURES, on="imdbId")

# Second Join: ratings JOIN first_join on movieId
data_matrix = ratings_small.merge(first_join, on="movieId")

# Delete unnecessary columns
data_matrix.drop(['movieId', 'timestamp', 'imdbId', 'tmdbId'],axis='columns', inplace=True)

After merging our dataframes, we will create a list of users, each with all the ratings provided by each user and features of the corresponding rated movie.

In [49]:
# Creating unique user list
unique_users = list(set(data_matrix["userId"]))
user_data = {}

# Adding movie ratings and features to the list of the corresponding user for users that have rated more than 15 movies
data_copy = data_matrix.copy()
tr = 50 # at least n reviews
for user_id in unique_users:
    current_data = data_copy[data_copy["userId"] == user_id]
    if current_data.shape[0] >= tr:
        user_data[user_id] = current_data
    data_copy = data_copy[data_copy["userId"] != user_id]

In [50]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.metrics import mean_absolute_error as MAE, mean_squared_error as MSE
import random
import time

# Train regression model on each user!
user_models = {}
scores = []
labels = []
total_predictions = []
start = time.time()
for user_id in user_data.keys():
    # Get user data
    data = user_data[user_id]
    Y = data["rating"]
    X = data.drop(["userId", "rating", "original_title"], axis=1)
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)
    # Train model and predict on test data
    reg = RandomForestRegressor(n_estimators=40, max_depth=3, random_state=0)
    fitted_reg = reg.fit(X_train, Y_train)
    predictions = fitted_reg.predict(X_test)
    user_models[user_id] = fitted_reg # storing estimator for this user!
    labels.extend(Y_test)
    total_predictions.extend(predictions)

print("elapsed time:", time.time() - start)

elapsed time: 40.27305817604065


In [51]:
# Displaying metrics that measure our models performance

print("MAE:", MAE(labels, total_predictions))
print("RMSE:", MSE(labels, total_predictions, squared=False))

MAE: 0.6405155074422084
RMSE: 0.8419787355832018


Now that we have trained a Random Forest Regressor model on every user, we can now pick any random user and recommend them the movies that receive the highest ratings according to our model.

In [54]:
import random

max_user_id = random.choice(list(user_data.keys()))

# Check the predictions of the recommender
model = user_models[max_user_id]
movies = FEATURES.iloc[:, 1:] #pd.concat([FEATURES.iloc[:, 1:], user_data[max_user_id].iloc[:, 2:]]).drop_duplicates(keep=False)

# Rating all movies with the model trained on the one user
predictions = pd.DataFrame({"rating": model.predict(movies.iloc[:, 1:])})
rated_movies = predictions.merge(movies, left_index=True, right_index=True)

# Diplay top 10 rated and top 10 recommendations!
top_rated = user_data[max_user_id].nlargest(columns="rating", n=10)
top_rated

Unnamed: 0,userId,rating,original_title,release_date,vote_average,Action,Adventure,Animation,Comedy,Crime,...,History,Horror,Music,Mystery,Romance,Science Fiction,TV Movie,Thriller,War,Western
1608,562,5.0,The Usual Suspects,0.818333,0.84127,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
2954,562,5.0,Clerks,0.811374,0.730159,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3755,562,5.0,Pulp Fiction,0.811306,0.873016,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
6173,562,5.0,Jurassic Park,0.801036,0.761905,0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
7909,562,5.0,Terminator 2: Judgment Day,0.785023,0.777778,1,0,0,0,0,...,0,0,0,0,0,1,0,1,0,0
8599,562,5.0,The Silence of the Lambs,0.781644,0.84127,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
9095,562,5.0,The Shawshank Redemption,0.811599,0.904762,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
9797,562,5.0,The Princess Bride,0.753896,0.761905,0,1,0,1,0,...,0,0,0,0,1,0,0,0,0,0
10009,562,5.0,Return of the Jedi,0.718333,0.809524,1,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
10710,562,5.0,Saving Private Ryan,0.843131,0.809524,0,0,0,0,0,...,1,0,0,0,0,0,0,0,1,0


In [55]:
# Top 10 recommendations
rated_movies.nlargest(columns="rating", n=10)

Unnamed: 0,rating,original_title,release_date,vote_average,Action,Adventure,Animation,Comedy,Crime,Documentary,...,History,Horror,Music,Mystery,Romance,Science Fiction,TV Movie,Thriller,War,Western
1292,4.704317,Bride of Frankenstein,0.322748,0.730159,0,0,0,0,0,0,...,0,1,0,0,0,1,0,0,0,0
1944,4.701893,Honey I Blew Up the Kid,0.793604,0.31746,0,1,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0
121,4.701669,重慶森林,0.81,0.809524,0,0,0,1,0,0,...,0,0,0,0,1,0,0,0,0,0
477,4.701669,Killing Zoe,0.803559,0.52381,1,0,0,0,1,0,...,0,0,0,0,0,0,0,1,0,0
435,4.670065,Dave,0.800248,0.555556,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
490,4.65018,愛のコリーダ,0.660586,0.587302,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1155,4.642563,The Princess Bride,0.753896,0.761905,0,1,0,1,0,0,...,0,0,0,0,1,0,0,0,0,0
489,4.625298,Executive Decision,0.823739,0.47619,1,1,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2178,4.583523,Edward Scissorhands,0.780338,0.746032,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
350,4.533467,The Flintstones,0.808896,0.349206,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


From the evaluation metrics, we can see that the prediction made by our model deviates on average by 0.63 from the true rating, which is not a large deviation, and is generally able to distinguish between 'okay', 'great' and 'terrible' movies in the opinion of the user. However, if we would like to recommend 10 movies out of roughly 45,000, our rating predictions need to be even better. Therefore, this model may serve as a strong baseline model that can be improved by accounting for other features such as cast, directors, plot, amongst others.

### Limitations:
As you have seen, we only considered users that had reviewed 100 movies or more, as we do not have enough data to accurately train a model for that given user for a large number of features. What happens in the real-world if we need to make recommendations for users who have rated few movies? What if they are a completely new user? This is known as the __cold-start__ problem and makes it so that models such as our content-based one struggle to make recommendations for users whom we have little data on, making it an issue of __data sparsity__. Additionally, this approach requires feature extraction, which can sometimes be cumbersome.

# Collaborative-filtering

The second classification of algorithms we are now going to go over is known as __collaborative-filtering__, which, as the name implies, requires the collaboration of all the different members of our user-base. It works under the assumption that users that have had similar preferences/choices in the past have similar tastes. If Susan and Tom both love the same movies, if Susan loves _He's Just Not That Into You_ then we should probably recommend that movie to Tom! Unlike content-based recommender systems, collaborative filtering does not require any manual feature extraction, hence why it's referred to as user-centred.

<img src="https://miro.medium.com/max/676/1*_x4VqIeV9L6fxXm5PeAYsg.png" width="500px" height="500px">

As before, there are different approaches to how we can create a collaborative filtering system:
- We can either look at the preferences/ratings of users who have similar interests and predict the preferences/ratings of this user (memory-based)
- We can use the data we have on the preferences/ratings of all users in our dataset to __learn__ and/or __predict__ the features of individual different items and use these features to predict preferences/ratings for given items (model-based)

As we have covered the model-based approach in the content-based recommendation section, we will be implementing a memory-based approach for collaborative filtering by using some of the unsupervised learning techniques we have encountered so far in the course. If you are interested in looking into an implementation that uses feature learning, check out this [video](https://www.youtube.com/watch?v=9AP-DgFBNP4&ab_channel=ArtificialIntelligence-AllinOne).

More specifically, we will be creating a ratings vector for each user and using a KNN with means algorithm to compute mean rating of the user's $K$ nearest neighbours for movies not yet rated by a given user. We will be using the __Surprise library__, a scikit for building and analyzing recommender systems that deal with explicit rating data. Before we use our data we will have to clean it just as in the case for the content-based system. For the Surprise 'cross-validate' method, we need to have our data as three columns in the following order: userId, movieId, rating. 

In [29]:
import pandas as pd
import numpy as np

# Importing our data
links_small = pd.read_csv("../DATA/links_small.csv")
ratings_small = pd.read_csv("../DATA/ratings_small.csv")
md = pd.read_csv("../DATA/movies_metadata.csv")

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [30]:
# Filtering movie data by vote count
md = md[md["vote_count"] >= 50]
md.dropna(subset=["imdb_id", "title"], inplace=True)
md.drop_duplicates(subset=["imdb_id"], inplace=True)
md["imdb_id"] = md["imdb_id"].str.strip("tt").astype('int64')

# Merging our data
movies = md[["imdb_id", "title"]]
movies.rename(columns={'imdb_id':'imdbId'}, inplace=True) # making column names match
first_join = links_small.merge(movies, on="imdbId")
joined_data = ratings_small.merge(first_join, on="movieId")
joined_data.drop(['title', 'timestamp', 'tmdbId', 'imdbId'], axis='columns', inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(


In [31]:
from surprise.prediction_algorithms.knns import  KNNWithMeans
from surprise import Reader
from surprise import Dataset
from surprise.model_selection import cross_validate

reader = Reader(rating_scale=(0, 5))
# joined_data

data = Dataset.load_from_df(joined_data, reader)
algo = KNNWithMeans()
cross_validate(algo, data, measures=["RMSE", "MAE"], cv=5, verbose=True)

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNWithMeans on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9008  0.9053  0.8937  0.8979  0.9159  0.9027  0.0076  
MAE (testset)     0.6910  0.6951  0.6831  0.6878  0.6993  0.6913  0.0056  
Fit time          0.32    0.24    0.24    0.22    0.23    0.25    0.04    
Test time         2.57    3.62    2.34    2.60    2.28    2.68    0.48    


{'test_rmse': array([0.90081819, 0.90528994, 0.89367232, 0.89785879, 0.91592172]),
 'test_mae': array([0.69103222, 0.69513847, 0.68309132, 0.68784471, 0.69932142]),
 'fit_time': (0.32130908966064453,
  0.23941397666931152,
  0.23502588272094727,
  0.2167050838470459,
  0.22664189338684082),
 'test_time': (2.573817729949951,
  3.616605281829834,
  2.3375608921051025,
  2.595590829849243,
  2.2797369956970215)}

What is this algorithm actually doing? In memory-based approaches with algorithms such as KNN, we generally construct a __user-item matrix,__ as seen in the diagram above. This is a 2D representation of users vs items, such that each row contains all the ratings (and missing ratings) a user gave to every movie. Cells representing movies that have not been rated by the respective users are filled with zeros, NaNs or other placeholders. In the case of this surprise algorithm, they created a __similarity matrix__ containing the measure of cosine similarity between each user pair and uses that similarity as weights when computing the average rating of all its neighbours.

Some of the most popular frameworks for collaborative filtering algorithms use __matrix factorization__ approaches, which include using a variant of SVD (as SVD cannot be applied to a matrix with missing data), known as SVD++. In fact, this is the algorithm popularized by Simon Funk during the Netflix prize for recommender systems! The theory of this algorithm extends beyond the scope of this course, but you can read more about it [here](https://www.hindawi.com/journals/mpe/2017/1975719/). Luckily this algorithm is also implemented by the Surprise library!

[Documentation](https://surprise.readthedocs.io/en/stable/matrix_factorization.html#surprise.prediction_algorithms.matrix_factorization.SVD)

In [32]:
from surprise import SVD

# CF Algorithm using SVD++
algo = SVD()
cross_validate(algo, data, measures=["RMSE", "MAE"], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8804  0.8886  0.8784  0.8931  0.8866  0.8854  0.0054  
MAE (testset)     0.6779  0.6824  0.6740  0.6883  0.6833  0.6812  0.0049  
Fit time          6.10    6.22    6.07    6.33    7.20    6.38    0.42    
Test time         0.27    0.20    0.22    0.22    0.24    0.23    0.02    


{'test_rmse': array([0.8804147 , 0.88855195, 0.87836022, 0.89305936, 0.88664292]),
 'test_mae': array([0.6778687 , 0.68237288, 0.6740175 , 0.68826835, 0.68325636]),
 'fit_time': (6.098623991012573,
  6.22326397895813,
  6.069719076156616,
  6.332957029342651,
  7.197875022888184),
 'test_time': (0.26703405380249023,
  0.2044239044189453,
  0.21656298637390137,
  0.2223351001739502,
  0.24431896209716797)}

### Limitations
Just like with the content-based recommender system, a collaborative filtering system also suffers from a cold start problem. In fact, since it is solely based on the ratings/preferences and not on any metadata, it requires much more data. It can also sometimes be difficult to combine and weight the preferences of user neighbors. 

In the real-world, you will rarely see a model that is purely content-based or purely collaborative filtering. Generally, multiple models that capture the predicted preferences differently are combined.

# Hybrid Systems

As the name implies __hybrid systems__ are essentially ensemble models of different recommender systems together. As each approach generally focuses on a different characteristic and suffers from individual limitations, hybrid systems are generally able to give the optimal result. Arguably one of the most successful recommender systems, the one used by _Spotify_, works with this principle. It combines the predictions of the following three key systems:

1. Collaborative filtering to compare an individual's behaviour to other people's taste
2. Natural language processing to analyse the text in each song and find songs of similar lyrics (content-based)
3. Audio modelling of songs' raw audio (content-based)

Unlike other recommendation systems, Spotify uses __implicit feedback__ instead of ratings, which are things such as stream count. This first model gives their system a way to recommend songs based on similar users. The NLP and audio modelling content-based models enable the system to find similarities between items with very few streams which would not otherwise be recommended by the collaborative filtering model due to data sparsity. This is how Spotify is able to recommend even relatively unpopular songs based on a user's personal preferences, which is what made their 'Discover Weekly' feature so popular!

If you would like to practice developing your own recommender systems, a great place to find datasets that can be used for recommender systems is [here](https://github.com/caserec/Datasets-for-Recommender-Systems).

# Congratulations!