# Recommender Systems

### Learning Objectives:
- [Introduction: Simple Recommender Systems](#Introduction:-Simple-Recommeder-Systems)
- [Offline & Online Evaluation](#Offline-&-Online-Evaluation)
- [Content-based Recommenders](#Content\-based-Recommenders)
- [Collaborative-filtering](#Collaborative\-filtering)
- [Hybrid Systems](#Hybrid-Systems)


# Introduction: Simple Recommender Systems

__Recommender systems__, also referred to as __recommendation systems__, are filtering systems used by many different companies world-wide to be able to recommend products (e.g. movies, clothes, etc) based on user preferences. Unlike __ranking algorithms,__ recommender systems aim to provide recommendations without an explicit input from the user (such as a search query). We obviously cannot recommend _exactly_ what a user wants as we cannot access or process all the information in their brain at the same time. Instead, we can take advantage or users' past ratings, choices and preferences to __predict__ the products the user will most probably like.

How do these systems do what they do? This is question that has become a large topic of research and the current answer is that there are mutliple ways to create recommender systems: each working under different assumptions and algorithms. There are two main broad classifications that we will cover shortly: __content-based recommendation__ (item-centred) and __collaborative filtering__ (user-centred).

<img width="500px;" height="500px" src="https://www.researchgate.net/profile/Lionel_Ngoupeyou_Tondji/publication/323726564/figure/fig5/AS:631605009846299@1527597777415/Content-based-filtering-vs-Collaborative-filtering-Source.png">

Before we cover the implementation of these, we will cover what is informally referred to as a simple recommender system: a system that uses the weighted average rating from all users to make recommendations on the "best" options. Throughout this notebook, we will use the "Movies Dataset" from [Kaggle](https://www.kaggle.com/rounakbanik/the-movies-dataset), where the full version contains information on over 45,000 movies with 26 million ratings from  270,000 users. We will be using the small version, as shown below:

In [1]:
import pandas as pd
import numpy as np
import plotly.graph_objects as go

In [2]:
# Importing ratings
ratings = pd.read_csv("../DATA/ratings_small.csv")
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [3]:
# Importing movie metadata
metadata = pd.read_csv("../DATA/movies_metadata.csv")
# metadata.head()
metadata.nlargest(n=10, columns=["vote_average"])

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
186,False,,0,"[{'id': 14, 'name': 'Fantasy'}, {'id': 35, 'na...",,58372,tt0114241,en,Reckless,"On Christmas eve, a relentlessly cheerful woma...",...,1995-11-17,0.0,91.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,The most twisted Christmas ever.,Reckless,False,10.0,1.0
394,False,,0,"[{'id': 10749, 'name': 'Romance'}, {'id': 12, ...",,278939,tt0113173,en,Girl in the Cadillac,A runaway teenage girl and a drifter rob a ban...,...,1995-10-24,0.0,89.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,On our first date... we robbed a bank,Girl in the Cadillac,False,10.0,1.0
706,False,,0,"[{'id': 99, 'name': 'Documentary'}]",,73183,tt0113270,en,"The Haunted World of Edward D. Wood, Jr.",The strange life and the wonderfully awful fil...,...,1996-05-01,0.0,112.0,[],Released,Can your MIND stand the SHOCKING TRUTH?,"The Haunted World of Edward D. Wood, Jr.",False,10.0,1.0
738,False,,0,"[{'id': 99, 'name': 'Documentary'}]",,255546,tt0109381,pt,Carmen Miranda: Bananas Is My Business,A biography of the Portuguese-Brazilian singer...,...,1995-04-13,0.0,91.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,,Carmen Miranda: Bananas Is My Business,False,10.0,1.0
1634,False,,0,"[{'id': 18, 'name': 'Drama'}]",,64562,tt0119845,en,Other Voices Other Rooms,Truman Capote's semi-autobiographical first no...,...,1995-09-15,0.0,0.0,[],Released,,Other Voices Other Rooms,False,10.0,1.0
1761,False,,0,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,78373,tt0118925,en,"Dancer, Texas Pop. 81","Four guys, best friends, have grown up togethe...",...,1998-05-01,565592.0,97.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,in the middle of nowhere they had everything,"Dancer, Texas Pop. 81",False,10.0,1.0
2114,False,,0,"[{'id': 18, 'name': 'Drama'}]",http://www.thefarmerswifefilm.co.uk/,143750,tt2140519,en,The Farmer's Wife,"As her surroundings are invaded by outsiders, ...",...,2012-06-20,0.0,18.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,The Farmer's Wife,False,10.0,1.0
2653,False,,0,"[{'id': 35, 'name': 'Comedy'}]",,89861,tt0120210,en,Stiff Upper Lips,Stiff Upper Lips is a broad parody of British ...,...,1998-06-12,0.0,99.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Stiff Upper Lips,False,10.0,1.0
2948,False,,0,"[{'id': 18, 'name': 'Drama'}, {'id': 28, 'name...",,124853,tt0114008,en,Ten Benny,The story of a young shoe salesman whose overe...,...,1995-04-09,0.0,108.0,[],Released,,Ten Benny,False,10.0,1.0
3160,False,,0,"[{'id': 99, 'name': 'Documentary'}]",,49477,tt0192069,en,Gendernauts: A Journey Through Shifting Identi...,Monika Treut explores the worlds and thoughts ...,...,1999-06-12,0.0,87.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Gendernauts: A Journey Through Shifting Identi...,False,10.0,2.0


For our simple recommender system, we will use the IMDB's known __weighted average formula__ used for their Top Movies Chart, given as follows:

$$ R_{W} = (\frac{v}{v + m})R + (\frac{m}{v + m})C  $$

Where:
- $R_{W}$ is the weighted average movie rating
- $v$ is the number of votes for that movie title
- $m$ is the minimum number of votes required to be in the top Chart
- $R$ is the average rating of that movie title
- $C$ is the mean vote rating across all movies

We can now begin our calculations to construct our simple recommender:

In [4]:
# Computing mean vote count across all movies
vote_counts = metadata[metadata['vote_count'].notnull()]['vote_count'].astype('int')
vote_averages = metadata[metadata['vote_average'].notnull()]['vote_average'].astype('int')
C = vote_averages.mean()
C

5.244896612406511

We must now choose a value for the minimum number. In this case, we will choose a value $m$ that gives us movies that have received more votes than 95% of the other remaining movies.

In [5]:
# Computing minimum number of votes required
m = vote_counts.quantile(0.95)
print(m)

434.0


We can now extract the movies that are considered to be canditates for the top charts in our recommender system given our computed 'm'.

In [6]:
# Extracting all movies that have a votecount that is greater than our m value
qualified = metadata[(metadata['vote_count'] >= m) & (metadata['vote_count'].notnull()) & (metadata['vote_average'].notnull())] \
                 [['title', 'release_date', 'vote_count', 'vote_average', 'popularity', 'genres']]
qualified['vote_count'] = qualified['vote_count'].astype('int')
qualified['vote_average'] = qualified['vote_average'].astype('int')
qualified.shape

(2274, 6)

In [7]:
# Computing weighted average and determining top 250 chart
def weighted_rating(x):
    v = x['vote_count']
    R = x['vote_average']
    return (v/(v+m) * R) + (m/(m+v) * C)

qualified['weighted_average'] = qualified.apply(weighted_rating, axis=1)
qualified = qualified.sort_values('weighted_average', ascending=False).head(250)
qualified

Unnamed: 0,title,release_date,vote_count,vote_average,popularity,genres,weighted_average
15480,Inception,2010-07-14,14075,8,29.1081,"[{'id': 28, 'name': 'Action'}, {'id': 53, 'nam...",7.917588
12481,The Dark Knight,2008-07-16,12269,8,123.167,"[{'id': 18, 'name': 'Drama'}, {'id': 28, 'name...",7.905871
22879,Interstellar,2014-11-05,11187,8,32.2135,"[{'id': 12, 'name': 'Adventure'}, {'id': 18, '...",7.897107
2843,Fight Club,1999-10-15,9678,8,63.8696,"[{'id': 18, 'name': 'Drama'}]",7.881753
4863,The Lord of the Rings: The Fellowship of the Ring,2001-12-18,8892,8,32.0707,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",7.871787
...,...,...,...,...,...,...,...
2006,Indiana Jones and the Temple of Doom,1984-05-23,2841,7,15.8023,"[{'id': 12, 'name': 'Adventure'}, {'id': 28, '...",6.767415
16129,The King's Speech,2010-09-06,2817,7,11.2604,"[{'id': 18, 'name': 'Drama'}, {'id': 36, 'name...",6.765698
895,Sunset Boulevard,1950-08-10,533,8,11.7098,"[{'id': 18, 'name': 'Drama'}]",6.763480
9888,Sin City,2005-04-01,2755,7,15.0105,"[{'id': 28, 'name': 'Action'}, {'id': 53, 'nam...",6.761143


Great, so now we have the top charts. This chart can be carried further to become a simple recommender system by recommending the top movies in the charts to all users. Is this a good recommender system? Not particularly. By taking a global weighted average we are able to determine which ones are considered the best on average, but we are unable to account for the individual preferences of the users. For instance, if I was a fan of exclusively Romcoms, I would be recommended only movies I dislike from the list above. The same would happen to people who really don't like action and thrillers. Therefore, we must use our data to instead be able to account for individual preferences! Therefore, simple recommender systems like this that generate global recommendations are generally only used for users who the system has collected little data from.

By accounting for individual user preferences, we would likely achieve a higher score. But how can we determine which system is better? This leads us to the two methods of recommender system evaluation: __offline evaluation__ and __online evaluation.__

# Offline & Online Evaluation

How can we tell that our recommender system is doing what it is supposed to? There are two different approaches to evaluating our system:

- __Offline evaluation:__ Use data we already have and evaluation metrics to compute numeric efectiveness measures that can be tuned for and/or compared. These are the same evaluation metrics which we have encountered and used to assess the performance of our models
- __Online evaluation:__  involves using a live system, and tracking user-related behaviors such as dwell-times, click-through rates, and purchase conversions

When carrying offline evaluation, we can split our data into a training and a test dataset just as we have seen before to ensure that we are tuning our systems appropriately. On the other hand, online evaluation enables us to capture aspects of the performance of our system that offline methods cannot. Whether offline evaluation, online evaluation or a combination of both is the best method to evaluate our system's performance still remains a topic of research. For the purposes of this notebook, we will only be covering simplistic forms of offline evaluation.

# Content-based Recommenders
We can now begin to understand the first sub-class of recommendation systems: __content-based recommenders.__ Let us look at the recommendation problem in the context of our movies dataset. It is intuitive to say that we would like to recommend romance movies to someone that has rated other romantic movies highly as opposed to action, or to recommend older films to users to who are fans of old classics, or even Batman movies to a Batman fan. In this context, we are looking at the characteristics (content) of each movie, and recommending movies that are similar to the previously highly rated movies by the same user. 

There are multiple approaches for the machinery of content-based recommenders. Most will either use the features of movies to predict whether you like or dislike a movie (classification) or to predict the rating the user would give to a movie they have not yet seen (__model-based__). Some might even use the features of a movie you have just watched and recommend the most similar movies to that given movie given their respective features (__memory-based__). We will be creating our own algorithm to predict the ratings of unseen movies and recommend those that are rated the highest.

In modelling terms, we can frame the problem of recommendation as using the __features__ of movies watched by a user and the ratings given to each movie to __predict__ the rating the user would give to a movie not yet watched based on the movie's features. This is why this approach is referred to as item-centred. In other words, if we have enough data, we train a model for each user based on the previously watched movies and their features. The features we have chosen to use in our model are given genres, vote average, release date and runtime. While release date and vote average are available, we have to extract and process the genres for each movie. Given that we have a list of genres for each movie, we will have to dummy encode it as follows:
- Determine how many genres there are and make each genre a feature of the model
- Assign to each genre a zero if it is not present in the movie's list of genres, or 1 for each genre given each movie 

In [8]:
# Importing our data
links_small = pd.read_csv("../DATA/links_small.csv")
ratings_small = pd.read_csv("../DATA/ratings_small.csv")
md = pd.read_csv("../DATA/movies_metadata.csv")

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [9]:
# Displaying our data
# md
# ratings_small
#links_small

In [10]:
# Only considering movies with the upper 15% most votes
m = md["vote_count"].quantile(0.85)
md = md[md["vote_count"] > m]

In [11]:
# Making sure imdb_id matches between links and metadata
md["imdb_id"] = md["imdb_id"].str.strip("tt")

# Removing all movies without a genre, release_date or runtime
md["genres"].replace('[]', np.nan, inplace=True)
md.dropna(subset=["genres", "release_date", "imdb_id", "vote_average"], inplace=True)

# Converting release_date to a POSIX timestamp float
md["release_date"] = pd.to_datetime(md["release_date"], infer_datetime_format=True)
md["release_date"] = md["release_date"].apply(lambda x:x.timestamp())

# Converting imdb_id to int
md["imdb_id"] = md["imdb_id"].astype('int64')

In [12]:
# Converting "genres" column from a dictionary from a list of strings, containing the respective genres
def extract_genres(x):
    genre_string = ''
    x = eval(x) # executes expression inside of string
    for dictionary in x:
        genre_string += dictionary["name"] + '|'
    return genre_string # include all but last one
md["genres"] = md["genres"].apply(extract_genres)

Now that we have preprocessed the data we are going to use for this model, we can extract the three columns: genre, release_data and budget. We will now also determine the unique features present in the dataset and use each as a feature. Note that this model assumes that all possible genres are included in the dataset.

In [13]:
# Initialising our features matrix
FEATURES = md[["imdb_id", "original_title", "release_date", "vote_average"]]

# Finding unique genre names
GENRES = md["genres"]
unique_genres = list(set(GENRES.sum().split('|')[:-1])) # Don't include last element
print(unique_genres)

# Removing '|' from the end of each string
GENRES = GENRES.apply(lambda x:x[:-1])

['Crime', 'Documentary', 'Comedy', 'War', 'TV Movie', 'Music', 'Science Fiction', 'Animation', 'Drama', 'Romance', 'Action', 'Mystery', 'Adventure', 'Horror', 'Family', 'Western', 'History', 'Thriller', 'Fantasy']


In [14]:
# Adding each genre as a feature
extended_features = GENRES.str.get_dummies()

# Horizontally stack our extended features and the original features
FEATURES = FEATURES.merge(extended_features, left_index=True, right_index=True)

In [15]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Normalizing non-categorical features
scaler = StandardScaler()
FEATURES[['release_date', 'vote_average']] = scaler.fit_transform(FEATURES[['release_date', 'vote_average']])

Now that our features for each movie are set, we need to create a list of users, where each user contains the features and ratings of each of the movies they rated. To link the users and their ratings to the movies, we will need to use the intermediary "links" table. Be careful! We have dropped a few of the movies in the original dataset when the required feature was not available. First we will merge the appropriate dataframes.

In [16]:
# First Join: links JOIN FEATURES ON imdbId
FEATURES = FEATURES.rename(columns={'imdb_id':'imdbId'}) # making column names match
first_join = links_small.merge(FEATURES, on="imdbId")

# Second Join: ratings JOIN first_join on movieId
data_matrix = ratings_small.merge(first_join, on="movieId")

# Delete unnecessary columns
data_matrix.drop(['movieId', 'timestamp', 'imdbId', 'tmdbId'],axis='columns', inplace=True)

# Converting ratings from float to ints for Logistic Regression
# data_matrix["rating"] = data_matrix["rating"].apply(lambda x: 2*x).astype("int64")

After merging our dataframes, we will create a list of users, each with all the ratings provided by each user and features of the corresponding rated movie.

In [17]:
# Creating unique user list
unique_users = list(set(data_matrix["userId"]))
user_data = {}

# Adding movie ratings and features to the list of the corresponding user for users that have rated more than 15 movies
data_copy = data_matrix.copy()
tr = 100 # at least n reviews
for user_id in unique_users:
    current_data = data_copy[data_copy["userId"] == user_id]
    if current_data.shape[0] >= tr:
        user_data[user_id] = current_data
    data_copy = data_copy[data_copy["userId"] != user_id]

In [19]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as MSE, mean_absolute_error as MAE, r2_score
import random

# Train regression model on each user!
user_models = {}
rmse = []
mae = []
r2 = []
n_reviews = []
for user_id in user_data.keys():
    # Get user data
    data = user_data[user_id]
    Y = data["rating"]
    X = data.drop(["userId", "rating", "original_title"], axis=1)
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)
    # Train model and predict on test data
    reg = RandomForestRegressor(n_estimators=80, max_depth=3, random_state=0).fit(X_train, Y_train)
#     reg2 = LinearRegression().fit(X_train, Y_train)
#     reg = LogisticRegression(max_iter=1000, random_state=0).fit(X_train, Y_train)
#     reg = RandomForestClassifier(n_estimators=80, max_depth=3, random_state=0).fit(X_train, Y_train)
#     reg = LinearRegression().fit(X_train, Y_train)
#     predictions = reg.predict(X_test)
#     predictions2 = reg2.predict(X_test)
    predictions = reg.predict(X_test)
    
    # Store RMSE and model
    rmse.append(MSE(Y_test, predictions, squared=False))
    mae.append(MAE(Y_test, predictions))
    r2.append(r2_score(Y_test, predictions))
    n_reviews.append(Y.shape[0])
    user_models[user_id] = reg # storing estimator for this user!
#     print("Y_test:", Y_test, "Predictions:", predictions)

In [20]:
# import matplotlib.pyplot as plt
# print("Random Forest Regressor:")
# print("R2:", r2_score(Y_test, predictions))
# print("MAE", MAE(Y_test, predictions))
# print("RMSE:", MSE(Y_test, predictions, squared=False))
# print("Number of reviews:", len(Y_test)*5)
# # print()
# # print("Linear Regression:")
# # print("R2:", r2_score(Y_test, predictions2))
# # print("MAE", MAE(Y_test, predictions2))
# # print("RMSE:", MSE(Y_test, predictions2, squared=False))
# # print()

# fig = plt.figure()
# plt.scatter(Y_test, predictions)
# # plt.scatter(Y_test, predictions2)
# plt.xlabel("Labels")
# plt.ylabel("Predictions")
# plt.legend(("Random Forest", "Linear Regressor"))
# ax = plt.gca()
# ax.set_ylim([0,10])
# ax.set_xlim([0,10])
# plt.show()

In [21]:
import random
# Pick user with the largest number of samples to look at individually
# max_user_id = -1
# for user_id in user_data.keys():
#     if max_user_id == -1:
#         max_user_id = user_id
#     elif (user_data[max_user_id].shape[0] < user_data[user_id].shape[0]):
#         max_user_id = user_id
max_user_id = random.choice(list(user_data.keys()))

# Check the predictions of the recommender
model = user_models[max_user_id]
movies = FEATURES.iloc[:, 1:] #pd.concat([FEATURES.iloc[:, 1:], user_data[max_user_id].iloc[:, 2:]]).drop_duplicates(keep=False)

# Rating all movies with the model trained on the one user
predictions = pd.DataFrame({"rating": model.predict(movies.iloc[:, 1:])})
rated_movies = predictions.merge(movies, left_index=True, right_index=True)

# Diplay top 10 rated and top 10 recommendations!
top_rated = user_data[max_user_id].nlargest(columns="rating", n=10)
top_rated

Unnamed: 0,userId,rating,original_title,release_date,vote_average,Action,Adventure,Animation,Comedy,Crime,...,History,Horror,Music,Mystery,Romance,Science Fiction,TV Movie,Thriller,War,Western
1621,602,5.0,The Usual Suspects,-0.38829,1.960805,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
2963,602,5.0,Clerks,-0.442465,1.150921,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3777,602,5.0,Pulp Fiction,-0.442991,2.1922,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
3892,602,5.0,Quiz Show,-0.441939,0.803828,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
4938,602,5.0,The Lion King,-0.456842,1.845107,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
5933,602,5.0,In the Line of Fire,-0.518205,0.456735,1,0,0,0,1,...,0,0,0,1,0,0,0,1,0,0
6325,602,5.0,Much Ado About Nothing,-0.529075,0.919525,0,0,0,1,0,...,0,0,0,0,1,0,0,0,0,0
6913,602,5.0,Schindler's List,-0.492958,2.1922,0,0,0,0,0,...,1,0,0,0,0,0,0,0,1,0
7140,602,5.0,The Nightmare Before Christmas,-0.5019,1.382316,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
7686,602,5.0,Aladdin,-0.557653,1.150921,0,1,1,1,0,...,0,0,0,0,1,0,0,0,0,0


In [22]:
# Top 10 recommendations
rated_movies.nlargest(columns="rating", n=10)

Unnamed: 0,rating,original_title,release_date,vote_average,Action,Adventure,Animation,Comedy,Crime,Documentary,...,History,Horror,Music,Mystery,Romance,Science Fiction,TV Movie,Thriller,War,Western
1347,4.681487,Sneakers,-0.571153,0.341037,0,0,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0
5529,4.670167,The Rules of Attraction,0.074743,-0.121754,0,0,0,1,0,0,...,0,0,0,0,1,0,0,0,0,0
1274,4.665057,An American Werewolf in London,-1.278939,0.919525,0,0,0,1,0,0,...,0,1,0,0,0,0,0,0,0,0
5446,4.665057,Captain Ron,-0.569575,-0.931638,1,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4359,4.659681,Cocktail,-0.834666,-0.81594,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1292,4.65942,Bride of Frankenstein,-4.246134,1.150921,0,0,0,0,0,0,...,0,1,0,0,0,1,0,0,0,0
3782,4.657447,Under Suspicion,-0.056225,-0.006056,0,0,0,0,0,0,...,0,0,0,1,0,0,0,1,0,0
170,4.656458,Judge Dredd,-0.391621,-1.163033,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
990,4.650383,"20,000 Leagues Under the Sea",-2.986251,0.572432,0,1,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
2178,4.647777,Edward Scissorhands,-0.684062,1.266618,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0


In [23]:
print("Mean RMSE over all users:", np.mean(rmse))
print("Mean MAE over all users:", np.mean(mae))
print("Mean R-squared over all users:", np.mean(r2))

Mean RMSE over all users: 0.8068141804470971
Mean MAE over all users: 0.631894226650515
Mean R-squared over all users: 0.17979772695336643


From the evaluation metrics, we can see that the prediction made by our model deviates on average by 0.63 from the true rating, which is not a large deviation, and is generally able to distinguish between 'okay', 'great' and 'terrible' movies in the opinion of the user. However, if we would like to recommend 10 movies out of roughly 45,000, our rating predictions need to be even better. Therefore, this model may serve as a strong baseline model that can be improved by accounting for other features such as cast, directors, plot, amongst others.

### Limitations:
As you have seen, we only considered users that had reviewed 100 movies or more, as we do not have enough data to accurately train a model for that given user for a large number of features. What happens in the real-world if we need to make recommendations for users who have rated few movies? What if they are a completely new user? This is known as the __cold-start__ problem and makes it so that models such as our content-based one struggle to make recommendations for users whom we have little data on, making it an issue of __data sparsity__. Additionally, this approach requires feature extraction, which can sometimes be cumbersome.

# Collaborative-filtering

The second classification of algorithms we are now going to go over is known as __collaborative-filtering__, which, as the name implies, requires the collaboration of all the different members of our user-base. It works under the assumption that users that have had similar preferences/choices in the past have similar tastes. If Susan and Tom both love the same movies, if Susan loves _He's Just Not That Into You_ then we should probably recommend that movie to Tom! Unlike content-based recommender systems, collaborative filtering does not require any manual feature extraction, hence why it's referred to as user-centred.

As before, there are different approaches to how we can create a collaborative filtering system:
- We can either look at the preferences/ratings of users who have similar interests and predict the preferences/ratings of this user (memory-based)
- We can use the data we have on the preferences/ratings of all users in our dataset to __learn__ and/or __predict__ the features of individual different items and use these features to predict preferences/ratings for given items (model-based)

As we have covered the model-based approach in the content-based recommendation section, we will be implementing a memory-based approach for collaborative filtering by using some of the unsupervised learning techniques we have encountered so far in the course. If you are interested in looking into an implementation that uses feature learning, check out this [video](https://www.youtube.com/watch?v=9AP-DgFBNP4&ab_channel=ArtificialIntelligence-AllinOne).

More specifically, we will be creating a ratings vector for each user and using a KNN regression algorithm to compute mean rating of the user's $K$ nearest neighbours.

# Hybrid Systems