# Recommender systems

In this notebook we will go through various examples of recommender systems. 

The code in the notebook is based on the following [DataCamp tutorial](https://www.datacamp.com/tutorial/recommender-systems-python) and uses [The Movies Dataset](https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset/data) from Kaggle, which is data from IMDB about movies and users. We will only use some of the data that is compressed into a zip on moodle "TheMovieDataset.zip".

## Simple recommender system

**First, we will do a simple recommender system by simply recommend the Top 250 movies.** 

For this to work, we have to decide how to rank the movies, which again is done by deciding on a way to assign a score to each movie.

For this, let us first look at the meta data about the movies.

In [3]:
#pip install pandas

In [4]:
# Import Pandas
import pandas as pd

# Load Movies Metadata
metadata = pd.read_csv('../Notebooks and data-13/movies_metadata.csv', low_memory=False)

# Print the first three rows
metadata.head(3)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0


In [5]:
metadata.shape

(45466, 24)

In [6]:
#metadata = metadata.iloc[0:30000, :]

In [7]:
metadata.shape

(45466, 24)

Considerations to take into account: The score should not only be based on the average vote, but also on how many that have actually voted on that movie. (Otherwise, a single high vote could make a movie the highest scoring.) Thus, we want a weighted score. For instance:
\begin{equation} 
\text Weighted Rating (\bf WR) = \left({{\bf v} \over {\bf v} + {\bf m}} \cdot R\right) + \left({{\bf m} \over {\bf v} + {\bf m}} \cdot C\right)
\end{equation}
where $v$ is the number of votes for the movie (`vote_count`), $m$ is the minimum votes required to be listed in the chart, $R$ is the average rating of the movie (`vote_average`), and $C$ is the mean vote across the whole report.

$v$ and $R$ we already have in the metadata dataset, and $C$ we can calculate from it. However, $m$ is a hyperparameter we have to choose ourselves.

First let us calculate $C$:

In [8]:
# Calculate mean of vote average column
C = metadata['vote_average'].mean()
print(C)

5.618207215134185


For $m$ we will set it at the 90th percentile of number of votes. In that way, we only consider the movies that are in the top 10% in regards to number of votes.

In [9]:
# Calculate the minimum number of votes required to be in the chart, m
m = metadata['vote_count'].quantile(0.90)
print(m)

160.0


We will make a new dataframe `q_movies` that only contains the movies that have more than $m$ (160) number of votes.

In [10]:
# Filter out all qualified movies into a new DataFrame
q_movies = metadata.copy().loc[metadata['vote_count'] >= m]
q_movies.shape

(4555, 24)

We will now calculate a weighted ranking of the movies based on the formula above and store it in a new column called `score`. 

In [11]:
# Function that computes the weighted rating of each movie
def weighted_rating(x, m=m, C=C):
    v = x['vote_count']
    R = x['vote_average']
    # Calculation based on the IMDB formula
    return (v/(v+m) * R) + (m/(m+v) * C)

In [12]:
# Define a new feature 'score' and calculate its value with `weighted_rating()`
q_movies['score'] = q_movies.apply(weighted_rating, axis=1)

In [13]:
q_movies

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,score
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0,7.640253
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,6.820293
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0,5.660700
5,False,,60000000,"[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...",,949,tt0113277,en,Heat,"Obsessive master thief, Neil McCauley leads a ...",...,187436818.0,170.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,A Los Angeles Crime Saga,Heat,False,7.7,1886.0,7.537201
8,False,,35000000,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",,9091,tt0114576,en,Sudden Death,International action superstar Jean Claude Van...,...,64350171.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Terror goes into overtime.,Sudden Death,False,5.5,174.0,5.556626
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45177,False,"{'id': 442352, 'name': 'Brice Collection', 'po...",0,"[{'id': 35, 'name': 'Comedy'}]",,375798,tt5029602,fr,Brice 3,"Brice is back. The world has changed, but not ...",...,0.0,95.0,"[{'iso_639_1': 'fr', 'name': 'Français'}]",Released,,Brice 3,False,4.3,160.0,4.959104
45204,False,,0,"[{'id': 35, 'name': 'Comedy'}]",,417870,tt3564472,en,Girls Trip,Four girlfriends take a trip to New Orleans fo...,...,0.0,122.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,"""Forgive us in advance for this wild weekend""",Girls Trip,False,7.1,393.0,6.671272
45258,False,"{'id': 466463, 'name': 'Descendants Collection...",0,"[{'id': 10770, 'name': 'TV Movie'}, {'id': 107...",,417320,tt5117876,en,Descendants 2,When the pressure to be royal becomes too much...,...,0.0,111.0,"[{'iso_639_1': 'da', 'name': 'Dansk'}]",Released,Long live evil.,Descendants 2,False,7.5,171.0,6.590372
45265,False,,0,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,265189,tt2121382,sv,Turist,"While holidaying in the French Alps, a Swedish...",...,1359497.0,118.0,"[{'iso_639_1': 'fr', 'name': 'Français'}, {'is...",Released,,Force Majeure,False,6.8,255.0,6.344369


Let us sort the dataframe on this new `score` and print the top 20

In [14]:
#Sort movies based on score calculated above
q_movies = q_movies.sort_values('score', ascending=False)

#Print the top 20 movies
q_movies[['title', 'vote_count', 'vote_average', 'score']].head(20)

Unnamed: 0,title,vote_count,vote_average,score
314,The Shawshank Redemption,8358.0,8.5,8.445869
834,The Godfather,6024.0,8.5,8.425439
10309,Dilwale Dulhania Le Jayenge,661.0,9.1,8.421453
12481,The Dark Knight,12269.0,8.3,8.265477
2843,Fight Club,9678.0,8.3,8.256385
292,Pulp Fiction,8670.0,8.3,8.251406
522,Schindler's List,4436.0,8.3,8.206639
23673,Whiplash,4376.0,8.3,8.205404
5481,Spirited Away,3968.0,8.3,8.196055
2211,Life Is Beautiful,3643.0,8.3,8.187171


We can now recommend new movies to a user based on this `score` - recomming the top movies according to this `score` that the user have not watched yet.

## Content-based filtering recommender systems

In this section, we will look at Content-based filtering. That is, we will try to recommend movies that are similar in content to movies the user have already watched. The key here is to find a way to represent "content" and a way to measure the distance between "content".

First, we will take the content to be a plot description we actually have in the data. For distance measure, we will use cosine similarity. That is, **we will recommend movies to the user that have plot descriptions, which are similar (measure by cosine similarity) to the plot descriptions of movies the user have already watched.**

The plot description is available in the variable `overview` of the metadata dataset. let us look at an example.

In [15]:
#Print plot overviews of the first 5 movies.
metadata['overview'].head()

0    Led by Woody, Andy's toys live happily in his ...
1    When siblings Judy and Peter discover an encha...
2    A family wedding reignites the ancient feud be...
3    Cheated on, mistreated and stepped on, the wom...
4    Just when George Banks has recovered from his ...
Name: overview, dtype: object

In [16]:
metadata['overview'][0]

"Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences."

These plot descriptions are plain text strings and cannot directly be put into any machine learning algorithm. Thus, we have to do some pre-processing to the `overview` variable. As when we looked at IMBD reviews that were labelled as positive or negative in connection with deep learning, we can use one-hot-encoding. That is, we can make a column for each of the most common words and put a 1 if the word is in the plot description and 0 if the word is not in the plot description.

This would work, but is a crude encoding. We can do a bit better in the sense that we instead of a 1 can but a score between 0 and 1 that somehow represent the importance of that word. One such importance score is *Term Frequency-Inverse Document Frequency* (TF-IDF). This score note how often the word appears in the given plot description in relation to how often it occurs overall in all the plot descriptions. 

By "term" we just mean word and by "document" we mean a plot description. Then we can first calculate the *relative term frequency* of a term in a document - that is, how often a word occurs in a particular plot description. The formula for this is:
$$
tf(t, d) = \frac{f_{t, d}}{len(d)} 
$$
where $t$ is the term, $d$ is the document, $f_{t, d}$ is the count of how many times the term $t$ appears in the document $d$, and $len(d)$ is the total count of terms in $d$. 

In addition, we can define the *inverse document fequency* by the formula:
$$
idf(t, D) = \log {\frac {\# D}{\# D_t}}
$$
where $D$ is the set of all documents (in our case all the plot descriptions), $D_t$ is the set of documents that contain the term $t$, $\#D$ the number of documents in $D$, and $\# D_t$ is the number of documents that contain $t$.

With relative term frequency and inverse document frequency defined, we can finally define *TF-IDF* as:
$$
TF-IDF(t, d, D) = tf(t, d) + idf(t, D)
$$

Luckily, we do not have to calculate these things manually, but can use a build in functionality of scikit-learn.

In [51]:
#RWe first replace missing values with an empty string
metadata['overview'] = metadata['overview'].fillna('')

#Import TfIdfVectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(metadata['overview'])

#Output the shape of tfidf_matrix
tfidf_matrix = tfidf_matrix.astype("uint8")

In [43]:
tfidf_matrix.toarray()[1, :]

MemoryError: Unable to allocate 3.21 GiB for an array with shape (45466, 75827) and data type uint8

In [47]:
tfidf_matrix

<Compressed Sparse Row sparse matrix of dtype 'uint8'
	with 1210882 stored elements and shape (45466, 75827)>

Now that we have each movie represented as a 75827 long vector (the rows), then we just need a way to measure the distance between two such vectors (movies/rows). For this, we will use the cosine similarity, which commonly used for tasks like this. Cosine similarity measure "the angle" between two vectors. If the vectors are proportional (have the same direction) the cosine similarity is 1, if the vectors are orthogonal it is 0, and if the vectors are pointing in completely opposite directions it is -1. (The way we constructed our rows, we will never get negative cosine similarity values.) Cosine similarity is also fast to compute for sparse rows like the one we have here (most values are 0). The formula for cosine similarity is:
$$
cos(A, B) = \frac{\Sigma_{i}a_i * b_i}{\sqrt(\Sigma_{i}a_i^2)*\sqrt(\Sigma_{i}b_i^2)}
$$
where $A$ and $B$ are vectors (in our case rows) and $a_i$ is the i'th element of the vector $A$ and $b_i$ is the i'th element of the vector $B$.

We calculate the cosine similarity between any two movies. We will store this is a matrix (2D array) of shape 45466 x 45466, where each column and row correspond to a movie. In this way, each row will correspond to a movie and the values will be the cosine similarity between that movie and all the other, 45466 (including itself) movies. (The blog argues for using a linear kernel to calculate cosine similarities faster, but we might as well just use the `cosine_similarity` function from scikit-learn - it is often fast enough.)

In [48]:
from sklearn.metrics.pairwise import cosine_similarity

In [54]:
%%time 
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

MemoryError: Unable to allocate 2.04 GiB for an array with shape (546860044,) and data type int32

The linear kernel actually turned out to be slower in this case! Let us remove this matrix (cosine_simLK) as it is quite big and take up memory.

In [None]:
cosine_sim

In [None]:
cosine_sim.shape

This matrix is symetric in the sense that `cosine_sim[0, 1]` tell us how much the first movie (index 0) is similar to the second movie (index 1), which returns the exact same value as `cosine_sim[1, 0]`.

In [None]:
cosine_sim[0, 1]

In [None]:
cosine_sim[1, 0]

To have any idea if this makes sense, we can look up the corresponding titles in the original metadata dataset. For later use, let us make a reverse map of index to titles.

In [None]:
#Construct a reverse map of indices and movie titles
indices = pd.Series(metadata.index, index=metadata['title']).drop_duplicates()

In [None]:
indices[0:10]

We can see that the similarity `cosine_sim[0, 1]` is the similarity between "Toy Story" and "Jumanji".

We can now define a recommender function, that is, we can define a function that takes in a movie title as input and returns a list of the 10 most similar movies to the input movie.

In [None]:
# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return metadata['title'].iloc[movie_indices]

We can now try it out.

In [None]:
get_recommendations('Toy Story')

In [None]:
get_recommendations('The Dark Knight Rises')

This recommender is not completely off, but still not perfect, of course

## Improved content-based filtering

We can **improve the recommender by considering more metadata about the movies, such as staring actors, the director, related genres, and keywords**. First, we load in this additional data and merge it with our original metadata.

In [None]:
# Load keywords and credits
credits = pd.read_csv('../Notebooks and data-13/credits.csv')
keywords = pd.read_csv('../Notebooks and data-13/keywords.csv')

# Remove rows with bad IDs.
metadata = metadata.drop([19730, 29503, 35587])

# Convert IDs to int. Required for merging
credits['id'] = credits['id'].astype('int')
keywords['id'] = keywords['id'].astype('int')
metadata['id'] = metadata['id'].astype('int')

# Merge keywords and credits into your main metadata dataframe
metadata = metadata.merge(credits, on='id')
metadata = metadata.merge(keywords, on='id')

In [None]:
metadata.head()

We can see that our new columns `cast`, `crew`, and `keywords` are some strange format - it looks like JSON in a string.

In [None]:
metadata.cast[0]

We can decode it a bit using the `literal_eval` function.

In [None]:
# Parse the stringified features into their corresponding python objects
from ast import literal_eval

features = ['cast', 'crew', 'keywords', 'genres']
for feature in features:
    metadata[feature] = metadata[feature].apply(literal_eval)

In [None]:
metadata.cast[0]

In [None]:
metadata.crew[0]

We can now build a function that fetches the director, for instance.

In [None]:
import numpy as np

def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

In [None]:
get_director(metadata.crew[0])

For `cast`, `keywords`, and `genres` we are just goint to retrieve the first 3 (top 3) elements. We can also make a function for that. 

In [None]:
def get_list(x):
    if isinstance(x, list):
        names = [i['name'] for i in x]
        #Check if more than 3 elements exist. If yes, return only first three. If no, return entire list.
        if len(names) > 3:
            names = names[:3]
        return names

    #Return empty list in case of missing/malformed data
    return []

In [None]:
get_list(metadata.cast[0])

In [None]:
get_list(metadata.keywords[0])

In [None]:
get_list(metadata.genres[0])

With these helper functions, we can now define new features for director, cast, genres and keywords.

In [None]:
# Define new director, cast, genres and keywords features that are in a suitable form.
metadata['director'] = metadata['crew'].apply(get_director)

features = ['cast', 'keywords', 'genres']
for feature in features:
    metadata[feature] = metadata[feature].apply(get_list)

In [None]:
# Print the new features of the first 3 films
metadata[['title', 'cast', 'director', 'keywords', 'genres']].head(3)

This new metadata about movies is still text data, thus we need to pre-process somehow to make it fit further analysis. There are several options for this, but essentially we want to vectorize the data and to do this it can sometimes be beneficial to combine the data into one string ("soup" - I am not sure if this is a commonly used term!) before vectorizing it. The tutorial does this by replacing upper case letters with lower case letters and removing black spaces, before concatenating the text strings into one string.

In [None]:
# Function to convert all strings to lower case and strip names of spaces
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        #Check if director exists. If not, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

In [None]:
clean_data(metadata.cast[0])

We then apply that function to all the relevant coulmns.

In [None]:
# Apply clean_data function to your features.
features = ['cast', 'keywords', 'director', 'genres']

for feature in features:
    metadata[feature] = metadata[feature].apply(clean_data)

In [None]:
# Print the new features of the first 3 films
metadata[['title', 'cast', 'director', 'keywords', 'genres']].head(3)

In [None]:
def create_soup(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])

In [None]:
# Create a new soup feature
metadata['soup'] = metadata.apply(create_soup, axis=1)

In [None]:
metadata.soup[0]

For vectorization we will use something else than TF-IDF, since we are not dealing with traditional text documents. Thus we will use the count vectorizer.

In [None]:
# Import CountVectorizer and create the count matrix
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(metadata['soup'])

In [None]:
count_matrix.shape

We will again use the cosine similarity to calculate the difference between the resulting vectors. Be aware that this will put a high load on the memory (and CPU)!!!!

In [None]:
# Compute the Cosine Similarity matrix based on the count_matrix
cosine_sim2 = cosine_similarity(count_matrix, count_matrix)

In [None]:
# Reset index of your main DataFrame and construct reverse mapping as before
metadata = metadata.reset_index()
indices = pd.Series(metadata.index, index=metadata['title'])

In [None]:
get_recommendations('Toy Story', cosine_sim2)

In [None]:
get_recommendations('The Dark Knight Rises', cosine_sim2)

## An example of User-Based Collaborative Filtering

In this section, we will look at Collaborative-based filtering. More specifically, we will make a user-based collaborative filtering based on data about the users (rating the movies). The example is based on the same movie dataset and the following Kaggle notebook: [https://www.kaggle.com/code/yagizcapa/user-based-recommender](https://www.kaggle.com/code/yagizcapa/user-based-recommender)

First we read in the rating dataset

In [None]:
ratings = pd.read_csv("ratings_small.csv")

In [None]:
df = metadata.merge(ratings, how="left", left_on="id", right_on="movieId")
df.head()

In [None]:
df.shape

In [None]:
df["title"].nunique()

In [None]:
df["userId"].nunique()

We there are 88822 user ratings by 671 users (of 42276 unique movies - it is not given that all movies have ratings).

Let us look at how many rated the most rated movies

In [None]:
rating_counts = pd.DataFrame(df["title"].value_counts())

rating_counts.head(10)

You might wonder what happens with all the rows (and movies) that did not have rating? We can remove those to ensure that the above code does not count rows where the rating is missing. By doing this, we also learn that there are 2794 movies that are rated by the 671 users.

In [None]:
df[["title", "rating"]].dropna().drop(columns=["rating"]).value_counts()

In [None]:
user_movie_df = df[["userId", "title", "rating"]]

In [None]:
user_movie_df

We create a dataframe with the user ratings only

In [None]:
user_movie_df = user_movie_df.pivot_table(index=["userId"], columns=["title"], values="rating")

In [None]:
user_movie_df.head()

In [None]:
user_movie_df.shape

Let us select a random user as an example case

In [None]:
random_user = np.array(user_movie_df.sample(random_state = 50).index)[0]
random_user

Getting the movies the random user have rated.

In [None]:
random_user_df = user_movie_df[user_movie_df.index == random_user]
random_user_df

In [None]:
random_user_movies_watched = random_user_df.columns[random_user_df.notna().any()].tolist()
random_user_movies_watched

In [None]:
len(random_user_movies_watched)

Selecting only does movies to look for similar users

In [None]:
movies_watched_df = user_movie_df[random_user_movies_watched]

In [None]:
movies_watched_df

In [None]:
movies_watched_df.shape

For each other user, we now calculate how many movies they have rated among these selected that the random user have rated.

In [None]:
user_movie_count = movies_watched_df.T.notnull().sum()

In [None]:
user_movie_count

In [None]:
user_movie_count = user_movie_count.reset_index()
user_movie_count.columns = ["userId", "movie_count"]
user_movie_count

We select those users that have rated more than 70% of the movies the random user have rated.

In [None]:
user_same_movies = user_movie_count[user_movie_count["movie_count"] > (len(random_user_movies_watched)*70)/100]["userId"]
user_same_movies

creating a data frame with the rating of only these users.

In [None]:
final_df = movies_watched_df[movies_watched_df.index.isin(user_same_movies)]
final_df

We now calculate the correlation between all the users. That is the correlation between the rows. As the `.corr` method on data frames calculate the correlations between columns, we have transpose the data frame first.

In [None]:
corr_df = final_df.T.corr()
corr_df

In [None]:
user_corr = corr_df[random_user].reset_index()
user_corr = user_corr.rename(columns={random_user: 'correlation'})
user_corr = user_corr.sort_values(by="correlation", ascending=False)
user_corr = user_corr.loc[user_corr["userId"] != random_user]
user_corr = user_corr.reset_index(drop=True)
user_corr

Now let us merge it with all the ratings of the users

In [None]:
top_users_ratings = user_corr.merge(ratings[["userId", "movieId", "rating"]], how="inner")
top_users_ratings

We can now create ratings that are weighted with respect to the correlation.

In [None]:
top_users_ratings["weighted_rating"] = top_users_ratings["correlation"] * top_users_ratings["rating"]
top_users_ratings

For each movie, we can now take the average of the weighted ratings to get a final rating for all the movies (as recommendation for the selectd random user).

In [None]:
recommendation_df = top_users_ratings.groupby("movieId").agg({"weighted_rating": "mean"}).sort_values(by = "weighted_rating", ascending = False)
recommendation_df = recommendation_df.reset_index()
recommendation_df

In [None]:
movies_to_be_recommended = recommendation_df.merge(metadata[["id", "title"]], left_on="movieId", right_on="id").drop(columns=["id"])
movies_to_be_recommended = movies_to_be_recommended.head()
movies_to_be_recommended

We can now put it all together into a recommender function.

In [None]:
def user_based_recommender(input_user, user_movie_df, rate_ratio=0.70, num_recommendations=5):
    # Creating a list of movies the input user have rated
    input_user_df = user_movie_df[user_movie_df.index == input_user]
    input_user_movies_watched = input_user_df.columns[input_user_df.notna().any()].tolist()

    # Creating a dataframe with the user rating of the movies the input user have rated
    movies_watched_df = user_movie_df[input_user_movies_watched]

    # Counting how many movies other users have rated that the input user have also rated
    user_movie_count = movies_watched_df.T.notnull().sum()
    user_movie_count = user_movie_count.reset_index()
    user_movie_count.columns = ["userId", "movie_count"]
    
    # Selecting similar users over based on a rating similarity count ratio threshold
    user_same_movies = user_movie_count[user_movie_count["movie_count"] > (len(input_user_movies_watched)*rate_ratio)]["userId"]

    # Creating a correlation matrix based on ratings
    final_df = movies_watched_df[movies_watched_df.index.isin(user_same_movies)]
    corr_df = final_df.T.corr()

    # Created top correlated users
    user_corr = corr_df[input_user].reset_index()
    user_corr = user_corr.rename(columns={input_user: 'correlation'})
    user_corr = user_corr.sort_values(by="correlation", ascending=False)
    user_corr = user_corr.loc[user_corr["userId"] != input_user]
    user_corr = user_corr.reset_index(drop=True)

    # Creating correlated weighting of rating
    top_users_ratings = user_corr.merge(ratings[["userId", "movieId", "rating"]], how="inner")
    top_users_ratings["weighted_rating"] = top_users_ratings["correlation"] * top_users_ratings["rating"]

    # Creating a recommendation dataframe
    recommendation_df = top_users_ratings.groupby("movieId").agg({"weighted_rating": "mean"}).sort_values(by = "weighted_rating", ascending = False)
    recommendation_df = recommendation_df.reset_index()

    # Creating the final recommendations
    movies_to_be_recommended = recommendation_df.merge(metadata[["id", "title"]], left_on="movieId", right_on="id").drop(columns=["id"])
    movies_to_be_recommended = movies_to_be_recommended.head(num_recommendations)

    return movies_to_be_recommended["title"]

In [None]:
user_movie_df

In [None]:
user_based_recommender(455, user_movie_df)

In [None]:
random_user_movies_watched