# Recommender systems

In this notebook we will go through various examples of recommender systems. 

The code in the notebook is based on the following [DataCamp tutorial](https://www.datacamp.com/tutorial/recommender-systems-python) and uses [The Movies Dataset](https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset/data) from Kaggle, which is data from IMDB about movies and users. We will only use some of the data that is compressed into a zip on moodle "TheMovieDataset.zip".

## Simple recommender system

**First, we will do a simple recommender system by simply recommend the Top 250 movies.** 

For this to work, we have to decide how to rank the movies, which again is done by deciding on a way to assign a score to each movie.

For this, let us first look at the meta data about the movies.

In [3]:
#pip install pandas

In [1]:
# Import Pandas
import pandas as pd

# Load Movies Metadata
metadata = pd.read_csv('../Notebooks and data-13/movies_metadata.csv', low_memory=False)

# Print the first three rows
metadata.head(3)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0


In [3]:
metadata.shape

(45466, 24)

In [4]:
#metadata = metadata.iloc[0:30000, :]

In [5]:
metadata.shape

(45466, 24)

Considerations to take into account: The score should not only be based on the average vote, but also on how many that have actually voted on that movie. (Otherwise, a single high vote could make a movie the highest scoring.) Thus, we want a weighted score. For instance:
\begin{equation} 
\text Weighted Rating (\bf WR) = \left({{\bf v} \over {\bf v} + {\bf m}} \cdot R\right) + \left({{\bf m} \over {\bf v} + {\bf m}} \cdot C\right)
\end{equation}
where $v$ is the number of votes for the movie (`vote_count`), $m$ is the minimum votes required to be listed in the chart, $R$ is the average rating of the movie (`vote_average`), and $C$ is the mean vote across the whole report.

$v$ and $R$ we already have in the metadata dataset, and $C$ we can calculate from it. However, $m$ is a hyperparameter we have to choose ourselves.

First let us calculate $C$:

In [7]:
# Calculate mean of vote average column
C = metadata['vote_average'].mean()
print(C)

5.618207215134185


For $m$ we will set it at the 90th percentile of number of votes. In that way, we only consider the movies that are in the top 10% in regards to number of votes.

In [9]:
# Calculate the minimum number of votes required to be in the chart, m
m = metadata['vote_count'].quantile(0.90)
print(m)

160.0


We will make a new dataframe `q_movies` that only contains the movies that have more than $m$ (160) number of votes.

In [11]:
# Filter out all qualified movies into a new DataFrame
q_movies = metadata.copy().loc[metadata['vote_count'] >= m]
q_movies.shape

(4555, 24)

We will now calculate a weighted ranking of the movies based on the formula above and store it in a new column called `score`. 

In [13]:
# Function that computes the weighted rating of each movie
def weighted_rating(x, m=m, C=C):
    v = x['vote_count']
    R = x['vote_average']
    # Calculation based on the IMDB formula
    return (v/(v+m) * R) + (m/(m+v) * C)

In [14]:
# Define a new feature 'score' and calculate its value with `weighted_rating()`
q_movies['score'] = q_movies.apply(weighted_rating, axis=1)

In [15]:
q_movies

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,score
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0,7.640253
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,6.820293
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0,5.660700
5,False,,60000000,"[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...",,949,tt0113277,en,Heat,"Obsessive master thief, Neil McCauley leads a ...",...,187436818.0,170.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,A Los Angeles Crime Saga,Heat,False,7.7,1886.0,7.537201
8,False,,35000000,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",,9091,tt0114576,en,Sudden Death,International action superstar Jean Claude Van...,...,64350171.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Terror goes into overtime.,Sudden Death,False,5.5,174.0,5.556626
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45177,False,"{'id': 442352, 'name': 'Brice Collection', 'po...",0,"[{'id': 35, 'name': 'Comedy'}]",,375798,tt5029602,fr,Brice 3,"Brice is back. The world has changed, but not ...",...,0.0,95.0,"[{'iso_639_1': 'fr', 'name': 'Français'}]",Released,,Brice 3,False,4.3,160.0,4.959104
45204,False,,0,"[{'id': 35, 'name': 'Comedy'}]",,417870,tt3564472,en,Girls Trip,Four girlfriends take a trip to New Orleans fo...,...,0.0,122.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,"""Forgive us in advance for this wild weekend""",Girls Trip,False,7.1,393.0,6.671272
45258,False,"{'id': 466463, 'name': 'Descendants Collection...",0,"[{'id': 10770, 'name': 'TV Movie'}, {'id': 107...",,417320,tt5117876,en,Descendants 2,When the pressure to be royal becomes too much...,...,0.0,111.0,"[{'iso_639_1': 'da', 'name': 'Dansk'}]",Released,Long live evil.,Descendants 2,False,7.5,171.0,6.590372
45265,False,,0,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,265189,tt2121382,sv,Turist,"While holidaying in the French Alps, a Swedish...",...,1359497.0,118.0,"[{'iso_639_1': 'fr', 'name': 'Français'}, {'is...",Released,,Force Majeure,False,6.8,255.0,6.344369


Let us sort the dataframe on this new `score` and print the top 20

In [17]:
#Sort movies based on score calculated above
q_movies = q_movies.sort_values('score', ascending=False)

#Print the top 20 movies
q_movies[['title', 'vote_count', 'vote_average', 'score']].head(20)

Unnamed: 0,title,vote_count,vote_average,score
314,The Shawshank Redemption,8358.0,8.5,8.445869
834,The Godfather,6024.0,8.5,8.425439
10309,Dilwale Dulhania Le Jayenge,661.0,9.1,8.421453
12481,The Dark Knight,12269.0,8.3,8.265477
2843,Fight Club,9678.0,8.3,8.256385
292,Pulp Fiction,8670.0,8.3,8.251406
522,Schindler's List,4436.0,8.3,8.206639
23673,Whiplash,4376.0,8.3,8.205404
5481,Spirited Away,3968.0,8.3,8.196055
2211,Life Is Beautiful,3643.0,8.3,8.187171


We can now recommend new movies to a user based on this `score` - recomming the top movies according to this `score` that the user have not watched yet.

## Content-based filtering recommender systems

In this section, we will look at Content-based filtering. That is, we will try to recommend movies that are similar in content to movies the user have already watched. The key here is to find a way to represent "content" and a way to measure the distance between "content".

First, we will take the content to be a plot description we actually have in the data. For distance measure, we will use cosine similarity. That is, **we will recommend movies to the user that have plot descriptions, which are similar (measure by cosine similarity) to the plot descriptions of movies the user have already watched.**

The plot description is available in the variable `overview` of the metadata dataset. let us look at an example.

In [22]:
#Print plot overviews of the first 5 movies.
metadata['overview'].head()

0    Led by Woody, Andy's toys live happily in his ...
1    When siblings Judy and Peter discover an encha...
2    A family wedding reignites the ancient feud be...
3    Cheated on, mistreated and stepped on, the wom...
4    Just when George Banks has recovered from his ...
Name: overview, dtype: object

In [23]:
metadata['overview'][0]

"Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences."

These plot descriptions are plain text strings and cannot directly be put into any machine learning algorithm. Thus, we have to do some pre-processing to the `overview` variable. As when we looked at IMBD reviews that were labelled as positive or negative in connection with deep learning, we can use one-hot-encoding. That is, we can make a column for each of the most common words and put a 1 if the word is in the plot description and 0 if the word is not in the plot description.

This would work, but is a crude encoding. We can do a bit better in the sense that we instead of a 1 can but a score between 0 and 1 that somehow represent the importance of that word. One such importance score is *Term Frequency-Inverse Document Frequency* (TF-IDF). This score note how often the word appears in the given plot description in relation to how often it occurs overall in all the plot descriptions. 

By "term" we just mean word and by "document" we mean a plot description. Then we can first calculate the *relative term frequency* of a term in a document - that is, how often a word occurs in a particular plot description. The formula for this is:
$$
tf(t, d) = \frac{f_{t, d}}{len(d)} 
$$
where $t$ is the term, $d$ is the document, $f_{t, d}$ is the count of how many times the term $t$ appears in the document $d$, and $len(d)$ is the total count of terms in $d$. 

In addition, we can define the *inverse document fequency* by the formula:
$$
idf(t, D) = \log {\frac {\# D}{\# D_t}}
$$
where $D$ is the set of all documents (in our case all the plot descriptions), $D_t$ is the set of documents that contain the term $t$, $\#D$ the number of documents in $D$, and $\# D_t$ is the number of documents that contain $t$.

With relative term frequency and inverse document frequency defined, we can finally define *TF-IDF* as:
$$
TF-IDF(t, d, D) = tf(t, d) + idf(t, D)
$$

Luckily, we do not have to calculate these things manually, but can use a build in functionality of scikit-learn.

In [25]:
#RWe first replace missing values with an empty string
metadata['overview'] = metadata['overview'].fillna('')

#Import TfIdfVectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(metadata['overview'])

#Output the shape of tfidf_matrix
tfidf_matrix = tfidf_matrix.astype("uint8")

In [26]:
tfidf_matrix.toarray()[1, :]

array([0, 0, 0, ..., 0, 0, 0], dtype=uint8)

In [27]:
tfidf_matrix

<45466x75827 sparse matrix of type '<class 'numpy.uint8'>'
	with 1210882 stored elements in Compressed Sparse Row format>

Now that we have each movie represented as a 75827 long vector (the rows), then we just need a way to measure the distance between two such vectors (movies/rows). For this, we will use the cosine similarity, which commonly used for tasks like this. Cosine similarity measure "the angle" between two vectors. If the vectors are proportional (have the same direction) the cosine similarity is 1, if the vectors are orthogonal it is 0, and if the vectors are pointing in completely opposite directions it is -1. (The way we constructed our rows, we will never get negative cosine similarity values.) Cosine similarity is also fast to compute for sparse rows like the one we have here (most values are 0). The formula for cosine similarity is:
$$
cos(A, B) = \frac{\Sigma_{i}a_i * b_i}{\sqrt(\Sigma_{i}a_i^2)*\sqrt(\Sigma_{i}b_i^2)}
$$
where $A$ and $B$ are vectors (in our case rows) and $a_i$ is the i'th element of the vector $A$ and $b_i$ is the i'th element of the vector $B$.

We calculate the cosine similarity between any two movies. We will store this is a matrix (2D array) of shape 45466 x 45466, where each column and row correspond to a movie. In this way, each row will correspond to a movie and the values will be the cosine similarity between that movie and all the other, 45466 (including itself) movies. (The blog argues for using a linear kernel to calculate cosine similarities faster, but we might as well just use the `cosine_similarity` function from scikit-learn - it is often fast enough.)

In [30]:
from sklearn.metrics.pairwise import cosine_similarity

In [31]:
%%time 
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

CPU times: total: 8.55 s
Wall time: 8.58 s


The linear kernel actually turned out to be slower in this case! Let us remove this matrix (cosine_simLK) as it is quite big and take up memory.

In [33]:
cosine_sim

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [34]:
cosine_sim.shape

(45466, 45466)

This matrix is symetric in the sense that `cosine_sim[0, 1]` tell us how much the first movie (index 0) is similar to the second movie (index 1), which returns the exact same value as `cosine_sim[1, 0]`.

In [36]:
cosine_sim[0, 1]

0.0

In [37]:
cosine_sim[1, 0]

0.0

To have any idea if this makes sense, we can look up the corresponding titles in the original metadata dataset. For later use, let us make a reverse map of index to titles.

In [39]:
#Construct a reverse map of indices and movie titles
indices = pd.Series(metadata.index, index=metadata['title']).drop_duplicates()

In [40]:
indices[0:10]

title
Toy Story                      0
Jumanji                        1
Grumpier Old Men               2
Waiting to Exhale              3
Father of the Bride Part II    4
Heat                           5
Sabrina                        6
Tom and Huck                   7
Sudden Death                   8
GoldenEye                      9
dtype: int64

We can see that the similarity `cosine_sim[0, 1]` is the similarity between "Toy Story" and "Jumanji".

We can now define a recommender function, that is, we can define a function that takes in a movie title as input and returns a list of the 10 most similar movies to the input movie.

In [43]:
# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return metadata['title'].iloc[movie_indices]

We can now try it out.

In [45]:
get_recommendations('Toy Story')

1                         Jumanji
2                Grumpier Old Men
3               Waiting to Exhale
4     Father of the Bride Part II
5                            Heat
6                         Sabrina
7                    Tom and Huck
8                    Sudden Death
9                       GoldenEye
10         The American President
Name: title, dtype: object

In [46]:
get_recommendations('The Dark Knight Rises')

1                         Jumanji
2                Grumpier Old Men
3               Waiting to Exhale
4     Father of the Bride Part II
5                            Heat
6                         Sabrina
7                    Tom and Huck
8                    Sudden Death
9                       GoldenEye
10         The American President
Name: title, dtype: object

This recommender is not completely off, but still not perfect, of course

## Improved content-based filtering

We can **improve the recommender by considering more metadata about the movies, such as staring actors, the director, related genres, and keywords**. First, we load in this additional data and merge it with our original metadata.

In [49]:
# Load keywords and credits
credits = pd.read_csv('../Notebooks and data-13/credits.csv')
keywords = pd.read_csv('../Notebooks and data-13/keywords.csv')

# Remove rows with bad IDs.
metadata = metadata.drop([19730, 29503, 35587])

# Convert IDs to int. Required for merging
credits['id'] = credits['id'].astype('int')
keywords['id'] = keywords['id'].astype('int')
metadata['id'] = metadata['id'].astype('int')

# Merge keywords and credits into your main metadata dataframe
metadata = metadata.merge(credits, on='id')
metadata = metadata.merge(keywords, on='id')

In [50]:
metadata.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,spoken_languages,status,tagline,title,video,vote_average,vote_count,cast,crew,keywords
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...","[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...","[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...","[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


We can see that our new columns `cast`, `crew`, and `keywords` are some strange format - it looks like JSON in a string.

In [52]:
metadata.cast[0]

"[{'cast_id': 14, 'character': 'Woody (voice)', 'credit_id': '52fe4284c3a36847f8024f95', 'gender': 2, 'id': 31, 'name': 'Tom Hanks', 'order': 0, 'profile_path': '/pQFoyx7rp09CJTAb932F2g8Nlho.jpg'}, {'cast_id': 15, 'character': 'Buzz Lightyear (voice)', 'credit_id': '52fe4284c3a36847f8024f99', 'gender': 2, 'id': 12898, 'name': 'Tim Allen', 'order': 1, 'profile_path': '/uX2xVf6pMmPepxnvFWyBtjexzgY.jpg'}, {'cast_id': 16, 'character': 'Mr. Potato Head (voice)', 'credit_id': '52fe4284c3a36847f8024f9d', 'gender': 2, 'id': 7167, 'name': 'Don Rickles', 'order': 2, 'profile_path': '/h5BcaDMPRVLHLDzbQavec4xfSdt.jpg'}, {'cast_id': 17, 'character': 'Slinky Dog (voice)', 'credit_id': '52fe4284c3a36847f8024fa1', 'gender': 2, 'id': 12899, 'name': 'Jim Varney', 'order': 3, 'profile_path': '/eIo2jVVXYgjDtaHoF19Ll9vtW7h.jpg'}, {'cast_id': 18, 'character': 'Rex (voice)', 'credit_id': '52fe4284c3a36847f8024fa5', 'gender': 2, 'id': 12900, 'name': 'Wallace Shawn', 'order': 4, 'profile_path': '/oGE6JqPP2xH4t

We can decode it a bit using the `literal_eval` function.

In [54]:
# Parse the stringified features into their corresponding python objects
from ast import literal_eval

features = ['cast', 'crew', 'keywords', 'genres']
for feature in features:
    metadata[feature] = metadata[feature].apply(literal_eval)

In [55]:
metadata.cast[0]

[{'cast_id': 14,
  'character': 'Woody (voice)',
  'credit_id': '52fe4284c3a36847f8024f95',
  'gender': 2,
  'id': 31,
  'name': 'Tom Hanks',
  'order': 0,
  'profile_path': '/pQFoyx7rp09CJTAb932F2g8Nlho.jpg'},
 {'cast_id': 15,
  'character': 'Buzz Lightyear (voice)',
  'credit_id': '52fe4284c3a36847f8024f99',
  'gender': 2,
  'id': 12898,
  'name': 'Tim Allen',
  'order': 1,
  'profile_path': '/uX2xVf6pMmPepxnvFWyBtjexzgY.jpg'},
 {'cast_id': 16,
  'character': 'Mr. Potato Head (voice)',
  'credit_id': '52fe4284c3a36847f8024f9d',
  'gender': 2,
  'id': 7167,
  'name': 'Don Rickles',
  'order': 2,
  'profile_path': '/h5BcaDMPRVLHLDzbQavec4xfSdt.jpg'},
 {'cast_id': 17,
  'character': 'Slinky Dog (voice)',
  'credit_id': '52fe4284c3a36847f8024fa1',
  'gender': 2,
  'id': 12899,
  'name': 'Jim Varney',
  'order': 3,
  'profile_path': '/eIo2jVVXYgjDtaHoF19Ll9vtW7h.jpg'},
 {'cast_id': 18,
  'character': 'Rex (voice)',
  'credit_id': '52fe4284c3a36847f8024fa5',
  'gender': 2,
  'id': 12900,
 

In [56]:
metadata.crew[0]

[{'credit_id': '52fe4284c3a36847f8024f49',
  'department': 'Directing',
  'gender': 2,
  'id': 7879,
  'job': 'Director',
  'name': 'John Lasseter',
  'profile_path': '/7EdqiNbr4FRjIhKHyPPdFfEEEFG.jpg'},
 {'credit_id': '52fe4284c3a36847f8024f4f',
  'department': 'Writing',
  'gender': 2,
  'id': 12891,
  'job': 'Screenplay',
  'name': 'Joss Whedon',
  'profile_path': '/dTiVsuaTVTeGmvkhcyJvKp2A5kr.jpg'},
 {'credit_id': '52fe4284c3a36847f8024f55',
  'department': 'Writing',
  'gender': 2,
  'id': 7,
  'job': 'Screenplay',
  'name': 'Andrew Stanton',
  'profile_path': '/pvQWsu0qc8JFQhMVJkTHuexUAa1.jpg'},
 {'credit_id': '52fe4284c3a36847f8024f5b',
  'department': 'Writing',
  'gender': 2,
  'id': 12892,
  'job': 'Screenplay',
  'name': 'Joel Cohen',
  'profile_path': '/dAubAiZcvKFbboWlj7oXOkZnTSu.jpg'},
 {'credit_id': '52fe4284c3a36847f8024f61',
  'department': 'Writing',
  'gender': 0,
  'id': 12893,
  'job': 'Screenplay',
  'name': 'Alec Sokolow',
  'profile_path': '/v79vlRYi94BZUQnkkyzn

We can now build a function that fetches the director, for instance.

In [58]:
import numpy as np

def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

In [59]:
get_director(metadata.crew[0])

'John Lasseter'

For `cast`, `keywords`, and `genres` we are just goint to retrieve the first 3 (top 3) elements. We can also make a function for that. 

In [61]:
def get_list(x):
    if isinstance(x, list):
        names = [i['name'] for i in x]
        #Check if more than 3 elements exist. If yes, return only first three. If no, return entire list.
        if len(names) > 3:
            names = names[:3]
        return names

    #Return empty list in case of missing/malformed data
    return []

In [62]:
get_list(metadata.cast[0])

['Tom Hanks', 'Tim Allen', 'Don Rickles']

In [63]:
get_list(metadata.keywords[0])

['jealousy', 'toy', 'boy']

In [64]:
get_list(metadata.genres[0])

['Animation', 'Comedy', 'Family']

With these helper functions, we can now define new features for director, cast, genres and keywords.

In [66]:
# Define new director, cast, genres and keywords features that are in a suitable form.
metadata['director'] = metadata['crew'].apply(get_director)

features = ['cast', 'keywords', 'genres']
for feature in features:
    metadata[feature] = metadata[feature].apply(get_list)

In [67]:
# Print the new features of the first 3 films
metadata[['title', 'cast', 'director', 'keywords', 'genres']].head(3)

Unnamed: 0,title,cast,director,keywords,genres
0,Toy Story,"[Tom Hanks, Tim Allen, Don Rickles]",John Lasseter,"[jealousy, toy, boy]","[Animation, Comedy, Family]"
1,Jumanji,"[Robin Williams, Jonathan Hyde, Kirsten Dunst]",Joe Johnston,"[board game, disappearance, based on children'...","[Adventure, Fantasy, Family]"
2,Grumpier Old Men,"[Walter Matthau, Jack Lemmon, Ann-Margret]",Howard Deutch,"[fishing, best friend, duringcreditsstinger]","[Romance, Comedy]"


This new metadata about movies is still text data, thus we need to pre-process somehow to make it fit further analysis. There are several options for this, but essentially we want to vectorize the data and to do this it can sometimes be beneficial to combine the data into one string ("soup" - I am not sure if this is a commonly used term!) before vectorizing it. The tutorial does this by replacing upper case letters with lower case letters and removing black spaces, before concatenating the text strings into one string.

In [69]:
# Function to convert all strings to lower case and strip names of spaces
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        #Check if director exists. If not, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

In [70]:
clean_data(metadata.cast[0])

['tomhanks', 'timallen', 'donrickles']

We then apply that function to all the relevant coulmns.

In [72]:
# Apply clean_data function to your features.
features = ['cast', 'keywords', 'director', 'genres']

for feature in features:
    metadata[feature] = metadata[feature].apply(clean_data)

In [73]:
# Print the new features of the first 3 films
metadata[['title', 'cast', 'director', 'keywords', 'genres']].head(3)

Unnamed: 0,title,cast,director,keywords,genres
0,Toy Story,"[tomhanks, timallen, donrickles]",johnlasseter,"[jealousy, toy, boy]","[animation, comedy, family]"
1,Jumanji,"[robinwilliams, jonathanhyde, kirstendunst]",joejohnston,"[boardgame, disappearance, basedonchildren'sbook]","[adventure, fantasy, family]"
2,Grumpier Old Men,"[waltermatthau, jacklemmon, ann-margret]",howarddeutch,"[fishing, bestfriend, duringcreditsstinger]","[romance, comedy]"


In [74]:
def create_soup(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])

In [75]:
# Create a new soup feature
metadata['soup'] = metadata.apply(create_soup, axis=1)

In [76]:
metadata.soup[0]

'jealousy toy boy tomhanks timallen donrickles johnlasseter animation comedy family'

For vectorization we will use something else than TF-IDF, since we are not dealing with traditional text documents. Thus we will use the count vectorizer.

In [78]:
# Import CountVectorizer and create the count matrix
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(metadata['soup'])

In [79]:
count_matrix.shape

(46628, 73881)

We will again use the cosine similarity to calculate the difference between the resulting vectors. Be aware that this will put a high load on the memory (and CPU)!!!!

In [81]:
# Compute the Cosine Similarity matrix based on the count_matrix
cosine_sim2 = cosine_similarity(count_matrix, count_matrix)

In [82]:
# Reset index of your main DataFrame and construct reverse mapping as before
metadata = metadata.reset_index()
indices = pd.Series(metadata.index, index=metadata['title'])

In [83]:
get_recommendations('Toy Story', cosine_sim2)

3012                       Toy Story 2
15444                      Toy Story 3
29156                  Superstar Goofy
25951       Toy Story That Time Forgot
22064             Toy Story of Terror!
3324                 Creature Comforts
25949                  Partysaurus Rex
27560                            Anina
43059    Dexter's Laboratory: Ego Trip
27959                    Radiopiratene
Name: title, dtype: object

In [84]:
get_recommendations('The Dark Knight Rises', cosine_sim2)

12541      The Dark Knight
10170        Batman Begins
9271                Shiner
9834       Amongst Friends
7732              Mitchell
516      Romeo Is Bleeding
11411         The Prestige
24040            Quicksand
24984             Deadfall
41043                 Sara
Name: title, dtype: object

## An example of User-Based Collaborative Filtering

In this section, we will look at Collaborative-based filtering. More specifically, we will make a user-based collaborative filtering based on data about the users (rating the movies). The example is based on the same movie dataset and the following Kaggle notebook: [https://www.kaggle.com/code/yagizcapa/user-based-recommender](https://www.kaggle.com/code/yagizcapa/user-based-recommender)

First we read in the rating dataset

In [122]:
ratings = pd.read_csv("../Notebooks and data-13/ratings_small.csv")

In [124]:
df = metadata.merge(ratings, how="left", left_on="id", right_on="movieId")
df.head()

Unnamed: 0,index,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,...,vote_count,cast,crew,keywords,director,soup,userId,movieId,rating,timestamp
0,0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[animation, comedy, family]",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,...,5415.0,"[tomhanks, timallen, donrickles]","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[jealousy, toy, boy]",johnlasseter,jealousy toy boy tomhanks timallen donrickles ...,,,,
1,1,False,,65000000,"[adventure, fantasy, family]",,8844,tt0113497,en,Jumanji,...,2413.0,"[robinwilliams, jonathanhyde, kirstendunst]","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[boardgame, disappearance, basedonchildren'sbook]",joejohnston,boardgame disappearance basedonchildren'sbook ...,,,,
2,2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[romance, comedy]",,15602,tt0113228,en,Grumpier Old Men,...,92.0,"[waltermatthau, jacklemmon, ann-margret]","[{'credit_id': '52fe466a9251416c75077a89', 'de...","[fishing, bestfriend, duringcreditsstinger]",howarddeutch,fishing bestfriend duringcreditsstinger walter...,,,,
3,3,False,,16000000,"[comedy, drama, romance]",,31357,tt0114885,en,Waiting to Exhale,...,34.0,"[whitneyhouston, angelabassett, lorettadevine]","[{'credit_id': '52fe44779251416c91011acb', 'de...","[basedonnovel, interracialrelationship, single...",forestwhitaker,basedonnovel interracialrelationship singlemot...,,,,
4,4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,[comedy],,11862,tt0113041,en,Father of the Bride Part II,...,173.0,"[stevemartin, dianekeaton, martinshort]","[{'credit_id': '52fe44959251416c75039ed7', 'de...","[baby, midlifecrisis, confidence]",charlesshyer,baby midlifecrisis confidence stevemartin dian...,,,,


In [125]:
df.shape

(88822, 34)

In [126]:
df["title"].nunique()

42276

In [127]:
df["userId"].nunique()

671

We there are 88822 user ratings by 671 users (of 42276 unique movies - it is not given that all movies have ratings).

Let us look at how many rated the most rated movies

In [130]:
rating_counts = pd.DataFrame(df["title"].value_counts())

rating_counts.head(10)

Unnamed: 0_level_0,count
title,Unnamed: 1_level_1
Terminator 3: Rise of the Machines,324
The Million Dollar Hotel,311
Solaris,305
The 39 Steps,293
Monsoon Wedding,274
Once Were Warriors,244
Three Colors: Red,228
Men in Black II,224
The Passion of Joan of Arc,218
Silent Hill,215


You might wonder what happens with all the rows (and movies) that did not have rating? We can remove those to ensure that the above code does not count rows where the rating is missing. By doing this, we also learn that there are 2794 movies that are rated by the 671 users.

In [132]:
df[["title", "rating"]].dropna().drop(columns=["rating"]).value_counts()

title                             
Terminator 3: Rise of the Machines    324
The Million Dollar Hotel              311
Solaris                               305
The 39 Steps                          291
Monsoon Wedding                       274
                                     ... 
Kaiji 2: The Ultimate Gambler           1
K-PAX                                   1
Just Call Me Nobody                     1
Junebug                                 1
Şaban Oğlu Şaban                        1
Name: count, Length: 2794, dtype: int64

In [133]:
user_movie_df = df[["userId", "title", "rating"]]

In [134]:
user_movie_df

Unnamed: 0,userId,title,rating
0,,Toy Story,
1,,Jumanji,
2,,Grumpier Old Men,
3,,Waiting to Exhale,
4,,Father of the Bride Part II,
...,...,...,...
88817,,Subdue,
88818,,Century of Birthing,
88819,,Betrayal,
88820,,Satan Triumphant,


We create a dataframe with the user ratings only

In [136]:
user_movie_df = user_movie_df.pivot_table(index=["userId"], columns=["title"], values="rating")

In [137]:
user_movie_df.head()

title,!Women Art Revolution,'Gator Bait,'Twas the Night Before Christmas,...And God Created Woman,00 Schneider - Jagd auf Nihil Baxter,10 Items or Less,10 Things I Hate About You,"10,000 BC",11'09''01 - September 11,12 Angry Men,...,Zodiac,Zombie Flesh Eaters,Zombie Holocaust,Zozo,eXistenZ,xXx,¡Three Amigos!,À nos amours,Ödipussi,Şaban Oğlu Şaban
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1.0,,,,,,,,,,,...,,,,,,,,,,
2.0,,,,,,,,,,,...,,,,,,,,,,
3.0,,,,,,,,,,,...,,,,,,,,,,
4.0,,,,,,,,,,,...,,,,,,,,,,
5.0,,,,,,,,,,,...,,,,,,,,,,


In [138]:
user_movie_df.shape

(671, 2794)

Let us select a random user as an example case

In [140]:
random_user = np.array(user_movie_df.sample(random_state = 50).index)[0]
random_user

455.0

Getting the movies the random user have rated.

In [142]:
random_user_df = user_movie_df[user_movie_df.index == random_user]
random_user_df

title,!Women Art Revolution,'Gator Bait,'Twas the Night Before Christmas,...And God Created Woman,00 Schneider - Jagd auf Nihil Baxter,10 Items or Less,10 Things I Hate About You,"10,000 BC",11'09''01 - September 11,12 Angry Men,...,Zodiac,Zombie Flesh Eaters,Zombie Holocaust,Zozo,eXistenZ,xXx,¡Three Amigos!,À nos amours,Ödipussi,Şaban Oğlu Şaban
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
455.0,,,,,,,,,,,...,,,,,,,,,,


In [143]:
random_user_movies_watched = random_user_df.columns[random_user_df.notna().any()].tolist()
random_user_movies_watched

['A Nightmare on Elm Street',
 'Back to the Future Part II',
 'Bang, Boom, Bang',
 'Batman Returns',
 'Beauty and the Beast',
 'Belle Époque',
 'Breaking the Waves',
 'Don Juan DeMarco',
 'Frankenstein Conquers the World',
 'Harry Potter and the Prisoner of Azkaban',
 'Monsoon Wedding',
 'Sissi',
 'Terminator 3: Rise of the Machines',
 'The Bourne Supremacy',
 'The Conversation',
 'The Passion of Joan of Arc',
 'The Third Man']

In [144]:
len(random_user_movies_watched)

17

Selecting only does movies to look for similar users

In [146]:
movies_watched_df = user_movie_df[random_user_movies_watched]

In [147]:
movies_watched_df

title,A Nightmare on Elm Street,Back to the Future Part II,"Bang, Boom, Bang",Batman Returns,Beauty and the Beast,Belle Époque,Breaking the Waves,Don Juan DeMarco,Frankenstein Conquers the World,Harry Potter and the Prisoner of Azkaban,Monsoon Wedding,Sissi,Terminator 3: Rise of the Machines,The Bourne Supremacy,The Conversation,The Passion of Joan of Arc,The Third Man
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
1.0,,,,,,,,,,,,,,,,,
2.0,3.0,3.0,,3.0,,,,,,,4.0,3.0,4.0,,5.0,,
3.0,2.5,,,,,,,,,,,,4.5,,3.0,,
4.0,,,,5.0,,,,,,,5.0,,5.0,,,,
5.0,4.0,,3.5,4.0,,,,,,,,,,3.5,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
667.0,3.0,3.0,,,,,,,,,4.0,3.0,5.0,,2.0,,
668.0,,,,,,,,,,,,,5.0,,,,
669.0,,,,,,,,,,,3.0,,,,,,
670.0,,,,,,,,,,,,4.0,,,,,


In [148]:
movies_watched_df.shape

(671, 17)

For each other user, we now calculate how many movies they have rated among these selected that the random user have rated.

In [150]:
user_movie_count = movies_watched_df.T.notnull().sum()

In [151]:
user_movie_count

userId
1.0      0
2.0      7
3.0      3
4.0      3
5.0      4
        ..
667.0    6
668.0    1
669.0    1
670.0    1
671.0    4
Length: 671, dtype: int64

In [152]:
user_movie_count = user_movie_count.reset_index()
user_movie_count.columns = ["userId", "movie_count"]
user_movie_count

Unnamed: 0,userId,movie_count
0,1.0,0
1,2.0,7
2,3.0,3
3,4.0,3
4,5.0,4
...,...,...
666,667.0,6
667,668.0,1
668,669.0,1
669,670.0,1


We select those users that have rated more than 70% of the movies the random user have rated.

In [154]:
user_same_movies = user_movie_count[user_movie_count["movie_count"] > (len(random_user_movies_watched)*70)/100]["userId"]
user_same_movies

14      15.0
18      19.0
29      30.0
72      73.0
87      88.0
118    119.0
129    130.0
133    134.0
149    150.0
164    165.0
211    212.0
246    247.0
294    295.0
305    306.0
354    355.0
383    384.0
387    388.0
451    452.0
454    455.0
456    457.0
460    461.0
467    468.0
517    518.0
546    547.0
561    562.0
563    564.0
573    574.0
579    580.0
606    607.0
623    624.0
653    654.0
663    664.0
Name: userId, dtype: float64

creating a data frame with the rating of only these users.

In [156]:
final_df = movies_watched_df[movies_watched_df.index.isin(user_same_movies)]
final_df

title,A Nightmare on Elm Street,Back to the Future Part II,"Bang, Boom, Bang",Batman Returns,Beauty and the Beast,Belle Époque,Breaking the Waves,Don Juan DeMarco,Frankenstein Conquers the World,Harry Potter and the Prisoner of Azkaban,Monsoon Wedding,Sissi,Terminator 3: Rise of the Machines,The Bourne Supremacy,The Conversation,The Passion of Joan of Arc,The Third Man
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
15.0,4.0,3.0,2.0,4.0,4.0,,3.5,2.0,3.0,,3.0,5.0,5.0,5.0,4.0,2.5,2.5
19.0,3.0,3.0,3.0,3.0,3.0,,4.0,,,,4.0,4.0,5.0,,4.0,3.0,1.0
30.0,4.0,4.0,2.0,3.0,,4.0,,,5.0,,4.0,5.0,5.0,,4.0,4.0,4.0
73.0,3.0,3.5,2.0,5.0,3.0,3.0,3.0,3.5,3.5,3.0,4.0,4.0,5.0,4.5,4.0,3.5,
88.0,3.0,3.0,2.0,4.0,3.5,,,0.5,4.0,,3.0,2.5,3.5,4.0,3.0,3.5,
119.0,2.0,,,3.0,3.0,4.0,4.0,,2.0,,4.0,3.0,5.0,4.0,4.0,3.0,
130.0,3.0,2.0,,3.5,2.0,3.5,,1.5,,,3.0,4.0,,4.5,3.5,1.5,3.0
134.0,3.5,4.0,2.5,3.0,4.5,,1.0,,,,5.0,4.5,4.5,4.0,4.0,4.5,
150.0,3.5,3.5,3.5,3.0,3.5,,2.5,3.0,,,3.5,4.0,4.5,3.5,3.0,4.0,
165.0,3.0,5.0,4.0,3.5,3.5,,3.5,,3.5,1.5,3.0,3.0,3.5,3.0,4.0,4.0,2.5


We now calculate the correlation between all the users. That is the correlation between the rows. As the `.corr` method on data frames calculate the correlations between columns, we have transpose the data frame first.

In [158]:
corr_df = final_df.T.corr()
corr_df

userId,15.0,19.0,30.0,73.0,88.0,119.0,130.0,134.0,150.0,165.0,...,518.0,547.0,562.0,564.0,574.0,580.0,607.0,624.0,654.0,664.0
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
15.0,1.0,0.6063144,0.5527708,0.616017,0.5167299,0.359888,0.795996,0.2467548,0.31069,-0.283594,...,0.1687966,0.191955,0.441971,0.070656,0.606001,0.2916644,0.754216,0.3408594,0.4998236,0.611448
19.0,0.606314,1.0,0.3953017,0.515935,-6.914925000000001e-17,0.868599,0.270765,0.1412993,0.265543,0.187729,...,0.3984095,0.059339,0.511163,0.074261,0.439498,0.180009,0.470087,0.2100903,0.3214653,0.44762
30.0,0.552771,0.3953017,1.0,0.447796,0.3766218,0.0625,0.158114,0.8243568,0.585239,-0.243916,...,0.3539192,0.743665,0.537331,0.382546,0.745234,0.4128375,0.841079,0.6048584,-0.02913583,0.11547
73.0,0.616017,0.5159347,0.4477955,1.0,0.4280457,0.408127,0.527048,0.3993377,0.252907,-0.012535,...,0.1723455,0.564169,0.792948,0.626935,0.703646,0.6431884,0.517099,0.319313,0.620155,0.498451
88.0,0.51673,-6.914925000000001e-17,0.3766218,0.428046,1.0,-0.045361,0.421111,0.2236068,0.266132,-0.183827,...,0.07444375,0.266379,0.244575,0.16875,0.593895,-2.141144e-16,0.430331,-3.0733000000000004e-17,0.4487647,0.202786
119.0,0.359888,0.868599,0.0625,0.408127,-0.04536092,1.0,0.377964,0.02827749,0.113228,0.104793,...,0.1336306,0.118864,0.83205,0.266131,0.707107,0.5281521,0.566529,0.2638224,0.4429812,0.589369
130.0,0.795996,0.2707652,0.1581139,0.527048,0.421111,0.377964,1.0,-0.2600157,-0.021402,-0.528613,...,0.02070788,-0.316736,0.290565,0.465778,0.272118,0.3727889,0.493179,0.1118034,0.5259006,0.174928
134.0,0.246755,0.1412993,0.8243568,0.399338,0.2236068,0.028277,-0.260016,1.0,0.700649,-0.1022,...,0.4504687,0.678079,0.599486,0.431764,0.266583,0.4431424,0.481125,-0.1153624,-0.09622504,-0.059761
150.0,0.31069,0.2655425,0.585239,0.252907,0.2661321,0.113228,-0.021402,0.700649,1.0,-0.072932,...,0.7051102,0.64247,0.047619,0.302765,0.563146,0.4020832,0.62361,0.3417995,0.1641276,0.398049
165.0,-0.283594,0.1877293,-0.2439164,-0.012535,-0.1838267,0.104793,-0.528613,-0.1022002,-0.072932,1.0,...,-0.6313604,-0.048901,-0.410411,0.21266,0.212748,0.4296916,-0.201688,0.08555937,0.2398804,0.351791


In [159]:
user_corr = corr_df[random_user].reset_index()
user_corr = user_corr.rename(columns={random_user: 'correlation'})
user_corr = user_corr.sort_values(by="correlation", ascending=False)
user_corr = user_corr.loc[user_corr["userId"] != random_user]
user_corr = user_corr.reset_index(drop=True)
user_corr

Unnamed: 0,userId,correlation
0,295.0,0.868487
1,134.0,0.619052
2,150.0,0.514666
3,654.0,0.501507
4,664.0,0.489481
5,119.0,0.488773
6,580.0,0.480598
7,247.0,0.462952
8,574.0,0.22182
9,518.0,0.163858


Now let us merge it with all the ratings of the users

In [161]:
top_users_ratings = user_corr.merge(ratings[["userId", "movieId", "rating"]], how="inner")
top_users_ratings

Unnamed: 0,userId,correlation,movieId,rating
0,295.0,0.868487,6,4.5
1,295.0,0.868487,10,4.5
2,295.0,0.868487,39,4.0
3,295.0,0.868487,47,4.5
4,295.0,0.868487,50,4.5
...,...,...,...,...
24678,30.0,-0.381674,6436,5.0
24679,30.0,-0.381674,6440,4.0
24680,30.0,-0.381674,6452,5.0
24681,30.0,-0.381674,6473,4.0


We can now create ratings that are weighted with respect to the correlation.

In [163]:
top_users_ratings["weighted_rating"] = top_users_ratings["correlation"] * top_users_ratings["rating"]
top_users_ratings

Unnamed: 0,userId,correlation,movieId,rating,weighted_rating
0,295.0,0.868487,6,4.5,3.908192
1,295.0,0.868487,10,4.5,3.908192
2,295.0,0.868487,39,4.0,3.473949
3,295.0,0.868487,47,4.5,3.908192
4,295.0,0.868487,50,4.5,3.908192
...,...,...,...,...,...
24678,30.0,-0.381674,6436,5.0,-1.908371
24679,30.0,-0.381674,6440,4.0,-1.526697
24680,30.0,-0.381674,6452,5.0,-1.908371
24681,30.0,-0.381674,6473,4.0,-1.526697


For each movie, we can now take the average of the weighted ratings to get a final rating for all the movies (as recommendation for the selectd random user).

In [165]:
recommendation_df = top_users_ratings.groupby("movieId").agg({"weighted_rating": "mean"}).sort_values(by = "weighted_rating", ascending = False)
recommendation_df = recommendation_df.reset_index()
recommendation_df

Unnamed: 0,movieId,weighted_rating
0,6776,3.473949
1,8884,3.473949
2,53953,3.095262
3,3654,2.931642
4,4833,2.785736
...,...,...
6536,3038,-1.617554
6537,5960,-1.908371
6538,4088,-1.908371
6539,4617,-1.908371


In [166]:
movies_to_be_recommended = recommendation_df.merge(metadata[["id", "title"]], left_on="movieId", right_on="id").drop(columns=["id"])
movies_to_be_recommended = movies_to_be_recommended.head()
movies_to_be_recommended

Unnamed: 0,movieId,weighted_rating,title
0,8884,3.473949,Franklyn
1,53953,3.095262,The Tooth Fairy
2,1824,2.443866,50 First Dates
3,26,2.443866,Walk on Water
4,26865,2.162691,X: The Unknown


We can now put it all together into a recommender function.

In [266]:
def user_based_recommender(input_user, user_movie_df, rate_ratio=0.70, num_recommendations=5):
    # Creating a list of movies the input user have rated
    input_user_df = user_movie_df[user_movie_df.index == input_user]
    input_user_movies_watched = input_user_df.columns[input_user_df.notna().any()].tolist()

    # Creating a dataframe with the user rating of the movies the input user have rated
    movies_watched_df = user_movie_df[input_user_movies_watched]

    # Counting how many movies other users have rated that the input user have also rated
    user_movie_count = movies_watched_df.T.notnull().sum()
    print(movies_watched_df)
    user_movie_count = user_movie_count.reset_index()
    user_movie_count.columns = ["userId", "movie_count"]
    
    # Selecting similar users over based on a rating similarity count ratio threshold
    user_same_movies = user_movie_count[user_movie_count["movie_count"] > (len(input_user_movies_watched)*rate_ratio)]["userId"]

    # Creating a correlation matrix based on ratings
    final_df = movies_watched_df[movies_watched_df.index.isin(user_same_movies)]
    corr_df = final_df.T.corr()
    
    # Created top correlated users
    user_corr = corr_df[input_user].reset_index()
    user_corr = user_corr.rename(columns={input_user: 'correlation'})
    user_corr = user_corr.sort_values(by="correlation", ascending=False)
    user_corr = user_corr.loc[user_corr["userId"] != input_user]
    user_corr = user_corr.reset_index(drop=True) 
    #print(user_movie_count)
    # print(corr_df)

    # Creating correlated weighting of rating
    top_users_ratings = user_corr.merge(ratings[["userId", "movieId", "rating"]], how="inner")
    top_users_ratings["weighted_rating"] = top_users_ratings["correlation"] * top_users_ratings["rating"]

    # Creating a recommendation dataframe
    recommendation_df = top_users_ratings.groupby("movieId").agg({"weighted_rating": "mean"}).sort_values(by = "weighted_rating", ascending = False)
    recommendation_df = recommendation_df.reset_index()

    # Creating the final recommendations
    movies_to_be_recommended = recommendation_df.merge(metadata[["id", "title"]], left_on="movieId", right_on="id").drop(columns=["id"])
    movies_to_be_recommended = movies_to_be_recommended.head(num_recommendations)

    #return movies_to_be_recommended["title"]

In [268]:
## test delte me
user_based_recommender(455, user_movie_df)

title   A Nightmare on Elm Street  Back to the Future Part II  \
userId                                                          
1.0                           NaN                         NaN   
2.0                           3.0                         3.0   
3.0                           2.5                         NaN   
4.0                           NaN                         NaN   
5.0                           4.0                         NaN   
...                           ...                         ...   
667.0                         3.0                         3.0   
668.0                         NaN                         NaN   
669.0                         NaN                         NaN   
670.0                         NaN                         NaN   
671.0                         NaN                         NaN   

title   Bang, Boom, Bang  Batman Returns  Beauty and the Beast  Belle Époque  \
userId                                                                    

In [169]:
user_movie_df

title,!Women Art Revolution,'Gator Bait,'Twas the Night Before Christmas,...And God Created Woman,00 Schneider - Jagd auf Nihil Baxter,10 Items or Less,10 Things I Hate About You,"10,000 BC",11'09''01 - September 11,12 Angry Men,...,Zodiac,Zombie Flesh Eaters,Zombie Holocaust,Zozo,eXistenZ,xXx,¡Three Amigos!,À nos amours,Ödipussi,Şaban Oğlu Şaban
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1.0,,,,,,,,,,,...,,,,,,,,,,
2.0,,,,,,,,,,,...,,,,,,,,,,
3.0,,,,,,,,,,,...,,,,,,,,,,
4.0,,,,,,,,,,,...,,,,,,,,,,
5.0,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
667.0,,,,,,,,,,,...,,,,,,,,,,
668.0,,,,,,,,,,,...,,,,,,,,,,
669.0,,,,,,,,,,,...,,,,,,,,,,
670.0,,,,,,,,,,,...,,,,,,,,,,


In [170]:
user_based_recommender(455, user_movie_df)

0           Franklyn
1    The Tooth Fairy
2     50 First Dates
3      Walk on Water
4     X: The Unknown
Name: title, dtype: object

In [171]:
random_user_movies_watched

['A Nightmare on Elm Street',
 'Back to the Future Part II',
 'Bang, Boom, Bang',
 'Batman Returns',
 'Beauty and the Beast',
 'Belle Époque',
 'Breaking the Waves',
 'Don Juan DeMarco',
 'Frankenstein Conquers the World',
 'Harry Potter and the Prisoner of Azkaban',
 'Monsoon Wedding',
 'Sissi',
 'Terminator 3: Rise of the Machines',
 'The Bourne Supremacy',
 'The Conversation',
 'The Passion of Joan of Arc',
 'The Third Man']