# **MOVIE RECOMMENDER SYSTEM**

The goal of this project is to build 3 types of recommender systems:

- Popularity based

- Item-based with correlation

- User-based with cosine similarity

# **1.Popularity Based Recommendations**

## 1.1.Reading Data & First Glance

In [1]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# import data
url = "https://drive.google.com/file/d/18TReZs3uJmJh0hIofeOXDzjOq-bnywYT/view?usp=share_link"
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
movies = pd.read_csv(path)

url = "https://drive.google.com/file/d/19A69kCZ33oTc_1oF8TX3XymJ5AGj2APC/view?usp=share_link"
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
ratings = pd.read_csv(path)

url = "https://drive.google.com/file/d/12KAAKmRT4l9QZEh4b3FAIToKCeFtlwCe/view?usp=share_link"
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
tags = pd.read_csv(path)

url = "https://drive.google.com/file/d/1MU1eYadkdX739KM2JZ_zn1HJad39XiaQ/view?usp=share_link"
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
links = pd.read_csv(path)

In [2]:
movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation


In [3]:
ratings.sample(5)

Unnamed: 0,userId,movieId,rating,timestamp
70614,451,32,5.0,854089163
99736,610,3578,5.0,1493844672
9147,62,176601,5.0,1525795252
68339,443,260,4.0,1501722465
29996,208,2427,3.0,940639513


In [4]:
tags

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200
...,...,...,...,...
3678,606,7382,for katie,1171234019
3679,606,7936,austere,1173392334
3680,610,3265,gun fu,1493843984
3681,610,3265,heroic bloodshed,1493843978


In [5]:
links

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0
...,...,...,...
9737,193581,5476944,432131.0
9738,193583,5914996,445030.0
9739,193585,6397426,479308.0
9740,193587,8391976,483455.0


In [6]:
movies_ratings = movies.merge(ratings, on="movieId")
movies_ratings

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,964982703
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5,4.0,847434962
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,7,4.5,1106635946
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,15,2.5,1510577970
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,17,4.5,1305696483
...,...,...,...,...,...,...
100831,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,184,4.0,1537109082
100832,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,184,3.5,1537109545
100833,193585,Flint (2017),Drama,184,3.5,1537109805
100834,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,184,3.5,1537110021


Let's group movies by rating, and look at their average rating. This is an explicit rating given by users.

In [7]:
rating=pd.DataFrame(movies_ratings.groupby('movieId')['rating'].mean())
rating.sort_values('rating',ascending=False).head()

Unnamed: 0_level_0,rating
movieId,Unnamed: 1_level_1
88448,5.0
100556,5.0
143031,5.0
143511,5.0
143559,5.0


The top rated places have a perfect score of 5/5. But how many reviews do these places have?

In [8]:
movies_ratings.query("movieId==88448")

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
93261,88448,Paper Birds (Pájaros de papel) (2010),Comedy|Drama,483,5.0,1315437602


Looks like only 1 people watched this movie. Maybe they're just the owner's friends!

We can also look at how many times each movies has received a rating. The ratings count is an implicit rating.

In [9]:
rating['rating_count']=movies_ratings.groupby('movieId')['rating'].count()
rating.sort_values('rating_count',ascending=False).head()

Unnamed: 0_level_0,rating,rating_count
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
356,4.164134,329
318,4.429022,317
296,4.197068,307
593,4.16129,279
2571,4.192446,278


Some movies have been watched around 300 times. They are more popular than the top rated places, but received lower explicit ratings.

Let's locate the most popular movie, and get some info about it:

In [10]:
# movieId  of most popular movie
top_popular_movieId=rating.sort_values('rating_count',ascending=False).head(1).index[0]

# name of the most popular moive
movies_ratings[movies_ratings['movieId']==top_popular_movieId]

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
10019,356,Forrest Gump (1994),Comedy|Drama|Romance|War,1,4.0,964980962
10020,356,Forrest Gump (1994),Comedy|Drama|Romance|War,6,5.0,845553200
10021,356,Forrest Gump (1994),Comedy|Drama|Romance|War,7,5.0,1106635915
10022,356,Forrest Gump (1994),Comedy|Drama|Romance|War,8,3.0,839463527
10023,356,Forrest Gump (1994),Comedy|Drama|Romance|War,10,3.5,1455301685
...,...,...,...,...,...,...
10343,356,Forrest Gump (1994),Comedy|Drama|Romance|War,605,3.0,1277097509
10344,356,Forrest Gump (1994),Comedy|Drama|Romance|War,606,4.0,1171231370
10345,356,Forrest Gump (1994),Comedy|Drama|Romance|War,608,3.0,1117162603
10346,356,Forrest Gump (1994),Comedy|Drama|Romance|War,609,4.0,847220869


## 1.2.Create a dataframe

Find a hybrid system to sort films, so that you can recommend: films that are both high rated and popular.

- Popularity: count of ratings
- Qualiity: mean of ratings

In [11]:
# find "count of rating" and "mean of rating"
r = ratings.groupby(["movieId"]).agg({"userId": 'count', "rating": "mean"})
r

Unnamed: 0_level_0,userId,rating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,215,3.920930
2,110,3.431818
3,52,3.259615
4,7,2.357143
5,49,3.071429
...,...,...
193581,1,4.000000
193583,1,3.500000
193585,1,3.500000
193587,1,3.500000


In [12]:
# merge movies and updated ratings tables
df = pd.merge(movies, r, on="movieId", how="left")
df

Unnamed: 0,movieId,title,genres,userId,rating
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,215.0,3.920930
1,2,Jumanji (1995),Adventure|Children|Fantasy,110.0,3.431818
2,3,Grumpier Old Men (1995),Comedy|Romance,52.0,3.259615
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,7.0,2.357143
4,5,Father of the Bride Part II (1995),Comedy,49.0,3.071429
...,...,...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,1.0,4.000000
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,1.0,3.500000
9739,193585,Flint (2017),Drama,1.0,3.500000
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,1.0,3.500000


In [13]:
# rename columns >> Popularity: count of rates , Quality: mean of ratings
df2 = df.rename(columns={'userId': 'popularity', 'rating': 'quality'})
df2

Unnamed: 0,movieId,title,genres,popularity,quality
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,215.0,3.920930
1,2,Jumanji (1995),Adventure|Children|Fantasy,110.0,3.431818
2,3,Grumpier Old Men (1995),Comedy|Romance,52.0,3.259615
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,7.0,2.357143
4,5,Father of the Bride Part II (1995),Comedy,49.0,3.071429
...,...,...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,1.0,4.000000
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,1.0,3.500000
9739,193585,Flint (2017),Drama,1.0,3.500000
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,1.0,3.500000


## 1.3.Popularity Based Recommender

In [14]:
# Find a hybrid system to sort films, so that you can recommend: films that are both high rated and popular.
df2['overall_rating'] = df2["quality"] * (df2["popularity"] * 0.25)
df2.sort_values(by="overall_rating", ascending=False).head(10)

Unnamed: 0,movieId,title,genres,popularity,quality,overall_rating
277,318,"Shawshank Redemption, The (1994)",Crime|Drama,317.0,4.429022,351.0
314,356,Forrest Gump (1994),Comedy|Drama|Romance|War,329.0,4.164134,342.5
257,296,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller,307.0,4.197068,322.125
1939,2571,"Matrix, The (1999)",Action|Sci-Fi|Thriller,278.0,4.192446,291.375
510,593,"Silence of the Lambs, The (1991)",Crime|Horror|Thriller,279.0,4.16129,290.25
224,260,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Sci-Fi,251.0,4.231076,265.5
97,110,Braveheart (1995),Action|Drama|War,237.0,4.031646,238.875
2226,2959,Fight Club (1999),Action|Crime|Drama|Thriller,218.0,4.272936,232.875
461,527,Schindler's List (1993),Drama|War,220.0,4.225,232.375
418,480,Jurassic Park (1993),Action|Adventure|Sci-Fi|Thriller,238.0,3.75,223.125


In [15]:
# Popularity: count of rates , Quality: mean of ratings

def top_5_movies(movies_ratings):
    top_5_movies =(
        movies_ratings
        .groupby(['title','genres'])
        .agg( quality = ('rating','mean'), popularity = ('userId','count'))
        .sort_values('popularity', ascending=False)
        .assign(quality = lambda x: round(x['quality'], 2))
        .head(5)
    )
    return top_5_movies

In [16]:
top_5_movies(movies_ratings)

Unnamed: 0_level_0,Unnamed: 1_level_0,quality,popularity
title,genres,Unnamed: 2_level_1,Unnamed: 3_level_1
Forrest Gump (1994),Comedy|Drama|Romance|War,4.16,329
"Shawshank Redemption, The (1994)",Crime|Drama,4.43,317
Pulp Fiction (1994),Comedy|Crime|Drama|Thriller,4.2,307
"Silence of the Lambs, The (1991)",Crime|Horror|Thriller,4.16,279
"Matrix, The (1999)",Action|Sci-Fi|Thriller,4.19,278


## 1.4.Popularity Chat Bot

In [17]:
def popularity_chat_bot(movies_ratings):
    print("Hi! I'm your personal recommender, let me recommend you some movies!")
    rec= top_5_movies(movies_ratings)
    rec= top_5_movies(movies_ratings).reset_index()
    rec =rec['title']
    return rec

In [18]:
popularity_chat_bot(movies_ratings)

Hi! I'm your personal recommender, let me recommend you some movies!


0                 Forrest Gump (1994)
1    Shawshank Redemption, The (1994)
2                 Pulp Fiction (1994)
3    Silence of the Lambs, The (1991)
4                  Matrix, The (1999)
Name: title, dtype: object

# **2.Item Based Recommendations**

### 2.1.Prepare data for correlation

We will look for movies that are similar to the most popular movie "Forrest Gump (1994)". "Similarity" will be defined by how well other movies correlate with "Forrest Gump (1994)" movie in the user-item matrix.

In this matrix, we have all the users in the rows and all the movies in the columns. It has many NaNs because most of the time users have not watched many movies —we call this a sparse matrix.

In [19]:
# preparing data for correlation

# movies_crosstab: user-item matrix
movies_crosstab = pd.pivot_table(data=movies_ratings, values='rating', index='userId', columns='movieId')
movies_crosstab.head(10)

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,,4.0,,,4.0,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4.0,,,,,,,,,,...,,,,,,,,,,
6,,4.0,5.0,3.0,5.0,4.0,4.0,3.0,,3.0,...,,,,,,,,,,
7,4.5,,,,,,,,,,...,,,,,,,,,,
8,,4.0,,,,,,,,2.0,...,,,,,,,,,,
9,,,,,,,,,,,...,,,,,,,,,,
10,,,,,,,,,,,...,,,,,,,,,,


Let's look at the users that have watched "Forrest Gump (1994)":

In [20]:
# Forrest Gump (1994)	
top_popular_movieId = 356

In [21]:
#list of the movies user ratings - exclusing NANs
fg_ratings = movies_crosstab[top_popular_movieId]
fg_ratings[fg_ratings>=0] # exclude NaNs

userId
1      4.0
6      5.0
7      5.0
8      3.0
10     3.5
      ... 
605    3.0
606    4.0
608    3.0
609    4.0
610    3.0
Name: 356, Length: 329, dtype: float64

### 2.2.Find similarities with correlation

In [22]:
# find similar movies
# we get warnings because computing the pearson correlation coefficient with NaNs, but the resuls are still ok
similar_to_fg = movies_crosstab.corrwith(fg_ratings)
similar_to_fg

  c = cov(x, y, rowvar, dtype=dtype)
  c *= np.true_divide(1, fact)


movieId
1         0.303465
2         0.367247
3         0.534682
4         0.388514
5         0.349541
            ...   
193581         NaN
193583         NaN
193585         NaN
193587         NaN
193609         NaN
Length: 9724, dtype: float64

In [23]:
# getting correlation score and dropping Nans
corr_fg = pd.DataFrame(similar_to_fg, columns=['PearsonR'])
corr_fg.dropna(inplace=True)
corr_fg.head(10)

Unnamed: 0_level_0,PearsonR
movieId,Unnamed: 1_level_1
1,0.303465
2,0.367247
3,0.534682
4,0.388514
5,0.349541
6,0.137421
7,0.106567
8,0.65602
9,0.0
10,0.217441


In [24]:
movies_ratings.head()

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,964982703
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5,4.0,847434962
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,7,4.5,1106635946
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,15,2.5,1510577970
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,17,4.5,1305696483


In [25]:
rating = pd.DataFrame(movies_ratings.groupby('movieId')['rating'].mean())
rating['rating_count'] = movies_ratings.groupby('movieId')['rating'].count()

In [26]:
rating['rating_count']

movieId
1         215
2         110
3          52
4           7
5          49
         ... 
193581      1
193583      1
193585      1
193587      1
193609      1
Name: rating_count, Length: 9724, dtype: int64

In [27]:
# joining correlation scores and rating count
fg_corr_summary = corr_fg.join(rating['rating_count'])
fg_corr_summary.drop(top_popular_movieId, inplace=True) # drop fg itself
fg_corr_summary

Unnamed: 0_level_0,PearsonR,rating_count
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0.303465,215
2,0.367247,110
3,0.534682,52
4,0.388514,7
5,0.349541,49
...,...,...
185585,-1.000000,2
187541,1.000000,4
187593,-0.203519,12
187595,0.870388,5


Let's filter out movies with a rating count below 10.

Then, take the top 10 movies in terms of similarity to Forest Gump:

In [28]:
# select only movies with over 10 ratings and sort by correlation highest to lowest selecting only n values
top10 = fg_corr_summary[fg_corr_summary['rating_count']>=10].sort_values('PearsonR', ascending=False).head(10)
top10

Unnamed: 0_level_0,PearsonR,rating_count
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1295,0.932958,11
6793,0.885253,11
328,0.881682,10
4954,0.865633,11
911,0.850591,13
55721,0.799415,10
195,0.786428,10
181,0.785661,17
80906,0.782601,12
150548,0.776636,10


In [29]:
data = movies[['movieId', 'title']]

In [30]:
top10 = top10.merge(data,left_index=True, right_on="movieId")
top10

Unnamed: 0,PearsonR,rating_count,movieId,title
993,0.932958,11,1295,"Unbearable Lightness of Being, The (1988)"
4573,0.885253,11,6793,Beethoven (1992)
286,0.881682,10,328,Tales from the Crypt Presents: Demon Knight (1...
3607,0.865633,11,4954,Ocean's Eleven (a.k.a. Ocean's 11) (1960)
693,0.850591,13,911,Charade (1963)
6607,0.799415,10,55721,Elite Squad (Tropa de Elite) (2007)
165,0.786428,10,195,Something to Talk About (1995)
153,0.785661,17,181,Mighty Morphin Power Rangers: The Movie (1995)
7436,0.782601,12,80906,Inside Job (2010)
9193,0.776636,10,150548,Sherlock: The Abominable Bride (2016)


### 2.3. Item Based Recommender

Create a function that takes as input a movie id and a number (n), and outputs the names of the top n most similar movies to the inputed one.

You can assume that the user-item matrix (movies_crosstab) is already created

In [31]:
def top_n_movie(movie_id, n):

    #list of the movie user ratings - exclusing NANs
    movie_ratings = movies_crosstab[movie_id]

    # find similar movies
    similar_to_movie = movies_crosstab.corrwith(movie_ratings)

    # getting correlation score and dropping Nans
    corr_fg = pd.DataFrame(similar_to_fg, columns=['PearsonR'])
    corr_fg.dropna(inplace=True)

    # joining correlation scores and rating count
    fg_corr_summary = corr_fg.join(rating['rating_count'])
    fg_corr_summary.drop(movie_id, inplace=True) # drop the inputed movie itself

    # select only movies with over 10 ratings and sort by correlation highest to lowest selecting only n values
    top10 = fg_corr_summary[fg_corr_summary['rating_count']>=10].sort_values('PearsonR', ascending=False).head(n)
    top10 = top10.merge(data, left_index=True, right_on="movieId")

    # return top n movies as list
    return list(top10["title"])

In [32]:
import warnings
warnings.filterwarnings('ignore')

In [33]:
rating.sort_values(by="rating_count", ascending=False).head(12)

Unnamed: 0_level_0,rating,rating_count
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
356,4.164134,329
318,4.429022,317
296,4.197068,307
593,4.16129,279
2571,4.192446,278
260,4.231076,251
480,3.75,238
110,4.031646,237
589,3.970982,224
527,4.225,220


In [34]:
top_n_movie(356, 10)

['Unbearable Lightness of Being, The (1988)',
 'Beethoven (1992)',
 'Tales from the Crypt Presents: Demon Knight (1995)',
 "Ocean's Eleven (a.k.a. Ocean's 11) (1960)",
 'Charade (1963)',
 'Elite Squad (Tropa de Elite) (2007)',
 'Something to Talk About (1995)',
 'Mighty Morphin Power Rangers: The Movie (1995)',
 'Inside Job (2010)',
 'Sherlock: The Abominable Bride (2016)']

# **3.User Based Recommendations**

Create the similarity matrix

In 3 simple steps:

- Create the big users-items table

- Replace NaNs with zeros

- Compute pairwise cosine similarities

### 3.1.Create the big users-items table.
We are just reshaping (pivoting) the data, so that we have users as rows and movies as columns. We need the data to be in this shape to compute similarities between users in the next step.

In [35]:
users_items = pd.pivot_table(data=movies_ratings,
                             values='rating',
                             index='userId', 
                             columns='movieId')

users_items.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,,4.0,,,4.0,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4.0,,,,,,,,,,...,,,,,,,,,,


### 2.2. Replace NaNs with zeros
The cosine similarity can't be computed with NaN's

In [36]:
users_items.fillna(0, inplace=True)
users_items.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,4.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### 3.2.Compute genres similarities

In [37]:
from sklearn.metrics.pairwise import cosine_similarity

user_similarities = pd.DataFrame(cosine_similarity(users_items),
                                 columns=users_items.index, 
                                 index=users_items.index)
user_similarities.head()

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.027283,0.05972,0.194395,0.12908,0.128152,0.158744,0.136968,0.064263,0.016875,...,0.080554,0.164455,0.221486,0.070669,0.153625,0.164191,0.269389,0.291097,0.093572,0.145321
2,0.027283,1.0,0.0,0.003726,0.016614,0.025333,0.027585,0.027257,0.0,0.067445,...,0.202671,0.016866,0.011997,0.0,0.0,0.028429,0.012948,0.046211,0.027565,0.102427
3,0.05972,0.0,1.0,0.002251,0.00502,0.003936,0.0,0.004941,0.0,0.0,...,0.005048,0.004892,0.024992,0.0,0.010694,0.012993,0.019247,0.021128,0.0,0.032119
4,0.194395,0.003726,0.002251,1.0,0.128659,0.088491,0.11512,0.062969,0.011361,0.031163,...,0.085938,0.128273,0.307973,0.052985,0.084584,0.200395,0.131746,0.149858,0.032198,0.107683
5,0.12908,0.016614,0.00502,0.128659,1.0,0.300349,0.108342,0.429075,0.0,0.030611,...,0.068048,0.418747,0.110148,0.258773,0.148758,0.106435,0.152866,0.135535,0.261232,0.060792


## 4.Building the recommender step by step:
Let's focus on one random user (user 5) and compute the recommendations only for this user, as an example. Then, we will build a function that can compute recommendations for any users. We will follow these steps:

- Compute the weights.

- Find movie user 5 has not rated.

- Compute the ratings user 5 would give to those unrated movies.

- Find the top 5 movies from the rating predictions.

### 4.1.Compute the weights
Here we will exclude user 5 using .query().

In [38]:
user_id = 5

weights = (
    user_similarities.query("userId!=@user_id")[user_id] / sum(user_similarities.query("userId!=@user_id")[user_id])
          )
weights.head(6)

userId
1    0.001729
2    0.000223
3    0.000067
4    0.001724
6    0.004024
7    0.001451
Name: 5, dtype: float64

In [39]:
weights.sum()

1.0000000000000013

### 4.2.Find movies user 5 has not rated.
We will exclude our user, since we don't want to include them on the weights.

In [40]:
users_items.loc[user_id,:]==0

movieId
1         False
2          True
3          True
4          True
5          True
          ...  
193581     True
193583     True
193585     True
193587     True
193609     True
Name: 5, Length: 9724, dtype: bool

In [41]:
# select movies that the inputed user has not watched
not_watched_movies = users_items.loc[users_items.index!=user_id, users_items.loc[user_id,:]==0]
not_watched_movies.T

userId,1,2,3,4,6,7,8,9,10,11,...,601,602,603,604,605,606,607,608,609,610
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2,0.0,0.0,0.0,0.0,4.0,0.0,4.0,0.0,0.0,0.0,...,0.0,4.0,0.0,5.0,3.5,0.0,0.0,2.0,0.0,0.0
3,4.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0
4,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0
6,4.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,5.0,...,0.0,3.0,4.0,3.0,0.0,0.0,0.0,0.0,0.0,5.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
193581,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
193583,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
193585,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
193587,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### 4.3.Compute the ratings user 5 would give to those unrated movies.

In [42]:
# dot product between the not_watched_moviess and the weights
weighted_averages = pd.DataFrame(not_watched_movies.T.dot(weights), columns=["predicted_rating"])
weighted_averages.head()

Unnamed: 0_level_0,predicted_rating
movieId,Unnamed: 1_level_1
2,0.922408
3,0.3761
4,0.060716
5,0.373809
6,0.881502


### 4.4.Find the top 5 movies from the rating predictions

In [43]:
data=movies[['movieId', 'title']]

In [44]:
recommendations = weighted_averages.merge(data, left_index=True, right_on="movieId")
recommendations.sort_values("predicted_rating", ascending=False).head()

Unnamed: 0,predicted_rating,movieId,title
314,2.987118,356,Forrest Gump (1994)
510,2.530083,593,"Silence of the Lambs, The (1991)"
418,2.281672,480,Jurassic Park (1993)
43,1.881849,47,Seven (a.k.a. Se7en) (1995)
334,1.699053,377,Speed (1994)


### 4.5.User Based Recommender

In [45]:
def weighted_user_rec(user_id, n):

  # compute the weights for one user
  weights = (user_similarities.query("userId!=@user_id")[user_id] / sum(user_similarities.query("userId!=@user_id")[user_id]))

  # select movies that the inputed user has not watched
  not_watched_movies = users_items.loc[users_items.index!=user_id, users_items.loc[user_id,:]==0]

  # dot product between the not_watched_moviess and the weights
  weighted_averages = pd.DataFrame(not_watched_movies.T.dot(weights), columns=["predicted_rating"])

  # find the top 5 movies from the rating predictions
  recommendations = weighted_averages.merge(data, left_index=True, right_on="movieId")
  top_recommendations = recommendations.sort_values("predicted_rating", ascending=False).head(n)
  
  return top_recommendations

In [46]:
weighted_user_rec(5, 10)

Unnamed: 0,predicted_rating,movieId,title
314,2.987118,356,Forrest Gump (1994)
510,2.530083,593,"Silence of the Lambs, The (1991)"
418,2.281672,480,Jurassic Park (1993)
43,1.881849,47,Seven (a.k.a. Se7en) (1995)
334,1.699053,377,Speed (1994)
31,1.582715,32,Twelve Monkeys (a.k.a. 12 Monkeys) (1995)
138,1.565097,165,Die Hard: With a Vengeance (1995)
224,1.551673,260,Star Wars: Episode IV - A New Hope (1977)
1939,1.548122,2571,"Matrix, The (1999)"
615,1.411351,780,Independence Day (a.k.a. ID4) (1996)
