#  Making Recommendations Based on user

User-based
By comparing users’ evaluations of the same item, algorithms determine how similar the users are. We can calculate similarity in a variety of ways: cosine similarity, Pearson correlation, or Jaccard similarity. We can then make recommendations for one user, based on the patterns of similar users. For example, if User A and User B watched and enjoyed all the same Disney movies, we could say that they have similar tastes. Therefore, when User B watches and enjoys The Simpsons movie, this could be recommended to User A as something they might also enjoy.

In [2]:
import numpy as np
import pandas as pd

In [3]:
#import data 
movies = pd.read_csv('data/ml-latest-small/movies.csv')
ratings = pd.read_csv('data/ml-latest-small/ratings.csv')
movies_ratings = movies.merge(ratings)

## Create the similarity matrix

In 3 simple steps:

1. Create the big users-items table

2. Replace NaNs with zeros

3. Compute pairwise cosine similarities

### 1. Create the big users-items table.

We are just reshaping (pivoting) the data, so that we have users as rows and movies as columns. We need the data to be in this shape to compute similarities between users in the next step.

In [4]:
users_items = pd.pivot_table(data=movies_ratings,
                             values='rating',
                             index='userId', 
                             columns='movieId')

users_items.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,,4.0,,,4.0,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4.0,,,,,,,,,,...,,,,,,,,,,


### 2. Replace NaNs with zeros
The cosine similarity can't be computed with NaN's

In [5]:
users_items.fillna(0, inplace=True)
users_items.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,4.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### 3. Compute genres similarities

In [8]:
from sklearn.metrics.pairwise import cosine_similarity
user_similarities = pd.DataFrame(cosine_similarity(users_items),
                                 columns=users_items.index, 
                                 index=users_items.index)
user_similarities.head()

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.027283,0.05972,0.194395,0.12908,0.128152,0.158744,0.136968,0.064263,0.016875,...,0.080554,0.164455,0.221486,0.070669,0.153625,0.164191,0.269389,0.291097,0.093572,0.145321
2,0.027283,1.0,0.0,0.003726,0.016614,0.025333,0.027585,0.027257,0.0,0.067445,...,0.202671,0.016866,0.011997,0.0,0.0,0.028429,0.012948,0.046211,0.027565,0.102427
3,0.05972,0.0,1.0,0.002251,0.00502,0.003936,0.0,0.004941,0.0,0.0,...,0.005048,0.004892,0.024992,0.0,0.010694,0.012993,0.019247,0.021128,0.0,0.032119
4,0.194395,0.003726,0.002251,1.0,0.128659,0.088491,0.11512,0.062969,0.011361,0.031163,...,0.085938,0.128273,0.307973,0.052985,0.084584,0.200395,0.131746,0.149858,0.032198,0.107683
5,0.12908,0.016614,0.00502,0.128659,1.0,0.300349,0.108342,0.429075,0.0,0.030611,...,0.068048,0.418747,0.110148,0.258773,0.148758,0.106435,0.152866,0.135535,0.261232,0.060792


## Building the recommender step by step:

Let's focus on one random user (user `5`) and compute the recommendations only for this user, as an example. Then, we will build a function that can compute recommendations for any users. We will follow these steps:

1. Compute the weights.

2. Find movie user `5` has not rated.

3. Compute the ratings user `5` would give to those unrated movies.

4. Find the top 5 movies from the rating predictions.

### 1. Compute the weights

Here we will exclude user `5` using `.query()`.

In [13]:
user_id = 5

weights = (
    user_similarities.query("userId!=@user_id")[user_id] / sum(user_similarities.query("userId!=@user_id")[user_id])
          )
weights.head(6)

userId
1    0.001729
2    0.000223
3    0.000067
4    0.001724
6    0.004024
7    0.001451
Name: 5, dtype: float64

In [14]:
weights.sum()

1.0000000000000013

### 2. Find movies user `5` has not rated.

We will exclude our user, since we don't want to include them on the weights.

In [16]:
users_items.loc[user_id,:]==0

movieId
1         False
2          True
3          True
4          True
5          True
          ...  
193581     True
193583     True
193585     True
193587     True
193609     True
Name: 5, Length: 9724, dtype: bool

In [17]:
# select movies that the inputed user has not watched
not_watched_movies = users_items.loc[users_items.index!=user_id, users_items.loc[user_id,:]==0]
not_watched_movies.T

userId,1,2,3,4,6,7,8,9,10,11,...,601,602,603,604,605,606,607,608,609,610
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2,0.0,0.0,0.0,0.0,4.0,0.0,4.0,0.0,0.0,0.0,...,0.0,4.0,0.0,5.0,3.5,0.0,0.0,2.0,0.0,0.0
3,4.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0
4,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0
6,4.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,5.0,...,0.0,3.0,4.0,3.0,0.0,0.0,0.0,0.0,0.0,5.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
193581,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
193583,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
193585,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
193587,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### 3. Compute the ratings user `5` would give to those unrated movies.

In [19]:
# dot product between the not_watched_moviess and the weights
weighted_averages = pd.DataFrame(not_watched_movies.T.dot(weights), columns=["predicted_rating"])
weighted_averages.head()

Unnamed: 0_level_0,predicted_rating
movieId,Unnamed: 1_level_1
2,0.922408
3,0.3761
4,0.060716
5,0.373809
6,0.881502


### 4. Find the top 5 movies from the rating predictions

In [23]:
data=movies[['movieId', 'title']]

In [24]:
recommendations = weighted_averages.merge(data, left_index=True, right_on="movieId")
recommendations.sort_values("predicted_rating", ascending=False).head()

Unnamed: 0,predicted_rating,movieId,title
314,2.987118,356,Forrest Gump (1994)
510,2.530083,593,"Silence of the Lambs, The (1991)"
418,2.281672,480,Jurassic Park (1993)
43,1.881849,47,Seven (a.k.a. Se7en) (1995)
334,1.699053,377,Speed (1994)


## 1. Make a function that recommends the top `n` movies to an inputted `userID`

In [25]:
def weighted_user_rec(user_id, n):
  weights = (user_similarities.query("userId!=@user_id")[user_id] / sum(user_similarities.query("userId!=@user_id")[user_id]))
  not_watched_movies = users_items.loc[users_items.index!=user_id, users_items.loc[user_id,:]==0]
  weighted_averages = pd.DataFrame(not_watched_movies.T.dot(weights), columns=["predicted_rating"])
  recommendations = weighted_averages.merge(data, left_index=True, right_on="movieId")
  top_recommendations = recommendations.sort_values("predicted_rating", ascending=False).head(n)
  return top_recommendations

In [27]:
weighted_user_rec(5, 10)

Unnamed: 0,predicted_rating,movieId,title
314,2.987118,356,Forrest Gump (1994)
510,2.530083,593,"Silence of the Lambs, The (1991)"
418,2.281672,480,Jurassic Park (1993)
43,1.881849,47,Seven (a.k.a. Se7en) (1995)
334,1.699053,377,Speed (1994)
31,1.582715,32,Twelve Monkeys (a.k.a. 12 Monkeys) (1995)
138,1.565097,165,Die Hard: With a Vengeance (1995)
224,1.551673,260,Star Wars: Episode IV - A New Hope (1977)
1939,1.548122,2571,"Matrix, The (1999)"
615,1.411351,780,Independence Day (a.k.a. ID4) (1996)
