
# Intro


**Notes**

The main bulk of the material comes from https://developers.google.com/machine-learning/recommendation/overview/candidate-generation. If you want to go further later, you can take a look at http://nicolas-hug.com/blog/matrix_facto_3. It is absolutely not expected to look at these two links for the interviews  or to complete the test.

**Context**: 

We want to build a movies' recommender in order to get new movies to watch during the lock down. We will base our work on a variation of the MovieLens dataset. 
The data consists of movies seen by the users, some informations about the movies, and some informations about the users. The problem consists in predicting which movies a given user might like.

We are presenting you here first a naive approach in order to familarize yourself with the problem and show you how it might be solved.

**Task**:

The code presented is a first implementation but has a number of shortcomings in its structure and features (more on that in the conclusion). Your task consist in producing a refactoring, so as to be one step closer to a "clean" code.

**Evaluation**:

Our goal here is two fold:
- See how you understand a problem and adapt to an already given approach to tackle it.
- See how you can design new features.
- See how you manipulate python code: understanding, ideas to refactor etc ...

The projects will be evaluated on the quality of the source code produced.

# The data

First, let's load some data.

In [1]:
import pandas as pd

users = pd.read_csv("data/users.csv")
print(users.shape)
users.head()

(6040, 5)


Unnamed: 0,user_id,gender,age,occupation,zip_code
0,0,F,1,10,48067
1,1,M,56,16,70072
2,2,M,25,15,55117
3,3,M,45,7,2460
4,4,M,25,20,55455


In [2]:
movies = pd.read_csv("data/movies.csv")
movies.head()

Unnamed: 0,movie_id,title,year,Animation,Children's,Comedy,Adventure,Fantasy,Romance,Drama,...,Crime,Thriller,Horror,Sci-Fi,Documentary,War,Musical,Mystery,Film-Noir,Western
0,0,Toy Story,1995,1.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,Jumanji,1995,0.0,1.0,0.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2,Grumpier Old Men,1995,0.0,0.0,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,3,Waiting to Exhale,1995,0.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,4,Father of the Bride Part II,1995,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [3]:
ratings = pd.read_csv("data/ratings.csv")
ratings.head()

Unnamed: 0,user_id,movie_id,rating
0,0,1192,5
1,0,660,3
2,0,913,3
3,0,3407,4
4,0,2354,5


# Content-based Filtering

Content-based filtering uses item features to recommend other items similar to what the user likes, based on their previous actions or explicit feedback. We dont use other users information !

For example, if user `A` liked `Harry Potter 1`, he/she will like `Harry Potter 2`

In [4]:
%%html
<img src='https://miro.medium.com/max/1642/1*BME1JjIlBEAI9BV5pOO5Mg.png' height="300" width="250"/>

# Scalability to other datasets

The main problem with the code's scalability, is that it works directly with the three data tables (Movies, ratings, user). Usually, machine learning algorithms expect one dataset as entry. This could also be an issue while defining the training/test sets.

We mainly work with the Rating table but we use movies to get the genres and we could think of using the Users table for adding new feature. 

Here, we will define one dataset that will include all the columns that we deem interesting for the recommendation. Our global dataset will play the role of a 'View', independently from what the actual data tables scheme is.

We use a right join between the table to make sure we keep all values of 'movies' and 'users' even if they don't figure in Ratings table.

In [5]:
import numpy as np

In [6]:
join_user_ratings=ratings.join(users.set_index('user_id'), how='right',on='user_id')
dataframe= join_user_ratings.join(movies.set_index('movie_id'), how='right',on='movie_id')


In [7]:
dataframe

Unnamed: 0,user_id,movie_id,rating,gender,age,occupation,zip_code,title,year,Animation,...,Crime,Thriller,Horror,Sci-Fi,Documentary,War,Musical,Mystery,Film-Noir,Western
40.0,0.0,0,5.0,F,1.0,10.0,48067,Toy Story,1995,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
469.0,5.0,0,4.0,F,50.0,9.0,55117,Toy Story,1995,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
581.0,7.0,0,4.0,M,25.0,12.0,11413,Toy Story,1995,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
711.0,8.0,0,5.0,M,25.0,17.0,61614,Toy Story,1995,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
837.0,9.0,0,5.0,F,35.0,1.0,95370,Toy Story,1995,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
964309.0,5811.0,3951,4.0,F,25.0,7.0,92120,"Contender, The",2000,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
967122.0,5830.0,3951,3.0,M,25.0,1.0,92120,"Contender, The",2000,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
968244.0,5836.0,3951,4.0,M,25.0,7.0,60607,"Contender, The",2000,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
982282.0,5926.0,3951,1.0,M,35.0,14.0,10003,"Contender, The",2000,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [8]:
dataframe['gender'] = dataframe['gender'].map({np.nan:0,'F':1, 'M':2}).astype(int)

zip_codes=dataframe['zip_code'].unique()
codes_map={zip_codes[i]:i+1 for i in range(len(zip_codes))}
dataframe['zip_code']=dataframe['zip_code'].map(codes_map).astype(int)

In [9]:
dataframe

Unnamed: 0,user_id,movie_id,rating,gender,age,occupation,zip_code,title,year,Animation,...,Crime,Thriller,Horror,Sci-Fi,Documentary,War,Musical,Mystery,Film-Noir,Western
40.0,0.0,0,5.0,1,1.0,10.0,1,Toy Story,1995,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
469.0,5.0,0,4.0,1,50.0,9.0,2,Toy Story,1995,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
581.0,7.0,0,4.0,2,25.0,12.0,3,Toy Story,1995,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
711.0,8.0,0,5.0,2,25.0,17.0,4,Toy Story,1995,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
837.0,9.0,0,5.0,1,35.0,1.0,5,Toy Story,1995,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
964309.0,5811.0,3951,4.0,1,25.0,7.0,1517,"Contender, The",2000,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
967122.0,5830.0,3951,3.0,2,25.0,1.0,1517,"Contender, The",2000,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
968244.0,5836.0,3951,4.0,2,25.0,7.0,781,"Contender, The",2000,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
982282.0,5926.0,3951,1.0,2,35.0,14.0,459,"Contender, The",2000,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Features

We define the features that helps us compute the similarity between movies or users. We also define the subsets of column names that can help us easily get the movies related columns or the users related columns

In [10]:
genre_cols = ["Animation", "Children's", 
       'Comedy', 'Adventure', 'Fantasy', 'Romance', 'Drama',
       'Action', 'Crime', 'Thriller', 'Horror', 'Sci-Fi', 'Documentary', 'War',
       'Musical', 'Mystery', 'Film-Noir', 'Western']
user_info_cols=['gender','age','occupation','zip_code']
movies_cols= movies.columns
users_cols=users.columns

# Code refactoring

We will redefine the methods introduced in the original notebook but in seperate files so that to facilitate reuse and maintability. 

**movies.py** : includes get_movie_id, get_movie_name, get_movie_year.It allows to manipulate movies.

**users.py** : includes get_user_index and get_user_ID. It allows to manipulate users

**scalability_tools.py** : it contains get_subset, which is a function that will allow us to reconstruct a sub-dataframe from the global one, given a subset of columns and an ID column. It can also contain any future methods for data scalability

**similarity_model_tools.py**: contains the group of methods for content based filtering and collaborative filtering (depending on users)

**evaluation.py**: defines the evaluation metrics


In [11]:
from content_based_filtering.helpers.movies import get_movie_id, get_movie_name, get_movie_year
from content_based_filtering.helpers.users import *
from content_based_filtering.helpers.scalability_tools import *
from content_based_filtering.helpers.similarity_model_tools import *


We use get_subset to reconstruct 'movies' and 'users' for an easier implementation


In [12]:
movies= (get_subset(dataframe,movies_cols,'movie_id')).reset_index(drop=True,inplace=False)
users=(get_subset(dataframe,users_cols,'user_id')).reset_index(drop=True,inplace=False)

In [13]:
movies

Unnamed: 0,movie_id,title,year,Animation,Children's,Comedy,Adventure,Fantasy,Romance,Drama,...,Crime,Thriller,Horror,Sci-Fi,Documentary,War,Musical,Mystery,Film-Noir,Western
0,0,Toy Story,1995,1.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,Jumanji,1995,0.0,1.0,0.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2,Grumpier Old Men,1995,0.0,0.0,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,3,Waiting to Exhale,1995,0.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,4,Father of the Bride Part II,1995,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3878,3947,Meet the Parents,2000,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3879,3948,Requiem for a Dream,2000,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3880,3949,Tigerland,2000,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3881,3950,Two Family House,2000,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [14]:
print("User_ID at index 6040 :", int(get_user_ID(users,6040)))
print("Index of user_ID 5726 :", get_user_index(users,5726))

User_ID at index 6040 : 5726
Index of user_ID 5726 : 6040


### Similarity Matrix

We define a method that allows us to get the similarity matrix between movies and another to get similarity matrix btween users

In [15]:
movies_similarity=get_movies_similarity_matrix(movies[genre_cols])

In [16]:
similarity_with_toy_story = movies_similarity[0] # 0 is Toy Story
similarity_with_toy_story

array([3., 1., 1., ..., 0., 0., 0.])

In [17]:
for i in range(10):
    print(f"Similarity between Toy story and {movies.iloc[i]['title']} (index {i}) is {similarity_with_toy_story[i]}")

Similarity between Toy story and Toy Story (index 0) is 3.0
Similarity between Toy story and Jumanji (index 1) is 1.0
Similarity between Toy story and Grumpier Old Men (index 2) is 1.0
Similarity between Toy story and Waiting to Exhale (index 3) is 1.0
Similarity between Toy story and Father of the Bride Part II (index 4) is 1.0
Similarity between Toy story and Heat (index 5) is 0.0
Similarity between Toy story and Sabrina (index 6) is 1.0
Similarity between Toy story and Tom and Huck (index 7) is 1.0
Similarity between Toy story and Sudden Death (index 8) is 0.0
Similarity between Toy story and GoldenEye (index 9) is 0.0


#### Similarity between users
Similarity between two users is defined as the sum of the absolute value of the differences bewteen the features

In [19]:
users_similarity=get_users_similarity_matrix(users[user_info_cols])

We use get_most_similar_movies and get_most_similar_users to obtain the top similar movies or users as follows

In [20]:
get_most_similar_movies(movies_similarity, movies, 'Toy Story')  

[(667, 'Space Jam', 3.0),
 (3685, 'Adventures of Rocky and Bullwinkle, The', 3.0),
 (3682, 'Chicken Run', 3.0),
 (2009, 'Jungle Book, The', 3.0),
 (2011, 'Lady and the Tramp', 3.0),
 (2012, 'Little Mermaid, The', 3.0),
 (2033, 'Steamboat Willie', 3.0),
 (2072, 'American Tail, An', 3.0),
 (2073, 'American Tail: Fievel Goes West, An', 3.0)]

In [21]:
get_most_similar_users(users_similarity,users,0)

[(4765, nan, nan),
 (6, 18.0, 7.0),
 (333, 978.0, 7.0),
 (412, 1219.0, 7.0),
 (18, 50.0, 18.0),
 (24, 74.0, 23.0),
 (1537, 4516.0, 26.0),
 (1355, 3952.0, 28.0),
 (2519, 4951.0, 28.0)]

# Recommendation

We obtain recommendation either based on the movies watched by a user, or by the similarity between users.

In [22]:
get_collaborative_recommendations(dataframe,movies,users,users_similarity,0)

Unnamed: 0,movie_id,title,similarity
0,0,Toy Story,7.0
1,1277,Real Genius,7.0
2,1996,"Purple Rose of Cairo, The",7.0
3,1993,"Governess, The",7.0
4,1981,Herbie Goes Bananas,7.0


# Evaluation

We will designing an evaluation metric that can allow us to compare between the various solutions and to determine whether a new feature improves the performances or not. 


A naive evaluation would be to consider the average similarity of the set of the recommended movies.


Another solution would be to consider for each user a subset of the movies they rated for training and a smaller subset for evaluation. For the evaluation subset, we can consider the movies with the highest ratings (Let it be M). Let R be the top recommended movies by the recommendation method, we evaluate the recommendation by considering of the number of movies existing in both M and R at the same time. 


A variation would be to compare the recommendation to a random subset of rated movies watched by the user (not necessarily with high ratings) and use the rating as a weight. 

If we're interested in evaluating the method based only on the 'Whatchability' of the movie, i.e, whether the user will watch the movie or not indifferently to how they rated it, we can use the second evaluation metric without adding the ratings as weights.

 We define both functions in **evaluation.py**


To evaluate the performance of the whole dataset, we consider the average of the metric applied to all entries

We apply the similarity score metric to evaluate the performance of content based recommendation

In [270]:
movies= get_subset(dataframe,movies_cols,'movie_id')
score=[]
for user in dataframe['user_id'].unique():
    recommendations=get_content_based_recommendations(dataframe,movies,movies_similarity,user) 
    score.append(np.mean(recommendations['similarity'].values))
print(np.mean(score))

3.399999999999999


In [None]:
movies= get_subset(dataframe,movies_cols,'movie_id')
score=[]
for user in dataframe['user_id'].unique():
    recommendations=get_collaborative_recommendations(dataframe,movies,users,users_similarity,user)
    score.append(np.mean(recommendations['similarity'].values))
print(np.mean(score))

# Test

We split our dataset to a training set and a test set.
If we want to consider the recommendation_score, we'll have to split the dataset Horizentally, meaning, for each entry in the ratings dataset, we consider a subset of movies for training and a subset for evaluation (it would be the target).

For now we will only work with similarity score.

To be able to Test our solution, we have to have learned something during the training phase. In our case, the training set produced the similarity matrix. That is to say, given a training set, we produce the similarity matrix based on the movies existing in the dataset. The challenge is in the case scenario where a new user in introduced in the Test dataset with movies that did not exist while we trained the model. 

For now, we will only test with the similarity matrix applied to both test and training

In [303]:
from sklearn.model_selection import train_test_split

df_train, df_test = train_test_split(dataframe , test_size=0.33, random_state=42)

In [318]:
movies= get_subset(dataframe,movies_cols,'movie_id')
movies_similarity=get_movies_similarity_matrix(dataframe,genre_cols)

score=[]
for user in df_train['user_id'].unique():
    recommendations=get_content_based_recommendations(df_train,movies,movies_similarity,user)
    score.append(np.mean(recommendations['similarity'].values))
print(np.mean(score))


"\nscore=[]\nfor user in df_train['user_id'].unique():\n    recommendations=get_recommendations(df_train,movies_subdf,movies_similarity,0)\n    score.append(np.mean(recommendations['similarity'].values))\nprint(np.mean(score))\n"

In [319]:
score=[]
for user in df_test['user_id'].unique():
    recommendations=get_content_based_recommendations(df_test,movies,movies_similarity,user)
    score.append(np.mean(recommendations['similarity'].values))
print(np.mean(score))

2.0
