
# Intro


**Notes**

The main bulk of the material comes from https://developers.google.com/machine-learning/recommendation/overview/candidate-generation. If you want to go further later, you can take a look at http://nicolas-hug.com/blog/matrix_facto_3. It is absolutely not expected to look at these two links for the interviews  or to complete the test.

**Context**: 

We want to build a movies' recommender in order to get new movies to watch during the lock down. We will base our work on a variation of the MovieLens dataset. 
The data consists of movies seen by the users, some informations about the movies, and some informations about the users. The problem consists in predicting which movies a given user might like.

We are presenting you here first a naive approach in order to familarize yourself with the problem and show you how it might be solved.

**Task**:

The code presented is a first implementation but has a number of shortcomings in its structure and features (more on that in the conclusion). Your task consist in producing a refactoring, so as to be one step closer to a "clean" code.

**Evaluation**:

Our goal here is two fold:
- See how you understand a problem and adapt to an already given approach to tackle it.
- See how you can design new features.
- See how you manipulate python code: understanding, ideas to refactor etc ...

The projects will be evaluated on the quality of the source code produced.

# The data

First, let's load some data.

In [1]:
import pandas as pd

users = pd.read_csv("data/users.csv")
print(users.shape)
users.head()

(6040, 5)


Unnamed: 0,user_id,gender,age,occupation,zip_code
0,0,F,1,10,48067
1,1,M,56,16,70072
2,2,M,25,15,55117
3,3,M,45,7,2460
4,4,M,25,20,55455


In [2]:
movies = pd.read_csv("data/movies.csv")
movies.head()

Unnamed: 0,movie_id,title,year,Animation,Children's,Comedy,Adventure,Fantasy,Romance,Drama,...,Crime,Thriller,Horror,Sci-Fi,Documentary,War,Musical,Mystery,Film-Noir,Western
0,0,Toy Story,1995,1.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,Jumanji,1995,0.0,1.0,0.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2,Grumpier Old Men,1995,0.0,0.0,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,3,Waiting to Exhale,1995,0.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,4,Father of the Bride Part II,1995,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [3]:
ratings = pd.read_csv("data/ratings.csv")
ratings.head()

Unnamed: 0,user_id,movie_id,rating
0,0,1176,5
1,0,655,3
2,0,902,3
3,0,3339,4
4,0,2286,5


# Content-based Filtering

Content-based filtering uses item features to recommend other items similar to what the user likes, based on their previous actions or explicit feedback. We dont use other users information !

For example, if user `A` liked `Harry Potter 1`, he/she will like `Harry Potter 2`

In [4]:
%%html
<img src='https://miro.medium.com/max/1642/1*BME1JjIlBEAI9BV5pOO5Mg.png' height="300" width="250"/>

What are similar movies ? In order to answer to this question we need to build a similiarity measure. 

## Features

This measure will operate on the characteristics (**features**) of the movies to determine which are close. In our case, we have access to the genres of the movies. For example, the genres of `Toy Story` are: `Animation`, `Children's` and `Comedy`. This is represented as follow in our dataset:

In [5]:
genre_cols = ["Animation", "Children's", 
       'Comedy', 'Adventure', 'Fantasy', 'Romance', 'Drama',
       'Action', 'Crime', 'Thriller', 'Horror', 'Sci-Fi', 'Documentary', 'War',
       'Musical', 'Mystery', 'Film-Noir', 'Western']

genre_and_title_cols = ['title'] + genre_cols 

movies[genre_and_title_cols].head()

Unnamed: 0,title,Animation,Children's,Comedy,Adventure,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Sci-Fi,Documentary,War,Musical,Mystery,Film-Noir,Western
0,Toy Story,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Jumanji,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Grumpier Old Men,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Waiting to Exhale,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Father of the Bride Part II,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Similarity

Now that we have some features, we will try to find a function that performs a similiarity measure. The Similarity function will take two items (two list of features) and return a number proportional to their similarity. 

For the following we will consider that the Similarity between two movies is the number of genres they have in common.

Here is an example with `Toy Story` and `E.T`

In [6]:
toy_story_genres = movies[genre_and_title_cols].loc[movies.title == 'Toy Story'][genre_cols].iloc[0]
toy_story_genres

Animation      1.0
Children's     1.0
Comedy         1.0
Adventure      0.0
Fantasy        0.0
Romance        0.0
Drama          0.0
Action         0.0
Crime          0.0
Thriller       0.0
Horror         0.0
Sci-Fi         0.0
Documentary    0.0
War            0.0
Musical        0.0
Mystery        0.0
Film-Noir      0.0
Western        0.0
Name: 0, dtype: float64

In [7]:
et_genres = movies[genre_and_title_cols].loc[movies.title == 'E.T. the Extra-Terrestrial'][genre_cols].iloc[0]
et_genres

Animation      0.0
Children's     1.0
Comedy         0.0
Adventure      0.0
Fantasy        1.0
Romance        0.0
Drama          1.0
Action         0.0
Crime          0.0
Thriller       0.0
Horror         0.0
Sci-Fi         1.0
Documentary    0.0
War            0.0
Musical        0.0
Mystery        0.0
Film-Noir      0.0
Western        0.0
Name: 1081, dtype: float64

In [8]:
et_genres.values * toy_story_genres

Animation      0.0
Children's     1.0
Comedy         0.0
Adventure      0.0
Fantasy        0.0
Romance        0.0
Drama          0.0
Action         0.0
Crime          0.0
Thriller       0.0
Horror         0.0
Sci-Fi         0.0
Documentary    0.0
War            0.0
Musical        0.0
Mystery        0.0
Film-Noir      0.0
Western        0.0
Name: 0, dtype: float64

In [9]:
(et_genres.values * toy_story_genres).sum() # scalar product

1.0

So our similarity measure returns `1.0` for these two movies. 

Let's see another example where we compare `Toy Stories` and `Pocahontas`

In [10]:
pocahontas_genres = movies[genre_and_title_cols].loc[movies.title == 'Pocahontas'][genre_cols].iloc[0]
(pocahontas_genres.values * toy_story_genres).sum()

2.0

This tels us that `Pocahontas` is closer to `Toy Stories` than `E.T.` which makes sense.


## Scaling up

Ok, that's a nice measure. Now we are going to scale it up to all movies of our dataset. To do so smartly, let's take a look at the operation we just did, but from a mathematical point of view. To do so, we will think of the list of features of a movie as a vector `V`. Then, our similarity measure between `Toy Story` and `E.T.` becomes:
$ V_{ToyStory} \cdot V_{ET}^{T}$

More generally the similarity measure between a movie `i` and another movie `j` is : $ V_{i} \cdot V_{j}^{T}$

Now we can think of `movies` as a matrix containing all features vectors describing the movies. Here is how our similiarity measure looks in this context:

![](imgs/dot_product_matrices.png)

To obtain the similiarity between all movies of our dataset we have to perform the dot product of the `movies` matrix with the transposed of the `movies` matrix.

In [11]:
similarity = movies[genre_cols].values.dot(movies[genre_cols].values.T)
similarity.shape

(3883, 3883)

We can now get the similarity between `Toy Story` and any other movie of our dataset

In [12]:
similarity_with_toy_story = similarity[0] # 0 is Toy Story
similarity_with_toy_story

array([3., 1., 1., ..., 0., 0., 0.])

In [13]:
for i in range(1,10):
    print("Similarity between Toy story and ", movies.iloc[i]['title'], " is ", similarity_with_toy_story[i])

('Similarity between Toy story and ', 'Jumanji', ' is ', 1.0)
('Similarity between Toy story and ', 'Grumpier Old Men', ' is ', 1.0)
('Similarity between Toy story and ', 'Waiting to Exhale', ' is ', 1.0)
('Similarity between Toy story and ', 'Father of the Bride Part II', ' is ', 1.0)
('Similarity between Toy story and ', 'Heat', ' is ', 0.0)
('Similarity between Toy story and ', 'Sabrina', ' is ', 1.0)
('Similarity between Toy story and ', 'Tom and Huck', ' is ', 1.0)
('Similarity between Toy story and ', 'Sudden Death', ' is ', 0.0)
('Similarity between Toy story and ', 'GoldenEye', ' is ', 0.0)


## A bit of polishing

### Helpers:

We also built some helpers to handle the movies dataset:

In [14]:
from content_based_filtering.helpers.movies import Movies

listMovies = Movies(movies)

print (listMovies.get_movie_id('Toy Story'))
print (listMovies.get_movie_id('Die Hard'))

print (listMovies.get_movie_name(0))
print (listMovies.get_movie_name(1000))
print (listMovies.get_movie_year(1000))

0
1023
Toy Story
Parent Trap, The
1961


### Finding similar movies:
Here is a method giving us the movie the most similar to another movie:

In [15]:
def get_most_similar(similarity, movie_name, year=None, top=10):
    index_movie = listMovies.get_movie_id(movie_name, year)
    best = similarity[index_movie].argsort()[::-1]
    return [(ind, listMovies.get_movie_name(ind), similarity[index_movie, ind]) for ind in best[:top] if ind != index_movie]

In [16]:
get_most_similar(similarity, 'Toy Story')

[(667, 'Space Jam', 3.0),
 (3685, 'Adventures of Rocky and Bullwinkle, The', 3.0),
 (3682, 'Chicken Run', 3.0),
 (2009, 'Jungle Book, The', 3.0),
 (2011, 'Lady and the Tramp', 3.0),
 (2012, 'Little Mermaid, The', 3.0),
 (2033, 'Steamboat Willie', 3.0),
 (2072, 'American Tail, An', 3.0),
 (2073, 'American Tail: Fievel Goes West, An', 3.0)]

In [17]:
get_most_similar(similarity, 'Psycho', 1960) 

[(3593, "Puppet Master III: Toulon's Revenge", 2.0),
 (2923, 'Rawhead Rex', 2.0),
 (1312, 'Believers, The', 2.0),
 (3407, "Jacob's Ladder", 2.0),
 (1957, 'Disturbing Behavior', 2.0),
 (1927, 'Poltergeist III', 2.0),
 (1926, 'Poltergeist II: The Other Side', 2.0),
 (1925, 'Poltergeist', 2.0),
 (732, 'Thinner', 2.0),
 (69, 'From Dusk Till Dawn', 2.0)]

### Giving a recommendation:

And finally, let's find some movies to recommend based on previously liked movies:

In [18]:
def get_recommendations(user_id):
    top_movies = ratings[ratings['user_id'] == user_id].sort_values(by='rating', ascending=False).head(3)['movie_id']
    index=['movie_id', 'title', 'similarity']

    most_similars = []
    for top_movie in top_movies:
        most_similars += get_most_similar(similarity, listMovies.get_movie_name(top_movie), listMovies.get_movie_year(top_movie))

    return pd.DataFrame(most_similars, columns=index).drop_duplicates().sort_values(by='similarity', ascending=False).head(5)

get_recommendations(0)


Unnamed: 0,movie_id,title,similarity
13,773,"Hunchback of Notre Dame, The",3.0
14,1526,Hercules,3.0
27,2072,"American Tail, An",3.0
26,2033,Steamboat Willie,3.0
25,2012,"Little Mermaid, The",3.0


In [19]:
get_recommendations(999)

Unnamed: 0,movie_id,title,similarity
0,166,First Knight,2.0
2,1451,Smilla's Sense of Snow,2.0
3,503,"Perfect World, A",2.0
4,3197,Man Bites Dog (C'est arriv� pr�s de chez vous),2.0
5,1458,"Devil's Own, The",2.0


# Conclusion:

The code presented is a first implementation but has a number of shortcomings preventing the collaboration of multiple MLE and Data Scientists:
- It is not possible to introduce easily new features mainly because the code is just a bunch of functions in one file.
- The code can not be scaled to other datasets or variations of the tasks.
- There is no evaluation of the performances.
- There is no testing

Additionaly a number we could think of some features to add, for example, what about looking at similar users to find a recommendation for our targeted user ?