# Recommendation - Data Preparation 🎬

---

<img src="https://cdn-images-1.medium.com/max/1200/0*ePGWILY6GyplT-nn" />

---

In the next few challenges, you will build a powerful **movie recommender**.

We will use the open-source library [LightFM](https://github.com/lyst/lightfm) which provides easy python implementation of **hybrid** recommendation engines.

In this first part, we will prepare the data in order to train efficiently of the model.

We let you load the data `movies` and `ratings` downloaded from the **small** [movielens dataset](https://grouplens.org/datasets/movielens/).



In [5]:
### TODO: Load the movies and ratings datasets
import pandas as pd
import numpy as np
movies = pd.read_csv("ml-latest-small/movies.csv")
ratings = pd.read_csv("ml-latest-small/ratings.csv")

In [6]:
movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation


In [7]:
ratings

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
100831,610,166534,4.0,1493848402
100832,610,168248,5.0,1493850091
100833,610,168250,5.0,1494273047
100834,610,168252,5.0,1493846352


**Q1**. What are the different types of recommendation models? Explain briefly with your own words the differences between them.

The different types of recommendation models include:

- Content-based filtering: The model will recommend the items based on past activities of the users. In the model, we will have features that give description about the items, and user profile which shows information of the engagement activities (likes, shares, follows, etc) of this user for the items. The pros of content-based filtering is that it doesn't need prior data, doesn't rely on user interaction data from others. The cons of content-based filtering is that the recommendation is quite limited. 

- Collaborative filtering: The model will use similarities between users and items in parallel to give recommendations. This model requires a similarity score between each pair of users or items. The pros of collaborative filtering is that it uncover complex relationships or patterns that are not obvious. The cons of collaborative filtering is that it suffers from cold start, requires large amount of user interactions.

- Hybrid filtering: The model will combine both collaborative and content-based filtering methods. The pros of this method is that it provides more accurate and diverse recommendations. The cons of this method is that it is more complex to implement. 


**Q1bis**. What data is expected by the LightFM `fit` method? Especially, how does the train data should be organized, and what should be the type of the train dataset? 

The training data should be in the form of a sparse matrix, typically a scipy.sparse matrix (e.g., scipy.sparse.coo_matrix).This sparse matrix represents interactions between users and items, where each entry (i, j) in the matrix denotes the interaction between user i and item j. The values can be binary (indicating whether an interaction happened or not) or real-valued (indicating the strength of interaction, such as ratings).

**Q2**. Explore `movies` and `ratings`, what do those datasets contain? How are they organized?

In [8]:
print(movies.shape)
movies.head()

(9742, 3)


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


The movies dataset contains movieID, titles of the movies and genres of the movies. 

In [9]:
print(ratings.shape)
ratings.head()

(100836, 4)


Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


The ratings dataset contains userID, movieID, rating and timestamp. 

---

### Q3 & Q4 are optional
> you can come back to it if you have time after having finished the whole project of the day

We created a few utils functions for you in `utils.py` script. Especially:
- `threshold_interactions_df`:
> Limit interactions df to minimum row and column interactions

**Q3**. Open `src/utils.py` file, and have a look at the documentation of this function to understand its goal and how it works.

Have a look the code to understand fully how it works. You should be familiar with everything.

What does represent the variable `sparsity`? What is the range of values in which sparsity can be?

**Q4**. Create a new DataFrame `ratings_thresh`, that filters `ratings` with only:
- users that rated strictly more than 4 movies
- movies that have been rated at least 10 times

How many users/movies remain in this new dataset?

In [13]:
from utils import threshold_interactions_df

ratings_thresh_df = threshold_interactions_df(ratings, 'userId', 'movieId', 5, 10)
print(
    len(ratings_thresh_df.userId.unique()),
    len(ratings_thresh_df.movieId.unique())
)
        

Starting interactions info
Number of rows: 610
Number of cols: 9724
Sparsity: 1.700%
Ending interactions info
Number of rows: 610
Number of columns: 3650
Sparsity: 4.055%
610 3650


**Q5**. In order to fit a [LightFM](https://lyst.github.io/lightfm/docs/home.html) model, we need to transform our Dataframe to a sparse matrix (cf. below). This is not straightforward so we included the function `df_to_matrix` in `utils.py`.

> 🔦 **Hint**:  Sparse matrices are just **big matrices with a lot of zeros or empty values**.
> 
> Existing tools (Pandas DataFrame, Numpy arrays for example) are not suitable for manipulating this kind of data. So we will use [Scipy sparse matrices](https://docs.scipy.org/doc/scipy-0.14.0/reference/sparse.html).
>
> It exists many different "types" of sparse matrices (CSC, CSR, COO, DIA, etc.). You don't need to know them. Just know that it corresponds to different formats with different methods of manipulation, slicing, indexing, etc.

> 🔦 **Hint 2**:  By going from a DataFrame to a sparse matrix, you will lose the information of the ids (userId and movieId), you will only deal with indices (row number and column number). Therefore, the `df_to_matrix` function also returns dictionaries mapping indexes to ids (ex: uid_to_idx mapping userId to index of the matrix) 


Have a look at the util function documentation, and use it to create 5 new variables:
- a final sparse matrix `ratings_matrix` (this will be the data used to train the model)
- the following utils mappers:
    - `uid_to_idx`
    - `idx_to_uid`
    - `mid_to_idx`
    - `idx_to_mid`

In [14]:
from utils import df_to_matrix

ratings_matrix, uid_to_idx, idx_to_uid, mid_to_idx, idx_to_mid = df_to_matrix(ratings, 'userId', 'movieId')
ratings_matrix

<610x9724 sparse matrix of type '<class 'numpy.float64'>'
	with 100836 stored elements in Compressed Sparse Row format>

In [15]:
ratings[ratings.userId==4]

Unnamed: 0,userId,movieId,rating,timestamp
300,4,21,3.0,986935199
301,4,32,2.0,945173447
302,4,45,3.0,986935047
303,4,47,2.0,945173425
304,4,52,3.0,964622786
...,...,...,...,...
511,4,4765,5.0,1007569445
512,4,4881,3.0,1007569445
513,4,4896,4.0,1007574532
514,4,4902,4.0,1007569465


In [22]:
for mid in [1, 2, 21, 32, 126]:
    print('For MID:', mid)
    print('Matrix value is (for user 4) is:', ratings_matrix[uid_to_idx[4], mid_to_idx[mid]])

For MID: 1
Matrix value is (for user 4) is: 0.0
For MID: 2
Matrix value is (for user 4) is: 0.0
For MID: 21
Matrix value is (for user 4) is: 1.0
For MID: 32
Matrix value is (for user 4) is: 1.0
For MID: 126
Matrix value is (for user 4) is: 1.0


**Q5**. Now that you have a `ratings_matrix` in the correct format, let's save it in pickle format:
- Create a variable `dst_dir` corresponding to the path of the folder `data/netflix` located at the root of the repository
- **Verify that this is the correct path**
- Save the ratings_matrix in pickle (as `ratings_matrix.pkl`) in this corresponding directory

In [25]:
import os
import pickle
dst_dir = "./data/netflixApp"
file_path_rm = os.path.join(dst_dir, "ratings_matrix.pkl")
os.makedirs(dst_dir, exist_ok=True)
with open(file_path_rm, "wb") as file:
    pickle.dump(ratings_matrix, file)

print(f"ratings_matrix saved to {file_path_rm}")

ratings_matrix saved to ./data/netflixApp/ratings_matrix.pkl


**Q6**. Save also all mappings objects into pickle (`idx_to_mid`, `mid_to_idx`, `uid_to_idx`, `idx_to_uid`) as it will be useful for later.

In [28]:
file_path_itm = os.path.join(dst_dir, "idx_to_mid.pkl")
with open(file_path_itm, "wb") as file:
    pickle.dump(idx_to_mid, file)

print(f"idx_to_mid saved to {file_path_itm}")

idx_to_mid saved to ./data/netflixApp/idx_to_mid.pkl


In [32]:
file_path_mti = os.path.join(dst_dir, "mid_to_idx.pkl")
with open(file_path_mti, "wb") as file:
    pickle.dump(mid_to_idx, file)

print(f"mid_to_idx saved to {file_path_mti}")

mid_to_idx saved to ./data/netflixApp/mid_to_idx.pkl


In [33]:
file_path_uti = os.path.join(dst_dir, "uid_to_idx.pkl")
with open(file_path_uti, "wb") as file:
    pickle.dump(uid_to_idx, file)

print(f"uid_to_idx saved to {file_path_uti}")

uid_to_idx saved to ./data/netflixApp/uid_to_idx.pkl


In [34]:
file_path_itu = os.path.join(dst_dir, "idx_to_uid.pkl")
with open(file_path_itu, "wb") as file:
    pickle.dump(idx_to_uid, file)

print(f"idx_to_uid saved to {file_path_itu}")

idx_to_uid saved to ./data/netflixApp/idx_to_uid.pkl


In [35]:
file_path_m = os.path.join(dst_dir, "movie.pkl")
with open(file_path_m, "wb") as file:
    pickle.dump(movie, file)

print(f"movie saved to {file_path_m}")

movie saved to ./data/netflixApp/movie.pkl


Up to next challenge now! 🍿