# Recommendation - Data Preparation 🎬

---

<img src="https://cdn-images-1.medium.com/max/1200/0*ePGWILY6GyplT-nn" />

---

In the next few challenges, you will build a powerful **movie recommender**.

We will use the open-source library [LightFM](https://github.com/lyst/lightfm) which provides easy python implementation of **hybrid** recommendation engines.

In this first part, we will prepare the data in order to train efficiently of the model.

We let you load the data `movies` and `ratings` downloaded from the **small** [movielens dataset](https://grouplens.org/datasets/movielens/).



In [1]:
import numpy as np
import pandas as pd

df_movies = pd.read_csv("movies.csv")
df_ratings = pd.read_csv("ratings.csv")

In [2]:
df_movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation


In [2]:
df_ratings["date_time"] = pd.to_datetime(df_ratings["timestamp"])
df_ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp,date_time
0,1,1,4.0,964982703,1970-01-01 00:00:00.964982703
1,1,3,4.0,964981247,1970-01-01 00:00:00.964981247
2,1,6,4.0,964982224,1970-01-01 00:00:00.964982224
3,1,47,5.0,964983815,1970-01-01 00:00:00.964983815
4,1,50,5.0,964982931,1970-01-01 00:00:00.964982931


**Q1**. What are the different types of recommendation models? Explain briefly with your own words the differences between them.

There are three types of recommendation systems: content-based recommendation, when there is a calculation to check the items that can be similar for an user, and recommend it to the user. This similarity of items is calculated by the ratings provided to the target user, so it is not dependent of other users, which is difference to the other recommendation models. 
The other model is rating-based recommendation, in which according to the user previous rating or purchase, there is a calculation of the rating, either by user or by item in which it will be hanked based on similar preference with other users. 
The final model (although there are other types of models) is the clustering model, in which is similar to rating-based recommendation but there is previously a step of clusting the users or items before performing the rating-based recommendation.

**Q1bis**. What data is expected by the LightFM `fit` method? Especially, how does the train data should be organized, and what should be the type of the train dataset? 

The interactions will be 

In [11]:
pip install lightfm

Collecting lightfm
  Downloading lightfm-1.16.tar.gz (310 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.1/310.1 KB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: lightfm
  Building wheel for lightfm (setup.py) ... [?25ldone
[?25h  Created wheel for lightfm: filename=lightfm-1.16-cp38-cp38-macosx_10_9_x86_64.whl size=443676 sha256=a930fa0c2d32b38bb34918e537582ac49dcd2f2ebd48399396a3747674ac5f9f
  Stored in directory: /Users/laravaroni/Library/Caches/pip/wheels/ec/bb/51/9c487d021c1373b691d13cadca0b65b6852627b1f3f43550fa
Successfully built lightfm
Installing collected packages: lightfm
Successfully installed lightfm-1.16
You should consider upgrading via the '/opt/anaconda3/bin/python -m pip install --upgrade pip' command.[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.


In [3]:
import numpy as np
import lightfm as lf
from lightfm import LightFM
from lightfm.data import Dataset




**Q2**. Explore `movies` and `ratings`, what do those datasets contain? How are they organized?

In [5]:
df_movies.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


In [13]:
df_ratings.info()
len(df_ratings["userId"].unique())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 5 columns):
 #   Column     Non-Null Count   Dtype         
---  ------     --------------   -----         
 0   userId     100836 non-null  int64         
 1   movieId    100836 non-null  int64         
 2   rating     100836 non-null  float64       
 3   timestamp  100836 non-null  int64         
 4   date_time  100836 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(1), int64(3)
memory usage: 3.8 MB


610

From a total of 100836 total row of users, only 610 users are unique. From it, there are there are 9724 different movies that the users can rate.

In [14]:
len(df_ratings["movieId"].unique())

9724

---

### Q3 & Q4 are optional
> you can come back to it if you have time after having finished the whole project of the day

We created a few utils functions for you in `utils.py` script. Especially:
- `threshold_interactions_df`:
> Limit interactions df to minimum row and column interactions

**Q3**. Open `src/utils.py` file, and have a look at the documentation of this function to understand its goal and how it works.

Have a look the code to understand fully how it works. You should be familiar with everything.

What does represent the variable `sparsity`? What is the range of values in which sparsity can be?

The sparsities can be low or high. Low sparsity represents when even when a user doesn't have rated a movie it will present as 0. High sparsity mean the degree in which the user has rated the movie, which can be 100% if every user has rated every movie

**Q4**. Create a new DataFrame `ratings_thresh`, that filters `ratings` with only:
- users that rated strictly more than 4 movies
- movies that have been rated at least 10 times

How many users/movies remain in this new dataset?

In [15]:
from utils import threshold_interactions_df

ratings_thresh = threshold_interactions_df(
    df = df_ratings,
    row_name = 'userId',
    col_name = 'movieId',
    row_min=5,
    col_min=10
)

Starting interactions info
Number of rows: 610
Number of cols: 9724
Sparsity: 1.700%
Ending interactions info
Number of rows: 610
Number of columns: 3650
Sparsity: 4.055%


**Q5**. In order to fit a [LightFM](https://lyst.github.io/lightfm/docs/home.html) model, we need to transform our Dataframe to a sparse matrix (cf. below). This is not straightforward so we included the function `df_to_matrix` in `utils.py`.

> 🔦 **Hint**:  Sparse matrices are just **big matrices with a lot of zeros or empty values**.
> 
> Existing tools (Pandas DataFrame, Numpy arrays for example) are not suitable for manipulating this kind of data. So we will use [Scipy sparse matrices](https://docs.scipy.org/doc/scipy-0.14.0/reference/sparse.html).
>
> It exists many different "types" of sparse matrices (CSC, CSR, COO, DIA, etc.). You don't need to know them. Just know that it corresponds to different formats with different methods of manipulation, slicing, indexing, etc.

> 🔦 **Hint 2**:  By going from a DataFrame to a sparse matrix, you will lose the information of the ids (userId and movieId), you will only deal with indices (row number and column number). Therefore, the `df_to_matrix` function also returns dictionaries mapping indexes to ids (ex: uid_to_idx mapping userId to index of the matrix) 


Have a look at the util function documentation, and use it to create 5 new variables:
- a final sparse matrix `ratings_matrix` (this will be the data used to train the model)
- the following utils mappers:
    - `uid_to_idx`
    - `idx_to_uid`
    - `mid_to_idx`
    - `idx_to_mid`

In [35]:
from utils import df_to_matrix
ratings_matrix,uid_to_idx,idx_to_uid,mid_to_idx,idx_to_mid = df_to_matrix(
    df = df_ratings,
    row_name= "userId",
    col_name= "movieId")

print(ratings_matrix)

  (0, 0)	1.0
  (0, 1)	1.0
  (0, 2)	1.0
  (0, 3)	1.0
  (0, 4)	1.0
  (0, 5)	1.0
  (0, 6)	1.0
  (0, 7)	1.0
  (0, 8)	1.0
  (0, 9)	1.0
  (0, 10)	1.0
  (0, 11)	1.0
  (0, 12)	1.0
  (0, 13)	1.0
  (0, 14)	1.0
  (0, 15)	1.0
  (0, 16)	1.0
  (0, 17)	1.0
  (0, 18)	1.0
  (0, 19)	1.0
  (0, 20)	1.0
  (0, 21)	1.0
  (0, 22)	1.0
  (0, 23)	1.0
  (0, 24)	1.0
  :	:
  (609, 9699)	1.0
  (609, 9700)	1.0
  (609, 9701)	1.0
  (609, 9702)	1.0
  (609, 9703)	1.0
  (609, 9704)	1.0
  (609, 9705)	1.0
  (609, 9706)	1.0
  (609, 9707)	1.0
  (609, 9708)	1.0
  (609, 9709)	1.0
  (609, 9710)	1.0
  (609, 9711)	1.0
  (609, 9712)	1.0
  (609, 9713)	1.0
  (609, 9714)	1.0
  (609, 9715)	1.0
  (609, 9716)	1.0
  (609, 9717)	1.0
  (609, 9718)	1.0
  (609, 9719)	1.0
  (609, 9720)	1.0
  (609, 9721)	1.0
  (609, 9722)	1.0
  (609, 9723)	1.0


**Q6**.
- On the one side, find what movies did the userId 4 rate?

- On the other side, what is the value of `ratings_matrix` for:
    - userId = 4 and movieId=1
    - userId = 4 and movieId=2
    - userId = 4 and movieId=21
    - userId = 4 and movieId=32
    - userId = 4 and movieId=126

Conclude on the values signification in `ratings_matrix`

In [22]:
df_ratings[df_ratings.userId == 4]

Unnamed: 0,userId,movieId,rating,timestamp,date_time
300,4,21,3.0,986935199,1970-01-01 00:00:00.986935199
301,4,32,2.0,945173447,1970-01-01 00:00:00.945173447
302,4,45,3.0,986935047,1970-01-01 00:00:00.986935047
303,4,47,2.0,945173425,1970-01-01 00:00:00.945173425
304,4,52,3.0,964622786,1970-01-01 00:00:00.964622786
...,...,...,...,...,...
511,4,4765,5.0,1007569445,1970-01-01 00:00:01.007569445
512,4,4881,3.0,1007569445,1970-01-01 00:00:01.007569445
513,4,4896,4.0,1007574532,1970-01-01 00:00:01.007574532
514,4,4902,4.0,1007569465,1970-01-01 00:00:01.007569465


In [36]:
uid = 4

for mid in [1,2,31,32,126]:
    print("Movie ID", mid)
    print(ratings_matrix[uid_to_idx[4],mid_to_idx[mid]])
    print("")

Movie ID 1
0.0

Movie ID 2
0.0

Movie ID 31
0.0

Movie ID 32
1.0

Movie ID 126
1.0



**Q5**. Now that you have a `ratings_matrix` in the correct format, let's save it in pickle format:
- Create a variable `dst_dir` corresponding to the path of the folder `data/netflix` located at the root of the repository
- **Verify that this is the correct path**
- Save the ratings_matrix in pickle (as `ratings_matrix.pkl`) in this corresponding directory

In [39]:
directory = "./data"
import pickle 
pickle.dump(
    ratings_matrix,
    open(directory + "/ratings_matrix.pkl", "wb"))

**Q6**. Save also all mappings objects into pickle (`idx_to_mid`, `mid_to_idx`, `uid_to_idx`, `idx_to_uid`) as it will be useful for later.

In [40]:
pickle.dump(
    idx_to_mid,
    open(directory + "/idx_to_mid.pkl", "wb"))
pickle.dump(
    mid_to_idx,
    open(directory + "/mid_to_idx.pkl", "wb"))
pickle.dump(
    uid_to_idx,
    open(directory + "/uid_to_idx.pkl", "wb"))
pickle.dump(
    idx_to_uid,
    open(directory + "/idx_to_uid.pkl", "wb"))

Up to next challenge now! 🍿