# Recommendation - Data Preparation 🎬

---

<img src="https://cdn-images-1.medium.com/max/1200/0*ePGWILY6GyplT-nn" />

---

In the next few challenges, you will build a powerful **movie recommender**.

We will use the open-source library [LightFM](https://github.com/lyst/lightfm) which provides easy python implementation of **hybrid** recommendation engines.

In this first part, we will prepare the data in order to train efficiently of the model.

We let you load the data `movies` and `ratings` downloaded from the **small** [movielens dataset](https://grouplens.org/datasets/movielens/).



In [1]:
import pandas as pd

### TODO: Load the movies and ratings datasets
movies = pd.read_csv("./ml-latest-small/movies.csv")
ratings = pd.read_csv("./ml-latest-small/ratings.csv")

In [2]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [3]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


**Q1**. What are the different types of recommendation models? Explain briefly with your own words the differences between them.

In [4]:
print("Recommender systems includes: content based commendation, rating based recommendationg, and clustering based recommendation.")
print("Content based recommending systems are based on the similarity between user features (personal) and item features (to be recommended). It matches the users interests to description of item to be recommended ")
print("Rating based recommending systems or collaborative filterings are based on the rating matrix. It only recommends based on users past behaviour or history. ")
print("Clustering based recommending systems are based on the clusters of the rating matrix. Any user that is classified within a cluster receives recommendations based on the group. ")
print("There is also hybrid recommending systems using rating and contents.")

Recommender systems includes: content based commendation, rating based recommendationg, and clustering based recommendation.
Content based recommending systems are based on the similarity between user features (personal) and item features (to be recommended). It matches the users interests to description of item to be recommended 
Rating based recommending systems or collaborative filterings are based on the rating matrix. It only recommends based on users past behaviour or history. 
Clustering based recommending systems are based on the clusters of the rating matrix. Any user that is classified within a cluster receives recommendations based on the group. 
There is also hybrid recommending systems using rating and contents.


**Q1bis**. What data is expected by the LightFM `fit` method? Especially, how does the train data should be organized, and what should be the type of the train dataset? 

In [19]:

print("The fit method expects data in the form of a sparse matrix where rows represent users and columns represent items. Each cell contains values indicating interactions.")
print("Therefore, the raining data is a (no_user, no_items) sparse matrix, with 1s denoting positive and -1s negative interactions")


The fit method expects data in the form of a sparse matrix where rows represent users and columns represent items. Each cell contains values indicating interactions.
Therefore, the raining data is a (no_user, no_items) sparse matrix, with 1s denoting positive and -1s negative interactions


**Q2**. Explore `movies` and `ratings`, what do those datasets contain? How are they organized?

In [6]:
movies.shape

(9742, 3)

In [7]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


In [8]:
ratings.shape

(100836, 4)

In [9]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


In [10]:
len(ratings["userId"].unique())

610

---

### Q3 & Q4 are optional
> you can come back to it if you have time after having finished the whole project of the day

We created a few utils functions for you in `utils.py` script. Especially:
- `threshold_interactions_df`:
> Limit interactions df to minimum row and column interactions

**Q3**. Open `src/utils.py` file, and have a look at the documentation of this function to understand its goal and how it works.

Have a look the code to understand fully how it works. You should be familiar with everything.

What does represent the variable `sparsity`? What is the range of values in which sparsity can be?

In [11]:

print("Sparsity represents the percentage of non-zero entries in the matrix. It is the percentage of existing ratings in the total interactions possible. \nRange: 0% (completely empty) to 100% (completely filled). \nCalculated as: sparsity = (number of interactions) / (total possible interactions) * 100.")


Sparsity represents the percentage of non-zero entries in the matrix. It is the percentage of existing ratings in the total interactions possible. 
Range: 0% (completely empty) to 100% (completely filled). 
Calculated as: sparsity = (number of interactions) / (total possible interactions) * 100.


**Q4**. Create a new DataFrame `ratings_thresh`, that filters `ratings` with only:
- users that rated strictly more than 4 movies
- movies that have been rated at least 10 times

How many users/movies remain in this new dataset?

In [12]:
from utils import threshold_interactions_df

ratings_thresh = threshold_interactions_df(ratings, 'userId', 'movieId', row_min=5, col_min=10)

remaining_users = ratings_thresh['userId'].nunique()
remaining_movies = ratings_thresh['movieId'].nunique()

print(f"\nNumber of remaining users: {remaining_users}")
print(f"Number of remaining movies: {remaining_movies}")

Starting interactions info
Number of rows: 610
Number of cols: 9724
Sparsity: 1.700%
Ending interactions info
Number of rows: 610
Number of columns: 3650
Sparsity: 4.055%

Number of remaining users: 610
Number of remaining movies: 3650


In [13]:
ratings_thresh.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


**Q5**. In order to fit a [LightFM](https://lyst.github.io/lightfm/docs/home.html) model, we need to transform our Dataframe to a sparse matrix (cf. below). This is not straightforward so we included the function `df_to_matrix` in `utils.py`.

> 🔦 **Hint**:  Sparse matrices are just **big matrices with a lot of zeros or empty values**.
> 
> Existing tools (Pandas DataFrame, Numpy arrays for example) are not suitable for manipulating this kind of data. So we will use [Scipy sparse matrices](https://docs.scipy.org/doc/scipy-0.14.0/reference/sparse.html).
>
> It exists many different "types" of sparse matrices (CSC, CSR, COO, DIA, etc.). You don't need to know them. Just know that it corresponds to different formats with different methods of manipulation, slicing, indexing, etc.

> 🔦 **Hint 2**:  By going from a DataFrame to a sparse matrix, you will lose the information of the ids (userId and movieId), you will only deal with indices (row number and column number). Therefore, the `df_to_matrix` function also returns dictionaries mapping indexes to ids (ex: uid_to_idx mapping userId to index of the matrix) 


Have a look at the util function documentation, and use it to create 5 new variables:
- a final sparse matrix `ratings_matrix` (this will be the data used to train the model)
- the following utils mappers:
    - `uid_to_idx`
    - `idx_to_uid`
    - `mid_to_idx`
    - `idx_to_mid`

In [14]:
from utils import df_to_matrix

# Create the sparse matrix and mappers
ratings_matrix, uid_to_idx, idx_to_uid, mid_to_idx, idx_to_mid = df_to_matrix(ratings, 'userId', 'movieId')

print(ratings_matrix)
print(uid_to_idx)
print(idx_to_uid)
print(mid_to_idx)
print(idx_to_mid)

  (0, 0)	1.0
  (0, 1)	1.0
  (0, 2)	1.0
  (0, 3)	1.0
  (0, 4)	1.0
  (0, 5)	1.0
  (0, 6)	1.0
  (0, 7)	1.0
  (0, 8)	1.0
  (0, 9)	1.0
  (0, 10)	1.0
  (0, 11)	1.0
  (0, 12)	1.0
  (0, 13)	1.0
  (0, 14)	1.0
  (0, 15)	1.0
  (0, 16)	1.0
  (0, 17)	1.0
  (0, 18)	1.0
  (0, 19)	1.0
  (0, 20)	1.0
  (0, 21)	1.0
  (0, 22)	1.0
  (0, 23)	1.0
  (0, 24)	1.0
  :	:
  (609, 9699)	1.0
  (609, 9700)	1.0
  (609, 9701)	1.0
  (609, 9702)	1.0
  (609, 9703)	1.0
  (609, 9704)	1.0
  (609, 9705)	1.0
  (609, 9706)	1.0
  (609, 9707)	1.0
  (609, 9708)	1.0
  (609, 9709)	1.0
  (609, 9710)	1.0
  (609, 9711)	1.0
  (609, 9712)	1.0
  (609, 9713)	1.0
  (609, 9714)	1.0
  (609, 9715)	1.0
  (609, 9716)	1.0
  (609, 9717)	1.0
  (609, 9718)	1.0
  (609, 9719)	1.0
  (609, 9720)	1.0
  (609, 9721)	1.0
  (609, 9722)	1.0
  (609, 9723)	1.0
{1: 0, 2: 1, 3: 2, 4: 3, 5: 4, 6: 5, 7: 6, 8: 7, 9: 8, 10: 9, 11: 10, 12: 11, 13: 12, 14: 13, 15: 14, 16: 15, 17: 16, 18: 17, 19: 18, 20: 19, 21: 20, 22: 21, 23: 22, 24: 23, 25: 24, 26: 25, 27: 26, 28: 27

In [15]:
ratings[ratings.userId==4]

Unnamed: 0,userId,movieId,rating,timestamp
300,4,21,3.0,986935199
301,4,32,2.0,945173447
302,4,45,3.0,986935047
303,4,47,2.0,945173425
304,4,52,3.0,964622786
...,...,...,...,...
511,4,4765,5.0,1007569445
512,4,4881,3.0,1007569445
513,4,4896,4.0,1007574532
514,4,4902,4.0,1007569465


**Q6**.
- On the one side, find what movies did the userId 4 rate?

- On the other side, what is the value of `ratings_matrix` for:
    - userId = 4 and movieId=1
    - userId = 4 and movieId=2
    - userId = 4 and movieId=21
    - userId = 4 and movieId=32
    - userId = 4 and movieId=126

Conclude on the values signification in `ratings_matrix`

In [16]:
user_4_ratings = ratings[ratings['userId'] == 4]
print(user_4_ratings)

# Get values from the ratings_matrix
user_4_idx = uid_to_idx[4]
movie_ids = [1, 2, 21, 32, 126]
for movie_id in movie_ids:
    movie_idx = mid_to_idx[movie_id]
    rating_value = ratings_matrix[user_4_idx, movie_idx]
    print(f"Rating for userId=4 and movieId={movie_id}: {rating_value}")

     userId  movieId  rating   timestamp
300       4       21     3.0   986935199
301       4       32     2.0   945173447
302       4       45     3.0   986935047
303       4       47     2.0   945173425
304       4       52     3.0   964622786
..      ...      ...     ...         ...
511       4     4765     5.0  1007569445
512       4     4881     3.0  1007569445
513       4     4896     4.0  1007574532
514       4     4902     4.0  1007569465
515       4     4967     5.0  1007569424

[216 rows x 4 columns]
Rating for userId=4 and movieId=1: 0.0
Rating for userId=4 and movieId=2: 0.0
Rating for userId=4 and movieId=21: 1.0
Rating for userId=4 and movieId=32: 1.0
Rating for userId=4 and movieId=126: 1.0


**Q7**. Now that you have a `ratings_matrix` in the correct format, let's save it in pickle format:
- Create a variable `dst_dir` corresponding to the path of the folder `data/netflix` located at the root of the repository
- **Verify that this is the correct path**
- Save the ratings_matrix in pickle (as `ratings_matrix.pkl`) in this corresponding directory

In [17]:
import pickle

#define the directory to save data
dst_dir = "data/netflix"

#verify the path
import os
if not os.path.exists(dst_dir):
    os.makedirs(dst_dir)

#save  ratings_matrix
with open(os.path.join(dst_dir, 'ratings_matrix.pkl'), 'wb') as f:
    pickle.dump(ratings_matrix, f)

**Q6**. Save also all mappings objects into pickle (`idx_to_mid`, `mid_to_idx`, `uid_to_idx`, `idx_to_uid`) as it will be useful for later.

In [18]:
#save mappers
mappers = {
    'uid_to_idx': uid_to_idx,
    'idx_to_uid': idx_to_uid,
    'mid_to_idx': mid_to_idx,
    'idx_to_mid': idx_to_mid
}

for name, mapper in mappers.items():
    with open(os.path.join(dst_dir, f'{name}.pkl'), 'wb') as f:
        pickle.dump(mapper, f)

Up to next challenge now! 🍿