# Recommendation - Data Preparation 🎬

---

<img src="https://cdn-images-1.medium.com/max/1200/0*ePGWILY6GyplT-nn" />

---

In the next few challenges, you will build a powerful **movie recommender**.

We will use the open-source library [LightFM](https://github.com/lyst/lightfm) which provides easy python implementation of **hybrid** recommendation engines.

In this first part, we will prepare the data in order to train efficiently of the model.

We let you load the data `movies` and `ratings` downloaded from the **small** [movielens dataset](https://grouplens.org/datasets/movielens/).



In [1]:
conda install -c conda-forge lightfm

done
Solving environment: done


  current version: 22.9.0
  latest version: 24.5.0

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /Users/manish/opt/anaconda3

  added / updated specs:
    - lightfm


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2024.2.2   |       h8857fd0_0         152 KB  conda-forge
    certifi-2024.2.2           |     pyhd8ed1ab_0         157 KB  conda-forge
    conda-22.11.1              |   py39h6e9494a_1         874 KB  conda-forge
    lightfm-1.17               |   py39h7cc1f47_2         242 KB  conda-forge
    openssl-1.1.1w             |       h8a1eda9_0         1.7 MB  conda-forge
    ruamel.yaml-0.17.40        |   py39ha09f3b3_0         193 KB  conda-forge
    ruamel.yaml.clib-0.2.8     |   py39ha09f3b3_0         118 KB  conda-forge
    ---------------------------

In [1]:
### TODO: Load the movies and ratings datasets

import numpy as np
import pandas as pd
from lightfm import LightFM

movies = pd.read_csv('movies.csv')
ratings = pd.read_csv('ratings.csv')



**Q1**. What are the different types of recommendation models? Explain briefly with your own words the differences between them.

Content-based and collaborative filtering recommender systems are widely recognised as the most prominent types of recommendation systems. Collaborative filtering involves providing recommendations to users based on the behaviours of a collective group of users. The recommendation is derived from the preferences of others. An unequivocal example might suggest a film to a user because their friend appreciated it.

1) Content-Based: 
This filtration approach relies on the information available regarding the objects. The algorithm suggests products resembling the ones that a user has previously expressed a preference for. The similarity, typically measured using cosine similarity, is calculated based on the available data regarding the goods and the user's previous preferences.

2) Collaborative-Based: 
This filtration technique, called collaborative filtering, relies on combining the user's actions and comparing them with those of other users in the database. This algorithm is highly dependent on the previous behaviour of all users, with each user's actions playing a crucial role in determining the recommendations they are given. The primary differentiation between content-based filtering and collaborative filtering lies in the fact that in the latter, the algorithm used to generate suggestions is influenced by user interactions with the items. In contrast, in content-based filtering, just the specific user's data is taken into account.

**Q1bis**. What data is expected by the LightFM `fit` method? Especially, how does the train data should be organized, and what should be the type of the train dataset? 

The LightFM fit method is used to provide the expected training data. The training data should be structured as a sparse matrix that includes the movie data, actors, directors, and ratings. This matrix should be created by merging the information from both CSV files.

**Q2**. Explore `movies` and `ratings`, what do those datasets contain? How are they organized?

In [6]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [7]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


---

### Q3 & Q4 are optional
> you can come back to it if you have time after having finished the whole project of the day

We created a few utils functions for you in `utils.py` script. Especially:
- `threshold_interactions_df`:
> Limit interactions df to minimum row and column interactions

**Q3**. Open `src/utils.py` file, and have a look at the documentation of this function to understand its goal and how it works.

Have a look the code to understand fully how it works. You should be familiar with everything.

What does represent the variable `sparsity`? What is the range of values in which sparsity can be?

**Q4**. Create a new DataFrame `ratings_thresh`, that filters `ratings` with only:
- users that rated strictly more than 4 movies
- movies that have been rated at least 10 times

How many users/movies remain in this new dataset?

**Q5**. In order to fit a [LightFM](https://lyst.github.io/lightfm/docs/home.html) model, we need to transform our Dataframe to a sparse matrix (cf. below). This is not straightforward so we included the function `df_to_matrix` in `utils.py`.

> 🔦 **Hint**:  Sparse matrices are just **big matrices with a lot of zeros or empty values**.
> 
> Existing tools (Pandas DataFrame, Numpy arrays for example) are not suitable for manipulating this kind of data. So we will use [Scipy sparse matrices](https://docs.scipy.org/doc/scipy-0.14.0/reference/sparse.html).
>
> It exists many different "types" of sparse matrices (CSC, CSR, COO, DIA, etc.). You don't need to know them. Just know that it corresponds to different formats with different methods of manipulation, slicing, indexing, etc.

> 🔦 **Hint 2**:  By going from a DataFrame to a sparse matrix, you will lose the information of the ids (userId and movieId), you will only deal with indices (row number and column number). Therefore, the `df_to_matrix` function also returns dictionaries mapping indexes to ids (ex: uid_to_idx mapping userId to index of the matrix) 


Have a look at the util function documentation, and use it to create 5 new variables:
- a final sparse matrix `ratings_matrix` (this will be the data used to train the model)
- the following utils mappers:
    - `uid_to_idx`
    - `idx_to_uid`
    - `mid_to_idx`
    - `idx_to_mid`

In [8]:
from utils import df_to_matrix
ratings_matrix, uid_to_idx, idx_to_uid, mid_to_idx, idx_to_mid = df_to_matrix(ratings,'userId', 'movieId')
ratings_matrix

<610x9724 sparse matrix of type '<class 'numpy.float64'>'
	with 100836 stored elements in Compressed Sparse Row format>

**Q6**.
- On the one side, find what movies did the userId 4 rate?

- On the other side, what is the value of `ratings_matrix` for:
    - userId = 4 and movieId=1
    - userId = 4 and movieId=2
    - userId = 4 and movieId=21
    - userId = 4 and movieId=32
    - userId = 4 and movieId=126

Conclude on the values signification in `ratings_matrix`

In [10]:
user_4_ratings = ratings[ratings['userId'] == 4]
user_4_ratings

Unnamed: 0,userId,movieId,rating,timestamp
300,4,21,3.0,986935199
301,4,32,2.0,945173447
302,4,45,3.0,986935047
303,4,47,2.0,945173425
304,4,52,3.0,964622786
...,...,...,...,...
511,4,4765,5.0,1007569445
512,4,4881,3.0,1007569445
513,4,4896,4.0,1007574532
514,4,4902,4.0,1007569465


In [17]:
# merging on movieId
merged_df = pd.merge(user_4_ratings, movies, on='movieId')

# Print titles of movies rated by userId 4
print(merged_df['title'])

0                                      Get Shorty (1995)
1              Twelve Monkeys (a.k.a. 12 Monkeys) (1995)
2                                      To Die For (1995)
3                            Seven (a.k.a. Se7en) (1995)
4                                Mighty Aphrodite (1995)
                             ...                        
211                                        L.I.E. (2001)
212                     Man Who Wasn't There, The (2001)
213    Harry Potter and the Sorcerer's Stone (a.k.a. ...
214    Devil's Backbone, The (Espinazo del diablo, El...
215                                 No Man's Land (2001)
Name: title, Length: 216, dtype: object


In [18]:
user_id = 4
movie_id_list = [1, 2, 21, 32, 126]

user_idx = uid_to_idx[user_id]
for movie_id in movie_id_list:
  movie_idx = mid_to_idx[movie_id]
  rating = ratings_matrix[user_idx, movie_idx]
  print(f"Rating for userId {user_id} and movieId {movie_id}: {rating}")

Rating for userId 4 and movieId 1: 0.0
Rating for userId 4 and movieId 2: 0.0
Rating for userId 4 and movieId 21: 1.0
Rating for userId 4 and movieId 32: 1.0
Rating for userId 4 and movieId 126: 1.0


**Q5**. Now that you have a `ratings_matrix` in the correct format, let's save it in pickle format:
- Create a variable `dst_dir` corresponding to the path of the folder `data/netflix` located at the root of the repository
- **Verify that this is the correct path**
- Save the ratings_matrix in pickle (as `ratings_matrix.pkl`) in this corresponding directory

In [22]:
import pickle

# Assuming you have dst_dir set to the correct path (replace with your actual path)
dst_dir = '/Users/manish/Desktop/Data Analysis/Workshop 9/netflixApp'

# Save ratings_matrix
with open(dst_dir + f"/ratings_matrix.pkl", 'wb') as f:
  pickle.dump(ratings_matrix, f)

**Q6**. Save also all mappings objects into pickle (`idx_to_mid`, `mid_to_idx`, `uid_to_idx`, `idx_to_uid`) as it will be useful for later.

In [23]:
# Save mappers (optional, but recommended)
with open(f"/Users/manish/Desktop/Data Analysis/Workshop 9/netflixApp/uid_to_idx.pkl", 'wb') as f:
  pickle.dump(uid_to_idx, f)

with open(f"/Users/manish/Desktop/Data Analysis/Workshop 9/netflixApp/idx_to_uid.pkl", 'wb') as f:
  pickle.dump(idx_to_uid, f)

with open(f"/Users/manish/Desktop/Data Analysis/Workshop 9/netflixApp/mid_to_idx.pkl", 'wb') as f:
  pickle.dump(mid_to_idx, f)

with open(f"/Users/manish/Desktop/Data Analysis/Workshop 9/netflixApp/idx_to_mid.pkl", 'wb') as f:
  pickle.dump(idx_to_mid, f)

print("Ratings matrix and mappers saved successfully!")

Ratings matrix and mappers saved successfully!


Up to next challenge now! 🍿