# Recommendation - Model 🍿

---

<img src="https://visithrastnik.si/uploads/tic/public/generic_list_item/6-kulturna_prireditev_v_avli_kulturnega_centra_zagorje_ob_savi.jpg" />

---

Now, time for the exciting part! We will train a Machine Learning model based on our previous **ratings** sparse matrix, so that it creates a recommendation engine automatically! 

First, load again the dataframe `movies` and `ratings`

In [24]:
import numpy as np
import pandas as pd

df_movies = pd.read_csv("movies.csv")
df_ratings = pd.read_csv("ratings.csv")

df_ratings["date_time"] = pd.to_datetime(df_ratings["timestamp"])
df_ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp,date_time
0,1,1,4.0,964982703,1970-01-01 00:00:00.964982703
1,1,3,4.0,964981247,1970-01-01 00:00:00.964981247
2,1,6,4.0,964982224,1970-01-01 00:00:00.964982224
3,1,47,5.0,964983815,1970-01-01 00:00:00.964983815
4,1,50,5.0,964982931,1970-01-01 00:00:00.964982931


**Q1**. Start by loading all the pickle you saved during last challenge: `ratings_matrix`, `idx_to_mid`, `mid_to_idx`, `uid_to_idx`, `idx_to_uid`

In [1]:
import pickle

ratings_matrix = pickle.load(open("./data/ratings_matrix.pkl", "rb"))
idx_to_mid = pickle.load(open("./data/idx_to_mid.pkl", "rb"))
mid_to_idx = pickle.load(open("./data/mid_to_idx.pkl", "rb"))
uid_to_idx = pickle.load(open("./data/uid_to_idx.pkl", "rb"))
idx_to_uid = pickle.load(open("./data/idx_to_uid.pkl", "rb"))


**Q2**. Because the dataset is slightly different from what we have been used to (X as features, y as target), the usual `train_test_split` method from scikit-learn does not apply.

Hopefully, `lightfm` comes with a `random_train_test_split` located into `cross_validation` dedicated to this usecase 🙂

Split the data randomly into a `train` matrix and a `test` matrix with 20% of interactions into the test set.

In [6]:
!pip install lightfm

You should consider upgrading via the '/opt/anaconda3/bin/python -m pip install --upgrade pip' command.[0m[33m
[0m

In [8]:
from lightfm.cross_validation import random_train_test_split

train,test = random_train_test_split(ratings_matrix, test_percentage=0.2)

**Q3**. Train a LightFM model for 10 epochs. You can use the parameter `loss="warp"`.

In [11]:
from lightfm import LightFM

model = LightFM(no_components = 10, loss="warp")
model.fit(train, epochs = 10, verbose = True)

Epoch: 100%|██████████| 10/10 [00:01<00:00,  9.13it/s]


<lightfm.lightfm.LightFM at 0x7ff7a50743d0>

**Q4**. Evaluate your model on your test set. You can use the `precision_at_k` metric implemented in the LightFM library.

In [14]:
from lightfm.evaluation import precision_at_k

k = 1

precision_score = precision_at_k(model, test, train, k=k)
print(precision_score.mean())

0.28688523


**Q5**. What does the attribute `item_embeddings` of `model` contains?  This will be the heart of your recommendation engine! 💟 So make sure you understand fully what it contains.

In [16]:
model.item_embeddings.shape

(9724, 10)

**Q6**. We just trained a model that factorized our ratings matrix into a U matrix of shape (n_users, no_components) : `model.user_embeddings` ; and V matrix of shape (n_movies, no_components) : `model.item_embeddings`).

Now we want to compute **similarity between each pair of movies**.

> 🔦 **Hint**: For the similarity distance we can either use `cosine_similarity` function or `pearson_similarity`:
> - **Cosine similarity** between two vectors, or matrices X and Y is given by:
> ``` python
> from sklearn.metrics.pairwise import cosine_similarity
> cosine_similarity(X, Y)
> ```
> - **Pearson similarity** between two vectors, or matrices X and Y is given by:
> ``` python
> import numpy as np
> np.corrcoef(X, Y)
> ```

Compute the `similarity_scores` of size (n_movies, n_movies), containing for each element (i, j) the similarity between movie of index i and movie of index j.

In [18]:
from sklearn.metrics.pairwise import cosine_similarity

sim_scores = cosine_similarity(model.item_embeddings)
sim_scores

array([[ 0.99999994,  0.5695318 ,  0.9126428 , ..., -0.92900145,
        -0.9224953 , -0.747296  ],
       [ 0.5695318 ,  1.0000001 ,  0.6184753 , ..., -0.68877697,
        -0.3842673 , -0.8129381 ],
       [ 0.9126428 ,  0.6184753 ,  1.0000002 , ..., -0.9299779 ,
        -0.77516943, -0.85041064],
       ...,
       [-0.92900145, -0.68877697, -0.9299779 , ...,  1.0000001 ,
         0.76746285,  0.8736551 ],
       [-0.9224953 , -0.3842673 , -0.77516943, ...,  0.76746285,
         1.0000001 ,  0.5748418 ],
       [-0.747296  , -0.8129381 , -0.85041064, ...,  0.8736551 ,
         0.5748418 ,  1.0000001 ]], dtype=float32)

**Q7**. For movie of idx 20, what are the idx of the 10 most similar movies?

In [29]:
df_movies.index = df_movies["movieId"]

In [30]:
import numpy as np

idx = mid_to_idx[1]
similarity_row = sim_scores[idx]
ranked_row = np.argsort(-similarity_row)

print(ranked_row[:10])
print("Convert idx to movie id")

ranked_mid = [idx_to_mid[index] for index in ranked_row]
print(ranked_row[:10])
print("convert mid to movie title")

ranked_titles = [df_movies.loc[mid]["title"] for mid in ranked_mid]
print(ranked_titles[:10])


[   0  478 1645 1776  170   26    7   16  140  979]
Convert idx to movie id
[   0  478 1645 1776  170   26    7   16  140  979]
convert mid to movie title
['Toy Story (1995)', 'Terminator 2: Judgment Day (1991)', 'Deep Blue Sea (1999)', 'Mask of Zorro, The (1998)', 'Mummy, The (1999)', 'Jurassic Park (1993)', 'Braveheart (1995)', 'Pulp Fiction (1994)', 'NeverEnding Story, The (1984)', "Dante's Peak (1997)"]


**Q8**. Let's now test our engine! Suppose we have an user that likes **Toy Story** 🧸 (movie_id = 1). Which movies would you recommend to that user? In other words, which movies are the most similar to the movie Toy Story 

> ⚠️ **Warning**: Remember that your `similarity_scores` works with `idx` and you have the `movie_id` associated to your movie.

Retrieve the **top 5 recommendations**.

In [26]:
similarity_scores = sim_scores

pickle.dump(similarity_scores, open("./data/similarity_scores.pkl", "wb"))
pickle.dump(df_movies, open("./data/movies.pkl", "wb"))

As the next step is to **deploy your model**, you need now to: 

**Q9**. Save your `similarity_scores` into pickle format. Save also `movies` DataFrame into pickle format. Save them at the `data/netflix` directory at the root of the repository.

**Q10**. Encapsulate the previous code into functions, especially you will need:
- `get_sim_scores(mid)` function that returns the vector of the similarity scores `sims` between a movie `mid` and all the other movies
- `get_ranked_recos(sims)` that returns for a vector of similarity scores `sims` the list of all ranked recommendations (n_movies) (from most recommended to least recommended) - in the format list of (mid, score, name) tuple.

If you have extra time, feel free now to improve your recommendation engine!