# Recommendation - Model 🍿

---

<img src="https://visithrastnik.si/uploads/tic/public/generic_list_item/6-kulturna_prireditev_v_avli_kulturnega_centra_zagorje_ob_savi.jpg" />

---

Now, time for the exciting part! We will train a Machine Learning model based on our previous **ratings** sparse matrix, so that it creates a recommendation engine automatically! 

First, load again the dataframe `movies` and `ratings`

In [46]:
### TODO: load the movies and ratings datasets
import pandas as pd
movies=pd.read_csv('movies.csv')
ratings=pd.read_csv('ratings.csv')

**Q1**. Start by loading all the pickle you saved during last challenge: `ratings_matrix`, `idx_to_mid`, `mid_to_idx`, `uid_to_idx`, `idx_to_uid`

In [48]:
import pickle
with open('ratings_matrix.pk1', 'rb') as f:
    ratings_matrix=pickle.load(f)
with open('idx_to_mid.pk1', 'rb') as f:
    idx_to_mid=pickle.load(f)
with open('mid_to_idx.pk1', 'rb') as f:
    mid_to_idx=pickle.load(f)
with open('uid_to_idx.pk1', 'rb') as f:
    uid_to_idx=pickle.load(f)
with open('idx_to_uid.pk1', 'rb') as f:
    idx_to_uid=pickle.load(f)

**Q2**. Because the dataset is slightly different from what we have been used to (X as features, y as target), the usual `train_test_split` method from scikit-learn does not apply.

Hopefully, `lightfm` comes with a `random_train_test_split` located into `cross_validation` dedicated to this usecase 🙂

Split the data randomly into a `train` matrix and a `test` matrix with 20% of interactions into the test set.

In [50]:
import numpy as np
from lightfm.cross_validation import random_train_test_split

train, test = random_train_test_split(ratings_matrix)

**Q3**. Train a LightFM model for 10 epochs. You can use the parameter `loss="warp"`.

In [52]:
from lightfm import LightFM
model = LightFM(no_components=100, loss="warp")
model.fit(train, epochs=10)

<lightfm.lightfm.LightFM at 0x7fc5a8921890>

**Q4**. Evaluate your model on your test set. You can use the `precision_at_k` metric implemented in the LightFM library.

In [54]:
from lightfm.evaluation import precision_at_k
k=5
precision_at_k=precision_at_k(model, test, train, k=k).mean()
print(precision_at_k)

0.28336078


**Q5**. What does the attribute `item_embeddings` of `model` contains?  This will be the heart of your recommendation engine! 💟 So make sure you understand fully what it contains.

In [56]:
print(model.item_embeddings.shape)

(4180, 100)


**Q6**. We just trained a model that factorized our ratings matrix into a U matrix of shape (n_users, no_components) : `model.user_embeddings` ; and V matrix of shape (n_movies, no_components) : `model.item_embeddings`).

Now we want to compute **similarity between each pair of movies**.

> 🔦 **Hint**: For the similarity distance we can either use `cosine_similarity` function or `pearson_similarity`:
> - **Cosine similarity** between two vectors, or matrices X and Y is given by:
> ``` python
> from sklearn.metrics.pairwise import cosine_similarity
> cosine_similarity(X, Y)
> ```
> - **Pearson similarity** between two vectors, or matrices X and Y is given by:
> ``` python
> import numpy as np
> np.corrcoef(X, Y)
> ```

Compute the `similarity_scores` of size (n_movies, n_movies), containing for each element (i, j) the similarity between movie of index i and movie of index j.

In [58]:
from sklearn.metrics.pairwise import cosine_similarity
similarity_scores=cosine_similarity(model.item_embeddings)
similarity_scores

array([[ 1.0000001 ,  0.4633051 ,  0.38324606, ..., -0.47083575,
        -0.27950346, -0.13566007],
       [ 0.4633051 ,  1.0000001 ,  0.39329863, ..., -0.15544859,
        -0.33119377, -0.06014879],
       [ 0.38324606,  0.39329863,  0.99999994, ..., -0.15517738,
        -0.18778114,  0.00758363],
       ...,
       [-0.47083575, -0.15544859, -0.15517738, ...,  0.99999994,
         0.19830498,  0.15208532],
       [-0.27950346, -0.33119377, -0.18778114, ...,  0.19830498,
         1.0000001 ,  0.1629575 ],
       [-0.13566007, -0.06014879,  0.00758363, ...,  0.15208532,
         0.1629575 ,  0.9999999 ]], dtype=float32)

**Q7**. For movie of idx 20, what are the idx of the 10 most similar movies?

In [77]:
idx=mid_to_idx[20]
similarity_idx=similarity_scores[idx]
ranked_idx=np.argsort(-similarity_idx)
ranked_mid=[idx_to_mid[x] for x in ranked_idx]
for mid in ranked_mid[:10]:
    print(movies[movies.movieId==mid]["title"])

19    Money Train (1995)
Name: title, dtype: object
5    Heat (1995)
Name: title, dtype: object
307    Clear and Present Danger (1994)
Name: title, dtype: object
138    Die Hard: With a Vengeance (1995)
Name: title, dtype: object
412    In the Line of Fire (1993)
Name: title, dtype: object
134    Crimson Tide (1995)
Name: title, dtype: object
249    Natural Born Killers (1994)
Name: title, dtype: object
253    Outbreak (1995)
Name: title, dtype: object
275    Stargate (1994)
Name: title, dtype: object
22    Assassins (1995)
Name: title, dtype: object


**Q8**. Let's now test our engine! Suppose we have an user that likes **Toy Story** 🧸 (movie_id = 1). Which movies would you recommend to that user? In other words, which movies are the most similar to the movie Toy Story 

> ⚠️ **Warning**: Remember that your `similarity_scores` works with `idx` and you have the `movie_id` associated to your movie.

Retrieve the **top 5 recommendations**.

In [79]:
idx=mid_to_idx[1]
similarity_idx=similarity_scores[idx]
ranked_idx=np.argsort(-similarity_idx)
ranked_mid=[idx_to_mid[x] for x in ranked_idx]
for mid in ranked_mid[:5]:
    print(movies[movies.movieId==mid]["title"])

0    Toy Story (1995)
Name: title, dtype: object
32    Babe (1995)
Name: title, dtype: object
224    Star Wars: Episode IV - A New Hope (1977)
Name: title, dtype: object
314    Forrest Gump (1994)
Name: title, dtype: object
506    Aladdin (1992)
Name: title, dtype: object


As the next step is to **deploy your model**, you need now to: 

**Q9**. Save your `similarity_scores` into pickle format. Save also `movies` DataFrame into pickle format. Save them at the `data/netflix` directory at the root of the repository.

In [86]:
directory="./"
import pickle
pickle.dump(similarity_scores, open(directory + "/similarity_scores.pk1", "wb"))
pickle.dump(movies, open(directory + "/movies.pk1", "wb"))

**Q10**. Encapsulate the previous code into functions, especially you will need:
- `get_sim_scores(mid)` function that returns the vector of the similarity scores `sims` between a movie `mid` and all the other movies
- `get_ranked_recos(sims)` that returns for a vector of similarity scores `sims` the list of all ranked recommendations (n_movies) (from most recommended to least recommended) - in the format list of (mid, score, name) tuple.

In [110]:
def get_sim_scores(mid):
    idx=mid_to_idx[mid]
    similarity_idx=similarity_scores[idx]
    return similarity_idx

def get_ranked_recos(sims, movies):
    recos=[]
    for idx in np.argsort(-sims):
        mid=idx_to_mid[idx]
        score=sims[idx]
        name=movies.loc[movies.movieId==mid].title.values[0]
        recos.append((mid,score,name))
    return recos

In [112]:
print(get_sim_scores(1))
print(get_ranked_recos(get_sim_scores(1),movies))

[ 1.0000001   0.4633051   0.38324606 ... -0.47083575 -0.27950346
 -0.13566007]
[(1, 1.0000001, 'Toy Story (1995)'), (34, 0.7583057, 'Babe (1995)'), (260, 0.73587614, 'Star Wars: Episode IV - A New Hope (1977)'), (356, 0.71557885, 'Forrest Gump (1994)'), (588, 0.7149673, 'Aladdin (1992)'), (1210, 0.71419287, 'Star Wars: Episode VI - Return of the Jedi (1983)'), (2355, 0.71205324, "Bug's Life, A (1998)"), (593, 0.71152216, 'Silence of the Lambs, The (1991)'), (364, 0.71071696, 'Lion King, The (1994)'), (586, 0.69934773, 'Home Alone (1990)'), (1307, 0.698324, 'When Harry Met Sally... (1989)'), (1207, 0.6971826, 'To Kill a Mockingbird (1962)'), (1198, 0.6877872, 'Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)'), (344, 0.684189, 'Ace Ventura: Pet Detective (1994)'), (551, 0.6814106, 'Nightmare Before Christmas, The (1993)'), (595, 0.67910093, 'Beauty and the Beast (1991)'), (500, 0.6780167, 'Mrs. Doubtfire (1993)'), (318, 0.673604, 'Shawshank Redemption, The 

If you have extra time, feel free now to improve your recommendation engine!