<a href="https://colab.research.google.com/github/mayaw00d/3803ict-workshops/blob/main/02-Recommendation-Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Recommendation - Model 🍿

---

<img src="https://visithrastnik.si/uploads/tic/public/generic_list_item/6-kulturna_prireditev_v_avli_kulturnega_centra_zagorje_ob_savi.jpg" />

---

Now, time for the exciting part! We will train a Machine Learning model based on our previous **ratings** sparse matrix, so that it creates a recommendation engine automatically!

First, load again the dataframe `movies` and `ratings`

In [7]:
### TODO: load the movies and ratings datasetsimport pandas as pd
import pandas as pd
### TODO: Load the movies and ratings datasets
movies = pd.read_csv("./ml-latest-small/movies.csv")
ratings = pd.read_csv("./ml-latest-small/ratings.csv")

**Q1**. Start by loading all the pickle you saved during last challenge: `ratings_matrix`, `idx_to_mid`, `mid_to_idx`, `uid_to_idx`, `idx_to_uid`

In [8]:
import pickle
root = "./data/netflix/"
ratings_matrix = pickle.load(open(root + "ratings_matrix.pkl", "rb"))
idx_to_mid = pickle.load(open(root + "idx_to_mid.pkl", "rb"))
mid_to_idx = pickle.load(open(root + "mid_to_idx.pkl", "rb"))
uid_to_idx = pickle.load(open(root + "uid_to_idx.pkl", "rb"))
idx_to_uid = pickle.load(open(root + "idx_to_uid.pkl", "rb"))


**Q2**. Because the dataset is slightly different from what we have been used to (X as features, y as target), the usual `train_test_split` method from scikit-learn does not apply.

Hopefully, `lightfm` comes with a `random_train_test_split` located into `cross_validation` dedicated to this usecase 🙂

Split the data randomly into a `train` matrix and a `test` matrix with 20% of interactions into the test set.

In [9]:
!pip install lightfm
import numpy as np
from lightfm.cross_validation import random_train_test_split

train, test = random_train_test_split(ratings_matrix, test_percentage=0.2, random_state=np.random.RandomState(0))

train.shape, test.shape



((610, 9724), (610, 9724))

**Q3**. Train a LightFM model for 10 epochs. You can use the parameter `loss="warp"`.

In [10]:
from lightfm import LightFM

light_model = LightFM(no_components=100, loss='warp', random_state=0)
try:
    light_model.fit(train, epochs=10, verbose=True)
except Exception as e:
    print(f"An error occurred: {e}")

Epoch: 100%|██████████| 10/10 [00:04<00:00,  2.10it/s]


**Q4**. Evaluate your model on your test set. You can use the `precision_at_k` metric implemented in the LightFM library.

In [11]:
from lightfm.evaluation import precision_at_k

k=5
pre_k = precision_at_k(light_model, test, train, k=k).mean()

print("Precision at K =", k)
print(pre_k)

Precision at K = 5
0.2680921


**Q5**. What does the attribute `item_embeddings` of `model` contains?  This will be the heart of your recommendation engine! 💟 So make sure you understand fully what it contains.

In [12]:
light_model.item_embeddings.shape

(9724, 100)

In [13]:
print("item_embeddings contains all movies into a vector of 100 dimensions")

item_embeddings contains all movies into a vector of 100 dimensions


In [14]:
light_model.item_embeddings[0]

array([-0.3142591 , -0.2041731 ,  0.08945829, -0.08405492,  0.34734696,
       -0.05399103,  0.11160227, -0.20095833, -0.11371228, -0.39413986,
        0.362561  ,  0.05316165, -0.4583187 , -0.34357777, -0.18797755,
       -0.00553624,  0.28137276,  0.43153363,  0.15660733, -0.40245408,
        0.40595597,  0.01247011, -0.27369824, -0.14441921,  0.20417038,
        0.02513273,  0.2091002 , -0.15891656,  0.2149729 ,  0.14988837,
       -0.21464817,  0.23834325, -0.2145693 ,  0.24521717, -0.37559548,
        0.22957776, -0.12345305,  0.40108737, -0.36697736,  0.22383201,
        0.223479  ,  0.34414145,  0.2924009 , -0.08213998, -0.16994385,
        0.14942707,  0.3083662 , -0.29354784,  0.2813535 ,  0.19445191,
       -0.09893961,  0.18049172,  0.2639099 , -0.2928732 ,  0.22133411,
        0.3848023 , -0.3763623 , -0.39090097,  0.1398391 , -0.19342977,
        0.2217506 , -0.25360197, -0.04459268, -0.24561936, -0.24732517,
       -0.2852928 , -0.3145209 ,  0.23152035, -0.08398732,  0.23

**Q6**. We just trained a model that factorized our ratings matrix into a U matrix of shape (n_users, no_components) : `model.user_embeddings` ; and V matrix of shape (n_movies, no_components) : `model.item_embeddings`).

Now we want to compute **similarity between each pair of movies**.

> 🔦 **Hint**: For the similarity distance we can either use `cosine_similarity` function or `pearson_similarity`:
> - **Cosine similarity** between two vectors, or matrices X and Y is given by:
> ``` python
> from sklearn.metrics.pairwise import cosine_similarity
> cosine_similarity(X, Y)
> ```
> - **Pearson similarity** between two vectors, or matrices X and Y is given by:
> ``` python
> import numpy as np
> np.corrcoef(X, Y)
> ```

Compute the `similarity_scores` of size (n_movies, n_movies), containing for each element (i, j) the similarity between movie of index i and movie of index j.

In [15]:
from sklearn.metrics.pairwise import cosine_similarity

similar_score = cosine_similarity(light_model.item_embeddings) #using method 1 cosine
similar_score

array([[ 0.9999999 ,  0.35686016,  0.5590585 , ..., -0.25982952,
        -0.3204423 , -0.3208219 ],
       [ 0.35686016,  1.        ,  0.4236118 , ..., -0.29630297,
        -0.30853668, -0.2569683 ],
       [ 0.5590585 ,  0.4236118 ,  1.        , ..., -0.27644306,
        -0.18271977, -0.15962668],
       ...,
       [-0.25982952, -0.29630297, -0.27644306, ...,  1.        ,
         0.7492958 ,  0.6513411 ],
       [-0.3204423 , -0.30853668, -0.18271977, ...,  0.7492958 ,
         0.9999997 ,  0.8015574 ],
       [-0.3208219 , -0.2569683 , -0.15962668, ...,  0.6513411 ,
         0.8015574 ,  1.0000001 ]], dtype=float32)

In [16]:
similar_score_np = np.corrcoef(light_model.item_embeddings) #using method 2 corr coef
similar_score_np

array([[ 1.        ,  0.35021119,  0.55637834, ..., -0.26238036,
        -0.33058201, -0.32973479],
       [ 0.35021119,  1.        ,  0.4183909 , ..., -0.30211544,
        -0.32621165, -0.27207356],
       [ 0.55637834,  0.4183909 ,  1.        , ..., -0.27883183,
        -0.19122061, -0.16694876],
       ...,
       [-0.26238036, -0.30211544, -0.27883183, ...,  1.        ,
         0.75087505,  0.65203931],
       [-0.33058201, -0.32621165, -0.19122061, ...,  0.75087505,
         1.        ,  0.80003545],
       [-0.32973479, -0.27207356, -0.16694876, ...,  0.65203931,
         0.80003545,  1.        ]])

**Q7**. For movie of idx 20, what are the idx of the 10 most similar movies?

In [17]:
idx = 20 #id of 20
similar_idx = similar_score[idx]
ranked_idx = np.argsort(-similar_idx)
ranked_mid = [idx_to_mid[x] for x in ranked_idx]

[movies[movies.movieId == mid]["title"] for mid in ranked_mid[:10]] #top 10


[314    Forrest Gump (1994)
 Name: title, dtype: object,
 277    Shawshank Redemption, The (1994)
 Name: title, dtype: object,
 257    Pulp Fiction (1994)
 Name: title, dtype: object,
 224    Star Wars: Episode IV - A New Hope (1977)
 Name: title, dtype: object,
 0    Toy Story (1995)
 Name: title, dtype: object,
 461    Schindler's List (1993)
 Name: title, dtype: object,
 418    Jurassic Park (1993)
 Name: title, dtype: object,
 43    Seven (a.k.a. Se7en) (1995)
 Name: title, dtype: object,
 325    Mask, The (1994)
 Name: title, dtype: object,
 510    Silence of the Lambs, The (1991)
 Name: title, dtype: object]

**Q8**. Let's now test our engine! Suppose we have an user that likes **Toy Story** 🧸 (movie_id = 1). Which movies would you recommend to that user? In other words, which movies are the most similar to the movie Toy Story

> ⚠️ **Warning**: Remember that your `similarity_scores` works with `idx` and you have the `movie_id` associated to your movie.

Retrieve the **top 5 recommendations**.

In [18]:
idx = mid_to_idx[1]
similar_idx = similar_score[idx]
ranked_idx = np.argsort(-similar_idx)
ranked_mid = [idx_to_mid[x] for x in ranked_idx]

[movies[movies.movieId == mid]["title"] for mid in ranked_mid[:5]] #get top 5

[0    Toy Story (1995)
 Name: title, dtype: object,
 314    Forrest Gump (1994)
 Name: title, dtype: object,
 277    Shawshank Redemption, The (1994)
 Name: title, dtype: object,
 224    Star Wars: Episode IV - A New Hope (1977)
 Name: title, dtype: object,
 257    Pulp Fiction (1994)
 Name: title, dtype: object]

As the next step is to **deploy your model**, you need now to:

**Q9**. Save your `similarity_scores` into pickle format. Save also `movies` DataFrame into pickle format. Save them at the `data/netflix` directory at the root of the repository.

In [19]:
with open(root + '/similarity_scores.pkl', 'wb') as f:
    pickle.dump(similar_score, f)

with open(root + '/movies.pkl', 'wb') as f:
    pickle.dump(movies, f)

**Q10**. Encapsulate the previous code into functions, especially you will need:
- `get_sim_scores(mid)` function that returns the vector of the similarity scores `sims` between a movie `mid` and all the other movies
- `get_ranked_recos(sims)` that returns for a vector of similarity scores `sims` the list of all ranked recommendations (n_movies) (from most recommended to least recommended) - in the format list of (mid, score, name) tuple.

In [23]:
def get_movie_name(mid, movies):
    try:
        name = movies.loc[movies.movieId == mid].title.values[0]
    except:
        name = 'unknown'
    return name

def get_sim_scores(mid):
    idx = mid_to_idx[mid]
    sim_score = similar_score[idx]
    return sim_score

def get_ranked_recos(sim_scores):
    recos = []
    for idx in np.argsort(-sim_scores):
        mid = idx_to_mid[idx]
        name = get_movie_name(mid, movies)
        score = sim_scores[idx]
        recos.append((mid, score, name))

    return recos


In [24]:
sims = get_sim_scores(3)
recos = get_ranked_recos(sims)[:10]

recos

[(3, 1.0, 'Grumpier Old Men (1995)'),
 (65, 0.69877565, 'Bio-Dome (1996)'),
 (432, 0.6745255, "City Slickers II: The Legend of Curly's Gold (1994)"),
 (234, 0.65149, 'Exit to Eden (1994)'),
 (415, 0.6485228, 'Another Stakeout (1993)'),
 (553, 0.6474647, 'Tombstone (1993)'),
 (880, 0.6424499, 'Island of Dr. Moreau, The (1996)'),
 (1405, 0.62317276, 'Beavis and Butt-Head Do America (1996)'),
 (1049, 0.621908, 'Ghost and the Darkness, The (1996)'),
 (466, 0.6159928, 'Hot Shots! Part Deux (1993)')]

If you have extra time, feel free now to improve your recommendation engine!