# Recommendation - Model 🍿

---

<img src="https://visithrastnik.si/uploads/tic/public/generic_list_item/6-kulturna_prireditev_v_avli_kulturnega_centra_zagorje_ob_savi.jpg" />

---

Now, time for the exciting part! We will train a Machine Learning model based on our previous **ratings** sparse matrix, so that it creates a recommendation engine automatically! 

First, load again the dataframe `movies` and `ratings`

In [5]:
import pandas as pd

### TODO: Load the movies and ratings datasets
movies = pd.read_csv("./ml-latest-small/movies.csv")
ratings = pd.read_csv("./ml-latest-small/ratings.csv")

**Q1**. Start by loading all the pickle you saved during last challenge: `ratings_matrix`, `idx_to_mid`, `mid_to_idx`, `uid_to_idx`, `idx_to_uid`

In [6]:
import pickle
root = "./data/netflix/"
ratings_matrix = pickle.load(open(root + "ratings_matrix.pkl", "rb"))
idx_to_mid = pickle.load(open(root + "idx_to_mid.pkl", "rb"))
mid_to_idx = pickle.load(open(root + "mid_to_idx.pkl", "rb"))
uid_to_idx = pickle.load(open(root + "uid_to_idx.pkl", "rb"))
idx_to_uid = pickle.load(open(root + "idx_to_uid.pkl", "rb"))


**Q2**. Because the dataset is slightly different from what we have been used to (X as features, y as target), the usual `train_test_split` method from scikit-learn does not apply.

Hopefully, `lightfm` comes with a `random_train_test_split` located into `cross_validation` dedicated to this usecase 🙂

Split the data randomly into a `train` matrix and a `test` matrix with 20% of interactions into the test set.

In [4]:
import numpy as np
from lightfm.cross_validation import random_train_test_split

train, test = random_train_test_split(ratings_matrix, test_percentage=0.2, random_state=np.random.RandomState(0))

train.shape, test.shape



((610, 9724), (610, 9724))

**Q3**. Train a LightFM model for 10 epochs. You can use the parameter `loss="warp"`.

In [None]:
from lightfm import LightFM

light_model = LightFM(no_components=100, loss='warp', random_state=0)
try:
    light_model.fit(train, epochs=10, verbose=True)
except Exception as e:
    print(f"An error occurred: {e}")

Epoch:   0%|          | 0/10 [00:00<?, ?it/s]

: 

**Q4**. Evaluate your model on your test set. You can use the `precision_at_k` metric implemented in the LightFM library.

In [7]:
from lightfm.evaluation import precision_at_k

k=5
pre_k = precision_at_k(light_model, test, train, k=k).mean()

print("Precision at K =", k)
print(pre_k)



NameError: name 'light_model' is not defined

**Q5**. What does the attribute `item_embeddings` of `model` contains?  This will be the heart of your recommendation engine! 💟 So make sure you understand fully what it contains.

In [None]:
light_model.item_embeddings.shape

In [8]:
print("item_embeddings contains all movies into a vector of 100 dimensions")

item_embeddings contains all movies into a vector of 100 dimensions


In [None]:
light_model.item_embeddings[0]

**Q6**. We just trained a model that factorized our ratings matrix into a U matrix of shape (n_users, no_components) : `model.user_embeddings` ; and V matrix of shape (n_movies, no_components) : `model.item_embeddings`).

Now we want to compute **similarity between each pair of movies**.

> 🔦 **Hint**: For the similarity distance we can either use `cosine_similarity` function or `pearson_similarity`:
> - **Cosine similarity** between two vectors, or matrices X and Y is given by:
> ``` python
> from sklearn.metrics.pairwise import cosine_similarity
> cosine_similarity(X, Y)
> ```
> - **Pearson similarity** between two vectors, or matrices X and Y is given by:
> ``` python
> import numpy as np
> np.corrcoef(X, Y)
> ```

Compute the `similarity_scores` of size (n_movies, n_movies), containing for each element (i, j) the similarity between movie of index i and movie of index j.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

similar_score = cosine_similarity(light_model.item_embeddings) #using method 1 cosine
similar_score

In [None]:
similar_score_np = np.corrcoef(light_model.item_embeddings) #using method 2 corr coef
similar_score_np 

**Q7**. For movie of idx 20, what are the idx of the 10 most similar movies?

In [None]:
idx = 20 #id of 20
similar_idx = similar_score[idx]
ranked_idx = np.argsort(-similar_idx)
ranked_mid = [idx_to_mid[x] for x in ranked_idx]

[movies[movies.movieId == mid]["title"] for mid in ranked_mid[:10]] #top 10

**Q8**. Let's now test our engine! Suppose we have an user that likes **Toy Story** 🧸 (movie_id = 1). Which movies would you recommend to that user? In other words, which movies are the most similar to the movie Toy Story 

> ⚠️ **Warning**: Remember that your `similarity_scores` works with `idx` and you have the `movie_id` associated to your movie.

Retrieve the **top 5 recommendations**.

In [None]:
idx = mid_to_idx[1]
similar_idx = similar_score[idx]
ranked_idx = np.argsort(-similar_idx)
ranked_mid = [idx_to_mid[x] for x in ranked_idx]

[movies[movies.movieId == mid]["title"] for mid in ranked_mid[:5]] #get top 5

As the next step is to **deploy your model**, you need now to: 

**Q9**. Save your `similarity_scores` into pickle format. Save also `movies` DataFrame into pickle format. Save them at the `data/netflix` directory at the root of the repository.

In [None]:
with open(root + '/similarity_scores.pkl', 'wb') as f:
    pickle.dump(similar_score, f)

with open(root + '/movies.pkl', 'wb') as f:
    pickle.dump(movies, f)

**Q10**. Encapsulate the previous code into functions, especially you will need:
- `get_sim_scores(mid)` function that returns the vector of the similarity scores `sims` between a movie `mid` and all the other movies
- `get_ranked_recos(sims)` that returns for a vector of similarity scores `sims` the list of all ranked recommendations (n_movies) (from most recommended to least recommended) - in the format list of (mid, score, name) tuple.

In [9]:
def get_movie_name(mid, movies):
    try:
        name = movies.loc[movies.movieId == mid].title.values[0]
    except:
        name = 'unknown'
    return Name

def get_sim_scores(mid):
    idx = mid_to_idx[mid]
    sim_score = similar_score[idx]
    return sim_score

def get_ranked_recos(sim_scores):
    recos = []
    for idx in np.argsort(-sim_scores):
        mid = idx_to_mid[idx]
        name = get_movie_name(mid, movies)
        score = sim_scores[idx]
        recos.append((mid, score, name))
    
    return recos
    

In [10]:
sims = get_sim_scores(3)
recos = get_ranked_recos(sims_idx)[:10]

recos

NameError: name 'similar_score' is not defined

If you have extra time, feel free now to improve your recommendation engine!