# Recommendation - Model 🍿

---

<img src="https://visithrastnik.si/uploads/tic/public/generic_list_item/6-kulturna_prireditev_v_avli_kulturnega_centra_zagorje_ob_savi.jpg" />

---

Now, time for the exciting part! We will train a Machine Learning model based on our previous **ratings** sparse matrix, so that it creates a recommendation engine automatically! 

First, load again the dataframe `movies` and `ratings`

In [9]:
!pip install lightfm

Collecting lightfm
  Using cached lightfm-1.17.tar.gz (316 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: lightfm
  Building wheel for lightfm (setup.py): started
  Building wheel for lightfm (setup.py): finished with status 'done'
  Created wheel for lightfm: filename=lightfm-1.17-cp38-cp38-win_amd64.whl size=418926 sha256=bda52f9c3ef5160cc2e712cda14906a732a36ba3ca7d71e3e79d182416942d5a
  Stored in directory: c:\users\nspen\appdata\local\pip\cache\wheels\72\da\9d\3c44c37f7f99adc331ded85b386e51ae01c53bcf33c2122317
Successfully built lightfm
Installing collected packages: lightfm
Successfully installed lightfm-1.17


In [1]:
### TODO: load the movies and ratings datasets
import pandas as pd
movies = pd.read_csv("ml-latest-small/movies.csv")
ratings = pd.read_csv("ml-latest-small/ratings.csv")

print(movies.head())
print(ratings.head())

   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
3                         Comedy|Drama|Romance  
4                                       Comedy  
   userId  movieId  rating  timestamp
0       1        1     4.0  964982703
1       1        3     4.0  964981247
2       1        6     4.0  964982224
3       1       47     5.0  964983815
4       1       50     5.0  964982931


**Q1**. Start by loading all the pickle you saved during last challenge: `ratings_matrix`, `idx_to_mid`, `mid_to_idx`, `uid_to_idx`, `idx_to_uid`

In [2]:
import pickle

ratings_matrix = pickle.load(open("data/ratings_matrix.pkl", "rb"))
idx_to_mid = pickle.load(open("data/idx_to_mid.pkl", "rb"))
mid_to_idx = pickle.load(open("data/mid_to_idx.pkl", "rb"))
uid_to_idx = pickle.load(open("data/uid_to_idx.pkl", "rb"))
idx_to_uid = pickle.load(open("data/idx_to_uid.pkl", "rb"))

**Q2**. Because the dataset is slightly different from what we have been used to (X as features, y as target), the usual `train_test_split` method from scikit-learn does not apply.

Hopefully, `lightfm` comes with a `random_train_test_split` located into `cross_validation` dedicated to this usecase 🙂

Split the data randomly into a `train` matrix and a `test` matrix with 20% of interactions into the test set.

In [3]:
import numpy as np
from lightfm.cross_validation import random_train_test_split as rtts

train, test = rtts(ratings_matrix, test_percentage = 0.2, random_state = np.random.RandomState(0))



**Q3**. Train a LightFM model for 10 epochs. You can use the parameter `loss="warp"`.

In [4]:
from lightfm import LightFM

model = LightFM(no_components = 5, loss = "warp", random_state = np.random.RandomState(0))

#model.fit(train, epochs = 10, verbose = True) # Can't run this line... kernal keeps dying!


**Q4**. Evaluate your model on your test set. You can use the `precision_at_k` metric implemented in the LightFM library.

In [5]:
from lightfm.evaluation import precision_at_k as pak

k = 5
precision_k = pak(model, test, train, k=k).mean()

print(f"Precision at k: {k} is {precision_k}")

ValueError: You must fit the model before trying to obtain predictions.

**Q5**. What does the attribute `item_embeddings` of `model` contains?  This will be the heart of your recommendation engine! 💟 So make sure you understand fully what it contains.

In [6]:
print(model.item_embeddings.shape)
print("the item_embeddings attribute contains all of the movies and the ratings each got from the users")

AttributeError: 'NoneType' object has no attribute 'shape'

**Q6**. We just trained a model that factorized our ratings matrix into a U matrix of shape (n_users, no_components) : `model.user_embeddings` ; and V matrix of shape (n_movies, no_components) : `model.item_embeddings`).

Now we want to compute **similarity between each pair of movies**.

> 🔦 **Hint**: For the similarity distance we can either use `cosine_similarity` function or `pearson_similarity`:
> - **Cosine similarity** between two vectors, or matrices X and Y is given by:
> ``` python
> from sklearn.metrics.pairwise import cosine_similarity
> cosine_similarity(X, Y)
> ```
> - **Pearson similarity** between two vectors, or matrices X and Y is given by:
> ``` python
> import numpy as np
> np.corrcoef(X, Y)
> ```

Compute the `similarity_scores` of size (n_movies, n_movies), containing for each element (i, j) the similarity between movie of index i and movie of index j.

In [7]:
from sklearn.metrics.pairwise import cosine_similarity

similarity_scores = cosine_similarity(model.item_embeddings)
print(similarity_scores.shape)
similarity_scores

ValueError: Expected 2D array, got scalar array instead:
array=nan.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

**Q7**. For movie of idx 20, what are the idx of the 10 most similar movies?

In [8]:
idx = 20
sims_idx = similarity_scores[idx]
ranked_idx = np.argsort(-similarity_idx)
ranked_mid = [idx_to_mid[x] for x in ranked_idx]
for mid in ranked_mid[:10]:
    print(movies[movies.movieId == mid]['title'])

NameError: name 'similarity_scores' is not defined

**Q8**. Let's now test our engine! Suppose we have an user that likes **Toy Story** 🧸 (movie_id = 1). Which movies would you recommend to that user? In other words, which movies are the most similar to the movie Toy Story 

> ⚠️ **Warning**: Remember that your `similarity_scores` works with `idx` and you have the `movie_id` associated to your movie.

Retrieve the **top 5 recommendations**.

In [9]:
idx = mid_to_idx[1]
sims_idx = similarity_scores[idx]
ranked_idx = np.argsort(-similarity_idx)
ranked_mid = [idx_to_mid[x] for x in ranked_idx]
for mid in ranked_mid[:10]:
    print(movies[movies.movieId == mid]['title'])

NameError: name 'similarity_scores' is not defined

As the next step is to **deploy your model**, you need now to: 

**Q9**. Save your `similarity_scores` into pickle format. Save also `movies` DataFrame into pickle format. Save them at the `data/netflix` directory at the root of the repository.

In [None]:
directory = "./data"
pickle.dump(similarity_scores, open(directory + "/similarity_scores.pkl", "wb"))

**Q10**. Encapsulate the previous code into functions, especially you will need:
- `get_sim_scores(mid)` function that returns the vector of the similarity scores `sims` between a movie `mid` and all the other movies
- `get_ranked_recos(sims)` that returns for a vector of similarity scores `sims` the list of all ranked recommendations (n_movies) (from most recommended to least recommended) - in the format list of (mid, score, name) tuple.

In [10]:
def get_movie_name(mid, movies):
    try:
        name = movies.loc[movies.movieId == mid].title.values[0]
    except:
        name = 'Unknown'
    return name

def get_sim_scores(mid):
    idx = mid_to_idx[mid]
    sims = similarity_scores[idx]
    return sims

def get_ranked_recos(sims):
    recos = []
    for idx in np.argsort(-sim):
        mid = idx_to_mid[idx]
        name = get_movie_name(mid, movies)
        score = sims[idx]
        recos.append((mid, score, name))
    return recos

def get_rec(mid, movies, k):
    sim_scores = get_sim_scores(mid)
    return get_ranked_recos(sim_scores, movies)[:k]
        

In [11]:
get_rec(3, movies, 10)

NameError: name 'similarity_scores' is not defined

If you have extra time, feel free now to improve your recommendation engine!