# Recommendation - Model 🍿

---

<img src="https://visithrastnik.si/uploads/tic/public/generic_list_item/6-kulturna_prireditev_v_avli_kulturnega_centra_zagorje_ob_savi.jpg" />

---

Now, time for the exciting part! We will train a Machine Learning model based on our previous **ratings** sparse matrix, so that it creates a recommendation engine automatically! 

First, load again the dataframe `movies` and `ratings`

### TODO: load the movies and ratings datasets
import pandas as pd
import pickle
movies = pd.read_csv('ml-latest-small/movies.csv')
ratings = pd.read_csv('ml-latest-small/ratings.csv')
print(movies.head())

print(ratings.head())

**Q1**. Start by loading all the pickle you saved during last challenge: `ratings_matrix`, `idx_to_mid`, `mid_to_idx`, `uid_to_idx`, `idx_to_uid`

In [9]:
import pickle
dir = './data'

ratings_mat = pickle.load(open('data/ratings_mat.pkl', 'rb'))
idx_to_mid = pickle.load( open('data/idx_to_mid.pkl', 'rb'))
mid_to_idx = pickle.load( open('data/mid_to_idx.pkl', 'rb'))
uid_to_idx = pickle.load( open('data/uid_to_idx.pkl', 'rb'))
idx_to_uid = pickle.load( open('data/idx_to_uid.pkl', 'rb'))

import pandas as pd

movies = pd.read_csv('./ml-latest-small/movies.csv')

ratings = pd.read_csv('./ml-latest-small/ratings.csv')

**Q2**. Because the dataset is slightly different from what we have been used to (X as features, y as target), the usual `train_test_split` method from scikit-learn does not apply.

Hopefully, `lightfm` comes with a `random_train_test_split` located into `cross_validation` dedicated to this usecase 🙂

Split the data randomly into a `train` matrix and a `test` matrix with 20% of interactions into the test set.

In [4]:
import numpy as np
from lightfm.cross_validation import random_train_test_split

train, test = random_train_test_split(ratings_mat, test_percentage=0.2, random_state=np.random.RandomState(0)) 

In [7]:
test

<610x3650 sparse matrix of type '<class 'numpy.float64'>'
	with 18055 stored elements in COOrdinate format>

**Q3**. Train a LightFM model for 10 epochs. You can use the parameter `loss="warp"`.

In [8]:
from lightfm import LightFM

model = LightFM(no_components=100, loss='warp', random_state=0)


In [9]:
model.fit(train, epochs=10, verbose=True)


: 

**Q4**. Evaluate your model on your test set. You can use the `precision_at_k` metric implemented in the LightFM library.

In [1]:
from lightfm.evaluation import precision_at_k
k = 5 # how accurate the top k recommendations are
precision_k = precision_at_k(model, test, train, k=k) # numpy array, so can do mean


print('Precision at k:', k, 'is', precision_k.mean())



NameError: name 'model' is not defined

**Q5**. What does the attribute `item_embeddings` of `model` contains?  This will be the heart of your recommendation engine! 💟 So make sure you understand fully what it contains.

In [2]:
print(model.item_embeddings.shape)
print('...')

NameError: name 'model' is not defined

**Q6**. We just trained a model that factorized our ratings matrix into a U matrix of shape (n_users, no_components) : `model.user_embeddings` ; and V matrix of shape (n_movies, no_components) : `model.item_embeddings`).

Now we want to compute **similarity between each pair of movies**.

> 🔦 **Hint**: For the similarity distance we can either use `cosine_similarity` function or `pearson_similarity`:
> - **Cosine similarity** between two vectors, or matrices X and Y is given by:
> ``` python
> from sklearn.metrics.pairwise import cosine_similarity
> cosine_similarity(X, Y)
> ```
> - **Pearson similarity** between two vectors, or matrices X and Y is given by:
> ``` python
> import numpy as np
> np.corrcoef(X, Y)
> ```

Compute the `similarity_scores` of size (n_movies, n_movies), containing for each element (i, j) the similarity between movie of index i and movie of index j.

In [3]:
from sklearn.metrics.pairwise import cosine_similarity

similarity_scores = cosine_similarity(model.item_embeddings) # calculates cosine similarity for each of the differnet movies
print(similarity_scores.shape) # could use a heatmap to represent the similarity scores
similarity_scores

NameError: name 'model' is not defined

**Q7**. For movie of idx 20, what are the idx of the 10 most similar movies?

In [4]:
idx = 20
similarity_idx = similarity_scores[idx] # gets entire row of array which has the similarity scores for that movie compaerd to all other movies

ranked_idx = np.argsort(-similarity_idx)
ranked_mid = [idx_to_mid[x] for x in ranked_idx]

for mid in ranked_mid[:10]:
    print(movies[movies.movieId == mid]['title'])


NameError: name 'similarity_scores' is not defined

**Q8**. Let's now test our engine! Suppose we have an user that likes **Toy Story** 🧸 (movie_id = 1). Which movies would you recommend to that user? In other words, which movies are the most similar to the movie Toy Story 

> ⚠️ **Warning**: Remember that your `similarity_scores` works with `idx` and you have the `movie_id` associated to your movie.

Retrieve the **top 5 recommendations**.

In [5]:
idx = 1
similarity_idx = similarity_scores[idx] # gets entire row of array which has the similarity scores for that movie compaerd to all other movies

ranked_idx = np.argsort(-similarity_idx)
ranked_mid = [idx_to_mid[x] for x in ranked_idx]

for mid in ranked_mid[:5]:
    print(movies[movies.movieId == mid]['title'])


NameError: name 'similarity_scores' is not defined

As the next step is to **deploy your model**, you need now to: 

**Q9**. Save your `similarity_scores` into pickle format. Save also `movies` DataFrame into pickle format. Save them at the `data/netflix` directory at the root of the repository.

In [6]:
pickle.dump(similarity_scores, open('./data/similarity_scores.pkl', 'wb'))
pickle.dump(similarity_scores, open('./data/movies.pkl', 'wb'))

NameError: name 'pickle' is not defined

**Q10**. Encapsulate the previous code into functions, especially you will need:
- `get_sim_scores(mid)` function that returns the vector of the similarity scores `sims` between a movie `mid` and all the other movies
- `get_ranked_recos(sims)` that returns for a vector of similarity scores `sims` the list of all ranked recommendations (n_movies) (from most recommended to least recommended) - in the format list of (mid, score, name) tuple.

In [7]:
def get_movie_name(mid, movies):
    try:
        name = movies.loc[movies.movieId == mid].title.values[0]
    except:
        name = 'Unknown'
    return name

def get_sim_scores(mid):
    idx = mid_to_idx[mid]
    sims = similarity_scores[idx]
    return sims

def get_ranked_recos(sims):
    recos = []
    for idxs in np.argsort(-sim):
        mid = idx_to_mid[idx]
        name = get_movie_name(mid, movies)
        score = sims[idx]
        recos.append((mid, score, name))
    return recos

def get_reccomendations(mid, movies, k):
    sim_scores = get_sim_scores(mid)
    return get_ranked_recos(sim_scores, movies)[:k]

In [8]:
get_reccomendations(2, movies, 10)

NameError: name 'movies' is not defined

If you have extra time, feel free now to improve your recommendation engine!