## What to do?

- <s>Recommend movies to users based on implicit recommend function</s>
- Talk about evaluation metrics (link IR metrics PDF)
- Choose p@k
- Implement train test split
- Choose Popular movies as reference
- Optimize ALS parameters

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import implicit

import scipy.sparse as sparse

%matplotlib inline

In [2]:
ratings = pd.read_csv("data/ml-latest-small/ratings.csv")

In [25]:
ratings = ratings[["userId", "movieId", "rating"]]

users = list(np.sort(ratings.userId.unique())) # Get our unique customers
movies = list(ratings.movieId.unique()) # Get our unique products that were purchased
rating = list(ratings.rating) # All of our purchases

rows = ratings.userId.astype('category', categories = users).cat.codes 
# Get the associated row indices
cols = ratings.movieId.astype('category', categories = movies).cat.codes 
# Get the associated column indices
user_item = sparse.csr_matrix((rating, (rows, cols)), shape=(len(users), len(movies)))

matrix_size = user_item.shape[0]*user_item.shape[1] # Number of possible interactions in the matrix
num_purchases = len(user_item.nonzero()[0]) # Number of items interacted with
sparsity = 100*(1 - (1.0*num_purchases/matrix_size))
print (sparsity)

user_item

98.3560858391


<671x9066 sparse matrix of type '<type 'numpy.float64'>'
	with 100004 stored elements in Compressed Sparse Row format>

## Recommending Movies to users

In [4]:
model = implicit.als.AlternatingLeastSquares(factors=10, 
                                             iterations=20, 
                                             regularization=0.1, 
                                             num_threads=4)
model.fit(user_item.T)

First let's write a function that returns the movies that a particular user had rated

In [11]:
ids = user_item[0].nonzero()[1]
ids

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19], dtype=int32)

In [12]:
movieTableIDs = [movies[item] for item in ids]
movieTableIDs

[31,
 1029,
 1061,
 1129,
 1172,
 1263,
 1287,
 1293,
 1339,
 1343,
 1371,
 1405,
 1953,
 2105,
 2150,
 2193,
 2294,
 2455,
 2968,
 3671]

In [33]:
users.index(1)

0

In [34]:
def get_rated_movies_ids(user_id, user_item, users, movies):
    """
    Input
    -----
    
    user_id: int
        User ID
        
    user_item: scipy.Sparse Matrix
        User item interaction matrix
        
    users: np.array
        Mapping array between user ID and index in the user item matrix
        
    movies: np.array
        Mapping array between movie ID and index in the user item matrix
        
    Output
    -----
    
    movieTableIDs: python list
        List of movie IDs that the user had rated
    
    """
    user_id = users.index(user_id)
    # Get matrix ids of rated movies by selected user
    ids = user_item[user_id].nonzero()[1]
    # Convert matrix ids to movies IDs
    movieTableIDs = [movies[item] for item in ids]
    
    return movieTableIDs

In [35]:
movieTableIDs = get_rated_movies_ids(1, user_item, users, movies)

In [36]:
rated_movies = pd.DataFrame(movieTableIDs, columns=['movieId'])
rated_movies

Unnamed: 0,movieId
0,31
1,1029
2,1061
3,1129
4,1172
5,1263
6,1287
7,1293
8,1339
9,1343


In [18]:
movies_table = pd.read_csv("data/ml-latest-small/movies.csv")
movies_table.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [20]:
rated_movies = pd.merge(rated_movies, movies_table, on='movieId', how='left')
rated_movies

Unnamed: 0,movieId,title,genres
0,31,Dangerous Minds (1995),Drama
1,1029,Dumbo (1941),Animation|Children|Drama|Musical
2,1061,Sleepers (1996),Thriller
3,1129,Escape from New York (1981),Action|Adventure|Sci-Fi|Thriller
4,1172,Cinema Paradiso (Nuovo cinema Paradiso) (1989),Drama
5,1263,"Deer Hunter, The (1978)",Drama|War
6,1287,Ben-Hur (1959),Action|Adventure|Drama
7,1293,Gandhi (1982),Drama
8,1339,Dracula (Bram Stoker's Dracula) (1992),Fantasy|Horror|Romance|Thriller
9,1343,Cape Fear (1991),Thriller


In [46]:
def get_movies(movieTableIDs, movies_table):
    """
    Input
    -----
    
    movieTableIDs: python list
        List of movie IDs that the user had rated
        
    movies_table: pd.DataFrame
        DataFrame of movies info
        
    Output
    -----
    
    rated_movies: pd.DataFrame
        DataFrame of rated movies
    
    """
    
    rated_movies = pd.DataFrame(movieTableIDs, columns=['movieId'])
    
    rated_movies = pd.merge(rated_movies, movies_table, on='movieId', how='left')
    
    return rated_movies

In [47]:
movieTableIDs = get_rated_movies_ids(1, user_item, users, movies)
df = get_movies(movieTableIDs, movies_table)
df

Unnamed: 0,movieId,title,genres
0,31,Dangerous Minds (1995),Drama
1,1029,Dumbo (1941),Animation|Children|Drama|Musical
2,1061,Sleepers (1996),Thriller
3,1129,Escape from New York (1981),Action|Adventure|Sci-Fi|Thriller
4,1172,Cinema Paradiso (Nuovo cinema Paradiso) (1989),Drama
5,1263,"Deer Hunter, The (1978)",Drama|War
6,1287,Ben-Hur (1959),Action|Adventure|Drama
7,1293,Gandhi (1982),Drama
8,1339,Dracula (Bram Stoker's Dracula) (1992),Fantasy|Horror|Romance|Thriller
9,1343,Cape Fear (1991),Thriller


In [40]:
user_id = users.index(1)
user_id

0

In [41]:
recommendations = model.recommend(user_id, user_item, N=5)
recommendations

[(187, 0.11486098321374655),
 (210, 0.10961007407185175),
 (445, 0.10327012493755082),
 (1229, 0.10219776867107168),
 (176, 0.10070906542395243)]

In [42]:
recommendations = [item[0] for item in recommendations]
recommendations

[187, 210, 445, 1229, 176]

In [43]:
movies_ids = [movies[ids] for ids in recommendations]
movies_ids

[1220, 1374, 1394, 5060, 1127]

In [44]:
def recommend_movie_ids(user_id, model, user_item, users, movies, N=5):
    """
    Input
    -----
    
    user_id: int
        User ID
        
    model: ALS model
        Trained ALS model
    
    user_item: sp.Sparse Matrix
        User item interaction matrix so that we do not recommend already rated movies
        
    users: np.array
        Mapping array between User ID and user item index
        
    movies: np.array
        Mapping array between Movie ID and user item index
        
    N: int (default =5)
        Number of recommendations
        
    Output
    -----
    
    movies_ids: python list
        List of movie IDs
    """
    
    user_id = users.index(user_id)
    
    recommendations = model.recommend(user_id, user_item, N=N)
    
    recommendations = [item[0] for item in recommendations]
    
    movies_ids = [movies[ids] for ids in recommendations]
    
    return movies_ids

In [45]:
movies_ids = recommend_movie_ids(1, model, user_item, users, movies, N=5)
movies_ids

[1220, 1374, 1394, 5060, 1127]

In [48]:
movies_rec = get_movies(movies_ids, movies_table)
movies_rec

Unnamed: 0,movieId,title,genres
0,1220,"Blues Brothers, The (1980)",Action|Comedy|Musical
1,1374,Star Trek II: The Wrath of Khan (1982),Action|Adventure|Sci-Fi|Thriller
2,1394,Raising Arizona (1987),Comedy
3,5060,M*A*S*H (a.k.a. MASH) (1970),Comedy|Drama|War
4,1127,"Abyss, The (1989)",Action|Adventure|Sci-Fi|Thriller


In [49]:
df

Unnamed: 0,movieId,title,genres
0,31,Dangerous Minds (1995),Drama
1,1029,Dumbo (1941),Animation|Children|Drama|Musical
2,1061,Sleepers (1996),Thriller
3,1129,Escape from New York (1981),Action|Adventure|Sci-Fi|Thriller
4,1172,Cinema Paradiso (Nuovo cinema Paradiso) (1989),Drama
5,1263,"Deer Hunter, The (1978)",Drama|War
6,1287,Ben-Hur (1959),Action|Adventure|Drama
7,1293,Gandhi (1982),Drama
8,1339,Dracula (Bram Stoker's Dracula) (1992),Fantasy|Horror|Romance|Thriller
9,1343,Cape Fear (1991),Thriller


## Add posters data

In [50]:
metadata = pd.read_csv('data/movies_metadata.csv')

image_data = metadata[['imdb_id', 'poster_path']]

links = pd.read_csv("data/links.csv")

links = links[['movieId', 'imdbId']]

image_data = image_data[~ image_data.imdb_id.isnull()]

def app(x):
    try:
        return int(x[2:])
    except ValueError:
        print x
        
image_data['imdbId'] = image_data.imdb_id.apply(app)
image_data = image_data[~ image_data.imdbId.isnull()]
image_data.imdbId = image_data.imdbId.astype(int)
image_data = image_data[['imdbId', 'poster_path']]


posters = pd.merge(image_data, links, on='imdbId', how='left')

posters = posters[['movieId', 'poster_path']]

posters = posters[~ posters.movieId.isnull()]

posters.movieId = posters.movieId.astype(int)

movies_table = pd.merge(movies_table, posters, on='movieId', how='left')
movies_table.head()

  interactivity=interactivity, compiler=compiler, result=result)


0
0
0


Unnamed: 0,movieId,title,genres,poster_path
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg
1,2,Jumanji (1995),Adventure|Children|Fantasy,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg
2,3,Grumpier Old Men (1995),Comedy|Romance,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg
4,5,Father of the Bride Part II (1995),Comedy,/e64sOI48hQXyru7naBFyssKFxVd.jpg


In [53]:
from IPython.display import HTML
from IPython.display import display

def display_posters(df):
    
    images = ''
    for ref in df.poster_path:
            if ref != '':
                link = 'http://image.tmdb.org/t/p/w185/' + ref
                images += "<img style='width: 120px; margin: 0px; \
                  float: left; border: 1px solid black;' src='%s' />" \
              % link
    display(HTML(images))

In [52]:
movies_rec = get_movies(movies_ids, movies_table)
movies_rec

Unnamed: 0,movieId,title,genres,poster_path
0,1220,"Blues Brothers, The (1980)",Action|Comedy|Musical,/6hesUNBkpVRqBTBw2HlTg0h8b56.jpg
1,1374,Star Trek II: The Wrath of Khan (1982),Action|Adventure|Sci-Fi|Thriller,/7VKpj4Xl3hTzgAS3xpVuOyqNnSv.jpg
2,1394,Raising Arizona (1987),Comedy,/jsBg2bhvbSncyezo9sMntMzBuy6.jpg
3,5060,M*A*S*H (a.k.a. MASH) (1970),Comedy|Drama|War,/eOslMOtaPXgQEgVJ93U3KOLogGD.jpg
4,1127,"Abyss, The (1989)",Action|Adventure|Sci-Fi|Thriller,/kRP5dGXDhKt7bDpXX4YBa4dRwlL.jpg


In [54]:
display_posters(movies_rec)

In [55]:
movies_ids = recommend_movie_ids(100, model, user_item, users, movies, N=7)
movies_rec = get_movies(movies_ids, movies_table)
display_posters(movies_rec)

Now that we are able to recommend movies to users and find similar movies to a selected movies, we want to know how good are the recommendations that we are making. This imply that we need to define an evaluation scheme

## Evaluation metrics

Traditional ML models can be evaluated through metrics like RMSE for regression problems or Accuracy and AUC for classification problems. Evaluating recommendations are tricky because we basically are trying to recommend movies that are very personnalized to that particular user.