# Collaborative Based Recommender Systems Pt. 1

A collaborative based recommender system recommends items based base on user-item ratings. Similar users are found based on a user-ratings, and then items are recommended based on the preferences of those similar users. One example is Amazon's "Customers who bought this item also bought" approach.

In this notebook, we'll be looking a data set containing movie ratings. This dataset is a small part of the MovieLens dataset (https://grouplens.org/datasets/movielens/). For this example, we will only be using the user-movie ratings to develop a recommendation engine.

In [1]:
# import packages we'll be using
import pandas as pd
import numpy as np
from scipy.spatial.distance import cdist

In [4]:
# read the data into a DataFrame
ratings_data = pd.read_csv('../data/ratings.csv')
movies_data = pd.read_csv('../data/movies.csv')

ratings_data.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [6]:
movies_data.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [7]:
# count number of unique users and unique movies
n_users = len(ratings_data['userId'].unique())
n_movies = len(ratings_data['movieId'].unique())

print(str(n_users) + ' users')
print(str(n_movies) + ' movies')

671 users
9066 movies


In [9]:
# create user-item matrix
user_item_matrix = ratings_data.pivot(index='userId',
                                      columns='movieId',
                                      values='rating')

# convert ratings to likes (>=4 is a like (1), <4 is not (0))
user_item_matrix[user_item_matrix < 4] = 0
user_item_matrix[user_item_matrix >= 4] = 1
user_item_matrix = user_item_matrix.fillna(0)

user_item_matrix.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,161084,161155,161594,161830,161918,161944,162376,162542,162672,163949
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now that we have the user-item matrix, we can use it to find similar users. The process is to first pick a user that needs recommendations. We then find the top K similar users based on their preferences and recommend movies that were liked by those K users.

In [11]:
# select a random user
user_id = user_item_matrix.sample(1).index
random_user = user_item_matrix.loc[user_id]

# calculate similarity between that user and all other users
metric = 'jaccard' # what other distance metrics can we use?
distances = cdist(user_item_matrix, random_user, metric=metric).squeeze()
distances = pd.Series(data=distances,
                      index=user_item_matrix.index)

# find the most similar K users
K = 10
similar_users = distances.sort_values()[:K].index

In [12]:
similar_users

Int64Index([170, 327, 422, 219, 344, 647, 41, 574, 646, 177], dtype='int64', name=u'userId')

After finding the most similar K users, we find movies that were liked by those users, and recommend the top M most liked movies.

In [14]:
# extract rows of most smiliar K users from user-item matrix
similar_user_ratings = user_item_matrix.loc[similar_users,:]

# remove movies already liked by user
similar_user_ratings = similar_user_ratings.loc[:, (random_user==0).squeeze()]

# add up the number of likes for each movie
overall_movie_ratings = similar_user_ratings.sum()

# find the top M movies
M = 5
movie_recs_ids = overall_movie_ratings.sort_values(ascending=False)[:M].index
movie_recs = movies_data[movies_data['movieId'].isin(movie_recs_ids)]['title'].tolist()

In [15]:
movie_recs

['Toy Story (1995)',
 'Star Wars: Episode IV - A New Hope (1977)',
 'Forrest Gump (1994)',
 'Star Wars: Episode V - The Empire Strikes Back (1980)',
 'Back to the Future (1985)']

Let's create a movie recommender function

In [18]:
def movie_recommender(user_id, num_recs=5, num_similar_users=10):
    random_user = user_item_matrix.loc[user_id]
    metric = 'jaccard' # what other distance metrics can we use?
    distances = cdist(user_item_matrix, random_user, metric=metric).squeeze()
    distances = pd.Series(data=distances,
                          index=user_item_matrix.index)
    similar_users = distances.sort_values()[:num_similar_users].index
    similar_user_ratings = user_item_matrix.loc[similar_users,:]
    similar_user_ratings = similar_user_ratings.loc[:, (random_user==0).squeeze()]
    overall_movie_ratings = similar_user_ratings.sum()
    movie_recs_ids = overall_movie_ratings.sort_values(ascending=False)[:M].index
    movie_recs = movies_data[movies_data['movieId'].isin(movie_recs_ids)]['title'].tolist()
    
    user_likes_ids = random_user.loc[:, (random_user==1).squeeze()].columns
    user_likes = movies_data[movies_data['movieId'].isin(user_likes_ids)]['title'].tolist()
    
    return user_likes, movie_recs

Run the function and print out the user's likes along with their recommendations

In [22]:
user_id = user_item_matrix.sample(1).index
user_likes, movie_recs = movie_recommender(user_id)

print('------------')
print('USER\'S LIKES')
print('------------')
print (movie) for movie in user_likes]
print('')
print('---------------')
print('RECOMMENDATIONS')
print('---------------')
print [(movie) for movie in movie_recs]

------------
USER'S LIKES
------------
['Jumanji (1995)', 'Heat (1995)', 'American President, The (1995)', 'Money Train (1995)', 'Copycat (1995)', 'Assassins (1995)', 'Now and Then (1995)', 'Dangerous Minds (1995)', 'Eye for an Eye (1996)', "Mr. Holland's Opus (1995)", 'Fair Game (1995)', 'Juror, The (1996)', 'Nick of Time (1995)', 'Broken Arrow (1996)', 'Bridges of Madison County, The (1995)', 'Braveheart (1995)', 'Up Close and Personal (1996)', 'Apollo 13 (1995)', 'Rob Roy (1995)', 'Casper (1995)', 'Crimson Tide (1995)', 'Die Hard: With a Vengeance (1995)', 'First Knight (1995)', 'Net, The (1995)', 'Under Siege 2: Dark Territory (1995)', 'Walk in the Clouds, A (1995)', 'Waterworld (1995)', 'Disclosure (1994)', 'Drop Zone (1994)', 'Dolores Claiborne (1995)', 'French Kiss (1995)', 'Forget Paris (1995)', 'Hideaway (1995)', 'I.Q. (1994)', 'Interview with the Vampire: The Vampire Chronicles (1994)', 'Just Cause (1995)', 'Legends of the Fall (1994)', 'Losing Isaiah (1995)', 'Miracle on 34t