# Recommendation system using collaborative filtering and the MovieLens dataset

This movie dataset contains 100,836 ratings and 3683 tag applications across 9742 movies. These data were created by 610 users between March 29, 1996 and September 24, 2018.  This dataset was generated on September 26, 20.

Obtained from: http://files.grouplens.org/datasets/movielens/ml-latest-small.zip

Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic informationis included.  Each user is represented by an id, and no other information is provided. The data are contained in the  files links.csv, movies.csv, ratings.csv and tags.csv

In [2]:
import pandas as pd
import numpy as np
import random
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import mean_squared_error

In [3]:
rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_csv('data/ratings.csv')

mnames = ['movie_id', 'title', 'genres']
movies = pd.read_csv('data/movies.csv')

tnames = ['user_id', 'movie_id', 'tags', 'timestamp']
tags = pd.read_csv('data/tags.csv')

In [4]:
ratings[:5]

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [5]:
movies[:5]

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [6]:
tags[:5]

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


Merging ratings and movies tables

In [7]:
movie_ratings = pd.merge(ratings,movies)
movie_ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,5,1,4.0,847434962,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,7,1,4.5,1106635946,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
3,15,1,2.5,1510577970,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
4,17,1,4.5,1305696483,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy


Create unique integer encoding for each user and each movie

In [8]:

user_ids = movie_ratings["userId"].unique().tolist()
# use this dictionary to get unique integer encoding from original userid
user_encoding = {x: i for i, x in enumerate(user_ids)}
# use this dictionary to get original userid from integer encoding value
user_id_lookup = {i: x for i, x in enumerate(user_ids)}

# Same process for movies
movie_ids = movie_ratings["movieId"].unique().tolist()
# use this dict to get unique integer encoding from original movie id
movie_encoding = {x: i for i, x in enumerate(movie_ids)}
# use this dict to get original movieid from integer encoding value
movie_id_lookup = {i: x for i, x in enumerate(movie_ids)}

# Add new columns for unique integer encodings
movie_ratings["user_encoding"] = movie_ratings["userId"].map(user_encoding)
movie_ratings["movie_encoding"] = movie_ratings["movieId"].map(movie_encoding)


movie_ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres,user_encoding,movie_encoding
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,0,0
1,5,1,4.0,847434962,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,0
2,7,1,4.5,1106635946,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,2,0
3,15,1,2.5,1510577970,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,3,0
4,17,1,4.5,1305696483,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,4,0


Converting table to a ratings matrix using user_encoding as unique rows and movie_encoding as unique columns. We do this so we can use the index of the np array to represent a particular user-item pair. ie. ratings_matrix[2,4] represents user 2's movie rating for movie 4. This would not work if we used the default userid and movieid since there are missing ids in this dataset. 

In [9]:
user_ratings = movie_ratings.pivot_table('rating', index='user_encoding', columns='movie_encoding')

user_ratings = user_ratings.fillna(0)

ratings_matrix = user_ratings.to_numpy()
ratings_matrix

array([[4. , 4. , 4. , ..., 0. , 0. , 0. ],
       [4. , 0. , 0. , ..., 0. , 0. , 0. ],
       [4.5, 0. , 0. , ..., 0. , 0. , 0. ],
       ...,
       [0. , 0. , 0. , ..., 0. , 0. , 0. ],
       [0. , 0. , 0. , ..., 0. , 0. , 0. ],
       [0. , 0. , 0. , ..., 0. , 0. , 0. ]])

Now that we have the proper format for our data matrix, we will split the data into training and testing sets. We will take 15 ratings from each user and use it for the test set

In [10]:
def train_test_split(ratings):
    # initialize test array with all 0s. Will be replaced by randomly selected ratings
    train = ratings.copy()
    test = np.zeros(ratings.shape)

    # go through each user
    for i in range(len(ratings_matrix)):
        
        # Take 15 random ratings that are not 0 and place them in the test set
        test_values = np.random.choice(ratings[i].nonzero()[0], size=15)
        
        # remove 15 ratings from training set
        train[i, test_values] = 0.
        
        # Add 15 rating to test set
        test[i, test_values] = ratings[i, test_values]
        
    
    return train, test

train, test = train_test_split(ratings_matrix)


In [12]:
train[0]

array([4., 4., 4., ..., 0., 0., 0.])

Next we will calculate similarity between users using cosine similary measure

In [47]:
user_sim = cosine_similarity(train, dense_output=False)
print(user_sim)

[[1.         0.10026552 0.13083992 ... 0.         0.         0.        ]
 [0.10026552 1.         0.13148037 ... 0.0548392  0.         0.        ]
 [0.13083992 0.13148037 1.         ... 0.10694364 0.03097405 0.00771644]
 ...
 [0.         0.0548392  0.10694364 ... 1.         0.         0.        ]
 [0.         0.         0.03097405 ... 0.         1.         0.        ]
 [0.         0.         0.00771644 ... 0.         0.         1.        ]]


Use the similarities to make predictions on unknown ratings

In [51]:
def make_ratings_predictions(ratings, user_similarity):
    # set default array with all 0s. Will be replaced by predicted ratings
    preds = np.zeros(ratings.shape)
    
    # number of users
    n = len(user_similarity)
    # number of ratings 
    m = len(ratings[0])
    
    for i in range(n):
        for j in range(m):
            # For each prediction we take the weighted sum of ratings using the similarity between users as the weights.
            # Using numpy dot method we can multiply the entire row in the user similarity matrix by the entire column in
            # the ratings matrix and add all those values up. Then we divide by the sum of the same row in the
            # user similarities matrix (ie. sum of similarity weights for respective user).
            weighted_sum = user_similarity[i].dot(ratings[:, j]) / np.sum(user_similarity[i])
            preds[i,j] = weighted_sum
    return preds

predictions_matrix = make_ratings_predictions(train, user_sim)

# Compare all nonzero values in prediction matrix with all nonzero values in test set
predictions = predictions_matrix[test.nonzero()]
actual_ratings = test[test.nonzero()]
mse = mean_squared_error(predictions, actual_ratings)

print("Mean Squared Error for user-based collaborative filtering: {:.3f}".format(mse))

Mean Squared Error for user-based collaborative filtering: 9.943


Currently, we are taking into account similarities between all users despite most being not very similar at all. We can try using only the top k most similar users to see if we can get better results.

In [99]:
# Same method but only using top k similar users
def make_ratings_predictions_top_k(ratings, user_similarity, k):
    preds = np.zeros(ratings.shape)
    
    # number of users
    n = len(user_similarity)
    # number of ratings 
    m = len(ratings[0])
    
    inverse_k = n-k+1
    
    for i in range(n):
        
        # Get k most similar users to user i from user_similarity matrix
        k_users = np.argsort(user_similarity[:,i])[:inverse_k:-1]
        
        for j in range(m):
            
            # index only k most similar users from similarit matrix
            weighted_sum = user_similarity[i][k_users].dot(ratings[:, j][k_users]) / np.sum(user_similarity[i][k_users])
            preds[i,j] = weighted_sum
    return preds


k = 30
predictions_matrix = make_ratings_predictions_top_k(train, user_sim, k)

# Compare all nonzero values in prediction matrix with all nonzero values in test set
predictions = predictions_matrix[test.nonzero()]
actual_ratings = test[test.nonzero()]
mse = mean_squared_error(predictions, actual_ratings)

print("Mean Squared Error for user-based collaborative filtering using top {} most similar users: {:.3f}".format(k, mse))

Mean Squared Error for user-based collaborative filtering using top 30 most similar users: 8.320


A function to request a specific rating on a movie, given the movie ID and user ID.

In [67]:
def get_movie_prediction(userid, movieid, predictions):
    # Use user lookup dict we made before
    u_encoded = user_id_lookup[userid]
    
    # Use movie lookup dict we made before
    m_encoded = movie_id_lookup[movieid]
    
    # Grab prediction from np array    
    return predictions[u_encoded, m_encoded]

mid = 54
uid = 100
p = get_movie_prediction(uid, mid, predictions_matrix)

print("Predicted rating for user {} on movie {} is: {}".format(uid,mid,p))

Predicted rating for user 100 on movie 54 is: 2.11220263598941
