# Final Project - Personalization Theory

Authors: *Bertrand Thia-Thiong-Fat, Jeremy Yao, Paul Doan*

In this notebook, we are going to build a user-based collaborative filtering model to predict the last ratings of the users in our dataset. This will represent one of the baseline of our study, with the classic baseline model that always returns the average global rating of the dataset.

## Metrics used

Along all our study, since we want to focus on the accuracy of the last ratings given by the different users in our dataset, we are going to work with the *Mean absolute error* and *Root Mean Square Error* metrics during our study. Also, another metric that we need to take into account is the *Running time* of our different algorithms. Indeed, Yelp has a lot of data: our whole dataset contains $6M$ rows for example. The training time of a model is thus highly critical in such settings. We need a fast and efficient algorithm that can process real-time data.

# Loading the data 

In [1]:
# Importing the libraries

import pandas as pd
from tqdm import tqdm
import numpy as np
import sklearn.model_selection 
from sklearn.metrics import mean_absolute_error, mean_squared_error
%matplotlib inline
from sklearn.metrics.pairwise import cosine_similarity
import heapq
from scipy import sparse

In [2]:
ratings = pd.read_csv('dataset.csv').drop(columns='Unnamed: 0')
ratings.head()

Unnamed: 0,user_id,business_id,rating,date
0,keBv05MsMFBd0Hu98vXThQ,JDZ6_yycNQFTpUZzLIKHUg,5.0,2018-11-14 18:05:34
1,hZ_ElhGO3sQDVvM8ZrQetA,zfyGTLKOZuVY8aRoInyx9Q,4.0,2018-11-14 17:59:07
2,y5zqSJE-rFihsKmUZRbbRg,evdJO0v9rvVixieNEnaeJg,5.0,2018-11-14 17:57:01
3,ozUsNrw9QlEtz_JqN5PlMw,u1fa8SE-Rzea_xWbk_B-Zw,3.0,2018-11-14 17:49:52
4,sHY6JcgWOHLP4vR836Esmw,urSuLlkYXXI5uwtKIxl9ew,5.0,2018-11-14 17:25:32


In [3]:
print('There are {} active users. \nThey add up to a total of {} unique ratings. \nThere are {} different businesses'.format(ratings["user_id"].nunique(), ratings.shape[0], ratings["business_id"].nunique()))

There are 30750 active users. 
They add up to a total of 317153 unique ratings. 
There are 4996 different businesses


Here is the average global rating of our dataset: 

In [4]:
ratings_mean = ratings.groupby('user_id', as_index=False).mean()
average_global_rating = round(ratings_mean['rating'].mean(), 2)
print('Globl mean rating:', average_global_rating)

Globl mean rating: 3.8


# Baseline: Collaborative Filtering

As a first step, we are  going to implement a user-based collaborative filtering model. It will be our baseline for the rest of our study.

## Implementation

Let us split our dataset into a training set and a testing set. The testing set will be the last ratings we want to predict, while the training set will be the rest of our data:

In [5]:
# Creating testing and training sets
ratings = ratings.sort_values('date', ascending=False)
test = ratings.drop_duplicates(subset='user_id', keep='first')
train = ratings[~(ratings.index).isin(test.index)]
print('train size: {} \ntest size: {}'.format(len(train), len(test)))

train size: 286403 
test size: 30750


By doing this, we avoid the problem that the reviews of one user are all split into one the training or testing set and cause bias. For example if all the reviews of user A are put into the training set, then during validation time, there is no test data for this user. The test RMSE will be 0 for this user. On the other hand, if all reviews are put into test set, then there is no review for this user during training time and cause the RMSE higher for this user. In our case, we know that we have at least 5 reviews per users. Hence, separating their last ratings ensure the users both appear in the training and the testing sets. Also, this allows us to have a ratio of size $90-10$% between the train and the test. 

In [6]:
# Creating the adjusted rating column in order to apply Adjusted cosine similarity
train_ratings_mean = train.groupby('user_id', as_index=False).mean().rename(columns = {'rating': 'rating_mean'})[['user_id','rating_mean']]
train = pd.merge(train, train_ratings_mean, on='user_id', how='left', sort=False)
train['adjusted_rating'] = train['rating'] - train['rating_mean']
train.head()

Unnamed: 0,user_id,business_id,rating,date,rating_mean,adjusted_rating
0,sHY6JcgWOHLP4vR836Esmw,seYvTmOZGJ2IAMdfQa8pkg,2.0,2018-11-14 17:17:17,4.258065,-2.258065
1,KuZ3T71dIxKSNNQ8Wh1eOg,OD61KMut0q5t2XBMZazFeA,4.0,2018-11-14 03:32:03,3.714286,0.285714
2,mgwX8cv9h59ukLarvLbxQA,ByFMv3p5X1aNeZhU61rDcA,5.0,2018-11-14 03:03:47,4.0,1.0
3,mgwX8cv9h59ukLarvLbxQA,ITowfgmzdwsxdX1lKFYpEA,3.0,2018-11-14 03:00:14,4.0,-1.0
4,mgwX8cv9h59ukLarvLbxQA,RwMLuOkImBIqqYj4SSKSPg,4.0,2018-11-14 02:55:10,4.0,0.0


In [7]:
# Creating the rating matrices
rating_matrix = train.pivot_table(values='rating', index='user_id', columns='business_id') 
adjusted_rating_matrix = train.pivot_table(values='adjusted_rating', index='user_id', columns='business_id').fillna(0)
adjusted_rating_matrix.head()

business_id,--7zmmkVg-IMGaXbuVd0SQ,--I7YYLada0tSLkORTHb5Q,-01XupAWZEXbdNbxNg5mEg,-0t6o9LXMVQ3aV8CHhRTnA,-0tgMGl7D9B10YjSN2ujLA,-1m9o3vGRA8IBPNvNqKLmA,-2C96suwzrE_cqI1U69cLA,-2QGc6Lb0R027lz0DpWN1A,-2ToCaDFpTNmmg3QFzxcWg,-2qfXy3hcWG0GEcE8QUKQw,...,zvQIEpJUmLLmMMffNntHXQ,zw4Legbcu018p5WcZ74iWA,zwNC-Ow4eIMan2__bS9-rg,zwmps5SXn30g-f5wqg_r9A,zy6Hzao6lTmuJMZo4BemIg,zyPGYeXF4XKCqNN1pjFWhg,zyrF7je6qVbaJMPfMG75xg,zzSYBWuv_fXGtSgsO-6_1g,zzsOLFhgUw8gnjLTVVItFA,zzwaS0xn1MVEPEf0hNLjew
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
---1lKK3aKOuomHnwAkAow,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
--2HUmLkcNHZp0xw6AMBPg,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
--2vR0DIsmQ6WfcSzKWigw,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
--BumyUHiO_7YsHurb9Hkw,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
--CIuK7sUpaNzalLAlHJKA,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


As we saw during our data preprocessing part, our dataset is very sparse. We can again observe this with the rating matrix above.

In [8]:
# Computing the adjusted cosine similarity matrix
adjusted_rating_matrix_sparse = sparse.csr_matrix(adjusted_rating_matrix)
similarities = cosine_similarity(adjusted_rating_matrix_sparse)
similarities = pd.DataFrame(similarities, 
                                 index=adjusted_rating_matrix.index, 
                                 columns=adjusted_rating_matrix.index)
similarities.head()

user_id,---1lKK3aKOuomHnwAkAow,--2HUmLkcNHZp0xw6AMBPg,--2vR0DIsmQ6WfcSzKWigw,--BumyUHiO_7YsHurb9Hkw,--CIuK7sUpaNzalLAlHJKA,--HCoE1ghaAlcaAfshICgw,--Nnm_506G_p8MxAOQna5w,--Qh8yKWAvIP4V4K8ZPfHA,--RlSfc-QmcHFGHyX6aVjA,--YhjyV-ce1nFLYxP49C5A,...,zyFN0KU7Y9MfTZIROrsKhA,zyMB50UtmEuQf-1VUKxZxQ,zyebSPCZLUZHapi-dSHU5Q,zyg4-MFtfPWmwucVazSjfw,zz9rIM0FmlWYvAzr6dGwVQ,zzHSF3oCjXqgsYuVJPgdvw,zzMJLd3cWBM6HT8eMbOGbA,zzPvEodjvLPe-5NvV0U_cg,zzjQuRnY_Kgr_zaw6A-ACQ,zzo--VpSQh8PpsGVeMC1dQ
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
---1lKK3aKOuomHnwAkAow,1.0,0.0,0.0,0.0,0.0,0.065945,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
--2HUmLkcNHZp0xw6AMBPg,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
--2vR0DIsmQ6WfcSzKWigw,0.0,0.0,1.0,0.0,0.04035,0.0,0.0,0.0,0.0,0.0,...,-0.027524,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
--BumyUHiO_7YsHurb9Hkw,0.0,0.0,0.0,1.0,0.0,0.0,0.0,-0.045966,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
--CIuK7sUpaNzalLAlHJKA,0.0,0.0,0.04035,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [9]:
# Creating the user based CF function which predict rating for given user - business pair
def user_based_collab(user_id, business_id, k=10):
    
    # First check if the user is in the training set. Otherwise we do not have computed similarities.
    if user_id in rating_matrix.index:
        
        # get the similarities between the user and the other users
        user_similarities = similarities[user_id]

        # Get all the other users' ratings for this business
        other_users_ratings = rating_matrix[business_id]

        # Remove the NaN from the users' ratings and from the similarity vector
        nan_index = other_users_ratings[other_users_ratings.isnull()].index
        other_users_ratings = other_users_ratings.dropna()
        user_similarities = user_similarities.drop(nan_index)

        # take k nearest neighbors
        k_index = user_similarities[user_similarities.isin(heapq.nlargest(k, user_similarities))].index
        other_users_ratings = other_users_ratings[k_index]
        user_similarities = user_similarities[k_index]           

        # Compute the predicted rating
        s = 0
        for sim in user_similarities:
            s += abs(sim)
        if s > 0:
            return np.dot(user_similarities, other_users_ratings) / s
        else:
            return train['rating'].mean()
          
    # If the user or the movie were not in the training set, return the average rating
    else:
        return train['rating'].mean()

## Training and testing

In [10]:
# Computing predictions for last rating of each user 
predictions_user_based =[]
for row in tqdm(test.index):        
    user_id = test.loc[row, 'user_id']
    business_id = test.loc[row, 'business_id']
    prediction = user_based_collab(user_id, business_id)
    predictions_user_based.append(prediction)

test['prediction'] = predictions_user_based

100%|██████████| 30750/30750 [08:21<00:00, 61.29it/s]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':


In [11]:
test

Unnamed: 0,user_id,business_id,rating,date,prediction
0,keBv05MsMFBd0Hu98vXThQ,JDZ6_yycNQFTpUZzLIKHUg,5.0,2018-11-14 18:05:34,4.214417
1,hZ_ElhGO3sQDVvM8ZrQetA,zfyGTLKOZuVY8aRoInyx9Q,4.0,2018-11-14 17:59:07,3.642952
2,y5zqSJE-rFihsKmUZRbbRg,evdJO0v9rvVixieNEnaeJg,5.0,2018-11-14 17:57:01,4.328149
3,ozUsNrw9QlEtz_JqN5PlMw,u1fa8SE-Rzea_xWbk_B-Zw,3.0,2018-11-14 17:49:52,3.790886
4,sHY6JcgWOHLP4vR836Esmw,urSuLlkYXXI5uwtKIxl9ew,5.0,2018-11-14 17:25:32,3.614968
...,...,...,...,...,...
317036,iw_6JAAw9FAIv2YkbLcMFA,zkhBU5qW_zCy0q4OEtIrsA,3.0,2006-02-06 01:23:27,3.704044
317070,06xIElWq6CAvmtk1xHJWNA,VxCnyVYn-FFgv6F1EqbdKA,4.0,2006-01-10 08:50:38,3.543193
317103,OXIJVK12GNTdHlmKJR7xbg,SvHAczHxIJ00QlJ9ovQR8g,5.0,2005-12-25 01:28:45,3.507845
317108,VnlUnPrfae-fbqDoXuvDbw,MjAuGkgPjNLyMrN74_0YbA,4.0,2005-12-15 03:43:44,4.005412


In [12]:
# Accuracy
print('RMSE:', np.sqrt(mean_squared_error(test['rating'], test['prediction'])))
print('MAE:', mean_absolute_error(test['rating'], test['prediction']))

RMSE: 1.5331582321487554
MAE: 1.1067444610701216


## Another baseline:

Another common baseline is returning the average global rating all the time:

In [13]:
test['average_global_rating'] = [average_global_rating for i in range(test.shape[0])]
print('RMSE:', np.sqrt(mean_squared_error(test['rating'], test['average_global_rating'])))
print('MAE:', mean_absolute_error(test['rating'], test['average_global_rating']))

RMSE: 1.296948262941883
MAE: 1.0714861788617884


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


# Conclusion

We can observe that the RMSE obtained with our user-based collaborative filtering is quite big and our model is not very accurate to predict the last rating of the different users in our dataset since the MAE is also significant. Another baseline consists of just returning the average global rating of the dataset all the time. It gives us better results actually. It seems that the user-based collaborative filtering we implemented is not well suited for the task desired.
<br>We will see the performances of other more sophisticated models, and how they compare with these baselines.