# Final Project - Personalization Theory

Authors: *Bertrand Thia-Thiong-Fat, Jeremy Yao, Paul Doan*

In this notebook, we will implement a content-based model to predict the last ratings of the users in our dataset. 

## Loading the data 

In [1]:
# Importing the libraries

import pandas as pd
from tqdm import tqdm
import json
import numpy as np
import sklearn.model_selection 
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.metrics.pairwise import cosine_similarity
import heapq

In [2]:
ratings = pd.read_csv('dataset.csv').drop(columns='Unnamed: 0')
ratings.head()

Unnamed: 0,user_id,business_id,rating,date
0,keBv05MsMFBd0Hu98vXThQ,JDZ6_yycNQFTpUZzLIKHUg,5.0,2018-11-14 18:05:34
1,hZ_ElhGO3sQDVvM8ZrQetA,zfyGTLKOZuVY8aRoInyx9Q,4.0,2018-11-14 17:59:07
2,y5zqSJE-rFihsKmUZRbbRg,evdJO0v9rvVixieNEnaeJg,5.0,2018-11-14 17:57:01
3,ozUsNrw9QlEtz_JqN5PlMw,u1fa8SE-Rzea_xWbk_B-Zw,3.0,2018-11-14 17:49:52
4,sHY6JcgWOHLP4vR836Esmw,urSuLlkYXXI5uwtKIxl9ew,5.0,2018-11-14 17:25:32


In [3]:
print('There are {} active users. \nThey add up to a total of {} unique ratings. \nThere are {} different businesses'.format(ratings["user_id"].nunique(), ratings.shape[0], ratings["business_id"].nunique()))

There are 30750 active users. 
They add up to a total of 317153 unique ratings. 
There are 4996 different businesses


# First Model: Content Based

The motivation behind this model is understanding the different sources of information available and weight the relevant features in our model. As a result, we will leverage this information and make predictions. In this study, we will use features about the different businesses at hand.  

## 1. Implementation

Let's take a look at our businesses attributes:

In [4]:
# Importing the business data
from tqdm import tqdm

business = pd.read_json('yelp_dataset/business.json', lines=True)
business = business[business['business_id'].isin(ratings['business_id'])]
business.set_index('business_id', inplace=True)
attributes = business.columns
business.head()

Unnamed: 0_level_0,address,attributes,categories,city,hours,is_open,latitude,longitude,name,postal_code,review_count,stars,state
business_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
jScBTQtdAt-8RshaiBEHgw,"1770 W Horizon Ridge, Ste 100","{'DriveThru': 'False', 'RestaurantsAttire': ''...","Ethnic Food, American (New), Burgers, Food, Re...",Henderson,"{'Monday': '0:0-0:0', 'Tuesday': '9:0-15:0', '...",1,36.010745,-115.064803,Served,89012,664,4.5,NV
6fPQJq4f_yiq1NHn0fd11Q,3655 Las Vegas Blvd S,"{'RestaurantsTakeOut': 'True', 'RestaurantsDel...","French, Restaurants, Creperies",Las Vegas,"{'Monday': '7:0-23:0', 'Tuesday': '7:0-23:0', ...",1,36.112527,-115.171351,La Creperie,89109,535,3.5,NV
k-dDZvTeLysoJvjHI-qr9g,2411 W Sahara Ave,"{'RestaurantsDelivery': 'False', 'RestaurantsT...","Buffets, Restaurants",Las Vegas,"{'Monday': '0:0-0:0', 'Tuesday': '0:0-0:0', 'W...",1,36.142116,-115.174252,Feast Buffet,89102,287,3.0,NV
MhnihE0alud0ereVInSt8Q,"2765 N Scottsdale Rd, Ste 105","{'OutdoorSeating': 'False', 'RestaurantsGoodFo...","Chinese, Restaurants",Scottsdale,"{'Monday': '11:0-21:30', 'Tuesday': '11:0-21:3...",1,33.478754,-111.925484,Yummy Yummy Chinese Restaurant,85257,188,3.0,AZ
i6hWP3si97eKQl_JyK8L3w,145 Richmond Street W,"{'RestaurantsPriceRange2': '3', 'WiFi': ''free...","Hotels, Event Planning & Services, Hotels & Tr...",Toronto,"{'Monday': '0:0-0:0', 'Tuesday': '0:0-0:0', 'W...",1,43.649612,-79.385447,Hilton Toronto,M5H 2L2,131,3.5,ON


Our assumption is that the study of the features of the different businesses help us understand the preferences of the users to drive insights and recommendations.

In [5]:
# Splitting for each single unique category
unique_category = []

for l in business['categories'].unique():
    for cat in l.split(', '):
        if cat not in unique_category:
            unique_category.append(cat)

In [6]:
# Creating category multidimensional space
for category in tqdm(unique_category):
    business[category] = 0
    for bus in business.index:
        if category in business.loc[bus, 'categories']:
            business.loc[bus, category] = 1

100%|██████████| 758/758 [00:48<00:00, 15.58it/s]


In [7]:
# Creating the category matrix
categories = business.drop(attributes, axis=1)
business = business.drop(categories.columns, axis=1)
categories.head()

Unnamed: 0_level_0,Ethnic Food,American (New),Burgers,Food,Restaurants,Asian Fusion,Specialty Food,Mexican,Sandwiches,Breakfast & Brunch,...,Pharmacy,Funeral Services & Cemeteries,Popcorn Shops,Maternity Wear,Hotel bar,Venezuelan,Gold Buyers,Cigar Bars,Junk Removal & Hauling,Climbing
business_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
jScBTQtdAt-8RshaiBEHgw,1,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
6fPQJq4f_yiq1NHn0fd11Q,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
k-dDZvTeLysoJvjHI-qr9g,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
MhnihE0alud0ereVInSt8Q,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
i6hWP3si97eKQl_JyK8L3w,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [8]:
print('Number of categories:', categories.shape[1])

Number of categories: 758


We have increased dimensionality of our dataset. In order to reduce the dimensions, we couldcluster them in bigger buckets and modify the granularity of the features. However, it proves to be untractable and difficult to implement. We will see how the model behaves with this curse of dimensionality.

In [9]:
# building similarity matrix for all business
similarities = cosine_similarity(categories)
similarities = pd.DataFrame(similarities, 
                                 index=categories.index, 
                                 columns=categories.index)
similarities.head()

business_id,jScBTQtdAt-8RshaiBEHgw,6fPQJq4f_yiq1NHn0fd11Q,k-dDZvTeLysoJvjHI-qr9g,MhnihE0alud0ereVInSt8Q,i6hWP3si97eKQl_JyK8L3w,-ucQnELMVRIUOi3-Kv5r0Q,mofOjB6flg-eAWOFbOkHfQ,QAqm1ubKgPYYqjZKxfi87A,NX1281ugzs2navHAX5X9cQ,2tE_n3ws4Vn1byejSZZIsQ,...,wHn5jKZc3lt_Cu8uoBblgw,QkDhw-fQi_IijlqSnV3eIg,JKmhHlmboFEsSSxPcYwxww,P_1ojkLpCsM8cpuiKlZnAg,Lhl72icGvaW2rFClTy-hog,6Sd4KBcAwWKrpUEv4M_oIg,HqvNxjGpLjfv9KgF-0OqPg,P8uECqGqXWTwEndkh-6bQw,2JsLzYF8rUalwpm5LDEcog,shIPnFoXrL3dFo5HLH1_HA
business_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
jScBTQtdAt-8RshaiBEHgw,1.0,0.174078,0.213201,0.213201,0.0,0.0,0.13484,0.0,0.26968,0.0,...,0.0,0.0,0.0,0.213201,0.13484,0.0,0.0,0.26968,0.174078,0.0
6fPQJq4f_yiq1NHn0fd11Q,0.174078,1.0,0.408248,0.408248,0.0,0.0,0.258199,0.0,0.258199,0.0,...,0.0,0.0,0.0,0.0,0.258199,0.0,0.0,0.258199,0.333333,0.0
k-dDZvTeLysoJvjHI-qr9g,0.213201,0.408248,1.0,0.5,0.0,0.0,0.632456,0.0,0.316228,0.0,...,0.0,0.0,0.0,0.0,0.316228,0.0,0.0,0.316228,0.408248,0.0
MhnihE0alud0ereVInSt8Q,0.213201,0.408248,0.5,1.0,0.0,0.0,0.632456,0.0,0.316228,0.0,...,0.0,0.0,0.0,0.0,0.316228,0.0,0.0,0.316228,0.408248,0.0
i6hWP3si97eKQl_JyK8L3w,0.0,0.0,0.0,0.0,1.0,0.471405,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [10]:
# Creating category based prediction function
def category_based(user_id, business_id, k=5):
         
    # Build similarity scores between last business and all the businesses the user already rated
    other_business_rated = ratings[ratings['user_id'] == user_id].iloc[1:]['business_id']
    business_sim = pd.Series(index=other_business_rated.index)
    for i in other_business_rated.index:
        business_sim.loc[i] = similarities.loc[business_id, other_business_rated[i]]

    # Get all the other businesses ratings for the user
    other_business_ratings = ratings[ratings['user_id'] == user_id].iloc[1:]['rating']

    # take k nearest neighbors
    k_index = other_business_ratings.sort_values(ascending=False).iloc[:k].index

    # Compute the predicted rating
    simTotal, weightedSum = 0, 0    
    
    for neighbor in k_index:
        simTotal += abs(business_sim[neighbor])
        weightedSum += business_sim[neighbor] * other_business_ratings[neighbor]
    if simTotal > 0:
        return weightedSum / simTotal
    else:
        # ratings mean
        return ratings['rating'].mean()

Let us now use our model to predict the last ratings of the users in our dataset:

In [11]:
last_ratings = ratings.drop_duplicates(subset='user_id', keep='first')
last_ratings.head()

Unnamed: 0,user_id,business_id,rating,date
0,keBv05MsMFBd0Hu98vXThQ,JDZ6_yycNQFTpUZzLIKHUg,5.0,2018-11-14 18:05:34
1,hZ_ElhGO3sQDVvM8ZrQetA,zfyGTLKOZuVY8aRoInyx9Q,4.0,2018-11-14 17:59:07
2,y5zqSJE-rFihsKmUZRbbRg,evdJO0v9rvVixieNEnaeJg,5.0,2018-11-14 17:57:01
3,ozUsNrw9QlEtz_JqN5PlMw,u1fa8SE-Rzea_xWbk_B-Zw,3.0,2018-11-14 17:49:52
4,sHY6JcgWOHLP4vR836Esmw,urSuLlkYXXI5uwtKIxl9ew,5.0,2018-11-14 17:25:32


In [12]:
# Computing predictions for last rating of each user 
predictions_user_based =[]
for row in tqdm(last_ratings.index):        
    user_id = last_ratings.loc[row, 'user_id']
    business_id = last_ratings.loc[row, 'business_id']
    prediction = category_based(user_id, business_id)
    predictions_user_based.append(prediction)

last_ratings['prediction'] = predictions_user_based

# Running time: 18'12

100%|██████████| 30750/30750 [15:10<00:00, 33.78it/s]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [13]:
last_ratings

Unnamed: 0,user_id,business_id,rating,date,prediction
0,keBv05MsMFBd0Hu98vXThQ,JDZ6_yycNQFTpUZzLIKHUg,5.0,2018-11-14 18:05:34,5.000000
1,hZ_ElhGO3sQDVvM8ZrQetA,zfyGTLKOZuVY8aRoInyx9Q,4.0,2018-11-14 17:59:07,5.000000
2,y5zqSJE-rFihsKmUZRbbRg,evdJO0v9rvVixieNEnaeJg,5.0,2018-11-14 17:57:01,5.000000
3,ozUsNrw9QlEtz_JqN5PlMw,u1fa8SE-Rzea_xWbk_B-Zw,3.0,2018-11-14 17:49:52,3.794692
4,sHY6JcgWOHLP4vR836Esmw,urSuLlkYXXI5uwtKIxl9ew,5.0,2018-11-14 17:25:32,5.000000
5,SVC0CajvmYfH5uAq4JnGvg,OSF8Iy5Xq-hAN4-zhDVS4w,5.0,2018-11-14 17:23:06,5.000000
7,xqUn2yqxQRq5MrthbRb-7Q,syBbYNE5-rWDMOs-MkFRQw,5.0,2018-11-14 16:54:06,5.000000
8,3iaQRn8gPiuarpYmv9dOJA,G8vA6pq4p8KslaURm-q65Q,5.0,2018-11-14 16:36:49,3.743222
9,1Ay0chgeSlxCNcT2PLSBOg,-v8Z3mdbbPs1ljLziHr2DA,5.0,2018-11-14 16:22:29,5.000000
10,khC-KFL_z77NwaQXyoDBGQ,mwE5uNVkxCXvEuVa1KQ_3g,5.0,2018-11-14 16:10:22,5.000000


In [14]:
# Accuracy
print('RMSE:', np.sqrt(mean_squared_error(last_ratings['rating'], last_ratings['prediction'])))
print('MAE:', mean_absolute_error(last_ratings['rating'], last_ratings['prediction']))

RMSE: 1.4202680669143533
MAE: 1.0387913730607607


We can observe that the **RMSE obtained is lower than the one of our baseline on the same dataset**. One reason could be caused by the curse of dimensionality: a one-hot-encoding that yielded a highly dimensional dataset with more than six hundred categories. A lot of these categories could be highly correlated and have nefarious impact on the model performance.

# Conclusion

We can observe that the RMSE obtained is relatively lower than our baselines'. There are unaccurate predictions. To improve our results, we could refine the categories data by clustering categories in bigger buckets, or  removing those that do not hold information or are highly correlated. Moreover, expanding the number of relevant attributes may also help. 