# Yelp Reviews Rating Prediction

Bangwei Zhou <br>
James Jungsuk Lee <br>
Ujjwal Peshin <br>
Zhongling Jiang <br>

## A. Abstract

We compare the merits of n different recommendation algorithms side by side.

* Biased Baseline
* Collaborative Filtering (CF - Baseline)
* Collective Matrix Factorization (CMF)
* Content-Based Filtering (CBF)
* Field-aware Factorization Machine (FFM) 
* Deep Learning Model

* Hybrid Approach <br>

Comparing the results of the models against our baseline shows that Collective Matrix Factorization Algorithm, DeepFM Algorithm, and Wide and Deep Algorithms outperform the baseline model. The strength of deep learning approaches were showcased as we increased the size of the dataset, beating the baselines initially by .02 RMSE in 20% subsampled data but later beating them by .04 RMSE in 100% of the data. Another key advantage of the deep learning approaches were the short training time. Using 100% of the dataset, they finished training in just 800 seconds on local environment while the CMF took an equivalent amount of time for dataset subsampled down to 50% size. Additionally, the weaknesses of the baseline models were exposed when metrics that measure user experiences such as ranking, coverage, and novelty were investigated. 
While our predictive algorithms didn’t provide drastic improvements in accuracy, our top deep learning methods providing about 3% improvement versus the baseline, given the fast training time as well as its ability to achieve superior measures on metrics that affect positive user experiences such as coverage, ranking, and novelty metrics, there is a strong reason to believe that deep learning models can provide value to Yelp users in a meaningful manner. In this report, we discuss the merits of various algorithms we explored and finally propose a complete recommendation system, that employs an ensemble of data retrieval system based on user location, cold-start recommendations using the bias-based model as well as DeepFM for general recommendation.


## B. Business Case and Objective

Our goal through this exercise is to figure out the best methodology for recommending restaurants to our users. This will be largely measured by using RMSE, which measures our forecast’s error relative to the ground truth as well as ranking metrics. Additionally, we want to measure metrics that can help us assess the sanity of our recommendation and experiences of the users. We measure these using coverage and novelty metrics to ensure that there is enough diversity in our recommendations and that our model isn’t resorting to recommending the most popular restaurants as opposed to creating a truly personalized experience. 
 
There are additional considerations such as model computation time, which affects long term overhead of maintenance of the product, and model comprehension. These are important considerations for the business as inefficient solutions will add technology debt and burden to the company’s systems. Also, employing models that are well understood can help the team in getting buy-in from stakeholders. Therefore, these factors will be critically analyzed when assessing the merits of our algorithms. 


## C. Data

We use the publicly available yelp dataset which consists roughly of 7 million reviews from 1.6 million users across 190 thousand businesses in 10 metropolitan areas. We are additionally provided with business and user metadata as well as check-in information, tips and photos provided by the reviewers. 

In [2]:
# !pip install tqdm

In [1]:
import pandas as pd
import tarfile
from tqdm import tqdm, tqdm_notebook, tnrange
import json
import numpy as np
import time
from copy import deepcopy

In [14]:
# Note the decompress step takes a while and large disk space.
# Run it only once
zf = tarfile.open('yelp_dataset.tar') 
zf.extract('review.json')
zf.extract('business.json')
zf.extract('user.json')

In [2]:
line_count = len(open("review.json").readlines())
user_ids, business_ids, stars, dates, texts = [], [], [], [], []
with open("review.json") as f:
    for line in tqdm(f, total=line_count):
        blob = json.loads(line)
        user_ids += [blob["user_id"]]
        business_ids += [blob["business_id"]]
        stars += [blob["stars"]]
        dates += [blob["date"]]
        texts += [blob["text"]]
ratings_ = pd.DataFrame(
    {"user_id": user_ids, "business_id": business_ids, "rating": stars, "date": dates, "text": texts}
)
user_counts = ratings_["user_id"].value_counts()
active_users = user_counts.loc[user_counts >= 5].index.tolist()
ratings_ = ratings_.loc[ratings_.user_id.isin(active_users)]

100%|██████████| 6685900/6685900 [01:15<00:00, 88385.36it/s] 


In [4]:
ratings_.head()

Unnamed: 0,user_id,business_id,rating,date,text
0,hG7b0MtEbXx5QzbzE6C_VA,ujmEBvifdJM6h6RLv4wQIg,1.0,2013-05-07 04:34:36,Total bill for this horrible service? Over $8G...
2,n6-Gk65cPZL6Uz8qRm3NYw,WTqjgwHlXbSFevF32_DJVw,5.0,2016-11-09 20:09:03,I have to say that this office really has it t...
6,jlu4CztcSxrKx56ba1a5AQ,3fw2X5bZYeW9xCz_zGhOHg,3.0,2016-05-07 01:21:02,Tracy dessert had a big name in Hong Kong and ...
7,d6xvYpyzcfbF_AZ8vMB7QA,zvO-PJCpNk4fgAVUnExYAA,1.0,2010-10-05 19:12:35,This place has gone down hill. Clearly they h...
8,sG_h0dIzTKWa3Q6fmb4u-g,b2jN2mm9Wf3RcrZCgfo1cg,2.0,2015-01-18 14:04:18,I was really looking forward to visiting after...


In [3]:
n_users = len(ratings_.user_id.unique())
n_restaurants = len(ratings_.business_id.unique())
print('Unique Users: {0}, unique restaurants: {1}'.format(n_users, n_restaurants))

Unique Users: 286130, unique restaurants: 185723


Load in User attributes set and Restaurant attributes set 

In [8]:
# load user.json
line_count = len(open("user.json").readlines())
users, names, review_counts, since, friends, useful, \
            funny, cool, n_fans, years_elite, average_stars = [], [], [], [], [], [], [], [], [], [], []
with open("user.json") as f:
    for line in tqdm(f, total=line_count):
        blob = json.loads(line)
        users += [blob["user_id"]]
        names += [blob["name"]]
        review_counts += [blob["review_count"]]
        since += [blob["yelping_since"]]
        friends += [blob["friends"]]
        useful += [blob["useful"]]
        funny += [blob["funny"]]
        cool += [blob["cool"]]
        n_fans += [blob["fans"]]
        years_elite += [blob["elite"]]
        average_stars += [blob["average_stars"]]
        
users_ = pd.DataFrame(
    {"user_id": users, 
     "user_name": names,
     "user_review_count": review_counts,
     "user_yelp_since": since,
     "friends": friends,
     "useful_reviews": useful,
     "funny_reviews": funny,
     "cool_reviews": cool,
     "n_fans": n_fans,
     "years_elite": years_elite,
     "average_stars": average_stars
    }
)

100%|██████████| 1637138/1637138 [00:32<00:00, 49990.93it/s]


In [9]:
# load business.json
line_count = len(open("business.json").readlines())
business_ids, names, addresses, cities, states, latitudes, longitudes, stars, \
        review_counts, is_open, categories = [], [], [], [], [], [], [], [], [], [], []
with open("business.json") as f:
    for line in tqdm(f, total=line_count):
        blob = json.loads(line)
        business_ids += [blob["business_id"]]
        names += [blob["name"]]
        addresses += [blob["address"]]
        cities += [blob["city"]]
        states += [blob["state"]]
        latitudes += [blob["latitude"]]
        longitudes += [blob["longitude"]]
        stars += [blob["stars"]]
        review_counts += [blob["review_count"]]
        is_open += [blob["is_open"]]
        categories += [blob["categories"]]
        
business_ = pd.DataFrame(
    {"business_id": business_ids, 
     "business_name": names,
     "business_address": addresses,
     "business_city": cities, 
     "business_state": states, 
     "business_latitude": latitudes, 
     "business_longitude": longitudes, 
     "stars": stars, 
     "review_counts": review_counts, 
     "is_open": is_open,
     "categories": categories}
)

100%|██████████| 192609/192609 [00:03<00:00, 62578.53it/s]


In [10]:
users_ = users_.loc[users_.user_id.isin(active_users)]

In [11]:
users_.head()

Unnamed: 0,user_id,user_name,user_review_count,user_yelp_since,friends,useful_reviews,funny_reviews,cool_reviews,n_fans,years_elite,average_stars
0,l6BmjZMeQD3rDxWUbiAiow,Rashmi,95,2013-10-08 23:11:33,"c78V-rj8NQcQjOI8KP3UEA, alRMgPcngYSCJ5naFRBz5g...",84,17,25,5,201520162017,4.03
1,4XChL029mKr5hydo79Ljxg,Jenna,33,2013-02-21 22:29:06,"kEBTgDvFX754S68FllfCaA, aB2DynOxNOJK9st2ZeGTPg...",48,22,16,4,,3.63
2,bc8C_eETBWL0olvFSJJd0w,David,16,2013-10-04 00:16:10,"4N-HU_T32hLENLntsNKNBg, pSY2vwWLgWfGVAAiKQzMng...",28,8,10,0,,3.71
4,MM4RJAeH6yuaN8oZDSt0RA,Nancy,361,2013-10-23 07:02:50,"mbwrZ-RS76V1HoJ0bF_Geg, g64lOV39xSLRZO0aQQ6DeQ...",1114,279,665,39,2015201620172018,4.08
6,TEtzbpgA2BFBrC0y0sCbfw,Keane,1122,2006-02-15 18:29:35,"RJQTcJVlBsJ3_Yo0JSFQQg, GWt_h78k1CBBkE1NpThGfQ...",13311,19356,15319,696,20062007200820092010201120122013,4.39


In [12]:
business_.head()

Unnamed: 0,business_id,business_name,business_address,business_city,business_state,business_latitude,business_longitude,stars,review_counts,is_open,categories
0,1SWheh84yJXfytovILXOAQ,Arizona Biltmore Golf Club,2818 E Camino Acequia Drive,Phoenix,AZ,33.522143,-112.018481,3.0,5,0,"Golf, Active Life"
1,QXAEGFB4oINsVuTFxEYKFQ,Emerald Chinese Restaurant,30 Eglinton Avenue W,Mississauga,ON,43.605499,-79.652289,2.5,128,1,"Specialty Food, Restaurants, Dim Sum, Imported..."
2,gnKjwL_1w79qoiV3IC_xQQ,Musashi Japanese Restaurant,"10110 Johnston Rd, Ste 15",Charlotte,NC,35.092564,-80.859132,4.0,170,1,"Sushi Bars, Restaurants, Japanese"
3,xvX2CttrVhyG2z1dFg_0xw,Farmers Insurance - Paul Lorenz,"15655 W Roosevelt St, Ste 237",Goodyear,AZ,33.455613,-112.395596,5.0,3,1,"Insurance, Financial Services"
4,HhyxOkGAM07SRYtlQ4wMFQ,Queen City Plumbing,"4209 Stuart Andrew Blvd, Ste F",Charlotte,NC,35.190012,-80.887223,4.0,4,1,"Plumbing, Shopping, Local Services, Home Servi..."


## E. Subsampling and Train-Test Split
#### E.1. FIltering, Undersampling and Dataset Size Considerations
The idea behind undersampling is to develop a smaller dataset that is representable, or partially representative of the entire dataset, and expedites the model development cycle. We utilize 20% dataset size generally across the board for fast development iteration and for computation intensive tasks such as hyperparameter tuning. However, it is critical to analyze model performance across dataset sizes since the models’ capability to handle large dataset is an important consideration if it were to be utilized in production. Therefore we investigate two other dataset sizes as well, 50% and 100% of the data size. Whereas 20% of the dataset size is used generally, larger datasets are employed to analyze certain models’ capability to handle large data as well as measure how their performance changes with larger datasets. Finally, we filter our dataset to only active users who have at least 5 reviews since there needs to be at least some data about the user for the model to perform. If a user does not have at least 5 reviews, we will be building out recommendations using cold-start methods.

#### E.2 Train Test Split

Train-Test split is done in a very straightforward way. We take all the users that made through our filters as described above. Then we simply take the last two reviews of the users. The most recent review becomes our test set and the other second most recent review becomes our validation set. 


In [4]:
SAMPLING_RATE = 1/5

In [5]:
# Downsample by users
user_id_unique = ratings_.user_id.unique()
user_id_sample = pd.DataFrame(user_id_unique, columns=['unique_user_id']) \
                    .sample(frac= SAMPLING_RATE, replace=False, random_state=1)

ratings_sample = ratings_.merge(user_id_sample, left_on='user_id', right_on='unique_user_id') \
                    .drop(['unique_user_id'], axis=1)
print(ratings_sample.head())
print(ratings_sample.shape)

                  user_id             business_id  rating  \
0  n6-Gk65cPZL6Uz8qRm3NYw  WTqjgwHlXbSFevF32_DJVw     5.0   
1  n6-Gk65cPZL6Uz8qRm3NYw  hk5wpV-_pi5jmDDVPeG8DA     5.0   
2  n6-Gk65cPZL6Uz8qRm3NYw  30Q5xBagQHmkwp8Q9I1FCg     5.0   
3  n6-Gk65cPZL6Uz8qRm3NYw  UtWngqS-WloIY_A53W5K-Q     5.0   
4  n6-Gk65cPZL6Uz8qRm3NYw  dU-Nt1-LjV9mAgFOVcdAJw     5.0   

                  date                                               text  
0  2016-11-09 20:09:03  I have to say that this office really has it t...  
1  2018-09-14 18:50:19  I highly recommend Arizona Pet Mortuary, David...  
2  2018-02-03 23:27:43  First time at this restaurant our server "Ramo...  
3  2016-02-18 06:42:16  Such an amazing hospital with friendly staff, ...  
4  2018-08-15 22:14:18  Went for my yearly GYN exam and was seen by Lo...  
(918368, 5)


In [6]:
# hold out last review
ratings_user_date = ratings_sample.loc[:, ['user_id', 'date']]
ratings_user_date.date = pd.to_datetime(ratings_user_date.date)
index_holdout = ratings_user_date.groupby(['user_id'], sort=False)['date'].transform(max) == ratings_user_date['date']
ratings_holdout_ = ratings_sample[index_holdout]
ratings_traincv_ = ratings_sample[~index_holdout]

ratings_user_date = ratings_traincv_.loc[:, ['user_id', 'date']]
index_holdout = ratings_user_date.groupby(['user_id'], sort=False)['date'].transform(max) == ratings_user_date['date']
ratings_cv_ = ratings_traincv_[index_holdout]
ratings_train_ = ratings_traincv_[~index_holdout]

# remove the user that has fake reviews 

cv_users_del = set(ratings_cv_.user_id) - set(ratings_train_.user_id)
holdout_users_del = set(ratings_holdout_.user_id) - set(ratings_train_.user_id)
ratings_cv_ = ratings_cv_[~ratings_cv_.user_id.isin(cv_users_del)]
ratings_holdout_ = ratings_holdout_[~ratings_holdout_.user_id.isin(holdout_users_del)]

# ratings_cv_ = ratings_cv_[~ratings_cv_.user_id.isin(['HiT9sg9pvDiEVMFHJYihXg'])]
# ratings_holdout_ = ratings_holdout_[~ratings_holdout_.user_id.isin(['HiT9sg9pvDiEVMFHJYihXg'])]

print('There are {0} rows, {1} columns in training set.'.format(ratings_train_.shape[0], ratings_train_.shape[1]))
print('There are {0} rows, {1} columns in training set.'.format(ratings_cv_.shape[0], ratings_cv_.shape[1]))
print('There are {0} rows, {1} columns in holdout set.'.format(ratings_holdout_.shape[0], ratings_holdout_.shape[1]))

There are 803897 rows, 5 columns in training set.
There are 57229 rows, 5 columns in training set.
There are 57223 rows, 5 columns in holdout set.


In [7]:
# check if we have a enough user sample size (> 50000)
number_of_unique_users = len(ratings_train_.user_id.unique())
print(number_of_unique_users)

57223


## F. Evaluation Metrics

A list of evaluation metrics the team uses are:

#### F.1. Regression Metrics
Regression metrics measure We present four different regression metrics that are measured but we primarily use the RMSE. 
Root Mean Square Error (RMSE) calculates square rooted sum of square residuals of predictions. It measures numerical difference between all ground-truth ratings and actual ratings in test set.
Mean Absolute Error (MAE) calculates sum of absolute residuals. It measures numerical difference between all ground-truth ratings and actual ratings in test set, and more robust to outliers.
R-squared measures what percentage of variance in target that is explained by predictions. The higher the value, the better the predictions.

#### F.2. Ranking Metrics
We first make top 10 recommendations of restaurants for each user and see how relevant the rankings were. First we discuss the metrics then discuss a few caveats to the approach we took to measure them.
Inclusion of Last Review in Top 10 Recommendations: We train our model on the training set and make top 10 recommendations to the users. We then look at the test set and see if the user’s latest review was included in that top 10 recommendations. We calculate the proportion of users who received such recommendations. In other words we have,

$$\frac{1}{n} * \sum_j^n \sum_{i \in j's top 10}^{10}Rel(i)$$

where a recommendation is relevant if it is the latest visited restaurant of the user and n is the total number of users measured. 
Average Ranking of Latest Restaurant: We train our model on the training set and make a prediction on every single business  and user combination. We then measure the average rankings on that prediction of the latest business that the user visited. In other words we have,

$$\frac{1}{n} * \sum_j^n\sum_i^m Rank(i)(1_{i=latest business of j})$$

Where the indicator function is one if the restaurant is the last visited restaurant of user j, n is the total number of users and m is the total number of businesses.
Some caveats of our approach is that instead of making a prediction on the entire business universe, we make predictions on the businesses that are in the same city as the business of the user’s last review. This makes sense since it would be futile to recommend a restaurant in Los Angeles to a user who is in New York City. This also allows us to reduce the computation time of our recommendation, which is another key advantage we want to have when we serve our model to the users. Additionally, we measure the above two metrics on a subsample of users as opposed to all the users to save computation time, and we also only take subsample of ratings that were positive ratings since we want to measure how good our recommendations are. 
#### F.3. Coverage
We measure the coverage of our recommendations by looking at the proportion of our recommendations that are distinct. In other words, we measure, <br>

$$\frac{number of distinct businesses in all recommendations to the subsample} 
{number of all recommendations to the subsample}$$

 
#### F.4. Novelty
We measure the novelty of our recommendations by simply taking the proportion of businesses in our top ten recommendations that the user hasn’t been to. This is crucial since we don’t want our recommendations to be filled with restaurants that the user has already been to. We measure novelty as simply. 

#### F.5. Runtime 
We measure the runtime of the model’s training as well as its prediction time on validation set using Python’s time library. 

#### Metrics on Coverage and Serendipity

First subsample a group of users that we will measure these metrics from

Methodology:
    We sample 5 users from each city where the user made the latest review.
    These cities must have at least 100 unique businesses
    These users must also have made a postive review(above their historical average)to those restaurants.
        1. We recommend 10 restaurants to each user
        2. We see if their latest restaurant makes it into the top 10 list (Ranking Metric)
        3. We see for those 10 x 5 recommendations, how many of them are distinct businesses (Coverage)
        4. We see for those top 10 recommendations, how many of them are restaurants they have not visited (Serendipity)
    
    Additionally, we measure what our ranking was for the latest restaurant that the user visited(Ranking Metric
    


Criteria: 

In [13]:
def process(df):
    df['date']  = pd.to_datetime(df['date'])
    df['week_day'] = df['date'].dt.weekday
    df['month'] = df['date'].dt.month
    df['hour'] = df['date'].dt.hour
    df = df.merge(users_, on = 'user_id')
    df = df.merge(business_, on = 'business_id')
    rename_dict = {'business_longitude': 'longitude', 'business_latitude': 'latitude',
                  'business_state':'state','business_city':'city', 'business_address': 'address'}
    df = df.rename(columns = rename_dict)
    return df

ratings_train = process(ratings_train_.copy())
ratings_holdout = process(ratings_holdout_.copy())
ratings_val = process(ratings_cv_.copy())

In [14]:
ratings_train_final = ratings_train.append(ratings_val)
ratings_entire_df = ratings_train.append(ratings_val).append(ratings_holdout)

In [15]:
# ratings_holdout.columns

In [16]:
unique_city_businesses = ratings_entire_df[['city','business_id']].drop_duplicates()
unique_cities = unique_city_businesses.groupby('city').count()['business_id']
unique_cities = unique_cities[unique_cities > 100]
out = pd.DataFrame()
for city in unique_cities.index:
    tmp = ratings_holdout[(ratings_holdout['city'] ==city) &
                              (ratings_holdout['rating'] >ratings_holdout['average_stars'])]
    if len(tmp['user_id'].unique())>4:
        
        ###this weird sampling technique is to ensure we dont' sample the same user twice in a same city
        five_users = np.random.choice(tmp['user_id'].unique(),5, replace = False)
        row = tmp[tmp['user_id'].isin(five_users)].groupby('user_id', group_keys=False).apply(lambda df: df.sample(1))
        out = out.append(row)

In [17]:
predict_df = out[['user_id','city','state']]
predict_df = predict_df.merge(unique_city_businesses, on = 'city')
predict_df.to_csv('data/metric_sample.csv')

In [18]:
# random initialization, needs to be substituted by actual predictions later.
predict_df['predictions'] = 2.5

In [19]:
def get_all_metrics(predict_df, validation_subsample, ratings_train_final):
    top_10_recs = predict_df.groupby(['user_id','city'])['predictions'].nlargest(10).reset_index()
    out = validation_subsample
    cnt =0
    serendipity = 0
    
    
    for row in out.iterrows():
        row_values = row[1]
        top_10 = predict_df.loc[top_10_recs[top_10_recs['user_id'] == row_values['user_id']].level_2]['business_id']
        ###In top 10
        if row_values['business_id'] in top_10.values:
            cnt+=1
        user_history = ratings_train_final[ratings_train_final['user_id'] == row_values['user_id']]    
        been_there = [i for i in top_10.values if i in  user_history.business_id.values]
        serendipity += 1-len(been_there)/10
    
    top_10 = cnt/len(out)
    serendipity = serendipity/len(out)
    
    predict_df = predict_df.reset_index()
    
    analysis_df = predict_df.merge(top_10_recs, left_on = ['user_id','city','index'], \
                                   right_on = ['user_id','city','level_2'])
    
    coverage = (analysis_df.groupby('city')['business_id'].nunique()/50).values.mean()
    
    predict_df['rankings']=predict_df.groupby(['city','user_id'])['predictions']. \
                                                        rank(method="first",ascending = False)
    running_rankings =0
    for row in out.iterrows():
        row_values = row[1]
        user_recs = predict_df[(predict_df['user_id']==row_values['user_id'])
                            &(predict_df['city']==row_values['city'])
                             & (predict_df['business_id']==row_values['business_id'])
                              ]
        assert len(user_recs)==1
        running_rankings += user_recs['rankings'].sum()

    avg_rank = running_rankings / len(out)
    print(top_10, coverage, serendipity, avg_rank)
    
    return top_10, coverage, serendipity, avg_rank

## F. Methods

A list of models that the team attempts are: 

* Bias Baseline
* Collaborative Filtering Baseline: SVD
* Collective Matrix Factorization (CMF)
* Content-Based Filtering (CBF)
* Field-aware Factorization Machine (FFM) 
* Deep Learning Model

_**Note**_: the team has run algorithms on **20%, 50%, 100%** training data respectively. But for readability purpose, only result on 20% data is displayed in this notebook.

### Baseline models

### Baseline 1: Bias Baseline

$\sum_{r_{ui} \in R_{train}} \left(r_{ui} - (\mu + b_u + b_i)\right)^2 +
\lambda \left(b_u^2 + b_i^2 \right)$.

#### Hyperparameter Tuning

In [28]:
!pip install scikit-surprise

Collecting scikit-surprise
[?25l  Downloading https://files.pythonhosted.org/packages/f5/da/b5700d96495fb4f092be497f02492768a3d96a3f4fa2ae7dea46d4081cfa/scikit-surprise-1.1.0.tar.gz (6.4MB)
[K    100% |████████████████████████████████| 6.5MB 6.2MB/s eta 0:00:01
Building wheels for collected packages: scikit-surprise
  Running setup.py bdist_wheel for scikit-surprise ... [?25ldone
[?25h  Stored in directory: /Users/zhonglingjiang/Library/Caches/pip/wheels/cc/fa/8c/16c93fccce688ae1bde7d979ff102f7bee980d9cfeb8641bcf
Successfully built scikit-surprise
Installing collected packages: scikit-surprise
Successfully installed scikit-surprise-1.1.0
[33mYou are using pip version 10.0.1, however version 19.3.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [46]:
from surprise import SVD
from surprise import accuracy
from surprise import Reader
from surprise.model_selection import GridSearchCV
from surprise import Dataset
from surprise import BaselineOnly

from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error

In [45]:
def process(df):
    df['date']  = pd.to_datetime(df['date'])
    df['week_day'] = df['date'].dt.weekday
    df['month'] = df['date'].dt.month
    df['hour'] = df['date'].dt.hour
    df = df.merge(users_, on = 'user_id')
    df = df.merge(business_, on = 'business_id')
    rename_dict = {'business_longitude': 'longitude', 'business_latitude': 'latitude',
              'business_state':'state','business_city':'city', 'business_address': 'address'}
    df = df.rename(columns = rename_dict)
    return df

ratings_train = process(ratings_train_.copy())
ratings_test = process(ratings_holdout_.copy())
ratings_val = process(ratings_cv_.copy())

In [47]:
# remove observations that may cause cold-start problem, which breaks the model.
ratings_test = ratings_test.loc[ratings_test.business_id.isin(ratings_train.business_id)]
ratings_val = ratings_val.loc[ratings_val.business_id.isin(ratings_train.business_id)]

In [48]:
trainset = ratings_train.loc[:,['user_id', 'business_id', 'rating']]
trainset.columns = ['userID', 'itemID','rating']
valset = ratings_val.loc[:, ['user_id', 'business_id', 'rating']]
valset.columns = ['userID', 'itemID','rating']
testset = ratings_holdout.loc[:, ['user_id', 'business_id', 'rating']]
testset.columns = ['userID', 'itemID','rating']

In [49]:
reader = Reader(rating_scale = (0.0, 5.0))
train_data = Dataset.load_from_df(trainset[['userID','itemID','rating']], reader)
val_data = Dataset.load_from_df(valset[['userID','itemID','rating']], reader)
test_data = Dataset.load_from_df(testset[['userID','itemID','rating']], reader)

train_sr = train_data.build_full_trainset()
val_sr_before = val_data.build_full_trainset()
val_sr = val_sr_before.build_testset()
test_sr_before = test_data.build_full_trainset()
test_sr = test_sr_before.build_testset()

In [50]:
bsl_options = {'method': 'als', 'n_epochs':3}
bias_baseline = BaselineOnly(bsl_options)
bias_baseline.fit(train_sr)
predictions = bias_baseline.test(val_sr)
accuracy.rmse(predictions)

Estimating biases using als...
RMSE: 1.3322


1.3322167380674812

In [51]:
bsl_options = {'method': 'als', 'n_epochs':5}
bias_baseline = BaselineOnly(bsl_options)
bias_baseline.fit(train_sr)
predictions = bias_baseline.test(val_sr)
accuracy.rmse(predictions)

Estimating biases using als...
RMSE: 1.3323


1.332251167562718

In [52]:
bsl_options = {'method': 'als', 'n_epochs':9}
bias_baseline = BaselineOnly(bsl_options)
bias_baseline.fit(train_sr)
predictions = bias_baseline.test(val_sr)
accuracy.rmse(predictions)

Estimating biases using als...
RMSE: 1.3323


1.3322543575081625

**Observation**: It seems different hyperparameters all performs the same result; the team just uses default.

#### Evaluation

**Note**: the team has evaluated this and the following algorithms on **20%, 50%, 100%** training data respectively. For readability, only result on 20% data is displayed. For full result, please refer to a result table enclosed.

In [30]:
# runtime on 20% data
start_time = time.time()
bias_baseline.fit(train_sr)
print("--- %s seconds ---" % (time.time() - start_time))

Estimating biases using als...
--- 2.340609073638916 seconds ---


In [53]:
# Regression metrics 
bbase_p = bias_baseline.test(test_sr)
start_time = time.time()
bbase_20_df = pd.DataFrame(bbase_p, columns = ['userId','itemId','rating','pred_rating','x'])
accuracy.rmse(bbase_p)
print('R^2 (with 20% data): ', r2_score(bbase_20_df.rating , bbase_20_df.pred_rating))
print('MAE (with 20% data): ', mean_absolute_error(bbase_20_df.rating, bbase_20_df.pred_rating))
print("--- %s seconds ---" % (time.time() - start_time))

RMSE: 1.3811
R^2 (with 20% data):  0.1661144204822098
MAE (with 20% data):  1.1598083269638433
--- 0.17791104316711426 seconds ---


In [45]:
# Ranking + coverage + novelty metrics
predict_df_baseline = predict_df.copy()
eval_set = Dataset.load_from_df(predict_df_baseline[['user_id','business_id','predictions']], reader)

bias_basline = BaselineOnly({'method': 'als', 'n_epochs':9})
eval_before = eval_set.build_full_trainset()
eval_sr = eval_before.build_testset()
bias_basline.fit(train_sr)
eval_pred = bias_basline.test(eval_sr)
#accuracy.rmse(predictions_50)
baseline_20 = pd.DataFrame(eval_pred, columns = ['userId','itemId','rating','pred_rating','x'])
predict_df_baseline['predictions'] = baseline_20.pred_rating

Estimating biases using als...


In [46]:
top_10, coverage, serendipity, avg_rank = get_all_metrics(predict_df_baseline, out, ratings_train_final)

0.12325581395348838 0.20441860465116274 0.9795348837209297 526.606976744186


### Baseline 2: Collaborative Filtering via SVD

Matrix factorization is a class of collaborative filtering algorithms. The general idea behind matrix factorization is that there can exist a lower dimensional latent space of features in which users and items can be represented such that the interaction between them can be obtained by simply dot producing the corresponding dense vectors in that space. In short, it decomposes a m*n user-item interaction matrix into two m*k and k*n matrices, sharing a joint latent vector space, where m represents the number of users, and n represents the number of items. In terms of its outcome, we are likely to observe that close users in terms of preferences as well as close items in terms of characteristics can have close representations in the latent space.

The mathematical overview is as follows:
Given a n*m matrix, such that . X is the user matrix where rows represent the n users and Y is the item matrix where rows represent the m items. We want to search for the dot product of matrices X and Y that best approximate the existing interactions; i.e., we want to find X and Y that minimize the “rating reconstruction error”:

$$ (X,Y) = argmin_{X,Y} \sum_{(i,j) \in E} [(X_i)(Y_j)^T − M_{ij}]^2$$

Adding a regularization term, we can also get:

$$(X,Y) = argmin_{X,Y} ½ \sum_{(i,j) \in E} [(X_i)(Y_j)^T − M_{ij}]^2 + \lambda/2(\sum_{i,k}(X_{ik})^2 + \sum_{j,k}(Y_{jk})^2)$$

In general, we obtain the matrices X and Y following a gradient descent optimization process. And once the matrices are obtained, we can predict the ratings simply by multiplying the user vector by any item vector.

In this Yelp Rating Challenge, we used the python surprise package to implement MF. The MF algorithm there uses the SVD approach, which is essentially 

$$ P_{m * n} = U_{m * m} \sum_{m * n} V_{n * n}$$

There, the prediction is
 $$\hat(r_{ui}) = \mu + b_u + b_i + (q_i)^T p_u $$
 
and the regularized squared error that needs to be minimized is 

$$\sum_{r_{ui} \in R_{train}} (r_{ui} − \hat(r_{ui}))^2 + \lambda(b^2_{i} + b^2_{u} + ||q_i||^2 + ||p_u||^2)$$

As the way the package is designed, we tuned on n_epochs, lr_all and leg_all to get an optimal hyperparameter set, where n_epochs is the number of iterations of the SGD (stochastic gradient descent) procedure, lr_all is the learning rate for all parameters, and reg_all is the regularization term for all parameters.  


In [26]:
import matplotlib.pyplot as plt
import json
from tqdm import tqdm

In [54]:
# This will follow the same process as the above model so can be skipped
def process(df):
    df['date']  = pd.to_datetime(df['date'])
    df['week_day'] = df['date'].dt.weekday
    df['month'] = df['date'].dt.month
    df['hour'] = df['date'].dt.hour
    df = df.merge(users_, on = 'user_id')
    df = df.merge(business_, on = 'business_id')
    rename_dict = {'business_longitude': 'longitude', 'business_latitude': 'latitude',
              'business_state':'state','business_city':'city', 'business_address': 'address'}
    df = df.rename(columns = rename_dict)
    return df
ratings_train = process(ratings_train_.copy())
ratings_test = process(ratings_holdout_.copy())
ratings_val = process(ratings_cv_.copy())

In [55]:
ratings_test = ratings_test.loc[ratings_test.business_id.isin(ratings_train.business_id)]
ratings_val = ratings_val.loc[ratings_val.business_id.isin(ratings_train.business_id)]

In [56]:
trainset = ratings_train.loc[:,['user_id', 'business_id', 'rating']]
trainset.columns = ['userID', 'itemID','rating']
valset = ratings_val.loc[:, ['user_id', 'business_id', 'rating']]
valset.columns = ['userID', 'itemID','rating']
testset = ratings_holdout.loc[:, ['user_id', 'business_id', 'rating']]
testset.columns = ['userID', 'itemID','rating']

In [57]:
reader = Reader(rating_scale = (0.0, 5.0))
train_data = Dataset.load_from_df(trainset[['userID','itemID','rating']], reader)
val_data = Dataset.load_from_df(valset[['userID','itemID','rating']], reader)
test_data = Dataset.load_from_df(testset[['userID','itemID','rating']], reader)

train_sr = train_data.build_full_trainset()
val_sr_before = val_data.build_full_trainset()
val_sr = val_sr_before.build_testset()
test_sr_before = test_data.build_full_trainset()
test_sr = test_sr_before.build_testset()

#### Hyperparameter Tuning

In [68]:
RMSE_tune = {}
n_epochs = [10, 20, 30]  # the number of iteration of the SGD procedure
lr_all = [0.001, 0.003, 0.005] # the learning rate for all parameters
reg_all =  [0.02, 0.05, 0.1, 0.4, 0.5] # the regularization term for all parameters

for n in n_epochs:
    for l in lr_all:
        for r in reg_all:
            print('Fitting n: {0}, l: {1}, r: {2}'.format(n, l, r))
            algo = SVD(n_epochs = n, lr_all = l, reg_all = r)
            algo.fit(train_sr)
            predictions = algo.test(val_sr)
            RMSE_tune[n,l,r] = accuracy.rmse(predictions)

Fitting n: 10, l: 0.001, r: 0.02
RMSE: 1.4159
Fitting n: 10, l: 0.001, r: 0.05
RMSE: 1.4153
Fitting n: 10, l: 0.001, r: 0.1
RMSE: 1.4150
Fitting n: 10, l: 0.001, r: 0.4
RMSE: 1.4172
Fitting n: 10, l: 0.001, r: 0.5
RMSE: 1.4174
Fitting n: 10, l: 0.003, r: 0.02
RMSE: 1.3687
Fitting n: 10, l: 0.003, r: 0.05
RMSE: 1.3690
Fitting n: 10, l: 0.003, r: 0.1
RMSE: 1.3691
Fitting n: 10, l: 0.003, r: 0.4
RMSE: 1.3727
Fitting n: 10, l: 0.003, r: 0.5
RMSE: 1.3737
Fitting n: 10, l: 0.005, r: 0.02
RMSE: 1.3448
Fitting n: 10, l: 0.005, r: 0.05
RMSE: 1.3454
Fitting n: 10, l: 0.005, r: 0.1
RMSE: 1.3447
Fitting n: 10, l: 0.005, r: 0.4
RMSE: 1.3493
Fitting n: 10, l: 0.005, r: 0.5
RMSE: 1.3508
Fitting n: 20, l: 0.001, r: 0.02
RMSE: 1.3868
Fitting n: 20, l: 0.001, r: 0.05
RMSE: 1.3871
Fitting n: 20, l: 0.001, r: 0.1
RMSE: 1.3873
Fitting n: 20, l: 0.001, r: 0.4
RMSE: 1.3902
Fitting n: 20, l: 0.001, r: 0.5
RMSE: 1.3904
Fitting n: 20, l: 0.003, r: 0.02
RMSE: 1.3378
Fitting n: 20, l: 0.003, r: 0.05
RMSE: 1.3379


In [69]:
import operator
min(RMSE_tune.items(), key=operator.itemgetter(1))[0]

(30, 0.005, 0.1)

**Observation**: The best combination is when n_epochs = 30, lr_all = 0.005, reg_all = 0.1, and the RMSE score on validation set is 1.3098

#### Evaluation

In [70]:
# train and test on the optimal parameter
start_time = time.time()
algo_real = SVD(n_epochs = 30, lr_all = 0.005, reg_all = 0.1)
algo_real.fit(train_sr)
predictions = algo_real.test(test_sr)

In [71]:
print("--- %s seconds ---" % (time.time() - start_time))

--- 83.35147786140442 seconds ---


In [72]:
accuracy.rmse(predictions)

RMSE: 1.3595


1.359467895040839

In [73]:
accuracy.mae(predictions)

MAE:  1.1170


1.1169825502216897

In [84]:
r2_score([t[2] for t in predictions], [t[3] for t in predictions])

0.19207896312764172

In [85]:
# To evaluate coverage and serendipity metrics, use evaluation set created earlier.
predict_df_20 = pd.read_csv('data/metric_sample.csv', index_col=0)
predict_df_20['predictions'] = 2.5 # fill in this value temporally
eval_20 = Dataset.load_from_df(predict_df_20[['user_id','business_id','predictions']], reader)

In [86]:
eval_before_20 = eval_20.build_full_trainset()
eval_sr_20 = eval_before_20.build_testset()
eval_pred_20 = algo_real.test(eval_sr_20)

baseline_20 = pd.DataFrame(eval_pred_20, columns = ['userId','itemId','rating','pred_rating','x'])
predict_df_20['predictions'] = baseline_20.pred_rating

In [87]:
top_10, coverage, serendipity, avg_rank = get_all_metrics(predict_df_20, out, ratings_train_final)

0.11395348837209303 0.476046511627907 0.9672093023255807 541.146511627907
