# Algorithm Election for Single Criterion Recommender System

## 1. Selection of the algorithm

Let's use the structure on https://github.com/NicolasHug/Surprise/blob/master/examples/benchmark.py to evaluate different algorithms to use on the recommender system with the data we have.

In any case I want to compare item-based and user-based approaches when possible. I believe SVD does not allow for such.

Once I have done that, I will implement that specific one on the recommender.


In [1]:
from __future__ import (absolute_import, division, print_function,
                        unicode_literals)
import time
import datetime
import random
import os

import numpy as np
import six
from tabulate import tabulate

from surprise import Dataset
from surprise import Reader
from surprise.model_selection import cross_validate
from surprise.model_selection import KFold
from surprise import NormalPredictor
from surprise import BaselineOnly
from surprise import KNNBasic
from surprise import KNNWithMeans
from surprise import KNNBaseline
from surprise import SVD
from surprise import SVDpp
from surprise import NMF
from surprise import SlopeOne
from surprise import CoClustering


The default is **user-based**, if we want **item-based** we need to specify the parameter to be false.

The default similarity is **MSD**.

Let's look at cross-validation performance first

In [2]:
classes = {'SVD':SVD, 'SVD++':SVDpp, 'NMF':NMF, 'SlopeOne':SlopeOne, 'KNNBasic':KNNBasic, 'KNNWithMeans':KNNWithMeans, 
           'KNNBaseline':KNNBaseline,'CoClustering':CoClustering, 'BaselineOnly':BaselineOnly, 
           'NormalPredictor':NormalPredictor}

# set RNG
np.random.seed(0)
random.seed(0)

file_path = os.path.expanduser('/home/jonas/Desktop/SpringBoard_Capstone_1/FIRST ATTEMPT/generated_ratings_1_reduced.csv')
reader = Reader(line_format='item rating user', sep=',')
data = Dataset.load_from_file(file_path, reader=reader)

kf = KFold(random_state=0)  # folds will be the same for all algorithms.

table = []

for name, klass in classes.items():
    start = time.time()
    out = cross_validate(klass(), data, ['rmse', 'mae', 'fcp'], kf, verbose=False)
    cv_time = str(datetime.timedelta(seconds=int(time.time() - start)))
    mean_rmse = '{:.3f}'.format(np.mean(out['test_rmse']))
    mean_mae = '{:.3f}'.format(np.mean(out['test_mae']))
    mean_fcp = '{:.3f}'.format(np.mean(out['test_fcp']))
    
    new_line = [name, mean_rmse, mean_mae, mean_fcp, cv_time]
    table.append(new_line)
print('\n\n User-based recommenders \n')
header = ['Name',
          'RMSE',
          'MAE',
          'FCP',
          'Time'
          ]
print(tabulate(table, header, tablefmt="pipe"))


Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity ma

We can also look at the performance on unseen data by splitting the dataset and retraining everything

In [6]:
from surprise.model_selection import train_test_split
from surprise import accuracy

trainset, testset = train_test_split(data, test_size=.25)

classes2 = {'SVD':SVD(), 'SVD++':SVDpp(), 'NMF':NMF(), 'SlopeOne':SlopeOne(), 'KNNBasic':KNNBasic(), 'KNNWithMeans':KNNWithMeans(), 
           'KNNBaseline':KNNBaseline(),'CoClustering':CoClustering(), 'BaselineOnly':BaselineOnly(), 
           'NormalPredictor':NormalPredictor()}

table = []

for name, klass in classes2.items():
    print(name)
    start = time.time()
    klass.fit(trainset)
    predictions = klass.test(testset)
    cv_time = str(datetime.timedelta(seconds=int(time.time() - start)))
    
    mean_rmse = '{:.3f}'.format(accuracy.rmse(predictions))
    mean_mae = '{:.3f}'.format(accuracy.mae(predictions))
    mean_fcp = '{:.3f}'.format(accuracy.fcp(predictions))
    
    new_line = [name, mean_rmse, mean_mae, mean_fcp, cv_time]
    table.append(new_line)
print('\n\n User-based recommenders \n')
header = ['Name',
          'RMSE',
          'MAE',
          'FCP',
          'Time'
          ]
print(tabulate(table, header, tablefmt="pipe"))

SVD
RMSE: 0.5647
MAE:  0.4256
FCP:  0.5378
SVD++
RMSE: 0.5512
MAE:  0.4158
FCP:  0.5321
NMF
RMSE: 0.6318
MAE:  0.4690
FCP:  0.5375
SlopeOne
RMSE: 0.6945
MAE:  0.4665
FCP:  0.4245
KNNBasic
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.6227
MAE:  0.4335
FCP:  0.4537
KNNWithMeans
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.6987
MAE:  0.4965
FCP:  0.4654
KNNBaseline
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.5450
MAE:  0.3979
FCP:  0.5284
CoClustering
RMSE: 0.7977
MAE:  0.6254
FCP:  0.5245
BaselineOnly
Estimating biases using als...
RMSE: 0.6190
MAE:  0.4580
FCP:  0.5336
NormalPredictor
RMSE: 1.1388
MAE:  0.8968
FCP:  0.4906


 User-based recommenders 

| Name            |   RMSE |   MAE |   FCP | Time    |
|:----------------|-------:|------:|------:|:--------|
| SVD             |  0.565 | 0.426 | 0.538 | 0:00:00 |
| SVD++           |  0.551 | 0.416 |

As we can see, it is still KNN baseline the best performing one. There doesn't seem to be overfit anywhere since the RMSE and MAEs are slightly larger than using cross-validation. SVD++ is the slowest one, and due to the small sample size, none of the others have an appreciable runtime.

Now, we can look at item-based algorithms:

In [7]:
sim_options = {'name': 'msd', 'user_based': False}
classes = {'KNNBasic':KNNBasic(sim_options=sim_options), 
           'KNNWithMeans':KNNWithMeans(sim_options=sim_options), 
           'KNNBaseline':KNNBaseline(sim_options=sim_options)}
# these are the only ones that accept the options. The others don't... :( 
np.random.seed(0)
random.seed(0)

file_path = os.path.expanduser('/home/jonas/Desktop/SpringBoard_Capstone_1/FIRST ATTEMPT/generated_ratings_1_reduced.csv')
reader = Reader(line_format='item rating user', sep=',')
data = Dataset.load_from_file(file_path, reader=reader)

kf = KFold(random_state=0)  # folds will be the same for all algorithms.

table = []

for name, klass in classes.items():
    start = time.time()
    out = cross_validate(klass, data, ['rmse', 'mae', 'fcp'], kf, verbose=False)
    cv_time = str(datetime.timedelta(seconds=int(time.time() - start)))
    mean_rmse = '{:.3f}'.format(np.mean(out['test_rmse']))
    mean_mae = '{:.3f}'.format(np.mean(out['test_mae']))
    mean_fcp = '{:.3f}'.format(np.mean(out['test_fcp']))
    
    new_line = [name, mean_rmse, mean_mae, mean_fcp, cv_time]
    table.append(new_line)
print('\n\n Item-based recommenders \n')
header = ['Name',
          'RMSE',
          'MAE',
          'FCP',
          'Time'
          ]
print(tabulate(table, header, tablefmt="pipe"))

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity ma

Indeed, it seems that KNN Baseline performs a bit better. Even though we cannot see a delay in run time, it is more expensive to go item-based, so we may want to avoid it.

How about another similarity? And other parameters on KNN? Let's see if we observe any difference:

In [8]:
sim_options = {'name': 'msd'}
classes = {'KNNBasic':KNNBasic(sim_options=sim_options), 
           'KNNWithMeans':KNNWithMeans(sim_options=sim_options), 
           'KNNBaseline':KNNBaseline(sim_options=sim_options)}

# set RNG
np.random.seed(0)
random.seed(0)

file_path = os.path.expanduser('/home/jonas/Desktop/SpringBoard_Capstone_1/FIRST ATTEMPT/generated_ratings_1_reduced.csv')
reader = Reader(line_format='item rating user', sep=',')
data = Dataset.load_from_file(file_path, reader=reader)

kf = KFold(random_state=0)  # folds will be the same for all algorithms.

table = []

for name, klass in classes.items():
    start = time.time()
    out = cross_validate(klass, data, ['rmse', 'mae', 'fcp'], kf, verbose=False)
    cv_time = str(datetime.timedelta(seconds=int(time.time() - start)))
    mean_rmse = '{:.3f}'.format(np.mean(out['test_rmse']))
    mean_mae = '{:.3f}'.format(np.mean(out['test_mae']))
    mean_fcp = '{:.3f}'.format(np.mean(out['test_fcp']))
    
    new_line = [name, mean_rmse, mean_mae, mean_fcp, cv_time]
    table.append(new_line)
print('\n\n Item-based recommenders \n')
header = ['Name',
          'RMSE',
          'MAE',
          'FCP',
          'Time'
          ]
print(tabulate(table, header, tablefmt="pipe"))

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity ma

In [9]:
from surprise.model_selection import GridSearchCV

param_grid = {'n_epochs': [5, 10], 'lr_all': [0.002, 0.005],
              'reg_all': [0.4, 0.6]}
gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3)

gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

0.6433716300066304
{'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.4}


SVD doesn't seem to be able to improve KNNBaseline ever. So we will go ahead and use it from now on.

For the deployment, we want to run the codes on demand, so the next section describes how this is doen and has some results

***

## 2. Example of implementation

Before we get hands on, a few considerations:

* for an existing user with some ratings, what would it be the best hotels? How do we do that?
* what are similar hotels to a specific one?
* What if a user doesn't have any ratings? Should we just add the most popular hotels for the area?

Let's start again by training the algorithm. Then, I will chose both: user with ratings and user without ratings, and will look at the output. Finally, let's try to look at hotels similar to a specific one.

In [10]:
# import specific libraries
import numpy as np
import pandas as pd
import os

from surprise import Dataset
from surprise import Reader
from surprise import KNNBaseline

# read data
file_path = os.path.expanduser('/home/jonas/Desktop/SpringBoard_Capstone_1/FINALE/generated_ratings_1_reduced.csv')
reader = Reader(line_format='item rating user', sep=',')
data = Dataset.load_from_file(file_path, reader=reader)

# train kNN-Baseline on the whole collection (both, user and item-wise)
trainset = data.build_full_trainset()

# Build an algorithm, and train it.
algo = KNNBaseline()
algo.fit(trainset)
sim_options = {'name': 'pearson_baseline', 'user_based': False}
algo_items = KNNBaseline(sim_options=sim_options)
algo_items.fit(trainset)

######################################################################
# Best hotels for user XX
# list of hotels...
hoteldf = pd.read_csv('/home/jonas/Desktop/SpringBoard_Capstone_1/FINALE/generated_ratings_1_reduced.csv', header=None, names=['item', 'rating','user'])
hotels = hoteldf['item'].unique().tolist()

# case 1
user1 = '3'
hot_ratings_user = {}
# loop to find ratings
for hot in hotels:
    pred = algo.predict(user1, hot)
    hot_ratings_user[hot] = pred.est
# the whole dictionary should be done now... I want the top N = 10
print('\n\nTop 10 hotels for user', user1, ':')
sorted_hot_ratings_user = sorted(hot_ratings_user, key=hot_ratings_user.get, reverse=True)[:10]
for key in sorted_hot_ratings_user:
    print(key, ':', hot_ratings_user[key])
    
# case 1.5
user1 = '3'
hot_ratings_user = {}
# loop to find ratings
for hot in hotels:
    pred = algo_items.predict(user1, hot)
    hot_ratings_user[hot] = pred.est
# the whole dictionary should be done now... I want the top N = 10
print('\n\nTop 10 hotels for user', user1, ' (item-based):')
sorted_hot_ratings_user = sorted(hot_ratings_user, key=hot_ratings_user.get, reverse=True)[:10]
for key in sorted_hot_ratings_user:
    print(key, ':', hot_ratings_user[key])
    
# case 2
user2 = '2000' # there's 1789 users with ratings...
hot_ratings_user = {}
# loop to find ratings
for hot in hotels:
    pred = algo.predict(user2, hot)
    hot_ratings_user[hot] = pred.est
# the whole dictionary should be done now... I want the top N = 10
print('\n\nTop 10 hotels for user', user2, ' (who has no reviews whatsoever):')
sorted_hot_ratings_user = sorted(hot_ratings_user, key=hot_ratings_user.get, reverse=True)[:10]
for key in sorted_hot_ratings_user:
    print(key, ':', hot_ratings_user[key])
    
# case 3
user2 = '2100' # there's 1789 users with ratings...
hot_ratings_user = {}
# loop to find ratings
for hot in hotels:
    pred = algo.predict(user2, hot)
    hot_ratings_user[hot] = pred.est
# the whole dictionary should be done now... I want the top N = 10
print('\n\nTop 10 hotels for user', user2, ' (who has no reviews whatsoever either):')
sorted_hot_ratings_user = sorted(hot_ratings_user, key=hot_ratings_user.get, reverse=True)[:10]
for key in sorted_hot_ratings_user:
    print(key, ':', hot_ratings_user[key])
    
# case 3.5
user2 = '2100' # there's 1789 users with ratings...
hot_ratings_user = {}
# loop to find ratings
for hot in hotels:
    pred = algo_items.predict(user2, hot)
    hot_ratings_user[hot] = pred.est
# the whole dictionary should be done now... I want the top N = 10
print('\n\nTop 10 hotels for user', user2, ' (who has no reviews whatsoever either)(item-based):')
sorted_hot_ratings_user = sorted(hot_ratings_user, key=hot_ratings_user.get, reverse=True)[:10]
for key in sorted_hot_ratings_user:
    print(key, ':', hot_ratings_user[key])

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


Top 10 hotels for user 3 :
blue moon hotel : 5
homewood suites las vegas airport : 5
holiday inn express hotel and suites las vegas 215 beltway : 4.996907332713799
comfort inn midtown manhattan : 4.996907332713798
allerton hotel : 4.9849054020480885
serrano hotel a kimpton hotel : 4.9574337074161665
four seasons hotel san francisco : 4.948217391789254
the bowery hotel : 4.937430263528778
hotel vitale : 4.937430263528778
the peninsula chicago : 4.933398035411475


Top 10 hotels for user 3  (item-based):
kitano new york : 5
best western hospitality house : 4.920003316334549
omni san francisco hotel : 4.842450737561286
columbus motor inn : 4.82953612873261
plaza athenee hotel : 4.692137807975122
hilton club new york : 4.669006284896156
w san francisco : 4.58492345350793

In [11]:
########################################################################
# Hotels similar to XX
hotel = 'courtyard by marriott new york manhattan upper east side'

# need to convert hotel to id
h_inner_id = algo.trainset.to_inner_iid(hotel)

# neigbouring ids
hotel_neighbors = algo_items.get_neighbors(h_inner_id, k=10)

# take them back to names
hotel_neighbors = (algo_items.trainset.to_raw_iid(inner_id) for inner_id in hotel_neighbors)

# boom
for hotl in hotel_neighbors:
    print(hotl)




the kimberly hotel
sofitel chicago water tower
the peninsula chicago
lowell hotel
the plaza
the gem hotel chelsea
greenwich hotel
sofitel new york
the carlyle a rosewood hotel
the ritz carlton


In [13]:
# most popular hotels? From original dataset, we want the higest average given a minimum number of ratings

raw = pd.read_csv('manually_corrected_data/Hotels_clean_merged.csv')

city = 'san francisco'
sfo = raw[raw['city']==city]
sfo = sfo[sfo['num_reviews']>9]
sfo_sprted = sfo.sort_values(by='overall_ratingsource', axis=0, ascending=False)
sfo_sprted = sfo_sprted.head(10)
sfo_sprted
# sfo_sprted['hotel_name'].tolist()

The codes seem to work well. For an existing user we can do some recommendations, for new users the recomendations are default. We can see similar hotels to existing ones (maybe there's room for improvement there, but we'd need better data), as well as the best hotels by city.

### Future considerations

Additional improvements are out of scope, but here's a very interesting option: using a multicriteria recommender system. That requires working on the core code instead of using existing packages, but it is rather easy, and can possibly make a big difference.