# <font color='darkred'>Yelp Recommendation System</font> 
### Team 14: Hikarun Murase (hikarum) & Tweesha Joshi (tweeshaj)
#### December 13, 2018


## <font color='brightpink'>Table of Contents</font> 
1. [Problem Domain](#Problemdomain)
2. [Data](#data)
    1. [How to Download Data](#download)
    2. [Review Data](#review)
    3. [User Data](#user)
    4. [Business (restaurant) Data](#business)
    5. [Data Cleaning](#datacleaning)
3. [Methodology](#method)
4. [Machine Learning](#ML)
    1. [User-based Recommender](#userbased)
    2. [Item-based Recommender](#itembased)
    3. [Content-based Recommender](#contentbased)
6. [Performance Evaluation](#performance)
5. [Contribution and Future Tasks](#cont)

## <font color='brightpink'>Problem Domain</font> <a name="Problemdomain"></a>

The type of machine learning task we conducted is recommendation. We constructed a personalized recommendation system for Yelp users. More specifically, we created a model that shows, for each user, personalized ratings of the restaurants which the user has not visited, based on other users’ ratings data.  
  
With reference to the inequality, P(T, E+ΔE) > P(T,E), Task (T) of our domain is providing a personalized star rating of the restaurants for each user, based on features such as the similarity among different users or restaurants. Experience (E) for our domain is user rating data. More specifically, E, is the star ratings of restaurants which Yelp users have visited before. In terms of learning, we aimed to demonstrate that adding more user rating data improves the performance measure. Performance (P) for our domain is error rate of our model prediction (predicted personalized ratings) compared to actual ratings. In the course of making a recommendation based on the personalized ratings, we can obtain the predicted ratings of every restaurants for every user and compare the actual ratings and predicted ratings. Making use of this nature, we assessed our work by comparing these ratings and calculating the root mean square error (RMSE) and the mean absolute error (MAE) through splitting data into training and test sets or performing k-fold cross validation. In that sense, our task can be also be broadly categorized as prediction.  
    
Yelp is an Internet search and review service for various businesses, but predominantly restaurants, where users can provide star ratings and text reviews for restaurants they have visited. Potential users (restaurant customers) can make use of the overall ratings and reviews posted by other users when they choose restaurants to visit. However, as of now, Yelp does not offer personalized star ratings of restaurants, or recommendation services, to users like other online platforms such as Amazon or Netflix. According to Yao et al., “the current rating system only provides an average value without considering any personalized information of the individual user. Thus, the efficacy of the rating system is diminished severely.”  In that context, we believe that constructing a recommender system would be useful for Yelp and could contribute to increasing their business value. Additionally, we were specifically interested in performing a recommendation task since we recently learned this in class and wanted to apply our new skills to a real-world business problem.  

## <font color='brightpink'>Data</font> <a name="data"></a>

We used the data which is available on Yelp Dataset Challenge website (https://www.yelp.com/dataset/challenge). The data available there contains several JSON files which include user rating data (review.json), user data (user.json), and restaurant data (business.json). 
  
According to Yelp Dataset JSON documentation, review.json contains ”full review text data including the user_id that wrote the review and the business_id the review is written for.”  We initially planed to download review.json and prepare a data table as shown in the last cells of this section. In the course of our analysis, since we found it was useful to merge user and restaurant information into the table, we also used user.json and business.json. 
 

#### <font color='brightpink'>How to Download the Data</font> <a name="download"></a>

*These three JSON files (review.json, user.json, and business.json) are contained in the submission folder (Google Drive). Therefore, if you successfully download the JSON files from the submission folder, you can ignore the following direction. On the other hand, if you cannot download them from the folder, you will need to follow the direction below to download directly from Yelp's website.*  

1. Access Yelp Dataset Challenge (https://www.yelp.com/dataset/challenge) and click "Download Dataset"
2. Fill out your information and click "Download", then download the .tar file on the left.
3. Unzip (using 7-Zip or other proper software) the downloaded .tar file and again unzip the uncompressed file
4. There are several JSON files including review.json, business.json, and user.json in the unzipped folder  
  

  
*If you cannnot download these JSON files smoothly, you can also load CSV files (which are contained in the submission files) in the follwing chunks.*

In [11]:
import json
from collections import OrderedDict
import pprint
import pandas as pd
import numpy as np
import math  # for sqrt
from sklearn.model_selection import train_test_split
from sklearn.model_selection import ShuffleSplit

import similarity as sim
import matrix_factorization_utilities as m
# you need to copy similarity.py and matrix_factorization_utilities.py
# into the folder where this .ipynb file exists

In [None]:
# To graders: you do not need to execute this chunk.
# This chunk is necessary for us to display the following plots. 
# You will need to have an account of "plotly" to run the following codes for plotting.
# Therefore, for plotting parts, you can just refer to the HTML version of the report. 
import plotly.plotly as py
import plotly.graph_objs as go

### <font color='brightpink'>Review Data</font> <a name="review"></a>
The review data table contains approximately 6 million (5,996,996) rows and three columns (which means there are 6 million reviews in the original data). The field “user_id” represents Yelp users.  

Before any analysis, we performed some exploratory data exploration on this data.

In [12]:
# 1. Review data exploration
# load JSON file and store it into a list
# (this might take a few minutes)
with open('yelp_academic_dataset_review.json',  encoding="utf8") as f: 
    reviews = [json.loads(line) for line in f]

In [13]:
# a sample of review data
reviews[0]

{'business_id': 'iCQpiavjjPzJ5_3gPD5Ebg',
 'cool': 0,
 'date': '2011-02-25',
 'funny': 0,
 'review_id': 'x7mDIiDB3jEiPGPHOmDzyw',
 'stars': 2,
 'text': "The pizza was okay. Not the best I've had. I prefer Biaggio's on Flamingo / Fort Apache. The chef there can make a MUCH better NY style pizza. The pizzeria @ Cosmo was over priced for the quality and lack of personality in the food. Biaggio's is a much better pick if youre going for italian - family owned, home made recipes, people that actually CARE if you like their food. You dont get that at a pizzeria in a casino. I dont care what you say...",
 'useful': 0,
 'user_id': 'msQe1u7Z_XuqjGoqhB0J5g'}

In [14]:
# total # of reviews in the dataset
len(reviews)

5996996

In [15]:
# convert into Pandas Data Frame and drop unnecessary columns 
# (this might take more than 10 minutes)
reviews_df = pd.DataFrame.from_dict(reviews)
reviews_df = reviews_df[['review_id', 'user_id', 'business_id', 'stars', 'date']]
reviews_df.head(5)

Unnamed: 0,review_id,user_id,business_id,stars,date
0,x7mDIiDB3jEiPGPHOmDzyw,msQe1u7Z_XuqjGoqhB0J5g,iCQpiavjjPzJ5_3gPD5Ebg,2,2011-02-25
1,dDl8zu1vWPdKGihJrwQbpw,msQe1u7Z_XuqjGoqhB0J5g,pomGBqfbxcqPv14c3XH-ZQ,5,2012-11-13
2,LZp4UX5zK3e-c5ZGSeo3kA,msQe1u7Z_XuqjGoqhB0J5g,jtQARsP6P-LbkyjbO1qNGg,1,2014-10-23
3,Er4NBWCmCD4nM8_p1GRdow,msQe1u7Z_XuqjGoqhB0J5g,elqbBhBfElMNSrjFqW3now,2,2011-02-25
4,jsDu6QEJHbwP2Blom1PLCA,msQe1u7Z_XuqjGoqhB0J5g,Ums3gaP2qM3W1XcA5r6SsQ,5,2014-09-05


In [10]:
# Save as a CSV file (we can use the csv file next time)
reviews_df.to_csv('reviews_df.csv')

In [16]:
# when loading, use the following code:
reviews_df = pd.read_csv("reviews_df.csv", index_col=0)


elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison



We aggregated the review data by "stars" (user ratings). As you can see, the number of star 5 ratings is the highest, followd by 4 star ratings. Stars in Yelp appear to be integers from 1 to 5. Shown below in the bar chart.

In [17]:
# descriptive statistics for 'stars' used to plot graph 
ratings = reviews_df['stars'].value_counts()
ratings

5    2641880
4    1335957
1     858139
3     673206
2     487813
0          1
Name: stars, dtype: int64

In [8]:
# plot of star ratings 
# To graders: you do not need to execute this chunk.
# Please refer to the HTML version of the report. 
ratings_plot = go.Bar(x=ratings.keys(),
                    y=ratings.values,
                     marker=dict(color='rgba(222,45,38,0.5)'))

data = [ratings_plot]

layout = go.Layout(
    title='Distribution of Ratings',
        xaxis = dict(title = 'Star Rating'),
    yaxis = dict(title = 'Number of Ratings')
)

fig = go.Figure(data=data, layout=layout)
py.iplot(fig)

We also aggregated the review data by year to analyze any trends seen in Yelp ratings over the last 15 years. Every year from 2004, the number of reviews seem to be increasing gradually. This is considered proportional to Yelp's growth (as well as the Internet's growth). But in 2018, reviews seem to drop suddenly as shown in the bar chart below.

In [18]:
# descriptive statistics for 'date' (in year)
# used to plot graph below
review_date = reviews_df['date'].apply(lambda x: int(x[:4])).value_counts()
review_date

2017    1194269
2016    1066894
2015     919171
2014     681665
2018     677379
2013     470970
2012     349939
2011     286343
2010     175177
2009      94224
2008      54390
2007      20745
2006       4953
2005        864
2004         13
Name: date, dtype: int64

In [54]:
# To graders: You do not need to execute this chunk.
# Please refer to the HTML version of the report. 
review_plot = go.Bar(x=review_date.keys(),
                    y=review_date.values,
                     marker=dict(color='rgba(0,128,128, 0.8)'))

data = [review_plot]

layout = go.Layout(
    title='Distribution of Ratings by Year',
        xaxis = dict(title = 'Year'),
    yaxis = dict(title = 'Number of Ratings')
)

fig = go.Figure(data=data, layout=layout)
py.iplot(fig)

### <font color='brightpink'>User Data</font> <a name="user"></a>
Next, we explored the user data. The primary key of the data is 'user_id'. User names (presumably first names) are also given in this data. 'review_count' in this data indicates how many times each user has rated a restaurant. There are 1,518,169 unique users in the original data.

In [19]:
# 2. User data exploration
# if you come across MemoryError, it is recommended to restart your machine
with open('yelp_academic_dataset_user.json',  encoding="utf8") as f:
    users = [json.loads(line) for line in f]

In [20]:
# a sample of user data
users[0]

{'average_stars': 2.0,
 'compliment_cool': 0,
 'compliment_cute': 0,
 'compliment_funny': 0,
 'compliment_hot': 0,
 'compliment_list': 0,
 'compliment_more': 0,
 'compliment_note': 0,
 'compliment_photos': 0,
 'compliment_plain': 0,
 'compliment_profile': 0,
 'compliment_writer': 0,
 'cool': 0,
 'elite': 'None',
 'fans': 0,
 'friends': 'None',
 'funny': 0,
 'name': 'Susan',
 'review_count': 1,
 'useful': 0,
 'user_id': 'lzlZwIpuSWXEnNS91wxjHw',
 'yelping_since': '2015-09-28'}

In [21]:
# total # of users in the dataset
len(users)

1518169

In [22]:
# convert into Pandas Data Frame
users_df = pd.DataFrame.from_dict(users)
users_df = users_df[['user_id', 'name', 'review_count', 'average_stars', 'yelping_since']]

In [32]:
# Save as a CSV file (we can use the csv file next time)
users_df.to_csv('users_df.csv')

In [23]:
# when loading, use the following code:
users_df = pd.read_csv("users_df.csv", index_col=0)


elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison



In [24]:
users_df.head(5)

Unnamed: 0,user_id,name,review_count,average_stars,yelping_since
0,lzlZwIpuSWXEnNS91wxjHw,Susan,1,2.0,2015-09-28
1,XvLBr-9smbI0m_a7dXtB7w,Daipayan,2,5.0,2015-09-05
2,QPT4Ud4H5sJVr68yXhoWFw,Andy,1,4.0,2016-07-21
3,i5YitlHZpf0B3R0s_8NVuw,Jonathan,19,4.05,2014-08-04
4,s4FoIXE_LSGviTHBe8dmcg,Shashank,3,3.0,2017-06-18


The following result and line graph below demonstrate the number of users who have rated N restaurants or more. Of more than 1.5 million Yelp users, approximately half of them have rated 5 or more restaurants. Users who have rated 1,000 restaurants or more are only 0.1% of the entire users dataset. In our analysis we decided to use the 1365 users who reviewed a 1000 ratings or more so provide a comprehensive recommendation for them.

In [25]:
# What portion of users who reviewed N times or more? (N=1,2,5,10,30,50,100)
print ('N >= 1', len(users_df[users_df['review_count'] >= 1]))
print ('N >= 2', len(users_df[users_df['review_count'] >= 2]))
print ('N >= 5', len(users_df[users_df['review_count'] >= 5]))
print ('N >= 10', len(users_df[users_df['review_count'] >= 10]))
print ('N >= 30', len(users_df[users_df['review_count'] >= 30]))
print ('N >= 100', len(users_df[users_df['review_count'] >= 100]))
print ('N >= 500', len(users_df[users_df['review_count'] >= 500]))
print ('N >= 1000', len(users_df[users_df['review_count'] >= 1000]))

N >= 1 1517743
N >= 2 1233092
N >= 5 793174
N >= 10 522482
N >= 30 221362
N >= 100 68637
N >= 500 6400
N >= 1000 1365


In [53]:
# To graders: You do not need to execute this chunk.
# Please refer to the HTML version of the report. 
user_ratings_plot = go.Scatter(x = (1,5,10,30,100,500,1000),
                               y=(len(users_df[users_df['review_count'] >= 1]),len(users_df[users_df['review_count'] >= 5]),
                                len(users_df[users_df['review_count'] >= 10]), len(users_df[users_df['review_count'] >= 30]), 
                                len(users_df[users_df['review_count'] >= 100]), len(users_df[users_df['review_count'] >= 500]), 
                                len(users_df[users_df['review_count'] >= 1000])),
                             text= ['>1 restaurant', '>5 restaurants', '>10 restaurants', '>30 restaurants',
                                    '>100 restaurants', '>500 restaurants', '>1000 restaurants'],
                             mode = 'lines+markers',
                              marker=dict(color='rgba(0,0,255, 0.8)'))
data = [user_ratings_plot]
layout = go.Layout(
    title='Number of Users who have Rated N Restaurants or More',
    xaxis = dict(title = 'Number of Restaurants Rated'),
    yaxis = dict(title = 'Number of Users')
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig)



In [26]:
# we will choose only users who reviews 1,000 times or more.
users_df2 = users_df[users_df['review_count'] >=1000]
len(users_df2) # display the size (# of users who have reviewed 1000 times or more)

1365

### <font color='brightpink'>Business (restaurant) Data</font> <a name="business"></a>
Finally, we explored business (restaurant) data. Similar to user data, restaurant names are provided in the data. We can identify a rough location of restaurants because their location information such as state is given. As far as we confirmed, it seems that the data contains restaurants across the United States and Canada. “business_id” plays a role as a primary key for restaurants. "review_count" represents the number of reviews which users have posted for each restaurant. There are 188,593 unique restaurants in the original data.

In [27]:
# 3. Restaurant (business) data exploration
with open('yelp_academic_dataset_business.json',  encoding="utf8") as f:
    businesses = [json.loads(line) for line in f]

In [28]:
# a sample of business data
businesses[0]

{'address': '1314 44 Avenue NE',
 'attributes': {'BikeParking': 'False',
  'BusinessAcceptsCreditCards': 'True',
  'BusinessParking': "{'garage': False, 'street': True, 'validated': False, 'lot': False, 'valet': False}",
  'GoodForKids': 'True',
  'HasTV': 'True',
  'NoiseLevel': 'average',
  'OutdoorSeating': 'False',
  'RestaurantsAttire': 'casual',
  'RestaurantsDelivery': 'False',
  'RestaurantsGoodForGroups': 'True',
  'RestaurantsPriceRange2': '2',
  'RestaurantsReservations': 'True',
  'RestaurantsTakeOut': 'True'},
 'business_id': 'Apn5Q_b6Nz61Tq4XzPdf9A',
 'categories': 'Tours, Breweries, Pizza, Restaurants, Food, Hotels & Travel',
 'city': 'Calgary',
 'hours': {'Friday': '11:0-21:0',
  'Monday': '8:30-17:0',
  'Saturday': '11:0-21:0',
  'Thursday': '11:0-21:0',
  'Tuesday': '11:0-21:0',
  'Wednesday': '11:0-21:0'},
 'is_open': 1,
 'latitude': 51.0918130155,
 'longitude': -114.031674872,
 'name': 'Minhas Micro Brewery',
 'neighborhood': '',
 'postal_code': 'T2E 6L6',
 'review_

In [29]:
# total # of restaurants in the dataset
len(businesses)

188593

In [30]:
# convert into Pandas Data Frame
businesses_df_original = pd.DataFrame.from_dict(businesses)
businesses_df = businesses_df_original[['business_id', 'name', 'review_count', 'stars', 'state']]

In [24]:
# Save as a CSV file (we can use the csv file next time)
businesses_df_original.to_csv('businesses_df_original.csv')
businesses_df.to_csv('businesses_df.csv')

In [31]:
# when loading, use the following code:
businesses_df_original = pd.read_csv("businesses_df_original.csv", index_col=0)
businesses_df = pd.read_csv("businesses_df.csv", index_col=0)

In [32]:
businesses_df.head(5)

Unnamed: 0,business_id,name,review_count,stars,state
0,Apn5Q_b6Nz61Tq4XzPdf9A,Minhas Micro Brewery,24,4.0,AB
1,AjEbIBw6ZFfln7ePHha9PA,CK'S BBQ & Catering,3,4.5,NV
2,O8S5hYJ1SMc8fA4QBtVujA,La Bastringue,5,4.0,QC
3,bFzdJJ3wp3PZssNEsyU23g,Geico Insurance,8,1.5,AZ
4,8USyCYqpScwiNEb58Bt6CA,Action Engine,4,2.0,AB


The following code indicates the number of restaurants which have been rated N times or more. More than a half of the restaurants have been reviewed by more than 10 unique users, and restaurants with more than 500 reviews are less than 1% of the data.

In [33]:
# What portion of restaurants which have review counts of N or more? (N=1,2,5,10,30,50,100)
print ('N >= 1', len(businesses_df[businesses_df['review_count'] >= 1]))
print ('N >= 2', len(businesses_df[businesses_df['review_count'] >= 2]))
print ('N >= 5', len(businesses_df[businesses_df['review_count'] >= 5]))
print ('N >= 10', len(businesses_df[businesses_df['review_count'] >= 10]))
print ('N >= 30', len(businesses_df[businesses_df['review_count'] >= 30]))
print ('N >= 100', len(businesses_df[businesses_df['review_count'] >= 100]))
print ('N >= 500', len(businesses_df[businesses_df['review_count'] >= 500]))
print ('N >= 1000', len(businesses_df[businesses_df['review_count'] >= 1000]))

N >= 1 188593
N >= 2 188593
N >= 5 135578
N >= 10 89041
N >= 30 39652
N >= 100 12058
N >= 500 1099
N >= 1000 309


In [52]:
# To graders: You do not need to execute this chunk.
# Please refer to the HTML version of the report. 
rest_ratings_plot = go.Scatter(x = (1,5,10,30,100,500,1000),
                               y=(len(businesses_df[businesses_df['review_count'] >= 1]),len(businesses_df[businesses_df['review_count'] >= 5]),
                                len(businesses_df[businesses_df['review_count'] >= 10]), len(businesses_df[businesses_df['review_count'] >= 30]), 
                                len(businesses_df[businesses_df['review_count'] >= 100]), len(businesses_df[businesses_df['review_count'] >= 500]), 
                                len(businesses_df[businesses_df['review_count'] >= 1000])),
                             text= ['>1 review', '>5 reviews', '>10 reviews', '>30 reviews',
                                    '>100 reviews', '>500 reviews', '>1000 reviews'],
                             mode = 'lines+markers',
                              marker=dict(color='rgba(0,255,0, 0.8)'))
data = [rest_ratings_plot]
layout = go.Layout(
    title='Number of Restaurants that are Reviewed N times or More',
    xaxis = dict(title = 'Number of Reviews'),
    yaxis = dict(title = 'Number of Restaurants')
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig)


Next, we aggregated restaurants by state. Most of the restaurants in the data are located in Arizona, Nevada, Ontario (Canada), North Carolina, Ohio, and Pennsylvania.

In [34]:
# Distribution of states among restaurants
businesses_df['state'].value_counts().head(25)

AZ     56495
NV     35688
ON     32393
NC     14359
OH     13664
PA     10966
QC      8756
AB      7670
WI      5042
IL      1937
SC       770
NYK      163
NI       134
IN       101
OR        72
BY        60
ST        45
CO        43
C         34
HE        32
XGM       23
NLK       23
RP        19
NY        19
01        11
Name: state, dtype: int64

In [35]:
# Top 10 States with more than 1000 restaurants on Yelp (We used the top 6 states is our analysis)
# used to plot graph below
bus_state = businesses_df['state'].value_counts().head(10)

In [55]:
# To graders: You do not need to execute this chunk.
# Please refer to the HTML version of the report. 
bus_state_plot = go.Bar(x=bus_state.keys(),
                    y=bus_state.values)

data = [bus_state_plot]

layout = go.Layout(
    title='Top 10 States with the Most Restaurants',
    xaxis = dict(title = 'State'),
    yaxis = dict(title = 'Number of Restaurants (on Yelp)')
)

fig = go.Figure(data=data, layout=layout)
py.iplot(fig)

In [36]:
# We will choose restaurants which has 1,000 or more reviews and 
# in AZ, NV, ON (Canada), NC, OH, and PA (the top 6 states)
businesses_df2 = businesses_df[(businesses_df['review_count']>=1000) & 
                               ((businesses_df['state']=='AZ') | 
                                (businesses_df['state']=='NV') |
                                (businesses_df['state']=='ON') |
                                (businesses_df['state']=='NC') |
                                (businesses_df['state']=='OH') |
                                (businesses_df['state']=='PA') )]
businesses_df2.head(10)
len(businesses_df2) # display the size (# of restaurants)

304

## <font color='brightpink'>Data Join and Data Cleaning</font> <a name="datacleaning"></a>
As our project, we filtered the 6-million review data following the three criteria mentioned below:  
1. Users who have reviewed for 1,000 or more restaurants
2. Restaurants which have been reviewed by 1,000 or more users
3. Restaurants which are located in Arizona, Nevada, Ontario, North Carolina, Ohio, or Pennsylvania. 

One of the reasons to do this is to avoid a memory error. We first tried to use the entire original data; however, we could not even merge and clean the data due to a memory error. To avoid this issue, it was necessary for us to reduce the size of the data to use.   
  
Another reason is to increase the portion of non-null values in a user-item matrix, which we will use to make a recommendation (prediction) for users. As we saw above, some users have rated only a few restaurants. Likewise, some restaurants have been rated by only a few users. Such data is unlikely to be used to make a recommendation because there are no similar users/restaurants in the data. Thus, it is efficient to remove such users/restaurants from the data to use.  
  
In the following chunk, we merged the review data with restaurant data and user data to achive it.

In [37]:
# Perform the following inner joins:
#    Review JOIN Restaurant_df2 USING (business_id)
#           JOIN User_df2       USING (user_id)
# The output contains reveiws within the entire period from 2004 to 2017
# for restaurants which has 1,000 or more reviews in AZ, NV, ON (Canada), NC, OH, and PA
# by users who have reviewed 1,000 times or more 

review_business = pd.merge(reviews_df, businesses_df2, on='business_id', how='inner', suffixes=('_review', '_business'))
print(len(review_business))

review_business_user = pd.merge(review_business, users_df2, on='user_id', how='inner', suffixes=('_business', '_user'))
print(len(review_business_user))

review_business_user.head(10)

531882
14711


Unnamed: 0,review_id,user_id,business_id,stars_review,date,name_business,review_count_business,stars_business,state,name_user,review_count_user,average_stars,yelping_since
0,MFILe8F1qzSK_22zQSukxw,LyrAjv8V6HWkceuTB4Xtkw,iCQpiavjjPzJ5_3gPD5Ebg,4,2012-05-14,Secret Pizza,4078,4.0,NV,Duke,1170,3.76,2008-08-18
1,WaCsltXdV1ag_FKY3tR7jQ,LyrAjv8V6HWkceuTB4Xtkw,f4x1YBxkLrZg652xt2KR5g,4,2011-01-17,Hash House A Go Go,5382,4.0,NV,Duke,1170,3.76,2008-08-18
2,uSxPoQQ7xQj2CH6Oyxu2Zg,LyrAjv8V6HWkceuTB4Xtkw,DkYS3arLOhA8si5uUEmHOw,4,2011-04-27,Earl of Sandwich,4981,4.5,NV,Duke,1170,3.76,2008-08-18
3,B_qmwryaFoXNHUSqtN3pKA,LyrAjv8V6HWkceuTB4Xtkw,K7lWdNUhCbcnEvI0NhGewg,4,2011-07-13,Wicked Spoon,6446,3.5,NV,Duke,1170,3.76,2008-08-18
4,Zodfk1QxukafgD3-jIGbsQ,LyrAjv8V6HWkceuTB4Xtkw,NvKNe9DnQavC9GstglcBJQ,3,2012-07-30,Grand Lux Cafe,2670,4.0,NV,Duke,1170,3.76,2008-08-18
5,pc9hrm9LVKPCe41j-CcdvA,LyrAjv8V6HWkceuTB4Xtkw,MpmFFw0GE_2iRFPdsRpJbA,5,2011-07-12,XS Nightclub,2915,4.0,NV,Duke,1170,3.76,2008-08-18
6,mispDRX2DZTZuGGOznkNIg,LyrAjv8V6HWkceuTB4Xtkw,ujHiaprwCQ5ewziu0Vi9rw,3,2010-05-20,The Buffet at Bellagio,4091,3.5,NV,Duke,1170,3.76,2008-08-18
7,cUo2TvljZu3eo0Sbu0vNXQ,LyrAjv8V6HWkceuTB4Xtkw,rcaPajgKOJC2vo_l3xa42A,3,2016-08-04,Bouchon at the Venezia Tower,3743,4.0,NV,Duke,1170,3.76,2008-08-18
8,HddoeXBzmLBvrNpljFZ8Ag,LyrAjv8V6HWkceuTB4Xtkw,LNGBEEelQx4zbfWnlc66cw,5,2011-09-27,Studio B Buffet,2155,4.0,NV,Duke,1170,3.76,2008-08-18
9,9cykx5-k4UiEAid_3LhF4g,LyrAjv8V6HWkceuTB4Xtkw,7fxebHYUwIF6CakxSr70iQ,4,2010-08-27,Carnevino,1347,3.5,NV,Duke,1170,3.76,2008-08-18


The following table displays the first ten rows of the merged and cleaned data. To make a recommendation, we will use "review_id", "user_id", "business_id", and "stars."

In [38]:
# data cleaning (drop unnecessary columns)
data = review_business_user[['review_id','user_id', 'business_id', 'stars_review', 'name_business', 'state', 'review_count_business','name_user','review_count_user']]
data.columns = ['review_id','user_id', 'business_id', 'stars', 'business_name', 'state', 'business_review_count', 'user_name','user_review_count']
print('Data size is', len(data))
data.head(10)

Data size is 14711


Unnamed: 0,review_id,user_id,business_id,stars,business_name,state,business_review_count,user_name,user_review_count
0,MFILe8F1qzSK_22zQSukxw,LyrAjv8V6HWkceuTB4Xtkw,iCQpiavjjPzJ5_3gPD5Ebg,4,Secret Pizza,NV,4078,Duke,1170
1,WaCsltXdV1ag_FKY3tR7jQ,LyrAjv8V6HWkceuTB4Xtkw,f4x1YBxkLrZg652xt2KR5g,4,Hash House A Go Go,NV,5382,Duke,1170
2,uSxPoQQ7xQj2CH6Oyxu2Zg,LyrAjv8V6HWkceuTB4Xtkw,DkYS3arLOhA8si5uUEmHOw,4,Earl of Sandwich,NV,4981,Duke,1170
3,B_qmwryaFoXNHUSqtN3pKA,LyrAjv8V6HWkceuTB4Xtkw,K7lWdNUhCbcnEvI0NhGewg,4,Wicked Spoon,NV,6446,Duke,1170
4,Zodfk1QxukafgD3-jIGbsQ,LyrAjv8V6HWkceuTB4Xtkw,NvKNe9DnQavC9GstglcBJQ,3,Grand Lux Cafe,NV,2670,Duke,1170
5,pc9hrm9LVKPCe41j-CcdvA,LyrAjv8V6HWkceuTB4Xtkw,MpmFFw0GE_2iRFPdsRpJbA,5,XS Nightclub,NV,2915,Duke,1170
6,mispDRX2DZTZuGGOznkNIg,LyrAjv8V6HWkceuTB4Xtkw,ujHiaprwCQ5ewziu0Vi9rw,3,The Buffet at Bellagio,NV,4091,Duke,1170
7,cUo2TvljZu3eo0Sbu0vNXQ,LyrAjv8V6HWkceuTB4Xtkw,rcaPajgKOJC2vo_l3xa42A,3,Bouchon at the Venezia Tower,NV,3743,Duke,1170
8,HddoeXBzmLBvrNpljFZ8Ag,LyrAjv8V6HWkceuTB4Xtkw,LNGBEEelQx4zbfWnlc66cw,5,Studio B Buffet,NV,2155,Duke,1170
9,9cykx5-k4UiEAid_3LhF4g,LyrAjv8V6HWkceuTB4Xtkw,7fxebHYUwIF6CakxSr70iQ,4,Carnevino,NV,1347,Duke,1170


The following table represents unique variables for each column of our data table. As a result of data merging, the number of unique user is 1,175, and the number of unique restaurants is 304. The number of reviews is 14,711. Restaurants in Ohio have been removed in the course of the data merging.

In [39]:
# Show the # of unique users, businesses, and reviews 
data[['user_id', 'business_id', 'review_id','state']].describe()

Unnamed: 0,user_id,business_id,review_id,state
count,14711,14711,14711,14711
unique,1175,304,14711,5
top,PKEzKWv_FktMm2mGPjwd0Q,FaHADZARwnY4yvlvpnsfGA,lmRSI0gSBt6Gtb7tldPdAg,NV
freq,159,371,1,13180


In [40]:
# Show a distribution of states
data['state'].value_counts()

NV    13180
AZ     1159
NC      242
ON       66
PA       64
Name: state, dtype: int64

In [26]:
# Save as a CSV file (we can use the csv file next time)
data.to_csv('data_final.csv')

In [41]:
# when loading, use the following code:
data = pd.read_csv("data_final.csv", index_col=0)

In [42]:
data = data.reset_index(drop=True) # need to reset index for creating a dictionary

Now, as we describe above, we defined "E" and "ΔE" (Delta E). Regarding E, we intentionally drop a half of data obtained above at random. Since the entire data we determined to user is 14,711 reviews, E + ΔE should correspond to the entire data. Therefore, the size of E is 7,356 reviews, and the size of ΔE is 7,355 reviews. We will see whether increasing the data (i.e. using data "E + ΔE" instead of using data "E") improves the performance or not.

In [43]:
# Define "E" and "Delta E"
# We will use "train_test_split" function

# "E" is a subset of "data"
# We will intentionally extract 10,000 samples from data (14,711 samples)
data_e, _ = train_test_split(data, train_size=0.5, random_state=95885)
data_e.reset_index(drop=True, inplace=True) # need to reset index for the following chunk (creating dictionary)

# "E + DeltaE" equals to "data"


From version 0.21, test_size will always complement train_size unless both are specified.



## <font color='brightpink'>Methodology</font> <a name="method"></a>

Using the cleaned table (user_id, business_id, ratings combinations), we created a user rating matrix in which the rows represent users (user_id) and columns represent restaurants (business_id). After that, we constructed user-based, item-based, content-based recommendation system.   
  
For user-based and item-based recommendation system, we used Euclidean distance and Pearson coefficient as similarity calculation methodologies. Then, we computed average weighted ratings for each user or item (restaurant) and displayed a recommendation for a certain user.   
  
For content-based recommendation system, we applied matrix factorization to find the latent features and compute predicted ratings for all users for all items. After that, we made a recommendation for a certain user based on the predicted ratings.   
  
Additionally, we also tried an ensemble method to merge these recommendation systems. More specifically, we averaged predicted ratings based on user-based (using Euclidean distance and Pearson coefficient respectively), item-based (using Euclidean distance and Pearson coefficient respectively), content-based recommendations for each user for item, and made it as a final predicted ratings.  
  
As we described above, we tested our algorithms by comparing predicted ratings and actual ratings for existing ratings. To achieve this, we performed 10-fold cross-validation. In each fold, we held out 10% of the data and keep it as a test dataset. Then, we made predictions (recommendations for all users) based on the remaining 90% of the data (i.e. training data) based on each method, and we compared the predicted ratings to the actual ratings in the test data.
  
Even though we removed users/restaurants which have fewer rating counts, there are some users and restaurants which do not have similar users/restaurants and thus it was not possible to predict ratings for those users. Therefore, not all users' ratings were predicted, regarding user-based and item-based recommendations. (We were able to make predictions for approximately 60% of the cells of the user-item matrix.) On the other hand, the content-based recommendation made predicted ratings for all users for all items by its nature of using factorization.     
  
Referencing Yao et al. and Sawant, we used RMSE and MAE as the performance measure, and we also examined whether the performance would be improved by adding more data (i.e. whether using data "E + ΔE" yields better performance than using data "E") to make sure that learning has happened. 
  

## <font color='brightpink'>Performing Machine Learning</font> <a name="ML"></a>

### <font color='brightpink'>(1) User-based Recommendation</font> <a name="userbased"></a>

In [44]:
# Prepare a dictionary to calculate similarity
# create a dictionary ({user, {item: review, item: review, ...}}, ...) 
# so that we can apply the code provided in class
# Below is a function to do so
def convert_dict(data):
    data_dict = {}
    data_dict_size = 0
    for i in range(len(data)):
        if data['user_id'].loc[i] not in data_dict:
            data_dict[data['user_id'].loc[i]] = {data['business_id'].loc[i]:
                                                 data['stars'].loc[i]}
            data_dict_size += 1
        else:
            data_dict[data['user_id'].loc[i]].update({data['business_id'].loc[i]:
                                                      data['stars'].loc[i]})
            data_dict_size += 1
    print('# of users in the data', len(data_dict))
    print('# of reveiws in the data:', data_dict_size)
    return data_dict

In [45]:
# from lecture note
def user_based_recommendations( prefs, person, similarity=sim.pearson ):
    totals={}
    simSums={}
    for other in prefs:
        # don't compare me to myself
        if other==person: continue
        sim=similarity(prefs,person,other)
  
        # ignore scores of zero or lower
        if sim<=0: continue
        for item in prefs[other]:
            # only score movies I haven't seen yet
            if item not in prefs[person] or prefs[person][item]==0:
                # Similarity * Score
                totals.setdefault(item,0)
                totals[item]+=prefs[other][item]*sim
                # Sum of similarities
                simSums.setdefault(item,0)
                simSums[item]+=sim
  
    # Create the normalized list
    rankings=[(total/simSums[item],item) for item,total in totals.items()]
  
    # Return the sorted list
    rankings.sort()
    rankings.reverse()
    return rankings

In [46]:
# apply the function of converting a dictionary
data_dict = convert_dict(data)

# of users in the data 1175
# of reveiws in the data: 14711


#### A Recommendation for a certain user
First, we tried making a recommendation based on user for a certain user as follows. The first result is a raw recommendation which contains only predicted ratings and "business_id"s, while the second result is a sophisticated version of recommendation where restaurant information (name, category, city, and state) is joined.

In [47]:
# Recommendation for a certain user (Top-N items)
user_based_recommendations(data_dict,'bLbSNkLggFnqwNNzzq-Ijw', sim.pearson)[:5]

[(5.0, 'JLbgvGM4FXh9zNP4O5ZWjQ'),
 (4.945856331657512, 'GmdujALb1Nq2RHGr7jhCaA'),
 (4.914212833329477, 'Xg5qEQiB-7L6kGJ5F4K3bQ'),
 (4.866936986736669, 'e4NQLZynhSmvwl38hC4m-A'),
 (4.859813199809862, 'vHz2RLtfUMVRPFmd7VBEHA')]

In [48]:
# Display a user's name and items' names
def user_name(user_id):
    return list(users_df[users_df['user_id'] == user_id]['name'])[0]

def item_name(item_id):
    return list(businesses_df[businesses_df['business_id'] == item_id]['name'])[0]

def item_category(item_id):
    return list(businesses_df_original[businesses_df_original['business_id'] == item_id]['categories'])[0]

def item_location(item_id):
    city = list(businesses_df_original[businesses_df_original['business_id'] == item_id]['city'])[0]
    state = list(businesses_df_original[businesses_df_original['business_id'] == item_id]['state'])[0]
    return '%s, %s' % (city, state)

In [49]:
# Let's display top-5 recommendations for user 'bLbSNkLggFnqwNNzzq-Ijw'
test_user = 'bLbSNkLggFnqwNNzzq-Ijw'

user_rec = pd.DataFrame.from_dict(user_based_recommendations(data_dict,test_user))

print('Recommendations for user %s' %user_name(test_user))
for i,j in user_based_recommendations(data_dict,test_user)[:5]:
    print('%s: %s (%s; %s)' %(round(i,2), item_name(j), item_category(j), item_location(j)))

Recommendations for user Stefany
5.0: Meat & Potatoes (Meat Shops, Food, Gastropubs, Restaurants, Specialty Food, American (New), Steakhouses; Pittsburgh, PA)
4.95: Hwaro (Buffets, Barbeque, Restaurants, Korean; Spring Valley, NV)
4.91: Little Miss BBQ (Barbeque, Restaurants; Phoenix, AZ)
4.87: Backyard Taco (Restaurants, Food, Mexican; Mesa, AZ)
4.86: Gordon Ramsay Hell's Kitchen (Restaurants, American (New), Breakfast & Brunch, Burgers; Las Vegas, NV)


#### Performing 10-fold Cross-Validation
Second, to test the performance of the item-based recommendation system, we performed 10-fold cross-validation for each similarity calculation method.

In [50]:
# Define a function to perform k-cross validation (user-based recommendation)
def user_based_cv(data, k=10, test_size=0.1, similarity=sim.pearson):
    matrix_size = len(data['user_id'].unique()) * len(data['business_id'].unique())
    rmse_values = []
    mae_values = []
    rmse2_values = []
    mae2_values = []
    cv_num = 0
    ss = ShuffleSplit(n_splits = k, test_size = test_size, random_state = 95885)
    
    for train_index, test_index in ss.split(data):
        # below is each validation process (in this for loop, we will iterate this process k times)
        cv_num += 1
    
        data_train, data_test= data.loc[train_index], data.loc[test_index] # split data into training and test
        data_train.reset_index(drop=True, inplace=True) 
        data_test.reset_index(drop=True, inplace=True) 
    
        ### compute similarity, prediction (recommendation) based on training data
        # create a dictionary for training data ({user: {(item: ratings), (item, ratings), ...}, ...})
        train_dict = {}
        train_dict_size = 0
        for i in range(len(data_train)):
            if data_train['user_id'].loc[i] not in train_dict:
                train_dict[data_train['user_id'].loc[i]] = {data_train['business_id'].loc[i]:
                                                            data_train['stars'].loc[i]}
                train_dict_size += 1
            else:
                train_dict[data_train['user_id'].loc[i]].update({data_train['business_id'].loc[i]:
                                                                 data_train['stars'].loc[i]})
                train_dict_size += 1
        
        # create a dictionary for predicted data (recommendations for all users) 
        pred_dict = {}
        pred_dict_size = 0
        for user in train_dict.keys():
            if user.find('_') == 0: # to avoid math domain error (user_id like '_something' will GET AN ERROR)
                pass
            elif len(user_based_recommendations(train_dict,user,similarity)) > 0: 
                for pred, item in user_based_recommendations(train_dict,user,similarity):
                    if user not in pred_dict:
                        pred_dict[user] = {item: pred}
                        pred_dict_size += 1
                    else:
                        pred_dict[user].update({item: pred})
                        pred_dict_size += 1
            else:
                pass
        
        ### compute RMSE and MAE by comparing predicted ratings to test data
        # create a dictionary for test data
        test_dict = {}
        test_dict_size = 0
        for i in range(len(data_test)):
            if data_test['user_id'].loc[i] not in test_dict:
                test_dict[data_test['user_id'].loc[i]] = {data_test['business_id'].loc[i]:
                                                          data_test['stars'].loc[i]}
                test_dict_size += 1
            else:
                test_dict[data_test['user_id'].loc[i]].update({data_test['business_id'].loc[i]:
                                                               data_test['stars'].loc[i]})
                test_dict_size += 1

        # compareing test data to predicted data and calculate errors
        sum_squared_error = 0
        sum_absolute_error = 0
        sum_squared_error2 = 0
        sum_absolute_error2 = 0
        error_count = 0
        count = 0
        for user in test_dict.keys():
            if user in pred_dict.keys():
                for item in test_dict[user].keys():
                    if item in pred_dict[user].keys():
                        sum_squared_error += (pred_dict[user][item] - test_dict[user][item]) ** 2
                        sum_absolute_error += abs(pred_dict[user][item] - test_dict[user][item])
                        # Adjusted SE and AE (predicted ratings will be rounded because actual ratings are integer)
                        sum_squared_error2 += (round(pred_dict[user][item],0) - test_dict[user][item]) ** 2
                        sum_absolute_error2 += abs(round(pred_dict[user][item],0) - test_dict[user][item])
                        if (round(pred_dict[user][item],0) - test_dict[user][item]) != 0:
                            error_count += 1
                        count += 1

        rmse = math.sqrt(sum_squared_error / count)
        mae = sum_absolute_error / count
        rmse2 = math.sqrt(sum_squared_error2 / count)
        mae2 = sum_absolute_error2 / count
        print(cv_num)
        print('User-Item Matrix size', matrix_size)
        print('Review data size collected for calculating prediction (training data)', train_dict_size)
        print('Review data size predicted (recommendations made)', pred_dict_size)
        print('Review data size for testing (test data size)', test_dict_size)
        print('Review data size used for testing (test data compared to predictions): ', count)
        print('# of errors: ', error_count)
        print ('RMSE: ', rmse)
        print ('MAE: ', mae)
        print('Adjusted RMSE: ', rmse2)
        print('Adjusted MAE: ', mae2)
        print ()
        rmse_values.append(rmse)
        mae_values.append(mae)
        rmse2_values.append(rmse2)
        mae2_values.append(mae2)

    mean_rmse = np.mean(rmse_values)
    mean_mae = np.mean(mae_values)
    mean_rmse2 = np.mean(rmse2_values)
    mean_mae2 = np.mean(mae2_values)
    print ('Overall performance')
    print ('Mean RMSE: %s' % round(mean_rmse,4))
    print ('Mean MAE: %s' % round(mean_mae,4))
    print ('Mean Adjusted RMSE: %s' % round(mean_rmse2,4))
    print ('Mean Adjusted MAE: %s' % round(mean_mae2,4))
    return mean_rmse, mean_mae

#### Performing 10-fold Cross Validation - using Euclidean Distance

In [51]:
# k-fold Cross Validation - Euclidean Coefficient
# "E"
user_based_e1_r, user_based_e1_a = user_based_cv(data_e, k=10, test_size=0.1, similarity=sim.euclidean_distance)

1
User-Item Matrix size 319504
Review data size collected for calculating prediction (training data) 6619
Review data size predicted (recommendations made) 254494
Review data size for testing (test data size) 736
Review data size used for testing (test data compared to predictions):  710
# of errors:  421
RMSE:  0.9251828014012471
MAE:  0.73634669304642
Adjusted RMSE:  0.982236596953283
Adjusted MAE:  0.7112676056338029

2
User-Item Matrix size 319504
Review data size collected for calculating prediction (training data) 6619
Review data size predicted (recommendations made) 255773
Review data size for testing (test data size) 736
Review data size used for testing (test data compared to predictions):  700
# of errors:  399
RMSE:  0.8917628758392866
MAE:  0.6998026145389744
Adjusted RMSE:  0.933503385868831
Adjusted MAE:  0.6628571428571428

3
User-Item Matrix size 319504
Review data size collected for calculating prediction (training data) 6619
Review data size predicted (recommendation

In [60]:
# k-fold Cross Validation - Euclidean Coefficient
# "E + Delta E"
user_based_e2_r, user_based_e2_a = user_based_cv(data, k=10, test_size=0.1, similarity=sim.euclidean_distance)

1
User-Item Matrix size 357200
Review data size collected for calculating prediction (training data) 13239
Review data size predicted (recommendations made) 324072
Review data size for testing (test data size) 1472
Review data size used for testing (test data compared to predictions):  1442
# of errors:  779
RMSE:  0.9004538702625948
MAE:  0.6947860735385905
Adjusted RMSE:  0.9220672667342669
Adjusted MAE:  0.6352288488210819

2
User-Item Matrix size 357200
Review data size collected for calculating prediction (training data) 13239
Review data size predicted (recommendations made) 323218
Review data size for testing (test data size) 1472
Review data size used for testing (test data compared to predictions):  1439
# of errors:  823
RMSE:  0.9273070680586617
MAE:  0.7247329608532421
Adjusted RMSE:  0.9526671531316754
Adjusted MAE:  0.6768589298123697

3
User-Item Matrix size 357200
Review data size collected for calculating prediction (training data) 13239
Review data size predicted (rec

#### Performing 10-fold Cross Validation - using Pearson Coefficient

In [52]:
# k-fold Cross Validation - Pearson Coefficient
# "E"
user_based_p1_r, user_based_p1_a = user_based_cv(data_e, k=10, test_size=0.1, similarity=sim.pearson)

1
User-Item Matrix size 319504
Review data size collected for calculating prediction (training data) 6619
Review data size predicted (recommendations made) 89910
Review data size for testing (test data size) 736
Review data size used for testing (test data compared to predictions):  499
# of errors:  287
RMSE:  0.9798744123974474
MAE:  0.7515958367035621
Adjusted RMSE:  1.0079841586693
Adjusted MAE:  0.7114228456913828

2
User-Item Matrix size 319504
Review data size collected for calculating prediction (training data) 6619
Review data size predicted (recommendations made) 91078
Review data size for testing (test data size) 736
Review data size used for testing (test data compared to predictions):  518
# of errors:  314
RMSE:  1.0159066483316244
MAE:  0.7769875986381115
Adjusted RMSE:  1.0554144275589632
Adjusted MAE:  0.7586872586872587

3
User-Item Matrix size 319504
Review data size collected for calculating prediction (training data) 6619
Review data size predicted (recommendations

In [61]:
# k-fold Cross Validation - Pearson Coefficient
# "E + Delta E"
user_based_p2_r, user_based_p2_a = user_based_cv(data, k=10, test_size=0.1, similarity=sim.pearson)

1
User-Item Matrix size 357200
Review data size collected for calculating prediction (training data) 13239
Review data size predicted (recommendations made) 225295
Review data size for testing (test data size) 1472
Review data size used for testing (test data compared to predictions):  1374
# of errors:  774
RMSE:  0.930348815535313
MAE:  0.7198373692897118
Adjusted RMSE:  0.9595152827729312
Adjusted MAE:  0.6746724890829694

2
User-Item Matrix size 357200
Review data size collected for calculating prediction (training data) 13239
Review data size predicted (recommendations made) 226752
Review data size for testing (test data size) 1472
Review data size used for testing (test data compared to predictions):  1387
# of errors:  809
RMSE:  0.9565317137653999
MAE:  0.7532257793271337
Adjusted RMSE:  0.9832776288809904
Adjusted MAE:  0.702956020187455

3
User-Item Matrix size 357200
Review data size collected for calculating prediction (training data) 13239
Review data size predicted (recom

Overall, the performance of user-based recommendations (when we use the entire data ("E + Delta E") is approximately 0.9 for RMSE and 0.7 for MAE. Adjusted RMSE and MAE here are RMSE and MAE when we round the predicted ratings (for example, predicted ratings 4.6 should be 5.) because ratings in Yelp are interger. Comparing the rounded predicted ratings and actual ratings, we found that approximately a half of predictions matched the actual ratings in the test data; however, performances were not better than ordinary RMSE and MAE.

### <font color='brightpink'>(2) Item-based Recommendation</font> <a name="itembased"></a>

Next, we developed item-based recommendation system.

In [53]:
# from the lecture material
def transpose( users ):
    result={}
    for person in users:
        for item in users[person]:
            result.setdefault(item,{})        
            # Flip item and person
            result[item][person] = users[person][item]
    return result

In [54]:
def most_similar( users, person, n=5, similarity=sim.pearson ):
    sims=[(similarity(users, person, other), other) for other in users if other!=person]
    sims.sort()
    sims.reverse()
#    return sims[0:n]
    return sims

In [55]:
# from the lecture material
def similarity_between_items( prefs, n=10, similarity=sim.pearson ):
    result={}
    # Invert the preference matrix to be item-centric
    item_prefs = transpose( prefs )
    len_item_prefs = len(item_prefs)
    c=0
    for item in item_prefs:
        # Status updates for large datasets
        c+=1
#        if c%100==0: print( "%d / %d" % (c,len_item_prefs) )
        # Find the most similar items to this one
        scores = most_similar( item_prefs, item, n, similarity )
        result[item] = scores
    return result

In [56]:
# from the lecture material
def item_based_recommendations( prefs, itemMatch, user ):
    userRatings=prefs[user]
    scores={}
    totalSim={}
    # Loop over items rated by this user
    for (item,rating) in userRatings.items( ):
  
      # Loop over items similar to this one
      for (similarity,item2) in itemMatch[item]:
        
        # ignore scores of zero or lower (this line is added to the original)
        if similarity<=0: continue
  
        # Ignore if this user has already rated this item
        if item2 in userRatings: continue
        # Weighted sum of rating times similarity
        scores.setdefault(item2,0)
        scores[item2]+=similarity*rating
        # Sum of all the similarities
        totalSim.setdefault(item2,0)
        totalSim[item2]+=similarity
  
    # Divide each total score by total weighting to get an average
    rankings=[(score/totalSim[item],item) for item,score in scores.items( )]
  
    # Return the rankings from highest to lowest
    rankings.sort( )
    rankings.reverse( )
    return rankings

#### A recommendation for a certain user
First, we tried making a recommendation based on item for a certain user as follows. The first result is a raw recommendation which contains only predicted ratings and "business_id"s, while the second result is a sophisticated version of recommendation where restaurant information (name, category, city, and state) is joined.

In [57]:
# from lecture note
from pprint import pprint as pp
def pt( expr ):
    print()
    print("=== " + expr)
    val = eval(expr)
    pp(val)
    return val

In [58]:
# For user_id 'bLbSNkLggFnqwNNzzq-Ijw'
itemsim = similarity_between_items( data_dict, similarity = sim.pearson )
if itemsim == None: itemsim = similarity_between_items( data_dict )
test_user = 'bLbSNkLggFnqwNNzzq-Ijw'

item_rec = pd.DataFrame.from_dict(item_based_recommendations(data_dict, itemsim, test_user))
#pt( "item_based_recommendations( data_dict, itemsim, 'bLbSNkLggFnqwNNzzq-Ijw' )[:5]" )
item_based_recommendations(data_dict, itemsim, test_user)[:5]

[(4.178039500119352, 'gG9z6zr_49LocyCTvSFg0w'),
 (3.9910769726489415, 'sJNcipFYElitBrtiJx0ezQ'),
 (3.9662568472203255, 'T2tEMLpTeSMxLKpxwFdS3g'),
 (3.902307108867529, 'mnwRtuVQEsIUomBchu0gwg'),
 (3.8588734981107162, 'FogTa-wmjhVnJCoTiaxvZA')]

In [59]:
# Let's display top-5 recommendations for user 'bLbSNkLggFnqwNNzzq-Ijw'
test_user = 'bLbSNkLggFnqwNNzzq-Ijw'

print('Recommendations for user %s' %user_name(test_user))
for i,j in item_based_recommendations(data_dict, itemsim, test_user)[:5]:
    print('%s: %s (%s; %s)' %(round(i,2), item_name(j), item_category(j), item_location(j)))

Recommendations for user Stefany
4.18: Amélie's French Bakery & Café (Bakeries, Patisserie/Cake Shop, Restaurants, Breakfast & Brunch, Food, Cafes, Coffee & Tea; Charlotte, NC)
3.99: Arizona Wilderness Brewing (Food, Breweries, American (New), Burgers, Restaurants; Gilbert, AZ)
3.97: Cabo Fish Taco (Latin American, Seafood, Restaurants, Mexican; Charlotte, NC)
3.9: OHSO Brewery- Arcadia (American (Traditional), American (New), Breweries, Breakfast & Brunch, Restaurants, Food, Gluten-Free; Phoenix, AZ)
3.86: Postino Central (Wine Bars, Nightlife, Italian, Bars, Restaurants, Breakfast & Brunch; Phoenix, AZ)


#### Performing 10-fold Cross-Validation
Second, we performed 10-fold cross-validation to examine the performance of the recommendation system.

In [60]:
# Define a function to perform k-cross validation (item-based recommendation)
def item_based_cv(data, k=10, test_size=0.1, similarity=sim.pearson):
    matrix_size = len(data['user_id'].unique()) * len(data['business_id'].unique())
    rmse_values = []
    mae_values = []
    rmse2_values = []
    mae2_values = []
    cv_num = 0
    ss = ShuffleSplit(n_splits=k, test_size=test_size, random_state=95885)
    for train_index, test_index in ss.split(data):
        # below is each validation process (in this for loop, we will iterate this process k times)
        cv_num += 1
    
        data_train, data_test= data.loc[train_index], data.loc[test_index] # split data into training and test
        data_train.reset_index(drop=True, inplace=True) 
        data_test.reset_index(drop=True, inplace=True) 
    
        ### compute similarity, prediction (recommendation) based on training data
        train_dict = {}
        train_dict_size = 0
        for i in range(len(data_train)):
            if data_train['user_id'].loc[i] not in train_dict:
                train_dict[data_train['user_id'].loc[i]] = {data_train['business_id'].loc[i]:
                                                            data_train['stars'].loc[i]}
                train_dict_size += 1
            else:
                train_dict[data_train['user_id'].loc[i]].update({data_train['business_id'].loc[i]:
                                                                 data_train['stars'].loc[i]})
                train_dict_size += 1

        itemsim = similarity_between_items( train_dict, similarity= similarity )
        pred_dict = {}
        pred_dict_size = 0
        for user in train_dict.keys():
            if user.find('_') == 0: # to avoid math domain error (user_id like '_something' will GET AN ERROR)
                pass
            elif len(item_based_recommendations(train_dict, itemsim, user)) > 0: 
                for pred, item in item_based_recommendations(train_dict, itemsim, user):
                    if user not in pred_dict:
                        if pred is None: continue
                        pred_dict[user] = {item: pred}
                        pred_dict_size += 1
                    else:
                        if pred is None: continue
                        pred_dict[user].update({item: pred})
                        pred_dict_size += 1
            else:
                pass
        
        ### compute RMSE and MAE by comparing predicted ratings to test data
        test_dict = {}
        test_dict_size = 0
        for i in range(len(data_test)):
            if data_test['user_id'].loc[i] not in test_dict:
                test_dict[data_test['user_id'].loc[i]] = {data_test['business_id'].loc[i]:
                                                          data_test['stars'].loc[i]}
                test_dict_size += 1
            else:
                test_dict[data_test['user_id'].loc[i]].update({data_test['business_id'].loc[i]:
                                                               data_test['stars'].loc[i]})
                test_dict_size += 1

        sum_squared_error = 0
        sum_absolute_error = 0
        sum_squared_error2 = 0
        sum_absolute_error2 = 0
        error_count = 0
        count = 0
        for user in test_dict.keys():
            if user in pred_dict.keys():
                for item in test_dict[user].keys():
                    if item in pred_dict[user].keys():
                        sum_squared_error += (pred_dict[user][item] - test_dict[user][item]) ** 2
                        sum_absolute_error += abs(pred_dict[user][item] - test_dict[user][item])
                        # Adjusted SE and AE (predicted ratings will be rounded because actual ratings are integer)
                        sum_squared_error2 += (round(pred_dict[user][item],0) - test_dict[user][item]) ** 2
                        sum_absolute_error2 += abs(round(pred_dict[user][item],0) - test_dict[user][item])
                        if (round(pred_dict[user][item],0) - test_dict[user][item]) != 0:
                            error_count += 1
                        count += 1

        rmse = math.sqrt(sum_squared_error / count)
        mae = sum_absolute_error / count
        rmse2 = math.sqrt(sum_squared_error2 / count)
        mae2 = sum_absolute_error2 / count
        print(cv_num)
        print('User-Item Matrix size', matrix_size)
        print('Review data size collected for calculating prediction (training data)', train_dict_size)
        print('Review data size predicted (recommendations made)', pred_dict_size)
        print('Review data size for testing (test data size)', test_dict_size)
        print('Review data size used for testing (test data compared to predictions): ', count)
        print('# of errors: ', error_count)
        print ('RMSE: ', rmse)
        print ('MAE: ', mae)
        print('Adjusted RMSE: ', rmse2)
        print('Adjusted MAE: ', mae2)
        print ()
        rmse_values.append(rmse)
        mae_values.append(mae)
        rmse2_values.append(rmse2)
        mae2_values.append(mae2)

    mean_rmse = np.mean(rmse_values)
    mean_mae = np.mean(mae_values)
    mean_rmse2 = np.mean(rmse2_values)
    mean_mae2 = np.mean(mae2_values)
    print ('Overall performance')
    print ('Mean RMSE: %s' % round(mean_rmse,4))
    print ('Mean MAE: %s' % round(mean_mae,4))
    print ('Mean Adjusted RMSE: %s' % round(mean_rmse2,4))
    print ('Mean Adjusted MAE: %s' % round(mean_mae2,4))
    return mean_rmse, mean_mae

#### Performing 10-fold Cross-Validation - using Euclidean Distance

In [61]:
# k-fold Cross Validation - Euclidean Coefficient
# "E"
item_based_e1_r, item_based_e1_a = item_based_cv(data_e, k=10, test_size=0.1, similarity=sim.euclidean_distance)

1
User-Item Matrix size 319504
Review data size collected for calculating prediction (training data) 6619
Review data size predicted (recommendations made) 254494
Review data size for testing (test data size) 736
Review data size used for testing (test data compared to predictions):  710
# of errors:  450
RMSE:  1.0105765786917964
MAE:  0.7949506902614089
Adjusted RMSE:  1.0467925524607653
Adjusted MAE:  0.7774647887323943

2
User-Item Matrix size 319504
Review data size collected for calculating prediction (training data) 6619
Review data size predicted (recommendations made) 255773
Review data size for testing (test data size) 736
Review data size used for testing (test data compared to predictions):  700
# of errors:  399
RMSE:  0.9355632983329635
MAE:  0.7337087069157181
Adjusted RMSE:  0.972478424292429
Adjusted MAE:  0.6885714285714286

3
User-Item Matrix size 319504
Review data size collected for calculating prediction (training data) 6619
Review data size predicted (recommendat

In [62]:
# k-fold Cross Validation - Euclidean Coefficient
# "E + Delta E"
item_based_e2_r, item_based_e2_a = item_based_cv(data, k=10, test_size=0.1, similarity=sim.euclidean_distance)

1
User-Item Matrix size 357200
Review data size collected for calculating prediction (training data) 13239
Review data size predicted (recommendations made) 324072
Review data size for testing (test data size) 1472
Review data size used for testing (test data compared to predictions):  1442
# of errors:  823
RMSE:  0.980408896910211
MAE:  0.7519187176266195
Adjusted RMSE:  1.011034816824909
Adjusted MAE:  0.7059639389736477

2
User-Item Matrix size 357200
Review data size collected for calculating prediction (training data) 13239
Review data size predicted (recommendations made) 323218
Review data size for testing (test data size) 1472
Review data size used for testing (test data compared to predictions):  1439
# of errors:  858
RMSE:  0.975683197779086
MAE:  0.7668984978068507
Adjusted RMSE:  1.028435191789405
Adjusted MAE:  0.7407922168172342

3
User-Item Matrix size 357200
Review data size collected for calculating prediction (training data) 13239
Review data size predicted (recomme

#### Performing 10-fold Cross Validation - Using Pearson Coefficient

In [62]:
# k-fold Cross Validation - Pearson Coefficient
# "E"
item_based_p1_r, item_based_p1_a= item_based_cv(data_e, k=10, test_size=0.1, similarity=sim.pearson)

1
User-Item Matrix size 319504
Review data size collected for calculating prediction (training data) 6619
Review data size predicted (recommendations made) 112938
Review data size for testing (test data size) 736
Review data size used for testing (test data compared to predictions):  539
# of errors:  344
RMSE:  1.1286065269881032
MAE:  0.8653687168403592
Adjusted RMSE:  1.1605769149479943
Adjusted MAE:  0.849721706864564

2
User-Item Matrix size 319504
Review data size collected for calculating prediction (training data) 6619
Review data size predicted (recommendations made) 115555
Review data size for testing (test data size) 736
Review data size used for testing (test data compared to predictions):  566
# of errors:  350
RMSE:  1.0278246371131576
MAE:  0.7967155447101802
Adjusted RMSE:  1.074109048164954
Adjusted MAE:  0.7826855123674912

3
User-Item Matrix size 319504
Review data size collected for calculating prediction (training data) 6619
Review data size predicted (recommendati

In [63]:
# k-fold Cross Validation - Pearson Coefficient
# "E + Delta E"
item_based_p2_r, item_based_p2_a= item_based_cv(data, k=10, test_size=0.1, similarity=sim.pearson)

1
User-Item Matrix size 357200
Review data size collected for calculating prediction (training data) 13239
Review data size predicted (recommendations made) 255436
Review data size for testing (test data size) 1472
Review data size used for testing (test data compared to predictions):  1402
# of errors:  815
RMSE:  1.0008772097173766
MAE:  0.7708764785746939
Adjusted RMSE:  1.038488971436182
Adjusted MAE:  0.7289586305278174

2
User-Item Matrix size 357200
Review data size collected for calculating prediction (training data) 13239
Review data size predicted (recommendations made) 255493
Review data size for testing (test data size) 1472
Review data size used for testing (test data compared to predictions):  1407
# of errors:  843
RMSE:  1.0057382613533727
MAE:  0.7824120341989728
Adjusted RMSE:  1.0495878637466403
Adjusted MAE:  0.7562189054726368

3
User-Item Matrix size 357200
Review data size collected for calculating prediction (training data) 13239
Review data size predicted (reco

Overall, the performance of item-based recommendations (when we use the entire data ("E + Delta E") is approximately 1.0 for RMSE and 0.77-0.78 for MAE. Adjusted RMSE and MAE here are again RMSE and MAE when we round the predicted ratings. Similar to user-based recommendations, comparing the rounded predicted ratings and actual ratings, we found that approximately a half of predictions matched the actual ratings in the test data; however, performances were not better than ordinary RMSE and MAE.

### <font color='brightpink'>Content-Based Recommendation</font> <a name="contentbased"></a>

#### Make a recommendation for a certain user
First, we tried making a recommendation based on content (latent features of restaurants) for a certain user as follows. The first result is a raw recommendation which contains only predicted ratings and "business_id"s, while the second result is a sophisticated version of recommendation where restaurant information (name, category, city, and state) is joined.

In [63]:
# User-item matrix (this will take a few minutes)
user_ratings_mat = pd.pivot_table(data[['user_id', 'business_id', 'stars']], 
                                  index='user_id', columns='business_id',
                                  aggfunc=np.max) # , fill_value=0) # aggfunc= lambda x: s

# Apply matrix factorization to find the latent features (This may take a few minutes)
# Convert the dataframe "matrix" to a real 2D numpy matrix
mat = user_ratings_mat.as_matrix()
U, M = m.low_rank_matrix_factorization(mat, num_features=15, regularization_amount=0.1)


Method .as_matrix will be removed in a future version. Use .values instead.



         Current function value: 515.447127
         Iterations: 3000
         Function evaluations: 4464
         Gradient evaluations: 4464


In [64]:
# Create a predicted 
predicted_ratings_np = np.matmul(U, M)

# Convert the numpy matrix into a dataframe for easy viewing
predicted_ratings_df = pd.DataFrame(index     = user_ratings_mat.index,
                                    columns   = user_ratings_mat.columns,
                                    data      = predicted_ratings_np)

In [65]:
# Set the below to a userID
user_id_to_search = 'bLbSNkLggFnqwNNzzq-Ijw'

In [66]:
reviewed_business_df = data[data['user_id'] == user_id_to_search]
user_ratings = predicted_ratings_df.loc[user_id_to_search].reset_index(level=0, drop=True)

already_reviewed = reviewed_business_df['business_id'] # business_ids which have alreadly been reviewed
recommended_df = user_ratings[user_ratings.index.isin(already_reviewed) == False] # extract only unreviewed items
recommended_df = recommended_df.sort_values(ascending=False) # sort by predicted ratings

# if predicted_stars is above 5.0 (such as 6.16), convert it into 5.0.
recommended_df[recommended_df>5] = 5
# if predicted_stars is below 1 (such as -0.4), convert it into 1.0
recommended_df[recommended_df<1] = 1

recommended_df = recommended_df.reset_index() # convert index column into an ordinary column 
recommended_df.columns = ['business_id','predicted_stars'] # change the second column name
content_rec = recommended_df[['predicted_stars', 'business_id']] # This will be used later

# print the output
print("This is a recommendation for user %s" %user_id_to_search)
recommended_df.head(15)

This is a recommendation for user bLbSNkLggFnqwNNzzq-Ijw


Unnamed: 0,business_id,predicted_stars
0,5shgJB7a-2_gdnzc0gsOtg,5.0
1,J4CATH00YZrq8Bne2S4_cw,5.0
2,oVrvzUJczq0e2JzVxSTyag,5.0
3,KmuLLSyZJdNoarwcIERHcw,5.0
4,o7AiTlyWUrBSzdz6oMHj5w,5.0
5,r5PLDU-4mSbde5XekTXSCA,4.776193
6,77h11eWv6HKJAgojLx8G4w,4.748403
7,GI-CAiZ_Gg3h21PwrANB4Q,4.743096
8,pSQFynH1VxkfSmehRXlZWw,4.734429
9,IWKtGvVg4hqc9rWHjW8KoA,4.6488


In [67]:
# Let's display top-5 recommendations for user 'bLbSNkLggFnqwNNzzq-Ijw'
print('Recommendations for user %s' %user_name(user_id_to_search))
for i in range(5):
    item = recommended_df.loc[i]['business_id']
    pred = recommended_df.loc[i]['predicted_stars']
    print('%s: %s (%s; %s)' %(round(pred,2), item_name(item), item_category(item), item_location(item)))

Recommendations for user Stefany
5.0: Firefly (Restaurants, Tapas/Small Plates, Tapas Bars; Las Vegas, NV)
5.0: CUT by Wolfgang Puck (Bars, Nightlife, Restaurants, Lounges, Steakhouses; Las Vegas, NV)
5.0: Cirque du Soleil - Zumanity (Adult Entertainment, Nightlife, Performing Arts, Arts & Entertainment; Las Vegas, NV)
5.0: Cirque du Soleil - Mystère (Performing Arts, Arts & Entertainment; Las Vegas, NV)
5.0: Excalibur Hotel (Event Planning & Services, Casinos, Hotels, Resorts, Arts & Entertainment, Hotels & Travel; Las Vegas, NV)


#### Performing 10-fold Cross-Validation
Next, we performed 10-fold Cross-Validation to test the performance of the content-based recommendation system.

In [68]:
# Define a function to perform k-cross validation (content-based recommendation)
def content_based_cv(data, k = 10, test_size = 0.1):
    matrix_size = len(data['user_id'].unique()) * len(data['business_id'].unique())
    rmse_values = []
    mae_values = []
    rmse2_values = []
    mae2_values = []
    rmse3_values = []
    mae3_values = []
    cv_num = 0
    ss = ShuffleSplit(n_splits=k, test_size=test_size, random_state=95885)

    for train_index, test_index in ss.split(data):
        # below is each validation process (in this for loop, we will iterate this process k times)
        cv_num += 1
    
        data_train, data_test= data.loc[train_index], data.loc[test_index] # split data into training and test
        data_train.reset_index(drop=True, inplace=True) 
        data_test.reset_index(drop=True, inplace=True) 
    
        ### make predictions (recommendation) based on training data by factorization
        # User-item matrix
        user_ratings_mat_train = pd.pivot_table(data_train[['user_id', 'business_id', 'stars']], 
                                          index='user_id', columns='business_id',
                                          aggfunc=np.max) # , fill_value=0)   
        # Apply matrix factorization to find the latent features
        # Now, we convert the dataframe "matrix" to a real 2D numpy matrix
        mat = user_ratings_mat_train.as_matrix()
        # This spolit matrix U: 100*a M: a*34 a: attributes of the movies
        U, M = m.low_rank_matrix_factorization(mat, num_features=15, regularization_amount=0.1)
        predicted_ratings_np = np.matmul(U, M)
        # Convert the numpy matrix into a dataframe for easy viewing
        predicted_ratings_df = pd.DataFrame(index     = user_ratings_mat_train.index,
                                            columns   = user_ratings_mat_train.columns,
                                            data      = predicted_ratings_np)
    
        ### compute RMSE and MAE by comparing predicted ratings to test data
        # convert test data into User-item matrix
        user_ratings_mat_test = pd.pivot_table(data_test[['user_id', 'business_id', 'stars']], 
                                          index='user_id', columns='business_id',
                                          aggfunc=np.max) # , fill_value=0) 
        # Compare test data to predicted data and calculate RMSE and MAE
        sum_squared_error = 0
        sum_absolute_error = 0
        sum_squared_error2 = 0
        sum_absolute_error2 = 0
        sum_squared_error3 = 0
        sum_absolute_error3 = 0
        count = 0
        error_count = 0
        error_count2 = 0
        for i in range(len(user_ratings_mat_test)): # loop in the test data to look for actual ratings
            for j in range(len(user_ratings_mat_test.columns)):
                test_stars = user_ratings_mat_test.loc[user_ratings_mat_test.index[i]].values[j]
                if test_stars > 0 : # to eliminate np.nan values
                    user = user_ratings_mat_test.index[i]
                    business = user_ratings_mat_test.columns[j][1]
                    if user in predicted_ratings_df.index: # to exclude users do not exist in the training data
                        pred_stars = predicted_ratings_df.loc[user]['stars'][business] # look for the coressponding predicted ratings
                        # Simple SE and AE
                        error = pred_stars - test_stars
                        sum_squared_error += error ** 2
                        sum_absolute_error += abs(error)
                        # Adjusted SE and AE (predicted ratings will be rounded because actual ratings are integer)
                        sum_squared_error2 += (round(pred_stars, 0) - test_stars) ** 2
                        sum_absolute_error2 += abs(round(pred_stars, 0) - test_stars)
                        if (round(pred_stars, 0) - test_stars) != 0:
                            error_count += 1
                        # Adjusted SE and AE ver.2 (predicted ratings > 5 or < 1 will be converted into 5 and 1, respectively)
                        if pred_stars > 5:
                            pred_stars = 5
                        if pred_stars < 1:
                            pred_stars = 1
                        error = pred_stars - test_stars
                        sum_squared_error3 += error ** 2
                        sum_absolute_error3 += abs(error)
                        if (round(pred_stars, 0) - test_stars) != 0:
                            error_count2 += 1
                        count += 1
        rmse = math.sqrt(sum_squared_error / count)
        mae = sum_absolute_error / count
        rmse2 = math.sqrt(sum_squared_error2 / count)
        mae2 = sum_absolute_error2 / count
        rmse3 = math.sqrt(sum_squared_error3 / count)
        mae3 = sum_absolute_error3 / count
        print(cv_num)
        print('Test data size used: ', count)
        print('# of errors: ', error_count)
        print('# of errors (with UL and LL): ', error_count2)
        print('RMSE: ', rmse)
        print('MAE: ', mae)
        print('Adjusted RMSE: ', rmse2)
        print('Adjusted MAE: ', mae2)
        print('RMSE (with UL and LL): ', rmse3)
        print('MAE (with UL and LL): ', mae3)
        rmse_values.append(rmse)
        mae_values.append(mae)
        rmse2_values.append(rmse2)
        mae2_values.append(mae2)
        rmse3_values.append(rmse3)
        mae3_values.append(mae3)
    
    # report the final results
    mean_rmse = np.mean(rmse_values)
    mean_mae = np.mean(mae_values)
    mean_rmse2 = np.mean(rmse2_values)
    mean_mae2 = np.mean(mae2_values)
    mean_rmse3 = np.mean(rmse3_values)
    mean_mae3 = np.mean(mae3_values)
    print ('Overall performance')
    print ('Mean RMSE: %s' % round(mean_rmse,4))
    print ('Mean MAE: %s' % round(mean_mae,4))
    print ('Mean Adjusted RMSE: %s' % round(mean_rmse2,4))
    print ('Mean Adjusted MAE: %s' % round(mean_mae2,4))
    print ('Mean RMSE (with UL and LL): %s' % round(mean_rmse3,4))
    print ('Mean MAE (with UL and LL): %s' % round(mean_mae3,4))
    return mean_rmse3, mean_mae3

In [69]:
# "E"
# This may take more than 10 minutes mostly because factorization takes some time
content_based_f1_r, content_based_f1_a = content_based_cv(data_e)


Method .as_matrix will be removed in a future version. Use .values instead.



         Current function value: 257.849864
         Iterations: 3000
         Function evaluations: 4455
         Gradient evaluations: 4455
1
Test data size used:  725
# of errors:  516
# of errors (with UL and LL):  513
RMSE:  1.28801544450036
MAE:  1.026657865018809
Adjusted RMSE:  1.3282085419972582
Adjusted MAE:  1.016551724137931
RMSE (with UL and LL):  1.269447300350302
MAE (with UL and LL):  1.0085969471630296
         Current function value: 258.096917
         Iterations: 3000
         Function evaluations: 4473
         Gradient evaluations: 4473
2
Test data size used:  718
# of errors:  492
# of errors (with UL and LL):  487
RMSE:  1.2373434423658634
MAE:  0.9949136299237505
Adjusted RMSE:  1.2765290683707688
Adjusted MAE:  0.9693593314763231
RMSE (with UL and LL):  1.2170190552396714
MAE (with UL and LL):  0.9743832346199363
         Current function value: 257.693728
         Iterations: 3000
         Function evaluations: 4473
         Gradient evaluations: 4473
3
Test 

In [64]:
# "E + Delta E"
# This may take more than 20 minutes partly because factorization takes some time
content_based_f2_r, content_based_f2_a = content_based_cv(data)



         Current function value: 413.959488
         Iterations: 3000
         Function evaluations: 4493
         Gradient evaluations: 4493
1
Test data size used:  1456
# of errors:  1082
# of errors (with UL and LL):  1039
RMSE:  1.6567707838277357
MAE:  1.2790509255431364
Adjusted RMSE:  1.6760268297571892
Adjusted MAE:  1.2554945054945055
RMSE (with UL and LL):  1.4104134132767392
MAE (with UL and LL):  1.109626596856603
         Current function value: 413.385389
         Iterations: 3000
         Function evaluations: 4482
         Gradient evaluations: 4482
2
Test data size used:  1454
# of errors:  1092
# of errors (with UL and LL):  1057
RMSE:  1.671512066361165
MAE:  1.2986549096528055
Adjusted RMSE:  1.7022115237322366
Adjusted MAE:  1.2840440165061897
RMSE (with UL and LL):  1.4592043258588772
MAE (with UL and LL):  1.1485097409202598
         Current function value: 419.148504
         Iterations: 3000
         Function evaluations: 4477
         Gradient evaluations: 447

Overall, the performance of content-based recommendations (when we use the entire data ("E + Delta E") is approximately 1.7 for RMSE and 1.3 for MAE. Adjusted RMSE and MAE here are RMSE and MAE when we round the predicted ratings. Similar to user-based recommendations, comparing the rounded predicted ratings and actual ratings, we found that only less than a half of predictions matched the actual ratings in the test data. RMSE and MAE "with UL and LL" are RMASE and MAE that are calculated based on "regulalized" predicted ratings. Since the range of ratings in Yelp is from 1 to 5, it is reasonable to replace predicted ratings above 5 and below 1 with 5 and 1, respectively. (Factorization predicts ratings so that it can minimize RMAE for existing ratings, and of course it does not know ratings in Yelp must be between 1 an 5. Thus, it is likely that there are some predicted ratings less than 1.0 or more than 5.0.) As a result, RMSE and MAE "with UL and LL" is approximately 1.4 and 1.3. We used these values as performances of the content-based recommendations.   
    




### (4) Recommendation Based on Ensemble Method

Finally, we combined all of the methods above as an ensemble learning to predict ratings of each restaurant for each user. As we described above, we averaged all of the predicted ratings of each item for each user that were calculated by user-based (using Euclidean distans and Pearson coefficient), item-based (using Euclidean distans and Pearson coefficient), and content-based recommendation systems, and we defined the averaged ratings as predicted ratings of the ensemble method.  
  
In general, ensemble learning improves performance because various errors of individual models average out, thus we considered this model as our final model to make a recommendation.

#### Make a recommendation for a certain user
First, we made a recommendation for a certain user. The following table displays the recommendation containing the predicted ratings and restaurant names.

In [70]:
# Load each recommendation for the same user based on each method
test_user = 'bLbSNkLggFnqwNNzzq-Ijw' # user_id to make a recommendation for 
user_rec_e = pd.DataFrame.from_dict(user_based_recommendations(data_dict, test_user, sim.euclidean_distance))
user_rec_p = pd.DataFrame.from_dict(user_based_recommendations(data_dict, test_user, sim.pearson))
item_rec_e = pd.DataFrame.from_dict(item_based_recommendations(data_dict, similarity_between_items(data_dict, similarity = sim.euclidean_distance), test_user))
item_rec_p = pd.DataFrame.from_dict(item_based_recommendations(data_dict, similarity_between_items(data_dict, similarity = sim.pearson), test_user))
user_rec_e.columns = ['stars_1', 'business_id']
user_rec_p.columns = ['stars_2', 'business_id']
item_rec_e.columns = ['stars_3', 'business_id']
item_rec_p.columns = ['stars_4', 'business_id']

# Join on "business_id"
df12 = pd.merge(user_rec_e, user_rec_p, on='business_id', how='inner', suffixes=('_1', '_2'))
df123 = pd.merge(df12, item_rec_e, on='business_id', how='inner', suffixes=('_12', '_3'))
df1234 = pd.merge(df123, item_rec_p, on='business_id', how='inner', suffixes=('_123', '_4'))
merged_rec = pd.merge(df1234, content_rec, on='business_id', how='inner', suffixes=('_1234', '_5'))

merged_rec = merged_rec.dropna() # pick up only rows that have predicted ratings based on all methods
merged_rec['avg'] = merged_rec[['stars_1','stars_2','stars_3','stars_4','predicted_stars']].mean(axis=1)    
merged_rec = merged_rec.sort_values(by='avg', ascending=False) # sort by averaged ratings
merged_rec['business_id'] = merged_rec['business_id'].apply(lambda s:item_name(s))
merged_rec_rank = merged_rec[['avg', 'business_id']]
merged_rec_rank = merged_rec_rank.reset_index(drop=True)
print ('A recommendation for user %s' %user_name(test_user))
merged_rec_rank.head(10)

A recommendation for user Stefany


Unnamed: 0,avg,business_id
0,4.242929,Little Miss BBQ
1,4.161305,Citizen Public House
2,4.112718,Defalco's Italian Grocery
3,4.063193,Cirque du Soleil - Mystère
4,4.040231,Mastro's Ocean Club
5,4.021675,Cirque du Soleil - Zumanity
6,4.014777,Four Peaks Brewing
7,4.004855,CUT by Wolfgang Puck
8,3.984515,Le Reve - The Dream
9,3.975148,Hwaro


#### Performing k-fold Cross-Validation
Then, we performed *1*-fold cross-validation to test the performance of our ensemble method. Since this method requires much time to process, we performed *1*-fold cross-validation instead of 10-fold cross-validation.

In [71]:
def ensemble_cv(data, k = 10, test_size = 0.1):
    matrix_size = len(data['user_id'].unique()) * len(data['business_id'].unique())
    rmse_values = []
    mae_values = []
    rmse2_values = []
    mae2_values = []
    cv_num = 0
    ss = ShuffleSplit(n_splits=k, test_size=test_size, random_state=95885)

    for train_index, test_index in ss.split(data):
        # below is each validation process (in this for loop, we will iterate this process k times)
        cv_num += 1
        print('-- # of CV set: %s --' %cv_num)    
        data_train, data_test= data.loc[train_index], data.loc[test_index] # split data into training and test
        data_train.reset_index(drop=True, inplace=True) 
        data_test.reset_index(drop=True, inplace=True) 
        
        ### User_based Recommendation using Euclidean Distance
        ### User_based Recommendation using Pearson Coefficient
        ### Item_based Recommendation using Euclidean Distance
        ### Item_based Recommendation using Pearson Coefficient
        # compute similarity, prediction (recommendation) based on training data
        # create a dictionary for training data ({user: {(item: ratings), (item, ratings), ...}, ...})
        train_dict = convert_dict(data_train) 
        
        # create a dictionary for predicted data (recommendations for all users) 
        pred_dict_4 = {} # this will be {user: {item: pred_ratings (based on 4 metrics), ...}, ...}
        print ('-- started user-based and item-based prediction --')
        for user in train_dict.keys():
            if user.find('_') == 0: # to avoid math domain error (user_id like '_something' will GET AN ERROR)
                pass
            else:
                user_rec_e = pd.DataFrame.from_dict(user_based_recommendations(train_dict, user, sim.euclidean_distance))
                user_rec_p = pd.DataFrame.from_dict(user_based_recommendations(train_dict, user, sim.pearson))
                # Join on "business_id"
                if len(user_rec_e) > 0 and len(user_rec_p) > 0:
                    user_rec_e.columns = ['stars_1', 'business_id']
                    user_rec_p.columns = ['stars_2', 'business_id']
                    df12 = pd.merge(user_rec_e, user_rec_p, on='business_id', how='inner', suffixes=('_1', '_2'))
                    item_rec_e = pd.DataFrame.from_dict(item_based_recommendations(train_dict, similarity_between_items(train_dict,similarity = sim.euclidean_distance),user))
                    if len(item_rec_e) > 0:
                        item_rec_e.columns = ['stars_3', 'business_id']
                        df123 = pd.merge(df12, item_rec_e, on='business_id', how='inner', suffixes=('_12', '_3'))
                        item_rec_p = pd.DataFrame.from_dict(item_based_recommendations(train_dict, similarity_between_items(train_dict,similarity = sim.pearson),user))
                        if len(item_rec_p) >0: 
                            item_rec_p.columns = ['stars_4', 'business_id']
                            merged_rec = pd.merge(df123, item_rec_p, on='business_id', how='inner', suffixes=('_123', '_4'))
                            merged_rec = merged_rec.dropna() # pick up only rows that have predicted ratings based on all methods
                            merged_rec['sum'] = merged_rec[['stars_1','stars_2','stars_3','stars_4']].sum(axis=1)    
                            pred_rec = merged_rec[['business_id','sum']]
                            #pred_rec.set_index('business_id')['sum']
                            if len(pred_rec) > 0: # ignore users for whom no recommendation was made (based on all methods)
                                user_pred_dict = {}
                                for _, item in pred_rec.iterrows():
                                    user_pred_dict.update({item['business_id']:item['sum']})
                                pred_dict_4[user] = user_pred_dict # it is assumed that "user" is not in pred_dict_4
        
        ### Content-based Recommendation
        ### make predictions (recommendation) based on training data by factorization
        # User-item matrix
        print ('-- started content-based prediction --')
        user_ratings_mat_train = pd.pivot_table(data_train[['user_id', 'business_id', 'stars']], 
                                          index='user_id', columns='business_id',
                                          aggfunc=np.max) # , fill_value=0)   
        # Apply matrix factorization to find the latent features
        # Now, we convert the dataframe "matrix" to a real 2D numpy matrix
        mat = user_ratings_mat_train.as_matrix()
        # This spolit matrix U: 100*a M: a*34 a: attributes of the movies
        U, M = m.low_rank_matrix_factorization(mat, num_features=15, regularization_amount=0.1)
        predicted_ratings_np = np.matmul(U, M)
        # Convert the numpy matrix into a dataframe for easy viewing
        predicted_ratings_df = pd.DataFrame(index     = user_ratings_mat_train.index,
                                            columns   = user_ratings_mat_train.columns,
                                            data      = predicted_ratings_np)
        predicted_ratings_df[predicted_ratings_df>5] = 5 # restrict upper limit
        predicted_ratings_df[predicted_ratings_df<1] = 1 # restrict lower limit
        
        ### Average predicted ratings based on ALL 5 metrics 
        pred_dict = {} 
        pred_dict_size = 0
        for user in pred_dict_4.keys():
            for item in pred_dict_4[user]:
                # add the total predicted ratings based on 4 methods to the predicted ratings based on content-based recommendation
                # and divide by 5; calculate average predicted ratings
                sum_ratings = pred_dict_4[user][item]
                #print(user, item, sum_ratings) # justfor debugging
                content_ratings = predicted_ratings_df.loc[user]['stars'][item]
                if content_ratings >= 1: # make sure predicted ratings based on content-based is not null
                    pred = (sum_ratings + content_ratings) / 5
                    if user not in pred_dict:
                        pred_dict[user] = {item: pred}
                        pred_dict_size += 1
                    else:
                        pred_dict[user].update({item: pred})
                        pred_dict_size += 1
                else: print('error:', user, item, content_ratings)
                    
        ### compute RMSE and MAE by comparing predicted ratings to test data
        # Compare test data to predicted data and calculate RMSE and MAE
        # (this process is the same as user-based or item-based recommendation k-fold CV)
        print ('-- started validation --')      
        test_dict = {}
        test_dict_size = 0
        for i in range(len(data_test)):
            if data_test['user_id'].loc[i] not in test_dict:
                test_dict[data_test['user_id'].loc[i]] = {data_test['business_id'].loc[i]:
                                                          data_test['stars'].loc[i]}
                test_dict_size += 1
            else:
                test_dict[data_test['user_id'].loc[i]].update({data_test['business_id'].loc[i]:
                                                               data_test['stars'].loc[i]})
                test_dict_size += 1
        
        sum_squared_error = 0
        sum_absolute_error = 0
        sum_squared_error2 = 0
        sum_absolute_error2 = 0
        count = 0
        error_count = 0
        
        for user in test_dict.keys():
            if user in pred_dict.keys():
                for item in test_dict[user].keys():
                    if item in pred_dict[user].keys():
                        sum_squared_error += (pred_dict[user][item] - test_dict[user][item]) ** 2
                        sum_absolute_error += abs(pred_dict[user][item] - test_dict[user][item])
                        # Adjusted SE and AE (predicted ratings will be rounded because actual ratings are integer)
                        sum_squared_error2 += (round(pred_dict[user][item],0) - test_dict[user][item]) ** 2
                        sum_absolute_error2 += abs(round(pred_dict[user][item],0) - test_dict[user][item])
                        if (round(pred_dict[user][item],0) - test_dict[user][item]) != 0:
                            error_count += 1
                        count += 1
        
        rmse = math.sqrt(sum_squared_error / count)
        mae = sum_absolute_error / count
        rmse2 = math.sqrt(sum_squared_error2 / count)
        mae2 = sum_absolute_error2 / count
        print(cv_num)
        print('User-Item Matrix size', matrix_size)
        print('Review data size predicted (recommendations made)', pred_dict_size)
        print('Test data size used: ', count)
        print('# of errors: ', error_count)
        print('RMSE: ', rmse)
        print('MAE: ', mae)
        print('Adjusted RMSE: ', rmse2)
        print('Adjusted MAE: ', mae2)
        rmse_values.append(rmse)
        mae_values.append(mae)
        rmse2_values.append(rmse2)
        mae2_values.append(mae2)
    
    # report the final results
    mean_rmse = np.mean(rmse_values)
    mean_mae = np.mean(mae_values)
    mean_rmse2 = np.mean(rmse2_values)
    mean_mae2 = np.mean(mae2_values)
    print ('Overall performance')
    print ('Mean RMSE: %s' % round(mean_rmse,4))
    print ('Mean MAE: %s' % round(mean_mae,4))
    print ('Mean Adjusted RMSE: %s' % round(mean_rmse2,4))
    print ('Mean Adjusted MAE: %s' % round(mean_mae2,4))
    return mean_rmse, mean_mae

In [72]:
# "E"
# Note: this takes more than 30 minutes.
ensemble_1_r, ensemble_1_a = ensemble_cv(data_e, k =1, test_size=0.1)

-- # of CV set: 1 --
# of users in the data 1040
# of reveiws in the data: 6619
-- started user-based and item-based prediction --
-- started content-based prediction --



Method .as_matrix will be removed in a future version. Use .values instead.



         Current function value: 257.849864
         Iterations: 3000
         Function evaluations: 4455
         Gradient evaluations: 4455
-- started validation --
1
User-Item Matrix size 319504
Review data size predicted (recommendations made) 64951
Test data size used:  449
# of errors:  261
RMSE:  0.908037231925645
MAE:  0.7276548600958623
Adjusted RMSE:  0.9367526716955391
Adjusted MAE:  0.6770601336302895
Overall performance
Mean RMSE: 0.908
Mean MAE: 0.7277
Mean Adjusted RMSE: 0.9368
Mean Adjusted MAE: 0.6771


In [120]:
# "E + Delta E"
# Note: this takes more than 30-60 minutes
ensemble_2_r, ensemble_2_a = ensemble_cv(data, k=1, test_size=0.1)

-- # of CV set: 1 --
# of users in the data 1160
# of reveiws in the data: 13239
-- started user-based and item-based prediction --
-- started content-based prediction --




         Current function value: 413.959488
         Iterations: 3000
         Function evaluations: 4493
         Gradient evaluations: 4493
-- started validation --
1
User-Item Matrix size 357200
Review data size predicted (recommendations made) 208129
Test data size used:  1357
# of errors:  780
RMSE:  0.905788160738991
MAE:  0.706885805261568
Adjusted RMSE:  0.9609164134936213
Adjusted MAE:  0.681650700073692
Overall performance
Mean RMSE: 0.9058
Mean MAE: 0.7069
Mean Adjusted RMSE: 0.9609
Mean Adjusted MAE: 0.6817


Overall, the performance of content-based recommendations (when we use the entire data ("E + Delta E") is approximately 0.9 for RMSE and 0.7 for MAE. Adjusted RMSE and MAE are RMSE and MAE when we round the predicted ratings. Similar to user-based recommendations, comparing the rounded predicted ratings and actual ratings, we found that approximately a half of predictions matched the actual ratings in the test data. Interestingly, the number of erros is slightly smaller than user-basd recommendations, which performs the best of all the three methods. Therefore, we can confirm the efficacy of ensemble learning here. 

## <font color='brightpink'>Performance Evaluation</font> <a name="performance"></a>
We summarized the performances of our recommendation systems as shown below.

### Recommendation Comparison
First, we merged the recommendations for a certain user made based on user-based (using Pearson coefficient as a similarity calculation method), item-based (using Pearson coefficient as a similarity calculation method), content-based, and ensemble-based recommendation systems so that we can compare how close or different each recommendation is.

In [73]:
user_item = pd.concat([user_rec, item_rec], axis=1)
user_item_content = pd.concat([user_item, content_rec], axis=1)
comparison = pd.concat([user_item_content, merged_rec_rank], axis=1)
comparison.columns = ['stars_u','user_based','stars_i','item_based','stars_c','content_based','stars_e','ensemble']

# Round predicted ratings
comparison['stars_u'] = comparison['stars_u'].apply(lambda f:round(f,4))
comparison['stars_i'] = comparison['stars_i'].apply(lambda f:round(f,4))
comparison['stars_c'] = comparison['stars_c'].apply(lambda f:round(f,4))
comparison['stars_e'] = comparison['stars_e'].apply(lambda f:round(f,4))

# Convert business_id into restaurant name
comparison['user_based'] = comparison['user_based'].apply(lambda s:item_name(s))
comparison['item_based'] = comparison['item_based'][-comparison['item_based'].isnull()].apply(lambda s:item_name(s)) # because this conttains 2 null values
comparison['content_based'] = comparison['content_based'].apply(lambda s:item_name(s))

# Index ranks
comparison['rank'] = [i+1 for i in range(len(comparison))]
comparison.index = comparison['rank']
comparison = comparison.drop('rank', axis=1)
comparison.to_csv('comparison.csv') # if you want to save this table

# Display Top N
N = 15
print ('Recommendations for user %s' %user_name(test_user))
comparison.head(N)

Recommendations for user Stefany


Unnamed: 0_level_0,stars_u,user_based,stars_i,item_based,stars_c,content_based,stars_e,ensemble
rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,5.0,Meat & Potatoes,4.178,Amélie's French Bakery & Café,5.0,Firefly,4.2429,Little Miss BBQ
2,4.9459,Hwaro,3.9911,Arizona Wilderness Brewing,5.0,CUT by Wolfgang Puck,4.1613,Citizen Public House
3,4.9142,Little Miss BBQ,3.9663,Cabo Fish Taco,5.0,Cirque du Soleil - Zumanity,4.1127,Defalco's Italian Grocery
4,4.8669,Backyard Taco,3.9023,OHSO Brewery- Arcadia,5.0,Cirque du Soleil - Mystère,4.0632,Cirque du Soleil - Mystère
5,4.8598,Gordon Ramsay Hell's Kitchen,3.8589,Postino Central,5.0,Excalibur Hotel,4.0402,Mastro's Ocean Club
6,4.822,Bobby Q,3.8545,Paradise Valley Burger Company,4.7762,Defalco's Italian Grocery,4.0217,Cirque du Soleil - Zumanity
7,4.7803,Cherryblossom Noodle Cafe,3.7703,Postino Arcadia,4.7484,Eggslut,4.0148,Four Peaks Brewing
8,4.7796,Fogo de Chão Brazilian Steakhouse,3.7558,Citizen Public House,4.7431,Mastro's Ocean Club,4.0049,CUT by Wolfgang Puck
9,4.7693,Tacos El Gordo,3.7555,Taco Guild,4.7344,Pizzeria Bianco,3.9845,Le Reve - The Dream
10,4.7382,Citizen Public House,3.7516,Pai Northern Thai Kitchen,4.6488,M Resort Spa Casino,3.9751,Hwaro


This table displays top 15 recommendations for use Stefany based on each method. It appears that restaurants recommended largely differ within the top 15 rankings among the three methods. A few restaurants such as KINKA IZAKAYA ORIGINAL and Citizen Public House can be seen in different rankings.  
However, the recommendation made by the ensemble method includes many restaurants that are ranked in other methods. (Because of its nature of ensembling, this might be reasonable.)

### Performance Comparison

In [72]:
performance_mat = pd.DataFrame({'Recommendation': ['User-based', 'User-based', 'Item-based', 'Item-based', 'Content-based'],
                                'Similarity_Metrics': ['Euclidean Distance', 'Pearson Coefficient', 'Euclidean Distance', 'Pearson Coefficient', ''],
                                'RMSE(E)': [user_based_e1_r, user_based_p1_r, item_based_e1_r, item_based_p1_r, content_based_f1_r],
                                'RMSE(E+ΔE)': [user_based_e2_r, user_based_p2_r, item_based_e2_r, item_based_p2_r, content_based_f2_r],
                                'MAE(E)': [user_based_e1_a, user_based_p1_a, item_based_e1_a, item_based_p1_a, content_based_f1_a],
                                'MAE(E+ΔE)': [user_based_e2_a, user_based_p2_a, item_based_e2_a, item_based_p2_a, content_based_f2_a],
                                })

performance_mat[['RMSE(E)', 'RMSE(E+ΔE)', 'MAE(E)', 'MAE(E+ΔE)']] = performance_mat[['RMSE(E)', 'RMSE(E+ΔE)', 'MAE(E)', 'MAE(E+ΔE)']].apply(lambda f:round(f,4))
performance_mat.to_csv('performance_mat.csv') # if you want to save this table
performance_mat = performance_mat.set_index(['Recommendation', 'Similarity_Metrics']) 
performance_mat

Unnamed: 0_level_0,Unnamed: 1_level_0,RMSE(E),RMSE(E+ΔE),MAE(E),MAE(E+ΔE)
Recommendation,Similarity_Metrics,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
User-based,Euclidean Distance,0.936,0.8927,0.7287,0.6941
User-based,Pearson Coefficient,1.0274,0.9235,0.7843,0.7187
Item-based,Euclidean Distance,0.9949,0.9756,0.7761,0.759
Item-based,Pearson Coefficient,1.0785,1.0019,0.8271,0.7769
Content-based,,1.262,1.4431,1.0033,1.1338


RMSE and MAE here are normal RMSE and MAE (i.e. they are not adjusted RMSE and MAE as mentioned above). RMSE and MAE of a content-based recommendation are based on regulalized predicted ratings as described above. 
  
Overall, user-based recommendation systems performed better than item-based recommendation systems and content-based recommendation system because both of RMSE and MAE are relatively lower, which means that the magnitude of erros in user-based recommendation systems is lower than that in other systems.  
  
It was expected that the performances of user-based recommendation and item-based recommendation are similar, but they are slightly different (the performance of user-based recommendation is slightly better). We think that this is because that the number of restaurants (items) are much fewer than the number of users. When we calculate weighted average ratings (i.e. predicted ratings) for each user, we generally sum up the of weighted ratings of similar users/items (a product of ratings of similar users/items and similarity) and divide it by the number of similar users/items (user-based/item-based recommendation, respectively). In general, larger and larger the number of samples is, more and more plausible the average would be (because variance is smaller). In that sense, the weighted average ratings based on user-based recommendation is more plausible than that based on item-based recommendation. That is why the overall performance of user-based recommendation was better.  
  
The reason why content-based recommendation's performance was not better than other methods is considered that content-based recommendation can predict all ratings for each user-item combination by its nature of using factorization while user-based and item-based recommendation system ignore non-similar users/items and users/items with few actual ratings when predicting ratings. This means that content-based recommendation contains non-reliable predicted ratings (for such users) and thus the magnitude of errors in content-based recommendation could be larger because of their larger variance. Another reason might be the number of latent features we set. We set it as 15, but it might not have been sufficient to the size of a 1500 * 300 matrix. We could have tuned this hyperparameter, but since matrix factorization requires much time and machine power, we actually could not try some values of this hyperparameters.  
  
From the perspective of the difference in performances of between RMSE and MAE, the values of RMSE are larger than those of MAE. This result is considered reasonable because RMSE is calculated by dividing a square root of a sum of a errors to the power of 2 by "square root of the number of samples" while MAE is calculated by dividing a sum of absolute errors by "the number of samples."  

Regarding user-based and item-based recommendations, using Euclidean distance as a similarity calculation method performed better than using Pearson coefficient. We believe that this is mostly because more data is used to predict ratings when we utilize Euclidean distance. As the code above indicates, when we calculated similarity, we explicitly removed user (or item) pairs that have negative coefficients (because they are not similar at all), which resulted in fewer number of samples to be used for prediction.     

In terms of the difference in performances of between using data "E" and using data "E + delta E", we clearly see that the performances when using "E + delta E" are better than those when using "E", which means that learning has happend. 
    
However, content-based recommendation is only an exception. For this system, increasing data adversely reduced the performance. It is unclear why it happened, but our hypothesis is that using more data leads to increasing the magnitude of errors because it can predict ratings for even users that do not have similar users or for items that similar users have not rated for. Another hypothesis is that the number of latent features necessary to predict in the larger dataset as accurately as the smaller dataset might be much larger. For example, it may be possible that the number of latent features necessary in the smaller data is 15, while that in the larger data is 60. (From that perspective, if this hypothesis is true, content-based recommendation would not be suitable to for large dataset.) We set both the numbers of latent features in those datasets as 15 and could have tuned this hyperparameter, but we could not try some values of this hyperparameters for the same reason mentioned above. These hypotheses are similar to the reason why content-based recommendation system performed worse than other recommendation systems.    

In [123]:
performance_mat2 = pd.DataFrame({'Recommendation': ['User-based', 'User-based', 'Item-based', 'Item-based', 'Content-based', 'Ensemble'],
                                 'Similarity_Metrics': ['Euclidean Distance', 'Pearson Coefficient', 'Euclidean Distance', 'Pearson Coefficient', '', ''],
                                 'RMSE(E)': [user_based_e1_r, user_based_p1_r, item_based_e1_r, item_based_p1_r, content_based_f1_r, ensemble_1_r],
                                 'RMSE(E+ΔE)': [user_based_e2_r, user_based_p2_r, item_based_e2_r, item_based_p2_r, content_based_f2_r, ensemble_2_r],
                                 'MAE(E)': [user_based_e1_a, user_based_p1_a, item_based_e1_a, item_based_p1_a, content_based_f1_a, ensemble_1_a],
                                 'MAE(E+ΔE)': [user_based_e2_a, user_based_p2_a, item_based_e2_a, item_based_p2_a, content_based_f2_a, ensemble_2_a],
                                 })

performance_mat2[['RMSE(E)', 'RMSE(E+ΔE)', 'MAE(E)', 'MAE(E+ΔE)']] = performance_mat2[['RMSE(E)', 'RMSE(E+ΔE)', 'MAE(E)', 'MAE(E+ΔE)']].apply(lambda f:round(f,4))
performance_mat2.to_csv('performance_mat2.csv') # if you want to save this table
performance_mat2 = performance_mat2.set_index(['Recommendation', 'Similarity_Metrics']) 
performance_mat2

Unnamed: 0_level_0,Unnamed: 1_level_0,RMSE(E),RMSE(E+ΔE),MAE(E),MAE(E+ΔE)
Recommendation,Similarity_Metrics,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
User-based,Euclidean Distance,0.936,0.8927,0.7287,0.6941
User-based,Pearson Coefficient,1.0274,0.9235,0.7843,0.7187
Item-based,Euclidean Distance,0.9949,0.9756,0.7761,0.759
Item-based,Pearson Coefficient,1.0785,1.0019,0.8271,0.7769
Content-based,,1.262,1.4431,1.0033,1.1338
Ensemble,,0.908,0.9058,0.7277,0.7069


Additionally, we compared the performance of the ensemble method to performances of other metrics. As we expected, overall, our final model, the ensemble method performed better than any other method.  
  
Using the entire data, RMSE is approximately 0.9, and MAE is approximately 0.7. This implies that the average magnitude of error is approximately 0.7. The difference between RMSE and MAE is as discussed above. As far as we researched, RMSE is generally around 1.0 to 1.2 in the prior studies using similar Yelp dataset. Therefore, we can say that we successfully developed a better model of recommendation system on Yelp. (However, since the size of data used to make predictions in the prior studies was larger than ours (such as 180,000 reviews, compared to 14,700 reviews in our analysis), there is a possibility that this argument might not be perfectly fair.)  
    
Also We can conclude here that learning has happened because if we use more data, the performances of the ensemble model improved.  
  
Besides, one of the greatest characteristics of the ensemble models is that the performance when we used less data (only “E”) was drastically improved, which indicates this model does not require much data to predict as accurately as other models.

## <font color='brightpink'>Contributions and Future Tasks</font> <a name="cont"></a>
Our work can contribute the problem mentioned at the beginning of the report in the following ways:
- Utilizing and improving this recommendation system, Yelp can offer recommendation service to their customers, which will enhance customers' satisfaction and Yelp's value in turn. As Netflix's Chief Product Officer Neil Hunt states, “the combined effect of personalization and recommendations save" Yelp large amount of money by attracting more users to Yelp and retain current users amnd encouraging more reviews and ratings from users. 
- Since Yelp makes revenue mostly from advertisement, letting more customers access the website will increase opportunities to click the advertisement on the site, which will result in Yelp's revenue growth. 
- Our ensemble model performs better than other prior models that have been made  so far, which indicates that Yelp can offer more effective/reliable recommendation system. In other words, Yelp can make recommendations for users which they will be likely to be satisfied with.  
- As a side effect, restaurants can also benefit from this system/improvement for free (since they do not pay to Yelp). They would become more likely to acquire potential loyal customers by the personalized recommendation system on Yelp. According to Alessandro Vitale's article, on Netflix, "75% of what people watch is from some sort of recommendation." This fact implies great potential for restaurants on Yelp.
  
To achive our aim to help Yelp enhance their value, we and Yelp need to do the following challenges in the future:
- From the business perspective, Yelp should make efforts to collect an influx of data on users preferences and personalities as consumers data to provide users personalized recommendations to restaurants and other services. (Over the years, it seems that Yelp has collected this data, though.) This is because one of the reason why Yelp has not provided recommendation system so far is considered “sparsity.” For Yelp, it may be extremely difficult to predict ratings for light users (and unpopular businesses). Our algorithm should help this issue (by improving in the future), but Yelp also needs to make efforts to solve this issue. 
- From the technical perspective, it is necessary to use machines that can handle much larger data so that we can include users/items with less than 1,000 review counts as well as tune hyperparameters smoothly.
- Also, it is necessary to tune wights for averaging predicted ratings based on each method regarding the ensemble method; in this project, we just evenly weighted the predicted ratings. Similarly, we could tune the hyperparameter of the number of latent features of the content-based recommendation.

## Reference

- Sumedh Sawant, “Yelp Food Recommendation System”, http://cs229.stanford.edu/proj2013/SawantPai-YelpFoodRecommendationSystem.pdf  
- Yinuo Yao, Fangmingyu Yang, and Xin Niu, “CS229 Project Final Report A Personalized Recommendation System for Yelp Users”, http://cs229.stanford.edu/proj2016/report/YaoNiuYang-A%20Personalized%20Recommendation%20System%20for%20Yelp%20Users-report.pdf 
- “Yelp Dataset Challenge”, Yelp, https://www.yelp.com/dataset_challenge   
- “Yelp Dataset JSON”, Yelp, https://www.yelp.com/dataset/documentation/main   
- Panagiotis Karagiannis, Juraj Juraska, Noujan Pashanasangi, and Konstantinos Zampetakis, “Restaurant Recommender System for Yelp”, https://pdfs.semanticscholar.org/60f1/6d5cbed28a1df409d4122aeea7ea20d4dd02.pdf   - “A Preference-Based Restaurant Recommendation System for Individuals and Groups. Team Size: 3”, https://www.cs.cornell.edu/~rahmtin/Files/YelpClassProject.pdf  
- "MAE and RMSE — Which Metric is Better?", https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
- Alessandro Vitale, "The importance of filtering in Recommender Systems", https://medium.com/@alevitale/the-importance-of-filtering-in-recommender-systems-8040c0e16516
- Naomi Carrillo et al., "Recommender Systems Designed for Yelp.com", https://www.math.uci.edu/icamp/summer/research/student_research/recommender_systems_slides.pdf