Only a subset of Yelp restaurants from a few selected states are available in this dataset. Among them, only Arizona, Nevada, Ohio, North Carolina and Pennsylvania have a rich catalog of over 5000 restaurants. 

Only the top two states, Arizona and Nevada have over 10000 restaurants. 

In [1]:
import pandas as pd
import numpy as np

business = pd.read_csv('business_clean.csv')  # contains business data including location data, attributes and categories
review = pd.read_csv('review_clean.csv') # # contains full review text data including the user_id that wrote the review and the business_id the review is written for

  return f(*args, **kwds)
  return f(*args, **kwds)


# 1. Calculate Geodesic distance between two points on the globe

In [2]:
# calculate geodesic distances between two points on a globe, see https://janakiev.com/blog/gps-points-distance-python/ for more resource
# alternatively, one can use geopy.distance from geopy package to calculate either geodesic or great circle distance, https://github.com/geopy/geopy/blob/master/geopy/distance.py

def great_circle_mile(lat1, lon1, lat2, lon2):
    """
    Compute geodesic distances (great-circle distance) of two points given their coordinates. 
    The function returns the distance in miles. 
    Note: 1. Calculation uses the earth's mean radius of 6371.009 km, 
    2. The central subtended angle is calculated by formula: 
    alpha = cos-1*[sin(lat1)*sin(lat2)+ cos(lat1)*cos(lat2)*cos(lon1-lon2)]
    """
    
    from math import sin, cos, acos, radians
    
    lat1, lon1, lat2, lon2 = radians(lat1), radians(lon1), radians(lat2), radians(lon2) # convert degrees to radians
    earth_radius = 6371.009  # use earth's mean radius in kilometers
    alpha = acos(sin(lat1)*sin(lat2) + cos(lat1)*cos(lat2)*cos(lon1-lon2)) # alpha is in radians
    dis_km = alpha * earth_radius
    dis_mile = dis_km * 0.621371   # convert kilometer to mile
    
    return dis_mile

In [3]:
pos1 = (51.5073219, -0.1276474) # London
pos2 = (52.5170365, 13.3888599) # Berlin
pos3 = (-33.8548157,151.2164539) # Sydney

In [4]:
%%timeit
# great_circle distance
distance_12 = great_circle_mile(pos1[0], pos1[1], pos2[0], pos2[1])
distance_12

2.54 µs ± 581 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [5]:
from geopy.distance import distance

In [6]:
%%timeit
# geodesic distance
distance2_12 = distance(pos1, pos2).miles
distance2_12

231 µs ± 16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [7]:
distance_12 = great_circle_mile(pos1[0], pos1[1], pos2[0], pos2[1])
distance2_12 = distance(pos1, pos2).miles
error_12 = (distance_12 - distance2_12)/distance2_12

In [8]:
distance_13 = great_circle_mile(pos1[0], pos1[1], pos3[0], pos3[1])
distance2_13 = distance(pos1, pos3).miles
error_13 = (distance_13 - distance2_13)/distance2_13

In [9]:
# print errors, 1-2 is distance between London and Berlin, 1-3 is distance between London and Sydney
print("absolute error:", (distance_12-distance2_12), (distance_13-distance2_13)) 
print("percent error:", error_12, error_13)

absolute error: -1.8326838373754981 2.892945931140275
percent error: -0.0031598293601707256 0.0002740520023695907


### Quick summary: 
Great-circle distance is reasonably accurate when compared with the geodesic distance calculated from an ellipsoidal model of the earth. Given that calculating great-circle distance is at least 100 times faster on average, great-circle distance will be used in this project when calculating geodistance on map.

# 2. Adjusted rating

Here, a single metric is introduced as a substitute of the original average restaurant rating ('stars' column of the business dataframe). Ideally, the new metric should take into consideration: <br>
1) average rating of the restaurant (indicates the goodness of the restaurant, but does not consider popularity) <br>
2) # of ratings by users (indicates popularity, but does not imply the goodness of the restaurant) <br>
3) age of the rating (indicates the relevance of the rating, as outdated ratings might fail to indicate the actual quality)<br>

The proposed new score is: 
$$score_i = \frac{\sum_u r_{ui} + k*\mu}{n_i+k}$$

where $r_ui$ is the rating on item i by user u, $n_i$ is the number of rating on item i, $\mu$ is the global mean of ratings over all businesses and all users and k is the strength of the damping term.

As the equation shows, the adjusted score uses the mechanism of the damped mean to regulate the extreme cases of having only a few extreme ratings. k controls the strength of the damping effect: the larger k is, the more actual ratings are required to overcome the global mean.

In this case, k is set to 4 (which is the 10% quantile of the review counts for all businesses), but it can be tuned according to various business considerations.  

Note:<br> 
Here, the age of the rating is not adjusted in the current version of the proposed metric. This is because 90% of the reviews/ratings are from year 2011 and later, and are considered quite relevant. Therefore, the need of adjusting for the age of the ratings is not strong.

In [10]:
# compute globe mean ratings of all businesses and all reviews
globe_mean = ((business.stars * business.review_count).sum())/(business.review_count.sum())
print("global mean rating is:", globe_mean)

global mean rating is: 3.7277814701620127


In [11]:
print(business.review_count.quantile([0.1,0.25,0.5,0.75,0.9]))
k = 4 # set strength k to 4, which is the 10% quantile of the review counts for all businesses
business['adjusted_score'] = (business.review_count * business.stars + k * globe_mean)/(business.review_count + k)
print("\nrank by the adjusted score in descending order:")
print(business[['stars','review_count','adjusted_score']].sort_values('adjusted_score', ascending=False).head(5))
print("\nrank by the original score in descending order:")
print(business[['stars','review_count','adjusted_score']].sort_values('stars', ascending=False).head(5))
print("\nrank by the least number of reviews:")
print(business[['stars','review_count','adjusted_score']].sort_values('review_count', ascending=True).head(5))

0.10      4.0
0.25      8.0
0.50     22.0
0.75     66.0
0.90    172.0
Name: review_count, dtype: float64

rank by the adjusted score in descending order:
       stars  review_count  adjusted_score
7464     5.0          1746        4.997092
31910    5.0          1380        4.996323
45401    5.0           547        4.990764
7784     5.0           520        4.990288
28162    5.0           472        4.989309

rank by the original score in descending order:
       stars  review_count  adjusted_score
22115    5.0             7        4.537375
23114    5.0             5        4.434570
42990    5.0             5        4.434570
42989    5.0            16        4.745556
12778    5.0             3        4.273018

rank by the least number of reviews:
       stars  review_count  adjusted_score
0        4.5             3        4.058732
5707     4.0             3        3.844447
16594    4.0             3        3.844447
5699     3.5             3        3.630161
16605    3.5             3  

# Create recommender class

In [12]:
# initiate a new column in dataframe business to store the distance computed from the restaurant to the location of interest
business['distance'] = np.nan

In [15]:
class Recommender:
    
    toplist = []  # store the list of indexes of the recommended restaurants
    original_score = False # boolean indicating whether the originial average rating or the adjusted score is used, default is False
    n = 10 # number of recommendations to make, default is 10
    zipcode = None
    city = None
    state = None
    distance = 10 # max distance between the restaurants and the location of interest, default is 10
    cuisine = None
    style = None
    user_id = None # user_id for personalized recommendation using either content_based or collaborative filtering
    
    def filter_by_cuisine(self, df):
        # to be added
        return df
    
    def filter_by_style(self, df):  
        # to be added
        return df
    
    def keyword(self):
        recomm = business
        if (self.zipcode is None) and (self.city is None) and (self.state is None):
            if self.cuisine is not None:
                recomm = filter_by_cuisine(recomm)
            if self.style is not None:
                recomm = filter_by_style(recomm)            
                # to be continued
        recomm = recomm.sort_values('adjusted_score', ascending=False)[:n]
        return list(recomm.index)
    
    def content(self):
    
    def collaborative(self):
        

# Functions to control API interfaces

In [16]:
# need to provide a list of top cuisines, or styles to choose from for the user interface

def display_recommendation(toplist=None):
    if toplist == None or len(toplist) == 0:
        print("Sorry, there is no recommendations yet.")
    else: 
        recomm = business.loc[toplist,['state','city','name','address','distance','attributes.RestaurantsPriceRange2','cuisine','style','review_count','stars','adjusted_score']]
        print(recomm)

### Business dataset joins review dataset

In [17]:
busi_review = business[['business_id','stars','review_count']].merge(review[['business_id','review_id','stars']], on='business_id')

In [18]:
# compare the consistency of review counts from business dataset vs review dataset
compare_review = busi_review.groupby('business_id')[['business_id','review_count']].agg({'business_id':'count','review_count':'mean'})
print(compare_review.head())
print(len(compare_review))
print(compare_review[compare_review.business_id != compare_review.review_count].head())
print(len(compare_review[compare_review.business_id != compare_review.review_count]))

                        business_id  review_count
business_id                                      
--7zmmkVg-IMGaXbuVd0SQ           54            54
--9e1ONYQuAa-CB_Rrw7Tw         1546          1546
--DdmeR16TRb3LsjG0ejrQ            5             5
--FBCX-N37CMYDfs790Bnw          125           125
--GM_ORV2cYS-h38DSaCLw            8             8
47553
                        business_id  review_count
business_id                                      
--cjBEbXMI2obtaRHNSFrA           64            63
-6h3K1hj0d4DRcZNUtHDuw          488           489
-6tvduBzjLI1ISfs3F_qTg         1074          1075
-8O8sVCnaIKHP-596zN9UA          177           176
-8iwcXhLnyqbLgvcrJGgaw          171           170
1661


In [19]:
# compare the consistency of average rating from business dataset vs review dataset
compare_rating = busi_review.groupby('business_id')[['stars_x','stars_y']].mean()
compare_rating['stars_y_round'] = (compare_rating.stars_y//0.5)*0.5 + ((compare_rating.stars_y % 0.5)//0.25)*0.5
print(compare_rating.head())
print(len(compare_rating))
print(compare_rating[compare_rating.stars_x != compare_rating.stars_y_round].head())
print(len(compare_rating[compare_rating.stars_x != compare_rating.stars_y_round]))

                        stars_x   stars_y  stars_y_round
business_id                                             
--7zmmkVg-IMGaXbuVd0SQ      4.0  3.870370            4.0
--9e1ONYQuAa-CB_Rrw7Tw      4.0  4.102846            4.0
--DdmeR16TRb3LsjG0ejrQ      3.0  3.200000            3.0
--FBCX-N37CMYDfs790Bnw      4.0  3.768000            4.0
--GM_ORV2cYS-h38DSaCLw      4.0  4.250000            4.5
47553
                        stars_x   stars_y  stars_y_round
business_id                                             
--GM_ORV2cYS-h38DSaCLw      4.0  4.250000            4.5
-7d3UqQYYcBxbDH2do86sg      3.0  3.250000            3.5
-AGdGGCeTS-njB_8GkUmjQ      4.0  4.250000            4.5
-G7MPSNBpxRJmtrJxdwt7A      3.5  3.227273            3.0
-J6xWAvDJJW4zb7J9YpYOA      2.0  2.250000            2.5
942
