Only a subset of Yelp restaurants from a few selected states are available in this dataset. Among them, only Arizona, Nevada, Ohio, North Carolina and Pennsylvania have a rich catalog of over 5000 restaurants. 

Only the top two states, Arizona and Nevada have over 10000 restaurants. 

In [2]:
import pandas as pd
import numpy as np
#import string

business = pd.read_csv('business_clean.csv')  # contains business data including location data, attributes and categories
user = pd.read_csv('user_clean.csv') # contains users data including the user's friend mapping and all the metadata associated with the user
review = pd.read_csv('review_clean.csv') # contains full review text data including the user_id that wrote the review and the business_id the review is written for
tip = pd.read_csv('tip_clean.csv') # tips written by a user on a business, tips are shorter than reviews and tend to convey quick suggestions
checkin = pd.read_csv('checkin_clean.csv') # checkins on a business

# 1. Calculate Geodesic distance between two points on the globe

In [3]:
# calculate geodesic distances between two points on a globe, see https://janakiev.com/blog/gps-points-distance-python/ for more resource
# alternatively, one can use geopy.distance from geopy package to calculate either geodesic or great circle distance, https://github.com/geopy/geopy/blob/master/geopy/distance.py

def great_circle_mile(lat1, lon1, lat2, lon2):
    """
    Compute geodesic distances (great-circle distance) of two points given their coordinates. 
    The function returns the distance in miles. 
    Note: 1. Calculation uses the earth's mean radius of 6371.009 km, 
    2. The central subtended angle is calculated by formula: 
    alpha = cos-1*[sin(lat1)*sin(lat2)+ cos(lat1)*cos(lat2)*cos(lon1-lon2)]
    """
    
    from math import sin, cos, acos, radians
    
    lat1, lon1, lat2, lon2 = radians(lat1), radians(lon1), radians(lat2), radians(lon2) # convert degrees to radians
    earth_radius = 6371.009  # use earth's mean radius in kilometers
    alpha = acos(sin(lat1)*sin(lat2) + cos(lat1)*cos(lat2)*cos(lon1-lon2)) # alpha is in radians
    dis_km = alpha * earth_radius
    dis_mile = dis_km * 0.621371   # convert kilometer to mile
    
    return dis_mile

In [4]:
pos1 = (51.5073219, -0.1276474) # London
pos2 = (52.5170365, 13.3888599) # Berlin
pos3 = (-33.8548157,151.2164539) # Sydney

In [5]:
%%timeit
# great_circle distance
distance_12 = great_circle_mile(pos1[0], pos1[1], pos2[0], pos2[1])
distance_12

1.75 µs ± 29.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [6]:
from geopy.distance import distance

In [7]:
%%timeit
# geodesic distance
distance2_12 = distance(pos1, pos2).miles
distance2_12

210 µs ± 5.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [8]:
distance_12 = great_circle_mile(pos1[0], pos1[1], pos2[0], pos2[1])
distance2_12 = distance(pos1, pos2).miles
error_12 = (distance_12 - distance2_12)/distance2_12

distance_13 = great_circle_mile(pos1[0], pos1[1], pos3[0], pos3[1])
distance2_13 = distance(pos1, pos3).miles
error_13 = (distance_13 - distance2_13)/distance2_13

# print errors, 1-2 is distance between London and Berlin, 1-3 is distance between London and Sydney
print("absolute error:", (distance_12-distance2_12), (distance_13-distance2_13)) 
print("percent error:", error_12, error_13)

absolute error: -1.8326838373754981 2.892945931140275
percent error: -0.0031598293601707256 0.0002740520023695907


### Quick summary: 
Great-circle distance is reasonably accurate when compared with the geodesic distance calculated from an ellipsoidal model of the earth. Given that calculating great-circle distance is at least 100 times faster on average, great-circle distance will be used in this project when calculating geodistance on map.

# 2. Adjusted rating

Here, a single metric is introduced as a substitute of the original average restaurant rating ('stars' column of the business dataframe). Ideally, the new metric should take into consideration: <br>
1) average rating of the restaurant (indicates the goodness of the restaurant, but does not consider popularity) <br>
2) # of ratings by users (indicates popularity, but does not imply the goodness of the restaurant) <br>
3) age of the rating (indicates the relevance of the rating, as outdated ratings might fail to indicate the actual quality)<br>

The proposed new score is: 
$$score_i = \frac{\sum_u r_{ui} + k*\mu}{n_i+k}$$

where $r_ui$ is the rating on item i by user u, $n_i$ is the number of rating on item i, $\mu$ is the global mean of ratings over all businesses and all users and k is the strength of the damping term.

As the equation shows, the adjusted score uses the mechanism of the damped mean to regulate the extreme cases of having only a few extreme ratings. k controls the strength of the damping effect: the larger k is, the more actual ratings are required to overcome the global mean.

In this case, k is set to 4 (which is the 10% quantile of the review counts for all businesses), but it can be tuned according to various business considerations.  

Note:<br> 
Here, the age of the rating is not adjusted in the current version of the proposed metric. This is because 90% of the reviews/ratings are from year 2011 and later, and are considered quite relevant. Therefore, the need of adjusting for the age of the ratings is not strong.

In [9]:
# compute globe mean ratings of all businesses and all reviews
globe_mean = ((business.stars * business.review_count).sum())/(business.review_count.sum())
print("global mean rating is:", globe_mean)

global mean rating is: 3.7277814701620127


In [106]:
print(business.review_count.quantile([0.1,0.25,0.5,0.75,0.9]))
k = 22 # set strength k to 22, which is the 50% quantile of the review counts for all businesses
business['adjusted_score'] = (business.review_count * business.stars + k * globe_mean)/(business.review_count + k)
print("\nrank by the adjusted score in descending order:")
print(business[['review_count','stars','adjusted_score']].sort_values('adjusted_score', ascending=False).head(5))
print("\nrank by the original score in descending order:")
print(business[['review_count','stars','adjusted_score']].sort_values('stars', ascending=False).head(5))
print("\nrank by the least number of reviews:")
print(business[['review_count','stars','adjusted_score']].sort_values('review_count', ascending=True).head(5))

0.10      4.0
0.25      8.0
0.50     22.0
0.75     66.0
0.90    172.0
Name: review_count, dtype: float64

rank by the adjusted score in descending order:
       review_count  stars  adjusted_score
7464           1746    5.0        4.984169
31910          1380    5.0        4.980037
45401           547    5.0        4.950811
7784            520    5.0        4.948360
28162           472    5.0        4.943342

rank by the original score in descending order:
       review_count  stars  adjusted_score
22115             7    5.0        4.034869
23114             5    5.0        3.963377
42990             5    5.0        3.963377
42989            16    5.0        4.263452
12778             3    5.0        3.880448

rank by the least number of reviews:
       review_count  stars  adjusted_score
0                 3    4.5        3.820448
5707              3    4.0        3.760448
16594             3    4.0        3.760448
5699              3    3.5        3.700448
16605             3    3.5  

# 3. Create recommender class

In [103]:
# update data type of the 'postal_code' column of business dataframe to string type
business['postal_code'] = business.postal_code.astype(str)

In [108]:
class Recommender:
    
    def __init__(self, n=5, original_score=False):
        """initiate a Recommender object by passing the desired number of recommendations to make, the default number is 10.
        By default, the adjusted score will be used for ranking; To rank by the original average rating of the restaurant, pass original_score=True
        """
        self.n = n # number of recommendations to make, default is 5
        self.original_score = original_score # boolean indicating whether the original average rating or the adjusted score is used
        
        # initiate the list of recommendations to be the entire catalog of the 'business' dataframe sorted by the score of interest
        if self.original_score:  # set sorting criteria to the originial star rating
            score = 'stars'
        else:  # set sorting criteria to the adjusted score
            score = 'adjusted_score'
        self.recomm = business.sort_values(score, ascending=False)
        
    def _filter_by_location(self):
        """Filter and update the dataframe of recommendations by the matching location of interest.
        A combination of state, city and zipcode is used as the location information, partially missing information can be handled. 
        Matching restaurant is defined as the restaurant within the acceptable distance (max_distance) of the location of interest.
        note: this hidden method should only be called within the method 'keyword'
        """       
        from geopy.geocoders import Nominatim
        from geopy.exc import GeocoderTimedOut
        geolocator = Nominatim(user_agent="yelp_recommender") # use geopy.geocoders to make geolocation queries
        
        address = [self.city, self.state, self.zipcode]
        address = ",".join([str(i) for i in address if i != None])
        # use geolocate query to find the coordinate for the location of interest
        try:
            location = geolocator.geocode(address, timeout=10) 
        except GeocoderTimedOut as e:
            print("Error: geocode failed to locate the address of interest {} with message {}".format(address, e.message))            
        # calculate the geodesic distance between each restaurant and the location of interest
        self.recomm['distance_to_interest'] = self.recomm.apply(lambda row: great_circle_mile(row.latitude, row.longitude, location.latitude, location.longitude), axis=1)
        # filter by the desired distance
        self.recomm = self.recomm[self.recomm.distance_to_interest <= self.max_distance]

    def _filter_by_state(self):
        """ Filter and update the dataframe of recommendations by the matching state.
        note: this hidden method should only be called within the method 'keyword'
        """
        self.recomm = self.recomm[self.recomm.state == self.state.upper()]
    
    def _filter_by_cuisine(self):
        """ Filter and update the dataframe of recommendations by the matching cuisine of interest. 
        note: this hidden method should only be called within the method 'keyword'
        """                         
        idx = []
        for i in self.recomm.index: 
            if self.recomm.loc[i,'cuisine'] is not np.nan:
                entries = self.recomm.loc[i,'cuisine'].split(',')
                if self.cuisine in entries:
                    idx.append(i)
        self.recomm = self.recomm.loc[idx]
    
    def _filter_by_style(self):  
        """ Filter and update the dataframe of recommendations by the matching style of interest. 
        note: this hidden method should only be called within the method 'keyword'
        """
        idx = []
        for i in self.recomm.index: 
            if self.recomm.loc[i,'style'] is not np.nan:
                entries = self.recomm.loc[i,'style'].split(',')
                if self.style in entries:
                    idx.append(i)
        self.recomm = self.recomm.loc[idx]
    
    def display_recommendation(self):
        """ Display the list of top n recommended restaurants
        """
        # limit the list of recommendation to only top n at max
        if self.n < len(self.recomm):
            self.recomm = self.recomm.iloc[:self.n]
        if len(self.recomm) == 0:
            print("Sorry, there is no matching recommendations.")
        else: 
            print("The top {} recommended restaurants matching your keywords are".format(self.n))
            display = self.recomm[['state','city','name','address','distance_to_interest','attributes.RestaurantsPriceRange2','cuisine','style','review_count','stars','adjusted_score']]
            print(display)
    
    # non-personalized keyword filtering-based recommendation module
    def keyword(self, zipcode=None, city=None, state=None, max_distance=10, cuisine=None, style=None):
        """Non-personalized recommendation by keyword filtering: 
        Support filtering by the desired distance and location (zipcode, city, state) of interest, 
        by the desired cuisine of interest and by the desired style of interest.
        Everytime this method is called, a new list of recommendation is created regardless of prior history.
        ---
        Note:
        state: needs to be the upper case of the state abbreviation, e.g.: 'NV', 'CA'
        max_distance: the max acceptable distance between the restaurant and the location of interest, unit is in miles, default is 10
        ---
        """
        self.recomm = business # start with the entire 'business' catalog every time the module is called
        # add a new column 'distance_to_interest' for storing the distance computed from the restaurant to the location of interest
        self.recomm['distance_to_interest'] = np.nan # initiate the entire column to np.nan every time the module is called
        
        self.zipcode = zipcode
        self.city = city
        self.state = state 
        self.max_distance = max_distance
        self.cuisine = cuisine
        self.style = style
                            
        # filter by restaurant location
        if (self.zipcode != None) or (self.city != None) or (self.state != None):      
            if (self.zipcode != None) or (self.city != None): # use zipcode and/or city whenever available
                self._filter_by_location()
            else: # filter by state if state is the only location information available 
                self._filter_by_state()
            if len(self.recomm) == 0:
                print("no restaurant found for the matching location of interest.")
                return []
        
        # filter by restaurant 'cuisine'
        if self.cuisine != None:
            self._filter_by_cuisine()
            if len(self.recomm) == 0:
                print("no restaurant found for the matching cuisine of {}".format(self.cuisine))
                return []
        
        # filter by restaurant 'style'
        if self.style != None:
            self._filter_by_style() 
            if len(self.recomm) == 0:
                print("no restaurant found for the matching style of {}".format(self.style))
                return []
        
        # sort the matching list of restaurants by the score of interest
        if self.original_score:  # set sorting criteria to the originial star rating
            score = 'stars'
        else:  # set sorting criteria to the adjusted score
            score = 'adjusted_score'
        self.recomm = self.recomm.sort_values(score, ascending=False)
        
        # display the list of top n recommendations
        self.display_recommendation()
        
        return self.recomm
    
    # personalized content-based filtering recommender module
    def content(self, user_id=None):
        """Passing of user_id is required if personalized recommendation is desired.
        """
        self.recomm = business # start with the entire 'business' catalog every time the module is called
                           
        self.user_id = user_id  # user_id for personalized recommendation using content_based filtering 
                          
        if self.user_id is None:
            print("no user_id is provided")
        if self.user_id not in user.user_id:
            print("No data available for this user_id")
        
        # to be continued
        
        # display the list of recommendations
        self.display_recommendation()
    
    # personalized collaborative-based filtering recommender module
    def collaborative(self, user_id=None):
        """Passing of user_id is required if personalized recommendation is desired.
        """
        self.recomm = business # start with the entire 'business' catalog every time the module is called
                           
        self.user_id = user_id # user_id for personalized recommendation using collaborative filtering 

        if self.user_id is None:
            print("no user_id is provided")
        if self.user_id not in user.user_id:
            print("No data available for this user_id")
            
        # to be continued
        
        # display the list of recommendations
        self.display_recommendation()

### Testing on the non-personalized keyword filtering module

In [115]:
%%time
# initiate a Recommender object
kw = Recommender(n=3)

# test0: display only (same as no keywords)
print("------\nresult from test0 (display only): ")
kw.display_recommendation()

# test1: no keywords
print("------\nresult from test1 (no keywords): ")
kw.keyword();

# test 2: a combination of city, state and zipcode
print("------\nresult from test2 (a combination of city and state): ")
kw.keyword(city='Phoenix', state='AZ', zipcode='85023');

# test 3: a combination of cuisine and style
print("------\nresult from test3 (a combination of cuisine and style): ")
kw.keyword(cuisine='barbeque', style='restaurants');

# test 4: a combination of state, cuisine and style
print("------\nresult from test4 (a combination of state, cuisine and style): ")
kw.keyword(state='NV', cuisine='desserts', style='restaurants');

# test 5: no matching location
print("------\nresult from test5 (no matching location): ")
kw.keyword(city='milpitas', zipcode='95035');

# test 6: no matching 'cuisine'
print("------\nresult from test6 (no matching cuisine): ")
kw.keyword(cuisine='abc');

# test 7: no matching 'style'
print("------\nresult from test7 (no matching style): ")
kw.keyword(style='abc');

# test 8: a combination of location, cuisine and style
print("------\nresult from test8 (a combination of location, cuisine and style): ")
kw.keyword(city='Phoenix', zipcode='85023',cuisine='barbeque', style='restaurants');

# test 9: use the original average rating and return top 10 recommendations
print("------\nresult from test9 (top 10 recommendations ranked by original average rating): ")
kw2 = Recommender(n=10, original_score=True)
kw2.keyword(city='Phoenix', zipcode='85023',cuisine='barbeque', style='restaurants');

------
result from test0 (display only): 
The top 3 recommended restaurants matching your keywords are
      state       city             name                       address  \
7464     AZ    Phoenix  Little Miss BBQ          4301 E University Dr   
31910    NV  Las Vegas     Brew Tea Bar  7380 S Rainbow Blvd, Ste 101   
45401    NV  Las Vegas       Gelatology  7910 S Rainbow Blvd, Ste 110   

       distance_to_interest  attributes.RestaurantsPriceRange2  \
7464                    NaN                                2.0   
31910                   NaN                                1.0   
45401                   NaN                                1.0   

                                   cuisine               style  review_count  \
7464                              barbeque         restaurants          1746   
31910                 desserts, bubble tea  cafes, restaurants          1380   
45401  ice cream & frozen yogurt, desserts                 NaN           547   

       stars  adju

As shown, 9 tests (9 queries) are performed with a total CPU time of 8 seconds and elapsed time of 12 seconds. This averages to roughly 1 second per queries which is very reasonable in practice.

# Functions to control API interfaces

### Business dataset joins review dataset

In [45]:
busi_review = pd.merge(business[['business_id','stars','review_count']], review[['business_id','review_id','stars']], how='left', on='business_id')

In [46]:
# compare the consistency of review counts from business dataset vs review dataset
compare_review = busi_review.groupby('business_id')[['business_id','review_count']].agg({'business_id':'count','review_count':'mean'})
print(compare_review.head())
print(len(compare_review))
print(compare_review[compare_review.business_id != compare_review.review_count].head())
print(len(compare_review[compare_review.business_id != compare_review.review_count]))

                        business_id  review_count
business_id                                      
--7zmmkVg-IMGaXbuVd0SQ           54            54
--9e1ONYQuAa-CB_Rrw7Tw         1546          1546
--DdmeR16TRb3LsjG0ejrQ            5             5
--FBCX-N37CMYDfs790Bnw          125           125
--GM_ORV2cYS-h38DSaCLw            8             8
47553
                        business_id  review_count
business_id                                      
--cjBEbXMI2obtaRHNSFrA           64            63
-6h3K1hj0d4DRcZNUtHDuw          488           489
-6tvduBzjLI1ISfs3F_qTg         1074          1075
-8O8sVCnaIKHP-596zN9UA          177           176
-8iwcXhLnyqbLgvcrJGgaw          171           170
1661


In [47]:
# compare the consistency of average rating from business dataset vs review dataset
compare_rating = busi_review.groupby('business_id')[['stars_x','stars_y']].mean()
compare_rating['stars_y_round'] = (compare_rating.stars_y//0.5)*0.5 + ((compare_rating.stars_y % 0.5)//0.25)*0.5
print(compare_rating.head())
print(len(compare_rating))
print(compare_rating[compare_rating.stars_x != compare_rating.stars_y_round].head())
print(len(compare_rating[compare_rating.stars_x != compare_rating.stars_y_round]))

                        stars_x   stars_y  stars_y_round
business_id                                             
--7zmmkVg-IMGaXbuVd0SQ      4.0  3.870370            4.0
--9e1ONYQuAa-CB_Rrw7Tw      4.0  4.102846            4.0
--DdmeR16TRb3LsjG0ejrQ      3.0  3.200000            3.0
--FBCX-N37CMYDfs790Bnw      4.0  3.768000            4.0
--GM_ORV2cYS-h38DSaCLw      4.0  4.250000            4.5
47553
                        stars_x   stars_y  stars_y_round
business_id                                             
--GM_ORV2cYS-h38DSaCLw      4.0  4.250000            4.5
-7d3UqQYYcBxbDH2do86sg      3.0  3.250000            3.5
-AGdGGCeTS-njB_8GkUmjQ      4.0  4.250000            4.5
-G7MPSNBpxRJmtrJxdwt7A      3.5  3.227273            3.0
-J6xWAvDJJW4zb7J9YpYOA      2.0  2.250000            2.5
942
