**Recommender system:**<br>
* Module 1 - non-personalized keyword filtering recommender:<br>
build keyword search-based restaurant recommender module to filter by keyword. Keywords could include, for instance, location-based information (zip code, longitude, latitude)  and restaurant feature-based information (cuisine, style). 
The restaurant inventory will be filtered by keywords first, then ranked by its average rating or weighted smart rating taking into consideration the popularity (depending on user’s choice). The top-k restaurants from the list will be returned as the top-k recommendations.<br>
* Module 2 - personalized content-based filtering recommender:<br>
With user ID and restaurant’s metadata, build a content based filtering recommender module that recommends restaurants that are similar to user’s preference inferred from user’s past ratings. More specifically, pairwise similarity scores will be computed for restaurants based on their vectorized feature representation extracted using CountVectorizer or TfidfVectorizer and recommend restaurants based on rankings of the weighted similarity score (e.g. cosine similarity). The important restaurant metadata to consider include categories, attributes, location.<br>
* Module 3 - personalized collaborative filtering recommender:<br>
With user_id x restaurant_id rating matrix, build a collaborative filtering recommender module. Remember that the dataset has a total of 1,518,169 users, 188,593 businesses and 5,996,995 reviews. In terms of the user_id x business_id matrix, the matrix is very sparse (0.003% non-empty). Therefore, matrix factorization algorithms will be used to complete the highly sparse matrix and generate recommendations.<br>
* Metrics chosen for evaluating and optimizing the ‘goodness’ of the algorithms:<br>
a) measure prediction accuracy: RMSE(root mean squared error)
b) measure ranking effectiveness: 
MAP (mean average precision)
NDCG(Normalized Discounted Cumulative Gain)<br>
* Integration - combine the above modules to build a hybrid recommendation engine:<br>
To combine the above modules, a few simple interactive questions will be added:<br>
a) “Want customized recommendations based on your user history by providing your user ID?”  If no, activate the simple recommender module to provide base-case recommendations using location information and/or optional keywords<br>
b) If yes, prompt to ask follow up question: “do you want to try something new based on people like you?” If yes, activate the collaborative filtering module to recommend new restaurants based on similar peers; otherwise, use content filter module to recommend similar restaurants. <br>
* Other improvements:<br>
Optimize restaurant ranking by weighting the average rating based on total number of ratings (popularity), weighting the individual rating according to their recency, etc. With a quick interactive question: “want smart rating instead?” The alternative restaurant ranking method based on the above weighted scores will be activated and used instead of the simple average rating.<br>
* Potential caveats - cold start problem:<br>
a) new restaurant → content-based recommendation module will be able to use the features (metadata) of the new restaurant and include it when generating recommendations.<br>
b) new user → will be treated as if the user ID is not available (both has no user history) and similar recommender module will be used to recommend restaurants based on location, keywords, popularity, etc. 

**Note:**<br> 
Only a subset of Yelp restaurants from a few selected states are available in this dataset. Among them, only Arizona, Nevada, Ohio, North Carolina and Pennsylvania have a rich catalog of over 5000 restaurants. 

Only the top two states, Arizona and Nevada have over 10000 restaurants. 

In [5]:
import pandas as pd
import numpy as np
import pickle

business = pd.read_csv('business_clean.csv')  # contains business data including location data, attributes and categories
#user = pd.read_csv('user_clean.csv') # contains users data including the user's friend mapping and all the metadata associated with the user
review = pd.read_csv('review_clean.csv') # contains full review text data including the user_id that wrote the review and the business_id the review is written for
#tip = pd.read_csv('tip_clean.csv') # tips written by a user on a business, tips are shorter than reviews and tend to convey quick suggestions
#checkin = pd.read_csv('checkin_clean.csv') # checkins on a business

  return f(*args, **kwds)
  return f(*args, **kwds)


# 1. Calculate Geodesic distance between two points on the globe

In [6]:
# calculate geodesic distances between two points on a globe, see https://janakiev.com/blog/gps-points-distance-python/ for more resource
# alternatively, one can use geopy.distance from geopy package to calculate either geodesic or great circle distance, https://github.com/geopy/geopy/blob/master/geopy/distance.py

def great_circle_mile(lat1, lon1, lat2, lon2):
    """
    Compute geodesic distances (great-circle distance) of two points given their coordinates. 
    The function returns the distance in miles. 
    Note: 1. Calculation uses the earth's mean radius of 6371.009 km, 
    2. The central subtended angle is calculated by formula: 
    alpha = cos-1*[sin(lat1)*sin(lat2)+ cos(lat1)*cos(lat2)*cos(lon1-lon2)]
    """
    
    from math import sin, cos, acos, radians
    
    lat1, lon1, lat2, lon2 = radians(lat1), radians(lon1), radians(lat2), radians(lon2) # convert degrees to radians
    earth_radius = 6371.009  # use earth's mean radius in kilometers
    alpha = acos(sin(lat1)*sin(lat2) + cos(lat1)*cos(lat2)*cos(lon1-lon2)) # alpha is in radians
    dis_km = alpha * earth_radius
    dis_mile = dis_km * 0.621371   # convert kilometer to mile
    
    return dis_mile

In [7]:
pos1 = (51.5073219, -0.1276474) # London
pos2 = (52.5170365, 13.3888599) # Berlin
pos3 = (-33.8548157,151.2164539) # Sydney

In [8]:
%%timeit
# great_circle distance
distance_12 = great_circle_mile(pos1[0], pos1[1], pos2[0], pos2[1])
distance_12

3.72 µs ± 1.21 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [9]:
from geopy.distance import distance

In [10]:
%%timeit
# geodesic distance
distance2_12 = distance(pos1, pos2).miles
distance2_12

286 µs ± 29.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [11]:
distance_12 = great_circle_mile(pos1[0], pos1[1], pos2[0], pos2[1])
distance2_12 = distance(pos1, pos2).miles
error_12 = (distance_12 - distance2_12)/distance2_12

distance_13 = great_circle_mile(pos1[0], pos1[1], pos3[0], pos3[1])
distance2_13 = distance(pos1, pos3).miles
error_13 = (distance_13 - distance2_13)/distance2_13

# print errors, 1-2 is distance between London and Berlin, 1-3 is distance between London and Sydney
print("absolute error:", (distance_12-distance2_12), (distance_13-distance2_13)) 
print("percent error:", error_12, error_13)

absolute error: -1.8326838373754981 2.892945931140275
percent error: -0.0031598293601707256 0.0002740520023695907


### Quick summary: 
Great-circle distance is reasonably accurate when compared with the geodesic distance calculated from an ellipsoidal model of the earth. Given that calculating great-circle distance is at least 100 times faster on average, great-circle distance will be used in this project when calculating geodistance on map.

# 2. Adjusted rating

Here, a single metric is introduced as a substitute of the original average restaurant rating ('stars' column of the business dataframe). Ideally, the new metric should take into consideration: <br>
1) average rating of the restaurant (indicates the goodness of the restaurant, but does not consider popularity) <br>
2) # of ratings by users (indicates popularity, but does not imply the goodness of the restaurant) <br>
3) age of the rating (indicates the relevance of the rating, as outdated ratings might fail to indicate the actual quality)<br>

The proposed new score is: 
$$score_i = \frac{\sum_u r_{ui} + k*\mu}{n_i+k}$$

where $r_ui$ is the rating on item i by user u, $n_i$ is the number of rating on item i, $\mu$ is the global mean of ratings over all businesses and all users and k is the strength of the damping term.

As the equation shows, the adjusted score uses the mechanism of the damped mean to regulate the extreme cases of having only a few extreme ratings. k controls the strength of the damping effect: the larger k is, the more actual ratings are required to overcome the global mean.

In this case, k is set to 4 (which is the 10% quantile of the review counts for all businesses), but it can be tuned according to various business considerations.  

Note:<br> 
Here, the age of the rating is not adjusted in the current version of the proposed metric. This is because 90% of the reviews/ratings are from year 2011 and later, and are considered quite relevant. Therefore, the need of adjusting for the age of the ratings is not strong.

In [12]:
# compute globe mean ratings of all businesses and all reviews
globe_mean = ((business.stars * business.review_count).sum())/(business.review_count.sum())
print("global mean rating is:", globe_mean)

global mean rating is: 3.7277814701620127


In [13]:
print(business.review_count.quantile([0.1,0.25,0.5,0.75,0.9]))
k = 22 # set strength k to 22, which is the 50% quantile of the review counts for all businesses
business['adjusted_score'] = (business.review_count * business.stars + k * globe_mean)/(business.review_count + k)
print("\nrank by the adjusted score in descending order:")
print(business[['review_count','stars','adjusted_score']].sort_values('adjusted_score', ascending=False).head(5))
print("\nrank by the original score in descending order:")
print(business[['review_count','stars','adjusted_score']].sort_values('stars', ascending=False).head(5))
print("\nrank by the least number of reviews:")
print(business[['review_count','stars','adjusted_score']].sort_values('review_count', ascending=True).head(5))

0.10      4.0
0.25      8.0
0.50     22.0
0.75     66.0
0.90    172.0
Name: review_count, dtype: float64

rank by the adjusted score in descending order:
       review_count  stars  adjusted_score
7464           1746    5.0        4.984169
31910          1380    5.0        4.980037
45401           547    5.0        4.950811
7784            520    5.0        4.948360
28162           472    5.0        4.943342

rank by the original score in descending order:
       review_count  stars  adjusted_score
22115             7    5.0        4.034869
23114             5    5.0        3.963377
42990             5    5.0        3.963377
42989            16    5.0        4.263452
12778             3    5.0        3.880448

rank by the least number of reviews:
       review_count  stars  adjusted_score
0                 3    4.5        3.820448
5707              3    4.0        3.760448
16594             3    4.0        3.760448
5699              3    3.5        3.700448
16605             3    3.5  

# 3. Building hybrid recommendation engine

## 3.1 non-personalized keyword filtering module

In [14]:
# update data type of the 'postal_code' column of business dataframe to string type
business['postal_code'] = business.postal_code.astype(str)

In [30]:
class Recommender:
    
    def __init__(self, n=5, original_score=False):
        """initiate a Recommender object by passing the desired number of recommendations to make, the default number is 10.
        By default, the adjusted score will be used for ranking; To rank by the original average rating of the restaurant, pass original_score=True
        """
        self.n = n # number of recommendations to make, default is 5
        self.original_score = original_score # boolean indicating whether the original average rating or the adjusted score is used
        # initiate a list of column names to display in the recommendation results
        self.column_to_display = ['state','city','name','address','attributes.RestaurantsPriceRange2','cuisine','style','review_count','stars','adjusted_score']
        
        # initiate the list of recommendations to be the entire catalog of the 'business' dataframe sorted by the score of interest
        if self.original_score:  # set sorting criteria to the originial star rating
            score = 'stars'
        else:  # set sorting criteria to the adjusted score
            score = 'adjusted_score'
        self.recomm = business.sort_values(score, ascending=False)
        
    def _filter_by_location(self):
        """Filter and update the dataframe of recommendations by the matching location of interest.
        A combination of state, city and zipcode is used as the location information, partially missing information can be handled. 
        Matching restaurant is defined as the restaurant within the acceptable distance (max_distance) of the location of interest.
        note: this hidden method should only be called within the method 'keyword'
        """       
        from geopy.geocoders import Nominatim
        from geopy.exc import GeocoderTimedOut
        geolocator = Nominatim(user_agent="yelp_recommender") # use geopy.geocoders to make geolocation queries
        address = [self.city, self.state, self.zipcode]
        address = ",".join([str(i) for i in address if i != None])
        # use geolocate query to find the coordinate for the location of interest
        try:
            location = geolocator.geocode(address, timeout=10) 
        except GeocoderTimedOut as e:
            print("Error: geocode failed to locate the address of interest {} with message {}".format(address, e.message))            

        # calculate the geodesic distance between each restaurant and the location of interest and add as a new column ''distance_to_interest'
        self.recomm['distance_to_interest'] = self.recomm.apply(lambda row: great_circle_mile(row.latitude, row.longitude, location.latitude, location.longitude), axis=1)
        # add the new column 'distance_to_interest' to the list of columns to display in the recommendation result
        self.column_to_display.insert(0, 'distance_to_interest')
        # filter by the desired distance
        self.recomm = self.recomm[self.recomm.distance_to_interest <= self.max_distance]

    def _filter_by_state(self):
        """ Filter and update the dataframe of recommendations by the matching state.
        note: this hidden method should only be called within the method 'keyword'
        """
        self.recomm = self.recomm[self.recomm.state == self.state.upper()]
    
    def _filter_by_cuisine(self):
        """ Filter and update the dataframe of recommendations by the matching cuisine of interest. 
        note: this hidden method should only be called within the method 'keyword'
        """                         
        idx = []
        for i in self.recomm.index: 
            if self.recomm.loc[i,'cuisine'] is not np.nan:
                entries = self.recomm.loc[i,'cuisine'].split(',')
                if self.cuisine in entries:
                    idx.append(i)
        self.recomm = self.recomm.loc[idx]
    
    def _filter_by_style(self):  
        """ Filter and update the dataframe of recommendations by the matching style of interest. 
        note: this hidden method should only be called within the method 'keyword'
        """
        idx = []
        for i in self.recomm.index: 
            if self.recomm.loc[i,'style'] is not np.nan:
                entries = self.recomm.loc[i,'style'].split(',')
                if self.style in entries:
                    idx.append(i)
        self.recomm = self.recomm.loc[idx]
    
    def display_recommendation(self):
        """ Display the list of top n recommended restaurants
        """
        if len(self.recomm) == 0:
            print("Sorry, there is no matching recommendations.")
        elif self.n < len(self.recomm):  # display only the top n from the recommendation list
            print("Below is a list of the top {} recommended restaurants for you: ".format(self.n))
            print(self.recomm.iloc[:self.n][self.column_to_display])
        else:  # display all if # of recommendations is less than self.n
            print("Below is a list of the top {} recommended restaurants for you: ".format(len(self.recomm)))
            print(self.recomm[self.column_to_display])
    
    # non-personalized keyword filtering-based recommendation module
    def keyword(self, df=business, zipcode=None, city=None, state=None, max_distance=10, cuisine=None, style=None):
        """Non-personalized recommendation by keyword filtering: 
        Support filtering by the desired distance and location (zipcode, city, state) of interest, 
        by the desired cuisine of interest and by the desired style of interest.
        ---
        Note:
        df: the default restaurant catalog is all the restaurants in the 'business' dataframe, 
            if a subset is prefered, e.g. previous filtered result, the subset can be passed to df
        state: needs to be the upper case of the state abbreviation, e.g.: 'NV', 'CA'
        max_distance: the max acceptable distance between the restaurant and the location of interest, unit is in miles, default is 10
        ---
        """
        # re-initiate the following variables every time the module is called so that the recommendation starts fresh
        self.recomm = df # start with the desired restaurant catalog
        self.recomm['distance_to_interest'] = np.nan # reset the distance between each restaurant and the location of interest
        self.column_to_display = ['state','city','name','address','attributes.RestaurantsPriceRange2','cuisine','style','review_count','stars','adjusted_score'] # reset the columns to display
        
        # assign variables based on user's keyword inputs
        self.zipcode = zipcode
        self.city = city
        self.state = state 
        self.max_distance = max_distance
        self.cuisine = cuisine
        self.style = style
             
        # filter by restaurant location
        if (self.zipcode != None) or (self.city != None) or (self.state != None):      
            if (self.zipcode != None) or (self.city != None): # use zipcode and/or city whenever available
                self._filter_by_location()
            else: # filter by state if state is the only location information available 
                self._filter_by_state()
            if len(self.recomm) == 0:
                print("no restaurant found for the matching location of interest.")
                return None
        
        # filter by restaurant 'cuisine'
        if self.cuisine != None:
            self._filter_by_cuisine()
            if len(self.recomm) == 0:
                print("no restaurant found for the matching cuisine of {}".format(self.cuisine))
                return None
    
        # filter by restaurant 'style'
        if self.style != None:
            self._filter_by_style() 
            if len(self.recomm) == 0:
                print("no restaurant found for the matching style of {}".format(self.style))
                return None
        
        # sort the matching list of restaurants by the score of interest
        if self.original_score:  # set sorting criteria to the originial star rating
            score = 'stars'
        else:  # set sorting criteria to the adjusted score
            score = 'adjusted_score'
        self.recomm = self.recomm.sort_values(score, ascending=False)
        
        # display the list of top n recommendations
        self.display_recommendation()
        
        return self.recomm

### Testing of the non-personalized keyword filtering module

In [31]:
%%time
# initiate a Recommender object
kw = Recommender(n=3)

# test0: display only (same as no keywords)
print("------\nresult from test0 (display only): ")
kw.display_recommendation()

# test1: no keywords
print("------\nresult from test1 (no keywords): ")
kw.keyword();

# test 2: a combination of city, state and zipcode
print("------\nresult from test2 (a combination of city and state): ")
kw.keyword(city='Phoenix', state='AZ', zipcode='85023');

# test 3: a combination of cuisine and style
print("------\nresult from test3 (a combination of cuisine and style): ")
kw.keyword(cuisine='barbeque', style='restaurants');

# test 4: a combination of state, cuisine and style
print("------\nresult from test4 (a combination of state, cuisine and style): ")
kw.keyword(state='NV', cuisine='desserts', style='restaurants');

# test 5: no matching location
print("------\nresult from test5 (no matching location): ")
kw.keyword(city='milpitas', zipcode='95035');

# test 6: no matching 'cuisine'
print("------\nresult from test6 (no matching cuisine): ")
kw.keyword(cuisine='abc');

# test 7: no matching 'style'
print("------\nresult from test7 (no matching style): ")
kw.keyword(style='abc');

# test 8: a combination of location, cuisine and style
print("------\nresult from test8 (a combination of location, cuisine and style): ")
kw.keyword(city='Phoenix', zipcode='85023',cuisine='barbeque', style='restaurants');

# test 9: use the original average rating and return top 10 recommendations
print("------\nresult from test9 (top 10 recommendations ranked by original average rating): ")
kw2 = Recommender(n=10, original_score=True)
kw2.keyword(city='Phoenix', zipcode='85023',cuisine='barbeque', style='restaurants');

------
result from test0 (display only): 
Below is a list of the top 3 recommended restaurants for you: 
      state       city             name                       address  \
7464     AZ    Phoenix  Little Miss BBQ          4301 E University Dr   
31910    NV  Las Vegas     Brew Tea Bar  7380 S Rainbow Blvd, Ste 101   
45401    NV  Las Vegas       Gelatology  7910 S Rainbow Blvd, Ste 110   

       attributes.RestaurantsPriceRange2                              cuisine  \
7464                                 2.0                             barbeque   
31910                                1.0                 desserts, bubble tea   
45401                                1.0  ice cream & frozen yogurt, desserts   

                    style  review_count  stars  adjusted_score  
7464          restaurants          1746    5.0        4.984169  
31910  cafes, restaurants          1380    5.0        4.980037  
45401                 NaN           547    5.0        4.950811  
------
result fro

As shown, 9 tests (9 queries) are performed with a total CPU time of 8 seconds and elapsed time of 12 seconds. This averages to roughly 1-2 seconds per queries which is very reasonable in practice.

## 3.2 adding the personalized collaborative filtering module

In [22]:
# create a reduced copy by removing the duplicated user, restaurant rating combinations
review_r = review[~review.duplicated(['user_id','business_id'], keep='first')]
review_r.reset_index(inplace=True, drop=True)

In [72]:
class Recommender:
    
    def __init__(self, n=5, original_score=False):
        """initiate a Recommender object by passing the desired number of recommendations to make, the default number is 10.
        By default, the adjusted score will be used for ranking; To rank by the original average rating of the restaurant, pass original_score=True
        """
        self.n = n # number of recommendations to make, default is 5
        self.original_score = original_score # boolean indicating whether the original average rating or the adjusted score is used
        # initiate a list of column names to display in the recommendation results
        self.column_to_display = ['state','city','name','address','attributes.RestaurantsPriceRange2','cuisine','style','review_count','stars','adjusted_score']
        
        # initiate the list of recommendations to be the entire catalog of the 'business' dataframe sorted by the score of interest
        if self.original_score:  # set sorting criteria to the originial star rating
            score = 'stars'
        else:  # set sorting criteria to the adjusted score
            score = 'adjusted_score'
        self.recomm = business.sort_values(score, ascending=False)
        
    def _filter_by_location(self):
        """Filter and update the dataframe of recommendations by the matching location of interest.
        A combination of state, city and zipcode is used as the location information, partially missing information can be handled. 
        Matching restaurant is defined as the restaurant within the acceptable distance (max_distance) of the location of interest.
        note: this hidden method should only be called within the method 'keyword'
        """       
        from geopy.geocoders import Nominatim
        from geopy.exc import GeocoderTimedOut
        geolocator = Nominatim(user_agent="yelp_recommender") # use geopy.geocoders to make geolocation queries
        address = [self.city, self.state, self.zipcode]
        address = ",".join([str(i) for i in address if i != None])
        # use geolocate query to find the coordinate for the location of interest
        try:
            location = geolocator.geocode(address, timeout=10) 
        except GeocoderTimedOut as e:
            print("Error: geocode failed to locate the address of interest {} with message {}".format(address, e.message))            

        # calculate the geodesic distance between each restaurant and the location of interest and add as a new column ''distance_to_interest'
        self.recomm['distance_to_interest'] = self.recomm.apply(lambda row: great_circle_mile(row.latitude, row.longitude, location.latitude, location.longitude), axis=1)
        # add the new column 'distance_to_interest' to the list of columns to display in the recommendation result
        self.column_to_display.insert(0, 'distance_to_interest')
        # filter by the desired distance
        self.recomm = self.recomm[self.recomm.distance_to_interest <= self.max_distance]

    def _filter_by_state(self):
        """ Filter and update the dataframe of recommendations by the matching state.
        note: this hidden method should only be called within the method 'keyword'
        """
        self.recomm = self.recomm[self.recomm.state == self.state.upper()]
    
    def _filter_by_cuisine(self):
        """ Filter and update the dataframe of recommendations by the matching cuisine of interest. 
        note: this hidden method should only be called within the method 'keyword'
        """                         
        idx = []
        for i in self.recomm.index: 
            if self.recomm.loc[i,'cuisine'] is not np.nan:
                entries = self.recomm.loc[i,'cuisine'].split(',')
                if self.cuisine in entries:
                    idx.append(i)
        self.recomm = self.recomm.loc[idx]
    
    def _filter_by_style(self):  
        """ Filter and update the dataframe of recommendations by the matching style of interest. 
        note: this hidden method should only be called within the method 'keyword'
        """
        idx = []
        for i in self.recomm.index: 
            if self.recomm.loc[i,'style'] is not np.nan:
                entries = self.recomm.loc[i,'style'].split(',')
                if self.style in entries:
                    idx.append(i)
        self.recomm = self.recomm.loc[idx]
    
    def display_recommendation(self):
        """ Display the list of top n recommended restaurants
        """
        if len(self.recomm) == 0:
            print("Sorry, there is no matching recommendations.")
        elif self.n < len(self.recomm):  # display only the top n from the recommendation list
            print("Below is a list of the top {} recommended restaurants for you: ".format(self.n))
            print(self.recomm.iloc[:self.n][self.column_to_display])
        else:  # display all if # of recommendations is less than self.n
            print("Below is a list of the top {} recommended restaurants for you: ".format(len(self.recomm)))
            print(self.recomm[self.column_to_display])
     
    #---------------------------------------------------------------
    # non-personalized keyword filtering-based recommender module
    def keyword(self, df=business, zipcode=None, city=None, state=None, max_distance=10, cuisine=None, style=None, personalized=False):
        """Non-personalized recommendation by keyword filtering: 
        Support filtering by the desired distance and location (zipcode, city, state) of interest, 
        by the desired cuisine of interest and by the desired style of interest.
        ---
        Note:
        df: the default restaurant catalog is all the restaurants in the 'business' dataframe, 
            if a subset is prefered, e.g. previous filtered result, the subset can be passed to df
        state: needs to be the upper case of the state abbreviation, e.g.: 'NV', 'CA'
        max_distance: the max acceptable distance between the restaurant and the location of interest, unit is in miles, default is 10
        ---
        """
        # re-initiate the following variables every time the module is called so that the recommendation starts fresh
        self.recomm = df # start with the desired restaurant catalog
        self.recomm['distance_to_interest'] = np.nan # reset the distance between each restaurant and the location of interest
        self.column_to_display = ['state','city','name','address','attributes.RestaurantsPriceRange2','cuisine','style','review_count','stars','adjusted_score'] # reset the columns to display
        
        # assign variables based on user's keyword inputs
        self.zipcode = zipcode
        self.city = city
        self.state = state 
        self.max_distance = max_distance
        self.cuisine = cuisine
        self.style = style
        
        # check personalized 'predicted_stars' is available for ranking and displaying personalized recommendations
        if personalized:
            if 'predicted_stars' not in self.recomm.columns:
                print("no personalized list of recommendations is generated yet!")
                print("please first run the collaborative filtering module or content-based filtering module for a personalized recommendations.")
                return None
        
        # filter by restaurant location
        if (self.zipcode != None) or (self.city != None) or (self.state != None):      
            if (self.zipcode != None) or (self.city != None): # use zipcode and/or city whenever available
                self._filter_by_location()
            else: # filter by state if state is the only location information available 
                self._filter_by_state()
            if len(self.recomm) == 0:
                print("no restaurant found for the matching location of interest.")
                return None
        
        # filter by restaurant 'cuisine'
        if self.cuisine != None:
            self._filter_by_cuisine()
            if len(self.recomm) == 0:
                print("no restaurant found for the matching cuisine of {}".format(self.cuisine))
                return None
    
        # filter by restaurant 'style'
        if self.style != None:
            self._filter_by_style() 
            if len(self.recomm) == 0:
                print("no restaurant found for the matching style of {}".format(self.style))
                return None
        
        # sort the matching list of restaurants by the score of interest
        if personalized:
            score = 'predicted_stars'
            self.column_to_display.insert(0, 'predicted_stars')  # add 'predicted_stars' to the list of columns to display
        elif self.original_score:  # set sorting criteria to the originial star rating
            score = 'stars'
        else:  # set sorting criteria to the adjusted score
            score = 'adjusted_score'
        self.recomm = self.recomm.sort_values(score, ascending=False)
        
        # display the list of top n recommendations
        self.display_recommendation()
        
        return self.recomm
    
    #------------------------------------------------------------
    # personalized collaborative-based filtering recommender module
    def collaborative(self, user_id=None):
        """Passing of user_id is required if personalized recommendation is desired.
        """
        
        self.user_id = user_id # user_id for personalized recommendation using collaborative filtering 
        if self.user_id is None:
            print("no user_id is provided!")
            return None
        if len(self.user_id) != 22:
            print("invalid user id!")
            return None
        
        # initiate every time the module is called
        self.recomm = business # start with the entire 'business' catalog
        self.column_to_display = ['state','city','name','address','attributes.RestaurantsPriceRange2',\
                                  'cuisine','style','review_count','stars','adjusted_score'] # reset the columns to display
        if 'predicted_stars' in self.recomm.columns:
            self.recomm.drop('predicted_stars', axis=1, inplace=True) # delete the column of 'predicted_stars' if already present
        
        # load and extract the necessary info fro the trained matrix factorization algorithm
        with open('svd_trained_info.pkl', 'rb') as f:
            svd_trained_info = pickle.load(f)
        user_latent = svd_trained_info['user_latent']
        item_latent = svd_trained_info['item_latent']
        user_bias = svd_trained_info['user_bias']
        item_bias = svd_trained_info['item_bias']
        r_mean = svd_trained_info['mean_rating'] # global mean of all ratings
        userid_to_idx = svd_trained_info['userid_to_index']
        itemid_to_idx = svd_trained_info['itemid_to_index']
        
        # predict personalized restaurant ratings for the user_id of interest
        if self.user_id in userid_to_idx:
            u_idx = userid_to_idx[self.user_id]
            pred = r_mean + user_bias[u_idx] + item_bias + np.dot(user_latent[u_idx,:],item_latent.T)
        else: 
            print("sorry, no personal data available for this user_id yet!")
            print("Here is the generic recommendation computed from all the users in our database:")
            pred = r_mean + item_bias
        
        # pairing the predicted ratings with the business_id by matching the corresponding matrix indices of the business_id
        prediction = pd.DataFrame(data=pred, index=itemid_to_idx.values(), columns=['predicted_stars']) 
        prediction.index.name = 'matrix_item_indice'
        assert len(prediction) == len(pred)
        prediction['business_id'] = list(itemid_to_idx.keys())
        
        # filter to unrated business_id only by the user_id of interest if a personal history is available
        if self.user_id in userid_to_idx:       
            busi_rated = review_r[review_r.user_id == self.user_id].business_id.unique()
            prediction = prediction[~prediction.business_id.isin(busi_rated)]
        
        # inner-join the prediction dataframe with the recommendation catalog on 'business_id' to retrieve all relevant business informations
        # note: the .merge step needs to be performed prior to extracting the top n
        # because many of the 'business_id' in the review dataframe are not restaurant-related, therefore not present in the 'business' catalog
        self.recomm = self.recomm.merge(prediction, on='business_id', how='inner') 
        
        # sort the prediction by the predicted ratings in descending order
        self.recomm = self.recomm.sort_values('predicted_stars', ascending=False).reset_index(drop=True)
        
        # add 'predicted_stars' to the list of columns to display
        self.column_to_display.insert(0, 'predicted_stars') 
        
        # display the list of top n recommendations
        self.display_recommendation()
        
        return self.recomm

### testing of the personalized collaborative filtering recommender module

In [73]:
%%time

# initiate a Recommender object
col = Recommender(n=5)

# test0: display only (same as no keywords)
print("------\nresult from test0 (display only): ")
col.display_recommendation()

# test1: no user id input
print("------\nresult from test1 (no user id input): ")
col.collaborative();

# test 2: invalid user id input
print("------\nresult from test2 (invalid user id input): ")
col.collaborative(user_id='928402');

------
result from test0 (display only): 
Below is a list of the top 5 recommended restaurants for you: 
      state             city                name  \
7464     AZ          Phoenix     Little Miss BBQ   
31910    NV        Las Vegas        Brew Tea Bar   
45401    NV        Las Vegas          Gelatology   
7784     NV  North Las Vegas        Poke Express   
28162    NV        Las Vegas  Meráki Greek Grill   

                            address  attributes.RestaurantsPriceRange2  \
7464           4301 E University Dr                                2.0   
31910  7380 S Rainbow Blvd, Ste 101                                1.0   
45401  7910 S Rainbow Blvd, Ste 110                                1.0   
7784        655 W Craig Rd, Ste 118                                2.0   
28162  4950 S Rainbow Blvd, Ste 160                                2.0   

                                   cuisine               style  review_count  \
7464                              barbeque         restau

In [75]:
%%time

# test 3: valid user id (no user data)
print("------\nresult from test3 (valid user id --- no user review data): ")
col.collaborative(user_id='-NzChtoNOw706kps82x0Kg');

------
result from test3 (valid user id --- no user review data): 
sorry, no personal data available for this user_id yet!
Here is the generic recommendation computed from all the users in our database:
Below is a list of the top 5 recommended restaurants for you: 
   predicted_stars state       city                              name  \
0         4.977158    NV  Henderson                        Party Pros   
1         4.968034    AZ      Tempe  Affordable Party & Event Rentals   
2         4.940860    WI    Madison           The Conscious Carnivore   
3         4.926701    NV  Las Vegas                  CHEFit Meal Prep   
4         4.916635    NV  Henderson                 Cakes On the Move   

                      address  attributes.RestaurantsPriceRange2  \
0           1153 Enchanted Ct                                NaN   
1      510 S 52nd St, Ste 105                                NaN   
2  3236 University Ave, Ste A                                2.0   
3             6235 S Pe

In [76]:
%%time

# test 4: valid user id (user has only one review)
print("------\nresult from test4 (valid user id --- user has only one review): ")
col.collaborative(user_id='---89pEy_h9PvHwcHNbpyg');

------
result from test4 (valid user id --- user has only one review): 
Below is a list of the top 5 recommended restaurants for you: 
   predicted_stars state       city                              name  \
0         5.176233    AZ      Tempe  Affordable Party & Event Rentals   
1         5.158063    NV  Henderson                        Party Pros   
2         5.155794    WI    Madison           The Conscious Carnivore   
3         5.152605    AZ    Gilbert           Big Island Hawaiian BBQ   
4         5.124354    AZ      Tempe              Almaza Hookah Lounge   

                      address  attributes.RestaurantsPriceRange2  \
0      510 S 52nd St, Ste 105                                NaN   
1           1153 Enchanted Ct                                NaN   
2  3236 University Ave, Ste A                                2.0   
3        2919 South Market St                                1.0   
4   107 E Baseline Rd, Ste A3                                2.0   

              cui

As shown, it takes only 2 seconds to return the personalized recommendation ranks, but due to the limited user preference history, the recommendation is somewhat similar to the generic recommendation for unseen users. 

In [77]:
%%time

# test 5: valid user id (user has over 100 reviews)
print("------\nresult from test5 (valid user id --- user has over 100 reviews): ")
col.collaborative(user_id='---1lKK3aKOuomHnwAkAow');

------
result from test5 (valid user id --- user has over 100 reviews): 
Below is a list of the top 5 recommended restaurants for you: 
   predicted_stars state       city                         name  \
0         5.895364    NV  Las Vegas                     Sushi-ko   
1         5.847682    NV  Las Vegas  Marc Savard Comedy Hypnosis   
2         5.815304    NV  Las Vegas                 Urban Turban   
3         5.803355    NV  Las Vegas              Above the Crust   
4         5.750107    NV  Las Vegas         Layla Grill & Hookah   

                                     address  \
0                            7101 W Craig Rd   
1  3663 Las Vegas Blvd S, Ste 360, V-Theater   
2                    3900 Paradise Rd, Ste G   
3                              7810 W Ann Rd   
4                8665 W Flamingo Rd, Ste 107   

   attributes.RestaurantsPriceRange2                              cuisine  \
0                                2.0                 japanese, sushi bars   
1           

As shown, even for users with more review history where the module needs to filter and remove all the rated restaurants from the recommendation list, it only takes 2 seconds to return the personalized recommendation rank. Thanks to the rich personal preference history, the recommendation is really personalized. As in this case, it seems to suggest that the user prefers restaurants with a rich number of reviews (popular restaurants), reasonable to good ratings (3.5-4.5) and in the lower price range (\$-\$$).

In [78]:
%%time

# test 6: valid user id (user has over 100 reviews)
print("------\nresult from test6 (valid user id --- user has over 100 reviews): ")
recomm = col.collaborative(user_id='---1lKK3aKOuomHnwAkAow');

# filter the personalized recommendation with keywords
print("------\nfurther filtering the personalized recommendations by keywords:")
recomm = col.keyword(df=recomm, city='Phoenix', personalized=True)

------
result from test6 (valid user id --- user has over 100 reviews): 
Below is a list of the top 5 recommended restaurants for you: 
   predicted_stars state       city                         name  \
0         5.895364    NV  Las Vegas                     Sushi-ko   
1         5.847682    NV  Las Vegas  Marc Savard Comedy Hypnosis   
2         5.815304    NV  Las Vegas                 Urban Turban   
3         5.803355    NV  Las Vegas              Above the Crust   
4         5.750107    NV  Las Vegas         Layla Grill & Hookah   

                                     address  \
0                            7101 W Craig Rd   
1  3663 Las Vegas Blvd S, Ste 360, V-Theater   
2                    3900 Paradise Rd, Ste G   
3                              7810 W Ann Rd   
4                8665 W Flamingo Rd, Ste 107   

   attributes.RestaurantsPriceRange2                              cuisine  \
0                                2.0                 japanese, sushi bars   
1           

In [79]:
%%time

# test 7: try to run keyword filtering of personalized recommendation directly
print("------\nresult from test7 (run keyword filtering of personalized recommendations directly):")
col.keyword(city='Phoenix', personalized=True)

------
result from test7 (run keyword filtering of personalized recommendations directly):
no personalized list of recommendations is generated yet!
please first run the collaborative filtering module or content-based filtering module for a personalized recommendations.
CPU times: user 1.55 ms, sys: 1.76 ms, total: 3.31 ms
Wall time: 1.8 ms


## 3.3 adding personalized content-based filtering module

# 4. Functions to control API interfaces

### Others: Business dataset joins review dataset

In [None]:
# busi_review = pd.merge(business[['business_id','stars','review_count']], review[['business_id','review_id','stars']], how='left', on='business_id')

In [None]:
# # compare the consistency of review counts from business dataset vs review dataset
# compare_review = busi_review.groupby('business_id')[['business_id','review_count']].agg({'business_id':'count','review_count':'mean'})
# print(compare_review.head())
# print(len(compare_review))
# print(compare_review[compare_review.business_id != compare_review.review_count].head())
# print(len(compare_review[compare_review.business_id != compare_review.review_count]))

In [None]:
# # compare the consistency of average rating from business dataset vs review dataset
# compare_rating = busi_review.groupby('business_id')[['stars_x','stars_y']].mean()
# compare_rating['stars_y_round'] = (compare_rating.stars_y//0.5)*0.5 + ((compare_rating.stars_y % 0.5)//0.25)*0.5
# print(compare_rating.head())
# print(len(compare_rating))
# print(compare_rating[compare_rating.stars_x != compare_rating.stars_y_round].head())
# print(len(compare_rating[compare_rating.stars_x != compare_rating.stars_y_round]))