**Recommender system:**<br>
* Module 1 - simple recommender:<br>
build keyword search-based restaurant recommender module to filter by keyword. Keywords could include, for instance, location-based information (zip code, longitude, latitude)  and restaurant feature-based information (cuisine, style). 
The restaurant inventory will be filtered by keywords first, then ranked by its average rating or weighted smart rating taking into consideration the popularity (depending on user’s choice). The top-k restaurants from the list will be returned as the top-k recommendations.<br>
* Module 2 - content filtering recommender:<br>
With user ID and restaurant’s metadata, build a content based filtering recommender module that recommends restaurants that are similar to user’s preference inferred from user’s past ratings. More specifically, pairwise similarity scores will be computed for restaurants based on their vectorized feature representation extracted using CountVectorizer or TfidfVectorizer and recommend restaurants based on rankings of the weighted similarity score (e.g. cosine similarity). The important restaurant metadata to consider include categories, attributes, location.<br>
* Module 3 - collaborative filtering recommender:<br>
With user_id x restaurant_id rating matrix, build a collaborative filtering recommender module. Remember that the dataset has a total of 1,518,169 users, 188,593 businesses and 5,996,995 reviews. In terms of the user_id x business_id matrix, the matrix is very sparse (0.003% non-empty). Therefore, matrix factorization using ALS (alternative least square) will be used to complete the matrix and generate recommendations.<br>
* Metrics chosen for evaluating and optimizing the ‘goodness’ of the algorithms:<br>
a) measure prediction accuracy: RMSE(root mean squared error)
b) measure ranking effectiveness: 
MAP (mean average precision)
NDCG(Normalized Discounted Cumulative Gain)<br>
* Integration - combine the above modules to build a hybrid recommendation engine:<br>
To combine the above modules, a few simple interactive questions will be added:<br>
a) “Want customized recommendations based on your user history by providing your user ID?”  If no, activate the simple recommender module to provide base-case recommendations using location information and/or optional keywords<br>
b) If yes, prompt to ask follow up question: “do you want to try something new based on people like you?” If yes, activate the collaborative filtering module to recommend new restaurants based on similar peers; otherwise, use content filter module to recommend similar restaurants. <br>
* Other improvements:<br>
Optimize restaurant ranking by weighting the average rating based on total number of ratings (popularity), weighting the individual rating according to their recency, etc. With a quick interactive question: “want smart rating instead?” The alternative restaurant ranking method based on the above weighted scores will be activated and used instead of the simple average rating.<br>
* Potential caveats - cold start problem:<br>
a) new restaurant → content-based recommendation module will be able to use the features (metadata) of the new restaurant and include it when generating recommendations.<br>
b) new user → will be treated as if the user ID is not available (both has no user history) and similar recommender module will be used to recommend restaurants based on location, keywords, popularity, etc. 

**Note:**<br> 
Only a subset of Yelp restaurants from a few selected states are available in this dataset. Among them, only Arizona, Nevada, Ohio, North Carolina and Pennsylvania have a rich catalog of over 5000 restaurants. 

Only the top two states, Arizona and Nevada have over 10000 restaurants. 

In [1]:
import pandas as pd
import numpy as np
#import string

business = pd.read_csv('business_clean.csv')  # contains business data including location data, attributes and categories
#user = pd.read_csv('user_clean.csv') # contains users data including the user's friend mapping and all the metadata associated with the user
review = pd.read_csv('review_clean.csv') # contains full review text data including the user_id that wrote the review and the business_id the review is written for
#tip = pd.read_csv('tip_clean.csv') # tips written by a user on a business, tips are shorter than reviews and tend to convey quick suggestions
#checkin = pd.read_csv('checkin_clean.csv') # checkins on a business

  return f(*args, **kwds)
  return f(*args, **kwds)


# 1. Calculate Geodesic distance between two points on the globe

In [2]:
# calculate geodesic distances between two points on a globe, see https://janakiev.com/blog/gps-points-distance-python/ for more resource
# alternatively, one can use geopy.distance from geopy package to calculate either geodesic or great circle distance, https://github.com/geopy/geopy/blob/master/geopy/distance.py

def great_circle_mile(lat1, lon1, lat2, lon2):
    """
    Compute geodesic distances (great-circle distance) of two points given their coordinates. 
    The function returns the distance in miles. 
    Note: 1. Calculation uses the earth's mean radius of 6371.009 km, 
    2. The central subtended angle is calculated by formula: 
    alpha = cos-1*[sin(lat1)*sin(lat2)+ cos(lat1)*cos(lat2)*cos(lon1-lon2)]
    """
    
    from math import sin, cos, acos, radians
    
    lat1, lon1, lat2, lon2 = radians(lat1), radians(lon1), radians(lat2), radians(lon2) # convert degrees to radians
    earth_radius = 6371.009  # use earth's mean radius in kilometers
    alpha = acos(sin(lat1)*sin(lat2) + cos(lat1)*cos(lat2)*cos(lon1-lon2)) # alpha is in radians
    dis_km = alpha * earth_radius
    dis_mile = dis_km * 0.621371   # convert kilometer to mile
    
    return dis_mile

In [3]:
pos1 = (51.5073219, -0.1276474) # London
pos2 = (52.5170365, 13.3888599) # Berlin
pos3 = (-33.8548157,151.2164539) # Sydney

In [4]:
%%timeit
# great_circle distance
distance_12 = great_circle_mile(pos1[0], pos1[1], pos2[0], pos2[1])
distance_12

1.9 µs ± 160 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [5]:
from geopy.distance import distance

In [6]:
%%timeit
# geodesic distance
distance2_12 = distance(pos1, pos2).miles
distance2_12

230 µs ± 16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [7]:
distance_12 = great_circle_mile(pos1[0], pos1[1], pos2[0], pos2[1])
distance2_12 = distance(pos1, pos2).miles
error_12 = (distance_12 - distance2_12)/distance2_12

distance_13 = great_circle_mile(pos1[0], pos1[1], pos3[0], pos3[1])
distance2_13 = distance(pos1, pos3).miles
error_13 = (distance_13 - distance2_13)/distance2_13

# print errors, 1-2 is distance between London and Berlin, 1-3 is distance between London and Sydney
print("absolute error:", (distance_12-distance2_12), (distance_13-distance2_13)) 
print("percent error:", error_12, error_13)

absolute error: -1.8326838373754981 2.892945931140275
percent error: -0.0031598293601707256 0.0002740520023695907


### Quick summary: 
Great-circle distance is reasonably accurate when compared with the geodesic distance calculated from an ellipsoidal model of the earth. Given that calculating great-circle distance is at least 100 times faster on average, great-circle distance will be used in this project when calculating geodistance on map.

# 2. Adjusted rating

Here, a single metric is introduced as a substitute of the original average restaurant rating ('stars' column of the business dataframe). Ideally, the new metric should take into consideration: <br>
1) average rating of the restaurant (indicates the goodness of the restaurant, but does not consider popularity) <br>
2) # of ratings by users (indicates popularity, but does not imply the goodness of the restaurant) <br>
3) age of the rating (indicates the relevance of the rating, as outdated ratings might fail to indicate the actual quality)<br>

The proposed new score is: 
$$score_i = \frac{\sum_u r_{ui} + k*\mu}{n_i+k}$$

where $r_ui$ is the rating on item i by user u, $n_i$ is the number of rating on item i, $\mu$ is the global mean of ratings over all businesses and all users and k is the strength of the damping term.

As the equation shows, the adjusted score uses the mechanism of the damped mean to regulate the extreme cases of having only a few extreme ratings. k controls the strength of the damping effect: the larger k is, the more actual ratings are required to overcome the global mean.

In this case, k is set to 4 (which is the 10% quantile of the review counts for all businesses), but it can be tuned according to various business considerations.  

Note:<br> 
Here, the age of the rating is not adjusted in the current version of the proposed metric. This is because 90% of the reviews/ratings are from year 2011 and later, and are considered quite relevant. Therefore, the need of adjusting for the age of the ratings is not strong.

In [8]:
# compute globe mean ratings of all businesses and all reviews
globe_mean = ((business.stars * business.review_count).sum())/(business.review_count.sum())
print("global mean rating is:", globe_mean)

global mean rating is: 3.7277814701620127


In [9]:
print(business.review_count.quantile([0.1,0.25,0.5,0.75,0.9]))
k = 22 # set strength k to 22, which is the 50% quantile of the review counts for all businesses
business['adjusted_score'] = (business.review_count * business.stars + k * globe_mean)/(business.review_count + k)
print("\nrank by the adjusted score in descending order:")
print(business[['review_count','stars','adjusted_score']].sort_values('adjusted_score', ascending=False).head(5))
print("\nrank by the original score in descending order:")
print(business[['review_count','stars','adjusted_score']].sort_values('stars', ascending=False).head(5))
print("\nrank by the least number of reviews:")
print(business[['review_count','stars','adjusted_score']].sort_values('review_count', ascending=True).head(5))

0.10      4.0
0.25      8.0
0.50     22.0
0.75     66.0
0.90    172.0
Name: review_count, dtype: float64

rank by the adjusted score in descending order:
       review_count  stars  adjusted_score
7464           1746    5.0        4.984169
31910          1380    5.0        4.980037
45401           547    5.0        4.950811
7784            520    5.0        4.948360
28162           472    5.0        4.943342

rank by the original score in descending order:
       review_count  stars  adjusted_score
22115             7    5.0        4.034869
23114             5    5.0        3.963377
42990             5    5.0        3.963377
42989            16    5.0        4.263452
12778             3    5.0        3.880448

rank by the least number of reviews:
       review_count  stars  adjusted_score
0                 3    4.5        3.820448
5707              3    4.0        3.760448
16594             3    4.0        3.760448
5699              3    3.5        3.700448
16605             3    3.5  

# 3. Create recommender class

## 3.1 non-personalized keyword filtering module

In [10]:
# update data type of the 'postal_code' column of business dataframe to string type
business['postal_code'] = business.postal_code.astype(str)

In [19]:
class Recommender:
    
    def __init__(self, n=5, original_score=False):
        """initiate a Recommender object by passing the desired number of recommendations to make, the default number is 10.
        By default, the adjusted score will be used for ranking; To rank by the original average rating of the restaurant, pass original_score=True
        """
        self.n = n # number of recommendations to make, default is 5
        self.original_score = original_score # boolean indicating whether the original average rating or the adjusted score is used
        # initiate a list of column names to display in the recommendation results
        self.column_to_display = ['state','city','name','address','attributes.RestaurantsPriceRange2','cuisine','style','review_count','stars','adjusted_score']
        
        # initiate the list of recommendations to be the entire catalog of the 'business' dataframe sorted by the score of interest
        if self.original_score:  # set sorting criteria to the originial star rating
            score = 'stars'
        else:  # set sorting criteria to the adjusted score
            score = 'adjusted_score'
        self.recomm = business.sort_values(score, ascending=False)
        
    def _filter_by_location(self):
        """Filter and update the dataframe of recommendations by the matching location of interest.
        A combination of state, city and zipcode is used as the location information, partially missing information can be handled. 
        Matching restaurant is defined as the restaurant within the acceptable distance (max_distance) of the location of interest.
        note: this hidden method should only be called within the method 'keyword'
        """       
        from geopy.geocoders import Nominatim
        from geopy.exc import GeocoderTimedOut
        geolocator = Nominatim(user_agent="yelp_recommender") # use geopy.geocoders to make geolocation queries
        address = [self.city, self.state, self.zipcode]
        address = ",".join([str(i) for i in address if i != None])
        # use geolocate query to find the coordinate for the location of interest
        try:
            location = geolocator.geocode(address, timeout=10) 
        except GeocoderTimedOut as e:
            print("Error: geocode failed to locate the address of interest {} with message {}".format(address, e.message))            

        # calculate the geodesic distance between each restaurant and the location of interest and add as a new column ''distance_to_interest'
        self.recomm['distance_to_interest'] = self.recomm.apply(lambda row: great_circle_mile(row.latitude, row.longitude, location.latitude, location.longitude), axis=1)
        # add the new column 'distance_to_interest' to the list of columns to display in the recommendation result
        self.column_to_display.insert(0, 'distance_to_interest')
        # filter by the desired distance
        self.recomm = self.recomm[self.recomm.distance_to_interest <= self.max_distance]

    def _filter_by_state(self):
        """ Filter and update the dataframe of recommendations by the matching state.
        note: this hidden method should only be called within the method 'keyword'
        """
        self.recomm = self.recomm[self.recomm.state == self.state.upper()]
    
    def _filter_by_cuisine(self):
        """ Filter and update the dataframe of recommendations by the matching cuisine of interest. 
        note: this hidden method should only be called within the method 'keyword'
        """                         
        idx = []
        for i in self.recomm.index: 
            if self.recomm.loc[i,'cuisine'] is not np.nan:
                entries = self.recomm.loc[i,'cuisine'].split(',')
                if self.cuisine in entries:
                    idx.append(i)
        self.recomm = self.recomm.loc[idx]
    
    def _filter_by_style(self):  
        """ Filter and update the dataframe of recommendations by the matching style of interest. 
        note: this hidden method should only be called within the method 'keyword'
        """
        idx = []
        for i in self.recomm.index: 
            if self.recomm.loc[i,'style'] is not np.nan:
                entries = self.recomm.loc[i,'style'].split(',')
                if self.style in entries:
                    idx.append(i)
        self.recomm = self.recomm.loc[idx]
    
    def display_recommendation(self):
        """ Display the list of top n recommended restaurants
        """
        # limit the list of recommendation to only top n at max
        if self.n < len(self.recomm):
            self.recomm = self.recomm.iloc[:self.n]
        if len(self.recomm) == 0:
            print("Sorry, there is no matching recommendations.")
        else: 
            print("The top {} recommended restaurants matching your keywords are".format(self.n))
            print(self.recomm[self.column_to_display])
    
    # non-personalized keyword filtering-based recommendation module
    def keyword(self, zipcode=None, city=None, state=None, max_distance=10, cuisine=None, style=None):
        """Non-personalized recommendation by keyword filtering: 
        Support filtering by the desired distance and location (zipcode, city, state) of interest, 
        by the desired cuisine of interest and by the desired style of interest.
        Everytime this method is called, a new list of recommendation is created regardless of prior history.
        ---
        Note:
        state: needs to be the upper case of the state abbreviation, e.g.: 'NV', 'CA'
        max_distance: the max acceptable distance between the restaurant and the location of interest, unit is in miles, default is 10
        ---
        """
        # re-initiate the following variables every time the module is called so that the recommendation starts fresh
        self.recomm = business # start with the entire 'business' catalog
        self.recomm['distance_to_interest'] = np.nan # reset the distance between each restaurant and the location of interest
        self.column_to_display = ['state','city','name','address','attributes.RestaurantsPriceRange2','cuisine','style','review_count','stars','adjusted_score'] # reset the columns to display
        
        # assign variables based on user's keyword inputs
        self.zipcode = zipcode
        self.city = city
        self.state = state 
        self.max_distance = max_distance
        self.cuisine = cuisine
        self.style = style
         
        
            
        # filter by restaurant location
        if (self.zipcode != None) or (self.city != None) or (self.state != None):      
            if (self.zipcode != None) or (self.city != None): # use zipcode and/or city whenever available
                self._filter_by_location()
            else: # filter by state if state is the only location information available 
                self._filter_by_state()
            if len(self.recomm) == 0:
                print("no restaurant found for the matching location of interest.")
                return []
        
        # filter by restaurant 'cuisine'
        if self.cuisine != None:
            self._filter_by_cuisine()
            if len(self.recomm) == 0:
                print("no restaurant found for the matching cuisine of {}".format(self.cuisine))
                return []
        
        # filter by restaurant 'style'
        if self.style != None:
            self._filter_by_style() 
            if len(self.recomm) == 0:
                print("no restaurant found for the matching style of {}".format(self.style))
                return []
        
        # sort the matching list of restaurants by the score of interest
        if self.original_score:  # set sorting criteria to the originial star rating
            score = 'stars'
        else:  # set sorting criteria to the adjusted score
            score = 'adjusted_score'
        self.recomm = self.recomm.sort_values(score, ascending=False)
        
        # display the list of top n recommendations
        self.display_recommendation()
        
        return self.recomm
    
    # personalized content-based filtering recommender module
    def content(self, user_id=None):
        """Passing of user_id is required if personalized recommendation is desired.
        """
        self.recomm = business # start with the entire 'business' catalog every time the module is called
                           
        self.user_id = user_id  # user_id for personalized recommendation using content_based filtering 
                          
        if self.user_id is None:
            print("no user_id is provided")
        if self.user_id not in user.user_id:
            print("No data available for this user_id")
        
        # to be added
        
        # display the list of recommendations
        self.display_recommendation()
    
    # personalized collaborative-based filtering recommender module
    def collaborative(self, user_id=None):
        """Passing of user_id is required if personalized recommendation is desired.
        """
        self.recomm = business # start with the entire 'business' catalog every time the module is called
                           
        self.user_id = user_id # user_id for personalized recommendation using collaborative filtering 

        if self.user_id is None:
            print("no user_id is provided")
        if self.user_id not in user.user_id:
            print("No data available for this user_id")
            
        # to be added
        
        # display the list of recommendations
        self.display_recommendation()

### Testing on the non-personalized keyword filtering module

In [20]:
%%time
# initiate a Recommender object
kw = Recommender(n=3)

# test0: display only (same as no keywords)
print("------\nresult from test0 (display only): ")
kw.display_recommendation()

# test1: no keywords
print("------\nresult from test1 (no keywords): ")
kw.keyword();

# test 2: a combination of city, state and zipcode
print("------\nresult from test2 (a combination of city and state): ")
kw.keyword(city='Phoenix', state='AZ', zipcode='85023');

# test 3: a combination of cuisine and style
print("------\nresult from test3 (a combination of cuisine and style): ")
kw.keyword(cuisine='barbeque', style='restaurants');

# test 4: a combination of state, cuisine and style
print("------\nresult from test4 (a combination of state, cuisine and style): ")
kw.keyword(state='NV', cuisine='desserts', style='restaurants');

# test 5: no matching location
print("------\nresult from test5 (no matching location): ")
kw.keyword(city='milpitas', zipcode='95035');

# test 6: no matching 'cuisine'
print("------\nresult from test6 (no matching cuisine): ")
kw.keyword(cuisine='abc');

# test 7: no matching 'style'
print("------\nresult from test7 (no matching style): ")
kw.keyword(style='abc');

# test 8: a combination of location, cuisine and style
print("------\nresult from test8 (a combination of location, cuisine and style): ")
kw.keyword(city='Phoenix', zipcode='85023',cuisine='barbeque', style='restaurants');

# test 9: use the original average rating and return top 10 recommendations
print("------\nresult from test9 (top 10 recommendations ranked by original average rating): ")
kw2 = Recommender(n=10, original_score=True)
kw2.keyword(city='Phoenix', zipcode='85023',cuisine='barbeque', style='restaurants');

------
result from test0 (display only): 
The top 3 recommended restaurants matching your keywords are
      state       city             name                       address  \
7464     AZ    Phoenix  Little Miss BBQ          4301 E University Dr   
31910    NV  Las Vegas     Brew Tea Bar  7380 S Rainbow Blvd, Ste 101   
45401    NV  Las Vegas       Gelatology  7910 S Rainbow Blvd, Ste 110   

       attributes.RestaurantsPriceRange2                              cuisine  \
7464                                 2.0                             barbeque   
31910                                1.0                 desserts, bubble tea   
45401                                1.0  ice cream & frozen yogurt, desserts   

                    style  review_count  stars  adjusted_score  
7464          restaurants          1746    5.0        4.984169  
31910  cafes, restaurants          1380    5.0        4.980037  
45401                 NaN           547    5.0        4.950811  
------
result from 

As shown, 9 tests (9 queries) are performed with a total CPU time of 8 seconds and elapsed time of 12 seconds. This averages to roughly 1 second per queries which is very reasonable in practice.

## 3.2 collaborative filtering module

In [50]:
review.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5996995 entries, 0 to 5996994
Data columns (total 9 columns):
funny          int64
stars          int64
useful         int64
cool           int64
text           object
business_id    object
user_id        object
review_id      object
date           object
dtypes: int64(4), object(5)
memory usage: 411.8+ MB


### using pandas pivot to convert 'review' dataframe to the 'user_id' x 'business_id' matrix

In [30]:
# trying only with the first 50,000 rows
matrix_reduced = review[:50000].pivot(index='user_id', columns='business_id', values='stars')

In [40]:
print(matrix_reduced.shape)
print(matrix_reduced.info())

# check the target matrix dimension
print(len(review.user_id.value_counts()))
print(len(review.business_id.value_counts()))

# check sparsity
print("The non-NaN entries in the target matrix is {}%".format(len(review)*100/(len(review.user_id.value_counts())*len(review.business_id.value_counts()))))

(20076, 30915)
<class 'pandas.core.frame.DataFrame'>
Index: 20076 entries, ---PLwSf5gKdIoVnyRHgBA to zzq0TgPc5-b3-7XKt6fwJA
Columns: 30915 entries, --6MefnULPED_I942VcFNA to zzwhN7x37nyjP0ZM8oiHmw
dtypes: float64(30915)
memory usage: 4.6+ GB
None
1518168
188593
The non-NaN entries in the target matrix is 0.002094538196300487%


As shown, the matrix_reduced created by only pivoting the first 50,000 rows is already taking 4.6+ GB, and the matrix_reduced shape is 20076 x 30915.<br>
The actual matrix shape will be len(review.user_id.value_counts()) x len(review.business_id.value_counts()), that is 1518168 x 188593, 461 times larger. Therefore, it's impossible given the space (memory) constrain. <br>

Alternatively, the target matrix is very sparse, therefore to make it work with the memory constrain, the 'stars' rating in the 'review' dataframe needs to be pivoted into a sparse matrix directly, by 'user_id' and 'business_id'. 

### pivot directly into a sparse matrix

In [3]:
# define method for pivoting dataframe into a sparse matrix directly
# version: python 3.6.5, pandas 0.23.3, numpy 1.15.0 scipy 1.1.0

from scipy.sparse import csr_matrix
from pandas.api.types import CategoricalDtype

# return the resulting sparse matrix only
def df_pivot_sparse_matrix_simple(df, idx, col, val):
    """pivot a pandas dataframe into sparse matrix directly using scipy.sparse.csr_matrix and return the resulting sparse matrix. 
    necessary when the df is large and pandas pivot (dense matrix) doesn't work due to space (memory) constrain. 
    ---
    input
    df: the pandas dataframe of interest
    idx: the column name of the df to be used as the index in the sparse matrix;
    col: the column name of the df to be used as the column in the sparse matrix;
    val: the column name of the df to be used as the actual value in the sparse matrix;
    """
    x = df[idx].astype(CategoricalDtype(ordered=True)).cat.codes
    y = df[col].astype(CategoricalDtype(ordered=True)).cat.codes
    return csr_matrix((df[val].values, (x, y)), shape=(df[idx].nunique(), df[col].nunique()))


# return the resulting sparse matrix along with the mapping dictionaries of matrix indices to the orignial values in the corresponding columns of df
def df_pivot_sparse_matrix(df, idx, col, val):
    """pivot a pandas dataframe into sparse matrix directly using scipy.sparse.csr_matrix and return the resulting sparse matrix, 
    necessary when the df is large and pandas pivot (dense matrix) doesn't work due to space (memory) constrain. 
    ---
    input
    df: the pandas dataframe of interest
    idx: the column name of the df to be used as the index in the sparse matrix;
    col: the column name of the df to be used as the column in the sparse matrix;
    val: the column name of the df to be used as the actual value in the sparse matrix;
    ---
    return:
    sparse_matrix: the resulting sparse matrix
    map_idx: the dictionary to map the numerical row indices of the sparse matrix back to the unique values in the idx column of the original df
    map_col: the dictionary to map the numerical column indices of the sparse matrix back to the uniqe values in the col column of the original df
    """
    idx_c = CategoricalDtype(sorted(df[idx].unique()),ordered=True) # find unique values in the idx column and define as a categorical type
    col_c = CategoricalDtype(sorted(df[col].unique()),ordered=True) # find unique values in the col column and define as a categorical type

    x = df[idx].astype(idx_c).cat.codes # cast columns to the newly created categorical type and access the underlying integer codes (corresponding numbering of the categories)
    y = df[col].astype(col_c).cat.codes 
    sparse_matrix = csr_matrix((df[val].values, (x, y)), \
                           shape=(len(idx_c.categories), len(col_c.categories))) # map to the sparse matrix
    
    map_idx = dict(zip(np.arange(len(idx_c.categories)), list(idx_c.categories))) # create the mapping dictionaries
    map_col = dict(zip(np.arange(len(col_c.categories)), list(col_c.categories)))
                               
    return sparse_matrix, map_idx, map_col

In [4]:
%%time
# convert to sparse matrix
matrix, map_user_id, map_business_id = df_pivot_sparse_matrix(review, 'user_id', 'business_id', 'stars')

CPU times: user 8.27 s, sys: 1 s, total: 9.27 s
Wall time: 10.2 s


In [5]:
# inspect the sparse matrix

# check shape
print("matrix shape:", matrix.shape)

# check memory use
print("memory use: {} Mb".format((matrix.data.nbytes + matrix.indptr.nbytes + matrix.indices.nbytes)*0.125*1e-6))

# check data type
print(matrix.dtype)
print(review.stars.value_counts())

# check non-NaN values
print(len(matrix.data))
print(review.stars.notnull().sum())

matrix shape: (1518168, 188593)
memory use: 9.7545725 Mb
int64
5    2641880
4    1335957
1     858139
3     673206
2     487813
Name: stars, dtype: int64
5996992
5996995


In [6]:
# inspect the mapping dictionaries
print(list(map_user_id.items())[:5])
print(list(map_business_id.items())[:5])

[(0, '---1lKK3aKOuomHnwAkAow'), (1, '---89pEy_h9PvHwcHNbpyg'), (2, '---94vtJ_5o_nikEs6hUjg'), (3, '---PLwSf5gKdIoVnyRHgBA'), (4, '---cu1hq55BP9DWVXXKHZg')]
[(0, '--1UhMGODdWsrMastO9DZw'), (1, '--6MefnULPED_I942VcFNA'), (2, '--7zmmkVg-IMGaXbuVd0SQ'), (3, '--8LPVSo5i0Oo61X01sV9A'), (4, '--9QQLMTbFzLJ_oT-ON3Xw')]


# [Q]: ??? why non-NaN values are not consistent (missing 3) 

### matrix factorization using scikit-learn non-negative matrix factorization (NMF)

In [7]:
from sklearn.decomposition import NMF
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

In [None]:
# compute performance

def get_performance(m_pred_H, m_pred_W, m_true):
    """ compute the RMSE(root mean squared error) between 
    m_true is a sparse matrix, use m_true.data, m_true.indices, and m_true.indptr to compute its non
    """
    # to be added
    return

In [8]:
# selecting n_components, set to 10 first, later use cross validation to optimize

model = NMF(n_components=10, init='random', random_state=0)
W = model.fit_transform(matrix)
H = model.components_

print(W.shape)
print(H.shape)

(1518168, 10)
(10, 188593)


### Problems to be solved
A) train_test_split on sparse matrix, randomly split on (user_id, business_id) combination. 
problem: some users only have one rating of one restaurant, those need to be in the training set. 
B) evaluate the test matrix only on available ratings (write def get_performance())
C) optimize n_components via cross validation 

### try out scikit-surprise package to handle the above complications
http://surpriselib.com/

In [31]:
from surprise import Dataset, Reader
from surprise import NMF, SVD
from surprise.model_selection import train_test_split, GridSearchCV
from surprise import accuracy

In [50]:
%%time
# prepare the data

# create a Reader object with the rating_scale from 1 to 5
reader = Reader(rating_scale=(1, 5))

# The columns must correspond to user id, item id and ratings in the exact order
data = Dataset.load_from_df(review[['user_id', 'business_id', 'stars']], reader)

# sample random trainset and testset
trainset, testset = train_test_split(data, test_size=0.3)

CPU times: user 32.6 s, sys: 29.2 s, total: 1min 1s
Wall time: 1min 21s


In [51]:
trainset.n_users

1250368

Note: this method doesn't include all users either. But it will return trainset.global_mean if the user_id or the business_id to be predicted is not found in the training set. 

In [52]:
%%time
# NMF with defaults

nmf = NMF() # initiate a NMF algorithm object
nmf.fit(trainset) # training
pred_nmf = nmf.test(testset) # predict ratings for the testset
accuracy.rmse(pred_nmf) # compute RMSE score

RMSE: 1.4656
CPU times: user 7min 31s, sys: 44.1 s, total: 8min 16s
Wall time: 8min 43s


In [53]:
%%time
# SVD with defaults

svd = SVD() # initiate a SVD algorithm object
svd.fit(trainset) # training
pred_svd = svd.test(testset) # predict ratings for the testset
accuracy.rmse(pred_svd) # compute RMSE score

RMSE: 1.2825
CPU times: user 5min 29s, sys: 38.6 s, total: 6min 8s
Wall time: 6min 35s


In [54]:
np.random.seed(42)

In [None]:
cross validation to optimize parameters of NMF
param_grid = {'n_factors': [10,20], 'n_epochs': [20, 50]}
gs = GridSearchCV(nmf, param_grid, measures='rmse', cv=3)
gs.fit(train)
# best RMSE score
print(gs.best_score)
# combination of parameters that gave the best RMSE score
print(gs.best_params)

In [None]:
cross validation to optimize parameters of SVD
param_grid = {'n_factors': [10,50], 'n_epochs': [10, 20], 'lr_all': [0.002, 0.01],'reg_all': [0.02, 0.1]}
gs = GridSearchCV(svd, param_grid, measures='rmse', cv=3)
gs.fit(train)
# best RMSE score
print(gs.best_score)
# combination of parameters that gave the best RMSE score
print(gs.best_params)

# Functions to control API interfaces

### Business dataset joins review dataset

In [14]:
# busi_review = pd.merge(business[['business_id','stars','review_count']], review[['business_id','review_id','stars']], how='left', on='business_id')

In [15]:
# # compare the consistency of review counts from business dataset vs review dataset
# compare_review = busi_review.groupby('business_id')[['business_id','review_count']].agg({'business_id':'count','review_count':'mean'})
# print(compare_review.head())
# print(len(compare_review))
# print(compare_review[compare_review.business_id != compare_review.review_count].head())
# print(len(compare_review[compare_review.business_id != compare_review.review_count]))

In [16]:
# # compare the consistency of average rating from business dataset vs review dataset
# compare_rating = busi_review.groupby('business_id')[['stars_x','stars_y']].mean()
# compare_rating['stars_y_round'] = (compare_rating.stars_y//0.5)*0.5 + ((compare_rating.stars_y % 0.5)//0.25)*0.5
# print(compare_rating.head())
# print(len(compare_rating))
# print(compare_rating[compare_rating.stars_x != compare_rating.stars_y_round].head())
# print(len(compare_rating[compare_rating.stars_x != compare_rating.stars_y_round]))