# Report-Yelp Business Insights & Hybrid Restaurant Recommendation Engine 

# Abstract
This capstone project is part of the data science career track program at Springboard. EDA and interactive visualizations are performed on Yelp open dataset (Yelp dataset challenge) to understand restaurant, user and review patterns on Yelp platform. A hybrid recommendation engine is developed powered by the Yelp dataset, offering a combination of non-personalized keyword-search recommendation, personalized collaborative recommendation and personalzied restaurant content-based recommendation at users' choice.

---

# Executive Summary

## * Yelp Business Insights
The Yelp open dataset of 5,996,996 reviews, 1,518,169 users, 188,593 businesses, 1,185,348 tips, and over 1.4 million business attributes for each of the 188,593 businesses is obtained, cleaned, analyzed in this project. **Interactive visualizations are also created using Bokeh server.**<br> 
**The key business findings are:**
* Only a subset of Yelp restaurants from a few selected states are available in this dataset. Among them, only Arizona, Nevada, Ohio, North Carolina and Pennsylvania have a rich catalog of over 5000 restaurants. 
* The most common restaurants are the popular chain or franchised restaurants, fast food or coffee shops, Starbucks, McDonald's and Subway being the top three among all.
* The average restaurant rating is around 3.5 and is similar among various locations, with 3.5 and 4.0 being the most common ratings. Half of the restaunrants have less than 30 reviews, but restaurants from Nevada (Las Vegas) have significant more reviews than others. The correlation between rating and review suggests that restaurants with more reviews tend to have higher ratings on average.
* Most restaurants are in the low (40.9%) and mid (41.6%) price ranges. More expensive restaurants tend to receive more reviews on average, but the average rating remains similar.
* The most popular cuisine of restaurants overall is American style (traditional and new), followed by Mexican, Italian and Chinese, whereas the most popular restaurant setting is the formal restaurant style, followed by the nightlife/bar style and fast food. The above preference of cuisine varies quite a bit by location.
* A steady increase of new users has continued since Yelp's debut in 2004 till 2015, followed by a significant decline thereafter. The average rating given by Yelp users is 3.72, and 60% of the users have less than 10 reviews in total, suggesting that most users post reviews on Yelp only occationally.
* The daily number of reviews posted on Yelp shows a steady upward trend with seasonal fluctuations, whereas the daily number of tips only increased in the first four years and slowly dived down thereafter, suggesting tip is not as popular as review. Two thirds of restaurant reviews on Yelp are associated with a positive star rating of 4+. 
* Half of the restaurants have less than 20 checkins, indicating that checkin is not a widely used feature on Yelp when compared with review.

## * Hybrid Restaurant Recommendation Engine Powered by Yelp Datasets
A non-personalized keyword-search recommender module, a personalized collaborative recommender module and a personalized restaurant content-based recommender module are implemented and a user-friendly interface is created to integrate the three submodules, gather user interests and navigate users through the hybrid recommendation engine via user interactive questions.

**Capabilities of the hybrid recommendation engine include:** 
* **A non-personalized keyword-search recommender module** supports a combination of restaurant location-based (zip code, city, state) keyword filtering and restaurant feature-based (cuisine, style, price) keyword filtering of restaurant catalog, and returns the customized recommendations by ranking the filtered catalog based on ranking criteria of user's choice.
* **A personalzied collaborative recommender module** supports personalized restaurant recommendation given the unique user_id. The personalization is computed based on the user's and all other users' rating history of all Yelp businesses via an optimized matrix factorization model, then user-unrated restaurants from the catalog are ranked by ratings predicted by the model and returned as personalized recommendations. <br>
* **A personalized restaurant content-based recommender module:** supports personalized restaurant recommendation given the unique user_id. The personalization is computed based on the similarity between the user's preference indicated by historical ratings and all restaurants' features extracted from a rich set of Yelp restaurant review texts, then user-unrated restaurants from the catalog are ranked by similarity score and returned as personalized recommendations.<br>
* **Further filter a recommendation list by keyword:** supports further filtering the recommendation results by a combination of restaurant location-based keywords and restaurant feature-based keywords by feeding the recommendation results as the restaurant catalog to the 'non-personalized keyword-search recommender module'.
* **An adjusted rating score is also introduced as an improved metric over the original restaurant average star ratings** supports ranking the restaurants by the adjusted rating as an alternative ranking criteria. The adjusted rating score uses the mechanism of the damped mean to regulate restaurants with different number of ratings, with the merit of incorporating both average restaurant rating (goodness) and number of ratings (popularity).
* **A user-friendly interface** supports flexible navigation among the three available recommender modules at user's choice and options to further filter the recommendation results by keywords and/or display the desired number of recommendations.<br>

**Performance of the hybrid recommendation engine:** 
* **Non-personalized keyword-search module** Test results validate that the recommendation results only contain restaurants matching user's combination of keywords and ranked by the appropriate scores of interest. 
* **Personalized collaborative module** Both the accuracy of rating prediction and the quality of recommendation ranking are computed on unseen testset. RMSE(Root Mean Squared Error) of rating prediction is 1.2777 on testset with new users/restaurants and 1.2443 on testset without new user/restaurant. NDCG (Normalized Discounted Cumulative Gain) of recommendation ranking on testset without new user/restaurant is 0.905 and 0.908 for NDCG@10 and NDCG@5, respectively.
* **Personalized content-based module** The quality of recommendation ranking are computed on unseen testset. NDCG of recommendation ranking on test without new user/restaurant is 0.857 and 0.863 for NDCG@10 and NDCG@5, respectively.<br>
<br>

---

# 1. Introduction

## 1.1 Problem

Nowadays recommender systems are everywhere. Almost every major tech company has applied them in some form or another: Amazon uses it to suggest products to customers, and YouTube uses it to decide which video to play next on autoplay. In fact, one fundamental driver of data science’s skyrocketing popularity is the overwhelming amount of information available for anyone trying to make a good decision, and a recommender system helps to filter vast amount of information and make suggestions according to individual’s preference.<br> 

Yelp is one of those companies whose business success relies heavily on the power of its recommender system. It provides users coming to their website or app with quick suggestions of nearby businesses or a list of suggestions for businesses matching users’ search keywords and location. While yelp provides ratings for each business, these are not always indicative of a restaurant’s quality. For instance, a restaurant with only one rating of 5-stars would be ranked higher than a restaurant with a hundred ratings averaging 4.8 stars. Other problems include that the star rating varies from person to person, and the older ratings are less relevant. Improvements are needed to provide better ratings and suggestions.<br>

In this project, a hybrid recommender system will be developed featuring following capabilities: <br>
1) for new or anonymous users, the recommendation engine can provide base-case recommendations using location information and/or  restaurant features.<br>
2) with user ID as input and user's rating history in the database, either the collaborative personalization or the content-based personalization will be used to provide personalized recommendations at user's choice.<br>
3) smart weighted ratings will be computed taking into consideration both the average rating ('quality') and the number of ratings ('popularity').<br>

## 1.2 Approach

### Data wrangling 
first import from json into Pandas dataframe and unpack nested dictionaries if present, followed by the necessary cleanup and transformation of some columns.
### EDA 
Understand business and user patterns: for instance, popular restaurant cuisines by locations, popular restaurant styles by location, highly rated restaurants by cost, correlation between ratings and reviews, etc. These understanding will also help in designing the recommendation engine.
### Interactive data visualizations
Interactive data visualizations are created using bokeh based on EDA findings.
### Hybrid recommendation engine
* Module 1 - non-personalized keyword-search recommender:<br>
build keyword search-based restaurant recommender module to filter by keyword. Keywords could include, for instance, location-based information (zip code, longitude, latitude)  and restaurant feature-based information (cuisine, style). 
The restaurant catalog will be filtered by keywords first, then ranked by the user-selected rating criteria. The top-n restaurants from the ranked list are returned with the user's choice of n.<br>
* Module 2 - personalized collaborative recommender:<br>
With user x business rating matrix, build a collaborative recommender module. Due to the highly sparse nature of the user x business matrix, matrix factorization models are prototyped to complete the matrix and generate personalized recommendation based on the predicted ratings.<br>
* Module 3 - personalzied restaurant content-based recommender:<br>
With user ID and a rich set of restaurant’s metadata (features, reviews), build a content-based recommender module that recommends restaurants similar to user’s preference inferred from user’s rating history. Specifically, pairwise cosine similarity scores will be computed between restaurants and users based on their vectorized feature representation extracted from restaurant metadata; Then recommendation is generated by ranking the unrated restaurant catalog based on the descending similarity scores.<br>
* Metrics for evaluating and optimizing recommender modules:<br>
a) accuracy of rating prediction: RMSE(root mean squared error)<br>
b) effectiveness of recommendation ranking: NDCG(Normalized Discounted Cumulative Gain)<br>
* Integration of submodules to build the hybrid recommendation engine:<br>
To combine the above three modules, a few interactive questions will be incorporated to help navigate users through the recommendatin engine and customize recommendation request:<br>
a) "Want to try a customized recommendation based on your Yelp user history?"<br>
b) "Want to rank your recommendations by 'smart' ratings?"<br>
c) "Which type of personalization would you prefer?" <br>
d) "Would you like to further filter your recommendation results by keywords?"<br>
* Other improvements:<br>
An alternative rating criteria will be introduced to adjust the original restaurant average star rating (quality) by also taking into consideration the number of ratings (popularity)<br>
* Potential caveats (cold start problem):<br>
a) new restaurant → can be addressed by both the non-personalzied keyword-search module and the content-based personalized module by manually input the metadata or feature space of the new restaurant.<br>
b) new user → can be addressed by the non-personalized keyword-search module and treated as if the user id is not available, and recommend restaurants based on keywords such as location, features, etc.<br> 


## 1.3 Impact
<p>The hybrid recommender system can be beneficial both to Yelp and to Yelp users. Yelp constantly looks for means to improve its recommendation systems and better make use of its rich business data. Having recommendations available for all levels of interaction, the hybrid recommender system will improve user experience and engagement by providing both quick suggestions for casual users and more sophisticated personalized recommendations for frequent users. The improved weighted rating metric will better represent restaurant quality, resulting in more accurate ranking for restaurants of interest. On the other hand, Yelp users will benefit from the various levels of interactions and personalized recommendations. </p>

## 1.4 Dataset
The Yelp dataset is available to the public via Yelp Dataset Challenge. The dataset is available for download upon signing up at https://www.yelp.com/dataset. The raw data is structured as five individual JSON files contains a total of 5,996,996 reviews, 1,518,169 users, 188,593 businesses, 1,185,348 tips, and over 1.4 million business attributes for each of the 188,593 businesses.

# 2. Data Wrangling

## 2.1 Raw data
The Yelp dataset is downloaded as five individual json files from Yelp at https://www.yelp.com/dataset. The dataset contains a total of 5,996,996 reviews, 1,518,169 users, 188,593 businesses, 1,185,348 tips, and over 1.4 million business attributes for each of the 188,593 businesses. The total size of the dataset is more than 7 Gb. A summary of information contained in the five raw JSON files is available at:
https://github.com/jingzhaomirror/capstone2_hybrid_yelp_recommender/blob/master/figures/dataset_info_from_yelp.png

## 2.2 Data wrangling

### Convert raw JSON fukes to CSV while unpacking nested dictionaries
The raw json files are downloaded ('business.json','user.json','review.json','tip.json','checkin.json'). Then a python script ('json_to_csv.py'), available at https://github.com/jingzhaomirror/capstone2_hybrid_yelp_recommender/blob/master/json_to_csv.py, is adpated to convert all the raw json files to csv files of the same name. Nested json dictionaries are flatterned during this conversion and both parent and nested key, value pairs are extracted.

### Data Cleaning
All five csv files are imported as Pandas dataframes, inspected for data quality, cleaned and transformed accordingly. The data cleaning protocols are summaried below: <br>
* **'business' dataframe**<br>
The 'business' dataframe contains a total of 188,593 businesses all over the world with a primary focus on US businesses. Business categories include a wide variaty of 1264 keywords, many of which are not restaurant-related, for instance, 'shopping', 'health & medical', 'automotive', etc. For this project, the scope is limited to US restaurants. Therefore, the 'business' dataframe is first filtered to US business only (138,757 businesses, reduced by 26%), then to restaurant-related business only (47,554 businesses, reduced by 66%).<br>
In addition, the 'categories' column of the 'business' dataframe is inspected and restaurant characteristics ('cuisine', 'style') are extracted and added back to the 'business' dataframe as new columns.<br>
* **'user' dataframe**<br>
The 'user' dataframe contains a total of 1,518,169 users, with only a few NaNs (0.03%) in the 'name' column and no NaNs in other columns. Since 'user_id' functions as the unique identifier for identifying users and cross-referencing to other dataframes, missing information in the 'name' column is not a problem.<br>
Action is taken to remove one outlier (value of 0.0) in the 'average_stars' column, as 'average_stars' should take any float number between 1.00 and 5.00.<br>
* **'review' dataframe**<br>
The 'review' dataframe contains a total of 5,996,996 reviews. The 'text' column containing the actual contents of the reviews has been updated to 'string' data type; The carriage-return '\r' is present in a few review texts, causing undesired creation of new rows when writing to and importing from csv files. Therefore, '\r' is replaced with '\n\n'.
In addition, there are two problematic entries with incorrect and missing information: one has no valid review text, therefore has been removed; the other one contains out of range values, therefore the incorrect values have been updated.<br>
* **'tip' dataframe**<br>
The 'tip' dataframe contains a total of 1,185,348 tips, four (0.0003%) out of which having no actual tip contents and are removed.
* **'checkin' dataframe**<br>
The 'checkin' dataframe contains 157,075 checkin logs associated with 157,075 different businesses. Each entry represents the checkin information for one business, with the checkin counts for particular hours in the different days of the week. A new column named 'total_count' is computed and introduced by adding up all checkins at all times.<br>

Please refere to the separate notebook for all details on data wrangling:
https://github.com/jingzhaomirror/capstone2_hybrid_yelp_recommender/blob/master/data_wrangling.ipynb

## 2.3 Cleaned datasets
The cleaned dataframes are saved as five separate csv files, 'business_clean.csv', 'user_clean.csv', 'review_clean.csv', 'tip_clean.csv' and 'checkin_clean.csv'.<br>
**A short description of each dataframe is given as below:**<br>
* **'business' dataframe**<br>
('business' also contains columns resulting from unpacking nested dictionaries under the 'attributes' and 'hours' columns, those 'child' columns feature column names starting with either 'attributes.' or 'hours.'. Below is a summary of only the 'parent' columns) <br>
1) business_id: no NaN, no dulicates, all business_ids are of the same length of 22 characters, and are case-sensitive.<br>
2) name: no NaN.<br>
3) address: there are 1.9% NaNs, but it's ok since postal_code and coordinates are used mostly instead of address<br>
4) postal_code: no NaN, following the American 5-digit zipcode format.<br>
5) city, 6) state, 7) latitude and 8) longitude: no NaN.<br>
9) stars: no NaN, all star ratings take discrete values from 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5 and 5.0.<br>
10) review_count: no NaN, review counts range between 3 and 7968.<br>
11) is_open: no NaN, integer taking values of 0 or 1 for closed (27.3%) or open (72.7%), respectively<br>
12) neighborhood: significant NaNs (66.7%), the top 5 neighborhoods are 'Westside','Southeast','Spring Valley','The Strip','downtown'. The neighborhood information will not be used as location information. Instead, it will be treated as one of the business features for NLP analysis.<br>
13) attributes: some NaNs (3.9%), all with subattributes shown under column names featuring 'attributes.'<br>
    -most subattributes are categorical with either True or False binary entries or a few categorical values;<br>
    -six subattributes (e.g. 'attributes.businessParking') still contain nested dictionaries;<br>
14) categories: a few NaNs (0.9%), string values contains comma separated phrases describing restaurant cuisines or styles, e.g. 'burger'.<br>
15) hours: many NaNs (26.8%), all with subfeatures shown under columns names featuring 'hours.' <br>
    -all subfeatures are day of the week from 'Monday' to 'Sunday', with string values indicating the operating hours<br>
16) cuisine: many NaNs (22.6%), strings containing phrases (comma separated) representing restaurant cuisinesra, these features are extracted from the 'categories' column. <br>
17) style: many NaNs (20.4%), strings containing phrases (comma separated) representing restaurant cuisinesra, these features are extracted from the 'categories' column. <br>

* **'user' dataframe**<br>
1) user_id: no NaN, no duplicates, similar to business_id, all user_ids are of the same length of 22 characters, and are case-sensitive.<br>
2) name: a few NaNs (0.03%). User_id will be used instead of name in all cases.<br>
3) elite: no NaNs, contains a list of the years the user was an elite member (very active Yelp users with frequent activities and many insightful reviews & tips). Most users (95.6%) has 'None' as the value.<br>
4) yelping_since: no NaN, string formatted as YYYY-MM-DD, ranging between 2004-10-12 and 2018-07-02, indicating the date user joined Yelp.<br>
5) review_count: no NaN, integer value indicating the number of reviews the user has written, value ranges between 0 and 12723.<br>
6) average_stars: no NaN, takes any float number between 1.00 and 5.00.<br>
7) useful: no NaN, integer indicates the number of useful votes sent by the user. Value ranges between 0 and 258479,with 0 being the most common value.<br>
8) funny:  no NaN, integer indicates the number of useful votes sent by the user. Value ranges between 0 and 242120,with 0 being the most common value.<br>
9) cool: no NaN, integer indicates the number of useful votes sent by the user. Value ranges between 0 and 255909,with 0 being the most common value.<br>
10) fans: no NaN, integer indiates the number of fans the user has. Value ranges between 0 and 8665, with 0 being the most common value. <br>
11) compliment_*: no NaN, all integers indicating the number of various types of compliments received by the users.<br>

* **'review' dataframe**<br> 
1) review_id: no NaN, no duplicates, similar to user_id and business_id, all review_ids are of the same length of 22 characters, and are case-sensitive.<br>
2) user_id: no NaN, all of the same length of 22 characters, case-sensitive, corresponding to the user_id in dataframe 'user'.<br>
3) business_id: no NaN, all of the same length of 22 characters, case-sensitive, corresponding to the business_id in dataframe 'business'.<br>
4) stars: no NaN, integer indicating the star rating, takes discrete values of 1, 2, 3, 4 and 5<br>
5) text: no NaN and no empty entries, strings of the actual reviews, with length ranging from 1 to 5000.<br>
5) date: no NaN, string of length 10 formatted as YYYY-MM-DD, dates ranges from 2004-10-12 to 2018-07-02.<br>
6) useful: no NaN, integer, the number of useful votes the review received, values range from 0 to 1234.<br>
7) funny: no NaN, integer, the number of useful votes the review received, values range from 0 to 505.<br>
8) cool: no NaN, integer, the number of useful votes the review received, values range from 0 to 991.<br>

* **'tip' dataframe**<br>
1) user_id: no NaN, all of the same length of 22 characters, case-sensitive, corresponding to the user_id in dataframe 'user'.<br>
2) business_id: no NaN, all of the same length of 22 characters, case-sensitive, corresponding to the business_id in dataframe 'business'.<br>
3) text: no NaN and no empty entries, strings of the actual tips, with length ranging from 1 to 500.<br>
4) date: no NaN, string value of length 10 formatted as YYYY-MM-DD, dates ranges from 2009-04-15 to 2018-07-02.<br>
5) likes: no NaN, integer value indicating the number of likes the tip received, value ranges from 0 to 15.<br>

* **'checkin' dataframe** <br>
('checkin' also contains columns resulting from unpacking nested dictionaries under column 'time', those nested columns feature column names starting with 'time.'. Below is a summary of only the parent columns)<br>
1) business_id: no NaN, no duplicates, all business_ids are of the same length of 22 characters, and are case-sensitive.<br>
2) time: no NaN, parent columns with nested dictionaries containing checkin counts (value) under all times (key).<br>
3) total_count: no NaN, integer indicating the sum of all checkins at all times for the business_id, values range from 1 to 138477.<br>

# 3. EDA & Interactive Visualization

## 3.1 Yelp restaurant patterns

### 3.1.1 Common restaurants
The commom restaurant names are analyzed nation-wide and for the five individual states (Arizona, Nevada, Ohio, North Carolina and Pennsylvania) that have rich inventories of over 5000 restaurants. The results are shown below as interactive Bokeh plot, where the dropdown menu offers selection to switch between 'all states' and the above five states: 
<img src="figures/common_restaurant_bokeh.png" height="800" width="900">

As expected, the top 10 common restaurant names are the popular chain or franchised restaurants, fast food or coffee shops. Although the ranking varies a bit by state, Starbucks, McDonald's and Subway are the top 3 among all. Some regional restaurant chains show up in the top list only in certain states, for instance, Filibertos, one of the Southwest's favorite Mexican fast food, is ranked # 8 in the state of Arizona. 

### 3.1.2 Restaurant statistics by state
A variety of summary statistics of Yelp restaurants, including '# of restaurants (total)', '# of restaurants (open)', 'average rating of restaurant' and 'average number of reviews received per restaurant', are computed by state. Interactive Bokeh plot is given below, where the dropdown menu offers selection to switch among various summary statistics:
<img src="figures/restaurant_stats_by_state_bokeh.png" height="800" width="900">

As shown, this dataset only contains a subset of all yelp businesses, focused on businesses from only a few selected states. In terms of restaurants, only a portion of restaurants from 15 states (Arizona, Nevada, Ohio, North Carolina, Pennsylvania, Wisconsin, Illinois, South Carolina, Indiana, Oregon, Colorado, New York, California, Vermont and Virgina) are available in this dataset. Among them, only Arizona, Nevada, Ohio, North Carolina and Pennsylvania have a rich catalog of over 5000 restaurants, and will be the main focus of this project.<br>
In terms of review counts, Nevada has a much higher average than all others, as a result of the popularity of Las Vegas as a resort town. The average restaurant rating is very similar among five states, close to 3.5.<br>

### 3.1.3 Restaurant distribution visualized on map
Both Boken and Folium are experimented for visualizing geographical distribution of restaurants on map. The Boken Google map visualization is shown below as an example: 
<img src="figures/restaurant_on_map_bokeh.png">

In good agreement with the # of restaurants by state in section 3.1.2, the Google map distribution also confirms that the Yelp dataset only contains a subset of all yelp businesses from a few states. In particular, restaurants are densely distributed around Phoenix of Arizona, Las Vegas of Nevada, Cleverland of Ohio, Charlotte of North Carolina and Pittsburgh of Pennsylvania.

### 3.1.4 Restaurant rating vs. # of reviews
Restaurants from the Yelp dataset are analyzed and visualized to understand the correlation between restaurant rating and # of reviews a restaurant receives. Below is a series of plots showing the restaurant distribution by rating, the restaurant distribution by # of reviews as weel as the restaurant rating by # of reviews.
<img src="figures/restaurant_rating_vs_review.png">

The plots reveal that the majority of the restaurants have a rating between 3.0 and 4.5, with 3.5 and 4.0 being the most common ratings; Half of the restaunrants have less than 30 reviews, although the record number of reviews is as high as 7968; In addition, the restaurant rating is related to # of reviews to some extent, as restaurants with more reviews tend to have higher ratings on average.

### 3.1.5 Restaurant by price range
Several summary statistics of Yelp restaurants are also computed for a variety of restaurant price ranges and the results are shown in the interactive Bokeh plot below, where the dropdown menu offers both options to switch among various summary statistics, including '# of restaurants', 'average rating', 'average # of reviews' and 'maximum # of reviews', and options to switch between 'all states' and five individual states (Arizona, Nevada, Ohio, North Carolina and Pennsylvania): 
<img src="figures/restaurant_by_price_bokeh.png" height="800" width="900">

As shown, most restaurants are in the low (40.9%) and mid (41.6%) price range, whereas restaurants in the high and highest price ranges only account for 3.4% and 0.67%, respectively. In addition, there are 13.5% of restaurants having missing price range data.<br>
In terms of restaurant rating('star'), restaurants in different price ranges have relatively similar average rating around 3.5. In terms of the # of reviews received ('review_count'), more expensive restaurants tend to receive more reviews on average.<br>
These trends vary by state only to some extent, state-wise trends are in general in agreement with the national trends.<br> 

### 3.1.6 Restaurants by category
The commom restaurant cusines and styles are analyzed both nation-wide and for five individual states (Arizona, Nevada, Ohio, North Carolina and Pennsylvania) that have rich inventories of over 5000 restaurants. The results are shown below as interactive Bokeh plot, where the dropdown menu offers both options to switch between 'all states' and five individual states, and options to switch between 'cuisine' and 'style'.
<img src="figures/restaurant_by_category_bokeh.png" height="800" width="900">

As shown, the most popular cuisine among all is American style (traditional and new), followed by Mexican, Italian and Chinese. The most popular restaurant setting is the formal restaurant style, followed by the nightlife/bar style and fast food.<br>
Restaurant trend by cuisine varies quite a bit by location, suggesting people in different states favor different cuisines. The trend by style remains similar among all states.

## 3.2 Yelp user patterns
Yelp user patterns are analyzed by the year they joined Yelp, the average rating given by users and the total number of reviews given by users. The plots are shown below: 
<img src="figures/user_pattern.png">

As shown, Yelp witnesses a steady increase of new members since the beginning, and this increase in new users peaks out in 2015, followed by a significant decline afterward; The average rating given by Yelp users is 3.72. 81% of the users on Yelp are generous with an average rating of 3+; Although the record number of reviews a Yelp user has given is 12723, 60% of the users have less than 10 reviews in total, suggesting that most users post reviews on Yelp only occationally.

## 3.3 Yelp review & tip trends
Trends on Yelp reviews vs tips are analyzed and plotted over time, as shown below:
<img src="figures/review_tip_trend.png">

As revealed, review is one of the earliest feature Yelp has since its beginning, whereas tip is one of the later features introduced in 2009. The popularity of reviews (# of reviews) shows a steady upward trend over time with seasonal fluctuations, whereas the popularity of tips (# of tips) increases in the first four years and slowly dives down afterward. Overall, tip is not as popular as review.<br>
Two thirds of the reviews are associated with a positive star rating of 4+.

## 3.4 Yelp checkin distribution
Restaurant distribution is analyzed and plotted by the number of checkins received on Yelp: 
<img src="figures/checkin_distribution.png">
As shown, half of the restaurants have less than 20 checkins, even less than the reviews, suggesting that checkin is not a widely used feature on Yelp.

# 4. Building hybrid recommendation engine
## 4.1 Implementation
**A hybrid recommendation engine is implemented by combining the following recommender modules:**
The individual modules are prototyped using various moethods, trained and evaluated in separated notebooks (refer to the corresponding links below) and all necessary information from the trained model is saved to file and available for import by the hybrid engine. 

* **Module 1 - non-personalized keyword-search recommender:**<br>
supports a combination of restaurant location-based (zip code, city, state) keyword filtering and restaurant feature-based (cuisine, style, price) keyword filtering of restaurant catalog, and returns the customized recommendations by ranking the filtered catalog based on ranking criteria of user's choice.<br>
--Please refer to the separate notebook on keyword recommender module for all implementation details: https://github.com/jingzhaomirror/capstone2_hybrid_yelp_recommender/blob/master/recommender_keyword.ipynb
<br>
<br>
* **Module 2 - personalized collaborative recommender:**<br>
supports personalized restaurant recommendation given the unique user_id. The personalization is computed based on the user's and all other users' rating history of all Yelp businesses via an optimized matrix factorization model, then user-unrated restaurants from the catalog are ranked by ratings predicted by the model and returned as personalized recommendations.<br>
**Building the module**<br>
In order to build the module, several matrix factorization models are prototyped. The optimized SVD with bias model gives the best RMSE for rating prediction on unseen testset, therefore is chosen for implementing the collaborative recommender module. Prior to module implementation, the best matrix factorization model is trained with the entire dataset, and user latent feature matrix and bias vector, business latent feature and bias vector, along with other necessary information of the trained best model are saved to file for use by the module.<br>
In module implementation, these saved information is loaded first; given the user_id of interest, personalized ratings are predicted for all businesses in the catalog and paired with corresponding business_id; the list of predicted ratings are then filtered to unrated businesses only based on the user_id of interest, and merged with the restaurant recommendation list based on business_id to filter out non-restaurant businesses; lastly, the resulting recommendation is ranked by predicted rating in descending order and the top-n restaurants from the ranked list are returned with the user's choice of n.<br>
**Performance Evaluation**<br>
RMSE of rating prediction by the best matrix factorization model: RMSE of predicted ratings is 1.2777 and 1.2443 on unseen testset with no users/restaurants and without new user/restaurant, respectively; RMSE further drops to 1.188 for testset with only users/restaurants of at least five historical ratings, suggesting that model performance should improve over time as user and business ratings accumulate over time and model is re-trained periodically.<br>
NDCG of recommendation ranking: recommendation ranking generated based on the predicted ratings is also evaluated. NDCG (normalized discounted cumulative gain) are computed on unseen testset, with the average NDCG@10 and NDCG@5 scores being 0.905 and 0.908, respectively.<br> 
---Please refer to the separate notebook on collaborative recommender module for all details on algorithm selection, evaluation and module implementation: https://github.com/jingzhaomirror/capstone2_hybrid_yelp_recommender/blob/master/recommender_collaborative.ipynb
<br>
<br>
* **Module 3 - personalized content-based recommender:**<br>
supports personalized restaurant recommendation given the unique user_id. Restaurant feature vectors and user feature vectors (representing user's preference) are first extracted and computed from a rich set of Yelp restaurant text reviews and numerical ratings. The personalization is then computed based on the similarity between user vectors and restaurant vectors, then user-unrated restaurants from the catalog are ranked by similarity score and returned as personalized recommendations.<br>
**Building the module**<br>
Yelp dataset offers a variety of restaurant metadata including restaurant features as both numerical data and text data, as well as a rich set of restaurants text-based reviews along with user's rating. **Three different strategies are prototyped for extracting restaurant feature vectors, aggregating user preference vectors and generating recommendation ranking:**<br>
1) restaurant feature vectors are extracted as the top 300 PCA components out of the top 1000 word features (mono & bigrams) extracted from **all restaurant reviews using Tfidf vectorizer**; user feature vectors presenting user's preference are then computed by aggregating feature vectors of user-rated restaurants weighted by the corresponding user rating; recommendation ranking is finally generated by **ranking the cosine similarity scores** between user feature vector and restaurant feature vector in descending order.<br>
2) restaurant feature vectors are extracted as the top 300 word features (monogram only) extracted from **all the available restaurant text-based metadata by count vectorizer**; user feature vectors are then computed by aggregating feature vectors of user-rated restaurants weighted by the corresponding user rating; recommendation ranking is finally generated by **ranking the cosine similarity scores** between user feature vector and restaurant feature vector in descending order.<br>
3) the two cosine similarity scores from above 1) and 2) are used as engineered features from text-type metadata to enable personalization; along with all other restaurant numerical metadata, a **supervised regression model** is then built and optimized to predict user's rating of restaurants; recommendation ranking is finally generated by **ranking the predicted rating** of restaurants in descending order.<br>
Among the three strategies for personalization based on restaurant contents, recommendations generated by **cosine similarity score between user and restaurant based on their feature vectors extracted from all restaurant text reviews** (#1) consistently gives the best NDCG scores among all, and therefore is chosen for implementing the content-based recommender module. Prior to module implementation, the restaurant and user feature vectors are re-computed using the entire 'review' dataset, and saved to file for use by the module.<br>
In module implementation, these saved information is loaded first; given the user_id of interest, all cosine similarity scores are then computed between this user and all restaurants in the catalog, and added back to the catalog as a restaurant feature; the restaurant catalog is then filtered to unrated restaurants only based on the user_id of interest, and ranked by descending similarity scores. The top-n restaurants from the ranked list are returned with the user's choice of n.<br>
**Performance Evaluation**<br>
NDCG of recommendation raking: NDCG of recommendation ranking generated by the best strategy is 0.857 and 0.863 for NDCG@10 and NDCG@5, respectively.<br>
--Please refer to the separate notebook on restaurant content-based recommender module for all details on the three strategies experimented, performance evaluation and module implementation: https://github.com/jingzhaomirror/capstone2_hybrid_yelp_recommender/blob/master/recommender_content.ipynb

## 4.2 Testing
Tests on each submodule are conducted. Please refer to the separate notebook on integrating and building and testing the hybrid recommendation engine for all testing details: 
https://github.com/jingzhaomirror/capstone2_hybrid_yelp_recommender/blob/master/hybrid_recommender.ipynb
<br>
<br>
**Here is a quick summary on testing individual recommender modules:**
* **non-persoanlized keyword-search recommender module** <br>
11 tests (11 queries) are performed with a total CPU time of 10 seconds and elapsed time of 15 seconds. This averages to roughly 1-2 seconds per queries which is very reasonable in practice. In case of invalid queries and no matching results, informative messages are returned by the module. 
* **personalized collaborative recommender module** <br>
On average, it takes 1 second to return the personalized recommendation;  recommendation results feature a diverse list of restaurants. For uses with the very limited user preference history, the list of recommendations is somewhat similar to the generic list for new users; For users with rich rating history, the list of recommendations is really personalized; In addition, the personalized recommendation generated by the module can be further filtered by keywords; In case of non existing user, the module returns a generic list of recommendations; In case of invalid user id, informative messages are returned by the module.
* **personalized restaurant content-based recommender module** <br>
The average time to return the personalzied recommendation is also around 1 second; Recommendations are very personalized for all existing users; For users with very limited rating history, the list of recommendations all share similar features to user's rating restaurant. For users with diverse history, the list becomes more diversed; In addition, the personalized recommendation generated by the module can be further filtered by keywords; In case of invalid user-id and non existing user data, informative messages are returned by the module.

# 5. Build user interface for the hybrid recommendation engine

To integrate the three recommender modules, customize to user's need and provide a good user experience, a user interface is created by creating a recommendation engine as a class object, implementing the above submodules as class methods and incorporating several interactive questions for gathering user interests and navigating users through the application. Please refer to section 3.1 in the separate notebook (linked below) for the source code on implementing the user interface: https://github.com/jingzhaomirror/capstone2_hybrid_yelp_recommender/blob/master/hybrid_recommender.ipynb
<br>
<br>
**Below is a list of interactive questions incorporated in the UI and the correpsonding executions:**
* **"Want to try a customized recommendation based on your Yelp user history?"**<br>
If no, proceed with the keyword class method, which correpsonds to the non-personalized keyword-search recommender module. First, gather user's keywords indicating user's interests, then filter the restaurant catalog by keywords to provide the non-personalized recommendations; if yes, prompt to collect user id and ask followup question to decide which personalized module to use.<br>

* **"Wanna rank your recommendations by 'smart' ratings?"**<br>
The keyword class method (non-personalized keyword-search recommender module) supports two different ranking mechanisms:<br>
    1. rank by the 'smart' rating: the original restaurnat average star rating is weighted by taking into consideration the number of ratings it receives ('restaurant popularity'); the adjusted rating is then used for ranking.<br>
    2. rank by the original star rating: the original restaurant average star rating is used for ranking.<br>
Answering 'no' will rank the recommendation list by the original star rating of restaurants; otherwise, the weighted 'smart' rating will be used to ranked the recommendation results in the keyword-search module. 
   
* **"Which personalized recommendation would you prefer?"**
The personalized recommendation can be generated via two different recommender modules. Users can choose between option 1 and option 2 shown below :<br>
    1. "Something new based on people like you": if chosen, will activate the collaborative class method (correpsonding to the personalized collaborative recommender module) and recommend new restaurants based on similar peers.<br>
    2. "Something similar to your favorate restaurants": if chosen, will activate the content class method (personalized content-based recommender module) to recommend similar restaurants based on user's past preference of restaurant metadata.<br>

* **"Would you like to further filter your recommendation results by keywords?"**<br>
The recommendation engine also supports further filtering the recommendation results with a variety of keywords (restaurant location and/or features). If yes, a series of questions will display to gather user's keywords of interest, then the keyword class method (non-personalized keyword-search recommender module) will be activated to filter the recommendation results by user's keywords. In order to filtering the recommendation results, the list of recommendation results previously returned will be used as the starting restaurant catalog instead of the default business catalog from the database.<br> 

* **"Would you like to display more/less recommendation results?"**<br>
The recommendation engine also supports displaying the top-n recommendation results with the user's choice of n. If yes, will take the user's choice of n and update the printed results to top-n. Under the hood, the class also has a display method that takes the user's choice of n, and outputs only the top-n restaurants of the entire ranked recommendation list with relevant restaurant feature columns.<br>

# 6. Demonstration of use cases of the hybrid recommendation engine
The subsections below demonstrate three use cases of the hybrid recommendation engine corresponding to the three recommender modules. The screenshots demonstrate the flow of all user inputs and application outputs in each case.

### 6.1 Recommendation by non-personalized recommender module and tuning the number of restaurants to recommendate

<img src="figures/ui_keyword_1.png" width="600" align="left">
<img src="figures/ui_keyword_2.png" width="900" align="left">
<img src="figures/ui_keyword_3.png" width="800" align="left">
<img src="figures/ui_keyword_4.png" width="500" align="left">
<img src="figures/ui_keyword_5.png" width="750" align="left">
<img src="figures/ui_keyword_6.png" width="800" align="left">
<img src="figures/ui_keyword_7.png" width="800" align="left">

### 6.2 Recommendation by personalized collaborative module and further filtering the recommendation results by keywords

<img src="figures/ui_collaborative_1.png" width="600" align="left">
<img src="figures/ui_collaborative_2.png" width="800" align="left">
<img src="figures/ui_collaborative_3.png" width="800" align="left">
<img src="figures/ui_collaborative_4.png" width="600" align="left">

### 6.3 Recommendation by personalized content-based module and repeat with multiple independent recommendation requests within one session

<img src="figures/ui_content_1.png" width="600" align="left">
<img src="figures/ui_content_2.png" width="500" align="left">
<img src="figures/ui_content_3.png" width="700" align="left">
<img src="figures/ui_content_4.png" width="600" align="left">
<img src="figures/ui_content_5.png" width="500" align="left">
<img src="figures/ui_content_6.png" width="650" align="left">

# 7. Conclusion & Potential Improvement
## 7.1 Yelp Business Insights
Yelp dataset containing a total of 188,593 businesses, 1,518,169 users, 5,996,996 reviews, 1,185,348 tips, and 157,075 checkins are obtained via Yelp Dataset Challenge, cleaned up, and analyzed here. After data wrangling, there are valid records of 47,553 **US restaurant** businesses, 1,518,168 users, 5,996,995 reviews, 1,185,344 tips, 157,075 checkins. EDA and interactive visualizations are performed to understand restauran patterns, user patterns, review and tip trends, and checkin pattern on Yelp. Important finds are summarized as follows:

#### Restaurant pattern:
This dataset only contains a subset of all yelp businesses, and only a portion of restaurants from 15 states are available. Among them, only Arizona, Nevada, Ohio, North Carolina and Pennsylvania have a rich catalog of over 5000 restaurants, therefore is the main focus of the analysis.<br> 

The most common restaurants are the popular chain or franchised restaurants, fast food or coffee shops. Although the ranking varies by location, Starbucks, McDonald's and Subway are the top 3 among all.<br>

*Rating and review:*<br> 
The majority of the restaurants have a rating between 3.0 and 4.5, with 3.5 and 4.0 being the most common ratings. The average restaurant rating is very similar among five states, close to 3.5. When it comes to reviews, half of the restaunrants have less than 30 reviews, although the record number of reviews is as high as 7968. In addition, the review count a restaurant has received varies by state quite a bit, Nevada has a much higher average than all other states, as a result of the popularity of Las Vegas as a resort town. The correlation between restaurant ratings and reviews reveals that restaurants with more reviews tend to have higher ratings on average.<br>

*Cost:*<br>
Most restaurants are in the low (40.9%) and mid (41.6%) price ranges. Restaurants in different price ranges share similar average ratings of around 3.5, but different review counts. More expensive restaurants tend to receive more reviews on average. This trend varies by state only to some extent.<br>

*Category:*<br>
The most popular cuisine among all is American style (traditional and new), followed by Mexican, Italian and Chinese. The most popular restaurant setting is the formal restaurant style, followed by the nightlife/bar style and fast food.
Restaurant trend by cuisine varies quite a bit by location, suggesting people in different states favor different cuisines. The trend by style remains similar among all states.

#### User pattern:
Yelp witnesses a steady increase of new users since its beginning around 2004, and this increase in new users peaks out in 2015, followed by a significant decline thereafter. The average rating given by Yelp users is 3.72. 81% of the users on Yelp are generous with an average rating of 3+. Although the record number of reviews a Yelp user has given is 12723, 60% of the users have less than 10 reviews in total, suggesting that most users post reviews on Yelp only occationally.

#### Review &tip trends:
Two thirds of the reviews are associated with a positive star rating of 4+. Review $vs.$ Tip: the popularity of reviews (# of reviews) shows a steady upward trend since the beginning in 2004 with seasonal fluctuations, whereas the popularity of tips (# of tips) increases in the first four years after its introduction (2009-2013) and slowly dives down thereafter. Overall, tip is not as popular as review.<br>

#### Checkin distribution: 
Half of the restaurants have less than 20 checkins, even less than the reviews, indicating that checkin is not a widely used feature on Yelp.<br> 

## 7.2 Yelp Hybrid Restaurant Recommendation Engine
A non-personalized keyword-search recommender module, a personalized collaborative recommender module and a personalized restaurant content-based recommender module are implemented and a user-friendly interface is created to integrate the three submodules, gather user interests and navigate users through the hybrid recommendation engine via user interactive questions.

**Capabilities of the hybrid recommendation engine include:** 
* **A non-personalized keyword-search recommender module** supports a combination of restaurant location-based (zip code, city, state) keyword filtering and restaurant feature-based (cuisine, style, price) keyword filtering of restaurant catalog, and returns the customized recommendations by ranking the filtered catalog based on ranking criteria of user's choice.
* **A personalzied collaborative recommender module** supports personalized restaurant recommendation given the unique user_id. The personalization is computed based on the user's and all other users' rating history of all Yelp businesses via an optimized matrix factorization model, then user-unrated restaurants from the catalog are ranked by ratings predicted by the model and returned as personalized recommendations. <br>
* **A personalized restaurant content-based recommender module:** supports personalized restaurant recommendation given the unique user_id. The personalization is computed based on the similarity between the user's preference indicated by historical ratings and all restaurants' features extracted from a rich set of Yelp restaurant review texts, then user-unrated restaurants from the catalog are ranked by similarity score and returned as personalized recommendations.<br>
* **Further filter a recommendation list by keyword:** supports further filtering the recommendation results by a combination of restaurant location-based keywords and restaurant feature-based keywords by feeding the recommendation results as the restaurant catalog to the 'non-personalized keyword-search recommender module'.
* **An adjusted rating score is also introduced as an improved metric over the original restaurant average star ratings** supports ranking the restaurants by the adjusted rating as an alternative ranking criteria. The adjusted rating score uses the mechanism of the damped mean to regulate restaurants with different number of ratings, with the merit of incorporating both average restaurant rating (goodness) and number of ratings (popularity).
* **A user-friendly interface** supports flexible navigation among the three available recommender modules at user's choice and options to further filter the recommendation results by keywords and/or display the desired number of recommendations.<br>

**Performance of the hybrid recommendation engine:** 
* **Non-personalized keyword-search module** Test results validate that the recommendation results only contain restaurants matching user's combination of keywords and ranked by the appropriate scores of interest. 
* **Personalized collaborative module** Both the accuracy of rating prediction and the quality of recommendation ranking are computed on unseen testset. RMSE(Root Mean Squared Error) of rating prediction is 1.2777 on testset with new users/restaurants and 1.2443 on testset without new user/restaurant. NDCG (Normalized Discounted Cumulative Gain) of recommendation ranking on testset without new user/restaurant is 0.905 and 0.908 for NDCG@10 and NDCG@5, respectively.
* **Personalized content-based module** The quality of recommendation ranking are computed on unseen testset. NDCG of recommendation ranking on test without new user/restaurant is 0.857 and 0.863 for NDCG@10 and NDCG@5, respectively.

## 7.3 Potential Improvement
There are several ways to improve the performances of the recommender modules: 
* Using unsupervised learning techniques to group restaurants and/or users into clusters<br>
This will enable more customized recommender model to be designed and optimized tailoring towards each cluster, leading to better submodels with improved performance.
* Incoporating data from the 'tip' and 'checkin' dataset<br>
The personalization of the current recommendation engine is computed based on data from the 'review' dataset. There are also information that can be potentially used for personalization in the 'tip' and 'checkin' datasets. For instance, setiment analysis can be performed on tip texts in the 'tip' dataset and incorporated as additional resource to infer user preference. Restaurant's number of checkins by day of the week data from the 'checkin' dataset can also be incorporated as restaurant's metadata indicating restaurant's popularity and weekly variations.
* Adjusting the restaurant star rating by time relevance<br>
The current recommendation engine supports the improved rating metric over the original restaurant average star rating by adjusting the original average rating with the number of ratings. However, as restaurant ownership, food quality, service and surrounding environment can all change over time, newer ratings are considered more relevant than older ratings. Therefore, the rating metric can be further improved by incorporating the age relevance of the rating.