## Restaurant Reviews Capstone Overview



There have been many capstone projects that predict restaurant ratings based on using details from individual reviews blended with information about the users providing reviews or details about the restaurant being reviewed.  These projects bring insights about factors that most contribute to the best or worst review scores and are useful to understand what makes a successful restaurant experience and build out recommendation engines.

In this project I was keen to use my subject matter expertise of London Restaurants to extract and explore data that would enable me to analyse the validity of the hypothesis:
- **There is a bias in users' reviews of London restaurants that can be measurably attributed in part to the restaurant's location**

This knowledge would be useful for layering on top of the standard restaurant analysis to build more nuanced, concierge like, restaurant recommendation engines.  

This is an overview of the steps taken in this project, the final data used and key findings.

## Summary of Project Steps

Data Exploration and Web_Scraping
- Research and analysis of various data sources (APIs, web pages and data sets)
- Timeout and Google web scraping

Data Cleaning and EDA
- Initial Exploratory Data Analysis (EDA)
- Merging and cleaning of TimeOut and Google data scraped

Features
- Feature Engineering
- Creation and merging of location features

Clustering
- Modelling using DBScan and Agglomerative algorithms to identify clusters to provide an enhanced location feature

Regression
- Various regression algorithms used to detect how much of a restaurant's average Google user review score can be predicted using general or location features, and which of those features are most important.

Classification
- Various classification algorithms used to identify which general and location features were most important to identify high or low scoring restaurants (based on average Google user review score)

Hypothesis and Similarity Testing 
- Hypothesis tests to identify if there is a statistically significant difference between restaurants based in different locations' review scores for each of the 3 reviewer groups (Timeout critics, Timeout users and Google users)
- Similarity testing between each reviewer group to measure their relative similarity


## Data Overview


There were 3 key initial challenges to consider with the data collection for this project:
1. How to obtain a reliable and complete (or at least sufficient) list of London Restaurants.

2. How to find sufficient useful data within the time constraints 

3. What was the 'right' data to use in the project - in particular what constitutes an unbiased baseline and which data contributed to meaningful features

<br />
Significantly more data was acquired and investigated for this project than was used for the final modelling and analysis.  Compiling the data from scratch across many, not necessarily compatible, sources meant a number of valuable lessons were learnt about 'messy data'.  <br />

The below table shows which data sources contributed to the table for the final modelling and analysis.


### Data Used for Final Models and Analysis:

| Data  | Description  |
|:--- |:---|
| TimeOut Restaurants | List of c4650 scraped from website front page    | 
| TimeOut Restaurant Reviews Overview | High level details of TimeOut restaurants - c1200 had a TimeOut critic review and rating, more have user ratings    |
| Google Restaurant Reviews Overview | Rating and number of users giving rating for restaurants with a TimeOut critic review - c1100 returned    |
|Stations |csv file of data scraped from web showing longitude, latitude and zone of London stations up to zone 6.  Used for location feature engineering|
|Geosite |Geosite used to obtain London district by postcode for location feature engineering|
|Visit London | List of top tourist spots used for location feature engineering|

After looking at a number of different sources then the list of restaurants scraped from TimeOut website's front page was taken as the best list to use for this project, this was practical as:
- the number of restaurants scraped was aligned with the number available from other sources once chain/fast food restaurants were removed;
- the high level details provided by TimeOut in the individual restaurants' pages were helpful for creating features; 
- the TimeOut critic review was chosen as the best 'unbiased' baseline view of restaurants for rating comparison purposes from the initially available data; and
- inconsistencies between restaurant metadata across data sources (e.g. different restaurant name and address syntax) meant taking a master list from another source to access TimeOut's features would have introduced complexities and potential unnecessary data loss.

However, after initial analysis it became clear that not enough TimeOut users reviewed restaurants to have a large enough data set for modelling by relying on TimeOut reviews alone.  If a lower limit was set that at least 5 users had reviewed a restaurant to make its average rating meaningful, then only just over 500 restaurants were left.  Restaurant visitors use Google to rate a restaurant more frequently, using this overview data doubled the data set size for modelling to just over 1100.
<br ><br >
Further cleaning and analysis identified that the TimeOut location data required augmenting with other data sources to create location features.  The table above shows where the final data used was acquired from.

### Final Data Dictionary

The data dictionary for the data used for modelling (after data cleansing and feature engineering is below)

In [1]:
import pandas as pd
pd.set_option('max_colwidth', 1000)
data_dict = pd.read_csv('data_dict_model.csv')
data_dict

Unnamed: 0,name,final_type,description,source,usage,Feature type
0,title,string,Name of restaurant,TimeOut,Identifier,Description
1,critic_rating_timeout,float,Critic rating for restaurant - (Out of 5 - integer),TimeOut Cleaning,Feature/ Bias target creation,Target/ Rating Feature
2,user_rating_timeout,float,User rating for restaurant - (Out of 5 - float - to nearest 0.1),TimeOut Cleaning,Feature/ Bias target creation - Potential,Target/ Rating Feature
3,users_timeout,float,Number of TimeOut users providing a rating for the restaurant,TimeOut Cleaning,Feature/ Bias target creation - Potential,Rating feature
4,latitude,float,Latitude of restaurant,TimeOut Cleaning,Feature/ Feature Creation,Location
5,longitude,float,Longitude of restaurant,TimeOut Cleaning,Feature/ Feature Creation,Location
6,postcode,string,Postcode of restaurant,TimeOut Cleaning,Feature Creation - Core,Location
7,subcategory_combined,string,Cuisine type - combined from TimeOut and Google categorisation,Feature Creation,Feature - Core,Description
8,sparcity,int,"Created sparcity feature - 3 classes - 0, 1, 2 representing high, mid, low density",Feature Creation,Feature - Core,Location
9,budget_combined,float,Price/budget of restaurant - Engineered - TimeOut used as primary source and nulls filled by Google values,Feature Creation,Feature - Core,Rating feature


## Summary of Findings

I learnt so much from working on this project that it was challenging to summarise everything I found and learnt.  The Capstone Presentation (Google Slides) gives a summary of the findings and project steps.  I have saved example notebooks from some of the project steps in the repository.
<br/>
My testing did show evidence that location is a factor in the rating of a restaurant and that restaurants outside of zone 1 did appear to have a higher average rating from the critics and both sets of users.  However, the zone difference whilst initially appearing statistically significant (through hypothesis testing) was not large enough to give us confidence that it was a true difference after the way the overlaps in the data and rounding was taken into account.

<br/>
However, I immensely enjoyed working with this data and it taught me a lot about working with messy data and the challenges of interpreting ratings.
<br/>

Below is a Tableau image of all the London restaurants used in my dataset, mapped and colour coded by whether they are in zone 1 or an outer zone and whether they are in my main manufactured cluster, a mini manufactured cluster or no cluster at all. (The distribution of restaurants is most dense in a cluster and more spread out when not in a cluster). 

![](./data/TableauZoneMap.png)