####  Iris Ryu

# Project Design Writeup 

## INTRO


According to CDC, one in every 6 people get some sort of foodborne illness a year. In order to try to mitigate foodborne illness, New York City created a Restaurant Inspection Rating System that tries to make restaurant sanitation more transparent to consumers. After much success, similar rating systems were then adopted by many different cities including but not limited to San Fransico, Chicago, Los Angeles, and Las Vegas. 

But how telling are these rating in reality? Are they actually good barometers of health safety? What do ratings actually tell us? Do they relate to actual quality of the food? 

This project aims to try to answer as much of these questions, in the simplest form through Data Science.

## PROJECT PROBLEM

There are largely two project problems:

__1. What is the likely sanitation score of a restaurant in NYC given different variables, such as zip code, violation type, cuisine, etc__

__2. Given we know the sanitation scores, is it a good predictor for how good the food is? If not, what is a good predictor? __


The details of the two project problems are below:

__1. Predicting sanitary inspection scores__
- __Outcome:__ Sanitation inspection score (continuous variable)
- __Possible Features:__
    - zip code (categorical)
    - street (categorical)
    - cuisine category (categorical)
    - violation code (binary)
    - inspection type (categorical)
    - Borough (categorical)
    - violation type (categorical)
    - criticality (categorical)
    - price range (categorical)
    
Because the outcome of interest is a continuous variable, a linear regression analysis may be the best starting point.

__2. Analyzing restaurant quality through sanitation score and other features__
- __Outcome:__ restaurant rating (1-5) (continuous variable)
- __Possible Features:__
    - Most recent inspection score (continuous)
    - Max inspection score (continuous)
    - Min inspection score (continuous)
    - Average inspection score (continuous)
    - Std dev inspection score (continuous)
    - price range (categorical)
    - review count (continuous)
    - restaurant category (categorical) 
    - area (categorical)
    - postal code (categorical)
    
Because the outcome of interes is a continuous variable, a linear regression analysis may be the best starting point. 


## HYPOTHESIS

__1. Predicting sanitary inspection scores__
- Features that are likely to have a large impact:
    - cuisine
    - criticality
    - price range 
    - violation category
- Where restaurants with a higher price range would tend to have lower inspection scores, inspections with lower number of critical violations will score lower, and cuisines small scale restaurants like cafes will have lower inspection scores
- Because we are focusing on the 5 bouroughs of NYC, the area/zip code is speculated to have minimal impact

__2. Analyzing restaurant quality through sanitation score and other features__
- Features that are likely to have a large impact: 
    - price range
    - postal code 
- where the higher the price range, the higher the restaurant rating will tend to be, and more expensive/gentrified areas would tend to have higher ratings

## DATASETS

To answer the 2 different questions, two separate data was cleaned differently, but both used the same 2 data sources. For sanitary inspection data, NYC OpenData was used, and for restaurant quality data, yelp data was scraped from their website using BeautifulSoup.

### Original Data
#### -- DOHMH New York City Restaurant Inspection Results --

__Source:__ https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/43nn-pn8j

#### Data Dictionary:

FieldName|Type|Description
---------|----|-----------
CAMIS| int | Unique identifier for the restaurant
DBA | string | Name of restaurant
BORO | string | Borough in which the restaurant is located
BUILDING | string | Building number for which the restaurant is located
STREET | string | Street name at which the restaurant is located
ZIPCODE | int | Zipcode as per the address of the restaurant
PHONE | string | Phone number of restaurant
CUISINE DESCRIPTION | string | Describes the cuisine of the restaurant 
INSPECTION DATE | date | Date at which the inspection took place 
ACTION | string | Action associated with each restaurant inspection
VIOLATION CODE | string | Code for the particular violation recorded for each restaurant
VIOLATION DESCRIPTION | string | Description of violation associated with each violation code recorded for each restaurant
CRITICAL FLAG | binary| Flag if particular violation recorded is Critical = 1, Not Critical = 0
SCORE | int | Total Score for a particular inspection. If there was adjudication a judge may reduce the total points for the inspection and this field will have the update amount
GRADE | string | • N = Not Yet Graded • A = Grade A • B = Grade B • C = Grade C • Z = Grade Pending • P= Grade Pending issued on re-opening following an initial inspection that resulted in a closure
GRADE DATE | date | Date when the current grade was issued to the restaurant
RECORD DATE | date | The date when the extract was run to produce this data set
INSPECTION TYPE | string | The type of inspection. A combination of the program and inspection type

Where every row is a unique violation

#### -- Yelp NYC Metadata --

__Code Source:__ https://github.com/bzsaindon/nycdsablog/blob/master/restaurantreviewsfoodborneillness/foodborne_illness_yelp.ipynb
__Data as of:__ 5/14/2017

#### Data Dictionary:

FieldName|Type|Description
---------|----|-----------
area| string | area in which the restaurant is located
restaurant | string | Name of restaurant
postal_code | int | Zipcode as per the address of the restaurant
phone | string | Phone number of restaurant
category | string | Describes the cuisine of the restaurant 
price_range | string | price range of an average meal at restaurant 
rating_value | float | average rating from 1-5 for all reviews of restaurant
review_count | int | Number of reviews written for restaurant

Where every row is a unique restaurant randomly selected

### Cleaned Data

-- *for details, please look at "Getting Yelp & Inspection Score Data notebook* --

#### -- Predicting sanitary inspection scores --

__What was cleaned:__
- a row now represent a unique inspection and not a unique violation
- each violation code was mapped to a violation category as categorized by http://www.ehagroup.com/food-safety/new-york-abc-restaurant-grading/NYC-HACCP-SCORESHEET.pdf
- the critical/non critical column was split out into columns "Critical", "Not Applicable", and "Not Critical", where each column represented the number of violations recorded in that category
- the violation codes were transposed from row to column, where 1 = violation was recorded and 0 = violation was not recorded. The violations were then categorized into violation categories, where the each violation category column represented the number of violations recorded for that particular type of violation

__Assumptions:__
- the manual mapping of violation category to violation code are accurate
- inspection scores are a better representation of restaurant health than grade 

__Notes:__
- Nutrition display, Legal, and Tabacco violation categories are not considered in this analysis because these new violations were implemented recently and there is not sufficient data per restaurant to analyze on these categories

#### Data Dictionary:

FieldName|Type|Description
---------|----|-----------
PHONE | string | Phone number of restaurant
DBA | string | Name of restaurant
BORO | string | Borough in which the restaurant is located
STREET | string | Street name at which the restaurant is located
ZIPCODE | int | Zipcode as per the address of the restaurant
CUISINE DESCRIPTION | string | Describes the cuisine of the restaurant 
INSPECTION DATE | date | Date at which the inspection took place 
INSPECTION TYPE | string | The type of inspection. A combination of the program and inspection type
SCORE | int | Total Score for a particular inspection. If there was adjudication a judge may reduce the total points for the inspection and this field will have the update amount
GRADE | string | • N = Not Yet Graded • A = Grade A • B = Grade B • C = Grade C • Z = Grade Pending • P= Grade Pending issued on re-opening following an initial inspection that resulted in a closure
CHEMICALS | int | Number of violations related to chemicals for this inspection
EMERGENCY SAFETY | int | Number of violations related to emergency safety for this inspection
FACILITY | int | Number of violations related to facility for this inspection
FOOD HANDLING | int | Number of violations related to food handling for this inspection
HACCP PROGRAM | int | Number of violations related to HACCP program for this inspection
LEGAL | int | Number of violations related to legal for this inspection
NUTRITION DISPLAY | int | Number of violations related to nutrition diplay for this inspection
OTHER | int | Number of violations related to categorized as other for this inspection
PERSONAL HYGIENE | int | Number of violations related to personal hygiene for this inspection
SANITIZING PROGRAM | int | Number of violations related to sanitizing program for this inspection
TABACCO | int | Number of violations related to tabacco for this inspection
TEMPERATURE CONTROLS | int | Number of violations related to temperature controls for this inspection
VECTORS | int | Number of violations related to vectors, or vermin related violations for this inspection
WATER SUPPLY AND WASTE DISPOSAL | int | Number of violations related to water supply and waste disposal for this inspection
#N/A | int | Number of violations that didn't have enough information for this inspection
Critical | int | Number of violations that were "Critical" for this inspection
Not Applicable | int | Number of violations that were "Not Applicable" criticality for this inspection
Not Critical | int | Number of violation that were "Not Critical" for this inspection


#### -- Analyzing restaurant quality --

__Notes:__
- NYC OpenData and yelp data were merged on phone #. The original dataset of 393 datapoints became 272 datapoints with both yelp and inspection data

#### Data Dictionary:

FieldName|Type|Description
---------|----|-----------
area| string | area in which the restaurant is located
restaurant | string | Name of restaurant
postal_code | int | Zipcode as per the address of the restaurant
phone | string | Phone number of restaurant
category | string | Describes the cuisine of the restaurant 
price_range | string | price range of an average meal at restaurant 
rating_value | float | average rating from 1-5 for all reviews of restaurant
review_count | int | Number of reviews written for restaurant
STREET | string | Street name at which the restaurant is located
CUISINE DESCRIPTION | string | Describes the cuisine of the restaurant 
Score Count | int | Total number of inspections done at this restaurant since 2011
Score Recent | int | Most recent inspection score for this restaurant
Score Max | int | Highest outstanding inspection score for this restaurant since 2011
Score Min | int | Lowest outstanding inspection score for this restaurant since 2011
Score Average | int | Average inspection score for this restaurant YTD since 2011
Score StdDev | int | Standard Deviation of inspection score for this restaurant for scores from 2011 to YTD
Grade A Count | int | Number of times this restaurant got an inspection grade of A from 2011 to YTD
Grade B Count | int | Number of times this restaurant got an inspection grade of B from 2011 to YTD
Grade C Count | int | Number of times this restaurant got an inspection grade of C from 2011 to YTD
Not Yet Graded | int | Number of times this restaurant got an inspection but didn't get graded from 2011 to YTD
P Count | int | Number of times this restaurant got a P after inspection from 2011 to YTD
Z Count | int | Number of times this restaurant got a Z after inspection from 2011 to YTD

## DOMAIN KNOWLEDGE

Although I do not have experience working or associated to Restaurant Inspection Score processes, as a customer to restaurants in NYC, I am conscious and aware of restaurant grades posted in front of restaurants. I studied about the restaurant grading process, so hopefully this will help me. 

Source: http://www1.nyc.gov/site/doh/services/restaurant-grades.page

In addition, understanding the grading system process will help with the analysis. Depending on the type of inspection and the respected score recorded for that inspection, the resulting grade of a restaurant can differ. For example, if a restaurant gets an inspection score of 13+ on their initial inspection, they are not graded immediately, but get another change to get an A grade through another inspection ~ a week after intial inspection. When analyzing the "Grade" count columns, we must be aware of this, and that non A grades can only be given for "Cycle Inspections"

### Pre-existing Models

Source: 
- Harvard Study on Restaurant Inspection:
    - The idea of "categorizing" each violation was inspired from this
    - Source: https://harvarddatasciencerestaurantinspections.wordpress.com/2014/12/12/big-data-analysis-of-nyc-restaurant-inspections/
- NYC Data Science Analysis:
    - The idea of scraping data from yelp website was adopted from here
    - Source: https://github.com/bzsaindon/nycdsablog/blob/master/restaurantreviewsfoodborneillness/foodborne_illness_yelp.ipynb


## Project Concerns

#### Questions:
- I am still unsure how to best deal with columns that have a lot of zeros. For example, the inspection score dataset has a lot of zeros for the violation categories, because many restaurants have a small number of violations recorded. Will this affect my analysis? 
- I don't know if I should analyze zipcode as a number or as a category, and if I analyze zipcode as a category, how do I go about doing that? 

#### Assumptions: 
- all data collected from yelp is up to date as of 5/14/2017 and all inspection data was collected before 5/14/2017

#### Missing Data:
- because I used web scraping to collect data on yelp restaurants, I was unable to get other data that could have improved the model such as: 
   - Takes Reservations Y/N
   - Delivery Y/N
   - Take out Y/N
   - Noise Level
   - Outdoor Seating
   - Dogs Allowed 
- If I could get other barometers of health, like number of food poisoning incidents or rodent appearances, the analysis may have been better, but I had no way of joining those information together
   

## Outcome


There are two different ways this analysis could benefit: for __restaurant owners__ and __customers__

__Restaurant Owners:__
- If we can rank violation category importance for total inspection scores, restaurant owners can focus on improving those aspects of their restaurant to avoid getting a low grade
- Because the categories are relatively straight forward, the model will most likely not be complicated
- The output should be a linear regression model

__Customers:__
- Depending on the outcome, the analysis will provide customers with transparency on the 3 level grades, and will be able to make more educated decisions on whether to use sanitary inspections grades as a determinant of good food
- Because it will intake more features that are different in nature, the model may be a littl more complicated but it will still be a linear regression for this case as well

Both analysis will be a success if I can identify 1 or more features that are highly correlated with the outcome.

# RESOURCES

Violation Categories: http://www1.nyc.gov/assets/doh/downloads/pdf/rii/self-inspection-worksheet.pdf
