# DS-SF-36 | Unit Project | 1 | Research Design | Starter Code

In this first unit project, you will create a framework to scope out data science projects.  This framework will provide you with a guide to develop a well-articulated problem statement and analysis plan that will be robust and reproducible.

## Part A.  Evaluate the following problem statement:

> Determine which factors most impact how many average check-ins a restaurants gets. How do we predict how many visitors a restaurant is going to get? 

> PROJECT 1 - VISUALIZATION: I will first find the list of all restaurants in SF from dataSF (https://data.sfgov.org/Health-and-Social-Services/Restaurant-Scores-LIVES-Standard/pyih-qa8i) which has a list of all restaurants, their addresses, and health scores. Not all the lat longs are populated so I will fill in the gaps by using a geocoding library. I will take this dataset and plot it on google maps. I will also do some EDA and group the level of risk against location and see if there are any trends

> PROJECT 2 - EDA AND ML: I will supplement the healthscores data set with Foursquare's check-in data. In order to pull the per restaurant data from the venues API (https://developer.foursquare.com/docs/responses/venue), I will need to loop through each restaurant name and pull the foursquare data at a snapshot in time. I will then need to join the 2 data sets on restaurant name to create one dataset that consists of both healthscore, checkin, and online profile (rating, photos, menu) data. I can then regress these factors against checkin data to find any relations. I will run various ML algorithms that we have learned until then on both the training and the test data set.

> PROJECT 3 - ML + SENTIMENT ANALYSIS: I will add a sentiment analysis of the reviews and the key phrases in the reviews (provided by the same API above) to get a sense of polarity in the reviews. I will then re-run the above analyses from project2 but with the ploarity scores and see if it makes any difference. I will also try other ML algorithms that we have learned since project 2 and summarize all my results

> ### Question 1.  What is the outcome?

Answer: The outcome is a prediction of how many check-ins to expect in a week given the various inputs

> ### Question 2.  What are the predictors/covariates?

Answer: The number of photos, whether a menu is available, health scores, sentiment analysis of recent reviews, price, sentiment analysis of key phrases

> ### Question 3.  What timeframe is this data relevent for?

Answer: San Francisco restaurants that are on Foursquare in July 2017

> ### Question 4.  What is the hypothesis?

Answer: Foot traffic is correlated with the depth of the online profile as well as health and safety scores

## Part B.  Let's start exploring our Foursquare dataset and answer some simple questions:

In [45]:
import os
import pandas as pd
import gmaps
import yaml

pd.set_option('display.max_rows', 20)
pd.set_option('display.max_columns', 20)
pd.set_option('display.notebook_repr_html', True)


df = pd.read_csv(os.path.join('..', '..', 'dataset', 'Restaurant_Scores_-_LIVES_Standard.csv'))
df = df.set_index('business_id')

with open('google.yaml', 'r') as f:
    google_credentials = yaml.load(f)

google_api_key = google_credentials['api-key']

gmaps.configure(api_key = google_api_key)

df.head()

Unnamed: 0_level_0,business_name,business_address,business_city,business_state,business_postal_code,business_latitude,business_longitude,business_location,business_phone_number,inspection_id,inspection_date,inspection_score,inspection_type,violation_id,violation_description,risk_category
business_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
10,Tiramisu Kitchen,033 Belden Pl,San Francisco,CA,94104,37.791116,-122.403816,"(37.791116, -122.403816)",,10_20140114,01/14/2014 12:00:00 AM,92.0,Routine - Unscheduled,10_20140114_103119,Inadequate and inaccessible handwashing facili...,Moderate Risk
10,Tiramisu Kitchen,033 Belden Pl,San Francisco,CA,94104,37.791116,-122.403816,"(37.791116, -122.403816)",,10_20140114,01/14/2014 12:00:00 AM,92.0,Routine - Unscheduled,10_20140114_103145,Improper storage of equipment utensils or linens,Low Risk
10,Tiramisu Kitchen,033 Belden Pl,San Francisco,CA,94104,37.791116,-122.403816,"(37.791116, -122.403816)",,10_20140114,01/14/2014 12:00:00 AM,92.0,Routine - Unscheduled,10_20140114_103154,Unclean or degraded floors walls or ceilings,Low Risk
10,Tiramisu Kitchen,033 Belden Pl,San Francisco,CA,94104,37.791116,-122.403816,"(37.791116, -122.403816)",,10_20140729,07/29/2014 12:00:00 AM,94.0,Routine - Unscheduled,10_20140729_103144,Unapproved or unmaintained equipment or utensils,Low Risk
10,Tiramisu Kitchen,033 Belden Pl,San Francisco,CA,94104,37.791116,-122.403816,"(37.791116, -122.403816)",,10_20140729,07/29/2014 12:00:00 AM,94.0,Routine - Unscheduled,10_20140729_103129,Insufficient hot water or running water,Moderate Risk


In [49]:
# Filling in the lat long gaps
from geopy.geocoders import Nominatim
geolocator = Nominatim()

#Concatenate address
df['comb_address'] = df['business_address']+" "+ df['business_city']


#Create a new column for lat and long 
df['latitude'] = 37.791116
df['longitude'] = -122.403816
location = df[["latitude", "longitude"]]

# Nulls 
# null_data = df[df['business_latitude'].isnull()]
# null_data['location'] = null_data['comb_address'].apply(geolocator.geocode)
# null_data.head()    

# Creating a dictionary of unique addresses
# comb_address = null_data['comb_address'].unique()
# len(comb_address)
# d = dict(zip(comb_address, pd.Series(comb_address).apply(geolocator.geocode).apply(lambda x: (x.latitude, x.longitude))))
# d
#Populate based on address

# location = geolocator.geocode("033 Belden Pl San Francisco")
# print((location.latitude, location.longitude))




# df['location'] = df['location'].apply(lambda x: (x.latitude, x.longitude))
# location = geolocator.geocode(df['comb_address'])
# print location.latitude[0]

#Drop bus_lat and bus_long
# df.head()
location = location.head()
location

Unnamed: 0_level_0,latitude,longitude
business_id,Unnamed: 1_level_1,Unnamed: 2_level_1
10,37.791116,-122.403816
10,37.791116,-122.403816
10,37.791116,-122.403816
10,37.791116,-122.403816
10,37.791116,-122.403816


In [50]:
# Removing address, city, state, postal, bus_lat, bus_long, bus_loc, bus_phone

# df = df[['business_address', 'business_city', 'business_state', 'business_postal_code', 'business_latitude', 'business_longitude', 'business_location', 'business_phone']].
df.drop(['business_address', 'business_city', 'business_state', 'business_postal_code', 'business_latitude', 'business_longitude', 'business_location', 'business_phone_number'], axis=1, inplace=True)

df.head()

Unnamed: 0_level_0,business_name,inspection_id,inspection_date,inspection_score,inspection_type,violation_id,violation_description,risk_category,comb_address,latitude,longitude
business_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
10,Tiramisu Kitchen,10_20140114,01/14/2014 12:00:00 AM,92.0,Routine - Unscheduled,10_20140114_103119,Inadequate and inaccessible handwashing facili...,Moderate Risk,033 Belden Pl San Francisco,37.791116,-122.403816
10,Tiramisu Kitchen,10_20140114,01/14/2014 12:00:00 AM,92.0,Routine - Unscheduled,10_20140114_103145,Improper storage of equipment utensils or linens,Low Risk,033 Belden Pl San Francisco,37.791116,-122.403816
10,Tiramisu Kitchen,10_20140114,01/14/2014 12:00:00 AM,92.0,Routine - Unscheduled,10_20140114_103154,Unclean or degraded floors walls or ceilings,Low Risk,033 Belden Pl San Francisco,37.791116,-122.403816
10,Tiramisu Kitchen,10_20140729,07/29/2014 12:00:00 AM,94.0,Routine - Unscheduled,10_20140729_103144,Unapproved or unmaintained equipment or utensils,Low Risk,033 Belden Pl San Francisco,37.791116,-122.403816
10,Tiramisu Kitchen,10_20140729,07/29/2014 12:00:00 AM,94.0,Routine - Unscheduled,10_20140729_103129,Insufficient hot water or running water,Moderate Risk,033 Belden Pl San Francisco,37.791116,-122.403816


In [51]:
figure = gmaps.figure()
figure.add_layer(gmaps.symbol_layer(location, fill_color = 'yellow', stroke_color = '#ffcc00', scale = 2))
figure

AssertionError: 

In [7]:
# figure = gmaps.figure()
# figure.add_layer(gmaps.heatmap_layer(locations_df))
# figure

In [8]:
> ### Question 5.  Create a data dictionary.

SyntaxError: invalid syntax (<ipython-input-8-7ad90180311d>, line 1)

Answer: TODO

(Use the template below)

Variable | Description | Type of Variable
---|---|---
Is Menu available | 0 = No, 1 = Yes | Categorical
Number of Photos | Integer | Continuous
Rating | Integer | Continuous
Reviews Polarity | Integer | Continuous
Price | Integer | Continuous


We would like to explore the association between X and Y. 

> ### Question 6.  What is the outcome?

Answer: Outcome is the number of check-ins per week

## Part C.  Create an exploratory analysis plan by answering the following questions:

Because the answers to these questions haven't yet been covered in class yet, this section is optional.  This is by design.  By having you guess or look around for these answers will help make sense once we cover this material in class.  You will not be penalized for wrong answers but we encourage you to give it a try!

> ### Question 11. What are the goals of the exploratory analysis?

Answer: 

* Summarize the main characteristics of the dataset (looking at summary statistics)
* Visualize the data set 
* Formulate additional hypotheses
* Formulate assumptions and edge cases


> ### Question 12.  What are the assumptions of the distribution of data?

Answer: The assumption is that the data for checkins is normally distributed. There are probably 1-2 major factors that define the curve

> ### Question 13.  How will determine the distribution of your data?

Answer: I will determine the distribution by first graphing a histogram of the total checkins across San Francisco. I will also compute the mean and the median of hte dataset and the standard deviation. 

> ### Question 14.  How might outliers impact your analysis?

Answer: If there are outliers in the dataset their characteristics may impact the overall statistical analysis. For example a restaurant may have a very high number of checkins due to some kind of a freak one off incident. 

> ### Question 15.  How will you test for outliers?

Answer: Outliers can be identified by computing the inner fences using the inter quartile ranges. If the data point is outside the fences then we can probably omit them. 

> ### Question 16.  What is colinearity?

Answer: multicollinearity (also collinearity) is a phenomenon in which two or more predictor variables in a multiple regression model are highly correlated, meaning that one can be linearly predicted from the others with a substantial degree of accuracy.

> ### Question 17.  How will you test for covariance?

Answer: Covariance can be tested by using statistical tests such as t-tests or f-tests.

> ### Question 18.  What is your exploratory analysis plan?

> Using the above information, write an exploratory analysis plan that would allow you or a colleague to reproduce your analysis one year from now.

Answer: 
* Craft the query to download the dataset from the venues API from Foursquare for all restaurants in SF
* Output the file as a csv and load it onto a jupyter notebook
* Remove any unneccesary columns
* Visualize the number of checkins as a histogram
* Figure out the outliers using the Inter quartile ranges and remove them
* Run a t-test on the resulting data set to test for covariance
* Remove any columns that have high covariance
* Run a linear regression on the training set of the outcome variable against the prdeictors
* Measure the success rate by running the relationship on the testing dataset
* Run other types of regressions and compare the results