# DS-SF-36 | Unit Project | 1 | Research Design | Starter Code

In this first unit project, you will create a framework to scope out data science projects.  This framework will provide you with a guide to develop a well-articulated problem statement and analysis plan that will be robust and reproducible.

## Part A.  Evaluate the following problem statement:

> Determine which factors most impact how many average check-ins a restaurants gets. How do we predict how many visitors a restaurant is going to get? 

> PROJECT 1 - VISUALIZATION: I will first find the list of all restaurants in SF from dataSF (https://data.sfgov.org/Health-and-Social-Services/Restaurant-Scores-LIVES-Standard/pyih-qa8i) which has a list of all restaurants, their addresses, and health scores. Not all the lat longs are populated so I will fill in the gaps by using a geocoding library. I will take this dataset and plot it on google maps. I will also do some EDA and group the level of risk against location and see if there are any trends

> PROJECT 2 - EDA AND ML: I will supplement the healthscores data set with Foursquare's check-in data. In order to pull the per restaurant data from the venues API (https://developer.foursquare.com/docs/responses/venue), I will need to loop through each restaurant name and pull the foursquare data at a snapshot in time. I will then need to join the 2 data sets on restaurant name to create one dataset that consists of both healthscore, checkin, and online profile (rating, photos, menu) data. I can then regress these factors against checkin data to find any relations. I will run various ML algorithms that we have learned until then on both the training and the test data set.

> PROJECT 3 - ML + SENTIMENT ANALYSIS: I will add a sentiment analysis of the reviews and the key phrases in the reviews (provided by the same API above) to get a sense of polarity in the reviews. I will then re-run the above analyses from project2 but with the ploarity scores and see if it makes any difference. I will also try other ML algorithms that we have learned since project 2 and summarize all my results

> ### Question 1.  What is the outcome?

Answer: The outcome is a prediction of how many check-ins to expect in a week given the various inputs

> ### Question 2.  What are the predictors/covariates?

Answer: The number of photos, whether a menu is available, health scores, sentiment analysis of recent reviews, price, sentiment analysis of key phrases

> ### Question 3.  What timeframe is this data relevent for?

Answer: San Francisco restaurants that are on Foursquare in July 2017

> ### Question 4.  What is the hypothesis?

Answer: Foot traffic is correlated with the depth of the online profile as well as health and safety scores

## Part B.  Let's start exploring our Foursquare dataset and answer some simple questions:

In [8]:
count = 0

In [18]:
import os
import pandas as pd
import gmaps
import googlemaps as g
import yaml
import tenacity
from tenacity import retry

pd.set_option('display.max_rows', 20)
pd.set_option('display.max_columns', 20)
pd.set_option('display.notebook_repr_html', True)

# Loading dataset
df = pd.read_csv(os.path.join('..', '..', 'dataset', 'Restaurant_Scores_-_LIVES_Standard.csv'))
df = df.set_index('business_id')

# Applying google credentials
with open('google.yaml', 'r') as f:
    google_credentials = yaml.load(f)

google_api_key = google_credentials['api-key']
gmaps.configure(api_key = google_api_key)

googlemaps_api_key = google_credentials['api-key-gmaps']
googmaps = g.Client(key=googlemaps_api_key)

# df.head()

In [19]:
# Filling in the lat long gaps
from geopy.geocoders import Nominatim
geolocator = Nominatim()

#Concatenate address
df['comb_address'] = df['business_address']+" "+ df['business_city'] 
#Create a new column for lat and long and location tuple

df.head()

Unnamed: 0_level_0,business_name,business_address,business_city,business_state,business_postal_code,business_latitude,business_longitude,business_location,business_phone_number,inspection_id,inspection_date,inspection_score,inspection_type,violation_id,violation_description,risk_category,comb_address
business_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
10,Tiramisu Kitchen,033 Belden Pl,San Francisco,CA,94104,37.791116,-122.403816,"(37.791116, -122.403816)",,10_20140114,01/14/2014 12:00:00 AM,92.0,Routine - Unscheduled,10_20140114_103119,Inadequate and inaccessible handwashing facili...,Moderate Risk,033 Belden Pl San Francisco
10,Tiramisu Kitchen,033 Belden Pl,San Francisco,CA,94104,37.791116,-122.403816,"(37.791116, -122.403816)",,10_20140114,01/14/2014 12:00:00 AM,92.0,Routine - Unscheduled,10_20140114_103145,Improper storage of equipment utensils or linens,Low Risk,033 Belden Pl San Francisco
10,Tiramisu Kitchen,033 Belden Pl,San Francisco,CA,94104,37.791116,-122.403816,"(37.791116, -122.403816)",,10_20140114,01/14/2014 12:00:00 AM,92.0,Routine - Unscheduled,10_20140114_103154,Unclean or degraded floors walls or ceilings,Low Risk,033 Belden Pl San Francisco
10,Tiramisu Kitchen,033 Belden Pl,San Francisco,CA,94104,37.791116,-122.403816,"(37.791116, -122.403816)",,10_20140729,07/29/2014 12:00:00 AM,94.0,Routine - Unscheduled,10_20140729_103144,Unapproved or unmaintained equipment or utensils,Low Risk,033 Belden Pl San Francisco
10,Tiramisu Kitchen,033 Belden Pl,San Francisco,CA,94104,37.791116,-122.403816,"(37.791116, -122.403816)",,10_20140729,07/29/2014 12:00:00 AM,94.0,Routine - Unscheduled,10_20140729_103129,Insufficient hot water or running water,Moderate Risk,033 Belden Pl San Francisco


In [20]:
# Creating a new location column that is NaN if not found or equal to business_location

df['location'] = df['business_location']
df.head()

Unnamed: 0_level_0,business_name,business_address,business_city,business_state,business_postal_code,business_latitude,business_longitude,business_location,business_phone_number,inspection_id,inspection_date,inspection_score,inspection_type,violation_id,violation_description,risk_category,comb_address,location
business_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
10,Tiramisu Kitchen,033 Belden Pl,San Francisco,CA,94104,37.791116,-122.403816,"(37.791116, -122.403816)",,10_20140114,01/14/2014 12:00:00 AM,92.0,Routine - Unscheduled,10_20140114_103119,Inadequate and inaccessible handwashing facili...,Moderate Risk,033 Belden Pl San Francisco,"(37.791116, -122.403816)"
10,Tiramisu Kitchen,033 Belden Pl,San Francisco,CA,94104,37.791116,-122.403816,"(37.791116, -122.403816)",,10_20140114,01/14/2014 12:00:00 AM,92.0,Routine - Unscheduled,10_20140114_103145,Improper storage of equipment utensils or linens,Low Risk,033 Belden Pl San Francisco,"(37.791116, -122.403816)"
10,Tiramisu Kitchen,033 Belden Pl,San Francisco,CA,94104,37.791116,-122.403816,"(37.791116, -122.403816)",,10_20140114,01/14/2014 12:00:00 AM,92.0,Routine - Unscheduled,10_20140114_103154,Unclean or degraded floors walls or ceilings,Low Risk,033 Belden Pl San Francisco,"(37.791116, -122.403816)"
10,Tiramisu Kitchen,033 Belden Pl,San Francisco,CA,94104,37.791116,-122.403816,"(37.791116, -122.403816)",,10_20140729,07/29/2014 12:00:00 AM,94.0,Routine - Unscheduled,10_20140729_103144,Unapproved or unmaintained equipment or utensils,Low Risk,033 Belden Pl San Francisco,"(37.791116, -122.403816)"
10,Tiramisu Kitchen,033 Belden Pl,San Francisco,CA,94104,37.791116,-122.403816,"(37.791116, -122.403816)",,10_20140729,07/29/2014 12:00:00 AM,94.0,Routine - Unscheduled,10_20140729_103129,Insufficient hot water or running water,Moderate Risk,033 Belden Pl San Francisco,"(37.791116, -122.403816)"


In [21]:
# Nulls 
null_data = df[df['location'].isnull()]
null_data.head()    

Unnamed: 0_level_0,business_name,business_address,business_city,business_state,business_postal_code,business_latitude,business_longitude,business_location,business_phone_number,inspection_id,inspection_date,inspection_score,inspection_type,violation_id,violation_description,risk_category,comb_address,location
business_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
79782,Deli 23,2449 23rd St,San Francisco,CA,94110,,,,,79782_20160503,05/03/2016 12:00:00 AM,92.0,Routine - Unscheduled,79782_20160503_103120,Moderate risk food holding temperature,Moderate Risk,2449 23rd St San Francisco,
76437,Sweetheart Cafe,909 Grant Ave,San Francisco,CA,94108,,,,,76437_20160329,03/29/2016 12:00:00 AM,76.0,Routine - Unscheduled,76437_20160329_103113,Sewage or wastewater contamination,High Risk,909 Grant Ave San Francisco,
88090,Hwaro,4516 Mission St,San Francisco,CA,94112,,,,14155210000.0,88090_20160729,07/29/2016 12:00:00 AM,,New Construction,,,,4516 Mission St San Francisco,
81161,Limon Peruvian Rotisserie,1001 South Van Ness Ave,San Francisco,CA,94110,,,,14155550000.0,81161_20160325,03/25/2016 12:00:00 AM,92.0,Routine - Unscheduled,81161_20160325_103149,Wiping cloths not clean or properly stored or ...,Low Risk,1001 South Van Ness Ave San Francisco,
85781,Domino's #7764,876 Geary St,San Francisco,CA,94109,,,,,85781_20160311,03/11/2016 12:00:00 AM,86.0,Routine - Unscheduled,85781_20160311_103124,Inadequately cleaned or sanitized food contact...,Moderate Risk,876 Geary St San Francisco,


In [22]:
# Setting up logging 
import logging

logging.basicConfig()
logger = logging.getLogger('jupyter')
logger.setLevel(logging.INFO)

In [25]:
# Function to fetch the latlong with address as input 

@retry(stop=tenacity.stop_after_attempt(10))
def fetch_latlong(locator, address):
    """
    Fetch the lat long for an address with some retries
    """
    logger.debug('Requesting latlong for address %s', address)
    try:
        l = locator.geocode(address)
        if l:
            return (l.latitude, l.longitude)
        else:
            logger.warn('Could not fetch lat long for address %s', address)
            return (None, None)
    except:
        logger.exception('Failed on %s', address)
        raise

In [33]:
# Creating a dictionary of unique addresses
comb_address = null_data['comb_address'].unique()
comb_address = comb_address[:10]

lat_longs = pd.Series(comb_address).apply(lambda x: fetch_latlong(geolocator, x))



ERROR:jupyter:Failed on 2449 23rd St San Francisco
Traceback (most recent call last):
  File "<ipython-input-25-1dfb3bc99465>", line 10, in fetch_latlong
    l = locator.geocode(address)
  File "/Users/manulohiya/anaconda2/lib/python2.7/site-packages/geopy/geocoders/osm.py", line 193, in geocode
    self._call_geocoder(url, timeout=timeout), exactly_one
  File "/Users/manulohiya/anaconda2/lib/python2.7/site-packages/geopy/geocoders/base.py", line 171, in _call_geocoder
    raise GeocoderServiceError(message)
GeocoderServiceError: <urlopen error [Errno 65] No route to host>


In [34]:
# Converting dictionary to a dataframe
d = dict(zip(comb_address, lat_longs))
latlongs = pd.DataFrame(d.items(), columns = ['address', 'location'])
latlongs = latlongs.set_index('address')
latlongs.head()
len(latlongs)

10

In [None]:
# Output dataframe to a csv 
count = count + 1
filename = 'null_latlongs'+str(count)+'.csv'
filename



In [None]:
latlongs.to_csv(filename)