# APIs and SQL Joins Lab

## 1. Initial data inspection

To answer the question we will need to retrieve and merge data from multiple files.

Yelp provides data on food quality, that can be found at [this address](http://www.yelp.com/healthscores/feeds). We already downloaded 4 files that you can find in the [assets folder](../../assets/datasets/yelp/).

In the bonus part we will also use the Google Geocoding API and data on [Neighborhoods](https://www.google.com/fusiontables/DataSource?docid=1zNwsvTwj-dH0QxuuDrKFsyfNklajd7WwEyaZ2U9M#rows:id=1).

1. Open each of the files and inspect them visually
- What information do they contain?

In [1]:
!ls ../../DSI-CHI-1/lessons/week-06/2.2-sql-joins-lab/assets/datasets/yelp

businesses.csv	inspections.csv  legend.csv  violations.csv


## 2. Local database

The first step in our analysis is to import the data into a local PostgreSQL database.

1. Connect to a local sqlite3 database and import the files to separate tables.

**Hint:** The files are probably not encoded in utf8 and this could create a problem when importing the data into postgres. You can read more about encodings here: http://www.postgresql.org/docs/current/interactive/multibyte.html

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [6]:
import sqlite3
sqlite_db = '2.2-sql-joins-lab.sqlite'
conn = sqlite3.connect(sqlite_db)
c = conn.cursor()

In [7]:
businesses = pd.read_csv('../../DSI-CHI-1/lessons/week-06/2.2-sql-joins-lab/assets/datasets/yelp/businesses.csv')
inspections = pd.read_csv('../../DSI-CHI-1/lessons/week-06/2.2-sql-joins-lab/assets/datasets/yelp/inspections.csv')
legend = pd.read_csv('../../DSI-CHI-1/lessons/week-06/2.2-sql-joins-lab/assets/datasets/yelp/legend.csv')
violations = pd.read_csv('../../DSI-CHI-1/lessons/week-06/2.2-sql-joins-lab/assets/datasets/yelp/violations.csv')

In [13]:
businesses.head(1)

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,phone_number
0,10,TIRAMISU KITCHEN,033 BELDEN PL,San Francisco,CA,94104,37.791116,-122.403816,14154217044


In [14]:
inspections.head(1)

Unnamed: 0,business_id,score,date,type
0,10,94,20140729,routine


In [15]:
legend.head(1)

Unnamed: 0,Minimum_Score,Maximum_Score,Description
0,0,70,Poor


In [16]:
violations.head(1)

Unnamed: 0,business_id,date,description
0,10,20140729,Insufficient hot water or running water [ dat...


In [17]:
businesses.columns

Index([u'business_id', u'name', u'address', u'city', u'state', u'postal_code',
       u'latitude', u'longitude', u'phone_number'],
      dtype='object')

In [53]:
c.execute("DROP TABLE IF EXISTS businesses")
create_business = "CREATE TABLE businesses (business_id varchar PRIMARY KEY, name varchar, address varchar, city varchar, state varchar, postal_code varchar, latitude varchar, longitude varchar, phone_number varchar);"

c.execute(create_business)

<sqlite3.Cursor at 0x7ff12da2bea0>

In [20]:
inspections.columns

Index([u'business_id', u'score', u'date', u'type'], dtype='object')

In [21]:
create_inspections = "CREATE TABLE inspections (business_id varchar PRIMARY KEY, score varchar, date varchar, type varchar);"

c.execute(create_inspections)

<sqlite3.Cursor at 0x7ff12da2bea0>

In [22]:
legend.columns

Index([u'Minimum_Score', u'Maximum_Score', u'Description'], dtype='object')

In [23]:
create_legend = "CREATE TABLE legend (minimum_score varchar, maximum_score varchar, description varchar);"

c.execute(create_legend)

<sqlite3.Cursor at 0x7ff12da2bea0>

In [24]:
violations.columns

Index([u'business_id', u'date', u'description'], dtype='object')

In [25]:
create_violations = "CREATE TABLE violations (business_id varchar PRIMARY KEY, date varchar, description varchar);"

c.execute(create_violations)

<sqlite3.Cursor at 0x7ff12da2bea0>

In [54]:
conn.commit()

In [27]:
results = c.execute("SELECT * FROM sqlite_master WHERE type='table';")

In [28]:
results.fetchall()

[(u'table',
  u'businesses',
  u'businesses',
  2,
  u'CREATE TABLE businesses (business_id varchar PRIMARY KEY, name varchar, address varchar, city varchar, state varchar, postal_code varchar, latitude varchar, longitude varchar, phone_number varchar)'),
 (u'table',
  u'inspections',
  u'inspections',
  4,
  u'CREATE TABLE inspections (business_id varchar PRIMARY KEY, score varchar, date varchar, type varchar)'),
 (u'table',
  u'legend',
  u'legend',
  6,
  u'CREATE TABLE legend (minimum_score varchar, maximum_score varchar, description varchar)'),
 (u'table',
  u'violations',
  u'violations',
  7,
  u'CREATE TABLE violations (business_id varchar PRIMARY KEY, date varchar, description varchar)')]

In [30]:
results = c.execute("SELECT * FROM inspections;")
results.fetchall()

[]

In [31]:
legend.as_matrix()

array([[0, 70, 'Poor'],
       [71, 85, 'Needs Improvement'],
       [86, 90, 'Adequate'],
       [91, 100, 'Good']], dtype=object)

In [32]:
list(legend.to_records())

[(0, 0, 70, 'Poor'),
 (1, 71, 85, 'Needs Improvement'),
 (2, 86, 90, 'Adequate'),
 (3, 91, 100, 'Good')]

In [33]:
legend.values

array([[0, 70, 'Poor'],
       [71, 85, 'Needs Improvement'],
       [86, 90, 'Adequate'],
       [91, 100, 'Good']], dtype=object)

In [34]:
c.executemany("INSERT INTO legend VALUES (?, ?, ?)", legend.values)

<sqlite3.Cursor at 0x7ff12da2bea0>

In [35]:
conn.commit()

In [36]:
results = c.execute("SELECT * FROM legend;")
results.fetchall()

[(u'0', u'70', u'Poor'),
 (u'71', u'85', u'Needs Improvement'),
 (u'86', u'90', u'Adequate'),
 (u'91', u'100', u'Good')]

In [38]:
c.execute("DROP TABLE IF EXISTS violations")

create_violations = "CREATE TABLE violations (business_id varchar, date varchar, description varchar);"
c.execute(create_violations)

c.executemany("INSERT INTO violations VALUES (?, ?, ?)", violations.values)

conn.commit()

In [41]:
results = c.execute("SELECT * FROM violations;")
results.fetchall()

[(u'10',
  u'20140729',
  u'Insufficient hot water or running water  [ date violation corrected: 8/7/2014 ]'),
 (u'10',
  u'20140729',
  u'Unapproved or unmaintained equipment or utensils  [ date violation corrected: 8/7/2014 ]'),
 (u'10',
  u'20140114',
  u'Inadequate and inaccessible handwashing facilities  [ date violation corrected: 1/24/2014 ]'),
 (u'10',
  u'20140114',
  u'Unclean or degraded floors walls or ceilings  [ date violation corrected: 1/24/2014 ]'),
 (u'10',
  u'20140114',
  u'Improper storage of equipment utensils or linens  [ date violation corrected: 1/24/2014 ]'),
 (u'19', u'20141110', u'Improper storage of equipment utensils or linens'),
 (u'19',
  u'20141110',
  u'Inadequate food safety knowledge or lack of certified food safety manager'),
 (u'19',
  u'20140214',
  u'Inadequately cleaned or sanitized food contact surfaces  [ date violation corrected: 2/14/2014 ]'),
 (u'19',
  u'20140214',
  u'Permit license or inspection report not posted  [ date violation correc

In [48]:
c.execute("DROP TABLE IF EXISTS inspections")

create_inspections = "CREATE TABLE inspections (business_id varchar, score varchar, date varchar, type varchar);"
c.execute(create_inspections)

c.executemany("INSERT INTO inspections VALUES (?, ?, ?, ?)", inspections.values)

conn.commit()

In [49]:
results = c.execute("SELECT * FROM violations;")
results.fetchall()

[(u'10',
  u'20140729',
  u'Insufficient hot water or running water  [ date violation corrected: 8/7/2014 ]'),
 (u'10',
  u'20140729',
  u'Unapproved or unmaintained equipment or utensils  [ date violation corrected: 8/7/2014 ]'),
 (u'10',
  u'20140114',
  u'Inadequate and inaccessible handwashing facilities  [ date violation corrected: 1/24/2014 ]'),
 (u'10',
  u'20140114',
  u'Unclean or degraded floors walls or ceilings  [ date violation corrected: 1/24/2014 ]'),
 (u'10',
  u'20140114',
  u'Improper storage of equipment utensils or linens  [ date violation corrected: 1/24/2014 ]'),
 (u'19', u'20141110', u'Improper storage of equipment utensils or linens'),
 (u'19',
  u'20141110',
  u'Inadequate food safety knowledge or lack of certified food safety manager'),
 (u'19',
  u'20140214',
  u'Inadequately cleaned or sanitized food contact surfaces  [ date violation corrected: 2/14/2014 ]'),
 (u'19',
  u'20140214',
  u'Permit license or inspection report not posted  [ date violation correc

In [55]:
c.executemany("INSERT INTO businesses VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)", businesses.values)

ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.

In [56]:
conn.commit()

In [57]:
c.execute("SELECT COUNT(*) FROM businesses;").fetchall()

[(3177,)]

### 2.b Display the first few lines of each table

In [58]:
%load_ext sql

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


In [59]:
%sql sqlite:///2.2-sql-joins-lab.sqlite

u'Connected: None@2.2-sql-joins-lab.sqlite'

In [60]:
%%sql
SELECT * FROM violations LIMIT 10;

Done.


business_id,date,description
10,20140729,Insufficient hot water or running water [ date violation corrected: 8/7/2014 ]
10,20140729,Unapproved or unmaintained equipment or utensils [ date violation corrected: 8/7/2014 ]
10,20140114,Inadequate and inaccessible handwashing facilities [ date violation corrected: 1/24/2014 ]
10,20140114,Unclean or degraded floors walls or ceilings [ date violation corrected: 1/24/2014 ]
10,20140114,Improper storage of equipment utensils or linens [ date violation corrected: 1/24/2014 ]
19,20141110,Improper storage of equipment utensils or linens
19,20141110,Inadequate food safety knowledge or lack of certified food safety manager
19,20140214,Inadequately cleaned or sanitized food contact surfaces [ date violation corrected: 2/14/2014 ]
19,20140214,Permit license or inspection report not posted [ date violation corrected: 11/10/2014 ]
19,20130904,Foods not protected from contamination [ date violation corrected: 9/4/2013 ]


## 2.b Investigate violations

Let's focus on the violations table initially.


Answer these questions using sql:
1. How many violations are there?
- How many businesses committing violations?
- What's the average number of violations per business?

Answer these questions using python
1. Draw a plot of the violations count
- Is the average number of violations meaningful?
- Draw a plot of the normalized cumulative violation counts. Can we discard the restaurants with few violations?
- Where would you draw a threshold if you were to keep 90% of the violations?

### 2.c Investigate Inspections

In the previous step we looked at violations count. However we also have an inspection score available in the inspections table. Let's have a look at that too.

Answer these questions using SQL:
1. What's the average score for the whole city?
1. What's the average score per business?
- Does the score correlate with the number of inspections?
- Create a dataframe from a table with the following columns:
    business_id, average_score, number_of_inspections, number_of_violations
- Use pandas to do a scatter matrix plot of average_score, number_of_inspections, number_of_violations to check for correlations

## 3 Zipcode analysis

The town administration would like to know which zip code are the ones where they should focus the inspections.

Use the information contained in the `businesses` table as well as the previous tables to answer the following questions using SQL:

1. Count the number of businesses per zipcode and sort them by descending order
- Which are the top 5 zipcodes with the worst average score?
    - restrict your analysis to the zipcodes with at least 50 businesses
    - do a simple average of the inspections scores in the postal code
- Which are the top 5 zipcodes with the highest number of violations per restaurant?
    - restrict your  analysis to the zipcodes with at least 50 businesses


## Final recommendation
Give a final recommendation on which 2 zipcodes should the administration focus and choose an appropriate plot to convince them visually.

## Bonus: Neighborhood data

Instead of looking at zipcodes we may be interested in using Neighborhood names.

It's beyond the scope of this lab to do a proper introduction to Geocoding and Reverse Geocoding, but we will give some pointers for further exploration.

### 1. Google Geocoding API
Have a look at:
- https://developers.google.com/maps/documentation/geocoding/intro
- https://maps.googleapis.com/maps/api/geocode/json?address=
- https://maps.googleapis.com/maps/api/geocode/json?latlng=

Through this API you can retrieve an address or a neighborhood from a lat-lon pair (reverse geocoding), or you can retrieve lat long and other information from an address (geocoding).

1. Try experimenting with and retrieving a few addresses
- Note that google imposes limits on the number of free queries
- How many missing lat-lon pairs do we have?

### Bonus 2
The pycurl library seems to be faster than requests in getting information from the google api.

1. See if you can extract the neighborhood from an address using the geocode api and a bit of json parsing
- Note that you would surely hit the daily limit if you pulled each address' neighborhood from the api

### Bonus 3
We can find the neighborhood using the polygons associated to each of them.
[Here](https://www.google.com/fusiontables/DataSource?docid=1zNwsvTwj-dH0QxuuDrKFsyfNklajd7WwEyaZ2U9M#rows:id=1) you can find these polygons (and we also copied them [locally](../../assets/datasets/sfneighborhoods.csv).

[This article](http://streamhacker.com/2010/03/23/python-point-in-polygon-shapely/) describes how to use the shapely package to check if a point belongs to a polygon.

- See if you can build a function that retrieves the neighborhood for a given address using the polygon data