# Week 14 Problem 1

If you are not using the `Assignments` tab on the course JupyterHub server to read this notebook, read [Activating the assignments tab](https://github.com/UI-DataScience/info490-fa16/blob/master/Week2/assignments/README.md).

A few things you should keep in mind when working on assignments:

1. Make sure you fill in any place that says `YOUR CODE HERE`. Do **not** write your answer in anywhere else other than where it says `YOUR CODE HERE`. Anything you write anywhere else will be removed or overwritten by the autograder.

2. Before you submit your assignment, make sure everything runs as expected. Go to menubar, select _Kernel_, and restart the kernel and run all cells (_Restart & Run all_).

3. Do not change the title (i.e. file name) of this notebook.

4. Make sure that you save your work (in the menubar, select _File_ → _Save and CheckPoint_)

5. You are allowed to submit an assignment multiple times, but only the most recent submission will be graded.

In [1]:
from nose.tools import assert_equal
import pandas as pd
import sqlite3 as sl
import numpy as np
import os

This assignment will give you experience interacting with SQL databases through Pandas. First we'll make a playground directory and define the location for our database. Nothing will be in the database for now, and if you'd like to start fresh there is a cleanup cell at the bottom of this notebook.

In [2]:
# make sandbox if it doesn't exist
!mkdir -p ~/w14_p1

# make absolutely sure there isn't a database from a previous student
try:
    os.remove("/home/data_scientist/w14_p1/p1.db")
except OSError as e:
    pass

# set the database location
db = '/home/data_scientist/w14_p1/p1.db'

# Problem 1

Pandas, like many other Python modules, communicates with databases via an API as described in the [PEP 249](https://www.python.org/dev/peps/pep-0249/) specification. In this framework, "Access to the database is made available through connection objects", which provide a uniform interface to connect to many different flavors of databases. Thus, once you've created the connection object it doesn't really matter if the underlying database is SQLite, MySQL, postgreSQL, etc. This is useful if you want to create reusable code that operates on many different kinds of databases as all you have to do is change the connection object.

This problem asks you to create the connection object to connect to our database. Use the `sqlite3` module to connect to a database at a given path. 

In [3]:
def create_connector(database):
    '''
    Creates a connection to a sqlite database
    
    Parameters
    ----------
    w: str, a filepath
    
    Returns
    -------
    a sqlite3.Connection object
    '''
    
    #YOUR CODE HERE
    # Return a connection object of given filepath
    return sl.connect(database)

In [4]:
# WARNING: these will test if the connector is created correctly
# but won't test if its connected to the correct database
c = create_connector(db)
assert_equal(type(c), sl.Connection)
assert_equal(c.in_transaction, False)

In [5]:
# ALWAYS RUN THIS CELL, ESPECIALLY WHEN YOU GET AN ASSERT ERROR AND WANT
# TO TRY AGAIN
c.close()

**Important Note:** If you open a connection you should close it when you're finished with it as above. If you get errors and don't know why, try closing all open connections by either closing them individually or **restarting the Python kernel**. For example, you won't be able to run the cleanup cell at the end of this notebook and delete the database if you have any open connections. This may frustrate you if you would like to retry code that failed.

# Problem 2

Now use Pandas to convert a `csv` file at a given path to a SQL table in the new database. Read in the `csv` and write it to the specified table name for the given connection. If the table already exists within the database, replace it.

In [6]:
def csv_to_sql(csv_name, table_name, con):
    '''
    Converts a csv file to a SQL table in a given database
    
    Parameters
    ----------
    csv_name: str, a filepath
    table_name: str, a name for the new table
    con: a database connection object
    
    Returns
    -------
    None
    '''
    #YOUR CODE HERE
    # Converts the csv file to a SQL table in a given database
    data = pd.read_csv(csv_name)
    # Drop the table if it exists
    data.to_sql(table_name, con, if_exists='replace')

In [7]:
# get airports.csv into a sql database as the airports table
d = create_connector(db)
csv_to_sql('/home/data_scientist/data/airports.csv', 'airports', d)

# get a cursor object
c = d.cursor()
# check that the number of airports is correct
num_apts = c.execute("SELECT COUNT(*) FROM airports").fetchone()
assert_equal(num_apts[0], 3376)
# check that the first airport is
one_apt = c.execute("SELECT * FROM airports ORDER BY iata ASC").fetchone()
assert_equal(one_apt[1], "00M")

In [8]:
# ALWAYS RUN THIS CELL, ESPECIALLY WHEN YOU GET AN ASSERT ERROR AND WANT
# TO TRY AGAIN
d.close()

In [9]:
# the number of airports in the airports table
print(num_apts)

(3376,)


In [10]:
# the first airport as ordered by iata code
print(one_apt)

(0, '00M', 'Thigpen ', 'Bay Springs', 'MS', 'USA', 31.95376472, -89.23450472)


# Problem 3

When querying databases from Python it is often useful to be able to programatically create SQL queries. Use string formatting to create a SQL query to do the following:

* Select all columns from `table`
* where `city_col` is equal to `city`
* and `state_col` is equal to `state`
* If either `city` or `state` is `None`, don't filter on that field

This function will likely require a bit of thinking about the conditional logic required to deal with either city or state being `None`, or both, or neither.

In [11]:
def create_query(table, city_col, state_col, city=None, state=None):
    '''
    Creates a SQL query to filter a table by city and state
    
    Parameters
    ----------
    table: str, a table name
    city_col: str, the name of the city field
    state_col: str, the name of the state field
    city: str or None, the name of the city to filter on
    state: str or None, the name of the state to filter on
    
    Returns
    -------
    a string representing a valid sql query the filters `table`
    by `city` and `state`
    '''

    #YOUR CODE HERE
    if city == None:
        if state == None:
            # if both are None
            query=("SELECT * FROM {}").format(table)
        else:
            # if only city is None
            query=("SELECT * FROM {} WHERE {} = '{}'".format (table, state_col,state))
    else:
        if state == None:
            # if only state is None
            query=("SELECT * FROM {} WHERE {} = '{}'".format (table, city_col,city))
        else:
            # if both are not None
            query=("SELECT * FROM {} WHERE {} = '{}' AND {} = '{}'".format (table, city_col,city,state_col,state))
    return query

In [12]:
# test when neither are None
q = create_query('airports', 'city', 'state', 'Champaign', 'IL')
q_lower = q.lower()
assert("city = 'Champaign'" in q)
assert("state = 'IL'" in q)
assert('select * from airports where' in q_lower)
assert('airports' in q)
# test when both are None
q2 = create_query('airports', 'city', 'state')
q2_lower = q2.lower()
assert_equal(q2_lower, 'select * from airports')
# test when state is None
q3=create_query('airports', 'city', 'state', 'Chicago/Waukegan', None)
q3_lower = q3.lower()
assert_equal(q3_lower, "select * from airports where city = 'chicago/waukegan'")
# test when city is None
q4=create_query('airports', 'city', 'state', None, 'MO')
q4_lower = q4.lower()
assert_equal(q4_lower, "select * from airports where state = 'mo'")

# Problem 4

Now write a function that uses the `create_query` function to actually execute the query on the airports table through Pandas.

In [13]:
def get_citystate_apts(city, state, con):
    
    '''
    Gets the airports in a certain city and state from the airports table
    
    Parameters
    ----------
    city: str or None, the name of the city to filter on
    state: str or None, the name of the state to filter on
    con: a database connection object
    
    Returns
    -------
    a dataframe that is the result of the query created by `create_query`
    '''
    
    #YOUR CODE HERE
    #  Gets the airports in a certain city and state from the airports table
    return pd.read_sql(create_query('airports', 'city', 'state', city, state), con)

In [14]:
con = create_connector(db)

# check when only specifying state
q_data = get_citystate_apts(None, 'IL', con)
assert_equal(type(q_data), pd.DataFrame)
assert_equal(len(q_data), 88)
assert_equal(len(q_data.state.unique()), 1)
assert_equal(q_data.state.unique(), "IL")
# only specifying city
q_data2 = get_citystate_apts('Columbia', None, con)
assert_equal(type(q_data2), pd.DataFrame)
assert_equal(len(q_data2), 5)
assert_equal(len(q_data2.state.unique()), 4)
# specifying neither
q_data3 = get_citystate_apts(None, None, con)
assert_equal(type(q_data3), pd.DataFrame)
assert_equal(len(q_data3), 3376)
assert_equal(len(q_data3.state.unique()), 57)
# specifying both
q_data4 = get_citystate_apts("Chicago", "IL", con)
assert_equal(type(q_data4), pd.DataFrame)
assert_equal(len(q_data4), 3)
assert_equal(len(q_data4.state.unique()), 1)

In [15]:
# ALWAYS RUN THIS CELL, ESPECIALLY WHEN YOU GET AN ASSERT ERROR AND WANT
# TO TRY AGAIN
con.close()

# Problem 5

Finally, use what you've learned in previous weeks to do the following:

* Calcuate the median DepDelay for each Origin airport in 2001.csv
* Merge on airport, city, and state from the airports.csv data  
* The final columns should be 'iata', 'airport', 'city', 'state', 'medianDepDelay'
* Push the result to the database as AirportDelays and don't write an index column

Keep in mind that during your median calculation you should not include any NA values. To speed up the import of 2001.csv, you can specify which columns you'll need via `usecols`. When doing the merge, there may be aiports in 2001.csv that aren't in airports.csv. Do not discard these, keep all airports in 2001.csv but allow the aiport, city, and state columns to be NA. Don't forget to close your database connection!

In [16]:
#YOUR CODE HERE
# Create the connector
con = create_connector(db)
# Read 2001.csv
df1 = pd.read_csv('/home/data_scientist/data/2001.csv',encoding='latin-1',usecols = ['DepDelay','Origin'])
# Change the columns name for merging
df1.columns = ['DepDelay','iata']
# Calculate the median
df1 = df1.groupby('iata', as_index = False).median()
# Read airports.csv
df2 = pd.read_csv('/home/data_scientist/data/airports.csv',usecols = ['iata','airport', 'city','state'])
# Merging
df = pd.merge(df1,df2, how = 'left',on='iata')
# Change the columns name
df.columns = [ 'iata', 'medianDepDelay', 'airport', 'city', 'state']
# Set index
df = df.set_index('iata')
# Push the result to the database as AirportDelays and don't write an index column
df.to_sql('AirportDelays', con, if_exists='replace')
# Close database connection
con.close()

In [17]:
# get connector and cursor
d = create_connector(db)
c = d.cursor()

# check that the number of airports is correct
num_apts = c.execute("SELECT COUNT(*) FROM AirportDelays").fetchone()
assert_equal(num_apts[0], 231)

# get the champaign data
cmi_data = c.execute("SELECT * FROM AirportDelays WHERE iata = 'CMI'").fetchall()
# there should only be one record
assert_equal(len(cmi_data), 1)

# check the column names and get indices
cols = [x[0] for x in c.description]
iata_col = cols.index('iata')
airpt_col = cols.index('airport')
city_col = cols.index('city')
state_col = cols.index('state')
data_col = cols.index('medianDepDelay')
cols.sort()
assert_equal(['airport', 'city', 'iata', 'medianDepDelay', 'state'], cols)

# check the champaign data
assert_equal('Champaign/Urbana', cmi_data[0][city_col])
assert_equal(-2.0, cmi_data[0][data_col])

# get the Dallas Data
dfw_data = c.execute("SELECT * FROM AirportDelays WHERE iata = 'DFW'").fetchall()
assert_equal('Dallas-Fort Worth International', dfw_data[0][airpt_col])
assert_equal('TX', dfw_data[0][state_col])

# get the Boston Data
bos_data = c.execute("SELECT * FROM AirportDelays WHERE iata = 'BOS'").fetchall()
assert_equal('Dallas-Fort Worth', dfw_data[0][city_col])
assert_equal(0.0, bos_data[0][data_col])

In [18]:
# close the connection
d.close()

# Cleanup

In [19]:
# if your code doesn't execute cleanly from top to bottom, you'll
# probably have to restart the kernel to get this cell to run
!rm -rf /home/data_scientist/w14_p1/
# make absolutely sure you get rid of your old database
try:
    os.remove("/home/data_scientist/w14_p1/p1.db")
except OSError as e:
    pass