#  CSCI 3287  -- Final Project
    
## FDA Food Database Exploration

Name: Jarryd Allison

Email: allisonj@colorado.edu

GitHub ID: jarrydallison

In [None]:
import sqlite3

In [None]:
# First, create the database using the attached csvs
# Connect to the database
conn = sqlite3.connect('database.db')

<hr>

## Instructions / Notes:

#### **_Read these carefully_**

* You **may** create new Jupyter notebook cells to use for e.g. testing, debugging, exploring, etc.- this is encouraged in fact! 

* However, you you must clearly mark the solution to each sub-problem by having your solution in the cell immediately after the cells marked ```### PLACE YOUR SOLUTION in the next cell``` a cell immediately following your solution marked ```### SOLUTION in is the previous cell``` . Place any testing cells after the ending cell.

* The expected results and correct output format of the queries are current displayed in the notebook. The results are printed in nice tables with headers. When you run your queries, for the questions the current correct results will be replace with your results which should look like the initial results.

<hr>

#### Submission Instructions:
 1. Commit your changes to your local repository.
 2. Push your changes to the remote repository in Git Classroom.
 3. Print your notebook as a PDF document to have a completed version. 
    * To create a PDF from your notebook:  select `File` -> `Save and Export Notebook As..` -> `PDF`,

If you run into problems with a query taking a very very long time, first try `Kernel` -> `Restart All and Run All Cells..` and then ask on Piazza

###  _Have fun!_
<hr>

<hr>

#### Set up a connection to a SQLite database
Make sure you have decompressed the database file before running the next cell.  The next cell will create an empty database if the file *flights.db* is not in the current directory.

#### The next cell will setup the SQL tools for JupyterHub Notebooks.
* Remember:
    * `%sql [SQL]` is for _single line_ SQL queries
    * `%%sql 
    [SQL]` is for _multi line_ SQL queries


In [2]:
%reload_ext sql
%sql sqlite:///flights.db

'Connected: @flights.db'

<hr>

## Introduction: Travel Delays
<hr>

There's nothing I dislike more than travel delays -- how about you?

In fact, I'm always scheming new ways to avoid travel delays, and I just found an amazing dataset that will help me understand some of the causes and trade-offs when traveling. I wonder if you can use SQL to help me!

Not surprisingly... you can! In this homework, we'll use SQL to explore airline travel delays that occurred in July 2007. To start, let's look at the primary relation in the database we've prepared for you:

In [33]:
%%sql
SELECT * 
FROM ontime 
LIMIT 1;

 * sqlite:///flights.db
Done.


Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
2007,7,1,7,2052,2050,2153,2155,WN,575,N304SW,61,65,50,-2,2,ONT,SJC,333,5,6,0,,0,0,0,0,0,0


Cool, there are so many columns! How many rows are there?

In [5]:
%%sql
SELECT COUNT(*) AS num_rows
FROM ontime;

 * sqlite:///flights.db
Done.


num_rows
648560


Wow, that's a lot of data! Good thing you don't have to answer all of my questions by hand...

You don't need to import more data into the database. However, you can find a description of each field online at [https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236](https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236). We actually downloaded the data from https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/HG7NV7&version=1.0 and focused on just July 2007 to reduce the data size. 

We've pre-loaded a number of additional tables that will help you decode important fields like
* `ontime` - the ontime flight data described at https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/HG7NV7/YZWKHN&version=1.0
* `carriers` - airlines described at https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/HG7NV7/3NOQ6Q&version=1.0
* `airports` - airports described at https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/HG7NV7/XTPZZY&version=1.0
* `planes` - individual plane information described at https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/HG7NV7/XXSL8A&version=1.0
* `weekdays` - hand-made database of weekdays with Sunday as day #1

Please use the following cell to explore these the `carriers`, `airports`, `planes` and `weekdays` tables.  Try a few different select statements to see what each table contains.

In [6]:
%%sql
select * from carriers Limit 1;

 * sqlite:///flights.db
Done.


code,carrier
02Q,Titan Airways


<hr>

### Your turn to write some queries
<hr>

Query 1: How long are flights delayed on average? (10 points)
------------------------
Just to get a sense of the data, let's start with a simple query.

In the cell below, write a SQL query that returns the average arrival delay for the entire month of July 2007 (i.e., the whole dataset).

In [7]:
### PLACE YOUR SOLUTION in the next cell
### vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv

In [8]:
%%sql
select avg(ArrDelay) from ontime;

 * sqlite:///flights.db
Done.


avg(ArrDelay)
14.107679837700504


In [9]:
### ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
### SOLUTION in is the previous cell


Query 2: What was the worst flight delay? (10 points)
------------------------
Hmm, the average doesn't look too bad! What about the _worst_ delay?

In the cell below, write a SQL query that returns the maximum arrival delay for the entire month of July 2007 (i.e., the whole dataset).

In [10]:
### PLACE YOUR SOLUTION in the next cell
### vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv

In [11]:
%%sql
select max(ArrDelay) from ontime;

 * sqlite:///flights.db
Done.


max(ArrDelay)
1386


In [12]:
### ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
### SOLUTION in is the previous cell


Query 3: What flight am I happiest I didn't take? (10 points)
------------------------
Yikes! What flight was so late?

In the cell below, write a SQL query that returns the airline code (i.e., `UniqueCarrier`), origin city name, destination city name, flight number, and arrival delay for the flight(s) with the maximum arrival delay for the entire month of July 2007. Do not hard-code the arrival delay you found above. Hint: use a subquery.

Your query should have the following columns (using "...as.." to change attribute labels into table labels)

|Airline Code | Origin | Destination|Flight Number|Arrival Delay|
|-------------|--------|------------|-------------|-------------|
| value... | value | value | value | value |
| ZZ | New York | Atlanta | 1942 | 480 |

If future problems, we would describe this as a query returning (`Airline Code`, `Origin City Name`, `Destination City Name`, `Flight Number`, `Arrival Delay`).

In [13]:
### PLACE YOUR SOLUTION in the next cell
### vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv

In [14]:
%%sql
select UniqueCarrier as "Airline Code", A1.city as 'Origin', A2.city as 'Destination', carrier as 'Airline', FlightNum as "Flight Number", ArrDelay as "Arrival Delay"
from ontime, airports AS A1, airports as A2, carriers
where ArrDelay = (select max(ArrDelay) from ontime) and A1.iata = Origin and A2.iata = Dest and code = UniqueCarrier;

 * sqlite:///flights.db
Done.


Airline Code,Origin,Destination,Airline,Flight Number,Arrival Delay
AA,Las Vegas,Dallas-Fort Worth,American Airlines Inc.,1004,1386


In [15]:
### ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
### SOLUTION in is the previous cell


Query 4: Which are the worst days to travel? (10 points)
------------------------
Since class just started, I don't have time to head to travel anytime soon. However, I'm headed out of town for a trip next week! What day is worst for booking my flight?

In the cell below, write a SQL query that returns the average arrival delay time for each day of the week, in descending order. 

The schema of your relation should be of the form (`Day of Week`, `Average Delay`).

**Note: do _not_ report the weekday ID.** (Hint: look at the `weekdays` table and perform a join to obtain the weekday name.)

In [16]:
### PLACE YOUR SOLUTION in the next cell
### vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv

In [17]:
%%sql
SELECT Name as 'weekday_name', Avg(ArrDelay) as 'average_delay'
FROM ontime
INNER JOIN weekdays
ON weekdays.DayOfWeek = ontime.DayOfWeek
GROUP BY Name
ORDER BY Avg(ArrDelay) DESC;

 * sqlite:///flights.db
Done.


weekday_name,average_delay
Wednesday,18.104848250718305
Saturday,17.03286664090445
Sunday,15.889976224441275
Thursday,13.359122962962964
Monday,12.97324491855269
Tuesday,12.716138992293423
Friday,7.1869733939345615


In [18]:
### ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
### SOLUTION in is the previous cell


Query 5: Which airlines that fly out of DEN are delayed least?
------------------------
Now that I know which days to avoid, I'm curious which airline I should fly out of DEN. Since I haven't been told where I'm flying, please just compute the average for the airlines that fly from DEN.

In the cell below, write a SQL query that returns the average arrival delay time (across _all_ flights) for each carrier that flew out of DEN at least once in July 2007 (i.e., in the current dataset), in descending order.

The schema of your relation should be of the form (`Airline Name`, `Average Delay`).


**Note: do _not_ report the airline ID (UniqueCarrier).** (Hint: a subquery is helpful here; also, look at the `carriers` table and perform a join.)

In [19]:
### PLACE YOUR SOLUTION in the next cell
### vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv

In [20]:
%%sql
SELECT UniqueCarrier as 'carrier_code', carrier as 'carrier_name', AVG(ArrDelay)
FROM ontime, carriers
WHERE Origin = 'DEN' and Code = UniqueCarrier
GROUP BY UniqueCarrier
ORDER BY AVG(ArrDelay) DESC;

 * sqlite:///flights.db
Done.


carrier_code,carrier_name,AVG(ArrDelay)
B6,JetBlue Airways,30.157894736842103
OH,Comair Inc.,26.0
XE,Expressjet Airlines Inc.,23.3046875
AA,American Airlines Inc.,19.608169440242055
UA,United Air Lines Inc.,16.398711524695777
NW,Northwest Airlines Inc.,16.311720698254366
AS,Alaska Airlines Inc.,15.78139534883721
US,US Airways Inc. (Merged with America West 9/05. Reporting for both starting 10/07.),13.11697247706422
DL,Delta Air Lines Inc.,12.268361581920905
CO,Continental Air Lines Inc.,12.004878048780489


In [21]:
### ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
### SOLUTION in is the previous cell


Query 6: What proportion of airlines are regularly late?
------------------------
Yeesh, there are a lot of late flights! How many airlines are regularly late?

In the cell below, write a SQL query that returns the proportion of airlines (appearing in `ontime`) whose flights are on average at least 10 minutes late to arrive. For example, if 4 of 8 airlines have average arrival delays of at least 10 minutes, you would report 0.5

Do not hard-code the total number of airlines, and make sure to use at least one `HAVING` clause in your SQL query.

**Note:** sqlite `COUNT(*)` returns integer types. Therefore, your query should likely contain at least one `SELECT CAST (COUNT(*) AS float)` or a clause like `COUNT(*)*1.0`.

In [22]:
### PLACE YOUR SOLUTION in the next cell
### vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv

In [23]:
%%sql
SELECT Later*1.0/Total as 'Percentage Late'
FROM (
    SELECT COUNT(*) as Later
    FROM (
        SELECT UniqueCarrier
        FROM ontime
        GROUP BY UniqueCarrier
        HAVING AVG(ArrDelay) >= 10
        )
    ),
    (
        SELECT COUNT(DISTINCT UniqueCarrier) as Total
        FROM ontime
    );

 * sqlite:///flights.db
Done.


Percentage Late
0.7


In [24]:
### ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
### SOLUTION in is the previous cell


Query 7: How do late departures affect late arrivals?
------------------------
It sure looks like my plane is likely to be delayed. I'd like to know: if my plane is delayed in taking off, how will it affect my arrival time?

The [sample covariance](https://en.wikipedia.org/wiki/Covariance) provides a measure of the joint variability of two variables. The higher the covariance, the more the two variables behave similarly, and negative covariance indicates the variables indicate the variables tend to be inversely related. We can compute the sample covariance as:
$$
Cov(X,Y) = \frac{1}{n-1} \sum_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})
$$
where $x_i$ denotes the $i$th sample of $X$, $y_i$ the $i$th sample of $Y$, and the mean of $X$ and $Y$ are denoted by $\bar{x}$ and $\bar{y}$.

In the cell below, write a single SQL query that computes the covariance between the departure delay time and the arrival delay time. You should explicitly exclude entries where either the arrival or departure delay is NULL.

*Note: we could also compute a statistic like the [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) here, which provides a normalized measure (i.e., on a scale from -1 to 1) of how strongly two variables are related. However, sqlite doesn't natively support square roots (unlike commonly-used relational databases like PostgreSQL and MySQL!), so we're asking you to compute covariance instead.*

In [25]:
### PLACE YOUR SOLUTION in the next cell
### vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv

In [7]:
%%sql
SELECT SUM((DepDelay - d_avg)*(ArrDelay -a_avg))/(COUNT(*) - 1) as COV
FROM ontime, (SELECT AVG(DepDelay) as d_avg, AVG(ArrDelay) as a_avg FROM ontime)
WHERE ArrDelay IS NOT NULL AND DepDelay IS NOT NULL;

 * sqlite:///flights.db
Done.


COV
1672.7287544186024


In [None]:
### ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
### SOLUTION in is the previous cell


Query 8: It was a bad week...
------------------------
Which airlines had the largest absolute increase in average arrival delay in the last week of July (i.e., flights on or after July 24th) compared to the previous days (i.e. flights before July 24th)?

In the cell below, write a single SQL query that returns the airline name (_not just_ ID) with the maximum absolute increase in average arrival delay between the first 23 days of the month and days 24-31. Report both the airline name and the absolute increase.

**Note:** due to [sqlite's handling of dates](http://www.sqlite.org/lang_datefunc.html), it may be easier to query using `day_of_month`.

**Note 2:** This is probably the hardest query of the assignment; break it down into subqueries that you can run one-by-one and build up your answer subquery by subquery.

**Hint:** You can compute two subqueries, one to compute the average arrival delay for flights on or after July 24th, and one to compute the average arrival delay for flights before July 24th, and then join the two to calculate the increase in delay. 

In [None]:
### PLACE YOUR SOLUTION in the next cell
### vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv

In [70]:
%%sql
SELECT MAX(second - first) as 'avg delay increase', firstHalf.carrier as airline
FROM
    (
        (
            SELECT AVG(ArrDelay) as first, carrier
            FROM ontime, carriers
            WHERE DayOfMonth < 24 and UniqueCarrier = Code
            GROUP BY UniqueCarrier
        ) as firstHalf
        INNER JOIN
        (
            SELECT AVG(ArrDelay) as second, carrier
            FROM ontime, carriers
            WHERE DayOfMonth >= 24 and UniqueCarrier = Code
            GROUP BY UniqueCarrier
        ) as secondHalf
        ON firstHalf.carrier = secondHalf.carrier
    )

 * sqlite:///flights.db
Done.


avg delay increase,airline
8.125207088689457,US Airways Inc. (Merged with America West 9/05. Reporting for both starting 10/07.)


In [None]:
### ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
### SOLUTION in is the previous cell

Query 9: Of Hipsters and Technologists
------------------------
I'm keen to visit both Portland (PDX) and San Francisco (SFO), but I can't fit both into the same trip. To maximize my frequent flier mileage, I'd like to use the same airline for each. Which airlines fly both DEN -> PDX and DEN -> SFO?

In the cell below, write a single SQL query that returns the distinct airline names (_not_ ID, and with no duplicates) that flew both DEN -> PDX and DEN -> SFO in July 2007.

In [None]:
### PLACE YOUR SOLUTION in the next cell
### vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv

In [30]:
%%sql
SELECT *
FROM 
    (
        SELECT carrier
        FROM ontime, carriers
        WHERE Dest IN ('SFO', 'PDX') AND Origin = 'DEN' and Code = UniqueCarrier
        GROUP BY Dest, UniqueCarrier
    )
GROUP BY carrier
HAVING COUNT() >= 2;

 * sqlite:///flights.db
Done.


carrier
Frontier Airlines Inc.
United Air Lines Inc.


In [None]:
### ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
### SOLUTION in is the previous cell


Query 10: Decision Fatigue and Equidistance
------------------------
I'm flying back to Denver from LA later this month, and I can fly out of either LA (LAX), Ontario (ONT) or San Diego (SAN) and can fly into either Denver (DEN) or Colorado Spring (COS). If this month is like July, which flight will have the shortest arrival delay for flights leaving after 2PM local time?

In the cell below, write a single SQL query that returns the average arrival delay of flights departing either LAX, ONT or SAN after 2PM local time (`CrsDepTime`) and arriving at one of DEN or COS. Group by departure and arrival airport and return results descending by arrival delay.

Note: the `CrsDepTime` field is an integer formatted as hhmm (e.g. 4:15pm is 1615)

In [None]:
### PLACE YOUR SOLUTION in the next cell
### vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv

In [3]:
%%sql
SELECT Origin, Dest, AVG(ArrDelay) as Delay
FROM ontime
WHERE CrsDepTime > 1400 AND (Dest = 'DEN' OR Dest = 'COS') AND (
    Origin = 'LAX' OR Origin = 'ONT' OR Origin = 'SAN'
)
GROUP BY Origin, Dest
ORDER BY Delay DESC;

 * sqlite:///flights.db
Done.


Origin,Dest,Delay
SAN,COS,25.548387096774192
ONT,DEN,23.285714285714285
LAX,DEN,22.167192429022084
SAN,DEN,18.78409090909091
LAX,COS,13.366666666666667
ONT,COS,8.0


In [None]:
### ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
### SOLUTION in is the previous cell


## You're done! Now *commit* and *push* your notebook!
 * Make sure to submit information in the Moodle assignment.
 * Make sure you have entered your information at the top of the notebook.
 * Refer to the top of this notebook for submission instructions.