## Setup
This notebook contains the code to create a database in postgres and import flight data. In produciton, I normally would not use a Juypter notebook, but it makes it easy to present my code in this demo.

The first cell below contains all our necessary imports and loads our environment variables. Be sure to run the cells in order or press the run all button up top, otherwise you may get errors.

In [1]:
import os, datetime
from helpers import make_table
from DBConn import DBConn
from dotenv import load_dotenv
import pandas as pd
import numpy as np
from sqlalchemy.dialects.postgresql import DOUBLE_PRECISION, SMALLINT, BOOLEAN, VARCHAR, CHAR
from sqlalchemy import create_engine, event


PROJECT_ROOT = os.path.dirname(os.path.realpath("__file__")) #Change if needed, defaults to location of this file
print('Project Root:', PROJECT_ROOT)
# DATA_DIR = os.path.join(PROJECT_ROOT,'data') # modify as needed or move data to projec_root\data\ (which is in .gitignore)
DATA_DIR = 'C:\\Users\\Public\\data\\' # modify as needed or move data to projec_root\data\ (which is in .gitignore)
print('Data Directory:', DATA_DIR)
DATABASE_NAME = 'flight_data'
DBConn.set_database(DATABASE_NAME) # Name of database to create.

# store data paths for later use -- modify filenames if yours are different
FLIGHT_DATA_DIR = os.path.join(DATA_DIR, 'FlightDataUncompressed')
FLIGHT_DATA_FNAMES = os.listdir(FLIGHT_DATA_DIR)
FLIGHT_DATA_PATHS = []
for file_name in FLIGHT_DATA_FNAMES:
    FLIGHT_DATA_PATHS.append(os.path.join(FLIGHT_DATA_DIR, file_name))
AIRPORTS_PATH = os.path.join(DATA_DIR, 'airports.csv')
CARRIERS_PATH = os.path.join(DATA_DIR, 'carriers.csv')

# load enviornment variables storing database conneciton info
load_dotenv(PROJECT_ROOT)   #values can now be accessed using os.getenv(KEY_NAME)

Project Root: C:\Users\leifk\Documents\GitHub\FlightDataAnalysis
Data Directory: C:\Users\Public\data\


True

## EDA with Pandas -- determine structure of data for database schema


In [5]:
import chardet
def row_count(f):

    for i, l in enumerate(f):
        pass
    return i   

total_rows = 0
file_rows = dict()
for path_ in FLIGHT_DATA_PATHS:
    name = os.path.basename(path_)
    with open(path_,'rb') as file:
        if int(name.rstrip('.csv'))>=0:       
            file_rows[name] = row_count(file)
            total_rows += file_rows[name]
            print(f"{name}: {file_rows[name]}")
            # print(f'{name} chardet:')
            # print(chardet.detect(file.read(1000)))
print(f'total rows: {total_rows}')

1987.csv: 1311826
1988.csv: 5202096
1989.csv: 5041200
1990.csv: 5270893
1991.csv: 5076925
1992.csv: 5092157
1993.csv: 5070501
1994.csv: 5180048
1995.csv: 5327435
1996.csv: 5351983
1997.csv: 5411843
1998.csv: 5384721
1999.csv: 5527884
2000.csv: 5683047
2001.csv: 5967780
2002.csv: 5271359
2003.csv: 6488540
2004.csv: 7129270
2005.csv: 7140596
2006.csv: 7141922
2007.csv: 7453215
2008.csv: 7009728
total rows: 123534969


In [3]:
# ingest to pandas
airports_df = pd.read_csv(AIRPORTS_PATH) 
carriers_df = pd.read_csv(CARRIERS_PATH)
carriers_df = carriers_df.fillna('NA') ## Added after EDA, due to North American Airlines being interpreted as null

print(f'Found {len(FLIGHT_DATA_FNAMES)} years of flight data starting: from {FLIGHT_DATA_FNAMES[0].rstrip(".csv")} - {FLIGHT_DATA_FNAMES[-1].rstrip(".csv")}')
print(f'looking at {FLIGHT_DATA_FNAMES[0].rstrip(".csv")}')
flight_df = pd.read_csv(FLIGHT_DATA_PATHS[9], encoding='ascii')

def print_max_str_len(df:pd.DataFrame):
    """
    Print max length of string columns in pandas dataframe
    """
    index = 0
    for col_name, col in df.iteritems():
        try:
            print(index, col_name, 'str', col.str.len().max())
        except AttributeError:
            try:
                print(index, col_name, 'int/float', max(col.map(str).apply(len)))
            except AttributeError:
                #lazy way to print col length of strings and ints
                pass
        index += 1

Found 22 years of flight data starting: from 1987 - 2008
looking at 1987


In [9]:
## Airports Schema
# print(airports_df.head(5))
print(airports_df[airports_df['iata'].astype(str).str.contains('CBM')])
# print(airports_df.info())
# print_max_str_len(airports_df)

# print(airports_df.isnull().sum())


Empty DataFrame
Columns: [iata, airport, city, state, country, lat, long]
Index: []


In [None]:
## Figure out Carriers Schema
## uncomment lines to see what steps I took.

# carriers_df.head()
# carriers_df.info()
# print_max_str_len(carriers_df)


# print('null values:')
# print(carriers_df.isnull().sum())
# ## looks like we have a null value
# carriers_df[carriers_df.Code.isnull()]
# ##and it's caused by North American being interpreted as NA.


In [None]:
## Flights Schema

# flight_df.info()
# flight_df.UniqueCarrier.isnull().sum()
# print_max_str_len(flight_df)

In [20]:
## More on flights

# flight_df.UniqueCarrier[0]
flight_df.head() # uncomment to see first five rows. You'll also need to comment next line.
# print(flight_df.Diverted.value_counts())


Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,...,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
0,1996,1,29,1,2039.0,1930,2245.0,2139,DL,345,...,6,10,0,,0,,,,,
1,1996,1,30,2,1931.0,1930,2142.0,2139,DL,345,...,5,22,0,,0,,,,,
2,1996,1,31,3,1956.0,1930,2231.0,2139,DL,345,...,7,27,0,,0,,,,,
3,1996,1,1,1,1730.0,1550,1909.0,1745,DL,411,...,4,14,0,,0,,,,,
4,1996,1,2,2,1714.0,1550,1841.0,1745,DL,411,...,4,8,0,,0,,,,,


Ok, so we've now looked at the columns in each dataset and can see that we'll need three tables relaations between:
- `flight.Origin`, `flight.Destination` to `airports.iata`
- `flight.UniqueCarrier` to `carrier.Code`

Now lets get the tables created in SQL:


### Create Flight Data Database
The database in the .env file is just an entry point. We are going to now create new database and load in our flight data.

#### Create database connection

In [None]:
### Create Database


# Create database if it doesn't exist already
sql = f'''--sql
SELECT 'CREATE DATABASE {DATABASE_NAME}'
WHERE NOT EXISTS (SELECT FROM pg_database WHERE datname = '{DATABASE_NAME}');
'''

conn = DBConn()
result = conn.exec(sql)
print("Database created successfully or already exists....")
DBConn.set_database(DATABASE_NAME)
conn.close()


## Create Tables

In [None]:
## Create Airport Table
conn = DBConn()
print('accessing', conn.get_database())
sql = f'''--sql
--DROP TABLE flights;
--DROP TABLE carriers;
--DROP TABLE airports;

--sql --comment included for code highlighting in IDE
 CREATE TABLE IF NOT EXISTS carriers (
    "Code" VARCHAR(7) NOT NULL,
    "Description" VARCHAR(100),
    CONSTRAINT carriers_pkey PRIMARY KEY ("Code")
);

--sql 
CREATE TABLE IF NOT EXISTS airports (
    iata VARCHAR(4) NOT NULL,
    airport VARCHAR(50) NOT NULL,
    city VARCHAR(40),
    state CHAR(2),
    country VARCHAR(40) NOT NULL,
    lat DOUBLE PRECISION NOT NULL,
    long DOUBLE PRECISION NOT NULL,
    CONSTRAINT airport_pkey PRIMARY KEY (iata)
);

CREATE TABLE IF NOT EXISTS flights (
 "Year" SMALLINT NOT NULL,  
 "Month" SMALLINT NOT NULL,  
 "DayofMonth" SMALLINT NOT NULL,  
 "DayOfWeek" SMALLINT NOT NULL,  
 "DepTime" SMALLINT,
 "CRSDepTime" SMALLINT NOT NULL,  
 "ArrTime" SMALLINT,
 "CRSArrTime" SMALLINT NOT NULL,  
 "UniqueCarrier" VARCHAR(6) NOT NULL,
 "FlightNum" SMALLINT NOT NULL,  
 "ActualElapsedTime" SMALLINT,
 "CRSElapsedTime" SMALLINT,  
 "AirTime" SMALLINT,
 "ArrDelay" SMALLINT,
 "DepDelay" SMALLINT,
 "Origin" VARCHAR(3) NOT NULL, 
 "Dest" VARCHAR(3) NOT NULL, 
 "Distance" INTEGER,
 "TaxiIn" SMALLINT,
 "TaxiOut" SMALLINT,
 "Cancelled" BOOLEAN NOT NULL,  
 "CarrierDelay" SMALLINT,
 "WeatherDelay" SMALLINT,
 "NASDelay" SMALLINT,
 "SecurityDelay" SMALLINT,
 "LateAircraftDelay" SMALLINT,
 
 -- PRIMARY KEY ("Year", "Month", "DayofMonth", "UniqueCarrier", "CRSDepTime", "FlightNum", "Origin", "CRSArrTime"),

 CONSTRAINT origin_airport_fk
    FOREIGN KEY ("Origin")
    REFERENCES airports(iata)
    ON DELETE CASCADE,

 CONSTRAINT dest_airport_fk 
    FOREIGN KEY ("Dest")
    REFERENCES airports(iata)
    ON DELETE CASCADE,

 CONSTRAINT uniquecarrier_carriers_fk 
    FOREIGN KEY ("UniqueCarrier")
    REFERENCES carriers("Code")
    ON DELETE CASCADE

);
'''
conn.exec(sql)
print('create successful or already created')
conn.close()

## Load Data

### Airports and Carriers

In [None]:
## load carriers into databse

load_carriers = False ## change to true to load
if load_carriers:
    engine = create_engine(
        f'postgresql://{os.getenv("DB_USER")}:{os.getenv("DB_PASS")}@{os.getenv("DB_HOST")}:{os.getenv("DB_PORT")}/{DATABASE_NAME}'
    )
    carriers_df.to_sql(
        'carriers', 
        engine, 
        if_exists='append', 
        index=False, 
        dtype={"Code":VARCHAR, "Description": VARCHAR})



In [None]:
## Load Airports into database
load_airports = False ## change to true to load
if load_airports:
    engine = create_engine(
        f'postgresql://{os.getenv("DB_USER")}:{os.getenv("DB_PASS")}@{os.getenv("DB_HOST")}:{os.getenv("DB_PORT")}/{DATABASE_NAME}'
    )
    airports_df.to_sql(
        'airports', 
        engine, 
        if_exists='append', 
        index=False, 
        dtype={
            "iata": VARCHAR, 
            "airport": VARCHAR,
            "city": VARCHAR,
            "state": CHAR,
            "country": VARCHAR,
            "lat": DOUBLE_PRECISION(),
            "long": DOUBLE_PRECISION(),
            })



### Load Flight Data

In [2]:
## Load Flights into database
load_flights=False
if load_flights:
    engine = create_engine(
        f'postgresql://{os.getenv("DB_USER")}:{os.getenv("DB_PASS")}@{os.getenv("DB_HOST")}:{os.getenv("DB_PORT")}/{DATABASE_NAME}'
    )

    #speeds up bulk import
    @event.listens_for(engine, "before_cursor_execute")
    def receive_before_cursor_execute(con, cursor, statement, params, context, executemany):
        if executemany:
            cursor.fast_executemany=True

    for path_ in reversed(FLIGHT_DATA_PATHS):
        name = os.path.basename(path_)
        year = int(name.rstrip('.csv'))
        if year != 2008:
            print(f"skipping {name}")
            break
        print(f'uploading {name} at {datetime.datetime.now()}')


        for chunk_num, chunk in enumerate(pd.read_csv(
            path_, chunksize=20000,encoding='ascii',dtype={
                "Year": 'int16',  
                "Month": 'int8',  
                "DayofMonth": 'int8',  
                "DayOfWeek": 'int8',  
                "DepTime": 'float64',
                "CRSDepTime": 'int16',  
                "ArrTime": 'float64',
                "CRSArrTime": 'int16',  
                "UniqueCarrier": 'string',
                "FlightNum": 'int16',  
                "ActualElapsedTime": 'float64',
                "CRSElapsedTime": 'float64',  
                "AirTime": 'float64',
                "ArrDelay": 'float64',
                "DepDelay": 'float64',
                "Origin": 'string', 
                "Dest": 'string', 
                "Distance": 'float64',
                "TaxiIn": 'float64',
                "TaxiOut": 'float64',
                "Cancelled": 'int8',  
                "CarrierDelay": 'float64',
                "WeatherDelay": 'float64',
                "NASDelay": 'float64',
                "SecurityDelay": 'float64',
                "LateAircraftDelay": 'float64',
            })):
            # if year == 2004:
            #     if chunk_num<104:
            #         continue
            
            # if year == 2007:
            #     if chunk_num!=235:
            #         print(f'skipping chunk {chunk_num}')
            #         continue
            task_start = datetime.datetime.now()
            print(f'working on chunk {chunk_num} at {task_start}')
            chunk = chunk.drop(['TailNum','CancellationCode','Diverted'], axis=1)
            chunk.to_sql(
                'flights', 
                engine, 
                if_exists='append', 
                index=False,
                method="multi",
                dtype={
                    "Year": SMALLINT,  
                    "Month": SMALLINT,  
                    "DayofMonth": SMALLINT,  
                    "DayOfWeek": SMALLINT,  
                    "DepTime": SMALLINT,
                    "CRSDepTime": SMALLINT,  
                    "ArrTime": SMALLINT,
                    "CRSArrTime": SMALLINT,  
                    "UniqueCarrier": VARCHAR,
                    "FlightNum": SMALLINT,  
                    "ActualElapsedTime": SMALLINT,
                    "CRSElapsedTime": SMALLINT,  
                    "AirTime": SMALLINT,
                    "ArrDelay": SMALLINT,
                    "DepDelay": SMALLINT,
                    "Origin": VARCHAR, 
                    "Dest": VARCHAR, 
                    "Distance": SMALLINT,
                    "TaxiIn": SMALLINT,
                    "TaxiOut": SMALLINT,
                    "Cancelled": BOOLEAN,  
                    "CarrierDelay": SMALLINT,
                    "WeatherDelay": SMALLINT,
                    "NASDelay": SMALLINT,
                    "SecurityDelay": SMALLINT,
                    "LateAircraftDelay": SMALLINT,
                    })
        complete_in = datetime.datetime.now()-task_start
        print(f'complete in {complete_in.seconds} seconds')

uploading 2008.csv at 2022-01-22 18:14:28.884997
working on chunk 0 at 2022-01-22 18:14:28.925007
working on chunk 1 at 2022-01-22 18:14:37.040424
working on chunk 2 at 2022-01-22 18:14:44.529860
working on chunk 3 at 2022-01-22 18:14:52.424278
working on chunk 4 at 2022-01-22 18:15:00.081595
working on chunk 5 at 2022-01-22 18:15:07.639774
working on chunk 6 at 2022-01-22 18:15:15.387070
working on chunk 7 at 2022-01-22 18:15:23.318664
working on chunk 8 at 2022-01-22 18:15:31.277152
working on chunk 9 at 2022-01-22 18:15:39.084707
working on chunk 10 at 2022-01-22 18:15:46.760302
working on chunk 11 at 2022-01-22 18:15:54.777905
working on chunk 12 at 2022-01-22 18:16:02.624210
working on chunk 13 at 2022-01-22 18:16:10.346252
working on chunk 14 at 2022-01-22 18:16:18.409987
working on chunk 15 at 2022-01-22 18:16:26.362341
working on chunk 16 at 2022-01-22 18:16:34.151862
working on chunk 17 at 2022-01-22 18:16:41.979764
working on chunk 18 at 2022-01-22 18:16:49.914284
working on 

## Problem Set

##### Problem 1:	What percentage of flights were canceled each year from 1999 to 2003?

In [3]:
conn = DBConn()
q = """
--sql
SELECT 
    count_cancelled,
    total,
    (count_cancelled/total::FLOAT)*100 perc_cancelled
FROM (
    SELECT 
        Count(*) total, 
        SUM(CASE WHEN "Cancelled" THEN 1 ELSE 0 END) count_cancelled
    FROM flights
    WHERE "Year" >= 1999 AND "Year" <= 2003
) x;
"""
ans_1 = conn.exec(q)
conn.close()
print(make_table(['Cancelled Flights','Total Flights', 'Percent Cancelled'], ans_1))

connection opened ...
{0: 17, 1: 13, 2: 17}
| Cancelled Flights | Total Flights | Percent Cancelled |
| ----------------- | ------------- | ----------------- |
| 739611            | 28938610      | 2.55579310823844  |

... connection closed.


| Cancelled Flights | Total Flights | Percent Cancelled |
| ----------------- | ------------- | ----------------- |
| 739611            | 28938610      | 2.55579310823844  |

##### Problem 2: On which day of the week in 2007 were you most likely to arrive on time flying from MCO to IAH

In [9]:
conn = DBConn()
q = """
--sql
SELECT
	"DayOfWeek",
	on_time_count,
	total_flights,
	(on_time_count/total_flights::FLOAT)*100 percent_ontime
FROM
(SELECT 
	"DayOfWeek",
	SUM(CASE WHEN "CRSArrTime" <= "ArrTime" THEN 1 ELSE 0 END) on_time_count,
	COUNT(*) total_flights
FROM flights
WHERE "Year" = 2007 AND "Origin" = 'MCO' AND "Dest" = 'IAH'
GROUP BY "DayOfWeek"
) x;
"""
ans_2 = conn.exec(q)
titles = ['Week Day', 'On Time Flights', 'Total Flights', 'Percent On Time']
print(make_table(titles, ans_2))
conn.close()

connection opened ...
{0: 8, 1: 15, 2: 13, 3: 16}
| Week Day | On Time Flights | Total Flights | Percent On Time  |
| -------- | --------------- | ------------- | ---------------- |
| 1        | 134             | 377           | 35.5437665782493 |
| 2        | 130             | 346           | 37.5722543352601 |
| 3        | 161             | 351           | 45.8689458689459 |
| 4        | 184             | 390           | 47.1794871794872 |
| 5        | 175             | 398           | 43.9698492462312 |
| 6        | 140             | 377           | 37.1352785145889 |
| 7        | 152             | 405           | 37.5308641975309 |

... connection closed.


|Week Day   | On Time Flights   | Total Flights | Percentage On time   |
|-----------|-------------------|---------------|----------------------|
|1 (Mon)    |134	            | 377       	| 35.5437665782493     |
|2 (Tue)    |130	            | 346       	| 37.5722543352601     |
|3 (Wed)    |161	            | 351       	| 45.8689458689459     |
|**4 (Thu)**|**184**	        | **390**      	| **47.1794871794872** |
|5 (Fri)    |175	            | 398       	| 43.9698492462312     |
|6 (Sat)    |140	            | 377       	| 37.1352785145889     |
|7 (sun)    |152	            | 405       	| 37.5308641975309     |

##### Problem 3.	Which 10 flights (airline, flight number, origin city, destination city, and date) had the latest actual vs. scheduled arrival in 2004?

In [30]:
conn = DBConn()
q = """
--sql
SELECT
	c."Description" AS "Airline",
	"FlightNum" AS "Flight Number",
	oa.city AS "Origin City",
	da.city AS "Dest City",
	"ArrTime" AS "Arrival Time",
	"CRSArrTime" AS "Scheduled Arrival",
	-- CASE WHEN "ArrTime"-"CRSArrTime" < 0 --Catch when flights go into the next day
	--	THEN 2400 - "CRSArrTime" + "ArrTime"
	--	ELSE "ArrTime"-"CRSArrTime"
	--	END AS "Late By"
	"ArrDelay" as "Arrival Delay",
	CONCAT("DayofMonth",'/',"Month",'/',"Year") AS "Date"
FROM flights
INNER JOIN carriers AS C
ON flights."UniqueCarrier" = c."Code"
INNER JOIN airports as da
ON flights."Dest" = da.iata
INNER JOIN airports as oa
ON flights."Origin" = oa.iata
WHERE "Year" = 2007 AND
	"ArrDelay" > 0 AND
	"ArrTime" IS NOT NULL AND 
	"CRSArrTime" IS NOT NULL
ORDER BY "ArrDelay"
LIMIT 10
;
"""
ans_3 = conn.exec(q)
titles = ["Airline","Flight Number", "Origin City", "Dest City","Arrival Time","Scheduled Arrival","Date","Arrival Delay"]
print(make_table(titles,ans_3))
conn.close()

connection opened ...
{0: 22, 1: 13, 2: 13, 3: 13, 4: 12, 5: 17, 6: 4, 7: 13}
| Airline                | Flight Number | Origin City   | Dest City     | Arrival Time | Scheduled Arrival | Date | Arrival Delay |
| ---------------------- | ------------- | ------------- | ------------- | ------------ | ----------------- | ---- | ------------- |
| Southwest Airlines Co. | 2705          | Birmingham    | Houston       | 1326         | 1325              | 1    | 2/1/2007      |
| Southwest Airlines Co. | 680           | Nashville     | Columbus      | 2156         | 2155              | 1    | 2/1/2007      |
| Southwest Airlines Co. | 1603          | Windsor Locks | Baltimore     | 1656         | 1655              | 1    | 2/1/2007      |
| Southwest Airlines Co. | 2774          | Albuquerque   | Phoenix       | 846          | 845               | 1    | 2/1/2007      |
| Southwest Airlines Co. | 1342          | Amarillo      | Dallas        | 1431         | 1430              | 1    | 2/1/200

| Airline                    | Flight Number | Origin City | Dest City | Arrival Time | Scheduled Arrival | Late By | Date       |
| -------------------------- | ------------- | ----------- | --------- | ------------ | ----------------- | ------- | ---------- |
| Continental Air Lines Inc. | 687           | Orlando     | Houston   | 2341         | 1636              | 705     | 22/5/2007  |
| Continental Air Lines Inc. | 1687          | Orlando     | Houston   | 2358         | 1918              | 440     | 28/5/2007  |
| Continental Air Lines Inc. | 1473          | Orlando     | Houston   | 1930         | 1509              | 421     | 26/12/2007 |
| Continental Air Lines Inc. | 1687          | Orlando     | Houston   | 2349         | 1944              | 405     | 3/1/2007   |
| Continental Air Lines Inc. | 1007          | Orlando     | Houston   | 137          | 2139              | 398     | 15/6/2007  |
| Continental Air Lines Inc. | 1007          | Orlando     | Houston   | 128          | 2139              | 389     | 3/9/2007   |
| Continental Air Lines Inc. | 1007          | Orlando     | Houston   | 121          | 2139              | 382     | 16/8/2007  |
| Continental Air Lines Inc. | 1872          | Orlando     | Houston   | 1821         | 1454              | 367     | 16/8/2007  |
| Continental Air Lines Inc. | 1487          | Orlando     | Houston   | 1631         | 1317              | 314     | 29/7/2007  |
| Continental Air Lines Inc. | 687           | Orlando     | Houston   | 1943         | 1642              | 301     | 14/3/2007  |

##### Problem 4.	For each year from 1987 to 2008, which airline made the trip between ORD and LAX the fastest (on average)? ***************

In [53]:
conn = DBConn()
q = """
--sql
WITH sub as (SELECT 
		"Year",
		"UniqueCarrier",
		AVG("ActualElapsedTime") AS avg_time
	FROM flights
	WHERE "Origin" = 'ORD' AND "Dest" = 'LAX'
	GROUP BY "Year", "UniqueCarrier"
	)

SELECT
	"Year",
	c."Description" AS "Airline",
	avg_time AS "Avg. Trip Time"
FROM sub
INNER JOIN carriers c
ON sub."UniqueCarrier" = c."Code"
WHERE "UniqueCarrier" IN (
	SELECT "UniqueCarrier"
	FROM sub
	ORDER BY avg_time LIMIT 1
)
GROUP BY "Year"
 ;
"""
a = """
	SELECT 
		"Year",
		"UniqueCarrier",
		AVG("ActualElapsedTime") AS avg_time
	FROM flights
	WHERE "Origin" = 'ORD' AND "Dest" = 'LAX'
	GROUP BY "Year";
"""
ans_4 = conn.exec(q)
titles = ["Year","Airline","Trip Time"]
print(make_table(titles,ans_4))
conn.close()

connection opened ...


GroupingError: column "c.Description" must appear in the GROUP BY clause or be used in an aggregate function
LINE 14:  c."Description" AS "Airline",
          ^


In [48]:
print(ans_4)

[(1987, 'American Airlines Inc.', Decimal('256.3271428571428571')), (1987, 'United Air Lines Inc.', Decimal('250.2101382488479263')), (1988, 'American Airlines Inc.', Decimal('246.7750724637681159')), (1988, 'United Air Lines Inc.', Decimal('247.5877598152424942')), (1989, 'American Airlines Inc.', Decimal('250.5437788018433180')), (1989, 'United Air Lines Inc.', Decimal('255.4851043865822191')), (1990, 'American Airlines Inc.', Decimal('253.6859344894026975')), (1990, 'United Air Lines Inc.', Decimal('254.0619114877589454')), (1991, 'American Airlines Inc.', Decimal('251.6377611351990540')), (1991, 'United Air Lines Inc.', Decimal('252.0872546701347836')), (1992, 'American Airlines Inc.', Decimal('248.0202058711399161')), (1992, 'United Air Lines Inc.', Decimal('249.5003646973012400')), (1993, 'American Airlines Inc.', Decimal('249.9389885807504078')), (1993, 'United Air Lines Inc.', Decimal('255.3960236432025793')), (1994, 'American Airlines Inc.', Decimal('251.8753669602348546')), (

##### Problem 5.	For the years 2002 to 2005, what is the ratio of carrier delay to elapsed travel time for each airline?

In [3]:
conn = DBConn()
q = """
--sql
SELECT
	carriers."Description",
	total_flights,
 	(carrier_delay/actual_travel_time::FLOAT)*100 carrier_delay_to_actual_travel_time,
 	(carrier_delay/scheduled_travel_time::FLOAT)*100 carrier_delay_to_scheduled_travel_time
FROM(
	SELECT
		"UniqueCarrier",
		count(*) total_flights,
		sum(COALESCE("CarrierDelay",0)) carrier_delay,
		sum("ActualElapsedTime") actual_travel_time,
		sum("CRSElapsedTime") scheduled_travel_time
	FROM flights
	WHERE "Year" >= 2002 AND
		"Year" <= 2005 AND 
		"ActualElapsedTime" IS NOT NULL AND 
		"CRSElapsedTime" IS NOT NULL
	GROUP BY "UniqueCarrier"
 ) sub
 INNER JOIN carriers
 ON sub."UniqueCarrier" = carriers."Code"
 ORDER BY carrier_delay_to_scheduled_travel_time DESC
 ;
"""
ans_2 = conn.exec(q)
print(ans_2)
conn.close()

connection opened ...
... connection closed.


| Airline                       | total flights | Carrier Delay : Actual Time | Carrier Delay : Scheduled Time |
|-------------------------------|-----------|-----------------------|--------------------|
|"Atlantic Southeast Airlines"	|   827770	|   4.21220220668667	|   4.06304872823098 |
|"Comair Inc."	                |   728359	|   4.20157846108717	|   4.04609468406112 |
|"Skywest Airlines Inc."	    |   1349626	|   3.84558131214308	|   3.72074115921868 |
|"American Eagle Airlines Inc."	|   1820414	|   2.46834946023174	|   2.46059032565379 |
|"Independence Air"	            |   669687	|   2.36632082048272	|   2.29166907122446 |
|"Alaska Airlines Inc."	        |   630560	|   2.28936929565487	|   2.2566713906672  |
|"Northwest Airlines Inc."	    |   1970265	|   1.93795258614886	|   1.94206729498834 |
|"Hawaiian Airlines Inc."	    |   103799	|   1.90166462858953	|   1.87090721407012 |
|"Expressjet Airlines Inc."	    |   1074944	|   1.58973862735903	|   1.61998129264762 |
|"AirTran Airways Corporation"	|   495523	|   1.52400065410015	|   1.51625822955084 |
|"Frontier Airlines Inc."	    |   53014	|   1.52439546315891	|   1.50189618285352 |
|"America West Airlines Inc. (Merged with US Airways 9/05. Stopped reporting 10/07.)"	|   768813	|   1.41595955812445	|   1.39156594181809 |
|"Delta Air Lines Inc."	        |   2687168	|   1.22959865780283	|   1.22939021315388 |
|"US Airways Inc. (Merged with America West 9/05. Reporting for both starting 10/07.)"	|   1739194	|   1.18512829714521	|   1.16945804472995 |
|"American Airlines Inc."	    |   2926333	|   1.12376052134875	|   1.11191833067806 |
|"ATA Airlines d/b/a ATA"	    |   186642	|   1.01907731549993	|   1.02120585233741 |
|"Southwest Airlines Co."	    |   3896607	|   1.0534639463494	    |   1.00619784487027 |
|"United Air Lines Inc."	    |   2146623	|   1.00772248368922	|   0.99692587190968 |
|"JetBlue Airways"	            |   265622	|   1.00898021545331	|   0.996318009919548 |
|"Continental Air Lines Inc."	|   1213153	|   0.75328152598789	|   0.754607711958256 |

##### Problem 6.	What airline spent the most and least average time taxiing (in and out) at JFK in 2006?

In [8]:
conn = DBConn()
q = """
--sql
SELECT
	carriers."Description",
    COUNT(*) total_flights,
	AVG(COALESCE("TaxiIn",0) + COALESCE("TaxiOut",0)) avg_taxi
FROM flights
INNER JOIN carriers
ON flights."UniqueCarrier" = carriers."Code"
WHERE "Year" = 2006 AND
    "TaxiIn" IS NOT NULL AND 
    "TaxiOut" IS NOT NULL AND (
        "Dest" = 'JFK' OR
        "Origin" = 'JFK'
    )
        
GROUP BY carriers."Description"
ORDER BY avg_taxi DESC
;
"""
ans_2 = conn.exec(q)
print(ans_2)
conn.close()

connection opened ...
[('Expressjet Airlines Inc.', 698, Decimal('42.9255014326647564')), ('Continental Air Lines Inc.', 1949, Decimal('38.2755259107234479')), ('Atlantic Southeast Airlines', 1020, Decimal('37.8343137254901961')), ('Comair Inc.', 31594, Decimal('37.4692030132303602')), ('Delta Air Lines Inc.', 25517, Decimal('37.3822941568366187')), ('US Airways Inc. (Merged with America West 9/05. Reporting for both starting 10/07.)', 5050, Decimal('34.0889108910891089')), ('Northwest Airlines Inc.', 3744, Decimal('33.8344017094017094')), ('American Airlines Inc.', 24681, Decimal('33.6701916453952433')), ('United Air Lines Inc.', 9797, Decimal('31.2978462794733082')), ('Mesa Airlines Inc.', 4902, Decimal('30.1472868217054264')), ('JetBlue Airways', 96604, Decimal('30.1264957972754751')), ('American Eagle Airlines Inc.', 15271, Decimal('29.7719206338812128')), ('ATA Airlines d/b/a ATA', 4, Decimal('24.0000000000000000'))]
... connection closed.


| Carrier | Total Flights | Average Taxi Time |
| ---- | ---- | ---- |
| Expressjet Airlines Inc. | 698 | 42.9255014326647564 |
| Continental Air Lines Inc. | 1949 | 38.2755259107234479 |
| Atlantic Southeast Airlines | 1020 | 37.8343137254901961 |
| Comair Inc. | 31594 | 37.4692030132303602 |
| Delta Air Lines Inc. | 25517 | 37.3822941568366187 |
| US Airways Inc. (Merged with America West 9/05. Reporting for both starting 10/07.) | 5050 | 34.0889108910891089 |
| Northwest Airlines Inc. | 3744 | 33.8344017094017094 |
| American Airlines Inc. | 24681 | 33.6701916453952433 |
| United Air Lines Inc. | 9797 | 31.2978462794733082 |
| Mesa Airlines Inc. | 4902 | 30.1472868217054264 |
| JetBlue Airways | 96604 | 30.1264957972754751 |
| American Eagle Airlines Inc. | 15271 | 29.7719206338812128 |
| ATA Airlines d/b/a ATA | 4 | 24.0000000000000000 |

##### Problem 7.	What were the top 10 routes (origin and destination city names and airport codes) most likely to have a weather delay of over 10 minutes in December 2005?
*Only consider routes with at least 20 flights that month.*


In [64]:
conn = DBConn()
q = """
--sql
SELECT
    "Origin" AS "Origin Code",
	oa.city AS "Origin City",
    "Dest" AS "Dest Code",
	da.city AS "Dest City",
    delays AS "Total Delays",
    total_flights AS "Total Flights",
    delayed_perc AS "Percent Delayed"
FROM (
    SELECT
        "Origin",
        "Dest",
        delays,
        total_flights,
	    (delays/total_flights::FLOAT)*100 delayed_perc
    FROM (
        SELECT
            "Origin",
            "Dest",
            COUNT(COALESCE("WeatherDelay", 0) > 10 THEN 1 ELSE 0 END) AS delays,
            COUNT(*) AS total_flights
        FROM flights
        WHERE "Year" = 2005 ANd "Month" = 12
        GROUP BY "Dest", "Origin"
    ) sub2
    WHERE
        total_flights > 20
) sub1
INNER JOIN airports as da
ON sub1."Dest" = da.iata
INNER JOIN airports as oa
ON sub1."Origin" = oa.iata

ORDER BY delayed_perc desc
LIMIT 100
;
"""
ans_3 = conn.exec(q)
titles = [
    "Origin Code",
    "Origin City", 
    "Dest Code", 
    "Dest City",
    "Weather Delays",
    "Total Flights",
    "Percent Delayed"]
print(make_table(titles,ans_3))
conn.close()

connection opened ...
{0: 11, 1: 21, 2: 9, 3: 13, 4: 14, 5: 13, 6: 15}
| Origin Code | Origin City           | Dest Code | Dest City     | Weather Delays | Total Flights | Percent Delayed |
| ----------- | --------------------- | --------- | ------------- | -------------- | ------------- | --------------- |
| CLE         | Cleveland             | ABE       | Allentown     | 90             | 90            | 100.0           |
| CLT         | Charlotte             | ABE       | Allentown     | 31             | 31            | 100.0           |
| CVG         | Covington             | ABE       | Allentown     | 79             | 79            | 100.0           |
| DFW         | Dallas-Fort Worth     | ABI       | Abilene       | 210            | 210           | 100.0           |
| AMA         | Amarillo              | ABQ       | Albuquerque   | 58             | 58            | 100.0           |
| ATL         | Atlanta               | ABQ       | Albuquerque   | 114            | 114        

##### Problem 8.	Flying Southwest, what is the year-over-year change in on-time travel rate from 2000 to 2007?

##### Problem 9.	What was the month-to-date on-time arrival rate for United for each date in September 2005?