## Setup
This notebook contains the code to create a database in postgres and import flight data. In produciton, I normally would not use a Juypter notebook, but it makes it easy to present my code in this demo.

The first cell below contains all our necessary imports and loads our environment variables. Be sure to run the cells in order or press the run all button up top, otherwise you may get errors.

In [None]:
import os
from DBConn import DBConn
from dotenv import load_dotenv
import pandas as pd
import numpy as np
from sqlalchemy.types import String


PROJECT_ROOT = os.path.dirname(os.path.realpath("__file__")) #Change if needed, defaults to location of this file
print('Project Root:', PROJECT_ROOT)
DATA_DIR = os.path.join(PROJECT_ROOT,'data') # modify as needed or move data to projec_root\data\ (which is in .gitignore)
print('Data Directory:', DATA_DIR)

# store data paths for later use -- modify filenames if yours are different
FLIGHT_DATA_DIR = os.path.join(DATA_DIR, 'FlightDataUncompressed')
FLIGHT_DATA_FNAMES = os.listdir(FLIGHT_DATA_DIR)
FLIGHT_DATA_PATHS = []
for file_name in FLIGHT_DATA_FNAMES:
    FLIGHT_DATA_PATHS.append(os.path.join(FLIGHT_DATA_DIR, file_name))
AIRPORTS_PATH = os.path.join(DATA_DIR, 'airports.csv')
CARRIERS_PATH = os.path.join(DATA_DIR, 'carriers.csv')

# load enviornment variables storing database conneciton info
load_dotenv(PROJECT_ROOT)   #values can now be accessed using os.getenv(KEY_NAME)

## EDA with Pandas -- determine structure of data for database schema


In [None]:
# ingest to pandas
airports_df = pd.read_csv(AIRPORTS_PATH) 
carriers_df = pd.read_csv(CARRIERS_PATH)
print(f'Found {len(FLIGHT_DATA_FNAMES)} years of flight data starting: from {FLIGHT_DATA_FNAMES[0].rstrip(".csv")} - {FLIGHT_DATA_FNAMES[-1].rstrip(".csv")}')
print(f'looking at {FLIGHT_DATA_FNAMES[0].rstrip(".csv")}')
flight_df = pd.read_csv(FLIGHT_DATA_PATHS[0],)

def print_max_str_len(df:pd.DataFrame):
    """
    Print max length of string columns in pandas dataframe
    """
    index = 0
    for col_name, col in df.iteritems():
        try:
            print(index, col_name, 'str', col.str.len().max())
        except AttributeError:
            try:
                print(index, col_name, 'int/float', max(col.map(str).apply(len)))
            except AttributeError:
                #lazy way to print col length of strings and ints
                pass
        index += 1

In [None]:
## Airports Schema
# print(airports_df.head(5))
print(airports_df.info())
print_max_str_len(airports_df)


In [None]:
## Carriers Schema
carriers_df.head()
# carriers_df.info()
# print_max_str_len(carriers_df)


In [None]:
## Flights Schema
flight_df.info()
# print_max_str_len(flight_df)

In [None]:
# flight_df.UniqueCarrier[0]
flight_df.head() # uncomment to see first five rows. You'll also need to comment next line.
#print(flight_df.info(verbose=True))
print(flight_df.ArrTime.describe())

Ok, so we've now looked at the columns in each dataset and can see that we'll need three tables relaations between:
- `flight.Origin`, `flight.Destination` to `airports.iata`
- `flight.UniqueCarrier` to `carrier.Code`

Now lets get the tables created in SQL:


### Create Flight Data Database
The database in the .env file is just an entry point. We are going to now create new database and load in our flight data.

#### Create database connection

In [None]:
### Create Database
DATABASE_NAME = 'flight_data'

# Create database if it doesn't exist already
sql = f'''--sql
SELECT 'CREATE DATABASE {DATABASE_NAME}'
WHERE NOT EXISTS (SELECT FROM pg_database WHERE datname = '{DATABASE_NAME}');
'''

conn = DBConn()
result = conn.exec(sql)
print("Database created successfully or already exists....")
DBConn.set_database(DATABASE_NAME)
conn.close()


## Create Tables

In [None]:
## Create Airport Table
conn = DBConn()
print('accessing', conn.get_database())
sql = f'''--sql
DROP TABLE flights;
DROP TABLE carriers;
--DROP TABLE airports;

--sql
 CREATE TABLE IF NOT EXISTS carriers (
    "Code" VARCHAR(6) NOT NULL,
    "Description" VARCHAR(50),
    CONSTRAINT carriers_pkey PRIMARY KEY ("Code")
);

--sql -- comment included for code highlighting in IDE
CREATE TABLE IF NOT EXISTS airports (
    iata VARCHAR(4) NOT NULL,
    airport VARCHAR(50) NOT NULL,
    state CHAR(2) NOT NULL,
    country VARCHAR(40) NOT NULL,
    lat DOUBLE PRECISION NOT NULL,
    long DOUBLE PRECISION NOT NULL,
    CONSTRAINT airport_pkey PRIMARY KEY (iata)
);

CREATE TABLE IF NOT EXISTS flights (
 "Year" INTEGER NOT NULL,  
 "Month" INTEGER NOT NULL,  
 "DayofMonth" INTEGER NOT NULL,  
 "DayOfWeek" INTEGER NOT NULL,  
 "DepTime" INTEGER,
 "CRSDepTime" INTEGER NOT NULL,  
 "ArrTime" INTEGER,
 "CRSArrTime" INTEGER NOT NULL,  
 "UniqueCarrier" VARCHAR(6) NOT NULL,
 "FlightNum" INTEGER NOT NULL,  
 "TailNum" INTEGER,
 "ActualElapsedTime" INTEGER,
 "CRSElapsedTime" INTEGER NOT NULL,  
 "AirTime" INTEGER,
 "ArrDelay" INTEGER,
 "DepDelay" INTEGER,
 "Origin" VARCHAR(3) NOT NULL, 
 "Dest" VARCHAR(3) NOT NULL, 
 "Distance" INTEGER,
 "TaxiIn" INTEGER,
 "TaxiOut" INTEGER,
 "Cancelled" INTEGER NOT NULL,  
 "CancellationCode" INTEGER,
 "Diverted" INTEGER NOT NULL,  
 "CarrierDelay" INTEGER,
 "WeatherDelay" INTEGER,
 "NASDelay" INTEGER,
 "SecurityDelay" INTEGER,
 "LateAircraftDelay" INTEGER,
 
 PRIMARY KEY ("Year", "Month", "DayofMonth", "UniqueCarrier", "CRSDepTime", "FlightNum"),

 CONSTRAINT origin_airport_fk
    FOREIGN KEY ("Origin")
    REFERENCES airports(iata)
    ON DELETE CASCADE,

 CONSTRAINT dest_airport_fk 
    FOREIGN KEY ("Dest")
    REFERENCES airports(iata)
    ON DELETE CASCADE,

 CONSTRAINT uniquecarrier_carriers_fk 
    FOREIGN KEY ("UniqueCarrier")
    REFERENCES carriers("Code")
    ON DELETE CASCADE

);
'''
conn.exec(sql)
print('create successful or already created')
conn.close()

## Load Data