# Analyzing On Time Preformance Statistics 

1. Conduct Exploratory Data Analysis on On time preformance metrics accross all Airlines between any two routes 
2. Develop a robost pipeline to extract useful metrics and novel network attributes 
3. Filter On Time data to current network level nodes
4. Clean the data and conduct data imputation 
5. Develop a pipeline to fit a density estimate and extract key metrics
6. Define statistical processes to model the probability of delay over time for a single route

Data Extraction and Preprocessing 

Pull Data --> CLean Data --> Filter Data to match edge table --> Develop r.v.s --> Determine optimal bw --> Apply KDE fit for all r.v.s --> 



In [2]:
# import relevant libraries 

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [26]:
# read in airline on time preformance data
on_time_data = pd.read_csv('Database/August 2018 Nationwide.csv')
# generate new column for the foreign key (route- origin-destination key)
on_time_data['citypair'] = on_time_data['ORIGIN'].astype(str) + '-' + on_time_data['DEST']   # dataset 4
# read in the dataset - DOT airline on time performance statistics
on_time_data_carriers = pd.read_csv('Database/Air Carriers.csv')
# extract corresponding code for southwest airlines
sw_carrier_code = on_time_data_carriers.loc[on_time_data_carriers['Description'] == "Southwest Airlines Co.: WN", 'Code']
print(sw_carrier_code) 

362    19393
Name: Code, dtype: int64


In [27]:
# clean the data
on_time_data = on_time_data.drop(['OP_CARRIER_FL_NUM', 'DEP_DELAY', 'ARR_DELAY', 'ORIGIN', 'DEST', 'CRS_ELAPSED_TIME', 'ORIGIN_AIRPORT_ID','DEST_AIRPORT_ID','CARRIER_DELAY','WEATHER_DELAY','NAS_DELAY','LATE_AIRCRAFT_DELAY'], axis=1)

on_time_data

Unnamed: 0,FL_DATE,DEP_DELAY_NEW,ARR_DELAY_NEW,CANCELLED,ACTUAL_ELAPSED_TIME,citypair
0,8/1/2018,9.0,44.0,0,377.0,JFK-PHX
1,8/1/2018,29.0,53.0,0,309.0,PHX-EWR
2,8/1/2018,0.0,0.0,0,177.0,CLE-DFW
3,8/1/2018,44.0,43.0,0,303.0,SJU-DFW
4,8/1/2018,0.0,0.0,0,175.0,AUS-MIA
...,...,...,...,...,...,...
701347,8/31/2018,0.0,0.0,0,150.0,SLC-DFW
701348,8/31/2018,0.0,12.0,0,226.0,TUS-ORD
701349,8/31/2018,2.0,0.0,0,182.0,DFW-LAX
701350,8/31/2018,0.0,0.0,0,304.0,LAX-MCO


## Part I: Exploratory Data Analysis and Data Cleaning

In [11]:
# import edge table data and filter on time statistics data records accordingly

edge_data = pd.read_csv('Database\FINAL_EDGE_TABLE.csv')
list_of_routes = edge_data['citypair'].to_list()

fltrd_on_time = on_time_data[on_time_data['citypair'].isin(list_of_routes)]
unique_city_pairs = len(set(fltrd_on_time['citypair'].to_list()))

# check the amount of data accounted for   There are 1240 original edges 
print('amount of data in on time stats', unique_city_pairs)   # missing data for 1240 - 1211 routes 
print('Missing Data for', 1240 - 1211, 'routes;', 'proportion of missing data:', (1240 - 1211)/1240 )

amount of data in on time stats 1211
Missing Data for 29 routes; proportion of missing data: 0.02338709677419355


In [17]:
# function to filter main dataframe by citypair id

from tqdm import tqdm
from joblib import Parallel, delayed


def filter_ontime_data(citypair_id): 
    # NOTE: citypair_id --> string-64-bit for the citypair key 
    df = fltrd_on_time.loc[fltrd_on_time['citypair'] == citypair_id]
    return df


# test the number of entries in each dataframe filter

df_lengths = np.zeros(len(list_of_routes))
for i in tqdm(range(len(list_of_routes))):
    df_lengths[i] = len(filter_ontime_data(list_of_routes[i]))


100%|██████████| 1240/1240 [00:45<00:00, 27.32it/s]


In [28]:
on_time_data

Unnamed: 0,FL_DATE,DEP_DELAY_NEW,ARR_DELAY_NEW,CANCELLED,ACTUAL_ELAPSED_TIME,citypair
0,8/1/2018,9.0,44.0,0,377.0,JFK-PHX
1,8/1/2018,29.0,53.0,0,309.0,PHX-EWR
2,8/1/2018,0.0,0.0,0,177.0,CLE-DFW
3,8/1/2018,44.0,43.0,0,303.0,SJU-DFW
4,8/1/2018,0.0,0.0,0,175.0,AUS-MIA
...,...,...,...,...,...,...
701347,8/31/2018,0.0,0.0,0,150.0,SLC-DFW
701348,8/31/2018,0.0,12.0,0,226.0,TUS-ORD
701349,8/31/2018,2.0,0.0,0,182.0,DFW-LAX
701350,8/31/2018,0.0,0.0,0,304.0,LAX-MCO


In [30]:
# compile data together to get aggregate table

on_time_fin = fltrd_on_time.groupby('citypair').agg({'DEP_DELAY_NEW':'mean','ARR_DELAY_NEW':'mean', 'CANCELLED':'mean'})
on_time_fin

Unnamed: 0_level_0,DEP_DELAY_NEW,ARR_DELAY_NEW,CANCELLED
citypair,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ABQ-BWI,8.644068,9.440678,0.063492
ABQ-DAL,9.102740,8.075862,0.006803
ABQ-DEN,18.624521,18.896154,0.003817
ABQ-HOU,4.595506,4.044944,0.000000
ABQ-LAS,15.618321,13.946565,0.015038
...,...,...,...
TUS-DEN,9.318919,10.494565,0.000000
TUS-LAS,7.196262,6.915888,0.009259
TUS-MDW,6.354839,7.967742,0.000000
TUS-SAN,9.724138,7.879310,0.000000


## Part II: Kernel Density Estimation

Define the following random variables for each airline route i $\in$ {list_of_routes}

1. $D_{i}$ ~ departure delay time  (continuous r.v.) ~ KDE distribution
2. $A_{i}$ ~ arrival delay time  (continuous r.v.) ~ KDE distribution
3. $T_{i}$ ~ duration time (continuous r.v.) ~ KDE distribution
4. $C_{i}$ ~ cancellation (discrete binomial r.v.) ~ $ \binom{n}{x}  p^{x} *  (1 - p)^{n-x}$, where p ~ probability of a flight on route i being canceled



### Methods for Optimal Bandwidth Determination for KDEs

Bandwidth is an important parameter to calculate for Kernel Density Estimation. A large Bandwidth value can produce a non-parametric based distribution with high bias or oversmoothing, whereas a small Bandwidth has a larger variance or undersmoothing.

Methodology : 

We consider the following method for Optimal Bandwidth Extraction:

### Maximum Likelihood Cross Validation 

Consider the following Objective function given a Kernel Function

### $ MLCV_{max} = \frac{1}{n} \sum_{i=1}^{n} log[\sum_{x_j} K(\frac{x_j - x_i}{h})] - log[(n-1)h]$


We wish to select a bandwidth value which maximizes this objective function

In [None]:
# create a general function to find optimal parameter and apply KDE fit

from scipy.stats import gaussian_kde, norm


def kde_dist_fit(x):
    # x ~ vector of data
    # fix h based on MLCV
    h = 0.3
    model = gaussian_kde(x, h)

    rslts = model.evaluate(np.arange(min(x) - , max(x) + 2*, 0.001))



plt.subplot(121)
plt.plot(np.arange(0,max(acc_scores)+0.7, 0.001), rslts, linewidth=3)
plt.hist(acc_scores,bins=20)
plt.xlim([0.5,1.1])
plt.title('KDE Fit to the Proofreader Accuracy Data w/ Optimal Bandwidth h')



