# WAF 2023
Analysis of passenger airline timeliness and determinants of on-time performance.

In [2]:
# load relevant libraries
import shap
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib as plt

from sklearn_pandas import gen_features, DataFrameMapper

In [6]:
# load the data
flights_original_df = pd.read_csv(
    "./Data/fa23_datachallenge.csv",
    header=0
)
print(f"The data has shape: {flights_original_df.shape}")
flights_original_df.head(10)

The data has shape: (890644, 49)


Unnamed: 0,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,OP_UNIQUE_CARRIER,TAIL_NUM,OP_CARRIER_FL_NUM,ORIGIN_AIRPORT_ID,ORIGIN_AIRPORT_NAME,ORIGIN,ORIGIN_CITY_NAME,...,AWND,PSUN,TSUN,AIRPORT_FLIGHTS_MONTH,AIRLINE_FLIGHTS_MONTH,AIRLINE_AIRPORT_FLIGHTS_MONTH,AVG_MONTHLY_PASS_AIRPORT,AVG_MONTHLY_PASS_AIRLINE,FLT_ATTENDANTS_PER_PASS,GROUND_SERV_PER_PASS
0,6,8,6,AA,N186US,1216,11298,Dallas Fort Worth Regional,DFW,"Dallas/Fort Worth, TX",...,6.04,,,25390.0,76419.0,12632.0,2907365.0,11744595.0,9.8e-05,0.000177
1,5,19,7,OO,N445SW,3643,11823,,FWA,"Fort Wayne, IN",...,,,,,,,,,,
2,12,18,3,MQ,N226NN,3303,11298,Dallas Fort Worth Regional,DFW,"Dallas/Fort Worth, TX",...,3.36,,,25322.0,26721.0,5416.0,2907365.0,1204766.0,0.000348,0.000107
3,1,2,3,YX,N408YX,4697,10785,,BTV,"Burlington, VT",...,,,,,,,,,,
4,3,21,4,DL,N986AT,2639,14771,San Francisco International,SFO,"San Francisco, CA",...,7.61,,,13989.0,84142.0,1146.0,1908862.0,12460183.0,0.000144,0.000149
5,10,25,5,WN,N206WN,8,11259,Dallas Love Field,DAL,"Dallas, TX",...,15.88,,,6261.0,115051.0,5727.0,673221.0,13382999.0,6.2e-05,9.9e-05
6,2,1,5,AA,N937AN,168,12889,McCarran International,LAS,"Las Vegas, NV",...,3.8,,,11500.0,70199.0,1059.0,1903352.0,11744595.0,9.8e-05,0.000177
7,9,15,7,WN,N224WN,4551,14122,Pittsburgh International,PIT,"Pittsburgh, PA",...,3.8,,,4113.0,107436.0,732.0,385767.0,13382999.0,6.2e-05,9.9e-05
8,10,29,2,F9,N205FR,244,11292,Stapleton International,DEN,"Denver, CO",...,12.3,,,22355.0,12581.0,2616.0,2743323.0,1857122.0,0.000116,7e-06
9,2,19,2,DL,N887DN,505,14869,Salt Lake City International,SLC,"Salt Lake City, UT",...,10.74,,,8345.0,67273.0,3086.0,1065782.0,12460183.0,0.000144,0.000149



Do some EDA First before dropping a bunch of shit



### Features to drop due to data leakage or mutual information
* `TAIL_NUM` - Unique number that identifies each aircraft
* `OP_CARRIER_FL_NUM` - Flight number 
* `ORIGIN_AIRPORT_ID` - A unique identifier for origin airport, numeric, expresses the same information as `ORIGIN`
* `ORIGIN_AIRPORT_NAME` - The english name of the origin airport, expresses the same information as `ORIGIN`
* `ORIGIN_CITY_NAME` - The english name of the airport's city, expresses the same information as `ORIGIN`
* `DEST_AIRPORT_ID` - A unique identifier for destination airport, numeric, expresses the same information as `DEST`
* `DEST_CITY_NAME` - The english name of the destination airport, expresses the same information as `DEST`
* `DEP_TIME` - In predicting, we will have the planned departure time but not the actual departure time, so adding this feature would leak information to the model about the future, we should drop this
* `ARR_TIME` - In predicting, we will have planned arrival time but not the actual arrival time, so adding this feature would leak information to the model about the future, we should drop this
* `ARR_DELAY_NEW` we won't know the arrival delay time at the time of prediction
* Drop all the delay types as we won't have this info ahead of the prediction moment.


### Features to drop due to feature engineering
* Using `DEP_TIME_BLK` instead of `CRS_DEP_TIME` and `ARR_TIME_BLK` instead of `CRS_ARR_TIME`, respectively, will allow us to encode the same information while reducing the dimensionality of feature space, representing the departure and arrival times more meaningfully
* Drop all observations for which the flight is cancelled, using the `CANCELLED` flag variable
* Drop `CANCELLATION_CODE` as well
* Drop `DISTANCE_GROUP` a binned representation of the flight distance. Though this might be helpful in some cases, there is a significant amount of information that is lost by using this over `DISTANCE`; also, we don't know what the cutoffs for the bins are, which could be problematic and introduce bias into our analysis
* Drop `AIRLINE_AIRPORT_FLIGHTS_MONTH` if it is highly correlated with `AIRPORT_FLIGHTS_MONTH` and `AIRLINE_FLIGHTS_MONTH`


## Exploratory Data Analysis
[...]