# 1. Ask an interesting question

1. Delay binary classifier (yes/no, >15 min)
- Better isolate the key features to predict delays so that airlines focus their efforts on their weakest points
- Account for uncontrollable variables (e.g. unpredicted adverse weather) and prepare to act consequently
2. Additional questions:
- When is the best time of day/day of week/time of year to fly to minimize delays?
- Do older planes suffer more delays?
- How does the number of people flying between different locations change over time?
- How well does weather predict plane delays?
- Can you detect cascading failures as delays in one airport create delays in others? Are there critical links in the system?

___

# 2. Get the data

### On-Time : Reporting Carrier On-Time Performance (1987-present)

Source: https://www.transtats.bts.gov/DatabaseInfo.asp?DB_ID=120&DB_URL=Mode_ID=1&Mode_Desc=Aviation&Subject_ID2=0

<em>Note</em>: Over time both the code and the name of a carrier may change and the same code or name may be assumed by a different airline. To ensure that you are analyzing data from the same airline, TranStats provides four airline-specific variables that identify one and only one carrier or its entity: Airline ID (AirlineID), Unique Carrier Code (UniqueCarrier), Unique Carrier Name (UniqueCarrierName), and Unique Entity (UniqCarrierEntity). A unique airline (carrier) is defined as one holding and reporting under the same DOT certificate regardless of its Code, Name, or holding company/corporation. US Airways and America West started to report combined on-time data in January 2006 and combined traffic and financial data in October 2007 following their 2005 merger announcement. Delta and Northwest began reporting jointly in January 2010 following their 2008 merger announcement. Continental Micronesia was combined into Continental Airlines in December 2010 and joint reporting began in January 2011. Atlantic Southeast and ExpressJet began reporting jointly in January 2012. United and Continental began reporting jointly in January 2012 following their 2010 merger announcement. Endeavor (9E) operated as Pinnacle prior to August 2013. Envoy (MQ) operated as American Eagle prior to April 2014. Southwest (WN) and AirTran (FL) began reporting jointly in January 2015 following their 2011 merger announcement. American (AA) and US Airways (US) began reporting jointly as AA in July 2015 following their 2013 merger announcement. Alaska (AS) and Virgin America (VX) began reporting jointly as AS in April 2018 following their 2016 merger announcement.

        
- **AC_Types**:
    - **Summary**:
        - Reporting Carrier On-Time Performance (1987-present)
    - **Description**:
        - Reporting carriers are required to (or voluntarily) report on-time data for flights they operate: on-time arrival and departure data for non-stop domestic flights by month and year, by carrier and by origin and destination airport. Includes scheduled and actual departure and arrival times, canceled and diverted flights, taxi-out and taxi-in times, causes of delay and cancellation, air time, and non-stop distance. Use Download for individual flight data.
    - **File**:
        - YYMM_123456789_T_ONTIME_REPORTING.zip *(YY = Year ; MM = Month)*

In [1]:
# Import libraries to be used

# Warning messages display
## import warnings
## warnings.filterwarnings(action='ignore') # https://docs.python.org/3/library/warnings.html#the-warnings-filter

# Directories/Files management
import os.path
## from zipfile import ZipFile # De momento no ha hecho falta 

# Timing
import time

# Memory monitoring
%load_ext memory_profiler
### Use '%memit' to check at each point

# Data analysis and wrangling
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None) # Show all columns in DataFrames
## pd.set_option('display.max_rows', None) # It greatly slows down the output display and freezes the kernel

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.style.use('ggplot') # choose a style: 'plt.style.available'
sns.set_theme(context='notebook',
              style="darkgrid") # {darkgrid, whitegrid, dark, white, ticks}
palette = sns.color_palette("flare", as_cmap=True);
import altair as alt

# Machine Learning
## from sklearn.[...] import ...

In [2]:
t0 = time.perf_counter() 

In [3]:
# Detect Operating System running and manage paths accordingly

if os.name == 'nt': # Windows
    root = r"C:\Users\turge\CompartidoVM\0.TFM"
    print("Running on Windows.")
elif os.name == 'posix': # Ubuntu
    root = "/home/dsc/shared/0.TFM"
    print("Running on Ubuntu.")
print("root path\t", root)

Running on Windows.
root path	 C:\Users\turge\CompartidoVM\0.TFM


___

### SAMPLE FILE

Single file import (i.e. month-sized database)

Considering that each file is considerably big, and that later on many of them will have to be grouped, a first exploration will be done considering the first 10,000 rows.

Let's see which columns will remain, and skip the rest.

___

### MULTIPLE FILE

Let's proceed with multiple-file importing, through concatenation into a single DataFrame.

___

#### Run only the first time to read the 12 month individual files and generate the global file (entire year 2019)

##### 1. Read individual month files

##### 2. Generate file (2019)

___

In [4]:
%memit

peak memory: 133.41 MiB, increment: 0.22 MiB


In [5]:
csv_path = os.path.join(root,
                        "Output_Data",
                        "US_DoT",
                        "AL_OTP_MVP_Preprocessed_19_v1.csv")
csv_path

'C:\\Users\\turge\\CompartidoVM\\0.TFM\\Output_Data\\US_DoT\\AL_OTP_MVP_Preprocessed_19_v1.csv'

In [27]:
%%time

# Chunk-loading procedure is required so as to prevent memory-saturation errors from appearing:

cols = pd.read_csv(csv_path, nrows=1).columns

chunks_list = []
chunks = pd.read_csv(csv_path,
                     encoding='latin1',
#                      nrows=1e6, # Fail-safe: in case the file is inadvertently too big
                     chunksize=1e6,
                     usecols=cols[:],
                     low_memory = False)

for i, chunk in enumerate(chunks):
    if i == 13: # Fail-safe: for debugging purposes only
        break
    chunks_list.append(chunk)

    
df = pd.concat(chunks_list, axis=0)
del chunks_list

df.sample(5)

Wall time: 2min 51s


Unnamed: 0,YEAR,QUARTER,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,FL_DATE,OP_UNIQUE_CARRIER,OP_CARRIER_AIRLINE_ID,OP_CARRIER,TAIL_NUM,OP_CARRIER_FL_NUM,ORIGIN_AIRPORT_ID,ORIGIN_AIRPORT_SEQ_ID,ORIGIN_CITY_MARKET_ID,ORIGIN,ORIGIN_CITY_NAME,ORIGIN_STATE_ABR,ORIGIN_STATE_FIPS,ORIGIN_STATE_NM,ORIGIN_WAC,DEST_AIRPORT_ID,DEST_AIRPORT_SEQ_ID,DEST_CITY_MARKET_ID,DEST,DEST_CITY_NAME,DEST_STATE_ABR,DEST_STATE_FIPS,DEST_STATE_NM,DEST_WAC,CRS_DEP_TIME,DEP_TIME,DEP_DELAY,DEP_DELAY_NEW,DEP_DEL15,DEP_DELAY_GROUP,DEP_TIME_BLK,TAXI_OUT,WHEELS_OFF,WHEELS_ON,TAXI_IN,CRS_ARR_TIME,ARR_TIME,ARR_DELAY,ARR_DELAY_NEW,ARR_DEL15,ARR_DELAY_GROUP,ARR_TIME_BLK,CANCELLED,CANCELLATION_CODE,DIVERTED,CRS_ELAPSED_TIME,ACTUAL_ELAPSED_TIME,AIR_TIME,FLIGHTS,DISTANCE,DISTANCE_GROUP,CARRIER_DELAY,WEATHER_DELAY,NAS_DELAY,SECURITY_DELAY,LATE_AIRCRAFT_DELAY,FIRST_DEP_TIME,TOTAL_ADD_GTIME,LONGEST_ADD_GTIME,DIV_AIRPORT_LANDINGS,DIV_REACHED_DEST,DIV_ACTUAL_ELAPSED_TIME,DIV_ARR_DELAY,DIV_DISTANCE,DIV1_AIRPORT,DIV1_AIRPORT_ID,DIV1_AIRPORT_SEQ_ID,DIV1_WHEELS_ON,DIV1_TOTAL_GTIME,DIV1_LONGEST_GTIME,DIV1_WHEELS_OFF,DIV1_TAIL_NUM,DIV2_AIRPORT,DIV2_AIRPORT_ID,DIV2_AIRPORT_SEQ_ID,DIV2_WHEELS_ON,DIV2_TOTAL_GTIME,DIV2_LONGEST_GTIME,DIV2_WHEELS_OFF,DIV2_TAIL_NUM,DIV3_AIRPORT,DIV3_AIRPORT_ID,DIV3_AIRPORT_SEQ_ID,DIV3_WHEELS_ON,DIV3_TOTAL_GTIME,DIV3_LONGEST_GTIME,DIV3_WHEELS_OFF,DIV3_TAIL_NUM,DIV4_AIRPORT,DIV4_AIRPORT_ID,DIV4_AIRPORT_SEQ_ID,DIV4_WHEELS_ON,DIV4_TOTAL_GTIME,DIV4_LONGEST_GTIME,DIV4_WHEELS_OFF,DIV4_TAIL_NUM,DIV5_AIRPORT,DIV5_AIRPORT_ID,DIV5_AIRPORT_SEQ_ID,DIV5_WHEELS_ON,DIV5_TOTAL_GTIME,DIV5_LONGEST_GTIME,DIV5_WHEELS_OFF,DIV5_TAIL_NUM
2521138,2019,2,5,25,6,2019-05-25,DL,19790,DL,N341NW,1960,11278,1127805,30852,DCA,"Washington, DC",VA,51,Virginia,38,11433,1143302,31295,DTW,"Detroit, MI",MI,26,Michigan,43,600,553.0,-7.0,0.0,0.0,-1.0,0600-0659,10.0,603.0,710.0,7.0,733,717.0,-16.0,0.0,0.0,-2.0,0700-0759,0.0,,0.0,93.0,84.0,67.0,1.0,405.0,2,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3966327,2019,3,7,1,1,2019-07-01,NK,20416,NK,N695NK,1095,14100,1410005,34100,PHL,"Philadelphia, PA",PA,42,Pennsylvania,23,13204,1320402,31454,MCO,"Orlando, FL",FL,12,Florida,33,1607,1611.0,4.0,4.0,0.0,0.0,1600-1659,14.0,1625.0,1821.0,13.0,1849,1834.0,-15.0,0.0,0.0,-1.0,1800-1859,0.0,,0.0,162.0,143.0,116.0,1.0,861.0,4,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1289565,2019,1,3,7,4,2019-03-07,AA,19805,AA,N997NN,2576,13930,1393007,30977,ORD,"Chicago, IL",IL,17,Illinois,41,10821,1082106,30852,BWI,"Baltimore, MD",MD,24,Maryland,35,1845,1844.0,-1.0,0.0,0.0,-1.0,1800-1859,14.0,1858.0,2118.0,7.0,2139,2125.0,-14.0,0.0,0.0,-1.0,2100-2159,0.0,,0.0,114.0,101.0,80.0,1.0,621.0,3,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
6098830,2019,4,10,4,5,2019-10-04,G4,20368,G4,229NV,429,14761,1476107,34761,SFB,"Sanford, FL",FL,12,Florida,33,15412,1541205,35412,TYS,"Knoxville, TN",TN,47,Tennessee,54,2111,2117.0,6.0,6.0,0.0,0.0,2100-2159,27.0,2144.0,2251.0,7.0,2243,2258.0,15.0,15.0,1.0,1.0,2200-2259,0.0,,0.0,92.0,101.0,67.0,1.0,511.0,3,0.0,0.0,9.0,0.0,6.0,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
653541,2019,1,2,19,2,2019-02-19,9E,20363,9E,N8894A,5344,10821,1082106,30852,BWI,"Baltimore, MD",MD,24,Maryland,35,14492,1449202,34492,RDU,"Raleigh/Durham, NC",NC,37,North Carolina,36,1627,1617.0,-10.0,0.0,0.0,-1.0,1600-1659,16.0,1633.0,1728.0,5.0,1739,1733.0,-6.0,0.0,0.0,-1.0,1700-1759,0.0,,0.0,72.0,76.0,55.0,1.0,255.0,2,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [28]:
df.shape

(7422037, 109)

In [29]:
%memit

peak memory: 2721.94 MiB, increment: 0.66 MiB


There are some missing values. Let's further delve into it:

In [30]:
%%time

# Absolute & Relative frequency of missing values by column:
pd.set_option('display.max_rows', 110) # It greatly slows down the output display and freezes the kernel
missing = pd.DataFrame([df.isna().sum(), df.isna().sum() / len(df) * 100], index=['Absolute', 'Relative']).T.sort_values(by='Relative', ascending=False)
missing

Wall time: 19.7 s


Unnamed: 0,Absolute,Relative
DIV5_TAIL_NUM,7422037.0,100.0
DIV4_TAIL_NUM,7422037.0,100.0
DIV4_AIRPORT,7422037.0,100.0
DIV4_AIRPORT_ID,7422037.0,100.0
DIV4_AIRPORT_SEQ_ID,7422037.0,100.0
DIV4_TOTAL_GTIME,7422037.0,100.0
DIV4_LONGEST_GTIME,7422037.0,100.0
DIV4_WHEELS_OFF,7422037.0,100.0
DIV4_WHEELS_ON,7422037.0,100.0
DIV5_AIRPORT_ID,7422037.0,100.0


In [31]:
# Show which columns present more than 5% of missing data:
empty_df = missing[missing['Relative'] > 5]
empty_df

Unnamed: 0,Absolute,Relative
DIV5_TAIL_NUM,7422037.0,100.0
DIV4_TAIL_NUM,7422037.0,100.0
DIV4_AIRPORT,7422037.0,100.0
DIV4_AIRPORT_ID,7422037.0,100.0
DIV4_AIRPORT_SEQ_ID,7422037.0,100.0
DIV4_TOTAL_GTIME,7422037.0,100.0
DIV4_LONGEST_GTIME,7422037.0,100.0
DIV4_WHEELS_OFF,7422037.0,100.0
DIV4_WHEELS_ON,7422037.0,100.0
DIV5_AIRPORT_ID,7422037.0,100.0


In [32]:
%%time

# Drop every column with more than 90% of missing values:
empty90_df = missing[missing['Relative'] > 90]
empty90_df_cols = empty90_df.index.to_list()
df = df.drop(empty90_df_cols, axis=1)
df.shape

Wall time: 9.08 s


(7422037, 61)

In [34]:
%%time

# CHECK AGAIN after dropping the (almost)empty columns.
# Absolute & Relative frequency of missing values by column:
pd.set_option('display.max_rows', 110) # It greatly slows down the output display and freezes the kernel
missing = pd.DataFrame([df.isna().sum(), df.isna().sum() / len(df) * 100], index=['Absolute', 'Relative']).T.sort_values(by='Relative', ascending=False)
missing

Wall time: 11.7 s


Unnamed: 0,Absolute,Relative
LATE_AIRCRAFT_DELAY,6032784.0,81.282052
SECURITY_DELAY,6032784.0,81.282052
NAS_DELAY,6032784.0,81.282052
WEATHER_DELAY,6032784.0,81.282052
CARRIER_DELAY,6032784.0,81.282052
ARR_DELAY_NEW,153805.0,2.072275
ARR_DEL15,153805.0,2.072275
ARR_DELAY_GROUP,153805.0,2.072275
ARR_DELAY,153805.0,2.072275
AIR_TIME,153805.0,2.072275


In [35]:
# CHECK AGAIN after manipulating the missing data.
# Show which columns present more than 5% of missing data:
empty_df = missing[missing['Relative'] > 5]
empty_df

Unnamed: 0,Absolute,Relative
LATE_AIRCRAFT_DELAY,6032784.0,81.282052
SECURITY_DELAY,6032784.0,81.282052
NAS_DELAY,6032784.0,81.282052
WEATHER_DELAY,6032784.0,81.282052
CARRIER_DELAY,6032784.0,81.282052


In [36]:
%%time

empty_df_cols = empty_df.index.to_list()
df[empty_df_cols] = df[empty_df_cols].fillna(value=0)

Wall time: 667 ms


In [37]:
%memit

peak memory: 2555.05 MiB, increment: 0.63 MiB


In [38]:
isolated_elements_missing = missing.loc[(missing['Relative'] < 5) & (missing['Relative'] > 0)]
isolated_elements_missing

Unnamed: 0,Absolute,Relative
ARR_DELAY_NEW,153805.0,2.072275
ARR_DEL15,153805.0,2.072275
ARR_DELAY_GROUP,153805.0,2.072275
ARR_DELAY,153805.0,2.072275
AIR_TIME,153805.0,2.072275
ACTUAL_ELAPSED_TIME,153805.0,2.072275
TAXI_IN,137647.0,1.854572
WHEELS_ON,137647.0,1.854572
ARR_TIME,137646.0,1.854558
WHEELS_OFF,133977.0,1.805124


In [39]:
%%time

# Quick approach → check how many rows contain empty values to see if directly dropping them would be feasible:

# Check which rows have at least 1 NaN:
df[df.isna().any(axis=1)]

Wall time: 5.96 s


Unnamed: 0,YEAR,QUARTER,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,FL_DATE,OP_UNIQUE_CARRIER,OP_CARRIER_AIRLINE_ID,OP_CARRIER,TAIL_NUM,OP_CARRIER_FL_NUM,ORIGIN_AIRPORT_ID,ORIGIN_AIRPORT_SEQ_ID,ORIGIN_CITY_MARKET_ID,ORIGIN,ORIGIN_CITY_NAME,ORIGIN_STATE_ABR,ORIGIN_STATE_FIPS,ORIGIN_STATE_NM,ORIGIN_WAC,DEST_AIRPORT_ID,DEST_AIRPORT_SEQ_ID,DEST_CITY_MARKET_ID,DEST,DEST_CITY_NAME,DEST_STATE_ABR,DEST_STATE_FIPS,DEST_STATE_NM,DEST_WAC,CRS_DEP_TIME,DEP_TIME,DEP_DELAY,DEP_DELAY_NEW,DEP_DEL15,DEP_DELAY_GROUP,DEP_TIME_BLK,TAXI_OUT,WHEELS_OFF,WHEELS_ON,TAXI_IN,CRS_ARR_TIME,ARR_TIME,ARR_DELAY,ARR_DELAY_NEW,ARR_DEL15,ARR_DELAY_GROUP,ARR_TIME_BLK,CANCELLED,DIVERTED,CRS_ELAPSED_TIME,ACTUAL_ELAPSED_TIME,AIR_TIME,FLIGHTS,DISTANCE,DISTANCE_GROUP,CARRIER_DELAY,WEATHER_DELAY,NAS_DELAY,SECURITY_DELAY,LATE_AIRCRAFT_DELAY,DIV_AIRPORT_LANDINGS
26,2019,1,1,29,2,2019-01-29,9E,20363,9E,N931XJ,5122,10397,1039707,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,14783,1478302,34783,SGF,"Springfield, MO",MO,29,Missouri,64,950,,,,,,0900-0959,,,,,1053,,,,,,1000-1059,1.0,0.0,123.0,,,1.0,563.0,3,0.0,0.0,0.0,0.0,0.0,0
53,2019,1,1,29,2,2019-01-29,9E,20363,9E,N931XJ,5122,14783,1478302,34783,SGF,"Springfield, MO",MO,29,Missouri,64,10397,1039707,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,1128,,,,,,1100-1159,,,,,1420,,,,,,1400-1459,1.0,0.0,112.0,,,1.0,563.0,3,0.0,0.0,0.0,0.0,0.0,0
142,2019,1,1,20,7,2019-01-20,9E,20363,9E,N934XJ,5125,12953,1295304,31703,LGA,"New York, NY",NY,36,New York,22,10785,1078502,30785,BTV,"Burlington, VT",VT,50,Vermont,16,1420,,,,,,1400-1459,,,,,1545,,,,,,1500-1559,1.0,0.0,85.0,,,1.0,258.0,2,0.0,0.0,0.0,0.0,0.0,0
298,2019,1,1,24,4,2019-01-24,9E,20363,9E,N930XJ,5129,12953,1295304,31703,LGA,"New York, NY",NY,36,New York,22,10785,1078502,30785,BTV,"Burlington, VT",VT,50,Vermont,16,2159,,,,,,2100-2159,,,,,2321,,,,,,2300-2359,1.0,0.0,82.0,,,1.0,258.0,2,0.0,0.0,0.0,0.0,0.0,0
406,2019,1,1,22,2,2019-01-22,9E,20363,9E,N929XJ,5133,11433,1143302,31295,DTW,"Detroit, MI",MI,26,Michigan,43,10785,1078502,30785,BTV,"Burlington, VT",VT,50,Vermont,16,2020,,,,,,2000-2059,,,,,2218,,,,,,2200-2259,1.0,0.0,118.0,,,1.0,537.0,3,0.0,0.0,0.0,0.0,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7419670,2019,4,12,18,3,2019-12-18,B6,20409,B6,N348JB,2780,11618,1161802,31703,EWR,"Newark, NJ",NJ,34,New Jersey,21,10721,1072102,30721,BOS,"Boston, MA",MA,25,Massachusetts,13,2104,,,,,,2100-2159,,,,,2216,,,,,,2200-2259,1.0,0.0,72.0,,,1.0,200.0,1,0.0,0.0,0.0,0.0,0.0,0
7419672,2019,4,12,18,3,2019-12-18,B6,20409,B6,N329JB,2784,14492,1449202,34492,RDU,"Raleigh/Durham, NC",NC,37,North Carolina,36,10721,1072102,30721,BOS,"Boston, MA",MA,25,Massachusetts,13,530,,,,,,0001-0559,,,,,723,,,,,,0700-0759,1.0,0.0,113.0,,,1.0,612.0,3,0.0,0.0,0.0,0.0,0.0,0
7419848,2019,4,12,19,4,2019-12-19,B6,20409,B6,N306JB,374,15096,1509602,35096,SYR,"Syracuse, NY",NY,36,New York,22,10721,1072102,30721,BOS,"Boston, MA",MA,25,Massachusetts,13,530,,,,,,0001-0559,,,,,651,,,,,,0600-0659,1.0,0.0,81.0,,,1.0,265.0,2,0.0,0.0,0.0,0.0,0.0,0
7420968,2019,4,12,30,1,2019-12-30,B6,20409,B6,N579JB,321,10721,1072102,30721,BOS,"Boston, MA",MA,25,Massachusetts,13,14027,1402702,34027,PBI,"West Palm Beach/Palm Beach, FL",FL,12,Florida,33,1936,2057.0,81.0,81.0,1.0,5.0,1900-1959,53.0,2150.0,,,2303,,,,,,2300-2359,0.0,1.0,207.0,,,1.0,1197.0,5,0.0,0.0,0.0,0.0,0.0,1


In [40]:
%%time

# DF's length as is:
original_length = len(df)
print('Original dataset length:\t', original_length)

# Check how many rows have at least 1 NaN:
manipulated_length = len(df.drop(df[df.isna().any(axis=1)].index, axis=0))
print('Manipulated dataset length:\t', manipulated_length)

# Dropped rows, absolute and relative number:
print('{} rows deleted, accounting for {:.2f}% of the original dataset.'.format(original_length - manipulated_length, (original_length - manipulated_length) / original_length * 100))

Original dataset length:	 7422037
Manipulated dataset length:	 7268232
153805 rows deleted, accounting for 2.07% of the original dataset.
Wall time: 8.8 s


Considering that it is only ~2% of the data, and that the dataset is big enough, let's just simply drop those rows as a quick  cleaning method.

In [41]:
%%time

df = df.drop(df[df.isna().any(axis=1)].index, axis=0)
df

Wall time: 8.89 s


Unnamed: 0,YEAR,QUARTER,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,FL_DATE,OP_UNIQUE_CARRIER,OP_CARRIER_AIRLINE_ID,OP_CARRIER,TAIL_NUM,OP_CARRIER_FL_NUM,ORIGIN_AIRPORT_ID,ORIGIN_AIRPORT_SEQ_ID,ORIGIN_CITY_MARKET_ID,ORIGIN,ORIGIN_CITY_NAME,ORIGIN_STATE_ABR,ORIGIN_STATE_FIPS,ORIGIN_STATE_NM,ORIGIN_WAC,DEST_AIRPORT_ID,DEST_AIRPORT_SEQ_ID,DEST_CITY_MARKET_ID,DEST,DEST_CITY_NAME,DEST_STATE_ABR,DEST_STATE_FIPS,DEST_STATE_NM,DEST_WAC,CRS_DEP_TIME,DEP_TIME,DEP_DELAY,DEP_DELAY_NEW,DEP_DEL15,DEP_DELAY_GROUP,DEP_TIME_BLK,TAXI_OUT,WHEELS_OFF,WHEELS_ON,TAXI_IN,CRS_ARR_TIME,ARR_TIME,ARR_DELAY,ARR_DELAY_NEW,ARR_DEL15,ARR_DELAY_GROUP,ARR_TIME_BLK,CANCELLED,DIVERTED,CRS_ELAPSED_TIME,ACTUAL_ELAPSED_TIME,AIR_TIME,FLIGHTS,DISTANCE,DISTANCE_GROUP,CARRIER_DELAY,WEATHER_DELAY,NAS_DELAY,SECURITY_DELAY,LATE_AIRCRAFT_DELAY,DIV_AIRPORT_LANDINGS
0,2019,1,1,3,4,2019-01-03,9E,20363,9E,N195PQ,5121,15412,1541205,35412,TYS,"Knoxville, TN",TN,47,Tennessee,54,10397,1039707,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,1140,1205.0,25.0,25.0,1.0,1.0,1100-1159,30.0,1235.0,1311.0,4.0,1250,1315.0,25.0,25.0,1.0,1.0,1200-1259,0.0,0.0,70.0,70.0,36.0,1.0,152.0,1,0.0,0.0,0.0,0.0,25.0,0
1,2019,1,1,4,5,2019-01-04,9E,20363,9E,N919XJ,5121,15412,1541205,35412,TYS,"Knoxville, TN",TN,47,Tennessee,54,10397,1039707,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,1140,1250.0,70.0,70.0,1.0,4.0,1100-1159,35.0,1325.0,1403.0,9.0,1250,1412.0,82.0,82.0,1.0,5.0,1200-1259,0.0,0.0,70.0,82.0,38.0,1.0,152.0,1,0.0,0.0,12.0,0.0,70.0,0
2,2019,1,1,5,6,2019-01-05,9E,20363,9E,N316PQ,5122,10397,1039707,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,14783,1478302,34783,SGF,"Springfield, MO",MO,29,Missouri,64,950,956.0,6.0,6.0,0.0,0.0,0900-0959,20.0,1016.0,1040.0,3.0,1051,1043.0,-8.0,0.0,0.0,-1.0,1000-1059,0.0,0.0,121.0,107.0,84.0,1.0,563.0,3,0.0,0.0,0.0,0.0,0.0,0
3,2019,1,1,6,7,2019-01-06,9E,20363,9E,N325PQ,5122,10397,1039707,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,14783,1478302,34783,SGF,"Springfield, MO",MO,29,Missouri,64,950,945.0,-5.0,0.0,0.0,-1.0,0900-0959,16.0,1001.0,1026.0,3.0,1053,1029.0,-24.0,0.0,0.0,-2.0,1000-1059,0.0,0.0,123.0,104.0,85.0,1.0,563.0,3,0.0,0.0,0.0,0.0,0.0,0
4,2019,1,1,7,1,2019-01-07,9E,20363,9E,N904XJ,5122,10397,1039707,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,14783,1478302,34783,SGF,"Springfield, MO",MO,29,Missouri,64,950,947.0,-3.0,0.0,0.0,-1.0,0900-0959,25.0,1012.0,1040.0,4.0,1053,1044.0,-9.0,0.0,0.0,-1.0,1000-1059,0.0,0.0,123.0,117.0,88.0,1.0,563.0,3,0.0,0.0,0.0,0.0,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7422032,2019,4,12,31,2,2019-12-31,B6,20409,B6,N193JB,846,13204,1320402,31454,MCO,"Orlando, FL",FL,12,Florida,33,15070,1507003,31703,SWF,"Newburgh/Poughkeepsie, NY",NY,36,New York,22,1356,1500.0,64.0,64.0,1.0,4.0,1300-1359,20.0,1520.0,1726.0,5.0,1639,1731.0,52.0,52.0,1.0,3.0,1600-1659,0.0,0.0,163.0,151.0,126.0,1.0,989.0,4,52.0,0.0,0.0,0.0,0.0,0
7422033,2019,4,12,31,2,2019-12-31,B6,20409,B6,N304JB,854,11278,1127805,30852,DCA,"Washington, DC",VA,51,Virginia,38,10721,1072102,30721,BOS,"Boston, MA",MA,25,Massachusetts,13,1420,1414.0,-6.0,0.0,0.0,-1.0,1400-1459,15.0,1429.0,1526.0,7.0,1550,1533.0,-17.0,0.0,0.0,-2.0,1500-1559,0.0,0.0,90.0,79.0,57.0,1.0,399.0,2,0.0,0.0,0.0,0.0,0.0,0
7422034,2019,4,12,31,2,2019-12-31,B6,20409,B6,N193JB,860,14100,1410005,34100,PHL,"Philadelphia, PA",PA,42,Pennsylvania,23,10721,1072102,30721,BOS,"Boston, MA",MA,25,Massachusetts,13,700,652.0,-8.0,0.0,0.0,-1.0,0700-0759,12.0,704.0,746.0,5.0,825,751.0,-34.0,0.0,0.0,-2.0,0800-0859,0.0,0.0,85.0,59.0,42.0,1.0,280.0,2,0.0,0.0,0.0,0.0,0.0,0
7422035,2019,4,12,31,2,2019-12-31,B6,20409,B6,N563JB,861,10721,1072102,30721,BOS,"Boston, MA",MA,25,Massachusetts,13,14843,1484306,34819,SJU,"San Juan, PR",PR,72,Puerto Rico,3,813,812.0,-1.0,0.0,0.0,-1.0,0800-0859,10.0,822.0,1245.0,3.0,1315,1248.0,-27.0,0.0,0.0,-2.0,1300-1359,0.0,0.0,242.0,216.0,203.0,1.0,1674.0,7,0.0,0.0,0.0,0.0,0.0,0


In [42]:
%%time

# CHECK AGAIN after manipulating the missing data.
# Absolute & Relative frequency of missing values by column:
pd.set_option('display.max_rows', 110) # It greatly slows down the output display and freezes the kernel
missing_2 = pd.DataFrame([df.isna().sum(), df.isna().sum() / len(df) * 100], index=['Absolute', 'Relative']).T.sort_values(by='Relative', ascending=False)
missing_2

Wall time: 11.2 s


Unnamed: 0,Absolute,Relative
YEAR,0.0,0.0
DEP_DELAY,0.0,0.0
DEP_DEL15,0.0,0.0
DEP_DELAY_GROUP,0.0,0.0
DEP_TIME_BLK,0.0,0.0
TAXI_OUT,0.0,0.0
WHEELS_OFF,0.0,0.0
WHEELS_ON,0.0,0.0
TAXI_IN,0.0,0.0
CRS_ARR_TIME,0.0,0.0


Great. Now the dataset is totally free of missing values.

Let's store the recently cleaned dataset in a CSV.

#### Run only the first time to generate the global CLEAN file (year 2019)

___

### Section left for later iterations to clean the CSV

In [None]:
int_cols = [
# Time Period
 'YEAR',
#  'QUARTER', # Disregarded: redundant
 'MONTH',
 'DAY_OF_MONTH',
 'DAY_OF_WEEK',
#  'FL_DATE', # Disregarded: redundant
# Airline / Aircraft
 'OP_UNIQUE_CARRIER',
 'OP_CARRIER_AIRLINE_ID',
 'OP_CARRIER',
 'TAIL_NUM',
 'OP_CARRIER_FL_NUM',
# Origin
 'ORIGIN_AIRPORT_ID',
 'ORIGIN_AIRPORT_SEQ_ID',
 'ORIGIN_CITY_MARKET_ID',
 'ORIGIN',
 'ORIGIN_CITY_NAME',
 'ORIGIN_STATE_ABR',
 'ORIGIN_STATE_FIPS',
 'ORIGIN_STATE_NM',
 'ORIGIN_WAC',
# Destination
 'DEST_AIRPORT_ID',
 'DEST_AIRPORT_SEQ_ID',
 'DEST_CITY_MARKET_ID',
 'DEST',
 'DEST_CITY_NAME',
 'DEST_STATE_ABR',
 'DEST_STATE_FIPS',
 'DEST_STATE_NM',
 'DEST_WAC',
# Departure Performance
 'CRS_DEP_TIME',
 'DEP_TIME',
 'DEP_DELAY',
 'DEP_DELAY_NEW',
 'DEP_DEL15',
 'DEP_DELAY_GROUP',
 'DEP_TIME_BLK',
 'TAXI_OUT',
 'WHEELS_OFF',
# Arrival Performance
 'WHEELS_ON',
 'TAXI_IN',
 'CRS_ARR_TIME',
 'ARR_TIME',
 'ARR_DELAY',
 'ARR_DELAY_NEW',
 'ARR_DEL15',
 'ARR_DELAY_GROUP',
 'ARR_TIME_BLK',
# Cancellations and Diversions
 'CANCELLED',
 'CANCELLATION_CODE',
 'DIVERTED',
# Flight Summaries
 'CRS_ELAPSED_TIME',
 'ACTUAL_ELAPSED_TIME',
 'AIR_TIME',
 'FLIGHTS',
 'DISTANCE',
 'DISTANCE_GROUP',
# Cause of Delay
 'CARRIER_DELAY',
 'WEATHER_DELAY',
 'NAS_DELAY',
 'SECURITY_DELAY',
 'LATE_AIRCRAFT_DELAY',
# Gate Return Information at Origin Airport (Data starts 10/2008)
 'FIRST_DEP_TIME',
 'TOTAL_ADD_GTIME',
 'LONGEST_ADD_GTIME',
# Diverted Airport Information (Data starts 10/2008)
 'DIV_AIRPORT_LANDINGS',
 'DIV_REACHED_DEST',
 'DIV_ACTUAL_ELAPSED_TIME',
 'DIV_ARR_DELAY',
 'DIV_DISTANCE',
 'DIV1_AIRPORT',
 'DIV1_AIRPORT_ID',
 'DIV1_AIRPORT_SEQ_ID',
 'DIV1_WHEELS_ON',
 'DIV1_TOTAL_GTIME',
 'DIV1_LONGEST_GTIME',
 'DIV1_WHEELS_OFF',
 'DIV1_TAIL_NUM',
 'DIV2_AIRPORT',
 'DIV2_AIRPORT_ID',
 'DIV2_AIRPORT_SEQ_ID',
 'DIV2_WHEELS_ON',
 'DIV2_TOTAL_GTIME',
 'DIV2_LONGEST_GTIME',
 'DIV2_WHEELS_OFF',
 'DIV2_TAIL_NUM',
 'DIV3_AIRPORT',
 'DIV3_AIRPORT_ID',
 'DIV3_AIRPORT_SEQ_ID',
 'DIV3_WHEELS_ON',
 'DIV3_TOTAL_GTIME',
 'DIV3_LONGEST_GTIME',
 'DIV3_WHEELS_OFF',
 'DIV3_TAIL_NUM',
 'DIV4_AIRPORT',
 'DIV4_AIRPORT_ID',
 'DIV4_AIRPORT_SEQ_ID',
 'DIV4_WHEELS_ON',
 'DIV4_TOTAL_GTIME',
 'DIV4_LONGEST_GTIME',
 'DIV4_WHEELS_OFF',
 'DIV4_TAIL_NUM',
 'DIV5_AIRPORT',
 'DIV5_AIRPORT_ID',
 'DIV5_AIRPORT_SEQ_ID',
 'DIV5_WHEELS_ON',
 'DIV5_TOTAL_GTIME',
 'DIV5_LONGEST_GTIME',
 'DIV5_WHEELS_OFF',
 'DIV5_TAIL_NUM'
]

In [None]:
print("{} columns kept from a total of {}.".format(len(int_cols), len(cols)))

Additional information on each column meaning can be found [here](https://www.transtats.bts.gov/Fields.asp?Table_ID=236&SYS_Table_Name=T_ONTIME_REPORTING&User_Table_Name=Reporting%20Carrier%20On-Time%20Performance%20(1987-present)&Year_Info=1&First_Year=1987&Last_Year=2020&Rate_Info=0&Frequency=Monthly&Data_Frequency=Annual,Quarterly,Monthly).

___

# 3. Explore the data

## Plots

In [None]:
cols = [
 'YEAR',
 'QUARTER',
 'MONTH',
 'DAY_OF_MONTH',
 'DAY_OF_WEEK',
#  'FL_DATE',
#  'OP_UNIQUE_CARRIER',
#  'OP_CARRIER_AIRLINE_ID',
#  'OP_CARRIER',
#  'TAIL_NUM',
#  'OP_CARRIER_FL_NUM',
#  'ORIGIN_AIRPORT_ID',
#  'ORIGIN_AIRPORT_SEQ_ID',
#  'ORIGIN_CITY_MARKET_ID',
#  'ORIGIN',
#  'ORIGIN_CITY_NAME',
#  'ORIGIN_STATE_ABR',
#  'ORIGIN_STATE_FIPS',
#  'ORIGIN_STATE_NM',
#  'ORIGIN_WAC',
#  'DEST_AIRPORT_ID',
#  'DEST_AIRPORT_SEQ_ID',
#  'DEST_CITY_MARKET_ID',
#  'DEST',
#  'DEST_CITY_NAME',
#  'DEST_STATE_ABR',
#  'DEST_STATE_FIPS',
#  'DEST_STATE_NM',
#  'DEST_WAC',
#  'CRS_DEP_TIME',
#  'DEP_TIME',
 'DEP_DELAY',
 'DEP_DELAY_NEW',
 'DEP_DEL15',
#  'DEP_DELAY_GROUP',
#  'DEP_TIME_BLK',
 'TAXI_OUT',
#  'WHEELS_OFF',
#  'WHEELS_ON',
 'TAXI_IN',
#  'CRS_ARR_TIME',
#  'ARR_TIME',
 'ARR_DELAY',
 'ARR_DELAY_NEW',
 'ARR_DEL15',
#  'ARR_DELAY_GROUP',
#  'ARR_TIME_BLK',
 'CANCELLED',
 'CANCELLATION_CODE',
 'DIVERTED',
#  'CRS_ELAPSED_TIME',
#  'ACTUAL_ELAPSED_TIME',
 'AIR_TIME',
 'FLIGHTS',
 'DISTANCE',
#  'DISTANCE_GROUP',
 'CARRIER_DELAY',
 'WEATHER_DELAY',
 'NAS_DELAY',
 'SECURITY_DELAY',
 'LATE_AIRCRAFT_DELAY',
#  'FIRST_DEP_TIME',
#  'TOTAL_ADD_GTIME',
#  'LONGEST_ADD_GTIME',
#  'DIV_AIRPORT_LANDINGS',
#  'DIV_REACHED_DEST',
#  'DIV_ACTUAL_ELAPSED_TIME',
 'DIV_ARR_DELAY',
 'DIV_DISTANCE',
#  'DIV1_AIRPORT',
#  'DIV1_AIRPORT_ID',
#  'DIV1_AIRPORT_SEQ_ID',
#  'DIV1_WHEELS_ON',
#  'DIV1_TOTAL_GTIME',
#  'DIV1_LONGEST_GTIME',
#  'DIV1_WHEELS_OFF',
#  'DIV1_TAIL_NUM',
#  'DIV2_AIRPORT',
#  'DIV2_AIRPORT_ID',
#  'DIV2_AIRPORT_SEQ_ID',
#  'DIV2_WHEELS_ON',
#  'DIV2_TOTAL_GTIME',
#  'DIV2_LONGEST_GTIME',
#  'DIV2_WHEELS_OFF',
#  'DIV2_TAIL_NUM',
#  'DIV3_AIRPORT',
#  'DIV3_AIRPORT_ID',
#  'DIV3_AIRPORT_SEQ_ID',
#  'DIV3_WHEELS_ON',
#  'DIV3_TOTAL_GTIME',
#  'DIV3_LONGEST_GTIME',
#  'DIV3_WHEELS_OFF',
#  'DIV3_TAIL_NUM',
#  'DIV4_AIRPORT',
#  'DIV4_AIRPORT_ID',
#  'DIV4_AIRPORT_SEQ_ID',
#  'DIV4_WHEELS_ON',
#  'DIV4_TOTAL_GTIME',
#  'DIV4_LONGEST_GTIME',
#  'DIV4_WHEELS_OFF',
#  'DIV4_TAIL_NUM',
#  'DIV5_AIRPORT',
#  'DIV5_AIRPORT_ID',
#  'DIV5_AIRPORT_SEQ_ID',
#  'DIV5_WHEELS_ON',
#  'DIV5_TOTAL_GTIME',
#  'DIV5_LONGEST_GTIME',
#  'DIV5_WHEELS_OFF',
#  'DIV5_TAIL_NUM'
]

In [None]:
df0 = df0[cols]
df0

In [None]:
output_csv_path = os.path.join(root,
                               "Output_Data",
                               "US_DoT",
                               "AL_OTP_MVP_Preprocessed.csv")
output_csv_path

In [None]:
%%time

df0.to_csv(path_or_buf=output_csv_path,
           index=False,
           encoding='latin1')

In [None]:
t1 = time.perf_counter()  - t0
print("Time elapsed: ", t1) # CPU seconds elapsed (floating point)

# 5. Communicate and visualize the results

___