# 1. Exploratory Data Analysis

## 1.3. Airline On-Time Performance Data

### 1.3.1. On-Time : Reporting Carrier On-Time Performance (1987-present)

Source: https://www.transtats.bts.gov/DatabaseInfo.asp?DB_ID=120&DB_URL=Mode_ID=1&Mode_Desc=Aviation&Subject_ID2=0

<em>Note</em>: Over time both the code and the name of a carrier may change and the same code or name may be assumed by a different airline. To ensure that you are analyzing data from the same airline, TranStats provides four airline-specific variables that identify one and only one carrier or its entity: Airline ID (AirlineID), Unique Carrier Code (UniqueCarrier), Unique Carrier Name (UniqueCarrierName), and Unique Entity (UniqCarrierEntity). A unique airline (carrier) is defined as one holding and reporting under the same DOT certificate regardless of its Code, Name, or holding company/corporation. US Airways and America West started to report combined on-time data in January 2006 and combined traffic and financial data in October 2007 following their 2005 merger announcement. Delta and Northwest began reporting jointly in January 2010 following their 2008 merger announcement. Continental Micronesia was combined into Continental Airlines in December 2010 and joint reporting began in January 2011. Atlantic Southeast and ExpressJet began reporting jointly in January 2012. United and Continental began reporting jointly in January 2012 following their 2010 merger announcement. Endeavor (9E) operated as Pinnacle prior to August 2013. Envoy (MQ) operated as American Eagle prior to April 2014. Southwest (WN) and AirTran (FL) began reporting jointly in January 2015 following their 2011 merger announcement. American (AA) and US Airways (US) began reporting jointly as AA in July 2015 following their 2013 merger announcement. Alaska (AS) and Virgin America (VX) began reporting jointly as AS in April 2018 following their 2016 merger announcement.

        
- **AC_Types**:
    - **Summary**:
        - Reporting Carrier On-Time Performance (1987-present)
    - **Description**:
        - Reporting carriers are required to (or voluntarily) report on-time data for flights they operate: on-time arrival and departure data for non-stop domestic flights by month and year, by carrier and by origin and destination airport. Includes scheduled and actual departure and arrival times, canceled and diverted flights, taxi-out and taxi-in times, causes of delay and cancellation, air time, and non-stop distance. Use Download for individual flight data.
    - **File**:
        - YYMM_123456789_T_ONTIME_REPORTING.zip *(YY = Year ; MM = Month)*

___

In [1]:
# Import libraries to be used:

import pandas as pd
import numpy as np
import os.path
import matplotlib.pyplot as plt
import seaborn as sns
import altair as alt
import warnings # warnings.filterwarnings(action='ignore') # https://docs.python.org/3/library/warnings.html#the-warnings-filter
# from zipfile import ZipFile # De momento no ha hecho falta 

In [2]:
# Show all columns and rows in DataFrames
pd.set_option('display.max_columns', None)
# pd.set_option('display.max_rows', None) # It greatly slows down the output display and freezes the kernel

# Show in notebook
%matplotlib inline

# style -> plt.style.available
# plt.style.use('seaborn')
plt.style.use('ggplot')

# theme
sns.set_theme(context='notebook',
              style="darkgrid") # {darkgrid, whitegrid, dark, white, ticks}

# color_palette -> https://seaborn.pydata.org/generated/seaborn.color_palette.html#seaborn.color_palette
palette = sns.color_palette("flare", as_cmap=True);

In [3]:
if os.name == 'nt': # Windows
    root = r"C:\Users\turge\CompartidoVM\0.TFM"
    print("Running on Windows.")
elif os.name == 'posix': # Ubuntu
    root = "/home/dsc/shared/0.TFM"
    print("Running on Ubuntu.")
print("root path\t", root)

Running on Windows.
root path	 C:\Users\turge\CompartidoVM\0.TFM


___

Single file import (i.e. month-sized database)

In [4]:
csv_path = os.path.join(root,
                        "Raw_Data",
                        "US_DoT",
                        "ONTIME_REPORTING",
                        "1901_921771952_T_ONTIME_REPORTING.zip")
csv_path

'C:\\Users\\turge\\CompartidoVM\\0.TFM\\Raw_Data\\US_DoT\\ONTIME_REPORTING\\1901_921771952_T_ONTIME_REPORTING.zip'

Considering that each file is considerably big, and that later on many of them will have to be grouped, a first exploration will be done considering the first 10,000 rows.

In [7]:
# Since 'pd.read_csv' works fine with zipped csv files, we can proceed directly:
cols = pd.read_csv(csv_path, nrows=1).columns # After normally importing it, an undesired extra blank column is loaded
df7_0 = pd.read_csv(csv_path,
                  encoding='latin1',
                  nrows=10000,
                  usecols=cols[:-1], # This way, the extra column is disregarded for the loading process
                  low_memory = False)
df7_0

Unnamed: 0,YEAR,QUARTER,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,FL_DATE,OP_UNIQUE_CARRIER,OP_CARRIER_AIRLINE_ID,OP_CARRIER,TAIL_NUM,OP_CARRIER_FL_NUM,ORIGIN_AIRPORT_ID,ORIGIN_AIRPORT_SEQ_ID,ORIGIN_CITY_MARKET_ID,ORIGIN,ORIGIN_CITY_NAME,ORIGIN_STATE_ABR,ORIGIN_STATE_FIPS,ORIGIN_STATE_NM,ORIGIN_WAC,DEST_AIRPORT_ID,DEST_AIRPORT_SEQ_ID,DEST_CITY_MARKET_ID,DEST,DEST_CITY_NAME,DEST_STATE_ABR,DEST_STATE_FIPS,DEST_STATE_NM,DEST_WAC,CRS_DEP_TIME,DEP_TIME,DEP_DELAY,DEP_DELAY_NEW,DEP_DEL15,DEP_DELAY_GROUP,DEP_TIME_BLK,TAXI_OUT,WHEELS_OFF,WHEELS_ON,TAXI_IN,CRS_ARR_TIME,ARR_TIME,ARR_DELAY,ARR_DELAY_NEW,ARR_DEL15,ARR_DELAY_GROUP,ARR_TIME_BLK,CANCELLED,CANCELLATION_CODE,DIVERTED,CRS_ELAPSED_TIME,ACTUAL_ELAPSED_TIME,AIR_TIME,FLIGHTS,DISTANCE,DISTANCE_GROUP,CARRIER_DELAY,WEATHER_DELAY,NAS_DELAY,SECURITY_DELAY,LATE_AIRCRAFT_DELAY,FIRST_DEP_TIME,TOTAL_ADD_GTIME,LONGEST_ADD_GTIME,DIV_AIRPORT_LANDINGS,DIV_REACHED_DEST,DIV_ACTUAL_ELAPSED_TIME,DIV_ARR_DELAY,DIV_DISTANCE,DIV1_AIRPORT,DIV1_AIRPORT_ID,DIV1_AIRPORT_SEQ_ID,DIV1_WHEELS_ON,DIV1_TOTAL_GTIME,DIV1_LONGEST_GTIME,DIV1_WHEELS_OFF,DIV1_TAIL_NUM,DIV2_AIRPORT,DIV2_AIRPORT_ID,DIV2_AIRPORT_SEQ_ID,DIV2_WHEELS_ON,DIV2_TOTAL_GTIME,DIV2_LONGEST_GTIME,DIV2_WHEELS_OFF,DIV2_TAIL_NUM,DIV3_AIRPORT,DIV3_AIRPORT_ID,DIV3_AIRPORT_SEQ_ID,DIV3_WHEELS_ON,DIV3_TOTAL_GTIME,DIV3_LONGEST_GTIME,DIV3_WHEELS_OFF,DIV3_TAIL_NUM,DIV4_AIRPORT,DIV4_AIRPORT_ID,DIV4_AIRPORT_SEQ_ID,DIV4_WHEELS_ON,DIV4_TOTAL_GTIME,DIV4_LONGEST_GTIME,DIV4_WHEELS_OFF,DIV4_TAIL_NUM,DIV5_AIRPORT,DIV5_AIRPORT_ID,DIV5_AIRPORT_SEQ_ID,DIV5_WHEELS_ON,DIV5_TOTAL_GTIME,DIV5_LONGEST_GTIME,DIV5_WHEELS_OFF,DIV5_TAIL_NUM
0,2019,1,1,3,4,2019-01-03,9E,20363,9E,N195PQ,5121,15412,1541205,35412,TYS,"Knoxville, TN",TN,47,Tennessee,54,10397,1039707,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,1140,1205.0,25.0,25.0,1.0,1.0,1100-1159,30.0,1235.0,1311.0,4.0,1250,1315.0,25.0,25.0,1.0,1.0,1200-1259,0.0,,0.0,70.0,70.0,36.0,1.0,152.0,1,0.0,0.0,0.0,0.0,25.0,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,2019,1,1,4,5,2019-01-04,9E,20363,9E,N919XJ,5121,15412,1541205,35412,TYS,"Knoxville, TN",TN,47,Tennessee,54,10397,1039707,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,1140,1250.0,70.0,70.0,1.0,4.0,1100-1159,35.0,1325.0,1403.0,9.0,1250,1412.0,82.0,82.0,1.0,5.0,1200-1259,0.0,,0.0,70.0,82.0,38.0,1.0,152.0,1,0.0,0.0,12.0,0.0,70.0,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,2019,1,1,5,6,2019-01-05,9E,20363,9E,N316PQ,5122,10397,1039707,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,14783,1478302,34783,SGF,"Springfield, MO",MO,29,Missouri,64,950,956.0,6.0,6.0,0.0,0.0,0900-0959,20.0,1016.0,1040.0,3.0,1051,1043.0,-8.0,0.0,0.0,-1.0,1000-1059,0.0,,0.0,121.0,107.0,84.0,1.0,563.0,3,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,2019,1,1,6,7,2019-01-06,9E,20363,9E,N325PQ,5122,10397,1039707,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,14783,1478302,34783,SGF,"Springfield, MO",MO,29,Missouri,64,950,945.0,-5.0,0.0,0.0,-1.0,0900-0959,16.0,1001.0,1026.0,3.0,1053,1029.0,-24.0,0.0,0.0,-2.0,1000-1059,0.0,,0.0,123.0,104.0,85.0,1.0,563.0,3,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,2019,1,1,7,1,2019-01-07,9E,20363,9E,N904XJ,5122,10397,1039707,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,14783,1478302,34783,SGF,"Springfield, MO",MO,29,Missouri,64,950,947.0,-3.0,0.0,0.0,-1.0,0900-0959,25.0,1012.0,1040.0,4.0,1053,1044.0,-9.0,0.0,0.0,-1.0,1000-1059,0.0,,0.0,123.0,117.0,88.0,1.0,563.0,3,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,2019,1,1,14,1,2019-01-14,WN,19393,WN,N497WN,599,14831,1483106,32457,SJC,"San Jose, CA",CA,6,California,91,14908,1490803,32575,SNA,"Santa Ana, CA",CA,6,California,91,2100,2055.0,-5.0,0.0,0.0,-1.0,2100-2159,8.0,2103.0,2203.0,4.0,2215,2207.0,-8.0,0.0,0.0,-1.0,2200-2259,0.0,,0.0,75.0,72.0,60.0,1.0,342.0,2,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
9996,2019,1,1,14,1,2019-01-14,WN,19393,WN,N728SW,664,14831,1483106,32457,SJC,"San Jose, CA",CA,6,California,91,14908,1490803,32575,SNA,"Santa Ana, CA",CA,6,California,91,645,639.0,-6.0,0.0,0.0,-1.0,0600-0659,7.0,646.0,746.0,19.0,805,805.0,0.0,0.0,0.0,0.0,0800-0859,0.0,,0.0,80.0,86.0,60.0,1.0,342.0,2,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
9997,2019,1,1,14,1,2019-01-14,WN,19393,WN,N455WN,759,14831,1483106,32457,SJC,"San Jose, CA",CA,6,California,91,14908,1490803,32575,SNA,"Santa Ana, CA",CA,6,California,91,1755,1801.0,6.0,6.0,0.0,0.0,1700-1759,16.0,1817.0,1918.0,4.0,1915,1922.0,7.0,7.0,0.0,0.0,1900-1959,0.0,,0.0,80.0,81.0,61.0,1.0,342.0,2,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
9998,2019,1,1,14,1,2019-01-14,WN,19393,WN,N791SW,894,14831,1483106,32457,SJC,"San Jose, CA",CA,6,California,91,14908,1490803,32575,SNA,"Santa Ana, CA",CA,6,California,91,1610,,,,,,1600-1659,,,,,1730,,,,,,1700-1759,1.0,A,0.0,80.0,,,1.0,342.0,2,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [9]:
df7_0.describe()

Unnamed: 0,YEAR,QUARTER,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,OP_CARRIER_AIRLINE_ID,OP_CARRIER_FL_NUM,ORIGIN_AIRPORT_ID,ORIGIN_AIRPORT_SEQ_ID,ORIGIN_CITY_MARKET_ID,ORIGIN_STATE_FIPS,ORIGIN_WAC,DEST_AIRPORT_ID,DEST_AIRPORT_SEQ_ID,DEST_CITY_MARKET_ID,DEST_STATE_FIPS,DEST_WAC,CRS_DEP_TIME,DEP_TIME,DEP_DELAY,DEP_DELAY_NEW,DEP_DEL15,DEP_DELAY_GROUP,TAXI_OUT,WHEELS_OFF,WHEELS_ON,TAXI_IN,CRS_ARR_TIME,ARR_TIME,ARR_DELAY,ARR_DELAY_NEW,ARR_DEL15,ARR_DELAY_GROUP,CANCELLED,DIVERTED,CRS_ELAPSED_TIME,ACTUAL_ELAPSED_TIME,AIR_TIME,FLIGHTS,DISTANCE,DISTANCE_GROUP,CARRIER_DELAY,WEATHER_DELAY,NAS_DELAY,SECURITY_DELAY,LATE_AIRCRAFT_DELAY,FIRST_DEP_TIME,TOTAL_ADD_GTIME,LONGEST_ADD_GTIME,DIV_AIRPORT_LANDINGS,DIV_REACHED_DEST,DIV_ACTUAL_ELAPSED_TIME,DIV_ARR_DELAY,DIV_DISTANCE,DIV1_AIRPORT_ID,DIV1_AIRPORT_SEQ_ID,DIV1_WHEELS_ON,DIV1_TOTAL_GTIME,DIV1_LONGEST_GTIME,DIV1_WHEELS_OFF,DIV2_AIRPORT,DIV2_AIRPORT_ID,DIV2_AIRPORT_SEQ_ID,DIV2_WHEELS_ON,DIV2_TOTAL_GTIME,DIV2_LONGEST_GTIME,DIV2_WHEELS_OFF,DIV2_TAIL_NUM,DIV3_AIRPORT,DIV3_AIRPORT_ID,DIV3_AIRPORT_SEQ_ID,DIV3_WHEELS_ON,DIV3_TOTAL_GTIME,DIV3_LONGEST_GTIME,DIV3_WHEELS_OFF,DIV3_TAIL_NUM,DIV4_AIRPORT,DIV4_AIRPORT_ID,DIV4_AIRPORT_SEQ_ID,DIV4_WHEELS_ON,DIV4_TOTAL_GTIME,DIV4_LONGEST_GTIME,DIV4_WHEELS_OFF,DIV4_TAIL_NUM,DIV5_AIRPORT,DIV5_AIRPORT_ID,DIV5_AIRPORT_SEQ_ID,DIV5_WHEELS_ON,DIV5_TOTAL_GTIME,DIV5_LONGEST_GTIME,DIV5_WHEELS_OFF,DIV5_TAIL_NUM
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,9862.0,9862.0,9862.0,9862.0,9862.0,9861.0,9861.0,9856.0,9856.0,10000.0,9856.0,9852.0,9852.0,9852.0,9852.0,10000.0,10000.0,10000.0,9852.0,9852.0,10000.0,10000.0,10000.0,1270.0,1270.0,1270.0,1270.0,1270.0,46.0,46.0,46.0,10000.0,8.0,4.0,4.0,8.0,9.0,9.0,9.0,9.0,9.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
mean,2019.0,1.0,1.0,11.2979,3.6095,19592.529,2472.6399,12951.3393,1295138.0,31919.0574,24.2616,60.8799,12717.0104,1271705.0,31768.8857,24.5137,60.863,1336.5375,1343.179477,6.352768,8.984385,0.140844,-0.168019,15.156982,1367.374607,1498.69998,5.608665,1514.5721,1504.146307,-1.84186,8.17976,0.128908,-0.581202,0.014,0.0008,124.7666,116.739748,95.972594,1.0,676.0572,3.2065,15.977165,2.782677,12.515748,0.022835,25.56063,1158.76087,25.847826,25.347826,0.0017,0.5,297.5,134.0,211.125,12672.777778,1267281.0,1512.222222,11.777778,9.111111,1509.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
std,0.0,0.0,0.0,6.116543,2.214204,392.105449,1839.091004,1403.946406,140394.5,1215.853545,15.947094,25.244112,1575.684243,157568.2,1255.503142,16.210919,24.374927,468.96326,477.536825,31.742815,30.843474,0.347878,1.754956,9.734649,477.1092,501.619888,4.092904,496.123375,504.244293,34.620159,30.650175,0.335115,1.901867,0.117496,0.028274,54.299622,53.132257,52.293776,0.0,440.206993,1.772053,39.908883,32.775426,27.324325,0.602293,43.708981,450.385918,22.719514,21.779927,0.094329,0.534522,47.528939,53.272882,304.691149,1386.541,138653.2,463.346463,8.941166,6.293736,342.991739,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
min,2019.0,1.0,1.0,1.0,1.0,19393.0,1.0,10140.0,1014005.0,30140.0,1.0,3.0,10135.0,1013505.0,30135.0,1.0,3.0,505.0,5.0,-23.0,0.0,0.0,-2.0,2.0,1.0,1.0,1.0,3.0,1.0,-61.0,0.0,0.0,-2.0,0.0,0.0,49.0,37.0,22.0,1.0,95.0,1.0,0.0,0.0,0.0,0.0,0.0,600.0,1.0,1.0,0.0,0.0,241.0,66.0,0.0,10781.0,1078105.0,742.0,2.0,2.0,1028.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
25%,2019.0,1.0,1.0,4.0,1.0,19393.0,964.0,11697.0,1169706.0,30928.0,8.0,36.0,11259.0,1125904.0,30693.0,8.0,36.0,930.0,932.0,-5.0,0.0,0.0,-1.0,9.0,949.0,1108.0,3.0,1124.5,1113.0,-16.0,0.0,0.0,-2.0,0.0,0.0,84.0,77.0,57.0,1.0,341.0,2.0,0.0,0.0,0.0,0.0,0.0,808.0,8.5,8.5,0.0,0.0,271.0,105.75,0.0,11278.0,1127805.0,1424.0,4.0,4.0,1416.5,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
50%,2019.0,1.0,1.0,13.0,5.0,19393.0,1951.0,12954.0,1295407.0,31703.0,24.0,64.0,12892.0,1289208.0,31454.0,22.0,64.0,1330.0,1337.0,-2.0,0.0,0.0,-1.0,12.0,1350.0,1520.0,5.0,1530.0,1523.5,-8.0,0.0,0.0,-1.0,0.0,0.0,110.0,104.0,82.0,1.0,563.0,3.0,0.0,0.0,0.0,0.0,8.0,1113.0,20.0,20.0,0.0,0.5,298.5,142.0,5.5,12889.0,1288903.0,1608.0,11.0,8.0,1586.5,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
75%,2019.0,1.0,1.0,14.0,5.0,19393.0,4684.25,14107.0,1410702.0,32575.0,36.0,85.0,14107.0,1410702.0,32575.0,37.0,85.0,1740.0,1744.0,4.0,4.0,0.0,0.0,18.0,1758.0,1919.0,6.0,1924.0,1923.0,2.0,2.0,0.0,0.0,0.0,0.0,155.0,144.0,123.0,1.0,909.0,4.0,16.0,0.0,17.0,0.0,30.0,1504.0,32.75,32.75,0.0,1.0,325.0,170.25,418.75,13232.0,1323202.0,1745.0,15.0,13.0,1679.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
max,2019.0,1.0,1.0,31.0,7.0,20363.0,6995.0,15624.0,1562404.0,35412.0,72.0,93.0,15919.0,1591904.0,35412.0,72.0,93.0,2235.0,2353.0,805.0,805.0,1.0,12.0,134.0,2358.0,2400.0,81.0,2355.0,2400.0,791.0,791.0,1.0,12.0,1.0,1.0,390.0,371.0,351.0,1.0,2555.0,11.0,578.0,774.0,453.0,19.0,387.0,2237.0,104.0,104.0,9.0,1.0,352.0,186.0,719.0,14893.0,1489302.0,2031.0,27.0,21.0,1835.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [11]:
df7_0.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Columns: 109 entries, YEAR to DIV5_TAIL_NUM
dtypes: float64(71), int64(21), object(17)
memory usage: 8.3+ MB


In [16]:
cols = [col for col in df7_0.columns]
cols

['YEAR',
 'QUARTER',
 'MONTH',
 'DAY_OF_MONTH',
 'DAY_OF_WEEK',
 'FL_DATE',
 'OP_UNIQUE_CARRIER',
 'OP_CARRIER_AIRLINE_ID',
 'OP_CARRIER',
 'TAIL_NUM',
 'OP_CARRIER_FL_NUM',
 'ORIGIN_AIRPORT_ID',
 'ORIGIN_AIRPORT_SEQ_ID',
 'ORIGIN_CITY_MARKET_ID',
 'ORIGIN',
 'ORIGIN_CITY_NAME',
 'ORIGIN_STATE_ABR',
 'ORIGIN_STATE_FIPS',
 'ORIGIN_STATE_NM',
 'ORIGIN_WAC',
 'DEST_AIRPORT_ID',
 'DEST_AIRPORT_SEQ_ID',
 'DEST_CITY_MARKET_ID',
 'DEST',
 'DEST_CITY_NAME',
 'DEST_STATE_ABR',
 'DEST_STATE_FIPS',
 'DEST_STATE_NM',
 'DEST_WAC',
 'CRS_DEP_TIME',
 'DEP_TIME',
 'DEP_DELAY',
 'DEP_DELAY_NEW',
 'DEP_DEL15',
 'DEP_DELAY_GROUP',
 'DEP_TIME_BLK',
 'TAXI_OUT',
 'WHEELS_OFF',
 'WHEELS_ON',
 'TAXI_IN',
 'CRS_ARR_TIME',
 'ARR_TIME',
 'ARR_DELAY',
 'ARR_DELAY_NEW',
 'ARR_DEL15',
 'ARR_DELAY_GROUP',
 'ARR_TIME_BLK',
 'CANCELLED',
 'CANCELLATION_CODE',
 'DIVERTED',
 'CRS_ELAPSED_TIME',
 'ACTUAL_ELAPSED_TIME',
 'AIR_TIME',
 'FLIGHTS',
 'DISTANCE',
 'DISTANCE_GROUP',
 'CARRIER_DELAY',
 'WEATHER_DELAY',
 'NAS_DEL

Let's see which columns will remain, and skip the rest.

In [17]:
int_cols = ['YEAR', 'MONTH', 'DAY_OF_MONTH', 'DAY_OF_WEEK', 'FL_DATE',
            'OP_UNIQUE_CARRIER', 'TAIL_NUM', 'OP_CARRIER_FL_NUM',
            'ORIGIN_CITY_MARKET_ID', 'ORIGIN', 'ORIGIN_CITY_NAME',
            'DEST_CITY_MARKET_ID', 'DEST', 'DEST_CITY_NAME',
            'DEP_TIME', 'DEP_DELAY', 'TAXI_OUT', 'WHEELS_OFF',
            'WHEELS_ON', 'TAXI_IN', 'ARR_TIME', 'ARR_DELAY',
            'CANCELLED', 'CANCELLATION_CODE', 'DIVERTED',
            'CRS_ELAPSED_TIME', 'ACTUAL_ELAPSED_TIME', 'AIR_TIME', 'FLIGHTS', 'DISTANCE', 'DISTANCE_GROUP',
            'CARRIER_DELAY', 'WEATHER_DELAY', 'NAS_DELAY', 'SECURITY_DELAY', 'LATE_AIRCRAFT_DELAY',
            'DIV_AIRPORT_LANDINGS', 'DIV_REACHED_DEST', 'DIV_ACTUAL_ELAPSED_TIME', 'DIV_ARR_DELAY', 'DIV_DISTANCE']

41 columns are still quite a few, but more than half of them (68) have been dropped.

Additional information on each column meaning can be found [here](https://www.transtats.bts.gov/Fields.asp?Table_ID=236&SYS_Table_Name=T_ONTIME_REPORTING&User_Table_Name=Reporting%20Carrier%20On-Time%20Performance%20(1987-present)&Year_Info=1&First_Year=1987&Last_Year=2020&Rate_Info=0&Frequency=Monthly&Data_Frequency=Annual,Quarterly,Monthly).

Once the "interesting" columns are selected, the files shall be imported again.

In [18]:
df7 = pd.read_csv(csv_path,
                  encoding='latin1',
                  nrows=1e6, # Fail-safe: in case the file is inadvertently big
                  usecols=int_cols, # This way, the extra column is disregarded for the loading process
                  low_memory = False) # This will prevent from auto-dtypes
df7

Unnamed: 0,YEAR,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,FL_DATE,OP_UNIQUE_CARRIER,TAIL_NUM,OP_CARRIER_FL_NUM,ORIGIN_CITY_MARKET_ID,ORIGIN,ORIGIN_CITY_NAME,DEST_CITY_MARKET_ID,DEST,DEST_CITY_NAME,DEP_TIME,DEP_DELAY,TAXI_OUT,WHEELS_OFF,WHEELS_ON,TAXI_IN,ARR_TIME,ARR_DELAY,CANCELLED,CANCELLATION_CODE,DIVERTED,CRS_ELAPSED_TIME,ACTUAL_ELAPSED_TIME,AIR_TIME,FLIGHTS,DISTANCE,DISTANCE_GROUP,CARRIER_DELAY,WEATHER_DELAY,NAS_DELAY,SECURITY_DELAY,LATE_AIRCRAFT_DELAY,DIV_AIRPORT_LANDINGS,DIV_REACHED_DEST,DIV_ACTUAL_ELAPSED_TIME,DIV_ARR_DELAY,DIV_DISTANCE
0,2019,1,3,4,2019-01-03,9E,N195PQ,5121,35412,TYS,"Knoxville, TN",30397,ATL,"Atlanta, GA",1205.0,25.0,30.0,1235.0,1311.0,4.0,1315.0,25.0,0.0,,0.0,70.0,70.0,36.0,1.0,152.0,1,0.0,0.0,0.0,0.0,25.0,0,,,,
1,2019,1,4,5,2019-01-04,9E,N919XJ,5121,35412,TYS,"Knoxville, TN",30397,ATL,"Atlanta, GA",1250.0,70.0,35.0,1325.0,1403.0,9.0,1412.0,82.0,0.0,,0.0,70.0,82.0,38.0,1.0,152.0,1,0.0,0.0,12.0,0.0,70.0,0,,,,
2,2019,1,5,6,2019-01-05,9E,N316PQ,5122,30397,ATL,"Atlanta, GA",34783,SGF,"Springfield, MO",956.0,6.0,20.0,1016.0,1040.0,3.0,1043.0,-8.0,0.0,,0.0,121.0,107.0,84.0,1.0,563.0,3,,,,,,0,,,,
3,2019,1,6,7,2019-01-06,9E,N325PQ,5122,30397,ATL,"Atlanta, GA",34783,SGF,"Springfield, MO",945.0,-5.0,16.0,1001.0,1026.0,3.0,1029.0,-24.0,0.0,,0.0,123.0,104.0,85.0,1.0,563.0,3,,,,,,0,,,,
4,2019,1,7,1,2019-01-07,9E,N904XJ,5122,30397,ATL,"Atlanta, GA",34783,SGF,"Springfield, MO",947.0,-3.0,25.0,1012.0,1040.0,4.0,1044.0,-9.0,0.0,,0.0,123.0,117.0,88.0,1.0,563.0,3,,,,,,0,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
583980,2019,1,20,7,2019-01-20,UA,N78448,564,30325,DEN,"Denver, CO",31454,MCO,"Orlando, FL",948.0,3.0,18.0,1006.0,1459.0,8.0,1507.0,-9.0,0.0,,0.0,211.0,199.0,173.0,1.0,1546.0,7,,,,,,0,,,,
583981,2019,1,20,7,2019-01-20,UA,N37468,563,34100,PHL,"Philadelphia, PA",30977,ORD,"Chicago, IL",1652.0,-8.0,14.0,1706.0,1749.0,13.0,1802.0,-20.0,0.0,,0.0,142.0,130.0,103.0,1.0,678.0,3,,,,,,0,,,,
583982,2019,1,20,7,2019-01-20,UA,N12218,562,30325,DEN,"Denver, CO",32575,SNA,"Santa Ana, CA",1920.0,5.0,12.0,1932.0,2038.0,5.0,2043.0,0.0,0.0,,0.0,148.0,143.0,126.0,1.0,846.0,4,,,,,,0,,,,
583983,2019,1,20,7,2019-01-20,UA,N12225,561,31703,LGA,"New York, NY",30325,DEN,"Denver, CO",715.0,-10.0,14.0,729.0,920.0,5.0,925.0,-40.0,0.0,,0.0,280.0,250.0,231.0,1.0,1620.0,7,,,,,,0,,,,


Let's proceed with multiple-file importing, through concatenation into a single DataFrame.

In [21]:
directory_in_str = os.path.join(root,
                                "Raw_Data",
                                "US_DoT",
                                "ONTIME_REPORTING")
directory_in_str

'C:\\Users\\turge\\CompartidoVM\\0.TFM\\Raw_Data\\US_DoT\\ONTIME_REPORTING'

In [30]:
# Import the 12 month-files corresponding to the year 2020:

# directory = os.fsencode(directory_in_str)
# print(directory)    
for file in os.listdir(directory_in_str):
    print(os.listdir(directory_in_str))
#     filename = os.fsdecode(file)
# #     print(filename)
#     if filename.endswith(".zip") and filename.startswith("20"): 
#         print(os.path.join(directory, filename))
#         continue
#     else:
#         continue

['1901_921771952_T_ONTIME_REPORTING.zip', '1902_921771952_T_ONTIME_REPORTING.zip', '1903_921771952_T_ONTIME_REPORTING.zip', '1904_921771952_T_ONTIME_REPORTING.zip', '1905_921881115_T_ONTIME_REPORTING.zip', '1906_921881115_T_ONTIME_REPORTING.zip', '1907_921881115_T_ONTIME_REPORTING.zip', '1908_921888367_T_ONTIME_REPORTING.zip', '1909_921901899_T_ONTIME_REPORTING.zip', '1910_921910120_T_ONTIME_REPORTING.zip', '1911_921910120_T_ONTIME_REPORTING.zip', '1912_921910120_T_ONTIME_REPORTING.zip', '2001_494230044_T_ONTIME_REPORTING.zip', '2002_494247439_T_ONTIME_REPORTING.zip', '2003_707907111_T_ONTIME_REPORTING.zip', '2004_921771952_T_ONTIME_REPORTING.zip', '2005_921771952_T_ONTIME_REPORTING.zip', '2006_921771952_T_ONTIME_REPORTING.zip', '2007_921771952_T_ONTIME_REPORTING.zip', '2008_921771952_T_ONTIME_REPORTING.zip', '2009_921771952_T_ONTIME_REPORTING.zip', '2010_921771952_T_ONTIME_REPORTING.zip', '2011_921771952_T_ONTIME_REPORTING.zip', '2012_921771952_T_ONTIME_REPORTING.zip']
['1901_92177195

There are some missing values. Let's further delve into it:

In [19]:
# Absolute number of missing values by column:
df7.isna().sum()

YEAR                            0
MONTH                           0
DAY_OF_MONTH                    0
DAY_OF_WEEK                     0
FL_DATE                         0
OP_UNIQUE_CARRIER               0
TAIL_NUM                     2543
OP_CARRIER_FL_NUM               0
ORIGIN_CITY_MARKET_ID           0
ORIGIN                          0
ORIGIN_CITY_NAME                0
DEST_CITY_MARKET_ID             0
DEST                            0
DEST_CITY_NAME                  0
DEP_TIME                    16352
DEP_DELAY                   16355
TAXI_OUT                    16616
WHEELS_OFF                  16616
WHEELS_ON                   17061
TAXI_IN                     17061
ARR_TIME                    17061
ARR_DELAY                   18022
CANCELLED                       0
CANCELLATION_CODE          567259
DIVERTED                        0
CRS_ELAPSED_TIME              134
ACTUAL_ELAPSED_TIME         18022
AIR_TIME                    18022
FLIGHTS                         0
DISTANCE      

In [20]:
# Relative frequency of missing values by column:
df7.isna().sum() / len(df7) * 100

YEAR                        0.000000
MONTH                       0.000000
DAY_OF_MONTH                0.000000
DAY_OF_WEEK                 0.000000
FL_DATE                     0.000000
OP_UNIQUE_CARRIER           0.000000
TAIL_NUM                    0.435456
OP_CARRIER_FL_NUM           0.000000
ORIGIN_CITY_MARKET_ID       0.000000
ORIGIN                      0.000000
ORIGIN_CITY_NAME            0.000000
DEST_CITY_MARKET_ID         0.000000
DEST                        0.000000
DEST_CITY_NAME              0.000000
DEP_TIME                    2.800072
DEP_DELAY                   2.800586
TAXI_OUT                    2.845279
WHEELS_OFF                  2.845279
WHEELS_ON                   2.921479
TAXI_IN                     2.921479
ARR_TIME                    2.921479
ARR_DELAY                   3.086038
CANCELLED                   0.000000
CANCELLATION_CODE          97.135885
DIVERTED                    0.000000
CRS_ELAPSED_TIME            0.022946
ACTUAL_ELAPSED_TIME         3.086038
A