# 1. Ask an interesting question

1. Delay binary classifier (yes/no, >15 min)
- Better isolate the key features to predict delays so that airlines focus their efforts on their weakest points
- Account for uncontrollable variables (e.g. unpredicted adverse weather) and prepare to act consequently
2. Additional questions:
- When is the best time of day/day of week/time of year to fly to minimize delays?
- Do older planes suffer more delays?
- How does the number of people flying between different locations change over time?
- How well does weather predict plane delays?
- Can you detect cascading failures as delays in one airport create delays in others? Are there critical links in the system?

___

# 2. Get the data

### On-Time : Reporting Carrier On-Time Performance (1987-present)

Source: https://www.transtats.bts.gov/DatabaseInfo.asp?DB_ID=120&DB_URL=Mode_ID=1&Mode_Desc=Aviation&Subject_ID2=0

<em>Note</em>: Over time both the code and the name of a carrier may change and the same code or name may be assumed by a different airline. To ensure that you are analyzing data from the same airline, TranStats provides four airline-specific variables that identify one and only one carrier or its entity: Airline ID (AirlineID), Unique Carrier Code (UniqueCarrier), Unique Carrier Name (UniqueCarrierName), and Unique Entity (UniqCarrierEntity). A unique airline (carrier) is defined as one holding and reporting under the same DOT certificate regardless of its Code, Name, or holding company/corporation. US Airways and America West started to report combined on-time data in January 2006 and combined traffic and financial data in October 2007 following their 2005 merger announcement. Delta and Northwest began reporting jointly in January 2010 following their 2008 merger announcement. Continental Micronesia was combined into Continental Airlines in December 2010 and joint reporting began in January 2011. Atlantic Southeast and ExpressJet began reporting jointly in January 2012. United and Continental began reporting jointly in January 2012 following their 2010 merger announcement. Endeavor (9E) operated as Pinnacle prior to August 2013. Envoy (MQ) operated as American Eagle prior to April 2014. Southwest (WN) and AirTran (FL) began reporting jointly in January 2015 following their 2011 merger announcement. American (AA) and US Airways (US) began reporting jointly as AA in July 2015 following their 2013 merger announcement. Alaska (AS) and Virgin America (VX) began reporting jointly as AS in April 2018 following their 2016 merger announcement.

        
- **AC_Types**:
    - **Summary**:
        - Reporting Carrier On-Time Performance (1987-present)
    - **Description**:
        - Reporting carriers are required to (or voluntarily) report on-time data for flights they operate: on-time arrival and departure data for non-stop domestic flights by month and year, by carrier and by origin and destination airport. Includes scheduled and actual departure and arrival times, canceled and diverted flights, taxi-out and taxi-in times, causes of delay and cancellation, air time, and non-stop distance. Use Download for individual flight data.
    - **File**:
        - YYMM_123456789_T_ONTIME_REPORTING.zip *(YY = Year ; MM = Month)*

In [1]:
# Import libraries to be used

# Warning messages display
## import warnings
## warnings.filterwarnings(action='ignore') # https://docs.python.org/3/library/warnings.html#the-warnings-filter

# Directories/Files management
import os.path
## from zipfile import ZipFile # De momento no ha hecho falta 

# Timing
import time

# Memory monitoring
%load_ext memory_profiler
### Use '%memit' to check at each point

# Data analysis and wrangling
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None) # Show all columns in DataFrames
## pd.set_option('display.max_rows', None) # It greatly slows down the output display and freezes the kernel

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.style.use('ggplot') # choose a style: 'plt.style.available'
sns.set_theme(context='notebook',
              style="darkgrid") # {darkgrid, whitegrid, dark, white, ticks}
palette = sns.color_palette("flare", as_cmap=True);
import altair as alt

# Machine Learning
## from sklearn.[...] import ...

In [2]:
t0 = time.perf_counter() 

In [3]:
# Detect Operating System running and manage paths accordingly

if os.name == 'nt': # Windows
    root = r"C:\Users\turge\CompartidoVM\0.TFM"
    print("Running on Windows.")
elif os.name == 'posix': # Ubuntu
    root = "/home/dsc/shared/0.TFM"
    print("Running on Ubuntu.")
print("root path\t", root)

Running on Windows.
root path	 C:\Users\turge\CompartidoVM\0.TFM


___

Additional information on each column meaning can be found [here](https://www.transtats.bts.gov/Fields.asp?Table_ID=236&SYS_Table_Name=T_ONTIME_REPORTING&User_Table_Name=Reporting%20Carrier%20On-Time%20Performance%20(1987-present)&Year_Info=1&First_Year=1987&Last_Year=2020&Rate_Info=0&Frequency=Monthly&Data_Frequency=Annual,Quarterly,Monthly).

___

### SAMPLE FILE

Single file import (i.e. month-sized database)

In [4]:
csv_path = os.path.join(root,
                        "Raw_Data",
                        "US_DoT",
                        "ONTIME_REPORTING",
                        "1901_921771952_T_ONTIME_REPORTING.zip")
csv_path

'C:\\Users\\turge\\CompartidoVM\\0.TFM\\Raw_Data\\US_DoT\\ONTIME_REPORTING\\1901_921771952_T_ONTIME_REPORTING.zip'

Considering that each file is considerably big, and that later on many of them will have to be grouped, a first exploration will be done considering the first 10,000 rows.

In [5]:
# Since 'pd.read_csv' works fine with zipped csv files, we can proceed directly:
cols = pd.read_csv(csv_path, nrows=1).columns # After normally importing it, an undesired extra blank column is loaded
df_0 = pd.read_csv(csv_path,
                  encoding='latin1',
                  nrows=10000,
                  usecols=cols[:-1], # This way, the extra column is disregarded for the loading process
                  low_memory = False)
df_0.sample(5)

Unnamed: 0,YEAR,QUARTER,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,FL_DATE,OP_UNIQUE_CARRIER,OP_CARRIER_AIRLINE_ID,OP_CARRIER,TAIL_NUM,OP_CARRIER_FL_NUM,ORIGIN_AIRPORT_ID,ORIGIN_AIRPORT_SEQ_ID,ORIGIN_CITY_MARKET_ID,ORIGIN,ORIGIN_CITY_NAME,ORIGIN_STATE_ABR,ORIGIN_STATE_FIPS,ORIGIN_STATE_NM,ORIGIN_WAC,DEST_AIRPORT_ID,DEST_AIRPORT_SEQ_ID,DEST_CITY_MARKET_ID,DEST,DEST_CITY_NAME,DEST_STATE_ABR,DEST_STATE_FIPS,DEST_STATE_NM,DEST_WAC,CRS_DEP_TIME,DEP_TIME,DEP_DELAY,DEP_DELAY_NEW,DEP_DEL15,DEP_DELAY_GROUP,DEP_TIME_BLK,TAXI_OUT,WHEELS_OFF,WHEELS_ON,TAXI_IN,CRS_ARR_TIME,ARR_TIME,ARR_DELAY,ARR_DELAY_NEW,ARR_DEL15,ARR_DELAY_GROUP,ARR_TIME_BLK,CANCELLED,CANCELLATION_CODE,DIVERTED,CRS_ELAPSED_TIME,ACTUAL_ELAPSED_TIME,AIR_TIME,FLIGHTS,DISTANCE,DISTANCE_GROUP,CARRIER_DELAY,WEATHER_DELAY,NAS_DELAY,SECURITY_DELAY,LATE_AIRCRAFT_DELAY,FIRST_DEP_TIME,TOTAL_ADD_GTIME,LONGEST_ADD_GTIME,DIV_AIRPORT_LANDINGS,DIV_REACHED_DEST,DIV_ACTUAL_ELAPSED_TIME,DIV_ARR_DELAY,DIV_DISTANCE,DIV1_AIRPORT,DIV1_AIRPORT_ID,DIV1_AIRPORT_SEQ_ID,DIV1_WHEELS_ON,DIV1_TOTAL_GTIME,DIV1_LONGEST_GTIME,DIV1_WHEELS_OFF,DIV1_TAIL_NUM,DIV2_AIRPORT,DIV2_AIRPORT_ID,DIV2_AIRPORT_SEQ_ID,DIV2_WHEELS_ON,DIV2_TOTAL_GTIME,DIV2_LONGEST_GTIME,DIV2_WHEELS_OFF,DIV2_TAIL_NUM,DIV3_AIRPORT,DIV3_AIRPORT_ID,DIV3_AIRPORT_SEQ_ID,DIV3_WHEELS_ON,DIV3_TOTAL_GTIME,DIV3_LONGEST_GTIME,DIV3_WHEELS_OFF,DIV3_TAIL_NUM,DIV4_AIRPORT,DIV4_AIRPORT_ID,DIV4_AIRPORT_SEQ_ID,DIV4_WHEELS_ON,DIV4_TOTAL_GTIME,DIV4_LONGEST_GTIME,DIV4_WHEELS_OFF,DIV4_TAIL_NUM,DIV5_AIRPORT,DIV5_AIRPORT_ID,DIV5_AIRPORT_SEQ_ID,DIV5_WHEELS_ON,DIV5_TOTAL_GTIME,DIV5_LONGEST_GTIME,DIV5_WHEELS_OFF,DIV5_TAIL_NUM
9706,2019,1,1,14,1,2019-01-14,WN,19393,WN,N8602F,1141,14679,1467903,33570,SAN,"San Diego, CA",CA,6,California,91,13796,1379608,32457,OAK,"Oakland, CA",CA,6,California,91,1155,1153.0,-2.0,0.0,0.0,-1.0,1100-1159,34.0,1227.0,1335.0,5.0,1325,1340.0,15.0,15.0,1.0,1.0,1300-1359,0.0,,0.0,90.0,107.0,68.0,1.0,446.0,2,0.0,0.0,15.0,0.0,0.0,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4329,2019,1,1,4,5,2019-01-04,WN,19393,WN,N7881A,819,14831,1483106,32457,SJC,"San Jose, CA",CA,6,California,91,13891,1389101,32575,ONT,"Ontario, CA",CA,6,California,91,1930,2018.0,48.0,48.0,1.0,3.0,1900-1959,8.0,2026.0,2114.0,4.0,2040,2118.0,38.0,38.0,1.0,2.0,2000-2059,0.0,,0.0,70.0,60.0,48.0,1.0,333.0,2,1.0,0.0,0.0,0.0,37.0,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4135,2019,1,1,4,5,2019-01-04,WN,19393,WN,N954WN,2077,14679,1467903,33570,SAN,"San Diego, CA",CA,6,California,91,14831,1483106,32457,SJC,"San Jose, CA",CA,6,California,91,1255,1254.0,-1.0,0.0,0.0,-1.0,1200-1259,16.0,1310.0,1409.0,2.0,1420,1411.0,-9.0,0.0,0.0,-1.0,1400-1459,0.0,,0.0,85.0,77.0,59.0,1.0,417.0,2,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
5396,2019,1,1,11,5,2019-01-11,WN,19393,WN,N8658A,1326,13198,1319801,33198,MCI,"Kansas City, MO",MO,29,Missouri,64,12339,1233904,32337,IND,"Indianapolis, IN",IN,18,Indiana,42,745,740.0,-5.0,0.0,0.0,-1.0,0700-0759,9.0,749.0,948.0,5.0,1010,953.0,-17.0,0.0,0.0,-2.0,1000-1059,0.0,,0.0,85.0,73.0,59.0,1.0,451.0,2,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
6222,2019,1,1,13,7,2019-01-13,WN,19393,WN,N707SA,4682,14771,1477104,32457,SFO,"San Francisco, CA",CA,6,California,91,12892,1289208,32575,LAX,"Los Angeles, CA",CA,6,California,91,830,839.0,9.0,9.0,0.0,0.0,0800-0859,16.0,855.0,951.0,8.0,1000,959.0,-1.0,0.0,0.0,-1.0,1000-1059,0.0,,0.0,90.0,80.0,56.0,1.0,337.0,2,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [6]:
cols = list(df_0.columns)
cols

['YEAR',
 'QUARTER',
 'MONTH',
 'DAY_OF_MONTH',
 'DAY_OF_WEEK',
 'FL_DATE',
 'OP_UNIQUE_CARRIER',
 'OP_CARRIER_AIRLINE_ID',
 'OP_CARRIER',
 'TAIL_NUM',
 'OP_CARRIER_FL_NUM',
 'ORIGIN_AIRPORT_ID',
 'ORIGIN_AIRPORT_SEQ_ID',
 'ORIGIN_CITY_MARKET_ID',
 'ORIGIN',
 'ORIGIN_CITY_NAME',
 'ORIGIN_STATE_ABR',
 'ORIGIN_STATE_FIPS',
 'ORIGIN_STATE_NM',
 'ORIGIN_WAC',
 'DEST_AIRPORT_ID',
 'DEST_AIRPORT_SEQ_ID',
 'DEST_CITY_MARKET_ID',
 'DEST',
 'DEST_CITY_NAME',
 'DEST_STATE_ABR',
 'DEST_STATE_FIPS',
 'DEST_STATE_NM',
 'DEST_WAC',
 'CRS_DEP_TIME',
 'DEP_TIME',
 'DEP_DELAY',
 'DEP_DELAY_NEW',
 'DEP_DEL15',
 'DEP_DELAY_GROUP',
 'DEP_TIME_BLK',
 'TAXI_OUT',
 'WHEELS_OFF',
 'WHEELS_ON',
 'TAXI_IN',
 'CRS_ARR_TIME',
 'ARR_TIME',
 'ARR_DELAY',
 'ARR_DELAY_NEW',
 'ARR_DEL15',
 'ARR_DELAY_GROUP',
 'ARR_TIME_BLK',
 'CANCELLED',
 'CANCELLATION_CODE',
 'DIVERTED',
 'CRS_ELAPSED_TIME',
 'ACTUAL_ELAPSED_TIME',
 'AIR_TIME',
 'FLIGHTS',
 'DISTANCE',
 'DISTANCE_GROUP',
 'CARRIER_DELAY',
 'WEATHER_DELAY',
 'NAS_DEL

Let's see which columns will remain, and skip the rest.

___

### MULTIPLE FILE

Let's proceed with multiple-file importing, through concatenation into a single DataFrame.

#### Run only the first time to read the 12 month individual files and generate the global file (entire year 2019)

##### 1. Read individual month files

In [7]:
directory_in_str = os.path.join(root,
                                "Raw_Data",
                                "US_DoT",
                                "ONTIME_REPORTING")
directory_in_str

'C:\\Users\\turge\\CompartidoVM\\0.TFM\\Raw_Data\\US_DoT\\ONTIME_REPORTING'

In [8]:
# List the files' paths corresponding to each month of year 2019

file_list = []
try:
    os.listdir(directory_in_str)
except FileNotFoundError:
    print("The system cannot find the specified path:\n" + directory_in_str + "\nPlease check the path has been properly set.")
else:
    for file in os.listdir(directory_in_str):
        if file.endswith(".zip") and file.startswith("19"):
            file_list.append(os.path.join(directory_in_str, file))
            continue
        else:
            continue
file_list    

['C:\\Users\\turge\\CompartidoVM\\0.TFM\\Raw_Data\\US_DoT\\ONTIME_REPORTING\\1901_921771952_T_ONTIME_REPORTING.zip',
 'C:\\Users\\turge\\CompartidoVM\\0.TFM\\Raw_Data\\US_DoT\\ONTIME_REPORTING\\1902_921771952_T_ONTIME_REPORTING.zip',
 'C:\\Users\\turge\\CompartidoVM\\0.TFM\\Raw_Data\\US_DoT\\ONTIME_REPORTING\\1903_921771952_T_ONTIME_REPORTING.zip',
 'C:\\Users\\turge\\CompartidoVM\\0.TFM\\Raw_Data\\US_DoT\\ONTIME_REPORTING\\1904_921771952_T_ONTIME_REPORTING.zip',
 'C:\\Users\\turge\\CompartidoVM\\0.TFM\\Raw_Data\\US_DoT\\ONTIME_REPORTING\\1905_921881115_T_ONTIME_REPORTING.zip',
 'C:\\Users\\turge\\CompartidoVM\\0.TFM\\Raw_Data\\US_DoT\\ONTIME_REPORTING\\1906_921881115_T_ONTIME_REPORTING.zip',
 'C:\\Users\\turge\\CompartidoVM\\0.TFM\\Raw_Data\\US_DoT\\ONTIME_REPORTING\\1907_921881115_T_ONTIME_REPORTING.zip',
 'C:\\Users\\turge\\CompartidoVM\\0.TFM\\Raw_Data\\US_DoT\\ONTIME_REPORTING\\1908_921888367_T_ONTIME_REPORTING.zip',
 'C:\\Users\\turge\\CompartidoVM\\0.TFM\\Raw_Data\\US_DoT\\ONTIM

In [9]:
%%time

# Create a DataFrame from the 12 month-files corresponding to the year 2019

cols = pd.read_csv(csv_path, nrows=1).columns # After normally importing it, an undesired extra blank column is loaded

df = pd.DataFrame()
for i, csv_path in enumerate(file_list):
    if i == 13: # Fail-safe: in case the list captured more than 12 files
        break
    df_month = pd.read_csv(csv_path,
                           encoding='latin1',
                           nrows=1e6, # Fail-safe: in case the file is inadvertently too big
                           usecols=cols[:-1], # This way, the extra column is disregarded for the loading process
                           low_memory = False) # This will prevent from auto-dtypes
    df = df.append(df_month)
df.sample(5)

Wall time: 5min 59s


Unnamed: 0,YEAR,QUARTER,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,FL_DATE,OP_UNIQUE_CARRIER,OP_CARRIER_AIRLINE_ID,OP_CARRIER,TAIL_NUM,OP_CARRIER_FL_NUM,ORIGIN_AIRPORT_ID,ORIGIN_AIRPORT_SEQ_ID,ORIGIN_CITY_MARKET_ID,ORIGIN,ORIGIN_CITY_NAME,ORIGIN_STATE_ABR,ORIGIN_STATE_FIPS,ORIGIN_STATE_NM,ORIGIN_WAC,DEST_AIRPORT_ID,DEST_AIRPORT_SEQ_ID,DEST_CITY_MARKET_ID,DEST,DEST_CITY_NAME,DEST_STATE_ABR,DEST_STATE_FIPS,DEST_STATE_NM,DEST_WAC,CRS_DEP_TIME,DEP_TIME,DEP_DELAY,DEP_DELAY_NEW,DEP_DEL15,DEP_DELAY_GROUP,DEP_TIME_BLK,TAXI_OUT,WHEELS_OFF,WHEELS_ON,TAXI_IN,CRS_ARR_TIME,ARR_TIME,ARR_DELAY,ARR_DELAY_NEW,ARR_DEL15,ARR_DELAY_GROUP,ARR_TIME_BLK,CANCELLED,CANCELLATION_CODE,DIVERTED,CRS_ELAPSED_TIME,ACTUAL_ELAPSED_TIME,AIR_TIME,FLIGHTS,DISTANCE,DISTANCE_GROUP,CARRIER_DELAY,WEATHER_DELAY,NAS_DELAY,SECURITY_DELAY,LATE_AIRCRAFT_DELAY,FIRST_DEP_TIME,TOTAL_ADD_GTIME,LONGEST_ADD_GTIME,DIV_AIRPORT_LANDINGS,DIV_REACHED_DEST,DIV_ACTUAL_ELAPSED_TIME,DIV_ARR_DELAY,DIV_DISTANCE,DIV1_AIRPORT,DIV1_AIRPORT_ID,DIV1_AIRPORT_SEQ_ID,DIV1_WHEELS_ON,DIV1_TOTAL_GTIME,DIV1_LONGEST_GTIME,DIV1_WHEELS_OFF,DIV1_TAIL_NUM,DIV2_AIRPORT,DIV2_AIRPORT_ID,DIV2_AIRPORT_SEQ_ID,DIV2_WHEELS_ON,DIV2_TOTAL_GTIME,DIV2_LONGEST_GTIME,DIV2_WHEELS_OFF,DIV2_TAIL_NUM,DIV3_AIRPORT,DIV3_AIRPORT_ID,DIV3_AIRPORT_SEQ_ID,DIV3_WHEELS_ON,DIV3_TOTAL_GTIME,DIV3_LONGEST_GTIME,DIV3_WHEELS_OFF,DIV3_TAIL_NUM,DIV4_AIRPORT,DIV4_AIRPORT_ID,DIV4_AIRPORT_SEQ_ID,DIV4_WHEELS_ON,DIV4_TOTAL_GTIME,DIV4_LONGEST_GTIME,DIV4_WHEELS_OFF,DIV4_TAIL_NUM,DIV5_AIRPORT,DIV5_AIRPORT_ID,DIV5_AIRPORT_SEQ_ID,DIV5_WHEELS_ON,DIV5_TOTAL_GTIME,DIV5_LONGEST_GTIME,DIV5_WHEELS_OFF,DIV5_TAIL_NUM
252731,2019,2,6,16,7,2019-06-16,OO,20304,OO,N963SW,5986,14252,1425202,34252,PSC,"Pasco/Kennewick/Richland, WA",WA,53,Washington,93,12892,1289208,32575,LAX,"Los Angeles, CA",CA,6,California,91,1630,1620.0,-10.0,0.0,0.0,-1.0,1600-1659,24.0,1644.0,1909.0,11.0,1913,1920.0,7.0,7.0,0.0,0.0,1900-1959,0.0,,0.0,163.0,180.0,145.0,1.0,851.0,4,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
54355,2019,4,11,8,5,2019-11-08,MQ,20398,MQ,N262NN,3399,13930,1393007,30977,ORD,"Chicago, IL",IL,17,Illinois,41,11066,1106606,31066,CMH,"Columbus, OH",OH,39,Ohio,44,1246,1243.0,-3.0,0.0,0.0,-1.0,1200-1259,11.0,1254.0,1440.0,7.0,1507,1447.0,-20.0,0.0,0.0,-2.0,1500-1559,0.0,,0.0,81.0,64.0,46.0,1.0,296.0,2,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
626770,2019,3,7,18,4,2019-07-18,WN,19393,WN,N457WN,882,13232,1323202,30977,MDW,"Chicago, IL",IL,17,Illinois,41,11193,1119302,33105,CVG,"Cincinnati, OH",KY,21,Kentucky,52,700,656.0,-4.0,0.0,0.0,-1.0,0700-0759,12.0,708.0,850.0,3.0,905,853.0,-12.0,0.0,0.0,-1.0,0900-0959,0.0,,0.0,65.0,57.0,42.0,1.0,249.0,1,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
145634,2019,4,11,29,5,2019-11-29,WN,19393,WN,N8618N,4840,11292,1129202,30325,DEN,"Denver, CO",CO,8,Colorado,82,10693,1069302,30693,BNA,"Nashville, TN",TN,47,Tennessee,54,2105,2115.0,10.0,10.0,0.0,0.0,2100-2159,32.0,2147.0,50.0,4.0,25,54.0,29.0,29.0,1.0,1.0,0001-0559,0.0,,0.0,140.0,159.0,123.0,1.0,1014.0,5,0.0,10.0,19.0,0.0,0.0,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
59012,2019,1,3,17,7,2019-03-17,AA,19805,AA,N140AN,1475,11298,1129806,30194,DFW,"Dallas/Fort Worth, TX",TX,48,Texas,74,14635,1463502,31714,RSW,"Fort Myers, FL",FL,12,Florida,33,1057,1053.0,-4.0,0.0,0.0,-1.0,1000-1059,22.0,1115.0,1414.0,7.0,1430,1421.0,-9.0,0.0,0.0,-1.0,1400-1459,0.0,,0.0,153.0,148.0,119.0,1.0,1017.0,5,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


##### 2. Generate file (2019)

In [10]:
output_csv_path = os.path.join(root,
                               "Output_Data",
                               "US_DoT",
                               "OTP_Preprocessed_2019_v1.csv")
output_csv_path

'C:\\Users\\turge\\CompartidoVM\\0.TFM\\Output_Data\\US_DoT\\OTP_Preprocessed_2019_v1.csv'

In [11]:
%%time

df.to_csv(path_or_buf=output_csv_path,
          index=False,
          encoding='latin1')

Wall time: 9min 57s


___

In [12]:
%memit

peak memory: 6681.16 MiB, increment: 1.68 MiB


In [13]:
csv_path = os.path.join(root,
                        "Output_Data",
                        "US_DoT",
                        "OTP_Preprocessed_2019_v1.csv")
csv_path

'C:\\Users\\turge\\CompartidoVM\\0.TFM\\Output_Data\\US_DoT\\OTP_Preprocessed_2019_v1.csv'

In [14]:
cols = [
    
### -----  < X > (PRE-FLIGHT DATA) -----

# Time Period
#  'YEAR', # Disregarded: for the time being, analysis limted to 2019
#  'QUARTER', # Disregarded: redundant
 'MONTH',
 'DAY_OF_MONTH',
 'DAY_OF_WEEK',
#  'FL_DATE', # Disregarded: redundant
# Airline / Aircraft
 'OP_UNIQUE_CARRIER',
#  'OP_CARRIER_AIRLINE_ID', # Disregarded: redundant
#  'OP_CARRIER', # Disregarded: redundant
 'TAIL_NUM',
#  'OP_CARRIER_FL_NUM', # Unknown in advance?
# Origin
#  'ORIGIN_AIRPORT_ID', # Disregarded: redundant
#  'ORIGIN_AIRPORT_SEQ_ID', # Disregarded: redundant
#  'ORIGIN_CITY_MARKET_ID', # Disregarded: not relevant for this particular analysis
 'ORIGIN',
 'ORIGIN_CITY_NAME', # Kept only for later grouping purposes
 'ORIGIN_STATE_ABR', # Kept only for later grouping purposes
#  'ORIGIN_STATE_FIPS', # Federal Information Processing Standards # Not used for the moment
 'ORIGIN_STATE_NM', # Kept only for later grouping purposes
#  'ORIGIN_WAC', # World Area Code # Not used for the moment
# Destination
#  'DEST_AIRPORT_ID', # Disregarded: redundant
#  'DEST_AIRPORT_SEQ_ID', # Disregarded: redundant
#  'DEST_CITY_MARKET_ID', # Disregarded: not relevant for this particular analysis
 'DEST',
 'DEST_CITY_NAME', # Kept only for later grouping purposes
 'DEST_STATE_ABR', # Kept only for later grouping purposes
#  'DEST_STATE_FIPS', # Federal Information Processing Standards # Not used for the moment
 'DEST_STATE_NM', # Kept only for later grouping purposes
#  'DEST_WAC', # World Area Code # Not used for the moment
# Departure Performance
 'CRS_DEP_TIME',
 'DEP_TIME_BLK', # Will be used for grouping into categories
#  'TAXI_OUT_median', #  Output / However, the median for each airport could be used as input !! (explanation below)   
# Arrival Performance
 'CRS_ARR_TIME',
 'ARR_TIME_BLK', # Will be used for grouping into categories
#  'TAXI_IN_median', #  Output / However, the median for each airport could be used as input !! (explanation below) 
# Flight Summaries
 'CRS_ELAPSED_TIME',
 'FLIGHTS',
 'DISTANCE',
 'DISTANCE_GROUP',

### ----- < y > (PRE-FLIGHT DATA) -----

# Departure Performance
 'DEP_TIME', # Disregarded: redundant
 'DEP_DELAY', # Disregarded: other potentially useful target
#  'DEP_DELAY_NEW', # Disregarded: redundant
 'DEP_DEL15', # Disregarded: other potentially useful target
#  'DEP_DELAY_GROUP', # Disregarded: not relevant for this particular analysis
 'TAXI_OUT', #  Output / However, the median for each airport could be used as input !! (explanation below)
#  'WHEELS_OFF', # Disregarded: redundant
# Arrival Performance
#  'WHEELS_ON', # Disregarded: redundant
 'TAXI_IN', #  Output / However, the median for each airport could be used as input !! (explanation below)
 'ARR_TIME', # Disregarded: redundant
 'ARR_DELAY', # -------------------------------------------> MAIN TARGET !! (i.e. < y >)
#  'ARR_DELAY_NEW', # Disregarded: redundant
 'ARR_DEL15', # Disregarded: other potentially useful target
#  'ARR_DELAY_GROUP', # Disregarded: not relevant for this particular analysis
# Cancellations and Diversions
 'CANCELLED', # Disregarded: not relevant for this particular analysis
#  'CANCELLATION_CODE', # Disregarded: not relevant for this particular analysis
#  'DIVERTED', # Disregarded: not relevant for this particular analysis
# Flight Summaries
#  'ACTUAL_ELAPSED_TIME', # Disregarded: redundant
#  'AIR_TIME', # Disregarded: redundant
# Cause of Delay
 'CARRIER_DELAY', # Disregarded: other potentially useful target
 'WEATHER_DELAY', # Disregarded: other potentially useful target
 'NAS_DELAY', # Disregarded: other potentially useful target
 'SECURITY_DELAY', # Disregarded: other potentially useful target
 'LATE_AIRCRAFT_DELAY', # Disregarded: other potentially useful target
# Gate Return Information at Origin Airport (Data starts 10/2008)
#  'FIRST_DEP_TIME', # Disregarded: not relevant for this particular analysis
#  'TOTAL_ADD_GTIME', # Disregarded: not relevant for this particular analysis
#  'LONGEST_ADD_GTIME', # Disregarded: not relevant for this particular analysis
# Diverted Airport Information (Data starts 10/2008)
#  'DIV_AIRPORT_LANDINGS', # Disregarded: not relevant for this particular analysis
]

In [15]:
%%time

# Chunk-loading procedure is required so as to prevent memory-saturation errors from appearing:

# cols = pd.read_csv(csv_path, nrows=1).columns

chunks_list = []
chunks = pd.read_csv(csv_path,
                     encoding='latin1',
#                      nrows=1e6, # Fail-safe: in case the file is inadvertently too big
                     chunksize=1e6,
                     usecols=cols[:],
                     low_memory = False)

for i, chunk in enumerate(chunks):
    if i == 13: # Fail-safe: for debugging purposes only
        break
    chunks_list.append(chunk)

    
df = pd.concat(chunks_list, axis=0)
del chunks_list

df.sample(5)

Wall time: 1min 24s


Unnamed: 0,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,OP_UNIQUE_CARRIER,TAIL_NUM,ORIGIN,ORIGIN_CITY_NAME,ORIGIN_STATE_ABR,ORIGIN_STATE_NM,DEST,DEST_CITY_NAME,DEST_STATE_ABR,DEST_STATE_NM,CRS_DEP_TIME,DEP_TIME,DEP_DELAY,DEP_DEL15,DEP_TIME_BLK,TAXI_OUT,TAXI_IN,CRS_ARR_TIME,ARR_TIME,ARR_DELAY,ARR_DEL15,ARR_TIME_BLK,CANCELLED,CRS_ELAPSED_TIME,FLIGHTS,DISTANCE,DISTANCE_GROUP,CARRIER_DELAY,WEATHER_DELAY,NAS_DELAY,SECURITY_DELAY,LATE_AIRCRAFT_DELAY
311611,1,24,4,OH,N599NN,MEM,"Memphis, TN",TN,Tennessee,DCA,"Washington, DC",VA,Virginia,1818,1809.0,-9.0,0.0,1800-1859,14.0,6.0,2120,2054.0,-26.0,0.0,2100-2159,0.0,122.0,1.0,762.0,4,,,,,
2169929,4,14,7,AA,N524UW,CLT,"Charlotte, NC",NC,North Carolina,LAX,"Los Angeles, CA",CA,California,1000,1009.0,9.0,0.0,1000-1059,17.0,10.0,1244,1209.0,-35.0,0.0,1200-1259,0.0,344.0,1.0,2125.0,9,,,,,
2178910,4,1,1,AA,N861NN,MIA,"Miami, FL",FL,Florida,PHL,"Philadelphia, PA",PA,Pennsylvania,2130,2128.0,-2.0,0.0,2100-2159,14.0,6.0,20,2400.0,-20.0,0.0,0001-0559,0.0,170.0,1.0,1013.0,5,,,,,
2240847,4,5,5,B6,N988JT,JFK,"New York, NY",NY,New York,LAX,"Los Angeles, CA",CA,California,549,544.0,-5.0,0.0,0001-0559,19.0,10.0,909,844.0,-25.0,0.0,0900-0959,0.0,380.0,1.0,2475.0,10,,,,,
1427384,3,10,7,DL,N896AT,DTW,"Detroit, MI",MI,Michigan,CLT,"Charlotte, NC",NC,North Carolina,1356,1352.0,-4.0,0.0,1300-1359,15.0,15.0,1547,1536.0,-11.0,0.0,1500-1559,0.0,111.0,1.0,500.0,3,,,,,


In [16]:
# For On-Time flights, the "..._DELAY" columns show NaN.
# Actually each of them could be any number provided that the sum of them is less than 15 minutes.
# However, for data consistency, let's replace them by 0:
causes = ['CARRIER_DELAY', 'WEATHER_DELAY', 'NAS_DELAY', 'SECURITY_DELAY', 'LATE_AIRCRAFT_DELAY']
df[causes] = df[causes].fillna(value=0)
df

Unnamed: 0,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,OP_UNIQUE_CARRIER,TAIL_NUM,ORIGIN,ORIGIN_CITY_NAME,ORIGIN_STATE_ABR,ORIGIN_STATE_NM,DEST,DEST_CITY_NAME,DEST_STATE_ABR,DEST_STATE_NM,CRS_DEP_TIME,DEP_TIME,DEP_DELAY,DEP_DEL15,DEP_TIME_BLK,TAXI_OUT,TAXI_IN,CRS_ARR_TIME,ARR_TIME,ARR_DELAY,ARR_DEL15,ARR_TIME_BLK,CANCELLED,CRS_ELAPSED_TIME,FLIGHTS,DISTANCE,DISTANCE_GROUP,CARRIER_DELAY,WEATHER_DELAY,NAS_DELAY,SECURITY_DELAY,LATE_AIRCRAFT_DELAY
0,1,3,4,9E,N195PQ,TYS,"Knoxville, TN",TN,Tennessee,ATL,"Atlanta, GA",GA,Georgia,1140,1205.0,25.0,1.0,1100-1159,30.0,4.0,1250,1315.0,25.0,1.0,1200-1259,0.0,70.0,1.0,152.0,1,0.0,0.0,0.0,0.0,25.0
1,1,4,5,9E,N919XJ,TYS,"Knoxville, TN",TN,Tennessee,ATL,"Atlanta, GA",GA,Georgia,1140,1250.0,70.0,1.0,1100-1159,35.0,9.0,1250,1412.0,82.0,1.0,1200-1259,0.0,70.0,1.0,152.0,1,0.0,0.0,12.0,0.0,70.0
2,1,5,6,9E,N316PQ,ATL,"Atlanta, GA",GA,Georgia,SGF,"Springfield, MO",MO,Missouri,950,956.0,6.0,0.0,0900-0959,20.0,3.0,1051,1043.0,-8.0,0.0,1000-1059,0.0,121.0,1.0,563.0,3,0.0,0.0,0.0,0.0,0.0
3,1,6,7,9E,N325PQ,ATL,"Atlanta, GA",GA,Georgia,SGF,"Springfield, MO",MO,Missouri,950,945.0,-5.0,0.0,0900-0959,16.0,3.0,1053,1029.0,-24.0,0.0,1000-1059,0.0,123.0,1.0,563.0,3,0.0,0.0,0.0,0.0,0.0
4,1,7,1,9E,N904XJ,ATL,"Atlanta, GA",GA,Georgia,SGF,"Springfield, MO",MO,Missouri,950,947.0,-3.0,0.0,0900-0959,25.0,4.0,1053,1044.0,-9.0,0.0,1000-1059,0.0,123.0,1.0,563.0,3,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7422032,12,31,2,B6,N193JB,MCO,"Orlando, FL",FL,Florida,SWF,"Newburgh/Poughkeepsie, NY",NY,New York,1356,1500.0,64.0,1.0,1300-1359,20.0,5.0,1639,1731.0,52.0,1.0,1600-1659,0.0,163.0,1.0,989.0,4,52.0,0.0,0.0,0.0,0.0
7422033,12,31,2,B6,N304JB,DCA,"Washington, DC",VA,Virginia,BOS,"Boston, MA",MA,Massachusetts,1420,1414.0,-6.0,0.0,1400-1459,15.0,7.0,1550,1533.0,-17.0,0.0,1500-1559,0.0,90.0,1.0,399.0,2,0.0,0.0,0.0,0.0,0.0
7422034,12,31,2,B6,N193JB,PHL,"Philadelphia, PA",PA,Pennsylvania,BOS,"Boston, MA",MA,Massachusetts,700,652.0,-8.0,0.0,0700-0759,12.0,5.0,825,751.0,-34.0,0.0,0800-0859,0.0,85.0,1.0,280.0,2,0.0,0.0,0.0,0.0,0.0
7422035,12,31,2,B6,N563JB,BOS,"Boston, MA",MA,Massachusetts,SJU,"San Juan, PR",PR,Puerto Rico,813,812.0,-1.0,0.0,0800-0859,10.0,3.0,1315,1248.0,-27.0,0.0,1300-1359,0.0,242.0,1.0,1674.0,7,0.0,0.0,0.0,0.0,0.0


In [17]:
def val_freq(col='', df=df):
    i = 0
    for v in df[col].value_counts().sort_index():
        print("{} : {} records ({:.2f}%)" \
              .format(df[col].value_counts().sort_index().index[i], v,  v / len(df) * 100))
        i += 1

In [18]:
%%time

for col in df.columns:
    print(col, ':', df[col].nunique(), 'unique values')
    if df[col].nunique() < 50:
        val_freq(col)
    print("")

MONTH : 12 unique values
1 : 583985 records (7.87%)
2 : 533175 records (7.18%)
3 : 632074 records (8.52%)
4 : 612023 records (8.25%)
5 : 636390 records (8.57%)
6 : 636691 records (8.58%)
7 : 659029 records (8.88%)
8 : 658461 records (8.87%)
9 : 605979 records (8.16%)
10 : 636014 records (8.57%)
11 : 602453 records (8.12%)
12 : 625763 records (8.43%)

DAY_OF_MONTH : 31 unique values
1 : 240951 records (3.25%)
2 : 237619 records (3.20%)
3 : 242174 records (3.26%)
4 : 241829 records (3.26%)
5 : 238517 records (3.21%)
6 : 244304 records (3.29%)
7 : 243985 records (3.29%)
8 : 248962 records (3.35%)
9 : 240393 records (3.24%)
10 : 247271 records (3.33%)
11 : 249266 records (3.36%)
12 : 243253 records (3.28%)
13 : 245085 records (3.30%)
14 : 244813 records (3.30%)
15 : 249956 records (3.37%)
16 : 240931 records (3.25%)
17 : 246247 records (3.32%)
18 : 250135 records (3.37%)
19 : 244596 records (3.30%)
20 : 243522 records (3.28%)
21 : 247257 records (3.33%)
22 : 250716 records (3.38%)
23 : 242

In [19]:
# Since 'FLIGHTS' value is always 1, it does not add any information. let's just drop it:
df.drop('FLIGHTS', axis=1, inplace=True)

In [20]:
df.shape

(7422037, 34)

In [21]:
%memit

peak memory: 2212.89 MiB, increment: 0.07 MiB


In [22]:
# Focus on delayed flights:
delays = df[df['ARR_DEL15'] == 1]
delays

Unnamed: 0,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,OP_UNIQUE_CARRIER,TAIL_NUM,ORIGIN,ORIGIN_CITY_NAME,ORIGIN_STATE_ABR,ORIGIN_STATE_NM,DEST,DEST_CITY_NAME,DEST_STATE_ABR,DEST_STATE_NM,CRS_DEP_TIME,DEP_TIME,DEP_DELAY,DEP_DEL15,DEP_TIME_BLK,TAXI_OUT,TAXI_IN,CRS_ARR_TIME,ARR_TIME,ARR_DELAY,ARR_DEL15,ARR_TIME_BLK,CANCELLED,CRS_ELAPSED_TIME,DISTANCE,DISTANCE_GROUP,CARRIER_DELAY,WEATHER_DELAY,NAS_DELAY,SECURITY_DELAY,LATE_AIRCRAFT_DELAY
0,1,3,4,9E,N195PQ,TYS,"Knoxville, TN",TN,Tennessee,ATL,"Atlanta, GA",GA,Georgia,1140,1205.0,25.0,1.0,1100-1159,30.0,4.0,1250,1315.0,25.0,1.0,1200-1259,0.0,70.0,152.0,1,0.0,0.0,0.0,0.0,25.0
1,1,4,5,9E,N919XJ,TYS,"Knoxville, TN",TN,Tennessee,ATL,"Atlanta, GA",GA,Georgia,1140,1250.0,70.0,1.0,1100-1159,35.0,9.0,1250,1412.0,82.0,1.0,1200-1259,0.0,70.0,152.0,1,0.0,0.0,12.0,0.0,70.0
8,1,11,5,9E,N135EV,ATL,"Atlanta, GA",GA,Georgia,SGF,"Springfield, MO",MO,Missouri,950,945.0,-5.0,0.0,0900-0959,33.0,24.0,1053,1113.0,20.0,1.0,1000-1059,0.0,123.0,563.0,3,0.0,0.0,20.0,0.0,0.0
12,1,15,2,9E,N917XJ,ATL,"Atlanta, GA",GA,Georgia,SGF,"Springfield, MO",MO,Missouri,950,1043.0,53.0,1.0,0900-0959,14.0,5.0,1053,1136.0,43.0,1.0,1000-1059,0.0,123.0,563.0,3,11.0,0.0,0.0,0.0,32.0
16,1,19,6,9E,N937XJ,ATL,"Atlanta, GA",GA,Georgia,SGF,"Springfield, MO",MO,Missouri,950,1001.0,11.0,0.0,0900-0959,40.0,5.0,1051,1113.0,22.0,1.0,1000-1059,0.0,121.0,563.0,3,0.0,0.0,11.0,0.0,11.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7421989,12,31,2,B6,N523JB,MCO,"Orlando, FL",FL,Florida,SJU,"San Juan, PR",PR,Puerto Rico,1428,1732.0,184.0,1.0,1400-1459,24.0,7.0,1815,2126.0,191.0,1.0,1800-1859,0.0,167.0,1189.0,5,184.0,0.0,7.0,0.0,0.0
7422004,12,31,2,B6,N198JB,BOS,"Boston, MA",MA,Massachusetts,RDU,"Raleigh/Durham, NC",NC,North Carolina,1545,1619.0,34.0,1.0,1500-1559,16.0,6.0,1759,1821.0,22.0,1.0,1700-1759,0.0,134.0,612.0,3,8.0,0.0,0.0,0.0,14.0
7422009,12,31,2,B6,N603JB,SLC,"Salt Lake City, UT",UT,Utah,MCO,"Orlando, FL",FL,Florida,2210,2231.0,21.0,1.0,2200-2259,29.0,5.0,417,438.0,21.0,1.0,0001-0559,0.0,247.0,1931.0,8,7.0,0.0,0.0,0.0,14.0
7422019,12,31,2,B6,N317JB,MCO,"Orlando, FL",FL,Florida,DCA,"Washington, DC",VA,Virginia,1853,1844.0,-9.0,0.0,1800-1859,12.0,68.0,2100,2139.0,39.0,1.0,2100-2159,0.0,127.0,759.0,4,0.0,0.0,39.0,0.0,0.0


In [23]:
# Let's check now the main reasons behind delays:
causes = ['CARRIER_DELAY', 'WEATHER_DELAY', 'NAS_DELAY', 'SECURITY_DELAY', 'LATE_AIRCRAFT_DELAY']
delay_groups = delays.groupby('ARR_DEL15')[causes].agg(['count', 'median', 'mean', 'sum', 'max'])
# delay_groups.columns = ['_'.join(col).strip() for col in delay_groups.columns.values]
delay_groups.T.loc[:, 1].apply(lambda x: '%.0f' % x)

CARRIER_DELAY        count      1389253
                     median           0
                     mean            21
                     sum       29352960
                     max           2695
WEATHER_DELAY        count      1389253
                     median           0
                     mean             4
                     sum        5282501
                     max           1847
NAS_DELAY            count      1389253
                     median           2
                     mean            17
                     sum       23044854
                     max           1741
SECURITY_DELAY       count      1389253
                     median           0
                     mean             0
                     sum         133484
                     max           1078
LATE_AIRCRAFT_DELAY  count      1389253
                     median           3
                     mean            27
                     sum       38075922
                     max           2206


There are some missing values. Let's further delve into it:

In [24]:
%%time

# Absolute & Relative frequency of missing values by column:
pd.set_option('display.max_rows', df.shape[1])
missing = pd.DataFrame([df.isna().sum(), df.isna().sum() / len(df) * 100], index=['Absolute', 'Relative']).T.sort_values(by='Relative', ascending=False)
missing

Wall time: 9.09 s


Unnamed: 0,Absolute,Relative
ARR_DELAY,153805.0,2.072275
ARR_DEL15,153805.0,2.072275
TAXI_IN,137647.0,1.854572
ARR_TIME,137646.0,1.854558
TAXI_OUT,133977.0,1.805124
DEP_DEL15,130110.0,1.753023
DEP_DELAY,130110.0,1.753023
DEP_TIME,130086.0,1.752699
TAIL_NUM,17837.0,0.240325
CRS_ELAPSED_TIME,135.0,0.001819


In [25]:
# Show which columns present more than 5% of missing data:
empty_df = missing[missing['Relative'] > 5]
empty_df

Unnamed: 0,Absolute,Relative


In [26]:
%memit

peak memory: 2618.46 MiB, increment: 0.02 MiB


In [27]:
%%time

# Quick approach → check how many rows contain empty values to see if directly dropping them would be feasible:

# Check which rows have at least 1 NaN:
df[df.isna().any(axis=1)]

Wall time: 4.48 s


Unnamed: 0,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,OP_UNIQUE_CARRIER,TAIL_NUM,ORIGIN,ORIGIN_CITY_NAME,ORIGIN_STATE_ABR,ORIGIN_STATE_NM,DEST,DEST_CITY_NAME,DEST_STATE_ABR,DEST_STATE_NM,CRS_DEP_TIME,DEP_TIME,DEP_DELAY,DEP_DEL15,DEP_TIME_BLK,TAXI_OUT,TAXI_IN,CRS_ARR_TIME,ARR_TIME,ARR_DELAY,ARR_DEL15,ARR_TIME_BLK,CANCELLED,CRS_ELAPSED_TIME,DISTANCE,DISTANCE_GROUP,CARRIER_DELAY,WEATHER_DELAY,NAS_DELAY,SECURITY_DELAY,LATE_AIRCRAFT_DELAY
26,1,29,2,9E,N931XJ,ATL,"Atlanta, GA",GA,Georgia,SGF,"Springfield, MO",MO,Missouri,950,,,,0900-0959,,,1053,,,,1000-1059,1.0,123.0,563.0,3,0.0,0.0,0.0,0.0,0.0
53,1,29,2,9E,N931XJ,SGF,"Springfield, MO",MO,Missouri,ATL,"Atlanta, GA",GA,Georgia,1128,,,,1100-1159,,,1420,,,,1400-1459,1.0,112.0,563.0,3,0.0,0.0,0.0,0.0,0.0
142,1,20,7,9E,N934XJ,LGA,"New York, NY",NY,New York,BTV,"Burlington, VT",VT,Vermont,1420,,,,1400-1459,,,1545,,,,1500-1559,1.0,85.0,258.0,2,0.0,0.0,0.0,0.0,0.0
298,1,24,4,9E,N930XJ,LGA,"New York, NY",NY,New York,BTV,"Burlington, VT",VT,Vermont,2159,,,,2100-2159,,,2321,,,,2300-2359,1.0,82.0,258.0,2,0.0,0.0,0.0,0.0,0.0
406,1,22,2,9E,N929XJ,DTW,"Detroit, MI",MI,Michigan,BTV,"Burlington, VT",VT,Vermont,2020,,,,2000-2059,,,2218,,,,2200-2259,1.0,118.0,537.0,3,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7419670,12,18,3,B6,N348JB,EWR,"Newark, NJ",NJ,New Jersey,BOS,"Boston, MA",MA,Massachusetts,2104,,,,2100-2159,,,2216,,,,2200-2259,1.0,72.0,200.0,1,0.0,0.0,0.0,0.0,0.0
7419672,12,18,3,B6,N329JB,RDU,"Raleigh/Durham, NC",NC,North Carolina,BOS,"Boston, MA",MA,Massachusetts,530,,,,0001-0559,,,723,,,,0700-0759,1.0,113.0,612.0,3,0.0,0.0,0.0,0.0,0.0
7419848,12,19,4,B6,N306JB,SYR,"Syracuse, NY",NY,New York,BOS,"Boston, MA",MA,Massachusetts,530,,,,0001-0559,,,651,,,,0600-0659,1.0,81.0,265.0,2,0.0,0.0,0.0,0.0,0.0
7420968,12,30,1,B6,N579JB,BOS,"Boston, MA",MA,Massachusetts,PBI,"West Palm Beach/Palm Beach, FL",FL,Florida,1936,2057.0,81.0,1.0,1900-1959,53.0,,2303,,,,2300-2359,0.0,207.0,1197.0,5,0.0,0.0,0.0,0.0,0.0


In [28]:
%%time

# DF's length as is:
original_length = len(df)
print('Original dataset length:\t', original_length)

# Check how many rows have at least 1 NaN:
manipulated_length = len(df.drop(df[df.isna().any(axis=1)].index, axis=0))
print('Manipulated dataset length:\t', manipulated_length)

# Dropped rows, absolute and relative number:
print('{} rows deleted, accounting for {:.2f}% of the original dataset.'.format(original_length - manipulated_length, (original_length - manipulated_length) / original_length * 100))

Original dataset length:	 7422037
Manipulated dataset length:	 7268232
153805 rows deleted, accounting for 2.07% of the original dataset.
Wall time: 6.47 s


Considering that it is only ~2% of the data, and that the dataset is big enough, let's just simply drop those rows as a quick  cleaning method.

In [29]:
%%time

df = df.drop(df[df.isna().any(axis=1)].index, axis=0)
df

Wall time: 6.21 s


Unnamed: 0,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,OP_UNIQUE_CARRIER,TAIL_NUM,ORIGIN,ORIGIN_CITY_NAME,ORIGIN_STATE_ABR,ORIGIN_STATE_NM,DEST,DEST_CITY_NAME,DEST_STATE_ABR,DEST_STATE_NM,CRS_DEP_TIME,DEP_TIME,DEP_DELAY,DEP_DEL15,DEP_TIME_BLK,TAXI_OUT,TAXI_IN,CRS_ARR_TIME,ARR_TIME,ARR_DELAY,ARR_DEL15,ARR_TIME_BLK,CANCELLED,CRS_ELAPSED_TIME,DISTANCE,DISTANCE_GROUP,CARRIER_DELAY,WEATHER_DELAY,NAS_DELAY,SECURITY_DELAY,LATE_AIRCRAFT_DELAY
0,1,3,4,9E,N195PQ,TYS,"Knoxville, TN",TN,Tennessee,ATL,"Atlanta, GA",GA,Georgia,1140,1205.0,25.0,1.0,1100-1159,30.0,4.0,1250,1315.0,25.0,1.0,1200-1259,0.0,70.0,152.0,1,0.0,0.0,0.0,0.0,25.0
1,1,4,5,9E,N919XJ,TYS,"Knoxville, TN",TN,Tennessee,ATL,"Atlanta, GA",GA,Georgia,1140,1250.0,70.0,1.0,1100-1159,35.0,9.0,1250,1412.0,82.0,1.0,1200-1259,0.0,70.0,152.0,1,0.0,0.0,12.0,0.0,70.0
2,1,5,6,9E,N316PQ,ATL,"Atlanta, GA",GA,Georgia,SGF,"Springfield, MO",MO,Missouri,950,956.0,6.0,0.0,0900-0959,20.0,3.0,1051,1043.0,-8.0,0.0,1000-1059,0.0,121.0,563.0,3,0.0,0.0,0.0,0.0,0.0
3,1,6,7,9E,N325PQ,ATL,"Atlanta, GA",GA,Georgia,SGF,"Springfield, MO",MO,Missouri,950,945.0,-5.0,0.0,0900-0959,16.0,3.0,1053,1029.0,-24.0,0.0,1000-1059,0.0,123.0,563.0,3,0.0,0.0,0.0,0.0,0.0
4,1,7,1,9E,N904XJ,ATL,"Atlanta, GA",GA,Georgia,SGF,"Springfield, MO",MO,Missouri,950,947.0,-3.0,0.0,0900-0959,25.0,4.0,1053,1044.0,-9.0,0.0,1000-1059,0.0,123.0,563.0,3,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7422032,12,31,2,B6,N193JB,MCO,"Orlando, FL",FL,Florida,SWF,"Newburgh/Poughkeepsie, NY",NY,New York,1356,1500.0,64.0,1.0,1300-1359,20.0,5.0,1639,1731.0,52.0,1.0,1600-1659,0.0,163.0,989.0,4,52.0,0.0,0.0,0.0,0.0
7422033,12,31,2,B6,N304JB,DCA,"Washington, DC",VA,Virginia,BOS,"Boston, MA",MA,Massachusetts,1420,1414.0,-6.0,0.0,1400-1459,15.0,7.0,1550,1533.0,-17.0,0.0,1500-1559,0.0,90.0,399.0,2,0.0,0.0,0.0,0.0,0.0
7422034,12,31,2,B6,N193JB,PHL,"Philadelphia, PA",PA,Pennsylvania,BOS,"Boston, MA",MA,Massachusetts,700,652.0,-8.0,0.0,0700-0759,12.0,5.0,825,751.0,-34.0,0.0,0800-0859,0.0,85.0,280.0,2,0.0,0.0,0.0,0.0,0.0
7422035,12,31,2,B6,N563JB,BOS,"Boston, MA",MA,Massachusetts,SJU,"San Juan, PR",PR,Puerto Rico,813,812.0,-1.0,0.0,0800-0859,10.0,3.0,1315,1248.0,-27.0,0.0,1300-1359,0.0,242.0,1674.0,7,0.0,0.0,0.0,0.0,0.0


In [30]:
%%time

# CHECK AGAIN after manipulating the missing data.
# Absolute & Relative frequency of missing values by column:
pd.set_option('display.max_rows', 110) # It greatly slows down the output display and freezes the kernel
missing_2 = pd.DataFrame([df.isna().sum(), df.isna().sum() / len(df) * 100], index=['Absolute', 'Relative']).T.sort_values(by='Relative', ascending=False)
missing_2

Wall time: 9.44 s


Unnamed: 0,Absolute,Relative
MONTH,0.0,0.0
CANCELLED,0.0,0.0
TAXI_IN,0.0,0.0
CRS_ARR_TIME,0.0,0.0
ARR_TIME,0.0,0.0
ARR_DELAY,0.0,0.0
ARR_DEL15,0.0,0.0
ARR_TIME_BLK,0.0,0.0
CRS_ELAPSED_TIME,0.0,0.0
DAY_OF_MONTH,0.0,0.0


Great. Now the dataset is totally free of missing values.

In [31]:
# 'TAXI_OUT' time is not an input, since there is no way to predict its actual value in advance.
# However, there is indeed some prior knowledge about it, based on past data concerning to operations on each airport.
# Each carrier may optimize its airport operations in a different manner, thus leading to average differences between them.
# Therefore, let's see whether some value could be used as a baseline for each pair and hence use it as an input feature.

DepDel_TaxOutTim = df.groupby(['ORIGIN', 'OP_UNIQUE_CARRIER'])[['TAXI_OUT', 'DEP_DELAY']] \
                     .agg({'TAXI_OUT' : ['count', 'mean', 'median', 'min', 'max'],
                           'DEP_DELAY' : ['mean', 'median', 'min', 'max']})
DepDel_TaxOutTim

Unnamed: 0_level_0,Unnamed: 1_level_0,TAXI_OUT,TAXI_OUT,TAXI_OUT,TAXI_OUT,TAXI_OUT,DEP_DELAY,DEP_DELAY,DEP_DELAY,DEP_DELAY
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,median,min,max,mean,median,min,max
ORIGIN,OP_UNIQUE_CARRIER,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
ABE,9E,626,16.915335,14.0,6.0,114.0,7.865815,-3.0,-18.0,360.0
ABE,DL,326,18.377301,14.0,7.0,112.0,1.638037,-3.0,-13.0,150.0
ABE,EV,137,18.518248,15.0,5.0,82.0,27.401460,-4.0,-19.0,382.0
ABE,G4,1142,10.722417,10.0,5.0,83.0,10.619965,-5.0,-32.0,1855.0
ABE,MQ,509,14.290766,12.0,5.0,85.0,11.618861,-6.0,-14.0,739.0
...,...,...,...,...,...,...,...,...,...,...
XNA,YX,1751,18.055397,15.0,4.0,141.0,14.347230,-4.0,-22.0,505.0
XWA,OO,204,26.769608,19.0,7.0,95.0,31.995098,0.0,-21.0,897.0
YAK,AS,694,10.207493,9.0,2.0,45.0,-6.237752,-12.0,-56.0,212.0
YUM,OO,1556,14.491645,13.0,2.0,87.0,3.267995,-6.0,-24.0,640.0


In [32]:
# As a means of checking whether the median could be taken as an input, let's first verify how much variance there is for
# each particular airport/carrier pair, as well as how far are the extreme values.

Dep_relative_diff = DepDel_TaxOutTim.loc[:, ('TAXI_OUT', 'mean')] / DepDel_TaxOutTim.loc[:, ('TAXI_OUT', 'median')]
Dep_relative_diff.describe()

# Based on the results, it is observed that the median is commonly close to the mean for each pair.
# In cases where this assumption is not satisfied, it is normally due to outliers.
# These extreme values significantly move the means to the right, but are not representative in general.

# In a nutshell, it is fair to assume that each airport/carrier median is a ground baseline value for an input feature.

count    1977.000000
mean        1.169532
std         0.106970
min         0.877193
25%         1.108320
50%         1.157459
75%         1.218919
max         3.977778
dtype: float64

In [33]:
%%time
# As the solution in the cell above (in 'Raw NBConvert' format) takes so long, an incredibly faster approach is addressed:
aux = df.groupby(['ORIGIN', 'OP_UNIQUE_CARRIER'], as_index=False)[['TAXI_OUT']].median()
df = df.reset_index().merge(aux, how='inner', on=['ORIGIN', 'OP_UNIQUE_CARRIER'], suffixes=(None,'_median')).set_index('index').sort_index()
df

Wall time: 28.4 s


Unnamed: 0_level_0,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,OP_UNIQUE_CARRIER,TAIL_NUM,ORIGIN,ORIGIN_CITY_NAME,ORIGIN_STATE_ABR,ORIGIN_STATE_NM,DEST,DEST_CITY_NAME,DEST_STATE_ABR,DEST_STATE_NM,CRS_DEP_TIME,DEP_TIME,DEP_DELAY,DEP_DEL15,DEP_TIME_BLK,TAXI_OUT,TAXI_IN,CRS_ARR_TIME,ARR_TIME,ARR_DELAY,ARR_DEL15,ARR_TIME_BLK,CANCELLED,CRS_ELAPSED_TIME,DISTANCE,DISTANCE_GROUP,CARRIER_DELAY,WEATHER_DELAY,NAS_DELAY,SECURITY_DELAY,LATE_AIRCRAFT_DELAY,TAXI_OUT_median
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1
0,1,3,4,9E,N195PQ,TYS,"Knoxville, TN",TN,Tennessee,ATL,"Atlanta, GA",GA,Georgia,1140,1205.0,25.0,1.0,1100-1159,30.0,4.0,1250,1315.0,25.0,1.0,1200-1259,0.0,70.0,152.0,1,0.0,0.0,0.0,0.0,25.0,15.0
1,1,4,5,9E,N919XJ,TYS,"Knoxville, TN",TN,Tennessee,ATL,"Atlanta, GA",GA,Georgia,1140,1250.0,70.0,1.0,1100-1159,35.0,9.0,1250,1412.0,82.0,1.0,1200-1259,0.0,70.0,152.0,1,0.0,0.0,12.0,0.0,70.0,15.0
2,1,5,6,9E,N316PQ,ATL,"Atlanta, GA",GA,Georgia,SGF,"Springfield, MO",MO,Missouri,950,956.0,6.0,0.0,0900-0959,20.0,3.0,1051,1043.0,-8.0,0.0,1000-1059,0.0,121.0,563.0,3,0.0,0.0,0.0,0.0,0.0,17.0
3,1,6,7,9E,N325PQ,ATL,"Atlanta, GA",GA,Georgia,SGF,"Springfield, MO",MO,Missouri,950,945.0,-5.0,0.0,0900-0959,16.0,3.0,1053,1029.0,-24.0,0.0,1000-1059,0.0,123.0,563.0,3,0.0,0.0,0.0,0.0,0.0,17.0
4,1,7,1,9E,N904XJ,ATL,"Atlanta, GA",GA,Georgia,SGF,"Springfield, MO",MO,Missouri,950,947.0,-3.0,0.0,0900-0959,25.0,4.0,1053,1044.0,-9.0,0.0,1000-1059,0.0,123.0,563.0,3,0.0,0.0,0.0,0.0,0.0,17.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7422032,12,31,2,B6,N193JB,MCO,"Orlando, FL",FL,Florida,SWF,"Newburgh/Poughkeepsie, NY",NY,New York,1356,1500.0,64.0,1.0,1300-1359,20.0,5.0,1639,1731.0,52.0,1.0,1600-1659,0.0,163.0,989.0,4,52.0,0.0,0.0,0.0,0.0,15.0
7422033,12,31,2,B6,N304JB,DCA,"Washington, DC",VA,Virginia,BOS,"Boston, MA",MA,Massachusetts,1420,1414.0,-6.0,0.0,1400-1459,15.0,7.0,1550,1533.0,-17.0,0.0,1500-1559,0.0,90.0,399.0,2,0.0,0.0,0.0,0.0,0.0,15.0
7422034,12,31,2,B6,N193JB,PHL,"Philadelphia, PA",PA,Pennsylvania,BOS,"Boston, MA",MA,Massachusetts,700,652.0,-8.0,0.0,0700-0759,12.0,5.0,825,751.0,-34.0,0.0,0800-0859,0.0,85.0,280.0,2,0.0,0.0,0.0,0.0,0.0,17.0
7422035,12,31,2,B6,N563JB,BOS,"Boston, MA",MA,Massachusetts,SJU,"San Juan, PR",PR,Puerto Rico,813,812.0,-1.0,0.0,0800-0859,10.0,3.0,1315,1248.0,-27.0,0.0,1300-1359,0.0,242.0,1674.0,7,0.0,0.0,0.0,0.0,0.0,16.0


In [34]:
# Similarly to what happens with 'TAXI_OUT', 'TAXI_IN' time is not an input, (unpredictable actual value in advance).
# However, the Arrival times dynamic follow the same commonalities as assumed for Departure.
# Therefore, let's proceed likewise and check the results.

TaxInTim_ArrDel = df.groupby(['DEST', 'OP_UNIQUE_CARRIER'])[['TAXI_IN', 'ARR_DELAY']] \
                     .agg({'TAXI_IN' : ['count', 'mean', 'median', 'min', 'max'],
                           'ARR_DELAY' : ['mean', 'median', 'min', 'max']})
TaxInTim_ArrDel

Unnamed: 0_level_0,Unnamed: 1_level_0,TAXI_IN,TAXI_IN,TAXI_IN,TAXI_IN,TAXI_IN,ARR_DELAY,ARR_DELAY,ARR_DELAY,ARR_DELAY
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,median,min,max,mean,median,min,max
DEST,OP_UNIQUE_CARRIER,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
ABE,9E,629,5.076312,4.0,2.0,28.0,-0.233704,-10.0,-37.0,461.0
ABE,DL,326,4.542945,4.0,2.0,14.0,-1.368098,-11.5,-34.0,465.0
ABE,EV,137,4.832117,5.0,2.0,17.0,38.649635,7.0,-29.0,355.0
ABE,G4,1145,5.524017,5.0,4.0,75.0,1.863755,-7.0,-37.0,555.0
ABE,MQ,512,4.285156,4.0,1.0,40.0,6.503906,-7.0,-37.0,597.0
...,...,...,...,...,...,...,...,...,...,...
XNA,YX,1767,7.233729,6.0,2.0,88.0,3.688738,-7.0,-57.0,736.0
XWA,OO,202,6.331683,5.0,1.0,36.0,-2.995050,-15.0,-47.0,369.0
YAK,AS,697,4.173601,4.0,2.0,33.0,-0.256815,-6.0,-47.0,232.0
YUM,OO,1553,4.046362,3.0,2.0,27.0,4.652930,-5.0,-36.0,1218.0


In [35]:
# Similarly to what has been done for Departure:

Arr_relative_diff = TaxInTim_ArrDel.loc[:, ('TAXI_IN', 'mean')] / TaxInTim_ArrDel.loc[:, ('TAXI_IN', 'median')]
Arr_relative_diff.describe()

# Once again, it is observed that the median is commonly close to the mean for each pair.

# Therefore, it is fair to assume that each airport/carrier median is a ground baseline value for an input feature.

count    1976.000000
mean        1.152699
std         0.149455
min         0.838818
25%         1.057349
50%         1.128030
75%         1.216900
max         2.904865
dtype: float64

In [36]:
%%time
# As the solution in the cell above (in 'Raw NBConvert' format) takes so long, an incredibly faster approach is addressed:
aux = df.groupby(['DEST', 'OP_UNIQUE_CARRIER'], as_index=False)[['TAXI_IN']].median()
df = df.reset_index().merge(aux, how='inner', on=['DEST', 'OP_UNIQUE_CARRIER'], suffixes=(None,'_median')).set_index('index').sort_index()
df

Wall time: 29.3 s


Unnamed: 0_level_0,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,OP_UNIQUE_CARRIER,TAIL_NUM,ORIGIN,ORIGIN_CITY_NAME,ORIGIN_STATE_ABR,ORIGIN_STATE_NM,DEST,DEST_CITY_NAME,DEST_STATE_ABR,DEST_STATE_NM,CRS_DEP_TIME,DEP_TIME,DEP_DELAY,DEP_DEL15,DEP_TIME_BLK,TAXI_OUT,TAXI_IN,CRS_ARR_TIME,ARR_TIME,ARR_DELAY,ARR_DEL15,ARR_TIME_BLK,CANCELLED,CRS_ELAPSED_TIME,DISTANCE,DISTANCE_GROUP,CARRIER_DELAY,WEATHER_DELAY,NAS_DELAY,SECURITY_DELAY,LATE_AIRCRAFT_DELAY,TAXI_OUT_median,TAXI_IN_median
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1
0,1,3,4,9E,N195PQ,TYS,"Knoxville, TN",TN,Tennessee,ATL,"Atlanta, GA",GA,Georgia,1140,1205.0,25.0,1.0,1100-1159,30.0,4.0,1250,1315.0,25.0,1.0,1200-1259,0.0,70.0,152.0,1,0.0,0.0,0.0,0.0,25.0,15.0,8.0
1,1,4,5,9E,N919XJ,TYS,"Knoxville, TN",TN,Tennessee,ATL,"Atlanta, GA",GA,Georgia,1140,1250.0,70.0,1.0,1100-1159,35.0,9.0,1250,1412.0,82.0,1.0,1200-1259,0.0,70.0,152.0,1,0.0,0.0,12.0,0.0,70.0,15.0,8.0
2,1,5,6,9E,N316PQ,ATL,"Atlanta, GA",GA,Georgia,SGF,"Springfield, MO",MO,Missouri,950,956.0,6.0,0.0,0900-0959,20.0,3.0,1051,1043.0,-8.0,0.0,1000-1059,0.0,121.0,563.0,3,0.0,0.0,0.0,0.0,0.0,17.0,5.0
3,1,6,7,9E,N325PQ,ATL,"Atlanta, GA",GA,Georgia,SGF,"Springfield, MO",MO,Missouri,950,945.0,-5.0,0.0,0900-0959,16.0,3.0,1053,1029.0,-24.0,0.0,1000-1059,0.0,123.0,563.0,3,0.0,0.0,0.0,0.0,0.0,17.0,5.0
4,1,7,1,9E,N904XJ,ATL,"Atlanta, GA",GA,Georgia,SGF,"Springfield, MO",MO,Missouri,950,947.0,-3.0,0.0,0900-0959,25.0,4.0,1053,1044.0,-9.0,0.0,1000-1059,0.0,123.0,563.0,3,0.0,0.0,0.0,0.0,0.0,17.0,5.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7422032,12,31,2,B6,N193JB,MCO,"Orlando, FL",FL,Florida,SWF,"Newburgh/Poughkeepsie, NY",NY,New York,1356,1500.0,64.0,1.0,1300-1359,20.0,5.0,1639,1731.0,52.0,1.0,1600-1659,0.0,163.0,989.0,4,52.0,0.0,0.0,0.0,0.0,15.0,6.0
7422033,12,31,2,B6,N304JB,DCA,"Washington, DC",VA,Virginia,BOS,"Boston, MA",MA,Massachusetts,1420,1414.0,-6.0,0.0,1400-1459,15.0,7.0,1550,1533.0,-17.0,0.0,1500-1559,0.0,90.0,399.0,2,0.0,0.0,0.0,0.0,0.0,15.0,6.0
7422034,12,31,2,B6,N193JB,PHL,"Philadelphia, PA",PA,Pennsylvania,BOS,"Boston, MA",MA,Massachusetts,700,652.0,-8.0,0.0,0700-0759,12.0,5.0,825,751.0,-34.0,0.0,0800-0859,0.0,85.0,280.0,2,0.0,0.0,0.0,0.0,0.0,17.0,6.0
7422035,12,31,2,B6,N563JB,BOS,"Boston, MA",MA,Massachusetts,SJU,"San Juan, PR",PR,Puerto Rico,813,812.0,-1.0,0.0,0800-0859,10.0,3.0,1315,1248.0,-27.0,0.0,1300-1359,0.0,242.0,1674.0,7,0.0,0.0,0.0,0.0,0.0,16.0,5.0


In [37]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7268232 entries, 0 to 7422036
Data columns (total 36 columns):
 #   Column               Dtype  
---  ------               -----  
 0   MONTH                int64  
 1   DAY_OF_MONTH         int64  
 2   DAY_OF_WEEK          int64  
 3   OP_UNIQUE_CARRIER    object 
 4   TAIL_NUM             object 
 5   ORIGIN               object 
 6   ORIGIN_CITY_NAME     object 
 7   ORIGIN_STATE_ABR     object 
 8   ORIGIN_STATE_NM      object 
 9   DEST                 object 
 10  DEST_CITY_NAME       object 
 11  DEST_STATE_ABR       object 
 12  DEST_STATE_NM        object 
 13  CRS_DEP_TIME         int64  
 14  DEP_TIME             float64
 15  DEP_DELAY            float64
 16  DEP_DEL15            float64
 17  DEP_TIME_BLK         object 
 18  TAXI_OUT             float64
 19  TAXI_IN              float64
 20  CRS_ARR_TIME         int64  
 21  ARR_TIME             float64
 22  ARR_DELAY            float64
 23  ARR_DEL15            float64
 24

In [38]:
# Create new categorical features based on CRS departure/arrival times.
df['DEP_TIME_BLK'] = df['DEP_TIME_BLK'].rename('DEP_TIME_hour').str.slice(0, 2)
df['ARR_TIME_BLK'] = df['ARR_TIME_BLK'].rename('ARR_TIME_hour').str.slice(0, 2)
df = df.rename({'DEP_TIME_BLK' : 'DEP_TIME_hour', 'ARR_TIME_BLK' : 'ARR_TIME_hour'}, axis=1)

In [39]:
# Casting numeric types to 'int64':
float64_cols = df.select_dtypes('float64').columns
df[float64_cols] = df[float64_cols].astype('int64')
# Casting certain features (numeric and string) types to 'category':
ord_cat_cols = ['MONTH', 'DAY_OF_MONTH', 'DAY_OF_WEEK', 'DEP_TIME_hour', 'ARR_TIME_hour', 'DISTANCE_GROUP']
notord_cat_cols = ['OP_UNIQUE_CARRIER', 'ORIGIN', 'DEST', 'DEP_DEL15', 'ARR_DEL15', 'CANCELLED']
df[ord_cat_cols] = df[ord_cat_cols].astype('category')
df[notord_cat_cols] = df[notord_cat_cols].astype('category')

In [40]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7268232 entries, 0 to 7422036
Data columns (total 36 columns):
 #   Column               Dtype   
---  ------               -----   
 0   MONTH                category
 1   DAY_OF_MONTH         category
 2   DAY_OF_WEEK          category
 3   OP_UNIQUE_CARRIER    category
 4   TAIL_NUM             object  
 5   ORIGIN               category
 6   ORIGIN_CITY_NAME     object  
 7   ORIGIN_STATE_ABR     object  
 8   ORIGIN_STATE_NM      object  
 9   DEST                 category
 10  DEST_CITY_NAME       object  
 11  DEST_STATE_ABR       object  
 12  DEST_STATE_NM        object  
 13  CRS_DEP_TIME         int64   
 14  DEP_TIME             int64   
 15  DEP_DELAY            int64   
 16  DEP_DEL15            category
 17  DEP_TIME_hour        category
 18  TAXI_OUT             int64   
 19  TAXI_IN              int64   
 20  CRS_ARR_TIME         int64   
 21  ARR_TIME             int64   
 22  ARR_DELAY            int64   
 23  ARR_DEL

In [41]:
df

Unnamed: 0_level_0,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,OP_UNIQUE_CARRIER,TAIL_NUM,ORIGIN,ORIGIN_CITY_NAME,ORIGIN_STATE_ABR,ORIGIN_STATE_NM,DEST,DEST_CITY_NAME,DEST_STATE_ABR,DEST_STATE_NM,CRS_DEP_TIME,DEP_TIME,DEP_DELAY,DEP_DEL15,DEP_TIME_hour,TAXI_OUT,TAXI_IN,CRS_ARR_TIME,ARR_TIME,ARR_DELAY,ARR_DEL15,ARR_TIME_hour,CANCELLED,CRS_ELAPSED_TIME,DISTANCE,DISTANCE_GROUP,CARRIER_DELAY,WEATHER_DELAY,NAS_DELAY,SECURITY_DELAY,LATE_AIRCRAFT_DELAY,TAXI_OUT_median,TAXI_IN_median
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1
0,1,3,4,9E,N195PQ,TYS,"Knoxville, TN",TN,Tennessee,ATL,"Atlanta, GA",GA,Georgia,1140,1205,25,1,11,30,4,1250,1315,25,1,12,0,70,152,1,0,0,0,0,25,15,8
1,1,4,5,9E,N919XJ,TYS,"Knoxville, TN",TN,Tennessee,ATL,"Atlanta, GA",GA,Georgia,1140,1250,70,1,11,35,9,1250,1412,82,1,12,0,70,152,1,0,0,12,0,70,15,8
2,1,5,6,9E,N316PQ,ATL,"Atlanta, GA",GA,Georgia,SGF,"Springfield, MO",MO,Missouri,950,956,6,0,09,20,3,1051,1043,-8,0,10,0,121,563,3,0,0,0,0,0,17,5
3,1,6,7,9E,N325PQ,ATL,"Atlanta, GA",GA,Georgia,SGF,"Springfield, MO",MO,Missouri,950,945,-5,0,09,16,3,1053,1029,-24,0,10,0,123,563,3,0,0,0,0,0,17,5
4,1,7,1,9E,N904XJ,ATL,"Atlanta, GA",GA,Georgia,SGF,"Springfield, MO",MO,Missouri,950,947,-3,0,09,25,4,1053,1044,-9,0,10,0,123,563,3,0,0,0,0,0,17,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7422032,12,31,2,B6,N193JB,MCO,"Orlando, FL",FL,Florida,SWF,"Newburgh/Poughkeepsie, NY",NY,New York,1356,1500,64,1,13,20,5,1639,1731,52,1,16,0,163,989,4,52,0,0,0,0,15,6
7422033,12,31,2,B6,N304JB,DCA,"Washington, DC",VA,Virginia,BOS,"Boston, MA",MA,Massachusetts,1420,1414,-6,0,14,15,7,1550,1533,-17,0,15,0,90,399,2,0,0,0,0,0,15,6
7422034,12,31,2,B6,N193JB,PHL,"Philadelphia, PA",PA,Pennsylvania,BOS,"Boston, MA",MA,Massachusetts,700,652,-8,0,07,12,5,825,751,-34,0,08,0,85,280,2,0,0,0,0,0,17,6
7422035,12,31,2,B6,N563JB,BOS,"Boston, MA",MA,Massachusetts,SJU,"San Juan, PR",PR,Puerto Rico,813,812,-1,0,08,10,3,1315,1248,-27,0,13,0,242,1674,7,0,0,0,0,0,16,5


Only by casting features to their corresponding dtypes, there is a memory usage reduction of 600MB (~ 30%).

#### Run only the first time to generate the global CLEAN file (year 2019)

In [47]:
%%time

output_csv_path = os.path.join(root,
                               "Output_Data",
                               "US_DoT",
                               "OTP_Preprocessed_2019_v2.csv")

df.to_csv(path_or_buf=output_csv_path,
          index=False,
          encoding='latin1')
          
# Wall time: 1min 2s

Wall time: 1min 23s


In [48]:
t1 = time.perf_counter() - t0
print("Time elapsed: ", t1) # CPU seconds elapsed (floating point)

Time elapsed:  4605.4373905


___

___