In [1]:
import pandas as pd
import datetime
import numpy as np

### Data.
We are taking data from the US Department of Transportation Bureau database. It is available from 1987 until Sep 2020, but to get a clear picture of recent times, we will take all of the 2019 datasets (one per each month), which adds up to over 7 million observations.

First important procedures to work with this data is to group it by route, by month and by carrier. We are not interested on the exact date of the flight, nor on each individual flight within a route: we want to get the total (count) of flights performing said route each month, split within the different carriers.

In [2]:
df01 = pd.read_csv('Data/012019.csv')
df02 = pd.read_csv('Data/022019.csv')
df03 = pd.read_csv('Data/032019.csv')
df04 = pd.read_csv('Data/042019.csv')
df05 = pd.read_csv('Data/052019.csv')
df06 = pd.read_csv('Data/062019.csv')
df07 = pd.read_csv('Data/072019.csv')
df08 = pd.read_csv('Data/082019.csv')
df09 = pd.read_csv('Data/092019.csv')
df10 = pd.read_csv('Data/102019.csv')
df11 = pd.read_csv('Data/112019.csv')
df12 = pd.read_csv('Data/122019.csv')

dfs = [df01, df02, df03, df04, df05, df06, df07, df08, df09, df10, df11, df12]

In [3]:
for i in dfs:
    i.drop(columns = 'Unnamed: 11', axis = 1, inplace = True)
    i['Route'] = i['ORIGIN'] + '-' + i['DEST']
    i['ORIGIN_CITY_NAME'] = i['ORIGIN_CITY_NAME'][:-4]
    i['DEST_CITY_NAME'] = i['DEST_CITY_NAME'][:-4]
    i.columns = ['Month', 'Carrier', 'From', 'FCity', 'FST', 'To', 'TCity', 'TST', 'Delay', 'Flights', 'Dist', 'Route']
    

In [4]:
df01

Unnamed: 0,Month,Carrier,From,FCity,FST,To,TCity,TST,Delay,Flights,Dist,Route
0,1,9E,ATL,"Atlanta, GA",GA,CSG,"Columbus, GA",GA,-12.0,1.0,83.0,ATL-CSG
1,1,9E,ATL,"Atlanta, GA",GA,CSG,"Columbus, GA",GA,-20.0,1.0,83.0,ATL-CSG
2,1,9E,ATL,"Atlanta, GA",GA,CSG,"Columbus, GA",GA,-13.0,1.0,83.0,ATL-CSG
3,1,9E,ATL,"Atlanta, GA",GA,CSG,"Columbus, GA",GA,-15.0,1.0,83.0,ATL-CSG
4,1,9E,ATL,"Atlanta, GA",GA,CSG,"Columbus, GA",GA,-11.0,1.0,83.0,ATL-CSG
...,...,...,...,...,...,...,...,...,...,...,...,...
583980,1,UA,SAT,"San Antonio, TX",TX,IAH,"Houston, TX",TX,-18.0,1.0,191.0,SAT-IAH
583981,1,UA,SJU,,PR,IAD,,VA,3.0,1.0,1571.0,SJU-IAD
583982,1,UA,IAD,,VA,SJU,,PR,2.0,1.0,1571.0,IAD-SJU
583983,1,UA,IAH,,TX,SFO,,CA,22.0,1.0,1635.0,IAH-SFO


After the initial wrangling of data, our 12 dataframes are in the same format and ready to be concatenated to get one complete dataset of all the flights in the whole year.

In [5]:
df = pd.concat(dfs)

In [6]:
df = df[['Route', 'Month', 'Carrier', 'From', 'FCity', 'FST',
         'To', 'TCity', 'TST', 'Delay', 'Flights', 'Dist']].reset_index(drop = True)

In [7]:
df

Unnamed: 0,Route,Month,Carrier,From,FCity,FST,To,TCity,TST,Delay,Flights,Dist
0,ATL-CSG,1,9E,ATL,"Atlanta, GA",GA,CSG,"Columbus, GA",GA,-12.0,1.0,83.0
1,ATL-CSG,1,9E,ATL,"Atlanta, GA",GA,CSG,"Columbus, GA",GA,-20.0,1.0,83.0
2,ATL-CSG,1,9E,ATL,"Atlanta, GA",GA,CSG,"Columbus, GA",GA,-13.0,1.0,83.0
3,ATL-CSG,1,9E,ATL,"Atlanta, GA",GA,CSG,"Columbus, GA",GA,-15.0,1.0,83.0
4,ATL-CSG,1,9E,ATL,"Atlanta, GA",GA,CSG,"Columbus, GA",GA,-11.0,1.0,83.0
...,...,...,...,...,...,...,...,...,...,...,...,...
7422032,MCO-SWF,12,B6,MCO,"Orlando, FL",FL,SWF,"Newburgh/Poughkeepsie, NY",NY,52.0,1.0,989.0
7422033,DCA-BOS,12,B6,DCA,,VA,BOS,,MA,-17.0,1.0,399.0
7422034,PHL-BOS,12,B6,PHL,,PA,BOS,,MA,-34.0,1.0,280.0
7422035,BOS-SJU,12,B6,BOS,,MA,SJU,,PR,-27.0,1.0,1674.0


Once we have all the dataset, we are going to remove the few observations that have any sort of missing data. 

In [8]:
df['Delay'] = df['Delay'].fillna(0)

In [9]:
df

Unnamed: 0,Route,Month,Carrier,From,FCity,FST,To,TCity,TST,Delay,Flights,Dist
0,ATL-CSG,1,9E,ATL,"Atlanta, GA",GA,CSG,"Columbus, GA",GA,-12.0,1.0,83.0
1,ATL-CSG,1,9E,ATL,"Atlanta, GA",GA,CSG,"Columbus, GA",GA,-20.0,1.0,83.0
2,ATL-CSG,1,9E,ATL,"Atlanta, GA",GA,CSG,"Columbus, GA",GA,-13.0,1.0,83.0
3,ATL-CSG,1,9E,ATL,"Atlanta, GA",GA,CSG,"Columbus, GA",GA,-15.0,1.0,83.0
4,ATL-CSG,1,9E,ATL,"Atlanta, GA",GA,CSG,"Columbus, GA",GA,-11.0,1.0,83.0
...,...,...,...,...,...,...,...,...,...,...,...,...
7422032,MCO-SWF,12,B6,MCO,"Orlando, FL",FL,SWF,"Newburgh/Poughkeepsie, NY",NY,52.0,1.0,989.0
7422033,DCA-BOS,12,B6,DCA,,VA,BOS,,MA,-17.0,1.0,399.0
7422034,PHL-BOS,12,B6,PHL,,PA,BOS,,MA,-34.0,1.0,280.0
7422035,BOS-SJU,12,B6,BOS,,MA,SJU,,PR,-27.0,1.0,1674.0


In [10]:
df = df.dropna(axis = 0, how = 'any')

We wanted to continue with the data wrangling in this same notebook and tried to group the routes as we specified before. Unfortunately, there have been some strange technical issues, so we have decided to export the current dataframe as we have it and reopen it again in a new notebook. This way, we have been able to continue working with the dataset without any kind of issues.

We will save it as .pkl because it is faster for Python to read it and because it keeps some extra format which .csv files fail to preserve so accurately.

In [11]:
df.to_pickle('Data19.pkl')