# Analyzing arrival time of flight

## Goal

Will be updated down the line ! Right now it is technical i.e. load data using Pandas, Use sci-kit to build model, and Use matplotlib to build visualizations !

## Load data

Download and load the data into a pandas data frame.

In [8]:
#!curl https://topcs.blob.core.windows.net/public/FlightData.csv -o flightdata.csv

In [9]:
import pandas as pd

flightdata = pd.read_csv('flightdata.csv')
flightdata.head()

Unnamed: 0,YEAR,QUARTER,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,UNIQUE_CARRIER,TAIL_NUM,FL_NUM,ORIGIN_AIRPORT_ID,ORIGIN,...,CRS_ARR_TIME,ARR_TIME,ARR_DELAY,ARR_DEL15,CANCELLED,DIVERTED,CRS_ELAPSED_TIME,ACTUAL_ELAPSED_TIME,DISTANCE,Unnamed: 25
0,2016,1,1,1,5,DL,N836DN,1399,10397,ATL,...,2143,2102.0,-41.0,0.0,0.0,0.0,338.0,295.0,2182.0,
1,2016,1,1,1,5,DL,N964DN,1476,11433,DTW,...,1435,1439.0,4.0,0.0,0.0,0.0,110.0,115.0,528.0,
2,2016,1,1,1,5,DL,N813DN,1597,10397,ATL,...,1215,1142.0,-33.0,0.0,0.0,0.0,335.0,300.0,2182.0,
3,2016,1,1,1,5,DL,N587NW,1768,14747,SEA,...,1335,1345.0,10.0,0.0,0.0,0.0,196.0,205.0,1399.0,
4,2016,1,1,1,5,DL,N836DN,1823,14747,SEA,...,607,615.0,8.0,0.0,0.0,0.0,247.0,259.0,1927.0,


## Perform data analysis

First inspect the dimensions of the data that is loaded.

In [10]:
flightdata.shape

(11231, 26)

Next extract the columns of the data, though we already have a glimpse by the `flightdata.head()` call. However, let us add the datatypes as well to be determined.

In [11]:
flightdata.columns
flightdata.dtypes

YEAR                     int64
QUARTER                  int64
MONTH                    int64
DAY_OF_MONTH             int64
DAY_OF_WEEK              int64
UNIQUE_CARRIER          object
TAIL_NUM                object
FL_NUM                   int64
ORIGIN_AIRPORT_ID        int64
ORIGIN                  object
DEST_AIRPORT_ID          int64
DEST                    object
CRS_DEP_TIME             int64
DEP_TIME               float64
DEP_DELAY              float64
DEP_DEL15              float64
CRS_ARR_TIME             int64
ARR_TIME               float64
ARR_DELAY              float64
ARR_DEL15              float64
CANCELLED              float64
DIVERTED               float64
CRS_ELAPSED_TIME       float64
ACTUAL_ELAPSED_TIME    float64
DISTANCE               float64
Unnamed: 25            float64
dtype: object

Let us now inspect for any null values in the data. The check will let us know in boolean if there are nulls or not. Based on the result of the check, we will summarize per column how many null values are in the data. This will help us decide the corrective action required to make the dataset usefule. i.e. if the null values are in columns which we suspect will not have any colinearity to delay we could drop and for columns where we anticipate colinearity we should try to fill a reasonable value but if we fear of distorting the dataset we should remove the observation from the data itself.

In [12]:
if flightdata.isnull().values.any():
    print(flightdata.isnull().sum())
else:
    print('No null values!')

YEAR                       0
QUARTER                    0
MONTH                      0
DAY_OF_MONTH               0
DAY_OF_WEEK                0
UNIQUE_CARRIER             0
TAIL_NUM                   0
FL_NUM                     0
ORIGIN_AIRPORT_ID          0
ORIGIN                     0
DEST_AIRPORT_ID            0
DEST                       0
CRS_DEP_TIME               0
DEP_TIME                 107
DEP_DELAY                107
DEP_DEL15                107
CRS_ARR_TIME               0
ARR_TIME                 115
ARR_DELAY                188
ARR_DEL15                188
CANCELLED                  0
DIVERTED                   0
CRS_ELAPSED_TIME           0
ACTUAL_ELAPSED_TIME      188
DISTANCE                   0
Unnamed: 25            11231
dtype: int64


Let us do the obvious thing first. That is remove the unintended parsing result away i.e. the column 25. We can do this by - 

In [13]:
flightdata = flightdata.drop('Unnamed: 25', axis=1)

Now let us check the impact on sum of nulls

In [14]:
flightdata.isnull().sum()

YEAR                     0
QUARTER                  0
MONTH                    0
DAY_OF_MONTH             0
DAY_OF_WEEK              0
UNIQUE_CARRIER           0
TAIL_NUM                 0
FL_NUM                   0
ORIGIN_AIRPORT_ID        0
ORIGIN                   0
DEST_AIRPORT_ID          0
DEST                     0
CRS_DEP_TIME             0
DEP_TIME               107
DEP_DELAY              107
DEP_DEL15              107
CRS_ARR_TIME             0
ARR_TIME               115
ARR_DELAY              188
ARR_DEL15              188
CANCELLED                0
DIVERTED                 0
CRS_ELAPSED_TIME         0
ACTUAL_ELAPSED_TIME    188
DISTANCE                 0
dtype: int64

We are good here so, let us work on the relevancy of the columns and restructure our data frame a bit before we take another look on null columns.

E.g. columns like TAIL_NUM will have no bearing on the flight delay. Wehereas, the column like ARR_DEL15 will have more meaning in our analysis. Powered by such logic we will now reformat our data frame to have only the following columns - 

In [15]:
flightdata = flightdata[['MONTH', 'DAY_OF_MONTH', 'DAY_OF_WEEK', 'ORIGIN', 'DEST', 'CRS_DEP_TIME', 'CANCELLED', 'DIVERTED', 'ARR_DEL15']]
flightdata.isnull().sum()

MONTH             0
DAY_OF_MONTH      0
DAY_OF_WEEK       0
ORIGIN            0
DEST              0
CRS_DEP_TIME      0
CANCELLED         0
DIVERTED          0
ARR_DEL15       188
dtype: int64

It is evident that the column ARR_DEL15 is needed for the analysis of on-time or delay status of the field. We thus take a peek on some top data which are null for that column

In [16]:
flightdata[flightdata.isnull().values.any(axis=1)].head()

Unnamed: 0,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,ORIGIN,DEST,CRS_DEP_TIME,CANCELLED,DIVERTED,ARR_DEL15
177,1,9,6,MSP,SEA,701,0.0,1.0,
179,1,10,7,MSP,DTW,1348,1.0,0.0,
184,1,10,7,MSP,DTW,625,0.0,1.0,
210,1,10,7,DTW,MSP,1200,1.0,0.0,
478,1,22,5,SEA,JFK,2305,1.0,0.0,


We notice that delayed flights which have NULL (NaN) in ARR_DEL15 are either cancelled or diverted flights. Following is the way to proove that!

In [17]:
allnas = flightdata[flightdata.isna().values.any(axis=1)]
cancelled = allnas[['CANCELLED']]
diverted = allnas[['DIVERTED']]
total_canc_div_flights = float(cancelled.sum()) + float(diverted.sum())


if(total_canc_div_flights == float(flightdata.isnull()['ARR_DEL15'].sum())):
    print('Flights are either cancelled or diverted always for NULL values in arrival delay')
else:
    print('There could be other reasons for data having NULL values for arrival delay')

Flights are either cancelled or diverted always for NULL values in arrival delay


Now that we are sure that the ARR_DEL15 is NULL because of delayed or cancelled flight. However, instead of dropping those rows we can fill them with 1's, i.e. they are considered as delayed as it would have caused the passengers if not same level of discomfort at least more discomfort.

In [18]:
flightdata = flightdata.fillna({'ARR_DEL15':1})
flightdata.iloc[177:185]

Unnamed: 0,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,ORIGIN,DEST,CRS_DEP_TIME,CANCELLED,DIVERTED,ARR_DEL15
177,1,9,6,MSP,SEA,701,0.0,1.0,1.0
178,1,9,6,DTW,JFK,1527,0.0,0.0,0.0
179,1,10,7,MSP,DTW,1348,1.0,0.0,1.0
180,1,10,7,DTW,MSP,1540,0.0,0.0,0.0
181,1,10,7,JFK,ATL,1325,0.0,0.0,0.0
182,1,10,7,JFK,ATL,610,0.0,0.0,0.0
183,1,10,7,JFK,SEA,1615,0.0,0.0,0.0
184,1,10,7,MSP,DTW,625,0.0,1.0,1.0


The CRS_DEP_TIME column of the dataset you are using represents scheduled departure times. The granularity of the numbers in this column — it contains more than 500 unique values — could have a negative impact on accuracy in a machine-learning model. This can be resolved using a technique called binning or quantization. What if you divided each number in this column by 100 and rounded down to the nearest integer? 1030 would become 10, 1925 would become 19, and so on, and you would be left with a maximum of 24 discrete values in this column. Intuitively, it makes sense, because it probably doesn't matter much whether a flight leaves at 10:30 a.m. or 10:40 a.m. It matters a great deal whether it leaves at 10:30 a.m. or 5:30 p.m.

In [19]:
import math

for index, row in flightdata.iterrows():
    flightdata.loc[index, 'CRS_DEP_TIME'] = math.floor(row['CRS_DEP_TIME'] / 100)
    
flightdata.head()

Unnamed: 0,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,ORIGIN,DEST,CRS_DEP_TIME,CANCELLED,DIVERTED,ARR_DEL15
0,1,1,5,ATL,SEA,19,0.0,0.0,0.0
1,1,1,5,DTW,MSP,13,0.0,0.0,0.0
2,1,1,5,ATL,SEA,9,0.0,0.0,0.0
3,1,1,5,SEA,MSP,8,0.0,0.0,0.0
4,1,1,5,SEA,DTW,23,0.0,0.0,0.0


Now we will be adding indicator columns for the ORIG and DEST ports

In [20]:
flightdata = pd.get_dummies(flightdata, columns=['ORIGIN','DEST'])
flightdata.head()

Unnamed: 0,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,CRS_DEP_TIME,CANCELLED,DIVERTED,ARR_DEL15,ORIGIN_ATL,ORIGIN_DTW,ORIGIN_JFK,ORIGIN_MSP,ORIGIN_SEA,DEST_ATL,DEST_DTW,DEST_JFK,DEST_MSP,DEST_SEA
0,1,1,5,19,0.0,0.0,0.0,1,0,0,0,0,0,0,0,0,1
1,1,1,5,13,0.0,0.0,0.0,0,1,0,0,0,0,0,0,1,0
2,1,1,5,9,0.0,0.0,0.0,1,0,0,0,0,0,0,0,0,1
3,1,1,5,8,0.0,0.0,0.0,0,0,0,0,1,0,0,0,1,0
4,1,1,5,23,0.0,0.0,0.0,0,0,0,0,1,0,1,0,0,0


We are more close to building a model now. To begin we will split data for the training and validation using the most familiar 80-20 split.

In [21]:
from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(flightdata.drop('ARR_DEL15', axis=1), flightdata['ARR_DEL15'], test_size=0.2, random_state=42)

print('Shape of the training features')
print(train_x.shape)

print('Shape of the test features')
print(test_x.shape)

print('Shape of the training labels')
print(train_y.shape)

print('Shape of the testing labels')
print(test_y.shape)

Shape of the training features
(8984, 16)
Shape of the test features
(2247, 16)
Shape of the training labels
(8984,)
Shape of the testing labels
(2247,)


Let us take a moment and understand what we are attempting to accomplish here. We intend to determine based on the parameters will the flight be delayed on arriving? Thus, we intend to classify the records in the dataset as delayed for arrival or not delayed for arrival.
It is very different from regression. Regression is more of extrapolating the known numerical values and extending it to situations where no observations has been made.

Thus, we need to use a classification model. To get started, there is no yardstick for Data Scientist to use select a specific algorithm. Only way is to build a model using the algorithm and determine the performance of the algorithm before deciding to move on to change the algorithm itself.

We start with RandomForestClassifier which happens to be a robost classification mechanism

In [22]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(random_state=13)
model.fit(train_x, train_y)



RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=13, verbose=0, warm_start=False)

In [23]:
predicted = model.predict(test_x)
model.score(test_x, test_y)

0.8713840676457499

Having a model is not sufficient. We need to score the model which is a great starting point for evaluating the model accuracy. Higher the score better the model is the common notion. Score will range from 0 - 1. Score alone is not a solid enough metric to move ahead with. There are other better measures which will be expand into here for the built model.

In [25]:
from sklearn.metrics import roc_auc_score
probabilities = model.predict_proba(test_x)
roc_auc_score(test_y, probabilities[:,1])

0.684956385692647