## Predicting Airline Arrivals 

* We will predict how late an airplane will arrive
* We will define lateness by arriving later than 30 minutes of the expected time
* This dataset is from 2008

**variable description:**
1. Year 1987-2008

2. Month: month,  1-12

3. DayofMonth: month, 1-31

4. DayOfWeek: 1 (Monday) - 7 (Sunday)

5. DepTime: actual departure time (local, hhmm)

6. CRSDepTime: scheduled departure time (local, hhmm)

7. ArrTime: actual arrival time (local, hhmm)

8. CRSArrTime: scheduled arrival time (local, hhmm)

9. UniqueCarrier: unique carrier code

10. FlightNum: flight number

11. TailNum: plane tail number

12. ActualElapsedTime: actual lapsed time in minutes

13. CRSElapsedTime: Estimated elapsed time in minutes

14. AirTime: in minutes

15. ArrDelay: arrival delay in minutes

16. DepDelay: departure delay in minutes

17. Origin: origin IATA airport code

18. Dest: destination IATA airport code

19. Distance: distance in miles

20. TaxiIn taxi: taxi in time in minutes

21. TaxiOut: taxi out time in minutes

22. Cancelled: was the flight cancelled? 1 = yes, 0 = no

23. CancellationCode: reason for cancellation (A = carrier, B = weather, C = NAS, D = security)

24. Diverted: 1 = yes, 0 = no

25. CarrierDelay: delay in minutes

26. WeatherDelay: delay in minutes

27. NASDelay: delay in minutes

28. SecurityDelay: delay in minutes

29. LateAircraftDelay: delay in minutes

In [1]:
import pandas as pd
import numpy as np
import scipy
import matplotlib.pyplot as plt
from sklearn import tree
from IPython.display import Image
%matplotlib inline

from sklearn import ensemble
from sklearn import datasets
from sklearn.utils import shuffle
from sklearn.metrics import mean_squared_error
from sklearn import svm
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import GridSearchCV
%matplotlib inline

In [2]:
data = pd.read_csv('/Users/jenny/documents/thinkful/random downloaded data/2008.csv')

In [3]:
data.head()

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,...,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
0,2008,1,3,4,2003.0,1955,2211.0,2225,WN,335,...,4.0,8.0,0,,0,,,,,
1,2008,1,3,4,754.0,735,1002.0,1000,WN,3231,...,5.0,10.0,0,,0,,,,,
2,2008,1,3,4,628.0,620,804.0,750,WN,448,...,3.0,17.0,0,,0,,,,,
3,2008,1,3,4,926.0,930,1054.0,1100,WN,1746,...,3.0,7.0,0,,0,,,,,
4,2008,1,3,4,1829.0,1755,1959.0,1925,WN,3920,...,3.0,10.0,0,,0,2.0,0.0,0.0,0.0,32.0


In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7009728 entries, 0 to 7009727
Data columns (total 29 columns):
Year                 int64
Month                int64
DayofMonth           int64
DayOfWeek            int64
DepTime              float64
CRSDepTime           int64
ArrTime              float64
CRSArrTime           int64
UniqueCarrier        object
FlightNum            int64
TailNum              object
ActualElapsedTime    float64
CRSElapsedTime       float64
AirTime              float64
ArrDelay             float64
DepDelay             float64
Origin               object
Dest                 object
Distance             int64
TaxiIn               float64
TaxiOut              float64
Cancelled            int64
CancellationCode     object
Diverted             int64
CarrierDelay         float64
WeatherDelay         float64
NASDelay             float64
SecurityDelay        float64
LateAircraftDelay    float64
dtypes: float64(14), int64(10), object(5)
memory usage: 1.5+ GB


In [7]:
data.describe()

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,FlightNum,ActualElapsedTime,...,Distance,TaxiIn,TaxiOut,Cancelled,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
count,7009728.0,7009728.0,7009728.0,7009728.0,6873482.0,7009728.0,6858079.0,7009728.0,7009728.0,6855029.0,...,7009728.0,6858079.0,6872670.0,7009728.0,7009728.0,1524735.0,1524735.0,1524735.0,1524735.0,1524735.0
mean,2008.0,6.37513,15.72801,3.924182,1333.83,1326.086,1481.258,1494.801,2224.2,127.3224,...,726.387,6.860852,16.45305,0.01960618,0.002463006,15.77206,3.039031,17.16462,0.07497434,20.77098
std,0.0,3.406737,8.797068,1.988259,478.0689,464.2509,505.2251,482.6728,1961.716,70.18731,...,562.1018,4.933649,11.3328,0.1386426,0.04956753,40.09912,19.50287,31.89495,1.83794,39.25964
min,2008.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,12.0,...,11.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2008.0,3.0,8.0,2.0,928.0,925.0,1107.0,1115.0,622.0,77.0,...,325.0,4.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,2008.0,6.0,16.0,4.0,1325.0,1320.0,1512.0,1517.0,1571.0,110.0,...,581.0,6.0,14.0,0.0,0.0,0.0,0.0,6.0,0.0,0.0
75%,2008.0,9.0,23.0,6.0,1728.0,1715.0,1909.0,1907.0,3518.0,157.0,...,954.0,8.0,19.0,0.0,0.0,16.0,0.0,21.0,0.0,26.0
max,2008.0,12.0,31.0,7.0,2400.0,2359.0,2400.0,2400.0,9743.0,1379.0,...,4962.0,308.0,429.0,1.0,1.0,2436.0,1352.0,1357.0,392.0,1316.0


In [8]:
data.corr()

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,FlightNum,ActualElapsedTime,...,Distance,TaxiIn,TaxiOut,Cancelled,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
Year,,,,,,,,,,,...,,,,,,,,,,
Month,,1.0,0.001816,-0.003727,-0.010095,-0.008215,-0.00041,0.000249,0.002502,-0.014061,...,-0.004148,0.007115,-0.016092,-0.028392,0.00184,-0.000232,0.005598,0.013533,-0.003165,0.002176
DayofMonth,,0.001816,1.0,0.00565,-0.001501,-0.001449,-0.001549,-0.001442,-0.000679,0.000598,...,0.001799,-0.002876,-0.000708,-0.008432,0.000612,0.000249,0.001249,0.004114,-0.000276,0.009385
DayOfWeek,,-0.003727,0.00565,1.0,0.005576,0.005322,0.003378,0.005218,-0.001274,0.012929,...,0.017231,0.010199,-0.007137,-0.006286,0.001504,0.012319,0.006441,-0.007274,0.004513,0.012024
DepTime,,-0.010095,-0.001501,0.005576,1.0,0.968457,0.712649,0.791164,-0.00608,-0.01707,...,-0.016545,-0.042257,0.050687,0.002028,0.001746,0.001787,0.023042,-0.013341,-0.009065,0.205079
CRSDepTime,,-0.008215,-0.001449,0.005322,0.968457,1.0,0.696878,0.791819,-0.010678,-0.017184,...,-0.013143,-0.047567,0.039814,0.016218,-0.00081,-0.053916,0.006909,-0.052253,-0.011447,0.191594
ArrTime,,-0.00041,-0.001549,0.003378,0.712649,0.696878,1.0,0.861972,-0.01766,0.037625,...,0.02898,0.007112,0.049434,,-0.000683,-0.058128,-0.020296,0.019385,-0.005345,-0.009715
CRSArrTime,,0.000249,-0.001442,0.005218,0.791164,0.791819,0.861972,1.0,-0.027878,0.051469,...,0.045825,-0.006542,0.059925,0.013236,0.007386,-0.05337,0.007568,-0.010596,-0.009237,0.153406
FlightNum,,0.002502,-0.000679,-0.001274,-0.00608,-0.010678,-0.01766,-0.027878,1.0,-0.319347,...,-0.349557,-0.009515,0.016561,0.042066,-6.5e-05,0.057192,0.064391,0.004423,-0.001206,-0.034556
ActualElapsedTime,,-0.014061,0.000598,0.012929,-0.01707,-0.017184,0.037625,0.051469,-0.319347,1.0,...,0.964521,0.158444,0.267801,,,-0.032919,-0.013712,0.203805,0.000318,-0.087003


In [4]:
X = data[['Month', 'DayOfWeek', 'CRSDepTime', 'CRSArrTime', 'UniqueCarrier', 
          'Origin', 'Dest', 'Distance', 'DepDelay']]
y = np.where(data.ArrDelay > 30, data.ArrDelay -30, 0)

In [5]:
X.head()

Unnamed: 0,Month,DayOfWeek,CRSDepTime,CRSArrTime,UniqueCarrier,Origin,Dest,Distance,DepDelay
0,1,4,1955,2225,WN,IAD,TPA,810,8.0
1,1,4,735,1000,WN,IAD,TPA,810,19.0
2,1,4,620,750,WN,IND,BWI,515,8.0
3,1,4,930,1100,WN,IND,BWI,515,-4.0
4,1,4,1755,1925,WN,IND,BWI,515,34.0


In [6]:
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline

class MultiColumnLabelEncoder:
    def __init__(self,columns = None):
        self.columns = columns # array of column names to encode

    def fit(self,X,y=None):
        return self # not relevant here

    def transform(self,X):
        '''
        Transforms columns of X specified in self.columns using
        LabelEncoder(). If no columns specified, transforms all
        columns in X.
        '''
        output = X.copy()
        if self.columns is not None:
            for col in self.columns:
                output[col] = LabelEncoder().fit_transform(output[col])
        else:
            for colname,col in output.iteritems():
                output[colname] = LabelEncoder().fit_transform(col)
        return output

    def fit_transform(self,X,y=None):
        return self.fit(X,y).transform(X)

X = MultiColumnLabelEncoder(columns = ['UniqueCarrier','Origin','Dest']).fit_transform(X)

In [7]:
X = X.fillna(X.mean())
X.head()

Unnamed: 0,Month,DayOfWeek,CRSDepTime,CRSArrTime,UniqueCarrier,Origin,Dest,Distance,DepDelay
0,1,4,1955,2225,17,135,286,810,8.0
1,1,4,735,1000,17,135,286,810,19.0
2,1,4,620,750,17,140,48,515,8.0
3,1,4,930,1100,17,140,48,515,-4.0
4,1,4,1755,1925,17,140,48,515,34.0


In [8]:
X.describe()

Unnamed: 0,Month,DayOfWeek,CRSDepTime,CRSArrTime,UniqueCarrier,Origin,Dest,Distance,DepDelay
count,7009728.0,7009728.0,7009728.0,7009728.0,7009728.0,7009728.0,7009728.0,7009728.0,7009728.0
mean,6.37513,3.924182,1326.086,1494.801,11.10727,148.5133,149.4405,726.387,9.97257
std,3.406737,1.988259,464.2509,482.6728,5.890848,80.90266,81.29815,562.1018,34.96642
min,1.0,1.0,0.0,0.0,0.0,0.0,0.0,11.0,-534.0
25%,3.0,2.0,925.0,1115.0,6.0,80.0,81.0,325.0,-4.0
50%,6.0,4.0,1320.0,1517.0,12.0,155.0,156.0,581.0,-1.0
75%,9.0,6.0,1715.0,1907.0,17.0,211.0,213.0,954.0,9.0
max,12.0,7.0,2359.0,2400.0,19.0,302.0,303.0,4962.0,2467.0


In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=85)

In [10]:
clr1 = ensemble.RandomForestRegressor()
cv1 = cross_val_score(clr1, X_train, y_train, cv=5)
print(cv1)
print("mean = {:.3}".format(cv1.mean()))

[ 0.8992554   0.90070925  0.89902519  0.90522841  0.89974764]
mean = 0.901


In [16]:
#takes long time to run
#clr2 = ensemble.RandomForestRegressor(n_estimators=30, max_depth=3)
#cv2 = cross_val_score(clr2, X_train, y_train, cv=5)
#print(cv2)
#print("mean = {:.3}".format(cv2.mean()))

In [None]:
#takes really long time to run
#clr3 = ensemble.RandomForestRegressor(n_estimators=60, max_depth=2)
#cv3 = cross_val_score(clr3, X_train, y_train, cv=5)
#print(cv3)
#print("mean = {:.3}".format(cv3.mean()))

In [14]:
#150 iterations, 3-deep trees
params = {'n_estimators': 100,
         'max_depth': 3,
         'loss': 'lad'}

#instantiate and fit
gbr1 = ensemble.GradientBoostingRegressor(**params)
gb1 = cross_val_score(gbr1, X_train, y_train, cv=5)
print(gb1)
print('mean = {:.3}'.format(gb1.mean()))

[ 0.89059531  0.88637432  0.88612945  0.88200796  0.89075297]
mean = 0.887


In [15]:
#run clr3 on test dataset
clr1.fit(X_train, y_train)
predict = clr1.predict(X_train)
r_sqrd = clr1.score(X_test, y_test)
print('R-squared is {:.3}'.format(r_sqrd))

R-squared is 0.899
