Testing the data with decision trees and random forests was frustrating.  Many models were tried and it took until the 8th model tested to produce an R2 value greater than 83%.  However, 6 of the subsequent 8 models produced R2 values of 99.99%+!  After finally getting a good working model, further testing under different conditions to see if the results could be replicated could not be resisted. The following produced the best results:

1) Departure Delays 10 Features -> Decision Tree using PCA -> R2 = 99.99%
2) Arrival Delays 18 Features -> Decision Tree using PCA -> R2 = 100%

These two models also produced the same result when delays of 15+ minutes was used.  Perhaps multicollinearity was a much bigger factor than intially anticipated which the PCA transform helped soften significantly.

A final model was test:  Delays were split into 3 categorical values and factorized:
1) On Time -> Delay =< 0 minutes
2) Small Delay -> Delay =< 15 Minutes
3) Long Delay -> Delay > 15 minutes

To our surprise, while using the best above models for arriving and departing delays, both subsequent models produced R2 scores of 100% each for predicting the 3 delay categories.

This is a great example of trying to utlitize too many models however this was due to trying out both decision trees and random forests with the latter not producing any meaningful results.  Perhaps this was due to the binary nature of our dependent variable.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns

import warnings
warnings.simplefilter('ignore')

# RNG used for seeding
rng = int(np.random.randint(low=1, high=2000, size=1))

In [2]:
# Read in first quarter dataset
delays_df = pd.read_csv("Delay_first_quarter.csv")

In [3]:
# Do some additional cleaning
delays_df = delays_df.fillna(0)

In [4]:
# Dummy variables for flights of east coast origin/destination 
# Dummy variables for flights of west coast origin/destination -> both fixed

delays_df['EAST_COAST_ORIGIN'] = 1*np.ravel(delays_df["ORIGIN_LONGITUDE"] >= -83)
delays_df['EAST_COAST_DEST'] = 1*np.ravel(delays_df["DEST_LONGITUDE"] >= -83)
delays_df['WEST_COAST_ORIGIN'] = 1*np.ravel(delays_df["ORIGIN_LONGITUDE"] <= -114)
delays_df['WEST_COAST_DEST'] = 1*np.ravel(delays_df["DEST_LONGITUDE"] <= -114)

In [5]:
# Create more dummy variables for categorical data:

Weekday = {
           "Monday": 1,
           "Tueday": 2, 
           "Wednesday": 3, 
           "Thursday" : 4,
           "Friday": 5,
           "Saturday": 6,
           "Sunday": 7
          }

Airline = {
        "UA": 1,
        "AA": 2,
        "9E": 3,
        "B6": 4,
        "EV": 5,
        "F9": 6,
        "G4": 7,
        "HA": 8,
        "MQ": 9,
        "NK": 10,
        "OH": 11,
        "OO": 12,
        "VX": 13,
        "WN": 14,
        "YV": 15,
        "YX": 16,
        "AS": 17,
        "DL": 18
}

In [6]:
delays_df['WEEKDAY_DUMMY'] = delays_df['WEEKDAY'].apply(  \
                            lambda x: next((y for z, y in Weekday.items() if x in z), 0))

In [7]:
delays_df['AIRLINE_DUMMY'] = delays_df['OP_CARRIER'].apply(  \
                            lambda x: next((y for z, y in Airline.items() if x in z), 0))

In [8]:
# Make a proper categorical delay dummy variable
# Get Month into a categorical value
# This uses less memory than using lambda.

delay_dict = {0: "On Time", 1: "Delayed"}
delays_df["DEPARTURE_DELAY_DUMMY"]=delays_df["DEPARTURE_DELAY"].copy()
delays_df["DEPARTURE_DELAY"].replace(delay_dict, inplace = True)

In [9]:
arrive_dict = {0: "On Time", 1: "Delayed"}
delays_df["ARRIVAL_DELAY_DUMMY"]=delays_df["ARRIVAL_DELAY"].copy()
delays_df["ARRIVAL_DELAY"].replace(delay_dict, inplace = True)

In [15]:
Delay_dict = {range(-2000, 1): "On Time",
               range(1, 16): 'Small Delay', 
               range(16, 2000): 'Long Delay',}

In [17]:
delays_df['DEPARTURE_DELAY_TEST'] = delays_df['DEP_DELAY'].apply(  \
                            lambda x: next((y for z, y in Delay_dict.items() if x in z), 0))

In [20]:
delays_df['ARRIVAL_DELAY_TEST'] = delays_df['ARR_DELAY'].apply(  \
                            lambda x: next((y for z, y in Delay_dict.items() if x in z), 0))

FIRST DEPARTURE MODEL  - BOTH DECISION TREE (72.7%) AND RANDOM FOREST (68)%

In [9]:
# As this follows a similar path to logistic regression, the same initial steps will be used.
# Start with first model

# Departure delay logistic ML model -> 10 features

X = delays_df[["DAY", "MONTH", "DEP_TIME", "DEPARTURE_TIME_OF_DAY_DUMMY", "WEEKDAY_DUMMY", "AIRLINE_DUMMY", \
               "OP_CARRIER_FL_NUM", "TAXI_OUT", "WHEELS_OFF","WEATHER_DELAY",]]

y = delays_df["DEPARTURE_DELAY"].values.reshape(-1, 1)
print(X.shape, y.shape)

(1683475, 10) (1683475, 1)


In [10]:
# Split for train and test datasets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng, stratify=y)

In [29]:
# Set up the random forest classifier

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=rng, oob_score=True)
clf.fit(X_test, y_test) 

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=2, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=True, random_state=552, verbose=0,
                       warm_start=False)

In [30]:
print(clf.feature_importances_)

[1.01942146e-02 8.39816741e-04 3.10909307e-01 1.60446742e-01
 4.10788513e-05 1.04169516e-02 1.29436184e-04 5.16579186e-02
 3.02235443e-01 1.53129091e-01]


In [31]:
predictions = clf.predict(X_test)

In [32]:
y_transposed = (np.transpose(y_test)).flatten()

In [33]:
print(f"Accuracy Score: {accuracy_score(y_test, predictions)*100}")

Accuracy Score: 68.01570084753213


In [34]:
# Um good at predicting on time but not delays?
pd.crosstab(y_transposed, predictions, rownames=["Actual Delays"], colnames=["Predicted Delays"])

Predicted Delays,Delayed,On Time
Actual Delays,Unnamed: 1_level_1,Unnamed: 2_level_1
Delayed,3449,134603
On Time,9,282808


In [24]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn import tree

clf = DecisionTreeClassifier(criterion = "entropy", random_state=rng)
clf.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=552, splitter='best')

In [25]:
y_transposed = (np.transpose(y_test)).flatten()

In [26]:
predictions = clf.predict(X_test)

In [27]:
print(f"Accuracy Score: {accuracy_score(y_test, predictions)*100}")

Accuracy Score: 72.70148193380838


In [28]:
pd.crosstab(y_transposed, predictions, rownames=["Actual Delays"], colnames=["Predicted Delays"])

Predicted Delays,Delayed,On Time
Actual Delays,Unnamed: 1_level_1,Unnamed: 2_level_1
Delayed,79726,58326
On Time,56565,226252


In [None]:
SECOND DEPARTURE DELAY 4 FEATURES -> RANDOM FOREST (68%)

In [36]:
# A complete stripdown to see if any useful analysis can be done with this method on delay data.
# Only 4 features

X = delays_df[["DEP_TIME", "WEEKDAY_DUMMY", "AIRLINE_DUMMY", "WEATHER_DELAY",]]

y = delays_df["DEPARTURE_DELAY"].values.reshape(-1, 1)
print(X.shape, y.shape)

(1683475, 4) (1683475, 1)


In [37]:
# Split for train and test datasets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng, stratify=y)

In [38]:
# Set up the random forest classifier

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=rng, oob_score=True)
clf.fit(X_test, y_test)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=2, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=True, random_state=552, verbose=0,
                       warm_start=False)

In [39]:
# Hey look at that.  Numbers greater than 0.00001
print(clf.feature_importances_)

[0.6179831  0.00112584 0.10440218 0.27648889]


In [40]:
predictions = clf.predict(X_test)

In [41]:
y_transposed = (np.transpose(y_test)).flatten()

In [42]:
print(f"Accuracy Score: {accuracy_score(y_test, predictions)*100}")

Accuracy Score: 68.2326329570484


In [43]:
# As predicted, not useful.
pd.crosstab(y_transposed, predictions, rownames=["Actual Delays"], colnames=["Predicted Delays"])

Predicted Delays,Delayed,On Time
Actual Delays,Unnamed: 1_level_1,Unnamed: 2_level_1
Delayed,4363,133689
On Time,10,282807


ARRIVAL DELAYS WITH 18 FEATURES -> RANDOM FOREST (83.7%)

In [58]:
# Trying out Arrival data before utilizing a PCA.

X = delays_df[["DAY", "MONTH", "DEP_TIME", "DEP_DELAY", "DEPARTURE_TIME_OF_DAY_DUMMY", \
              "OP_CARRIER_FL_NUM", "TAXI_OUT", "AIR_TIME", "TAXI_IN", \
              "ARRIVAL_TIME_OF_DAY_DUMMY", "CARRIER_DELAY", "DISTANCE", "WEEKDAY_DUMMY", "AIRLINE_DUMMY", \
       "WEATHER_DELAY", "NAS_DELAY", "SECURITY_DELAY", "LATE_AIRCRAFT_DELAY",]]
y = delays_df["ARRIVAL_DELAY"].values.reshape(-1, 1)
print(X.shape, y.shape)

(1683475, 18) (1683475, 1)


In [46]:
# Split for train and test datasets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng, stratify=y)

In [47]:
# Set up the random forest classifier

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=rng, oob_score=True)
clf.fit(X_test, y_test)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=2, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=True, random_state=552, verbose=0,
                       warm_start=False)

In [48]:
print(clf.feature_importances_)

[1.43593371e-03 0.00000000e+00 1.74217006e-02 3.03704467e-01
 8.02006973e-03 2.94857369e-04 8.72869937e-02 9.97959256e-03
 8.77208326e-03 4.23487158e-04 1.59000164e-01 1.25210929e-04
 0.00000000e+00 5.49552384e-04 3.31999451e-02 2.06266546e-01
 0.00000000e+00 1.63519397e-01]


In [49]:
# Well that is better
predictions = clf.predict(X_test)
y_transposed = (np.transpose(y_test)).flatten()
print(f"Accuracy Score: {accuracy_score(y_test, predictions)*100}")

Accuracy Score: 83.68162064680459


In [50]:
pd.crosstab(y_transposed, predictions, rownames=["Actual Delays"], colnames=["Predicted Delays"])

Predicted Delays,0,1
Actual Delays,Unnamed: 1_level_1,Unnamed: 2_level_1
0,282965,195
1,68484,69225


ARRIVAL DELAYS WITH 8 FEATURES -> RANDOM FOREST (81.86%)

In [52]:
# Strip features down further.

X = delays_df[["DEP_TIME", "DEP_DELAY", "DEPARTURE_TIME_OF_DAY_DUMMY", \
              "OP_CARRIER_FL_NUM", "AIR_TIME",
              "ARRIVAL_TIME_OF_DAY_DUMMY","WEEKDAY_DUMMY", "AIRLINE_DUMMY", ]]
y = delays_df["ARRIVAL_DELAY"].values.reshape(-1, 1)
print(X.shape, y.shape)

(1683475, 8) (1683475, 1)


In [53]:
# Split for train and test datasets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng, stratify=y)

In [54]:
# Set up the random forest classifier

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=rng, oob_score=True)
clf.fit(X_test, y_test)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=2, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=True, random_state=552, verbose=0,
                       warm_start=False)

In [55]:
print(clf.feature_importances_)

[0.2418698  0.41068455 0.0402863  0.02756126 0.17455907 0.04152022
 0.00070267 0.06281613]


In [56]:
# Stripping out the variables made the model worse.
predictions = clf.predict(X_test)
y_transposed = (np.transpose(y_test)).flatten()
print(f"Accuracy Score: {accuracy_score(y_test, predictions)*100}")

Accuracy Score: 81.85943844759296


In [57]:
pd.crosstab(y_transposed, predictions, rownames=["Actual Delays"], colnames=["Predicted Delays"])

Predicted Delays,Delayed,On Time
Actual Delays,Unnamed: 1_level_1,Unnamed: 2_level_1
Delayed,69649,68060
On Time,8288,274872


ARRIVAL DELAYS WITH 3 FEATURES -> RANDOM FOREST (83.76%)

In [59]:
# Strip features down further. Going with the feature importance recommendation.

X = delays_df[["DEP_TIME", "DEP_DELAY", "AIR_TIME"]]
y = delays_df["ARRIVAL_DELAY"].values.reshape(-1, 1)
print(X.shape, y.shape)

(1683475, 3) (1683475, 1)


In [60]:
# Split for train and test datasets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng, stratify=y)

In [61]:
# Set up the random forest classifier

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=rng, oob_score=True)
clf.fit(X_test, y_test)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=2, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=True, random_state=552, verbose=0,
                       warm_start=False)

In [62]:
print(clf.feature_importances_)

[0.19766085 0.64795597 0.15438318]


In [63]:
# Well this is the best model so far.  This may be the best we can get.
predictions = clf.predict(X_test)
y_transposed = (np.transpose(y_test)).flatten()
print(f"Accuracy Score: {accuracy_score(y_test, predictions)*100}")

Accuracy Score: 83.7612178611397


In [64]:
pd.crosstab(y_transposed, predictions, rownames=["Actual Delays"], colnames=["Predicted Delays"])

Predicted Delays,Delayed,On Time
Actual Delays,Unnamed: 1_level_1,Unnamed: 2_level_1
Delayed,80028,57681
On Time,10663,272497


DEPARTURE DELAY 10 FEATURES DECISION TREE WITH PCA -> (99.9%)

In [10]:
# Try out best departure and arrival models with pca
# Start with Departure delay logistic ML model -> 10 features

X = delays_df[["DAY", "MONTH", "DEP_TIME", "DEPARTURE_TIME_OF_DAY_DUMMY", "WEEKDAY_DUMMY", "AIRLINE_DUMMY", \
               "OP_CARRIER_FL_NUM", "TAXI_OUT", "WHEELS_OFF","WEATHER_DELAY",]]

y = delays_df["DEPARTURE_DELAY"].values.reshape(-1, 1)
print(X.shape, y.shape)

(1683475, 10) (1683475, 1)


In [13]:
from sklearn.decomposition import PCA
pca = PCA(.95)
pca.fit(X)

PCA(copy=True, iterated_power='auto', n_components=0.95, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)

In [18]:
# Split for train and test datasets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng, stratify=y)

In [19]:
X_test_transformed = pca.transform(X_test)

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn import tree

clf = DecisionTreeClassifier(random_state=rng)

In [20]:
clf.fit(X_test_transformed, y_test)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=1767, splitter='best')

In [21]:
predictions = clf.predict(X_test_transformed)

In [25]:
y_transposed = (np.transpose(y_test)).flatten()

In [27]:
# Seems that utilizing PCA greatly improved the model.
print(f"Accuracy Score: {accuracy_score(y_transposed, predictions)*100}")

Accuracy Score: 99.99952479275024


In [26]:
# O.O  it finally worked!
pd.crosstab(y_transposed, predictions, rownames=["Actual Delays"], colnames=["Predicted Delays"])

Predicted Delays,Delayed,On Time
Actual Delays,Unnamed: 1_level_1,Unnamed: 2_level_1
Delayed,138052,0
On Time,2,282815


ARRIVAL DELAYS 3 FEATURES DECISION TREE WITH PCA (97%)

In [37]:
# Try out the PCA with arrival delays using the best previous model (3 features)

X = delays_df[["DEP_TIME", "DEP_DELAY", "AIR_TIME"]]
y = delays_df["ARRIVAL_DELAY"].values.reshape(-1, 1)
print(X.shape, y.shape)

(1683475, 3) (1683475, 1)


In [38]:
from sklearn.decomposition import PCA
pca = PCA(.95)
pca.fit(X)

PCA(copy=True, iterated_power='auto', n_components=0.95, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)

In [39]:
# Split for train and test datasets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng, stratify=y)

In [40]:
X_test_transformed = pca.transform(X_test)

In [42]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn import tree

clf = DecisionTreeClassifier(random_state=rng)
clf.fit(X_test_transformed, y_test)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=1767, splitter='best')

In [43]:
# Seems that PCA helped improve the arrival model as well
predictions = clf.predict(X_test_transformed)
y_transposed = (np.transpose(y_test)).flatten()
print(f"Accuracy Score: {accuracy_score(y_transposed, predictions)*100}")

Accuracy Score: 97.00738234462504


In [44]:
# While not as good as the departure delay, 97% and a correct looking confusion matrix is great to see here
pd.crosstab(y_transposed, predictions, rownames=["Actual Delays"], colnames=["Predicted Delays"])

Predicted Delays,Delayed,On Time
Actual Delays,Unnamed: 1_level_1,Unnamed: 2_level_1
Delayed,134122,3587
On Time,9008,274152


ARRIVAL MODEL 18 FEATURES DECISION TREE WITH PCA (100%!)

In [45]:
# Try out the original arrival model with 18 features as it was indifferent to the one with 3 features in accuracy

X = delays_df[["DAY", "MONTH", "DEP_TIME", "DEP_DELAY", "DEPARTURE_TIME_OF_DAY_DUMMY", \
              "OP_CARRIER_FL_NUM", "TAXI_OUT", "AIR_TIME", "TAXI_IN", \
              "ARRIVAL_TIME_OF_DAY_DUMMY", "CARRIER_DELAY", "DISTANCE", "WEEKDAY_DUMMY", "AIRLINE_DUMMY", \
       "WEATHER_DELAY", "NAS_DELAY", "SECURITY_DELAY", "LATE_AIRCRAFT_DELAY",]]
y = delays_df["ARRIVAL_DELAY"].values.reshape(-1, 1)
print(X.shape, y.shape)

(1683475, 18) (1683475, 1)


In [46]:
from sklearn.decomposition import PCA
pca = PCA(.95)
pca.fit(X)

PCA(copy=True, iterated_power='auto', n_components=0.95, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)

In [47]:
# Split for train and test datasets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng, stratify=y)

In [48]:
X_test_transformed = pca.transform(X_test)

In [49]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn import tree

clf = DecisionTreeClassifier(random_state=rng)
clf.fit(X_test_transformed, y_test)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=1767, splitter='best')

In [50]:
# Well then.....
predictions = clf.predict(X_test_transformed)
y_transposed = (np.transpose(y_test)).flatten()
print(f"Accuracy Score: {accuracy_score(y_transposed, predictions)*100}")

Accuracy Score: 100.0


In [51]:
# This actually creeps me out.  100% accuracy is ridiculous to see.
pd.crosstab(y_transposed, predictions, rownames=["Actual Delays"], colnames=["Predicted Delays"])

Predicted Delays,Delayed,On Time
Actual Delays,Unnamed: 1_level_1,Unnamed: 2_level_1
Delayed,137709,0
On Time,0,283160


DEPATURE DELAYS 10 FEATURES RANDOM FOREST WITH GRID SEARCH (68.42%)

In [54]:
# Try out the grid search with the random forest classifier as a final test
# Departure Delay with 10 features

X = delays_df[["DAY", "MONTH", "DEP_TIME", "DEPARTURE_TIME_OF_DAY_DUMMY", "WEEKDAY_DUMMY", "AIRLINE_DUMMY", \
               "OP_CARRIER_FL_NUM", "TAXI_OUT", "WHEELS_OFF","WEATHER_DELAY",]]

y = delays_df["DEPARTURE_DELAY"].values.reshape(-1, 1)
print(X.shape, y.shape)

(1683475, 10) (1683475, 1)


In [56]:
# Split for train and test datasets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng, stratify=y)

In [73]:
# Set up the random forest classifier

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

clf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=rng, oob_score=True)

grid_search = GridSearchCV(clf, {'n_estimators': [1,6,11]}, cv = 2, scoring = "roc_auc", return_train_score=True)

grid_search.fit(X_test, y_test)

GridSearchCV(cv=2, error_score='raise-deprecating',
             estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
                                              criterion='gini', max_depth=2,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators=100, n_jobs=None,
                                              oob_score=True, random_state=1767,
                                              verbose=0, warm_start=False),
             iid='warn', n_jobs=None, param_grid={'n_estimators': [1,

In [74]:
# This is about the same as the original model.
clf_model = grid_search.best_estimator_
predictions = clf_model.predict(X_test)
prediction_p = clf_model.predict_proba(X_test)
y_transposed = (np.transpose(y_test)).flatten()
print(f"Accuracy Score: {accuracy_score(y_transposed, predictions)*100}")

Accuracy Score: 68.42319106420287


In [75]:
pd.crosstab(y_transposed, predictions, rownames=["Actual Delays"], colnames=["Predicted Delays"])

Predicted Delays,Delayed,On Time
Actual Delays,Unnamed: 1_level_1,Unnamed: 2_level_1
Delayed,5164,132888
On Time,9,282808


ARRIVAL DELAYS 3 FEATURES RANDOM FOREST WITH GRID SEARCH (83.6%)

In [87]:
# Try out the random forest classifier with grid search on the best arrival delay model.

X = delays_df[["DEP_TIME", "DEP_DELAY", "AIR_TIME"]]
y = delays_df["ARRIVAL_DELAY"].values.reshape(-1, 1)
print(X.shape, y.shape)

(1683475, 3) (1683475, 1)


In [88]:
# Split for train and test datasets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng, stratify=y)

In [91]:
# Set up the random forest classifier

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

clf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=rng, oob_score=True)

grid_search = GridSearchCV(clf, {'n_estimators': [1,6,11]}, return_train_score=True)

grid_search.fit(X_test, y_test)

GridSearchCV(cv='warn', error_score='raise-deprecating',
             estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
                                              criterion='gini', max_depth=2,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators=100, n_jobs=None,
                                              oob_score=True, random_state=1767,
                                              verbose=0, warm_start=False),
             iid='warn', n_jobs=None, param_grid={'n_estimators'

In [90]:
# This is about the same as the original model as well.
clf_model = grid_search.best_estimator_
predictions = clf_model.predict(X_test)
prediction_p = clf_model.predict_proba(X_test)
y_transposed = (np.transpose(y_test)).flatten()
print(f"Accuracy Score: {accuracy_score(y_transposed, predictions)*100}")

Accuracy Score: 83.60653790134222


In [92]:
pd.crosstab(y_transposed, predictions, rownames=["Actual Delays"], colnames=["Predicted Delays"])

Predicted Delays,Delayed,On Time
Actual Delays,Unnamed: 1_level_1,Unnamed: 2_level_1
Delayed,79541,58168
On Time,10827,272333


DEPARTURE DELAYS 15+ MINUTES DECSION TREE WITH PCA (99.99%)

In [28]:
# Test if the Decision Tree with PCA can predict departure delays of 15 minutes+

X = delays_df[["DAY", "MONTH", "DEP_TIME", "DEPARTURE_TIME_OF_DAY_DUMMY", "WEEKDAY_DUMMY", "AIRLINE_DUMMY", \
               "OP_CARRIER_FL_NUM", "TAXI_OUT", "WHEELS_OFF","WEATHER_DELAY",]]

y = delays_df["DEPARTURE_DELAY_OVER_15_MINUTES"].values.reshape(-1, 1)
print(X.shape, y.shape)

(1683475, 10) (1683475, 1)


In [29]:
from sklearn.decomposition import PCA
pca = PCA(.95)
pca.fit(X)

PCA(copy=True, iterated_power='auto', n_components=0.95, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)

In [30]:
# Split for train and test datasets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng, stratify=y)

In [31]:
X_test_transformed = pca.transform(X_test)

In [32]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn import tree

clf = DecisionTreeClassifier(random_state=rng)

In [33]:
clf.fit(X_test_transformed, y_test)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=1007, splitter='best')

In [34]:
# Seems like the variables can be interchanged
predictions = clf.predict(X_test_transformed)
prediction_p = clf.predict_proba(X_test_transformed)
y_transposed = (np.transpose(y_test)).flatten()
print(f"Accuracy Score: {accuracy_score(y_transposed, predictions)*100}")

Accuracy Score: 99.99976239637512


In [35]:
pd.crosstab(y_transposed, predictions, rownames=["Actual Delays"], colnames=["Predicted Delays"])

Predicted Delays,0,1
Actual Delays,Unnamed: 1_level_1,Unnamed: 2_level_1
0,416673,0
1,1,4195


ARRIVAL DELAYS 15+ MINUTES DECISION TRESS WITH PCA (100%)

In [39]:
# Try it out for arrival delays of 15 minutes+

X = delays_df[["DAY", "MONTH", "DEP_TIME", "DEP_DELAY", "DEPARTURE_TIME_OF_DAY_DUMMY", \
              "OP_CARRIER_FL_NUM", "TAXI_OUT", "AIR_TIME", "TAXI_IN", \
              "ARRIVAL_TIME_OF_DAY_DUMMY", "CARRIER_DELAY", "DISTANCE", "WEEKDAY_DUMMY", "AIRLINE_DUMMY", \
       "WEATHER_DELAY", "NAS_DELAY", "SECURITY_DELAY", "LATE_AIRCRAFT_DELAY",]]
y = delays_df["ARRIVAL_DELAY_OVER_15_MINUTES"].values.reshape(-1, 1)
print(X.shape, y.shape)

(1683475, 18) (1683475, 1)


In [40]:
from sklearn.decomposition import PCA
pca = PCA(.95)
pca.fit(X)

PCA(copy=True, iterated_power='auto', n_components=0.95, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)

In [41]:
# Split for train and test datasets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng, stratify=y)

In [42]:
X_test_transformed = pca.transform(X_test)

In [43]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn import tree

clf = DecisionTreeClassifier(random_state=rng)
clf.fit(X_test_transformed, y_test)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=1007, splitter='best')

In [44]:
# Seems like the variables can be interchanged.  Should work for all delay lengths.
predictions = clf.predict(X_test_transformed)
prediction_p = clf.predict_proba(X_test_transformed)
y_transposed = (np.transpose(y_test)).flatten()
print(f"Accuracy Score: {accuracy_score(y_transposed, predictions)*100}")

Accuracy Score: 100.0


In [45]:
pd.crosstab(y_transposed, predictions, rownames=["Actual Delays"], colnames=["Predicted Delays"])

Predicted Delays,0,1
Actual Delays,Unnamed: 1_level_1,Unnamed: 2_level_1
0,305172,0
1,0,115697


DEPARTURE DELAY 3 CATEGORY SPLIT DECISION TREE (100%)

In [62]:
# A final test for predictions of on times, small delays (<15 minutes) and long delays (15+ minutes)

X = delays_df[["DAY", "MONTH", "DEP_TIME", "DEPARTURE_TIME_OF_DAY_DUMMY", "WEEKDAY_DUMMY", "AIRLINE_DUMMY", \
               "OP_CARRIER_FL_NUM", "TAXI_OUT", "WHEELS_OFF","WEATHER_DELAY",]]

y = delays_df["DEPARTURE_DELAY_TEST"].values.reshape(-1, 1)
print(X.shape, y.shape)

(1683475, 10) (1683475, 1)


In [65]:
y = pd.factorize(delays_df["DEPARTURE_DELAY_TEST"])[0]

In [67]:
# Split for train and test datasets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng, stratify=y)

In [69]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn import tree

clf = DecisionTreeClassifier(random_state=rng)
clf.fit(X_test, y_test)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=1007, splitter='best')

In [70]:
# Well this is a huge improvement over the original model
predictions = clf.predict(X_test)
prediction_p = clf.predict_proba(X_test)
y_transposed = (np.transpose(y_test)).flatten()
print(f"Accuracy Score: {accuracy_score(y_transposed, predictions)*100}")

Accuracy Score: 100.0


In [71]:
# Figure out why there is a 4th dummy variable
pd.crosstab(y_transposed, predictions, rownames=["Actual Delays"], colnames=["Predicted Delays"])

Predicted Delays,0,1,2,3
Actual Delays,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,282817,0,0,0
1,0,68998,0,0
2,0,0,68977,0
3,0,0,0,77


ARRIVAL DELAY 3 CATEGORIES DECISION TREE (100%)

In [75]:
# Test out on Arrival Delays with 3 categories

X = delays_df[["DAY", "MONTH", "DEP_TIME", "DEP_DELAY", "DEPARTURE_TIME_OF_DAY_DUMMY", \
              "OP_CARRIER_FL_NUM", "TAXI_OUT", "AIR_TIME", "TAXI_IN", \
              "ARRIVAL_TIME_OF_DAY_DUMMY", "CARRIER_DELAY", "DISTANCE", "WEEKDAY_DUMMY", "AIRLINE_DUMMY", \
       "WEATHER_DELAY", "NAS_DELAY", "SECURITY_DELAY", "LATE_AIRCRAFT_DELAY",]]
y = delays_df["ARRIVAL_DELAY_TEST"].values.reshape(-1, 1)
print(X.shape, y.shape)

(1683475, 18) (1683475, 1)


In [76]:
y = pd.factorize(delays_df["ARRIVAL_DELAY_TEST"])[0]

In [77]:
# Split for train and test datasets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng, stratify=y)

In [78]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn import tree

clf = DecisionTreeClassifier(random_state=rng)
clf.fit(X_test, y_test)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=1007, splitter='best')

In [79]:
# Again, good to see this can be applied to both arriving and departure delays.
predictions = clf.predict(X_test)
prediction_p = clf.predict_proba(X_test)
y_transposed = (np.transpose(y_test)).flatten()
print(f"Accuracy Score: {accuracy_score(y_transposed, predictions)*100}")

Accuracy Score: 100.0


In [80]:
pd.crosstab(y_transposed, predictions, rownames=["Actual Delays"], colnames=["Predicted Delays"])

Predicted Delays,0,1,2,3
Actual Delays,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,283160,0,0,0
1,0,69700,0,0
2,0,0,67933,0
3,0,0,0,76


In [84]:
Decision_Trees_ML_outcomes = {
    "Model": ["Depature Delay", "", "", "", "", "Departure Delay 15+ Minutes", "Departure Delay 3 Categories", \
              "Arrival Delay", "", "", "", "", "", "Arrival Delay 15+ Minutes", "Arrival Delay 3 Categories"],
    
    "Model Type": ["Decision Tree", "Random Forest", "Random Forest", \
                   "Decision Tree with PCA", "Random Forest with Grid Search", \
                   "Decision Tree with PCA", "Decision Tree"
                   "Random Forest", "Random Forest", "Random Forest", \
                   "Decision Tree with PCA", "Decision Tree with PCA", "Forest with Grid Search", \
                   "Random Forest", "Random Forest", "Random Forest"],
    
    "Features": [10, 10, 4, 10, 10, 10, 10, 18, 8, 3, 3, 18, 3, 18, 18],
    
    "Test Data R2": [.727, .68, .6823, .999, .6842, .999, 1, .837, .8186, .8376, .97, 1, .836, 1, 1]
    
                }

Decision_Trees_ML_outcomes_df = pd.DataFrame(Decision_Trees_ML_outcomes)
Decision_Trees_ML_outcomes_df

Unnamed: 0,Model,Model Type,Features,Test Data R2
0,Depature Delay,Decision Tree,10,0.727
1,,Random Forest,10,0.68
2,,Random Forest,4,0.6823
3,,Decision Tree with PCA,10,0.999
4,,Random Forest with Grid Search,10,0.6842
5,Departure Delay 15+ Minutes,Decision Tree with PCA,10,0.999
6,Departure Delay 3 Categories,Decision TreeRandom Forest,10,1.0
7,Arrival Delay,Random Forest,18,0.837
8,,Random Forest,8,0.8186
9,,Decision Tree with PCA,3,0.8376
