#### A Test to See How Our Best Models Predict Delays with the Real Data from the Second Quarter 2019!

Unfortunately, arrival delay predictions were nowhere near the quality of their departure delay counterparts.  Perhaps this is due to the extra variance which can be seen when departure/arrival delays bar graphs are compared side by side.

Like departure delays, arrival delay were tested in 3 scenarios:

1) Arrival Delays

2) Arrival Delays over 15 minutes

3) Arrival Delays broken up into 3 subcategories:

    1) Flight On Time

    2) Flight Delayed less than or equal to 15 minutes

    3) Flight Delayed over 15 minutes.
    
    
While a model for departure delays of over 15 minutes was consistently producing an R2 score over 99%, the best outcome among the 3 scenarios was arrival delays over 15 minutes which produced an R2 score just under 71%.  More work would have to be done with this model to improve its prediction accuracy.

While the linear models yielded good results, the categorical model once again fell short, this time producing an R2 score of just over 51%.  Perhaps it could be concluded that using a true/false variable outcome would work best for predicting both departure and arrival delays as both of those models held up the best with predictions against real data.

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
import time

import warnings
warnings.simplefilter('ignore')

# RNG used for seeding
rng = int(np.random.randint(low=1, high=2000, size=1))

In [3]:
# Get the first and second quarter in for comparision.
delays_df = pd.read_csv("Delay_first_quarter1.csv")

In [4]:
delays_df2 = pd.read_csv("Delay_second_quarter2.csv") 

In [22]:
# Fix for arrival delay over 15 minutes dummy variable
delays_df["ARRIVAL_DELAY_OVER_15_MINUTES"] = 1*np.ravel(delays_df["ARRIVAL_DELAY_TEST"] == "Long Delay")
delays_df2["ARRIVAL_DELAY_OVER_15_MINUTES"] = 1*np.ravel(delays_df2["ARRIVAL_DELAY_TEST"] == "Long Delay")

In [17]:
# Best models for the 6 tested dependent variables.
# 4 over 95%, the other two over 89% and 81%.  Seems like it is time to compare with the real data.
Best_model = {
    "Model Variable": ["Depature Delay", "Departure Delay Over 15 Minutes", "Departure Delay 3 Categories", "", \
              "Arrival Delay", "Arrival Delay", "Arrival Delay Over 15 Minutes", "Arrival Delay 3 Categories", ""],
    
    
    
    "Model Type": ["Linear Regression", "Scaled Stochastic Gradient Descending + PCA", \
                   "Scaled Stochastic Gradient Boosting", "", "Linear Regression", "Logistic Regression", \
                   "Scaled Stochastic Gradient Boosting", "Logistic Regression", ""],
    
    
    
    "Features": [10, 10, 23, "", 16, 18, 26, 32, ""],
    
    "Test Data R2": [.9431, .99, .8121, "", .9537, .8847, .9977, .8981, ""],
    
    "Second Quarter R2": [.94, .9977, .5776, "", .9609, .5172, .7079, .5088, ""]
    
                }

Best_model_df = pd.DataFrame(Best_model)
Best_model_df

Unnamed: 0,Model Variable,Model Type,Features,Test Data R2,Second Quarter R2
0,Depature Delay,Linear Regression,10.0,0.9431,0.94
1,Departure Delay Over 15 Minutes,Scaled Stochastic Gradient Descending + PCA,10.0,0.99,0.9977
2,Departure Delay 3 Categories,Scaled Stochastic Gradient Boosting,23.0,0.8121,0.5776
3,,,,,
4,Arrival Delay,Linear Regression,16.0,0.9537,0.9609
5,Arrival Delay,Logistic Regression,18.0,0.8847,0.5172
6,Arrival Delay Over 15 Minutes,Scaled Stochastic Gradient Boosting,26.0,0.9977,0.7079
7,Arrival Delay 3 Categories,Logistic Regression,32.0,0.8981,0.5088
8,,,,,


#### BEST ARRIVAL DELAY MODEL -> ARRIVAL DELAY TIMES

#### LINEAR REGRESSION WITH 16 FEATURES (96.09%)

In [7]:
X = delays_df[["DAY", "MONTH", "OP_CARRIER_FL_NUM", "AIR_TIME", "DISTANCE", "WHEELS_OFF", "TAXI_OUT", \
               "DEPARTURE_TIME_OF_DAY_DUMMY", "DEPARTURE_DELAY_DUMMY", "CRS_ARR_TIME", "ARRIVAL_TIME_OF_DAY_DUMMY", \
               "CARRIER_DELAY", "WEATHER_DELAY", "NAS_DELAY", "SECURITY_DELAY", "LATE_AIRCRAFT_DELAY"]]
y = delays_df["ARR_DELAY"].values.reshape(-1, 1)
print(X.shape, y.shape)

(1683475, 16) (1683475, 1)


In [10]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
start = time.time()

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng)

X_scaler = StandardScaler().fit(X_train)
y_scaler = StandardScaler().fit(y_train)

X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)
y_train_scaled = y_scaler.transform(y_train)
y_test_scaled = y_scaler.transform(y_test)

from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train_scaled, y_train_scaled)
end = time.time()
print(f"{end-start} seconds")

4.680936813354492 seconds


In [12]:
from sklearn.metrics import mean_squared_error

predictions = model.predict(X_test_scaled)
MSE = mean_squared_error(y_test_scaled, predictions)
r2 = model.score(X_test_scaled, y_test_scaled)
print(f"MSE: {MSE}, R2: {r2}")

MSE: 0.04566464809330771, R2: 0.9542262008069549


In [13]:
# Test out with the second quarter data
X = delays_df2[["DAY", "MONTH", "OP_CARRIER_FL_NUM", "AIR_TIME", "DISTANCE", "WHEELS_OFF", "TAXI_OUT", \
               "DEPARTURE_TIME_OF_DAY_DUMMY", "DEPARTURE_DELAY_DUMMY", "CRS_ARR_TIME", "ARRIVAL_TIME_OF_DAY_DUMMY", \
               "CARRIER_DELAY", "WEATHER_DELAY", "NAS_DELAY", "SECURITY_DELAY", "LATE_AIRCRAFT_DELAY"]]
y = delays_df2["ARR_DELAY"].values.reshape(-1, 1)
print(X.shape, y.shape)

(1816125, 16) (1816125, 1)


In [14]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
start = time.time()

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng)

X_scaler = StandardScaler().fit(X_train)
y_scaler = StandardScaler().fit(y_train)

X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)
y_train_scaled = y_scaler.transform(y_train)
y_test_scaled = y_scaler.transform(y_test)

In [16]:
# 96.09.  Not bad here again.
MSE = mean_squared_error(y_test_scaled[:100000], predictions[:100000])
r2 = model.score(X_test_scaled[:100000], y_test_scaled[:100000])
print(f"MSE: {MSE}, R2: {r2}")

MSE: 1.9665093232425084, R2: 0.9608624480993


#### BEST ARRIVAL DELAY MODEL -> ARRIVAL DELAY TIMES DUMMY

#### LOGISTIC REGRESSION WITH 18 FEATURES 

In [11]:
X = delays_df[["DAY", "MONTH", "DEP_TIME", "DEP_DELAY", "DEPARTURE_TIME_OF_DAY_DUMMY", \
              "OP_CARRIER_FL_NUM", "TAXI_OUT", "AIR_TIME", "TAXI_IN", \
              "ARRIVAL_TIME_OF_DAY_DUMMY", "CARRIER_DELAY", "DISTANCE", "WEEKDAY_DUMMY", "AIRLINE_DUMMY", \
       "WEATHER_DELAY", "NAS_DELAY", "SECURITY_DELAY", "LATE_AIRCRAFT_DELAY",]]
y = delays_df["ARRIVAL_DELAY"].values.reshape(-1, 1)
print(X.shape, y.shape)

(1683475, 18) (1683475, 1)


In [12]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
start = time.time()

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng)
classifier = LogisticRegression(penalty='l2')
classifier.fit(X_train, y_train)
end = time.time()
print(f"{end-start} seconds")

72.26071691513062 seconds


In [7]:
from sklearn.metrics import accuracy_score
start = time.time()

predictions = classifier.predict(X_test)
y_transposed = (np.transpose(y_test)).flatten()
print(f"Accuracy Score: {accuracy_score(y_test, predictions)*100}")
end = time.time()
print(f"{end-start} seconds")

Accuracy Score: 88.47313534615284
2.8837475776672363 seconds


In [13]:
# Try out using the second quarter data
X = delays_df2[["DAY", "MONTH", "DEP_TIME", "DEP_DELAY", "DEPARTURE_TIME_OF_DAY_DUMMY", \
              "OP_CARRIER_FL_NUM", "TAXI_OUT", "AIR_TIME", "TAXI_IN", \
              "ARRIVAL_TIME_OF_DAY_DUMMY", "CARRIER_DELAY", "DISTANCE", "WEEKDAY_DUMMY", "AIRLINE_DUMMY", \
       "WEATHER_DELAY", "NAS_DELAY", "SECURITY_DELAY", "LATE_AIRCRAFT_DELAY",]]
y = delays_df2["ARRIVAL_DELAY"].values.reshape(-1, 1)
print(X.shape, y.shape)

(1816125, 18) (1816125, 1)


In [9]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
start = time.time()

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng)
classifier = LogisticRegression(penalty='l2')
classifier.fit(X_train, y_train)
end = time.time()
print(f"{end-start} seconds")

103.186598777771 seconds


In [16]:
from sklearn.metrics import accuracy_score
start = time.time()

# predictions = classifier.predict(X_test)
y_transposed = (np.transpose(y_test)).flatten()
print(f"Accuracy Score: {accuracy_score(y_test[:100000], predictions[:100000])*100}")
end = time.time()
print(f"{end-start} seconds")

Accuracy Score: 57.152
0.6170988082885742 seconds


In [None]:
pd.crosstab(y_transposed[:100000], predictions[:100000], rownames=["Actual Delays"], colnames=["Predicted Delays"])

#### BEST ARRIVAL DELAY OVER 15 MINUTES DUMMY

#### SCALED STOCHASTIC GRADIENT BOOSTING 26 FEATURES (70.79%)

In [23]:
X = delays_df[["DAY", "MONTH", "DEP_TIME", "DEP_DELAY", "DEPARTURE_TIME_OF_DAY_DUMMY", \
              "OP_CARRIER_FL_NUM", "TAXI_OUT", "AIR_TIME", "TAXI_IN", "WHEELS_ON", "WHEELS_OFF",\
              "ARRIVAL_TIME_OF_DAY_DUMMY", "CARRIER_DELAY", "DISTANCE", "WEEKDAY_DUMMY", "AIRLINE_DUMMY", \
       "WEATHER_DELAY", "NAS_DELAY", "SECURITY_DELAY", "LATE_AIRCRAFT_DELAY", "CANCELLED", "DIVERTED", \
              "EAST_COAST_ORIGIN", "WEST_COAST_ORIGIN", "EAST_COAST_DEST", "WEST_COAST_DEST",]]
y = delays_df["ARRIVAL_DELAY_OVER_15_MINUTES"].values.reshape(-1, 1)
print(X.shape, y.shape)

(1683475, 26) (1683475, 1)


In [24]:
# Split for train and test datasets
from sklearn.model_selection import train_test_split
start = time.time()

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng)
end = time.time()
print(f"{end-start} seconds")

1.0739738941192627 seconds


In [25]:
# Try out a min/max scaler for the data
from sklearn.preprocessing import MinMaxScaler
start = time.time()

scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
end = time.time()
print(f"{end-start} seconds")

1.853917121887207 seconds


In [26]:
# Set up and fit the Gradient Booster

from sklearn.ensemble import GradientBoostingClassifier
start = time.time()

gbc = GradientBoostingClassifier(n_estimators=100, random_state=rng)
gbc.fit(X_train, y_train)
end = time.time()
print(f"{end-start} seconds")

437.2535951137543 seconds


In [27]:
# 
from sklearn.metrics import accuracy_score
start = time.time()

predictions = gbc.predict(X_test)
prediction_p = gbc.predict_proba(X_test)
y_transposed = (np.transpose(y_test)).flatten()
print(f"Accuracy Score: {accuracy_score(y_transposed, predictions)*100}")
end = time.time()
print(f"{end-start} seconds")

Accuracy Score: 99.76833646574113
2.445967197418213 seconds


In [28]:
# Set up with second quarter data
X = delays_df2[["DAY", "MONTH", "DEP_TIME", "DEP_DELAY", "DEPARTURE_TIME_OF_DAY_DUMMY", \
              "OP_CARRIER_FL_NUM", "TAXI_OUT", "AIR_TIME", "TAXI_IN", "WHEELS_ON", "WHEELS_OFF",\
              "ARRIVAL_TIME_OF_DAY_DUMMY", "CARRIER_DELAY", "DISTANCE", "WEEKDAY_DUMMY", "AIRLINE_DUMMY", \
       "WEATHER_DELAY", "NAS_DELAY", "SECURITY_DELAY", "LATE_AIRCRAFT_DELAY", "CANCELLED", "DIVERTED", \
              "EAST_COAST_ORIGIN", "WEST_COAST_ORIGIN", "EAST_COAST_DEST", "WEST_COAST_DEST",]]

y = delays_df2["ARRIVAL_DELAY_OVER_15_MINUTES"].values.reshape(-1, 1)
print(X.shape, y.shape)

(1816125, 26) (1816125, 1)


In [29]:
# Split for train and test datasets
from sklearn.model_selection import train_test_split
start = time.time()

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng)
end = time.time()
print(f"{end-start} seconds")

1.1600158214569092 seconds


In [30]:
# Try out a min/max scaler for the data
from sklearn.preprocessing import MinMaxScaler
start = time.time()

scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
end = time.time()
print(f"{end-start} seconds")

1.8129653930664062 seconds


In [32]:
# y_test is from the second quarter while predictions are for the first quarter of data
# 70.787% is not bad.  Not great either.
from sklearn.metrics import accuracy_score
start = time.time()

# predictions = gbc.predict(X_test)
# prediction_p = gbc.predict_proba(X_test)
y_transposed = (np.transpose(y_test)).flatten()
print(f"Accuracy Score: {accuracy_score(y_transposed[:100000], predictions[:100000])*100}")
end = time.time()
print(f"{end-start} seconds")

Accuracy Score: 70.787
0.021996736526489258 seconds


In [33]:
pd.crosstab(y_transposed[:100000], predictions[:100000], rownames=["Actual Delays"], colnames=["Predicted Delays"])

Predicted Delays,0,1
Actual Delays,Unnamed: 1_level_1,Unnamed: 2_level_1
0,67649,13602
1,15611,3138


#### BEST ARRIVAL DELAY 3 CATEGORIES

#### LOGISTIC REGRESSION 32 FEATURES

In [35]:
X = delays_df[["DAY", "MONTH", "DEP_TIME", "DEP_DELAY", "DEPARTURE_DELAY_DUMMY", "DEPARTURE_TIME_OF_DAY_DUMMY", \
              "OP_CARRIER_FL_NUM", "TAXI_OUT", "AIR_TIME", "TAXI_IN", "WHEELS_ON", "WHEELS_OFF",\
              "ARRIVAL_TIME_OF_DAY_DUMMY", "CARRIER_DELAY", "DISTANCE", "WEEKDAY_DUMMY", "AIRLINE_DUMMY", \
              "WEATHER_DELAY", "NAS_DELAY", "SECURITY_DELAY", "LATE_AIRCRAFT_DELAY", "CANCELLED", "DIVERTED", \
              "EAST_COAST_ORIGIN", "WEST_COAST_ORIGIN", "EAST_COAST_DEST", "WEST_COAST_DEST", "CRS_ARR_TIME",
              'ORIGIN_LATITUDE', 'ORIGIN_LONGITUDE', 'DEST_LATITUDE', 'DEST_LONGITUDE']]
y = delays_df["ARRIVAL_DELAY_TEST"].values.reshape(-1, 1)
print(X.shape, y.shape)

(1683475, 32) (1683475, 1)


In [36]:
# Split for train and test datasets
from sklearn.model_selection import train_test_split
start = time.time()

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng)
end = time.time()
print(f"{end-start} seconds")

1.434030532836914 seconds


In [37]:
# Set up the logistic regression classifier

from sklearn.linear_model import LogisticRegression
start = time.time()

classifier = LogisticRegression(penalty='l2')
classifier.fit(X_train, y_train)
end = time.time()
print(f"{end-start} seconds")

559.5249679088593 seconds


In [38]:
from sklearn.metrics import accuracy_score
start = time.time()

predictions = classifier.predict(X_test)
y_transposed = (np.transpose(y_test)).flatten()
print(f"Accuracy Score: {accuracy_score(y_test, predictions)*100}")
end = time.time()
print(f"{end-start} seconds")

Accuracy Score: 89.83365370222089
1.5060200691223145 seconds


In [43]:
# Set up with the second quarter dataset

X = delays_df2[["DAY", "MONTH", "DEP_TIME", "DEP_DELAY", "DEPARTURE_DELAY_DUMMY", "DEPARTURE_TIME_OF_DAY_DUMMY", \
              "OP_CARRIER_FL_NUM", "TAXI_OUT", "AIR_TIME", "TAXI_IN", "WHEELS_ON", "WHEELS_OFF",\
              "ARRIVAL_TIME_OF_DAY_DUMMY", "CARRIER_DELAY", "DISTANCE", "WEEKDAY_DUMMY", "AIRLINE_DUMMY", \
              "WEATHER_DELAY", "NAS_DELAY", "SECURITY_DELAY", "LATE_AIRCRAFT_DELAY", "CANCELLED", "DIVERTED", \
              "EAST_COAST_ORIGIN", "WEST_COAST_ORIGIN", "EAST_COAST_DEST", "WEST_COAST_DEST", "CRS_ARR_TIME",
              'ORIGIN_LATITUDE', 'ORIGIN_LONGITUDE', 'DEST_LATITUDE', 'DEST_LONGITUDE']]
y = delays_df2["ARRIVAL_DELAY_TEST"].values.reshape(-1, 1)
print(X.shape, y.shape)

(1816125, 32) (1816125, 1)


In [44]:
# Split for train and test datasets
from sklearn.model_selection import train_test_split
start = time.time()

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng)
end = time.time()
print(f"{end-start} seconds")

1.9675931930541992 seconds


In [45]:
# It seems that the categorical predictions may not work out here.
from sklearn.metrics import accuracy_score
start = time.time()

# predictions = classifier.predict(X_test)
y_transposed = (np.transpose(y_test)).flatten()
print(f"Accuracy Score: {accuracy_score(y_test[:100000], predictions[:100000])*100}")
end = time.time()
print(f"{end-start} seconds")

Accuracy Score: 50.88399999999999
0.2689526081085205 seconds


In [42]:
pd.crosstab(y_transposed[:100000], predictions[:100000], rownames=["Actual Delays"], colnames=["Predicted Delays"])

Predicted Delays,Long Delay,On Time,Small Delay
Actual Delays,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Long Delay,69830,9,43
On Time,0,270934,12156
Small Delay,2539,28040,37318
