# Stochastic Gradient Descent Classification

SGDClassifying provided perhaps the most insightful look of any of the machine learning techniques that were tested with the flight delay data.  It was a great tool to stumble upon as it can work great with scaling and pca while taking under 10 seconds for a binary logistic model to be fit. 

SGD is quite intense.  While looking further into the theory behind it, it does a bit of the reverse of Stochastic Gradient Boosting.  While the first 2 letters mean the same, the last one stands for Descending as it works the loss function in a descending way to smooth over potential errors through each model iteration (instead of an increasing the loss function).

The most concrete conclusion that can be made here is that we have found the best model for predicting if a flight will be delayed by 15 or more minutes.  Utilizing 10 features (with all of them outside of wheels off being pretty reliable to track), this model consistently produced an R2 of 99% over 4 different machine learning techniques (Decision Trees, Stochastic Gradient Boosting, SGDClassifying as well as straight up Logisitic regression).  Unfortunately, neither of the other two departure delay variables (straight up departure delays and departure delays split into 3 categories) could match these results.  It was ultimately determined that in order to make good models for them, they would have to rely on information that would be impossible to consistently get for them (such as arrival time and the arrival delay dummy being significant for the 3 category departure delay model).  In conclusion, departure delays of 15+ minutes will be the final variable used out of the 3 for departure delay predictions.

Unfortunately, as per usual, the models used for arrival delays were not up to the same standard.  The best score produced was 83.73% for arrival delays over 15 minutes with 33 features.  The final push for this project will be to figure out a model to break that glass ceiling of 90% for arrival delays.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
import time

import warnings
warnings.simplefilter('ignore')

# RNG used for seeding
rng = int(np.random.randint(low=1, high=2000, size=1))

# xgboost has similar problems to tensorflow when installing so it will not be utilizied. Same with parfit.
# from xgboost import XGBClassifier
# import parfit.parfit as pf

In [2]:
# Read in first quarter dataset
delays_df = pd.read_csv("Delay_first_quarter1.csv")

In [3]:
# Do some additional cleaning
delays_df = delays_df.fillna(0)

#### DEPARTURE DELAY SBGCLASSIFIER SCALED PCA WITH GRID SEARCH (67.99%)

In [40]:
# Try out SGD classifier on best departure delay model.

X = delays_df[["DAY", "MONTH", "DEP_TIME", "DEPARTURE_TIME_OF_DAY_DUMMY", "WEEKDAY_DUMMY", "AIRLINE_DUMMY", \
               "OP_CARRIER_FL_NUM", "TAXI_OUT", "WHEELS_OFF","WEATHER_DELAY",]]

y = delays_df["DEPARTURE_DELAY"].values.reshape(-1, 1)
print(X.shape, y.shape)

(1683475, 10) (1683475, 1)


In [41]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng)

In [36]:
# Try out PCA if needed
from sklearn.decomposition import PCA
start = time.time()

pca = PCA(.95)
pca.fit(X)
X_train = pca.transform(X_train)
X_test = pca.transform(X_test)
end = time.time()
print(f"{end-start} seconds")

2.32086181640625 seconds


In [42]:
# Try out a min/max scaler for the data if needed
from sklearn.preprocessing import MinMaxScaler
start = time.time()

scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
end = time.time()
print(f"{end-start} seconds")

0.9809563159942627 seconds


In [38]:
# SGDClassifier with Grid Search
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
start = time.time()

sgd = SGDClassifier(random_state=rng)

grid = {
    'alpha': [1e-4, 1e-3, 1e-2, 1e-1, 1e0, 1e1, 1e2, 1e3], # learning rate
    'max_iter': [10], # number of epochs
    'loss': ['log'], # logistic regression,
    'penalty': ['l2'],
    'n_jobs': [-1]
}

grid_search = GridSearchCV(sgd, grid)
grid_search.fit(X_train, y_train)
end = time.time()
print(f"{end-start} seconds")

201.20335936546326 seconds


In [39]:
# Yuck.  Perhaps try without the grid search to simplify things

from sklearn.metrics import accuracy_score
start = time.time()

model = grid_search.best_estimator_
predictions = model.predict(X_test)
y_transposed = (np.transpose(y_test)).flatten()
print(f"Accuracy Score: {accuracy_score(y_transposed, predictions)*100}")
end = time.time()
print(f"{end-start} seconds")

Accuracy Score: 67.21093737006052
3.1169469356536865 seconds


In [26]:
# Try out PCA if needed
from sklearn.decomposition import PCA
start = time.time()

pca = PCA(.95)
pca.fit(X)
X_train = pca.transform(X_train)
X_test = pca.transform(X_test)
end = time.time()
print(f"{end-start} seconds")

2.414828300476074 seconds


In [27]:
# Try out a min/max scaler for the data if needed
from sklearn.preprocessing import MinMaxScaler
start = time.time()

scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
end = time.time()
print(f"{end-start} seconds")

0.4174480438232422 seconds


In [43]:
# SGDClassifier -> Runs surprisingly fast
from sklearn.linear_model import SGDClassifier
start = time.time()

sgd = SGDClassifier(random_state=rng)
sgd.fit(X_train, y_train)
end = time.time()
print(f"{end-start} seconds")

5.4937591552734375 seconds


In [44]:
# Both models produced the same score (67%)

from sklearn.metrics import accuracy_score
start = time.time()

predictions = sgd.predict(X_test)
y_transposed = (np.transpose(y_test)).flatten()
print(f"Accuracy Score: {accuracy_score(y_transposed, predictions)*100}")
end = time.time()
print(f"{end-start} seconds")

Accuracy Score: 67.21212538818492
3.24591326713562 seconds


In [45]:
# Good at predicting on time. Whether that is correct or not is a whole another story.
pd.crosstab(y_transposed, predictions, rownames=["Actual Delays"], colnames=["Predicted Delays"])

Predicted Delays,Delayed,On Time
Actual Delays,Unnamed: 1_level_1,Unnamed: 2_level_1
Delayed,5,137994
On Time,0,282870


#### DEPARTURE DELAY 15+ MINUTES SBGCLASSIFIER PCA SCALED (99%)

In [53]:
# Try out SGD classifier on best departure delay over 15 minutes model.

X = delays_df[["DAY", "MONTH", "DEP_TIME", "DEPARTURE_TIME_OF_DAY_DUMMY", "WEEKDAY_DUMMY", "AIRLINE_DUMMY", \
               "OP_CARRIER_FL_NUM", "TAXI_OUT", "WHEELS_OFF","WEATHER_DELAY",]]

y = delays_df["DEPARTURE_DELAY_OVER_15_MINUTES"].values.reshape(-1, 1)
print(X.shape, y.shape)

(1683475, 10) (1683475, 1)


In [54]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng)

In [48]:
# Try out PCA if needed
from sklearn.decomposition import PCA
start = time.time()

pca = PCA(.95)
pca.fit(X)
X_train = pca.transform(X_train)
X_test = pca.transform(X_test)
end = time.time()
print(f"{end-start} seconds")

2.4059107303619385 seconds


In [55]:
# Try out a min/max scaler for the data if needed
from sklearn.preprocessing import MinMaxScaler
start = time.time()

scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
end = time.time()
print(f"{end-start} seconds")

0.9359583854675293 seconds


In [56]:
# SGDClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
start = time.time()

sgd = SGDClassifier(random_state=rng)
sgd.fit(X_train, y_train)
end = time.time()
print(f"{end-start} seconds")

2.404978036880493 seconds


In [57]:
# Seems this is the honey bucket model for departure delays over 15 minutes as this is the fourth model for predictions with
# around 99% accuracy.

from sklearn.metrics import accuracy_score
start = time.time()

predictions = sgd.predict(X_test)
y_transposed = (np.transpose(y_test)).flatten()
print(f"Accuracy Score: {accuracy_score(y_transposed, predictions)*100}")
end = time.time()
print(f"{end-start} seconds")

Accuracy Score: 99.00158956825045
0.15599703788757324 seconds


In [58]:
# A bit odd to see it predict all flight on time or delayed < 15 minutes.
pd.crosstab(y_transposed, predictions, rownames=["Actual Delays"], colnames=["Predicted Delays"])

Predicted Delays,0
Actual Delays,Unnamed: 1_level_1
0,416667
1,4202


#### DEPARTURE DELAY 4 FEATURES SCALED PCA SGDCLASSIFIER (67%)

In [67]:
# Try out Departure Delay again.

X = delays_df[["DEP_TIME", "DEPARTURE_TIME_OF_DAY_DUMMY", "WEEKDAY_DUMMY", "AIRLINE_DUMMY"]]

y = delays_df["DEPARTURE_DELAY"].values.reshape(-1, 1)
print(X.shape, y.shape)

(1683475, 4) (1683475, 1)


In [68]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng)

In [69]:
# Try out PCA if needed
from sklearn.decomposition import PCA
start = time.time()

pca = PCA(.95)
pca.fit(X)
X_train = pca.transform(X_train)
X_test = pca.transform(X_test)
end = time.time()
print(f"{end-start} seconds")

0.9819846153259277 seconds


In [70]:
# Try out a min/max scaler for the data if needed
from sklearn.preprocessing import MinMaxScaler
start = time.time()

scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
end = time.time()
print(f"{end-start} seconds")

0.041997432708740234 seconds


In [71]:
# SGDClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
start = time.time()

sgd = SGDClassifier(random_state=rng)
sgd.fit(X_train, y_train)
end = time.time()
print(f"{end-start} seconds")

4.854905843734741 seconds


In [72]:
# Perhaps it maybe time to give up on departure delays and just use delays of 15+ minutes. 

from sklearn.metrics import accuracy_score
start = time.time()

predictions = sgd.predict(X_test)
y_transposed = (np.transpose(y_test)).flatten()
print(f"Accuracy Score: {accuracy_score(y_transposed, predictions)*100}")
end = time.time()
print(f"{end-start} seconds")

Accuracy Score: 67.21093737006052
3.016428232192993 seconds


#### USE SGBCLASSIFIER WITH GRID SEARCH TO FIND BEST MODEL FOR DEPARTURE DELAY

##### 46 FEATURES USING THE BEST MODEL SCORE (100%)

In [68]:
# Use a grid search and find the best overall score with Departure delay

y = delays_df["DEPARTURE_DELAY"].values.reshape(-1, 1)

numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
X = delays_df.select_dtypes(include=numerics)

print(X.shape, y.shape)

(1683475, 47) (1683475, 1)


In [69]:
# Split for train and test datasets
from sklearn.model_selection import train_test_split
start = time.time()

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng)
end = time.time()
print(f"{end-start} seconds")

1.9019725322723389 seconds


In [70]:
# Try out a min/max scaler for the data if needed
from sklearn.preprocessing import MinMaxScaler
start = time.time()

scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
end = time.time()
print(f"{end-start} seconds")

3.181917190551758 seconds


In [7]:
# SGDClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
start = time.time()

sgd = SGDClassifier(random_state=rng)
sgd.fit(X_train, y_train)
end = time.time()
print(f"{end-start} seconds")

6.4533913135528564 seconds


In [8]:
# I am not sure I believe this.  Odd. 

from sklearn.metrics import accuracy_score
start = time.time()

predictions = sgd.predict(X_test)
y_transposed = (np.transpose(y_test)).flatten()
print(f"Accuracy Score: {accuracy_score(y_transposed, predictions)*100}")
end = time.time()
print(f"{end-start} seconds")

Accuracy Score: 100.0
3.5334360599517822 seconds


In [9]:
# 47 features seemed to improve the model quite a bit.
pd.crosstab(y_transposed, predictions, rownames=["Actual Delays"], colnames=["Predicted Delays"])

Predicted Delays,Delayed,On Time
Actual Delays,Unnamed: 1_level_1,Unnamed: 2_level_1
Delayed,137830,0
On Time,0,283039


In [71]:
# SGDClassifier with Grid Search
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
start = time.time()

sgd = SGDClassifier(random_state=rng)

grid = {
    'alpha': [1e-4, 1e-3, 1e-2, 1e-1, 1e0, 1e1, 1e2, 1e3], # learning rate
    'max_iter': [10], # number of epochs
    'loss': ['log'], # logistic regression,
    'penalty': ['l2'],
    'n_jobs': [-1]
}

grid_search = GridSearchCV(sgd, grid)
grid_search.fit(X_train, y_train)
end = time.time()
print(f"{end-start} seconds")

251.5979974269867 seconds


In [72]:
# The best score utilizing the 46 features here is 100%.  Yureka!

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
start = time.time()

model = grid_search.best_estimator_
predictions = model.predict(X_test)
y_transposed = (np.transpose(y_test)).flatten()
print(f"Accuracy Score: {accuracy_score(y_transposed, predictions)*100}")
end = time.time()
print(f"{end-start} seconds")

Accuracy Score: 100.0
3.260974884033203 seconds


#### DEPARTURE DELAY RANDOM FOREST FEATURE IMPORTANCE

In [7]:
# A method to use to narrow down this model is feature_importance_ however it is not working for the grid_search.
# It did work for the random forest classifier so that will be tested here.

y = delays_df["DEPARTURE_DELAY"].values.reshape(-1, 1)

numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
X = delays_df.select_dtypes(include=numerics)

print(X.shape, y.shape)

(1683475, 47) (1683475, 1)


In [8]:
# Split for train and test datasets
from sklearn.model_selection import train_test_split
start = time.time()

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng)
end = time.time()
print(f"{end-start} seconds")

2.068992853164673 seconds


In [9]:
# Try out a min/max scaler for the data if needed
from sklearn.preprocessing import MinMaxScaler
start = time.time()

scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
end = time.time()
print(f"{end-start} seconds")

3.255887985229492 seconds


In [10]:
# Set up the random forest classifier

from sklearn.ensemble import RandomForestClassifier
start = time.time()

clf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=rng, oob_score=True)
clf.fit(X_train, y_train) 
end = time.time()
print(f"{end-start} seconds")

185.31030941009521 seconds


In [12]:
importance_df = pd.DataFrame(clf.feature_importances_, X.columns, columns=[["Importance"]]).reset_index()
importance_df

Unnamed: 0,index,Importance
0,DAY,2.235823e-07
1,MONTH,0.0
2,OP_CARRIER_FL_NUM,2.988311e-05
3,ORIGIN_LATITUDE,2.948623e-06
4,ORIGIN_LONGITUDE,2.147621e-06
5,EAST_COAST_ORIGIN,6.954277e-07
6,WEST_COAST_ORIGIN,2.937212e-07
7,CRS_DEP_TIME,0.006631807
8,DEP_TIME,0.02589541
9,DEPARTURE_TIME_OF_DAY_DUMMY,0.0007261635


#### DEPARTURE DELAY MODEL TRIMDOWN 46 TO 34 FEATURES

In [42]:
# Drop some of the columns that conflict with departure delays and try again.

y = delays_df["DEPARTURE_DELAY"].values.reshape(-1, 1)

numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
X = delays_df.select_dtypes(include=numerics)
X = X.drop(['DEPARTURE_TIME_OF_DAY_DUMMY', 'DEP_DELAY', 'DEPARTURE_DELAY_OVER_15_MINUTES', 'DEPARTURE_DELAY_OVER_30_MINUTES', \
            'DEPARTURE_DELAY_OVER_45_MINUTES', 'DEPARTURE_DELAY_OVER_60_MINUTES', \
            'ARR_DELAY', 'ARRIVAL_DELAY_OVER_15_MINUTES', 'ARRIVAL_DELAY_OVER_30_MINUTES', \
            'ARRIVAL_DELAY_OVER_45_MINUTES', 'ARRIVAL_DELAY_OVER_60_MINUTES', \
            'ARRIVAL_DELAY_OVER_60_MINUTES.1', 'ARRIVAL_TIME_OF_DAY_DUMMY',], axis = 1)

print(X.shape, y.shape)

(1683475, 34) (1683475, 1)


In [33]:
# Split for train and test datasets
from sklearn.model_selection import train_test_split
start = time.time()

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng)
end = time.time()
print(f"{end-start} seconds")

1.638005018234253 seconds


In [34]:
# Try out a min/max scaler for the data if needed
from sklearn.preprocessing import MinMaxScaler
start = time.time()

scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
end = time.time()
print(f"{end-start} seconds")

2.252122402191162 seconds


In [35]:
# SGDClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
start = time.time()

sgd = SGDClassifier(random_state=rng)
sgd.fit(X_train, y_train)
end = time.time()
print(f"{end-start} seconds")

5.610364675521851 seconds


In [36]:
# This is another good sign.  Look back at the feature importance to try to get a good model with less features
# 34 features is a bit too much of a logistical possibility to consistently have to update the model for predictions.

from sklearn.metrics import accuracy_score
start = time.time()

predictions = sgd.predict(X_test)
y_transposed = (np.transpose(y_test)).flatten()
print(f"Accuracy Score: {accuracy_score(y_transposed, predictions)*100}")
end = time.time()
print(f"{end-start} seconds")

Accuracy Score: 100.0
3.6655337810516357 seconds


In [37]:
pd.crosstab(y_transposed, predictions, rownames=["Actual Delays"], colnames=["Predicted Delays"])

Predicted Delays,Delayed,On Time
Actual Delays,Unnamed: 1_level_1,Unnamed: 2_level_1
Delayed,137791,0
On Time,0,283078


#### DEPARTURE DELAY MODEL TRIMDOWN 34 TO 30 FEATURES (71.475%)

In [24]:
# Continue to drop features that both conflict with departure delays and have a low feature importance.

y = delays_df["DEPARTURE_DELAY"].values.reshape(-1, 1)

X.columns

Index(['DAY', 'MONTH', 'OP_CARRIER_FL_NUM', 'ORIGIN_LATITUDE',
       'ORIGIN_LONGITUDE', 'EAST_COAST_ORIGIN', 'WEST_COAST_ORIGIN',
       'CRS_DEP_TIME', 'DEP_TIME', 'TAXI_OUT', 'WHEELS_OFF', 'AIR_TIME',
       'CRS_ELAPSED_TIME', 'ACTUAL_ELAPSED_TIME', 'DISTANCE', 'WHEELS_ON',
       'TAXI_IN', 'DEST_LATITUDE', 'DEST_LONGITUDE', 'EAST_COAST_DEST',
       'WEST_COAST_DEST', 'CRS_ARR_TIME', 'ARR_TIME', 'CANCELLED', 'DIVERTED',
       'CARRIER_DELAY', 'WEATHER_DELAY', 'NAS_DELAY', 'SECURITY_DELAY',
       'LATE_AIRCRAFT_DELAY', 'WEEKDAY_DUMMY', 'AIRLINE_DUMMY',
       'DEPARTURE_DELAY_DUMMY', 'ARRIVAL_DELAY_DUMMY'],
      dtype='object')

In [38]:
# Split for train and test datasets
from sklearn.model_selection import train_test_split
start = time.time()

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng)
end = time.time()
print(f"{end-start} seconds")

1.5662486553192139 seconds


In [39]:
# Try out a min/max scaler for the data if needed
from sklearn.preprocessing import MinMaxScaler
start = time.time()

scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
end = time.time()
print(f"{end-start} seconds")

2.477146863937378 seconds


In [40]:
# Set up the random forest classifier

from sklearn.ensemble import RandomForestClassifier
start = time.time()

clf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=rng, oob_score=True)
clf.fit(X_train, y_train) 
end = time.time()
print(f"{end-start} seconds")

204.14404678344727 seconds


In [41]:
importance_df = pd.DataFrame(clf.feature_importances_, X.columns, columns=[["Importance"]]).reset_index()
importance_df

Unnamed: 0,index,Importance
0,DAY,0.003217066
1,MONTH,0.0
2,OP_CARRIER_FL_NUM,0.0001011075
3,ORIGIN_LATITUDE,0.0003044551
4,ORIGIN_LONGITUDE,5.325754e-06
5,EAST_COAST_ORIGIN,0.0
6,WEST_COAST_ORIGIN,0.0
7,CRS_DEP_TIME,0.02807475
8,DEP_TIME,0.04215495
9,TAXI_OUT,0.008555668


In [43]:
X = X.drop(['DEP_TIME', 'ARR_TIME','DEPARTURE_DELAY_DUMMY','ARRIVAL_DELAY_DUMMY'], axis = 1)

In [44]:
print(X.shape, y.shape)

(1683475, 30) (1683475, 1)


In [45]:
# Split for train and test datasets
from sklearn.model_selection import train_test_split
start = time.time()

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng)
end = time.time()
print(f"{end-start} seconds")

1.423994779586792 seconds


In [49]:
# Try out PCA if needed
from sklearn.decomposition import PCA
start = time.time()

pca = PCA(.95)
pca.fit(X)
X_train = pca.transform(X_train)
X_test = pca.transform(X_test)
end = time.time()
print(f"{end-start} seconds")

7.795314311981201 seconds


In [53]:
# Try out a min/max scaler for the data if needed
from sklearn.preprocessing import MinMaxScaler
start = time.time()

scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
end = time.time()
print(f"{end-start} seconds")

0.38097596168518066 seconds


In [54]:
# SGDClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
start = time.time()

sgd = SGDClassifier(random_state=rng)
sgd.fit(X_train, y_train)
end = time.time()
print(f"{end-start} seconds")

4.772235870361328 seconds


In [55]:
# Seems like it may not work as departure delays seem to heavily rely on arrival time and arrival delays.
# This could work with preexisting flights but be irrelevant for new flight paths.

from sklearn.metrics import accuracy_score
start = time.time()

predictions = sgd.predict(X_test)
y_transposed = (np.transpose(y_test)).flatten()
print(f"Accuracy Score: {accuracy_score(y_transposed, predictions)*100}")
end = time.time()
print(f"{end-start} seconds")

Accuracy Score: 67.26035892403574
2.896061897277832 seconds


In [31]:
pd.crosstab(y_transposed, predictions, rownames=["Actual Delays"], colnames=["Predicted Delays"])

Predicted Delays,Delayed,On Time
Actual Delays,Unnamed: 1_level_1,Unnamed: 2_level_1
Delayed,23130,114661
On Time,65,283013


#### DEPARTURE DELAY 3 CATEGORIES SCBCLASSIFIER SCALED PCA (67.26%)

In [56]:
# Try the 3 categorical y for Departure Delays

X = delays_df[["DAY", "MONTH", "DEP_TIME", "DEPARTURE_TIME_OF_DAY_DUMMY", "WEEKDAY_DUMMY", "AIRLINE_DUMMY", \
               "OP_CARRIER_FL_NUM", "TAXI_OUT", "WHEELS_OFF","WEATHER_DELAY",]]

y = delays_df["DEPARTURE_DELAY_TEST"].values.reshape(-1, 1)
print(X.shape, y.shape)

(1683475, 10) (1683475, 1)


In [57]:
# Split for train and test datasets
from sklearn.model_selection import train_test_split
start = time.time()

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng)
end = time.time()
print(f"{end-start} seconds")

0.7909786701202393 seconds


In [58]:
# Try out PCA if needed
from sklearn.decomposition import PCA
start = time.time()

pca = PCA(.95)
pca.fit(X)
X_train = pca.transform(X_train)
X_test = pca.transform(X_test)
end = time.time()
print(f"{end-start} seconds")

2.3417959213256836 seconds


In [59]:
# Try out a min/max scaler for the data if needed
from sklearn.preprocessing import MinMaxScaler
start = time.time()

scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
end = time.time()
print(f"{end-start} seconds")

0.3640408515930176 seconds


In [49]:
# Try out PCA if needed
from sklearn.decomposition import PCA
start = time.time()

pca = PCA(.95)
pca.fit(X)
X_train = pca.transform(X_train)
X_test = pca.transform(X_test)
end = time.time()
print(f"{end-start} seconds")

7.795314311981201 seconds


In [60]:
# SGDClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
start = time.time()

sgd = SGDClassifier(random_state=rng)
sgd.fit(X_train, y_train)
end = time.time()
print(f"{end-start} seconds")

8.164777755737305 seconds


In [61]:
# Perhaps trying this out with the 34 feature model may produce a better if unrealistic result.

from sklearn.metrics import accuracy_score
start = time.time()

predictions = sgd.predict(X_test)
y_transposed = (np.transpose(y_test)).flatten()
print(f"Accuracy Score: {accuracy_score(y_transposed, predictions)*100}")
end = time.time()
print(f"{end-start} seconds")

Accuracy Score: 67.26035892403574
1.5279765129089355 seconds


In [62]:
pd.crosstab(y_transposed, predictions, rownames=["Actual Delays"], colnames=["Predicted Delays"])

Predicted Delays,On Time
Actual Delays,Unnamed: 1_level_1
Long Delay,69140
On Time,283078
Small Delay,68651


#### DEPARTURE DELAY 3 CATEGORIES 34 FEATURE SGDCLASSIFIER SCALED PCA

In [75]:
# Try out the 34 feature model

y = delays_df["DEPARTURE_DELAY_TEST"].values.reshape(-1, 1)

numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
X = delays_df.select_dtypes(include=numerics)
X = X.drop(['DEPARTURE_TIME_OF_DAY_DUMMY', 'DEP_DELAY', 'DEPARTURE_DELAY_OVER_15_MINUTES', 'DEPARTURE_DELAY_OVER_30_MINUTES', \
            'DEPARTURE_DELAY_OVER_45_MINUTES', 'DEPARTURE_DELAY_OVER_60_MINUTES', \
            'ARR_DELAY', 'ARRIVAL_DELAY_OVER_15_MINUTES', 'ARRIVAL_DELAY_OVER_30_MINUTES', \
            'ARRIVAL_DELAY_OVER_45_MINUTES', 'ARRIVAL_DELAY_OVER_60_MINUTES', \
            'ARRIVAL_DELAY_OVER_60_MINUTES.1', 'ARRIVAL_TIME_OF_DAY_DUMMY',], axis = 1)

print(X.shape, y.shape)

(1683475, 34) (1683475, 1)


In [76]:
# Split for train and test datasets
from sklearn.model_selection import train_test_split
start = time.time()

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng)
end = time.time()
print(f"{end-start} seconds")

1.5199692249298096 seconds


In [77]:
# Try out a min/max scaler for the data if needed
from sklearn.preprocessing import MinMaxScaler
start = time.time()

scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
end = time.time()
print(f"{end-start} seconds")

2.288613796234131 seconds


In [78]:
# SGDClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
start = time.time()

sgd = SGDClassifier(random_state=rng)
sgd.fit(X_train, y_train)
end = time.time()
print(f"{end-start} seconds")

11.069297313690186 seconds


In [79]:
# Perhaps trying this out with the 34 feature model may produce a better if unrealistic result.

from sklearn.metrics import accuracy_score
start = time.time()

predictions = sgd.predict(X_test)
y_transposed = (np.transpose(y_test)).flatten()
print(f"Accuracy Score: {accuracy_score(y_transposed, predictions)*100}")
end = time.time()
print(f"{end-start} seconds")

Accuracy Score: 91.73234426864416
1.4033496379852295 seconds


In [80]:
pd.crosstab(y_transposed, predictions, rownames=["Actual Delays"], colnames=["Predicted Delays"])

Predicted Delays,Long Delay,On Time,Small Delay
Actual Delays,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Long Delay,66093,0,3047
On Time,0,283078,0
Small Delay,31749,0,36902


In [81]:
# Trying this again by dropping variables that conflict with y
X = X.drop(['DEP_TIME', 'ARR_TIME','DEPARTURE_DELAY_DUMMY','ARRIVAL_DELAY_DUMMY'], axis = 1)

In [82]:
# Split for train and test datasets
from sklearn.model_selection import train_test_split
start = time.time()

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng)
end = time.time()
print(f"{end-start} seconds")

1.4039931297302246 seconds


In [83]:
# Try out a min/max scaler for the data if needed
from sklearn.preprocessing import MinMaxScaler
start = time.time()

scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
end = time.time()
print(f"{end-start} seconds")

2.0187394618988037 seconds


In [84]:
# SGDClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
start = time.time()

sgd = SGDClassifier(random_state=rng)
sgd.fit(X_train, y_train)
end = time.time()
print(f"{end-start} seconds")

13.720370769500732 seconds


In [85]:
# Further proof that using this model with 3 categories is highly dependent on variables that are unrealistic to have for
# predictions.

from sklearn.metrics import accuracy_score
start = time.time()

predictions = sgd.predict(X_test)
y_transposed = (np.transpose(y_test)).flatten()
print(f"Accuracy Score: {accuracy_score(y_transposed, predictions)*100}")
end = time.time()
print(f"{end-start} seconds")

Accuracy Score: 70.40409248483495
1.7947876453399658 seconds


In [86]:
pd.crosstab(y_transposed, predictions, rownames=["Actual Delays"], colnames=["Predicted Delays"])

Predicted Delays,Long Delay,On Time
Actual Delays,Unnamed: 1_level_1,Unnamed: 2_level_1
Long Delay,13269,55871
On Time,38,283040
Small Delay,4,68647


#### ARRIVAL DELAY 26 FEATURES SGBCLASSIFIER SCALED (76.39%)

In [102]:
# Arrival Delay 26 features using SGDclassification

X = delays_df[["DAY", "MONTH", "DEP_TIME", "DEP_DELAY", "DEPARTURE_TIME_OF_DAY_DUMMY", \
              "OP_CARRIER_FL_NUM", "TAXI_OUT", "AIR_TIME", "TAXI_IN", "WHEELS_ON", "WHEELS_OFF",\
              "ARRIVAL_TIME_OF_DAY_DUMMY", "CARRIER_DELAY", "DISTANCE", "WEEKDAY_DUMMY", "AIRLINE_DUMMY", \
       "WEATHER_DELAY", "NAS_DELAY", "SECURITY_DELAY", "LATE_AIRCRAFT_DELAY", "CANCELLED", "DIVERTED", \
              "EAST_COAST_ORIGIN", "WEST_COAST_ORIGIN", "EAST_COAST_DEST", "WEST_COAST_DEST",]]
y = delays_df["ARRIVAL_DELAY"].values.reshape(-1, 1)
print(X.shape, y.shape)

(1683475, 26) (1683475, 1)


In [103]:
# Split for train and test datasets
from sklearn.model_selection import train_test_split
start = time.time()

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng)
end = time.time()
print(f"{end-start} seconds")

1.4695534706115723 seconds


In [104]:
# Try out a min/max scaler for the data if needed
from sklearn.preprocessing import MinMaxScaler
start = time.time()

scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
end = time.time()
print(f"{end-start} seconds")

1.7988338470458984 seconds


In [105]:
# SGDClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
start = time.time()

sgd = SGDClassifier(random_state=rng)
sgd.fit(X_train, y_train)
end = time.time()
print(f"{end-start} seconds")

6.722989559173584 seconds


In [106]:
# Eh not that best way to start off. The best score for any of the arrival models is .8896 and this does not cut it.

from sklearn.metrics import accuracy_score
start = time.time()

predictions = sgd.predict(X_test)
y_transposed = (np.transpose(y_test)).flatten()
print(f"Accuracy Score: {accuracy_score(y_transposed, predictions)*100}")
end = time.time()
print(f"{end-start} seconds")

Accuracy Score: 76.38790217383557
3.3022053241729736 seconds


In [107]:
pd.crosstab(y_transposed, predictions, rownames=["Actual Delays"], colnames=["Predicted Delays"])

Predicted Delays,Delayed,On Time
Actual Delays,Unnamed: 1_level_1,Unnamed: 2_level_1
Delayed,39558,98318
On Time,1058,281935


#### ARRIVAL DELAY 33 FEATURES SCALED SGBCLASSIFIER (82.27%)

In [110]:
# Try out the 33 feature model minus the arrival delay dummy variable

y = delays_df["ARRIVAL_DELAY"].values.reshape(-1, 1)

numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
X = delays_df.select_dtypes(include=numerics)
X = X.drop(['DEPARTURE_TIME_OF_DAY_DUMMY', 'DEP_DELAY', 'DEPARTURE_DELAY_OVER_15_MINUTES', 'DEPARTURE_DELAY_OVER_30_MINUTES', \
            'DEPARTURE_DELAY_OVER_45_MINUTES', 'DEPARTURE_DELAY_OVER_60_MINUTES', \
            'ARR_DELAY', 'ARRIVAL_DELAY_OVER_15_MINUTES', 'ARRIVAL_DELAY_OVER_30_MINUTES', \
            'ARRIVAL_DELAY_OVER_45_MINUTES', 'ARRIVAL_DELAY_OVER_60_MINUTES', \
            'ARRIVAL_DELAY_OVER_60_MINUTES.1', 'ARRIVAL_TIME_OF_DAY_DUMMY', 'ARRIVAL_DELAY_DUMMY'], axis = 1)

print(X.shape, y.shape)

(1683475, 33) (1683475, 1)


In [111]:
# Split for train and test datasets
from sklearn.model_selection import train_test_split
start = time.time()

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng)
end = time.time()
print(f"{end-start} seconds")

1.5790274143218994 seconds


In [112]:
# Try out a min/max scaler for the data if needed
from sklearn.preprocessing import MinMaxScaler
start = time.time()

scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
end = time.time()
print(f"{end-start} seconds")

2.3608081340789795 seconds


In [113]:
# SGDClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
start = time.time()

sgd = SGDClassifier(random_state=rng)
sgd.fit(X_train, y_train)
end = time.time()
print(f"{end-start} seconds")

7.064920663833618 seconds


In [114]:
# This is better but not anything better than what we were getting with other models.

from sklearn.metrics import accuracy_score
start = time.time()

predictions = sgd.predict(X_test)
y_transposed = (np.transpose(y_test)).flatten()
print(f"Accuracy Score: {accuracy_score(y_transposed, predictions)*100}")
end = time.time()
print(f"{end-start} seconds")

Accuracy Score: 82.27025511501203
3.7160470485687256 seconds


In [115]:
pd.crosstab(y_transposed, predictions, rownames=["Actual Delays"], colnames=["Predicted Delays"])

Predicted Delays,Delayed,On Time
Actual Delays,Unnamed: 1_level_1,Unnamed: 2_level_1
Delayed,102912,34964
On Time,39655,243338


#### ARRIVAL DELAY OVER 15 MINUTES 33 FEATURES SCALED SGBCLASSIFIER (83.73%)

In [116]:
# Try out the 33 feature model minus the arrival delay dummy variable

y = delays_df["ARRIVAL_DELAY_OVER_15_MINUTES"].values.reshape(-1, 1)

numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
X = delays_df.select_dtypes(include=numerics)
X = X.drop(['DEPARTURE_TIME_OF_DAY_DUMMY', 'DEP_DELAY', 'DEPARTURE_DELAY_OVER_15_MINUTES', 'DEPARTURE_DELAY_OVER_30_MINUTES', \
            'DEPARTURE_DELAY_OVER_45_MINUTES', 'DEPARTURE_DELAY_OVER_60_MINUTES', \
            'ARR_DELAY', 'ARRIVAL_DELAY_OVER_15_MINUTES', 'ARRIVAL_DELAY_OVER_30_MINUTES', \
            'ARRIVAL_DELAY_OVER_45_MINUTES', 'ARRIVAL_DELAY_OVER_60_MINUTES', \
            'ARRIVAL_DELAY_OVER_60_MINUTES.1', 'ARRIVAL_TIME_OF_DAY_DUMMY', 'ARRIVAL_DELAY_DUMMY'], axis = 1)

print(X.shape, y.shape)

(1683475, 33) (1683475, 1)


In [117]:
# Split for train and test datasets
from sklearn.model_selection import train_test_split
start = time.time()

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng)
end = time.time()
print(f"{end-start} seconds")

1.3400073051452637 seconds


In [118]:
# Try out a min/max scaler for the data if needed
from sklearn.preprocessing import MinMaxScaler
start = time.time()

scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
end = time.time()
print(f"{end-start} seconds")

2.400963544845581 seconds


In [119]:
# SGDClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
start = time.time()

sgd = SGDClassifier(random_state=rng)
sgd.fit(X_train, y_train)
end = time.time()
print(f"{end-start} seconds")

5.123867034912109 seconds


In [120]:
# This is better but not anything better than what we were getting with other models either.

from sklearn.metrics import accuracy_score
start = time.time()

predictions = sgd.predict(X_test)
y_transposed = (np.transpose(y_test)).flatten()
print(f"Accuracy Score: {accuracy_score(y_transposed, predictions)*100}")
end = time.time()
print(f"{end-start} seconds")

Accuracy Score: 83.7298541826554
0.1680006980895996 seconds


In [121]:
pd.crosstab(y_transposed, predictions, rownames=["Actual Delays"], colnames=["Predicted Delays"])

Predicted Delays,0,1
Actual Delays,Unnamed: 1_level_1,Unnamed: 2_level_1
0,293157,12139
1,56337,59236


#### ARRIVAL DELAY OVER 15 MINUTES 26 FEATURES SCALED SGBCLASSIFIER (72.54%)

In [124]:
# Try out this model that produced the best score for a previous arrival delays model on delays of 15+ minutes.

X = delays_df[["DAY", "MONTH", "DEP_TIME", "DEP_DELAY", "DEPARTURE_TIME_OF_DAY_DUMMY", \
              "OP_CARRIER_FL_NUM", "TAXI_OUT", "AIR_TIME", "TAXI_IN", "WHEELS_ON", "WHEELS_OFF",\
              "ARRIVAL_TIME_OF_DAY_DUMMY", "CARRIER_DELAY", "DISTANCE", "WEEKDAY_DUMMY", "AIRLINE_DUMMY", \
       "WEATHER_DELAY", "NAS_DELAY", "SECURITY_DELAY", "LATE_AIRCRAFT_DELAY", "CANCELLED", "DIVERTED", \
              "EAST_COAST_ORIGIN", "WEST_COAST_ORIGIN", "EAST_COAST_DEST", "WEST_COAST_DEST",]]
y = delays_df["ARRIVAL_DELAY_OVER_15_MINUTES"].values.reshape(-1, 1)
print(X.shape, y.shape)

(1683475, 26) (1683475, 1)


In [125]:
# Split for train and test datasets
from sklearn.model_selection import train_test_split
start = time.time()

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng)
end = time.time()
print(f"{end-start} seconds")

1.2449698448181152 seconds


In [126]:
# Try out a min/max scaler for the data if needed
from sklearn.preprocessing import MinMaxScaler
start = time.time()

scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
end = time.time()
print(f"{end-start} seconds")

2.0329737663269043 seconds


In [127]:
# SGDClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
start = time.time()

sgd = SGDClassifier(random_state=rng)
sgd.fit(X_train, y_train)
end = time.time()
print(f"{end-start} seconds")

4.787930011749268 seconds


In [128]:
# As I suspected, removing features significantly lowers the model accuracy.

from sklearn.metrics import accuracy_score
start = time.time()

predictions = sgd.predict(X_test)
y_transposed = (np.transpose(y_test)).flatten()
print(f"Accuracy Score: {accuracy_score(y_transposed, predictions)*100}")
end = time.time()
print(f"{end-start} seconds")

Accuracy Score: 72.53919865801473
0.19199514389038086 seconds


In [130]:
pd.crosstab(y_transposed, predictions, rownames=["Actual Delays"], colnames=["Predicted Delays"])

Predicted Delays,0,1
Actual Delays,Unnamed: 1_level_1,Unnamed: 2_level_1
0,305295,1
1,115573,0


#### ARRIVAL DELAY 3 CATEGORIES 33 FEATURES SCALED SGBCLASSIFIER (79.26%)

In [131]:
# Try out the 33 feature model minus the arrival delay dummy variable

y = delays_df["ARRIVAL_DELAY_TEST"].values.reshape(-1, 1)

numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
X = delays_df.select_dtypes(include=numerics)
X = X.drop(['DEPARTURE_TIME_OF_DAY_DUMMY', 'DEP_DELAY', 'DEPARTURE_DELAY_OVER_15_MINUTES', 'DEPARTURE_DELAY_OVER_30_MINUTES', \
            'DEPARTURE_DELAY_OVER_45_MINUTES', 'DEPARTURE_DELAY_OVER_60_MINUTES', \
            'ARR_DELAY', 'ARRIVAL_DELAY_OVER_15_MINUTES', 'ARRIVAL_DELAY_OVER_30_MINUTES', \
            'ARRIVAL_DELAY_OVER_45_MINUTES', 'ARRIVAL_DELAY_OVER_60_MINUTES', \
            'ARRIVAL_DELAY_OVER_60_MINUTES.1', 'ARRIVAL_TIME_OF_DAY_DUMMY', 'ARRIVAL_DELAY_DUMMY'], axis = 1)

print(X.shape, y.shape)

(1683475, 33) (1683475, 1)


In [132]:
# Split for train and test datasets
from sklearn.model_selection import train_test_split
start = time.time()

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng)
end = time.time()
print(f"{end-start} seconds")

1.5709190368652344 seconds


In [133]:
# Try out a min/max scaler for the data if needed
from sklearn.preprocessing import MinMaxScaler
start = time.time()

scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
end = time.time()
print(f"{end-start} seconds")

2.620950222015381 seconds


In [134]:
# SGDClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
start = time.time()

sgd = SGDClassifier(random_state=rng)
sgd.fit(X_train, y_train)
end = time.time()
print(f"{end-start} seconds")

13.653743267059326 seconds


In [135]:
#  Meh.  Perhaps it would be good to figure out which variables in the 
# entire model are significant like what was performed with the departure delay data.

from sklearn.metrics import accuracy_score
start = time.time()

predictions = sgd.predict(X_test)
y_transposed = (np.transpose(y_test)).flatten()
print(f"Accuracy Score: {accuracy_score(y_transposed, predictions)*100}")
end = time.time()
print(f"{end-start} seconds")

Accuracy Score: 79.26528207114328
1.5239343643188477 seconds


In [136]:
pd.crosstab(y_transposed, predictions, rownames=["Actual Delays"], colnames=["Predicted Delays"])

Predicted Delays,Long Delay,On Time,Small Delay
Actual Delays,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Long Delay,57853,12060,89
On Time,7357,275631,5
Small Delay,19894,47861,119


In [138]:
SGD_ML_outcomes = {
    "Model": ["Depature Delay", "", "Departure Delay 15+ Minutes", "Departure Delay 3 Categories", \
              "Arrival Delay", "", "Arrival Delay 15+ Minutes", "", "Arrival Delay 3 Categories"],
    
    "Model Type": ["Scaled SGD+PCA+Grid", "Scaled SGD", "Scaled SGD+PCA", "Scaled SGD+PCA", \
                   "Scaled SGD", "Scaled SGD", "Scaled SGD", "Scaled SGD", "Scaled SGD"],
    
    "Features": [10, 30, 10, 10, 26, 33, 33, 26, 33],
    
    "Test Data R2": [.6799, .6726, .99, .6726, .7639, .8273, .8373, .7254, .7926]
    
                }

SGD_ML_outcomes_df = pd.DataFrame(SGD_ML_outcomes)
SGD_ML_outcomes_df

Unnamed: 0,Model,Model Type,Features,Test Data R2
0,Depature Delay,Scaled SGD+PCA+Grid,10,0.6799
1,,Scaled SGD,30,0.6726
2,Departure Delay 15+ Minutes,Scaled SGD+PCA,10,0.99
3,Departure Delay 3 Categories,Scaled SGD+PCA,10,0.6726
4,Arrival Delay,Scaled SGD,26,0.7639
5,,Scaled SGD,33,0.8273
6,Arrival Delay 15+ Minutes,Scaled SGD,33,0.8373
7,,Scaled SGD,26,0.7254
8,Arrival Delay 3 Categories,Scaled SGD,33,0.7926
