# Capstone 2 on RDU Airline Delays - Modeling2
5/16/23

We have wrangled, explored, and preprocessed our data. Now let's create some models to predict whether or not a flight will be delayed/cancelled. (This is attempt 2 because I had to redo my resampling.)

In [1]:
#Import necessary packages

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# load dataset

X_train = pd.read_csv('../Data/Final/X_train.csv')
X_test = pd.read_csv('../Data/Final/X_test.csv')
y_train = pd.read_csv('../Data/Final/y_train.csv')
y_test = pd.read_csv('../Data/Final/y_test.csv')

In [3]:
X_train.drop(['Unnamed: 0'], axis=1, inplace=True)
X_test.drop(['Unnamed: 0'], axis=1, inplace=True)
y_train.drop(['Unnamed: 0'], axis=1, inplace=True)
y_test.drop(['Unnamed: 0'], axis=1, inplace=True)

In [4]:
y_train = pd.DataFrame(y_train).to_numpy().ravel()
y_test = pd.DataFrame(y_test).to_numpy().ravel()

It's hard to remember now what my columns are, so here's the list of columns before get_dummies:

         QUARTER              52645 non-null  category
         MONTH                52645 non-null  category
         DAY_OF_MONTH         52645 non-null  category
         DAY_OF_WEEK          52645 non-null  category
         CARRIER              52645 non-null  object  
         FL_NUM               52645 non-null  object  
         DEST                 52645 non-null  object  
         CRS_ELAPSED_TIME_LG  52645 non-null  float64 
         DISTANCE_LG          52645 non-null  float64
         DEP_TIME_BINS        52645 non-null  object  
         ARR_TIME_BINS        52626 non-null  object  

And this is what our RESULT column corresponds to:

                0 = no delay
                1 = delay of 1 hour or less
                2 = delay of 2 hours or less
                3 = delay of more than 2 hours
                4 = cancelled

## Models

Since we're trying to solve a multiclass classification problem, we can use the following models:

        0. Dummy Classifier
        1. KNN Classifier
        2. Logistic Regression
        3. Random Forest Classifier
        4. Gradient Boosting Classifier
        
We've already done our preprocessing/scaling and standardizing. The remaining columns that we haven't scaled are objects or binary categories. We'll perform cross validation with KFold, hyperparameter tuning with RandomizedSearchCV, and put these together with a pipeline.

## 0. Dummy Classifier

The sklearn dummy classifier will give us a baseline with which to compare.

In [5]:
from sklearn.dummy import DummyClassifier
dummy_clf = DummyClassifier(strategy="most_frequent")
dummy_clf.fit(X_train, y_train)

print("Accuracy Score on Training Set for Dummy Classifier: {}".format(dummy_clf.score(X_train, y_train)))
print("Accuracy Score on Test Set for Dummy Classifier: {}".format(dummy_clf.score(X_test, y_test)))

Accuracy Score on Training Set for Dummy Classifier: 0.2
Accuracy Score on Test Set for Dummy Classifier: 0.8087187767119385


In [6]:
y_pred_dummy = dummy_clf.predict(X_test)
y_probs_dummy = dummy_clf.predict_proba(X_test)

In [7]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
print("Accuracy Score on Test Set for Dummy Classifier: {}"
      .format(accuracy_score(y_test, y_pred_dummy)))
print("Precision Score - weighted averaged on Test Set for Dummy Classifier: {}"
      .format(precision_score(y_test, y_pred_dummy, average='weighted')))
print("Recall Score - weighted averaged on Test Set for Dummy Classifier: {}"
      .format(recall_score(y_test, y_pred_dummy, average='weighted')))
print("F1 Score - weighted averaged on Test Set for Dummy Classifier: {}"
      .format(f1_score(y_test, y_pred_dummy, average='weighted')))
print("ROC-AUC score - weighted averaged on Test Set for Dummy Classifier: {}"
      .format(roc_auc_score(y_test, y_probs_dummy, average='weighted', multi_class='ovo')))

# # note: since my dataset is still slightly imbalanced, I'm using average=weighted

Accuracy Score on Test Set for Dummy Classifier: 0.8087187767119385
Precision Score - weighted averaged on Test Set for Dummy Classifier: 0.6540260598064542
Recall Score - weighted averaged on Test Set for Dummy Classifier: 0.8087187767119385
F1 Score - weighted averaged on Test Set for Dummy Classifier: 0.7231926468916358
ROC-AUC score - weighted averaged on Test Set for Dummy Classifier: 0.5


  _warn_prf(average, modifier, msg_start, len(result))


Since our test set hasn't been resampled, it has a majority of class 0: on time departures. The dummy classifier picks the "most frequent" strategy so it gets about 80% of its predictions correct. Thus it has a high accuracy and recall score, but its ROC-AUC score is 0.5, as expected.

## 1. K Nearest Neighbors

In [8]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
params = {"n_neighbors": np.arange(1, 50, 2),
        "weights": ['uniform', 'distance'],
         'p': [1, 2]}
knn_cv = RandomizedSearchCV(knn, params, cv=5)
knn_cv.fit(X_train, y_train)

In [9]:
print("Tuned KNN Classifier Parameters: {}".format(knn_cv.best_params_))
print("Tuned KNN Classifier Best Accuracy Score: {}".format(knn_cv.best_score_))

Tuned KNN Classifier Parameters: {'weights': 'distance', 'p': 2, 'n_neighbors': 7}
Tuned KNN Classifier Best Accuracy Score: 0.9505082776648273


Let's calculate some metrics for this KNN classifier: accuracy, precision, and ROC-AUC score.

In [10]:
y_pred_knn = knn_cv.predict(X_test)
y_probs_knn = knn_cv.predict_proba(X_test)

In [11]:
print("Accuracy Score on Test Set for KNN Classifier: {}"
      .format(accuracy_score(y_test, y_pred_knn)))
print("Precision Score - weighted averaged on Test Set for KNN Classifier: {}"
      .format(precision_score(y_test, y_pred_knn, average='weighted')))
print("Recall Score - weighted averaged on Test Set for KNN Classifier: {}"
      .format(recall_score(y_test, y_pred_knn, average='weighted')))
print("F1 Score - weighted averaged on Test Set for KNN Classifier: {}"
      .format(f1_score(y_test, y_pred_knn, average='weighted')))
print("ROC-AUC score - weighted averaged on Test Set for KNN Classifier: {}"
      .format(roc_auc_score(y_test, y_probs_knn, average='weighted', multi_class='ovo')))

Accuracy Score on Test Set for KNN Classifier: 0.7551524361287871
Precision Score - weighted averaged on Test Set for KNN Classifier: 0.6970710277899569
Recall Score - weighted averaged on Test Set for KNN Classifier: 0.7551524361287871
F1 Score - weighted averaged on Test Set for KNN Classifier: 0.7226003365000664
ROC-AUC score - weighted averaged on Test Set for KNN Classifier: 0.5926151582490415


Not only are the accuracy and recall scores relatively low, precision, F1, and ROC-AUC scores are not very good either. On all accounts the KNN model is barely better than the dummy classifier!

## 2. Logistic Regression

In [12]:
# took forever to run a RandomizedSearchCV with more parameters,
# so stripped it down to just one parameter for a "baseline metric"

from sklearn.linear_model import LogisticRegression
LR = LogisticRegression()
params = {"C": [100, 10, 1.0, 0.1, 0.01]}
LR_cv = RandomizedSearchCV(LR, params, cv=5)
LR_cv.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

In [13]:
print("Tuned Logistic Regression Parameters: {}"
      .format(LR_cv.best_params_))
print("Tuned Logistic Regression Best Accuracy Score: {}"
      .format(LR_cv.best_score_))

Tuned Logistic Regression Parameters: {'C': 0.01}
Tuned Logistic Regression Best Accuracy Score: 0.42932326459483006


In [14]:
y_pred_LR = LR_cv.predict(X_test)
y_probs_LR = LR_cv.predict_proba(X_test)

In [15]:
print("Accuracy Score on Test Set for Logistic Regression Classifier: {}"
      .format(accuracy_score(y_test, y_pred_LR)))
print("Precision Score - weighted averaged on Test Set for Logistic Regression Classifier: {}"
      .format(precision_score(y_test, y_pred_LR, average='weighted')))
print("Recall Score - weighted averaged on Test Set for Logistic Regression Classifier: {}"
      .format(recall_score(y_test, y_pred_LR, average='weighted')))
print("F1 Score - weighted averaged on Test Set for Logistic Regression Classifier: {}"
      .format(f1_score(y_test, y_pred_LR, average='weighted')))
print("ROC-AUC score - weighted averaged on Test Set for Logistic Regression Classifier: {}"
      .format(roc_auc_score(y_test, y_probs_LR, average='weighted', multi_class='ovo')))

Accuracy Score on Test Set for Logistic Regression Classifier: 0.5789723620476779
Precision Score - weighted averaged on Test Set for Logistic Regression Classifier: 0.7130569355741322
Recall Score - weighted averaged on Test Set for Logistic Regression Classifier: 0.5789723620476779
F1 Score - weighted averaged on Test Set for Logistic Regression Classifier: 0.6337286055108474
ROC-AUC score - weighted averaged on Test Set for Logistic Regression Classifier: 0.6373196900689284


We have extremely poor numbers for logistic regression. The model would not converge! These "baseline numbers" however already demonstrate it would not be a good model for us.

## 3. Random Forest

In [23]:
from sklearn.ensemble import RandomForestClassifier
RF = RandomForestClassifier()
RF.fit(X_train, y_train)

In [26]:
y_pred_rf = RF.predict(X_test)
y_probs_rf = RF.predict_proba(X_test)

In [27]:
print("Accuracy Score on Test Set for Random Forest Classifier: {}"
      .format(accuracy_score(y_test, y_pred_rf)))
print("Precision Score - weighted averaged on Test Set for Random Forest Classifier: {}"
      .format(precision_score(y_test, y_pred_rf, average='weighted')))
print("Recall Score - weighted averaged on Test Set for Random Forest Classifier: {}"
      .format(recall_score(y_test, y_pred_rf, average='weighted')))
print("F1 Score - weighted averaged on Test Set for Random Forest Classifier: {}"
      .format(f1_score(y_test, y_pred_rf, average='weighted')))
print("ROC-AUC score - weighted averaged on Test Set for Random Forest Classifier: {}"
      .format(roc_auc_score(y_test, y_probs_rf, average='weighted', multi_class='ovo')))

Accuracy Score on Test Set for Random Forest Classifier: 0.7958020704720297
Precision Score - weighted averaged on Test Set for Random Forest Classifier: 0.7206488879107353
Recall Score - weighted averaged on Test Set for Random Forest Classifier: 0.7958020704720297
F1 Score - weighted averaged on Test Set for Random Forest Classifier: 0.7457074298933853
ROC-AUC score - weighted averaged on Test Set for Random Forest Classifier: 0.6907940151756098


Even without RandomizedSearchCV, Random Forest already performs better than our other models. Let's try it with RandomizedSearchCV now.

In [10]:
params = {"n_estimators": np.arange(25, 500, 25),
        "criterion": ["gini", "entropy"]}
RF_cv = RandomizedSearchCV(RF, params, cv=5)
RF_cv.fit(X_train, y_train)

In [11]:
print("Tuned Random Forest Classifier Parameters: {}".format(RF_cv.best_params_))
print("Tuned Random Forest Classifier Best Accuracy Score: {}".format(RF_cv.best_score_))

Tuned Random Forest Classifier Parameters: {'n_estimators': 425, 'criterion': 'entropy'}
Tuned Random Forest Classifier Best Accuracy Score: 0.9684925936683124


In [12]:
y_pred_rfcv = RF_cv.predict(X_test)
y_probs_rfcv = RF_cv.predict_proba(X_test)

In [13]:
print("Accuracy Score on Test Set for Random Forest Classifier: {}"
      .format(accuracy_score(y_test, y_pred_rfcv)))
print("Precision Score - weighted averaged on Test Set for Random Forest Classifier: {}"
      .format(precision_score(y_test, y_pred_rfcv, average='weighted')))
print("Recall Score - weighted averaged on Test Set for Random Forest Classifier: {}"
      .format(recall_score(y_test, y_pred_rfcv, average='weighted')))
print("F1 Score - weighted averaged on Test Set for Random Forest Classifier: {}"
      .format(f1_score(y_test, y_pred_rfcv, average='weighted')))
print("ROC-AUC score - weighted averaged on Test Set for Random Forest Classifier: {}"
      .format(roc_auc_score(y_test, y_probs_rfcv, average='weighted', multi_class='ovo')))

Accuracy Score on Test Set for Random Forest Classifier: 0.7965618767214361
Precision Score - weighted averaged on Test Set for Random Forest Classifier: 0.7190495771048088
Recall Score - weighted averaged on Test Set for Random Forest Classifier: 0.7965618767214361
F1 Score - weighted averaged on Test Set for Random Forest Classifier: 0.7444888093081168
ROC-AUC score - weighted averaged on Test Set for Random Forest Classifier: 0.6977167825865475


The tuned random forest classifier doesn't actually outperform the original one by much! Nonetheless, this model performs the best of all of our models so far.

## 4. Gradient Boosting Classifier

In [20]:
# for some reason it was IMPOSSIBLE to run a RandomizedSearchCV on this classifier.
# I let it run for hours and it never completed.
# But let's try it with the default parameters to see if it's promising
# (it's not)

from sklearn.ensemble import GradientBoostingClassifier
GBC = GradientBoostingClassifier()
GBC.fit(X_train, y_train)

In [21]:
y_pred_gbc = GBC.predict(X_test)
y_probs_gbc = GBC.predict_proba(X_test)

In [22]:
print("Accuracy Score on Test Set for Gradient Boosting Classifier: {}"
      .format(accuracy_score(y_test, y_pred_gbc)))
print("Precision Score - weighted averaged on Test Set for Gradient Boosting Classifier: {}"
      .format(precision_score(y_test, y_pred_gbc, average='weighted')))
print("Recall Score - weighted averaged on Test Set for Gradient Boosting Classifier: {}"
      .format(recall_score(y_test, y_pred_gbc, average='weighted')))
print("F1 Score - weighted averaged on Test Set for Gradient Boosting Classifier: {}"
      .format(f1_score(y_test, y_pred_gbc, average='weighted')))
print("ROC-AUC score - weighted averaged on Test Set for Gradient Boosting Classifier: {}"
      .format(roc_auc_score(y_test, y_probs_gbc, average='weighted', multi_class='ovo')))

Accuracy Score on Test Set for Gradient Boosting Classifier: 0.7088042549149967
Precision Score - weighted averaged on Test Set for Gradient Boosting Classifier: 0.7174036038538861
Recall Score - weighted averaged on Test Set for Gradient Boosting Classifier: 0.7088042549149967
F1 Score - weighted averaged on Test Set for Gradient Boosting Classifier: 0.7123079122040494
ROC-AUC score - weighted averaged on Test Set for Gradient Boosting Classifier: 0.6510593057288828


Again, the RandomizedSearchCV never finished running on the GradientBoostingClassifier. Instead I set the normal parameters to get a baseline metric. As we can see, these scores were pretty average. Accuracy and recall weren't great, and precision, F1, and ROC-AUC were only slightly better than the dummy classifier.

## Conclusion

The Random Forest Classifier outperforms the KNN Classifier, Logistic Regression, and Gradient Boosting Classifier. Its accuracy and recall actually don't outperform the Dummy Classifier, but its precision and ROC-AUC scores are quite a bit better.

For the RF Classifier we found optimal hyperparameters found through RandomizedSearchCV: {'n_estimators': 425, 'criterion': 'entropy'}.

Rounded to two decimal places:
Our accuracy was 0.80,
precision was 0.72,
recall was 0.80,
F1 score was 0.74,
and ROC-AUC was 0.70.

## Saving our best model

In [16]:
from sklearn.ensemble import RandomForestClassifier
best_RF = RandomForestClassifier(n_estimators=425,
                criterion='entropy')
best_model = best_RF.fit(X_train, y_train)

In [17]:
from sklearn import __version__ as sklearn_version
import datetime
import pickle
best_model.version = 1.0
best_model.pandas_version = pd.__version__
best_model.numpy_version = np.__version__
best_model.sklearn_version = sklearn_version
best_model.X_columns = [col for col in X_train.columns]
best_model.build_datetime = datetime.datetime.now()

In [18]:
savedmodel = 'RDU_departure_predictions.pkl'
pickle.dump(best_model, open(savedmodel, 'wb'))