# Capstone 2 on RDU Airline Delays - Modeling2
10/19/22

We have wrangled, explored, and preprocessed our data. Now let's create some models to predict whether or not a flight will be delayed/cancelled. (This is attempt 2 because I realized an issue with my dataset.)

In [1]:
#Import necessary packages

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# load dataset

X_train = pd.read_csv('X_train2.csv')
X_test = pd.read_csv('X_test2.csv')
y_train = pd.read_csv('y_train2.csv')
y_test = pd.read_csv('y_test2.csv')

In [3]:
X_train.drop(['Unnamed: 0'], axis=1, inplace=True)
X_test.drop(['Unnamed: 0'], axis=1, inplace=True)
y_train.drop(['Unnamed: 0'], axis=1, inplace=True)
y_test.drop(['Unnamed: 0'], axis=1, inplace=True)

In [4]:
X_train.columns

Index(['FL_NUM', 'QUARTER_2', 'QUARTER_3', 'QUARTER_4', 'MONTH_2', 'MONTH_3',
       'MONTH_4', 'MONTH_5', 'MONTH_6', 'MONTH_7',
       ...
       'DEST_TTN', 'DEST_VPS', 'DEP_TIME_BINS_LATE_NIGHT',
       'DEP_TIME_BINS_MIDDAY', 'DEP_TIME_BINS_MORNING',
       'ARR_TIME_BINS_LATE_NIGHT', 'ARR_TIME_BINS_MIDDAY',
       'ARR_TIME_BINS_MORNING', 'CRS_ELAPSED_TIME_LG', 'DISTANCE_LG'],
      dtype='object', length=124)

In [5]:
y_train = pd.DataFrame(y_train).to_numpy().ravel()
y_test = pd.DataFrame(y_test).to_numpy().ravel()

It's hard to remember now what my columns are, so here's the list of columns before get_dummies:

         QUARTER              52645 non-null  category
         MONTH                52645 non-null  category
         DAY_OF_MONTH         52645 non-null  category
         DAY_OF_WEEK          52645 non-null  category
         CARRIER              52645 non-null  object  
         FL_NUM               52645 non-null  object  
         DEST                 52645 non-null  object  
         CRS_ELAPSED_TIME_LG  52645 non-null  float64 
         DISTANCE_LG          52645 non-null  float64
         DEP_TIME_BINS        52645 non-null  object  
         ARR_TIME_BINS        52626 non-null  object  

And this is what our RESULT column corresponds to:

                0 = no delay
                1 = delay of 1 hour or less
                2 = delay of 2 hours or less
                3 = delay of more than 2 hours
                4 = cancelled

## Models

Since we're trying to solve a multiclass classification problem, we can use the following models:

        0. Dummy Classifier
        1. KNN Classifier
        2. Logistic Regression
        3. Random Forest Classifier
        4. Gradient Boosting Classifier
        
We've already done our preprocessing/scaling and standardizing. The remaining columns that we haven't scaled are objects or binary categories. We'll perform cross validation with KFold, hyperparameter tuning with RandomizedSearchCV, and put these together with a pipeline.

## 0. Dummy Classifier

The sklearn dummy classifier will give us a baseline with which to compare.

In [6]:
from sklearn.dummy import DummyClassifier
dummy_clf = DummyClassifier(strategy="most_frequent")
dummy_clf.fit(X_train, y_train)

print("Accuracy Score on Training Set for Dummy Classifier: {}".format(dummy_clf.score(X_train, y_train)))
print("Accuracy Score on Test Set for Dummy Classifier: {}".format(dummy_clf.score(X_test, y_test)))

Accuracy Score on Training Set for Dummy Classifier: 0.50016
Accuracy Score on Test Set for Dummy Classifier: 0.49952


In [7]:
y_pred_dummy = dummy_clf.predict(X_test)
y_probs_dummy = dummy_clf.predict_proba(X_test)

In [8]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
print("Accuracy Score on Test Set for Dummy Classifier: {}"
      .format(accuracy_score(y_test, y_pred_dummy)))
print("Precision Score - weighted averaged on Test Set for Dummy Classifier: {}"
      .format(precision_score(y_test, y_pred_dummy, average='weighted')))
print("Recall Score - weighted averaged on Test Set for Dummy Classifier: {}"
      .format(recall_score(y_test, y_pred_dummy, average='weighted')))
print("F1 Score - weighted averaged on Test Set for Dummy Classifier: {}"
      .format(f1_score(y_test, y_pred_dummy, average='weighted')))
print("ROC-AUC score - weighted averaged on Test Set for Dummy Classifier: {}"
      .format(roc_auc_score(y_test, y_probs_dummy, average='weighted', multi_class='ovo')))

# # note: since my dataset is still slightly imbalanced, I'm using average=weighted

Accuracy Score on Test Set for Dummy Classifier: 0.49952
Precision Score - weighted averaged on Test Set for Dummy Classifier: 0.2495202304
Recall Score - weighted averaged on Test Set for Dummy Classifier: 0.49952
F1 Score - weighted averaged on Test Set for Dummy Classifier: 0.332800136577038
ROC-AUC score - weighted averaged on Test Set for Dummy Classifier: 0.5


  _warn_prf(average, modifier, msg_start, len(result))


We have pretty mediocre numbers for this dummy classifier, so looks like our models won't have to work too hard to beat these metrics.

## 1. K Nearest Neighbors

In [9]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
params = {"n_neighbors": np.arange(1, 50, 2),
        "weights": ['uniform', 'distance'],
         'p': [1, 2]}
knn_cv = RandomizedSearchCV(knn, params, cv=5)
knn_cv.fit(X_train, y_train)

RandomizedSearchCV(cv=5, estimator=KNeighborsClassifier(),
                   param_distributions={'n_neighbors': array([ 1,  3,  5,  7,  9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33,
       35, 37, 39, 41, 43, 45, 47, 49]),
                                        'p': [1, 2],
                                        'weights': ['uniform', 'distance']})

In [10]:
print("Tuned KNN Classifier Parameters: {}".format(knn_cv.best_params_))
print("Tuned KNN Classifier Best Accuracy Score: {}".format(knn_cv.best_score_))

Tuned KNN Classifier Parameters: {'weights': 'distance', 'p': 1, 'n_neighbors': 43}
Tuned KNN Classifier Best Accuracy Score: 0.8306666666666667


Let's calculate some metrics for this KNN classifier: accuracy, precision, and ROC-AUC score.

In [11]:
y_pred_knn = knn_cv.predict(X_test)
y_probs_knn = knn_cv.predict_proba(X_test)

In [12]:
print("Accuracy Score on Test Set for KNN Classifier: {}"
      .format(accuracy_score(y_test, y_pred_knn)))
print("Precision Score - weighted averaged on Test Set for KNN Classifier: {}"
      .format(precision_score(y_test, y_pred_knn, average='weighted')))
print("Recall Score - weighted averaged on Test Set for KNN Classifier: {}"
      .format(recall_score(y_test, y_pred_knn, average='weighted')))
print("F1 Score - weighted averaged on Test Set for KNN Classifier: {}"
      .format(f1_score(y_test, y_pred_knn, average='weighted')))
print("ROC-AUC score - weighted averaged on Test Set for KNN Classifier: {}"
      .format(roc_auc_score(y_test, y_probs_knn, average='weighted', multi_class='ovo')))

Accuracy Score on Test Set for KNN Classifier: 0.85256
Precision Score - weighted averaged on Test Set for KNN Classifier: 0.8604214834908291
Recall Score - weighted averaged on Test Set for KNN Classifier: 0.85256
F1 Score - weighted averaged on Test Set for KNN Classifier: 0.8538858273467026
ROC-AUC score - weighted averaged on Test Set for KNN Classifier: 0.95906915448453


## 2. Logistic Regression

In [19]:
# took forever to run a RandomizedSearchCV with more parameters,
# so stripped it down to just one parameter for a "baseline metric"

from sklearn.linear_model import LogisticRegression
LR = LogisticRegression()
params = {"C": [100, 10, 1.0, 0.1, 0.01]}
LR_cv = RandomizedSearchCV(LR, params, cv=5)
LR_cv.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

RandomizedSearchCV(cv=5, estimator=LogisticRegression(),
                   param_distributions={'C': [100, 10, 1.0, 0.1, 0.01]})

In [20]:
print("Tuned Logistic Regression Parameters: {}"
      .format(LR_cv.best_params_))
print("Tuned Logistic Regression Best Accuracy Score: {}"
      .format(LR_cv.best_score_))

Tuned Logistic Regression Parameters: {'C': 0.01}
Tuned Logistic Regression Best Accuracy Score: 0.5238400000000001


In [21]:
y_pred_LR = LR_cv.predict(X_test)
y_probs_LR = LR_cv.predict_proba(X_test)

In [22]:
print("Accuracy Score on Test Set for Logistic Regression Classifier: {}"
      .format(accuracy_score(y_test, y_pred_LR)))
print("Precision Score - weighted averaged on Test Set for Logistic Regression Classifier: {}"
      .format(precision_score(y_test, y_pred_LR, average='weighted')))
print("Recall Score - weighted averaged on Test Set for Logistic Regression Classifier: {}"
      .format(recall_score(y_test, y_pred_LR, average='weighted')))
print("F1 Score - weighted averaged on Test Set for Logistic Regression Classifier: {}"
      .format(f1_score(y_test, y_pred_LR, average='weighted')))
print("ROC-AUC score - weighted averaged on Test Set for Logistic Regression Classifier: {}"
      .format(roc_auc_score(y_test, y_probs_LR, average='weighted', multi_class='ovo')))

Accuracy Score on Test Set for Logistic Regression Classifier: 0.52
Precision Score - weighted averaged on Test Set for Logistic Regression Classifier: 0.39038691194207187
Recall Score - weighted averaged on Test Set for Logistic Regression Classifier: 0.52
F1 Score - weighted averaged on Test Set for Logistic Regression Classifier: 0.42608632238480376
ROC-AUC score - weighted averaged on Test Set for Logistic Regression Classifier: 0.6175465272854115


  _warn_prf(average, modifier, msg_start, len(result))


We have extremely poor numbers for logistic regression. The model would not converge! These "baseline numbers" however already demonstrate it would not be a good model for us.

## 3. Random Forest

In [6]:
from sklearn.ensemble import RandomForestClassifier
RF = RandomForestClassifier()
params = {"n_estimators": np.arange(25, 500, 25),
        "criterion": ["gini", "entropy"],
          "max_features": ['auto', 'sqrt']}
RF_cv = RandomizedSearchCV(RF, params, cv=5)
RF_cv.fit(X_train, y_train)

RandomizedSearchCV(cv=5, estimator=RandomForestClassifier(),
                   param_distributions={'criterion': ['gini', 'entropy'],
                                        'max_features': ['auto', 'sqrt'],
                                        'n_estimators': array([ 25,  50,  75, 100, 125, 150, 175, 200, 225, 250, 275, 300, 325,
       350, 375, 400, 425, 450, 475])})

In [7]:
print("Tuned Random Forest Classifier Parameters: {}".format(RF_cv.best_params_))
print("Tuned Random Forest Classifier Best Accuracy Score: {}".format(RF_cv.best_score_))

Tuned Random Forest Classifier Parameters: {'n_estimators': 425, 'max_features': 'auto', 'criterion': 'gini'}
Tuned Random Forest Classifier Best Accuracy Score: 0.8520800000000002


In [8]:
y_pred_rf = RF_cv.predict(X_test)
y_probs_rf = RF_cv.predict_proba(X_test)

In [9]:
`print("Accuracy Score on Test Set for Random Forest Classifier: {}"
      .format(accuracy_score(y_test, y_pred_rf)))
print("Precision Score - weighted averaged on Test Set for Random Forest Classifier: {}"
      .format(precision_score(y_test, y_pred_rf, average='weighted')))
print("Recall Score - weighted averaged on Test Set for Random Forest Classifier: {}"
      .format(recall_score(y_test, y_pred_rf, average='weighted')))
print("F1 Score - weighted averaged on Test Set for Random Forest Classifier: {}"
      .format(f1_score(y_test, y_pred_rf, average='weighted')))
print("ROC-AUC score - weighted averaged on Test Set for Random Forest Classifier: {}"
      .format(roc_auc_score(y_test, y_probs_rf, average='weighted', multi_class='ovo')))

Accuracy Score on Test Set for Random Forest Classifier: 0.88656
Precision Score - weighted averaged on Test Set for Random Forest Classifier: 0.8878030167706007
Recall Score - weighted averaged on Test Set for Random Forest Classifier: 0.88656
F1 Score - weighted averaged on Test Set for Random Forest Classifier: 0.8867338091960775
ROC-AUC score - weighted averaged on Test Set for Random Forest Classifier: 0.960010153347069


This model outperforms our (very good) KNN classifier.

## 4. Gradient Boosting Classifier

In [11]:
# for some reason it was IMPOSSIBLE to run a RandomizedSearchCV on this classifier.
# I let it run for hours and it never completed.
# But let's try it with the default parameters to see if it's promising
# (it's not)

from sklearn.ensemble import GradientBoostingClassifier
GBC = GradientBoostingClassifier()
GBC.fit(X_train, y_train)

GradientBoostingClassifier()

In [13]:
y_pred_gbc = GBC.predict(X_test)
y_probs_gbc = GBC.predict_proba(X_test)

In [14]:
print("Accuracy Score on Test Set for Gradient Boosting Classifier: {}"
      .format(accuracy_score(y_test, y_pred_gbc)))
print("Precision Score - weighted averaged on Test Set for Gradient Boosting Classifier: {}"
      .format(precision_score(y_test, y_pred_gbc, average='weighted')))
print("Recall Score - weighted averaged on Test Set for Gradient Boosting Classifier: {}"
      .format(recall_score(y_test, y_pred_gbc, average='weighted')))
print("F1 Score - weighted averaged on Test Set for Gradient Boosting Classifier: {}"
      .format(f1_score(y_test, y_pred_gbc, average='weighted')))
print("ROC-AUC score - weighted averaged on Test Set for Gradient Boosting Classifier: {}"
      .format(roc_auc_score(y_test, y_probs_gbc, average='weighted', multi_class='ovo')))

Accuracy Score on Test Set for Gradient Boosting Classifier: 0.54856
Precision Score - weighted averaged on Test Set for Gradient Boosting Classifier: 0.5627793693132978
Recall Score - weighted averaged on Test Set for Gradient Boosting Classifier: 0.54856
F1 Score - weighted averaged on Test Set for Gradient Boosting Classifier: 0.47528682737883166
ROC-AUC score - weighted averaged on Test Set for Gradient Boosting Classifier: 0.715993475639177


Again, the RandomizedSearchCV never finished running on the GradientBoostingClassifier. Instead I set the normal parameters to get a baseline metric. As we can see, these weren't very good.

## Conclusion

The Random Forest Classifier outperforms the KNN Classifier, Logistic Regression, and Gradient Boosting Classifier. And of course it far outperforms the Dummy Classifier.

For the RF Classifier we found optimal hyperparameters found through RandomizedSearchCV: {'n_estimators': 425, 'max_features': 'auto', 'criterion': 'gini'}.

Our accuracy was 0.89,
precision was 0.89,
recall was 0.89,
F1 score was 0.89,
and ROC-AUC was 0.96.

## Saving our best model

In [6]:
from sklearn.ensemble import RandomForestClassifier
best_RF = RandomForestClassifier(n_estimators=425,
                criterion='gini',
                max_features='auto')
best_model = best_RF.fit(X_train, y_train)

In [7]:
from sklearn import __version__ as sklearn_version
import datetime
import pickle
best_model.version = 1.0
best_model.pandas_version = pd.__version__
best_model.numpy_version = np.__version__
best_model.sklearn_version = sklearn_version
best_model.X_columns = [col for col in X_train.columns]
best_model.build_datetime = datetime.datetime.now()

In [8]:
savedmodel = 'RDU_departure_predictions.pkl'
pickle.dump(best_model, open(savedmodel, 'wb'))