# Capstone 2 on RDU Airline Delays - Modeling
10/14/22

We have wrangled, explored, and preprocessed our data. Now let's create some models to predict whether or not a flight will be delayed/cancelled.

In [1]:
#Import necessary packages

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_validate, RandomizedSearchCV
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, roc_auc_score

In [10]:
# load dataset

X_train = pd.read_csv('X_train.csv')
X_test = pd.read_csv('X_test.csv')
y_train = pd.read_csv('y_train.csv')
y_test = pd.read_csv('y_test.csv')

In [11]:
X_train.drop(['Unnamed: 0'], axis=1, inplace=True)
X_test.drop(['Unnamed: 0'], axis=1, inplace=True)
y_train.drop(['Unnamed: 0'], axis=1, inplace=True)
y_test.drop(['Unnamed: 0'], axis=1, inplace=True)

In [21]:
y_train = pd.DataFrame(y_train).to_numpy().ravel()
y_test = pd.DataFrame(y_test).to_numpy().ravel()

It's hard to remember now what my columns are, so here's the list of columns before get_dummies:

         QUARTER              52645 non-null  category
         MONTH                52645 non-null  category
         DAY_OF_MONTH         52645 non-null  category
         DAY_OF_WEEK          52645 non-null  category
         CARRIER              52645 non-null  object  
         FL_NUM               52645 non-null  object  
         DEST                 52645 non-null  object  
         CRS_ELAPSED_TIME_LG  52645 non-null  float64 
         DISTANCE_LG          52645 non-null  float64 
         CARRIER_DELAY        52645 non-null  category
         WEATHER_DELAY        52645 non-null  category
         NAS_DELAY            52645 non-null  category
         SECURITY_DELAY       52645 non-null  category
         LATE_AIRCRAFT_DELAY  52645 non-null  category
         DEP_TIME_BINS        52645 non-null  object  
         ARR_TIME_BINS        52626 non-null  object  

## Models

Since we're trying to solve a multiclass classification problem, we can use the following models:

        1. Random Forest
        2. KNN
        
We've already done our preprocessing/scaling and standardizing. We'll perform cross validation with KFold, hyperparameter tuning with RandomizedSearchCV, and put these together with a pipeline.

## Random Forest

In [34]:
RF = RandomForestClassifier()
params = {"n_estimators": np.arange(25, 500, 25),
        "criterion": ["gini", "entropy"],
          "max_features": ['auto', 'sqrt']}
RF_cv = RandomizedSearchCV(RF, params, cv=5)
RF_cv.fit(X_train, y_train)

RandomizedSearchCV(cv=5, estimator=RandomForestClassifier(),
                   param_distributions={'criterion': ['gini', 'entropy'],
                                        'max_features': ['auto', 'sqrt'],
                                        'n_estimators': array([ 25,  50,  75, 100, 125, 150, 175, 200, 225, 250, 275, 300, 325,
       350, 375, 400, 425, 450, 475])})

In [35]:
print("Tuned Random Forest Classifier Parameters: {}".format(RF_cv.best_params_))
print("Tuned Random Forest Classifier Best Accuracy Score: {}".format(RF_cv.best_score_))

Tuned Random Forest Classifier Parameters: {'n_estimators': 200, 'max_features': 'auto', 'criterion': 'gini'}
Tuned Random Forest Classifier Best Accuracy Score: 0.849986028321324


Let's calculate some metrics for this random forest classifier: accuracy, precision, and ROC-AUC score.

In [50]:
y_pred_rf = RF_cv.predict(X_test)
y_probs_rf = RF_cv.predict_proba(X_test)

In [53]:
print("Accuracy Score on Test Set for Random Forest Classifier: {}".format(accuracy_score(y_test, y_pred_rf)))
print("Precision Score - weighted averaged on Test Set for Random Forest Classifier: {}".format(precision_score(y_test, y_pred_rf, average='weighted')))
print("ROC-AUC score - weighted averaged on Test Set for Random Forest Classifier: {}".format(roc_auc_score(y_test, y_probs_rf, average='weighted', multi_class='ovo')))

Accuracy Score on Test Set for Random Forest Classifier: 0.8504786506609938
Precision Score - weighted averaged on Test Set for Random Forest Classifier: 0.824977037253275
ROC-AUC score - weighted averaged on Test Set for Random Forest Classifier: 0.8784947327590824


## K Nearest Neighbors

In [32]:
knn = KNeighborsClassifier()
params = {"n_neighbors": np.arange(1, 50, 2),
        "weights": ['uniform', 'distance'],
         'p': [1, 2],}
knn_cv = RandomizedSearchCV(knn, params, cv=5)
knn_cv.fit(X_train, y_train)

RandomizedSearchCV(cv=5, estimator=KNeighborsClassifier(),
                   param_distributions={'n_neighbors': array([ 1,  3,  5,  7,  9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33,
       35, 37, 39, 41, 43, 45, 47, 49]),
                                        'p': [1, 2],
                                        'weights': ['uniform', 'distance']})

In [33]:
print("Tuned KNN Classifier Parameters: {}".format(knn_cv.best_params_))
print("Tuned KNN Classifier Best Accuracy Score: {}".format(knn_cv.best_score_))

Tuned KNN Classifier Parameters: {'weights': 'distance', 'p': 1, 'n_neighbors': 15}
Tuned KNN Classifier Best Accuracy Score: 0.8191626704971391


In [54]:
y_pred_knn = knn_cv.predict(X_test)
y_probs_knn = knn_cv.predict_proba(X_test)

In [55]:
print("Accuracy Score on Test Set for KNN Classifier: {}".format(accuracy_score(y_test, y_pred_knn)))
print("Precision Score - weighted averaged on Test Set for KNN Classifier: {}".format(precision_score(y_test, y_pred_knn, average='weighted')))
print("ROC-AUC score - weighted averaged on Test Set for KNN Classifier: {}".format(roc_auc_score(y_test, y_probs_knn, average='weighted', multi_class='ovo')))

Accuracy Score on Test Set for KNN Classifier: 0.8130223370308464
Precision Score - weighted averaged on Test Set for KNN Classifier: 0.7442509100408177
ROC-AUC score - weighted averaged on Test Set for KNN Classifier: 0.7082888419437557


## Conclusion

In both cases, the Random Forest Classifier outperforms the KNN Classifier. 

For the RF Classifier we can use the hyperparameters found through RandomizedSearchCV: {'n_estimators': 200, 'max_features': 'auto', 'criterion': 'gini'}.

Surprisingly the accuracy on the test set is quite good, even a tiny bit higher than the train set at 0.85.