# Sprint 13, Task 1

## Level 1

### Exercise 1

Create at least three different classification models to try to predict as best as possible flight delay (ArrDelay) from the DelayedFlights.csv dataset. Consider whether the flight was late or not (ArrDelay > 0).

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
import sklearn.metrics as metrics

In [2]:
dades = pd.read_csv("../Sprint 11/DelayedFlights.csv")

In [3]:
print(len(dades["ArrDelay"]))
print(dades["ArrDelay"].value_counts()[0])

# Only 27k cases saw no Arrival Delay over a total of 2M, which makes sense since this is a Delay Dataset

####################################################################################################
#------------------------------------ IMPORTANT ---------------------------------------------------#
####################################################################################################

# This constitutes an important limitation since all 27k cases must have seen a Departure Delay to have been 
# included in this dataset. As a consequence, our predictions will most likely not be the most accurate since our
# pool of negative cases (no delay) is already severely biased (they suffered delay on departure)

1936758
27040


In [4]:
# Drop columns with little explanation potential

dades.drop(columns=["Unnamed: 0", "Year", "FlightNum", "TailNum", "Cancelled", "CancellationCode"], inplace=True)

In [5]:
# Fill NaN entries with information from Departure Delay

dades["ArrDelay"].fillna(dades["DepDelay"], inplace = True)

delay = dades["CRSElapsedTime"] + dades["DepDelay"]

dades["ActualElapsedTime"].fillna(delay, inplace = True)

In [6]:
# Fill taxiing NaN values with mean value

dades["TaxiIn"].fillna(dades["TaxiIn"].mean(), inplace=True)

dades["TaxiOut"].fillna(dades["TaxiOut"].mean(), inplace=True)


In [7]:
# Now no Arrival Delay sees NaN values, since we have imputed null values with their corresponding entry of 
# Departure Delay

dades.isnull().sum()

Month                     0
DayofMonth                0
DayOfWeek                 0
DepTime                   0
CRSDepTime                0
ArrTime                7110
CRSArrTime                0
UniqueCarrier             0
ActualElapsedTime       198
CRSElapsedTime          198
AirTime                8387
ArrDelay                  0
DepDelay                  0
Origin                    0
Dest                      0
Distance                  0
TaxiIn                    0
TaxiOut                   0
Diverted                  0
CarrierDelay         689270
WeatherDelay         689270
NASDelay             689270
SecurityDelay        689270
LateAircraftDelay    689270
dtype: int64

In [8]:
# Then, we transform Arrival Delay into a 0/1 variable

dades["is_delay"] = dades["ArrDelay"].apply(lambda x: 0 if x==0 else 1)

print(dades["is_delay"].value_counts()[0] / len(dades["is_delay"]))
print(dades["is_delay"].value_counts()[1] / len(dades["is_delay"]))

# These are the Delay proportions; 98,6% of cases saw delay, 1,4% did not. That means our model has to be better 
# than 98,6% to be any good

0.013961475827129668
0.9860385241728703


In [9]:
# Rescale potentially good variables into a 0-1 scale: MinMaxScaler

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

scaled_df = scaler.fit_transform(dades[["is_delay", "DepDelay", "Distance", "TaxiIn", "TaxiOut"]])


In [10]:
scaled_df = pd.DataFrame(scaled_df, columns = ["is_delay", "DepDelay", "Distance", "TaxiIn", "TaxiOut"])

_________________________________________________________________________________________________________
#########################################################################################################
_________________________________________________________________________________________________________

In [11]:
# First model: Logistic Regression

y = scaled_df["is_delay"]
X = scaled_df[["DepDelay", "Distance", "TaxiIn", "TaxiOut"]]

X_train_log, X_test_log, y_train_log, y_test_log = train_test_split(X, y, random_state = 11, stratify= scaled_df["is_delay"])

# We stratify according to flight delay to mantain stable proportions


In [12]:
# We fit and predict the model

from sklearn.linear_model import LogisticRegression

logistic = LogisticRegression()

logistic.fit(X_train_log, y_train_log)

y_pred_log = logistic.predict(X_test_log)

In [13]:
# Accuracy Score

metrics.accuracy_score(y_test_log, y_pred_log)

# In principle, a very good score; however, it coincides extremely closely with the proportion of Delay cases in the
# dataset overall (no model), which should make us weary

0.9860385385902228

In [14]:
# Confusion Matrix

metrics.confusion_matrix(y_test_log, y_pred_log)

# As it can be seen, this is a terrible model in terms of predicting Arrival Delay entries. Not a sinlge one of
# them was predicted by the model, and yet accuracy was excellent

array([[     0,   6760],
       [     0, 477430]])

In [15]:
# F1 Score

metrics.f1_score(y_test_log, y_pred_log)

# Very good F1 score, better than the dummy model

0.9929701961273684

In [16]:
# AUC score

y_pred_prob_log = logistic.predict_proba(X_test_log)[:, 1] 

    # This gets our probability coefficients that each instance is a Delay

metrics.roc_auc_score(y_test_log, y_pred_prob_log)

# This score shows a relatively good model, but it is not better than a dummy model

0.8221003917114402

In [17]:
# All in all, a simple logistic regression would not be the best model to predict Arrival Delay, since it predicts
# all the majority class instances yet not the minority class. Further hyperparameter tuning will be needed

_________________________________________________________________________________________________________
#########################################################################################################
_________________________________________________________________________________________________________

In [18]:
# Resampling, this time reducing sampling size to make ML model prediction faster 

X_train_svm, X_test_svm, y_train_svm, y_test_svm = train_test_split(X, y, train_size=75000, test_size=25000, random_state = 13, stratify= scaled_df["is_delay"])


In [19]:
# Suport Vector Machine:

from sklearn import svm

vector = svm.SVC(kernel='linear', probability=True)

# Model fit:

vector.fit(X_train_svm, y_train_svm)

# Model Execution time: 1m 10s

SVC(kernel='linear', probability=True)

In [20]:
# Model prediction

y_pred_svm = vector.predict(X_test_svm)

In [21]:
# Accuracy Score

metrics.accuracy_score(y_test_svm, y_pred_svm)

# Like the logistic model, it coincides quite closely with the majority class; suspicious

0.98604

In [22]:
# Confusion Matrix

metrics.confusion_matrix(y_test_svm, y_pred_svm)

# Once again, the model is very bad at predicting the minority instances, which happen to be what we care about

array([[    0,   349],
       [    0, 24651]])

In [23]:
# F1 score:

metrics.f1_score(y_test_svm, y_pred_svm)

# Almost perfect F1 score, yet other metrics show its bias

0.992970937141246

In [24]:
# AUC score:

y_pred_prob_svm = vector.predict_proba(X_test_svm)[:, 1] 

metrics.roc_auc_score(y_test_svm, y_pred_prob_svm)

# Quite bad AUC score, given how below the proportion of the majority class it is

0.7844702883194961

_________________________________________________________________________________________________________
#########################################################################################################
_________________________________________________________________________________________________________

In [25]:
# XGBoost

import xgboost as xgb

X_train_xgb, X_test_xgb, y_train_xgb, y_test_xgb = train_test_split(X, y, random_state = 55, stratify= scaled_df["is_delay"])

xclassif = xgb.XGBClassifier(random_state=3)

# Fit the model

xclassif.fit(X_train_xgb, y_train_xgb)

# Execution time: 1m 45s





XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=4, num_parallel_tree=1, random_state=3,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

In [26]:
# Predict with the fitted model:

y_pred_xgb = xclassif.predict(X_test_xgb)

In [27]:
# Accuracy Score

metrics.accuracy_score(y_test_xgb, y_pred_xgb)

# Same as previous models

0.9860385385902228

In [28]:
# Confusion Matrix

metrics.confusion_matrix(y_test_xgb, y_pred_xgb)

# Same as previous models

array([[     0,   6760],
       [     0, 477430]])

In [29]:
# F1 score:

metrics.f1_score(y_test_xgb, y_pred_xgb)

# Same as previous models

0.9929701961273684

In [30]:
# AUC score:

y_pred_prob_xgb = xclassif.predict_proba(X_test_xgb)[:, 1] 

metrics.roc_auc_score(y_test_xgb, y_pred_prob_xgb)

# Same as previous models

0.852250652749119

_________________________________________________________________________________________________________
#########################################################################################################
_________________________________________________________________________________________________________

### Exercise 2

Compare the previous classification models using accuracy, a confidence matrix and other more advanced techniques.

In [31]:
# Dummy baseline reminder (proportion of majority class in original dataset): 0.986

In [32]:
# Logistic Regression:

    # Accuracy Score: 0.986
    # Confusion Matrix: 0 correct predictions of minority class
    # F1 Score: 0.993
    # AUC Score: 0.822

In [33]:
# Support Vector Machine:

    # Accuracy Score: 0.986
    # Confusion Matrix: 0 correct predictions of minority class
    # F1 Score: 0.993
    # AUC Score: 0.784

In [34]:
# X Gradient Boost:

    # Accuracy Score: 0.986
    # Confusion Matrix: 0 correct predictions of minority class
    # F1 Score: 0.993
    # AUC Score: 0.852

In [35]:
################################ CONCLUSIONS ################################################################

# By taking the accuracy scores, one might think that the models used are quite good. However, that could not
# be further from the truth. The raw models (no hyperparameter tuning) have proven to be quite bad at predicting
# our target variable (Arrival Delay). Hence, the extremely high precision scores are due to the models predicting
# the majority class by default. That is not what a classification model is expected to do.

# Hence, the notebook S13 T01 P2 will try to improve perfomance through hyperparameter tuning 