# Predicting Significant Flight Delays using Supervised Learning

In this notebook I present my work in ___________________________.

## Business Understanding

Many consultants travel frequently over long distances for business purposes. Managers at a consulting firm might be interested in how to minimize -------------------------- risk when booking flights. One pertinent concern for these managers is how to reduce the risk of significant flight delay to ensure that consultants can utilize air travel reliably and efficiently.

My goal is to predict whether a given flight will be significantly delayed given several known factors about the flight such as airline, scheduled time of departure, and ---------------------------------. An effective model of this kind will assist managers in understanding ---------------------- so that they can make more informed decisions when booking flights.

## Data Understanding

This element assesses how well students demonstrate the utility of their data for helping solve a business
problem. We frame utility in terms of the properties, source, and business relevance of the data.
* This element assesses the demonstration of the data’s utility, not the utility itself

Data Understanding: Notebook clearly describes the source and properties of the data to show how useful the data are for solving the problem of interest.
* Describe the data sources and explain why the data are suitable for the project
* Present the size of the dataset and descriptive statistics for all features used in the analysis
* Justify the inclusion of features based on their properties and relevance for the project
* Identify any limitations of the data that have implications for the project
------------------------------------------------------------------------------------------

I obtained data for use by the public domain from https://www.kaggle.com/datasets/robikscube/flight-delay-dataset-20182022. Each row pertains to a flight --------------------. The data includes several promising features such as --------------------------------.

I will bulid my model using a random sample from a year's worth of raw data spanning from August 2021 to July 2022.

In [3]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [4]:
df_original = pd.read_csv('sample.csv')

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


### Data Cleaning and Feature Selection

In [5]:
to_keep = ['Quarter',
           'Month', 'DayOfWeek', 'Operating_Airline ',
           'Origin', 'Dest',
           'DepTimeBlk', 'ArrDel15', 'ArrTimeBlk', 'Cancelled', 'Distance']
len(to_keep)

11

In [6]:
df = df_original[to_keep]

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 11 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   Quarter             100000 non-null  int64  
 1   Month               100000 non-null  int64  
 2   DayOfWeek           100000 non-null  int64  
 3   Operating_Airline   100000 non-null  object 
 4   Origin              100000 non-null  object 
 5   Dest                100000 non-null  object 
 6   DepTimeBlk          100000 non-null  object 
 7   ArrDel15            97266 non-null   float64
 8   ArrTimeBlk          100000 non-null  object 
 9   Cancelled           100000 non-null  float64
 10  Distance            100000 non-null  float64
dtypes: float64(3), int64(3), object(5)
memory usage: 8.4+ MB


In [8]:
# Create target variable - 1 if flight is significantly delayed, 0 if not
df['Target'] = 0
for idx, row in df.iterrows():
    if row['ArrDel15'] == 1 or row['Cancelled'] == 1:
        df.at[idx, 'Target'] = 1
df.drop(['ArrDel15','Cancelled'], axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Target'] = 0
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


In [9]:
df

Unnamed: 0,Quarter,Month,DayOfWeek,Operating_Airline,Origin,Dest,DepTimeBlk,ArrTimeBlk,Distance,Target
0,1,2,5,9E,DTW,TYS,2000-2059,2200-2259,443.0,0
1,1,1,2,UA,DEN,FLL,0900-0959,1500-1559,1703.0,1
2,1,1,7,AA,PHX,MCO,1600-1659,2200-2259,1849.0,0
3,3,8,2,DL,SLC,ATL,0800-0859,1300-1359,1590.0,0
4,4,11,2,YV,IAH,DFW,1800-1859,1900-1959,224.0,0
...,...,...,...,...,...,...,...,...,...,...
99995,4,11,3,WN,AUS,MDW,1300-1359,1500-1559,972.0,0
99996,2,6,5,OO,BHM,DEN,1500-1559,1700-1759,1083.0,0
99997,2,4,5,DL,SLC,DEN,2100-2159,2200-2259,391.0,0
99998,3,8,5,DL,ATL,BNA,2000-2059,2000-2059,214.0,0


In [10]:
from sklearn.preprocessing import OneHotEncoder


X = df.drop(['Target'], axis=1)
y = df['Target']

X_cat = X.drop(['Distance'], axis=1)
X_num = X[['Distance']]

ohe = OneHotEncoder(drop="first", sparse=False)
ohe.fit(X_cat)

X_cat_ohe = pd.DataFrame(
    data=ohe.transform(X_cat),
    # columns=[{cat} for cat in ohe.categories_[][1:]],
    index=X_cat.index
)

X_final = pd.concat([X_num, X_cat_ohe], axis=1)
X_final

Unnamed: 0,Distance,0,1,2,3,4,5,6,7,8,...,810,811,812,813,814,815,816,817,818,819
0,443.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,1703.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1849.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,1590.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,224.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99995,972.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
99996,1083.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
99997,391.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
99998,214.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


In [11]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_final, y, random_state=100)

In [12]:
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(criterion='entropy', random_state=100)

clf.fit(X_train, y_train)

DecisionTreeClassifier(criterion='entropy', random_state=100)

In [13]:
from sklearn.metrics import accuracy_score

y_preds = clf.predict(X_test)

print('Accuracy: ', accuracy_score(y_test, y_preds))

Accuracy:  0.6928


In [14]:
# from sklearn.model_selection import GridSearchCV

# clf_grid = DecisionTreeClassifier()

# param_grid = {
#     'criterion': ['gini', 'entropy'],
#     'max_depth': [5, 10, 20],
#     'min_samples_split': [2, 5, 10],
#     'min_samples_leaf': [2, 5, 10],
#     'class_weight':['balanced']
# }

# gs_tree = GridSearchCV(clf_grid, param_grid, cv=3, scoring='recall')
# gs_tree.fit(X_train, y_train)

# gs_tree.best_params_

In [15]:
clf_best = DecisionTreeClassifier(criterion='gini', max_depth=5, min_samples_split=5, min_samples_leaf=5, class_weight='balanced', random_state=100)
clf_best.fit(X_train, y_train)
y_preds_best = clf_best.predict(X_test)

from sklearn.metrics import confusion_matrix
c = confusion_matrix(y_test, y_preds_best)
print('Test accuracy: ', accuracy_score(y_test, y_preds_best))
print('Train accuracy: ', accuracy_score(y_train, clf_best.predict(X_train)))
print("Recall: {}".format(c[1][1]/(c[1][1]+c[1][0])))

Test accuracy:  0.46444
Train accuracy:  0.46758666666666665
Recall: 0.7704561911658219


In [16]:

c

array([[ 7355, 12121],
       [ 1268,  4256]], dtype=int64)

In [29]:
def get_metrics(c):
    tn = c[0][0]
    fp = c[0][1]
    fn = c[1][0]
    tp = c[1][1]
    
    print("% Delayed/cancelled flights that were predicted correctly: {}".format(tp/(tp+fn)))
    print("% Non-delayed/cancelled flights that were predicted correctly: {}".format(tn/(tn+fp)))
    print("% Predicted delays/cancellations that were delayed/cancelled: {}".format(tp/(tp+fp)))
    print("% Predicted non-delays/cancellations that were not delayed/cancelled: {}".format(tn/(tn+fn)))
    print("Overall accuracy: {}".format((tp+tn)/(tp+tn+fp+fn)))

In [30]:
from sklearn.linear_model import LogisticRegression

# Instantiate the model
logreg = LogisticRegression(fit_intercept=False, 
                            # C=1e12,
                            #solver='saga',
                            class_weight='balanced',
                           max_iter=1000)

# Fit the model
logreg.fit(X_train, y_train)

y_hat_train = logreg.predict(X_train)
y_hat_test = logreg.predict(X_test)

c = confusion_matrix(y_test, y_hat_test)

#the count of true negatives is , false negatives is , true positives is  and false positives is .
# 00 10 11 01

get_metrics(c)


% Delayed/cancelled flights that were predicted correctly: 0.6285300506879073
% Non-delayed/cancelled flights that were predicted correctly: 0.5927295132470733
% Predicted delays/cancellations that were delayed/cancelled: 0.3044545773412838
% Predicted non-delays/cancellations that were not delayed/cancelled: 0.8490732568402471
Overall accuracy: 0.60064


In [31]:
from sklearn.ensemble import RandomForestClassifier

# rfc_grid = RandomForestClassifier()

# param_grid_rfc = {
#     'n_estimators':[10],
#     'criterion': ['gini', 'entropy'],
#     'max_depth': [5, 10, 15, 20],
#     'min_samples_split': [2, 5, 10, 15],
#     'min_samples_leaf': [2, 5, 10, 15],
#     'class_weight':['balanced']
# }

# gs_rfc_tree = GridSearchCV(rfc_grid, param_grid_rfc, cv=3, scoring='recall')
# gs_rfc_tree.fit(X_train, y_train)

# gs_rfc_tree.best_params_

In [32]:
rfc = RandomForestClassifier(n_estimators=10, criterion='gini', max_depth=20, min_samples_split=2, min_samples_leaf=10, class_weight='balanced', random_state=100)

#rfc = RandomForestClassifier(n_estimators=10, criterion='gini', max_depth=5, min_samples_split=20, min_samples_leaf=2, class_weight='balanced', random_state=100)

rfc.fit(X_train, y_train)
y_preds_best = rfc.predict(X_test)
# print('Test accuracy: ', accuracy_score(y_test, y_preds_best))
# print('Train accuracy: ', accuracy_score(y_train, clf_best.predict(X_train)))

c = confusion_matrix(y_test, y_preds_best)
get_metrics(c)

% Delayed/cancelled flights that were predicted correctly: 0.5997465604634323
% Non-delayed/cancelled flights that were predicted correctly: 0.6173238858081742
% Predicted delays/cancellations that were delayed/cancelled: 0.30772803269552296
% Predicted non-delays/cancellations that were not delayed/cancelled: 0.8446676970633694
Overall accuracy: 0.61344


In [21]:
c

array([[12023,  7453],
       [ 2211,  3313]], dtype=int64)

In [22]:
print("TP: {}\nTN:{}\nFP:{}\nFN:{}".format(c[1][1], c[0][0], c[0][1], c[1][0]))

TP: 3313
TN:12023
FP:7453
FN:2211


In [23]:
# #X_train_nb = X_train.drop(['Distance'], axis=1)
# #X_test_nb = X_test.drop(['Distance'], axis=1)


# from sklearn.naive_bayes import CategoricalNB

# X_train_nb, X_test_nb, y_train_nb, y_test_nb = train_test_split(to_enc, y, random_state=100)

# bnb = CategoricalNB()
# bnb.fit(X_train_nb, y_train_nb)

# y_hat_train = bnb.predict(X_train_nb)
# y_hat_test = bnb.predict(X_test_nb)

# c = confusion_matrix(y_test_nb, y_hat_test)

# print('Test accuracy: ', accuracy_score(y_test_nb, y_hat_test))
# print('Train accuracy: ', accuracy_score(y_train_nb, y_hat_train))

In [34]:
# Look into oversampling for naive bayes
1-sum(y_test)/len(y_test)

0.77904