# Predicting Significant Flight Delays using Supervised Learning

In this notebook I present my work in ___________________________.

## Business Understanding

Many consultants travel frequently over long distances for business purposes. Managers at a consulting firm might be interested in how to minimize -------------------------- risk when booking flights. One pertinent concern for these managers is how to reduce the risk of significant flight delay to ensure that consultants can utilize air travel reliably and efficiently.

My goal is to predict whether a given flight will be significantly delayed given several known factors about the flight such as airline, scheduled time of departure, and ---------------------------------. An effective model of this kind will assist managers in understanding ---------------------- so that they can make more informed decisions when booking flights.

## Data Understanding

This element assesses how well students demonstrate the utility of their data for helping solve a business
problem. We frame utility in terms of the properties, source, and business relevance of the data.
* This element assesses the demonstration of the data’s utility, not the utility itself

Data Understanding: Notebook clearly describes the source and properties of the data to show how useful the data are for solving the problem of interest.
* Describe the data sources and explain why the data are suitable for the project
* Present the size of the dataset and descriptive statistics for all features used in the analysis
* Justify the inclusion of features based on their properties and relevance for the project
* Identify any limitations of the data that have implications for the project
------------------------------------------------------------------------------------------

I obtained data for use by the public domain from https://www.kaggle.com/datasets/robikscube/flight-delay-dataset-20182022. Each row pertains to a flight --------------------. The data includes several promising features such as --------------------------------.

I will bulid my model using a random sample from a year's worth of raw data spanning from August 2021 to July 2022.

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [2]:
df_original = pd.read_csv('sample.csv')

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


### Data Cleaning and Feature Selection

In [3]:
to_keep = ['Year', 'Quarter', 'Month', 'DayOfWeek', 'Operating_Airline ', 'Origin', 'Dest', 'DepTimeBlk', 'ArrDel15', 'ArrTimeBlk', 'Cancelled', 'Distance']
len(to_keep)

12

In [4]:
df = df_original[to_keep]

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Year                25000 non-null  int64  
 1   Quarter             25000 non-null  int64  
 2   Month               25000 non-null  int64  
 3   DayOfWeek           25000 non-null  int64  
 4   Operating_Airline   25000 non-null  object 
 5   Origin              25000 non-null  object 
 6   Dest                25000 non-null  object 
 7   DepTimeBlk          25000 non-null  object 
 8   ArrDel15            24315 non-null  float64
 9   ArrTimeBlk          25000 non-null  object 
 10  Cancelled           25000 non-null  float64
 11  Distance            25000 non-null  float64
dtypes: float64(3), int64(4), object(5)
memory usage: 2.3+ MB


In [6]:
# Create target variable - 1 if flight is significantly delayed, 0 if not
df['Target'] = 0
for idx, row in df.iterrows():
    if row['ArrDel15'] == 1 or row['Cancelled'] == 1:
        df.at[idx, 'Target'] = 1
df.drop(['ArrDel15','Cancelled'], axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Target'] = 0
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


In [7]:
df

Unnamed: 0,Year,Quarter,Month,DayOfWeek,Operating_Airline,Origin,Dest,DepTimeBlk,ArrTimeBlk,Distance,Target
0,2022,2,5,1,AA,CLT,VPS,0900-0959,0900-0959,460.0,0
1,2021,4,12,3,WN,HOU,MSY,1100-1159,1200-1259,302.0,0
2,2022,1,3,5,AA,PHL,MCO,0800-0859,1000-1059,861.0,1
3,2022,2,4,5,DL,SDF,ATL,1900-1959,2100-2159,321.0,0
4,2022,2,5,7,AS,MCO,SFO,1900-1959,2100-2159,2446.0,0
...,...,...,...,...,...,...,...,...,...,...,...
24995,2021,4,10,1,WN,CMH,RSW,1400-1459,1700-1759,930.0,0
24996,2021,3,8,7,9E,ATL,LEX,1000-1059,1100-1159,304.0,0
24997,2021,3,8,5,OO,SLC,TWF,2200-2259,2300-2359,175.0,1
24998,2022,2,5,5,DL,ATL,ILM,2000-2059,2200-2259,377.0,0


In [8]:
from sklearn.preprocessing import OneHotEncoder


X = df.drop(['Target'], axis=1)
y = df['Target']

X_cat = X.drop(['Distance'], axis=1)
X_num = X[['Distance']]

ohe = OneHotEncoder(drop="first", sparse=False)
ohe.fit(X_cat)

X_cat_ohe = pd.DataFrame(
    data=ohe.transform(X_cat),
    # columns=[{cat} for cat in ohe.categories_[][1:]],
    index=X_cat.index
)

X_final = pd.concat([X_num, X_cat_ohe], axis=1)
X_final

Unnamed: 0,Distance,0,1,2,3,4,5,6,7,8,...,768,769,770,771,772,773,774,775,776,777
0,460.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,302.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,861.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,321.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,2446.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24995,930.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
24996,304.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
24997,175.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
24998,377.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [9]:
# ohe.categories_

In [10]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_final, y, random_state=100)

In [11]:
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(criterion='entropy', random_state=100)

clf.fit(X_train, y_train)

DecisionTreeClassifier(criterion='entropy', random_state=100)

In [12]:
from sklearn.metrics import accuracy_score

y_preds = clf.predict(X_test)

print('Accuracy: ', accuracy_score(y_test, y_preds))

Accuracy:  0.69632


In [13]:
"""
TODO:
- Figure out OHE issue with values showing up in testing that are not in training set
- Implement grid search for decision tree
- Consider using logistic regression
"""

'\nTODO:\n- Figure out OHE issue with values showing up in testing that are not in training set\n- Implement grid search for decision tree\n- Consider using logistic regression\n'

In [14]:
# from sklearn.model_selection import GridSearchCV

# clf_grid = DecisionTreeClassifier()

# param_grid = {
#     'criterion': ['gini', 'entropy'],
#     'max_depth': [1, 5, 10, 20],
#     'min_samples_split': [1.0, 2, 5, 10],
#     'min_samples_leaf': [1, 2, 5, 10],
#     'class_weight':['balanced']
# }

# gs_tree = GridSearchCV(clf_grid, param_grid, cv=3, scoring='f1')
# gs_tree.fit(X_train, y_train)

# gs_tree.best_params_

In [15]:
clf_best = DecisionTreeClassifier(criterion='gini', max_depth=5, min_samples_split=5, min_samples_leaf=5, class_weight='balanced', random_state=100)
clf_best.fit(X_train, y_train)
y_preds_best = clf_best.predict(X_test)
print('Test accuracy: ', accuracy_score(y_test, y_preds_best))
print('Train accuracy: ', accuracy_score(y_train, clf_best.predict(X_train)))

Test accuracy:  0.4048
Train accuracy:  0.4119466666666667


In [16]:
from sklearn.metrics import confusion_matrix
c = confusion_matrix(y_test, y_preds_best)

In [17]:
print("TP: {}\nTN:{}\nFP:{}\nFN:{}".format(c[1][1], c[0][0], c[0][1], c[1][0]))

TP: 1113
TN:1417
FP:3484
FN:236


In [18]:
print("Percent of delayed/cancelled accurately predicted: {}%".format(c[1][1]/(c[1][1]+c[1][0])*100))
print("Percent of non-delayed/cancelled accurately predicted: {}%".format(c[0][0]/(c[0][0]+c[0][1])*100))

Percent of delayed/cancelled accurately predicted: 82.50555967383248%
Percent of non-delayed/cancelled accurately predicted: 28.912466843501328%


In [22]:
from sklearn.linear_model import LogisticRegression

X_train['Distance'] = X_train['Distance']/(max(X_train['Distance'])-min(X_train['Distance']))
X_test['Distance'] = X_test['Distance']/(max(X_test['Distance'])-min(X_test['Distance']))
# Instantiate the model
logreg = LogisticRegression(fit_intercept=False, C=1e12, solver='liblinear', class_weight='balanced')

# Fit the model
logreg.fit(X_train, y_train)

y_hat_train = logreg.predict(X_train)
y_hat_test = logreg.predict(X_test)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train['Distance'] = X_train['Distance']/(max(X_train['Distance'])-min(X_train['Distance']))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_test['Distance'] = X_test['Distance']/(max(X_test['Distance'])-min(X_test['Distance']))


In [23]:
c = confusion_matrix(y_test, y_hat_test)
print("Percent of delayed/cancelled accurately predicted: {}%".format(c[1][1]/(c[1][1]+c[1][0])*100))
print("Percent of non-delayed/cancelled accurately predicted: {}%".format(c[0][0]/(c[0][0]+c[0][1])*100))

Percent of delayed/cancelled accurately predicted: 59.08080059303188%
Percent of non-delayed/cancelled accurately predicted: 60.538665578453376%


In [24]:
print('Test accuracy: ', accuracy_score(y_test, y_hat_test))
print('Train accuracy: ', accuracy_score(y_train, y_hat_train))

Test accuracy:  0.60224
Train accuracy:  0.6327466666666667
