# Predicting Significant Flight Delays using Supervised Learning

In this notebook I present my work in ___________________________.

## Business Understanding

Many consultants travel frequently over long distances for business purposes. Managers at a consulting firm might be interested in how to minimize -------------------------- risk when booking flights. One pertinent concern for these managers is how to reduce the risk of significant flight delay to ensure that consultants can utilize air travel reliably and efficiently.

My goal is to predict whether a given flight will be significantly delayed given several known factors about the flight such as airline, scheduled time of departure, and ---------------------------------. An effective model of this kind will assist managers in understanding ---------------------- so that they can make more informed decisions when booking flights.

## Data Understanding

This element assesses how well students demonstrate the utility of their data for helping solve a business
problem. We frame utility in terms of the properties, source, and business relevance of the data.
* This element assesses the demonstration of the data’s utility, not the utility itself

Data Understanding: Notebook clearly describes the source and properties of the data to show how useful the data are for solving the problem of interest.
* Describe the data sources and explain why the data are suitable for the project
* Present the size of the dataset and descriptive statistics for all features used in the analysis
* Justify the inclusion of features based on their properties and relevance for the project
* Identify any limitations of the data that have implications for the project
------------------------------------------------------------------------------------------

I obtained data for use by the public domain from https://www.kaggle.com/datasets/robikscube/flight-delay-dataset-20182022. Each row pertains to a flight --------------------. The data includes several promising features such as --------------------------------.

I will bulid my model using a random sample from a year's worth of raw data spanning from August 2021 to July 2022.

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [2]:
df_original = pd.read_csv('sample.csv')

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


### Data Cleaning and Feature Selection

In [3]:
to_keep = ['Year', 'Quarter', 'Month', 'DayofMonth', 'DayOfWeek', 'Operating_Airline ', 'Origin', 'OriginState', 'Dest', 'DestState', 'DepTimeBlk', 'ArrDel15', 'ArrTimeBlk', 'Cancelled', 'Distance']
len(to_keep)

15

In [4]:
df = df_original[to_keep]

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Year                25000 non-null  int64  
 1   Quarter             25000 non-null  int64  
 2   Month               25000 non-null  int64  
 3   DayofMonth          25000 non-null  int64  
 4   DayOfWeek           25000 non-null  int64  
 5   Operating_Airline   25000 non-null  object 
 6   Origin              25000 non-null  object 
 7   OriginState         25000 non-null  object 
 8   Dest                25000 non-null  object 
 9   DestState           25000 non-null  object 
 10  DepTimeBlk          25000 non-null  object 
 11  ArrDel15            24315 non-null  float64
 12  ArrTimeBlk          25000 non-null  object 
 13  Cancelled           25000 non-null  float64
 14  Distance            25000 non-null  float64
dtypes: float64(3), int64(5), object(7)
memory usage: 2.9+

In [6]:
# Create target variable - 1 if flight is significantly delayed, 0 if not
df['Target'] = df['ArrDel15']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Target'] = df['ArrDel15']


In [7]:
df.dropna(axis=0, inplace=True)
df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.dropna(axis=0, inplace=True)


Unnamed: 0,Year,Quarter,Month,DayofMonth,DayOfWeek,Operating_Airline,Origin,OriginState,Dest,DestState,DepTimeBlk,ArrDel15,ArrTimeBlk,Cancelled,Distance,Target
0,2022,2,5,16,1,AA,CLT,NC,VPS,FL,0900-0959,0.0,0900-0959,0.0,460.0,0.0
1,2021,4,12,22,3,WN,HOU,TX,MSY,LA,1100-1159,0.0,1200-1259,0.0,302.0,0.0
2,2022,1,3,11,5,AA,PHL,PA,MCO,FL,0800-0859,1.0,1000-1059,0.0,861.0,1.0
3,2022,2,4,22,5,DL,SDF,KY,ATL,GA,1900-1959,0.0,2100-2159,0.0,321.0,0.0
4,2022,2,5,22,7,AS,MCO,FL,SFO,CA,1900-1959,0.0,2100-2159,0.0,2446.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24995,2021,4,10,18,1,WN,CMH,OH,RSW,FL,1400-1459,0.0,1700-1759,0.0,930.0,0.0
24996,2021,3,8,15,7,9E,ATL,GA,LEX,KY,1000-1059,0.0,1100-1159,0.0,304.0,0.0
24997,2021,3,8,6,5,OO,SLC,UT,TWF,ID,2200-2259,1.0,2300-2359,0.0,175.0,1.0
24998,2022,2,5,6,5,DL,ATL,GA,ILM,NC,2000-2059,0.0,2200-2259,0.0,377.0,0.0


In [8]:
from sklearn.preprocessing import OneHotEncoder


X = df.drop(['Target', 'ArrDel15'], axis=1)
y = df['Target']

X_cat = X.drop(['Distance', 'DayofMonth'], axis=1)
X_num = X[['Distance', 'DayofMonth']]

ohe = OneHotEncoder(drop="first", sparse=False)
ohe.fit(X_cat)

X_cat_ohe = pd.DataFrame(
    data=ohe.transform(X_cat),
   # columns=[{cat} for cat in ohe.categories_[0][1:]],
    index=X_cat.index
)

X_final = pd.concat([X_num, X_cat_ohe], axis=1)
X_final

Unnamed: 0,Distance,DayofMonth,0,1,2,3,4,5,6,7,...,868,869,870,871,872,873,874,875,876,877
0,460.0,16,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,302.0,22,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,861.0,11,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,321.0,22,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,2446.0,22,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24995,930.0,18,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
24996,304.0,15,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
24997,175.0,6,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
24998,377.0,6,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [9]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_final, y, random_state=100)

In [10]:
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(criterion='entropy', random_state=100)

clf.fit(X_train, y_train)

DecisionTreeClassifier(criterion='entropy', random_state=100)

In [11]:
from sklearn.metrics import accuracy_score

y_preds = clf.predict(X_test)

print('Accuracy: ', accuracy_score(y_test, y_preds))

Accuracy:  0.703898667544004


In [12]:
"""
TODO:
- Figure out OHE issue with values showing up in testing that are not in training set
- Implement grid search for decision tree
- Consider using logistic regression
"""

''