### Final Project  
Course:  
Semester: Summer 2025  
Institution: University of San Diego  
  
### Project Details  
Project Title:  
Authors: Greg Moore, Zachary Artman, Jack Baxter  
Instructor: David Friesen  
Submission Date: 06/23/2025  
  
### Dependencies  
Python [3.9 or higher]  
Jupyter Notebook  
kagglehub (or download the dataset from [here](https://www.kaggle.com/datasets/usdot/flight-delays))  
Libraries: [pandas, numpy, scikit-learn, matplotlib]  


In [1]:
#import required libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn import tree 
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split,  RandomizedSearchCV
from sklearn import metrics
from scipy.stats import randint
from xgboost import XGBClassifier, plot_importance
from scipy.stats import f_oneway
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import mutual_info_regression
import kagglehub
import numpy as np
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
import xgboost as xgb_mod
from sklearn.linear_model import LogisticRegression


## Problem statement and justification for the proposed approach.
When a flight is delayed, the engine is left running while the plane sits idle on the tarmac. While inevitable, for every 30 minutes a flight is delayed the airline burns $1,500-3000 in fuel cost. Similarly, airlines are required to pay crews 10-15% more for unplanned delays. At large airports, the daily toll of the extra wage is $500-2000 per airline (Airlines for America A4A 2023 report). With the Airline Dataset, our project aims to leverage ML strategies to predict a delay based on previous patterns by airline and airport. With this prediction, airlines have foresight to best manage, reroute or cancel flights, saving thousands of dollars daily from a business perspective. Additionally, flights identified by AI to consistently be delayed, or certain times that an airport is more likely to experience a delay can allow for more advanced flight planning to avoid unnecessary costs, and boost customer/traveler satisfaction rates. Specific ML models have yet to be chosen for our project, but once more EDA is performed to garner insight on the data itself, an ideal model/ensemble will be selected.


## Data understanding (EDA)

In [None]:
#create dataframe
path = kagglehub.dataset_download("usdot/flight-delays")
airlinedata = pd.read_csv(f"{path}/flights.csv")
airlinedata.head()

In [None]:
#data summary and column description
sum = airlinedata.shape #shape data
print("airlinedata(row, col):",sum,"\r\n") 
for col in airlinedata.columns: #print data by col 
    print(airlinedata[col].describe()) 

In [None]:
#explore variables / remove uneccessary delay categories 
cols = list(airlinedata.columns)
i = 0
for vals in cols: 
    print(cols[i])
    i = i+1

In [None]:
#evaluate and remove columns with significant null values 
nulls = airlinedata.isna().sum()
print(nulls)

## Data preparation & feature engineering

In [6]:
#remove the columns with abundant nulls, remove remaining NAs
nons = ['CANCELLATION_REASON','WEATHER_DELAY', 'LATE_AIRCRAFT_DELAY', 'AIRLINE_DELAY', 'SECURITY_DELAY', 'AIR_SYSTEM_DELAY']
airlinedata.drop(nons, axis=1, inplace=True)
airlinedata.dropna(inplace=True)

In [None]:
#remove unecessary information and drop NAN values 
shape = airlinedata.shape #shape data
print(shape)

In [None]:
#garner delay by airline visual for EDA 
plt.figure(figsize=(12, 6))
sns.boxplot(x='AIRLINE', y='DEPARTURE_DELAY', data=airlinedata)
plt.xticks(rotation=45)
plt.title('Delay Distribution by Airline')
plt.xlabel('Airline')
plt.ylabel('Delay (minutes)')
plt.tight_layout()
plt.show()

In [None]:
#pivot visualization for month/day delay averages 
pivot_table = airlinedata.pivot_table(values='DEPARTURE_DELAY', index='MONTH', columns='DAY_OF_WEEK', aggfunc='mean')
plt.figure(figsize=(10, 8))
sns.heatmap(pivot_table, annot=True, cmap='RdYlGn_r', fmt='.1f')
plt.title('Average Delay by Month and Day of Week')
plt.xlabel('Day of Week')
plt.ylabel('Month')
plt.tight_layout()
plt.show()

In [None]:
#check numerical cols for correlation 
numerical_cols = ['SCHEDULED_DEPARTURE', 'TAXI_OUT', 'DISTANCE', 'SCHEDULED_TIME', 'DEPARTURE_DELAY']
corr_matrix = airlinedata[numerical_cols].corr()
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Heatmap of Numerical Features')
plt.tight_layout()
plt.show()

In [None]:
#inspect most frequent airports for potential outliers
top_airports = airlinedata['ORIGIN_AIRPORT'].value_counts().index[:20]
plt.figure(figsize=(12, 6))
sns.barplot(x='ORIGIN_AIRPORT', y='DEPARTURE_DELAY', data=airlinedata[airlinedata['ORIGIN_AIRPORT'].isin(top_airports)])
plt.xticks(rotation=45)
plt.title('Average Departure Delay by Origin Airport (Top 20)')
plt.xlabel('Origin Airport')
plt.ylabel('Average Departure Delay (minutes)')
plt.tight_layout()
plt.show()

In [None]:
#visual for delay by scheduled time of day 
plt.figure(figsize=(10, 6))
sns.scatterplot(x='SCHEDULED_DEPARTURE', y='DEPARTURE_DELAY', data=airlinedata, alpha=0.5)
plt.title('Departure Delay vs. Scheduled Departure Time')
plt.xlabel('Scheduled Departure (minutes past midnight)')
plt.ylabel('Departure Delay (minutes)')
plt.tight_layout()
plt.show()

In [None]:
#evaluate numerical columns for correlation with delays 
nums = ['YEAR', 'MONTH', 'DAY', 'DAY_OF_WEEK', 'SCHEDULED_DEPARTURE', 'TAXI_OUT', 
                 'SCHEDULED_TIME', 'DISTANCE', 'TAXI_IN', 'SCHEDULED_ARRIVAL', 'DEPARTURE_DELAY']
corrmat = airlinedata[nums].corr()
correlations = corrmat['DEPARTURE_DELAY'].sort_values(ascending=False)
print("Correlation with DEPARTURE_DELAY:")
print(correlations)
plt.figure(figsize=(10, 8))
sns.heatmap(corrmat, annot=True, cmap='coolwarm', fmt='.2f', vmin=-1, vmax=1)
plt.title('Correlation Heatmap of Numerical Features')
plt.tight_layout()
plt.show()

## Feature Selection

In [None]:
#create random forest for feature importance evaluation 
airlinedatajan = airlinedata[airlinedata['MONTH'] == 1]
forestairlines = airlinedatajan.dropna(subset=['DEPARTURE_DELAY'])
cats = ['AIRLINE', 'ORIGIN_AIRPORT', 'DESTINATION_AIRPORT', 'TAIL_NUMBER']
for col in cats:
    le = LabelEncoder()
    forestairlines[col] = le.fit_transform(forestairlines[col].astype(str))
vars = ['MONTH', 'DAY', 'DAY_OF_WEEK', 'AIRLINE', 'FLIGHT_NUMBER', 'TAIL_NUMBER', 
                    'ORIGIN_AIRPORT', 'DESTINATION_AIRPORT', 'SCHEDULED_DEPARTURE', 'DEPARTURE_TIME', 
                      'TAXI_OUT', 'WHEELS_OFF', 'SCHEDULED_TIME', 'ELAPSED_TIME', 'AIR_TIME', 'DISTANCE', 'WHEELS_ON', 'TAXI_IN', 
                      'SCHEDULED_ARRIVAL', 'ARRIVAL_TIME', 'DIVERTED', 'CANCELLED']
xvars = forestairlines[vars]
yvars = forestairlines['DEPARTURE_DELAY']
redwoodforest = RandomForestRegressor(n_estimators=10, random_state=42)
redwoodforest.fit(xvars, yvars)
whatsimportant = pd.Series(redwoodforest.feature_importances_, index=vars).sort_values(ascending=False)
print("Feature Importances:")
print(whatsimportant)

**How were the features selected based on the data analysis?**

## Modeling

In [None]:
# Remove rows where the DEPARTURE_DELAY value is missing (NaN)
airlinedata_clean = airlinedata.dropna(subset=['DEPARTURE_DELAY'])

# Select the categories and the predictors
cats = ['AIRLINE', 'ORIGIN_AIRPORT', 'DESTINATION_AIRPORT', 'TAIL_NUMBER']
features = ['YEAR', 'MONTH', 'DAY', 'DAY_OF_WEEK', 'AIRLINE', 'FLIGHT_NUMBER', 'TAIL_NUMBER',
            'ORIGIN_AIRPORT', 'DESTINATION_AIRPORT', 'SCHEDULED_DEPARTURE', 'DISTANCE']

# Convert delay times to classes
# DEPARTURE_DELAY	    Departure Status
# 0			            On Time
# Negative number (< 0)	Early
# Positive number (> 0)	Delayed
conditions = [
    airlinedata_clean['DEPARTURE_DELAY'] == 0,
    airlinedata_clean['DEPARTURE_DELAY'] < 0,
    airlinedata_clean['DEPARTURE_DELAY'] > 0
]

# Choose the 3 different departure status
choices = ['On Time', 'Early', 'Delayed']

# Convert to the 3 classes
airlinedata_clean.loc[:, 'DEPARTURE_DELAY'] = np.select(conditions, choices, default='On Time')

# Set variables for train/test split
X = airlinedata_clean[features].copy()
y = airlinedata_clean['DEPARTURE_DELAY']

# Encode categorical variables
for col in cats:
    le = LabelEncoder()
    X[col] = le.fit_transform(X[col].astype(str))

# Convert ['On Time', 'Early', 'Delayed'] to [2, 1, 0]
le = LabelEncoder()
y_encoded = le.fit_transform(y)

In [None]:
y_encoded

### Model 1 - Random Forest

Features selected: SCHEDULED_DEPARTURE, ORIGIN_AIRPORT, AIRLINE, DAY_OF_WEEK, MONTH, DESTINATION_AIRPORT, FLIGHT_NUMBER, TAIL_NUMBER, DAY, TAXI_OUT

In [None]:
# select RF features
rf_feats = ['SCHEDULED_DEPARTURE','ORIGIN_AIRPORT','AIRLINE',
            'DAY_OF_WEEK','MONTH','DESTINATION_AIRPORT',
            'FLIGHT_NUMBER','TAIL_NUMBER','DAY','TAXI_OUT']
X_rf = airlinedata_clean[rf_feats].copy()
y_rf = y_encoded.copy()

# encode the categoricals
for col in ['ORIGIN_AIRPORT','DESTINATION_AIRPORT','AIRLINE','TAIL_NUMBER']:
    le = LabelEncoder()
    X_rf[col] = le.fit_transform(X_rf[col].astype(str))

# split
X_rf_train, X_rf_test, y_rf_train, y_rf_test = train_test_split(
    X_rf, y_rf, test_size=0.2, random_state=42, stratify=y_rf)

# train
rf = RandomForestClassifier(random_state=42)
rf.fit(X_rf_train, y_rf_train)

# eval
y_rf_pred = rf.predict(X_rf_test)
y_rf_proba = rf.predict_proba(X_rf_test)
print("RF: Classification Report:", classification_report(y_rf_test, y_rf_pred))
print("RF: ROC AUC (ovr):", roc_auc_score(y_rf_test, y_rf_proba, multi_class='ovr'))

### Model 2 - Logistic Regression

In [None]:
# reuse RF feature list
lr_feats = ['SCHEDULED_DEPARTURE','ORIGIN_AIRPORT','AIRLINE',
            'DAY_OF_WEEK','MONTH','DESTINATION_AIRPORT',
            'FLIGHT_NUMBER','TAIL_NUMBER','DAY','TAXI_OUT']
X_lr = airlinedata_clean[lr_feats].copy()
y_lr = y_encoded.copy()

# encode categoricals
for col in ['ORIGIN_AIRPORT','DESTINATION_AIRPORT','AIRLINE','TAIL_NUMBER']:
    le = LabelEncoder()
    X_lr[col] = le.fit_transform(X_lr[col].astype(str))

# split
X_lr_train, X_lr_test, y_lr_train, y_lr_test = train_test_split(
    X_lr, y_lr, test_size=0.2, random_state=42, stratify=y_lr)

# train
lr = LogisticRegression(random_state=42, multi_class='multinomial', solver='lbfgs', max_iter=1000)
lr.fit(X_lr_train, y_lr_train)

# eval
y_lr_pred = lr.predict(X_lr_test)
y_lr_proba = lr.predict_proba(X_lr_test)
print("LR: Classification Report:", classification_report(y_lr_test, y_lr_pred))
print("LR: ROC AUC (ovr):", roc_auc_score(y_lr_test, y_lr_proba, multi_class='ovr'))

### Model 3 - XG Boost

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.2, random_state=42, stratify=y)

# Use XGBoost model classifier
xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)

# Train the XGBoost model
xgb.fit(X_train, y_train)

# Evaluation Metrics for XGB
# Use trained XGBoost model (xgb) to predict the class labels
y_pred = xgb.predict(X_test)
# Array of predicted class labels (0, 1, 2 for multiclass classification)
y_proba = xgb.predict_proba(X_test)

print("XGBoost: Classification Report:", classification_report(y_test, y_pred))

# Model ability to distinguish between classes (1.0 = perfect)
print("XGBoost: ROC AUC (ovr):", roc_auc_score(y_test, y_proba, multi_class='ovr'))

**selection, comparison, tuning, and analysis – consider ensembles**

## Evaluation

In [None]:
xgb_mod.plot_importance(xgb)
plt.show()

**performance measures, results, and conclusions**

## Deployment

**A discussion of either the hypothetical deployment of the model or the actual deployment of the model if it has been deployed.**

## Discussion and conclusions

**address the problem statement and recommendation**