<a href="https://colab.research.google.com/github/gomezphd/flight-delay-classifier/blob/main/FinalProjectPrept.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Flight Delay Classifier Using Weather Data
**Authors:** Barbara Lorenzo & Carlos C. Gomez  
**Date:** December 2024

# 1. Introduction

## 1.1 Project Overview
The goal of this project is to develop a binary classifier that predicts whether a flight will be delayed or not, based on both flight data and relevant weather data. Flight delays impact airlines by increasing operational costs and disrupting passengers' schedules, leading to dissatisfaction. If we can accurately predict delays, airlines can proactively mitigate their effects, improving operational efficiency and customer satisfaction.

## 1.2 Business Value
- Improved customer satisfaction through proactive delay notifications.
- Better resource allocation for airlines.
- Reduced operational costs through optimized scheduling.
- Enhanced decision-making for both airlines and passengers.


In [None]:
# Importing required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# For machine learning model creation
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.metrics import classification_report, accuracy_score, roc_auc_score, confusion_matrix

# For handling and combining multiple datasets
from datetime import datetime



# 2. Dataset Selection and Loading

## 2.1 Dataset Selection
We are using the "US Flight Delay Data" from Kaggle along with NOAA Weather Data to capture flight-specific and weather-specific features that may contribute to delays.

## 2.2 Dataset Description
We will use two datasets:
1. **Flight Delay Data**: Historical flight records, including departure/arrival times and delay information.
2. **Weather Data**: Corresponding weather conditions at airports during relevant times.

## 2.3 Loading the Datasets
We will upload the datasets to Google Colab, which will enable us to start the data cleaning, exploration, and preprocessing stages.


In [2]:

# URLs for datasets stored in GitHub repository
flight_url = "https://raw.githubusercontent.com/gomezphd/flight-delay-classifier/main/data/2015-flight-delays/flights.csv"
# weather_url = "https://raw.githubusercontent.com/gomezphd/flight-delay-classifier/main/data/weather-data/weather_2015.csv"

# Load datasets into DataFrames with error handling
try:
    flight_data = pd.read_csv(flight_url)
    print("Flight Data Sample:")
    display(flight_data.head())
except Exception as e:
    print(f"Error loading flight data: {e}")

# Uncomment this section once weather data is available
# try:
#     weather_data = pd.read_csv(weather_url)
#     print("Weather Data Sample:")
#     display(weather_data.head())
# except Exception as e:
#     print(f"Error loading weather data: {e}")




Error loading flight data: HTTP Error 404: Not Found


# 1. Dataset Selection and Problem Statement

## 1.1 Dataset Selection
For this project, we are using datasets that contain information on flights and weather conditions. Specifically, we will combine the "US Flight Delay Data" from Kaggle with weather data from NOAA. This combination will allow us to capture both the flight-specific and weather-specific variables that may contribute to delays.




In [None]:
# Upload flight dataset from local (you will be prompted to select a file)
uploaded = files.upload()

# Load flight dataset into a DataFrame
flight_data = pd.read_csv('flight_delay_data.csv') # replace with your file's name


In [None]:
# Upload weather dataset from local (you will be prompted to select a file)
uploaded = files.upload()

# Load weather dataset into a DataFrame
weather_data = pd.read_csv('weather_data.csv') # replace with your file's name


# 2. Data Cleaning and Preprocessing

## 2.1 Data Cleaning Plan
1. **Handling Missing Values**: Missing data will be addressed by using imputation techniques like interpolation for numerical features or by dropping rows with critical missing values.
2. **Converting Categorical Variables**: We will convert categorical variables such as airline codes into numerical formats using one-hot encoding, which allows the classifier to process these features effectively.
3. **Scaling Numeric Features**: Numeric features such as temperature, wind speed, and visibility will be scaled using either **normalization** or **standardization**. This will ensure that our model is not biased by different value ranges.
4. **Dataset Integration**: The flight and weather datasets will be combined based on **timestamps** and **location identifiers** to ensure proper alignment.
5. **Feature Engineering**: New features such as a "bad weather" indicator will be created to capture severe weather conditions that are likely to influence flight delays.


# 3. Exploratory Data Analysis (EDA)

## 3.1 Univariate Analysis
- We will perform univariate analysis to examine the distributions of flight delays, weather conditions (e.g., temperature, precipitation), and flight schedule variables (e.g., departure times).

## 3.2 Bivariate Analysis
- Bivariate analysis will help us understand the relationships between weather conditions and flight delays. For example, we will analyze how different weather features (e.g., wind speed, visibility) correlate with delays.

## 3.3 Multivariate Analysis
- Multivariate analysis will allow us to explore interactions among multiple features, such as how airline, weather, and flight schedule jointly affect the likelihood of delays.

**Visualizations** will be created using **Matplotlib** and **Seaborn** to illustrate these relationships and uncover any important patterns that may guide feature selection and model building.


# 4. Model Selection and Implementation

We will develop and compare several classification models, including **Logistic Regression**, **Random Forest**, and **XGBoost**, to predict whether a flight will be delayed. Below is our modeling plan:

## 4.1 Logistic Regression
- As a baseline model, Logistic Regression provides a simple linear decision boundary to assess initial feature correlations.

## 4.2 Random Forest
- An ensemble-based classifier that captures non-linear relationships and is less prone to overfitting compared to individual decision trees.

## 4.3 XGBoost
- A more sophisticated gradient boosting algorithm known for excellent performance on tabular data, especially with heterogeneous features.

The models will be evaluated using metrics such as **accuracy**, **precision**, **recall**, **F1-score**, and **AUC-ROC** to determine which approach best suits our problem.


# 5. Business Report Summary

## 5.1 Introduction
- The objective of this project is to predict flight delays by using both flight-specific and weather-related data. This can help airlines proactively manage scheduling and improve resource utilization.

## 5.2 Methods
- **Data Cleaning and Preprocessing**: Missing data imputation, categorical encoding, and feature scaling.
- **Modeling**: We tested multiple models including Logistic Regression, Random Forest, and XGBoost, optimizing their hyperparameters for the best results.

## 5.3 Results
- We will summarize our findings, including model performance metrics and key insights on which factors are most influential in predicting delays.

## 5.4 Conclusion
- Predicting flight delays allows airlines to mitigate their effects through better scheduling and operational adjustments. However, limitations include potential inaccuracies in weather data and the unpredictability of operational disruptions.


# 6. Innovation and Creativity

## 6.1 Enhancements and Creative Methods
- We will apply **weather severity indices** to improve predictions, combining weather features into a single "severity" measure to see how well it correlates with delays.
- **Ensemble Modeling**: We plan to use an ensemble approach, combining **Random Forest** and **XGBoost** to enhance prediction robustness.
- **Advanced Visualizations**: To better convey trends, we will create **animated visualizations** that show changes in flight status and delays over time.


# 7. Timeline and Submission

## 7.1 Proposed Timeline
- **Day _**: Dataset exploration and cleaning.
- **Day _**: Conduct EDA and perform feature engineering.
- **Day __**: Model selection, training, and testing.
- **Day __**: Write the business report summarizing our project.
- **Day __**: Finalize all deliverables and submit.

## 7.2 Deliverables
- **Colab Notebook**: Submitted as both `.ipynb` and `.html`.
- **Business Report**: A well-polished PDF covering all aspects of the project.
