<a href="https://colab.research.google.com/github/gomezphd/flight-delay-classifier/blob/main/Notebooks/ProjectPrept.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Flight Delay Classifier Using Weather Data
**Authors:** Barbara Lorenzo & Carlos C. Gomez  
**Date:** December 2024

# 1. Introduction

## 1.1 Project Overview
The goal of this project is to develop a binary classifier that predicts whether a flight will be delayed or not, based on both flight data and relevant weather data. Flight delays impact airlines by increasing operational costs and disrupting passengers' schedules, leading to dissatisfaction. If we can accurately predict delays, airlines can proactively mitigate their effects, improving operational efficiency and customer satisfaction.

## 1.2 Business Value
- Improved customer satisfaction through proactive delay notifications.
- Better resource allocation for airlines.
- Reduced operational costs through optimized scheduling.
- Enhanced decision-making for both airlines and passengers.


In [9]:
# Cell 1: Importing Required Libraries

# Data Analysis and Manipulation
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning Model Creation and Evaluation
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.metrics import classification_report, accuracy_score, roc_auc_score, confusion_matrix

# For handling and combining multiple datasets using datetime as PK
from datetime import datetime

# Widgets and Display for Google Colab
import ipywidgets as widgets
from IPython.display import display

# Checking available RAM and Disk Space
import psutil
import shutil

# Function to check available RAM and Disk Space
def check_resources():
    ram = psutil.virtual_memory().available / (1024 ** 3)  # in GB
    total, used, free = shutil.disk_usage('/')
    print(f"Available RAM: {ram:.2f} GB")
    print(f"Disk Space: Total: {total // (2**30)} GB, Used: {used // (2**30)} GB, Free: {free // (2**30)} GB")

# Display available resources
print("Initial Resource Check:")
check_resources()

from IPython.display import display



Initial Resource Check:
Available RAM: 11.27 GB
Disk Space: Total: 107 GB, Used: 33 GB, Free: 73 GB


# 2. Dataset Selection and Loading

## 2.1 Dataset Selection
For this project, we are using datasets that contain information on flights and weather conditions. Specifically, we will combine the "US Flight Delay Data" from Kaggle with weather data from NOAA. This combination will allow us to capture both the flight-specific and weather-specific variables that may contribute to delays.

## 2.2 Dataset Description
We will use two datasets:
1. **Flight Delay Data**: Historical flight records, including departure/arrival times and delay information.
2. **Weather Data**: Corresponding weather conditions at airports during relevant times.

## 2.3 Loading the Datasets
We will upload the datasets to Google Colab, which will enable us to start the data cleaning, exploration, and preprocessing stages.


In [12]:
# Cell 2: Cloning Repository, Pulling LFS Files, and Loading Data

# Function to check available RAM and Disk Space before cloning
print("Resource Check Before Cloning Repository:")
check_resources()

# Clone the entire GitHub repository into Google Colab
!git clone https://github.com/gomezphd/flight-delay-classifier.git

# Change directory to the cloned repository
%cd flight-delay-classifier

# Install Git LFS to manage large files
!git lfs install
!git lfs pull

# Function to check available RAM and Disk Space after pulling LFS files
print("Resource Check After Pulling LFS Files:")
check_resources()

# Loading Data with Pandas
# Paths to files within the cloned repository
flight_url = 'data/2015-flight-delays/flights.csv'
airline_url = 'data/2015-flight-delays/airlines.csv'
airport_url = 'data/2015-flight-delays/airports.csv'

# Load datasets into pandas DataFrames
try:
    # Load flights.csv
    flight_data = pd.read_csv(flight_url)
    print("Flight Data Sample:")
    display(flight_data.head())

    # Load airlines.csv
    airlines_data = pd.read_csv(airline_url)
    print("\nAirlines Data Sample:")
    display(airlines_data.head())

    # Load airports.csv
    airports_data = pd.read_csv(airport_url)
    print("\nAirports Data Sample:")
    display(airports_data.head())

    # Placeholder for loading the weather data once it becomes available
    # Uncomment the following lines once the weather data is added to the repository
    # Make sure to add the weather data to the data/weather-data/ folder within the repository
    # weather_url = 'data/weather-data/weather_2015.csv'
    # weather_data = pd.read_csv(weather_url)
    # print("\nWeather Data Sample:")
    # display(weather_data.head())

except Exception as e:
    print(f"Error loading data: {e}")

# Function to check available RAM after loading datasets
print("Resource Check After Loading Data:")
check_resources()


Resource Check Before Cloning Repository:
Available RAM: 8.47 GB
Disk Space: Total: 107 GB, Used: 36 GB, Free: 71 GB
Cloning into 'flight-delay-classifier'...
remote: Enumerating objects: 20, done.[K
remote: Counting objects: 100% (20/20), done.[K
remote: Compressing objects: 100% (19/19), done.[K
remote: Total 20 (delta 1), reused 16 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (20/20), 16.19 KiB | 2.31 MiB/s, done.
Resolving deltas: 100% (1/1), done.
/content/flight-delay-classifier/flight-delay-classifier/flight-delay-classifier/flight-delay-classifier
Updated git hooks.
Git LFS initialized.
Resource Check After Pulling LFS Files:
Available RAM: 8.45 GB
Disk Space: Total: 107 GB, Used: 37 GB, Free: 70 GB


  flight_data = pd.read_csv(flight_url)


Flight Data Sample:


Unnamed: 0,YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,...,ARRIVAL_TIME,ARRIVAL_DELAY,DIVERTED,CANCELLED,CANCELLATION_REASON,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY
0,2015,1,1,4,AS,98,N407AS,ANC,SEA,5,...,408.0,-22.0,0,0,,,,,,
1,2015,1,1,4,AA,2336,N3KUAA,LAX,PBI,10,...,741.0,-9.0,0,0,,,,,,
2,2015,1,1,4,US,840,N171US,SFO,CLT,20,...,811.0,5.0,0,0,,,,,,
3,2015,1,1,4,AA,258,N3HYAA,LAX,MIA,20,...,756.0,-9.0,0,0,,,,,,
4,2015,1,1,4,AS,135,N527AS,SEA,ANC,25,...,259.0,-21.0,0,0,,,,,,



Airlines Data Sample:


Unnamed: 0,IATA_CODE,AIRLINE
0,UA,United Air Lines Inc.
1,AA,American Airlines Inc.
2,US,US Airways Inc.
3,F9,Frontier Airlines Inc.
4,B6,JetBlue Airways



Airports Data Sample:


Unnamed: 0,IATA_CODE,AIRPORT,CITY,STATE,COUNTRY,LATITUDE,LONGITUDE
0,ABE,Lehigh Valley International Airport,Allentown,PA,USA,40.65236,-75.4404
1,ABI,Abilene Regional Airport,Abilene,TX,USA,32.41132,-99.6819
2,ABQ,Albuquerque International Sunport,Albuquerque,NM,USA,35.04022,-106.60919
3,ABR,Aberdeen Regional Airport,Aberdeen,SD,USA,45.44906,-98.42183
4,ABY,Southwest Georgia Regional Airport,Albany,GA,USA,31.53552,-84.19447


Resource Check After Loading Data:
Available RAM: 7.08 GB
Disk Space: Total: 107 GB, Used: 37 GB, Free: 70 GB


# 2. Data Cleaning and Preprocessing

## 2.1 Data Cleaning Plan
1. **Handling Missing Values**: Missing data will be addressed by using imputation techniques like interpolation for numerical features or by dropping rows with critical missing values.
2. **Converting Categorical Variables**: We will convert categorical variables such as airline codes into numerical formats using one-hot encoding, which allows the classifier to process these features effectively.
3. **Scaling Numeric Features**: Numeric features such as temperature, wind speed, and visibility will be scaled using either **normalization** or **standardization**. This will ensure that our model is not biased by different value ranges.
4. **Dataset Integration**: The flight and weather datasets will be combined based on **timestamps** and **location identifiers** to ensure proper alignment.
5. **Feature Engineering**: New features such as a "bad weather" indicator will be created to capture severe weather conditions that are likely to influence flight delays.


# 3. Exploratory Data Analysis (EDA)

## 3.1 Univariate Analysis
- We will perform univariate analysis to examine the distributions of flight delays, weather conditions (e.g., temperature, precipitation), and flight schedule variables (e.g., departure times).

## 3.2 Bivariate Analysis
- Bivariate analysis will help us understand the relationships between weather conditions and flight delays. For example, we will analyze how different weather features (e.g., wind speed, visibility) correlate with delays.

## 3.3 Multivariate Analysis
- Multivariate analysis will allow us to explore interactions among multiple features, such as how airline, weather, and flight schedule jointly affect the likelihood of delays.

**Visualizations** will be created using **Matplotlib** and **Seaborn** to illustrate these relationships and uncover any important patterns that may guide feature selection and model building.


# 4. Model Selection and Implementation

We will develop and compare several classification models, including **Logistic Regression**, **Random Forest**, and **XGBoost**, to predict whether a flight will be delayed. Below is our modeling plan:

## 4.1 Logistic Regression
- As a baseline model, Logistic Regression provides a simple linear decision boundary to assess initial feature correlations.

## 4.2 Random Forest
- An ensemble-based classifier that captures non-linear relationships and is less prone to overfitting compared to individual decision trees.

## 4.3 XGBoost
- A more sophisticated gradient boosting algorithm known for excellent performance on tabular data, especially with heterogeneous features.

The models will be evaluated using metrics such as **accuracy**, **precision**, **recall**, **F1-score**, and **AUC-ROC** to determine which approach best suits our problem.


# 5. Business Report Summary

## 5.1 Introduction
- The objective of this project is to predict flight delays by using both flight-specific and weather-related data. This can help airlines proactively manage scheduling and improve resource utilization.

## 5.2 Methods
- **Data Cleaning and Preprocessing**: Missing data imputation, categorical encoding, and feature scaling.
- **Modeling**: We tested multiple models including Logistic Regression, Random Forest, and XGBoost, optimizing their hyperparameters for the best results.

## 5.3 Results
- We will summarize our findings, including model performance metrics and key insights on which factors are most influential in predicting delays.

## 5.4 Conclusion
- Predicting flight delays allows airlines to mitigate their effects through better scheduling and operational adjustments. However, limitations include potential inaccuracies in weather data and the unpredictability of operational disruptions.


# 6. Innovation and Creativity

## 6.1 Enhancements and Creative Methods
- We will apply **weather severity indices** to improve predictions, combining weather features into a single "severity" measure to see how well it correlates with delays.
- **Ensemble Modeling**: We plan to use an ensemble approach, combining **Random Forest** and **XGBoost** to enhance prediction robustness.
- **Advanced Visualizations**: To better convey trends, we will create **animated visualizations** that show changes in flight status and delays over time.


# 7. Timeline and Submission

## 7.1 Proposed Timeline
- **Day _**: Dataset exploration and cleaning.
- **Day _**: Conduct EDA and perform feature engineering.
- **Day __**: Model selection, training, and testing.
- **Day __**: Write the business report summarizing our project.
- **Day __**: Finalize all deliverables and submit.

## 7.2 Deliverables
- **Colab Notebook**: Submitted as both `.ipynb` and `.html`.
- **Business Report**: A well-polished PDF covering all aspects of the project.
