<a href="https://colab.research.google.com/github/glgunderson/INFOB2DA-PA4/blob/main/pa4_ipyn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Dashboard Visualizations and Coordinated View Systems**
## Practical Assignment 4 - INFOB2DA
*Tobias Buiten & Grace Gunderson*


In [None]:
# Import Relevant Libraries for Level 2 Visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

In [None]:
# Download Dataset from GitHub Release & Unzip Large Dataset
!curl -L -o DelayedFlights.zip "https://github.com/glgunderson/INFOB2DA-PA4/releases/download/PA4.DATA/DelayedFlights.zip"
!unzip -o DelayedFlights.zip -d data

import pandas as pd

# Load Dataset
df_raw = pd.read_csv('data/airlinedelaycauses_DelayedFlights.csv')

# Copy Dataset for Preprocessing (Delayed Flights)
df = df_raw.copy()

# Copy Dataset for Comparison (Delayed vs. All Flights)
df_all = df_raw.copy()

# Preview Dataset
df_raw.head()

## Dataset Overview

### DOT’S Air Travel Consumer Report:
The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the **on-time performance** of domestic flights operated by large air carriers.
- DOT provides **monthly summary information** on the number of on-time, delayed, cancelled and diverted flights.
- BTS collects details on the **causes of flight delays** and releases summary statistics and raw data.


## Relevant Features

Time-Related Features
- `Month` (1-12) / `DayOfMonth` (1-31) / `DayOfWeek` (1-7) — calendar indicators for each flight.
- `DepTime` / `ArrTime` — actual departure and arrival times (in HHMM format).
- `CRSDepTime` / `CRSArrTime` — scheduled departure and arrival times (in HHMM format).
- `DepHour` / `ArrHour` — derived from scheduled times, showing the hour of day (0–23) - *AFTER PREPROCESSING*

Flight Details
- `Distance` — flight distance in miles.
  - `DistanceGroup` — categorized distance range (<500, 500–1000, etc.) - *AFTER PREPROCESSING*
- `AirTime` — total time spent in the air (minutes).
- `ActualElapsedTime` / `CRSElapsedTime` — actual vs. scheduled total flight durations (minutes).

Delay Metrics
- `DepDelay` / `ArrDelay` — minutes delayed at departure and arrival.
  - `DepDelayHours` / `ArrDelayHours` — same delays converted to hours for easier interpretation - *AFTER PREPROCESSING*
- `CarrierDelay` / `WeatherDelay` / `NASDelay` / `SecurityDelay` / LateAircraftDelay — minutes of delay attributed to specific causes.

Operational Flags
- `Cancelled` / `Diverted` — binary indicators (1 = yes, 0 = no).
- `UniqueCarrier` — airline carrier code (e.g., AA, DL, UA).
- `Origin` / `Dest` — airport codes for departure and arrival locations.


## Summary Statistics

In [None]:
# Understand the dataset
df_raw.info()
df_raw.shape

In [None]:
# Basic Summary Statistics
df_raw.describe().T

### Understanding the Dataset
The initial output upon loading the *full raw dataset* includes:
- **1,936,758 rows (flights) x 30 columns (features)**
- The columns consist of both numeric (`int64`, `float64`) and categorical (`object`) features, including:
  - 14 float variables (e.g., `DepTime`, `ArrTime`, `DepDelay`, `ArrDelay`)
  - 11 integer variables (e.g., `Year`, `Month`, `DayofWeek`, `FlightNum`)
  - 5 object variables (e.g., `UniqueCarrier`, `Origin`, `Dest`)  

According to PA4, the dataset *should* include:
- Flight delay metrics for **1,247,486** different flights.
- **30 different features**, both numerical and categorical.

### Understanding the Record Discrepancy
The difference between the ~1.94 million and ~1.25 million flight records is explained by *dataset scope*.
- The full raw dataset (**1,936,758 rows**) includes **all scheduled flights** in 2008 - whether they were on time, delayed, cancelled, or diverted.
- Only a subset of the flight records (**1,247,488 rows**) contain complete **delay-related data** (`CarrierDelay`, `WeatherDelay`, `NASDelay`, `SecurityDelay`, `LateAircraftDelay`).
  - These represent flights that actually experienced a *delay event*, which is the primary focus of this data analysis.

***As a result, subsequent preprocessing and visualizations are performed by isolating this ~1.25M delayed-flight subset to ensure meaningful data analysis.***

## Preprocessing

In [None]:
# PREPROCESSING - DELAYED FLIGHT DATA (df)

# 1. Drop redundant/irrelevant columns
df = df.drop(columns=['Unnamed: 0', 'Year', 'FlightNum', 'TailNum', 'CancellationCode'], errors='ignore')

# 2. Remove cancelled or diverted flights
df = df[(df['Cancelled'] == 0) & (df['Diverted'] == 0)]

# 3. Identify and define records with complete delay-cause data
DelayCause = ['CarrierDelay', 'WeatherDelay', 'NASDelay', 'SecurityDelay', 'LateAircraftDelay']
df = df.dropna(subset=DelayCause)

# 4. Fill missing delay-cause values (NaN) with 0
df[DelayCause] = df[DelayCause].fillna(0)

# 5. Clip negative delay values (representing early arrivals) to 0
df['ArrDelay'] = df['ArrDelay'].clip(lower=0)

# 6. Derive new time-based features for later visualization
df['DepHour'] = (df['DepTime'] // 100).astype(int)
df['ArrHour'] = (df['ArrTime'] // 100).astype(int)

# 7. Convert arrival/departure delays from minutes to hours
df['ArrDelayHours'] = (df['ArrDelay'] / 60).astype('float64')
df['DepDelayHours'] = (df['DepDelay'] / 60).astype('float64')

# 8. Create distance categories (in miles)
df['DistanceGroup'] = pd.cut(
    df['Distance'],
    bins=[0, 500, 1000, 2000, 3000, 5000],
    labels=['<500', '500–1000', '1000–2000', '2000–3000', '3000–5000']
)

# 10. Reset DataFrame index to ensure clean row alignment
df = df.reset_index(drop=True)

# Verify structure after preprocessing
df.info()
df.shape

## Data Preprocessing

*Before visualizing flight delay trends, the dataset required preprocessing to ensure data analysis focuses only on valid, delayed flight records.*

### 1. Removed redundant and irrelevant columns
- The first column, `Unnamed: 0`, is an index column automatically generated during export and does not represent a meaningful feature.
- Dropped `Year` since all records were from 2008 (constant value), offering no variance for analysis.
- `FlightNum`, `TailNum`, and `CancellationCode` provide no meaningful information for data analysis of flight delays.

### 2. Excluded cancelled or diverted flights
- Removed flights where `Cancelled = 1` or `Diverted = 1`.  
- These records do not have valid arrival/departure data, which is essential for delay analysis.
- The focus of analysis is delayed flights, so this information is not relevant.

### 3. Retained only records with complete delay-cause data
- Filtered to include only rows where all five delay cause fields were present:  
  `CarrierDelay`, `WeatherDelay`, `NASDelay`, `SecurityDelay`, `LateAircraftDelay`.
- Define delay causes for consistent reference.
- This isolates the ~1.25 million delayed-flight subset that contains full delay-cause information — the focus of this data analysis.

### 4. Filled missing delay-cause values with zero
- Any remaining `NaN` values in delay-cause columns were replaced with `0`
- NaN indicated *no delay* from that cause (e.g., `WeatherDelay`).

### 5. Clipped negative arrival delay values
- Negative values in `ArrDelay` represent early arrivals (e.g., `-109` = 109 minutes early).  
- To focus purely on delays, these were clipped to `0`, indicating no delay.  
- `DepDelay` contained no negative values, so no adjustment was required.

### 6. Derived new time-based features
- Created `DepHour` and `ArrHour` by converting scheduled departure/arrival times (e.g., `1530`) to hour bins (e.g., `15`).  
- This facilitates later visualization of delay patterns by time of day.

### 7. Converted arrival and departure delays from minutes to hours
- Created two new columns, ArrDelayHours and DepDelayHours, by dividing the delay values (in minutes) by 60.
- This transformation makes large delay durations easier to interpret (e.g., 180 minutes → 3.0 hours).
- Both columns were stored as floating-point values (float64) to preserve precision for visualizations and calculations.

### 8. Created distance categories (DistanceGroup)
- Grouped flight distances (in miles) into five categories to simplify comparison across flight lengths.
- The new DistanceGroup column was created using pd.cut() with the following bins and labels: <500, 500–1000, 1000–2000, 2000–3000, 3000–5000.
- This grouping allows visualizations to explore whether longer flights are more or less prone to delays.

### 9. Reset DataFrame index
- Reset the index after all filtering steps to maintain continuous row alignment.

### Preprocessing Summary
After cleaning and filtering the dataset, the records were reduced from **1,936,758** total scheduled flights to **1,247,488** valid delayed-flight records.

*This refined dataset now focuses exclusively on flights that experienced measurable delays, removing cancelled, diverted, or incomplete records.*
- Dataset size reduced: from ~443 MB to ~277 MB
- Features retained: 30 (including derived time-based and distance-group variables)
- Delay cause completeness: All five causes (CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay) present for every record
- Negative delay values clipped (early arrivals = 0 hours delay)
- Delays converted from minutes to hours for interpretability
- Added columns: DepHour, ArrHour, DepDelayHours, ArrDelayHours, DistanceGroup

*The resulting dataset is now optimized for visual analysis of **when, why, and how long flights are delayed**, forming a clean basis for the dashboard visualizations.*


In [None]:
# PREPROCESSING — FULL FLIGHT DATA (df_all)
# Includes all flights (on-time, cancelled, diverted)

# 1. Drop unnecessary columns
df = df.drop(columns=['Unnamed: 0', 'Year', 'FlightNum', 'TailNum', 'CancellationCode'], errors='ignore')

# 2. Fill NaNs in delay-cause columns with 0 (on-time flights)
df_all[DelayCause] = df_all[DelayCause].fillna(0)

# 3. Convert delay columns to hours for consistency
df_all['ArrDelayHours'] = (df_all['ArrDelay'] / 60).astype('float64')
df_all['DepDelayHours'] = (df_all['DepDelay'] / 60).astype('float64')
df_all[DelayCause] = df_all[DelayCause] / 60

# 4. Add departure and arrival hour columns
df_all['DepHour'] = (df_all['DepTime'] // 100).astype('Int64')
df_all['ArrHour'] = (df_all['ArrTime'] // 100).astype('Int64')

# 5. Define distance bins
df_all['DistanceGroup'] = pd.cut(
    df_all['Distance'],
    bins=[0, 500, 1000, 2000, 3000, 5000],
    labels=['<500', '500-1000', '1000-2000', '2000-3000', '3000-5000']
)

df_all.info()

Applied preprocessing to full dataset for comparison to delayed flight data - to reveal trends specific to delayed flight data.

## Review Dataset

In [None]:
# Basic summary statistics after preprocessing
df.describe().T

In [None]:
print(f"Raw data shape: {df_raw.shape}")
print(f"Delayed flight data shape: {df.shape}")
print(f"All flight data shape: {df_all.shape}")
print(f"Rows without delay-cause data: {df_raw.shape[0] - df.shape[0]:,}")

## Investigate Dataset

In [None]:
# Number of delayed flights due to each cause?
(df[['CarrierDelay','WeatherDelay','NASDelay','SecurityDelay','LateAircraftDelay']] > 0).sum()

In [None]:
# Delays greater than 24 hours?
(df['ArrDelay'] > 1440).sum(), (df['DepDelay'] > 1440).sum(), (df['CarrierDelay'] > 1440).sum()

In [None]:
# Number of delayed departures/arrivals by hour
plt.figure(figsize=(12,5))
width = 0.4

# Plot departure and arrival side-by-side instead of overlapping
plt.bar(df.groupby('DepHour').size().index - width/2,
        df.groupby('DepHour').size().values,
        width=width, color='steelblue', label='Departure')

plt.bar(df.groupby('ArrHour').size().index + width/2,
        df.groupby('ArrHour').size().values,
        width=width, color='lightblue', label='Arrival')

# Visuals
plt.title('Number of Delayed Flights by Hour (Departure vs Arrival)', weight='bold')
plt.xlabel('Hour of Day')
plt.ylabel('Number of Delayed Flights')
plt.xticks(range(0, 24))
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Total vs average delays by hour?
hourly_stats = df.groupby('DepHour').agg(
    total_delays=('ArrDelayHours', 'count'),
    avg_delay=('ArrDelayHours', 'mean')
).reset_index()

fig, axes = plt.subplots(1,2,figsize=(22,5), sharex=True)

sns.barplot(x='DepHour', y='total_delays', data=hourly_stats,
            color='skyblue', ax=axes[0])
axes[0].set_title('Number of Delayed Flights by Hour')
axes[0].set_ylabel('Number of Flights')

sns.lineplot(x='DepHour', y='avg_delay', data=hourly_stats,
             color='steelblue', marker='o', ax=axes[1])
axes[1].set_title('Average Delay by Hour')
axes[1].set_ylabel('Hours')

for ax in axes:
    ax.set_xlabel('Hour of Day (Departure)')
    ax.set_xticks(range(0,24))
sns.despine()
plt.tight_layout()
plt.show()

In [None]:
# Hourly trend for number of flights compared to delay?

sns.set_style("whitegrid")

# Aggregate data by hour
hourly_stats = df.groupby('DepHour').agg(
    avg_delay=('ArrDelayHours', 'mean'),
    total_delays=('ArrDelayHours', 'count')
).reset_index()

# Normalize both metrics to comparable scale
hourly_stats['avg_delay_scaled'] = hourly_stats['avg_delay'] / hourly_stats['avg_delay'].max()
hourly_stats['total_delays_scaled'] = (hourly_stats['total_delays'] / hourly_stats['total_delays'].max()) * -1

plt.figure(figsize=(12,6))

sns.barplot(data=hourly_stats, x='DepHour', y='total_delays_scaled',
            color='steelblue', label='Number of Delayed Flights')
sns.barplot(data=hourly_stats, x='DepHour', y='avg_delay_scaled',
            color='skyblue', label='Average Delay Duration(Hours)')

# Titles and labels
plt.title('Opposing Trends: Flight Volume vs. Average Delay by Hour', weight='bold')
plt.xlabel('Hour of Day (Departure)')
plt.ylabel('')
plt.yticks([])
plt.legend(loc='upper right', fontsize=12)
plt.show()

In [None]:
# Distribution of delay duration by hour?

sns.set_style("whitegrid")

plt.figure(figsize=(12,5))
sns.boxplot(
    data=df, x='DepHour', y='ArrDelayHours',
    showfliers=False,
    color='lightblue'
)

plt.title('Distribution of Arrival Delay Durations by Departure Hour', weight='bold')
plt.xlabel('Hour of Day (Departure)')
plt.ylabel('Arrival Delay Duration (Hours)')
plt.tight_layout()
plt.show()

These visuals introduce the daily pattern of flight delays, showing how both severity and frequency change by hour of departure.
- The boxplot shows that delays are longest and most variable in the evening hours, likely caused by flights carried over from the previous day.
  - Delay durations stabilize during daytime hours but rise again in the late evening.
- The bar chart reveals an inverse trend between flight volume and delay duration — as flight activity increases throughout the day, average delay times initially decrease before rising again as the system becomes congested.

Key takeaway:

Flight delays follow a clear daily rhythm — early-morning hours show the worst severity, while evening congestion amplifies frequency, creating cyclical system strain.

In [None]:
hourly_total = df.groupby('DepHour')['ArrDelayHours'].sum().reset_index()
hourly_total['CumulativeDelay'] = hourly_total['ArrDelayHours'].cumsum()

plt.figure(figsize=(9,5))
sns.lineplot(data=hourly_total, x='DepHour', y='CumulativeDelay', color='steelblue', marker='o')
plt.title('Cumulative Build-Up of Total Delay Hours Throughout the Day', weight='bold')
plt.xlabel('Hour of Day (Departure)')
plt.ylabel('Cumulative Delay (Hours)')
plt.tight_layout()
plt.show()

Key Findings:
- The majority of delayed flights occur during the daytime hours, reflecting the busiest hours for flight traffic.
  - Delays tend to peak slightly later in the day, showing how early departure delays cause downstream congestion by evening (delay propagation).
- While most delays occur during the day, nighttime flights experience longer and more unpredictable delays. Low traffic volume at night means that the few delays that occur are often severe.
- The flights with the greatest delays appear to occur during evening hours in visualizations, although this may be skewed because of how few delays there are in comparison to the daytime.
- Daytime delays are compact (consistent short delays) while early/late hours have extreme variability.

In [None]:
#Percentage of flights delayed by hour?

# Group by departure hour
total_by_hour = df_all.groupby('DepHour').size().rename('TotalFlights')
delayed_by_hour = df.groupby('DepHour').size().rename('DelayedFlights')

# Merge into one table
compare = pd.concat([total_by_hour, delayed_by_hour], axis=1).fillna(0)

# Calculate percentage of flights delayed
compare['PctDelayed'] = (compare['DelayedFlights'] / compare['TotalFlights']) * 100

# Plot
plt.figure(figsize=(12,6))
plt.bar(compare.index, compare['PctDelayed'], color='steelblue')
plt.title('Percentage of Flights Delayed by Hour of Departure', weight='bold')
plt.xlabel('Hour of Day (Departure)')
plt.ylabel('% of Flights Delayed')
plt.xticks(range(0,25))
plt.tight_layout()
plt.show()

# Summary stats
overall_pct = (df.shape[0] / df_all.shape[0]) * 100
print(f"Overall, {overall_pct:.2f}% of flights in the dataset experienced a delay.")

This chart shows how the likelihood of flight delays changes throughout the day.
- Flights departing in the evening face the highest delay percentages, often exceeding 80–90%.
- Delays drop sharply around 5–6 AM, aligning with the start of a new flight cycle, before gradua- lly increasing again throughout the day.
- By afternoon, over 60% of flights experience some delay, reflecting system congestion and delay propagation as the day progresses.
- Overall, 64.4% of all flights in the dataset experienced a delay.

Key takeaway:

Delays accumulate as the day unfolds — evening flights are most likely to depart behind schedule, while early-morning flights remain the most punctual.


In [None]:
# Total number of flights vs. flights delayed?

# Group by departure hour
total_by_hour = df_all.groupby('DepHour').size().rename('TotalFlights')
delayed_by_hour = df.groupby('DepHour').size().rename('DelayedFlights')

# Merge datasets
compare = pd.concat([total_by_hour, delayed_by_hour], axis=1).fillna(0)

# Plot both on same chart
plt.figure(figsize=(12,6))
plt.bar(compare.index - 0.2, compare['TotalFlights'], width=0.4, color='lightblue', label='Total Flights')
plt.bar(compare.index + 0.2, compare['DelayedFlights'], width=0.4, color='steelblue', label='Delayed Flights')

plt.title('Total vs Delayed Flights by Hour of Departure', weight='bold')
plt.xlabel('Hour of Day (Departure)')
plt.ylabel('Number of Flights')
plt.xticks(range(0, 24))
plt.legend()
plt.tight_layout()
plt.show()

This chart compares the total number of flights scheduled versus the number of flights delayed for each hour of the day.
- Total flights (light blue) increase steadily throughout the morning, peaking during mid-to-late afternoon (15:00–19:00) when air traffic is highest.
- Delayed flights (dark blue) follow a similar trend, with the highest number of delays occurring during these peak traffic hours.
- However, despite fewer flights overnight, delays are still present — showing that late-night flights experience a disproportionately high share of delays relative to total flights.

Key takeaway:

The busiest hours account for most total delays due to volume, but evening flights (after 22:00) remain at higher risk per flight, indicating persistent operational bottlenecks at the end of the day.


In [None]:
# Calculate total and delayed flights per airline
total_by_airline = df_all.groupby('UniqueCarrier').size()
delayed_by_airline = df.groupby('UniqueCarrier').size()

# Calculate % of flights delayed
delay_percent = (delayed_by_airline / total_by_airline * 100).sort_values(ascending=False)

# Plot
plt.figure(figsize=(10,5))
sns.barplot(x=delay_percent.index, y=delay_percent.values, color='steelblue')

plt.title('Percentage of Flights Delayed by Airline', weight='bold')
plt.xlabel('Airline')
plt.ylabel('% of Flights Delayed')
plt.axhline(delay_percent.mean(), color='midnightblue', linestyle='--', alpha=0.7)


In [None]:
# Percentage of flights delayed by airline?
# Map airline codes to names
airline_names = {
    'AA': 'American',
    'AS': 'Alaska',
    'B6': 'JetBlue',
    'CO': 'Continental',
    'DL': 'Delta',
    'EV': 'ExpressJet',
    'F9': 'Frontier',
    'FL': 'AirTran',
    'HA': 'Hawaiian',
    'MQ': 'American Eagle',
    'NW': 'Northwest',
    'OH': 'Comair',
    'OO': 'SkyWest',
    'UA': 'United',
    'US': 'US Airways',
    'WN': 'Southwest',
    'XE': 'ExpressJet (XE)',
    'YV': 'Mesa',
    'AQ': 'Aloha'
}

# Replace airline codes with names
delay_percent_named = delay_percent.rename(index=airline_names)

# Plot
plt.figure(figsize=(12,5))
sns.barplot(x=delay_percent_named.index, y=delay_percent_named.values, color='steelblue')

plt.title('Percentage of Flights Delayed by Airline', weight='bold')
plt.xlabel('Airline')
plt.ylabel('% of Flights Delayed')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

This visualization compares each airline’s operational reliability, showing the percentage of all flights (from df_all) that experienced a delay (in df):
- Mesa (YV) and Comair (OH) have the highest share of delayed flights — over 70% of their total operations.
- Southwest (WN) and Aloha (AQ) maintain the lowest proportion of delays, under 55%.
- Major national carriers like American, Delta, and United fall near the middle, reflecting moderate but consistent congestion across their large networks.

Key takeaway:

Regional and connector airlines tend to experience more frequent delays, while large network carriers maintain more stable on-time performance overall.


In [None]:
# Flights delayed by month and day of week?
df_all['Month'] = df_all['Month'].astype(int)
df_all['DayOfWeek'] = df_all['DayOfWeek'].astype(int)
df['Month'] = df['Month'].astype(int)
df['DayOfWeek'] = df['DayOfWeek'].astype(int)

# All x Delayed
total = df_all.groupby(['Month','DayOfWeek']).size()
delayed = df.groupby(['Month','DayOfWeek']).size()

# 1–12 x 1–7
p = (delayed / total * 100).unstack().reindex(
    index=range(1,13), columns=range(1,8)
)

# Labels
month_map = {1:'Jan',2:'Feb',3:'Mar',4:'Apr',5:'May',6:'Jun',
             7:'Jul',8:'Aug',9:'Sep',10:'Oct',11:'Nov',12:'Dec'}
dow_map   = {1:'Mon',2:'Tue',3:'Wed',4:'Thu',5:'Fri',6:'Sat',7:'Sun'}

p = p.rename(index=month_map, columns=dow_map)

# Heatmap
plt.figure(figsize=(10,6))
sns.heatmap(p, cmap='YlGnBu', annot=True, fmt='.1f',
            cbar_kws={'label':'% of Flights Delayed'},
            xticklabels=True, yticklabels=True)
plt.title('Percentage of All Flights Delayed by Month and Day of Week', weight='bold')
plt.xlabel('Day of Week'); plt.ylabel('Month')
plt.tight_layout(); plt.show()

This heatmap shows how the likelihood of flight delays varies throughout the year and across different days of the week. By combining both month and weekday dimensions, it highlights seasonal and weekly patterns in overall delay frequency.

Key Insights:
- Highest delay rates occur in winter months (Dec, Jan-Feb) — likely due to holiday travel traffic.
- Weekdays show slightly lower delay percentages, reflecting more stable business-travel patterns.
- Weekends (Fri–Sun) and holiday months show increased delays, aligning with peak leisure travel and congestion at major airports.
- Summer months (Jun–Aug) have moderate but consistent delay levels, possibly due to higher flight volumes and air traffic congestion.

In [None]:
# Flights delayed by departure hour + weekday?
total_hourly = df_all.groupby(['DepHour', 'DayOfWeek']).size()
delayed_hourly = df.groupby(['DepHour', 'DayOfWeek']).size()

# Percentage delayed
delay_percent_hourly = (delayed_hourly / total_hourly * 100).unstack()

# Map weekday numbers to names (1–7)
dow_map = {1:'Mon', 2:'Tue', 3:'Wed', 4:'Thu', 5:'Fri', 6:'Sat', 7:'Sun'}
delay_percent_hourly.columns = delay_percent_hourly.columns.map(dow_map)

# Reorder hours so the day starts at 5 AM and wraps to 4 AM
hour_order = list(range(5, 25)) + list(range(0, 5))
delay_percent_hourly = delay_percent_hourly.reindex(hour_order)

# Plot
plt.figure(figsize=(10,8))
sns.heatmap(delay_percent_hourly,
            cmap='YlGnBu',
            annot=True, fmt='.1f',
            cbar_kws={'label':'% of Flights Delayed'})

plt.title('Percentage of Flights Delayed by Hour and Day of Week', weight='bold')
plt.xlabel('Day of Week')
plt.ylabel('Hour of Day (Departure)')
plt.tight_layout()
plt.show()

This heatmap shows how flight delay frequency varies throughout the day and across each day of the week. It reveals patterns tied to daily flight volume, air-traffic congestion, and operational timing.

Key Insights:
- Evening hours have the highest percentage of delays, though this is partly due to a smaller number of total flights—so each delay weighs more heavily on the percentage.
- Morning to early afternoon (5:00–14:00) shows the lowest proportion of delays, reflecting smoother operations and lower congestion.
- As the day progresses hours, delay percentages rise again, as delays accumulate and propagate through the day - this pattern is consistent across all days of the week.

In [None]:
# Total delay impact by airline?

# Group by airline and month for total delay counts
delays_by_airline_month = df.groupby(['UniqueCarrier', 'Month']).size().unstack(fill_value=0)

# Convert to percentage of total delays by month
delay_percent_airline_month = delays_by_airline_month.div(delays_by_airline_month.sum(axis=0), axis=1) * 100

# Month labels
month_labels = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
delay_percent_airline_month.columns = month_labels

# Plot heatmap
plt.figure(figsize=(10,6))
sns.heatmap(delay_percent_airline_month,
            cmap='YlGnBu',
            annot=True, fmt='.1f',
            cbar_kws={'label':'% of Total Delays (by Month)'})

plt.title('Percentage of Total Delays by Airline and Month', weight='bold')
plt.xlabel('Month')
plt.ylabel('Airline')
plt.tight_layout()
plt.show()

- Southwest (WN) consistently accounts for the largest share of delays year-round, often contributing 15–19% of all total delay minutes each month.
- American (AA), United (UA), SkyWest (OO), and Envoy (MQ) are also major contributors, especially during spring and summer, suggesting operational strain during peak travel periods.
- Smaller regional carriers (e.g., HA, AQ) barely register — meaning their total delay contribution is minimal even when their own rates might be high.
- There’s no single month when one airline dominates — rather, total delays are spread across major players, aligning with industry-wide congestion cycles.	•	Southwest (WN) consistently accounts for the largest share of delays year-round, often contributing 15–19% of all total delay minutes each month.
- American (AA), SkyWest (OO), and Envoy (MQ) are also major contributors, especially during spring and summer, suggesting operational strain during peak travel periods.
- Smaller regional carriers (e.g., HA, AQ) barely register — meaning their total delay contribution is minimal even when their own rates might be high.
- There’s no single month when one airline dominates — rather, total delays are spread across major players, aligning with industry-wide congestion cycles.

In [None]:
# Percentage of each airline's flights delayed per month - airline reliability?

# Total flights per airline per month
total_by_airline_month = df_all.groupby(['UniqueCarrier', 'Month']).size().unstack(fill_value=0)

# Delayed flights per airline per month
delayed_by_airline_month = df.groupby(['UniqueCarrier', 'Month']).size().unstack(fill_value=0)

# Calculate % of that airline's flights delayed each month
delay_rate_airline_month = (delayed_by_airline_month / total_by_airline_month * 100).fillna(0)

# Month labels
month_labels = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
delay_rate_airline_month.columns = month_labels

# Plot heatmap
plt.figure(figsize=(10,6))
sns.heatmap(delay_rate_airline_month,
            cmap='YlGnBu',
            annot=True, fmt='.1f',
            cbar_kws={'label':'% of Airline Flights Delayed'})

plt.title('Monthly Reliability: % of Flights Delayed by Airline', weight='bold')
plt.xlabel('Month')
plt.ylabel('Airline')
plt.tight_layout()
plt.show()

In [None]:
# Total flight volume by airline?
total_flights = df_all['UniqueCarrier'].value_counts().sort_values(ascending=False)

plt.figure(figsize=(10,5))
sns.barplot(x=total_flights.index, y=total_flights.values, color='steelblue')
plt.title('Total Number of Flights per Airline (All Year)', weight='bold')
plt.xlabel('Airline')
plt.ylabel('Total Flights')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

Although Southwest Airlines (WN) appears to cause the largest share of total U.S. delay minutes, this chart reveals why:
- Southwest operates far more flights than any other airline — nearly double the next largest carrier.
- Because of this enormous volume, even a moderate delay rate (like 60–70%) creates the highest overall delay impact across the country.
- In contrast, smaller regional carriers (like EV, YV, or 9E) may have higher delay percentages but contribute little to the total because they operate fewer flights.

**Total delay volume explains disproportionate delay impact.**

In [None]:
# Total flights and contribution to total delay by airline?
fig, ax1 = plt.subplots(figsize=(10,6))

# Bars = Total Flights (scale)
sns.barplot(x=total_flights.index, y=total_flights.values, color='lightblue', alpha=0.7, label='Total Flights', ax=ax1)
ax1.set_ylabel('Total Flights', color='steelblue')
ax1.tick_params(axis='y', labelcolor='steelblue')

# Line = % of Total Delays (impact)
ax2 = ax1.twinx()
sns.lineplot(x=delay_share.index, y=delay_share.values, color='darkred', marker='o', label='% of Total Delays', ax=ax2)
ax2.set_ylabel('% of Total Delays', color='darkred')
ax2.tick_params(axis='y', labelcolor='darkred')

# Visuals
plt.title('Airline Scale vs Delay Impact', weight='bold')
ax1.set_xlabel('Airline')
ax1.legend(loc='upper left')
ax2.legend(loc='upper right')
plt.tight_layout()
plt.show()

In [None]:
# Average delay duration (hours) by airline?
avg_delay_by_airline = df.groupby('UniqueCarrier')['ArrDelayHours'].mean().sort_values(ascending=False)

# Plot
plt.figure(figsize=(10,5))
sns.barplot(x=avg_delay_by_airline.index, y=avg_delay_by_airline.values, color='steelblue')

plt.title('Average Arrival Delay Duration by Airline', weight='bold')
plt.xlabel('Airline')
plt.ylabel('Average Delay (Hours)')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
# Total delay minutes by airline and cause?
delay_causes = df.groupby('UniqueCarrier')[['CarrierDelay', 'WeatherDelay', 'NASDelay', 'SecurityDelay', 'LateAircraftDelay']].sum()

# Percentage within each airline
delay_cause_percent = delay_causes.div(delay_causes.sum(axis=1), axis=0) * 100

# Plot as stacked bar chart
plt.figure(figsize=(12,6))
delay_cause_percent.plot(
    kind='bar',
    stacked=True,
    colormap='tab20c',
    width=0.8,
    figsize=(12,6)
)

# Visuals
plt.title('Breakdown of Delay Causes by Airline', weight='bold')
plt.xlabel('Airline')
plt.ylabel('% of Total Delay Minutes')
plt.legend(title='Delay Cause', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

### Delay Patterns by Cause, Distance, and Airline

Overview:
- Carrier delays increase with flight distance and are most pronounced among airlines with higher distance operations (>2000 miles).
- Late Aircraft delays dominate short-haul and high-frequency carriers.
- Regional carriers (e.g., 9E, EV, MQ) show similar delay compositions due to their shared operational range (<1000 miles).
- Distance category distributions correspond closely with each airline’s network scope as reflected in delay composition.

In [None]:
# Total delay minutes per airline by cause?
delay_causes = df.groupby('UniqueCarrier')[[ 'LateAircraftDelay', 'CarrierDelay', 'NASDelay', 'WeatherDelay', 'SecurityDelay']].sum()

# Percentage of each airline’s total delay time
delay_percent = delay_causes.div(delay_causes.sum(axis=1), axis=0) * 100

# Heatmap
plt.figure(figsize=(10,6))
sns.heatmap(delay_percent,
            cmap='YlGnBu',
            annot=True, fmt='.1f',
            cbar_kws={'label':'% of Airline Delay Minutes'},
)

plt.title('Percentage of Total Delay Minutes by Cause and Airline', weight='bold')
plt.xlabel('Delay Cause')
plt.ylabel('Airline')
plt.tight_layout()
plt.show()

- The majority of airline delays stem from late aircraft and carrier-related causes, highlighting that network ripple effects and operational inefficiencies drive most disruptions.
- Late aircraft delays dominate larger carriers (AA, UA, WN), indicating that tight turnaround schedules magnify small disruptions.
- Carrier delays are more prominent among regional and smaller airlines (YV, EV), suggesting limited resources for recovery.
- Weather and NAS delays contribute less overall, underscoring that most delays arise from internal or network factors rather than uncontrollable events.

Takeaway:
Most U.S. flight delays are contagious rather than spontaneous — once a delay enters the system, it spreads through tightly packed flight schedules.

Note:
- `LateAircraftDelay` shows delays cascading through the system.
- `CarrierDelay` hints at internal inefficieincies (maintenance, staffing, etc.).
- `NASDelay` often reflects air traffic control or congestion issues.

In [None]:
# Total delay minutes by cause and distance
delay_by_cause_dist = df.groupby('DistanceGroup')[[ 'LateAircraftDelay', 'CarrierDelay', 'NASDelay', 'WeatherDelay', 'SecurityDelay']].sum()

# Percent within each distance bin
delay_by_cause_dist_pct = delay_by_cause_dist.div(delay_by_cause_dist.sum(axis=1), axis=0) * 100

# Plot
plt.figure(figsize=(8,8))
sns.heatmap(delay_by_cause_dist_pct,
            cmap='YlGnBu',
            annot=True, fmt='.1f',
            cbar_kws={'label':'% of Delay Minutes'},
            linewidths=0.3, linecolor='white')

plt.title('Distribution of Delay Causes by Flight Distance Category', weight='bold')
plt.xlabel('Delay Cause')
plt.ylabel('Distance Category (miles)')
plt.tight_layout()
plt.show()

Delay causes vary significantly with flight distance.
- Shorter routes are most affected by late aircraft delays, reflecting network congestion and tight scheduling.
- Medium-haul routes (1000–3000 mi) show a more even split between causes, indicating multiple interacting delay factors.
- Long-haul flights (>3000 mi) are dominated by carrier-related delays, suggesting that maintenance, fueling, or operational factors play a larger role than propagation or air traffic.

Takeaway:
As flight distance increases, delays shift from system ripple effects (short flights) to carrier-specific operational challenges (long flights). This highlights different optimization priorities for airlines depending on route length.

In [None]:
# Flights by airline and distance?

# Normalize by airline
flights_by_dist_pct = flights_by_dist.div(flights_by_dist.sum(axis=1), axis=0) * 100

plt.figure(figsize=(10,6))
sns.heatmap(flights_by_dist_pct,
            cmap='YlGnBu',
            annot=True, fmt='.1f',
            cbar_kws={'label':'% of Airline Flights'},
            )

# Visuals
plt.title('Flight Distribution by Airline and Distance Category', weight='bold')
plt.xlabel('Distance Category (miles)')
plt.ylabel('Airline')
plt.tight_layout()
plt.show()

*Each airline’s delay profile is shaped by its route structure.*
- Regional carriers dominate short-haul operations (<500 mi), where quick turnarounds make them vulnerable to cascading delays.
- Southwest Airlines (WN) maintains a high flight volume concentrated in short- and medium-haul routes (over 80%), explaining its large share of total U.S. delays despite relatively moderate delay rates.
- Legacy carriers (e.g., AA, UA) balance domestic and long-haul routes, diversifying their delay exposure.
- Geographically specialized airlines (Alaska, Hawaiian) display unique distance patterns reflecting limited route networks.
  - HA and AQ have the greatest prevalence of carrier delays, as well as the greatest share of 2000-3000 mile flights.

Takeaway:
Route structure is a hidden driver of delay risk — airlines operating shorter, high-frequency routes are more prone to frequent but shorter delays, while long-haul carriers face fewer but more severe disruptions.



# **Visualization Dashboard**
*visualizations from group presentation*



## **Delays by Hour**

In [None]:
# Average Arrival and Departure Delays by Hour

# Create bar plot
plt.figure(figsize=(12,6))
df.groupby('DepHour')[['ArrDelayHours', 'DepDelayHours']].mean().plot(
    kind='bar',
    ax=plt.gca(),
    width=0.8,
    edgecolor='none'
)

# Titles and labels
plt.title('Average Arrival and Departure Delays by Hour', weight='bold')
plt.xlabel('Hour of Day (Departure)')
plt.ylabel('Average Delay (Hours)')
plt.legend(['Arrival Delay', 'Departure Delay'], title='Flight Delay')

# Visuals
sns.set_style("whitegrid")
sns.set_palette("Blues_r")
sns.despine()
plt.tight_layout()
plt.show()

## **Delay Variability by Hour**

In [None]:
# Aggregate key hourly metrics
hourly = df.groupby('DepHour').agg(
    total_delays=('ArrDelayHours', 'count'),
    median_delay=('ArrDelayHours', 'median'),
    q1=('ArrDelayHours', lambda x: x.quantile(0.25)),
    q3=('ArrDelayHours', lambda x: x.quantile(0.75))
).reset_index()

fig, ax1 = plt.subplots(figsize=(12,6))

# Bars for total delayed flights
ax1.bar(hourly['DepHour'], hourly['total_delays'],
        color='steelblue', label='Number of Delayed Flights')
ax1.set_ylabel('Number of Delayed Flights', color='steelblue')
ax1.tick_params(axis='y', labelcolor='steelblue')

# Line for median delay (hours)
ax2 = ax1.twinx()
ax2.plot(hourly['DepHour'], hourly['median_delay'],
         color='black', linewidth=2.5, marker='o', label='Median Delay (Hours)')
ax2.fill_between(hourly['DepHour'],
                 hourly['q1'], hourly['q3'],
                 color='black', alpha=0.15, label='Delay Variability (IQR)')
ax2.set_ylabel('Delay Duration (Hours)', color='black')
ax2.tick_params(axis='y', labelcolor='black')

# Visuals
plt.title('Daily Flight Delays: Frequency vs Severity by Hour of Departure', weight='bold')
ax1.set_xlabel('Hour of Day (Departure)')
ax1.legend(loc='upper left')
ax2.legend(loc='upper right')
sns.set_style("white")
plt.tight_layout()
plt.show()

This visualization compares **how often flights are delayed** with **how long those delays last**  across each hour of the day.
— Delays are most common during the daytime, peaking between 15:00–20:00, when overall air traffic is highest.
- The median delay duration (in hours) reveals that, while delays are shorter on average during the day, they tend to increase again at night, when fewer flights are operating.
- The shaded gray area indicates delay variability (IQR) — showing that late-night delays are less predictable, with a wider spread of delay lengths.

Key takeaway:

While most delays occur during busy daytime hours due to traffic volume, longer and more unpredictable delays tend to happen late at night — when fewer flights are available to absorb disruptions from earlier in the day.


## **Daily/Hourly Delays**

In [None]:
# Average arrival delay by weekday and hour

# Create pivot table
pivot = df.pivot_table(
    values='ArrDelayHours',
    index='DepHour',       # rows = hour of day
    columns='DayOfWeek',   # columns = day of week
    aggfunc='mean'
)

# Rename day columns for clarity
pivot.columns = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']

# Plot heatmap
plt.figure(figsize=(8,6))
sns.heatmap(pivot, cmap='Blues', annot=False, cbar_kws={'label': 'Average Arrival Delay (hours)'})
plt.title('Average Arrival Delay by Hour and Day of Week')
plt.xlabel('Day of Week')
plt.ylabel('Hour of Day')
plt.show()

This heatmap shows how average arrival delays (in hours) vary by both time of day and day of the week.
- The darker areas represent longer average delays (up to 4 hours), while lighter areas indicate shorter or minimal delays.
- Early morning hours (midnight–4 AM) consistently show the longest delays, suggesting residual effects from previous-day disruptions and overnight scheduling limits.
- Daytime hours are mostly stable with shorter delays, reflecting higher traffic but more operational recovery capacity.
- There are no major weekday-to-weekend differences.

Key takeaway:

Delays are driven more by time of day than day of week — evening flights experience the most severe delays due to limited recovery options and overnight schedule congestion.


## **Hourly Delays**

In [None]:
# Group delay causes by departure hour (values still in minutes)
hourly_delays_hr = df.groupby('DepHour')[DelayCause].sum().reset_index()

# Convert all delay cause totals from minutes to hours
hourly_delays_hr[DelayCause] = hourly_delays_hr[DelayCause] / 60

# Create interactive stacked area chart
fig = px.area(
    hourly_delays_hr,
    x='DepHour',
    y=DelayCause,
    title='Hourly Distribution of Delay Causes',
    labels={
        'value': 'Total Delay (hours)',
        'DepHour': 'Hour of Day (Departure)'
    }
)

# Clean legend and axes formatting
fig.update_layout(
    legend_title_text='Cause of Delay',
    xaxis=dict(
        tickmode='linear',
        dtick=1,  # show every hour (0–23)
        title='Hour of Day (Departure)',
        range=[0, 23]
    ),
    yaxis_title='Total Delay (hours)',
    plot_bgcolor='white'
)

fig.show()

## **Daily/Monthly Delay Trends**

In [None]:
# Average Arrival Delay by Month/Day

# Pivot table: average arrival delay grouped by month and weekday
pivot = df.pivot_table(
    values='ArrDelayHours',
    index='Month',        # months as rows
    columns='DayOfWeek',  # weekdays as columns
    aggfunc='mean'
)

# Replace numeric weekdays (1–7) with labels
pivot.columns = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']

# Replace numeric months (1–12) with short names
month_labels = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
                'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
pivot.index = month_labels

# Plot heatmap
plt.figure(figsize=(9,6))
sns.heatmap(pivot, cmap='Blues', annot=False, cbar_kws={'label': 'Avg Arrival Delay (hours)'})
plt.title('Average Delay by Month and Day of Week')
plt.xlabel('Day of Week')
plt.ylabel('Month')
plt.show()

 This heatmap illustrates how average arrival delays fluctuate across both months of the year and days of the week.
- Overall, delays remain fairly consistent throughout the year, averaging around 1 hour.
August Sundays stand out with the highest average delays (~1.5 hours), likely reflecting peak summer travel demand and congested air traffic.
- Winter months (December–February) also show slightly higher delays, consistent with weather-related disruptions.
- Weekday variation is minimal, indicating that seasonal and operational factors have a greater impact than specific days of the week.

Key takeaway:

Summer and winter travel seasons bring longer average delays — particularly Sundays in August, when peak leisure travel and weather patterns converge to slow operations.


## **Monthly Delays**

In [None]:
# Delay Causes by Month

# Sum delay causes by month (for all flights)
monthly_cause = df_all.groupby('Month')[['CarrierDelay', 'WeatherDelay', 'NASDelay', 'LateAircraftDelay']].sum()

# Calculate percentage of total delay minutes
monthly_percent = monthly_cause.div(monthly_cause.sum().sum(), axis=0) * 100

# Convert month numbers to names
monthly_percent = monthly_percent.reset_index()
monthly_percent['Month'] = monthly_percent['Month'].replace({
    1:'Jan', 2:'Feb', 3:'Mar', 4:'Apr', 5:'May', 6:'Jun',
    7:'Jul', 8:'Aug', 9:'Sep', 10:'Oct', 11:'Nov', 12:'Dec'
})

# Plot
fig = px.area(
    monthly_percent,
    x='Month',
    y=['CarrierDelay', 'WeatherDelay', 'NASDelay', 'LateAircraftDelay'],
    title='Delay Causes by Month',
    labels={'value':'% of Total Delay Minutes', 'variable':'Delay Cause'}
)

fig.show()

### Delay Causes by Month

This stacked area chart illustrates how different delay causes contribute to the total delay time across months of the year.
- Late Aircraft Delays (purple) consistently account for the largest share of total delay minutes.
- NAS (National Airspace System) Delays (green) remain consistent with delay trends throughout the year, increasing slightly in summer (June–August) and holiday months (December-February) when air traffic volume is highest.
- Carrier Delays (blue) and Weather Delays (red) remain relatively steady throughout the year
- Delays pike notably in winter (Dec–Feb) and early summer (June) — periods often affected by severe weather and schedule congestion.

Key takeaway:

Most delays stem from late-arriving aircraft and system congestion, with seasonal disruption during winter and summer peaks


## **Average Delays by Airline**

In [None]:
# Average Delays by Airline
avg_delay_carrier = df.groupby('UniqueCarrier')[['ArrDelay','DepDelay']].mean().sort_values('ArrDelay', ascending=False)
avg_delay_carrier.plot(kind='bar', figsize=(10,6))
plt.title("Average Arrival & Departure Delays by Airline")
plt.ylabel("Arrival/Departure Delay (minutes)")
plt.xlabel("Airline")
plt.show()

This bar chart compares the average arrival and departure delays (in minutes) across major U.S. airlines.
- B6 (JetBlue) shows the highest average delays, exceeding 75 minutes, followed by regional carriers like YV, XE, and UA, indicating greater susceptibility to scheduling and congestion issues.
- Major carriers like AA (American), DL (Delta), and US (US Airways) perform better, maintaining delays around 50–60 minutes.
- Low-cost carriers such as WN (Southwest) and F9 (Frontier) show the shortest delays, averaging under 45 minutes, reflecting more efficient turnaround times and simplified networks.
- Across all airlines, arrival delays are consistently slightly higher than departure delays — consistent with compounding effects from prior flight disruptions.

Key takeaway:

Delay severity varies by airline — regional and legacy carriers experience longer delays, while low-cost carriers demonstrate stronger schedule reliability and recovery efficiency.
