<a href="https://colab.research.google.com/github/glgunderson/INFOB2DA-PA4/blob/main/pa4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Dashboard Visualizations and Coordinated View Systems**
## Practical Assignment 4 - INFOB2DA
*Tobias Buiten & Grace Gunderson*


In [None]:
# Import Relevant Libraries for Level 2 Visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

In [None]:
# Download Dataset from GitHub Release & Unzip Large Dataset
!curl -L -o DelayedFlights.zip "https://github.com/glgunderson/INFOB2DA-PA4/releases/download/PA4.DATA/DelayedFlights.zip"
!unzip -o DelayedFlights.zip -d data

import pandas as pd

# Load Dataset
df_raw = pd.read_csv('data/airlinedelaycauses_DelayedFlights.csv')

# Copy Dataset for Preprocessing (Delayed Flights)
df = df_raw.copy()

# Copy Dataset for Comparison (Delayed vs. All Flights)
df_all = df_raw.copy()

# Preview Dataset
df_raw.head()

## Dataset Overview

### DOT’S Air Travel Consumer Report:
The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the **on-time performance** of domestic flights operated by large air carriers.
- DOT provides **monthly summary information** on the number of on-time, delayed, cancelled and diverted flights.
- BTS collects details on the **causes of flight delays** and releases summary statistics and raw data.


## Relevant Features

Time-Related Features
- `Month` (1-12) / `DayOfMonth` (1-31) / `DayOfWeek` (1-7) — calendar indicators for each flight.
- `DepTime` / `ArrTime` — actual departure and arrival times (in HHMM format).
- `CRSDepTime` / `CRSArrTime` — scheduled departure and arrival times (in HHMM format).
- `DepHour` / `ArrHour` — derived from scheduled times, showing the hour of day (0–23) - *AFTER PREPROCESSING*

Flight Details
- `Distance` — flight distance in miles.
  - `DistanceGroup` — categorized distance range (<500, 500–1000, etc.) - *AFTER PREPROCESSING*
- `AirTime` — total time spent in the air (minutes).
- `ActualElapsedTime` / `CRSElapsedTime` — actual vs. scheduled total flight durations (minutes).

Delay Metrics
- `DepDelay` / `ArrDelay` — minutes delayed at departure and arrival.
  - `DepDelayHours` / `ArrDelayHours` — same delays converted to hours for easier interpretation - *AFTER PREPROCESSING*
- `CarrierDelay` / `WeatherDelay` / `NASDelay` / `SecurityDelay` / LateAircraftDelay — minutes of delay attributed to specific causes.

Operational Flags
- `Cancelled` / `Diverted` — binary indicators (1 = yes, 0 = no).
- `UniqueCarrier` — airline carrier code (e.g., AA, DL, UA).
- `Origin` / `Dest` — airport codes for departure and arrival locations.


## Summary Statistics

In [None]:
# Understand the dataset
df_raw.info()
df_raw.shape

In [None]:
# Basic Summary Statistics
df_raw.describe().T

### Understanding the Dataset
The initial output upon loading the *full raw dataset* includes:
- **1,936,758 rows (flights) x 30 columns (features)**
- The columns consist of both numeric (`int64`, `float64`) and categorical (`object`) features, including:
  - 14 float variables (e.g., `DepTime`, `ArrTime`, `DepDelay`, `ArrDelay`)
  - 11 integer variables (e.g., `Year`, `Month`, `DayofWeek`, `FlightNum`)
  - 5 object variables (e.g., `UniqueCarrier`, `Origin`, `Dest`)  

According to PA4, the dataset *should* include:
- Flight delay metrics for **1,247,486** different flights.
- **30 different features**, both numerical and categorical.

### Understanding the Record Discrepancy
The difference between the ~1.94 million and ~1.25 million flight records is explained by *dataset scope*.
- The full raw dataset (**1,936,758 rows**) includes **all scheduled flights** in 2008 - whether they were on time, delayed, cancelled, or diverted.
- Only a subset of the flight records (**1,247,488 rows**) contain complete **delay-related data** (`CarrierDelay`, `WeatherDelay`, `NASDelay`, `SecurityDelay`, `LateAircraftDelay`).
  - These represent flights that actually experienced a *delay event*, which is the primary focus of this data analysis.

***As a result, preprocessing and subsequent visualizations are performed by isolating this ~1.25M delayed-flight subset to ensure meaningful data analysis.***

## Preprocessing

In [None]:
# PREPROCESSING - DELAYED FLIGHT DATA (df)

# 1. Drop redundant/irrelevant columns
df = df.drop(columns=['Unnamed: 0', 'Year', 'FlightNum', 'TailNum', 'CancellationCode'], errors='ignore')

# 2. Remove cancelled or diverted flights
df = df[(df['Cancelled'] == 0) & (df['Diverted'] == 0)]

# 3. Identify records with complete delay-cause data
DelayCause = ['CarrierDelay', 'WeatherDelay', 'NASDelay', 'SecurityDelay', 'LateAircraftDelay']
df = df.dropna(subset=DelayCause)

# 4. Fill missing delay-cause values (NaN) with 0
df[DelayCause] = df[DelayCause].fillna(0)

# 5. Clip negative delay values (representing early arrivals) to 0
df['ArrDelay'] = df['ArrDelay'].clip(lower=0)

# 6. Derive new time-based features for later visualization
df['DepHour'] = (df['DepTime'] // 100).astype(int)
df['ArrHour'] = (df['ArrTime'] // 100).astype(int)

# 7. Convert arrival/departure delays from minutes to hours
df['ArrDelayHours'] = (df['ArrDelay'] / 60).astype('float64')
df['DepDelayHours'] = (df['DepDelay'] / 60).astype('float64')

# 8. Create distance categories (in miles)
df['DistanceGroup'] = pd.cut(
    df['Distance'],
    bins=[0, 500, 1000, 2000, 3000, 5000],
    labels=['<500', '500–1000', '1000–2000', '2000–3000', '3000–5000']
)

# 9. # Define delay cause columns
DelayCause = ['CarrierDelay', 'WeatherDelay', 'NASDelay', 'SecurityDelay', 'LateAircraftDelay']

# 10. Reset DataFrame index to ensure clean row alignment
df = df.reset_index(drop=True)

# Verify structure after preprocessing
df.info()
df.shape

## Data Preprocessing

*Before visualizing flight delay trends, the dataset required preprocessing to ensure data analysis focuses only on valid, delayed flight records.*

### 1. Removed redundant and irrelevant columns
- The first column, `Unnamed: 0`, is an index column automatically generated during export and does not represent a meaningful feature.
- Dropped `Year` since all records were from 2008 (constant value), offering no variance for analysis.
- `FlightNum`, `TailNum`, and `CancellationCode` provide no meaningful information for data analysis of flight delays.

### 2. Excluded cancelled or diverted flights
- Removed flights where `Cancelled = 1` or `Diverted = 1`.  
- These records do not have valid arrival/departure data, which is essential for delay analysis.
- The focus of analysis is delayed flights, so this information is not relevant.

### 3. Retained only records with complete delay-cause data
- Filtered to include only rows where all five delay cause fields were present:  
  `CarrierDelay`, `WeatherDelay`, `NASDelay`, `SecurityDelay`, `LateAircraftDelay`.  
- This isolates the ~1.25 million delayed-flight subset that contains full delay-cause information — the focus of this data analysis.

### 4. Filled missing delay-cause values with zero
- Any remaining `NaN` values in delay-cause columns were replaced with `0`
- NaN indicated *no delay* from that cause (e.g., `WeatherDelay`).

### 5. Clipped negative arrival delay values
- Negative values in `ArrDelay` represent early arrivals (e.g., `-109` = 109 minutes early).  
- To focus purely on delays, these were clipped to `0`, indicating no delay.  
- `DepDelay` contained no negative values, so no adjustment was required.

### 6. Derived new time-based features
- Created `DepHour` and `ArrHour` by converting scheduled departure/arrival times (e.g., `1530`) to hour bins (e.g., `15`).  
- This facilitates later visualization of delay patterns by time of day.

### 7. Converted arrival and departure delays from minutes to hours
- Created two new columns, ArrDelayHours and DepDelayHours, by dividing the delay values (in minutes) by 60.
- This transformation makes large delay durations easier to interpret (e.g., 180 minutes → 3.0 hours).
- Both columns were stored as floating-point values (float64) to preserve precision for visualizations and calculations.

### 8. Created distance categories (DistanceGroup)
- Grouped flight distances (in miles) into five categories to simplify comparison across flight lengths.
- The new DistanceGroup column was created using pd.cut() with the following bins and labels: <500, 500–1000, 1000–2000, 2000–3000, 3000–5000.
- This grouping allows visualizations to explore whether longer flights are more or less prone to delays.

### 9. Define delay causes
- For consistent refernece in later analysis of flight delays.


### 10. Reset DataFrame index
- Reset the index after all filtering steps to maintain continuous row alignment.

### Preprocessing Summary
After cleaning and filtering the dataset, the records were reduced from **1,936,758** total scheduled flights to **1,247,488** valid delayed-flight records.

*This refined dataset now focuses exclusively on flights that experienced measurable delays, removing cancelled, diverted, or incomplete records.*
- Dataset size reduced: from ~443 MB to ~277 MB
- Features retained: 30 (including derived time-based and distance-group variables)
- Delay cause completeness: All five causes (CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay) present for every record
- Negative delay values clipped (early arrivals = 0 hours delay)
- Delays converted from minutes to hours for interpretability
- Added columns: DepHour, ArrHour, DepDelayHours, ArrDelayHours, DistanceGroup

*The resulting dataset is now optimized for visual analysis of **when, why, and how long flights are delayed**, forming a clean basis for the dashboard visualizations.*


In [None]:
# PREPROCESSING — FULL FLIGHT DATA (df_all)
# Includes all flights (on-time, cancelled, diverted)

# 1. Fill NaNs in delay-cause columns with 0 (on-time flights)
df_all[DelayCause] = df_all[DelayCause].fillna(0)

# 2. Map cancellation codes (A–D) to delay categories
cancel_map = {'A': 'CarrierDelay', 'B': 'WeatherDelay', 'C': 'NASDelay', 'D': 'SecurityDelay'}
df_all['CancelMapped'] = df_all['CancellationCode'].map(cancel_map).fillna('Not Cancelled')

# 3. Convert delay columns to hours for consistency
df_all['ArrDelayHours'] = (df_all['ArrDelay'] / 60).astype('float64')
df_all['DepDelayHours'] = (df_all['DepDelay'] / 60).astype('float64')
df_all[DelayCause] = df_all[DelayCause] / 60

# 4. Add departure and arrival hour columns if not already present
df_all['DepHour'] = (df_all['DepTime'] // 100).astype('Int64')
df_all['ArrHour'] = (df_all['ArrTime'] // 100).astype('Int64')

df_all.info()

## Review Dataset

In [None]:
# Basic summary statistics after preprocessing
df.describe().T

In [None]:
print(f"Raw data shape: {df_raw.shape}")
print(f"Delayed flight data shape: {df.shape}")
print(f"All flight data shape: {df_all.shape}")
print(f"Rows without delay-cause data: {df_raw.shape[0] - df.shape[0]:,}")

## Investigate Dataset

In [None]:
# Number of delayed flights due to each cause?
(df[['CarrierDelay','WeatherDelay','NASDelay','SecurityDelay','LateAircraftDelay']] > 0).sum()

In [None]:
# Delays greater than 24 hours?
(df['ArrDelay'] > 1440).sum(), (df['DepDelay'] > 1440).sum(), (df['CarrierDelay'] > 1440).sum()

In [None]:
# Number of Delayed Flights by Departure Hour?

sns.set_style("whitegrid")
sns.set_palette("Blues_r")

# Count delayed flights per hour
delays_by_hour = df['DepHour'].value_counts().sort_index().reset_index()
delays_by_hour.columns = ['DepHour', 'DelayedFlights']

# Bar chart
plt.figure(figsize=(10,5))
sns.barplot(data=delays_by_hour, x='DepHour', y='DelayedFlights')

plt.title('Number of Delayed Flights by Departure Hour')
plt.xlabel('Hour of Day (Departure)')
plt.ylabel('Number of Delayed Flights')
plt.xticks(range(0, 24))
sns.despine()
plt.tight_layout()
plt.show()

In [None]:
 Number of Delayed Flights by Arrival Hour?

sns.set_style("whitegrid")
sns.set_palette("Blues_r")

# Count delayed flights per hour
delays_by_hour = df['ArrHour'].value_counts().sort_index().reset_index()
delays_by_hour.columns = ['ArrHour', 'DelayedFlights']

# Bar chart
plt.figure(figsize=(10,5))
sns.barplot(data=delays_by_hour, x='ArrHour', y='DelayedFlights')

plt.title('Number of Delayed Flights by Arrival Hour')
plt.xlabel('Hour of Day (Arrival)')
plt.ylabel('Number of Delayed Flights')
plt.xticks(range(0, 24))
sns.despine()
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(12,5))
width = 0.4

# Plot departure and arrival side-by-side instead of overlapping
plt.bar(df.groupby('DepHour').size().index - width/2,
        df.groupby('DepHour').size().values,
        width=width, color='steelblue', label='Departure')

plt.bar(df.groupby('ArrHour').size().index + width/2,
        df.groupby('ArrHour').size().values,
        width=width, color='lightblue', label='Arrival')

# Visuals
plt.title('Number of Delayed Flights by Hour (Departure vs Arrival)', weight='bold')
plt.xlabel('Hour of Day')
plt.ylabel('Number of Delayed Flights')
plt.xticks(range(0, 24))
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
fig, axes = plt.subplots(1,2,figsize=(22,5), sharex=True)

sns.barplot(x='DepHour', y='total_delays', data=hourly_stats,
            color='skyblue', ax=axes[0])
axes[0].set_title('Number of Delayed Flights by Hour')
axes[0].set_ylabel('Number of Flights')

sns.lineplot(x='DepHour', y='avg_delay', data=hourly_stats,
             color='steelblue', marker='o', ax=axes[1])
axes[1].set_title('Average Delay by Hour')
axes[1].set_ylabel('Hours')

for ax in axes:
    ax.set_xlabel('Hour of Day (Departure)')
    ax.set_xticks(range(0,24))
sns.despine()
plt.tight_layout()
plt.show()

In [None]:
sns.set_style("whitegrid")

# Aggregate data by hour
hourly_stats = df.groupby('DepHour').agg(
    avg_delay=('ArrDelayHours', 'mean'),
    total_delays=('ArrDelayHours', 'count')
).reset_index()

# Normalize both metrics to comparable scale
hourly_stats['avg_delay_scaled'] = hourly_stats['avg_delay'] / hourly_stats['avg_delay'].max()
hourly_stats['total_delays_scaled'] = (hourly_stats['total_delays'] / hourly_stats['total_delays'].max()) * -1

plt.figure(figsize=(12,6))

sns.barplot(data=hourly_stats, x='DepHour', y='total_delays_scaled',
            color='steelblue', label='Number of Delayed Flights')
sns.barplot(data=hourly_stats, x='DepHour', y='avg_delay_scaled',
            color='skyblue', label='Average Delay Duration(Hours)')

# Titles and labels
plt.title('Opposing Trends: Flight Volume vs. Average Delay by Hour', weight='bold')
plt.xlabel('Hour of Day (Departure)')
plt.ylabel('')
plt.yticks([])
plt.legend(loc='upper right', fontsize=12)
plt.show()

In [None]:
sns.set_style("whitegrid")

plt.figure(figsize=(12,5))
sns.boxplot(
    data=df, x='DepHour', y='ArrDelayHours',
    showfliers=False,
    color='lightblue'
)

plt.title('Distribution of Arrival Delay Durations by Departure Hour', weight='bold')
plt.xlabel('Hour of Day (Departure)')
plt.ylabel('Arrival Delay Duration (Hours)')
plt.tight_layout()
plt.show()

In [None]:
hourly_total = df.groupby('DepHour')['ArrDelayHours'].sum().reset_index()
hourly_total['CumulativeDelay'] = hourly_total['ArrDelayHours'].cumsum()

plt.figure(figsize=(9,5))
sns.lineplot(data=hourly_total, x='DepHour', y='CumulativeDelay', color='steelblue', marker='o')
plt.title('Cumulative Build-Up of Total Delay Hours Throughout the Day', weight='bold')
plt.xlabel('Hour of Day (Departure)')
plt.ylabel('Cumulative Delay (Hours)')
plt.tight_layout()
plt.show()

In [None]:
# Group by departure hour
total_by_hour = df_all.groupby('DepHour').size().rename('TotalFlights')
delayed_by_hour = df.groupby('DepHour').size().rename('DelayedFlights')

# Merge into one table
compare = pd.concat([total_by_hour, delayed_by_hour], axis=1).fillna(0)

# Calculate percentage of flights delayed
compare['PctDelayed'] = (compare['DelayedFlights'] / compare['TotalFlights']) * 100

# Plot
plt.figure(figsize=(12,6))
plt.bar(compare.index, compare['PctDelayed'], color='steelblue')
plt.title('Percentage of Flights Delayed by Hour of Departure', weight='bold')
plt.xlabel('Hour of Day (Departure)')
plt.ylabel('% of Flights Delayed')
plt.xticks(range(0,25))
plt.tight_layout()
plt.show()

# Summary stats
overall_pct = (df.shape[0] / df_all.shape[0]) * 100
print(f"Overall, {overall_pct:.2f}% of all flights in the dataset experienced a delay.")

In [None]:
# Group by departure hour
total_by_hour = df_all.groupby('DepHour').size().rename('TotalFlights')
delayed_by_hour = df.groupby('DepHour').size().rename('DelayedFlights')

# Merge datasets
compare = pd.concat([total_by_hour, delayed_by_hour], axis=1).fillna(0)

# Plot both on same chart
plt.figure(figsize=(12,6))
plt.bar(compare.index - 0.2, compare['TotalFlights'], width=0.4, color='lightblue', label='Total Flights')
plt.bar(compare.index + 0.2, compare['DelayedFlights'], width=0.4, color='steelblue', label='Delayed Flights')

plt.title('Total vs Delayed Flights by Hour of Departure', weight='bold')
plt.xlabel('Hour of Day (Departure)')
plt.ylabel('Number of Flights')
plt.xticks(range(0, 24))
plt.legend()
plt.tight_layout()
plt.show()

Key Findings:
- The majority of delayed flights occur during the daytime hours, reflecting the busiest hours for flight traffic.
  - Arrival delays tend to peak slightly later in the day, showing how early departure delays cause downstream congestion by evening (delay propagation).
- While most delays occur during the day, nighttime flights experience longer and more unpredictable delays. Low traffic volume at night means that the few delays that occur are often severe.
- The flights with the greatest delays appear to occur during evening hours in visualizations, although this may be skewed because of how few delays there are in comparison to the daytime.
- Daytime delays are compact (consistent short delays) while early/late hours have extreme variability.

# **Visualization Dashboard**

In [None]:
# Average Arrival and Departure Delays by Hour

# Create bar plot
plt.figure(figsize=(12,6))
df.groupby('DepHour')[['ArrDelayHours', 'DepDelayHours']].mean().plot(
    kind='bar',
    ax=plt.gca(),
    width=0.8,
    edgecolor='none'
)

# Titles and labels
plt.title('Average Arrival and Departure Delays by Hour', weight='bold')
plt.xlabel('Hour of Day (Departure)')
plt.ylabel('Average Delay (Hours)')
plt.legend(['Arrival Delay', 'Departure Delay'], title='Flight Delay')

# Visuals
sns.set_style("whitegrid")
sns.set_palette("Blues_r")
sns.despine()
plt.tight_layout()
plt.show()

In [None]:
# Aggregate key hourly metrics
hourly = df.groupby('DepHour').agg(
    total_delays=('ArrDelayHours', 'count'),
    median_delay=('ArrDelayHours', 'median'),
    q1=('ArrDelayHours', lambda x: x.quantile(0.25)),
    q3=('ArrDelayHours', lambda x: x.quantile(0.75))
).reset_index()

fig, ax1 = plt.subplots(figsize=(12,6))

# Bars for total delayed flights
ax1.bar(hourly['DepHour'], hourly['total_delays'],
        color='steelblue', label='Number of Delayed Flights')
ax1.set_ylabel('Number of Delayed Flights', color='steelblue')
ax1.tick_params(axis='y', labelcolor='steelblue')

# Line for median delay (hours)
ax2 = ax1.twinx()
ax2.plot(hourly['DepHour'], hourly['median_delay'],
         color='black', linewidth=2.5, marker='o', label='Median Delay (Hours)')
ax2.fill_between(hourly['DepHour'],
                 hourly['q1'], hourly['q3'],
                 color='black', alpha=0.15, label='Delay Variability (IQR)')
ax2.set_ylabel('Delay Duration (Hours)', color='black')
ax2.tick_params(axis='y', labelcolor='black')

# Visuals
plt.title('Daily Flight Delays: Frequency vs Severity by Hour of Departure', weight='bold')
ax1.set_xlabel('Hour of Day (Departure)')
ax1.legend(loc='upper left')
ax2.legend(loc='upper right')
sns.set_style("white")
plt.tight_layout()
plt.show()

## **Visualization 2**

In [None]:
# Average arrival delay by weekday and hour

# Create pivot table
pivot = df.pivot_table(
    values='ArrDelayHours',
    index='DepHour',       # rows = hour of day
    columns='DayOfWeek',   # columns = day of week
    aggfunc='mean'
)

# Rename day columns for clarity
pivot.columns = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']

# Plot heatmap
plt.figure(figsize=(8,6))
sns.heatmap(pivot, cmap='coolwarm', annot=False, cbar_kws={'label': 'Average Arrival Delay (hours)'})
plt.title('Average Arrival Delay by Hour and Day of Week')
plt.xlabel('Day of Week')
plt.ylabel('Hour of Day')
plt.show()

## **Visualization 3**

In [None]:
# Group delay causes by departure hour (values still in minutes)
hourly_delays_hr = df.groupby('DepHour')[DelayCause].sum().reset_index()

# Convert all delay cause totals from minutes to hours
hourly_delays_hr[DelayCause] = hourly_delays_hr[DelayCause] / 60

# Create interactive stacked area chart
fig = px.area(
    hourly_delays_hr,
    x='DepHour',
    y=DelayCause,
    title='Hourly Distribution of Delay Causes',
    labels={
        'value': 'Total Delay (hours)',
        'DepHour': 'Hour of Day (Departure)'
    }
)

# Clean legend and axes formatting
fig.update_layout(
    legend_title_text='Cause of Delay',
    xaxis=dict(
        tickmode='linear',
        dtick=1,  # show every hour (0–23)
        title='Hour of Day (Departure)',
        range=[0, 23]
    ),
    yaxis_title='Total Delay (hours)',
    plot_bgcolor='white'
)

fig.show()

## **Visualization 4**

In [None]:
# Average Arrival Delay by Month/Day

# Pivot table: average arrival delay grouped by month and weekday
pivot = df.pivot_table(
    values='ArrDelayHours',
    index='Month',        # months as rows
    columns='DayOfWeek',  # weekdays as columns
    aggfunc='mean'
)

# Replace numeric weekdays (1–7) with labels
pivot.columns = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']

# Replace numeric months (1–12) with short names
month_labels = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
                'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
pivot.index = month_labels

# Plot heatmap
plt.figure(figsize=(9,6))
sns.heatmap(pivot, cmap='Blues', annot=False, cbar_kws={'label': 'Avg Arrival Delay (hours)'})
plt.title('Average Delay by Month and Day of Week')
plt.xlabel('Day of Week')
plt.ylabel('Month')
plt.show()

Describe

## **Visualization 5**

In [None]:
# Delay Causes by Month

# Sum delay causes by month (for all flights)
monthly_cause = df_raw.groupby('Month')[['CarrierDelay', 'WeatherDelay', 'NASDelay', 'LateAircraftDelay']].sum()

# Calculate percentage of total delay minutes
monthly_percent = monthly_cause.div(monthly_cause.sum().sum(), axis=0) * 100

# Convert month numbers to names
monthly_percent = monthly_percent.reset_index()
monthly_percent['Month'] = monthly_percent['Month'].replace({
    1:'Jan', 2:'Feb', 3:'Mar', 4:'Apr', 5:'May', 6:'Jun',
    7:'Jul', 8:'Aug', 9:'Sep', 10:'Oct', 11:'Nov', 12:'Dec'
})

# Plot
fig = px.area(
    monthly_percent,
    x='Month',
    y=['CarrierDelay', 'WeatherDelay', 'NASDelay', 'LateAircraftDelay'],
    title='Delay Causes by Month',
    labels={'value':'% of Total Delay Minutes', 'variable':'Delay Cause'}
)

fig.show()

## **Visualisation 6**

In [None]:
# Average Delays by Airline
avg_delay_carrier = df.groupby('UniqueCarrier')[['ArrDelay','DepDelay']].mean().sort_values('ArrDelay', ascending=False)
avg_delay_carrier.plot(kind='bar', figsize=(10,6))
plt.title("Average Arrival & Departure Delays by Airline")
plt.ylabel("Arrival/Departure Delay (minutes)")
plt.show()