<a href="https://colab.research.google.com/github/glgunderson/INFOB2DA-PA4/blob/main/pa4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Dashboard Visualizations and Coordinated View Systems**
## Practical Assignment 4 - INFOB2DA
*Tobias Buiten & Grace Gunderson*


In [None]:
# Import Relevant Libraries for Level 2 Visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

In [None]:
# Download Dataset from GitHub Release & Unzip Large Dataset
!curl -L -o DelayedFlights.zip "https://github.com/glgunderson/INFOB2DA-PA4/releases/download/PA4.DATA/DelayedFlights.zip"
!unzip -o DelayedFlights.zip -d data

import pandas as pd

# Load Dataset
df_raw = pd.read_csv('data/airlinedelaycauses_DelayedFlights.csv')

# Copy Dataset for Preprocessing
df = df_raw.copy()

# Preview Dataset
df_raw.head()

## Dataset Overview

### DOT’S Air Travel Consumer Report:
The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the **on-time performance** of domestic flights operated by large air carriers.
- DOT provides **monthly summary information** on the number of on-time, delayed, cancelled and diverted flights.
- BTS collects details on the **causes of flight delays** and releases summary statistics and raw data.


## Relevant Features

Time-Related Features
- `Month` (1-12) / `DayOfMonth` (1-31) / `DayOfWeek` (1-7) — calendar indicators for each flight.
- `DepTime` / `ArrTime` — actual departure and arrival times (in HHMM format).
- `CRSDepTime` / `CRSArrTime` — scheduled departure and arrival times (in HHMM format).
- `DepHour` / `ArrHour` — derived from scheduled times, showing the hour of day (0–23) - *AFTER PREPROCESSING*

Flight Details
- `Distance` — flight distance in miles.
  - `DistanceGroup` — categorized distance range (<500, 500–1000, etc.) - *AFTER PREPROCESSING*
- `AirTime` — total time spent in the air (minutes).
- `ActualElapsedTime` / `CRSElapsedTime` — actual vs. scheduled total flight durations (minutes).

Delay Metrics
- `DepDelay` / `ArrDelay` — minutes delayed at departure and arrival.
  - `DepDelayHours` / `ArrDelayHours` — same delays converted to hours for easier interpretation - *AFTER PREPROCESSING*
- `CarrierDelay` / `WeatherDelay` / `NASDelay` / `SecurityDelay` / LateAircraftDelay — minutes of delay attributed to specific causes.

Operational Flags
- `Cancelled` / `Diverted` — binary indicators (1 = yes, 0 = no).
- `UniqueCarrier` — airline carrier code (e.g., AA, DL, UA).
- `Origin` / `Dest` — airport codes for departure and arrival locations.


## Summary Statistics

In [None]:
# Understand the dataset
df_raw.info()
df_raw.shape

In [None]:
# Basic Summary Statistics
df_raw.describe().T

### Understanding the Dataset
The initial output upon loading the *full raw dataset* includes:
- **1,936,758 rows (flights) x 30 columns (features)**
- The columns consist of both numeric (`int64`, `float64`) and categorical (`object`) features, including:
  - 14 float variables (e.g., `DepTime`, `ArrTime`, `DepDelay`, `ArrDelay`)
  - 11 integer variables (e.g., `Year`, `Month`, `DayofWeek`, `FlightNum`)
  - 5 object variables (e.g., `UniqueCarrier`, `Origin`, `Dest`)  

According to PA4, the dataset *should* include:
- Flight delay metrics for **1,247,486** different flights.
- **30 different features**, both numerical and categorical.

### Understanding the Record Discrepancy
The difference between the ~1.94 million and ~1.25 million flight records is explained by *dataset scope*.
- The full raw dataset (**1,936,758 rows**) includes **all scheduled flights** in 2008 - whether they were on time, delayed, cancelled, or diverted.
- Only a subset of the flight records (**1,247,488 rows**) contain complete **delay-related data** (`CarrierDelay`, `WeatherDelay`, `NASDelay`, `SecurityDelay`, `LateAircraftDelay`).
  - These represent flights that actually experienced a *delay event*, which is the primary focus of this data analysis.

As a result, all preprocessing and subsequent visualizations are performed on this ~1.25M delayed-flight subset to ensure meaningful data analysis.

## Preprocessing

In [None]:
# PREPROCESSING

# 1. Drop redundant/irrelevant columns
df = df.drop(columns=['Unnamed: 0', 'Year', 'FlightNum', 'TailNum', 'CancellationCode'], errors='ignore')

# 2. Remove cancelled or diverted flights
df = df[(df['Cancelled'] == 0) & (df['Diverted'] == 0)]

# 3. Identify records with complete delay-cause data
DelayCause = ['CarrierDelay', 'WeatherDelay', 'NASDelay', 'SecurityDelay', 'LateAircraftDelay']
df = df.dropna(subset=DelayCause)

# 4. Fill missing delay-cause values (NaN) with 0
df[DelayCause] = df[DelayCause].fillna(0)

# 5. Clip negative delay values (representing early arrivals) to 0
df['ArrDelay'] = df['ArrDelay'].clip(lower=0)

# 6. Derive new time-based features for later visualization
df['DepHour'] = (df['DepTime'] // 100).astype(int)
df['ArrHour'] = (df['ArrTime'] // 100).astype(int)

# 7. Convert arrival/departure delays from minutes to hours
df['ArrDelayHours'] = (df['ArrDelay'] / 60).astype('float64')
df['DepDelayHours'] = (df['DepDelay'] / 60).astype('float64')

# 8. Create distance categories (in miles)
df['DistanceGroup'] = pd.cut(
    df['Distance'],
    bins=[0, 500, 1000, 2000, 3000, 5000],
    labels=['<500', '500–1000', '1000–2000', '2000–3000', '3000–5000']
)

# 9. Reset DataFrame index to ensure clean row alignment
df = df.reset_index(drop=True)

# Verify structure after preprocessing
df.info()
df.shape

## Data Preprocessing

*Before visualizing flight delay trends, the dataset required preprocessing to ensure data analysis focuses only on valid, delayed flight records.*

### 1. Removed redundant and irrelevant columns
- The first column, `Unnamed: 0`, is an index column automatically generated during export and does not represent a meaningful feature.
- Dropped `Year` since all records were from 2008 (constant value), offering no variance for analysis.
- `FlightNum`, `TailNum`, and `CancellationCode` provide no meaningful information for data analysis of flight delays.

### 2. Excluded cancelled or diverted flights
- Removed flights where `Cancelled = 1` or `Diverted = 1`.  
- These records do not have valid arrival/departure data, which is essential for delay analysis.
- The focus of analysis is delayed flights, so this information is not relevant.

### 3. Retained only records with complete delay-cause data
- Filtered to include only rows where all five delay cause fields were present:  
  `CarrierDelay`, `WeatherDelay`, `NASDelay`, `SecurityDelay`, `LateAircraftDelay`.  
- This isolates the ~1.25 million delayed-flight subset that contains full delay-cause information — the focus of this data analysis.

### 4. Filled missing delay-cause values with zero
- Any remaining `NaN` values in delay-cause columns were replaced with `0`
- NaN indicated *no delay* from that cause (e.g., `WeatherDelay`).

### 5. Clipped negative arrival delay values
- Negative values in `ArrDelay` represent early arrivals (e.g., `-109` = 109 minutes early).  
- To focus purely on delays, these were clipped to `0`, indicating no delay.  
- `DepDelay` contained no negative values, so no adjustment was required.

### 6. Derived new time-based features
- Created `DepHour` and `ArrHour` by converting scheduled departure/arrival times (e.g., `1530`) to hour bins (e.g., `15`).  
- This facilitates later visualization of delay patterns by time of day.

### 7. Converted arrival and departure delays from minutes to hours
- Created two new columns, ArrDelayHours and DepDelayHours, by dividing the delay values (in minutes) by 60.
- This transformation makes large delay durations easier to interpret (e.g., 180 minutes → 3.0 hours).
- Both columns were stored as floating-point values (float64) to preserve precision for visualizations and calculations.

### 8. Created distance categories (DistanceGroup)
- Grouped flight distances (in miles) into five categories to simplify comparison across flight lengths.
- The new DistanceGroup column was created using pd.cut() with the following bins and labels: <500, 500–1000, 1000–2000, 2000–3000, 3000–5000.
- This grouping allows visualizations to explore whether longer flights are more or less prone to delays.

### 9. Reset DataFrame index
- Reset the index after all filtering steps to maintain continuous row alignment.

### Preprocessing Summary
After cleaning and filtering the dataset, the records were reduced from **1,936,758** total scheduled flights to **1,247,488** valid delayed-flight records.

*This refined dataset now focuses exclusively on flights that experienced measurable delays, removing cancelled, diverted, or incomplete records.*
- Dataset size reduced: from ~443 MB to ~277 MB
- Features retained: 30 (including derived time-based and distance-group variables)
- Delay cause completeness: All five causes (CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay) present for every record
- Negative delay values clipped (early arrivals = 0 hours delay)
- Delays converted from minutes → hours for interpretability
- Added columns: DepHour, ArrHour, DepDelayHours, ArrDelayHours, DistanceGroup

*The resulting dataset is now optimized for visual analysis of **when, why, and how long flights are delayed**, forming a clean basis for the dashboard visualizations.*


## Review Dataset

In [None]:
# Basic summary statistics after preprocessing
df.describe().T

In [None]:
print(f"Raw dataset shape: {df_raw.shape}")
print(f"Cleaned dataset shape: {df.shape}")
print(f"Rows removed during preprocessing: {df_raw.shape[0] - df.shape[0]:,}")

## Investigate Dataset

In [None]:
# Number of delayed flights due to each cause?
(df[['CarrierDelay','WeatherDelay','NASDelay','SecurityDelay','LateAircraftDelay']] > 0).sum()

In [None]:
# Delays greater than 24 hours?
(df['ArrDelay'] > 1440).sum(), (df['DepDelay'] > 1440).sum(), (df['CarrierDelay'] > 1440).sum()

In [None]:
plt.figure(figsize=(6,5))
sns.heatmap(df[DelayCause].corr(), annot=True, cmap='Blues', fmt=".2f",
            cbar_kws={'label':'Correlation Coefficient'})
plt.title('Correlation Between Delay Causes')
plt.xticks(rotation=45)
plt.show()

# **Visualization Dashboard**

## **Average Arrival and Departure Delays by Hour**

In [None]:
# Visualization 1 – Average Arrival and Departure Delays by Hour
import matplotlib.pyplot as plt
import seaborn as sns

# Apply a blue-friendly theme
sns.set_style("whitegrid")
sns.set_palette("Blues_r")

# Create bar plot
plt.figure(figsize=(10,5))
df.groupby('DepHour')[['ArrDelayHours', 'DepDelayHours']].mean().plot(
    kind='bar',
    ax=plt.gca(),
    width=0.8,
    edgecolor='none'
)

# Titles and labels
plt.title('Average Arrival and Departure Delays by Hour', fontsize=13, weight='bold')
plt.xlabel('Hour of Day (Departure)', fontsize=11)
plt.ylabel('Average Delay (Hours)', fontsize=11)
plt.xticks(rotation=0)
plt.legend(['Arrival Delay', 'Departure Delay'], title='Delay Type')

# Remove unnecessary frame and tighten layout
sns.despine()
plt.tight_layout()
plt.show()

## **Visualization 2**

In [None]:
# 2
# Create pivot table for average arrival delay by weekday and hour
pivot = df.pivot_table(
    values='ArrDelayHours',
    index='DepHour',       # rows = hour of day
    columns='DayOfWeek',   # columns = day of week
    aggfunc='mean'
)

# Rename day columns for clarity
pivot.columns = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']

# Plot heatmap
plt.figure(figsize=(8,6))
sns.heatmap(pivot, cmap='coolwarm', annot=False, cbar_kws={'label': 'Average Arrival Delay (hours)'})
plt.title('Average Arrival Delay by Hour and Day of Week')
plt.xlabel('Day of Week')
plt.ylabel('Hour of Day')
plt.show()

## **Visualization 3**

In [None]:
# Group delay causes by departure hour (values still in minutes)
hourly_delays_hr = df.groupby('DepHour')[DelayCause].sum().reset_index()

# Convert all delay cause totals from minutes to hours
hourly_delays_hr[DelayCause] = hourly_delays_hr[DelayCause] / 60

# Create interactive stacked area chart
fig = px.area(
    hourly_delays_hr,
    x='DepHour',
    y=DelayCause,
    title='Hourly Distribution of Delay Causes',
    labels={
        'value': 'Total Delay (hours)',
        'DepHour': 'Hour of Day (Departure)'
    }
)

# Clean legend and axes formatting
fig.update_layout(
    legend_title_text='Cause of Delay',
    xaxis=dict(
        tickmode='linear',
        dtick=1,  # show every hour (0–23)
        title='Hour of Day (Departure)',
        range=[0, 23]
    ),
    yaxis_title='Total Delay (hours)',
    plot_bgcolor='white'
)

fig.show()

## **Visualization 4**

In [None]:
# 4
# Pivot table: average arrival delay grouped by month and weekday
pivot = df.pivot_table(
    values='ArrDelayHours',
    index='Month',        # months as rows
    columns='DayOfWeek',  # weekdays as columns
    aggfunc='mean'
)

# Replace numeric weekdays (1–7) with labels
pivot.columns = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']

# Replace numeric months (1–12) with short names
month_labels = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
                'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
pivot.index = month_labels

# Plot heatmap
plt.figure(figsize=(9,6))
sns.heatmap(pivot, cmap='YlOrRd', annot=False, cbar_kws={'label': 'Avg Arrival Delay (hours)'})
plt.title('Average Arrival Delay by Month and Day of Week')
plt.xlabel('Day of Week')
plt.ylabel('Month')
plt.show()

The weak correlation (|r| < 0.2) between delay causes indicate that most types of delays occur independently.

## **Visualization 5**

In [None]:
# 5
# Sum delay causes by month (for all flights)
monthly_cause = df_raw.groupby('Month')[['CarrierDelay', 'WeatherDelay', 'NASDelay', 'LateAircraftDelay']].sum()

# Calculate percentage of total delay minutes
monthly_percent = monthly_cause.div(monthly_cause.sum().sum(), axis=0) * 100

# Convert month numbers to names
monthly_percent = monthly_percent.reset_index()
monthly_percent['Month'] = monthly_percent['Month'].replace({
    1:'Jan', 2:'Feb', 3:'Mar', 4:'Apr', 5:'May', 6:'Jun',
    7:'Jul', 8:'Aug', 9:'Sep', 10:'Oct', 11:'Nov', 12:'Dec'
})

# Plot
fig = px.area(
    monthly_percent,
    x='Month',
    y=['CarrierDelay', 'WeatherDelay', 'NASDelay', 'LateAircraftDelay'],
    title='Delay Causes by Month',
    labels={'value':'% of Total Delay Minutes', 'variable':'Delay Cause'}
)

fig.show()

## **Visualisation 6**

In [None]:
# 6
avg_delay_carrier = df.groupby('UniqueCarrier')[['ArrDelay','DepDelay']].mean().sort_values('ArrDelay', ascending=False)
avg_delay_carrier.plot(kind='bar', figsize=(12,6))
plt.title("Average Arrival & Departure Delays by Airline")
plt.ylabel("Arrival/Departure Delay (minutes)")
plt.show()