<a href="https://colab.research.google.com/github/glgunderson/INFOB2DA-PA4/blob/main/pa4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Dashboard Visualizations and Coordinated View Systems**
## Practical Assignment 4 - INFOB2DA
*Tobias Buiten & Grace Gunderson*


In [None]:
# Download Dataset from GitHub Releases & Unzip Large Dataset
!curl -L -o DelayedFlights.zip "https://github.com/glgunderson/INFOB2DA-PA4/releases/download/PA4.DATA/DelayedFlights.zip"
!unzip -o DelayedFlights.zip -d data

import pandas as pd

# Load and Preview Dataset
df = pd.read_csv('data/airlinedelaycauses_DelayedFlights.csv')
df.head()

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 62.0M  100 62.0M    0     0  41.9M      0  0:00:01  0:00:01 --:--:-- 62.9M
Archive:  DelayedFlights.zip
  inflating: data/airlinedelaycauses_DelayedFlights.csv  


Unnamed: 0.1,Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,...,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
0,0,2008,1,3,4,2003.0,1955,2211.0,2225,WN,...,4.0,8.0,0,N,0,,,,,
1,1,2008,1,3,4,754.0,735,1002.0,1000,WN,...,5.0,10.0,0,N,0,,,,,
2,2,2008,1,3,4,628.0,620,804.0,750,WN,...,3.0,17.0,0,N,0,,,,,
3,4,2008,1,3,4,1829.0,1755,1959.0,1925,WN,...,3.0,10.0,0,N,0,2.0,0.0,0.0,0.0,32.0
4,5,2008,1,3,4,1940.0,1915,2121.0,2110,WN,...,4.0,10.0,0,N,0,,,,,


## Dataset Overview

### DOT’S Air Travel Consumer Report:
The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the **on-time performance** of domestic flights operated by large air carriers.
- DOT provides **monthly summary information** on the number of on-time, delayed, cancelled and diverted flights.
- BTS collects details on the **causes of flight delays** and releases summary statistics and raw data.


## Summary Statistics

In [None]:
# Understand the dataset
df.info()
df.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1936758 entries, 0 to 1936757
Data columns (total 30 columns):
 #   Column             Dtype  
---  ------             -----  
 0   Unnamed: 0         int64  
 1   Year               int64  
 2   Month              int64  
 3   DayofMonth         int64  
 4   DayOfWeek          int64  
 5   DepTime            float64
 6   CRSDepTime         int64  
 7   ArrTime            float64
 8   CRSArrTime         int64  
 9   UniqueCarrier      object 
 10  FlightNum          int64  
 11  TailNum            object 
 12  ActualElapsedTime  float64
 13  CRSElapsedTime     float64
 14  AirTime            float64
 15  ArrDelay           float64
 16  DepDelay           float64
 17  Origin             object 
 18  Dest               object 
 19  Distance           int64  
 20  TaxiIn             float64
 21  TaxiOut            float64
 22  Cancelled          int64  
 23  CancellationCode   object 
 24  Diverted           int64  
 25  CarrierDelay      

(1936758, 30)

The initial output upon loading the *full raw dataset* includes:
- **1.936.758** rows (flights) x 30 columns (features)
- The columns consist of both numeric (int64, float64) and categorical (object) features, including:
  - 14 float variables (e.g., DepTime, ArrTime, DepDelay, ArrDelay)
  - 11 integer variables (e.g., Year, Month, DayOfWeek, FlightNum)
  - 5 object variables (e.g., UniqueCarrier, Origin, Dest)
- The first column, **Unnamed: 0**, is an index column automatically generated when the dataset during export and  to CSV does not represent a meaningful feature.
  - It will will be removed in later preprocessing, leaving 29 features (e.g., Year, Month, DepDelay, ArrDelay, CarrierDelay, etc.) for analysis.


The dataset *should* include:
- Flight delay metrics for **1.247.486** different flights.
- **30 different features**, both numerical and categorical.

*The discrepancy between the ~1.94 million rows in this raw file and the ~1.25 million flight records referenced in cleaned versions of this dataset occurs because the raw data includes cancelled, diverted, and incomplete flight records that are excluded after preprocessing.*

In [None]:
# Data Cleaning and Preprocessing

# Remove unnecessary index column
df = df.drop(columns=['Unnamed: 0'])

# Remove cancelled or diverted flights
df = df[(df['Cancelled'] == 0) & (df['Diverted'] == 0)]

# Drop rows with missing essential delay information
df = df.dropna(subset=['DepDelay', 'ArrDelay'])

# Fill missing delay cause columns with 0 (no delay recorded)
delay_cols = ['CarrierDelay', 'WeatherDelay', 'NASDelay', 'SecurityDelay', 'LateAircraftDelay']
df[delay_cols] = df[delay_cols].fillna(0)

# Remove duplicate rows (if any)
df = df.drop_duplicates()

# Confirm dataset shape and structure
print("Cleaned dataset shape:", df.shape)
df.info()

Cleaned dataset shape: (1928369, 29)
<class 'pandas.core.frame.DataFrame'>
Index: 1928369 entries, 0 to 1936757
Data columns (total 29 columns):
 #   Column             Dtype  
---  ------             -----  
 0   Year               int64  
 1   Month              int64  
 2   DayofMonth         int64  
 3   DayOfWeek          int64  
 4   DepTime            float64
 5   CRSDepTime         int64  
 6   ArrTime            float64
 7   CRSArrTime         int64  
 8   UniqueCarrier      object 
 9   FlightNum          int64  
 10  TailNum            object 
 11  ActualElapsedTime  float64
 12  CRSElapsedTime     float64
 13  AirTime            float64
 14  ArrDelay           float64
 15  DepDelay           float64
 16  Origin             object 
 17  Dest               object 
 18  Distance           int64  
 19  TaxiIn             float64
 20  TaxiOut            float64
 21  Cancelled          int64  
 22  CancellationCode   object 
 23  Diverted           int64  
 24  CarrierDelay      

## Analyze Dataset

In [None]:
# Experiment with basic charts/visualizations

# **INTERACTIVE VISUALIZATION DASHBOARD**

## **Visualization 1**

In [None]:
# 1

### Conclusion

## **Visualization 2**

In [None]:
# 2

### Conclusion

## **Visualization 3**

In [None]:
# 3

### Conclusion

## **Visualization 4**

In [None]:
# 4

### Conclusion



## **Visualization 5**

In [None]:
# 5

### Conclusion