### Load Packages

In [1]:
import pandas as pd

### Set Config

In [12]:
data_dir = "/Users/mrla/Documents/Projects/data/flight_delays/ot_delaycause1_DL"
pd.set_option('display.max_columns', None)

### Load Data

In [5]:
df = pd.read_csv(data_dir + "/Airline_Delay_Cause.csv")


### Data Definitions ([Source](https://www.transtats.bts.gov/OT_Delay/OT_DelayCause1.asp?20=E))

![Data Definitions](DataDefinition.png)

### Data Exploration

#### High-level Data Summary

In [13]:
print(f"Shape of data: {df.shape[0]:,} rows, {df.shape[1]:,} columns")
print(f"Columns in data: {df.columns.tolist()}")
print(f"First 5 rows of data:\n{df.head()}")


Shape of data: 400,118 rows, 21 columns
Columns in data: ['year', 'month', 'carrier', 'carrier_name', 'airport', 'airport_name', 'arr_flights', 'arr_del15', 'carrier_ct', 'weather_ct', 'nas_ct', 'security_ct', 'late_aircraft_ct', 'arr_cancelled', 'arr_diverted', 'arr_delay', 'carrier_delay', 'weather_delay', 'nas_delay', 'security_delay', 'late_aircraft_delay']
First 5 rows of data:
   year  month carrier       carrier_name airport  \
0  2025      2      9E  Endeavor Air Inc.     ABE   
1  2025      2      9E  Endeavor Air Inc.     AEX   
2  2025      2      9E  Endeavor Air Inc.     AGS   
3  2025      2      9E  Endeavor Air Inc.     ALB   
4  2025      2      9E  Endeavor Air Inc.     ATL   

                                        airport_name  arr_flights  arr_del15  \
0  Allentown/Bethlehem/Easton, PA: Lehigh Valley ...         78.0        9.0   
1           Alexandria, LA: Alexandria International         78.0       12.0   
2        Augusta, GA: Augusta Regional at Bush Field   

In [10]:
print(f"Data types of columns:\n{df.dtypes}")
print("--------------------------------")
missing_counts = df.isnull().sum()
missing_percent = (missing_counts / len(df)) * 100

missing_df = pd.DataFrame({
    'Missing Count': missing_counts,
    'Missing %': missing_percent.round(2)
})

print("Missing values in each column:\n", missing_df)

print("--------------------------------")
print(f"Summary statistics of numerical columns:\n{df.describe()}")

Data types of columns:
year                     int64
month                    int64
carrier                 object
carrier_name            object
airport                 object
airport_name            object
arr_flights            float64
arr_del15              float64
carrier_ct             float64
weather_ct             float64
nas_ct                 float64
security_ct            float64
late_aircraft_ct       float64
arr_cancelled          float64
arr_diverted           float64
arr_delay              float64
carrier_delay          float64
weather_delay          float64
nas_delay              float64
security_delay         float64
late_aircraft_delay    float64
dtype: object
--------------------------------
Missing values in each column:
                      Missing Count  Missing %
year                             0       0.00
month                            0       0.00
carrier                          0       0.00
carrier_name                     0       0.00
airport          

##### Insights so far

🧭 Understanding `*_ct` and `*_delay` Columns

🔹 `*_ct` Columns (e.g., `carrier_ct`, `weather_ct`)

* Despite the name, these are **not raw integer counts** of delays.
* They contain **decimal values**, indicating they are likely:

  * **Weighted averages** (e.g., delays per flight or per day)
  * **Proportional allocations** of delay causes
* These fields are typically found in **aggregated FAA datasets** where counts are normalized.
* **Important**: We **cannot** sum the `*_ct` values expecting to get total delayed flights (`arr_del15`).

🔹 `*_delay` Columns (e.g., `carrier_delay`, `weather_delay`)

* These represent **delay durations in minutes**.
* In principle, their **sum should match** `arr_delay`:

  ```
  arr_delay ≈ carrier_delay + weather_delay + nas_delay + security_delay + late_aircraft_delay
  ```
* In practice, small mismatches may exist due to:

  * Rounding
  * Missing values
  * Unattributed or jointly caused delays

---

👉 **In the next code section, we will check if the delay components add up to `arr_delay` and explore any mismatches.**


In [16]:
# Step 1: Safely compute delay component sum by treating NaNs as 0
delay_components = (
    df['carrier_delay'].fillna(0) +
    df['weather_delay'].fillna(0) +
    df['nas_delay'].fillna(0) +
    df['security_delay'].fillna(0) +
    df['late_aircraft_delay'].fillna(0)
)

# Step 2: Compute difference from arr_delay (only where arr_delay is not null)
valid_mask = df['arr_delay'].notna()
delay_difference = df.loc[valid_mask, 'arr_delay'] - delay_components[valid_mask]

# Step 3: Identify rows where the difference is significant (e.g., > 1 minute)
mismatched_rows = df.loc[valid_mask].loc[delay_difference.abs() > 1]

# Step 4: Summary
print(f"Total valid rows (non-null arr_delay): {valid_mask.sum()}")
print(f"Rows where delay components do NOT sum to arr_delay: {len(mismatched_rows)} ({(len(mismatched_rows) / len(df)) * 100:.2f}%)")

print("\nSample mismatches:")
print(
    mismatched_rows[['arr_delay']].assign(
        delay_components=delay_components[valid_mask],
        difference=delay_difference
    ).head()
)




Total valid rows (non-null arr_delay): 399461
Rows where delay components do NOT sum to arr_delay: 9 (0.00%)

Sample mismatches:
       arr_delay  delay_components  difference
13845    39334.0           39154.0       180.0
13848    12812.0           12686.0       126.0
25479     2773.0            1428.0      1345.0
28886     6934.0            6880.0        54.0
31638     3006.0            2775.0       231.0


Out of 399,461 valid records (i.e., rows where `arr_delay` is not null), only 9 rows—or less than 0.01%—have a mismatch between the sum of individual delay components (`carrier_delay`, `weather_delay`, `nas_delay`, `security_delay`, `late_aircraft_delay`) and the reported `arr_delay`. This confirms that the delay decomposition is highly reliable across the dataset. The mismatches are relatively small in scale, with the largest observed difference being 1,345 minutes. These discrepancies could be due to rounding, reporting errors, or rare cases of unclassified delays. Overall, the dataset's delay attribution can be considered consistent and trustworthy for modeling.
