Data Download and Understanding
-------------------------------


In [None]:
# Task 1: Loading Libraries and Data
import pandas as pd
import pathlib as Path

raw_path = Path.Path('../data/raw/')

files = list(raw_path.glob('*.csv'))
print(f"found {len(files)} files")

dfs = []
for file in files:
    df = pd.read_csv(file)
    df["source_file"] = file.name  # lineage - easier to trace bugs and better auditability(Professional habit)
    dfs.append(df)

raw_df = pd.concat(dfs, ignore_index=True) # combine all dataframes into one and avoid duplicate indices
print(f"combined dataframe shape: {raw_df.shape}")


found 13 files
combined dataframe shape: (7607025, 18)


In [15]:
# Task 2: First Sanity check and column name conversion
raw_df.head()
raw_df.info()
# raw_df.describe(include="all") # for a brief statistical summary of the dataframe

# Converting column names to lowercase and replacing spaces with underscores
raw_df.columns = (
    raw_df.columns
    .str.strip()
    .str.lower()
    .str.replace(" ", "_")
)
raw_df.columns.tolist()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7607025 entries, 0 to 7607024
Data columns (total 18 columns):
 #   Column               Dtype  
---  ------               -----  
 0   fl_date              object 
 1   op_unique_carrier    object 
 2   origin_airport_id    int64  
 3   origin               object 
 4   dest_airport_id      int64  
 5   dest                 object 
 6   dep_delay            float64
 7   dep_delay_new        float64
 8   arr_delay            float64
 9   arr_delay_new        float64
 10  cancelled            float64
 11  diverted             float64
 12  carrier_delay        float64
 13  weather_delay        float64
 14  nas_delay            float64
 15  security_delay       float64
 16  late_aircraft_delay  float64
 17  source_file          object 
dtypes: float64(11), int64(2), object(5)
memory usage: 1.0+ GB


['fl_date',
 'op_unique_carrier',
 'origin_airport_id',
 'origin',
 'dest_airport_id',
 'dest',
 'dep_delay',
 'dep_delay_new',
 'arr_delay',
 'arr_delay_new',
 'cancelled',
 'diverted',
 'carrier_delay',
 'weather_delay',
 'nas_delay',
 'security_delay',
 'late_aircraft_delay',
 'source_file']

1. Categorical Columns
- [FL_DATE, OP_UNIQUE_CARRIER, ORIGIN, DEST, source_file] - all objects,5.

2. Numerical columns
- 2 ints,[ORIGIN_AIRPORT_ID, DEST_AIRPORT_ID] and 11 floats[remaining columns]

3. Missing values
- Majorly in columns related Flight delay causes

In [16]:
# Task 3:Converting data to parquet format for faster loading in future
processed_path = Path.Path('../data/processed/')
processed_path.mkdir(exist_ok=True)

raw_df.to_parquet(
    processed_path / "flights_raw.parquet",
    index=False
)

In [17]:
# Reading back the parquet file to verify
df_parquet = pd.read_parquet(processed_path / "flights_raw.parquet")
print(f"Parquet dataframe shape: {df_parquet.shape}")
df_parquet.head()

Parquet dataframe shape: (7607025, 18)


Unnamed: 0,fl_date,op_unique_carrier,origin_airport_id,origin,dest_airport_id,dest,dep_delay,dep_delay_new,arr_delay,arr_delay_new,cancelled,diverted,carrier_delay,weather_delay,nas_delay,security_delay,late_aircraft_delay,source_file
0,4/1/2025 12:00:00 AM,AA,10140,ABQ,11298,DFW,-8.0,0.0,-11.0,0.0,0.0,0.0,,,,,,T_ONTIME_REPORTING_APR-2025.csv
1,4/1/2025 12:00:00 AM,AA,10140,ABQ,11298,DFW,-2.0,0.0,-11.0,0.0,0.0,0.0,,,,,,T_ONTIME_REPORTING_APR-2025.csv
2,4/1/2025 12:00:00 AM,AA,10140,ABQ,11298,DFW,-1.0,0.0,-17.0,0.0,0.0,0.0,,,,,,T_ONTIME_REPORTING_APR-2025.csv
3,4/1/2025 12:00:00 AM,AA,10140,ABQ,11298,DFW,26.0,26.0,10.0,10.0,0.0,0.0,,,,,,T_ONTIME_REPORTING_APR-2025.csv
4,4/1/2025 12:00:00 AM,AA,10140,ABQ,11298,DFW,50.0,50.0,32.0,32.0,0.0,0.0,0.0,0.0,0.0,0.0,32.0,T_ONTIME_REPORTING_APR-2025.csv


Parquet file type overview
--------------------------

üîπ Excel / CSV
- Row-based
- Text-heavy
- Repetitive column names
- No type enforcement

üîπ Parquet
- Columnar storage
- Compression per column
- Stores schema + types
- Only reads needed columns

Example:
- CSV: reads all 18 columns even if you need 2
- Parquet: reads only requested columns

So on my your laptop:
- Less RAM usage
- Faster EDA
- Faster model training

Hence this is why Parquet is industry standard.

Data Review
-----------

‚úÖ We will keep both delay columns for now:-

Why?
- DEP_DELAY_NEW ‚Üí better for business regression
- DEP_DELAY ‚Üí useful for EDA & interpretation

Later:
- Regression target ‚Üí DEP_DELAY_NEW
- Drop the other to avoid leakage

- ‚úîÔ∏èThe 5 delay causes columns provide GOLD insight but CANNOT be used as features to predict delays - IMPORTANT !!

Why?
- They are post-event explanations
- They sum up to the delay
- This would be target leakage

üìå What they ARE good for:

- Secondary regression:
    ‚ÄúWhich cause explains most delay?‚Äù

- EDA
    Business insight dashboard

It will be used in Mini regression later ‚Äî but NOT in the main predictive model