# Flight Delay Prediction – Operational Risk for Complex Logistics
## Stage 1: Data Acquisition, Cleaning, and Target Creation
### Objective:
Load raw data, perform essential cleaning, and establish the primary target variable (Operational Risk Signal). This notebook simulates the output of a Data Engineering pipeline.

-----



### 1. Configuration and Library Imports

In [2]:
import pandas as pd
import numpy as np

# Setting display options
pd.set_option('display.max_columns', 100)

## 2. Data Loading and Initial Inspection


In [4]:
# Load the raw dataset (simulating a call to a modular function)
try:
    df = pd.read_csv('../data/raw/dataset_SCL.csv', parse_dates=['Fecha-I', 'Fecha-O'])
    print(f"Dataset loaded with {len(df)} rows.")
except FileNotFoundError:
    print("Error: File not found. Ensure 'dataset_SCL (1).csv' is in the path '../data/raw/'.")
    df = None

if df is not None:
    # Standardize column names (snake_case)
    df.columns = df.columns.str.lower().str.replace('-', '_')

    print("\nInitial Data Types and Non-Null Values:")
    df.info()

Dataset loaded with 68206 rows.

Initial Data Types and Non-Null Values:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68206 entries, 0 to 68205
Data columns (total 18 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   fecha_i    68206 non-null  datetime64[ns]
 1   vlo_i      68206 non-null  object        
 2   ori_i      68206 non-null  object        
 3   des_i      68206 non-null  object        
 4   emp_i      68206 non-null  object        
 5   fecha_o    68206 non-null  datetime64[ns]
 6   vlo_o      68205 non-null  object        
 7   ori_o      68206 non-null  object        
 8   des_o      68206 non-null  object        
 9   emp_o      68206 non-null  object        
 10  dia        68206 non-null  int64         
 11  mes        68206 non-null  int64         
 12  año        68206 non-null  int64         
 13  dianom     68206 non-null  object        
 14  tipovuelo  68206 non-null  object        
 15  opera      682

  df = pd.read_csv('../data/raw/dataset_SCL.csv', parse_dates=['Fecha-I', 'Fecha-O'])


## 3. Core Feature Creation: Target Variable and Basic Time Features

In [5]:
# 3.1 Target Variable Creation (The Operational Risk Signal)
# Target: delay_15 (1 if actual departure time > scheduled departure time + 15 mins)
df['min_diff'] = (df['fecha_o'] - df['fecha_i']).dt.total_seconds() / 60
df['delay_15'] = (df['min_diff'] > 15).astype(int)

print(f"\nDistribution of Target 'delay_15':\n{df['delay_15'].value_counts(normalize=True)}")
print(f"Delay rate (Positive Class): {df['delay_15'].mean():.2%}")
# Narrative: The positive class (delay) is minority, confirming the **Class Imbalance** problem.
# This requires cost-sensitive learning or class weighting in the modeling stage (Stage 3).


Distribution of Target 'delay_15':
delay_15
0    0.81506
1    0.18494
Name: proportion, dtype: float64
Delay rate (Positive Class): 18.49%


In [6]:
# 3.2 Basic Time Features (Required for subsequent FE)
df['month'] = df['fecha_i'].dt.month
df['day_of_week'] = df['fecha_i'].dt.day_name()
df['hour'] = df['fecha_i'].dt.hour

## 4. Save Cleaned Data

In [None]:
# Save the cleaned dataset with the target variable for the next stage.
df.to_csv('../data/interim/01_cleaned_target_data.csv', index=False)
print("\nCleaned data saved to interim/01_cleaned_target_data.csv.")


Cleaned data saved to processed/01_cleaned_target_data.csv.
