# ✈️ Flight Delay Prediction – Santiago Airport (SCL)
This project aims to predict the probability of a flight being delayed by more than 15 minutes using flight data from Santiago de Chile International Airport (SCL) during 2017.

In [None]:
# Standard libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Project modules
from src.data.load_data import load_raw_data
from src.data.preprocess import parse_dates, clean_column_names
from src.features.build_features import build_synthetic_features
from src.visualization.visualize import (
    plot_delay_rate_by_column,
    plot_count_by_column,
    plot_delay_distribution,
    plot_delay_rate_by_two_categories
)

## 📥 3. Data Loading and Initial Inspection

In [None]:
# Load and prepare data
df = load_raw_data()
df = parse_dates(df)
df = clean_column_names(df)

# Preview
df.head()

## 🔍 4. Data Overview

In [None]:
df.info()
df.describe(include='all')
df.isnull().sum()


## 🛠️ 5. Feature Engineering

In [None]:
df = build_synthetic_features(df)

# Preview engineered features
df[["fecha_i", "fecha_o", "min_diff", "delay_15", "high_season", "period_day"]].head()


## 📊 6. Exploratory Data Analysis (EDA)

In [None]:
# Delay distribution
plot_delay_distribution(df)

# Delay rate by airline
plot_delay_rate_by_column(df, "opera")

# Delay rate by destination
plot_delay_rate_by_column(df, "siglades")

# Flight counts by time of day
plot_count_by_column(df, "period_day")

# Delay rate by airline and flight type
plot_delay_rate_by_two_categories(df, "opera", "tipovuelo", top_n=10)

# Delay rate by destination and high season
plot_delay_rate_by_two_categories(df, "siglades", "high_season", top_n=10)


## 🤖 Predictive Modeling

This section will introduce supervised models (e.g., Random Forest, Logistic Regression, or XGBoost) to predict whether a flight is delayed over 15 minutes.

Models will be evaluated using classification metrics such as:
- Accuracy
- Precision / Recall
- F1-score
- ROC-AUC

In [None]:
from src.models.train_model import train_model

# Train Random Forest model
model, X_test, y_test, y_pred = train_model(df, model_type="random_forest")




## 📉 Plot Evaluation Metrics

In [None]:
from src.models.evaluate_model import plot_confusion_matrix, plot_roc_curve

# Predicted probabilities for ROC
y_proba = model.predict_proba(X_test)[:, 1]

# Plot evaluation visuals
plot_confusion_matrix(y_test, y_pred, labels=["No Delay", "Delay"], title="Confusion Matrix – Random Forest")
plot_roc_curve(y_test, y_proba, title="ROC Curve – Random Forest")


## ✅ Conclusions

- Certain airlines and destinations show significantly higher delay rates.
- The `min_diff` feature captures time deviations and reveals operational issues.
- Temporal and categorical variables like `period_day`, `high_season`, and `tipovuelo` provide important segmentations.
- The problem is well-suited for a binary classification model.

## ✅ Summary & Next Steps

This notebook covered the full pipeline from data loading to model evaluation.

Key takeaways:
- Delay prediction is a feasible task using simple operational variables.
- Features like flight type, airline, time of day, and seasonality provide valuable segmentation.
- A Random Forest model achieved reasonable performance for classifying delays over 15 minutes.

### Next steps:
- Hyperparameter tuning with cross-validation
- Feature selection and dimensionality reduction
- Integration of external features (e.g., weather, traffic)
- Deployment of the model as a service (API or dashboard)

