## Baseline Model

Just predicting that there are never any cancellations at all, given that the percentage of cancelled flights is ~2.64%. This represents a naive baseline where we always predict "no cancellation".

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, classification_report, confusion_matrix, roc_auc_score, average_precision_score

csv_path = "../data/flights_sample_3m.csv"
df = pd.read_csv(csv_path)

# Similar to what was done in feature_engineering.ipynb

# Feature Engineering: Add temporal features
df["dep_hour"] = df["CRS_DEP_TIME"] // 100

# Extract temporal features from FL_DATE
df['FL_DATE'] = pd.to_datetime(df['FL_DATE'])
df['month'] = df['FL_DATE'].dt.month
df['day_of_week'] = df['FL_DATE'].dt.dayofweek  # 0=Monday, 6=Sunday

# Frequency Encoding (can be done before train/test split since it doesn't use target)
# Count how often each airport appears
df["origin_freq"] = df["ORIGIN"].map(df["ORIGIN"].value_counts())
df["dest_freq"] = df["DEST"].map(df["DEST"].value_counts())

# Get proportions
df["origin_freq_proportion"] = df["ORIGIN"].map(df["ORIGIN"].value_counts(normalize=True))
df["dest_freq_proportion"] = df["DEST"].map(df["DEST"].value_counts(normalize=True))

# Drop leakage columns (actual times/delays that happen after scheduled departure)
leakage_cols = [
    'DEP_TIME', 'DEP_DELAY', 'TAXI_OUT', 'WHEELS_OFF', 'WHEELS_ON',
    'TAXI_IN', 'ARR_TIME', 'ARR_DELAY', 'ELAPSED_TIME', 'AIR_TIME'
]
df = df.drop(columns=[col for col in leakage_cols if col in df.columns])

# Separate features and target
X = df.drop(columns=["CANCELLED"])  # Features
y = df["CANCELLED"]  # Target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set size: {len(X_train):,}")
print(f"Test set size: {len(X_test):,}")
print(f"\nCancellation rate in test set: {y_test.mean():.4f} ({y_test.mean()*100:.2f}%)")

Training set size: 2,400,000
Test set size: 600,000

Cancellation rate in test set: 0.0264 (2.64%)


In [3]:
# Baseline: Predict all zeros (no cancellations)
baseline_pred = np.zeros_like(y_test)

print(f"Baseline predictions: {np.unique(baseline_pred, return_counts=True)}")
print(f"\nActual test labels: {np.unique(y_test, return_counts=True)}")

Baseline predictions: (array([0.]), array([600000]))

Actual test labels: (array([0., 1.]), array([584172,  15828]))


In [4]:
# Evaluate baseline performance
baseline_accuracy = accuracy_score(y_test, baseline_pred)
baseline_precision = precision_score(y_test, baseline_pred, zero_division=0)
baseline_recall = recall_score(y_test, baseline_pred, zero_division=0)

# For ROC-AUC and PR-AUC, we need probability scores
# Since baseline_pred is all zeros (binary predictions), we'll treat them as probability 0.0
# Note: When all predictions are 0.0, ROC-AUC = 0.5 (random guessing) and PR-AUC = positive class rate
baseline_proba = baseline_pred.astype(float)  # Convert to float (all 0.0)

try:
    baseline_roc_auc = roc_auc_score(y_test, baseline_proba)
except ValueError as e:
    baseline_roc_auc = 0.5  # When all predictions are 0.0, ROC-AUC equals 0.5 (diagonal ROC curve)
    print(f"Note: ROC-AUC calculation issue: {e}")

try:
    baseline_pr_auc = average_precision_score(y_test, baseline_proba)
except ValueError as e:
    baseline_pr_auc = y_test.mean()  # When all predictions are 0.0, PR-AUC equals the positive class rate
    print(f"Note: PR-AUC calculation issue: {e}")

print("Baseline Model Performance:")
print(f"  Accuracy:  {baseline_accuracy:.4f}")
print(f"  Precision: {baseline_precision:.4f}")
print(f"  Recall:    {baseline_recall:.4f}")
print(f"  ROC-AUC:   {baseline_roc_auc:.4f}")
print(f"  PR-AUC:    {baseline_pr_auc:.4f}")

print("\nConfusion Matrix:")
print(confusion_matrix(y_test, baseline_pred))

Baseline Model Performance:
  Accuracy:  0.9736
  Precision: 0.0000
  Recall:    0.0000
  ROC-AUC:   0.5000
  PR-AUC:    0.0264

Confusion Matrix:
[[584172      0]
 [ 15828      0]]


**Interpretation:**
- This baseline predicts **no cancellations** for all flights
- It achieves ~97.36% accuracy (simply because most flights aren't cancelled)
- However, it has **0% recall** (fails to catch any cancellations) and **0% precision**
- **ROC-AUC = 0.5000**: This equals random guessing performance. When all predictions are 0.0, the ROC curve becomes a diagonal line (TPR = FPR), giving AUC = 0.5. This is the baseline for ROC-AUC (worse than random would be < 0.5)
- **PR-AUC = 0.0264**: This equals the positive class rate (2.64% cancellation rate). When a model predicts all negatives, the PR-AUC equals the proportion of positive examples. This represents the baseline for PR-AUC
- Any real model should beat this baseline by achieving ROC-AUC > 0.5 and PR-AUC > 0.0264
- **Note**: For imbalanced datasets, PR-AUC is often more informative than ROC-AUC, as it focuses on the minority class performance

In [5]:
# Preprocessing: Apply target encoding (after train/test split to avoid leakage)
import warnings
warnings.filterwarnings('ignore')

# Make copies for preprocessing
X_train_processed = X_train.copy()
X_test_processed = X_test.copy()

# Target Encoding for AIRLINE (using only training data to avoid leakage)
# Calculate mean cancellation rate per airline from TRAINING data only
if 'AIRLINE' in X_train_processed.columns:
    # Create a temporary dataframe with aligned indices
    temp_df = pd.DataFrame({
        'AIRLINE': X_train_processed['AIRLINE'],
        'CANCELLED': y_train.values
    }, index=X_train_processed.index)
    
    airline_target_means = temp_df.groupby('AIRLINE')['CANCELLED'].mean()
    
    # Apply to both train and test
    X_train_processed['airline_target_encoded'] = X_train_processed['AIRLINE'].map(airline_target_means)
    X_test_processed['airline_target_encoded'] = X_test_processed['AIRLINE'].map(airline_target_means)
    
    # Fill any unseen airlines in test with global training mean
    global_mean = y_train.mean()
    X_test_processed['airline_target_encoded'] = X_test_processed['airline_target_encoded'].fillna(global_mean)
    
    # Also fill any NaN in train (shouldn't happen, but just in case)
    X_train_processed['airline_target_encoded'] = X_train_processed['airline_target_encoded'].fillna(global_mean)
    
    # Drop original AIRLINE column (we have the encoded version)
    X_train_processed = X_train_processed.drop(columns=['AIRLINE'])
    X_test_processed = X_test_processed.drop(columns=['AIRLINE'])

# Select final feature set: use frequency-encoded and target-encoded features
# Drop remaining categorical columns that aren't encoded yet
categorical_cols = X_train_processed.select_dtypes(include=['object']).columns.tolist()
if categorical_cols:
    print(f"Dropping unencoded categorical columns: {categorical_cols}")
    X_train_processed = X_train_processed.drop(columns=categorical_cols)
    X_test_processed = X_test_processed.drop(columns=categorical_cols)

# Drop any columns that shouldn't be used (like CANCELLATION_CODE - only exists for cancelled flights)
columns_to_drop = ['CANCELLATION_CODE'] if 'CANCELLATION_CODE' in X_train_processed.columns else []
if columns_to_drop:
    X_train_processed = X_train_processed.drop(columns=columns_to_drop)
    X_test_processed = X_test_processed.drop(columns=columns_to_drop)

# Fill any remaining NaN values with median (for numerical) or 0
# Only fill numeric columns
numeric_cols = X_train_processed.select_dtypes(include=[np.number]).columns
X_train_processed[numeric_cols] = X_train_processed[numeric_cols].fillna(X_train_processed[numeric_cols].median())
X_test_processed[numeric_cols] = X_test_processed[numeric_cols].fillna(X_train_processed[numeric_cols].median())

# Ensure all columns are numeric (convert any remaining non-numeric to numeric)
for col in X_train_processed.columns:
    if X_train_processed[col].dtype == 'object':
        # Try to convert to numeric
        X_train_processed[col] = pd.to_numeric(X_train_processed[col], errors='coerce')
        X_test_processed[col] = pd.to_numeric(X_test_processed[col], errors='coerce')

# Final check: ensure no NaN or infinite values
X_train_processed = X_train_processed.replace([np.inf, -np.inf], np.nan)
X_test_processed = X_test_processed.replace([np.inf, -np.inf], np.nan)
X_train_processed = X_train_processed.fillna(X_train_processed.median())
X_test_processed = X_test_processed.fillna(X_train_processed.median())

# Convert to numpy arrays for sklearn (or keep as DataFrames - sklearn accepts both)
# But ensure all are numeric
X_train_processed = X_train_processed.select_dtypes(include=[np.number])
X_test_processed = X_test_processed.select_dtypes(include=[np.number])

print(f"\nProcessed training shape: {X_train_processed.shape}")
print(f"Processed test shape: {X_test_processed.shape}")
print(f"\nData types: {X_train_processed.dtypes.value_counts().to_dict()}")
print(f"\nKey features being used:")
feature_list = [col for col in X_train_processed.columns if any(x in col.lower() for x in ['freq', 'target', 'hour', 'month', 'day', 'distance'])]
for feat in feature_list[:10]:  # Show first 10
    print(f"  - {feat}")
if len(feature_list) > 10:
    print(f"  ... and {len(X_train_processed.columns) - 10} more features")

Dropping unencoded categorical columns: ['AIRLINE_DOT', 'AIRLINE_CODE', 'ORIGIN', 'ORIGIN_CITY', 'DEST', 'DEST_CITY', 'CANCELLATION_CODE']

Processed training shape: (2400000, 20)
Processed test shape: (600000, 20)

Data types: {dtype('float64'): 11, dtype('int64'): 7, dtype('int32'): 2}

Key features being used:
  - DISTANCE
  - dep_hour
  - month
  - day_of_week
  - origin_freq
  - dest_freq
  - origin_freq_proportion
  - dest_freq_proportion
  - airline_target_encoded


In [7]:
# Helper function to evaluate models
def evaluate_model(y_true, y_pred, y_proba, model_name):
    """Evaluate model performance and return metrics dictionary"""
    accuracy = accuracy_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred, zero_division=0)
    recall = recall_score(y_true, y_pred, zero_division=0)
    roc_auc = roc_auc_score(y_true, y_proba)
    pr_auc = average_precision_score(y_true, y_proba)
    
    results = {
        'Model': model_name,
        'Accuracy': accuracy,
        'Precision': precision,
        'Recall': recall,
        'ROC-AUC': roc_auc,
        'PR-AUC': pr_auc
    }
    
    return results


## Model Training and Evaluation

**Fair Comparison Setup:**
- All models use the **same decision threshold (0.5)** - this is the default for `.predict()`
- All models handle class imbalance equivalently:
  - Logistic Regression & Random Forest: `class_weight='balanced'`
  - XGBoost & LightGBM: `scale_pos_weight` (equivalent approach)
- All models use `random_state=42` for reproducibility
- All models are evaluated on the same train/test split
- Threshold optimization will be done later after selecting the best model

**Key Metrics for Imbalanced Data:**
- **PR-AUC**: Primary metric (focuses on minority class)
- **ROC-AUC**: Secondary metric (overall discrimination ability)
- **Recall**: How many cancellations we catch
- **Precision**: How many predicted cancellations are real

## 1. Logistic Regression

In [11]:
from sklearn.linear_model import LogisticRegression

lr_model = LogisticRegression(
    class_weight='balanced',
    max_iter=1000,
    random_state=42,
    n_jobs=-1
)

lr_model.fit(X_train_processed, y_train)

# Predictions
lr_pred = lr_model.predict(X_test_processed)
lr_proba = lr_model.predict_proba(X_test_processed)[:, 1]

# Evaluate
lr_results = evaluate_model(y_test, lr_pred, lr_proba, 'Logistic Regression')
print("\nLogistic Regression Performance:")
for metric, value in lr_results.items():
    if metric != 'Model':
        print(f"  {metric}: {value:.4f}")


Logistic Regression Performance:
  Accuracy: 0.5309
  Precision: 0.0424
  Recall: 0.7782
  ROC-AUC: 0.6896
  PR-AUC: 0.0441


## 2. Random Forest

In [13]:
from sklearn.ensemble import RandomForestClassifier

# Using a subset for faster training - can adjust n_estimators and remove sampling for full training
rf_model = RandomForestClassifier(
    n_estimators=100,  # Reduce for faster training; increase for better performance
    class_weight='balanced',
    random_state=42,
    n_jobs=-1,
    max_depth=20,  # Limit depth for faster training
    min_samples_split=100,
    min_samples_leaf=50
)

rf_model.fit(X_train_processed, y_train)

# Predictions
rf_pred = rf_model.predict(X_test_processed)
rf_proba = rf_model.predict_proba(X_test_processed)[:, 1]

# Evaluate
rf_results = evaluate_model(y_test, rf_pred, rf_proba, 'Random Forest')
print("\nRandom Forest Performance:")
for metric, value in rf_results.items():
    if metric != 'Model':
        print(f"  {metric}: {value:.4f}")


Random Forest Performance:
  Accuracy: 0.7727
  Precision: 0.0730
  Recall: 0.6510
  ROC-AUC: 0.8022
  PR-AUC: 0.1068


## 3. XGBoost

In [None]:
try:
    import xgboost as xgb
    
    # Calculate scale_pos_weight for class imbalance
    # scale_pos_weight = number of negative samples / number of positive samples
    scale_pos_weight = (y_train == 0).sum() / (y_train == 1).sum()
    
    xgb_model = xgb.XGBClassifier(
        n_estimators=100,
        max_depth=6,
        learning_rate=0.1,
        scale_pos_weight=scale_pos_weight,
        random_state=42,
        n_jobs=-1,
        eval_metric='logloss',
        tree_method='hist'
    )
    
    xgb_model.fit(X_train_processed, y_train)
    
    # Predictions
    xgb_pred = xgb_model.predict(X_test_processed)
    xgb_proba = xgb_model.predict_proba(X_test_processed)[:, 1]
    
    # Evaluate
    xgb_results = evaluate_model(y_test, xgb_pred, xgb_proba, 'XGBoost')
    print("\nXGBoost Performance:")
    for metric, value in xgb_results.items():
        if metric != 'Model':
            print(f"  {metric}: {value:.4f}")
            
except ImportError:
    print("XGBoost not installed. Install with: pip install xgboost")
    xgb_results = None


XGBoost Performance:
  Accuracy: 0.6684
  Precision: 0.0579
  Recall: 0.7580
  ROC-AUC: 0.7858
  PR-AUC: 0.0866


## 4. LightGBM

In [None]:
try:
    import lightgbm as lgb
    
    # Calculate scale_pos_weight for class imbalance
    scale_pos_weight = (y_train == 0).sum() / (y_train == 1).sum()
    
    lgb_model = lgb.LGBMClassifier(
        n_estimators=100,
        max_depth=6,
        learning_rate=0.1,
        scale_pos_weight=scale_pos_weight,
        random_state=42,
        n_jobs=-1,
        verbose=-1
    )
    
    lgb_model.fit(X_train_processed, y_train)
    
    # Predictions
    lgb_pred = lgb_model.predict(X_test_processed)
    lgb_proba = lgb_model.predict_proba(X_test_processed)[:, 1]
    
    # Evaluate
    lgb_results = evaluate_model(y_test, lgb_pred, lgb_proba, 'LightGBM')
    print("\nLightGBM Performance:")
    for metric, value in lgb_results.items():
        if metric != 'Model':
            print(f"  {metric}: {value:.4f}")
            
except ImportError:
    print("LightGBM not installed. Install with: pip install lightgbm")
    lgb_results = None



LightGBM Performance:
  Accuracy: 0.6730
  Precision: 0.0586
  Recall: 0.7560
  ROC-AUC: 0.7876
  PR-AUC: 0.0891


## Model Comparison

In [14]:
# Collect all results
all_results = [lr_results, rf_results]

if xgb_results:
    all_results.append(xgb_results)
if lgb_results:
    all_results.append(lgb_results)

# Create comparison DataFrame
comparison_df = pd.DataFrame(all_results)
comparison_df = comparison_df.set_index('Model')

# Add baseline for comparison
baseline_results = {
    'Model': 'Baseline (All Zeros)',
    'Accuracy': baseline_accuracy,
    'Precision': baseline_precision,
    'Recall': baseline_recall,
    'ROC-AUC': baseline_roc_auc,
    'PR-AUC': baseline_pr_auc
}
baseline_df = pd.DataFrame([baseline_results]).set_index('Model')

# Combine baseline with model results
comparison_df = pd.concat([baseline_df, comparison_df])

print("\n" + "="*70)
print("MODEL COMPARISON")
print("="*70)
print(comparison_df.round(4))
print("\n" + "="*70)

# Show which model performs best on each metric
print("\nBest performing models by metric:")
metrics_to_compare = ['Accuracy', 'Precision', 'Recall', 'ROC-AUC', 'PR-AUC']
for metric in metrics_to_compare:
    if metric in ['ROC-AUC', 'PR-AUC', 'Accuracy', 'Precision', 'Recall']:
        best_idx = comparison_df[metric].idxmax()
        best_value = comparison_df.loc[best_idx, metric]
        print(f"  {metric:12s}: {best_idx:20s} ({best_value:.4f})")


MODEL COMPARISON
                      Accuracy  Precision  Recall  ROC-AUC  PR-AUC
Model                                                             
Baseline (All Zeros)    0.9736     0.0000  0.0000   0.5000  0.0264
Logistic Regression     0.5309     0.0424  0.7782   0.6896  0.0441
Random Forest           0.7727     0.0730  0.6510   0.8022  0.1068
XGBoost                 0.6684     0.0579  0.7580   0.7858  0.0866
LightGBM                0.6730     0.0586  0.7560   0.7876  0.0891


Best performing models by metric:
  Accuracy    : Baseline (All Zeros) (0.9736)
  Precision   : Random Forest        (0.0730)
  Recall      : Logistic Regression  (0.7782)
  ROC-AUC     : Random Forest        (0.8022)
  PR-AUC      : Random Forest        (0.1068)


## Model Selection Decision

**Fair Comparison Summary:**
- All models were compared using reasonable default hyperparameters
- Same decision threshold (0.5), same train/test split, same features
- This comparison is **fair** - no model was given special tuning

**Results:**
- **Random Forest** performs best on key metrics (PR-AUC, ROC-AUC, Precision)
- XGBoost and LightGBM underperformed with current hyperparameters