## Introduction

This notebook builds on the cleaned dataset from the preprocessing stage and
implements the feature engineering steps needed to prepare the data for model
training. The goal here is to enrich the dataset with leakage-safe encodings,
interaction terms, and final feature selection before training a baseline model.

We load the processed data, apply out-of-fold target encoding to key categorical
fields, construct interaction features that capture cross-effects between
important signals, and prepare the final training, validation, and test
matrices used for modeling.


In [None]:
import sys
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import lightgbm as lgb
from sklearn.metrics import mean_absolute_error, mean_squared_error

project_root = next(
    (p for p in [Path.cwd()] + list(Path.cwd().parents) if (p / "src").exists()),
    Path.cwd()
)

sys.path.insert(0, str(project_root))



In [3]:
# Module imports 
from src.preprocessing_imputation import preprocess_impute
from src.feature_engineering import (
    TargetEncoderOOF,
    FeatureInteractionBuilder,
)

## 1. Load Model-Ready Preprocessed Data

We begin by loading the preprocessed dataset produced in the previous notebook.
Some columns are downcast from 64-bit to 32-bit where safe, reducing memory
usage without affecting model performance.

At this stage, the data is free of missing values and already enriched with
temporal, order-level, store-level, and market load features.


In [4]:
df = pd.read_parquet(project_root / "data" / "processed" / "preprocessed_data.parquet",
                     engine="pyarrow")

# Columns to convert to int32
int64_to_int32 = [
    "order_subtotal_bucket",
    "order_total_items_bucket",
    "order_distinct_items_bucket",
    "store_order_volume"
]

df[int64_to_int32] = df[int64_to_int32].astype("int32")

# Columns to convert to float32 (exclude target!)
float64_to_float32 = [
    "time_hour_sin", "time_hour_cos",
    "time_dow_sin", "time_dow_cos",
    "order_avg_item_price",
    "order_percent_distinct_items",
    "load_roll_outstanding_mean",
    "load_roll_onshift_mean",
    "load_roll_busy_mean",
    "load_roll_busy_ratio_mean",
    "load_roll_demand_supply_mean",
    "load_roll_outstanding_std",
    "load_roll_busy_std",
    "load_recent_order_count",
    "load_demand_momentum",
    "load_supply_momentum",
    "load_busy_momentum",
    "load_utilization_volatility",
    "load_pressure_index",
    "total_onshift_dashers_lag_5min",
    "total_onshift_dashers_lag_15min",
    "total_onshift_dashers_lag_30min",
    "total_busy_dashers_lag_5min",
    "total_busy_dashers_lag_15min",
    "total_busy_dashers_lag_30min",
    "total_outstanding_orders_lag_5min",
    "total_outstanding_orders_lag_15min",
    "total_outstanding_orders_lag_30min",
    "load_busy_ratio_lag_5min",
    "load_busy_ratio_lag_15min",
    "load_busy_ratio_lag_30min",
    "load_demand_supply_ratio_lag_5min",
    "load_demand_supply_ratio_lag_15min",
    "load_demand_supply_ratio_lag_30min",
    "total_onshift_dashers_roll10_std",
    "total_busy_dashers_roll10_std",
    "total_outstanding_orders_roll10_std",
    "load_busy_ratio_roll10_std",
    "load_demand_supply_ratio_roll10_std",
    "demand_spike_15m",
    "supply_drop_15m",
    "busy_growth_15m",
    "burst_index",
]

df[float64_to_float32] = df[float64_to_float32].astype("float32")


## 2. Time-Based Train/Validation/Test Split

To mimic real-world forecasting conditions, the dataset is split
chronologically:

- **Train:** earliest 70%  
- **Validation:** next 15%  
- **Test:** most recent 15%

This ensures the model never learns from future data and reflects realistic
production behavior.


In [None]:
# 1. Split Data 
cutoff_1 = df.created_at.quantile(0.70)
cutoff_2 = df.created_at.quantile(0.85)

train_df = df[df.created_at < cutoff_1].copy()
valid_df = df[(df.created_at >= cutoff_1) & (df.created_at < cutoff_2)].copy()
test_df  = df[df.created_at >= cutoff_2].copy()

## 3. Leakage-Safe Target Encoding

Several categorical fields, such as store_id, market_id, category, and
order_protocol, contain thousands of levels or highly uneven distributions.

We apply a leakage-safe K-Fold target encoder that replaces each category with
the average delivery time observed *in the other folds* of the training set.
This technique captures category-specific behavior without letting the model
see its own target.

- The **training set** receives out-of-fold (OOF) encoded features.
- The **validation and test sets** receive the final encoding trained on the
  entire training split.

This produces stable, leakage-free numerical representations of the major
categorical features across all splits.


In [None]:
# Initialize target encoders
cat_encoder = TargetEncoderOOF(
    col="store_primary_category",
    target_col="target_delivery_seconds",
    n_folds=5,
    smoothing=10
)

store_embedder = TargetEncoderOOF(
    col="store_id",
    target_col="target_delivery_seconds",
    n_folds=5,
    smoothing=10
)

market_encoder = TargetEncoderOOF(
    col="market_id",
    target_col="target_delivery_seconds",
    n_folds=5,
    smoothing=10
)

protocol_encoder = TargetEncoderOOF(
    col="order_protocol",
    target_col="target_delivery_seconds",
    n_folds=5,
    smoothing=10
)

# Fit encoders on training data
cat_encoder.fit(train_df)
store_embedder.fit(train_df)
market_encoder.fit(train_df)
protocol_encoder.fit(train_df)

# Use K-fold out-of-fold encoding for training set
train_df = cat_encoder.transform(train_df, is_training=True)
train_df = store_embedder.transform(train_df, is_training=True)
train_df = market_encoder.transform(train_df, is_training=True)
train_df = protocol_encoder.transform(train_df, is_training=True)
# Use regular encoding for validation and test sets
valid_df = cat_encoder.transform(valid_df, is_training=False)
valid_df = store_embedder.transform(valid_df, is_training=False)
valid_df = market_encoder.transform(valid_df, is_training=False)
valid_df = protocol_encoder.transform(valid_df, is_training=False)

test_df = cat_encoder.transform(test_df, is_training=False)
test_df = store_embedder.transform(test_df, is_training=False)
test_df = market_encoder.transform(test_df, is_training=False)
test_df = protocol_encoder.transform(test_df, is_training=False)

## 4. Interaction Features

Certain operational phenomena involve relationships between multiple features:
for example, whether high order volume matters more at busy markets, or whether
a costly order interacts with store performance.

We construct a small set of targeted interaction terms, including:

- subtotal × store encoding  
- item count × busy ratio  
- time-of-day × demand/supply ratio  
- store encoding × average item price  

These interactions help the model learn joint effects that may not be captured
by individual features alone.


In [7]:
interaction_pairs = [
    # Interaction 1: Does a big order matter more at a slow store?
    ("order_subtotal", "store_id_encoded"), 
    
    # Interaction 2: Does volume matter more when the fleet is busy?
    ("order_total_items", "load_busy_ratio"),
    
    # Interaction 3: Does time of day amplifiy supply/demand mismatches?
    ("time_hour_sin", "load_demand_supply_ratio"),
    
    # Interaction 4: Do expensive items take longer at slow stores?
    ("store_id_encoded", "order_avg_item_price"),
]

interactions = FeatureInteractionBuilder(interactions=interaction_pairs)

train_df = interactions.transform(train_df)
valid_df = interactions.transform(valid_df)
test_df  = interactions.transform(test_df)

print("Interaction features created successfully.")
print([c for c in train_df.columns if "interaction" in c])

Interaction features created successfully.
['interaction__order_subtotal__x__store_id_encoded', 'interaction__order_total_items__x__load_busy_ratio', 'interaction__time_hour_sin__x__load_demand_supply_ratio', 'interaction__store_id_encoded__x__order_avg_item_price']


In [8]:
print("Train:", train_df.shape, "Valid:", valid_df.shape, "Test:", test_df.shape)

Train: (137399, 92) Valid: (29443, 92) Test: (29443, 92)


In [None]:
train_df.to_parquet(project_root / "data" / "processed" / "train_features.parquet")
valid_df.to_parquet(project_root / "data" / "processed" / "valid_features.parquet")
test_df.to_parquet(project_root / "data" / "processed" / "test_features.parquet")


## 5. Baseline LightGBM Model

To evaluate the quality of the engineered features, we train a baseline
LightGBM regression model using:

- all engineered features,
- early stopping on the validation set,
- MAE and RMSE as evaluation metrics.

This baseline helps verify that our encodings, splits, and interaction terms
lead to stable, predictive signals before experimenting with more advanced
methods.


In [9]:
FEATURE_COLS = [
    col for col in train_df.columns
    if col not in [
        # 1. Target & Timestamps (Standard Exclusions)
        "created_at",
        "actual_delivery_time",
        "target_delivery_seconds",
        "market_id", 
        "store_id", 
        "store_primary_category", 
        "order_protocol"
    ]
]

X_train = train_df[FEATURE_COLS]
y_train = train_df["target_delivery_seconds"]

X_valid = valid_df[FEATURE_COLS]
y_valid = valid_df["target_delivery_seconds"]

X_test  = test_df[FEATURE_COLS]
y_test  = test_df["target_delivery_seconds"]


In [12]:
# Baseline LightGBM Model
params = {
    "objective": "regression",
    "metric": ["mae", "rmse"],
    "boosting_type": "gbdt",
    "learning_rate": 0.05,
    "num_leaves": 64,
    "feature_fraction": 0.9,
    "bagging_fraction": 0.9,
    "bagging_freq": 5,
    "seed": 42,
    "early_stopping_rounds": 100,
}

train_set = lgb.Dataset(X_train, y_train)
valid_set = lgb.Dataset(X_valid, y_valid)

params = dict(params)
params["early_stopping_round"] = 100
params.setdefault("verbosity", 1)

model = lgb.train(
    params,
    train_set,
    valid_sets=[train_set, valid_set],
    num_boost_round=2000
)

# ensure predictions are numpy arrays (avoid sparse / list types for sklearn metrics)
pred_valid = np.asarray(model.predict(X_valid)).ravel()
pred_test = np.asarray(model.predict(X_test)).ravel()

mae_valid = mean_absolute_error(y_valid, pred_valid)
rmse_valid = np.sqrt(mean_squared_error(y_valid, pred_valid))

mae_test = mean_absolute_error(y_test, pred_test)
rmse_test = np.sqrt(mean_squared_error(y_test, pred_test))

print("\n------------- Validation Results -------------")
print(f"MAE  (valid): {mae_valid:,.2f} seconds")
print(f"RMSE (valid): {rmse_valid:,.2f} seconds")

print("\n------------- Test Results -------------")
print(f"MAE  (test): {mae_test:,.2f} seconds")
print(f"RMSE (test): {rmse_test:,.2f} seconds")

importances = model.feature_importance(importance_type="gain")
feat_imp = (
    pd.DataFrame({
        "feature": model.feature_name(), 
        "importance": importances
    })
    .sort_values("importance", ascending=False)
)

feat_imp.head(20)


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.017957 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 14414
[LightGBM] [Info] Number of data points in the train set: 137399, number of used features: 85
[LightGBM] [Info] Start training from score 2812.942736
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[78]	training's l1: 577.033	training's rmse: 764.58	valid_1's l1: 583.254	valid_1's rmse: 778.346

------------- Validation Results -------------
MAE  (valid): 583.25 seconds
RMSE (valid): 778.35 seconds

------------- Test Results -------------
MAE  (test): 617.05 seconds
RMSE (test): 828.51 seconds


Unnamed: 0,feature,importance
83,interaction__time_hour_sin__x__load_demand_sup...,149151200000.0
78,store_id_encoded,111142900000.0
9,time_estimated_store_to_consumer_driving_duration,60979510000.0
39,load_roll_demand_supply_mean,45684040000.0
81,interaction__order_subtotal__x__store_id_encoded,31532860000.0
33,load_demand_supply_ratio,18919190000.0
15,time_minute_of_day,17395330000.0
61,load_demand_supply_ratio_lag_5min,14378830000.0
75,bucket_distance,13406110000.0
17,time_hour_sin,11505080000.0


## 6. Feature Importance Review

LightGBM’s gain-based feature importance highlights the strongest predictors
in the engineered dataset. This step provides an early look at:

- which encoded categories carry the most signal,
- which marketplace load features are most influential,
- whether interaction terms meaningfully contribute to model performance.

These insights inform further feature refinement and model tuning.


## Conclusion

This notebook transformed the preprocessed dataset into a fully engineered,
model-ready feature matrix by applying leakage-safe target encoding, building
select interaction features, and constructing a clean train/valid/test split.

With these engineered features showing strong baseline performance, the next
steps involve model tuning, evaluation across multiple algorithms, and
refinement of the most impactful signals.
