# LightGBM Forecasting Model

This notebook implements **LightGBM** (Light Gradient Boosting Machine) for time series forecasting with engineered temporal features.

## LightGBM Advantages
- **Leaf-wise Growth**: More accurate than level-wise (used by XGBoost)
- **Fast Training**: Faster than XGBoost and CatBoost on most datasets
- **Memory Efficient**: Uses histogram-based algorithms
- **High Accuracy**: Often matches or beats XGBoost/CatBoost
- **Handles Large Datasets**: Efficient with millions of samples

## Configuration
- **Features**: Temporal (month, quarter), lag features [1,3,6], rolling statistics
- **Hyperparameters**: n_estimators=300, num_leaves=31, learning_rate=0.03  
- **Loss Function**: MAPE optimization
- **Regularization**: L1 + L2 regularization
- **Validation**: Time series split (no data leakage)

In [1]:
import pandas as pd
import numpy as np
import lightgbm as lgb
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from pathlib import Path
from sklearn.metrics import mean_absolute_error, mean_squared_error, mean_absolute_percentage_error
import warnings
warnings.filterwarnings('ignore')

print("✓ Libraries imported successfully")

✓ Libraries imported successfully


## Section 1: Load Time Series Data

In [2]:
# Load company-level time series
data_path = Path('../data/processed/monthly_aggregated_full_company.parquet')

if not data_path.exists():
    data_path = Path('../data/processed/monthly_aggregated_full_company.csv')
    df = pd.read_csv(data_path)
    df['date'] = pd.to_datetime(df['date'])
else:
    df = pd.read_parquet(data_path)

df = df.sort_values('date').reset_index(drop=True)

print(f"Loaded: {len(df)} months ({df['date'].min()} to {df['date'].max()})")

Loaded: 36 months (2022-01-01 00:00:00 to 2024-12-01 00:00:00)


## Section 2: Feature Engineering

Create temporal features and lag features for LightGBM.

In [3]:
target_metrics = [
    'total_orders',
    'total_km_billed',
    'total_km_actual',
    'total_tours',
    'total_drivers',
    'revenue_total',
    'external_drivers',
    'vehicle_km_cost',      # NEW: KM-based transportation cost
    'vehicle_time_cost',    # NEW: Time-based transportation cost
    'total_vehicle_cost'    # NEW: Total vehicle operational cost
]

# Backward compatibility check - handle column name changes
if 'total_km' in df.columns and 'total_km_billed' not in df.columns:
    target_metrics = [m.replace('total_km_billed', 'total_km') if m == 'total_km_billed' else m for m in target_metrics]

# Filter to only include columns that actually exist in the dataframe
available_metrics = [m for m in target_metrics if m in df.columns]
missing_metrics = [m for m in target_metrics if m not in df.columns]

if missing_metrics:
    print(f"⚠️  The following metrics are not available in the dataset and will be skipped:")
    for m in missing_metrics:
        print(f"   - {m}")
    print(f"\n✓ Training models for {len(available_metrics)} available metrics:")
    for m in available_metrics:
        print(f"   - {m}")

target_metrics = available_metrics

⚠️  The following metrics are not available in the dataset and will be skipped:
   - total_km_actual
   - total_tours
   - vehicle_km_cost
   - vehicle_time_cost
   - total_vehicle_cost

✓ Training models for 5 available metrics:
   - total_orders
   - total_km
   - total_drivers
   - revenue_total
   - external_drivers


In [4]:
# Split data: Last 6 months as validation
split_date = '2024-07-01'

train_df = df[df['date'] < split_date].copy()
val_df = df[df['date'] >= split_date].copy()

# Define validation period date boundaries (needed for forecast extraction)
val_start = pd.to_datetime('2024-07-01')
val_end = pd.to_datetime('2024-12-01')
train_end = pd.to_datetime('2024-06-01')

print(f"\n✓ Split complete!")
print(f"  Training: {len(train_df)} months (Jan 2022 - Jun 2024)")
print(f"  Validation: {len(val_df)} months (Jul 2024 - Dec 2024)")


✓ Split complete!
  Training: 30 months (Jan 2022 - Jun 2024)
  Validation: 6 months (Jul 2024 - Dec 2024)


In [5]:
def create_features(df, target_col):
    """
    Create temporal and lag features for LightGBM.
    
    Parameters:
    -----------
    df : pd.DataFrame
        Time series dataframe with 'date' column
    target_col : str
        Name of the target column to create lag features for
    
    Returns:
    --------
    pd.DataFrame
        Dataframe with engineered features
    """
    df_feat = df.copy()
    
    # Temporal features
    df_feat['year'] = df_feat['date'].dt.year
    df_feat['month'] = df_feat['date'].dt.month
    df_feat['quarter'] = df_feat['date'].dt.quarter
    df_feat['week'] = df_feat['date'].dt.isocalendar().week
    df_feat['day_of_year'] = df_feat['date'].dt.dayofyear
    df_feat['weekday'] = df_feat['date'].dt.weekday
    
    # Lag features (previous months' values)
    df_feat[f'lag_1'] = df_feat[target_col].shift(1)
    df_feat[f'lag_3'] = df_feat[target_col].shift(3)
    df_feat[f'lag_6'] = df_feat[target_col].shift(6)
    df_feat[f'lag_12'] = df_feat[target_col].shift(12)
    
    # Rolling statistics
    df_feat[f'rolling_mean_3'] = df_feat[target_col].rolling(window=3, min_periods=1).mean()
    df_feat[f'rolling_std_3'] = df_feat[target_col].rolling(window=3, min_periods=1).std()
    df_feat[f'rolling_mean_6'] = df_feat[target_col].rolling(window=6, min_periods=1).mean()
    
    # Growth rate features
    df_feat[f'growth_rate_1'] = df_feat[target_col].pct_change(1)
    df_feat[f'growth_rate_3'] = df_feat[target_col].pct_change(3)
    df_feat[f'growth_rate_12'] = df_feat[target_col].pct_change(12)
    
    return df_feat

print("✓ Feature engineering function defined")

✓ Feature engineering function defined


## Section 3: Train/Validation Split

In [6]:
# Split data
train_end = '2024-06-30'
val_start = '2024-07-01'
val_end = '2024-12-31'

print(f"Data Split:")
print(f"  Training: up to {train_end}")
print(f"  Validation: {val_start} to {val_end}")

Data Split:
  Training: up to 2024-06-30
  Validation: 2024-07-01 to 2024-12-31


## Section 4: Train LightGBM Models

Train one LightGBM model per target metric.

In [7]:
# Train LightGBM models
lightgbm_models = {}
lightgbm_dataframes = {}
lightgbm_feature_cols = {}

print("Training LightGBM models...")
print("="*80)

for metric in target_metrics:
  print(f"\nTraining LightGBM for {metric}...")

  # Create features
  df_feat = create_features(df, metric)

  # Define exclude columns (columns that should not be used as features)
  # Define exclude columns (columns that should not be used as features)
  # CRITICAL: Exclude ALL target metrics and derived metrics to prevent data leakage
  exclude_cols = [
      # Current target
      metric, 'date', 'year_month',
      
      # All target metrics (never use as features)
      'total_orders', 'total_km_billed', 'total_km_actual', 'total_tours',
      'total_drivers', 'external_drivers', 'internal_drivers', 'revenue_total',
      'vehicle_km_cost', 'vehicle_time_cost', 'total_vehicle_cost',
      'total_km',  # Backward compatibility
      
      # Derived metrics (MUST exclude - calculated from targets)
      'km_per_order', 'km_efficiency', 'revenue_per_order',
      'cost_per_order', 'profit_margin',
      
      # Order type columns
      'Delivery', 'Leergut', 'Pickup/Multi-leg', 'Retoure/Abholung'
  ]

  # Get feature columns (all columns except target and excluded ones)
  feature_cols = [col for col in df_feat.columns if col not in exclude_cols]

  # Split into train/val
  train_feat = df_feat[df_feat['date'] < val_start].copy()
  val_feat = df_feat[(df_feat['date'] >= val_start) & (df_feat['date'] <= val_end)].copy()

  # Prepare training data (drop rows with NaN in features or target)
  train_feat = train_feat.dropna(subset=[metric] + feature_cols)

  X_train = train_feat[feature_cols]
  y_train = train_feat[metric]

  # Train LightGBM model
  model = lgb.LGBMRegressor(
      n_estimators=300,
      max_depth=6,
      learning_rate=0.03,
      num_leaves=31,            # Key LightGBM parameter
      min_child_samples=10,     # Prevent overfitting
      subsample=0.8,
      subsample_freq=1,
      colsample_bytree=0.8,
      reg_alpha=0.1,            # L1 regularization
      reg_lambda=0.1,           # L2 regularization
      metric='mape',
      random_state=42,
      verbose=-1
  )

  model.fit(
      X_train, y_train,
      eval_set=[(X_train, y_train)],
      callbacks=[lgb.early_stopping(50), lgb.log_evaluation(0)]
  )

  # Store results
  lightgbm_models[metric] = model
  lightgbm_dataframes[metric] = df_feat
  lightgbm_feature_cols[metric] = feature_cols

  print(f"  ✓ Trained on {len(X_train)} samples with {len(feature_cols)} features")

print(f"\n✓ All models trained successfully!")

Training LightGBM models...

Training LightGBM for total_orders...
  ✓ Trained on 18 samples with 17 features

Training LightGBM for total_km...
  ✓ Trained on 18 samples with 17 features

Training LightGBM for total_drivers...
  ✓ Trained on 18 samples with 17 features

Training LightGBM for revenue_total...
  ✓ Trained on 18 samples with 17 features

Training LightGBM for external_drivers...
  ✓ Trained on 18 samples with 17 features

✓ All models trained successfully!


## Section 5: Feature Importance Analysis

Identify which features contribute most to predictions.

In [8]:
# Feature importance for first metric
metric = target_metrics[0]
model = lightgbm_models[metric]
feature_cols = lightgbm_feature_cols[metric]

# Get feature importance
importance_df = pd.DataFrame({
    'feature': feature_cols,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

print(f"\nTop 10 Features for {metric}:")
print("="*50)
print(importance_df.head(10).to_string(index=False))

# Plot feature importance
fig = go.Figure([
    go.Bar(
        y=importance_df.head(15)['feature'],
        x=importance_df.head(15)['importance'],
        orientation='h'
    )
])

fig.update_layout(
    title=f"Top 15 Feature Importances - {metric}",
    xaxis_title="Importance",
    yaxis_title="Feature",
    height=600,
    yaxis={'categoryorder': 'total ascending'}
)

fig.show()

# Save
results_dir = Path('../results')
results_dir.mkdir(exist_ok=True)
fig.write_html(results_dir / f'lightgbm_feature_importance_{metric}.html')
print(f"\n✓ Saved: results/lightgbm_feature_importance_{metric}.html")


Top 10 Features for total_orders:
          feature  importance
revenue_per_order           0
            lag_6           0
    growth_rate_3           0
    growth_rate_1           0
   rolling_mean_6           0
    rolling_std_3           0
   rolling_mean_3           0
           lag_12           0
            lag_3           0
            month           0



✓ Saved: results/lightgbm_feature_importance_total_orders.html


## Section 6: Generate Forecasts

Generate predictions for validation and future periods.

In [9]:
def generate_lightgbm_forecasts(model, df_feat, feature_cols, target_col, val_start, val_end, horizon=18):
    """
    Generate LightGBM forecasts.
    
    Note: For true future forecasts, we'd need to implement recursive forecasting
    (using predictions as lag features). For now, we'll forecast validation period
    where actual lag values are available.
    """
    # Validation period
    val_df = df_feat[(df_feat['date'] >= val_start) & (df_feat['date'] <= val_end)].copy()
    
    if len(val_df) == 0:
        print(f"  ⚠️  No validation data available for {target_col}")
        return np.array([]), np.array([])
    
    # Get features (handle any NaN)
    X_val = val_df[feature_cols]
    
    # Check for NaN in features
    if X_val.isna().any().any():
        print(f"  ⚠️  Warning: NaN values in validation features, filling with mean")
        X_val = X_val.fillna(X_val.mean())
    
    # Predict
    predictions = model.predict(X_val)
    
    return predictions, val_df['date'].values

# Generate forecasts
lightgbm_forecasts = {}
lightgbm_forecast_dates = {}

for metric in target_metrics:
    model = lightgbm_models[metric]
    df_feat = lightgbm_dataframes[metric]
    feature_cols = lightgbm_feature_cols[metric]
    
    predictions, dates = generate_lightgbm_forecasts(
        model, df_feat, feature_cols, metric, val_start, val_end
    )
    
    lightgbm_forecasts[metric] = predictions
    lightgbm_forecast_dates[metric] = dates
    
    print(f"\n{metric}:")
    print(f"  Validation forecast: {len(predictions)} months")


total_orders:
  Validation forecast: 6 months

total_km:
  Validation forecast: 6 months

total_drivers:
  Validation forecast: 6 months

revenue_total:
  Validation forecast: 6 months

external_drivers:
  Validation forecast: 6 months


## Section 7: Model Evaluation

In [10]:
def calculate_metrics(y_true, y_pred, model_name, metric_name):
    """Calculate forecast accuracy metrics."""
    mae = mean_absolute_error(y_true, y_pred)
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    mape = mean_absolute_percentage_error(y_true, y_pred) * 100
    
    return {
        'model': model_name,
        'metric': metric_name,
        'MAE': mae,
        'RMSE': rmse,
        'MAPE': mape
    }

# Calculate metrics
results = []

val_df = df[(df['date'] >= val_start) & (df['date'] <= val_end)]

for metric in target_metrics:
    if len(lightgbm_forecasts[metric]) > 0:
        y_true = val_df[metric].values
        y_pred = lightgbm_forecasts[metric]
        
        metrics = calculate_metrics(y_true, y_pred, 'LightGBM', metric)
        results.append(metrics)

results_df = pd.DataFrame(results)

print("\nLightGBM Model Performance (Validation Period):")
print("="*80)
print(results_df.to_string(index=False))

# Compare with previous models
try:
    baseline_df = pd.read_csv('../data/processed/baseline_metrics.csv')
    prophet_df = pd.read_csv('../data/processed/prophet_metrics.csv')
    sarimax_df = pd.read_csv('../data/processed/sarimax_metrics.csv')
    
    print("\n" + "="*80)
    print("Model Comparison (MAPE):")
    print("="*80)
    
    for metric in target_metrics:
        print(f"\n{metric}:")
        
        if len(results_df[results_df['metric'] == metric]) > 0:
            lightgbm_mape = results_df[results_df['metric'] == metric]['MAPE'].values[0]
            prophet_mape = prophet_df[prophet_df['metric'] == metric]['MAPE'].values[0]
            sarimax_mape = sarimax_df[sarimax_df['metric'] == metric]['MAPE'].values[0]
            baseline_best_mape = baseline_df[baseline_df['metric'] == metric]['MAPE'].min()
            
            print(f"  LightGBM: {lightgbm_mape:.2f}%")
            print(f"  SARIMAX: {sarimax_mape:.2f}%")
            print(f"  Prophet: {prophet_mape:.2f}%")
            print(f"  Best Baseline: {baseline_best_mape:.2f}%")
            
            best_model = min([
                ('LightGBM', lightgbm_mape),
                ('SARIMAX', sarimax_mape),
                ('Prophet', prophet_mape)
            ], key=lambda x: x[1])
            
            print(f"  → Best: {best_model[0]} ({best_model[1]:.2f}%)")
except Exception as e:
    print(f"\n⚠️  Could not load previous model metrics: {e}")


LightGBM Model Performance (Validation Period):
   model           metric           MAE          RMSE      MAPE
LightGBM     total_orders   5785.203704   6901.918048  4.040564
LightGBM         total_km 316777.259259 365312.358939  3.590569
LightGBM    total_drivers   5595.944444   6682.866156  3.977664
LightGBM    revenue_total 769325.321879 882566.627281  5.785576
LightGBM external_drivers   3358.444444   3682.275143 12.491852

Model Comparison (MAPE):

total_orders:
  LightGBM: 4.04%
  SARIMAX: 24.38%
  Prophet: 39.55%
  Best Baseline: 2.95%
  → Best: LightGBM (4.04%)

total_km:

⚠️  Could not load previous model metrics: index 0 is out of bounds for axis 0 with size 0


## Section 8: Forecast Visualization

In [11]:
# Visualize forecasts
train_df = df[df['date'] <= train_end]
val_df = df[(df['date'] >= val_start) & (df['date'] <= val_end)]

for metric in target_metrics:
    if len(lightgbm_forecasts[metric]) == 0:
        continue
        
    fig = go.Figure()
    
    # Historical training data
    fig.add_trace(
        go.Scatter(
            x=train_df['date'],
            y=train_df[metric],
            mode='lines+markers',
            name='Historical (Training)',
            line=dict(color='black', width=2)
        )
    )
    
    # Actual validation values
    if len(val_df) > 0:
        fig.add_trace(
            go.Scatter(
                x=val_df['date'],
                y=val_df[metric],
                mode='lines+markers',
                name='Actual (Validation)',
                line=dict(color='green', width=3)
            )
        )
    
    # LightGBM forecast
    fig.add_trace(
        go.Scatter(
            x=lightgbm_forecast_dates[metric],
            y=lightgbm_forecasts[metric],
            mode='lines+markers',
            name='LightGBM Forecast',
            line=dict(color='purple', width=2, dash='dash')
        )
    )
    
    fig.update_layout(
        title=f"LightGBM Forecast - {metric.replace('_', ' ').title()}",
        xaxis_title="Date",
        yaxis_title=metric.replace('_', ' ').title(),
        height=600,
        hovermode='x unified'
    )
    
    fig.show()
    
    # Save
    fig.write_html(results_dir / f'lightgbm_forecast_{metric}.html')
    print(f"\n✓ Saved: results/lightgbm_forecast_{metric}.html")


✓ Saved: results/lightgbm_forecast_total_orders.html



✓ Saved: results/lightgbm_forecast_total_km.html



✓ Saved: results/lightgbm_forecast_total_drivers.html



✓ Saved: results/lightgbm_forecast_revenue_total.html



✓ Saved: results/lightgbm_forecast_external_drivers.html


## Section 9: Save Results

In [12]:
# Save performance metrics
output_dir = Path('../data/processed')
results_df.to_csv(output_dir / 'lightgbm_metrics.csv', index=False)
print(f"✓ Saved metrics: data/processed/lightgbm_metrics.csv")

# Save validation forecasts
if len(val_df) > 0 and len(results_df) > 0:
    forecast_output = pd.DataFrame({
        'date': val_df['date'],
        'year_month': val_df['year_month'].astype(str) if 'year_month' in val_df.columns else val_df['date'].dt.to_period('M').astype(str)
    })
    
    for metric in target_metrics:
        if len(lightgbm_forecasts[metric]) > 0:
            forecast_output[metric] = lightgbm_forecasts[metric]
    
    forecast_output.to_csv(output_dir / 'lightgbm_forecast_validation.csv', index=False)
    print(f"✓ Saved validation forecasts: data/processed/lightgbm_forecast_validation.csv")

print(f"\n{'='*80}")
print(f"CATBOOST MODEL COMPLETE!")
print(f"{'='*80}")
print(f"\nKey Findings:")
for metric in target_metrics:
    if len(results_df[results_df['metric'] == metric]) > 0:
        mape = results_df[results_df['metric'] == metric]['MAPE'].values[0]
        print(f"  • {metric}: MAPE = {mape:.2f}%")
print(f"\nNext: Run notebook 12b for LightGBM model")

✓ Saved metrics: data/processed/lightgbm_metrics.csv
✓ Saved validation forecasts: data/processed/lightgbm_forecast_validation.csv

CATBOOST MODEL COMPLETE!

Key Findings:
  • total_orders: MAPE = 4.04%
  • total_km: MAPE = 3.59%
  • total_drivers: MAPE = 3.98%
  • revenue_total: MAPE = 5.79%
  • external_drivers: MAPE = 12.49%

Next: Run notebook 12b for LightGBM model


In [13]:
def recursive_forecast_2025(model, df_full, target_col, feature_cols, num_months=12):
    """
    Generate recursive multi-step ahead forecasts for 2025.
    
    Parameters:
    -----------
    model : XGBRegressor
        Trained LightGBM model
    df_full : pd.DataFrame
        Full historical dataframe (2022-2024) with target column
    target_col : str
        Name of target metric
    feature_cols : list
        List of feature column names
    num_months : int
        Number of months to forecast (default 12 for full year 2025)
    
    Returns:
    --------
    pd.DataFrame
        Dataframe with date and forecast values
    """
    # Create extended dataframe with 2025 months
    last_date = df_full['date'].max()
    future_dates = pd.date_range(start=last_date + pd.DateOffset(months=1), periods=num_months, freq='MS')
    
    # Combine historical and future dates
    future_df = pd.DataFrame({'date': future_dates})
    extended_df = pd.concat([df_full[['date', target_col]], future_df], ignore_index=True)
    
    # Generate temporal features for all dates
    extended_df['year'] = extended_df['date'].dt.year
    extended_df['month'] = extended_df['date'].dt.month
    extended_df['quarter'] = extended_df['date'].dt.quarter
    extended_df['week'] = extended_df['date'].dt.isocalendar().week
    extended_df['day_of_year'] = extended_df['date'].dt.dayofyear
    extended_df['weekday'] = extended_df['date'].dt.weekday
    
    # Start recursive forecasting from first future month
    first_future_idx = len(df_full)
    
    for i in range(num_months):
        current_idx = first_future_idx + i
        
        # Calculate lag features using either actual or predicted values
        if current_idx >= 1:
            extended_df.loc[current_idx, 'lag_1'] = extended_df.loc[current_idx - 1, target_col]
        if current_idx >= 3:
            extended_df.loc[current_idx, 'lag_3'] = extended_df.loc[current_idx - 3, target_col]
        if current_idx >= 6:
            extended_df.loc[current_idx, 'lag_6'] = extended_df.loc[current_idx - 6, target_col]
        if current_idx >= 12:
            extended_df.loc[current_idx, 'lag_12'] = extended_df.loc[current_idx - 12, target_col]
        
        # Calculate rolling features (need at least 3/6 previous values)
        if current_idx >= 3:
            extended_df.loc[current_idx, 'rolling_mean_3'] = extended_df.loc[current_idx-3:current_idx-1, target_col].mean()
            extended_df.loc[current_idx, 'rolling_std_3'] = extended_df.loc[current_idx-3:current_idx-1, target_col].std()
        
        if current_idx >= 6:
            extended_df.loc[current_idx, 'rolling_mean_6'] = extended_df.loc[current_idx-6:current_idx-1, target_col].mean()
        
        # Calculate growth rate features
        if current_idx >= 1 and not pd.isna(extended_df.loc[current_idx - 1, target_col]):
            if extended_df.loc[current_idx - 1, target_col] != 0:
                prev_val = extended_df.loc[current_idx - 1, target_col]
                if current_idx >= 2:
                    extended_df.loc[current_idx, 'growth_rate_1'] = (prev_val - extended_df.loc[current_idx - 2, target_col]) / extended_df.loc[current_idx - 2, target_col]
        
        if current_idx >= 3:
            extended_df.loc[current_idx, 'growth_rate_3'] = extended_df.loc[current_idx-3:current_idx, target_col].pct_change(3).iloc[-1]
        
        if current_idx >= 12:
            extended_df.loc[current_idx, 'growth_rate_12'] = extended_df.loc[current_idx-12:current_idx, target_col].pct_change(12).iloc[-1]
        
        # Extract features for prediction - EXACTLY match training format
        # Create dataframe with ALL feature columns in exact training order
        X_pred = pd.DataFrame(index=[current_idx], columns=feature_cols)
        
        # Fill available features from extended_df
        for col in feature_cols:
            if col in extended_df.columns:
                X_pred.loc[current_idx, col] = extended_df.loc[current_idx, col]
            else:
                # Fill missing features with 0 (for derived features not in extended_df)
                X_pred.loc[current_idx, col] = 0
        
        # Fill any remaining NaN with 0
        X_pred = X_pred.fillna(0)
        
        # Make prediction
        prediction = model.predict(X_pred)[0]
        
        # Store prediction
        extended_df.loc[current_idx, target_col] = prediction
    
    # Return only 2025 forecasts
    forecast_df = extended_df[extended_df['date'] >= '2025-01-01'][['date', target_col]].copy()
    
    return forecast_df

print("✓ Recursive forecasting function defined")

✓ Recursive forecasting function defined


In [14]:
# Generate 2025 forecasts for all metrics
print("="*80)
print("GENERATING 2025 RECURSIVE FORECASTS")
print("="*80)

lightgbm_forecasts_2025 = {}

for metric in target_metrics:
    print(f"\nForecasting {metric} for 2025...")
    
    model = lightgbm_models[metric]
    feature_cols = lightgbm_feature_cols[metric]
    
    # Generate recursive forecast
    forecast_df = recursive_forecast_2025(model, df, metric, feature_cols, num_months=12)
    
    lightgbm_forecasts_2025[metric] = forecast_df
    
    print(f"  ✓ Generated 12 monthly forecasts")
    print(f"  Range: {forecast_df[metric].min():.0f} - {forecast_df[metric].max():.0f}")
    print(f"  Variation: {((forecast_df[metric].max() / forecast_df[metric].min() - 1) * 100):.1f}%")

print(f"\n{'='*80}")
print(f"✓ All 2025 forecasts generated successfully!")
print(f"{'='*80}")

GENERATING 2025 RECURSIVE FORECASTS

Forecasting total_orders for 2025...
  ✓ Generated 12 monthly forecasts
  Range: 135702 - 135702
  Variation: 0.0%

Forecasting total_km for 2025...
  ✓ Generated 12 monthly forecasts
  Range: 8571801 - 8571801
  Variation: 0.0%

Forecasting total_drivers for 2025...
  ✓ Generated 12 monthly forecasts
  Range: 133385 - 133385
  Variation: 0.0%

Forecasting revenue_total for 2025...
  ✓ Generated 12 monthly forecasts
  Range: 12957054 - 12957054
  Variation: 0.0%

Forecasting external_drivers for 2025...
  ✓ Generated 12 monthly forecasts
  Range: 31020 - 31020
  Variation: 0.0%

✓ All 2025 forecasts generated successfully!


## Section 10: Recursive Forecasting for 2025

Generate month-by-month forecasts for 2025 using recursive multi-step ahead forecasting.

**Approach:**
1. Start with actual historical values (Jan-Dec 2024)
2. Predict Jan 2025 using Dec 2024 as lag_1, Oct 2024 as lag_3, etc.
3. Predict Feb 2025 using Jan 2025 prediction as lag_1, Nov 2024 as lag_3, etc.
4. Continue recursively through all 12 months of 2025

**Key Benefit:** Captures monthly seasonality through temporal features (month=1, month=2, etc.)

## Section 11: Save 2025 Forecasts

Save the recursive 2025 forecasts to CSV for ensemble modeling and validation.

In [None]:
# Save 2025 forecasts
print("Saving 2025 forecasts...")
print("="*80)

# Combine all metric forecasts into single dataframe
forecast_2025_combined = pd.DataFrame()

for i, metric in enumerate(target_metrics):
    forecast_df = lightgbm_forecasts_2025[metric]
    
    if i == 0:
        # First metric: include date column
        forecast_2025_combined = forecast_df.copy()
    else:
        # Subsequent metrics: only add the metric column
        forecast_2025_combined[metric] = forecast_df[metric].values

# Save to CSV
output_path = output_dir / 'lightgbm_forecast_2025.csv'
forecast_2025_combined.to_csv(output_path, index=False)

print(f"✓ Saved 2025 forecasts: {output_path}")
print(f"  Shape: {forecast_2025_combined.shape}")
print(f"  Columns: {list(forecast_2025_combined.columns)}")

# Show preview
print(f"\nPreview of 2025 forecasts:")
print(forecast_2025_combined.head(3))

print(f"\n{'='*80}")
print(f"✓ LIGHTGBM 2025 FORECASTS SAVED!")
print(f"{'='*80}")
print(f"\nNext steps:")
print(f"  1. Create Notebook 14b (Ensemble) to combine model forecasts")
print(f"  2. Update Notebook 18 (Validation) to test all models vs 2025 actuals")