# 02c LightGBM Hyperparameter Optimization with Optuna

Optimize LightGBM hyperparameters using Optuna with CRPS minimization on validation set.

**Core Configuration:**
- Horizons: 7, 28 days ahead (same as nb/02)
- Origins: Bi-weekly from 2024-07-08 to 2025-05-26
- Minimum training window: 730 days (2 years)
- Optimization: 25 Optuna trials with TPESampler(seed=42)
- Objective: Minimize CRPS on validation set (July-Sep 2024)

**Output:** `results/forecasts/lightgbm_optuna.parquet` with model='LightGBM_Optimized'

## 0. Setup

### Imports

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Model and optimization imports
import lightgbm as lgb
import optuna
from optuna.visualization import plot_optimization_history, plot_param_importances
import joblib
import json

In [3]:
# Set random seeds for reproducibility
np.random.seed(42)

In [4]:
# Define paths
project_root = Path.cwd().parent
data_dir = project_root / 'data'
results_dir = project_root / 'results' / 'forecasts'
optimization_dir = project_root / 'results' / 'optimization'
figures_dir = project_root / 'results' / 'figures'

# Create directories if they don't exist
results_dir.mkdir(parents=True, exist_ok=True)
optimization_dir.mkdir(parents=True, exist_ok=True)
figures_dir.mkdir(parents=True, exist_ok=True)

### Load Data

In [5]:
# Load cleaned time series
ts = pd.read_pickle(data_dir / 'flu_daily_clean.pkl')
print(f"Loaded data: {ts.shape[0]} observations")
print(f"Date range: {ts.index.min()} to {ts.index.max()}")
print(f"Frequency: {ts.index.freq}")

Loaded data: 1078 observations
Date range: 2022-07-04 00:00:00 to 2025-06-15 00:00:00
Frequency: <Day>


### Configure Rolling Windows

In [6]:
# Forecast configuration (same as nb/02)
HORIZONS = [7, 28]  # Short-term (weekly) vs long-term (monthly) forecasts
ORIGINS = pd.date_range('2024-07-08', '2025-05-26', freq='2W-MON')  # Bi-weekly origins
MIN_TRAIN = 730  # 2 years minimum training data

print(f"Forecast horizons: {HORIZONS}")
print(f"Number of forecast origins: {len(ORIGINS)} (bi-weekly)")
print(f"Minimum training days: {MIN_TRAIN}")
print(f"Total forecasts: {len(ORIGINS) * len(HORIZONS)}")

Forecast horizons: [7, 28]
Number of forecast origins: 24 (bi-weekly)
Minimum training days: 730
Total forecasts: 48


## 1. Helper Functions

In [7]:
def build_fourier_terms(dates, period=365, order=2):
    """
    Build Fourier terms for seasonality from date index.

    Parameters
    ----------
    dates : pd.DatetimeIndex
        Date index to compute Fourier terms for
    period : int
        Seasonal period (365 for annual cycle)
    order : int
        Number of sine/cosine pairs (order=2 gives 4 terms)

    Returns
    -------
    pd.DataFrame
        Fourier terms with columns sin1, cos1, sin2, cos2, ...
    """
    fourier = pd.DataFrame(index=dates)
    for k in range(1, order + 1):
        fourier[f'sin{k}'] = np.sin(2 * np.pi * k * np.arange(len(dates)) / period)
        fourier[f'cos{k}'] = np.cos(2 * np.pi * k * np.arange(len(dates)) / period)
    return fourier

In [8]:
def build_lag_features(series, lags=[1, 2, 3, 7, 14]):
    """
    Build lag features from a time series.

    Parameters
    ----------
    series : pd.Series
        Time series to create lags from
    lags : list of int
        Lag values to create

    Returns
    -------
    pd.DataFrame
        Dataframe with lag columns
    """
    lag_df = pd.DataFrame(index=series.index)
    for lag in lags:
        lag_df[f'lag_{lag}'] = series.shift(lag)
    return lag_df

In [9]:
def compute_crps(y_true, q10, q50, q90):
    """
    Compute approximate CRPS using three quantiles.

    CRPS approximation based on quantile coverage:
    CRPS ≈ 0.1 * |y - q10| + 0.8 * |y - q50| + 0.1 * |y - q90|

    Parameters
    ----------
    y_true : float or array-like
        True values
    q10 : float or array-like
        0.1 quantile predictions
    q50 : float or array-like
        0.5 quantile predictions (median)
    q90 : float or array-like
        0.9 quantile predictions

    Returns
    -------
    float
        Mean CRPS across all observations
    """
    y_true = np.asarray(y_true)
    q10 = np.asarray(q10)
    q50 = np.asarray(q50)
    q90 = np.asarray(q90)

    crps = 0.1 * np.abs(y_true - q10) + 0.8 * np.abs(y_true - q50) + 0.1 * np.abs(y_true - q90)
    return np.mean(crps)

## 2. Validation Set Preparation

In [10]:
# Define train/validation split
TRAIN_END = pd.Timestamp('2024-06-30')
VAL_START = pd.Timestamp('2024-07-01')
VAL_END = pd.Timestamp('2024-09-30')

# Split data
train_opt = ts[ts.index < TRAIN_END]
val_opt = ts[(ts.index >= VAL_START) & (ts.index <= VAL_END)]

print(f"Training set: {len(train_opt)} days ({train_opt.index.min()} to {train_opt.index.max()})")
print(f"Validation set: {len(val_opt)} days ({val_opt.index.min()} to {val_opt.index.max()})")

Training set: 727 days (2022-07-04 00:00:00 to 2024-06-29 00:00:00)
Validation set: 92 days (2024-07-01 00:00:00 to 2024-09-30 00:00:00)


## 3. Optuna Objective Function

In [11]:
def lightgbm_forecast_single_quantile(train_series, horizon, quantile, params, lags=[1, 2, 3, 7, 14], fourier_order=2, period=365):
    """
    Forecast with LightGBM for a single quantile.

    Parameters
    ----------
    train_series : pd.Series
        Training data (datetime-indexed)
    horizon : int
        Number of steps ahead to forecast
    quantile : float
        Target quantile (e.g., 0.1, 0.5, 0.9)
    params : dict
        LightGBM hyperparameters
    lags : list of int
        Lag features to create
    fourier_order : int
        Number of Fourier term pairs
    period : int
        Seasonal period for Fourier terms

    Returns
    -------
    float
        Forecast value at target horizon
    """
    # Build lag features
    lag_df = build_lag_features(train_series, lags=lags)

    # Build Fourier features
    fourier_df = build_fourier_terms(train_series.index, period=period, order=fourier_order)

    # Combine features
    X_train = pd.concat([lag_df, fourier_df], axis=1).dropna()
    y_train = train_series.loc[X_train.index]

    # Train model with specified hyperparameters
    model_params = params.copy()
    model_params['objective'] = 'quantile'
    model_params['alpha'] = quantile
    model_params['random_state'] = 42
    model_params['verbose'] = -1

    model = lgb.LGBMRegressor(**model_params)
    model.fit(X_train, y_train)

    # Multi-step forecast (iterative)
    current_series = train_series.copy()

    for step in range(horizon):
        # Build features for next step
        lag_feats = build_lag_features(current_series, lags=lags).iloc[-1:]

        # Fourier terms for next date
        next_date = current_series.index[-1] + pd.Timedelta(days=1)
        fourier_feats = build_fourier_terms(pd.DatetimeIndex([next_date]), period=period, order=fourier_order)

        # Combine and predict
        X_next = pd.concat([lag_feats, fourier_feats], axis=1)
        pred = model.predict(X_next)[0]

        # Append prediction to series for next iteration
        current_series = pd.concat([
            current_series,
            pd.Series([pred], index=[next_date])
        ])

    return float(pred)

In [12]:
def objective(trial):
    """
    Optuna objective function: minimize CRPS on validation set.

    Optimizes for median (q=0.5) and evaluates CRPS using all three quantiles.
    """
    # Suggest hyperparameters
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 50, 500),
        'max_depth': trial.suggest_int('max_depth', 3, 10),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.2, log=True),
    }

    # Generate forecasts on validation set for both horizons
    y_true = []
    y_pred_q10 = []
    y_pred_q50 = []
    y_pred_q90 = []

    # Evaluate on a subset of validation dates (for speed)
    val_origins = pd.date_range('2024-07-08', '2024-09-23', freq='W-MON')  # Weekly samples

    for origin in val_origins:
        # Training data up to origin
        train = ts[ts.index < origin]

        if len(train) < MIN_TRAIN:
            continue

        for horizon in HORIZONS:
            target_date = origin + pd.Timedelta(days=horizon - 1)

            # Skip if target is beyond validation period
            if target_date not in val_opt.index:
                continue

            actual = val_opt.loc[target_date]

            # Generate predictions for all three quantiles
            try:
                pred_q10 = lightgbm_forecast_single_quantile(train, horizon, 0.1, params)
                pred_q50 = lightgbm_forecast_single_quantile(train, horizon, 0.5, params)
                pred_q90 = lightgbm_forecast_single_quantile(train, horizon, 0.9, params)

                # Enforce quantile monotonicity
                pred_q10 = min(pred_q10, pred_q50)
                pred_q90 = max(pred_q50, pred_q90)

                y_true.append(actual)
                y_pred_q10.append(pred_q10)
                y_pred_q50.append(pred_q50)
                y_pred_q90.append(pred_q90)
            except Exception as e:
                # Skip failed forecasts
                continue

    # Compute CRPS
    if len(y_true) == 0:
        return float('inf')  # Penalize failed trials

    crps_score = compute_crps(y_true, y_pred_q10, y_pred_q50, y_pred_q90)

    return crps_score

## 4. Run Optuna Optimization

In [13]:
# Create Optuna study
print("Starting Optuna hyperparameter optimization...")
print(f"Trials: 25")
print(f"Objective: Minimize CRPS on validation set (July-Sep 2024)")
print("\nThis will take approximately 15-20 minutes...\n")

study = optuna.create_study(
    direction='minimize',
    sampler=optuna.samplers.TPESampler(seed=42),
    study_name='lightgbm_crps_optimization'
)

# Run optimization
study.optimize(objective, n_trials=25, show_progress_bar=True)

print("\n✅ Optimization complete!")
print(f"\nBest CRPS: {study.best_value:.4f}")
print(f"\nBest hyperparameters:")
for param, value in study.best_params.items():
    print(f"  {param}: {value}")

[I 2025-10-16 09:31:37,928] A new study created in memory with name: lightgbm_crps_optimization


Starting Optuna hyperparameter optimization...
Trials: 25
Objective: Minimize CRPS on validation set (July-Sep 2024)

This will take approximately 15-20 minutes...



  0%|          | 0/25 [00:00<?, ?it/s]

[I 2025-10-16 09:31:44,139] Trial 0 finished with value: 0.3069851473117954 and parameters: {'n_estimators': 218, 'max_depth': 10, 'learning_rate': 0.08960785365368121}. Best is trial 0 with value: 0.3069851473117954.
[I 2025-10-16 09:31:48,950] Trial 1 finished with value: 0.4747679233433221 and parameters: {'n_estimators': 319, 'max_depth': 4, 'learning_rate': 0.015957084694148364}. Best is trial 0 with value: 0.3069851473117954.
[I 2025-10-16 09:31:52,602] Trial 2 finished with value: 0.5004897230605911 and parameters: {'n_estimators': 76, 'max_depth': 9, 'learning_rate': 0.06054365855469246}. Best is trial 0 with value: 0.3069851473117954.
[I 2025-10-16 09:31:57,233] Trial 3 finished with value: 0.2854345567706134 and parameters: {'n_estimators': 369, 'max_depth': 3, 'learning_rate': 0.18276027831785724}. Best is trial 3 with value: 0.2854345567706134.
[I 2025-10-16 09:32:02,423] Trial 4 finished with value: 0.35550493841365244 and parameters: {'n_estimators': 425, 'max_depth': 4, 

## 5. Save Optimization Artifacts

In [14]:
# Save Optuna study
study_path = optimization_dir / 'optuna_study.pkl'
joblib.dump(study, study_path)
print(f"Saved study: {study_path}")

Saved study: /home/mikhailarutyunov/projects/time-series-flu/results/optimization/optuna_study.pkl


In [15]:
# Save best hyperparameters
params_path = optimization_dir / 'lightgbm_best_params.json'
with open(params_path, 'w') as f:
    json.dump(study.best_params, f, indent=2)
print(f"Saved best params: {params_path}")

Saved best params: /home/mikhailarutyunov/projects/time-series-flu/results/optimization/lightgbm_best_params.json


In [16]:
# Save optimization history
history_df = study.trials_dataframe()
history_path = optimization_dir / 'optuna_history.csv'
history_df.to_csv(history_path, index=False)
print(f"Saved history: {history_path}")
print(f"\nOptimization history preview:")
print(history_df[['number', 'value', 'params_n_estimators', 'params_max_depth', 'params_learning_rate']].head())

Saved history: /home/mikhailarutyunov/projects/time-series-flu/results/optimization/optuna_history.csv

Optimization history preview:
   number     value  params_n_estimators  params_max_depth  \
0       0  0.306985                  218                10   
1       1  0.474768                  319                 4   
2       2  0.500490                   76                 9   
3       3  0.285435                  369                 3   
4       4  0.355505                  425                 4   

   params_learning_rate  
0              0.089608  
1              0.015957  
2              0.060544  
3              0.182760  
4              0.017241  


## 6. Optuna Visualizations

In [17]:
# Optimization history plot
fig_history = plot_optimization_history(study)
fig_history.update_layout(
    title="LightGBM Hyperparameter Optimization History",
    xaxis_title="Trial",
    yaxis_title="CRPS (Validation)"
)

# Try to save image (requires kaleido package)
try:
    fig_history.write_image(figures_dir / 'optuna_history.png', width=800, height=500)
    print(f"Saved: {figures_dir / 'optuna_history.png'}")
except Exception as e:
    print(f"⚠️  Could not save PNG image: {e}")
    print("   Install kaleido for image export: uv pip install kaleido")

fig_history.show()

⚠️  Could not save PNG image: 

Kaleido requires Google Chrome to be installed.

Either download and install Chrome yourself following Google's instructions for your operating system,
or install it from your terminal by running:

    $ plotly_get_chrome


   Install kaleido for image export: uv pip install kaleido


In [18]:
# Parameter importance plot
fig_importance = plot_param_importances(study)
fig_importance.update_layout(
    title="Hyperparameter Importance (CRPS Minimization)"
)

# Try to save image (requires kaleido package)
try:
    fig_importance.write_image(figures_dir / 'optuna_importance.png', width=800, height=500)
    print(f"Saved: {figures_dir / 'optuna_importance.png'}")
except Exception as e:
    print(f"⚠️  Could not save PNG image: {e}")
    print("   Install kaleido for image export: uv pip install kaleido")

fig_importance.show()

⚠️  Could not save PNG image: 

Kaleido requires Google Chrome to be installed.

Either download and install Chrome yourself following Google's instructions for your operating system,
or install it from your terminal by running:

    $ plotly_get_chrome


   Install kaleido for image export: uv pip install kaleido


## 7. Rolling Forecasts with Optimized Hyperparameters

In [19]:
def forecast_lightgbm_optimized(train_series, horizon, best_params, lags=[1, 2, 3, 7, 14], fourier_order=2, period=365):
    """
    Forecast with optimized LightGBM hyperparameters.

    Parameters
    ----------
    train_series : pd.Series
        Training data (datetime-indexed)
    horizon : int
        Number of steps ahead to forecast
    best_params : dict
        Optimized hyperparameters from Optuna
    lags : list of int
        Lag features to create
    fourier_order : int
        Number of Fourier term pairs
    period : int
        Seasonal period for Fourier terms

    Returns
    -------
    dict : {'q0.1': float, 'q0.5': float, 'q0.9': float}
        Forecast quantiles
    """
    quantiles = [0.1, 0.5, 0.9]
    predictions = {}

    for q in quantiles:
        pred = lightgbm_forecast_single_quantile(
            train_series=train_series,
            horizon=horizon,
            quantile=q,
            params=best_params,
            lags=lags,
            fourier_order=fourier_order,
            period=period
        )
        predictions[f'q{q}'] = pred

    # Enforce quantile monotonicity
    predictions['q0.1'] = min(predictions['q0.1'], predictions['q0.5'])
    predictions['q0.9'] = max(predictions['q0.5'], predictions['q0.9'])

    return predictions

In [20]:
def run_rolling_forecasts_optimized(ts, origins, horizons, min_train, best_params):
    """
    Run rolling forecasts with optimized LightGBM hyperparameters.

    Parameters
    ----------
    ts : pd.Series
        Full time series
    origins : pd.DatetimeIndex
        Forecast origin dates
    horizons : list of int
        Forecast horizons
    min_train : int
        Minimum training window size
    best_params : dict
        Optimized hyperparameters

    Returns
    -------
    pd.DataFrame
        Forecast results
    """
    results = []

    # Progress bar
    total_iter = len(origins) * len(horizons)
    pbar = tqdm(total=total_iter, desc="Rolling forecasts (optimized)")

    for origin in origins:
        # Get training data (all data before origin)
        train = ts[ts.index < origin]

        # Skip if insufficient training data
        if len(train) < min_train:
            continue

        for horizon in horizons:
            # Target forecast date
            target_date = origin + pd.Timedelta(days=horizon - 1)

            # Skip if target date is beyond available data
            if target_date not in ts.index:
                pbar.update(1)
                continue

            # Get actual value
            actual = ts.loc[target_date]

            # Generate forecast
            try:
                pred = forecast_lightgbm_optimized(train, horizon, best_params)
                results.append({
                    'date': target_date,
                    'origin': origin,
                    'horizon': horizon,
                    'model': 'LightGBM_Optimized',
                    'q0.1': pred['q0.1'],
                    'q0.5': pred['q0.5'],
                    'q0.9': pred['q0.9'],
                    'actual': actual
                })
            except Exception as e:
                # Skip failed forecasts
                pass

            pbar.update(1)

    pbar.close()

    return pd.DataFrame(results)

### Execute Rolling Forecasts

**Warning:** This cell will take several minutes to complete.

In [21]:
print("Starting rolling forecast generation with optimized hyperparameters...")
print(f"Total iterations: {len(ORIGINS) * len(HORIZONS)}")
print("\nThis will take approximately 5-10 minutes...\n")

forecast_df = run_rolling_forecasts_optimized(
    ts=ts,
    origins=ORIGINS,
    horizons=HORIZONS,
    min_train=MIN_TRAIN,
    best_params=study.best_params
)

print("\n✅ Rolling forecasts complete!")
print(f"Total forecasts: {len(forecast_df)}")

Starting rolling forecast generation with optimized hyperparameters...
Total iterations: 48

This will take approximately 5-10 minutes...



Rolling forecasts (optimized): 100%|██████████| 48/48 [00:19<00:00,  2.50it/s]


✅ Rolling forecasts complete!
Total forecasts: 47





## 8. Save Forecast Results

In [22]:
# Save forecasts as parquet
output_path = results_dir / 'lightgbm_optuna.parquet'
forecast_df.to_parquet(output_path, index=False)
print(f"Saved forecasts: {len(forecast_df)} rows → {output_path}")

Saved forecasts: 47 rows → /home/mikhailarutyunov/projects/time-series-flu/results/forecasts/lightgbm_optuna.parquet


## 9. Summary Statistics

In [23]:
# Display summary of forecast counts by horizon
summary = forecast_df.groupby('horizon').size().reset_index(name='count')
print("\n" + "=" * 60)
print("FORECAST COUNT SUMMARY")
print("=" * 60)
print(summary.to_string(index=False))
print(f"\nTotal forecasts: {len(forecast_df)}")


FORECAST COUNT SUMMARY
 horizon  count
       7     24
      28     23

Total forecasts: 47


## 10. Quality Checks

In [24]:
# Check for NaN values
nan_count = forecast_df[['q0.1', 'q0.5', 'q0.9']].isna().sum().sum()
print(f"NaN values in forecasts: {nan_count}")
assert nan_count == 0, "ERROR: Found NaN values in forecasts!"
print("✅ No NaN values")

NaN values in forecasts: 0
✅ No NaN values


In [25]:
# Check quantile monotonicity
violations = (
    (forecast_df['q0.1'] > forecast_df['q0.5']) |
    (forecast_df['q0.5'] > forecast_df['q0.9'])
).sum()
print(f"Quantile monotonicity violations: {violations}")
assert violations == 0, "ERROR: Quantile monotonicity violated!"
print("✅ Quantile monotonicity preserved (q0.1 ≤ q0.5 ≤ q0.9)")

Quantile monotonicity violations: 0
✅ Quantile monotonicity preserved (q0.1 ≤ q0.5 ≤ q0.9)


In [26]:
# Check forecast count
expected_forecasts = 47  # Same as baseline (24 origins * 2 horizons - 1 missing)
actual_forecasts = len(forecast_df)
print(f"Expected forecasts: ~{expected_forecasts}")
print(f"Actual forecasts: {actual_forecasts}")
assert actual_forecasts >= expected_forecasts - 2, "ERROR: Too few forecasts generated!"
print("✅ Forecast count matches expected")

Expected forecasts: ~47
Actual forecasts: 47
✅ Forecast count matches expected


In [27]:
# Compare hyperparameters to baseline
baseline_params = {'n_estimators': 300, 'max_depth': 5, 'learning_rate': 0.05}
optimized_params = study.best_params

print("\n" + "=" * 60)
print("HYPERPARAMETER COMPARISON")
print("=" * 60)
print(f"{'Parameter':<20} {'Baseline':<15} {'Optimized':<15} {'Change'}")
print("-" * 60)
for param in baseline_params:
    baseline_val = baseline_params[param]
    optimized_val = optimized_params[param]
    change = "✓" if baseline_val != optimized_val else "(same)"
    print(f"{param:<20} {baseline_val:<15} {optimized_val:<15} {change}")
print("=" * 60)


HYPERPARAMETER COMPARISON
Parameter            Baseline        Optimized       Change
------------------------------------------------------------
n_estimators         300             439             ✓
max_depth            5               8               ✓
learning_rate        0.05            0.0326376815988589 ✓


## Checkpoint Summary

**Expected outcomes:**
- `results/forecasts/lightgbm_optuna.parquet` with 47-48 forecasts
- `results/optimization/optuna_study.pkl` (for resuming optimization)
- `results/optimization/lightgbm_best_params.json` (best hyperparameters)
- `results/optimization/optuna_history.csv` (trial history)
- `results/figures/optuna_history.png` (optimization convergence plot)
- `results/figures/optuna_importance.png` (parameter importance plot)
- Columns: date, origin, horizon, model, q0.1, q0.5, q0.9, actual
- Total runtime: 20-30 minutes
- No NaN values, quantile monotonicity preserved

**Success criterion:** MASE < 0.64 (evaluated in nb/03_evaluation.ipynb)

**Next:** Proceed to `03_evaluation.ipynb` to compare optimized LightGBM against foundation models.