**For correct rendering, view this notebook in [nbviewer](https://nbviewer.org/github/markuskrecik/preference-dynamics-learning/blob/main/notebooks/41_training_cnn_n1_residual.ipynb)**

# 1d CNN Residual Model Training for n=1 action

The previous CNN model was showing already good performance for parameter and initial condition prediction, but shows weaknesses to reproduce the steady state.
It is also not sutiable for forecasting and cannot incorporate extra features.

I introduce an improved model architecture, which takes in additional steady state features, and trains the CNN only on the residual $r = x-c$ of time series $x$ and steady state $c$.

**This notebook:**
- Trains the residual 1d CNN model for n=1 action
- Evaluates the model on the test set with various metrics
- Performs hyperparameter studies on learning rate, and number of filters, kernel sizes, and hidden dimensions
- Compares time series of true and predicted parameters


## Training for 1 action


In [1]:
%load_ext autoreload
%autoreload 2

import numpy as np
from optuna.visualization import plot_parallel_coordinate, plot_contour, plot_param_importances
from plotly.offline import init_notebook_mode

init_notebook_mode(connected=True)

from preference_dynamics.schemas import (
    ParameterVector,
    ICVector,
    ODEConfig,
    SolverConfig,
    ODESolverConfig,
    TrainerConfig,
    RunnerConfig,
)

from preference_dynamics.solver import create_default_sampler, generate_batch, solve_ODE
from preference_dynamics.data import DataConfig, DataManager
from preference_dynamics.data.adapters import (
    StateInputAdapter,
    StateFeatureInputAdapter,
    ParameterICForecastTargetAdapter,
)
from preference_dynamics.data.transformer import (
    SampleGroupNormalizer,
    SampleGroupStdNormalizer,
    ShortenTimeSeriesTransformer,
    SteadyStateFeature,
)
from preference_dynamics.models import CNN1DConfig, CNN1DFeatConfig, CNN1DResidualConfig
from preference_dynamics.training import compute_metrics
from preference_dynamics.experiments import ExperimentRunner
from preference_dynamics.visualization import (
    plot_metrics,
    plot_parameter_comparison,
    plot_time_series,
    plot_training_curves,
)
from preference_dynamics.utils import (
    num_vars,
    num_params,
    get_param_names,
    get_var_names,
    assemble_checkpoint_path,
)


n_actions = 1
data_dir = f"data/n{n_actions}"
model_name = f"cnn1d_n{n_actions}_residual"

### Preliminaries: Data Loading and Transformation

I assume the data has been generated already in `40_training_cnn_n1.ipynb`.

I transform the raw data by inferring the steady state as a feature through `SteadyStateFeature`, and I also shorten the time series through `ShortenTimeSeriesTransformer` in order to predict the last time point.

The `StateFeatureInputAdapter` specifies that the model receives the state vector and a feature vector as input.
The `ParameterICForecastTargetAdapter` specifies that the model targets are parameters, initial conditions, and forecasts of last time point.

In [2]:
data_config = DataConfig(
    data_dir=data_dir,
    load_if_exists=False,
    transformers=[
        SteadyStateFeature(),
        SampleGroupStdNormalizer(),
        ShortenTimeSeriesTransformer(),
    ],
    input_adapter=StateFeatureInputAdapter(),
    target_adapter=ParameterICForecastTargetAdapter(),
)
dm = DataManager(config=data_config).setup()

### Model Training

I start with the optimal configuration from the previous study.
- 2 convolutional layer with filters 64 and 128
- 1 global pooling layer
- A dropout layer
- 1 hidden layer with 128 neurons
- All activation functions are ReLU

In [3]:
model_config = CNN1DResidualConfig(
    model_name=model_name,
    in_channels=dm.n_inputs,
    filters=[64, 128],
    kernel_sizes=[3, 3],
    hidden_dims=[128],
    out_dim=dm.n_targets,
    dropout=0.3,
)

trainer_config = TrainerConfig(
    loss_function="mse",
    learning_rate=0.002,
    num_epochs=200,
    early_stopping_patience=20,
)

runner_config = RunnerConfig(
    experiment_name=f"preference_dynamics_n{n_actions}",
)

runner = ExperimentRunner(
    runner_config=runner_config,
    data_config=data_config,
    model_config=model_config,
    trainer_config=trainer_config,
)

2026/01/07 16:47:21 INFO mlflow.store.db.utils: Creating initial MLflow database tables...
2026/01/07 16:47:21 INFO mlflow.store.db.utils: Updating database tables
2026/01/07 16:47:21 INFO alembic.runtime.migration: Context impl SQLiteImpl.
2026/01/07 16:47:21 INFO alembic.runtime.migration: Will assume non-transactional DDL.
2026/01/07 16:47:21 INFO alembic.runtime.migration: Context impl SQLiteImpl.
2026/01/07 16:47:21 INFO alembic.runtime.migration: Will assume non-transactional DDL.


In [4]:
experiment = runner.run("residual_base")

plot_training_curves(experiment.history)

Training:  45%|████▌     | 90/200 [06:28<07:54,  4.31s/it, epoch=90, train_loss=0.162, val_loss=0.136, epoch_time=4.67]


### Model Evaluation

The training converged well. Evaluation metrics show that the new model architecture performs far better than the previous model on the parameter and IC prediction, in particular $g_0$.
The new model also uses the steady state features to predict the last time point (which is easy).
It is worth noting that the new model also picks up the unobservable initial conditions $v_0<0$ and $m_0<0$, which the previous model failed to do.

In [None]:
experiment = runner.load_checkpoint("cnn1d_n1_residual/f36e45be394a4331bfbfbc759d0ddb56/best")
# experiment.load_checkpoint("best")

y_pred, y_true, loss = experiment.trainer.evaluate(dm.test_dataloader)

col_names = get_param_names(n_actions, ic=True)
col_names.extend(get_var_names(n_actions, suffix="forecast"))
metrics = compute_metrics(y_true, y_pred)
plot_metrics(metrics, target_names=col_names, height=1000)
plot_parameter_comparison(y_true, y_pred, col_names)

INFO:preference_dynamics.training.trainer:Initialized Trainer with device=cuda, checkpoint_dir=checkpoints
INFO:preference_dynamics.training.trainer:Loading checkpoint from checkpoints/cnn1d_n1_residual/f36e45be394a4331bfbfbc759d0ddb56/best.pt


#### Time Series Comparison

Does the model replicate the time series? Let's look at some examples (s=0 true, s=1 predicted time series). The model performs far better than the previous one. For some time series, it still has some issues with the steady state.

In [6]:
# Visual comparison of true and predicted time series
solver_config = SolverConfig(
    time_span=(0.0, 200.0),
    n_time_points=401,
)
n_params = num_params(n_actions)
n_vars = num_vars(n_actions)

for i in range(10):
    samples = []
    try:
        for name, y in {"true": y_true.cpu().numpy()[i], "pred": y_pred.cpu().numpy()[i]}.items():
            config_ode = ODEConfig(
                parameters=ParameterVector(values=y[:n_params]),
                initial_conditions=ICVector(values=y[n_params : n_params + n_vars]),
            )
            config = ODESolverConfig(ode=config_ode.model_dump(), solver=solver_config)

            sample = solve_ODE(config)
            samples.append(sample)
        plot_time_series(samples)
    except ValueError:
        continue

## Hyperparameter Optimization

### Learning Rate

I first optimize the learning rate.

For this, I again subclass the `ExperimentRunner` and override the `suggest_parameters` method.

In [7]:
class Runner(ExperimentRunner):
    def suggest_parameters(self, trial):
        lr = trial.suggest_float("lr", 1e-4, 1e-2, log=True)
        self.trainer_config.learning_rate = lr


runner = Runner(
    runner_config=runner_config,
    data_config=data_config,
    model_config=model_config,
    trainer_config=trainer_config,
)

In [8]:
study = runner.run_study("residual_lr_study", n_trials=10, n_jobs=1)

[I 2026-01-06 19:08:56,021] A new study created in RDB with name: residual_lr_study

suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.

INFO:preference_dynamics.experiments.runner:Run: Trial 0: lr=0.002011967960926902
INFO:preference_dynamics.training.trainer:Initialized Trainer with device=cuda, checkpoint_dir=checkpoints
INFO:preference_dynamics.data.manager:Loading raw data from data/n1/raw
INFO:preference_dynamics.data.manager:Saving 3 splits to data/n1/processed
INFO:preference_dynamics.training.trainer:Starting training for 200 epochs with run_id=1416227b274e44929346aef03d5c05ba
Training:  60%|██████    | 121/200 [06:35<03:48,  2.90s/it, epoch=121, train_loss=0.148, val_loss=0.123, epoch_time=3.13]INFO:preference_dynamics.training.trainer:Early stopping at epoch 121 (patience: 20)
Training:  60%|██████    | 121/200 [06:35<04:18,  3.27s/it, ep

In [8]:
study = runner.load_study("residual_lr_study")
plot_parallel_coordinate(study)

A learning rate of $\approx 0.00162$ yields the best compromise between speed and performance.

### Number of Filters, Kernel Sizes, and Hidden Dimensions

Next, I optimize over the number of filters, kernel sizes, and hidden dimensions. I keep the number of layers fixed.

In [9]:
model_config = CNN1DResidualConfig(
    model_name=model_name,
    in_channels=dm.n_inputs,
    filters=[64, 128, 256],
    kernel_sizes=[3, 3, 3],
    hidden_dims=[128, 64],
    out_dim=dm.n_targets,
    dropout=0.3,
)

trainer_config = TrainerConfig(
    loss_function="mse",
    learning_rate=0.00162,
    num_epochs=200,
    early_stopping_patience=20,
)


class Runner(ExperimentRunner):
    def suggest_parameters(self, trial):
        # n_kernels = trial.suggest_int("n_kernels", 1, 4)
        n_kernels = len(self.model_config.filters)
        self.model_config.kernel_sizes = [
            trial.suggest_int(f"kernel_size_{i}", 1, 7, step=2) for i in range(n_kernels)
        ]
        self.model_config.filters = [
            trial.suggest_int(f"filter_{i}", 32, 256, step=32) for i in range(n_kernels)
        ]

        # n_hidden_layers = trial.suggest_int("n_hidden_layers", 1, 3)
        n_hidden_layers = len(self.model_config.hidden_dims)
        self.model_config.hidden_dims = [
            trial.suggest_int(f"hidden_dim_{i}", 32, 256, step=32) for i in range(n_hidden_layers)
        ]


runner = Runner(
    runner_config=runner_config,
    data_config=data_config,
    model_config=model_config,
    trainer_config=trainer_config,
)

In [16]:
study = runner.run_study("residual_num_filter_hidden_kernel_study", n_trials=80, n_jobs=1)

[I 2026-01-06 21:44:39,609] A new study created in RDB with name: residual_num_filter_hidden_kernel_study
INFO:preference_dynamics.experiments.runner:Run: Trial 0: kernel_size_0=7, kernel_size_1=7, kernel_size_2=3, filter_0=160, filter_1=32, filter_2=96, hidden_dim_0=32, hidden_dim_1=160
INFO:preference_dynamics.training.trainer:Initialized Trainer with device=cuda, checkpoint_dir=checkpoints
INFO:preference_dynamics.data.manager:Loading raw data from data/n1/raw
INFO:preference_dynamics.data.manager:Saving 3 splits to data/n1/processed
INFO:preference_dynamics.training.trainer:Starting training for 200 epochs with run_id=9d5dfca0e89145408b277bea786c0aeb
Training:  60%|█████▉    | 119/200 [07:44<05:26,  4.03s/it, epoch=119, train_loss=0.122, val_loss=0.137, epoch_time=4.3]  INFO:preference_dynamics.training.trainer:Early stopping at epoch 119 (patience: 20)
Training:  60%|█████▉    | 119/200 [07:44<05:15,  3.90s/it, epoch=119, train_loss=0.122, val_loss=0.137, epoch_time=4.3]
INFO:pref

We first check which hyperparameters have the highest impact on the model performance. The number of filters in the last CNN layer is most impactful.

The contour plots show correlations between two hyperparameters. By fixing one parameter, and comparing the performance to each of the others, we can identify broad patterns in the parameter space.
The broad findings are:

- `filter_0=64` best across most other parameters combinations
- `filter_1=64` same
- `filter_2=192` same
- `kernel_size_0=3` not unambiguously better than 5, but yields better performance for right parameter combinations
- `kernel_size_1=5` unambiguously best
- `kernel_size_2=7` same
- `hidden_dim_0=64` best, larger hidden layers yield worse performance
- `hidden_dim_1=96` best

To double-check, we can check for this particular parameter combination in the contour plot, and verify that it lies in a valley across all parameters.

In [10]:
study = runner.load_study("residual_num_filter_hidden_kernel_study")

display(plot_param_importances(study))

fig = plot_contour(study)
fig.update_layout(height=1000)
# restrict color range to relevant values
fig.update_traces(zmin=0.065, zmax=0.11, selector={"type": "contour"})
fig.show()

#### Model Evaluation, Best Run

I take the best run (which coincides with the best parameter combination identified above) and evaluate it on the test set.
The overall metrics show a slight improvement in model performance.

In [11]:
run_id = [t.user_attrs["mlflow_run_id"] for t in study.trials if t.number == 24][0]
checkpoint_path = assemble_checkpoint_path(["best"], model_name=model_name, run_id=run_id)

experiment = runner.load_checkpoint(checkpoint_path)

y_pred, y_true, loss = experiment.trainer.evaluate(dm.test_dataloader)

metrics = compute_metrics(y_true, y_pred)
plot_metrics(metrics, col_names)
plot_parameter_comparison(y_true, y_pred, col_names)

INFO:preference_dynamics.training.trainer:Initialized Trainer with device=cuda, checkpoint_dir=checkpoints
INFO:preference_dynamics.training.trainer:Loading checkpoint from checkpoints/cnn1d_n1_residual/c4ccbc4412d446b290a8e01be8230577/best.pt


### Time Series Comparison, Best Run

Again, I visualize predictions for a few individual test samples. The optimization was worth the effort, the model captures the steady state better than the model I started with. Some difficult outliers still remain, however.

In [12]:
solver_config = SolverConfig(
    time_span=(0.0, 200.0),
    n_time_points=401,
)
n_params = num_params(n_actions)
n_vars = num_vars(n_actions)

for i in range(10):
    samples = []
    try:
        for name, y in {"true": y_true.cpu().numpy()[i], "pred": y_pred.cpu().numpy()[i]}.items():
            config_ode = ODEConfig(
                parameters=ParameterVector(values=y[:n_params]),
                initial_conditions=ICVector(values=y[n_params : n_params + n_vars]),
            )
            config = ODESolverConfig(ode=config_ode.model_dump(), solver=solver_config)

            sample = solve_ODE(config)
            samples.append(sample)
        plot_time_series(samples)
    except ValueError:
        continue

## Summary

This notebook trains the improved 1d CNN model architecture with the CNN learning on the residual.
I performed hyperparameter studies to identify the best learning rate, and the number of filters, kernel sizes, and hidden dimensions.
The optimized model outperforms the previous model and achieves better a better representation of the simulated time series.

**Future extensions:**
- Add custom loss function to weight forecasting differently and add regularization of residual
- Additional hyperparameter studies on optimizer, activation functions, etc.
- Analysis of outliers

**Next steps:**
- Train PINN model (coming soon)