# Anemoi training configurations

Anemoi adopts a config-driven approach to development, and training new models makes no exception. In this notebook, we will have a brief overview of how the anemoi-training configuration system works. The configuration controls every aspect of the program, including:

- data sources, data loading
- model architectures
- optimization hyperparameters
- distributed training strategy
- training diagnostics

and a lot more!

## Generate default configs

Start by generating a configuration directory with

In [34]:
!anemoi-training config generate --output ../configs/training

2025-10-23 11:59:18 INFO Generating configs, please wait.
2025-10-23 11:59:18 INFO File ../configs/training/ensemble_crps.yaml already exists, skipping
2025-10-23 11:59:18 INFO File ../configs/training/debug.yaml already exists, skipping
2025-10-23 11:59:18 INFO File ../configs/training/stretched.yaml already exists, skipping
2025-10-23 11:59:18 INFO File ../configs/training/diffusion.yaml already exists, skipping
2025-10-23 11:59:18 INFO File ../configs/training/interpolator.yaml already exists, skipping
2025-10-23 11:59:18 INFO File ../configs/training/hierarchical.yaml already exists, skipping
2025-10-23 11:59:18 INFO File ../configs/training/lam.yaml already exists, skipping
2025-10-23 11:59:18 INFO File ../configs/training/config.yaml already exists, skipping
2025-10-23 11:59:18 INFO File ../configs/training/data/zarr.yaml already exists, skipping
2025-10-23 11:59:18 INFO File ../configs/training/dataloader/native_grid.yaml already exists, skipping
2025-10-23 11:59:18 INFO File ..

This will contain some default configs to run the training. You can find more about how the training configuration works in the ["Configuration basics"](https://anemoi.readthedocs.io/projects/training/en/latest/user-guide/hydra-intro.html) section of the anemoi-training documentation.

In the top folder of `configs/training` you will find a collection of configs that anemoi contributors are currently using. Quite some diversity! For this tutorial, we will create our own config starting from `config.yaml`


## Our first training config

Let's inspect our training configuration

In [35]:
import yaml
from pprint import pprint

with open("../configs/training/config.yaml") as f:
    config = yaml.safe_load(f)

pprint(config)

{'config_validation': True,
 'defaults': [{'data': 'zarr'},
              {'dataloader': 'native_grid'},
              {'diagnostics': 'evaluation'},
              {'datamodule': 'single'},
              {'hardware': 'example'},
              {'graph': 'multi_scale'},
              {'model': 'gnn'},
              {'training': 'default'},
              '_self_']}


for now, we only have some defaults. Defaults are themselves YAML files and they are resolved into the full configuration with [hydra](https://github.com/facebookresearch/hydra), a framework for handling hierarchically structured configs. Let's try to compose the full config.

In [36]:
from omegaconf import OmegaConf
from pprint import pprint
from anemoi_demo_keisler2022.helpers import compose_config, load_config

composed_config = compose_config("../configs/training", "config")

pprint(OmegaConf.to_container(composed_config))

{'config_validation': True,
 'data': {'diagnostic': ['tp', 'cp'],
          'forcing': ['cos_latitude',
                      'cos_longitude',
                      'sin_latitude',
                      'sin_longitude',
                      'cos_julian_day',
                      'cos_local_time',
                      'sin_julian_day',
                      'sin_local_time',
                      'insolation',
                      'lsm',
                      'sdor',
                      'slor',
                      'z'],
          'format': 'zarr',
          'frequency': '6h',
          'normalizer': {'default': 'mean-std',
                         'max': ['sdor', 'slor', 'z'],
                         'min-max': None,
                         'none': ['cos_latitude',
                                  'cos_longitude',
                                  'sin_latitude',
                                  'sin_longitude',
                                  'cos_julian_day',
           

That's a long configuration! The advantage of composing configurations like this is that we can replace entire sections with a single line change. For instance, let's say we want to use a transformer model instead of a GNN. Both are defined respectively as YAML files `transformer.yaml` and `gnn.yaml` under `configs/training/model`. All we need to do is change `defaults.model` to `transformer`. 

In [37]:
config["defaults"][6]["model"] = "transformer"

pprint(config)

{'config_validation': True,
 'defaults': [{'data': 'zarr'},
              {'dataloader': 'native_grid'},
              {'diagnostics': 'evaluation'},
              {'datamodule': 'single'},
              {'hardware': 'example'},
              {'graph': 'multi_scale'},
              {'model': 'transformer'},
              {'training': 'default'},
              '_self_']}


Configuring anemoi-training also supports interpolating values inside the config. For instance, let's look at the dataloader config

In [38]:
pprint(OmegaConf.to_container(composed_config["dataloader"]))

{'batch_size': {'test': 4, 'training': 2, 'validation': 4},
 'dataset': '${hardware.paths.data}/${hardware.files.dataset}',
 'grid_indices': {'_target_': 'anemoi.training.data.grid_indices.FullGrid',
                  'nodes_name': '${graph.data}'},
 'limit_batches': {'test': 20, 'training': None, 'validation': None},
 'num_workers': {'test': 8, 'training': 8, 'validation': 8},
 'pin_memory': True,
 'prefetch_factor': 2,
 'read_group_size': '${hardware.num_gpus_per_model}',
 'test': {'dataset': '${dataloader.dataset}',
          'drop': [],
          'end': None,
          'frequency': '${data.frequency}',
          'start': 2022},
 'training': {'dataset': '${dataloader.dataset}',
              'drop': [],
              'end': 2020,
              'frequency': '${data.frequency}',
              'start': None},
 'validation': {'dataset': '${dataloader.dataset}',
                'drop': [],
                'end': 2021,
                'frequency': '${data.frequency}',
                'sta

The dataloader config `dataset` entry is `${hardware.paths.data}/${hardware.files.dataset}`. This means that the value will be taken from other parts of the config. Let's go have a look.

In [39]:
composed_config["hardware"]["paths"]["data"]

MissingMandatoryValue: Missing mandatory value: hardware.paths.data
    full_key: hardware.paths.data
    object_type=dict

we haven't set this one. This is why, when we try to resolve the full configuration (in other words, to interpolate all values with the ${value} syntax), we get an error:

In [40]:
# note 'resolve=True' argument
OmegaConf.to_container(composed_config, resolve=True)

InterpolationToMissingValueError: MissingMandatoryValue while resolving interpolation: Missing mandatory value: hardware.paths.data
    full_key: dataloader.dataset
    object_type=dict

There are a few other values that have no defaults:

In [41]:
!grep "???" -r ../configs/training

../configs/training/stretched.yaml:        thinning: ???
../configs/training/stretched.yaml:      weight_frac_of_total: ???
../configs/training/stretched.yaml:    dataset: ???
../configs/training/stretched.yaml:    forcing_dataset: ???
../configs/training/training/scalers/stretched.yaml:  weight_frac_of_total: ???
../configs/training/diagnostics/evaluation.yaml:    entity: ???
../configs/training/hardware/files/example.yaml:dataset: ???
../configs/training/hardware/files/example.yaml:graph: ???
../configs/training/hardware/paths/example.yaml:data: ???
../configs/training/hardware/paths/example.yaml:output: ???
../configs/training/lam.yaml:        thinning: ???
../configs/training/lam.yaml:    dataset: ???
../configs/training/lam.yaml:    forcing_dataset: ???



however, the ones that are absolutely needed to run a training are the following:

- `hardware.paths.data`: Location of base directory where datasets are stored.
- `hardware.paths.graph`: Location of graph directory. You can set this to where you stored your generated graph: `../output/graphs`.
- `hardware.paths.output`: Location of output directory. Set this to `../output/training`.
- `hardware.files.dataset`: Filename(s) of datasets used for training.
- `hardware.files.graph`: If you have pre-computed a specific graph, specify its filename here. Otherwise, a new graph will be constructed on the fly and written to the filename given.


Under `../configs/training/hardware` we can find some example configs that will help us. Let's set some values directly in our config.

In [46]:
from pathlib import Path

PROJECT_ROOT = Path.cwd().parent

# set all the mandatory missing config entries
config["hardware"] = {
    "paths": {
        "data": str(PROJECT_ROOT / "output/datasets"),
        "graph": str(PROJECT_ROOT / "output/graphs"),
        "output": str(PROJECT_ROOT / "output/training"),
    },
    "files": {
        "dataset": "aifs-ea-an-oper-0001-mars-o96-1979-2022-6h-v6.zarr",
        "graph": "o96_multi_scale.pt",
    }
}

# set some extra config entries
config["defaults"][4]["hardware"] = "slurm"

# in this demo we use small batch sizes, but consider that in any case we will 
# be using distributed training across multiple GPUs, so the effective batch size will be
# (num_gpus * batch_size_per_gpu)
config["dataloader"] = {
    "batch_size": {
        "training": 1,
        "validation": 1
    }
}


We save the config now.

In [50]:
with open("../configs/training/demo.yaml", "w") as f:
    yaml.dump(config, f, sort_keys=False)

Let's check that we can compose and resolve it:

In [48]:
composed_config = compose_config("../configs/training", "demo")
pprint(OmegaConf.to_container(composed_config, resolve=True))

InterpolationResolutionError: KeyError raised while resolving interpolation: "Environment variable 'SLURM_GPUS_PER_NODE' not found"
    full_key: hardware.num_gpus_per_node
    object_type=dict

We are getting this error because some of the keys are configured to be resolved via environment variables, for instance:

In [49]:
composed_config = OmegaConf.to_container(composed_config, resolve=False)
composed_config["hardware"]["num_gpus_per_node"]

'${oc.decode:${oc.env:SLURM_GPUS_PER_NODE}}'

This is expected, because in our config we are using a pre-defined `slurm` config for hardware's defaults. This config can be used to launch training programs on SLURM clusters. 

One can of course create another default config for other cluster types, or manually override these values.

## Launching the training program

The training program can be launched from the command line, with 

```bash
anemoi-training train --config-path=/path/to/training/config/directory --config-name=demo.yaml
```

one can also override some values directly from the command line:

```bash
anemoi-training train \
    --config-path=/path/to/training/config/directory \
    --config-name=demo.yaml \
    training.max_steps=100000
```

Note that while launching program can like this be useful for debugging purposes, when one wants to launch full-scale training runs, a submission script is typically used. In this notebook we will focus on training on a SLURM cluster, for which we will use a submission script such as the one below. Note that SLURM job specification can be specific to your cluster, so you might have to do some updates.

### Example SLURM submission script


In this case, we will be launching a distributed training over 8 nodes, with 4 GPUs per node. We will specify 4 tasks per node because pytorch's Distributed Data Parallel (DDP) model requires one process per GPU. In total we will have 24 GPUs so 24 processes.

In [None]:
SLURM_SUBMISSION_SCRIPT = """#!/bin/bash

#SBATCH --job-name="demo-anemoi-training"
#SBATCH --partition=normal
#SBATCH --nodes=8
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-node=4
#SBATCH --cpus-per-task=16
#SBATCH --mem=256G
#SBATCH --time=12:00:00
#SBATCH --exclusive
#SBATCH --output=slogs/debug.log
#SBATCH --error=slogs/debug.err

CONFIG_DIR=$(realpath configs/training)
CONFIG_NAME="demo.yaml"

echo "=================================================="
echo "Submitting anemoi-training job to SLURM..."
echo "Using config dir: $CONFIG_DIR"
echo "Using config name: $CONFIG_NAME"
echo "=================================================="

srun -u anemoi-training train --config-path $CONFIG_DIR --config-name $CONFIG_NAME
"""

Path("../scripts").mkdir(exist_ok=True)
with open("../scripts/submit_slurm.sh", "w") as f:
    f.write(SLURM_SUBMISSION_SCRIPT)

In [None]:
!cd ../ && sbatch scripts/submit_slurm.sh

/scratch/mch/fzanetta/projects/anemoi-demo-keisler2022/configs/training/demo.yaml


## Tracking your experiments

Anemoi supports two popular tools for tracking your experiments: MLFlow and W&B. In the anemoi community, MLFlow is currently considered the standard, so we will focus on that.