# HEIMDALL's configuration file

This notebook cover the definitions and description of the inputs parameters

__version__:  0.3.0

**NB:** the version reported in the YAML file must match the one installed.

--------------------------------------------------------------------------------------


Here's the configuration file's template:

```yaml
version: "0.3.0"

# :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
# :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

BUILD_GNN:
  PROJ_START_DATE: "2000-01-01T00:00:00"
  PROJ_END_DATE: "2050-12-31T23:59:59"
  INVENTORY_PATH: "Inventory_training_data.xml"
  NETWORKS: ["N", ]  # Network code selection of stations (if False or null, select ALL network codes in Inventory)
  PLOT_GRAPH_ARCH: True
  BASE_CONNECT_TYPE: "KNN"
  BASE_CONNECT_VALUE: 7  # degree
  SELF_LOOPS: True
  SCALE_DISTANCES: "max"   # Also NONE or FALSE or STD
  GNN_TAG: "heimdall_graph"  # stored inside the NPZ
  PLOT_BOUNDARIES: []

BUILD_GRID:
  BOUNDARIES: [130.3, 131.8,
               32.55, 33.5,
               0.0, 25.0]       # km
  SPACING_XYZ: [0.25, 0.25, 0.1]  # km
  CENTRE: False
  GRID_TAG: "heimdall_grid"  # stored in the NPZ

# :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
# :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

PREPARE_DATA:
  DOWNSAMPLE:
    new_df: 100.0
    hp_freq: 2   # The bandpass filter will be applied from HP-FREQ to the antialias frequency ALWAYS

  SOURCE_ERRORS_BY_PICKS:
    # n.picks in window:  [radius (km), max background noise, source noise]
    0:  [0.0, 0.0, 0.0]
    1:  [0.0, 0.0, 0.0]
    2:  [0.0, 0.0, 0.0]
    3:  [7.0, 0.0, 0.0]
    4:  [6.0, 0.0, 0.0]
    5:  [5.0, 0.0, 0.0]
    6:  [4.0, 0.0, 0.0]
    7:  [3.0, 0.0, 0.0]
    8:  [2.0, 0.0, 0.0]
    9:  [1.0, 0.0, 0.0]
    10: [0.7, 0.0, 0.0]
    11: [0.5, 0.0, 0.0]
    # higher number of picks in windows will keep the last key

  SLICING:
      wlen_seconds: 5.0
      slide_seconds: 0.5

# :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
# :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

TRAINING_PARAMETERS:
  # ----------  output
  PLOTS:
    make_plots:       True      # if False the code never calls gplt
    every_batches:    50        # produce a figure every N batches

  # ----------  data sampling & augmentations
  DATASET:
    how_many:         null      # null == use all events
    evenize:                    # arguments passed to __evenize_classes__
      min_pick_signal:  1
      reduce_data:      False
      noise_perc:       0.10
      signal_perc:      0.90
    batch_size: 8
    n_work: 5

  AUGMENTATION:
    enabled:          True

  RANDOM_SEED:        42

  # ----------  splits
  SPLIT:
    test:             0.10
    val:              0.10

  # ----------  optimiser & scheduler
  OPTIMISATION:
    learning_rate:    1.e-4
    epochs:           null      # null --> use early stopping (next block)
    early_stopping:
      patience:       7
      delta:          1.e-4

  # ----------  loss weighting
  LOCATOR_LOSS:          # W1_XY, W2_XZ, W3_YZ
    xy:   1.0
    xz:   1.0
    yz:   1.0

  COMPOSITE_LOSS:        # ALPHA, BETA, GAMMA
    alpha: 1.0
    beta:  1.0
    gamma: 1.0

  # ----------  model initialisation
  MODEL:
    pretrained_weights: ""      # path to previously trained model (i.e. transfer learning or fine tuning of the heads) or "" to train from scratch
    freeze_encoder:    False    # keep it False unless you've good reasons not to ...
```

---------------------------------------------------------

To start with, inside the whole framework, the configuration files will be read with the `heimdall.io.read_configuration_file` function, and will return an `AttributeDict` syle dictionary: therefore it can be accessed using the dot sintax (i.e. `config.version` or `config.TRAINING_PARAMETERS.MODEL.freeze_encoder`).

Below is a **plain‑text description in Markdown** of every section and parameter found in `heimdall_v030_confs.yml`.  
For each key you’ll see:

* **Type** – accepted value(s) or data‑type.  
* **Default / Example** – the value shown in your file.  
* **Purpose** – what the code uses it for.

---

## Version

| Key | Type | Default / Example | Purpose |
|---|---|---|---|
| `version` | string (semantic) | '0.3.0' | Internal schema version; the loader checks this before parsing the file. |

---

## BUILD_GNN — graph‑building options

| Key | Type | Default / Example | Purpose |
|---|---|---|---|
| `PROJ_START_DATE` | ISO‑8601 datetime | '2000‑01‑01T00:00:00' | First timestamp of data to include when scanning waveform archives. |
| `PROJ_END_DATE` | ISO‑8601 datetime | '2050‑12‑31T23:59:59' | Last timestamp to include. |
| `INVENTORY_PATH` | path / string | 'Inventory_training_data.xml' | StationXML (or similar) inventory used to fetch station coordinates (nodes) & metadata. |
| `NETWORKS` | list of strings or **null** | ['N'] | Network codes to keep (e.g. 'IU', 'XR'). Empty/null -> keep all networks. |
| `PLOT_GRAPH_ARCH` | bool | True | If True, write a PNG/PDF visualisation of the final graph. |
| `BASE_CONNECT_TYPE` | 'KNN', 'DIST', … | 'KNN' | Strategy to connect stations.  **KNN** - each node linked to `BASE_CONNECT_VALUE` nearest neighbours. **DBSCAN** Use radius‑based. |
| `BASE_CONNECT_VALUE` | int | 7 | Degree for the chosen BASE_CONNECT_TYPE (e.g. *k* in k‑NN, km radius if **DBSCAN** used). |
| `SELF_LOOPS` | bool | True | Add self‑edges so convolution layers can mix a node’s own features. |
| `SCALE_DISTANCES` | 'max', 'std', 'none'/false | 'max' | How to normalise edge‑length attributes: <br>• **max** → divide by max distance. <br>• **std** → z‑score. <br>• **none/false** → leave raw kilometres. |
| `GNN_TAG` | string | 'heimdall_graph' | Name stored inside the .npz graph file (useful when several graphs sit in the same archive). |
| `PLOT_BOUNDARIES` | list [lon₁, lon₂, lat₁, lat₂] | [] (empty) | Geographic map bounds for plots; empty → auto‑fit to stations. |

---

## BUILD_GRID — 3‑D locator grid

| Key | Type | Default / Example | Purpose |
|---|---|---|---|
| `BOUNDARIES` | list [lon₁, lon₂, lat₁, lat₂, zmin, zmax] | [130.3, 131.8, 32.55, 33.5, 0, 25] | Geographic extents (°) and depth range (km) for the search grid. |
| `SPACING_XYZ` | list [dx, dy, dz] (km) | [0.25, 0.25, 0.1] | Grid spacing along X (lon), Y (lat) and Z (depth). |
| `CENTRE` | bool | False | If True, re‑centre grid so (0,0,0) sits at the geometric centre. |
| `GRID_TAG` | string | 'heimdall_grid' | Key under which the grid is saved in the .npz file. |

---

## PREPARE_DATA — pre‑processing & label generation

### DOWNSAMPLE

| Key | Type | Default / Example | Purpose |
|---|---|---|---|
| `new_df` | float (Hz) | 100.0 | Target sampling rate after decimation. |
| `hp_freq` | float (Hz) | 2 | High‑pass corner; the pipeline always applies a band‑pass from hp_freq to the new Nyquist. |

### SOURCE_ERRORS_BY_PICKS

Defines how much random error/noise to inject in the synthetic source label **based on how many P-picks / Event fall inside a window**.

| n picks | [radius_km, max_noise, source_noise] | Meaning |
|---|---|---|
| 0–2 | [0, 0, 0] | No localisation information. |
| 3 | [7, 0, 0] | If only 3 picks, smear the PDF within a 7 km radius. |
| 4 | [6, 0, 0] | … |
| 5 | [5, 0, 0] | … |
| 6 | [4, 0, 0] | … |
| 7 | [3, 0, 0] | … |
| 8 | [2, 0, 0] | … |
| 9 | [1, 0, 0] | … |
| 10 | [0.7, 0, 0] | … |
| 11 | [0.5, 0, 0] | With many picks the allowed radius keeps shrinking. |
| >11 | last entry | Any higher pick count re‑uses the last line. |

### SLICING

| Key | Type | Default / Example | Purpose |
|---|---|---|---|
| `wlen_seconds` | float | 5.0 | Duration in seconds , adjust for your use-case (i.e. shorter for microseisms, longer for regional. Tips: calculate the distribution of S-P duration in your dataset and double it. |
| `slide_seconds` | float | 0.5 | Rolling window's step (in seconds) between consecutive windows (`overlap = wlen − slide`). |

---

## TRAINING_PARAMETERS

### PLOTS

| Key | Type | Default / Example | Purpose |
|---|---|---|---|
| `make_plots` | bool | True | Enable the plotting routines for the test dataset (after training). For a quick view of training performances.|
| `every_batches` | int | 50 | Draw a figure every N batches of the test-dataset. |

### DATASET

| Key | Type | Default / Example | Purpose |
|---|---|---|---|
| `how_many` | int or null | null | Cap the number of training elements (useful for quick tests). If null, takes all the H5 rows.|
| `evenize` | dict | See below | Arguments passed to __evenize_classes__ that rebalance signal/noise windows. If `False` or `{}` all elements will be used.|
| `batch_size` | int | 8 | Samples per mini‑batch sent to the model loader (i.e. DataLoader batching). |
| `n_work` | int | 5 | PyTorch DataLoader workers (parallel file readers for faster I/O operations at training stage). Set the value to `(nproc/2)-1` available in your machine.|

#### Evenize sub‑keys

| Sub‑key | Default | Purpose |
|---|---|---|
| `min_pick_signal` | 1 | Minimum picks to call a window “signal”. |
| `reduce_data` | False | If True, randomly drop part of the balanced set. |
| `noise_perc` | 0.10 | Desired fraction of noise windows after balancing. |
| `signal_perc` | 0.90 | Desired fraction of signal windows. |

### AUGMENTATION

| Key | Type | Default / Example | Purpose |
|---|---|---|---|
| `enabled` | bool | True | Toggles on‑the‑fly waveform augmentations (jitter, scaling, polarity flip …). |

### RANDOM_SEED

| Key | Type | Default / Example | Purpose |
|---|---|---|---|
| `RANDOM_SEED` | int | 42 | Seed for Python, NumPy and PyTorch to guarantee repeatable results. |

### SPLIT

| Key | Type | Default / Example | Purpose |
|---|---|---|---|
| `test` | float | 0.10 | Fraction of data set aside for the final test evaluation. |
| `val` | float | 0.10 | Fraction used for early‑stopping / hyper‑parameter tuning. |

### OPTIMISATION

| Key | Type | Default / Example | Purpose |
|---|---|---|---|
| `learning_rate` | float | 1e‑4 | Initial LR for the Adam optimiser. |
| `epochs` | int or null | null | Max epochs; null lets early‑stopping decide when to stop. |
| `early_stopping.patience` | int | 7 | Non‑improving epochs to wait before halting. |
| `early_stopping.delta` | float | 1e‑4 | Minimum drop in validation loss that counts as improvement. |

### LOCATOR_LOSS

| Key | Type | Default | Purpose |
|---|---|---|---|
| `xy` | float | 1.0 | Weight for XY plane pdf loss. |
| `xz` | float | 1.0 | Weight for XZ plane pdf loss. |
| `yz` | float | 1.0 | Weight for YZ plane pdf loss. |

### COMPOSITE_LOSS

| Key | Type | Default | Purpose |
|---|---|---|---|
| `alpha` | float | 1.0 | Weight for detector loss. |
| `beta` | float | 1.0 | Weight for averaged locator‑plane loss. |
| `gamma` | float | 1.0 | Weight for coordinate‑consistency loss. |

### MODEL

| Key | Type | Default / Example | Purpose |
|---|---|---|---|
| `pretrained_weights` | path / string | '' (empty) | Optional .pt file to warm‑start from; empty → train from scratch. |
| `freeze_encoder` | bool | False | If True, the CNN encoder’s weights stay frozen (transfer‑learning). |

---

### Section pipeline summary

1. BUILD_GNN & BUILD_GRID run once to create two .npz assets (heimdall_graph.npz, heimdall_grid.npz).
2. PREPARE_DATA slices raw MiniSEED, injects label uncertainty and writes per‑event shards.
3. HeimdallCore_2a_createHdf5.py merges shards; TRAINING_PARAMETERS then steer HeimdallCore_3_Training.py.


# Which uses what?

Every `bin` uses mainly different parts:

**core/**
- `HeimdallCore_1_BuildNetwork.py`: uses `BUILD_GNN` sector
- `HeimdallCore_1a_BuildGrid.py`: uses `BUILD_GRID` sector
- `HeimdallCore_2_PrepareDataset.py`: uses `PREPARE_DATA` sector
- `HeimdallCore_2a_createHdf5.py`: uses none
- `HeimdallCore_3_Training.py`: uses `TRAINING_PARAMETERS` sector
- `HeimdallCore_4_Predict.py`
- `HeimdallCore_5_ExtractResults.py`

**utils/**
- `HeimdallUtils_Plot_Label.py`
- `HeimdallUtils_Plot_Catalog.py`
- `HeimdallUtils_Plot_Event.py`  
