# Config Deep Dive

Our config class is the central point to customize our `AUTOENCODIX` pipelines. This notebook is more of a reference document than a tutorial, as we mainly list config parameters with explanations and default values and only include a few coding examples.

**IMPORTANT**
> This tutorial explains specific concepts of our config class. If you're unfamiliar with general concepts,  
> we recommend following the `Getting Started - Vanillix` Tutorial first.

## What You'll Learn

We'll cover the following points:

- The two ways to provide a config:
  - as an instance of our config class
  - as a `YAML` file
- Which config parameters you should set
- The two parts of our config:
  - main config
  - data config
- Pipeline-specific config parameters
- Default config reference


## 1) How to Provide a Config

The main way is to pass an instance of our `DefaultConfig` class to the pipeline object, as you've seen many times in the pipeline tutorials.

#### 1.1 Provide Config as a Class Instance


In [1]:
import autoencodix as acx
from autoencodix.configs.default_config import DefaultConfig, DataCase
from autoencodix.configs import VarixConfig
from autoencodix.utils.example_data import EXAMPLE_MULTI_BULK
import yaml
from pathlib import Path

config = VarixConfig(
    latent_dim=8, scaling="MINMAX", data_case=DataCase.MULTI_BULK, epochs=10
)
varix = acx.Varix(config=config, data=EXAMPLE_MULTI_BULK)


in handle_direct_user_data with data: <class 'autoencodix.data.datapackage.DataPackage'>


#### 1.2 Provide the Config as a YAML file

In our GitHub repo, we prepared a directory called `configs` with sample YAML files.  
We can easily load the values into our config class with the `model_validate` method, as shown below:

#### ❗❗ Requirements: Getting Tutorial Data ❗❗
To follow along, please download the date from the link below (1GB), if you've already done the `Ontix` Tutorial, you've probably already downloaded the data.

https://cloud.scadsai.uni-leipzig.de/index.php/s/QXYnieKY8AA3Zta/download/OntixTutorialData.zip

After downloading:
- from the root of the repository, create the folders `data/raw` if not created yet
- move the donwloaded files there

### Extra 2: Get correct path
We assume you are in the root of the package. The following code ensures that the correct paths are used.
[1] Tutorials/DeepDives/ConfigTutorial.ipynb


In [5]:
# first be sure to be in root
import os

p = os.getcwd()
d = "autoencodix_package"
if d not in p:
    raise FileNotFoundError(f"'{d}' not found in path: {p}")
os.chdir(os.sep.join(p.split(os.sep)[: p.split(os.sep).index(d) + 1]))
print(f"Changed to: {os.getcwd()}")

# now we can load the config
custom_config = VarixConfig.model_validate(
    {
        **yaml.safe_load(Path("configs/multi_bulk.yaml").read_text()),
        "learning_rate": 0.77,
    }
)
# and pass to a pipeline
varix = acx.Varix(config=custom_config)
r = varix.run()


Changed to: /Users/maximilianjoas/development/autoencodix_package
reading parquet: data/raw/combined_rnaseq_formatted.parquet
reading parquet: data/raw/combined_meth_formatted.parquet
reading parquet: data/raw/combined_clin_formatted.parquet
anno key: paired
Epoch 1 - Train Loss: 47.9436
Sub-losses: recon_loss: 27.6438, var_loss: 20.2998, anneal_factor: 0.0000, effective_beta_factor: 0.0000
Epoch 1 - Valid Loss: 18.8278
Sub-losses: recon_loss: 18.8271, var_loss: 0.0007, anneal_factor: 0.0000, effective_beta_factor: 0.0000
Epoch 2 - Train Loss: 359.4296
Sub-losses: recon_loss: 20.3551, var_loss: 339.0744, anneal_factor: 0.0344, effective_beta_factor: 0.0034
Epoch 2 - Valid Loss: 43.4410
Sub-losses: recon_loss: 18.7178, var_loss: 24.7232, anneal_factor: 0.0344, effective_beta_factor: 0.0034
Epoch 3 - Train Loss: 9603.1596
Sub-losses: recon_loss: 20.6843, var_loss: 9582.4751, anneal_factor: 0.9656, effective_beta_factor: 0.0966
Epoch 3 - Valid Loss: 18.5837
Sub-losses: recon_loss: 18.5836

**Note**  
> We can also go the reverse way and save our config object to a YAML file, as shown below:


In [6]:
output_path = Path("default_config.yaml")

# Convert to plain Python dict first
config_dict = config.model_dump()
# Write YAML with nice formatting
with output_path.open("w") as f:
    yaml.dump(config_dict, f, sort_keys=False, default_flow_style=False)

print(f"✅ Saved config to {output_path.resolve()}")

✅ Saved config to /Users/maximilianjoas/development/autoencodix_package/default_config.yaml


## 2) Which Config Parameter to Set

To make **AUTOENCODIX** easily usable, we provide sensible defaults per pipeline, so you don't need to think of and set 30+ config params before starting.  
However, there are a few parameters you should consider — and depending on the pipeline and data type, a few mandatory ones you’ll need to set.


### 2.1) Mandatory Config Parameters
*(These depend on your pipeline — see specific documentation for details.)*


### 2.2) Parameters to Consider

Although we set sensible defaults per pipeline for these parameters, it might make sense to adjust the following values:

| **Parameter** | **Type** | **Description** |
|----------------|-----------|-----------------|
| `latent_dim` | `int` | Dimension of the latent space |
| `epochs` | `int` | Number of training epochs |
| `batch_size` | `int` | Number of samples per batch (must be >1 due to BatchNorm) |
| `filtering` | `Literal['VAR', 'MAD', 'CORR', 'VARCORR', 'NOFILT', 'NONZEROVAR']` | Feature filtering method |
| `scaling` | `Literal['STANDARD', 'MINMAX', 'ROBUST', 'MAXABS', 'NONE']` | How to scale your input data; should be compatible with your loss function |
| `k_filter` | `Optional[int]` | Number of features to keep |
| `learning_rate` | `float` | Learning rate for optimization |
| `reconstruction_loss` | `Literal['mse', 'bce']` | Type of reconstruction loss: `mse` = Mean Squared Error, `bce` = Binary Cross-Entropy |
| `default_vae_loss` | `Literal['kl', 'mmd']` | Type of VAE loss: `kl` = Kullback-Leibler Divergence, `mmd` = Maximum Mean Discrepancy |
| `beta` | `float` | β weighting factor for VAE loss |


## 3) General Config and Data Config
Our config class consists of two parts: (a) a general part where we set global parameters, and (b) a `DataConfig` where we can set parameters for each data type individually.  
Inside the `DataConfig` we have a `DataInfo` object where we can set the parameters; see a minimal code example below.  
You can find a full list of parameters in [section 5.1](#51-dataconfig--configuration-parameters).


In [7]:
import os
import autoencodix as acx
from autoencodix.configs.default_config import DataConfig, DataInfo, DataCase
from autoencodix.configs import VarixConfig
from autoencodix.utils.example_data import raw_protein, raw_rna, annotation
from autoencodix.data import DataPackage


my_dp = DataPackage(
    multi_bulk={"rna": raw_rna, "protein": raw_protein}, annotation={"anno": annotation}
)
data_config = DataConfig(
    data_info={
        "rna": DataInfo(scaling="MINMAX", data_type="NUMERIC"),
        "protein": DataInfo(scaling="MINMAX", data_type="NUMERIC"),
    },
)

bulk_config = VarixConfig(data_config=data_config, data_case=DataCase.MULTI_BULK, epochs=30)
varix = acx.Varix(config=bulk_config, data=my_dp)

in handle_direct_user_data with data: <class 'autoencodix.data.datapackage.DataPackage'>


## 4) Pipeline Specific Configs
We have one `DefaultConfig` class that defines all configurable parameters and sets sensible defaults.  
However, not all pipelines (Varix, XModalix, etc.) should have the same sensible defaults.  
Thus, each pipeline has its own config class that inherits from `DefaultConfig` and overrides some default values.  
Be aware that when you build a config object, you use the appropriate config for the pipeline.  
The config can be imported from `autoencodix.configs`. The naming convention is `<pipeline-name>Config`, for example: `VarixConfig` or `OntixConfig`.


## 5)  Configuration Parameters Reference
We list all configurable parameters sorted into functional sections below:

#### **Data Configuration**

| **Parameter** | **Type** | **Description** |
|----------------|-----------|-----------------|
| `data_config` | `autoencodix.configs.default_config.DataConfig` | Contains detailed information about each data modality (see `DataConfig` section) |
| `img_path_col` | `str` | When working with images, defines the column name containing image paths per sample |
| `requires_paired` | `Optional[bool]` | Indicates whether samples for xmodalix are paired (based on sample ID) |
| `data_case` | `Optional[DataCase]` | Data case for the model (auto-determined) |
| `k_filter` | `Optional[int]` | Number of features to keep |
| `scaling` | `Literal['STANDARD', 'MINMAX', 'ROBUST', 'MAXABS', 'NONE']` | Global scaling setting (can be overridden per modality) |
| `skip_preprocessing` | `bool` | Skip scaling, filtering, and cleaning |
| `class_param` | `Optional[str]` | Optional column name for class labels |


#### **Model Architecture**

| **Parameter** | **Type** | **Description** |
|----------------|-----------|-----------------|
| `latent_dim` | `int` | Dimension of the latent space |
| `n_layers` | `int` | Number of encoder/decoder layers (excluding latent layer) |
| `enc_factor` | `int` | Encoder dimension scaling factor |
| `input_dim` | `int` | Input feature dimension |
| `drop_p` | `float` | Dropout probability |
| `save_memory` | `bool` | Skip storing `TrainingDynamics` to save memory |


#### **Training Hyperparameters**

| **Parameter** | **Type** | **Description** |
|----------------|-----------|-----------------|
| `learning_rate` | `float` | Learning rate for optimization |
| `batch_size` | `int` | Samples per batch (>1 required due to BatchNorm) |
| `epochs` | `int` | Number of training epochs |
| `weight_decay` | `float` | L2 regularization factor |
| `reconstruction_loss` | `Literal['mse', 'bce']` | Type of reconstruction loss |
| `default_vae_loss` | `Literal['kl', 'mmd']` | Type of VAE loss |
| `loss_reduction` | `Literal['sum', 'mean']` | Loss reduction mode in PyTorch |
| `beta` | `float` | β weight for VAE loss |
| `beta_mi` | `float` | β weight for mutual information term |
| `beta_tc` | `float` | β weight for total correlation term |
| `beta_dimKL` | `float` | β weight for dimension-wise KL |
| `use_mss` | `bool` | Use minibatch stratified sampling for disentangled VAE loss |
| `gamma` | `float` | γ weight for adversarial loss (XModalix classifier) |
| `delta_pair` | `float` | δ weight for paired loss (XModalix training) |
| `delta_class` | `float` | δ weight for class loss (XModalix training) |
| `anneal_function` | `Literal['5phase-constant', '3phase-linear', '3phase-log', 'logistic-mid', 'logistic-early', 'logistic-late', 'no-annealing']` | Annealing function strategy for VAE loss |
| `pretrain_epochs` | `int` | Number of pretraining epochs (can differ per modality) |

---

#### **Device & Performance**

| **Parameter** | **Type** | **Description** |
|----------------|-----------|-----------------|
| `device` | `Literal['cpu', 'cuda', 'gpu', 'tpu', 'mps', 'auto']` | Compute device |
| `n_gpus` | `int` | Number of GPUs to use |
| `checkpoint_interval` | `int` | Checkpoint save interval |
| `float_precision` | `Literal['transformer-engine', 'transformer-engine-float16', '16-true', '16-mixed', 'bf16-true', 'bf16-mixed', '32-true', '64-true', '64', '32', '16', 'bf16']` | Floating-point precision |
| `gpu_strategy` | `Literal['auto', 'dp', 'ddp', 'ddp_spawn', 'ddp_find_unused_parameters_true', 'xla', 'deepspeed', 'fsdp']` | GPU parallelization strategy |


#### **Data Splits & Reproducibility**

| **Parameter** | **Type** | **Description** |
|----------------|-----------|-----------------|
| `train_ratio` | `float` | Training split ratio |
| `test_ratio` | `float` | Test split ratio |
| `valid_ratio` | `float` | Validation split ratio |
| `min_samples_per_split` | `int` | Minimum samples per split |
| `reproducible` | `bool` | Ensure reproducibility |
| `global_seed` | `int` | Global random seed |


#### 5.1 DataConfig — Configuration Parameters

---

##### **DataConfig**

| **Parameter** | **Type** | **Description** |
|----------------|-----------|-----------------|
| `data_info` | `Dict[str, DataInfo]` | Dictionary mapping modality names (e.g. `"RNA"`, `"IMG"`) to their `DataInfo` configuration |
| `require_common_cells` | `Optional[bool]` | Whether to require that all data modalities share a common set of cells/samples |
| `annotation_columns` | `Optional[List[str]]` | List of column names from the annotation file to include as metadata |

###### **DataInfo**

| **Parameter** | **Type** | **Description** |
|----------------|-----------|-----------------|
| `file_path` | `str` | Path to the raw data file |
| `data_type` | `Literal['NUMERIC', 'CATEGORICAL', 'IMG', 'ANNOTATION']` | Type of data modality |
| `scaling` | `Literal['STANDARD', 'MINMAX', 'ROBUST', 'MAXABS', 'NONE', 'NOTSET']` | Overrides the globally set scaling method for this modality |
| `filtering` | `Literal['VAR', 'MAD', 'CORR', 'VARCORR', 'NOFILT', 'NONZEROVAR']` | Feature filtering method |
| `sep` | `Optional[str]` | Delimiter for CSV/TSV input files (passed to `pandas.read_csv`) |
| `extra_anno_file` | `Optional[str]` | Path to an additional annotation file |


**Single-Cell Specific Parameters**

| **Parameter** | **Type** | **Description** |
|----------------|-----------|-----------------|
| `is_single_cell` | `bool` | Whether the dataset represents single-cell data |
| `min_cells` | `float` | Minimum fraction of cells in which a gene must be expressed to be kept (filters rare genes) |
| `min_genes` | `float` | Minimum fraction of genes a cell must express to be kept (filters low-quality cells) |
| `selected_layers` | `List[str]` | Layers to include from the single-cell dataset; must always include `"X"` |
| `is_X` | `bool` | Whether the data originates from the `"X"` matrix only |
| `normalize_counts` | `bool` | Whether to normalize single-cell counts by total expression per cell |
| `log_transform` | `bool` | Whether to apply `log1p` transformation after normalization |
| `k_filter` | `Optional[int]` | Automatically set based on global config; do not override manually |


**Image-Specific Parameters**

| **Parameter** | **Type** | **Description** |
|----------------|-----------|-----------------|
| `img_width_resize` | `Optional[int]` | Target width for image resizing (must equal height) |
| `img_height_resize` | `Optional[int]` | Target height for image resizing (must equal width) |


**XModalix & Translation Parameters**

| **Parameter** | **Type** | **Description** |
|----------------|-----------|-----------------|
| `translate_direction` | `Optional[Literal['from', 'to']]` | Defines translation direction in cross-modal (XModalix) training |
| `pretrain_epochs` | `int` | Number of pretraining epochs specific to this modality (overrides global pretraining setting) |


**Validation Rules**

- `selected_layers` must always contain `"X"`.  
- `img_width_resize` and `img_height_resize` must be **positive integers** and **equal** (enforces square resizing).  

## 6) List Config Parameters Dynamically

If you want to see all available parameters directly via Python, you can call `<config-instance>.print_schema()` as shown below.


In [8]:
config = DefaultConfig()
config.print_schema()


DefaultConfig Configuration Parameters:
--------------------------------------------------

data_config:
  Type: <class 'autoencodix.configs.default_config.DataConfig'>
  Default: data_info={} require_common_cells=False annotation_columns=None
  Description: No description available

img_path_col:
  Type: <class 'str'>
  Default: img_paths
  Description: When working with images, we except a column in your annotation file that specifies the path of the image for a particular sample. Here you can define the name of this column

requires_paired:
  Type: typing.Optional[bool]
  Default: PydanticUndefined
  Description: Indicator if the samples for the xmodalix are paired, based on some sample id

data_case:
  Type: typing.Optional[autoencodix.configs.default_config.DataCase]
  Default: PydanticUndefined
  Description: Data case for the model, will be determined automatically

k_filter:
  Type: typing.Optional[int]
  Default: 20
  Description: Number of features to keep

scaling:
  Type: 