- read config from file
- define config
- config per pipeline
- config params
  - in code
  - in markdown (print_schema to chatgpt )

# Config Deep Dive
Our config class is the central point to customize our  `AUTOENCODIX` pipelines. This notebook is more a reference document than a tutorial as we mainly will list config parameters with explanations and default values and only have a few coding parts.

**IMPORTANT**
> This tutorial explains specific concepts of our config class. If you're unfamilar with general concepts,  
> we recommend to follow the `Getting Started - Vanillix` Tutorial first.

## What You'll Learn
We'll cover the following points:
- The two ways to provide a config:
  - as instance of our config class
  - as a `YAML` file
- Which config parameters you should set
- The two parts of our config:
  - main config
  - data config
- Pipeline specific config parameters
- Default onfig reference

## 1) How to Provide a Config
The main way is to pass an instance of our `DefaultConfig` class to the pipeline object as you've seen many times in the pipeline tutorials.
#### 1.1 Provide Config as a Class Instance

In [4]:
import autoencodix as acx
from autoencodix.configs.default_config import DefaultConfig, DataCase
from autoencodix.utils.example_data import EXAMPLE_MULTI_BULK
import yaml
from pathlib import Path

config = DefaultConfig(
    latent_dim=8, scaling="MINMAX", data_case=DataCase.MULTI_BULK, epochs=10
)
varix = acx.Varix(config=config, data=EXAMPLE_MULTI_BULK)


in handle_direct_user_data with data: <class 'autoencodix.data.datapackage.DataPackage'>




#### 1.2 Provide the Config as a YAML file
In our GitHub repo, we prepared a directory called `configs` with sample yaml files.  
We can easily load the values into our config class with the `model_validate` method as shown below:

In [9]:
# first be sure to be in root
import os

p = os.getcwd()
d = "autoencodix_package"
if d not in p:
    raise FileNotFoundError(f"'{d}' not found in path: {p}")
os.chdir(os.sep.join(p.split(os.sep)[: p.split(os.sep).index(d) + 1]))
print(f"Changed to: {os.getcwd()}")

# now we can load the config
custom_config = DefaultConfig.model_validate(
    {
        **yaml.safe_load(Path("configs/multi_bulk.yaml").read_text()),
        "learning_rate": 0.77,
    }
)
# and pass to a pipeline
varix = acx.Varix(config=custom_config)
r = varix.run()


Changed to: /Users/maximilianjoas/development/autoencodix_package
reading parquet: data/raw/mini/bulk/clinical_sample_data.parquet




anno key: paired




Epoch 1 - Train Loss: 31.4416
Sub-losses: recon_loss: 31.4414, var_loss: 0.0002, anneal_factor: 0.0000, effective_beta_factor: 0.0000
Epoch 1 - Valid Loss: 6103076175872.0000
Sub-losses: recon_loss: 6103076175872.0000, var_loss: 66080.3906, anneal_factor: 0.0000, effective_beta_factor: 0.0000
Epoch 2 - Train Loss: 355.1007
Sub-losses: recon_loss: 31.8294, var_loss: 323.2712, anneal_factor: 0.0344, effective_beta_factor: 0.0344
Epoch 2 - Valid Loss: 210153668608.0000
Sub-losses: recon_loss: 210128601088.0000, var_loss: 25069478.0000, anneal_factor: 0.0344, effective_beta_factor: 0.0344
Epoch 3 - Train Loss: 89.9097
Sub-losses: recon_loss: 23.5348, var_loss: 66.3749, anneal_factor: 0.9656, effective_beta_factor: 0.9656
Epoch 3 - Valid Loss: 2088455503872.0000
Sub-losses: recon_loss: 2087284375552.0000, var_loss: 1171138688.0000, anneal_factor: 0.9656, effective_beta_factor: 0.9656


**Note** 
> We got a warning that our config 

In [5]:
output_path = Path("default_config.yaml")

# Convert to plain Python dict first
config_dict = config.model_dump()
# Write YAML with nice formatting
with output_path.open("w") as f:
    yaml.dump(config_dict, f, sort_keys=False, default_flow_style=False)

print(f"‚úÖ Saved config to {output_path.resolve()}")

‚úÖ Saved config to /Users/maximilianjoas/development/autoencodix_package/Tutorials/DeepDives/default_config.yaml



# üß© DefaultConfig ‚Äî Configuration Parameters

---

## **Data Configuration**

| **Parameter** | **Type** | **Default** | **Description** |
|----------------|-----------|--------------|-----------------|
| `data_config` | `autoencodix.configs.default_config.DataConfig` | `data_info={} require_common_cells=False annotation_columns=None` | No description available |
| `img_path_col` | `str` | `img_paths` | When working with images, defines the column name containing image paths per sample |
| `requires_paired` | `Optional[bool]` | *PydanticUndefined* | Indicates whether samples for xmodalix are paired (based on sample ID) |
| `data_case` | `Optional[DataCase]` | *PydanticUndefined* | Data case for the model (auto-determined) |
| `k_filter` | `Optional[int]` | `20` | Number of features to keep |
| `scaling` | `Literal['STANDARD', 'MINMAX', 'ROBUST', 'MAXABS', 'NONE']` | `STANDARD` | Global scaling setting (can be overridden per modality) |
| `skip_preprocessing` | `bool` | `False` | Skip scaling, filtering, and cleaning |
| `class_param` | `Optional[str]` | `None` | No description available |

---

## **Model Architecture**

| **Parameter** | **Type** | **Default** | **Description** |
|----------------|-----------|--------------|-----------------|
| `latent_dim` | `int` | `16` | Dimension of the latent space |
| `n_layers` | `int` | `3` | Number of encoder/decoder layers (excluding latent layer) |
| `enc_factor` | `int` | `4` | Encoder dimension scaling factor |
| `input_dim` | `int` | `10000` | Input feature dimension |
| `drop_p` | `float` | `0.1` | Dropout probability |
| `save_memory` | `bool` | `False` | Skip storing `TrainingDynamics` to save memory |

---

## **Training Hyperparameters**

| **Parameter** | **Type** | **Default** | **Description** |
|----------------|-----------|--------------|-----------------|
| `learning_rate` | `float` | `0.001` | Learning rate for optimization |
| `batch_size` | `int` | `32` | Samples per batch (>1 required due to BatchNorm) |
| `epochs` | `int` | `3` | Number of training epochs |
| `weight_decay` | `float` | `0.01` | L2 regularization factor |
| `reconstruction_loss` | `Literal['mse', 'bce']` | `mse` | Type of reconstruction loss |
| `default_vae_loss` | `Literal['kl', 'mmd']` | `kl` | Type of VAE loss |
| `loss_reduction` | `Literal['sum', 'mean']` | `sum` | Loss reduction mode in PyTorch |
| `beta` | `float` | `1` | Œ≤ weight for VAE loss |
| `beta_mi` | `float` | `1` | Œ≤ weight for mutual information term |
| `beta_tc` | `float` | `1` | Œ≤ weight for total correlation term |
| `beta_dimKL` | `float` | `1` | Œ≤ weight for dimension-wise KL |
| `use_mss` | `bool` | `True` | Use minibatch stratified sampling for disentangled VAE loss |
| `gamma` | `float` | `10.0` | Œ≥ weight for adversarial loss (XModalix classifier) |
| `delta_pair` | `float` | `5.0` | Œ¥ weight for paired loss (XModalix training) |
| `delta_class` | `float` | `5.0` | Œ¥ weight for class loss (XModalix training) |
| `anneal_function` | `Literal['5phase-constant', '3phase-linear', '3phase-log', 'logistic-mid', 'logistic-early', 'logistic-late', 'no-annealing']` | `logistic-mid` | Annealing function strategy for VAE loss |
| `pretrain_epochs` | `int` | `0` | Pretraining epochs (can differ per modality) |

---

## **Device & Performance**

| **Parameter** | **Type** | **Default** | **Description** |
|----------------|-----------|--------------|-----------------|
| `device` | `Literal['cpu', 'cuda', 'gpu', 'tpu', 'mps', 'auto']` | `auto` | Compute device |
| `n_gpus` | `int` | `1` | Number of GPUs to use |
| `n_workers` | `int` | `0` | Data loader workers |
| `checkpoint_interval` | `int` | `10` | Checkpoint save interval |
| `float_precision` | `Literal['transformer-engine', 'transformer-engine-float16', '16-true', '16-mixed', 'bf16-true', 'bf16-mixed', '32-true', '64-true', '64', '32', '16', 'bf16']` | `32` | Floating-point precision |
| `gpu_strategy` | `Literal['auto', 'dp', 'ddp', 'ddp_spawn', 'ddp_find_unused_parameters_true', 'xla', 'deepspeed', 'fsdp']` | `auto` | GPU parallelization strategy |

---

## **Data Splits & Reproducibility**

| **Parameter** | **Type** | **Default** | **Description** |
|----------------|-----------|--------------|-----------------|
| `train_ratio` | `float` | `0.7` | Training split ratio |
| `test_ratio` | `float` | `0.2` | Test split ratio |
| `valid_ratio` | `float` | `0.1` | Validation split ratio |
| `min_samples_per_split` | `int` | `1` | Minimum samples per split |
| `reproducible` | `bool` | `False` | Ensure reproducibility |
| `global_seed` | `int` | `1` | Global random seed |


# üß¨ DataConfig ‚Äî Configuration Parameters

---

## **DataConfig**

| **Parameter** | **Type** | **Default** | **Description** |
|----------------|-----------|--------------|-----------------|
| `data_info` | `Dict[str, DataInfo]` | *Required* | Dictionary mapping modality names (e.g. `"RNA"`, `"IMG"`) to their `DataInfo` configuration |
| `require_common_cells` | `Optional[bool]` | `False` | Whether to require that all data modalities share a common set of cells/samples |
| `annotation_columns` | `Optional[List[str]]` | `None` | List of column names from the annotation file to include as metadata |

---

## **DataInfo**

| **Parameter** | **Type** | **Default** | **Description** |
|----------------|-----------|--------------|-----------------|
| `file_path` | `str` | `""` | Path to the raw data file |
| `data_type` | `Literal['NUMERIC', 'CATEGORICAL', 'IMG', 'ANNOTATION']` | `NUMERIC` | Type of data modality |
| `scaling` | `Literal['STANDARD', 'MINMAX', 'ROBUST', 'MAXABS', 'NONE', 'NOTSET']` | `NOTSET` | Overrides the globally set scaling method for this modality |
| `filtering` | `Literal['VAR', 'MAD', 'CORR', 'VARCORR', 'NOFILT', 'NONZEROVAR']` | `VAR` | Feature filtering method |
| `sep` | `Optional[str]` | `None` | Delimiter for CSV/TSV input files (passed to `pandas.read_csv`) |
| `extra_anno_file` | `Optional[str]` | `None` | Path to an additional annotation file |

---

## **Single-Cell Specific Parameters**

| **Parameter** | **Type** | **Default** | **Description** |
|----------------|-----------|--------------|-----------------|
| `is_single_cell` | `bool` | `False` | Whether the dataset represents single-cell data |
| `min_cells` | `float` | `0.05` | Minimum fraction of cells in which a gene must be expressed to be kept (filters rare genes) |
| `min_genes` | `float` | `0.02` | Minimum fraction of genes a cell must express to be kept (filters low-quality cells) |
| `selected_layers` | `List[str]` | `['X']` | Layers to include from the single-cell dataset; must always include `"X"` |
| `is_X` | `bool` | `False` | Whether the data originates from the `"X"` matrix only |
| `normalize_counts` | `bool` | `True` | Whether to normalize single-cell counts by total expression per cell |
| `log_transform` | `bool` | `True` | Whether to apply `log1p` transformation after normalization |
| `k_filter` | `Optional[int]` | `20` | Automatically set based on global config; do not override manually |

---

## **Image-Specific Parameters**

| **Parameter** | **Type** | **Default** | **Description** |
|----------------|-----------|--------------|-----------------|
| `img_width_resize` | `Optional[int]` | `64` | Target width for image resizing (must equal height) |
| `img_height_resize` | `Optional[int]` | `64` | Target height for image resizing (must equal width) |

---

## **XModalix & Translation Parameters**

| **Parameter** | **Type** | **Default** | **Description** |
|----------------|-----------|--------------|-----------------|
| `translate_direction` | `Optional[Literal['from', 'to']]` | `None` | Defines translation direction in cross-modal (XModalix) training |
| `pretrain_epochs` | `int` | `0` | Number of pretraining epochs specific to this modality (overrides global pretraining setting) |

---

### ‚öôÔ∏è **Validation Rules**

- `selected_layers` must always contain `"X"`.  
- `img_width_resize` and `img_height_resize` must be **positive integers** and **equal** (enforces square resizing).  

---
