# How to Access Pipeline Results - Deep Dive.
Each step in the pipeline writes its results in the `Result` object of the pipeline instance (Vanillix, Varix, etc.).  
In this Tutorial we explore how to access and make sense of the results.  
The attributes of the `Result` object are mostly instances of a `TrainingDynamics` class.  
This class provied a standardized interface of how to access result from different split and epochs.  
<br>
**IMPORTANT** 
> Epoch-specific TrainingDynamics, such as losses or intermediate latentspaces are not stored every epoch by default.
> You need to set the `checkpoint_interval` param in the config, according to your needs.
## What You will Learn:
We go in depth into:
- The TrainingsDynamics API. <br><br>
  - latentspaces
  - losses
  - reconstructions
  - sample_ids
- Nested TrainingDynamics like `sub_losses`<br><br>
- Non-TrainingDyanimics Result attributes like:
  - datasets
  - new_datasets
  - model
  - adata_latent
  - final_reconstruction
  - embedding_evaluation
- Special Methods to obtain pandas DataFrames
  - get latentspace as dataframe with sample_ids
  - get reconstruction as dataframw with sample_ids


## 1) Filling the Result Object.
Before we can investigate the result object, we need first create results. Therefore, we run two pipelines: `XModalix` and `Varix`.

#### 1.1 The Datasets
For the `Varix` example we use a mock single cell dataset as MuData object inside our custom `DataPackage` class. For our `XModalix` example we use the same data set as for the `XModalix.ipynb` Tutorial:  

As a showcase for data modality translation by `x-modalix`, we will use cancer gene expression from TCGA as in combination with handwritten digits from the [MNIST data set](https://keras.io/api/datasets/mnist/).  
Our goal is to translate the gene expression signature of five selected cancer subtypes to images of digits, where each cancer subtype class is assigned to a digit between 0-4:  


In practice those images can be histopathological images or any other data modality. Before we show data preparation and `x-modalix` training, some background to the basic idea of a cross-modal VAE as proposed by [Yang & Uhler](https://arxiv.org/abs/1902.03515). 



In [1]:
import os

p = os.getcwd()
d = "autoencodix_package"
if d not in p:
    raise FileNotFoundError(f"'{d}' not found in path: {p}")
os.chdir(os.sep.join(p.split(os.sep)[: p.split(os.sep).index(d) + 1]))
print(f"Changed to: {os.getcwd()}")


Changed to: /Users/maximilianjoas/development/autoencodix_package


In [2]:
%%capture
from autoencodix.utils.example_data import EXAMPLE_MULTI_SC
from autoencodix.configs.varix_config import VarixConfig
from autoencodix.configs.default_config import DataCase, DataInfo, DataConfig
from autoencodix.configs.xmodalix_config import XModalixConfig
import autoencodix as acx

varix_config = VarixConfig(
    learning_rate=0.001,
    epochs=33,
    checkpoint_interval=1,
    default_vae_loss="kl",  # kl or mmd possible
    data_case=DataCase.MULTI_SINGLE_CELL,
)
varix = acx.Varix(data=EXAMPLE_MULTI_SC, config=varix_config)
result = varix.run()

# XModalix
clin_file = os.path.join("data/XModalix-Tut-data/combined_clin_formatted.parquet")
rna_file = os.path.join("data/XModalix-Tut-data/combined_rnaseq_formatted.parquet")
img_root = os.path.join("data/XModalix-Tut-data/images/tcga_fake")

xmodalix_config = XModalixConfig(
    checkpoint_interval=1,
    class_param="CANCER_TYPE_ACRONYM",
    epochs=10,
    data_case=DataCase.IMG_TO_BULK,
    data_config=DataConfig(
        data_info={
            "img": DataInfo(
                file_path=img_root,
                data_type="IMG",
                scaling="MINMAX",
                translate_direction="to",
            ),
            "rna": DataInfo(
                file_path=rna_file,
                data_type="NUMERIC",
                scaling="MINMAX",
                translate_direction="from",
            ),
            "anno": DataInfo(file_path=clin_file, data_type="ANNOTATION", sep="\t"),
        },
    ),
)

xmodalix = acx.XModalix(config=xmodalix_config)
xmodalix_result = xmodalix.run()

OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.


## 2) TrainingDynamics Interface Deep Dive
Before accessing the actual results, we provide a theory section on our interface:  
<br><br>
The `TrainingDynamics` object has the following form:  
`<epoch><split><data>`  

So, if you want to access the train loss for the 5th epoch, you would use:  
```python
result.loss.get(epoch=5, split="train")
````

##### The `.get()` Method Explained
Let's say, we're interessted in thre reconstructions of our autoencoder.  
The `reconstructions.get()` method provides flexible access to reconstruction data stored during training. It can retrieve data for specific epochs, specific splits, or any combination of these parameters.

##### Parameters

* **`epoch`** (Optional[int]):

  * Positive integer (e.g., `2`): Get reconstructions from that specific epoch
  * Negative integer (e.g., `-1`): Get the latest epoch (-1), second-to-last (-2), etc.
  * `None`: Return data for all epochs

* **`split`** (Optional[str]):

  * Valid values: `"train"`, `"valid"`, `"test"`
  * `None`: Return data for all splits

##### Return Value Behavior

The method returns different types depending on the parameters:

1. **Both `epoch` and `split` specified**:

   * Returns a NumPy array for that specific epoch and split
   * Example: `get(epoch=2, split="train")` → `array([...])`

2. **Only `epoch` specified**:

   * Returns a dictionary of all splits for that epoch
   * Example: `get(epoch=2)` → `{"train": array([...]), "valid": array([...]), ...}`

3. **Only `split` specified**:

   * Returns a NumPy array containing data for that split across all epochs
   * Example: `get(split="train")` → `array([[...], [...], ...])` (first dimension represents epochs)

4. **Neither specified**:

   * Returns the complete nested dictionary structure
   * Example: `get()` → `{0: {"train": array([...])}, 1: {...}, ...}`

##### Special Handling

* If an invalid split is provided, a `KeyError` is raised
* Negative epoch indices work like Python list indexing (-1 is the last epoch)
* If an epoch doesn't exist, an empty array or dictionary is returned

##### Code Example

```python
# Access train reconstructions for the 5th epoch
train_epoch_5 = result.reconstructions.get(epoch=5, split="train")

# Access all splits for the latest epoch
latest_epoch_all_splits = result.reconstructions.get(epoch=-1)

# Access data for all epochs for the "valid" split
all_epochs_valid = result.reconstructions.get(split="valid")

# Access the full nested dictionary
full_data = result.reconstructions.get()
```


## 3) Working with Actual Results
### 3.1) Varix

Exemplary, we show how to get the `latentspaces` of the `result` attribute and to access different loss types.

In [3]:
all_ls = result.latentspaces.get()
print(f"Keys of all latentspaces: {all_ls.keys()}")


Keys of all latentspaces: dict_keys([-1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32])


We see that we have latentspaces for each epoch, because we set `checkpoint_interval=1` in our configs in [step 1](#1-filling-the-result-object).  
For each epoch we have the latentspace for the `valid` and `train` split. The `-1 epoch` is a special key for the `test` split. For the other splits, this works just as negative indexing works for in Python lists, so you get the last  for -1 and the second last for -2 and so on. 
<br><br>
**Special Case**
> You cannot get the last epoch for all splits at once, you can either geth the last epoch for train and valid or only for test. See code below.

In [4]:
print(f"Splits in 2. epoch: {result.latentspaces.get(epoch=2).keys()}")

# this will only give you the data for train and valid, since 'test' is a special case
print(f"Splits in epoch =-1: {result.latentspaces.get(epoch=-1).keys()}")
# Get test by adding the 'split' argument.
test_ls = result.latentspaces.get(split="test")
print(test_ls[0][0])
# get a specific epoch and specific split:

print("\n")
print("-"*80)

print(f"latentspace of one sample in train split at epoch 4: {result.latentspaces.get(split="train", epoch=4)[0]}")

Splits in 2. epoch: dict_keys(['train', 'valid'])
Splits in epoch =-1: dict_keys(['train', 'valid'])
[ 7.4745417e-01  4.0559044e+00  1.6833726e+00  1.0847610e+00
 -1.5539709e+00 -2.3329544e-01  1.6099819e+00  1.4448245e+00
  7.3622680e-01  1.3920254e+00 -7.2016740e-01  4.7351834e-01
  3.0655026e+00 -4.5949221e-04 -1.4978287e+00  5.4582012e-01]


--------------------------------------------------------------------------------
latentspace of one sample in train split at epoch 4: [ 1.4870985  -0.73432136 -0.9347738  -0.20718834 -0.6143952  -0.13031495
 -0.41699654 -1.1720629  -1.7184465   1.2840557   0.20730922  1.3808663
 -0.72797185 -0.01947922 -0.28647375  0.7279187 ]


### 3.2 XModalix
Accessing the results also works slightly different for `XModalix`, because we have:
1. multiple latentspaces (one for each data modality). <br><br>
2. not just reconstructions, but also `translations` and `reference translation` (this are reconstructions within one data modality on the same split as the translation, so if we translate from `rna` to `img`, this would be the reconstruction from `img` to `img`). We can access these `translations` via the `reconstruction` attribute of the result object, where we first apply the usual `TrainingDynamics` API via get() and then have a dict for each data modality and translaiton and reference translation <br> <br>
3. multiple losses. These are accessed via the  `sub_losses` attribute of the `result` object. This is a Dict of TrainingDynamics. So first you need to select the loss type you're interested and then you can work with the usual `TrainingDyanmics` API. <br><br>
  3.1 A note on the naming of the sub-losses. The global losses are simply named after the loss type e.g. `class_loss`. The losses per data modality is named after the global key of the data modality like `multi_bulk` or `mulit_sc` or `img` then a `dot` then the name you defined for the specific data modality e.g. `rna` and then a dot and then the name of the loss. See print below.


#### 3.2.1 Access Modality Latentpaces
As described in our `XModalix Deep Dive` [1], we fit one latentspace per data modality. You can access this by first picking the epoch and split you're interested in (standar TrainingsDynamics API). The result of this will be a `Dict` with the name each data modality as key.  


In [5]:
print(f" Keys of data modalities for latent space dynamic: {xmodalix_result.latentspaces.get(epoch=-1, split='test').keys()}")


 Keys of data modalities for latent space dynamic: dict_keys(['multi_bulk.rna', 'img.img'])


Now you can access the latentspace of the `image` modality with the key `img.img`

In [6]:
xmodalix_result.latentspaces.get(epoch=-1, split="test").get("img.img")

array([[-1.0930173 ,  4.738305  , 21.872086  , ...,  1.3188266 ,
         1.9298503 ,  9.99824   ],
       [ 0.6364157 ,  7.6179223 ,  3.2025988 , ...,  4.461285  ,
         2.8316135 ,  2.182133  ],
       [-0.17223616,  4.0562906 ,  2.528502  , ...,  1.2698087 ,
         2.0447204 ,  3.495261  ],
       ...,
       [-0.05119326,  0.5352183 , 15.210966  , ...,  3.1731205 ,
         1.2334447 ,  3.177263  ],
       [-0.6019183 ,  6.928581  ,  3.1212626 , ...,  2.9466472 ,
         0.46270385,  7.041744  ],
       [ 0.8243733 ,  8.979268  , 13.781986  , ...,  9.794446  ,
         0.9893867 ,  4.5015326 ]], shape=(646, 16), dtype=float32)

#### 3.2.2 Access Translation

In [7]:
print("Get reconstruction keys")
# Frist define split and epoch you're interested in
# usually test split (there are no epochs, so by default this is always epoch=-1)
recons = xmodalix_result.reconstructions.get(split="test", epoch=-1)
print(recons.keys())
print("Getting Translation")
trans = recons.get("translation")
print(f"shape of translation: {trans.shape}")

Get reconstruction keys
dict_keys(['multi_bulk.rna', 'img.img', 'translation', 'reference_img.img_to_img.img'])
Getting Translation
shape of translation: (711, 1, 64, 64)


#### 3.2.3 Access Sub-Losses

In [8]:

sub_losses = xmodalix_result.sub_losses
print("Sub Losses:")
print(f"keys: {sub_losses.keys()}")
print("\n")
recon_dyn = sub_losses.get(key="paired_loss")
print("Value of paired loss in epoch 4 for train split")
print(f"{recon_dyn.get(split='train', epoch=4):.2f}")


Sub Losses:
keys: dict_keys(['total_loss', 'adver_loss', 'aggregated_sub_losses', 'paired_loss', 'class_loss', 'multi_bulk.rna.recon_loss', 'multi_bulk.rna.var_loss', 'multi_bulk.rna.anneal_factor', 'multi_bulk.rna.effective_beta_factor', 'multi_bulk.rna.loss', 'img.img.recon_loss', 'img.img.var_loss', 'img.img.anneal_factor', 'img.img.effective_beta_factor', 'img.img.loss', 'clf_loss'])


Value of paired loss in epoch 4 for train split
10.00


## 4) Non-TrainingDynamics Result Attribute 
There are other (intermediate) results that are not creating during training and might be still interesting. These results do not follow a uniform interface like `TrainingDynamics`, but are much more straightforward. We go over each attribute quickly

#### 4. 1 Datasets
The `datasets` attribute stores the preprocessed data in a `DatasetContainer` This is basically a dict with `test`, `valid` and `train` as keys and each value is a child class of a PyTorch dataset. So whenever you need to re-access your preprocssed data you can do this with the `datasets` attribute as shown below:

In [9]:
print(result.datasets)
print(result.datasets.train)
print(type(result.datasets.train.data))
result.datasets.train.metadata.head()

DatasetContainer(train=<autoencodix.data._numeric_dataset.NumericDataset object at 0x14179aff0>, valid=<autoencodix.data._numeric_dataset.NumericDataset object at 0x1421785c0>, test=<autoencodix.data._numeric_dataset.NumericDataset object at 0x141b81a60>)
<autoencodix.data._numeric_dataset.NumericDataset object at 0x14179aff0>
<class 'torch.Tensor'>


Unnamed: 0,rna:cell_type,rna:batch,rna:donor,rna:cell_cycle,rna:n_genes,protein:cell_type,protein:batch,protein:donor,protein:cell_cycle,protein:n_genes
cell_1,type_0,batch3,donor4,G1,357,type_0,batch3,donor4,G1,147
cell_107,type_4,batch2,donor3,S,335,type_4,batch2,donor3,S,146
cell_189,type_3,batch2,donor2,G2M,368,type_3,batch2,donor2,G2M,155
cell_19,type_3,batch1,donor1,G1,345,type_3,batch1,donor1,G1,147
cell_190,type_4,batch1,donor2,S,338,type_4,batch1,donor2,S,152


#### 4.2 New Datasets
Whenever you run the `predict` step of the pipeline and pass new unseen data to it, we preprocess this data (if necessary). To not overwrite the original `datasets`, we store this in `new_datasets`. If you run predict again with other data then `new_datasets` is overridden. Ohterwise `new_datasets` work the same as `datasets`

First we create new data and then we run predict.

In [10]:
import copy

from autoencodix.utils.example_data import EXAMPLE_MULTI_SC
new_data = copy.copy(EXAMPLE_MULTI_SC)
new_multi_sc = new_data.multi_sc["multi_sc"]
for modname, mod in new_multi_sc.mod.items():
    new_names = mod.obs_names.str.replace('cell', 'new_cell')
    mod.index = new_names

    mod.obs_names = new_names
    print(mod.index)
    new_multi_sc.mod[modname] = mod

new_data.multi_sc["multi_sc"] = new_multi_sc
new_data.multi_sc["multi_sc"].update()

Index(['new_cell_0', 'new_cell_1', 'new_cell_10', 'new_cell_100',
       'new_cell_101', 'new_cell_102', 'new_cell_103', 'new_cell_104',
       'new_cell_105', 'new_cell_106',
       ...
       'new_cell_990', 'new_cell_991', 'new_cell_992', 'new_cell_993',
       'new_cell_994', 'new_cell_995', 'new_cell_996', 'new_cell_997',
       'new_cell_998', 'new_cell_999'],
      dtype='object', length=1000)
Index(['new_cell_0', 'new_cell_1', 'new_cell_10', 'new_cell_100',
       'new_cell_101', 'new_cell_102', 'new_cell_103', 'new_cell_104',
       'new_cell_105', 'new_cell_106',
       ...
       'new_cell_990', 'new_cell_991', 'new_cell_992', 'new_cell_993',
       'new_cell_994', 'new_cell_995', 'new_cell_996', 'new_cell_997',
       'new_cell_998', 'new_cell_999'],
      dtype='object', length=1000)


  attrm[mod] = mapping > 0
  attrm[mod] = mapping > 0


Run the predict step:

In [11]:
%%capture
varix.predict(data=new_data)

Examine `datasets` and `new_datasets`:  
We see that `datasets still is kept and only new_datasets is updated.

In [12]:
print(f"Sample of original dataset: {result.datasets.train.sample_ids[0]}")

print(f"Sample of new dataset: {result.new_datasets.test.sample_ids[0]}")

Sample of original dataset: cell_1
Sample of new dataset: new_cell_0


#### 4.3 Model Attribute
This is straightforward the trained model as PytTorch Module.

In [13]:
result.model

VarixArchitecture(
  (_encoder): Sequential(
    (0): Linear(in_features=20, out_features=16, bias=True)
    (1): BatchNorm1d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): Dropout(p=0.1, inplace=False)
    (3): ReLU()
    (4): Linear(in_features=16, out_features=16, bias=True)
    (5): BatchNorm1d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (6): Dropout(p=0.1, inplace=False)
    (7): ReLU()
    (8): Linear(in_features=16, out_features=16, bias=True)
    (9): BatchNorm1d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (10): Dropout(p=0.1, inplace=False)
    (11): ReLU()
  )
  (_mu): Linear(in_features=16, out_features=16, bias=True)
  (_logvar): Linear(in_features=16, out_features=16, bias=True)
  (_decoder): Sequential(
    (0): Linear(in_features=16, out_features=16, bias=True)
    (1): BatchNorm1d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): Dropout(p=0.1, inplace=False)

#### 4.4  Adata Latent
We save the latentspace of the `test` split from the final trained model as `AnnData` object for the single cell community.  
This is also useful for non-single-cell cases, because you obtain the sample ids via `.obs`.

In [14]:
print(result.adata_latent)
print(result.adata_latent.obs)

AnnData object with n_obs × n_vars = 1000 × 16
    uns: 'var_names'
Empty DataFrame
Columns: []
Index: [new_cell_0, new_cell_1, new_cell_10, new_cell_100, new_cell_101, new_cell_102, new_cell_103, new_cell_104, new_cell_105, new_cell_106, new_cell_107, new_cell_108, new_cell_109, new_cell_11, new_cell_110, new_cell_111, new_cell_112, new_cell_113, new_cell_114, new_cell_115, new_cell_116, new_cell_117, new_cell_118, new_cell_119, new_cell_12, new_cell_120, new_cell_121, new_cell_122, new_cell_123, new_cell_124, new_cell_125, new_cell_126, new_cell_127, new_cell_128, new_cell_129, new_cell_13, new_cell_130, new_cell_131, new_cell_132, new_cell_133, new_cell_134, new_cell_135, new_cell_136, new_cell_137, new_cell_138, new_cell_139, new_cell_14, new_cell_140, new_cell_141, new_cell_142, new_cell_143, new_cell_144, new_cell_145, new_cell_146, new_cell_147, new_cell_148, new_cell_149, new_cell_15, new_cell_150, new_cell_151, new_cell_152, new_cell_153, new_cell_154, new_cell_155, new_cell_1

#### 4.5 Final Reconstruction
This attribute gives you the exact data structure as you used for input i.e. `MuData` in our case, with the reconstructed values.

In [15]:
result.final_reconstruction

#### 4.6 Evaluation Embeddings
Before we can access this attribute, we first need to run the `evaluate` step. This will use the latent space and train a downstream machine learning task. in our case we want to classify the cancer type.  
<br><br>
The results of this evalute step, will be store in `embedding_evalutation`.

In [16]:
%%capture
xmodalix.evaluate(params=["CANCER_TYPE"])

In [17]:
xmodalix_result.embedding_evaluation

Unnamed: 0,score_split,CLINIC_PARAM,metric,value,ML_ALG,ML_TYPE,MODALITY,ML_TASK,ML_SUBTASK
0,train,CANCER_TYPE,roc_auc_ovo,0.613343,LogisticRegression(),classification,multi_bulk.rna,Latent,Latent_$_multi_bulk.rna
1,valid,CANCER_TYPE,roc_auc_ovo,0.613812,LogisticRegression(),classification,multi_bulk.rna,Latent,Latent_$_multi_bulk.rna
2,test,CANCER_TYPE,roc_auc_ovo,0.574837,LogisticRegression(),classification,multi_bulk.rna,Latent,Latent_$_multi_bulk.rna
0,train,CANCER_TYPE,roc_auc_ovo,0.996737,LogisticRegression(),classification,img.img,Latent,Latent_$_img.img
1,valid,CANCER_TYPE,roc_auc_ovo,0.991656,LogisticRegression(),classification,img.img,Latent,Latent_$_img.img
2,test,CANCER_TYPE,roc_auc_ovo,0.993542,LogisticRegression(),classification,img.img,Latent,Latent_$_img.img


In [None]:

varix.result.datasets.test.metadata.head()
varix.evaluate(params=["rna:batch"])

Unnamed: 0,rna:cell_type,rna:batch,rna:donor,rna:cell_cycle,rna:n_genes,protein:cell_type,protein:batch,protein:donor,protein:cell_cycle,protein:n_genes
cell_0,type_0,batch2,donor2,S,346,type_0,batch2,donor2,S,150
cell_195,type_4,batch2,donor2,S,357,type_4,batch2,donor2,S,147
cell_199,type_1,batch3,donor4,G1,347,type_1,batch3,donor4,G1,145
cell_202,type_1,batch2,donor3,G1,354,type_1,batch2,donor3,G1,151
cell_203,type_3,batch3,donor1,G1,346,type_3,batch3,donor1,G1,149


##  5 Special Methods to Obtain DataFrames
As you've  seen in the [Training Dynamics Section](#2-trainingdynamics-interface-deep-dive), we only get plain values of reconstructions and latetnspaces. Often it is more useful to have sample ids, too. We could obtain the sample ids in the same order via the `sample_ids` TrainingDynamic. To make this more accessible, we added the methods
`get_latent_df` and `get_reconstructions_df`. Here you pass `epoch` and `split` as seen before and you get a pandas DataFrame for the specific split and epoch for the latent space or the reconstruction.

In [None]:
result.get_latent_df(epoch=-1, split="test").head()

In [None]:
result.get_reconstructions_df(epoch=-1, split="test").head()