# AUTOENCODIX PACKAGE HANDBOOK
This notebook demonstrates the usage of the autoencodix package.
For now it serves as an internal guideline with the goal to:
- test the package from a user perspective
- serve as a first draft of user documentation
- serve a developer guideline 
  - developer guide will be derrived from this notebook

## 00 Generate mock data
When  development proceeds this section should be used to  show how to use different datatypes
for now we only use a mock numpy array

In [1]:
import numpy as np
sample_data = np.random.rand(100, 10)
sample_data.shape

(100, 10)

## 01 General Pipeline Usage

In [2]:
# imports
import autoencodix as acx
from autoencodix.utils.default_config import DefaultConfig

Exception ignored in: <bound method IPythonKernel._clean_thread_parent_frames of <ipykernel.ipkernel.IPythonKernel object at 0x1036f1750>>
Traceback (most recent call last):
  File "/Users/maximilianjoas/development/autoencodix_package/.venv/lib/python3.10/site-packages/ipykernel/ipkernel.py", line 775, in _clean_thread_parent_frames
    def _clean_thread_parent_frames(
KeyboardInterrupt: 


In [None]:
#### --------------------------------------------
# TODO user prepares data or config
### INITIALIZATION ### --------------------------
# Use Vanillix Pipeline interface
# needs to be initialized with data
# data should be a numpy array, pandas dataframe or AnnData object
# possible to pass a custom Config object
van = acx.Vanillix(data=sample_data)
# ------------------------------------------------
### DATA PROCESSING ### --------------------------
# job of old make data
# populates self._features attrbute with torch tensor
# populates self._datasets attribute with torch dataset
# (important for training with dataloader)
# possible to pass a custom Config object, or keyword arguments
van.preprocess()
# ------------------------------------------------
### MODEL TRAINING ### --------------------------
# job of old make model
# calls self.Trainer class to init and train model
# populates self._model attribute with trained model
# populates self.result attribute with training results (model, losses, etc)
van.fit()
# ------------------------------------------------
### PREDICTION ### -------------------------------
# job of old make predict
# if no data is passed, used the test split from preprocessing
# otherwise, uses the data passed, and preprocesses it
# updates self.result attribute with predictions (latent space, reconstructions, etc)
van.predict()
# ------------------------------------------------
### EVALUATION ### -------------------------------
# job of old make ml_task
# populates self.result attribute with ml task results
van.evaluate() # not implemented yet
# ------------------------------------------------
### VISUALIZATION ### ---------------------------
# job of old make visualize
# populates self.result attribute with visualizations
van.visualize()
# show visualizations for notebook use
van.show_result()
# --------------------------
# --------------------------
# run all steps in the pipeline
result_object = van.run()

cpu not relevant here
batch: 0
model_outputs.reconstruction: torch.Size([32, 10])
batch: 1
model_outputs.reconstruction: torch.Size([32, 10])
batch: 2
model_outputs.reconstruction: torch.Size([6, 10])
Epoch: 0, Loss: 2.3556206822395325
output.reconstruction: torch.Size([32, 10])
output.reconstruction: torch.Size([32, 10])
output.reconstruction: torch.Size([6, 10])
output.reconstruction: torch.Size([10, 10])
batch: 0
model_outputs.reconstruction: torch.Size([32, 10])
batch: 1
model_outputs.reconstruction: torch.Size([32, 10])
batch: 2
model_outputs.reconstruction: torch.Size([6, 10])
Epoch: 1, Loss: 2.308934450149536
output.reconstruction: torch.Size([32, 10])
output.reconstruction: torch.Size([32, 10])
output.reconstruction: torch.Size([6, 10])
output.reconstruction: torch.Size([10, 10])
batch: 0
model_outputs.reconstruction: torch.Size([32, 10])
batch: 1
model_outputs.reconstruction: torch.Size([32, 10])
batch: 2
model_outputs.reconstruction: torch.Size([6, 10])
Epoch: 2, Loss: 2.2133

In [None]:
recons = result_object.reconstructions.get(split="train", epoch=2)
recons_val = result_object.reconstructions.get(split="valid", epoch=2)
recons_test = result_object.reconstructions.get(split="test", epoch=-1)
print(recons.shape, recons_val.shape, recons_test.shape)

(70, 10) (10, 10) (20, 10)


In [None]:
latents = result_object.latentspaces.get(split="train", epoch=2)
latents_val = result_object.latentspaces.get(split="valid", epoch=2)
latents_test = result_object.latentspaces.get(split="test", epoch=-1)
print(latents.shape, latents_val.shape, latents_test.shape)

(70, 16) (10, 16) (20, 16)


#### Using a custom train, test, valid split
When you pass the data to the pipeline, autoencodix, internally splits the data for you based on the train,test, valid ratios provided in the config (defaults are 70%/10%/20% train/valid/test).
You can either pass custom ratios (see next section) or provide the indices directly as shown below

In [None]:
sample_data = np.random.rand(100, 10)
custom_train_indices = np.arange(75) # we won't allow overlap between splits
custom_valid_indices = np.arange(75, 80)
custom_test_indices = np.arange(80, 100)

# the custom split needs to be a dictionary with keys "train", "valid", and "test" and indices of the samples to be included in each split as numpy arrays
custom_split = {"train": custom_train_indices,
                "valid": custom_valid_indices,
                "test": custom_test_indices}
van = acx.Vanillix(data=sample_data, custom_splits=custom_split)
van.preprocess()
van.fit(epochs=3)

cpu not relevant here


Epoch: 0, Loss: 2.367237627506256
Epoch: 1, Loss: 2.1364158391952515
Epoch: 2, Loss: 2.140314042568207


It is possible to pass empty splits, but depending on how you'll use the autoencodix pipeline, this will throw an error at some point. So it is possible to call `fit` with only training data, but if you want to call `predict` and don't provide new data, this won't work without a data in the test split

#### Using predict with new data
The standard case is to train the model with the train data and then predict with the test split.
However, it is possible to pass new data to the predict method to perform inference on this data with the already trained model

In [None]:
new_unseen_data = np.random.rand(10, 10)
van.predict(data=new_unseen_data)


Invalid parameters: data
Valid parameters are: config


#### Examining the result of the pipeline
Each step in the pipeline writes its results in the result object of the Vanillix instance.
In this section we explore how to access and make sense of the results.

In [None]:
result = van.result
print(result)

Result Object Public Attributes:
------------------------------
latentspaces: TrainingDynamics object
reconstructions: TrainingDynamics object
mus: TrainingDynamics object
sigmas: TrainingDynamics object
losses: TrainingDynamics object
preprocessed_data: Tensor of shape (100, 10)
model: _FabricModule
model_checkpoints: TrainingDynamics object
datasets: DatasetContainer(train=<autoencodix.data._numeric_dataset.NumericDataset object at 0x1053f1e70>, valid=<autoencodix.data._numeric_dataset.NumericDataset object at 0x10626cd90>, test=<autoencodix.data._numeric_dataset.NumericDataset object at 0x10626ce20>)


##### TrainingDynamics object in result
The training dynamics object has the followinf form:
<epoch><split><data>
So if you want to access the train loss for the 5th epoch, you would:
`result.lossss.get(epoch=5, split="train")`

In [None]:
loss_train_ep2 = result.losses.get(epoch=2, split="train")
print(loss_train_ep2)
valid_loss = result.losses.get(split="valid")
print(valid_loss)
print(result.losses.get())

0.7134380141894022
[0.27492604 0.24243712 0.23720634]
{0: {'train': array(0.78907921), 'valid': array(0.27492604)}, 1: {'train': array(0.71213861), 'valid': array(0.24243712)}, 2: {'train': array(0.71343801), 'valid': array(0.23720634)}}


Note: this schema works for every TrainingDynamics instance in the results object.

## 02 Pipeline usage with custom parameters
Here we show how to customize the above shown pipeline with a user config or with keyword arguments.
In future iterations we want to allow to read a config from a file, this will be also demonstrated here.

In [None]:

# Use Vanillix Pipeline interface
# needs to be initialized with data
# data should be a numpy array, pandas dataframe or AnnData object
# possible to pass a custom Config object
van = acx.Vanillix(data=sample_data)
# job of old make data
# populates self._features attrbute with torch tensor
# populates self._datasets attribute with torch dataset
# (important for training with dataloader)
# possible to pass a custom Config object, or keyword arguments
van.preprocess()
# job of old make model
# calls self.Trainer class to init and train model
# populates self._model attribute with trained model
# populates self.result attribute with training results (losses, etc)
# van.fit()
""" 
Each step can be run separately, with custom parameters, these parameters
can be passed as keyword arguments, or as a Config object
"""
van.fit(learning_rate=0.01, batch_size=32, epochs=5) # or like this:
my_config = DefaultConfig(learning_rate=130.0, batch_size=32, epochs=5)
van.fit(config=my_config) # config has to be an keyword argument



cpu not relevant here
Epoch: 0, Loss: 1.9614873230457306
Epoch: 1, Loss: 1.382771223783493
Epoch: 2, Loss: 0.977891355752945
Epoch: 3, Loss: 0.6883653551340103
Epoch: 4, Loss: 0.5179779827594757
cpu not relevant here
Epoch: 0, Loss: 8780462592.772186
Epoch: 1, Loss: 1348435297.7003174
Epoch: 2, Loss: 5917.90837097168
Epoch: 3, Loss: 10487.504081726074
Epoch: 4, Loss: 3092.8131675720215


#### 02.1  How to relevant keyword arguments for pipeline methods
It can be hard to know what keyword arguments are valid for each step,
so we show:
- how to get a list of allowed keyword arguments
- what happens if you pass non-allowed keyword arguments

In [None]:
# for each config method, we can call a valid_params method
van = acx.Vanillix(data=sample_data)
fit_params = van.fit.valid_params # returns a set of keyword arguments that are actually used in the fit method

import pprint
pprint.pprint(fit_params)

{'batch_size',
 'checkpoint_interval',
 'config',
 'device',
 'epochs',
 'global_seed',
 'gpu_strategy',
 'learning_rate',
 'n_gpus',
 'n_workers',
 'reconstruction_loss',
 'reproducible',
 'weight_decay'}


To get even more verbose info about the keyword args, you can run the following code.

In [None]:
# when you want to have more info about the params, you can get type hints from the config object
my_config = DefaultConfig()
conig_values = my_config.get_params()
my_config.print_schema(filter_params=fit_params)

Valid Keyword Arguments:
--------------------------------------------------

learning_rate:
  Type: <class 'float'>
  Default: 0.001
  Description: Learning rate for optimization

batch_size:
  Type: <class 'int'>
  Default: 32
  Description: Number of samples per batch

epochs:
  Type: <class 'int'>
  Default: 3
  Description: Number of training epochs

weight_decay:
  Type: <class 'float'>
  Default: 0.01
  Description: L2 regularization factor

reconstruction_loss:
  Type: typing.Literal['mse', 'bce']
  Default: mse
  Description: Type of reconstruction loss

device:
  Type: typing.Literal['cpu', 'cuda', 'gpu', 'tpu', 'mps', 'auto']
  Default: auto
  Description: Device to use

n_gpus:
  Type: <class 'int'>
  Default: 1
  Description: Number of GPUs to use

n_workers:
  Type: <class 'int'>
  Default: 2
  Description: Number of data loading workers

checkpoint_interval:
  Type: <class 'int'>
  Default: 1
  Description: Interval for saving checkpoints

gpu_strategy:
  Type: typing.Lit

If you pass not supported parameters you get a warning

In [None]:
# if you use an unsupported keyword argument, you will get a warning
# as you see the default value from the DefaultConfig is not overwritten and the training will take 100 epochs (not 10)
van.preprocess()
van.fit(epochds=10)


Invalid parameters: epochds
Valid parameters are: batch_size, checkpoint_interval, config, device, epochs, global_seed, gpu_strategy, learning_rate, n_gpus, n_workers, reconstruction_loss, reproducible, weight_decay
cpu not relevant here
Epoch: 0, Loss: 2.2837421894073486
Epoch: 1, Loss: 2.2954598665237427
Epoch: 2, Loss: 2.3972206711769104


#### 02.2 How to get information about the default config parameters

In [None]:
# if you want to see what config parameters are used in the default config you can do it like:
default_config = DefaultConfig()
default_config.print_schema()




DefaultConfig Configuration Parameters:
--------------------------------------------------

latent_dim:
  Type: <class 'int'>
  Default: 16
  Description: Dimension of the latent space

n_layers:
  Type: <class 'int'>
  Default: 3
  Description: Number of layers in encoder/decoder

enc_factor:
  Type: <class 'int'>
  Default: 4
  Description: Scaling factor for encoder dimensions

input_dim:
  Type: <class 'int'>
  Default: 10000
  Description: Input dimension

drop_p:
  Type: <class 'float'>
  Default: 0.1
  Description: Dropout probability

learning_rate:
  Type: <class 'float'>
  Default: 0.001
  Description: Learning rate for optimization

batch_size:
  Type: <class 'int'>
  Default: 32
  Description: Number of samples per batch

epochs:
  Type: <class 'int'>
  Default: 3
  Description: Number of training epochs

weight_decay:
  Type: <class 'float'>
  Default: 0.01
  Description: L2 regularization factor

reconstruction_loss:
  Type: typing.Literal['mse', 'bce']
  Default: mse
  

### 02.3 Documentation Config class
You can update the config with your own values by:
- passing arguments as:
    - dict
    - single arguments
- passing a file (TODO)

In [3]:
from autoencodix.utils.default_config import DefaultConfig
# METHOD 1: override the default config with a dictionary
my_args = {"learning_rate": 0.0234, "batch_size": 13, "epochs": 12}
my_config = DefaultConfig(**my_args)
# METHOD 2: override signle parameters
my_new_conig = DefaultConfig(latent_dim=23, n_gpus=13)

# METHOD 3: from a file: TODO


## 03 Use another model
Now we show how easy it is to use a variational autoencoder instead of a vanilla version.

In [1]:
from autoencodix.utils.default_config import DefaultConfig
import autoencodix as acx
import numpy as np
sample_data = np.random.rand(100, 10)
my_config = DefaultConfig(learning_rate=0.001, epochs=3, checkpoint_interval=1)
varix = acx.Varix(data=sample_data, config=my_config)
result = varix.run()

cpu not relevant here
Epoch: 0, Loss: 1.2922510504722595
Epoch: 1, Loss: 1.2435818016529083
Epoch: 2, Loss: 1.2091135382652283

Invalid parameters: data
Valid parameters are: config


#### Examine Variational result
Here, we have more info in our results object than in the Vanillix case. We have the learned paramters mu and logvar of the normal distirbution, in addition to the losses and reconstructions. We provide also the sampled latentspaces at each epoch and split.

You can resample new latenspaces (shown in next section)

In [6]:
# we did not train for the test split, so we don't need to pass an epoch
# technically the epoch is -1
mu_test_ep_last = result.latentspaces.get(split="test")
print(mu_test_ep_last.shape)

(1, 20, 16)


#### Different loss types
For our variation autoencoder, the total loss consists of a reconstruction loss and a distribution loss i.e. kl-divergence. To investigate these losses, the result_obj has the attribute `sub_losses`. This is a `LossRegistry` withe the name of the loss as key and the value is of class `TrainingDynamics` and can be accessed as shown for the Vanillix part

In [7]:
sub_losses = result.sub_losses
print(f"keys: {sub_losses.keys()}")
recon_dyn = sub_losses.get(key="recon_loss")
print(recon_dyn.get(split="train"))

keys: dict_keys(['recon_loss', 'var_loss'])
[0.42741003 0.44090185 0.37895077]


#### Sample new latentspaces
You might want to use the trained model and the fitted parameters mu, and logvar to sample latentspaces. Therefore, the Varix pipeline has the additional method `sample_latent_space`

In [None]:
sampled = varix.

AttributeError: 'Varix' object has no attribute 'sample_latent_space'

## How to add a new architecture
If you want to add a new architecture


# TODOS
- show how to update and work with the config object (later)

### SANDBOX 
current testing MPS (mac gpus support and float and gpu strategies)

In [None]:
# test
from autoencodix.utils.default_config import DefaultConfig
import torch
from autoencodix.modeling._vanillix_architecture import VanillixArchitecture
from autoencodix.data._numeric_dataset import NumericDataset
from autoencodix.trainers._general_trainer import GeneralTrainer
from autoencodix.utils._result import Result
from autoencodix.utils._losses import VanillixLoss

config = DefaultConfig(
     epochs=3, checkpoint_interval=1, reproducible=True
)
train_dataset = NumericDataset(data=torch.rand(100, 10),config=config)
valid_dataset = NumericDataset(data=torch.rand(10, 10), config=config)


general_trainer = GeneralTrainer(
    trainset=train_dataset,
    validset=valid_dataset,
    result=Result(),
    config=config,
    model_type=VanillixArchitecture,
    loss_type=VanillixLoss,
)
result1 = general_trainer.train()
train_loss1 = result1.losses.get(split="train")
reconstructed_data1 = result1.reconstructions.get(split="train")

result2 = general_trainer.train()
train_losses2 = result2.losses.get(split="train")
reconstructed_data2 = result2.reconstructions.get(split="train")


TypeError: NumericDataset.__init__() missing 1 required positional argument: 'config'

In [None]:
result = van.result
losses = result.losses.get(split="train")
print(losses)

[0.7612474  0.76515329 0.79907356]


In [None]:
jsample_data.shape

(100, 10)

In [None]:
r = result_object.reconstructions.get(split="train")

In [None]:
r.shape
r_  = r.reshape((-1,10))
r_.shape

(18, 10)

In [None]:
reconstructions = result.reconstructions.get(split="train")
print((reconstructions.shape))


(3, 6, 10)


In [None]:
my_config = DefaultConfig(float_precision="32", device="mps", n_gpus=1, gpu_strategy="auto")
van = acx.Vanillix(data=sample_data, config=my_config)
van.preprocess()
van.fit(epochs=2)



ValueError: You selected an invalid strategy name: `strategy=None`. It must be either a string or an instance of `lightning_fabric.strategies.Strategy`. Example choices: auto, ddp, ddp_spawn, deepspeed, dp, ... Find a complete list of options in our documentation at https://lightning.ai

In [None]:
from autoencodix.utils._result import Result
result = Result()
# check if the result object is empty or None
for key, value in result.__dict__.items():
    if value is None:
        print(f"{key} is None")
    elif len(value) == 0:
        print(f"{key} is empty")

TypeError: object of type 'TrainingDynamics' has no len()

In [None]:
d = {}

In [None]:
len(d)

0

In [None]:
data = result.losses._data

In [None]:
len(data)

0

In [None]:
from autoencodix.utils._traindynamics import TrainingDynamics
from autoencodix.data._datasetcontainer import DatasetContainer
td = TrainingDynamics()
td._data

{}

In [None]:
ds = DatasetContainer(train=[1, 2, 3], valid=[4, 5, 6], test=[7, 8, 9])

In [None]:
ds

DatasetContainer(train=[1, 2, 3], valid=[4, 5, 6], test=[7, 8, 9])

In [None]:
my_config.dict()

NameError: name 'my_config' is not defined

In [None]:
config = DefaultConfig()
config.dict()

/var/folders/5y/4yr_9t4x5zgf77_zw1krm4vw0000gn/T/ipykernel_84592/1790132599.py:2: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.10/migration/
  config.dict()


{'latent_dim': 16,
 'n_layers': 3,
 'enc_factor': 4,
 'input_dim': 10000,
 'drop_p': 0.1,
 'learning_rate': 0.001,
 'batch_size': 32,
 'epochs': 23,
 'weight_decay': 0.01,
 'reconstruction_loss': 'mse',
 'default_vae_loss': 'kl',
 'min_samples_per_split': 1,
 'device': 'auto',
 'n_gpus': 1,
 'n_workers': 2,
 'checkpoint_interval': 10,
 'float_precision': '32',
 'gpu_strategy': 'auto',
 'train_ratio': 0.7,
 'test_ratio': 0.2,
 'valid_ratio': 0.1,
 'reproducible': True,
 'global_seed': 1}

In [None]:
new_params = {"learning_rate": 0.31, "batch_size": 2, "epochs": 53}

config = DefaultConfig(**new_params)
print(config.model_dump())

{'latent_dim': 16, 'n_layers': 3, 'enc_factor': 4, 'input_dim': 10000, 'drop_p': 0.1, 'learning_rate': 0.31, 'batch_size': 2, 'epochs': 53, 'weight_decay': 0.01, 'reconstruction_loss': 'mse', 'default_vae_loss': 'kl', 'min_samples_per_split': 1, 'device': 'auto', 'n_gpus': 1, 'n_workers': 2, 'checkpoint_interval': 10, 'float_precision': '32', 'gpu_strategy': 'auto', 'train_ratio': 0.7, 'test_ratio': 0.2, 'valid_ratio': 0.1, 'reproducible': True, 'global_seed': 1}


In [None]:
from autoencodix.utils.default_config import DefaultConfig

In [None]:
import torch
torch.cuda.deterministic

AttributeError: module 'torch.cuda' has no attribute 'deterministic'

In [None]:
            torch.cuda.manual_seed(seed=self._config.global_seed)
            torch.cuda.manual_seed_all(seed=self._config.global_seed)
            torch.backends.cudnn.deterministic = True
            torch.backends.cudnn.benchmark = False

In [None]:
torch.backends.cudnn.benchmark
torch.cuda.manual_seed
# get seed from cuda
torch.cuda.initial_seed()

AssertionError: Torch not compiled with CUDA enabled