# AUTOENCODIX PACKAGE HANDBOOK
This notebook demonstrates the usage of the autoencodix package.
For now it serves as an internal guideline with the goal to:
- test the package from a user perspective
- serve as a first draft of user documentation
- serve a developer guideline 
  - developer guide will be derrived from this notebook

In [1]:
import os
os.getcwd()
os.chdir("../../")


## 00 Generate mock data
When  development proceeds this section should be used to  show how to use different datatypes
for now we only use a mock numpy array

In [2]:
import numpy as np
sample_data = np.random.rand(100, 10)
sample_data.shape

(100, 10)

## 01 General Pipeline Usage

In [3]:
# imports
import autoencodix as acx
from autoencodix.utils.default_config import DefaultConfig

In [None]:
# Use Vanillix Pipeline interface
# needs to be initialized with data
# data should be a numpy array, pandas dataframe or AnnData object
# possible to pass a custom Config object
van = acx.Vanillix(data=sample_data)
# job of old make data
# populates self._features attrbute with torch tensor
# populates self._datasets attribute with torch dataset
# (important for training with dataloader)
# possible to pass a custom Config object, or keyword arguments
van.preprocess()
# job of old make model
# calls self.Trainer class to init and train model
# populates self._model attribute with trained model
# populates self.result attribute with training results (model, losses, etc)
van.fit()
# job of old make predict
# if no data is passed, used the test split from preprocessing
# otherwise, uses the data passed, and preprocesses it
# updates self.result attribute with predictions (latent space, reconstructions, etc)
van.predict()
# job of old make ml_task
# populates self.result attribute with ml task results
van.evaluate() # not implemented yet
# job of old make visualize
# populates self.result attribute with visualizations
van.visualize()
# show visualizations for notebook use
van.show_result()
# --------------------------
# --------------------------
# run all steps in the pipeline
result_object = van.run()



getting model for Vanillix
Epoch: 9, Loss: 1.9387561082839966


#### Using a custom train, test, valid split
When you pass the data to the pipeline, autoencodix, internally splits the data for you based on the train,test, valid ratios provided in the config (defaults are 70%/10%/20% train/valid/test).
You can either pass custom ratios (see next section) or provide the indices directly as shown below

In [None]:
sample_data = np.random.rand(100, 10)
custom_train_indices = np.arange(75) # we won't allow overlap between splits
custom_valid_indices = np.arange(75, 80)
custom_test_indices = np.arange(80, 100)

# the custom split needs to be a dictionary with keys "train", "valid", and "test" and values as numpy arrays
custom_split = {"train": custom_train_indices,
                "valid": custom_valid_indices,
                "test": custom_test_indices}
van = acx.Vanillix(data=sample_data, custom_splits=custom_split)
van.preprocess()
van.fit(epochs=3)

getting model for Vanillix


It is possible to pass empty splits, but depending on how you'll use the autoencodix pipeline, this will throw an error at some point. So it is possible to call `fit` with only training data, but if you want to call `predict` and don't provide new data, this won't work without a data in the test split

#### Using predict with new data
The standard case is to train the model with the train data and then predict with the test split.
However, it is possible to pass new data to the predict method to perform inference on this data with the already trained model

In [None]:
new_unseen_data = np.random.rand(10, 10)
van.predict(data=new_unseen_data)


Invalid parameters: data
Valid parameters are: config


#### Examining the result of the pipeline
Each step in the pipeline writes its results in the result object of the Vanillix instance.
In this section we explore how to access and make sense of the results.

In [None]:
result = van.result
print(result)

Result Object Public Attributes:
------------------------------
latentspaces: TrainingDynamics object
reconstructions: TrainingDynamics object
mus: TrainingDynamics object
sigmas: TrainingDynamics object
losses: TrainingDynamics object
preprocessed_data: Tensor of shape (100, 10)
model: Module
model_checkpoints: Dict with 0 items
datasets: DataSetContainer(train=<autoencodix.data._numeric_dataset.NumericDataset object at 0x1053a1420>, valid=<autoencodix.data._numeric_dataset.NumericDataset object at 0x1053a0f70>, test=<autoencodix.data._numeric_dataset.NumericDataset object at 0x10621ff40>)


##### TrainingDynamics object in result
The training dynamics object has the followinf form:
<epoch><split><data>
So if you want to access the train loss for the 5th epoch, you would:
`result.lossss.get(epoch=5, split="train")`

In [None]:
loss_train_ep2 = result.losses.get(epoch=2, split="train")
print(loss_train_ep2)
valid_loss = result.losses.get(split="valid")
print(valid_loss)
print(result.losses.get())

1.004350741704305
[0.35802913 0.38561997 0.41879243]
{0: {'train': array(1.07741566), 'valid': array(0.35802913)}, 1: {'train': array(0.95756259), 'valid': array(0.38561997)}, 2: {'train': array(1.00435074), 'valid': array(0.41879243)}}


## 02 Pipeline usage with custom parameters
Here we show how to customize the above shown pipeline with a user config or with keyword arguments.
In future iterations we want to allow to read a config from a file, this will be also demonstrated here.

In [None]:

# Use Vanillix Pipeline interface
# needs to be initialized with data
# data should be a numpy array, pandas dataframe or AnnData object
# possible to pass a custom Config object
van = acx.Vanillix(data=sample_data)
# job of old make data
# populates self._features attrbute with torch tensor
# populates self._datasets attribute with torch dataset
# (important for training with dataloader)
# possible to pass a custom Config object, or keyword arguments
van.preprocess()
# job of old make model
# calls self.Trainer class to init and train model
# populates self._model attribute with trained model
# populates self.result attribute with training results (losses, etc)
# van.fit()
""" 
Each step can be run separately, with custom parameters, these parameters
can be passed as keyword arguments, or as a Config object
"""
van.fit(learning_rate=0.01, batch_size=32, epochs=5) # or like this:
my_config = DefaultConfig(learning_rate=130.0, batch_size=32, epochs=5)
van.fit(config=my_config) # config has to be an keyword argument



getting model for Vanillix


KeyboardInterrupt: 

#### 02 How to relevant keyword arguments for pipeline methods
It can be hard to know what keyword arguments are valid for each step,
so we show:
- how to get a list of allowed keyword arguments
- what happens if you pass non-allowed keyword arguments

In [None]:
# for each config method, we can call a valid_params method
van = acx.Vanillix(data=sample_data)
fit_params = van.fit.valid_params # returns a set of keyword arguments that are actually used in the fit method

import pprint
pprint.pprint(fit_params)

{'batch_size',
 'checkpoint_interval',
 'config',
 'epochs',
 'global_seed',
 'gpu_strategy',
 'learning_rate',
 'n_devices',
 'n_workers',
 'reconstruction_loss',
 'reproducible',
 'use_gpu',
 'weight_decay'}


To get even more verbose info about the keyword args, you can run the following code.

In [None]:
# when you want to have more info about the params, you can get type hints from the config object
my_config = DefaultConfig()
conig_values = my_config.get_params()
my_config.print_schema(filter_params=fit_params)

Valid Keyword Arguments:
--------------------------------------------------

learning_rate:
  Type: <class 'float'>
  Default: 0.001
  Description: Learning rate for optimization

batch_size:
  Type: <class 'int'>
  Default: 32
  Description: Number of samples per batch

epochs:
  Type: <class 'int'>
  Default: 23
  Description: Number of training epochs

weight_decay:
  Type: <class 'float'>
  Default: 0.01
  Description: L2 regularization factor

reconstruction_loss:
  Type: typing.Literal['mse', 'bce']
  Default: mse
  Description: Type of reconstruction loss

use_gpu:
  Type: <class 'bool'>
  Default: False
  Description: Whether to use GPU acceleration

n_devices:
  Type: <class 'int'>
  Default: 1
  Description: Number of devices for computation

n_workers:
  Type: <class 'int'>
  Default: 2
  Description: Number of data loading workers

checkpoint_interval:
  Type: <class 'int'>
  Default: 10
  Description: Interval for saving checkpoints

gpu_strategy:
  Type: typing.Literal['a

If you pass not supported parameters you get a warning

In [None]:
# if you use an unsupported keyword argument, you will get a warning
# as you see the default value from the DefaultConfig is not overwritten and the training will take 100 epochs (not 10)
van.preprocess()
van.fit(epochds=10)


Invalid parameters: epochds
Valid parameters are: batch_size, checkpoint_interval, config, epochs, global_seed, gpu_strategy, learning_rate, n_devices, n_workers, reconstruction_loss, reproducible, use_gpu, weight_decay
getting model for Vanillix
Epoch: 9, Loss: 1.312615990638733
Epoch: 19, Loss: 0.8676923215389252


#### How to get information about the default config parameters

In [None]:
# if you want to see what config parameters are used in the default config you can do it like:
default_config = DefaultConfig()
default_config.print_schema()




DefaultConfig Configuration Parameters:
--------------------------------------------------

latent_dim:
  Type: <class 'int'>
  Default: 16
  Description: Dimension of the latent space

n_layers:
  Type: <class 'int'>
  Default: 3
  Description: Number of layers in encoder/decoder

enc_factor:
  Type: <class 'int'>
  Default: 4
  Description: Scaling factor for encoder dimensions

input_dim:
  Type: <class 'int'>
  Default: 10000
  Description: Input dimension

drop_p:
  Type: <class 'float'>
  Default: 0.1
  Description: Dropout probability

learning_rate:
  Type: <class 'float'>
  Default: 0.001
  Description: Learning rate for optimization

batch_size:
  Type: <class 'int'>
  Default: 32
  Description: Number of samples per batch

epochs:
  Type: <class 'int'>
  Default: 23
  Description: Number of training epochs

weight_decay:
  Type: <class 'float'>
  Default: 0.01
  Description: L2 regularization factor

reconstruction_loss:
  Type: typing.Literal['mse', 'bce']
  Default: mse
 

### Documentation Config class
You can update the config with your own values by:
- passing a dictionary (TODO)
- passing a file (TODO)

# TODOS
- show how to use a custom split
- show how to update and work with the config object (later)

### SANBOX 
current testing MPS (mac gpus support and float and gpu strategies)

In [None]:
my_config = DefaultConfig(float_precision="32", device="mps", n_gpus=1, gpu_strategy="auto")
van = acx.Vanillix(data=sample_data, config=my_config)
van.preprocess()
van.fit(epochs=2)