# Implementing a new CV from scratch
 
[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/luigibonati/mlcolvar/blob/main/docs/notebooks/tutorials/adv_newcv_scratch.ipynb)

In this notebook, we will move top-bottom through the structure of the CV classes in `mlcolvar`.
We will give an overview of how CVs classes should be implemented from scratch alongside some coding-conventions we adopted in the library which may be useful for possible external contibutors. 

As an example we will implement (and comment) step by step the `AutoEncoderCV`. 

## Define the class object
In `mlcolvar`, CVs class objects inherit from two parent classes:
- `BaseCV` class, which contains some common and default helper functions
- `lightning.LightniningModule` class, which automatically gives access to the Lightining package utilities 

In the class declaration preamble, we set the names of  the `BLOCKS` that will consitute the main body of the CV itself. 

The blocks are meant to correspond to classes and functions defined in `mlcolvar.core` . However, the names we give in `BLOCKS` are arbitrary, considered that, in principle, we could have more blocks of the same types in our model and we would then need to distinguish between them.

In [None]:
# Colab setup
import os

if os.getenv("COLAB_RELEASE_TAG"):
    import subprocess
    subprocess.run('wget https://raw.githubusercontent.com/luigibonati/mlcolvar/main/colab_setup.sh', shell=True)
    cmd = subprocess.run('bash colab_setup.sh TUTORIAL', shell=True, stdout=subprocess.PIPE)
    print(cmd.stdout.decode('utf-8'))

In [None]:
import torch
import lightning

from mlcolvar.cvs import BaseCV

class AutoEncoderCV(BaseCV, lightning.LightningModule):
    DEFAULT_BLOCKS = ['norm_in','encoder','decoder'] 

  from .autonotebook import tqdm as notebook_tqdm


To keep the code in the library as clear as possible, we should also add short docstring to our CV class briefly explaining how it works!

Anyways to save some space we will skip this in the following cells

In [None]:
class AutoEncoderCV(BaseCV, lightning.LightningModule):
    """AutoEncoding Collective Variable. It is composed by a first neural network (encoder) which projects 
    the input data into a latent space (the CVs). Then a second network (decoder) takes 
    the CVs and tries to reconstruct the input data based on them. It is an unsupervised learning approach, 
    typically used when no labels are available.
    Furthermore, it can also be used lo learn a representation which can be used not to reconstruct the data but 
    to predict, e.g. future configurations. 

    For training it requires a DictDataset with the key 'data' and optionally 'weights'. If a 'target' 
    key is present this will be used as reference for the output of the decoder, otherway this will be compared
    with the input 'data'.
    """
    
    DEFAULT_BLOCKS = ['norm_in','encoder','decoder'] 

## The CV class `__init__` method
The `__init__` method is the signature of the CV model as it initializes all that is necessary for the CV model to run, including blocks, variables, loss functions..

### Declaration of the `__init__` method

All the CV's in `mlcolvar` have some common elements:

- **in/out features**: All the CVs classes in `mlcolvar` should have defined the number of `in_features` and `out_features`, which are the number of inputs and outputs respectively. They must be passed to the `BaseCV` parent class  with the command `super().__init__(in_features, out_features)`.
- **options**: The `options` dict provide the interface to modify the defaults of the CV's elements, i.e. parameters of blocks, optimizer.. (see later)
- ****kwargs**: CVs in `mlcolvar` also accept key-word arguments to be passed to their inner functions 

Each CV class will depend on different parameters, in our example the characteristic parameters for the `AutoEncoderCV` are just the `encoder_layer` (compulsory) and the `decoder_layer` (optional). 

To stay as user-friendly as possible, in `mlcolvar`, we always try to give meaningful and intelligible names to the parameters. Besides that, it is also a good practice to provide a complete docstring for the `__init__` method, explaining more in detail what each parameter is actually doing in the model.


In [None]:
class AutoEncoderCV(BaseCV, lightning.LightningModule):
    DEFAULT_BLOCKS = ['norm_in','encoder','decoder']  

    def __init__(self,
# ================================================ LOOK HERE 0.0 ================================================   
                encoder_layers : list, 
                decoder_layers : list = None, 
                options : dict = None, 
                **kwargs):
        """
        Train a CV defined as the output layer of the encoder of an autoencoder model (latent space). 
        The decoder part is used only during the training for the reconstruction loss.

        Parameters
        ----------
        encoder_layers : list
            Number of neurons per layer of the encoder
        decoder_layers : list, optional
            Number of neurons per layer of the decoder, by default None
            If not set it takes automaically the reversed architecture of the encoder
        options : dict[str,Any], optional
            Options for the building blocks of the model, by default None.
            Available blocks: ['norm_in', 'encoder','decoder'].
            Set 'block_name' = None or False to turn off that block
        """
        super().__init__(model=encoder_layers, **kwargs)
        
# ================================================ LOOK HERE 0.0 ================================================   


### Parse options and parameters
The different options in the `options` dictionary are parsed using the `BaseCV.parse_options` function. This command is required as it also initializes defaults whenever specific options entries are not specified and checks that the given options make sense with the CV at hand.

Options must be a dictionary of dictionaries mapping the name of a block (or the optimizer) to a dictionary of keyword arguments to pass to the block (or the optimizer) `__init__` function, i.e. name_of_block -> block_init_kwargs (e.g. options = {'encoder': {'activation': 'relu'}, 'optimizer' : { 'lr' = 1e-3} }

Here we also initialize what is needed from the input parameters. In our case for example we specify that, whenever `decoder_layer` is not specified, it should be the reversed `encoder_layer`.

In [None]:
class AutoEncoderCV(BaseCV, lightning.LightningModule):
    DEFAULT_BLOCKS = ['norm_in','encoder','decoder'] 
    
    def __init__(self,
                encoder_layers : list, 
                decoder_layers : list = None, 
                options : dict = None, 
                **kwargs):
        super().__init__(model=encoder_layers, **kwargs)

# ================================================ LOOK HERE 0.0 ================================================   
        
        # ======= OPTIONS ======= 
        # parse and sanitize
        options = self.parse_options(options)

        # if decoder is not given reverse the encoder
        if decoder_layers is None:
            decoder_layers = encoder_layers[::-1]
            
# ================================================ LOOK HERE 0.0 ================================================   



### Define the `loss_fn` in the model
In the `mlcolvar` CVs the loss function are defined as attributes of the CV class.  In our case we will use the `MSELoss` defined in `mlcolvar.core.loss`.

In [None]:
from mlcolvar.core.loss import MSELoss

class AutoEncoderCV(BaseCV, lightning.LightningModule):
    DEFAULT_BLOCKS = ['norm_in','encoder','decoder'] 
    
    def __init__(self,
                encoder_layers : list, 
                decoder_layers : list = None, 
                options : dict = None, 
                **kwargs):
        super().__init__(model=encoder_layers, **kwargs)

        # ======= OPTIONS ======= 
        # parse and sanitize
        options = self.parse_options(options)

        # if decoder is not given reverse the encoder
        if decoder_layers is None:
            decoder_layers = encoder_layers[::-1]
            
# ================================================ LOOK HERE 0.0 ================================================   

        # =======   LOSS  =======
        # Reconstruction (MSE) loss
        self.loss_fn = MSELoss()

# ================================================ LOOK HERE 0.0 ================================================   


### Initialize the Blocks in the CV model
In general the blocks are meant to be initialized relying on the functions and classes implemented in `mlcolvar.core`.

We remind that the list of the names for the blocks we want to include in our CV is defined in the class' constant `BLOCKS`.

In our example we will implement a `norm_in` = `Normalization()` normalize the input, the `encoder` = `FeedForward()` NN for the encoder part of the architecture and the `decoder` = `FeedForward()` NN. 

#### Modyifing the blocks default

We pass `**options` as kwargs to the blocks functions in order to be able to use the `options` dictionary to modify the defaults when initializing the CV model in our code.
For example in the case of the `encoder` block we can modify the activation function of the layers to the `shifted_softplus` using 

`options={'encoder':{'activation':'shifted_softplus'}}`

We may also want to have the possibility to deactivate blocks sometimes like we do here for the `norm_in` block, which can be skipped using 

`options={'norm_in': None}` or `options={'norm_in': False}`


In [None]:
from mlcolvar.core.nn import FeedForward
from mlcolvar.core.transform import Normalization

class AutoEncoderCV(BaseCV, lightning.LightningModule):
    DEFAULT_BLOCKS = ['norm_in','encoder','decoder'] 
    
    def __init__(self,
                encoder_layers : list, 
                decoder_layers : list = None, 
                options : dict = None, 
                **kwargs):
        super().__init__(model=encoder_layers, **kwargs)

        # ======= OPTIONS ======= 
        # parse and sanitize
        options = self.parse_options(options)

        # if decoder is not given reverse the encoder
        if decoder_layers is None:
            decoder_layers = encoder_layers[::-1]
            
        # =======   LOSS  =======
        # Reconstruction (MSE) loss
        self.loss_fn = MSELoss()

# ================================================ LOOK HERE 0.0 ================================================   

        # ======= BLOCKS =======

        # initialize norm_in
        o = 'norm_in'
        if ( options[o] is not False ) and (options[o] is not None): # this allows to deactivate it
            self.norm_in = Normalization(self.in_features,**options[o]) 

        # initialize encoder
        o = 'encoder'
        self.encoder = FeedForward(encoder_layers, **options[o])

        # initialize decoder
        o = 'decoder'
        self.decoder = FeedForward(decoder_layers, **options[o])

# ================================================ LOOK HERE 0.0 ================================================   


## Defining the `forward` and `forward_cv` function
By default in the `BaseCV` class has two methods that apply the CV model:
- `forward_cv` sequentially executes the blocks, skipping pre and post processing.
- `forward`, which is used when calling `model(input)` and for deploying the model, also applies pre and post processing operations, if present.

By default **all** the defined blocks are meant to be executed to lead to the CV, however, sometimes this may not be the case. 
In the case of an autoencoder, for example, this would skip the `decoder` block as the CVs space correspond to the latent representation of the autoencoder. 

To implement this we must:
- **overload** `forward_cv` method of the `BaseCV` mother class in our CV model
- **implement** a function that executes both the encoder, the decoder part and revert the normalization applied on the inputs to be used during the training (`encode_decode`)

In [8]:
def forward_cv(self, x: torch.Tensor) -> (torch.Tensor):
    if self.norm_in is not None:
        x = self.norm_in(x)
    x = self.encoder(x)
    return x

def encode_decode(self, x: torch.Tensor) -> (torch.Tensor):
    x = self.forward(x)
    x = self.decoder(x)
    if self.norm_in is not None:
        x = self.norm_in.inverse(x)
    return x

## Define the `training_step`
All the CVs classes in `mlcolvar` must overload the `lightning.LightningModule.training_step` function.

- As first thing, within this function we need to select the data we need look for in the dataset. This is done using the keyword-indexing of the `mlcolvar.data.DictDataset` and allowing for a easy-to-read code.
- Then we apply the model and compute the loss function according to the results.
- Finally, and optionally, we log the quantities we are interested in monitoring using the lightning framework.

The `BaseCV` mother class also have a `validation_step` and a `test_step` functions which are by default equal to the `training_step` one.

In [9]:
def training_step(self, train_batch, batch_idx):
    # =================get data===================
    x = train_batch['data']
    loss_kwargs = {}
    if 'weights' in train_batch:
        loss_kwargs['weights'] = train_batch['weights']

    # =================forward====================
    x_hat = self.encode_decode(x)

    # ===================loss=====================
    # Reference output (compare with a 'target' key, if any, otherwise with input 'data')
    if 'target' in train_batch:
        x_ref = train_batch['target']
    else:
        x_ref = x 
    loss = self.loss_fn(x_hat, x_ref, **loss_kwargs)
    
    # ====================log=====================     
    name = 'train' if self.training else 'valid'       
    self.log(f'{name}_loss', loss, on_epoch=True)
    return loss

## Wrap up: the complete example CV class 

In [None]:
class AutoEncoderCV(BaseCV, lightning.LightningModule):
    """AutoEncoding Collective Variable. It is composed by a first neural network (encoder) which projects 
    the input data into a latent space (the CVs). Then a second network (decoder) takes 
    the CVs and tries to reconstruct the input data based on them. It is an unsupervised learning approach, 
    typically used when no labels are available.
    Furthermore, it can also be used lo learn a representation which can be used not to reconstruct the data but 
    to predict, e.g. future configurations. 

    For training it requires a DictDataset with the key 'data' and optionally 'weights'. If a 'target' 
    key is present this will be used as reference for the output of the decoder, otherway this will be compared
    with the input 'data'.
    """
    
    DEFAULT_BLOCKS = ['norm_in','encoder','decoder'] 
    
    def __init__(self,
                encoder_layers : list, 
                decoder_layers : list = None, 
                options : dict = None, 
                **kwargs):
        """
        Train a CV defined as the output layer of the encoder of an autoencoder model (latent space). 
        The decoder part is used only during the training for the reconstruction loss.

        Parameters
        ----------
        encoder_layers : list
            Number of neurons per layer of the encoder
        decoder_layers : list, optional
            Number of neurons per layer of the decoder, by default None
            If not set it takes automaically the reversed architecture of the encoder
        options : dict[str,Any], optional
            Options for the building blocks of the model, by default None.
            Available blocks: ['norm_in', 'encoder','decoder'].
            Set 'block_name' = None or False to turn off that block
        """
        super().__init__(model=encoder_layers, **kwargs)

        # ======= OPTIONS ======= 
        # parse and sanitize
        options = self.parse_options(options)

        # if decoder is not given reverse the encoder
        if decoder_layers is None:
            decoder_layers = encoder_layers[::-1]
            
        # =======   LOSS  =======
        # Reconstruction (MSE) loss
        self.loss_fn = MSELoss()

        # ======= BLOCKS =======

        # initialize norm_in
        o = 'norm_in'
        if ( options[o] is not False ) and (options[o] is not None): # this allows to deactivate it
            self.norm_in = Normalization(self.in_features,**options[o]) 

        # initialize encoder
        o = 'encoder'
        self.encoder = FeedForward(encoder_layers, **options[o])

        # initialize decoder
        o = 'decoder'
        self.decoder = FeedForward(decoder_layers, **options[o])

    def forward_cv(self, x: torch.Tensor) -> (torch.Tensor):
        if self.norm_in is not None:
            x = self.norm_in(x)
        x = self.encoder(x)
        return x

    def encode_decode(self, x: torch.Tensor) -> (torch.Tensor):
        x = self.forward(x)
        x = self.decoder(x)
        if self.norm_in is not None:
            x = self.norm_in.inverse(x)
        return x
    
    def training_step(self, train_batch, batch_idx):
        # =================get data===================
        x = train_batch['data']
        loss_kwargs = {}
        if 'weights' in train_batch:
            loss_kwargs['weights'] = train_batch['weights']

        # =================forward====================
        x_hat = self.encode_decode(x)

        # ===================loss=====================
        # Reference output (compare with a 'target' key, if any, otherwise with input 'data')
        if 'target' in train_batch:
            x_ref = train_batch['target']
        else:
            x_ref = x 
        loss = self.loss_fn(x_hat, x_ref, **loss_kwargs)
        
        # ====================log=====================     
        name = 'train' if self.training else 'valid'       
        self.log(f'{name}_loss', loss, on_epoch=True)
        return loss

## Write test functions
In order to ensure smooth functioning of the `mlcolvar` library , all the main functions have to be accompanied by proper testing functions which should be added in the tests folder. 
In their final form, these are mainly meant to ensure that the code is not crashing in the possible different settings and should be kept as generic and synthetic as possible.

In [16]:
def test_autoencodercv():
    from mlcolvar.data import DictDataset, DictModule
    import numpy as np

    in_features, out_features = 8,2
    layers = [in_features, 6, 4, out_features]

    # initialize via dictionary
    options = { 'norm_in'  : None,
             'encoder' : { 'activation' : 'relu' },
             'optimizer' : {'lr' : 1e-3}
           } 
    model = AutoEncoderCV( encoder_layers=layers, options=options )

    # train on synthetic dataset
    X = torch.randn(100,in_features) 
    dataset = DictDataset({'data': X})
    datamodule = DictModule(dataset)
    trainer = lightning.Trainer(max_epochs=1, log_every_n_steps=2,logger=None, enable_checkpointing=False, enable_model_summary=False)
    trainer.fit( model, datamodule )
    model.eval()
    X_hat = model(X)

if __name__ == "__main__":
    test_autoencodercv() 

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


                                                                            

  rank_zero_warn(


Epoch 0: 100%|██████████| 1/1 [00:00<00:00, 71.27it/s, v_num=32] 

`Trainer.fit` stopped: `max_epochs=1` reached.


Epoch 0: 100%|██████████| 1/1 [00:00<00:00, 64.56it/s, v_num=32]
