# Pre/post processing
[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/luigibonati/mlcolvar/blob/main/docs/notebooks/tutorials/adv_preprocessing.ipynb)

This tutorial shows how to add pre- or postprocessing modules to CVs. The idea is that into these modules go any operations that should not be performed in training, but only in inference. This provides additional flexibility that can come in handy, for example, in the following cases:
- apply preprocessing to the data to avoid having to do it at each step, and at the same time save it in the model so that it is performed in the prediction phase, e.g., in PLUMED
- apply postprocessing after the training is finished, for example, to normalize the output CV

## Setup

In [7]:
# Colab setup
import os

if os.getenv("COLAB_RELEASE_TAG"):
    import subprocess
    subprocess.run('wget https://raw.githubusercontent.com/luigibonati/mlcolvar/main/colab_setup.sh', shell=True)
    cmd = subprocess.run('bash colab_setup.sh TUTORIAL', shell=True, stdout=subprocess.PIPE)
    print(cmd.stdout.decode('utf-8'))
     
import torch
import mlcolvar
import numpy as np

## BaseCV class

Note that the `BaseCV` class implements the forward method in the following way:

In [None]:
def forward(self, x : torch.Tensor) -> torch.Tensor:
    """
    Evaluation of the CV
    - Apply preprocessing if any
    - Execute the forward_cv method
    - Apply postprocessing if any
    """
    
    if self.preprocessing is not None:
        x = self.preprocessing(x)

    x = self.forward_cv(x)

    if self.postprocessing is not None:
        x = self.postprocessing(x)

    return x

As explained in the tutorial on implementing CVs from scratch, 
- the `forward` method is supposed to be called during inference
- the `forward_cv` method is called from `training_step`, and is the one which is re-implemented by the various subclasses

## Pre-processing

Assume we have a dataset on which we want to apply a preprocessing operation. In general we can define this operation as:

- (a) a module implemented in the library (such as `mlcolvar.core.transform` or `mlcolvar.core.stats` objects)
- (b) a generic class that inherits from the `torch.nn.Module` class (including `torch.nn.Sequential` to concatenate more transformations) 
- (c) a generic function that takes as input a `torch.Tensor` and returns another `torch.Tensor`. 

If the dimensionality of the inputs remains unchanged following the transformation, all three cases work without any other changes. Otherwise, there must be an `in_features` member that specifies the initial input size which is used to correctly concatenate the model. 
This is already present in all objects in (a), it must be added for those in (b), while it cannot be used in the case of python functions (c). 

Once we have defined the preprocessing, we need to:
- apply it to the data before creating the Dataset/Datamodule
- save into the model. This can be done either by passing it to the `preprocessing` keyword in the costructor or saving it into the `preprocessing` member after initialization. 


### Using a mlcolvar object as preprocessing

In this example we show how to use a `mlcolvar` module, and in particular Principal Component Analysis (PCA) to reduce the dimensionality of the inputs. We first define the preprocessing and compute the 2 principal components out of a 10-d dataset.

In [8]:
from mlcolvar.core.stats import PCA

# create synthetic dataset
n_input = 10
X = torch.rand(100,n_input)
y = X.square().sum(1)

# compute PCA
n_pca = 2

pca = PCA(in_features=n_input, out_features=n_pca)
_ = pca.compute(X)

Then we can apply it to the dataset to get the pre-processed data and create the datamodule

In [9]:
from mlcolvar.data import DictDataset

X_pre = pca(X)

DictDataset(dict(data=X_pre,target=y))

DictDataset( "data": [100, 2], "target": [100] )

And save it into the model, here a `RegressionCV`. Note that the input of the CV needs to be equal to 2 now, since we are going to apply it to the pre-processed dataset

In [12]:
from mlcolvar.cvs import RegressionCV

model = RegressionCV(model=[2,10,10,1], 
                     preprocessing = pca ) 

# the preprocessing can also be saved later, like in:
# model.preprocessing = pca

model

  rank_zero_warn(


RegressionCV(
  (preprocessing): PCA(in_features=10, out_features=2)
  (loss_fn): MSELoss()
  (norm_in): Normalization(in_features=2, out_features=2, mode=mean_std)
  (nn): FeedForward(
    (nn): Sequential(
      (0): Linear(in_features=2, out_features=10, bias=True)
      (1): ReLU(inplace=True)
      (2): Linear(in_features=10, out_features=10, bias=True)
      (3): ReLU(inplace=True)
      (4): Linear(in_features=10, out_features=1, bias=True)
    )
  )
)

For inference, we should either call `forward` on the raw original data (which is what is exported to Torchscript) or also `forward_cv` to the raw data (which is what is executed during training).

In [11]:
y_pred      = model.forward(X) #equivalent to model(X)
y_pred_pre  = model.forward_cv(X_pre)

torch.allclose(y_pred,y_pred_pre)

True

## Post-processing

Similarly, one might want to do some post-processing operations, typically after the training is completed. Here we use this feature to standardize the CV output such that it lies in the range between -1 and 1. 

In [15]:
from mlcolvar.cvs import AutoEncoderCV

model = AutoEncoderCV(encoder_layers=[10,5,1])

Calculate mean and range to be subtracted and divided for with the Normalization class.

In [23]:
from mlcolvar.core.transform import Statistics

with torch.no_grad():
    y_pred = model(X)
    
stats = Statistics(y_pred).to_dict()
stats

{'mean': tensor([-0.2367]),
 'std': tensor([0.0248]),
 'min': tensor([-0.3403]),
 'max': tensor([-0.1554])}

Define a Normalization object based on these values and `mode=min_max`. Note that, in order to standardize the outputs such that the mean is 0 and stdandard deviation is 1 you should use the `mode=mean_std` instead.

In [24]:
from mlcolvar.core.transform import Normalization

norm = Normalization(in_features=1,
                        stats=stats, mode='min_max')

Finally, we can save it as postprocessing in the CV object, and test whether it is working when calling the forward method.

In [27]:
model.postprocessing = norm

with torch.no_grad():
    y_pred_post = model(X)
    
stats = Statistics(y_pred_post).to_dict()
stats

{'mean': tensor([0.1210]),
 'std': tensor([0.2687]),
 'min': tensor([-1.]),
 'max': tensor([1.])}

That's it! Now the outputs of the CV will be rescaled such that the min and max over the training set are equal to -1 and 1.

Note: it you would like to reset the pre-/post- processing modules you can just set them to `None`.