# Overview

## Enhanced sampling and collective variables

- cv-based enhanced sampling
- meaning of CVs:
    1. dimensionality reduction
    2. able to distinguish metastable states of interest
    3. able to promote the sampling along the minimum free energy pathways
- chicken and egg problem
- historical pathway: from physics to data-driven
    - physical intuition
    - linear transformation methods
    - non-linear (e.g. nn) cvs

## What is `mlcolvar`
`mlcolvar`, which stands for Machine Learning COLlective VARiables, is Python library designed to aid the design of data-driven reaction coordinates for atomistic simulations.

The guiding principles of `mlcolvar` are twofold. On one hand, we wanted to develop a unified framework to help test and deploy CVs proposed in the literature. On the other hand, we tried to do so in a modular way which could simplify also the development of new approaches and the cross contamination between them. 

The library is based on Pytorch machine learning library, and we decided to rely on the Lightning high-level package to simplify the overall workflow and focus on the design of the CVs.

While of course it can be used as a standalone tool (e.g. for analysis of MD simulations), its primary purpose is thought to be in the workflow of enhanced sampling in combination with PLUMED. Thus, the input features are thought to come from PLUMED and the final result will be a serialized model which can be deployed in PLUMED via the LibTorch C++ interface.
Note that, at variance with other ML workflow, this implies that all the pre and processing steps (e.g. standardization of the input data) has to be saved inside the model to avoid repeating them in PLUMED which could be a tedious and/or complex task.

<center><img src="images/graphical_overview_mlcolvar.png" width="800" /></center>

## `mlcolvar` workflow

The main goal of `mlcolvar` is to make the construction of CVs as straightforward and accessible as possible for all types of users.

In general, the process of CVs optimization from data consists of steps presented in Fig. XXX, which correspond to the following pseudocode:

1. Import training data (e.g. PLUMED COLVAR files), using `mlcolvar.utils.io`
2. Split dataset in training and validation part, using a `lightning.DataModule`
3. Choose a model from `mlcolvar.cvs` and define hyper-parameters
4. Define a `lightning.Trainer` object and customize training procedure (e.g. loggers, early stopping, model checkpointing)
5. Optimize the parameters (`trainer.fit` )
6. Compile the model with TorchScript ( `model.to_torchscript()` )
7. Use it in PLUMED with the pytorch module

## Structure of the mlcolvar library

### data
Implements all the tools used to efficiently handle data in `mlcolvar`. The structure is inspired by `lightning` with the addition of relying on a dictionary-like handling of the datasets based on keywords indexing for a better ease of use. 
The key elements are:
- **DictionaryDataset**:            A dictionary-like `torch.utils.data.Dataset` that works with tensors and names, i.e data,labels,target,weights etc. 
- **DictionaryDataModule**:         A `lightning.LightningDatamodule` to be initialized from a DictionaryDataset.
- **FastDictionaryLoader** :        A DataLoader-like object for sets of tensors. It is adapted to work with dictionaries and to be faster than standard dataloader (see docs).


### core
Implements building blocks of the mlcolvar classes.
- **nn** :        Implements trainable building blocks of the mlcolvar classes
- **loss** :      Implements loss functions for the training of mlcolvar
- **stats** :     Implements statistical analysis methods (LDA, TICA, PCA..)
- **transform** : Implements non-trainable transformations of data (e.g. normalization)

Each of them are implemented as python subclasses  of `torch.nn.modules`. 

### cvs
Implements ready-to-use mlcolvar classes and the `BaseCV` template class.
The CVs can be grouped, based on the criterion used for the optimization, in: 
- **unsupervised** :      Only require data about the system (`AutoEncoderCV` and `VariationalAutoEncoderCV`).
- **supervised**:         Require either labeled data from the different metastable states of the system (`DeepLDA` and `DeepTDA`) or data and target to be matched (`RegressionCV`)
- **timelagged**:         Require time-lagged data from reactive trajectory (`DeepTICA`)

These are defined as classes which inherit from from a `BaseCV` class and from `LightningModule`, which inherits from `torch.nn.module` and adds more functionalities.

The first super class is meant to define a template for all the CVs along with common utility methods and the handling of pre and post processing in the model.

Each CV is characterized by its specific methods, attributes and properties, which are implemented on top of these two super classes.
The structure of CVs in `mlcolvar` is thought to be modular, indeed the core of each model is defined as a series of `BLOCKS`, implemented as `torch.nn.module`, that are automatically executed sequentially in a similar fashion to what is done with `torch.nn.sequential`.
Each CV then has a `loss_fn` attribute that sets the loss function which has to be minimized for the optimization of the trainable blocks. On the other hand, the optimizer for the training over the trainable weights of the model is set as a property of the model.

In addition to the core of the CV class the user can also prepend and append pre and postprocessing models. These are in general thought to be `Transform` object, as they are not trainable, but in principle they could generic `torch.nn.Module`.
This possiblity allows to perform the non-trainable preprocessing operations on the dataset only once at the beginning of the training and to include anyways such operations in the final model for exporting, testing etc.
Furthermore this allows to perform postprocessing on the outputs of the model and include them after the training is already completed.

### utils
Implements miscellanous and transversal tools for a smoother workflow in `mlcolvar`. 
- **io**:            Utils for fast and efficient data import from file, optimized for PLUMED colvar files.  
- **fes**:           Function to recover and plot 1D and 2D Free Energy Surface (FES) from biased data. The reweighting function is based on Kernel Density Estimation (KDE) either from `KDEpy` (faster) or `Scipy` (slower).
- **timelagged** :   Utils for timelagged datasets.
- **plot** :         Utils functions for often-used plots (i.e. `plot_metrics` and `plot_isolines_2D`) and `cm_fessa` and `cortina80` color palettes.
- **trainer** :      `pytorch_lihtning.Callback` functions for metrics logging.



### test

We use pytest to check: 
- the tests contained in the mlcolvar/tests folder
- the jupyter notebooks contained in the docs/notebooks folder (via nbmake extension)