# Training an MLP with NNFFLIB

In this tutorial, we will train a simple MLP for ethanal.

In [None]:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

In [None]:
from glob import glob
import numpy as np
import tensorflow as tf

## Preprocessing data

[Extended xyz](https://wiki.fysik.dtu.dk/ase/ase/io/formatoptions.html#extxyz) files are the easiest file format to store and process data for use within NNFFLIB. In general, a single snapshot should look like this:

    7
    Properties=species:S:1:pos:R:3:Z:I:1:force:R:3 energy=-4179.499904512839
    O	0.799673646	0.500360941	-0.301057334	8	0.19326614893650695	4.0296261599732475	0.5339464202520161
    C	-0.343136376	0.460221225	0.175132418	6	1.3154501686057245	-2.7655710555658	-0.26792985503376854
    C	-0.673405235	-1.004626693	0.127376591	6	0.825295199293655	-0.03915909188103606	0.543113470938796
    H	-0.816178397	1.256991818	0.7303707769999999	1	-0.24686343844785177	0.7401369791385353	-0.41621451861542136
    H	0.657637755	-0.829930033	-0.228715689	1	-2.7923321765335167	-2.6854090250013893	-0.10726306811241916
    H	-1.534805957	-1.384324001	-0.434431111	1	0.28378608782006104	0.4602677629570161	-0.6539789189674613
    H	-0.927792662	-1.430857733	1.106000824	1	0.42382386079352236	0.25992823576131424	0.36947153520393444

This examples describes a single ethanal molecule. In this tutorial, the data files have already been stored in a respective train and validation set in `data/train.xyz` and `data/validation.xyz`.

The first step to train an MLP, is converting all the data into a Tensorflow Record file (tfr-file). This can be easily done with the `TFRWriter` class of the NNFFLIB. Before we can do so, we need to tell NNFFLIB what kind of properties to look for in extended xyz files. The is done via the `list_of_properties` variable:

In [None]:
list_of_properties = ['positions', 'numbers', 'energy', 'forces']

For periodic structures, add `'rvec'` in the list to include the cell matrix. Afterwards, we can convert both the training and validation set.

In [None]:
from nnfflib.datasets import TFRWriter

writer = TFRWriter('train.tfr', list_of_properties = list_of_properties, per_atom_reference = -597.2258011428571)
writer.write_from_xyz('data/train.xyz')
writer.close()

writer = TFRWriter('validation.tfr', list_of_properties = list_of_properties, per_atom_reference = -597.2258011428571)
writer.write_from_xyz('data/validation.xyz')
writer.close()

Note that NNFFLIB will print out the amount of configurations being stored in the respective tfr files, together with the total number of atoms and some statistics of the energy. The total number of configurations should be remembered as this information is not included into the tfr-files itself (see below when constructing the data sets).

The TFRWriter class has several arguments:
- `reference`: a single reference energy which will be substracted from all the energies.
- `per_atom_reference`: a reference energy per atom that be substracted from all the energies.

## Configuring the training procedure

First, one has the choose a Tensorflow distributed training strategy (https://www.tensorflow.org/guide/distributed_training#types_of_strategies). This enables multi-GPU training. The `MirroredStrategy` is a good default, which is also valid for single GPU training.

In [None]:
strategy = tf.distribute.MirroredStrategy()

All the following code should be initialized within the scope of the same strategy.

In [None]:
from nnfflib.datasets import DataSet
with strategy.scope():
    train_data = DataSet(['train.tfr'], num_configs = 4995, cutoff = 5.0, batch_size = 64, float_type = 32, num_parallel_calls = 8, strategy = strategy, list_of_properties = list_of_properties)
    validation_data = DataSet(['validation.tfr'], num_configs = 840, cutoff = 5.0, batch_size = 64, float_type = 32, num_parallel_calls = 8, strategy = strategy, list_of_properties = list_of_properties, test = True)


Next, the training and validation data sets are initialized via the `DataSet` class starting from a list of tfr files. Here, it is important to specify the correct amount of configurations being stored in the data sets via the `num_configs` argument (see above when generating the tfr-files). Otherwise, NNFFLIB will not correctly count the number of epochs. The validation set should have the extra argument `test=True`. Other arguments may include:
- `cutoff`: The cutoff distance of the MLP. Defaults to 4A.
- `batch_size`: the batch size being used while training.
- `test`: (boolean) Set True for the validation set, False for the training set.
- `num_parallel_calls`: The number of configurations to be preprocessed in parallel. A good default is the number of CPU cores available.
- `strategy`: The distributed strategy defined above.
- `list_of_properties`: The list of properties defined above.

In [None]:
from nnfflib.l1mlp import L1MLP
from nnfflib.schnet import SchNet
with strategy.scope():
    model_schnet = SchNet(cutoff = 5., n_max = 32, num_layers = 4, start = 0.0, end = 5.0, num_filters = 64, num_features = 512, shared_W_interactions = False, float_type = 32)
    model = L1MLP(cutoff = 5., n_max = 32, num_layers = 6, start = 0.0, end = 5.0, num_filters = 64, num_features = 512) 
    #model = L1MLP.from_restore_file('model_dir/model_name_2.00')

Next, the architecture of the MLP should specified. For now, two types of MLPs are supported: `SchNet` (https://doi.org/10.1063/1.5019779) and the `L1MLP` (an equivariant MLP with l=1).

We will go further with the `L1MLP`. One can tune the following arguments:
- `cutoff`: The cutoff of the MLP. Should take the same value as the cutoff specified in the `DataSet` class.
- `n_max`: The number of radial features.
- `num_features`: The number of features for every particle.
- `num_filters`: The number of filters. 
- `num_layers`: The number of layers or interaction blocks
- `start`: The distance of the first radial feature.
- `end`: The distance of the last radial feature.
- `reference`: If a number, use a constant reference while training. If the reference energy is already substracted in the `DataSet` class, the reference energy should not be included here anymore. 
- `per_atom_reference`: If a number, use a constant per atom reference while training. If the reference energy per atom is already substracted in the `DataSet` class, the reference energy should not be included here anymore. 
- `xla`: (boolean). Whether or not to use [XLA](https://www.tensorflow.org/xla) when training the model. Defaults to False.

When restarting from a pretrained model, or simply when resuming the training procedure, the MLP can be loaded as follows: 

    model = L1MLP.from_restore_file('model_dir/model_name_2.00') 
where `'model_dir/model_name_2.00'` is the location where the MLP is stored.

In [None]:
from nnfflib.learning_rate_manager import ExponentialDecayLearningRate
with strategy.scope():
    optimizer = tf.optimizers.Adam(3e-04)
    learning_rate_manager = ExponentialDecayLearningRate(initial_learning_rate = 3e-04, decay_rate = 0.5, decay_epochs = 300)

Here, the optimizer and learning rate schedular are being loaded. Any tensorflow optimizer can be used here (https://www.tensorflow.org/api_docs/python/tf/keras/optimizers ). Two popular learning rate schedulars can be used:
- `ExponentialDecayLearningRate(initial_learning_rate = 3e-04, decay_rate = 0.5, decay_epochs = 300)`: An exponentially decaying learning rate. The learning rate starts at the initial value of `initial_learning_rate` and decays every `decay_epochs` by a factor `decay_rate`.
- `ConstantDecayLearningRate(initial_learning_rate = 1e-04, decay_factor = 0.5, min_learning_rate = 1e-07, decay_patience = 25)`: Starting from an initial learning rate of `initial_learning_rate`, the learning rate decays with a factor of `decay_factor` when the validation losses have not decreased anymore after `decay_patience` epochs. The training stops when the learning rate has dropped below `min_learning_rate`.

In [None]:
from nnfflib.losses import MSE, MAE
with strategy.scope():
    losses = [MSE('energy', scale_factor = 1., per_atom = True), MSE('forces', scale_factor = 1.)]
    validation_losses = [MAE('energy', per_atom = True), MAE('forces', scale_factor = 1.)]

To define the training and validation losses, one can use mean squared errors (MSE) or mean absolute errors (MAE). They should be given as a list, where the `scale_factor` argument is the weight tuning the relative weights of both the forces and energies.

Finally, the `SaveHook` class specifies the save location and how frequently the model is stored.

In [None]:
from nnfflib.hooks import SaveHook
with strategy.scope():
    savehook = SaveHook(model, ckpt_name = 'model_dir/model_name', max_to_keep = 5, save_period = 1.0, history_period = 8.0,
                        npz_file = 'model_dir/model_name.npz')

The following arguments can be specified:
- `ckpt_name`: the location of where to save to model (checkpoint files).
- `max_to_keep`: How many saves are at most being stored. Older saves will always be deleted.
- `save_period`: The amount of epochs after which the validation set losses are being calculated and **if the current validation losses are the lowest**, the model is saved.
- `history_period`: The amount of eopchs after which the model is saved irrespective of the current validation losses. Can be switched off by setting the value to `None`.
- `npz_file`: All the losses, time per epoch and other information will be stored to produce graphs in an npz file at this location.

Finally, at the end, one can train the model. This will go on unless the `learning_rate_manager` stops the training. Hence, you should interrupt this process manually.

In [None]:
from nnfflib.training import Trainer
with strategy.scope():
    trainer = Trainer(model, losses, train_data, validation_data, strategy = strategy, optimizer = optimizer, savehook = savehook, 
                      learning_rate_manager = learning_rate_manager, validation_losses = validation_losses)
    trainer.train(verbose = True, validate_first = False)