# Introduction to Neural Force Field

This Jupyter Notebook contains an introduction to the `nff` package. Here, we will load the modules and functions from `nff` to import a dataset, create dataloaders, create a model, train it and check the test stats. We will do most of it manually to illustrate the usage of the API. However, scripts such as the one provided in the `scripts/` folder already automate most of this process.

After the `nff` package has been installed, we start by importing all dependencies for this tutorial.

**Multi-task**: Change `two_states` to False if you just want one state. Change `one_mol` to True if you just want one molecule.

In [1]:
two_states = False
one_mol = False
MAX_GEOM = 100

In [2]:
import sys


sys.path.append("/home/saxelrod/Repo")
sys.path.append("/home/saxelrod/Repo/projects/multi_task")
sys.path.append("/home/saxelrod/Repo/projects/multi_task/NeuralForceField")


import os
import shutil
import numpy as np
import matplotlib.pyplot as plt

import torch
from torch.optim import Adam

from nff.data import Dataset, split_train_validation_test
import pickle
from nff.data.loader import collate_dicts
from torch.utils.data import DataLoader


from nff.train import Trainer, get_trainer, get_model, loss, hooks, metrics, evaluate

It might also be useful setting the GPU you want to use:

In [3]:
DEVICE = 2
OUTDIR = './sandbox'

if os.path.exists(OUTDIR):
    newpath = os.path.join(os.path.dirname(OUTDIR), 'backup')
    if os.path.exists(newpath):
        shutil.rmtree(newpath)
        
    shutil.move(OUTDIR, newpath)

## Loading the relevant data

As we usually work with the database, we can pack their information in a class `Dataset`, which is a subclass of `torch.utils.data.Dataset`. It basically wraps information on the atomic numbers, energies, forces and SMILES strings for each one of the geometries. In this example, we already have a pre-compiled `Dataset` to be used. We start by loading this file and creating three slices of the original dataset

In [4]:
import pdb

import copy


METHOD_NAME = "sf_tddft_bhhlyp"
FILE_PATH = ""
DATA_FILE = os.path.join(FILE_PATH, "data", "{}.pickle".format(METHOD_NAME))
with open(DATA_FILE, "rb") as f:
    props = pickle.load(f)

if one_mol:
    smiles = props["smiles"][0]
    val_idx = [x==smiles for x in props["smiles"]]
    for key in props.keys():
        props[key] = np.array(props[key])[val_idx].tolist()


if two_states:



    props["energy_0_grad"] = copy.deepcopy(props["force_0"])
    props["energy_1_grad"] = copy.deepcopy(props["force_1"])


    for i, element in enumerate(copy.deepcopy(props["force_0"])):
        props["energy_0_grad"][i] = -element
    for i, element in enumerate(copy.deepcopy(props["force_1"])):
        props["energy_1_grad"][i] = -element

    props.pop("force_0")
    props.pop("force_1")


    for key in ["energy_0", "energy_1", "energy_0_grad", "energy_1_grad",
               "nxyz", "smiles"]:
        if MAX_GEOM is not None:
            props[key] = props[key][:MAX_GEOM]

    lst = [*props["energy_0"], *props["energy_1"]]
    mean = np.mean(lst)

    for key in ["energy_0", "energy_1"]:
        lst = copy.deepcopy(props[key])
        for i, element in enumerate(lst):
            props[key][i] -= mean


else:


    props.pop("force_1")
    props.pop("energy_1")

    props["energy_0_grad"] = copy.deepcopy(props["force_0"])
    props.pop("force_0")

    for i, element in enumerate(props["energy_0_grad"]):
        if type(element) is np.ndarray:
            props["energy_0_grad"][i] = -element

    lst = [en for en in props["energy_0"] if en is not None]
    mean = np.mean(lst)

    for key in ["energy_0"]:
        lst = copy.deepcopy(props[key])
        for i, element in enumerate(lst):
            if element is not None:
                lst[i] -= mean

        props[key] = lst


dataset = Dataset(props=props.copy(), units='atomic')
dataset.generate_neighbor_list(cutoff=5)

train, val, test = split_train_validation_test(dataset, val_size=0.2, test_size=0.2)

The `nff` code interfaces with the `graphbuilder` module through a git submodule in the repository. `graphbuilder` provides methods to create batches of graphs. In `nff`, we interface that through a custom dataloader called `
GraphLoader`. Here, we create one loader for each one of the slices.

In [5]:
train_loader = DataLoader(train, batch_size=50, collate_fn=collate_dicts)
val_loader = DataLoader(val, batch_size=50, collate_fn=collate_dicts)
test_loader = DataLoader(test, batch_size=50, collate_fn=collate_dicts)

## Creating a model

`nff` is based on SchNet. It parameterizes interatomic interactions in molecules and materials through a series of convolution layers with continuous filters. Here, we are going to create a simple model using the hyperparameters given on `params`:

In [6]:
n_atom_basis = 256
EPS = 1e-15
 
if two_states:
    
    readoutdict = {
                        "energy_0": [{'name': 'linear', 'param' : { 'in_features': n_atom_basis, 
                                                                  'out_features': int(n_atom_basis / 2)}},
                                   {'name': 'shifted_softplus', 'param': {}},
                                   {'name': 'linear', 'param' : { 'in_features': int(n_atom_basis / 2), 
                                                                  'out_features': 1}}],
                        "energy_1": [{'name': 'linear', 'param' : { 'in_features': n_atom_basis, 
                                                                  'out_features': int(n_atom_basis / 2)}},
                                   {'name': 'shifted_softplus', 'param': {}},
                                   {'name': 'linear', 'param' : { 'in_features': int(n_atom_basis / 2), 
                                                                  'out_features': 1}}]
                    }
    


    def post_readout(predict_dict, readoutdict):
        sorted_keys = sorted(list(readoutdict.keys()))
        sorted_ens = torch.sort(torch.stack([predict_dict[key] for key in sorted_keys]))[0] 
        sorted_dic = {key: val for key, val in zip(sorted_keys, sorted_ens)}
        return sorted_dic


        
else:
    readoutdict = {
                        "energy_0": [{'name': 'linear', 'param' : { 'in_features': n_atom_basis, 
                                                                  'out_features': int(n_atom_basis / 2)}},
                                   {'name': 'shifted_softplus', 'param': {}},
                                   {'name': 'linear', 'param' : { 'in_features': int(n_atom_basis / 2), 
                                                                  'out_features': 1}}]
                    }
    
    post_readout = None
    
params = {
    'n_atom_basis': n_atom_basis,
    'n_filters': 256,
    'n_gaussians': 32,
    'n_convolutions': 10,
    'cutoff': 5.0,
    'trainable_gauss': False, 
    'readoutdict': readoutdict,
    'post_readout': post_readout
}



model = get_model(params)

## Creating a trainer

To train our model with the data provided, we have to create a loss function. The easiest way to do that is through the `build_mse_loss` builder. Its argument `rho` is a parameter that will multiply the mean square error (MSE) of the force components before summing it with the MSE of the energy.

In [7]:
import numpy as np
rho = 0.1
decay = 10

if two_states:
    loss_coef = {'energy_0': rho, 'energy_0_grad': 1, 'energy_1': rho, 'energy_1_grad': 1}
else:
    loss_coef = {'energy_0': rho, 'energy_0_grad': 1}



loss_fn = loss.build_mse_loss(loss_coef=loss_coef)

We should also select an optimizer for our recently created model:

In [8]:
trainable_params = filter(lambda p: p.requires_grad, model.parameters())
optimizer = Adam(trainable_params, lr=2e-4)

### Metrics and hooks

Metrics and hooks allow the customization of the training process. Instead of tweaking directly the code or having to resort to countless flags, we can create submodules (or add-ons) to monitor the progress of the training or customize it.

If we want to monitor the progress of our training, say by looking at the mean absolute error (MAE) of energies and forces, we can simply create metrics to observe them:

In [9]:
if two_states:
    train_metrics = [
        metrics.MeanAbsoluteError('energy_0'),
        metrics.MeanAbsoluteError('energy_0_grad'),
        metrics.MeanAbsoluteError('energy_1'),
        metrics.MeanAbsoluteError('energy_1_grad'),

    ]
else:
    train_metrics = [
        metrics.MeanAbsoluteError('energy_0'),
        metrics.MeanAbsoluteError('energy_0_grad')
    ]
    

Furthermore, if we want to customize how our training procedure is done, we can use hooks which can interrupt or change the train automatically.

In our case, we are adding hooks to:
* Stop the training procedure after 100 epochs;
* Log the training on a machine-readable CSV file under the directory `./sandbox`;
* Print the progress on the screen with custom formatting; and
* Setup a scheduler for the learning rate.

In [10]:
train_hooks = [
    hooks.MaxEpochHook(5000),
    hooks.CSVHook(
        OUTDIR,
        metrics=train_metrics,
    ),
    hooks.PrintingHook(
        OUTDIR,
        metrics=train_metrics,
        separator = ' | '
    ),
    hooks.ReduceLROnPlateauHook(
        optimizer=optimizer,
        patience=30,
        factor=0.5,
        min_lr=1e-7,
        window_length=1,
        stop_after_min=True
    )
]

### Trainer wrapper

A `Trainer` in the `nff` package is a wrapper to train a model. It automatically creates checkpoints, as well as trains and validates a given model. It also allow further training by loading checkpoints from existing paths, making the training procedure more flexible. Its functionalities can be extended by the hooks we created above. To create a trainer, we have to execute the following command:

In [11]:
import pdb
try:
    T = Trainer(
        model_path=OUTDIR,
        model=model,
        loss_fn=loss_fn,
        optimizer=optimizer,
        train_loader=train_loader,
        validation_loader=val_loader,
        checkpoint_interval=1,
        hooks=train_hooks
    )
except:
    pdb.post_mortem()

Now we can finally train the model using the method `train` from the `Trainer`:

In [12]:
import pdb

try:
    T.train(device=DEVICE, n_epochs=10)

except Exception as e:
    print(e)
    pdb.post_mortem()
    


               Time | Epoch | Learning rate | Train loss | Validation loss | MAE_energy_0 | MAE_energy_0_grad | GPU Memory (MB)
2019-09-16 13:46:00 |     1 |     2.000e-04 | 2459295338.6418 | 1277151616.0000 |   87280.2400 |           34.9220 |            2119
2019-09-16 13:46:01 |     2 |     2.000e-04 | 2458654592.2236 | 1276801024.0000 |   87274.6800 |           35.0514 |            2119
2019-09-16 13:46:02 |     3 |     2.000e-04 | 2456638388.1095 | 1275590784.0000 |   87247.9400 |           44.7343 |            2119
2019-09-16 13:46:03 |     4 |     2.000e-04 | 2449349770.5859 | 1271791872.0000 |   87177.1400 |          114.1421 |            2119
2019-09-16 13:46:04 |     5 |     2.000e-04 | 2425140239.9161 | 1263825792.0000 |   87865.5600 |          409.1880 |            2119
2019-09-16 13:46:05 |     6 |     2.000e-04 | 2375795221.1718 | 1300772224.0000 |   90725.3000 |         1449.3995 |            2119
2019-09-16 13:46:06 |     7 |     2.000e-04 | 2475810267.0612 | 1296837248

## Evaluating the model on the test set

Now we have a brand new model trained and validated. We can use the best model from this training to evaluate its performance on the test set. `results` contains the predictions of properties for the whole test dataset. `targets` contains the ground truth for such data. `test_loss` is the loss, calculated with the same function used during the training part

In [13]:
results, targets, val_loss, other_results = evaluate(model, test_loader, loss_fn, device=DEVICE)

RuntimeError: expected device cpu and dtype Float but got device cuda:2 and dtype Float

Finally, we can plot our results to observe how well is our model performing:

In [None]:

units = {
    'energy_0_grad': r'kcal/mol/$\AA$',
    'energy_0': 'kcal/mol',
    'energy_1_grad': r'kcal/mol/$\AA$',
    'energy_1': 'kcal/mol'
}

dic_keys = list(loss_coef.keys())

for i in range(int(len(dic_keys)/2) ):
    
    fig, ax_fig = plt.subplots(1, 2, figsize=(12, 6))

    for ax, key in zip(ax_fig, dic_keys[2*i:2*i+2] ):

        pred = torch.cat(results[key]).reshape(-1).detach().numpy()
        targ = torch.cat(targets[key]).reshape(-1).detach().numpy()

        ax.scatter(pred, targ, color='#ff7f0e', alpha=0.3)

        lim_min = min(np.min(pred), np.min(targ)) * 1.1
        lim_max = max(np.max(pred), np.max(targ)) * 1.1

        ax.set_xlim(lim_min, lim_max)
        ax.set_ylim(lim_min, lim_max)

        ax.set_aspect('equal')

        ax.plot((lim_min, lim_max),
                (lim_min, lim_max),
                color='#000000',
                zorder=-1,
                linewidth=0.5)

        ax.set_title(key.upper(), fontsize=14)
        ax.set_xlabel('predicted %s (%s)' % (key, units[key]), fontsize=12)
        ax.set_ylabel('target %s (%s)' % (key, units[key]), fontsize=12)

    plt.show()

The model is performing quite well.