# Introduction to Neural Force Field

This Jupyter Notebook contains an introduction to the `nff` package. Here, we will load the modules and functions from `nff` to import a dataset, create dataloaders, create a model, train it and check the test stats. We will do most of it manually to illustrate the usage of the API. However, scripts such as the one provided in the `scripts/` folder already automate most of this process.

After the `nff` package has been installed, we start by importing all dependencies for this tutorial.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import sys
sys.path.insert(0, "..")
# sys.path.insert(0, "/home/saxelrod/Repo/projects/covid_nff/NeuralForceField")
# sys.path.remove('/home/saxelrod/Repo/projects/ax_autopology/NeuralForceField')

import os
import shutil
import numpy as np
import matplotlib.pyplot as plt

import torch
from torch.optim import Adam
from torch.utils.data import DataLoader

from nff.data import Dataset, split_train_validation_test, collate_dicts, to_tensor
from nff.train import Trainer, get_trainer, get_model, load_model, loss, hooks, metrics, evaluate

It might also be useful setting the GPU you want to use:

In [3]:
# DEVICE = 1
# OUTDIR = './sandbox'
# model = load_model(OUTDIR)

In [4]:
DEVICE = 3
OUTDIR = './sandbox'

if os.path.exists(OUTDIR):
    newpath = os.path.join(os.path.dirname(OUTDIR), 'backup')
    if os.path.exists(newpath):
        shutil.rmtree(newpath)
        
    shutil.move(OUTDIR, newpath)

## Loading the relevant data

As we usually work with the database, we can pack their information in a class `Dataset`, which is a subclass of `torch.utils.data.Dataset`. It basically wraps information on the atomic numbers, energies, forces and SMILES strings for each one of the geometries. In this example, we already have a pre-compiled `Dataset` to be used. We start by loading this file and creating three slices of the original dataset

In [5]:
# dataset = Dataset.from_file('./data/covid.pth.tar')
dataset = Dataset.from_file('./data/covid_mmff94.pth.tar')


In [6]:
import pdb

def separate_datasets(dataset, split_ratio):

    bind_indices = torch.LongTensor([i  for i, bind in enumerate(dataset.props['bind']) if bind])
    remaining_indices = [i for i in range(len(dataset)) if i not in bind_indices]

    fail_dataset = dataset.copy()
    for key, val in fail_dataset.props.items():
        fail_dataset.props[key] = [val[i] for i in remaining_indices]
    return dataset, fail_dataset, bind_indices

def get_split_bind_indices(bind_indices, split_ratio):
    num_bind = len(bind_indices)
    bind_per_split = (split_ratio * num_bind).astype('int')
    while True:
        for i in range(3):
            if sum(bind_per_split) == num_bind:
                break
            bind_per_split[i] += 1
        if sum(bind_per_split) == num_bind:
                break

    bind_per_split = bind_per_split.tolist()
    split_bind_indices = torch.split(bind_indices, bind_per_split)
    return split_bind_indices

def make_bind_datasets(split_bind_indices, dataset):
    
    datasets = []
    for indices in split_bind_indices:
        new_set = dataset.copy()
        for key, val in dataset.props.items():
            new_set.props[key] = to_tensor([val[i] for i in indices])
        datasets.append(new_set)
    return tuple(datasets)
    

def split_data(dataset, split_ratio):
    dataset, fail_dataset, bind_indices = separate_datasets(dataset, split_ratio)
    split_bind_indices =  get_split_bind_indices(bind_indices, split_ratio)
    bind_datasets = make_bind_datasets(split_bind_indices, dataset)
    
    train, val, test = split_train_validation_test(fail_dataset, val_size=0.2, test_size=0.2)
    split_sets = [train, val, test]
    
    for i in range(3):
        split_set = split_sets[i]
        bind_set = bind_datasets[i]
        
        for key, value in bind_set.props.items():
            if type(value) is list:
                split_set.props[key] += value
            else:
                split_set.props[key] = torch.cat((split_set.props[key], value))
    
    return train, val, test

        

In [7]:
split_ratio = np.array([0.6, 0.2, 0.2])
train, val, test = split_data(dataset, split_ratio)

The `nff` code interfaces with the `graphbuilder` module through a git submodule in the repository. `graphbuilder` provides methods to create batches of graphs. In `nff`, we interface that through a custom dataloader called `
GraphLoader`. Here, we create one loader for each one of the slices.

In [8]:
train_loader = DataLoader(train, batch_size=10, collate_fn=collate_dicts)
val_loader = DataLoader(val, batch_size=10, collate_fn=collate_dicts)
test_loader = DataLoader(test, batch_size=10, collate_fn=collate_dicts)

Number of positive binders in train, validation, and test sets:

In [9]:
print(torch.sum(train.props['bind']))
print(torch.sum(val.props['bind']))
print(torch.sum(test.props['bind']))



tensor(163)
tensor(54)
tensor(54)


## Creating a model

`nff` is based on SchNet. It parameterizes interatomic interactions in molecules and materials through a series of convolution layers with continuous filters. Here, we are going to create a simple model using the hyperparameters given on `params`:

In [10]:


n_atom_basis = 256
mol_basis = 256

mol_fp_layers = [{'name': 'linear', 'param' : { 'in_features': n_atom_basis,
                                                              'out_features': int((n_atom_basis + mol_basis)/2)}},
                               {'name': 'shifted_softplus', 'param': {}},
                               {'name': 'linear', 'param' : { 'in_features': int((n_atom_basis + mol_basis)/2),
                                                              'out_features': mol_basis}}]

readoutdict = {
                    "bind": [{'name': 'linear', 'param' : { 'in_features': mol_basis,
                                                              'out_features': int(mol_basis / 2)}},
                               {'name': 'shifted_softplus', 'param': {}},
                               {'name': 'linear', 'param' : { 'in_features': int(mol_basis / 2),
                                                              'out_features': 1}},
                               {'name': 'sigmoid', 'param': {}}],
                }

params = {
    'n_atom_basis': n_atom_basis,
    'n_filters': 256,
    'n_gaussians': 32,
    'n_convolutions': 4,
    'cutoff': 5.0,
    'trainable_gauss': True,
    'dropout_rate': 0.2,
    'mol_fp_layers': mol_fp_layers,
    'readoutdict': readoutdict
}


model = get_model(params=params, model_type='WeightedConformers')

## Creating a trainer

To train our model with the data provided, we have to create a loss function. The easiest way to do that is through the `build_mse_loss` builder. Its argument `rho` is a parameter that will multiply the mean square error (MSE) of the force components before summing it with the MSE of the energy.

In [11]:
loss_fn = loss.build_cross_entropy_loss(loss_coef={'bind': 1.0})

We should also select an optimizer for our recently created model:

In [12]:
trainable_params = filter(lambda p: p.requires_grad, model.parameters())
optimizer = Adam(trainable_params, lr=3e-4)

### Metrics and hooks

Metrics and hooks allow the customization of the training process. Instead of tweaking directly the code or having to resort to countless flags, we can create submodules (or add-ons) to monitor the progress of the training or customize it.

If we want to monitor the progress of our training, say by looking at the mean absolute error (MAE) of energies and forces, we can simply create metrics to observe them:

In [13]:
train_metrics = [
    metrics.TruePositives('bind'),
    metrics.TrueNegatives('bind'),
    metrics.FalsePositives('bind'),
    metrics.FalseNegatives('bind'),

]

Furthermore, if we want to customize how our training procedure is done, we can use hooks which can interrupt or change the train automatically.

In our case, we are adding hooks to:
* Stop the training procedure after 100 epochs;
* Log the training on a machine-readable CSV file under the directory `./sandbox`;
* Print the progress on the screen with custom formatting; and
* Setup a scheduler for the learning rate.

In [14]:
train_hooks = [
    hooks.MaxEpochHook(100),
    hooks.CSVHook(
        OUTDIR,
        metrics=train_metrics,
    ),
    hooks.PrintingHook(
        OUTDIR,
        metrics=train_metrics,
        separator = ' | ',
        time_strf='%M:%S'
    ),
    hooks.ReduceLROnPlateauHook(
        optimizer=optimizer,
        patience=30,
        factor=0.5,
        min_lr=1e-7,
        window_length=1,
        stop_after_min=True
    )
]

### Trainer wrapper

A `Trainer` in the `nff` package is a wrapper to train a model. It automatically creates checkpoints, as well as trains and validates a given model. It also allow further training by loading checkpoints from existing paths, making the training procedure more flexible. Its functionalities can be extended by the hooks we created above. To create a trainer, we have to execute the following command:

In [15]:
T = Trainer(
    model_path=OUTDIR,
    model=model,
    loss_fn=loss_fn,
    optimizer=optimizer,
    train_loader=train_loader,
    validation_loader=val_loader,
    checkpoint_interval=1,
    hooks=train_hooks
)

Now we can finally train the model using the method `train` from the `Trainer`:

In [None]:
T.train(device=DEVICE, n_epochs=100)


 Time | Epoch | Learning rate | Train loss | Validation loss | TruePositive_bind | TrueNegative_bind | FalsePositive_bind | FalseNegative_bind | GPU Memory (MB)


  return self.loss / self.n_entries


51:08 |     1 |     3.000e-04 |     1.4861 |          1.3004 |               nan |            0.9573 |                nan |             0.0427 |               0
51:32 |     2 |     3.000e-04 |     1.4827 |          1.3051 |               nan |            0.9573 |                nan |             0.0427 |               0
51:56 |     3 |     3.000e-04 |     1.4827 |          1.3051 |               nan |            0.9573 |                nan |             0.0427 |               0
