# Creating a dataset using nff's code

In [9]:
import sys
import numpy as np
from importlib import reload
import networkx as nx

import torch
from torch.utils.data import DataLoader

import sys
sys.path.append("../..")

In [10]:
import nff.data as d

### Data: ethanol trajectories

For this example, we are using an ethanol dataset provided by Wujie. It is a simple and good benchmark on how we can create a dataset from raw data. Below, we start by loading the raw data:

In [3]:
ethanol_data = np.load('ethanol_ccsd_t-train.npz')
nxyz_data = np.dstack((np.array([ethanol_data.f.z]*1000).reshape(1000, -1, 1), np.array(ethanol_data.f.R)))
force_data = ethanol_data.f.F
energy_data = ethanol_data.f.E.squeeze() - ethanol_data.f.E.mean()
smiles_data = ["COC"] * 1000

The Dataset requires a dictionary of lists for its properties. It also works with `energy_grad` instead of forces. To convert between the two, we have to invert the sign of the forces:

In [4]:
dataset = d.Dataset.from_file('autopology_dataset.pth.tar')

In [9]:
props = {
    'nxyz': nxyz_data.tolist(),
    'energy': energy_data.tolist(),
    'energy_grad': [(-x).tolist() for x in force_data],
    'smiles': smiles_data
}

### Creating the dataset

When creating the dataset, we have to supply it with the properties of interest and the units of the energy. The forces should be in the same system of units. XYZ positions should be in Å.

In [25]:
dataset = d.Dataset(props.copy(), units='kcal/mol')

Here's an example of an item from the dataset:

In [22]:
dataset=d.Dataset.from_file("autopology_dataset.pth.tar")
dataset.props["num_impropers"][:4]

[tensor(18), tensor(18), tensor(18), tensor(12)]

Calculating the length of the dataset:

In [None]:
len(dataset)

### Generating neighbor list

Managing the neighbor list of the input graphs is the responsibility of the data. The dataset has an in-built function to do so. It requires only the cutoff (in Å) to define atoms as neighbors:

In [None]:
dataset.generate_neighbor_list(cutoff=5)

We can plot an example of a graph by using the neighbor list we just computed:

In [None]:
%matplotlib inline
nbr_list = dataset[0]['nbr_list'].numpy()
G = nx.from_edgelist(nbr_list)
nx.draw_kamada_kawai(G)

### Loading/saving dataset from file

We can save this dataset to a file by using its in-build method:

In [None]:
dataset.save('dataset.pth.tar')

Alternatively, we could load this same dataset directly:

In [None]:
dataset = d.Dataset.from_file('dataset.pth.tar')

## DataLoader

To create a dataloader for the dataset we just created, we use PyTorch's DataLoader and our custom collate function:

In [None]:
loader = DataLoader(dataset, batch_size=5, collate_fn=d.collate_dicts)

Example of a batch from this dataloader:

In [None]:
next(iter(loader))

# Generating a dataset for SchNet + AuTopology

If we want to use learn classical priors together with SchNet, we need the topologies (bonds, angles, dihedrals, etc.) of the molecule. We do this with the `generate_topologies` function. Note that the old props must have smiles included, and that we must supply a bond dictionary for each distinct smiles. Normally we would get the bond dictionary from `htvs/djangochem/analysis/reference_molecule_graph.py`. Here we've provided a snippet of this code (`data/htvs_snippet.py`) to demonstrate how it works.

Here's an example in which we take an existing dataset and modify it by adding topologies.

In [24]:
from htvs_snippet import get_mol_ref

old_dataset = d.Dataset.from_file("switch_demonstration.pth.tar")

In [25]:
smileslist = list(set(old_dataset.props["smiles"]))
mol_ref = get_mol_ref(smileslist=smileslist,
                     groupname="switches", method_name='molecular_mechanics_mmff94')



We take only the zeroth element for bond_dic, which contains the bond list:

In [26]:
bond_dic = {key: val[0] for key, val in mol_ref.items()}

Now we just call `generate_topologies` and the work is done for us:

In [27]:
old_dataset.props.keys()

dict_keys(['energy_0', 'energy_1', 'energy_0_grad', 'nxyz', 'energy_1_grad', 'nbr_list', 'num_atoms', 'smiles'])

In [29]:
dataset = old_dataset.copy()
dataset.generate_topologies(bond_dic)

In [30]:
dataset.save("autopology_demonstration.pth.tar")

We see that the new dataset props contain topology information:

In [31]:
dataset.props.keys()

dict_keys(['angles', 'degree_vec', 'num_dihedrals', 'bonds', 'smiles', 'energy_0', 'nxyz', 'impropers', 'bonded_nbr_list', 'pairs', 'num_angles', 'energy_1', 'num_pairs', 'nbr_list', 'num_atoms', 'dihedrals', 'energy_1_grad', 'energy_0_grad', 'num_impropers', 'num_bonds'])

For example, we can look at the dihedrals in the first geom:

In [32]:
dataset.props["dihedrals"][0][:5]

tensor([[ 0,  1,  2, 24],
        [ 0,  1,  2,  3],
        [ 7,  1,  2, 24],
        [ 0,  1,  7,  8],
        [ 0,  1,  7,  5]])

Now we can use the dataset in the `SchNetAuTopology` model to combine classical priors with the neural force field!