# Creating a dataset using nff's code

In [2]:
import sys
import numpy as np
from importlib import reload
import networkx as nx

import torch
from torch.utils.data import DataLoader

import sys
sys.path.append("../..")

In [5]:
import nff.data as d

### Data: ethanol trajectories

For this example, we are using an ethanol dataset provided by Wujie. It is a simple and good benchmark on how we can create a dataset from raw data. Below, we start by loading the raw data:

In [3]:
ethanol_data = np.load('ethanol_ccsd_t-train.npz')
nxyz_data = np.dstack((np.array([ethanol_data.f.z]*1000).reshape(1000, -1, 1), np.array(ethanol_data.f.R)))
force_data = ethanol_data.f.F
energy_data = ethanol_data.f.E.squeeze() - ethanol_data.f.E.mean()
smiles_data = ["COC"] * 1000

The Dataset requires a dictionary of lists for its properties. It also works with `energy_grad` instead of forces. To convert between the two, we have to invert the sign of the forces:

In [4]:
dataset = d.Dataset.from_file('autopology_dataset.pth.tar')

In [5]:
dataset.props["num_angles"][:10]

[tensor(3947),
 tensor(3915),
 tensor(3952),
 tensor(3242),
 tensor(3156),
 tensor(3227),
 tensor(2769),
 tensor(2670),
 tensor(2668),
 tensor(2667)]

In [6]:
dataset.props["num_atoms"][9]

tensor(27)

In [7]:
dataset.props["num_bonds"][:10]

[tensor(262),
 tensor(261),
 tensor(262),
 tensor(227),
 tensor(225),
 tensor(227),
 tensor(196),
 tensor(193),
 tensor(193),
 tensor(192)]

In [8]:
# props["smiles"]

In [9]:
props = {
    'nxyz': nxyz_data.tolist(),
    'energy': energy_data.tolist(),
    'energy_grad': [(-x).tolist() for x in force_data],
    'smiles': smiles_data
}

### Creating the dataset

When creating the dataset, we have to supply it with the properties of interest and the units of the energy. The forces should be in the same system of units. XYZ positions should be in Å.

In [24]:
old_dataset = d.Dataset.from_file("switch_data_new_stoich.pth.tar")
props = old_dataset.props
props.pop("num_atoms")
props.pop("nbr_list")
# props.pop("pbc")


[tensor([[ 0,  1],
         [ 0,  2],
         [ 0,  3],
         [ 0,  4],
         [ 0, 16],
         [ 0, 17],
         [ 0, 18],
         [ 0, 19],
         [ 0, 20],
         [ 0, 21],
         [ 0, 22],
         [ 0, 30],
         [ 1,  0],
         [ 1,  2],
         [ 1,  3],
         [ 1,  4],
         [ 1,  5],
         [ 1, 15],
         [ 1, 16],
         [ 1, 17],
         [ 1, 18],
         [ 1, 19],
         [ 1, 20],
         [ 1, 21],
         [ 1, 22],
         [ 1, 29],
         [ 1, 30],
         [ 2,  0],
         [ 2,  1],
         [ 2,  3],
         [ 2,  4],
         [ 2,  5],
         [ 2,  6],
         [ 2, 15],
         [ 2, 16],
         [ 2, 17],
         [ 2, 18],
         [ 2, 19],
         [ 2, 20],
         [ 2, 21],
         [ 2, 22],
         [ 2, 23],
         [ 2, 29],
         [ 2, 30],
         [ 3,  0],
         [ 3,  1],
         [ 3,  2],
         [ 3,  4],
         [ 3,  5],
         [ 3,  6],
         [ 3,  7],
         [ 3,  8],
         [ 3

In [25]:
dataset = d.Dataset(props.copy(), units='kcal/mol')

In [26]:
dataset.generate_neighbor_list(cutoff=5)

In [27]:
# dataset.props["nbr_list"]

In [28]:
dataset.generate_topologies()

> /home/saxelrod/Repo/projects/ax_autopology/NeuralForceField/nff/data/dataset.py(182)generate_bonded_neighbor_list()
-> return
(Pdb) c


In [29]:
dataset.save('autopology_full.pth.tar')

Here's an example of an item from the dataset:

In [22]:
dataset=d.Dataset.from_file("autopology_dataset.pth.tar")
dataset.props["num_impropers"][:4]

[tensor(18), tensor(18), tensor(18), tensor(12)]

Calculating the length of the dataset:

In [None]:
len(dataset)

### Generating neighbor list

Managing the neighbor list of the input graphs is the responsibility of the data. The dataset has an in-built function to do so. It requires only the cutoff (in Å) to define atoms as neighbors:

In [None]:
dataset.generate_neighbor_list(cutoff=5)

We can plot an example of a graph by using the neighbor list we just computed:

In [None]:
%matplotlib inline
nbr_list = dataset[0]['nbr_list'].numpy()
G = nx.from_edgelist(nbr_list)
nx.draw_kamada_kawai(G)

### Loading/saving dataset from file

We can save this dataset to a file by using its in-build method:

In [None]:
dataset.save('dataset.pth.tar')

Alternatively, we could load this same dataset directly:

In [None]:
dataset = d.Dataset.from_file('dataset.pth.tar')

## DataLoader

To create a dataloader for the dataset we just created, we use PyTorch's DataLoader and our custom collate function:

In [None]:
loader = DataLoader(dataset, batch_size=5, collate_fn=d.collate_dicts)

Example of a batch from this dataloader:

In [None]:
next(iter(loader))

# Generating a dataset for SchNet + AuTopology

If we want to use learn classical priors together with SchNet, we need the topologies (bonds, angles, dihedrals, etc.) of the molecule. We do this with the `generate_topologies` function.

Here's an example in which we take an existing dataset and modify it by adding topologies.

In [10]:
old_dataset = d.Dataset.from_file("switch_demonstration.pth.tar")
initial_prop_keys = ["energy_0", "energy_1", "energy_0_grad", "energy_1_grad", "nxyz"]
props = {key: val for key, val in old_dataset.props.items() if key in initial_prop_keys}

new_dataset = d.Dataset(props.copy(), units='kcal/mol')
new_dataset.generate_neighbor_list(cutoff=5.0)
new_dataset.generate_topologies()


new_dataset.save("autopology_demonstration.pth.tar")