# Create NFF dataset from MD17

MD17 is a popular benchmark dataset with forces and energies from molecular dynamics trajectories of small molecules. 

This brief tutorial shows how to use the utility function `get_md17_dataset` to prepare an MD17 dataset for NFF.

Refs
* http://quantum-machine.org/gdml/#datasets
* Chmiela, S., Tkatchenko, A., Sauceda, H. E., Poltavsky, I., Schütt, K. T., Müller, K.-R., Science Advances, 3(5), 2017, e1603015.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from nff.data import Dataset
from nff.data.utils import get_md17_dataset

Download and prepare NFF dataset for one of aspirin, benzene, ethanol, malonaldehyde, naphthalene, salicylic, toluene, uracil, paracetamol, or azobenzene.

In [3]:
molecule = 'benzene'

dataset = get_md17_dataset(molecule)

Inspect the length of the dataset and an item in it.

In [4]:
len(dataset)

49863

In [5]:
dataset[0]

{'nxyz': tensor([[  6.0000, -36.7161,  41.9528, -36.0171],
         [  6.0000, -36.0692,  41.9257, -34.7771],
         [  6.0000, -36.8074,  41.7153, -33.6075],
         [  6.0000, -38.1924,  41.5319, -33.6781],
         [  6.0000, -38.8393,  41.5589, -34.9181],
         [  6.0000, -38.1012,  41.7694, -36.0877],
         [  1.0000, -36.1398,  42.1171, -36.9301],
         [  1.0000, -34.9879,  42.0689, -34.7220],
         [  1.0000, -36.3024,  41.6941, -32.6394],
         [  1.0000, -38.7687,  41.3675, -32.7651],
         [  1.0000, -39.9206,  41.4158, -34.9732],
         [  1.0000, -38.6062,  41.7906, -37.0558]]),
 'energy': tensor([-145503.0469]),
 'energy_grad': tensor([[-0.0856, -0.0189,  0.1409],
         [-0.1680, -0.0249, -0.0189],
         [-0.0733, -0.0039, -0.1592],
         [ 0.0856,  0.0189, -0.1409],
         [ 0.1680,  0.0248,  0.0189],
         [ 0.0733,  0.0040,  0.1592],
         [ 0.2642,  0.0709, -0.4128],
         [ 0.4869,  0.0668,  0.0272],
         [ 0.2255, -0.00

Save and load the dataset.

In [None]:
dataset.save(f'{molecule}.pth.tar'

In [23]:
dataset = Dataset.from_file(f'{molecule}.pth.tar')

Retrieving multiple datasets is easy with a loop.

In [7]:
molecules = ['benzene', 'ethanol']
datasets = []
for molecule in molecules:
    dataset = get_md17_dataset(molecule)
    datasets.append(dataset)

In [8]:
print([len(d) for d in datasets])

[49863, 555092]


If we try to access a non-existent dataset, we'll get an error.

In [6]:
dataset = get_md17_dataset('cureall_molecule')

ValueError: ('Incorrect value for molecule. Must be one of: ', ['aspirin', 'benzene', 'ethanol', 'malonaldehyde', 'naphthalene', 'salicylic', 'toluene', 'uracil', 'paracetamol', 'azobenzene'])