# Creating a dataset using nff's code and our database on HTVS

This tutorial addresses the problem of creating a database directly from our database for HTVS. It requires setting up the project `dbsetting` to make loading the data easier for this code

In [70]:
import sys
import numpy as np

import torch
from dbsetting import *
import nff.data as d

### Data: ethanol calculations using DLPNO

For this example, we will create a dataset of hybrid calculations of ethanol molecules. In this example, we will extract 500 calculations from the database

In [71]:
myspecies = Species.objects.filter(smiles='COC')
mymethod = Method.objects.filter(name='dft_d3_dhyb_dsdpbebp86').first()
mycalcs = (Calc.objects
    .filter(method=mymethod)
    .filter(species__in=myspecies)
    .order_by('?')
)

Checking if there are Calcs for the reference (atomic C, H and O):

In [69]:
mymethod = Method.objects.filter(name='dft_d3_dhyb_dsdpbebp86').first()
Calc.objects.filter(method=mymethod, species__smiles='[H][H]').count()

0

In [72]:
geoms_id = mycalcs.values_list('geoms__id')[:500]
geoms = Geom.objects.filter(id__in=geoms_id)

Retrieving the calculations:

In [91]:
values = get_xyz_force_energy_smiles(
    geoms,
    methodname=mymethod.name
)

nxyz_data = values[0]
force_data = [np.array(x) for x in values[1]]
energy_data = np.array(values[2]) - np.mean(values[2])
smiles_data = values[3]

The Dataset requires a dictionary of lists for its properties. It also works with `energy_grad` instead of forces. To convert between the two, we have to invert the sign of the forces:

In [92]:
props = {
    'nxyz': nxyz_data,
    'energy': energy_data,
    'energy_grad': [-x for x in force_data]
}

### Creating the dataset

When creating the dataset, we have to supply it with the properties of interest and the units of the energy. The forces should be in the same system of units. XYZ positions should be in Å.

In [93]:
dataset = d.Dataset(props.copy(), units='atomic')

Here's an example of an item from the dataset:

In [94]:
dataset[0]

{'nxyz': tensor([[ 6.0000,  1.1761,  0.1112, -0.0274],
         [ 8.0000,  0.1080, -0.8071,  0.2331],
         [ 6.0000, -1.1692, -0.2110,  0.0710],
         [ 1.0000,  1.2005,  0.5543, -1.0620],
         [ 1.0000,  1.0958,  1.0418,  0.5908],
         [ 1.0000,  2.1149, -0.4663,  0.0950],
         [ 1.0000, -1.9168, -0.9302,  0.3407],
         [ 1.0000, -1.2084,  0.0966, -0.9999],
         [ 1.0000, -1.2481,  0.8450,  0.7435]]),
 'energy': tensor(1.1699),
 'energy_grad': tensor([[ 2.3291e+00, -7.8456e+00,  1.1062e+01],
         [-8.6176e+00, -1.5709e+01, -3.1400e+00],
         [-3.4544e+01, -7.4079e+01, -7.2512e+00],
         [ 9.6509e+00,  1.8666e+01, -8.2697e+00],
         [-5.5672e+00,  2.1054e+01, -5.0141e+00],
         [ 1.1862e+01, -1.3442e+01, -6.2142e+00],
         [ 9.7976e+00,  4.9230e+00, -3.8420e+00],
         [ 1.5142e+01,  1.5226e+00, -1.3161e+01],
         [-6.0180e-02,  6.4751e+01,  3.5879e+01]]),
 'num_atoms': tensor(9)}

Calculating the length of the dataset:

In [96]:
len(dataset)

481

### Generating neighbor list

Managing the neighbor list of the input graphs is the responsibility of the data. The dataset has an in-built function to do so. It requires only the cutoff (in Å) to define atoms as neighbors:

In [97]:
dataset.generate_neighbor_list(cutoff=5)

### Loading/saving dataset from file

We can save this dataset to a file by using its in-build method:

In [98]:
dataset.save('dataset_dlpno_TEST.pth.tar')