In [1]:
import datasets
import tqdm

## Load dataset

These are [Huggingface dataset](https://huggingface.co/docs/datasets/en/index) formats.

In [13]:
opt_dataset = datasets.Dataset.load_from_disk("../data/additional-optimizations/")
td_dataset = datasets.Dataset.load_from_disk("../data/additional-torsiondrives/")

Datasets are represented with the features in each entry.

In [14]:
opt_dataset

Dataset({
    features: ['smiles', 'coords', 'energy', 'forces'],
    num_rows: 70
})

Datasets can be indexed to get a single entry. "coords", "forces", etc. are stored as flat lists of floats.

In [4]:
opt_dataset[0]

{'smiles': '[H:17][c:1]1[c:2]([c:5]([c:9]([c:8]([c:4]1[H:20])[c:10]2[c:6]([c:3]([c:7]([c:11](=[O:15])[n:14]2[C:12]([H:24])([H:25])[H:26])[H:23])[H:19])[H:22])[O:16][C:13]([H:27])([H:28])[H:29])[H:21])[H:18]',
 'coords': [-2.429097601086157,
  -4.995734076091361,
  5.279949289173572,
  -1.4927028614970528,
  -4.969878833032298,
  6.313283057490595,
  -1.7725006246161668,
  -0.5189687735923793,
  1.6009539610385115,
  -2.247461979620605,
  -4.159604022426994,
  4.172769570744676,
  -0.38466058273334747,
  -4.115854293359497,
  6.251594902508013,
  -1.9333702569868894,
  -1.4028490611584157,
  2.7018932656017056,
  -0.6955642015756051,
  -0.6374878174592984,
  0.7671004023768058,
  -1.1450532229998112,
  -3.3048529745574973,
  4.08185260086658,
  -0.21036471460187214,
  -3.2807783690358603,
  5.1415423097198305,
  -1.002708576933281,
  -2.391834366260054,
  2.9167438933540053,
  0.30720990737475856,
  -1.6608091536936738,
  0.9680378916896591,
  1.0356941877630836,
  -3.6276665354563615,


We'll need to re-convert to PyTorch.

In [15]:
# reformat dataset lists to torch tensors
opt_dataset.set_format('torch', columns=['energy', 'coords','forces'], output_all_columns=True)
td_dataset.set_format('torch', columns=['energy', 'coords','forces'], output_all_columns=True)

In [16]:
opt_dataset[0]

{'coords': tensor([-1.7365,  1.3097,  5.1288, -1.5198,  1.1527,  3.7460, -0.6614,  0.3525,
          3.0057, -0.8821,  0.7326,  1.6502, -0.2310,  0.1866,  0.4400, -0.4846,
          0.6215, -1.0329,  0.5960, -0.5005, -1.1146,  1.2970,  0.1879,  0.0944,
         -0.2535, -1.2926, -0.0766, -1.8106,  1.6883,  1.5537, -2.1830,  1.9305,
          2.8444, -1.0775,  0.7501,  5.6606, -2.6807,  1.0490,  5.4048,  0.0104,
         -0.4076,  3.3820, -1.4734,  0.3898, -1.4397, -0.1691,  1.6355, -1.2960,
          1.0740, -0.8972, -2.0142,  1.7470,  1.1697, -0.0845,  1.9392, -0.4433,
          0.7182,  0.2762, -2.0313,  0.5348, -1.2254, -1.6684, -0.4125, -2.8092,
          2.6975,  3.0476]),
 'energy': tensor([-298509.5312]),
 'forces': tensor([ 8.6516e-03,  9.9962e-03, -2.1068e-02, -6.0739e-02, -7.5098e-02,
          1.4797e-02,  6.7018e-02,  6.8418e-02,  1.4632e-02, -1.8251e-02,
         -5.5096e-02,  2.0181e-02, -4.5286e-02,  5.9464e-02, -4.5025e-02,
         -3.8120e-03, -6.7527e-03,  5.1948e-03

## Fitting

For how to fit a force field to optimization data from a SMIRNOFF force field, here's an [example I put together for the IRL Irvine meeting](https://openforcefield.atlassian.net/wiki/spaces/MEET/pages/3440508935/Hackathon+How+to+train+your+force+field+with+smee) (`run-smee-fit-from-qca-data-commented.ipynb` where you can largely follow on from the "Assign parameters to molecules in the dataset" heading.

The only note is that the `descent.targets.energy.predict` function would have to be re-written to not include forces in the objective and prediction if they're not in the data.