# Prepare a NAGL dataset for training

Training a GCN requires a collection of examples that the GCN should reproduce and interpolate between. This notebook describes how to prepare such a dataset for predicting partial charges.

## Imports

In [1]:
from pathlib import Path

from tqdm import tqdm

from openff.toolkit.topology import Molecule

from openff.nagl.storage.record import MoleculeRecord
from openff.nagl.storage import MoleculeStore

## Choosing our molecules

The simplest way to specify the molecules in our dataset is with SMILES, though [anything you can load](https://docs.openforcefield.org/projects/toolkit/en/stable/users/molecule_cookbook.html) into an OpenFF [`Molecule`] is fair game. For instance, with the [`Molecule.from_file()`] method you could load partial charges from SDF files. But for this example, we'll have NAGL generate our charges, so we can just provide the SMILES themselves:

[`Molecule`]: https://docs.openforcefield.org/projects/toolkit/en/stable/api/generated/openff.toolkit.topology.Molecule.html
[`Molecule.from_file()`]: https://docs.openforcefield.org/projects/toolkit/en/stable/api/generated/openff.toolkit.topology.Molecule.html#openff.toolkit.topology.Molecule.from_file

In [2]:
alkanes_smiles = Path("alkanes.smi").read_text().splitlines()
alkanes_smiles

['C',
 'CC',
 'CCC',
 'CCCC',
 'CC(C)C',
 'CCCCC',
 'CC(C)CC',
 'CCCCCC',
 'CC(C)CCC',
 'CC(CC)CC']

## Generating charges

NAGL can generate AM1-BCC and AM1-Mulliken charges automatically with the OpenFF Toolkit. If you'd like a dataset of other charges, load them into the [`Molecule.partial_charges`] attribute and use the [`MoleculeRecord.from_precomputed_openff()`] method.

[`MoleculeRecord.from_precomputed_openff()`]: https://docs.openforcefield.org/projects/nagl/en/stable/api/generated/openff.nagl.storage.record.html#openff.nagl.storage.record.MoleculeRecord.from_precomputed_openff
[`Molecule.partial_charges`]: https://docs.openforcefield.org/projects/toolkit/en/stable/api/generated/openff.toolkit.topology.Molecule.html#openff.toolkit.topology.Molecule.partial_charges

In [3]:
records = [
    MoleculeRecord.from_openff(
        Molecule.from_smiles(smiles, allow_undefined_stereo=True),
        partial_charge_methods=["am1bcc", "am1"],
        generate_conformers=True,
        n_conformer_pool=500, # Start with 500 conformers...
        n_conformers=10, # ... and prune all but 10 (ELF10)
        rms_cutoff=0.05, # Conformers in the initial pool must be at least this different
    ) 
    for smiles in tqdm(alkanes_smiles, desc="Labeling molecules")
]

Labeling molecules: 100%|███████████████████████████████████████████| 10/10 [00:01<00:00,  9.71it/s]


## Storing the dataset

Finally, we'll save all the molecule records to a SQLite database file, which NAGL can use directly as a dataset via [`DGLMoleculeLightningDataModule`]:

[`DGLMoleculeLightningDataModule`]: https://docs.openforcefield.org/projects/nagl/en/stable/api/generated/openff.nagl.nn.dataset.html#openff.nagl.nn.dataset.DGLMoleculeLightningDataModule

In [39]:
output_store_file = Path("alkanes.sqlite")
if output_store_file.exists():
    output_store_file.unlink()

store = MoleculeStore(output_store_file)
store.store(records)

grouping records to store by InChI key: 100%|█████████████████████| 10/10 [00:00<00:00, 1008.63it/s]
storing grouped records: 100%|█████████████████████████████████████| 10/10 [00:00<00:00, 472.48it/s]
