# Generation process

This notebook documents the generation of a dataset of diverse fragment molecules with multiple Br atoms.

The molecules of `OpenFF ESP Fragment Conformers v1.0` were filtered to select for all molecules with *multiple Cl atoms*. Every Cl atom was replaced with Br, and then they were filtered again to ensure that none were duplicates of molecules already in `OpenFF ESP Fragment Conformers v1.0`. The output of this filtering is the content of the `filtered-cl-esps-replaced-with-br.smi` file.

In this document we generate ELF conformers for each molecule and create an OptimizationDataset to optimize geometries.

## Imports

In [1]:
import openff.qcsubmit
import openff.toolkit
import openeye
import qcelemental
import qcportal

print("OpenFF QCSubmit:", openff.qcsubmit.__version__)
print("OpenFF Toolkit:", openff.toolkit.__version__)
print("OpenEye:", openeye.__version__)
print("QCElemental:", qcelemental.__version__)
print("QCPortal:", qcportal.__version__)

OpenFF QCSubmit: 0.50.0
OpenFF Toolkit: 0.14.4
OpenEye: 2023.1.1
QCElemental: 0.27.1
QCPortal: 0.51


In [2]:
import tqdm

from openff.units import unit

from openff.toolkit import Molecule
from openff.toolkit.utils import OpenEyeToolkitWrapper, ToolkitRegistry

from openff.qcsubmit.datasets import OptimizationDataset
from openff.qcsubmit.factories import OptimizationDatasetFactory

from qcelemental.models.results import WavefunctionProtocolEnum

## Setting up dataset

In [3]:
dataset_factory = OptimizationDatasetFactory()
provenance = dataset_factory.provenance(ToolkitRegistry([OpenEyeToolkitWrapper]))

In [4]:
dataset = OptimizationDataset(
    dataset_name="OpenFF multi-Br ESP Fragment Conformers v1.0",
    dataset_tagline="HF/6-31G* conformers of diverse fragment molecules with multiple Br.",
    description=(
        "A dataset containing molecules from the "
        "`OpenFF ESP Fragment Conformers v1.0` with modifications. "
        "Molecules with multiple Cl atoms were selected from the "
        "dataset and the Cl atoms were replaced with Br.\n\n"
        "For each molecule, a set of up to 5 conformers were generated by:\n"
        "  * generating a set of up to 1000 conformers with a RMS cutoff of 0.5 Å "
        "using the OpenEye backend of the OpenFF toolkit\n"
        "  * applying ELF conformer selection (max 5 conformers) using OpenEye\n\n"
        "Each conformer will be converged according to the 'GAU_LOOSE' criteria."
    ),
    provenance=provenance
)
dataset.metadata.submitter = "lilyminium"
dataset.metadata.long_description_url = (
        "https://github.com/openforcefield/qca-dataset-submission/tree/master/"
        "submissions/"
        "2023-11-02-OpenFF-multi-Br-ESP-Fragment-Conformers-v1.0"
    )

In [5]:
dataset.clear_qcspecs()
dataset.add_qc_spec(
    program="psi4",
    method="hf",
    basis="6-31G*",
    spec_name="HF/6-31G*",
    spec_description="The standard HF/6-31G* basis used to derive RESP style charges.",
    store_wavefunction=WavefunctionProtocolEnum.orbitals_and_eigenvalues
)

## Loading input

In [6]:
with open("filtered-cl-esps-replaced-with-br.smi", "r") as f:
    br_smiles = [x.strip() for x in f.readlines()]

## Generating conformers

In [7]:
def generate_conformers(smiles: str) -> Molecule:
    wrapper = OpenEyeToolkitWrapper()
    
    mol = Molecule.from_smiles(
        smiles,
        allow_undefined_stereo=True,
        toolkit_registry=wrapper,
    )
    # generate max 1000 conformers with OpenEye
    mol.generate_conformers(
        n_conformers=1000,
        rms_cutoff=0.5 * unit.angstrom,
        toolkit_registry=wrapper,
    )
    
    # prune based on ELF method, max 5 conformers output
    mol.apply_elf_conformer_selection(
        percentage=2.0,
        limit=5,
        toolkit_registry=wrapper
    )
    
    assert mol.n_conformers > 0
    
    return mol

In [8]:
for smi in tqdm.tqdm(br_smiles, desc="generating conformers"):
    mol = generate_conformers(smi)
    dataset.add_molecule(
        dataset_factory.create_index(molecule=mol),
        mol
    )









generating conformers: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 611/611 [01:27<00:00,  6.99it/s]


## Exporting dataset

In [9]:
dataset.export_dataset("dataset.json.bz2")
dataset.molecules_to_file('dataset.smi', 'smi')
dataset.visualize("dataset.pdf", columns=8)

print(dataset.qc_specifications)

{'HF/6-31G*': QCSpec(method='hf', basis='6-31G*', program='psi4', spec_name='HF/6-31G*', spec_description='The standard HF/6-31G* basis used to derive RESP style charges.', store_wavefunction=<WavefunctionProtocolEnum.orbitals_and_eigenvalues: 'orbitals_and_eigenvalues'>, implicit_solvent=None, maxiter=200, scf_properties=[<SCFProperties.Dipole: 'dipole'>, <SCFProperties.Quadrupole: 'quadrupole'>, <SCFProperties.WibergLowdinIndices: 'wiberg_lowdin_indices'>, <SCFProperties.MayerIndices: 'mayer_indices'>], keywords={})}


## Dataset information

In [10]:
import numpy as np
from collections import Counter

In [11]:
print("n_molecules:", dataset.n_molecules)
print("n_conformers:", dataset.n_records)

n_molecules: 611
n_conformers: 677


In [12]:
n_confs = np.array(
    [mol.n_conformers for mol in dataset.molecules]
)
n_heavy_atoms = np.array(
    [mol.to_rdkit().GetNumHeavyAtoms() for mol in dataset.molecules]
)

In [13]:
print(
    "Number of conformers (min, mean, max):",
    n_confs.min(), n_confs.mean(), n_confs.max()
)
print("# heavy atoms")
counts = Counter(n_heavy_atoms)
for n_heavy in sorted(counts):
    print(f"{str(n_heavy):>3}: {counts[n_heavy]}")

Number of conformers (min, mean, max): 1 1.1080196399345335 5
# heavy atoms
  4: 1
  5: 8
  6: 10
  7: 17
  8: 31
  9: 81
 10: 121
 11: 172
 12: 170


In [14]:
unique_charges = set([
    mol.total_charge.m_as(unit.elementary_charge)
    for mol in dataset.molecules
])
unique_charges

{-4.0, -2.0, -1.0, 0.0, 1.0, 2.0}

In [15]:
masses = np.array([
    sum([atom.mass.m for atom in mol.atoms])
    for mol in dataset.molecules
])
print("MW (min, mean, max):", masses.min(), masses.mean(), masses.max())

MW (min, mean, max): 201.84508399999999 292.191026416203 466.5868909999999


In [16]:
elements = set(
    atom.symbol
    for mol in dataset.molecules
    for atom in mol.atoms
)
print(elements)

{'P', 'N', 'C', 'F', 'H', 'Br', 'O', 'S'}
