# OpenFF Additional Generated ChEMBL Optimizations 4.0

This notebook generates additional optimizations from molecules for bond, angle, and improper parameters in Sage 2.2.1 with either low coverage, or where a substantial number of existing records have the molecules minimize to a different value to the QM.
Input files are obtained from the process in the [fragment-chembl-data](https://github.com/lilyminium/fragment-chembl-data/commit/1a85d6b296350867a134529aa76930965a710a68) repo.

In [1]:
import zstandard
import qcportal
import pathlib

from openff.toolkit import Molecule, ForceField
import numpy as np
import tqdm

from openff.qcsubmit.utils import get_symmetry_classes, get_symmetry_group
from openff.qcsubmit import workflow_components
from openff.qcsubmit.factories import OptimizationDatasetFactory
from openff.qcsubmit.utils.visualize import molecules_to_pdf

In [2]:
input_directory = pathlib.Path("inputs")
input_files = input_directory.glob("*.smi")
all_molecules = []
for file in tqdm.tqdm(input_files):
    parameter_id = file.stem.split("-")[-1]
    molecules = Molecule.from_file(file, allow_undefined_stereo=True)
    for mol in molecules:
        mol.name = None
    all_molecules.extend(molecules)
    molecules_to_pdf(molecules, f"inputs/{parameter_id}_molecules.pdf")

68it [00:02, 31.89it/s]


In [3]:
len(all_molecules)

1880

In [4]:
dataset_factory = OptimizationDatasetFactory()
dataset_factory.add_workflow_components(
    workflow_components.StandardConformerGenerator(max_conformers=10)
)

description = """\
This optimization dataset adds more coverage for rare parameters, as listed in `inputs/`.
The parameters, as in Sage 2.2.1 are:
- a16, a36, a7
- b15, b23, b24, b29, b40, b42, b44, b47, b49, b54, b55, b59, b62, b63, b64, b65, b67, b69, b74, b76, b77, b78, b80, b81, b82
- i4, i5
- t101, t102, t103, t112, t113, t114, t12, t126, t128, t129, t136, t137, t138a, t141, t141a, t141c
- t154, t158, t164, t165, t167, t30, t31a, t33, t42a, t49, t54, t55, t59, t60, t61, t62, t7, t73, t8, t81, t88, t89.

Molecules were generated according to the process in https://github.com/lilyminium/fragment-chembl-data/commit/1a85d6b296350867a134529aa76930965a710a68
repo. In short, for torsions:

- ChEMBL molecules were split into "elementary" fragments without rotatable bonds
- Elementary fragments were combined with a single bond
- For each torsion, a pool of up to 5000 molecules were initially selected after sorting for low molecular weight
- Up to 250 molecules were selected from this pool by maximising chemical diversity, using the Tanimoto distance of the Morgan fingerprints
- Up to 10 molecules were selected from the second pool by maximising the diversity of coupled torsions through the central bond

For bonds, angles, and impropers:
- For each torsion, a pool of up to 10000 molecules were initially selected after sorting for low molecular weight
- Up to 500 molecules were selected from this pool by maximising chemical diversity, using the Tanimoto distance of the Morgan fingerprints
- Up to 50 molecules were selected from the second pool by maximising the diversity of other parameters applied to the atoms of each parameter


This dataset uses the OpenFF default level of theory (B3LYP-D3BJ/DZVP).
It covers the H, N, P, F, Br, C, I, Cl, S, O elements and -3.0, -2.0, -1.0, 0.0, 1.0, 2.0 charges.
Molecular MW ranges from 32 – 313 Da with mean MW of 126 Da.

"""

dataset = dataset_factory.create_dataset(
    dataset_name="OpenFF Additional Generated ChEMBL TorsionDrives 4.0",
    tagline="Additional TorsionDrives curated from ChEMBL fragments for rare torsions",
    description=description,
    molecules=all_molecules,
)

dataset.metadata.submitter = "lilyminium"
dataset.metadata.long_description_url = (
    "https://github.com/openforcefield/qca-dataset-submission/tree/master/"
    "submissions/" + str(pathlib.Path.cwd().name)
)

Deduplication                 : 100%|█████| 1880/1880 [00:00<00:00, 2037.29it/s]
[15:31:03] UFFTYPER: Unrecognized charge state for atom: 2
[15:31:03] UFFTYPER: Unrecognized charge state for atom: 5
[15:31:03] UFFTYPER: Unrecognized charge state for atom: 1
[15:31:03] UFFTYPER: Unrecognized charge state for atom: 1
[15:31:03] UFFTYPER: Unrecognized charge state for atom: 5
[15:31:04] UFFTYPER: Unrecognized charge state for atom: 1
[15:31:04] UFFTYPER: Unrecognized charge state for atom: 1
[15:31:04] UFFTYPER: Unrecognized charge state for atom: 1
[15:31:04] UFFTYPER: Unrecognized atom type: S_5+4 (1)
[15:31:04] UFFTYPER: Unrecognized charge state for atom: 1
[15:31:04] UFFTYPER: Unrecognized charge state for atom: 1
[15:31:04] UFFTYPER: Unrecognized charge state for atom: 1
StandardConformerGenerator    :   8%|▋       | 145/1847 [00:04<00:25, 65.79it/s][15:31:04] UFFTYPER: Unrecognized charge state for atom: 1
[15:31:04] UFFTYPER: Unrecognized charge state for atom: 1
[15:31:04] UFFTYP

In [5]:
dataset.n_molecules

1844

In [6]:
print(description)

This optimization dataset adds more coverage for rare parameters, as listed in `inputs/`.
The parameters, as in Sage 2.2.1 are:
- a16, a36, a7
- b15, b23, b24, b29, b40, b42, b44, b47, b49, b54, b55, b59, b62, b63, b64, b65, b67, b69, b74, b76, b77, b78, b80, b81, b82
- i4, i5
- t101, t102, t103, t112, t113, t114, t12, t126, t128, t129, t136, t137, t138a, t141, t141a, t141c
- t154, t158, t164, t165, t167, t30, t31a, t33, t42a, t49, t54, t55, t59, t60, t61, t62, t7, t73, t8, t81, t88, t89.

Molecules were generated according to the process in https://github.com/lilyminium/fragment-chembl-data/commit/1a85d6b296350867a134529aa76930965a710a68
repo. In short, for torsions:

- ChEMBL molecules were split into "elementary" fragments without rotatable bonds
- Elementary fragments were combined with a single bond
- For each torsion, a pool of up to 5000 molecules were initially selected after sorting for low molecular weight
- Up to 250 molecules were selected from this pool by maximising chemi

In [7]:
# summarize dataset for readme
confs = np.array([len(mol.conformers) for mol in dataset.molecules])

print("* Number of unique molecules:", dataset.n_molecules)
# With multiple torsions per unique molecule, n_molecules * confs.mean() no
# longer equals the number of conformers. instead, the number of dihedrals *
# confs.mean() should equal the number of conformers. The dataset contains one
# record per driven torsion (rather than combining multiple dihedrals into the
# same record), so n_records is the same as manually adding up len(dihedrals)
# for each record.
print("* Number of filtered molecules:", dataset.n_filtered)
print("* Number of conformers:", sum(confs))
print(
    "* Number of conformers per molecule (min, mean, max): "
    f"{confs.min()}, {confs.mean():.2f}, {confs.max()}"
)

masses = [
    [
        sum([atom.mass.m for atom in molecule.atoms])
        for molecule in dataset.molecules
    ]
]
print(f"* Mean molecular weight: {np.mean(np.array(masses)):.2f}")
print(f"* Min molecular weight: {np.min(np.array(masses)):.2f}")
print(f"* Max molecular weight: {np.max(np.array(masses)):.2f}")
print("* Charges:", sorted(set(m.total_charge.m for m in dataset.molecules)))


print("## Metadata")
print(f"* Elements: {{{', '.join(dataset.metadata.dict()['elements'])}}}")


fields = [
    "basis",
    "implicit_solvent",
    "keywords",
    "maxiter",
    "method",
    "program",
]
for spec, obj in dataset.qc_specifications.items():
    od = obj.dict()
    print("* Spec:", spec)
    for field in fields:
        print(f"\t * {field}: {od[field]}")
    print("\t* SCF properties:")
    for field in od["scf_properties"]:
        print(f"\t\t* {field}")


# export the dataset
dataset.export_dataset("dataset.json.bz2")
dataset.molecules_to_file("output.smi", "smi")
dataset.visualize("dataset.pdf", columns=8)

* Number of unique molecules: 1844
* Number of filtered molecules: 3
* Number of conformers: 2429
* Number of conformers per molecule (min, mean, max): 1, 1.32, 6
* Mean molecular weight: 125.99
* Min molecular weight: 32.05
* Max molecular weight: 312.99
* Charges: [-3.0, -2.0, -1.0, 0.0, 1.0, 2.0]
## Metadata
* Elements: {F, Br, Cl, P, S, O, N, H, C, I}
* Spec: default
	 * basis: DZVP
	 * implicit_solvent: None
	 * keywords: {}
	 * maxiter: 200
	 * method: B3LYP-D3BJ
	 * program: psi4
	* SCF properties:
		* dipole
		* quadrupole
		* wiberg_lowdin_indices
		* mayer_indices
