# OpenFF Additional Generated ChEMBL TorsionDrives 4.0

This notebook generates additional torsiondrives from molecules for torsions in Sage 2.2.1 with low or no existing coverage in QCArchive.
Input files are obtained from the process in the [fragment-chembl-data](https://github.com/lilyminium/fragment-chembl-data/commit/d94c9fc4be6945c4e680b2bb5b31a2693c4b772b) repo. Low coverage torsions are identified from the [profile-qc-data](https://github.com/lilyminium/profile-qc-data) repo.

In [1]:
import zstandard
import qcportal
import pathlib

from openff.toolkit import Molecule, ForceField
import numpy as np
import tqdm

from openff.qcsubmit.utils import get_symmetry_classes, get_symmetry_group
from openff.qcsubmit.workflow_components import TorsionIndexer
from openff.qcsubmit import workflow_components
from openff.qcsubmit.factories import TorsiondriveDatasetFactory
from openff.qcsubmit.utils.visualize import molecules_to_pdf

In [2]:
def load_molecules(file: str, parameter_id: str) -> list[Molecule]:
    """Load SMILES from file and assign dihedrals to rotate around by pattern"""
    case_molecules = []
    forcefield = ForceField("inputs/openff_unconstrained-2.2.1.offxml")

    molecules = Molecule.from_file(file, allow_undefined_stereo=True)

    for mol in molecules:
        
        unique_central_bonds = set()
        torsion_indexer = TorsionIndexer()
        symmetry_classes = get_symmetry_classes(mol)

        labels = forcefield.label_molecules(mol.to_topology())[0]["ProperTorsions"]
        for (i, j, k, l), parameter in labels.items():
            if parameter.id != parameter_id:
                continue
            central_bond = tuple(sorted([j, k]))
            # ignore torsions around rings
            if mol.get_bond_between(j, k).is_in_ring():
                continue
                
            symmetry_group = get_symmetry_group(central_bond, symmetry_classes)
            if central_bond in unique_central_bonds:
                continue
                
            unique_central_bonds.add(central_bond)
            torsion_indexer.add_torsion((i, j, k, l), symmetry_group, (-165, 180))

        assert len(torsion_indexer.torsions)
        mol.properties["dihedrals"] = torsion_indexer
        case_molecules.append(mol)
    return case_molecules


def visualize(mols, filename):
    """Draw output molecules as PDF"""
    new_mols = []
    for mol in mols:
        for val in mol.properties["dihedrals"].torsions.values():
            new_mol = Molecule(mol)
            new_mol.properties["dihedrals"] = val.get_dihedrals
            new_mols.append(new_mol)
    molecules_to_pdf(new_mols, filename)

In [3]:
input_directory = pathlib.Path("inputs")
input_files = input_directory.glob("*.smi")
all_molecules = []
for file in tqdm.tqdm(input_files):
    parameter_id = file.stem.split("-")[-1]
    molecules = load_molecules(file, parameter_id)
    all_molecules.extend(molecules)
    visualize(molecules, f"inputs/{parameter_id}_molecules.pdf")

29it [00:08,  3.28it/s]


In [4]:
len(all_molecules)

290

In [5]:
dataset_factory = TorsiondriveDatasetFactory()
dataset_factory.add_workflow_components(
    workflow_components.StandardConformerGenerator(max_conformers=5)
)

description = """\
Molecules were curated to add more coverage for rare torsions, as listed in `inputs/`.
Rare torsions that run through a ring were not included.
Molecules were generated according to the process in https://github.com/lilyminium/fragment-chembl-data/commit/d94c9fc4be6945c4e680b2bb5b31a2693c4b772b
repo. In short:

- ChEMBL molecules were split into "elementary" fragments without rotatable bonds
- Elementary fragments were combined with a single bond
- For each torsion, a pool of up to 5000 molecules were initially selected after sorting for low molecular weight
- Up to 250 molecules were selected from this pool by maximising chemical diversity, using the Tanimoto distance of the Morgan fingerprints
- Up to 10 molecules were selected from the second pool by maximising the diversity of coupled torsions through the central bond

"""

dataset = dataset_factory.create_dataset(
    dataset_name="OpenFF Additional Generated ChEMBL TorsionDrives 4.0",
    tagline="Additional TorsionDrives curated from ChEMBL fragments for rare torsions",
    description=description,
    molecules=all_molecules,
)

dataset.metadata.submitter = "lilyminium"
dataset.metadata.long_description_url = (
    "https://github.com/openforcefield/qca-dataset-submission/tree/master/"
    "submissions/" + str(pathlib.Path.cwd().name)
)

Deduplication                 : 100%|███████| 290/290 [00:00<00:00, 1654.10it/s]
[17:41:29] UFFTYPER: Unrecognized atom type: S_5+4 (4)
[17:41:29] UFFTYPER: Unrecognized atom type: S_5+4 (0)
[17:41:29] UFFTYPER: Unrecognized charge state for atom: 1
[17:41:29] UFFTYPER: Unrecognized atom type: S_5+4 (1)
[17:41:29] UFFTYPER: Unrecognized charge state for atom: 3
[17:41:29] UFFTYPER: Unrecognized charge state for atom: 1
[17:41:29] UFFTYPER: Unrecognized atom type: S_5+4 (0)
[17:41:30] UFFTYPER: Unrecognized atom type: S_5+4 (1)
[17:41:30] UFFTYPER: Unrecognized charge state for atom: 1
[17:41:31] UFFTYPER: Unrecognized charge state for atom: 1
[17:41:31] UFFTYPER: Unrecognized charge state for atom: 4
[17:41:31] UFFTYPER: Unrecognized charge state for atom: 4
[17:41:31] UFFTYPER: Unrecognized charge state for atom: 1
[17:41:31] UFFTYPER: Unrecognized charge state for atom: 2
[17:41:31] UFFTYPER: Unrecognized charge state for atom: 1
[17:41:31] UFFTYPER: Unrecognized charge state for ato

In [6]:
print(description)

Molecules were curated to add more coverage for rare torsions, as listed in `inputs/`.
Rare torsions that run through a ring were not included.
Molecules were generated according to the process in https://github.com/lilyminium/fragment-chembl-data/commit/d94c9fc4be6945c4e680b2bb5b31a2693c4b772b
repo. In short:

- ChEMBL molecules were split into "elementary" fragments without rotatable bonds
- Elementary fragments were combined with a single bond
- For each torsion, a pool of up to 5000 molecules were initially selected after sorting for low molecular weight
- Up to 250 molecules were selected from this pool by maximising chemical diversity, using the Tanimoto distance of the Morgan fingerprints
- Up to 10 molecules were selected from the second pool by maximising the diversity of coupled torsions through the central bond




In [7]:
# summarize dataset for readme
confs = np.array([len(mol.conformers) for mol in dataset.molecules])

print("* Number of unique molecules:", dataset.n_molecules)
# With multiple torsions per unique molecule, n_molecules * confs.mean() no
# longer equals the number of conformers. instead, the number of dihedrals *
# confs.mean() should equal the number of conformers. The dataset contains one
# record per driven torsion (rather than combining multiple dihedrals into the
# same record), so n_records is the same as manually adding up len(dihedrals)
# for each record.
print("* Number of driven torsions:", dataset.n_records)
print("* Number of filtered molecules:", dataset.n_filtered)
print("* Number of conformers:", sum(confs))
print(
    "* Number of conformers per molecule (min, mean, max): "
    f"{confs.min()}, {confs.mean():.2f}, {confs.max()}"
)

masses = [
    [
        sum([atom.mass.m for atom in molecule.atoms])
        for molecule in dataset.molecules
    ]
]
print(f"* Mean molecular weight: {np.mean(np.array(masses)):.2f}")
print(f"* Max molecular weight: {np.max(np.array(masses)):.2f}")
print("* Charges:", sorted(set(m.total_charge.m for m in dataset.molecules)))


print("## Metadata")
print(f"* Elements: {{{', '.join(dataset.metadata.dict()['elements'])}}}")


fields = [
    "basis",
    "implicit_solvent",
    "keywords",
    "maxiter",
    "method",
    "program",
]
for spec, obj in dataset.qc_specifications.items():
    od = obj.dict()
    print("* Spec:", spec)
    for field in fields:
        print(f"\t * {field}: {od[field]}")
    print("\t* SCF properties:")
    for field in od["scf_properties"]:
        print(f"\t\t* {field}")


# export the dataset
dataset.export_dataset("dataset.json.bz2")
dataset.molecules_to_file("output.smi", "smi")
dataset.visualize("dataset.pdf", columns=8)

* Number of unique molecules: 270
* Number of driven torsions: 275
* Number of filtered molecules: 17
* Number of conformers: 352
* Number of conformers per molecule (min, mean, max): 1, 1.28, 4
* Mean molecular weight: 124.98
* Max molecular weight: 312.99
* Charges: [-1.0, 0.0, 1.0, 2.0]
## Metadata
* Elements: {O, Cl, Br, C, I, P, F, H, N, S}
* Spec: default
	 * basis: DZVP
	 * implicit_solvent: None
	 * keywords: {}
	 * maxiter: 200
	 * method: B3LYP-D3BJ
	 * program: psi4
	* SCF properties:
		* dipole
		* quadrupole
		* wiberg_lowdin_indices
		* mayer_indices
