# TorsionNet500 Re-optimization TorsionDrives v4.0

This notebook generates a `TorsiondriveDataset` based on the [existing TorsionNet500 dataset](submissions/2021-11-09-TorsionNet500-single-points), which only computed single-point energies and was originally optimized to a different level of theory. This submission uses the OpenFF's default level of theory.

In [1]:
from rich.pretty import pprint
import numpy
import pathlib

from openff.qcsubmit.common_structures import QCSpec, SCFProperties
from openff.toolkit import Molecule
from openff.qcsubmit.utils import get_symmetry_classes, get_symmetry_group
from openff.qcsubmit.factories import TorsiondriveDatasetFactory
from openff.qcsubmit.workflow_components.utils import TorsionIndexer, SingleTorsion

In [2]:
all_molecules = Molecule.from_file(
    "TorsionNet500_qm_opt_geometries.sdf",
    file_format="sdf",
    allow_undefined_stereo=True,
)

assert len(all_molecules) == 12000, len(all_molecules)

assert len({molecule.to_inchikey() for molecule in all_molecules}) == 500

In [3]:
def deduplicate_molecules(all_molecules: list[Molecule]) -> list[Molecule]:
    """
    Given the entire TorsionDrive500 dataset, 'de-duplicate' such that
    * only one molecule per scan is returned
    * the conformer corresponding to the lowest-energy (as reported in the SDF) is returned
    """
    returned_molecules = list()

    n_unique_molecules = len({molecule.to_inchikey() for molecule in all_molecules})

    for index in range(n_unique_molecules):
        molecules = all_molecules[24 * index : 24 * (index + 1)]

        assert len({molecule.to_inchikey() for molecule in molecules}) == 1, (
            "Molecules in this group of 24 not identical"
        )

        # use the energy-minimum molecule as a starting point for this scan
        energies = [
            float(molecule.properties["Energy"]) - -351878.3514795939
            for molecule in molecules
        ]
        min_energy_molecule = molecules[numpy.argmin(energies)]

        # 1-indexed in file, needs to be 0-indexed for OpenFF/QC* software
        torsion_atoms = tuple(
            [
                int(index) - 1
                for index in min_energy_molecule.properties[
                    "TORSION_ATOMS_FRAGMENT"
                ].split(" ")
            ]
        )

        central_bond = tuple((torsion_atoms[1], torsion_atoms[2]))
        symmetry_classes = get_symmetry_classes(min_energy_molecule)
        symmetry_group = get_symmetry_group(central_bond, symmetry_classes)

        min_energy_molecule.properties["dihedrals"] = TorsionIndexer(
            torsions={
                tuple((torsion_atoms[1], torsion_atoms[2])): SingleTorsion(
                    torsion1=torsion_atoms,
                    scan_range=(-165, 180),
                    scan_increment=[15],
                    symmetry_group1=symmetry_group,
                )
            }
        )

        returned_molecules.append(min_energy_molecule)

    return returned_molecules

In [4]:
deduplicated_molecules = deduplicate_molecules(all_molecules)

assert len(deduplicated_molecules) == 500, len(deduplicated_molecules)

In [5]:
dataset_factory = TorsiondriveDatasetFactory(
    qc_specifications={
        "default": QCSpec(
            spec_description="Standard OpenFF optimization quantum chemistry specification with `mbis_charges` and `lowdin_charges` keywords added.",
            scf_properties=[
                SCFProperties.Dipole,
                SCFProperties.Quadrupole,
                SCFProperties.MBISCharges,
                SCFProperties.LowdinCharges,
                SCFProperties.WibergLowdinIndices,
                SCFProperties.MayerIndices,
            ],
        )
    }
)
pprint(dataset_factory)

In [6]:
description = """\
The TorsionNet500 (TN500) dataset is the input.
TN500 was originally optimized to a different level of theory than OpenFF's default.

It is available at https://github.com/pfizer-opensource/TorsionNet/blob/main/data/TorsionNet500_qm_opt_geometries.sdf/.

For each scan (of 24 conformers), the lowest-energy conformer (as reported in the SDF) was used as a starting point for a TorsionDrive workflow.
The same scan range/increment as the original TorsionNet500 dataset was used, -165 to 180 degrees in increments of 15 degrees.

This dataset uses the OpenFF default level of theory (B3LYP-D3BJ/DZVP).
It covers the N, Cl, O, S, H, C, F elements and 0.0 charges.
Molecular MW ranges from 70.13 - 268.74 Da with mean MW of 183.52 Da.

"""

In [7]:
dataset = dataset_factory.create_dataset(
    dataset_name="TorsionNet500 Re-optimization TorsionDrives v4.0",
    molecules=deduplicated_molecules,
    tagline="TorsionNet500 TorsionDrives re-optimized with OpenFF default spec",
    description=description,
    verbose=True,
)

dataset.metadata.submitter = "mattwthompson"
dataset.metadata.long_description_url = (
    "https://github.com/openforcefield/qca-dataset-submission/tree/master/"
    "submissions/" + str(pathlib.Path.cwd().name)
)

Deduplication                 : 100%|███████| 500/500 [00:00<00:00, 2027.88it/s]
Preparation                   : 100%|████████| 500/500 [00:04<00:00, 116.73it/s]


In [8]:
assert dataset.n_molecules > 0, (
    f"Ended with {dataset.n_molecules=}, {dataset.n_filtered=} were filtered out"
)

In [9]:
# summarize dataset for readme
confs = numpy.array([len(mol.conformers) for mol in dataset.molecules])

print("* Number of unique molecules:", dataset.n_molecules)
# With multiple torsions per unique molecule, n_molecules * confs.mean() no
# longer equals the number of conformers. instead, the number of dihedrals *
# confs.mean() should equal the number of conformers. The dataset contains one
# record per driven torsion (rather than combining multiple dihedrals into the
# same record), so n_records is the same as manually adding up len(dihedrals)
# for each record.
print("* Number of driven torsions:", dataset.n_records)
print("* Number of filtered molecules:", dataset.n_filtered)
print("* Number of conformers:", sum(confs))
print(
    "* Number of conformers per molecule (min, mean, max): "
    f"{confs.min()}, {confs.mean():.2f}, {confs.max()}"
)

masses = [
    [sum([atom.mass.m for atom in molecule.atoms]) for molecule in dataset.molecules]
]
print(f"* Mean molecular weight: {numpy.mean(numpy.array(masses)):.2f}")
print(f"* Min molecular weight: {numpy.min(numpy.array(masses)):.2f}")
print(f"* Max molecular weight: {numpy.max(numpy.array(masses)):.2f}")
print("* Charges:", sorted(set(m.total_charge.m for m in dataset.molecules)))


print("## Metadata")
print(f"* Elements: {{{', '.join(dataset.metadata.dict()['elements'])}}}")


fields = [
    "basis",
    "implicit_solvent",
    "keywords",
    "maxiter",
    "method",
    "program",
]
for spec, obj in dataset.qc_specifications.items():
    od = obj.dict()
    print("* Spec:", spec)
    for field in fields:
        print(f"\t * {field}: {od[field]}")
    print("\t* SCF properties:")
    for field in od["scf_properties"]:
        print(f"\t\t* {field}")

* Number of unique molecules: 500
* Number of driven torsions: 500
* Number of filtered molecules: 0
* Number of conformers: 500
* Number of conformers per molecule (min, mean, max): 1, 1.00, 1
* Mean molecular weight: 183.52
* Min molecular weight: 70.13
* Max molecular weight: 268.74
* Charges: [0.0]
## Metadata
* Elements: {S, Cl, C, N, H, F, O}
* Spec: default
	 * basis: DZVP
	 * implicit_solvent: None
	 * keywords: {}
	 * maxiter: 200
	 * method: B3LYP-D3BJ
	 * program: psi4
	* SCF properties:
		* dipole
		* quadrupole
		* mbis_charges
		* lowdin_charges
		* wiberg_lowdin_indices
		* mayer_indices


In [10]:
dataset.visualize(f"dataset.pdf", columns=8)

In [11]:
dataset.molecules_to_file(f"dataset.smi", "smi")

In [12]:
dataset.metadata

Metadata(submitter='mattwthompson', creation_date=datetime.date(2026, 2, 17), collection_type='TorsionDriveDataset', dataset_name='TorsionNet500 Re-optimization TorsionDrives v4.0', short_description='TorsionNet500 TorsionDrives re-optimized with OpenFF default spec', long_description_url=HttpUrl('https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2026-02-12-TorsionNet500-Re-optimization-TorsionDrives-v4.0', ), long_description="The TorsionNet500 (TN500) dataset is the input.\nTN500 was originally optimized to a different level of theory than OpenFF's default.\n\nIt is available at https://github.com/pfizer-opensource/TorsionNet/blob/main/data/TorsionNet500_qm_opt_geometries.sdf/.\n\nFor each scan (of 24 conformers), the lowest-energy conformer (as reported in the SDF) was used as a starting point for a TorsionDrive workflow.\nThe same scan range/increment as the original TorsionNet500 dataset was used, -165 to 180 degrees in increments of 15 degrees.\n\nT

In [13]:
dataset.export_dataset(f"dataset.json.bz2")