# OpenFF NSP Optimization Set 2 Sulfur v4.0

This notebook generates optimizations for Sulfur set of molecules in the NSP molecules set obtained from Pubchem.

In [1]:
import pathlib
from pprint import pprint
from openff.toolkit import Molecule
import numpy as np

from openff.qcsubmit import workflow_components
from openff.qcsubmit.factories import OptimizationDatasetFactory
from openff.qcsubmit.utils.visualize import molecules_to_pdf


In [2]:
input_file = 'set2-S-smiles-pka-normalized.smi'
molecules = Molecule.from_file(input_file, allow_undefined_stereo=True)
molecules_to_pdf(molecules, f"dataset.pdf")

In [3]:
len(molecules)

1235

In [4]:
dataset_factory = OptimizationDatasetFactory()
dataset_factory.add_workflow_components(
    workflow_components.StandardConformerGenerator(max_conformers=10)
)

description = """\
Optimization dataset to probe coverage of nitrogen, sulfur, and phosphorus (i.e., NSP) functional groups in general, and sulfur in particular. Molecules were curated from PubChem datasets, retaining those with HAC < 40 and applying additional filters; the preprocessing steps are detailed at https://github.com/pavankum/NSP_sets. This dataset includes a broader set of molecules, many of which may not be drug-like, but are informative for differentiating force field parameter ranges. In addition to the default OpenFF QC specification (B3LYP-D3BJ/DZVP), an SPICE-level QC specification (ωB97M-D3BJ/def2-TZVPPD) is included, since many molecules have charged states. Set 2 prioritizes as many varied charged states as possible with valid valences for the atoms. Some of these may include charged states that get protonated immediately in solvent, but they were retained for experiments to check the quality of charge equilibration in MLIPs as some of these may be used as reference for bespoke torsion fits. 

- Number of unique molecules: 1213
- Number of filtered molecules: 0
- Number of conformers: 8018
- Number of conformers per molecule (min, mean, max): 1, 6.61, 10
- Mean molecular weight: 293.47
- Min molecular weight: 86.16
- Max molecular weight: 597.88
- Charges: [-6.0, -5.0, -4.0, -3.0, -2.0, -1.0, 0.0, 1.0, 2.0, 3.0, 4.0]
## Metadata
- Elements: {O, P, I, C, F, N, Br, S, H, Cl}
- Spec: default
	 - basis: DZVP
	 - implicit_solvent: None
	 - keywords: {}
	 - maxiter: 200
	 - method: B3LYP-D3BJ
	 - program: psi4
	- SCF properties:
		- dipole
		- quadrupole
		- wiberg_lowdin_indices
		- mayer_indices
		- lowdin_charges
		- mulliken_charges
- Spec: WB97M-D3BJ/def2-TZVPPD
	 - basis: def2-TZVPPD
	 - implicit_solvent: None
	 - keywords: {}
	 - maxiter: 200
	 - method: WB97M-D3BJ
	 - program: psi4
	- SCF properties:
		- dipole
		- quadrupole
		- wiberg_lowdin_indices
		- mayer_indices
		- lowdin_charges
		- mulliken_charges
"""

dataset = dataset_factory.create_dataset(
    dataset_name="OpenFF NSP Optimization Set 2 Sulfur v4.0",
    tagline="Molecules curated from PubChem for Sulfur",
    description=description,
    molecules=molecules,
)

dataset.metadata.submitter = "pavankum"
dataset.metadata.long_description_url = (
    "https://github.com/openforcefield/qca-dataset-submission/tree/master/"
    "submissions/" + str(pathlib.Path.cwd().name)
)

Deduplication                 : 100%|██████| 1235/1235 [00:03<00:00, 389.39it/s]
StandardConformerGenerator    :   3%|▎        | 38/1214 [00:08<02:06,  9.33it/s][13:38:04] UFFTYPER: Unrecognized charge state for atom: 8
[13:38:04] UFFTYPER: Unrecognized charge state for atom: 21
[13:42:38] UFFTYPER: Unrecognized charge state for atom: 1
[13:42:38] UFFTYPER: Unrecognized charge state for atom: 1
[13:42:39] UFFTYPER: Unrecognized charge state for atom: 12
[13:42:39] UFFTYPER: Unrecognized charge state for atom: 7
[13:42:40] UFFTYPER: Unrecognized charge state for atom: 2
[13:42:40] UFFTYPER: Unrecognized charge state for atom: 6
[13:42:40] UFFTYPER: Unrecognized charge state for atom: 10
[13:42:40] UFFTYPER: Unrecognized charge state for atom: 14
[13:42:40] UFFTYPER: Unrecognized charge state for atom: 1
[13:42:40] UFFTYPER: Unrecognized charge state for atom: 13
[13:42:40] UFFTYPER: Unrecognized charge state for atom: 19
[13:42:40] UFFTYPER: Unrecognized charge state for atom: 6
[13:42:

In [5]:
dataset.n_molecules

1213

Add a new qcspecification to the factory which will be applied to the dataset.
    
    Parameters:
        method: The name of the method to use eg B3LYP-D3BJ
        basis: The name of the basis to use can also be `None`
        program: The name of the program to execute the computation
        spec_name: The name the spec should be stored under
        spec_description: The description of the spec
        store_wavefunction: what parts of the wavefunction that should be saved
        overwrite: If there is a spec under this name already overwrite it
        implicit_solvent: The implicit solvent settings if it is to be used.
        maxiter: The maximum number of SCF iterations that should be done.
        scf_properties: The list of SCF properties that should be extracted from the calculation.
        keywords: Program specific computational keywords that should be passed to
            the program

In [6]:
dataset.add_qc_spec(method='WB97M-D3BJ',
                    basis='def2-TZVPPD',
                    program='psi4',
                    spec_name="WB97M-D3BJ/def2-TZVPPD",
                    spec_description="SPICE sets level of theory",
                    store_wavefunction='none',
                    implicit_solvent=None,
                    maxiter=200,
                    scf_properties=['dipole', 'quadrupole', 'wiberg_lowdin_indices', 'mayer_indices', 'lowdin_charges', 'mulliken_charges'],
                    keywords={}
                    )
dataset.add_qc_spec(method='B3LYP-D3BJ',
                    basis='DZVP',
                    program='psi4',
                    spec_name="default",
                    spec_description="Standard OpenFF optimization quantum chemistry specification.",
                    store_wavefunction='none',
                    implicit_solvent=None,
                    maxiter=200,
                    scf_properties=['dipole', 'quadrupole', 'wiberg_lowdin_indices', 'mayer_indices', 'lowdin_charges', 'mulliken_charges'],
                    keywords={},
                    overwrite=True
                    )        

In [7]:
pprint(dataset.dict()['qc_specifications'])

{'WB97M-D3BJ/def2-TZVPPD': {'basis': 'def2-TZVPPD',
                            'implicit_solvent': None,
                            'keywords': {},
                            'maxiter': 200,
                            'method': 'WB97M-D3BJ',
                            'program': 'psi4',
                            'scf_properties': ['dipole',
                                               'quadrupole',
                                               'wiberg_lowdin_indices',
                                               'mayer_indices',
                                               'lowdin_charges',
                                               'mulliken_charges'],
                            'spec_description': 'SPICE sets level of theory',
                            'spec_name': 'WB97M-D3BJ/def2-TZVPPD',
                            'store_wavefunction': 'none'},
 'default': {'basis': 'DZVP',
             'implicit_solvent': None,
             'keywords': {},
             'maxi

In [8]:
# summarize dataset for readme
confs = np.array([len(mol.conformers) for mol in dataset.molecules])

print("- Number of unique molecules:", dataset.n_molecules)
print("- Number of filtered molecules:", dataset.n_filtered)
print("- Number of conformers:", sum(confs))
print(
    "- Number of conformers per molecule (min, mean, max): "
    f"{confs.min()}, {confs.mean():.2f}, {confs.max()}"
)

masses = [
    [
        sum([atom.mass.m for atom in molecule.atoms])
        for molecule in dataset.molecules
    ]
]
print(f"- Mean molecular weight: {np.mean(np.array(masses)):.2f}")
print(f"- Min molecular weight: {np.min(np.array(masses)):.2f}")
print(f"- Max molecular weight: {np.max(np.array(masses)):.2f}")
print("- Charges:", sorted(set(m.total_charge.m for m in dataset.molecules)))


print("## Metadata")
print(f"- Elements: {{{', '.join(dataset.metadata.dict()['elements'])}}}")


fields = [
    "basis",
    "implicit_solvent",
    "keywords",
    "maxiter",
    "method",
    "program",
]
for spec, obj in dataset.qc_specifications.items():
    od = obj.dict()
    print("- Spec:", spec)
    for field in fields:
        print(f"\t - {field}: {od[field]}")
    print("\t- SCF properties:")
    for field in od["scf_properties"]:
        print(f"\t\t- {field}")


# export the dataset
dataset.export_dataset("dataset.json.bz2")
dataset.molecules_to_file("dataset.smi", "smi")
dataset.visualize("dataset.pdf", columns=4)

- Number of unique molecules: 1213
- Number of filtered molecules: 0
- Number of conformers: 8018
- Number of conformers per molecule (min, mean, max): 1, 6.61, 10
- Mean molecular weight: 293.47
- Min molecular weight: 86.16
- Max molecular weight: 597.88
- Charges: [-6.0, -5.0, -4.0, -3.0, -2.0, -1.0, 0.0, 1.0, 2.0, 3.0, 4.0]
## Metadata
- Elements: {F, I, Cl, C, H, Br, O, P, S, N}
- Spec: default
	 - basis: DZVP
	 - implicit_solvent: None
	 - keywords: {}
	 - maxiter: 200
	 - method: B3LYP-D3BJ
	 - program: psi4
	- SCF properties:
		- dipole
		- quadrupole
		- wiberg_lowdin_indices
		- mayer_indices
		- lowdin_charges
		- mulliken_charges
- Spec: WB97M-D3BJ/def2-TZVPPD
	 - basis: def2-TZVPPD
	 - implicit_solvent: None
	 - keywords: {}
	 - maxiter: 200
	 - method: WB97M-D3BJ
	 - program: psi4
	- SCF properties:
		- dipole
		- quadrupole
		- wiberg_lowdin_indices
		- mayer_indices
		- lowdin_charges
		- mulliken_charges
