# Dataset Preparation

We will use the same assets from `2020-07-27-OpenFF-Benchmark-Ligands` as our starting point:

In [1]:
! ls ../2020-07-27-OpenFF-Benchmark-Ligands/sdfs/

bace.sdf  jnk1.sdf  p38a.sdf   thrombin.sdf
cdk2.sdf  mcl1.sdf  ptp1b.sdf  tyk2.sdf


We'll want these as OpenFF Molecules to feed them into QCSubmit. Some differences from that dataset:
1. We won't fragment the molecules at all; we are not doing QM.
2. We will exclude all molecules with elements that are not supported by ANI.
3. We don't want the default compute spec; instead want:
    - `openff 1.0.0`
    - `openff 1.1.0`
    - `openff 1.2.0`
    - `openff 1.2.1`
    - `ani2x`

In [2]:
from qcsubmit.factories import TorsiondriveDatasetFactory
from qcsubmit import workflow_components



In [3]:
factory = TorsiondriveDatasetFactory()
factory

TorsiondriveDatasetFactory(qc_specifications={'default': QCSpec(method='B3LYP-D3BJ', basis='DZVP', program='psi4', spec_name='default', spec_description='Standard OpenFF optimization quantum chemistry specification.', store_wavefunction=<WavefunctionProtocolEnum.none: 'none'>)}, maxiter=200, driver=<DriverEnum.gradient: 'gradient'>, scf_properties=['dipole', 'quadrupole', 'wiberg_lowdin_indices', 'mayer_indices'], priority='normal', dataset_tags=['openff'], compute_tag='openff', workflow={}, optimization_program=GeometricProcedure(program='geometric', coordsys='tric', enforce=0.1, epsilon=0.0, reset=True, qccnv=True, molcnv=False, check=0, trust=0.1, tmax=0.3, maxiter=300, convergence_set='GAU', constraints={}), grid_spacings=[15], energy_upper_limit=0.05, dihedral_ranges=None, energy_decrease_thresh=None)

In [4]:
conformers = workflow_components.StandardConformerGenerator(max_conformers=10)
factory.add_workflow_component(conformers)
factory

TorsiondriveDatasetFactory(qc_specifications={'default': QCSpec(method='B3LYP-D3BJ', basis='DZVP', program='psi4', spec_name='default', spec_description='Standard OpenFF optimization quantum chemistry specification.', store_wavefunction=<WavefunctionProtocolEnum.none: 'none'>)}, maxiter=200, driver=<DriverEnum.gradient: 'gradient'>, scf_properties=['dipole', 'quadrupole', 'wiberg_lowdin_indices', 'mayer_indices'], priority='normal', dataset_tags=['openff'], compute_tag='openff', workflow={'StandardConformerGenerator': StandardConformerGenerator(component_name='StandardConformerGenerator', component_description='Generate conformations for the given molecules', component_fail_message='Conformers could not be generated', toolkit='openeye', max_conformers=10, clear_existing=True)}, optimization_program=GeometricProcedure(program='geometric', coordsys='tric', enforce=0.1, epsilon=0.0, reset=True, qccnv=True, molcnv=False, check=0, trust=0.1, tmax=0.3, maxiter=300, convergence_set='GAU', con

In [5]:
factory.export_settings("torsiondrivefactory_settings.yaml")

## Gathering molecules supported by ANI2X

In [6]:
from glob import glob
from openforcefield.topology import Molecule

In [7]:
sdfs = sorted(glob('../2020-07-27-OpenFF-Benchmark-Ligands/sdfs/*.sdf'))

In [8]:
offmols = [Molecule.from_file(i, allow_undefined_stereo=True) for i in sdfs]

Problematic atoms are:
Atom atomic num: 16, name: , idx: 44, aromatic: False, chiral: True with bonds:
bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 10, aromatic: True, chiral: False
bond order: 2, chiral: False to atom atomic num: 8, name: , idx: 45, aromatic: False, chiral: False
bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 46, aromatic: False, chiral: False



In [9]:
list(map(len, offmols))

[36, 16, 21, 42, 34, 23, 11, 16]

We'll walk through these, and build up a set of molecules that we can use for ANI2x

In [10]:
mol = offmols[0][0]

In [11]:
allowed = {1, 6, 7, 8, 9, 16, 17}

In [12]:
symbols = set(a.atomic_number for a in mol.atoms)
symbols

{1, 6, 7, 8, 17}

In [13]:
symbols.issubset(allowed)

True

In [14]:
symbols.add(15)

In [15]:
symbols.issubset(allowed)

False

In [16]:
selected_offmols = []

count = 0
for molset in offmols:
    for mol in molset:
        count += 1
        symbols = set(a.atomic_number for a in mol.atoms)
        if symbols.issubset(allowed):
            selected_offmols.append(mol)

In [17]:
count

199

In [18]:
len(selected_offmols)

174

We have our subset. :D

## Building the dataset

In [19]:
dataset = factory.create_dataset(dataset_name="OpenFF Benchmark Ligands - Unfragmented v1.0",
                                 molecules=selected_offmols,
                                 description="Torsiondrives of the unfragmented JACS benchmark inhibitors.",
                                 tagline="Torsiondrives of unfragmented JACS benchmark inhibitors.")

Problematic atoms are:
Atom atomic num: 16, name: , idx: 44, aromatic: False, chiral: True with bonds:
bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 10, aromatic: True, chiral: False
bond order: 2, chiral: False to atom atomic num: 8, name: , idx: 45, aromatic: False, chiral: False
bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 46, aromatic: False, chiral: False

Problematic atoms are:
Atom atomic num: 16, name: , idx: 26, aromatic: False, chiral: True with bonds:
bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 6, aromatic: True, chiral: False
bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 17, aromatic: False, chiral: False
bond order: 2, chiral: False to atom atomic num: 8, name: , idx: 24, aromatic: False, chiral: False



In [20]:
dataset.metadata.elements

{'C', 'Cl', 'F', 'H', 'N', 'O', 'S'}

In [21]:
dataset.n_molecules

Problematic atoms are:
Atom atomic num: 16, name: , idx: 43, aromatic: False, chiral: True with bonds:
bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 3, aromatic: True, chiral: False
bond order: 2, chiral: False to atom atomic num: 8, name: , idx: 44, aromatic: False, chiral: False
bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 45, aromatic: False, chiral: False

Problematic atoms are:
Atom atomic num: 16, name: , idx: 43, aromatic: False, chiral: True with bonds:
bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 3, aromatic: True, chiral: False
bond order: 2, chiral: False to atom atomic num: 8, name: , idx: 44, aromatic: False, chiral: False
bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 45, aromatic: False, chiral: False

Problematic atoms are:
Atom atomic num: 16, name: , idx: 43, aromatic: False, chiral: True with bonds:
bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 3, aromatic: True, chiral: F

174

In [22]:
dataset.n_records

1255

### Setting compute specs

In [23]:
dataset.qc_specifications.pop('default')

QCSpec(method='B3LYP-D3BJ', basis='DZVP', program='psi4', spec_name='default', spec_description='Standard OpenFF optimization quantum chemistry specification.', store_wavefunction=<WavefunctionProtocolEnum.none: 'none'>)

In [24]:
dataset.qc_specifications

{}

In [25]:
openff_versions = [
"openff-1.0.0", 
"openff-1.1.0",
"openff-1.2.0",
"openff-1.2.1"]

In [26]:
for openff_version in openff_versions:
    dataset.add_qc_spec(method=openff_version, 
                        basis="smirnoff", 
                        program="openmm", 
                        spec_name=openff_version, 
                        spec_description=f"default {openff_version} spec")

Consider whether we want ANI as a separate submission or the same one.

Also, I believe Josh Horton had mentioned that something might be done instead of increasing maxiter for ANI cases that aren't converging?

In [27]:
dataset.add_qc_spec(method="ani2x",
                    basis=None,
                    program="torchani",
                    spec_name="ani2x",
                    spec_description="ANI2x ML potential")

In [28]:
dataset.visualize("molecules.pdf")

Problematic atoms are:
Atom atomic num: 16, name: , idx: 43, aromatic: False, chiral: True with bonds:
bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 3, aromatic: True, chiral: False
bond order: 2, chiral: False to atom atomic num: 8, name: , idx: 44, aromatic: False, chiral: False
bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 45, aromatic: False, chiral: False

Problematic atoms are:
Atom atomic num: 16, name: , idx: 43, aromatic: False, chiral: True with bonds:
bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 3, aromatic: True, chiral: False
bond order: 2, chiral: False to atom atomic num: 8, name: , idx: 44, aromatic: False, chiral: False
bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 45, aromatic: False, chiral: False

Problematic atoms are:
Atom atomic num: 16, name: , idx: 43, aromatic: False, chiral: True with bonds:
bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 3, aromatic: True, chiral: F

### Metadata additions

In [29]:
dataset.metadata.long_description_url = "https://github.com/openforcefield/qca-dataset-submission/tree/master/2020-10-08-OpenFF-Benchmark-Ligands-Unfragmented"

### Writeout

In [30]:
dataset.export_dataset("dataset.json")

In [33]:
! bzip2 dataset.json

In [34]:
dataset.filtered_molecules

{'StandardConformerGenerator': FilterEntry(component_name='StandardConformerGenerator', component_description={'component_name': 'StandardConformerGenerator', 'component_description': 'Generate conformations for the given molecules', 'component_fail_message': 'Conformers could not be generated', 'toolkit': 'openeye', 'max_conformers': 10, 'clear_existing': True}, component_provenance={'OpenforcefieldToolkit': '0.7.1', 'QCSubmit': '0+untagged.152.g76dd8ac', 'openeye': '2020.1.0'}, molecules=[]),
 'LinearTorsionRemoval': FilterEntry(component_name='LinearTorsionRemoval', component_description={'component_description': 'Remove any molecules with a linear torsions selected to drive.'}, component_provenance={'qcsubmit': '0+untagged.152.g76dd8ac', 'openforcefield': '0.7.1', 'openeye': '2020.1.0'}, molecules=[]),
 'UnconnectedTorsionRemoval': FilterEntry(component_name='UnconnectedTorsionRemoval', component_description={'component_description': 'Remove any molecules with unconnected torsion i

In [35]:
dataset.metadata.elements

{'C', 'Cl', 'F', 'H', 'N', 'O', 'S'}