# Creating submission workflows using qcsubmit.

Here we aim to create reproducible workflows which can process large lists of molecules applying filtering and some other useful operations 
such as state enumeration and fragmentation before formatting the data into an qcarchive dataset that can be 
submitted to the public or local archive instance. 

The entire workflow is build up out of modular components which can each have programable settings which can be 
controlled through the API or via workflow yaml/json files. 

In this example, we will demonstrate how to set up a basic workflow via the API and how it can be exported to a settings file 
which can then be used to reconstruct the entire workflow by another user.

First lets load in the packages.

In [1]:
from qcsubmit.factories import BasicDatasetFactory  # load in the factory to process molecules
from qcsubmit import workflow_components                # load in a list of workflow_components
from openforcefield.utils.utils import get_data_file_path  # a util function to load a mini drug bank file
from openforcefield.topology import Molecule


Create the basic qcarchive dataset factory, this is useful for large machine learning datasets which are a collection
of single point calculations provided through the energy/gradient/hessian drivers.

In [2]:
factory = BasicDatasetFactory()
# we can view and change any of the basic settings in the factory such as the qm settings
factory.program = 'torchani'
factory.method = 'ANI1ccx'
factory.basis = None
factory.driver = 'energy'
factory.spec_description = "ANI1ccx standard specification"
factory.spec_name = "ani1ccx"
# lets look at the class and the settings 
print(factory)

method='ANI1ccx' basis=None program='torchani' maxiter=200 driver=<DriverEnum.energy: 'energy'> scf_properties=['dipole', 'qudrupole', 'wiberg_lowdin_indices'] spec_name='ani1ccx' spec_description='ANI1ccx standard specification' priority='normal' dataset_tags=['openff'] compute_tag='openff' workflow={}


In [3]:
# each of the fields are also validated as the class is based on pydantic
# so the following should produce and error
factory.basis = {'test': 1}


ValidationError: 1 validation error for BasicDatasetFactory
basis
  str type expected (type=type_error.str)

Note that the basic settings should be suitable in most cases as they are those recommended by the 
openforcefield and are currently used in the fitting of the most recent force fields. Now lets look at
the workflow components. 

In [None]:
# the workflow is a dictionary that contains all of the components that will be executed in order
print(factory.workflow)

In [None]:
# the workflow is also validated so only properly configured workflow components can be added
factory.add_workflow_component(3)

As you can see the number is not a proper workflow component and so it has been rejected from the workflow.
Users can make there own workflow components which can be added to the workflow as well but they must be a subclass
of the CustomWorkflowComponent class and have all of the abstract methods implemented and settings. See the creating workflow
components notebook for examples on how to do this.

Now lets set up a workflow that will filter out some unwanted elements, then filter by molecular weight
before generating conformers for each of the molecules.

In [4]:
# set up the element filter
el_filter = workflow_components.ElementFilter()
# lets view the options available for this filter
print(el_filter)

component_name='ElementFilter' component_description='Filter out molecules who contain elements not in the allowed element list' component_fail_message='Molecule contained elements not in the allowed elements list' allowed_elements=['H', 'C', 'N', 'O', 'F', 'P', 'S', 'Cl', 'Br', 'I']


This filter has the ability to filter elements by atomic name or number, we just have to supply a list of 
symbols or numbers to the filter. Here lets only keep molecules with elements of H,C,N and O as we would like
to use AN1 as our QM method.

In [5]:
# set the filter to only keep molecules with these elements
el_filter.allowed_elements = [1, 6, 7, 8]

# now lets add the filter to the workflow
factory.add_workflow_component(el_filter)

Now we will set up the weight filter and conformer generation components and add them to the workflow.

In [6]:
weight_filter = workflow_components.MolecularWeightFilter()
factory.add_workflow_component(weight_filter)
conf_gen = workflow_components.StandardConformerGenerator()
conf_gen.max_conformers = 1
conf_gen.toolkit = "openeye"
factory.add_workflow_component(conf_gen)

Now lets look at the workflow and make sure all of the components were added in correctly. Then lets save the 
settings and workflow so they can be used again latter.


In [7]:
print(factory.workflow)

{'ElementFilter': ElementFilter(component_name='ElementFilter', component_description='Filter out molecules who contain elements not in the allowed element list', component_fail_message='Molecule contained elements not in the allowed elements list', allowed_elements=[1, 6, 7, 8]), 'MolecularWeightFilter': MolecularWeightFilter(component_name='MolecularWeightFilter', component_description='Molecules are filtered based on the allowed molecular weights.', component_fail_message='Molecule weight was not in the specified region.', minimum_weight=130, maximum_weight=781), 'StandardConformerGenerator': StandardConformerGenerator(component_name='StandardConformerGenerator', component_description='Generate conformations for the given molecules', component_fail_message='Conformers could not be generated', toolkit='openeye', max_conformers=1, clear_existing=True)}


In [8]:
# now lets save the workflow to json
factory.export_settings('work1.json')

# and lets save to yaml
factory.export_settings('work1.yaml')

In [9]:
# lets look at the ouput files 
! head -n 20 work1.yaml

basis: null
compute_tag: openff
dataset_tags:
- openff
driver: energy
maxiter: 200
method: ANI1ccx
priority: normal
program: torchani
scf_properties:
- dipole
- qudrupole
- wiberg_lowdin_indices
spec_description: ANI1ccx standard specification
spec_name: ani1ccx
workflow:
  ElementFilter:
    allowed_elements:
    - 1
    - 6


In [10]:
! head -n 20 work1.json

{
  "method": "ANI1ccx",
  "basis": null,
  "program": "torchani",
  "maxiter": 200,
  "driver": "energy",
  "scf_properties": [
    "dipole",
    "qudrupole",
    "wiberg_lowdin_indices"
  ],
  "spec_name": "ani1ccx",
  "spec_description": "ANI1ccx standard specification",
  "priority": "normal",
  "dataset_tags": [
    "openff"
  ],
  "compute_tag": "openff",
  "workflow": {
    "ElementFilter": {


Now lets make a new workflow factory and load in the settings we just saved to quickly make a new workflow.


In [11]:
factory2 = BasicDatasetFactory()

# now lets print out the basic workflow factory
print(factory2)

method='B3LYP-D3BJ' basis='DZVP' program='psi4' maxiter=200 driver=<DriverEnum.energy: 'energy'> scf_properties=['dipole', 'qudrupole', 'wiberg_lowdin_indices'] spec_name='default' spec_description='Standard OpenFF optimization quantum chemistry specification.' priority='normal' dataset_tags=['openff'] compute_tag='openff' workflow={}


In [12]:
# now load in the settings from the yaml file and print out the factory components

factory2.import_settings('work1.yaml')

# note that the theory has changed along with making the workflow components
print(factory2)

method='ANI1ccx' basis=None program='torchani' maxiter=200 driver=<DriverEnum.energy: 'energy'> scf_properties=['dipole', 'qudrupole', 'wiberg_lowdin_indices'] spec_name='ani1ccx' spec_description='ANI1ccx standard specification' priority='normal' dataset_tags=['openff'] compute_tag='openff' workflow={'ElementFilter': ElementFilter(component_name='ElementFilter', component_description='Filter out molecules who contain elements not in the allowed element list', component_fail_message='Molecule contained elements not in the allowed elements list', allowed_elements=[1, 6, 7, 8]), 'MolecularWeightFilter': MolecularWeightFilter(component_name='MolecularWeightFilter', component_description='Molecules are filtered based on the allowed molecular weights.', component_fail_message='Molecule weight was not in the specified region.', minimum_weight=130, maximum_weight=781), 'StandardConformerGenerator': StandardConformerGenerator(component_name='StandardConformerGenerator', component_description='

Now we can run the worklow on a set of molecules in the mini drug bank file that comes with the openforcefield.

In [13]:
mols = Molecule.from_file(get_data_file_path('molecules/minidrugbank.sdf'), allow_undefined_stereo=True)

Problematic atoms are:
Atom atomic num: 7, name: , idx: 20, aromatic: False, chiral: True with bonds:
bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 4, aromatic: True, chiral: False
bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 16, aromatic: False, chiral: True
bond order: 1, chiral: False to atom atomic num: 8, name: , idx: 28, aromatic: False, chiral: False

Problematic atoms are:
Atom atomic num: 7, name: , idx: 20, aromatic: False, chiral: True with bonds:
bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 4, aromatic: True, chiral: False
bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 16, aromatic: False, chiral: True
bond order: 1, chiral: False to atom atomic num: 8, name: , idx: 28, aromatic: False, chiral: False

Problematic atoms are:
Atom atomic num: 7, name: , idx: 21, aromatic: False, chiral: True with bonds:
bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 12, aromatic: True, chiral: False

In [14]:
# create the dataset ready for submission
dataset = factory2.create_dataset(dataset_name='my_dataset', molecules=mols, description="my test dataset.")

Problematic atoms are:
Atom atomic num: 7, name: , idx: 23, aromatic: False, chiral: True with bonds:
bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 2, aromatic: True, chiral: False
bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 14, aromatic: False, chiral: False
bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 18, aromatic: False, chiral: False

Problematic atoms are:
Atom atomic num: 7, name: , idx: 23, aromatic: False, chiral: True with bonds:
bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 2, aromatic: True, chiral: False
bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 14, aromatic: False, chiral: False
bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 18, aromatic: False, chiral: False





c: [0, 0]
fc: [0, 0.0]
m: [1]
fm: [1, 1, 2]


In [15]:
dataset.export_dataset("dataset_dump.json")

In [16]:
for molecule in dataset.filtered:
    print(molecule)

Problematic atoms are:
Atom atomic num: 7, name: , idx: 7, aromatic: False, chiral: True with bonds:
bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 6, aromatic: True, chiral: False
bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 8, aromatic: False, chiral: True
bond order: 1, chiral: False to atom atomic num: 8, name: , idx: 48, aromatic: False, chiral: False



Molecule with name '' and SMILES '[H]c1c(c(c(c(c1[H])[H])Oc2c(c(c(c(c2[H])C([H])([H])C([H])([H])C([H])([H])[C@]([H])(P(=O)(O[H])O[H])S(=O)(=O)O[H])[H])[H])[H])[H])[H]'
Molecule with name '' and SMILES '[H]c1c(c(c(c(c1C([H])([H])[H])[H])[H])S(=O)(=O)O[H])[H]'
Molecule with name '' and SMILES '[H]c1c(c(c(c(c1C(C([H])([H])[H])(C([H])([H])[H])C([H])([H])[H])[H])[H])S(=O)(=O)O[H])[H]'
Molecule with name '' and SMILES '[H]c1c(c(c(c(c1C([H])([H])O[C@@]([H])(O[H])SC([H])([H])[C@]([H])(C(=O)N([H])C([H])([H])C(=O)O[H])N([H])C(=O)C([H])([H])C([H])([H])[C@@]([H])(C(=O)O[H])N([H])[H])[H])[H])N(=O)([H])O[H])[H]'
Molecule with name '' and SMILES '[H]c1c(c(c2c(c1[H])N=C(S2)SC([H])([H])C([H])([H])C([H])([H])S(=O)(=O)O[H])[H])[H]'
Molecule with name '' and SMILES '[H]c1c(c(c(c(c1c2c(c(c(c(c2[H])[H])[C@@]([H])(C([H])([H])[H])C([H])([H])N([H])S(=O)(=O)C([H])([H])[H])[H])[H])[H])[H])[C@@]([H])(C([H])([H])[H])C([H])([H])N([H])S(=O)(=O)C([H])([H])[H])[H]'
Molecule with name '' and SMILES '[H]c1c(c2c(c(c(c(c2

Problematic atoms are:
Atom atomic num: 7, name: , idx: 7, aromatic: False, chiral: True with bonds:
bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 6, aromatic: True, chiral: False
bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 8, aromatic: False, chiral: True
bond order: 1, chiral: False to atom atomic num: 8, name: , idx: 48, aromatic: False, chiral: False



Molecule with name '' and SMILES '[H]C1=C(SC(=N1)[C@@]([H])([C@@]([H])(C([H])([H])[H])O[H])N([H])[H])[H]'
Molecule with name '' and SMILES '[H]C([H])(C([H])([H])Cl)Cl'
Molecule with name '' and SMILES '[H]C1=C(N(C(C(=C1N([H])C([H])([H])C([H])([H])[C@]([H])(C(=O)N2C(C(C(C(C2([H])[H])([H])[H])([H])[H])([H])[H])([H])[H])N([H])C(=O)C([H])([H])N([H])S(=O)(=O)C3(C(C(C(C(C3([H])[H])([H])[H])([H])[H])([H])[H])([H])[H])[H])[H])([H])[H])[H])[H]'
Molecule with name '' and SMILES '[H]/C(=C(/[H])\S[H])/N([H])C(=O)C([H])([H])C([H])([H])N([H])C(=O)[C@@]([H])(C(C([H])([H])[H])(C([H])([H])[H])C([H])([H])O[H])O[H]'
Molecule with name '' and SMILES '[H]c1c(c(c(c(c1C([H])([H])[C@@]([H])(C(=O)O[H])N([H])S(=O)(=O)C([H])([H])C([H])([H])C([H])([H])C([H])([H])[H])[H])[H])OC([H])([H])C([H])([H])C([H])([H])C([H])([H])C2(C(C(N(C(C2([H])[H])([H])[H])[H])([H])[H])([H])[H])[H])[H]'
Molecule with name '' and SMILES '[H]c1c(c(c(c(c1C([H])([H])C(=O)OC([H])([H])C([H])([H])Br)C(=O)[H])[H])N([H])[H])[H]'
Molecule with nam

Problematic atoms are:
Atom atomic num: 7, name: , idx: 11, aromatic: False, chiral: True with bonds:
bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 3, aromatic: True, chiral: False
bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 12, aromatic: False, chiral: False
bond order: 1, chiral: False to atom atomic num: 7, name: , idx: 32, aromatic: False, chiral: False

Problematic atoms are:
Atom atomic num: 7, name: , idx: 11, aromatic: False, chiral: True with bonds:
bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 3, aromatic: True, chiral: False
bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 12, aromatic: False, chiral: False
bond order: 1, chiral: False to atom atomic num: 7, name: , idx: 32, aromatic: False, chiral: False



Molecule with name '' and SMILES '[H][C@]12[C@]3([C@@]([C@](C(C3([H])[H])([H])[H])([H])[C@]([H])(C([H])([H])[H])C([H])([H])C([H])([H])C(=O)N([H])C([H])([H])C([H])([H])S(=O)(=O)O[H])([C@](C([C@@]1([C@@]4([C@@](C([C@@]2([H])O[H])([H])[H])(C([C@@](C(C4([H])[H])([H])[H])([H])O[H])([H])[H])[H])C([H])([H])[H])[H])([H])[H])([H])O[H])C([H])([H])[H])[H]'
Molecule with name '' and SMILES '[H]c1c(c(c(nc1[H])[H])C([H])([H])N2C(=O)[C@@]3([C@]([C@]4([C@@](c5c(c(c(c(c5C(C4([H])[H])([H])[H])[H])OS(=O)(=O)N([H])[H])[H])[H])(C(C3([H])[H])([H])[H])[H])[H])(C(C2=O)([H])[H])[H])C([H])([H])[H])[H]'
Molecule with name '' and SMILES '[H]C(=C([H])S(=O)(=O)O[H])[H]'
Molecule with name '' and SMILES '[H][C@](C(=O)C([H])([H])[P@](=O)([H])O[H])([C@]([H])(C([H])([H])C([H])([H])[H])O[H])C([H])([H])[H]'
Molecule with name '' and SMILES '[H][C@@](C(=O)O[H])(C([H])([H])SS(=O)(=O)O[H])N([H])[H]'
Molecule with name '' and SMILES '[H]C1=C(SC(=C1[H])C2(C(C(C(C(C2([H])[H])([H])[H])([H])[H])([H])[H])([H])[H])N3C(C(C(C(C3([H]

Problematic atoms are:
Atom atomic num: 7, name: , idx: 18, aromatic: False, chiral: True with bonds:
bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 3, aromatic: False, chiral: True
bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 22, aromatic: False, chiral: False
bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 19, aromatic: False, chiral: False



Molecule with name '' and SMILES '[H]C1=C(N(C(=N1([H])[H])[H])P(=O)(O[H])O[H])C([H])([H])[C@@]([H])(C(=O)O[H])N([H])[H]'
Molecule with name '' and SMILES '[H]c1c(c(c(c(c1[H])[H])c2c(c(c(c(c2[H])[H])C([H])([H])N3C(=O)[C@](C(C(C(C3([H])[H])([H])[H])([H])[H])([H])[H])([H])N([H])C(=O)[C@]([H])(C([H])([H])c4c(c(c(c(c4[H])P(=O)(O[H])O[H])OC([H])([H])C(=O)O[H])[H])[H])N([H])C(=O)C([H])([H])[H])[H])[H])[H])[H]'
Molecule with name '' and SMILES '[H][C@@](C(=O)O[H])(C([H])([H])SO[H])N([H])[H]'
Molecule with name '' and SMILES '[H]c1c(c(c(c(c1[H])C(=O)O[H])N([H])c2c(c(c(c(c2[H])Cl)OOO[H])Cl)[H])[H])[H]'
Molecule with name '' and SMILES '[H][C@](C(=O)C([H])([H])[P@@](=O)([H])O[H])([C@]([H])([C@@]([H])(C([H])([H])[H])C([H])([H])[C@@]([H])(C([H])([H])[H])C([H])([H])O[H])O[H])C([H])([H])[H]'
Molecule with name '' and SMILES '[H][C@]1([C@@]([C@](O[C@@]1([H])C([H])([H])OP(=O)(O[H])O[H])([H])N2C(=C([N+](C2([H])[H])([H])[H])C(=O)N([H])[H])O[H])([H])O[H])O[H]'
Molecule with name '' and SMILES '[H]c1c(c(c(

Problematic atoms are:
Atom atomic num: 7, name: , idx: 3, aromatic: False, chiral: True with bonds:
bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 2, aromatic: False, chiral: False
bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 4, aromatic: False, chiral: False
bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 61, aromatic: False, chiral: False



Molecule with name '' and SMILES '[H]C1(C(C(C(C(C1([H])[H])([H])[H])([H])N([H])C([H])([H])C([H])([H])C([H])([H])S(=O)(=O)O[H])([H])[H])([H])[H])[H]'
Molecule with name '' and SMILES '[H][C@@](C([H])([H])OC(=O)C([H])([H])C([H])([H])C([H])([H])C([H])([H])[H])(C([H])([H])O[P@@](=S)(OC([H])([H])C([H])([H])[N+](C([H])([H])[H])(C([H])([H])[H])C([H])([H])[H])S[H])OC(=O)C([H])([H])C([H])([H])C([H])([H])C([H])([H])[H]'
Molecule with name '' and SMILES '[H]C1=NC2=C(N(C(=N[C@]2(N1[C@]3([C@]([C@@]([C@](O3)([H])C([H])([H])OS(=O)(=O)N([H])C(=O)[C@@]([H])(C([H])([H])C([H])([H])SC([H])([H])[H])N([H])[H])([H])O[H])([H])O[H])[H])[H])[H])[H])N([H])[H]'
Molecule with name '' and SMILES '[H][C@@](C(=O)O[H])(C([H])([H])C([H])([H])[S@](=O)C([H])([H])[H])N([H])[H]'
Molecule with name '' and SMILES '[H][C@]1([C@@](C([C@@](N1[H])([H])S(=O)(=O)O[H])([H])[H])([H])C([H])([H])[H])C(=O)O[H]'
Molecule with name '' and SMILES '[H]C([H])([H])C(=O)N(C([H])([H])C([H])([H])OP(=O)(O[H])O[H])Br'
Molecule with name '' and SM

Molecule with name '' and SMILES '[H]C([H])([H])C([H])([H])C([H])([H])C([H])([H])C([H])([H])C([H])([H])C([H])([H])C([H])([H])C([H])([H])C([H])([H])C([H])([H])C([H])([H])[N+](C([H])([H])[H])(C([H])([H])[H])C([H])([H])C([H])([H])C([H])([H])S(=O)(=O)O[H]'
Molecule with name '' and SMILES '[H]c1c(c(c(c(c1[H])Cl)Cl)C2=NC3=C(C(=NN3C(=C2[H])N([H])C([H])([H])c4c(c(nc(c4[H])[H])[H])[H])[H])SC#N)[H]'
Molecule with name '' and SMILES '[H]c1c(c(c(c(c1C([H])([H])C(C(=O)OC([H])([H])C([H])([H])[H])(C(=O)OC([H])([H])C([H])([H])[H])C([H])([H])C([H])([H])N([H])C(=O)OC(C([H])([H])[H])(C([H])([H])[H])C([H])([H])[H])[H])[H])N([H])S(=O)(=O)O[H])[H]'
Molecule with name '' and SMILES '[H]c1nc2c(c(n1)N([H])[H])N=C(N2[C@]3([C@]([C@@]([C@](O3)([H])C([H])([H])OS(=O)(=O)N([H])C(=O)[C@@]([H])(C([H])([H])S[H])N([H])[H])([H])O[H])([H])O[H])[H])[H]'
Molecule with name '' and SMILES '[H]C1=C(SC(=C1[H])C2=C(C(=C(S2)C([H])([H])N([H])[H])[H])[H])[H]'
Molecule with name '' and SMILES '[H][C@@](C(=O)O[H])(C([H])([H])SC([H])

In [17]:
for molecule in dataset.molecules:
    print(molecule)

Molecule with name 'c1ccc(cc1)C[C@H](CC(CN(=O)O)(O)O)C(=O)O' and SMILES '[H]c1c(c(c(c(c1[H])[H])C([H])([H])[C@@]([H])(C(=O)O[H])C([H])([H])C(C([H])([H])N(=O)([H])O[H])(O[H])O[H])[H])[H]'
Molecule with name 'CC(=O)Nc1ccc(cc1NC(N)N)C(=O)O' and SMILES '[H]c1c(c(c(c(c1C(=O)O[H])[H])N([H])C([H])(N([H])[H])N([H])[H])N([H])C(=O)C([H])([H])[H])[H]'


Problematic atoms are:
Atom atomic num: 7, name: , idx: 23, aromatic: False, chiral: True with bonds:
bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 2, aromatic: True, chiral: False
bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 24, aromatic: False, chiral: False
bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 28, aromatic: False, chiral: False



Molecule with name 'CC(C)C[C@@H](C(=O)N[C@@H]1CCN(C1)C#N)NC(=O)OCc2ccccc2' and SMILES '[H]c1c(c(c(c(c1[H])[H])C([H])([H])OC(=O)N([H])[C@]([H])(C(=O)N([H])[C@@]2(C(C(N(C2([H])[H])C#N)([H])[H])([H])[H])[H])C([H])([H])C([H])(C([H])([H])[H])C([H])([H])[H])[H])[H]'
Molecule with name 'C[C@@H](C(=O)C)NC(=O)[C@H](CCCNC(N)N)NC(=O)OCc1ccccc1' and SMILES '[H]c1c(c(c(c(c1[H])[H])C([H])([H])OC(=O)N([H])[C@]([H])(C(=O)N([H])[C@]([H])(C(=O)C([H])([H])[H])C([H])([H])[H])C([H])([H])C([H])([H])C([H])([H])N([H])C([H])(N([H])[H])N([H])[H])[H])[H]'
Molecule with name 'c1ccc(cc1)C2=NN3C=CC=CC3=C2c4cc5c(nn4)NN=C5N' and SMILES '[H]c1c(c(c(c(c1[H])[H])C2=NN3C(=C(C(=C(C3=C2c4c(c5c(nn4)N(N=C5N([H])[H])[H])[H])[H])[H])[H])[H])[H])[H]'
Molecule with name 'C(CCCN)CCCNC(N)N' and SMILES '[H]C([H])(C([H])([H])C([H])([H])C([H])([H])N([H])[H])C([H])([H])C([H])([H])C([H])([H])N([H])C([H])(N([H])[H])N([H])[H]'
Molecule with name 'CCCC[C@H](CNNC[C@H](CCCC)C(=O)N[C@H](CCCCN)C(=O)Nc1ccccc1)C(=O)N[C@@H](CCCCN)C(=O)Nc2ccccc2'

Molecule with name 'c1ccc(cc1)C[C@@H](C(=O)CN#N)NC(=O)OCc2ccccc2' and SMILES '[H]c1c(c(c(c(c1[H])[H])C([H])([H])[C@@]([H])(C(=O)C([H])([H])N(#N)[H])N([H])C(=O)OC([H])([H])c2c(c(c(c(c2[H])[H])[H])[H])[H])[H])[H]'
Molecule with name 'c1ccc(cc1)C(=C/C=N/c2c3ccccc3nnc2CN)c4ccccc4' and SMILES '[H]c1c(c(c(c(c1[H])[H])C(=C([H])/C(=N/c2c3c(c(c(c(c3nnc2C([H])([H])N([H])[H])[H])[H])[H])[H])/[H])c4c(c(c(c(c4[H])[H])[H])[H])[H])[H])[H]'
Molecule with name 'Cc1ccccc1c2c3ccc(cc3cnn2)c4cc(ccc4C)C(=O)NC5CC5' and SMILES '[H]c1c(c(c(c(c1[H])c2c3c(c(c(c(c3c(nn2)[H])[H])c4c(c(c(c(c4C([H])([H])[H])[H])[H])C(=O)N([H])C5(C(C5([H])[H])([H])[H])[H])[H])[H])[H])C([H])([H])[H])[H])[H]'
Molecule with name 'CCCCCN[C@@H]([C@@H](Cc1ccc(c(c1)C(=O)O)OCC(=O)O)N[C@H]([C@@H](Cc2ccccc2)N[C@H](O)OC(C)(C)C)O)O' and SMILES '[H]c1c(c(c(c(c1[H])[H])C([H])([H])[C@]([H])([C@@]([H])(N([H])[C@@]([H])([C@]([H])(N([H])C([H])([H])C([H])([H])C([H])([H])C([H])([H])C([H])([H])[H])O[H])C([H])([H])c2c(c(c(c(c2[H])C(=O)O[H])OC([H])([H])C(=

In [18]:
dataset.components

[{'component_name': 'ElementFilter',
  'component_description': {'component_name': 'ElementFilter',
   'component_description': 'Filter out molecules who contain elements not in the allowed element list',
   'component_fail_message': 'Molecule contained elements not in the allowed elements list',
   'allowed_elements': [1, 6, 7, 8]},
  'component_provenance': {'OpenforcefieldToolkit': '0.6.0+383.g20d4b740',
   'QCSubmit': '0+untagged.51.g0fb4d0d.dirty',
   'openmm_elements': '7.4.1'}},
 {'component_name': 'MolecularWeightFilter',
  'component_description': {'component_name': 'MolecularWeightFilter',
   'component_description': 'Molecules are filtered based on the allowed molecular weights.',
   'component_fail_message': 'Molecule weight was not in the specified region.',
   'minimum_weight': 130,
   'maximum_weight': 781},
  'component_provenance': {'OpenforcefieldToolkit': '0.6.0+383.g20d4b740',
   'QCSubmit': '0+untagged.51.g0fb4d0d.dirty',
   'openmm_units': '7.4.1'}},
 {'component_

In [19]:
dataset.n_molecules

57

In [20]:
dataset.n_records

57

In [21]:
len(mols)

371

In [22]:
len(dataset.dataset.keys())

57

In [23]:
dataset.filtered_molecules

{'ElementFilter': FilterEntry(component_name='ElementFilter', component_description={'component_name': 'ElementFilter', 'component_description': 'Filter out molecules who contain elements not in the allowed element list', 'component_fail_message': 'Molecule contained elements not in the allowed elements list', 'allowed_elements': [1, 6, 7, 8]}, component_provenance={'OpenforcefieldToolkit': '0.6.0+383.g20d4b740', 'QCSubmit': '0+untagged.51.g0fb4d0d.dirty', 'openmm_elements': '7.4.1'}, molecules=['[H]c1c(c(c(c(c1[H])[H])Oc2c(c(c(c(c2[H])C([H])([H])C([H])([H])C([H])([H])[C@]([H])(P(=O)(O[H])O[H])S(=O)(=O)O[H])[H])[H])[H])[H])[H]', '[H]c1c(c(c(c(c1C([H])([H])[H])[H])[H])S(=O)(=O)O[H])[H]', '[H]c1c(c(c(c(c1C(C([H])([H])[H])(C([H])([H])[H])C([H])([H])[H])[H])[H])S(=O)(=O)O[H])[H]', '[H]c1c(c(c(c(c1C([H])([H])O[C@@]([H])(O[H])SC([H])([H])[C@]([H])(C(=O)N([H])C([H])([H])C(=O)O[H])N([H])C(=O)C([H])([H])C([H])([H])[C@@]([H])(C(=O)O[H])N([H])[H])[H])[H])N(=O)([H])O[H])[H]', '[H]c1c(c(c2c(c1[H])N

In [24]:
for i in dataset.dataset['c1ccc(cc1)C[C@H](CC(CN(=O)O)(O)O)C(=O)O'].initial_molecules:
    print(i.dict())

{'schema_name': 'qcschema_molecule', 'schema_version': 2, 'validated': True, 'symbols': array(['C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'N',
       'O', 'O', 'O', 'O', 'O', 'O', 'H', 'H', 'H', 'H', 'H', 'H', 'H',
       'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H'], dtype='<U1'), 'geometry': array([[ 8.18206614e+00, -7.98085237e+00, -2.36548381e+00],
       [ 9.23438708e+00, -5.62018828e+00, -1.84823333e+00],
       [ 5.67634431e+00, -8.16059222e+00, -3.16390574e+00],
       [ 7.78100827e+00, -3.43918248e+00, -2.12915991e+00],
       [ 4.22296009e+00, -5.97960129e+00, -3.44482714e+00],
       [ 5.27525040e+00, -3.61885159e+00, -2.92747890e+00],
       [ 4.08606535e+00,  8.25998710e-01,  9.81701040e-01],
       [ 3.70002598e+00, -1.25765326e+00, -3.23168225e+00],
       [ 1.03974700e-02,  1.20714561e+00, -1.45069085e+00],
       [-2.86265369e+00, -3.84372700e-02,  2.31156091e+00],
       [ 2.27177321e+00, -5.17215560e-01, -7.99937790e-01],
       [-1.62834659e+00,

In [25]:
dataset.filtered_molecules

{'ElementFilter': FilterEntry(component_name='ElementFilter', component_description={'component_name': 'ElementFilter', 'component_description': 'Filter out molecules who contain elements not in the allowed element list', 'component_fail_message': 'Molecule contained elements not in the allowed elements list', 'allowed_elements': [1, 6, 7, 8]}, component_provenance={'OpenforcefieldToolkit': '0.6.0+383.g20d4b740', 'QCSubmit': '0+untagged.51.g0fb4d0d.dirty', 'openmm_elements': '7.4.1'}, molecules=['[H]c1c(c(c(c(c1[H])[H])Oc2c(c(c(c(c2[H])C([H])([H])C([H])([H])C([H])([H])[C@]([H])(P(=O)(O[H])O[H])S(=O)(=O)O[H])[H])[H])[H])[H])[H]', '[H]c1c(c(c(c(c1C([H])([H])[H])[H])[H])S(=O)(=O)O[H])[H]', '[H]c1c(c(c(c(c1C(C([H])([H])[H])(C([H])([H])[H])C([H])([H])[H])[H])[H])S(=O)(=O)O[H])[H]', '[H]c1c(c(c(c(c1C([H])([H])O[C@@]([H])(O[H])SC([H])([H])[C@]([H])(C(=O)N([H])C([H])([H])C(=O)O[H])N([H])C(=O)C([H])([H])C([H])([H])[C@@]([H])(C(=O)O[H])N([H])[H])[H])[H])N(=O)([H])O[H])[H]', '[H]c1c(c(c2c(c1[H])N

In [26]:
coverage = dataset.coverage_report(['openff-1.0.0.offxml'])

Problematic atoms are:
Atom atomic num: 7, name: , idx: 23, aromatic: False, chiral: True with bonds:
bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 2, aromatic: True, chiral: False
bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 24, aromatic: False, chiral: False
bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 28, aromatic: False, chiral: False



In [27]:
coverage

{'openff-1.0.0.offxml': {'Constraints': ['c1'],
  'Bonds': ['b5',
   'b84',
   'b2',
   'b3',
   'b20',
   'b17',
   'b1',
   'b83',
   'b7',
   'b14',
   'b39',
   'b38',
   'b86',
   'b87',
   'b4',
   'b8',
   'b10',
   'b26',
   'b29',
   'b19',
   'b9',
   'b15',
   'b12',
   'b6',
   'b13',
   'b35',
   'b33',
   'b32',
   'b18',
   'b40',
   'b16',
   'b25',
   'b11',
   'b36',
   'b37'],
  'Angles': ['a10',
   'a11',
   'a1',
   'a27',
   'a17',
   'a18',
   'a15',
   'a2',
   'a19',
   'a20',
   'a16',
   'a13',
   'a14',
   'a22',
   'a4',
   'a6',
   'a3',
   'a28',
   'a12',
   'a5'],
  'ProperTorsions': ['t44',
   't17',
   't1',
   't20',
   't2',
   't4',
   't85',
   't118',
   't99',
   't50',
   't18',
   't100',
   't9',
   't84',
   't3',
   't47',
   't69',
   't59',
   't51',
   't70',
   't86',
   't61',
   't62',
   't157',
   't98',
   't101',
   't23',
   't22',
   't52',
   't43',
   't75',
   't77',
   't71',
   't45',
   't130',
   't128',
   't68',
   't12

In [28]:
dataset.molecules_to_file('test.smi', 'smi')

In [29]:
dataset.molecules_to_file('test.inchi', 'inchi')

In [30]:
dataset.molecules_to_file('test.inchikey', 'inchikey')

In [31]:
dataset.dict()

{'dataset_name': 'my_dataset',
 'dataset_tagline': 'OpenForcefield single point evaluations.',
 'dataset_type': 'BasicDataSet',
 'method': 'ANI1ccx',
 'basis': None,
 'program': 'torchani',
 'maxiter': 200,
 'driver': <DriverEnum.energy: 'energy'>,
 'scf_properties': ['dipole', 'qudrupole', 'wiberg_lowdin_indices'],
 'spec_name': 'ani1ccx',
 'spec_description': 'ANI1ccx standard specification',
 'priority': 'normal',
 'description': 'my test dataset.',
 'dataset_tags': ['openff'],
 'compute_tag': 'openff',
 'metadata': {'date': '2020-05-12'},
 'provenance': {'qcsubmit': '0+untagged.51.g0fb4d0d.dirty',
  'openforcefield': '0.6.0+383.g20d4b740'},
 'dataset': {'c1ccc(cc1)C[C@H](CC(CN(=O)O)(O)O)C(=O)O': {'index': 'c1ccc(cc1)C[C@H](CC(CN(=O)O)(O)O)C(=O)O',
   'initial_molecules': [{'schema_name': 'qcschema_molecule',
     'schema_version': 2,
     'validated': True,
     'symbols': array(['C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'N',
            'O', 'O', 'O', 'O', 'O', '

In [34]:
from qcsubmit.datasets import BasicDataSet

data = BasicDataSet.parse_obj(dataset.dict())

In [35]:
data

BasicDataSet(dataset_name='my_dataset', method='wB97x-d', basis='DZVP', program='psi4', maxiter=200, driver='energy', scf_properties=['dipole', 'qudrupole', 'wiberg_lowdin_indices'], spec_name='default', spec_description='Standard OpenFF optimization quantum chemistry specification.', client='public', priority='normal', tag='openff', dataset={'c1ccc(cc1)C[C@H](CC(CN(=O)O)(O)O)C(=O)O': {'attributes': {'canonical_smiles': 'c1ccc(cc1)CC(CC(CN(=O)O)(O)O)C(=O)O', 'canonical_isomeric_smiles': 'c1ccc(cc1)C[C@H](CC(CN(=O)O)(O)O)C(=O)O', 'canonical_explicit_hydrogen_smiles': '[H]c1c(c(c(c(c1[H])[H])C([H])([H])C([H])(C(=O)O[H])C([H])([H])C(C([H])([H])N(=O)([H])O[H])(O[H])O[H])[H])[H]', 'canonical_isomeric_explicit_hydrogen_smiles': '[H]c1c(c(c(c(c1[H])[H])C([H])([H])[C@@]([H])(C(=O)O[H])C([H])([H])C(C([H])([H])N(=O)([H])O[H])(O[H])O[H])[H])[H]', 'canonical_isomeric_explicit_hydrogen_mapped_smiles': '[H:20][c:1]1[c:2]([c:4]([c:6]([c:5]([c:3]1[H:22])[H:24])[C:8]([H:25])([H:26])[C@@:11]([H:31])([C:

In [17]:
dataset.export_dataset('data1.yaml')

In [18]:
dataset.export_dataset('data1.json')

In [19]:
dataset.dict()

{'dataset_name': 'my_dataset',
 'method': 'wB97x-d',
 'basis': 'DZVP',
 'program': 'psi4',
 'maxiter': 200,
 'driver': 'energy',
 'scf_properties': ['dipole', 'qudrupole', 'wiberg_lowdin_indices'],
 'spec_name': 'default',
 'spec_description': 'Standard OpenFF optimization quantum chemistry specification.',
 'client': 'public',
 'priority': 'normal',
 'tag': 'openff',
 'dataset': {'c1ccc(cc1)C[C@H](CC(CN(=O)O)(O)O)C(=O)O': {'attributes': {'canonical_smiles': 'c1ccc(cc1)CC(CC(CN(=O)O)(O)O)C(=O)O',
    'canonical_isomeric_smiles': 'c1ccc(cc1)C[C@H](CC(CN(=O)O)(O)O)C(=O)O',
    'canonical_explicit_hydrogen_smiles': '[H]c1c(c(c(c(c1[H])[H])C([H])([H])C([H])(C(=O)O[H])C([H])([H])C(C([H])([H])N(=O)([H])O[H])(O[H])O[H])[H])[H]',
    'canonical_isomeric_explicit_hydrogen_smiles': '[H]c1c(c(c(c(c1[H])[H])C([H])([H])[C@@]([H])(C(=O)O[H])C([H])([H])C(C([H])([H])N(=O)([H])O[H])(O[H])O[H])[H])[H]',
    'canonical_isomeric_explicit_hydrogen_mapped_smiles': '[H:20][c:1]1[c:2]([c:4]([c:6]([c:5]([c:3]1

In [17]:
dataset.export_dataset('data1.yaml')

In [20]:
dataset.dataset

{'c1ccc(cc1)C[C@H](CC(CN(=O)O)(O)O)C(=O)O': {'attributes': {'canonical_smiles': 'c1ccc(cc1)CC(CC(CN(=O)O)(O)O)C(=O)O',
   'canonical_isomeric_smiles': 'c1ccc(cc1)C[C@H](CC(CN(=O)O)(O)O)C(=O)O',
   'canonical_explicit_hydrogen_smiles': '[H]c1c(c(c(c(c1[H])[H])C([H])([H])C([H])(C(=O)O[H])C([H])([H])C(C([H])([H])N(=O)([H])O[H])(O[H])O[H])[H])[H]',
   'canonical_isomeric_explicit_hydrogen_smiles': '[H]c1c(c(c(c(c1[H])[H])C([H])([H])[C@@]([H])(C(=O)O[H])C([H])([H])C(C([H])([H])N(=O)([H])O[H])(O[H])O[H])[H])[H]',
   'canonical_isomeric_explicit_hydrogen_mapped_smiles': '[H:20][c:1]1[c:2]([c:4]([c:6]([c:5]([c:3]1[H:22])[H:24])[C:8]([H:25])([H:26])[C@@:11]([H:31])([C:7](=[O:14])[O:16][H:33])[C:9]([H:27])([H:28])[C:12]([C:10]([H:29])([H:30])[N:13](=[O:15])([H:32])[O:19][H:36])([O:17][H:34])[O:18][H:35])[H:23])[H:21]',
   'molecular_formula': 'C12H17NO6',
   'standard_inchi': 'InChI=1S/C12H17NO6/c14-11(15)10(6-9-4-2-1-3-5-9)7-12(16,17)8-13(18)19/h1-5,10,13,16-17H,6-8H2,(H,14,15)(H,18,19)/t10-/m1

In [20]:
from qcsubmit.datasets import BasicDataSet

data = BasicDataSet.parse_file('data1.json')

In [22]:
data.dataset

{'c1ccc(cc1)C[C@H](CC(CN(=O)O)(O)O)C(=O)O': {'attributes': {'canonical_smiles': 'c1ccc(cc1)CC(CC(CN(=O)O)(O)O)C(=O)O',
   'canonical_isomeric_smiles': 'c1ccc(cc1)C[C@H](CC(CN(=O)O)(O)O)C(=O)O',
   'canonical_explicit_hydrogen_smiles': '[H]c1c(c(c(c(c1[H])[H])C([H])([H])C([H])(C(=O)O[H])C([H])([H])C(C([H])([H])N(=O)([H])O[H])(O[H])O[H])[H])[H]',
   'canonical_isomeric_explicit_hydrogen_smiles': '[H]c1c(c(c(c(c1[H])[H])C([H])([H])[C@@]([H])(C(=O)O[H])C([H])([H])C(C([H])([H])N(=O)([H])O[H])(O[H])O[H])[H])[H]',
   'canonical_isomeric_explicit_hydrogen_mapped_smiles': '[H:20][c:1]1[c:2]([c:4]([c:6]([c:5]([c:3]1[H:22])[H:24])[C:8]([H:25])([H:26])[C@@:11]([H:31])([C:7](=[O:14])[O:16][H:33])[C:9]([H:27])([H:28])[C:12]([C:10]([H:29])([H:30])[N:13](=[O:15])([H:32])[O:19][H:36])([O:17][H:34])[O:18][H:35])[H:23])[H:21]',
   'molecular_formula': 'C12H17NO6',
   'standard_inchi': 'InChI=1S/C12H17NO6/c14-11(15)10(6-9-4-2-1-3-5-9)7-12(16,17)8-13(18)19/h1-5,10,13,16-17H,6-8H2,(H,14,15)(H,18,19)/t10-/m1

In [23]:
dataset.dict()

{'dataset_name': 'my_dataset',
 'method': 'wB97x-d',
 'basis': 'DZVP',
 'program': 'psi4',
 'maxiter': 200,
 'driver': 'energy',
 'scf_properties': ['dipole', 'qudrupole', 'wiberg_lowdin_indices'],
 'spec_name': 'default',
 'spec_description': 'Standard OpenFF optimization quantum chemistry specification.',
 'client': 'public',
 'priority': 'normal',
 'tag': 'openff',
 'dataset': {'c1ccc(cc1)C[C@H](CC(CN(=O)O)(O)O)C(=O)O': {'attributes': {'canonical_smiles': 'c1ccc(cc1)CC(CC(CN(=O)O)(O)O)C(=O)O',
    'canonical_isomeric_smiles': 'c1ccc(cc1)C[C@H](CC(CN(=O)O)(O)O)C(=O)O',
    'canonical_explicit_hydrogen_smiles': '[H]c1c(c(c(c(c1[H])[H])C([H])([H])C([H])(C(=O)O[H])C([H])([H])C(C([H])([H])N(=O)([H])O[H])(O[H])O[H])[H])[H]',
    'canonical_isomeric_explicit_hydrogen_smiles': '[H]c1c(c(c(c(c1[H])[H])C([H])([H])[C@@]([H])(C(=O)O[H])C([H])([H])C(C([H])([H])N(=O)([H])O[H])(O[H])O[H])[H])[H]',
    'canonical_isomeric_explicit_hydrogen_mapped_smiles': '[H:20][c:1]1[c:2]([c:4]([c:6]([c:5]([c:3]1