# Creating submission workflows using qcsubmit.

Here our aim is to create reproducible workflows which can process large lists of molecules applying filtering and some other useful operations 
such as state enumeration and fragmentation before formatting the data into a qcarchive dataset that can be 
submitted to the public or local archive instance. 

The entire workflow is build up out of modular components which can each have their own programable settings which can be 
controlled through the API or via workflow yaml/json files. 

In this example we will demonstrate how to set up a basic workflow via the API and how it can be exported to a settings file 
which can then be used to reconstruct the entire workflow by another user.

First lets load in the packages.

In [17]:
from qcsubmit.factories import QCFractalDatasetFactory  # load in the factory to process molecules
from qcsubmit import workflow_components                # load in a list of workflow_components
from openforcefield.utils.utils import get_data_file_path  # a util function to load a mini drug bank file
from openforcefield.topology import Molecule


Create the basic qcarchive dataset factory, this is useful for large machine learning datasets which are a collection
of single point calculations provided through the energy/gradient/hessian drivers.

In [18]:
factory = QCFractalDatasetFactory()
# we can view and change any of the basic settings in the factory such as the qm settings
factory.theory = 'wB97x-d'
# lets look at the class and the settings 
print(factory)

theory='wB97x-d' basis='DZVP' program='psi4' maxiter=200 driver='energy' scf_properties=['dipole', 'qudrupole', 'wiberg_lowdin_indices'] client='public' priority='normal' tags='openff' workflow={}


In [19]:
# each of the fields are also validated as the class is based on pydantic
# so the following should produce and error
factory.basis = {'test': 1}


ValidationError: 1 validation error for QCFractalDatasetFactory
basis
  str type expected (type=type_error.str)

Note that the basic settings should be suitable in most cases as they are those recommended by the 
openforcefield and are currently used in the fitting of the most recent force fields. Now lets look at
the workflow components. 

In [20]:
# the workflow is a dictionary that contains all of the components that will be executed in order
print(factory.workflow)

{}


In [21]:
# the workflow is also validated so only properly configured workflow components can be added
factory.add_workflow_component(3)

Component 3 rejected as it is not subclass of CustomWorkflowComponent.


As you can see the number is not a proper workflow component and so it has been rejected from the workflow.
Users can make there own workflow components which can be added to the workflow as well but they must be a subclass
of the CustomWorkflowComponent class and have all of the abstract methods implemented and settings. See the creating workflow
components notebook for examples on how to do this.

Now lets set up a workflow that will filter out some unwanted elements, then filter by molecular weight
before generating conformers for each of the molecules.

In [22]:
# set up the element filter
el_filter = workflow_components.ElementFilter()
# lets view the options available for this filter
print(el_filter)

componet_name='ElementFilter' componet_descripton='Filter out molecules who contain elements not in the allowed element list' componet_fail_message='Molecule contained elements not in the allowed elements list' allowed_elements=['H', 'C', 'N', 'O', 'F', 'P', 'S', 'Cl', 'Br', 'I']


This filter has the ability to filter elements by atomic name or number, we just have to supply a list of 
symbols or numbers to the filter. Here lets only keep molecules with elements of H,C,N and O as we would like
to use AN1 as our QM method.

In [23]:
# set the filter to only keep molecules with these elements
el_filter.allowed_elements = [1, 6, 7, 8]

# now lets add the filter to the workflow
factory.add_workflow_component(el_filter)

Now we will set up the weight filter and conformer generation components and add them to the workflow.

In [24]:
weight_filter = workflow_components.MolecularWeightFilter()
factory.add_workflow_component(weight_filter)
conf_gen = workflow_components.StandardConformerGenerator()
factory.add_workflow_component(conf_gen)

Now lets look at the workflow and make sure all of the components were added in correctly. Then lets save the 
settings and workflow so they can be used again latter.


In [25]:
print(factory.workflow)

{'ElementFilter': ElementFilter(componet_name='ElementFilter', componet_descripton='Filter out molecules who contain elements not in the allowed element list', componet_fail_message='Molecule contained elements not in the allowed elements list', allowed_elements=[1, 6, 7, 8]), 'MolecularWeightFilter': MolecularWeightFilter(componet_name='MolecularWeightFilter', componet_descripton='Molecules are filtered based on the allowed molecular weights.', componet_fail_message='Molecule weight was not in the specified region.', minimum_weight=130, maximum_weight=781), 'StandardConformerGenerator': StandardConformerGenerator(componet_name='StandardConformerGenerator', componet_descripton='Generate conformations for the given molecules', componet_fail_message='Conformers could not be generated', max_conformers=20, clear_exsiting=True, toolkit='rdkit')}


In [26]:
# now lets save the workflow to json
factory.export_settings('work1.json')

# and lets save to yaml
factory.export_settings('work1.yaml')

In [27]:
# lets look at the ouput files 
! head -n 20 work1.yaml

basis: DZVP
client: public
driver: energy
maxiter: 200
priority: normal
program: psi4
scf_properties:
- dipole
- qudrupole
- wiberg_lowdin_indices
tags: openff
theory: wB97x-d
workflow:
  ElementFilter:
    allowed_elements:
    - 1
    - 6
    - 7
    - 8
    componet_descripton: Filter out molecules who contain elements not in the allowed


In [28]:
! head -n 20 work1.json

{
  "theory": "wB97x-d",
  "basis": "DZVP",
  "program": "psi4",
  "maxiter": 200,
  "driver": "energy",
  "scf_properties": [
    "dipole",
    "qudrupole",
    "wiberg_lowdin_indices"
  ],
  "client": "public",
  "priority": "normal",
  "tags": "openff",
  "workflow": {
    "ElementFilter": {
      "componet_name": "ElementFilter",
      "componet_descripton": "Filter out molecules who contain elements not in the allowed element list",
      "componet_fail_message": "Molecule contained elements not in the allowed elements list",
      "allowed_elements": [


Now lets make a new workflow factory and load in the settings we just saved to quickly make a new workflow.


In [29]:
factory2 = QCFractalDatasetFactory()

# now lets print out the basic workflow factory
print(factory2)

theory='B3LYP-D3BJ' basis='DZVP' program='psi4' maxiter=200 driver='energy' scf_properties=['dipole', 'qudrupole', 'wiberg_lowdin_indices'] client='public' priority='normal' tags='openff' workflow={}


In [30]:
# now load in the settings from the yaml file and print out the factory components

factory2.import_settings('work1.yaml')

# note that the theory has changed along with making the workflow components
print(factory2)

theory='wB97x-d' basis='DZVP' program='psi4' maxiter=200 driver='energy' scf_properties=['dipole', 'qudrupole', 'wiberg_lowdin_indices'] client='public' priority='normal' tags='openff' workflow={'ElementFilter': ElementFilter(componet_name='ElementFilter', componet_descripton='Filter out molecules who contain elements not in the allowed element list', componet_fail_message='Molecule contained elements not in the allowed elements list', allowed_elements=[1, 6, 7, 8]), 'MolecularWeightFilter': MolecularWeightFilter(componet_name='MolecularWeightFilter', componet_descripton='Molecules are filtered based on the allowed molecular weights.', componet_fail_message='Molecule weight was not in the specified region.', minimum_weight=130, maximum_weight=781), 'StandardConformerGenerator': StandardConformerGenerator(componet_name='StandardConformerGenerator', componet_descripton='Generate conformations for the given molecules', componet_fail_message='Conformers could not be generated', max_conform

Now we can run the worklow on a set of molecules in the mini drug bank file that comes with the openforcefield.

In [31]:
mols = Molecule.from_file(get_data_file_path('molecules/minidrugbank.sdf'), allow_undefined_stereo=True)

Problematic atoms are:
Atom atomic num: 7, name: , idx: 20, aromatic: False, chiral: True with bonds:
bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 4, aromatic: True, chiral: False
bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 16, aromatic: False, chiral: True
bond order: 1, chiral: False to atom atomic num: 8, name: , idx: 28, aromatic: False, chiral: False

Problematic atoms are:
Atom atomic num: 7, name: , idx: 20, aromatic: False, chiral: True with bonds:
bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 4, aromatic: True, chiral: False
bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 16, aromatic: False, chiral: True
bond order: 1, chiral: False to atom atomic num: 8, name: , idx: 28, aromatic: False, chiral: False

Problematic atoms are:
Atom atomic num: 7, name: , idx: 21, aromatic: False, chiral: True with bonds:
bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 12, aromatic: True, chiral: False

In [32]:
# create the dataset ready for submission
dataset = factory2.create_dataset('my_dataset', mols)

RDKit ERROR: [13:05:45] Explicit valence for atom # 35 O, 3, is greater than permitted
RDKit ERROR: [13:06:03] Explicit valence for atom # 11 N, 5, is greater than permitted


The final component result molecules [Molecule with name 'DrugBank_5415' and SMILES '[H]c1c(c(c(c(c1[H])[H])C([H])([H])[C@@]([H])(C(=O)O[H])C([H])([H])C(C([H])([H])N(=O)([H])O[H])(O[H])O[H])[H])[H]', Molecule with name 'DrugBank_2998' and SMILES '[H]c1c(c(c(c(c1C(=O)O[H])[H])N([H])C([H])(N([H])[H])N([H])[H])N([H])C(=O)C([H])([H])[H])[H]', Molecule with name 'DrugBank_3101' and SMILES '[H]c1c(c(c(c(c1[H])[H])C([H])([H])OC(=O)N([H])[C@]([H])(C(=O)N([H])[C@@]2(C(C(N(C2([H])[H])C#N)([H])[H])([H])[H])[H])C([H])([H])C([H])(C([H])([H])[H])C([H])([H])[H])[H])[H]', Molecule with name 'DrugBank_3175' and SMILES '[H]c1c(c(c(c(c1[H])[H])C([H])([H])OC(=O)N([H])[C@]([H])(C(=O)N([H])[C@]([H])(C(=O)C([H])([H])[H])C([H])([H])[H])C([H])([H])C([H])([H])C([H])([H])N([H])C([H])(N([H])[H])N([H])[H])[H])[H]', Molecule with name 'DrugBank_5711' and SMILES '[H]c1c(c(c(c(c1[H])[H])C2=NN3C(=C(C(=C(C3=C2c4c(c5c(nn4)N(N=C5N([H])[H])[H])[H])[H])[H])[H])[H])[H])[H]', Molecule with name 'DrugBank_3269' and SMILES '[H

The filtered molecules {'ElementFilter': {'component_description': 'Filter out molecules who contain elements not in the allowed element list', 'component_fail_message': 'Molecule contained elements not in the allowed elements list', 'molecules': [Molecule with name 'DrugBank_5354' and SMILES '[H]c1c(c(c(c(c1[H])[H])Oc2c(c(c(c(c2[H])C([H])([H])C([H])([H])C([H])([H])[C@]([H])(P(=O)(O[H])O[H])S(=O)(=O)O[H])[H])[H])[H])[H])[H]', Molecule with name 'DrugBank_2791' and SMILES '[H]c1c(c(c(c(c1C([H])([H])[H])[H])[H])S(=O)(=O)O[H])[H]', Molecule with name 'DrugBank_5373' and SMILES '[H]c1c(c(c(c(c1C(C([H])([H])[H])(C([H])([H])[H])C([H])([H])[H])[H])[H])S(=O)(=O)O[H])[H]', Molecule with name 'DrugBank_2799' and SMILES '[H]c1c(c(c(c(c1C([H])([H])O[C@@]([H])(O[H])SC([H])([H])[C@]([H])(C(=O)N([H])C([H])([H])C(=O)O[H])N([H])C(=O)C([H])([H])C([H])([H])[C@@]([H])(C(=O)O[H])N([H])[H])[H])[H])N(=O)([H])O[H])[H]', Molecule with name 'DrugBank_2800' and SMILES '[H]c1c(c(c2c(c1[H])N=C(S2)SC([H])([H])C([H]