# [WIP] QCArchive Interface 

Here we show how to create openforcefield molecules safely from data in the QCArchive using the cmiles entries, specifically we want to use the canonical_isomeric_explicit_hydrogen_mapped_smiles data which is metadata stored at the entry-level of a collection.

First load up the client you wish to connect to, in this case, we use the public instance.

In [1]:
import qcportal as ptl
from openforcefield.topology import Molecule

client= ptl.FractalClient()
# list the collections available
client.list_collections()



Unnamed: 0_level_0,Unnamed: 1_level_0,tagline
collection,name,Unnamed: 2_level_1
Dataset,ANI-1,22 million off-equilibrium conformations and e...
Dataset,COMP6 ANI-MD,Benchmark containing MD trajectories from the ...
Dataset,COMP6 DrugBank,Benchmark containing DrugBank off-equilibrium ...
Dataset,COMP6 GDB10to13,Benchmark containing off-equilibrium molecules...
Dataset,COMP6 GDB7to9,Benchmark containing off-equilibrium molecules...
...,...,...
TorsionDriveDataset,OpenFF Primary TorsionDrive Benchmark 1,
TorsionDriveDataset,OpenFF Substituted Phenyl Set 1,
TorsionDriveDataset,Pfizer Discrepancy Torsion Dataset 1,
TorsionDriveDataset,SMIRNOFF Coverage Torsion Set 1,


Now let us grab a molecule from an optimization dataset

In [2]:
ds = client.get_collection('OptimizationDataset', 'Kinase Inhibitors: WBO Distributions')

Take the first entry from the collection. 

In [3]:
entry = ds.get_entry(ds.df.index[0])

We can view the entry in detail by looking at the dictionary representation.

In [4]:
entry.dict()

{'name': 'Cc1ccc(cc1Nc2nccc(n2)c3cccnc3)NC(=O)c4ccc(cc4)CN5CCN(CC5)C-0',
 'initial_molecule': '9589274',
 'additional_keywords': {},
 'attributes': {'canonical_explicit_hydrogen_smiles': '[H]c1c(c(c(nc1[H])[H])c2c(c(nc(n2)N([H])c3c(c(c(c(c3C([H])([H])[H])[H])[H])N([H])C(=O)c4c(c(c(c(c4[H])[H])C([H])([H])N5C(C(N(C(C5([H])[H])([H])[H])C([H])([H])[H])([H])[H])([H])[H])[H])[H])[H])[H])[H])[H]',
  'canonical_isomeric_explicit_hydrogen_mapped_smiles': '[H:38][c:1]1[c:2]([c:14]([c:13]([n:30][c:11]1[H:48])[H:50])[c:20]2[c:9]([c:12]([n:31][c:21]([n:32]2)[N:35]([H:67])[c:19]3[c:10]([c:18]([c:8]([c:7]([c:17]3[C:27]([H:59])([H:60])[H:61])[H:44])[H:45])[N:36]([H:68])[C:22](=[O:37])[c:15]4[c:3]([c:5]([c:16]([c:6]([c:4]4[H:41])[H:43])[C:29]([H:65])([H:66])[N:34]5[C:25]([C:23]([N:33]([C:24]([C:26]5([H:57])[H:58])([H:53])[H:54])[C:28]([H:62])([H:63])[H:64])([H:51])[H:52])([H:55])[H:56])[H:42])[H:40])[H:47])[H:49])[H:46])[H:39]',
  'canonical_isomeric_explicit_hydrogen_smiles': '[H]c1c(c(c(nc1[H])[H])c2

Now we can make a molecule using a few different input options.

In [5]:
# first make a molecule using this record object
mol_record = Molecule.from_qcschema(entry)

# we could have also used the dictionary representation of the object
mol_dict = Molecule.from_qcschema(entry.dict(encoding='json'))

In [6]:
# we check that the molecule has been ordered to match the ordering used in the data base
# by printing out the atomic numbers of both objects in order

# first lets get the initial molecule from the database
initial_mol = client.query_molecules(id=entry.initial_molecule)[0]

for atoms in zip(mol_record.atoms, initial_mol.atomic_numbers):
    print(atoms[0].atomic_number, atoms[1])

# we can also check that the molecules are the same regardless of how they are made 
assert mol_dict == mol_record

6 6
6 6
6 6
6 6
6 6
6 6
6 6
6 6
6 6
6 6
6 6
6 6
6 6
6 6
6 6
6 6
6 6
6 6
6 6
6 6
6 6
6 6
6 6
6 6
6 6
6 6
6 6
6 6
6 6
7 7
7 7
7 7
7 7
7 7
7 7
7 7
8 8
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1


In [7]:
# we can also compare the graph representations of the molecules to make sure they are in the same order
import networkx as nx

# make a graph of the initial molecule using newtorkx and the data in the record
initial_network = nx.Graph()
for i, atom_num in enumerate(initial_mol.atomic_numbers):
    initial_network.add_node(i, atomic_number=atom_num)
    
for bond in initial_mol.connectivity:
    initial_network.add_edge(*bond[:2])
# now we can use the new isomorphic check to get the atom mapping
isomorphic, atom_map = Molecule.are_isomorphic(mol_record, 
                                               initial_network, 
                                               return_atom_map=True,
                                               aromatic_matching=False,
                                               formal_charge_matching=False,
                                               bond_order_matching=False,
                                               bond_stereochemistry_matching=False,
                                               atom_stereochemistry_matching=False)

# we can print if the graph was found to be isomorphic and then the atom mapping
# the atoms are in the same order here as the idexes are the same in the mapping
print(isomorphic)
print(atom_map)

True
{0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9, 10: 10, 11: 11, 12: 12, 13: 13, 14: 14, 15: 15, 16: 16, 17: 17, 18: 18, 19: 19, 20: 20, 21: 21, 22: 22, 23: 23, 24: 24, 25: 25, 26: 26, 27: 27, 28: 28, 29: 29, 30: 30, 31: 31, 32: 32, 33: 33, 34: 34, 35: 35, 36: 36, 37: 37, 38: 38, 39: 39, 40: 40, 41: 41, 42: 42, 43: 43, 44: 44, 45: 45, 46: 46, 47: 47, 48: 48, 49: 49, 50: 50, 51: 51, 52: 52, 53: 53, 54: 54, 55: 55, 56: 56, 57: 57, 58: 58, 59: 59, 60: 60, 61: 61, 62: 62, 63: 63, 64: 64, 65: 65, 66: 66, 67: 67}


Now we have seen how to make the moleucle lets look at also getting the geometry as currently we have none.

In [8]:
# check there is no geometry for the molecule
assert mol_record.n_conformers == 0

# if we also want the input geometry for the molecule, we just need to pass the relavent client instance
mol_dict = Molecule.from_qcschema(entry.dict(encoding='json'), client=client)

# check that there is a conformation
mol_dict.n_conformers

1

In [9]:
# thanks to the qcscehma method we also get visulisation for free, along with being able to compute
# properties like energy, gradient and hessian with qcgengine using QM,rdkit, openmm, or ani1
mol_dict.to_qcschema()

_ColormakerRegistry()

NGLWidget()

Here we will try and compute the energy using RDKit, only run this cell if RDKit and qcengine is installed. 

In [14]:
# for example this molecules energy can be computed using ANI1 as follows (note only HCNO are included in ANI1)
import qcengine

# set up the ani task
rdkit_task = {"schema_name": "qcschema_input",
            "schema_version": 2,
            "molecule": mol_dict.to_qcschema(),
            "driver": "energy",
            "model": {"method": 'UFF', "basis": None},
            "keywords": {"scf_type": "df"}}

# now lets compute the energy using qcengine and RDKit and print the result
qcengine.compute(rdkit_task, 'rdkit').return_result

0.053473234080915956

# Adding Final conformations

This is an example of ideas on how we might safely add conformations from optimization trajectories or final molecule conformations.

## The problem

During optimization, connectivity can change, and as the connection records are just propagated from the input structure they are unreliable sources from which to create OFFMOls. So we now have WBO calculated on many datasets like this one which could be used to infer connectivity and check that the conformer is still valid for the input connectivity.

At the entry-level we also have two different final molecules depending on the running settings lets start with how to attach the final molecule from the default run.

In [15]:
# load up the optimization record
opt_default = client.query_procedures(id=9527898)[0]

In [16]:
# look at the details of the job
opt_default.dict()

{'id': '9527898',
 'hash_index': 'dc5a29efea828dd45763a2fca034f56b98e481ab',
 'procedure': 'optimization',
 'program': 'geometric',
 'version': 1,
 'protocols': {},
 'extras': {},
 'stdout': '13566247',
 'stderr': None,
 'error': None,
 'task_id': None,
 'manager_name': 'lilac_multithread-lt19-70eff68f-502c-4deb-91b8-f5b373b646ef',
 'status': <RecordStatusEnum.complete: 'COMPLETE'>,
 'modified_on': datetime.datetime(2019, 12, 29, 2, 39, 7, 482239),
 'created_on': datetime.datetime(2019, 12, 8, 17, 1, 5, 552791),
 'provenance': {'creator': 'geomeTRIC',
  'version': '0.9.7.2',
  'routine': 'geometric.run_json.geometric_run_json',
  'username': 'chodera',
  'wall_time': 10380.749236822128,
  'qcengine_version': 'v0.13.0',
  'hostname': 'lt19',
  'cpu': 'Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz'},
 'schema_version': 1,
 'initial_molecule': '9589274',
 'qc_spec': {'driver': <DriverEnum.gradient: 'gradient'>,
  'method': 'b3lyp-d3(bj)',
  'basis': 'dzvp',
  'keywords': '2',
  'program': 'ps

In [17]:
# look at the final molecule
opt_default.get_final_molecule()

NGLWidget()

In [29]:
# this conformation looks valid but we should check it using the WBO that was calculated
# grab the last gradient calculation from the traj
last_calc = opt_default.get_trajectory()[-1]

In [30]:
# make sure this was done on the final geometry
assert last_calc.molecule == opt_default.final_molecule

In [31]:
# now lets grab the flat WBO list
wbo = last_calc.extras['qcvars']['WIBERG_LOWDIN_INDICES']

In [32]:
# now form it into an NXN array where N is the number of atoms in the molecule
import numpy as np

wbo_array = np.array(wbo).reshape(-1, mol_dict.n_atoms)

In [33]:
wbo_array

array([[0.00000000e+00, 1.55415073e+00, 5.49124559e-05, ...,
        6.15653758e-07, 3.55044068e-06, 7.37668585e-06],
       [1.55415073e+00, 0.00000000e+00, 1.10352077e-04, ...,
        6.56349635e-07, 2.68150435e-05, 1.63365792e-05],
       [5.49124559e-05, 1.10352077e-04, 0.00000000e+00, ...,
        1.05862549e-03, 4.17623676e-06, 9.79195744e-03],
       ...,
       [6.15653758e-07, 6.56349635e-07, 1.05862549e-03, ...,
        0.00000000e+00, 1.12429462e-08, 5.70559477e-06],
       [3.55044068e-06, 2.68150435e-05, 4.17623676e-06, ...,
        1.12429462e-08, 0.00000000e+00, 2.41934552e-05],
       [7.37668585e-06, 1.63365792e-05, 9.79195744e-03, ...,
        5.70559477e-06, 2.41934552e-05, 0.00000000e+00]])

In [34]:
# now lets make a function which can build a graphical representation of this array 
# which we can check is the same as our molecule. 
import networkx as nx
from simtk.openmm.app import Element


# NOTE we have to define some cutoff which a wbo below would not be defined as a bond
def get_conection_graph(molecule_record, wbo_array, wbo_cutoff):
    "use the molecule record and wbo_array to build a graph of the molecule"
    
    # loop through the array and create a set of conections
    conections = set()
    for i in range(wbo_array.shape[0]):
        for j in range(wbo_array.shape[1]):
            if wbo_array[i, j] >= wbo_cutoff:
                if (j, i) in conections:
                    continue
                else:
                    conections.add((i, j))
                        
    # build a networkx representatation of the molecule
    topology = nx.Graph()
    for i, symbol in enumerate(molecule_record.symbols):
        topology.add_node(i, atomic_number=Element.getBySymbol(symbol).atomic_number)

    for bond in conections:
        topology.add_edge(bond[0], bond[1])
            
    return topology

In [35]:
# Now use the function to generate a graph of the final molecule, using a cutoff of 0.8
final_mol_graph = get_conection_graph(opt_default.get_final_molecule(), wbo_array, 0.8)

In [36]:
# Use the new isomorphic function to check if the networkx graph is isomorphic of the molecule 
# the network only has atomic numbers and conections so turn of all retrictive matching 
isomorphic, _ = Molecule.are_isomorphic(mol_dict, 
                                       final_mol_graph, 
                                       return_atom_map=False, 
                                       atom_stereochemistry_matching=False, 
                                       bond_stereochemistry_matching=False, 
                                       aromatic_matching=False,
                                       bond_order_matching=False,
                                       formal_charge_matching=False)

In [37]:
# now if we check we see that the graph is not the same at this cutoff and the conformer would be rejected
isomorphic

False

In [38]:
# we could of course lower this requirement and try again
final_mol_graph = get_conection_graph(opt_default.get_final_molecule(), wbo_array, 0.75)
isomorphic, _ = Molecule.are_isomorphic(mol_dict, 
                                       final_mol_graph, 
                                       return_atom_map=False, 
                                       atom_stereochemistry_matching=False, 
                                       bond_stereochemistry_matching=False, 
                                       aromatic_matching=False,
                                       bond_order_matching=False,
                                       formal_charge_matching=False)
print(isomorphic)

True


The molecule conformer would now be accepted and could be safely added to the molecule.

## How do we automate this/ should we automate this?

If this was to be added to the method to make molecules from the qcarchive we would require some extra keyword arguments that would control how we handle entries with multiple procedures in the object map and may look something like 

In [18]:
mol.from_qcschema(entry_instance/dict, client=None, attach_final_molecules=False, wbo_cutoff=0.75, spec_name='default')

NameError: name 'mol' is not defined

or the spec_name could be accepted in attach_final_molecules_from with None as the default, which will not gather any finial molecules

In [19]:
mol.from_qcschema(entry_instance/dict, client=None, attach_final_molecules_from=None, wbo_cutoff=0.75)

NameError: name 'mol' is not defined

then to pull final molecules from the 'default' run

In [20]:
mol.from_qcschema(entry_instance/dict, client=None, attach_final_molecules_from='default', wbo_cutoff=0.75)

NameError: name 'mol' is not defined

or we could extend the add_conformer API to accept optimisation or torsionDrive records

In [21]:
mol.add_conformer(opt/torsiondrive_record, wbo_cutoff=0.75)

NameError: name 'mol' is not defined

In this case, we do not have to worry about the spec name as the user will supply a record of the spec they want all we have to do is gather the final molecules and build their graphs from the WBO and compare the connectivity if it does not match we can throw an error about trying to attach an invalid conformer.