-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Infer bond orders and formal charges #1828
Comments
This seems like a solid idea. My major concerns are:
Anyway, the biggest thing I'd like to see to approve this moving forward in our ecosystem is a benchmark that it actually works. So the first step toward either of these outcomes is running the InChI benchmark. Assorted notes:
|
That's a great idea for the test. I'll try writing something along those lines and see how MDAnalysis does. |
The following test goes through all the monomers from the DES370K dataset (just under 400 of them). For each one it performs the series of transformations SDF->OpenFF Molecule->PDB->MDAnalysis Unverse->RDKit Molecule->OpenFF Molecule. It doesn't find any errors. from openff.toolkit import Molecule
import MDAnalysis as mda
import os
dir = '/Users/peastman/workspace/spice-dataset/des370k/SDFS'
errors = 0
for filename in os.listdir(dir):
mol = Molecule(os.path.join(dir, filename), allow_undefined_stereo=True)
mol.to_file('temp.pdb', 'PDB')
u = mda.Universe('temp.pdb')
mol2 = Molecule(u.atoms.convert_to('RDKit', force=True), allow_undefined_stereo=True)
if mol.to_inchi() != mol2.to_inchi():
print(filename, mol.to_inchi(), mol2.to_inchi())
errors += 1
print(errors, 'errors') That's the biggest set of SDF files I happened to have sitting around. I have SMILES strings for about 400,000 PubChem molecules that can make a much bigger test. |
Here is how the Free Software Foundation interprets it according to their FAQ. When you load a GPL library into memory, you link it to all the other code running in that process. All that code together becomes a derived product and must be licensed under the GPL. Therefore, any code that you load into the same process as a GPL library must be available under a GPL-compatible license (one that allows relicensing it as GPL). That isn't a problem for OpenFF Toolkit itself because the MIT license is GPL-compatible. But the OpenEye toolkit is not. If you load both the OpenEye toolkit and MDAnalysis into the same process, you're violating the license. |
I'm also not incredibly familiar with licenses, but very simplistically the way I understand it is code that runs |
I'm in favor of not depending on GPL code at runtime, certainly not adding new GPL dependencies now that I am aware of this argument that side-by-side imports are a license violation. (I previously thought Python's import library got around GPL via magic I didn't understand, but now I'm not sure.) The toolkit tries to load In [1]: from openff.toolkit import Molecule
In [2]: import os, sys
In [3]: os.path.isfile("/Users/mattthompson/.oe_license.txt")
Out[3]: True
In [4]: "openeye"in sys.modules
Out[4]: True so I'm already needing to go back and see if some tests I wrote in the past are infected (presumably this is why OpenFE has mixed licenses in its ecosystem?). |
This version of the test runs through the PubChem molecules. from openff.toolkit import Molecule
import MDAnalysis as mda
import os
errors = 0
for line in open('/Users/peastman/workspace/spice-dataset/pubchem/sorted.txt'):
id, smiles = line.split()
mol = Molecule.from_smiles(smiles, allow_undefined_stereo=True)
mol.generate_conformers(n_conformers=1)
mol.to_file('temp.pdb', 'PDB')
u = mda.Universe('temp.pdb')
mol2 = Molecule(u.atoms.convert_to('RDKit', force=True), allow_undefined_stereo=True)
if mol.to_inchi() != mol2.to_inchi():
print('Error:', smiles)
print(mol.to_inchi())
print(mol2.to_inchi())
errors += 1
print(errors, 'errors') It does report some errors. Here are a few examples.
It looks to me like these are mostly cases where it's getting the total charge wrong. Unlike the MDAnalysis routine, the |
This version uses RDKit to read the PDB file and fill in missing information. from openff.toolkit import Molecule
from rdkit import Chem
errors = 0
for line in open('/Users/peastman/workspace/spice-dataset/pubchem/sorted.txt'):
id, smiles = line.split()
mol = Molecule.from_smiles(smiles, allow_undefined_stereo=True)
mol.generate_conformers(n_conformers=1)
mol.to_file('temp.pdb', 'PDB')
rdmol = Chem.MolFromPDBFile('temp.pdb', removeHs=False)
mol2 = Molecule(rdmol, allow_undefined_stereo=True)
if mol.to_inchi() != mol2.to_inchi():
print('Error:', smiles)
print(mol.to_inchi())
print(mol2.to_inchi())
errors += 1
print(errors, 'errors') When it fails, it generally knows something has gone wrong and prints an error message.
|
I ran 10,000 PubChem molecules through the above code. Here's how it did. 9862 succeeded. It made it through all the transformations, and the final molecule was identical to the initial one. 122 reported errors in reading the PDB file and 16 made it through, but the final molecule was different from the initial one. Here are some of the molecules in that last category.
|
I think this would be a great feature - but it does add ambiguity when hydrogens are missing. If I pass in a PDB with graph Since PDB files in particular usually do not include hydrogens, we should be very careful doing this by default. Even a check that fails if there are no hydrogens would not be sufficient to make this safe, as PDB files commonly include non-polar hydrogens. I would be in favour of a false-by-default, well documented |
Just chiming in on a couple of thoughts with my MDA & OpenFE hats on:
To clarify, the strategy here is that the core openfe toolkit isn't importing
Couple of thoughts here:
|
It looks like most of those errors aren't reproducible. They happen when RDKit incorrectly infers the bonds based on coordinates. Since the script calls That does suggest a workaround. It's easy to check whether it inferred the correct set of bonds. If it didn't, generate a new random conformation and try again. Of course, it would be even better if RDKit would use the actual bonds specified in the PDB file, not insist on ignoring them and selecting new bonds based on coordinates. |
I have a bit of an update on this. I realized the tests above were kind of cheating. I had OpenFF generate a conformer, wrote it to a PDB file, and had RDKit read the file and try to infer bond orders from it. But here's the problem: OpenFF already knew the bond orders at the start, and made use of them in generating the conformer. Without that information, it couldn't generate realistic coordinates, and without realistic coordinates, RDKit couldn't determine the bond orders. Oops! Instead I tried to get RDKit to infer bond orders just from the topology without needing a conformation. In the process I discovered a couple of bugs. A new RDKit version with fixes for those bugs was just released today, allowing me to get back to it. Here is the new code to create an RDKit molecule from an OpenFF molecule and determine bond orders. rdmol = Chem.EditableMol(Chem.Mol())
for atom in mol.atoms:
a = Chem.Atom(atom.atomic_number)
a.SetNoImplicit(True)
rdmol.AddAtom(a)
for bond in mol.bonds:
rdmol.AddBond(bond.atom1_index, bond.atom2_index, Chem.BondType.SINGLE)
rdmol = rdmol.GetMol()
rdDetermineBonds.DetermineBondOrders(rdmol, int(mol.total_charge.m), embedChiral=False)
That last case doesn't necessarily mean it was wrong. Someone more knowledgeable about this than me would have to look at them and decide. In some cases I suspect the problem may be in the original specification. Like this one: Original: I'm not a chemist, but that just looks really strange to me. Positive charges on the oxygens??? The one produced by RDKit looks a lot more plausible. |
Is your feature request related to a problem? Please describe.
Creating a Topology requires you to provide bond orders and formal charges. Sometimes that information is not available, like when reading a PDB file or converting an OpenMM Topology. In that case, you need to provide SMILES strings for each molecule. That itself may be hard to determine, such as if all you have is the PDB file, or if it contains a protein.
Describe the solution you'd like
As long as all hydrogens are present, you can determine the bond orders and formal charges from the elements and bonds. That would make some workflows much easier. This article describes an algorithm for doing it. This routine in MDAnalysis implements it. We can't copy the code directly since it's GPL, but it's fine to use it as a reference for how the algorithm works.
Describe alternatives you've considered
RDKit has a routine called
determineBondOrders()
that does something similar. However, it starts by throwing out all the existing bonds and determining new ones based on coordinates. That isn't a reliable thing to do.The text was updated successfully, but these errors were encountered: