Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infer bond orders and formal charges #1828

Open
peastman opened this issue Mar 4, 2024 · 13 comments
Open

Infer bond orders and formal charges #1828

peastman opened this issue Mar 4, 2024 · 13 comments

Comments

@peastman
Copy link

peastman commented Mar 4, 2024

Is your feature request related to a problem? Please describe.
Creating a Topology requires you to provide bond orders and formal charges. Sometimes that information is not available, like when reading a PDB file or converting an OpenMM Topology. In that case, you need to provide SMILES strings for each molecule. That itself may be hard to determine, such as if all you have is the PDB file, or if it contains a protein.

Describe the solution you'd like
As long as all hydrogens are present, you can determine the bond orders and formal charges from the elements and bonds. That would make some workflows much easier. This article describes an algorithm for doing it. This routine in MDAnalysis implements it. We can't copy the code directly since it's GPL, but it's fine to use it as a reference for how the algorithm works.

Describe alternatives you've considered
RDKit has a routine called determineBondOrders() that does something similar. However, it starts by throwing out all the existing bonds and determining new ones based on coordinates. That isn't a reliable thing to do.

@j-wags
Copy link
Member

j-wags commented Mar 5, 2024

This seems like a solid idea. My major concerns are:

  • Reliability
    • Intuitively, I can't think of an example with CHONPS, first column elements, and halogens where this wouldn't work. The closest I can come to counterexamples are elements with multiple allowed valence states like nitrogen, particularly cases with ambiguous aromaticity (like Ns in aromatic five-membered rings).
    • I don't have much intuition outside the elements I mentioned above (specifically I'm thinking about transition metals here). So I'd scope the initial implementation to just accept the common elements, and have it raise an error if anything outside of those are found.
    • A good test for this functionality might be something like roundtripping a large relevant dataset (maybe our industry benchmark set or SPICE, filtered to only the elements that we aim to cover). For each molecule, start with a fully specified SDF/OFFMol, write it to PDB (with explicit elements and CONECTs), read it back in using one of the proposed methods, and see whether the InChI (with our default layers) is identical.
  • Branding/communication
    • I'm thinking a lot about how we advertise functionality/packages as experimental vs private vs public. Once things become public, we are responsible for their correctness and for maintaining them. So two solutions seem nice to me here:
      • Plan for this functionality be inside our ecosystem, and initially be private or labeled experimental, and after clearing some big tests (like the InChI benchmark described above) or a long time with active users and no errors, roll it out publicly. This has the upside of having the answer within our API, with the downside that we become responsible for maintenance and correctness (even though other groups are also spending time to maintain other implementations).
      • Have this be a "blessed" code path outside of our ecosystem, where we include (for example) using the MDA --> RDKit converter in our molecule cookbook. This reduces our maintenance load and only has one group maintaining the functionality, with the downside that it's a longer trek for our users to figure out that it exists.
  • Implementation
    • I haven't read a lot into license compatibility, but if we go with an external solution (like the MDA reader), we might avoid license issues by making a ToolkitWrapper for it and then having MDA be an optional dependency. Also, many MDA folks are at OMSF now so we may be able to ask them whether this functionality could be treated under a different license.

Anyway, the biggest thing I'd like to see to approve this moving forward in our ecosystem is a benchmark that it actually works. So the first step toward either of these outcomes is running the InChI benchmark.

Assorted notes:

  • Reading element-and-bond-graphs sounds a lot like InChI reading. It's possible that there's been a solution there all along.
  • I wonder if this will need knowledge of total formal charge. For XYZ reading I think it's necessary to know this, but that might be more as a cross-check for geometry-based bond detection.

@peastman
Copy link
Author

peastman commented Mar 5, 2024

That's a great idea for the test. I'll try writing something along those lines and see how MDAnalysis does.

@peastman
Copy link
Author

peastman commented Mar 5, 2024

The following test goes through all the monomers from the DES370K dataset (just under 400 of them). For each one it performs the series of transformations SDF->OpenFF Molecule->PDB->MDAnalysis Unverse->RDKit Molecule->OpenFF Molecule. It doesn't find any errors.

from openff.toolkit import Molecule
import MDAnalysis as mda
import os

dir = '/Users/peastman/workspace/spice-dataset/des370k/SDFS'
errors = 0
for filename in os.listdir(dir):
    mol = Molecule(os.path.join(dir, filename), allow_undefined_stereo=True)
    mol.to_file('temp.pdb', 'PDB')
    u = mda.Universe('temp.pdb')
    mol2 = Molecule(u.atoms.convert_to('RDKit', force=True), allow_undefined_stereo=True)
    if mol.to_inchi() != mol2.to_inchi():
        print(filename, mol.to_inchi(), mol2.to_inchi())
        errors += 1
print(errors, 'errors')

That's the biggest set of SDF files I happened to have sitting around. I have SMILES strings for about 400,000 PubChem molecules that can make a much bigger test.

@peastman
Copy link
Author

peastman commented Mar 5, 2024

I haven't read a lot into license compatibility

Here is how the Free Software Foundation interprets it according to their FAQ. When you load a GPL library into memory, you link it to all the other code running in that process. All that code together becomes a derived product and must be licensed under the GPL. Therefore, any code that you load into the same process as a GPL library must be available under a GPL-compatible license (one that allows relicensing it as GPL). That isn't a problem for OpenFF Toolkit itself because the MIT license is GPL-compatible. But the OpenEye toolkit is not. If you load both the OpenEye toolkit and MDAnalysis into the same process, you're violating the license.

@lilyminium
Copy link
Collaborator

I'm also not incredibly familiar with licenses, but very simplistically the way I understand it is code that runs import MDAnalysis should also currently be GPL-licensed. So as @peastman says you could load the OpenFF toolkit and MDAnalysis into the same GPL code, but I'm not sure what the implications would be of actually including MDAnalysis code in the Toolkit itself, even if it's an optional dependency and isolated in a toolkitwrapper. We're trying to re-license at the moment to LGPL but it's a slow process. The RDKit converter implementation, however, has relatively few authors -- it could be easier to ask them if they would relicense the implementation to be MIT-compatible and possibly just copy that.

@mattwthompson
Copy link
Member

I'm in favor of not depending on GPL code at runtime, certainly not adding new GPL dependencies now that I am aware of this argument that side-by-side imports are a license violation. (I previously thought Python's import library got around GPL via magic I didn't understand, but now I'm not sure.)

The toolkit tries to load oechem in most practical code paths:

In [1]: from openff.toolkit import Molecule

In [2]: import os, sys

In [3]: os.path.isfile("/Users/mattthompson/.oe_license.txt")
Out[3]: True

In [4]: "openeye"in sys.modules
Out[4]: True

so I'm already needing to go back and see if some tests I wrote in the past are infected (presumably this is why OpenFE has mixed licenses in its ecosystem?).

@peastman
Copy link
Author

peastman commented Mar 5, 2024

This version of the test runs through the PubChem molecules.

from openff.toolkit import Molecule
import MDAnalysis as mda
import os

errors = 0
for line in open('/Users/peastman/workspace/spice-dataset/pubchem/sorted.txt'):
    id, smiles = line.split()
    mol = Molecule.from_smiles(smiles, allow_undefined_stereo=True)
    mol.generate_conformers(n_conformers=1)
    mol.to_file('temp.pdb', 'PDB')
    u = mda.Universe('temp.pdb')
    mol2 = Molecule(u.atoms.convert_to('RDKit', force=True), allow_undefined_stereo=True)
    if mol.to_inchi() != mol2.to_inchi():
        print('Error:', smiles)
        print(mol.to_inchi())
        print(mol2.to_inchi())
        errors += 1
print(errors, 'errors')

It does report some errors. Here are a few examples.

Error: N#P(Br)Br
InChI=1S/Br2NP/c1-4(2)3
InChI=1S/Br2NP/c1-4(2)3/q-2

Error: [O-][I+3]([O-])([O-])[O-]
InChI=1S/IO4/c2-1(3,4)5/q-1
InChI=1S/IO4/c2-1(3,4)5/q-3

Error: I[I-]I
InChI=1S/I3/c1-3-2/q-1
InChI=1S/I3/c1-3-2/q+1

Error: Cl[C+](Cl)Cl
InChI=1S/CCl3/c2-1(3)4/q+1
InChI=1S/CCl3/c2-1(3)4/q-1

Error: O=S=NSN=S=O
InChI=1S/N2O2S3/c3-6-1-5-2-7-4
InChI=1S/H2N2O2S3/c3-6-1-5-2-7-4/h3-4H/q-2/p-2

Error: [O-][n+]1cc2c3c[n+]([O-])c4ccccc4c3[n+]([O-])nc2c2ccccc21
InChI=1S/C18H10N4O3/c23-20-9-13-14-10-21(24)16-8-4-2-6-12(16)18(14)22(25)19-17(13)11-5-1-3-7-15(11)20/h1-10H
InChI=1S/C18H10N4O3/c23-20-9-13-14-10-21(24)16-8-4-2-6-12(16)18(14)22(25)19-17(13)11-5-1-3-7-15(11)20/h1-10H/q+2

It looks to me like these are mostly cases where it's getting the total charge wrong. Unlike the MDAnalysis routine, the determineBondOrders() function in RDKit does take the total charge as an argument. I'll try it next and see if it does better.

@peastman
Copy link
Author

peastman commented Mar 5, 2024

This version uses RDKit to read the PDB file and fill in missing information.

from openff.toolkit import Molecule
from rdkit import Chem

errors = 0
for line in open('/Users/peastman/workspace/spice-dataset/pubchem/sorted.txt'):
    id, smiles = line.split()
    mol = Molecule.from_smiles(smiles, allow_undefined_stereo=True)
    mol.generate_conformers(n_conformers=1)
    mol.to_file('temp.pdb', 'PDB')
    rdmol = Chem.MolFromPDBFile('temp.pdb', removeHs=False)
    mol2 = Molecule(rdmol, allow_undefined_stereo=True)
    if mol.to_inchi() != mol2.to_inchi():
        print('Error:', smiles)
        print(mol.to_inchi())
        print(mol2.to_inchi())
        errors += 1
print(errors, 'errors')

When it fails, it generally knows something has gone wrong and prints an error message.

[13:35:37] UFFTYPER: Warning: hybridization set to SP3 for atom 1
[13:35:37] UFFTYPER: Unrecognized charge state for atom: 1
[13:35:37] Explicit valence for atom # 0 F, 2, is greater than permitted
[13:35:37] ERROR: Empty structure

Error: F[P-](F)(F)(F)(F)F
InChI=1S/F6P/c1-7(2,3,4,5)6/q-1
[13:35:37] ERROR: Empty structure

@peastman
Copy link
Author

peastman commented Mar 5, 2024

I ran 10,000 PubChem molecules through the above code. Here's how it did.

9862 succeeded. It made it through all the transformations, and the final molecule was identical to the initial one.

122 reported errors in reading the PDB file and MolFromPDBFile() returned None. These ones failed, but RDKit knew they had failed.

16 made it through, but the final molecule was different from the initial one.

Here are some of the molecules in that last category.

Error: CC1=CC2=S(SC=C2)S1
InChI=1S/C6H6S3/c1-5-4-6-2-3-7-9(6)8-5/h2-4H,1H3
InChI=1S/C6H6S3/c1-5-4-6-2-3-7-9(6)8-5/h2-4H,1H3/t9-/m0/s1

Error: O=S(=O)(/N=S1/N=S=NS1)C(F)(F)F
InChI=1S/CF3N3O2S4/c2-1(3,4)13(8,9)7-12-6-10-5-11-12/t12-/m1/s1
InChI=1S/CH2F3N3O2S4/c2-1(3,4)11(8,9)6-13-7-12(13)5-10(12)13/h10,13H/t12-/m1/s1

Error: CN1P(F)(F)(F)N(C)P1(F)(F)F
InChI=1S/C2H6F6N2P2/c1-9-11(3,4,5)10(2)12(9,6,7)8/h1-2H3
InChI=1S/C2H8F6N2P2/c1-9-10(2)11(9,3,4,5)12(9,10,6,7)8/h11-12H,1-2H3/q+2/t9-,10+

Error: ClC1(Cl)SC(Cl)(Cl)S1
InChI=1S/C2Cl4S2/c3-1(4)7-2(5,6)8-1
InChI=1S/C2H2Cl4S2/c3-1(4)7-2(5,6)8(1)7/h7-8H

Error: S=C1Cc2[nH]c(=S)sc(=S)c2C(=S)N1
InChI=1S/C7H4N2S5/c10-3-1-2-4(5(11)9-3)6(12)14-7(13)8-2/h1H2,(H,8,13)(H,9,10,11)
InChI=1S/C7H6N2S5/c10-3-1-2-4-5(9-3)13-14-6(4)12-7(11)8-2/h13-14H,1H2,(H,8,11)(H,9,10)

Error: N#Cc1cc(S(=O)(=O)Nc2ncns2)ccc1Oc1ccc(Cl)cc1-c1ccnc(N)c1
InChI=1S/C20H13ClN6O3S2/c21-14-1-3-18(16(9-14)12-5-6-24-19(23)8-12)30-17-4-2-15(7-13(17)10-22)32(28,29)27-20-25-11-26-31-20/h1-9,11H,(H2,23,24)(H,25,26,27)
InChI=1S/C20H13ClN6O3S2/c21-14-1-3-18(16(9-14)12-5-6-24-19(23)8-12)30-17-4-2-15(7-13(17)10-22)32(28,29)26-20-27-11-25-31(20)27/h1-9,11,31H,(H2,23,24)/p+1

@Yoshanuikabundi
Copy link
Collaborator

I think this would be a great feature - but it does add ambiguity when hydrogens are missing. If I pass in a PDB with graph N-C-C-N, at the moment we can say "I don't know what that is" and raise an exception. With this change, we'd say it was SMILES N#CC#N (the dinitrile). But what if the user has made a mistake and intended the diamine [NH2][CH2][CH2][NH2] but didn't include hydrogens? With this change, we lose the ability to ask the user to disambiguate their input.

Since PDB files in particular usually do not include hydrogens, we should be very careful doing this by default. Even a check that fails if there are no hydrogens would not be sufficient to make this safe, as PDB files commonly include non-polar hydrogens. I would be in favour of a false-by-default, well documented infer_bond_orders argument though!

@IAlibay
Copy link

IAlibay commented Mar 7, 2024

Just chiming in on a couple of thoughts with my MDA & OpenFE hats on:

so I'm already needing to go back and see if some tests I wrote in the past are infected (presumably this is why OpenFE has mixed licenses in its ecosystem?).

To clarify, the strategy here is that the core openfe toolkit isn't importing openfe_analysis but rather calling it via subprocess. This a reasonably common approach that gets around licensing concerns about dynamic linking.

We're trying to re-license at the moment to LGPL but it's a slow process. The RDKit converter implementation, however, has relatively few authors -- it could be easier to ask them if they would relicense the implementation to be MIT-compatible and possibly just copy that.

Couple of thoughts here:

  1. Although we still have some distance to go, MDAnalysis relicensing is probably on the order of months rather than years. We still have to talk to our lawyer about the current status of things, but assuming we can get a couple of universities to agree, I am optimistic that we're likely to be close to done by summer - or at least a point where a package of a portion of the library that is LGPLv2.1+ licensed could be made. P.S. Part of the blocker here is FTE time - if you want to go this direction there's definitely things that could be done to accelerate the process.
  2. A direct converter package itself would still be GPLv3+ licensed for now even if only the relevant authors agreed to the license change. You would need to remove the MDAnalysis dependency itself, and even then you might still be in "is this derivative work by design" ambiguity.

@peastman
Copy link
Author

It looks like most of those errors aren't reproducible. They happen when RDKit incorrectly infers the bonds based on coordinates. Since the script calls generate_conformers() to generate a random conformation, the set of failures can be different each time.

That does suggest a workaround. It's easy to check whether it inferred the correct set of bonds. If it didn't, generate a new random conformation and try again. Of course, it would be even better if RDKit would use the actual bonds specified in the PDB file, not insist on ignoring them and selecting new bonds based on coordinates.

@peastman
Copy link
Author

peastman commented May 2, 2024

I have a bit of an update on this. I realized the tests above were kind of cheating. I had OpenFF generate a conformer, wrote it to a PDB file, and had RDKit read the file and try to infer bond orders from it. But here's the problem: OpenFF already knew the bond orders at the start, and made use of them in generating the conformer. Without that information, it couldn't generate realistic coordinates, and without realistic coordinates, RDKit couldn't determine the bond orders. Oops!

Instead I tried to get RDKit to infer bond orders just from the topology without needing a conformation. In the process I discovered a couple of bugs. A new RDKit version with fixes for those bugs was just released today, allowing me to get back to it. Here is the new code to create an RDKit molecule from an OpenFF molecule and determine bond orders.

rdmol = Chem.EditableMol(Chem.Mol())
for atom in mol.atoms:
    a = Chem.Atom(atom.atomic_number)
    a.SetNoImplicit(True)
    rdmol.AddAtom(a)
for bond in mol.bonds:
    rdmol.AddBond(bond.atom1_index, bond.atom2_index, Chem.BondType.SINGLE)
rdmol = rdmol.GetMol()
rdDetermineBonds.DetermineBondOrders(rdmol, int(mol.total_charge.m), embedChiral=False)

DetermineBondOrders() throws an exception if for any reason it can't determine them. I ran 10,000 PubChem molecules through this code with the following results.

  • 9420 succeeded. RDKit was able to assign bond orders and formal charges. I created a new OpenFF molecule, and the InChI string exactly matched the original one.
  • 285 failed. Either RDKit threw an exception when trying to determine bond orders or (much less often) OpenFF Toolkit threw an exception when I tried to convert the molecule.
  • 295 made it through the full round trip, but the result did not exactly match the original.

That last case doesn't necessarily mean it was wrong. Someone more knowledgeable about this than me would have to look at them and decide. In some cases I suspect the problem may be in the original specification. Like this one:

Original: [H][O+]=[C]1[N-][c]2[c]([H])[c]([C]([F])([F])[F])[c]([H])[n+]([O-])[c]2[N-][C]1=[O+][H]
Final: [H][O][C]1=[N][C]2=[C]([H])[C]([C]([F])([F])[F])=[C]([H])[N]([O-])[C]2=[N+]=[C]1[O][H]

I'm not a chemist, but that just looks really strange to me. Positive charges on the oxygens??? The one produced by RDKit looks a lot more plausible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants