Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Toolkit-dependent differences in how some atomic primitives are interpreted in chemical_environment_matches #511

Open
maxentile opened this issue Feb 9, 2020 · 26 comments

Comments

@maxentile
Copy link
Member

Describe the bug
Some atomic primitive SMIRKS have inconsistent semantics in RDKit vs. OpenEye toolkit.

(Note: None of the specific primitives I'm reporting are used in Parsley, so this issue might affect chemical environment manipulation but not the released forcefields!)

To Reproduce
For any molecule mol, call mol.chemical_environment_matches(query) where query contains any of the following atomic primitives: r0,r1,r2,R1,R2. For some molecules, different toolkit behavior is also seen for query containing atomic primitives R3 or R4.

(Recall that "R2" means "in 2 SSSR rings", and "r0" means "in smallest SSSR ring of size 0.")

# different SMARTS semantics between toolkits

import openforcefield, rdkit, openeye
for package in (openforcefield, rdkit, openeye):
    print('{} version: '.format(package.__name__).ljust(24), package.__version__)

from openforcefield.utils.toolkits import RDKitToolkitWrapper, OpenEyeToolkitWrapper
rdkit_tk, openeye_tk = RDKitToolkitWrapper(), OpenEyeToolkitWrapper()
toolkits = {'RDKit': rdkit_tk, 'OpenEye': openeye_tk}

from openforcefield.topology import Molecule
mol = Molecule.from_smiles('NC(N1)=NC2(CCCCC2)C1=O')

patterns = ["[r0:1]", "[r1:1]", "[r2:1]",
            "[R1:1]", "[R2:1]", "[R4:1]"]

for pattern in patterns:
    print('\nchemical environment: "{}"'.format(pattern))
    for name, toolkit in toolkits.items():
        try:
            matches = mol.chemical_environment_matches(pattern, toolkit)
        except ValueError as e:
            matches = e
        print('{} matches:'.format(name).ljust(22),matches)

mol.to_rdkit()

Output

image

openforcefield version:  0.6.0
rdkit version:           2019.09.3
openeye version:         2019.Oct.2

chemical environment: "[r0:1]"
RDKit matches:         RDKit could not parse the SMIRKS string "[r0:1]"
OpenEye matches:       [(0,), (11,), (12,), (13,), (14,), (15,), (16,), (17,), (18,), (19,), (20,), (21,), (22,), (23,), (24,)]

chemical environment: "[r1:1]"
RDKit matches:         RDKit could not parse the SMIRKS string "[r1:1]"
OpenEye matches:       []

chemical environment: "[r2:1]"
RDKit matches:         RDKit could not parse the SMIRKS string "[r2:1]"
OpenEye matches:       []

chemical environment: "[R1:1]"
RDKit matches:         [(1,), (2,), (3,), (5,), (6,), (7,), (8,), (9,), (10,)]
OpenEye matches:       []

chemical environment: "[R2:1]"
RDKit matches:         [(4,)]
OpenEye matches:       [(1,), (2,), (3,), (5,), (6,), (7,), (8,), (9,), (10,)]

chemical environment: "[R4:1]"
RDKit matches:         []
OpenEye matches:       [(4,)]

Computing environment:
MacOS 10.15.3

openforcefield version: 0.6.0
rdkit version: 2019.09.3
openeye version: 2019.Oct.2

Additional context
The atomic primitives r0,r1,r2,R1, and R2 are handled differently by the two toolkits on 100% of the ~5000 drug-like molecules I checked, R3 is handled differently on ~70%, and R4 is handled differently on ~2%.

These SMIRKS are also considered valid for constructing ChemicalEnvironment objects, i.e.

from openforcefield.typing.chemistry.environment import AtomChemicalEnvironment
AtomChemicalEnvironment(smirks='[r0:1]')

doesn't raise an error.

To resolve, I suggest we should:

  1. Raise a warning or error when users input SMARTS/SMIRKS containing any primitives that may not have an agreed meaning between the toolkits, or that are otherwise unsupported by ChemicalEnvironment.
  2. Pick a meaning for rn, n=0,1,2, which I think could reasonably default to mean "not in an SSSR ring of any size", aka !r.
  3. Possibly look further into difference between how the toolkits interpret R1,R2,R3,R4. In the example shown, it looks like RDKit's interpretation was correct for R2 and R4, and OpenEye's was correct for R1.
@jaimergp
Copy link
Contributor

jaimergp commented Feb 9, 2020

I just want to say this is an excellent example on how to raise an issue 👏

@maxentile
Copy link
Member Author

I just realize that Rn has a totally different intended meaning in OpenEye SMARTS than in Daylight SMARTS!

OpenEye SMARTS atomic primitives
image

https://docs.eyesopen.com/toolkits/cpp/oechemtk/SMARTS.html

vs.

Daylight SMARTS atomic primitives
image

https://www.daylight.com/dayhtml/doc/theory/theory.smarts.html

Given the issues described in OpenEye's footnote 1 about Rn's possible non-determinism, and given that OpenEye assigns Rn to be a synonym of xn (which is supported unambiguously in both toolkits), I propose to reject if the user inputs Rn, and suggest to use xn to mean "ring bond count" without the possibility of toolkit-dependent ambiguity.

@j-wags
Copy link
Member

j-wags commented Feb 11, 2020

Wow, thanks for the excellent and thoughtful writeup.

I agree that we should forbid the use of R in SMARTS. Until we put out parameters for Rutherfordium, Rhenium, Radon, or Rubidium, we shouldn't have any unwanted clashes there.

The atomic primitives r0,r1,r2,R1, and R2 are handled differently by the two toolkits on 100% of the ~5000 drug-like molecules I checked

Do you think we should also forbid r in favor of x?

@maxentile
Copy link
Member Author

Wow, thanks for the excellent and thoughtful writeup.

No worries -- I'm very interested in this issue!

I agree that we should forbid the use of R in SMARTS. Until we put out parameters for Rutherfordium, Rhenium, Radon, or Rubidium, we shouldn't have any unwanted clashes there.

Hah, yeah, and I think the OpenFF force-fields so far have always opted to use atomic number rather than element symbol, so no loss of expressiveness or change of convention required there.

Do you think we should also forbid r in favor of x?

No, I don't think so! Ring size is a very important concept to be able to express, and the primitives r3,r4,r5,r6 are used many times in Parsley.

@j-wags
Copy link
Member

j-wags commented Feb 11, 2020

Ahh, I read closer and see what you mean about rn -- It's impossible to have rings of size 0, 1, or 2, so those are obviously nonsense. But r3 and up are real, and both toolkits handle them properly. Good call.

Pick a meaning for rn, n=0,1,2, which I think could reasonably default to mean "not in an SSSR ring of any size", aka !r.

I'm not sure what, if any, special behavior we should have for r{0,1,2}. Our options seem to be:

  1. validate every SMARTS to find instances of r{0,1,2} and turn it into !r, so that we get the same behavior from both toolkits
  2. validate every SMARTS and raise an exception immediately if r{0,1,2} are found
  3. let the toolkits do their own thing and behave differently for r{0,1,2}, but caution users about ambiguous behavior

I'm mostly in favor of the last option. 1) and 2) will require us to find and replace r{0,1,2}, which will hit edge cases like Ar, Sr, Zr, Cr` atoms in rings, and who-knows-what-else. 1) will fail round trip tests, and 2) will block off reasonable functionality from OpenEye. 3) isn't perfect, but seems to be a good compromise.

@maxentile
Copy link
Member Author

I'm mostly in favor of the last option.

  1. will fail round trip tests, and 2) will block off reasonable functionality from OpenEye. 3) isn't perfect, but seems to be a good compromise.

A warning seems perfectly fine to me! Warning text can suggest to use an alternative unambiguous SMARTS like "!r" to express the likely intended meaning.

  1. and 2) will require us to find and replace r{0,1,2}, which will hit edge cases like Ar, Sr, Zr, Cr` atoms in rings, and who-knows-what-else.

Sorry, minor point, but I'm not sure I see how r1,r2 can hit anything in a ring. Could you please flesh out a bit how one of these edge-cases work?

@maxentile
Copy link
Member Author

Sorry, minor point, but I'm not sure I see how r1,r2 can hit anything in a ring. Could you please flesh out a bit how one of these edge-cases work?

Ahh! Sorry, just got the concern.

(If you string-match for "r0", "r1" etc. you may hit unintended tokens whose last letter is "r" if they're followed by something that starts with a number. Although I'm not aware of any valid things to say in SMARTS / SMIRKS that start with a number, I agree that find-and-replace is a risky option, and I agree with you about going with a warning instead of modifying the user's query.)

@davidlmobley
Copy link
Contributor

I believe @bannanc caught this at one point and we'd originally used Rn quite a bit in some early materials and then we moved away from it because of such ambiguities/differences. I think we should have gone further and refused to use it, it sounds like.

For history, see this PR openforcefield/smirnoff99Frosst#57 and the issue referenced therein #40 -- which is very brief but I know that we discussed this with Chris Bayly (probably in person in a call, or on Slack). Also note that Caitlin tested that going from R to x didn't change typing when we made that change.

So I think we're on board with ensuring folks don't use R.

@bannanc
Copy link
Collaborator

bannanc commented Feb 14, 2020

Yes, there are documented differences in R! In OpenEye R and x are the same, they mean number of ring bonds (or ring connectivity). However, the Daylight SMARTS and RDKit use R to mean the number of rings an atom is a part of and x to been ring connectivity (number of ring bonds on the atom). When OE made their SMARTS parse they changed the meaning of R because x didn't exist yet and "number of rings" isn't necessarily intuitive.

We fixed this 3 years ago in smirnoff99Frosst issue 54. If you think you've found new chemical perception issues I highly recommend searching in the smirnoff99frosst issue trackers.

@bannanc
Copy link
Collaborator

bannanc commented Feb 14, 2020

This is why document issues is so important so we don't waste time repeating research into the history of a problem. Perhaps there are other things that should be done to prevent such repeated work.

@maxentile
Copy link
Member Author

Yes, there are documented differences in R!

Thanks! Will do a better job for searching the history before raising similar issues here. I should have emphasized more clearly that none of the primitives r0,r1,r2,R1,R2,R3,R4 are used in the released forcefields, and that the extent of the issue is that chemical_environment_matches function may have toolkit-dependent behavior under unrestricted user input. The readme points only to the Daylight spec, and since I had studied only the Daylight spec, this was surprising behavior, some of which was clarified as soon as looked at the OpenEye spec.

If you think you've found new chemical perception issues I highly recommend searching in the smirnoff99frosst issue trackers.

To the best of my knowledge, these are the only atomic primitives in the Daylight spec that have different behavior in OpenEye vs. RDKit.

(Although I'm aware there are atomic primitives that appear in the OpenEye spec but not in the Daylight spec (e.g. OpenEye's hybridization primitives '[^0:1]', '[^2:1]', '[^3:1]' match different subsets of atoms in the example molecule when toolkit is OpenEye vs. RDKit), and there are differences in what grammatical compositions of consensus primitives the RDKit and OpenEye parsers reject (e.g. trailing bond symbols are rejected by RDKit parser but allowed by OpenEye parser, '[*:1]~', '[*:1]-', '[*:1]=', '[*:1]#', '[*:1]@', '[*:1]:'). Again, these corner-case issues do not affect the released forcefields, but affect behavior of chemical_environment_matches function under unrestricted user input. After a search, I didn't see r{0,1,2} or ^{n} differences described previously in the smirnoff99Frosst tracker.)

Perhaps there are other things that should be done to prevent such repeated work.

I notice in the linked issue thread that John had pointed out an OpenSMARTS specification, which appears to have stalled but contains a draft of a formal grammar for SMARTS: https://github.com/timvdm/OpenSMARTS/blob/master/source/grammar.rst . If it ever becomes unwieldy to guarantee sufficiently toolkit-independent chemical perception behavior, it may be worth writing down a consensus grammar shared by the OpenEye and RDKit parsers, to validate that user input SMARTS are in a consensus grammar for which the Open Force Field Toolkit can guarantee {rdkit, openeye}-independent behavior.

@bannanc
Copy link
Collaborator

bannanc commented Feb 14, 2020

I'm sorry for my curt response, I didn't note the hybridization note before. We (both hand written and auto generated patterns) didn't use the hybridization, so it wouldn't have come up. I was weary of the OpenSMARTS since it was being actively maintained. I do agree that we need something more robust for checking patterns than just are they valid with the toolkit being used. We had an issue for that on the SMARTY issue tracker related to the ChemicalEnvironment class which was my hand written parser and would probably not get those trailing bonds correct because it only looks for bonds between atoms.

I had a couple of discrepancies that did get noted during ChemPer's development. These were mostly things that "broke" OpenEye and didn't "break" RDKit. I was intentionally trying to raise an error when I learned that RDK would support a m symbol.

@bannanc
Copy link
Collaborator

bannanc commented Feb 14, 2020

Looking back at the top issues I have a couple of things to ask:

  • RDK should be able to parse normal rn, so I don't know why "[r1:1]" isn't working does "[*r1:1]" work?

  • I don't remember which toolkit it was, but one of them does not like r0 which is why when an atom isn't in a ring ChemPer generates !r for not in a ring.

  • There is another issue I thought of related to SMARTS, some things can causing parsing problems, for example, ChemPer at one point made a SMARTS that include Ar6 for aliphatic and in a 6 membered ring but one or both of the toolkits read that as Argon I assume with isotope 6 since that's what the standalone number means.

@bannanc
Copy link
Collaborator

bannanc commented Feb 14, 2020

The more I think about the r parsing the more worried I get? That decorator is HUGELY important to getting our typing right and we haven't had a problem with it before. @maxentile could you confirm that it is only some rearrangements of r that don't work.

From a users perspective, I think if you have a "master" user who can write their own SMARTS and wants to make their own force field you should let them use any pattern that is parseable, then warn against ambiguous behavior. However, I think its important that force fields OpenFF makes behave the same either way.

@maxentile
Copy link
Member Author

I'm sorry for my curt response

No worries!

I didn't note the hybridization note before. We (both hand written and auto generated patterns) didn't use the hybridization, so it wouldn't have come up.

I just noticed that this week, and I don't see any place it would have come up before.

The more I think about the r parsing the more worried I get? That decorator is HUGELY important to getting our typing right and we haven't had a problem with it before. @maxentile could you confirm that it is only some rearrangements of r that don't work.

I think this should not be a cause for concern. For the few thousands of molecules I've checked so far, RDKit and OpenEye interpret r3,r4,r5,r6,r7 identically -- it's just that r0,r1,r2 aren't really well-defined (since a ring of size 0,1, or 2 can't exist), and OpenEye provides a reasonable default behavior for these symbols while RDKit rejects them. I don't think this has had any practical impact, but if someone were to do chemical-perception learning using chemical_environment_matches relying on OpenEye toolkit-specific behavior, they would need to make a minor revision at the end (from r{<=2} to r!, as you've done in ChemPer) for compatibility with RDKit.

From a users perspective, I think if you have a "master" user who can write their own SMARTS and wants to make their own force field you should let them use any pattern that is parseable, then warn against ambiguous behavior. However, I think its important that force fields OpenFF makes behave the same either way.

Agreed!

Looking back at the top issues I have a couple of things to ask:

Looking into these shortly!

@bannanc
Copy link
Collaborator

bannanc commented Feb 14, 2020

I think this should not be a cause for concern...
Thank you, I wasn't thinking critically about those numbers, just the differences and the letter. It makes sense that a nonsensical r usage might be different between them.

Regarding my questions, I think you got them all, except that I wanted to make sure other people know about interpretations of letters next to each other.

@maxentile
Copy link
Member Author

Ar6 for aliphatic and in a 6 membered ring but one or both of the toolkits read that as Argon I assume with isotope 6 since that's what the standalone number means.

Ahh, good to be aware of this! I hadn't considered that.

Looking at a few examples of this now, it seems like both RDKit and OpenEye probably resolve this type of parse ambiguity in the same way (e.g. in both toolkits "[rA:1]" and "[A&r:1]" match the same 10 atoms in the example molecule, while "[Ar:1]" matches nothing). May be interesting at some point to search for other types of parse ambiguities that the toolkits might resolve with different precedence rules. I didn't see any stated precedence rules in the Daylight or OpenEye specifications that would resolve the ambiguity between interpreting "Ar" as "argon" vs. "aliphatic and in a ring", but presumably they have rules for this written somewhere...

(In any case, this again poses no issue for the unambiguous SMIRKS subset used in the released forcefields, which always reference atomic numbers rather than element symbols.)

@maxentile
Copy link
Member Author

@yuanqing-wang encountered a molecule in Esol this afternoon, where attempting to parameterize led to an UnassignedBondParameterException when using OpenEye but passed when using RDKit.

This prompted me to loop over the SMIRKS patterns in openff-1.1.1 and the molecules in Esol, to check if there are any mols and patterns for which mol.chemical_environment_matches(pattern, toolkit) has toolkit-dependent behavior.

Surprisingly, this turned up a few examples, listed here: https://gist.github.com/maxentile/d528c3f271aa021eb1fce9d2b3debbb9 . Some of these patterns reference the aromatic bond primitive : (which I might be suspicious of having complicated behavior), but some look really benign (like [#6X4:1]). I'm curious what might explain the difference there -- maybe something with explicit vs. implicit hydrogens?

I haven’t checked if these possible differences in chemical-environment-matching would result in different parameters ultimately being assigned for any of these molecules.

@maxentile
Copy link
Member Author

I've modified the script to print if there are any molecules in this set where different force field parameters are assigned depending on toolkit, and I was surprised that it produced any output: https://gist.github.com/maxentile/bede56b2888132ac2b215d459e00ec1b#file-tk_dependent_parameters-txt .

Looking at a random example molecule, c1c(OC)c(OC)C2C(=O)OCC2c1, it appears that all of the differently assigned parameters for this molecule can be traced to RDKit and OpenEye disagreeing whether atom 13 is matched by atomic primitive X4 or X3. The "total connectivity" primitive X has the same documented meaning in Daylight SMARTS and OpenEye SMARTS, so I'm not sure what the root cause of this disagreement is -- perhaps this reflects a pre-SMARTS difference in how implicit hydrogens are handled in offmol->rdmol conversions vs. offmol->oemol conversions or something. A search of the smirnoff99Frosst issue tracker didn't turn up a previous discussion of this issue. A possibly related issue is #362 . (I also haven't yet determined if this is the culprit for the other molecules in the set too, or if there is possibly more than one cause.)

c1c(OC)c(OC)C2C(=O)OCC2c1

from openforcefield.utils.toolkits import RDKitToolkitWrapper, OpenEyeToolkitWrapper
rdkit_tk, openeye_tk = RDKitToolkitWrapper(), OpenEyeToolkitWrapper()
toolkits = {'RDKit': rdkit_tk, 'OpenEye': openeye_tk}

from openforcefield.topology import Molecule
mol = Molecule.from_smiles('c1c(OC)c(OC)C2C(=O)OCC2c1', allow_undefined_stereo=True)
pattern = "[X4:1]"

print('\nchemical environment: "{}"'.format(pattern))
for name, toolkit in toolkits.items():
    print('{} matches:'.format(name).ljust(22), mol.chemical_environment_matches(pattern, toolkit))
chemical environment: "[X4:1]"
RDKit matches:         [(3,), (4,), (6,), (7,), (11,), (12,), (13,)]
OpenEye matches:       [(3,), (4,), (6,), (7,), (11,), (12,)]

@davidlmobley
Copy link
Contributor

Can I ask where this molecule comes from? Arguably part of the problem with this particular case is likely that this is "badly formed" chemistry in some sense, e.g. you've got a carbon in that six-membered ring which only makes three bonds. I'd guess that RDKit is effectively saying, "That can't be what you mean" and OpenEye is saying "Well OK then, if you insist..."

Do you happen to have any examples which are well-formed chemistry?

I'm not saying this doesn't indicate a problem, but the IMPORTANCE of the problem is impacted by whether it occurs elsewhere. For example, this specific problem we could probably help avoid by doing more careful assessment of what people feed in and throwing errors/warnings if they feed in something which doesn't seem to make chemical sense.

(As a side note, we except that it should also be possible to find some inconsistencies in representations of aromaticity; we're using the MDL aromaticity model in both toolkits but aromaticity models are tricky to define fully, so there may be a few cases where they are not completely consistent. If we find such cases, we'd be firing them at the OpenFF/RDKit toolkit developers to get them to work out differences.)

@maxentile
Copy link
Member Author

These ~60 examples were from the ESOL dataset, which I believe @yuanqing-wang obtained here: https://github.com/deepchem/deepchem/blob/master/datasets/delaney-processed.csv .

Do you happen to have any examples which are well-formed chemistry?

These are the first examples I've encountered -- I'll follow-up to see if this appears elsewhere.

@j-wags
Copy link
Member

j-wags commented Aug 26, 2020

@maxentile This is a great writeup. Thanks so much.

Skimming over those molecules, it seems like we have two categories of errors:

  1. Nitro groups
  2. Rings with a mix of aromatic and non-aromatic carbons

Nitro groups are worth looking into separately, since I recall them being some degree of trouble to even load from file, much less parameterize.

The partially-aromatic rings are interesting to me, because the same SMILES would probably be fine if they were provided with explicit bond orders. I agree with @davidlmobley that the problem is likely to be the SMILES being interpreted with a different aromaticity model than they were written with.

I've been smelling smoke around aromaticity for some time, and have started taking some notes on how to move forward. Implementing a fix for this may take a little while, but I'll try to include "handling this sort of case" as a reach goal for the new toolkit's behavior.

@maxentile
Copy link
Member Author

Arguably part of the problem with this particular case is likely that this is "badly formed" chemistry in some sense
the IMPORTANCE of the problem is impacted by whether it occurs elsewhere
I'll follow-up to see if this appears elsewhere.

Comfortingly, running the same check on a few thousand randomly sampled molecules from the Enamine REAL Diverse set returned 0 mismatches of this sort, supporting the interpretation that this issue is restricted to malformed SMILES strings.

@davidlmobley
Copy link
Contributor

@maxentile That's great. So far, after our very early toolkit releases, I've found that our toolkit actually does a pretty good job of breaking only for bad molecules. Tangentially, in my experience, if I energy minimize molecules from a database like eMolecules with several force fields, most of the OpenFF failures will be molecules which are badly formed, whereas other FFs (like GAFF) will fail on a reasonable fraction of the well formed molecules.

But this does also raise the issue that it would be nice if we flagged molecules like yours as badly formed, rather than just giving weird results for them.

@davidlmobley
Copy link
Contributor

@j-wags perhaps the best solution to some of these is to not use an aromaticity model at all, which will become increasingly feasible as we finish implementing WBOs for the various terms. Someone could fit a FF which uses only valence/connectivity and not bond order at all.

@j-wags
Copy link
Member

j-wags commented Oct 5, 2020

@davidlmobley Unfortunately, we still need aromaticity/bond orders in molecule equality checks, which drive how the Topology object recognizes when different molecules are the same chemical species. But I agree that, for SMARTS, we can someday do away with aromaticity/bond orders entirely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants