New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Toolkit-dependent differences in how some atomic primitives are interpreted in chemical_environment_matches #511
Comments
I just want to say this is an excellent example on how to raise an issue 👏 |
I just realize that Rn has a totally different intended meaning in OpenEye SMARTS than in Daylight SMARTS! OpenEye SMARTS atomic primitives https://docs.eyesopen.com/toolkits/cpp/oechemtk/SMARTS.html vs. Daylight SMARTS atomic primitives https://www.daylight.com/dayhtml/doc/theory/theory.smarts.html Given the issues described in OpenEye's footnote 1 about Rn's possible non-determinism, and given that OpenEye assigns Rn to be a synonym of xn (which is supported unambiguously in both toolkits), I propose to reject if the user inputs Rn, and suggest to use xn to mean "ring bond count" without the possibility of toolkit-dependent ambiguity. |
Wow, thanks for the excellent and thoughtful writeup. I agree that we should forbid the use of
Do you think we should also forbid |
No worries -- I'm very interested in this issue!
Hah, yeah, and I think the OpenFF force-fields so far have always opted to use atomic number rather than element symbol, so no loss of expressiveness or change of convention required there.
No, I don't think so! Ring size is a very important concept to be able to express, and the primitives r3,r4,r5,r6 are used many times in Parsley. |
Ahh, I read closer and see what you mean about
I'm not sure what, if any, special behavior we should have for
I'm mostly in favor of the last option. 1) and 2) will require us to find and replace |
A warning seems perfectly fine to me! Warning text can suggest to use an alternative unambiguous SMARTS like "!r" to express the likely intended meaning.
Sorry, minor point, but I'm not sure I see how r1,r2 can hit anything in a ring. Could you please flesh out a bit how one of these edge-cases work? |
Ahh! Sorry, just got the concern. (If you string-match for "r0", "r1" etc. you may hit unintended tokens whose last letter is "r" if they're followed by something that starts with a number. Although I'm not aware of any valid things to say in SMARTS / SMIRKS that start with a number, I agree that find-and-replace is a risky option, and I agree with you about going with a warning instead of modifying the user's query.) |
I believe @bannanc caught this at one point and we'd originally used For history, see this PR openforcefield/smirnoff99Frosst#57 and the issue referenced therein #40 -- which is very brief but I know that we discussed this with Chris Bayly (probably in person in a call, or on Slack). Also note that Caitlin tested that going from R to x didn't change typing when we made that change. So I think we're on board with ensuring folks don't use |
Yes, there are documented differences in We fixed this 3 years ago in smirnoff99Frosst issue 54. If you think you've found new chemical perception issues I highly recommend searching in the smirnoff99frosst issue trackers. |
This is why document issues is so important so we don't waste time repeating research into the history of a problem. Perhaps there are other things that should be done to prevent such repeated work. |
Thanks! Will do a better job for searching the history before raising similar issues here. I should have emphasized more clearly that none of the primitives r0,r1,r2,R1,R2,R3,R4 are used in the released forcefields, and that the extent of the issue is that chemical_environment_matches function may have toolkit-dependent behavior under unrestricted user input. The readme points only to the Daylight spec, and since I had studied only the Daylight spec, this was surprising behavior, some of which was clarified as soon as looked at the OpenEye spec.
To the best of my knowledge, these are the only atomic primitives in the Daylight spec that have different behavior in OpenEye vs. RDKit. (Although I'm aware there are atomic primitives that appear in the OpenEye spec but not in the Daylight spec (e.g. OpenEye's hybridization primitives
I notice in the linked issue thread that John had pointed out an OpenSMARTS specification, which appears to have stalled but contains a draft of a formal grammar for SMARTS: https://github.com/timvdm/OpenSMARTS/blob/master/source/grammar.rst . If it ever becomes unwieldy to guarantee sufficiently toolkit-independent chemical perception behavior, it may be worth writing down a consensus grammar shared by the OpenEye and RDKit parsers, to validate that user input SMARTS are in a consensus grammar for which the Open Force Field Toolkit can guarantee {rdkit, openeye}-independent behavior. |
I'm sorry for my curt response, I didn't note the hybridization note before. We (both hand written and auto generated patterns) didn't use the hybridization, so it wouldn't have come up. I was weary of the OpenSMARTS since it was being actively maintained. I do agree that we need something more robust for checking patterns than just are they valid with the toolkit being used. We had an issue for that on the SMARTY issue tracker related to the I had a couple of discrepancies that did get noted during ChemPer's development. These were mostly things that "broke" OpenEye and didn't "break" RDKit. I was intentionally trying to raise an error when I learned that RDK would support a |
Looking back at the top issues I have a couple of things to ask:
|
The more I think about the From a users perspective, I think if you have a "master" user who can write their own SMARTS and wants to make their own force field you should let them use any pattern that is parseable, then warn against ambiguous behavior. However, I think its important that force fields OpenFF makes behave the same either way. |
No worries!
I just noticed that this week, and I don't see any place it would have come up before.
I think this should not be a cause for concern. For the few thousands of molecules I've checked so far, RDKit and OpenEye interpret r3,r4,r5,r6,r7 identically -- it's just that r0,r1,r2 aren't really well-defined (since a ring of size 0,1, or 2 can't exist), and OpenEye provides a reasonable default behavior for these symbols while RDKit rejects them. I don't think this has had any practical impact, but if someone were to do chemical-perception learning using chemical_environment_matches relying on OpenEye toolkit-specific behavior, they would need to make a minor revision at the end (from r{<=2} to r!, as you've done in ChemPer) for compatibility with RDKit.
Agreed!
Looking into these shortly! |
Regarding my questions, I think you got them all, except that I wanted to make sure other people know about interpretations of letters next to each other. |
Ahh, good to be aware of this! I hadn't considered that. Looking at a few examples of this now, it seems like both RDKit and OpenEye probably resolve this type of parse ambiguity in the same way (e.g. in both toolkits (In any case, this again poses no issue for the unambiguous SMIRKS subset used in the released forcefields, which always reference atomic numbers rather than element symbols.) |
@yuanqing-wang encountered a molecule in Esol this afternoon, where attempting to parameterize led to an This prompted me to loop over the SMIRKS patterns in openff-1.1.1 and the molecules in Esol, to check if there are any Surprisingly, this turned up a few examples, listed here: https://gist.github.com/maxentile/d528c3f271aa021eb1fce9d2b3debbb9 . Some of these patterns reference the aromatic bond primitive I haven’t checked if these possible differences in chemical-environment-matching would result in different parameters ultimately being assigned for any of these molecules. |
I've modified the script to print if there are any molecules in this set where different force field parameters are assigned depending on toolkit, and I was surprised that it produced any output: https://gist.github.com/maxentile/bede56b2888132ac2b215d459e00ec1b#file-tk_dependent_parameters-txt . Looking at a random example molecule, from openforcefield.utils.toolkits import RDKitToolkitWrapper, OpenEyeToolkitWrapper
rdkit_tk, openeye_tk = RDKitToolkitWrapper(), OpenEyeToolkitWrapper()
toolkits = {'RDKit': rdkit_tk, 'OpenEye': openeye_tk}
from openforcefield.topology import Molecule
mol = Molecule.from_smiles('c1c(OC)c(OC)C2C(=O)OCC2c1', allow_undefined_stereo=True)
pattern = "[X4:1]"
print('\nchemical environment: "{}"'.format(pattern))
for name, toolkit in toolkits.items():
print('{} matches:'.format(name).ljust(22), mol.chemical_environment_matches(pattern, toolkit))
|
Can I ask where this molecule comes from? Arguably part of the problem with this particular case is likely that this is "badly formed" chemistry in some sense, e.g. you've got a carbon in that six-membered ring which only makes three bonds. I'd guess that RDKit is effectively saying, "That can't be what you mean" and OpenEye is saying "Well OK then, if you insist..." Do you happen to have any examples which are well-formed chemistry? I'm not saying this doesn't indicate a problem, but the IMPORTANCE of the problem is impacted by whether it occurs elsewhere. For example, this specific problem we could probably help avoid by doing more careful assessment of what people feed in and throwing errors/warnings if they feed in something which doesn't seem to make chemical sense. (As a side note, we except that it should also be possible to find some inconsistencies in representations of aromaticity; we're using the MDL aromaticity model in both toolkits but aromaticity models are tricky to define fully, so there may be a few cases where they are not completely consistent. If we find such cases, we'd be firing them at the OpenFF/RDKit toolkit developers to get them to work out differences.) |
These ~60 examples were from the ESOL dataset, which I believe @yuanqing-wang obtained here: https://github.com/deepchem/deepchem/blob/master/datasets/delaney-processed.csv .
These are the first examples I've encountered -- I'll follow-up to see if this appears elsewhere. |
@maxentile This is a great writeup. Thanks so much. Skimming over those molecules, it seems like we have two categories of errors:
Nitro groups are worth looking into separately, since I recall them being some degree of trouble to even load from file, much less parameterize. The partially-aromatic rings are interesting to me, because the same SMILES would probably be fine if they were provided with explicit bond orders. I agree with @davidlmobley that the problem is likely to be the SMILES being interpreted with a different aromaticity model than they were written with. I've been smelling smoke around aromaticity for some time, and have started taking some notes on how to move forward. Implementing a fix for this may take a little while, but I'll try to include "handling this sort of case" as a reach goal for the new toolkit's behavior. |
Comfortingly, running the same check on a few thousand randomly sampled molecules from the Enamine REAL Diverse set returned 0 mismatches of this sort, supporting the interpretation that this issue is restricted to malformed SMILES strings. |
@maxentile That's great. So far, after our very early toolkit releases, I've found that our toolkit actually does a pretty good job of breaking only for bad molecules. Tangentially, in my experience, if I energy minimize molecules from a database like eMolecules with several force fields, most of the OpenFF failures will be molecules which are badly formed, whereas other FFs (like GAFF) will fail on a reasonable fraction of the well formed molecules. But this does also raise the issue that it would be nice if we flagged molecules like yours as badly formed, rather than just giving weird results for them. |
@j-wags perhaps the best solution to some of these is to not use an aromaticity model at all, which will become increasingly feasible as we finish implementing WBOs for the various terms. Someone could fit a FF which uses only valence/connectivity and not bond order at all. |
@davidlmobley Unfortunately, we still need aromaticity/bond orders in molecule equality checks, which drive how the |
Describe the bug
Some atomic primitive SMIRKS have inconsistent semantics in RDKit vs. OpenEye toolkit.
(Note: None of the specific primitives I'm reporting are used in
Parsley
, so this issue might affect chemical environment manipulation but not the released forcefields!)To Reproduce
For any molecule
mol
, callmol.chemical_environment_matches(query)
wherequery
contains any of the following atomic primitives: r0,r1,r2,R1,R2. For some molecules, different toolkit behavior is also seen forquery
containing atomic primitives R3 or R4.(Recall that "R2" means "in 2 SSSR rings", and "r0" means "in smallest SSSR ring of size 0.")
Output
Computing environment:
MacOS 10.15.3
openforcefield version: 0.6.0
rdkit version: 2019.09.3
openeye version: 2019.Oct.2
Additional context
The atomic primitives r0,r1,r2,R1, and R2 are handled differently by the two toolkits on 100% of the ~5000 drug-like molecules I checked, R3 is handled differently on ~70%, and R4 is handled differently on ~2%.
These SMIRKS are also considered valid for constructing
ChemicalEnvironment
objects, i.e.doesn't raise an error.
To resolve, I suggest we should:
ChemicalEnvironment
.rn
, n=0,1,2, which I think could reasonably default to mean "not in an SSSR ring of any size", aka!r
.The text was updated successfully, but these errors were encountered: