-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix #344 #351
Conversation
There are a bunch of failing tests in the OpenEye version like this one. Looks like the |
Very interesting. When I only run that test by itself, it passes. My guess is that some test that runs before must be modifying the cached test molecules. Looking into this. |
Ahh, fascinating. #308 had us start caching SMILES to speed up hashing, but a molecule's SMILES is dependent on the toolkit used to make it. So, in our test suite, the RDKit tests run first, creating RDKit SMILES for all molecules. Then the OE tests run, but they recover the cached SMILES from RDKit and skip generating OE-format SMILES. I think this is actually a bug that could arise from use of the pubic API. We do publicly expose the This issue highlights the fact that, moving forward, we will need to be stricter about defining what exactly a "cached property" is. We originally intended it to mean "dependent on the molecular graph", so cache invalidations would occur in a pretty strict set of circumstances (that is, graph changes). However, an OFFMol's SMILES is not uniquely a function of I could see two ways forward here:
I think that toolkit runtime is pretty important, so I'm partial to the latter option. I'll add this for review momentarily. |
Some possible solutions:
|
This is the safest option, and what I've implemented. There is minimal runtime cost as implemented in my previous commit.
I think InChI is still "toolkit dependent", in cases where OE and RDK interpret the stereochemistry or aromaticity of the same molecule differently. This would be a pretty good option, but I think the first option is better.
I'd thought about caching Hill formulas, but the toolkit internally uses SMILES to generate molecule hashes, so that would lead to some problematic key collisions (isopropanol vs. propanol) |
Our assumption is that both toolkits implement the same variant MDL aromaticity model, right? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Thank you!
I agree that the double caching is the best way to go for now. We don't expose any part of it anyway, so we'll be free to change it later if we want to.
openforcefield/topology/molecule.py
Outdated
func_qualname = to_smiles_method.__qualname__ | ||
|
||
# Check to see if a SMILES for this molecule was already cached using this method | ||
if func_qualname in self._cached_smiles.keys(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd remove the keys()
with the in
operator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh man, I totally thought I'd read some python docs that indicated that in
checks both keys and values, and I was going to be all like "I know more than Andrea about one thing!".
Except now I can't find it, and the Python docs unambiguously agree with you. It must have been a dream :-/
Reverting...
@jchodera That is our assumption, but I'm not 100% convinced (though I don't have hard data immediately available to back up any discrepancy). I'd prefer to assume "unsafe" in this case. |
Codecov Report
@@ Coverage Diff @@
## master #351 +/- ##
==========================================
+ Coverage 75.93% 75.95% +0.01%
==========================================
Files 17 17
Lines 5428 5431 +3
==========================================
+ Hits 4122 4125 +3
Misses 1306 1306
Continue to review full report at Codecov.
|
For the record:
I am quite confident there will be edge cases where they do not have exactly the same aromaticity model still. A lot of the back and forth about the definition of aromaticity models -- once the fundamentals are documented in place -- amounts to questions like, "Does it also consider X aromatic?" "What about Y?" We've basically gone through the alphabet once and all the "common" cases agree but may still disagree about edge cases. We will need to document any which come up and work with developers to get them corrected. |
allow_undefined_stereo
kwargs toMolecule.from_smiles
andMolecule.from_object
test_to_from_openeye
andtest_to_from_rdkit
to be more explicit and test a wider range of behavior.test_smiles_missing_stereochemistry
to test a wider range of behavior.