Fix #344 #351

j-wags · 2019-06-13T04:04:09Z

Fixes Generate OFFMol molecule from OEMol is failing #344
adds allow_undefined_stereo kwargs to Molecule.from_smiles and Molecule.from_object
refactors test_molecule test_to_from_openeye and test_to_from_rdkit to be more explicit and test a wider range of behavior.
refactors test_toolkits test_smiles_missing_stereochemistry to test a wider range of behavior.

jchodera · 2019-06-13T08:58:18Z

There are a bunch of failing tests in the OpenEye version like this one.

Looks like the pip install is now installing the 2019.4.2. toolkit that was released 30 Apr 2019, so it doesn't appear to be due to a recent toolkit update.

j-wags · 2019-06-13T17:41:10Z

Very interesting. When I only run that test by itself, it passes. My guess is that some test that runs before must be modifying the cached test molecules. Looking into this.

j-wags · 2019-06-13T20:09:50Z

Ahh, fascinating. #308 had us start caching SMILES to speed up hashing, but a molecule's SMILES is dependent on the toolkit used to make it. So, in our test suite, the RDKit tests run first, creating RDKit SMILES for all molecules. Then the OE tests run, but they recover the cached SMILES from RDKit and skip generating OE-format SMILES.

I think this is actually a bug that could arise from use of the pubic API. We do publicly expose the toolkit_registry kwarg in many functions, so I would say that returning an RDKit SMILES after the user requested an OE SMILES is pretty squarely a bug on our side.

This issue highlights the fact that, moving forward, we will need to be stricter about defining what exactly a "cached property" is. We originally intended it to mean "dependent on the molecular graph", so cache invalidations would occur in a pretty strict set of circumstances (that is, graph changes). However, an OFFMol's SMILES is not uniquely a function of (atoms, bonds, stereochemistry) -- it is also dependent on (toolkit).

I could see two ways forward here:

Don't cache SMILES, which will increase system creation time by 50-100% according to some quick tests.
Have Molecule._cached_smiles be a dict of the form {ToolkitRegistry._toolkit_name: SMILES}

I think that toolkit runtime is pretty important, so I'm partial to the latter option. I'll add this for review momentarily.

jchodera · 2019-06-14T10:31:02Z

Some possible solutions:

Refactor the code and tests to make it OK if two different toolkits are used---caching a SMILES representation should only help speed things up, after all. If we end up caching both RDKit and OpenEye SMILES, that's fine---we do 2x the effort, but still see a speedup.
Switch from using SMILES (which is toolkit-dependent) for caching to something easily available but more universal, like InChI: RDKit InChI
Use something else for caching, like Hill formulas

j-wags · 2019-06-14T14:30:48Z

Refactor the code and tests to make it OK if two different toolkits are used---caching a SMILES representation should only help speed things up, after all. If we end up caching both RDKit and OpenEye SMILES, that's fine---we do 2x the effort, but still see a speedup.

This is the safest option, and what I've implemented. There is minimal runtime cost as implemented in my previous commit.

Switch from using SMILES (which is toolkit-dependent) for caching to something easily available but more universal, like InChI: RDKit InChI

I think InChI is still "toolkit dependent", in cases where OE and RDK interpret the stereochemistry or aromaticity of the same molecule differently. This would be a pretty good option, but I think the first option is better.

Use something else for caching, like Hill formulas

I'd thought about caching Hill formulas, but the toolkit internally uses SMILES to generate molecule hashes, so that would lead to some problematic key collisions (isopropanol vs. propanol)

jchodera · 2019-06-14T14:49:08Z

I think InChI is still "toolkit dependent", in cases where OE and RDK interpret the stereochemistry or aromaticity of the same molecule differently. This would be a pretty good option, but I think the first option is better.

Our assumption is that both toolkits implement the same variant MDL aromaticity model, right?

andrrizzi

Looks great! Thank you!

I agree that the double caching is the best way to go for now. We don't expose any part of it anyway, so we'll be free to change it later if we want to.

andrrizzi · 2019-06-14T14:45:54Z

openforcefield/topology/molecule.py

+        func_qualname = to_smiles_method.__qualname__
+
+        # Check to see if a SMILES for this molecule was already cached using this method
+        if func_qualname in self._cached_smiles.keys():


I'd remove the keys() with the in operator.

Oh man, I totally thought I'd read some python docs that indicated that in checks both keys and values, and I was going to be all like "I know more than Andrea about one thing!".

Except now I can't find it, and the Python docs unambiguously agree with you. It must have been a dream :-/

Reverting...

j-wags · 2019-06-14T15:05:14Z

Our assumption is that both toolkits implement the same variant MDL aromaticity model, right?

@jchodera That is our assumption, but I'm not 100% convinced (though I don't have hard data immediately available to back up any discrepancy). I'd prefer to assume "unsafe" in this case.

codecov-io · 2019-06-14T16:09:59Z

Codecov Report

Merging #351 into master will increase coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #351      +/-   ##
==========================================
+ Coverage   75.93%   75.95%   +0.01%     
==========================================
  Files          17       17              
  Lines        5428     5431       +3     
==========================================
+ Hits         4122     4125       +3     
  Misses       1306     1306

Impacted Files	Coverage Δ
openforcefield/utils/toolkits.py	`87.08% <100%> (+0.01%)`	⬆️
openforcefield/topology/molecule.py	`86.18% <100%> (+0.03%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6d69835...936d3f4. Read the comment docs.

davidlmobley · 2019-06-21T00:45:06Z

For the record:

Our assumption is that both toolkits implement the same variant MDL aromaticity model, right?

@jchodera That is our assumption, but I'm not 100% convinced (though I don't have hard data immediately available to back up any discrepancy). I'd prefer to assume "unsafe" in this case.

I am quite confident there will be edge cases where they do not have exactly the same aromaticity model still. A lot of the back and forth about the definition of aromaticity models -- once the fundamentals are documented in place -- amounts to questions like, "Does it also consider X aromatic?" "What about Y?" We've basically gone through the alphabet once and all the "common" cases agree but may still disagree about edge cases. We will need to document any which come up and work with developers to get them corrected.

j-wags added 3 commits June 12, 2019 20:21

Fix #344 by adding allow_undef_stereo kwarg to more funcs in OE and RDK

83beb09

cleanup test_to_from_rdkit docstring

382b684

implementation and tests for allow_undef_stereo for mol.from_smiles

315f8fd

use __qualname__ to distinguish cached smiles from specific toolkits

a3a2c0a

jchodera mentioned this pull request Jun 14, 2019

Generate OFFMol molecule from OEMol is failing #344

Closed

andrrizzi approved these changes Jun 14, 2019

View reviewed changes

j-wags added 2 commits June 14, 2019 08:06

Merge branch 'master' into fix-344

a9caf4c

"in dict.keys()" --> "in dict"

936d3f4

j-wags merged commit cc1d5eb into master Jun 14, 2019

j-wags deleted the fix-344 branch June 14, 2019 16:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #344 #351

Fix #344 #351

j-wags commented Jun 13, 2019

jchodera commented Jun 13, 2019

j-wags commented Jun 13, 2019

j-wags commented Jun 13, 2019 •

edited

Loading

jchodera commented Jun 14, 2019

j-wags commented Jun 14, 2019 •

edited

Loading

jchodera commented Jun 14, 2019

andrrizzi left a comment

andrrizzi Jun 14, 2019

j-wags Jun 14, 2019

j-wags commented Jun 14, 2019

codecov-io commented Jun 14, 2019

davidlmobley commented Jun 21, 2019

Fix #344 #351

Fix #344 #351

Conversation

j-wags commented Jun 13, 2019

jchodera commented Jun 13, 2019

j-wags commented Jun 13, 2019

j-wags commented Jun 13, 2019 • edited Loading

jchodera commented Jun 14, 2019

j-wags commented Jun 14, 2019 • edited Loading

jchodera commented Jun 14, 2019

andrrizzi left a comment

Choose a reason for hiding this comment

andrrizzi Jun 14, 2019

Choose a reason for hiding this comment

j-wags Jun 14, 2019

Choose a reason for hiding this comment

j-wags commented Jun 14, 2019

codecov-io commented Jun 14, 2019

Codecov Report

davidlmobley commented Jun 21, 2019

j-wags commented Jun 13, 2019 •

edited

Loading

j-wags commented Jun 14, 2019 •

edited

Loading