Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #344 #351

Merged
merged 6 commits into from
Jun 14, 2019
Merged

Fix #344 #351

merged 6 commits into from
Jun 14, 2019

Conversation

j-wags
Copy link
Member

@j-wags j-wags commented Jun 13, 2019

  • Fixes Generate OFFMol molecule from OEMol is failing #344
  • adds allow_undefined_stereo kwargs to Molecule.from_smiles and Molecule.from_object
  • refactors test_molecule test_to_from_openeye and test_to_from_rdkit to be more explicit and test a wider range of behavior.
  • refactors test_toolkits test_smiles_missing_stereochemistry to test a wider range of behavior.

@jchodera
Copy link
Member

There are a bunch of failing tests in the OpenEye version like this one.

Looks like the pip install is now installing the 2019.4.2. toolkit that was released 30 Apr 2019, so it doesn't appear to be due to a recent toolkit update.

@j-wags
Copy link
Member Author

j-wags commented Jun 13, 2019

Very interesting. When I only run that test by itself, it passes. My guess is that some test that runs before must be modifying the cached test molecules. Looking into this.

@j-wags
Copy link
Member Author

j-wags commented Jun 13, 2019

Ahh, fascinating. #308 had us start caching SMILES to speed up hashing, but a molecule's SMILES is dependent on the toolkit used to make it. So, in our test suite, the RDKit tests run first, creating RDKit SMILES for all molecules. Then the OE tests run, but they recover the cached SMILES from RDKit and skip generating OE-format SMILES.

I think this is actually a bug that could arise from use of the pubic API. We do publicly expose the toolkit_registry kwarg in many functions, so I would say that returning an RDKit SMILES after the user requested an OE SMILES is pretty squarely a bug on our side.

This issue highlights the fact that, moving forward, we will need to be stricter about defining what exactly a "cached property" is. We originally intended it to mean "dependent on the molecular graph", so cache invalidations would occur in a pretty strict set of circumstances (that is, graph changes). However, an OFFMol's SMILES is not uniquely a function of (atoms, bonds, stereochemistry) -- it is also dependent on (toolkit).

I could see two ways forward here:

  • Don't cache SMILES, which will increase system creation time by 50-100% according to some quick tests.
  • Have Molecule._cached_smiles be a dict of the form {ToolkitRegistry._toolkit_name: SMILES}

I think that toolkit runtime is pretty important, so I'm partial to the latter option. I'll add this for review momentarily.

@jchodera
Copy link
Member

Some possible solutions:

  • Refactor the code and tests to make it OK if two different toolkits are used---caching a SMILES representation should only help speed things up, after all. If we end up caching both RDKit and OpenEye SMILES, that's fine---we do 2x the effort, but still see a speedup.
  • Switch from using SMILES (which is toolkit-dependent) for caching to something easily available but more universal, like InChI: RDKit InChI
  • Use something else for caching, like Hill formulas

@j-wags
Copy link
Member Author

j-wags commented Jun 14, 2019

Refactor the code and tests to make it OK if two different toolkits are used---caching a SMILES representation should only help speed things up, after all. If we end up caching both RDKit and OpenEye SMILES, that's fine---we do 2x the effort, but still see a speedup.

This is the safest option, and what I've implemented. There is minimal runtime cost as implemented in my previous commit.

Switch from using SMILES (which is toolkit-dependent) for caching to something easily available but more universal, like InChI: RDKit InChI

I think InChI is still "toolkit dependent", in cases where OE and RDK interpret the stereochemistry or aromaticity of the same molecule differently. This would be a pretty good option, but I think the first option is better.

Use something else for caching, like Hill formulas

I'd thought about caching Hill formulas, but the toolkit internally uses SMILES to generate molecule hashes, so that would lead to some problematic key collisions (isopropanol vs. propanol)

@jchodera
Copy link
Member

I think InChI is still "toolkit dependent", in cases where OE and RDK interpret the stereochemistry or aromaticity of the same molecule differently. This would be a pretty good option, but I think the first option is better.

Our assumption is that both toolkits implement the same variant MDL aromaticity model, right?

Copy link
Collaborator

@andrrizzi andrrizzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thank you!

I agree that the double caching is the best way to go for now. We don't expose any part of it anyway, so we'll be free to change it later if we want to.

func_qualname = to_smiles_method.__qualname__

# Check to see if a SMILES for this molecule was already cached using this method
if func_qualname in self._cached_smiles.keys():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd remove the keys() with the in operator.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh man, I totally thought I'd read some python docs that indicated that in checks both keys and values, and I was going to be all like "I know more than Andrea about one thing!".

Except now I can't find it, and the Python docs unambiguously agree with you. It must have been a dream :-/

Reverting...

@j-wags
Copy link
Member Author

j-wags commented Jun 14, 2019

Our assumption is that both toolkits implement the same variant MDL aromaticity model, right?

@jchodera That is our assumption, but I'm not 100% convinced (though I don't have hard data immediately available to back up any discrepancy). I'd prefer to assume "unsafe" in this case.

@codecov-io
Copy link

Codecov Report

Merging #351 into master will increase coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #351      +/-   ##
==========================================
+ Coverage   75.93%   75.95%   +0.01%     
==========================================
  Files          17       17              
  Lines        5428     5431       +3     
==========================================
+ Hits         4122     4125       +3     
  Misses       1306     1306
Impacted Files Coverage Δ
openforcefield/utils/toolkits.py 87.08% <100%> (+0.01%) ⬆️
openforcefield/topology/molecule.py 86.18% <100%> (+0.03%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6d69835...936d3f4. Read the comment docs.

@j-wags j-wags merged commit cc1d5eb into master Jun 14, 2019
@j-wags j-wags deleted the fix-344 branch June 14, 2019 16:56
@davidlmobley
Copy link
Contributor

For the record:

Our assumption is that both toolkits implement the same variant MDL aromaticity model, right?

@jchodera That is our assumption, but I'm not 100% convinced (though I don't have hard data immediately available to back up any discrepancy). I'd prefer to assume "unsafe" in this case.

I am quite confident there will be edge cases where they do not have exactly the same aromaticity model still. A lot of the back and forth about the definition of aromaticity models -- once the fundamentals are documented in place -- amounts to questions like, "Does it also consider X aromatic?" "What about Y?" We've basically gone through the alphabet once and all the "common" cases agree but may still disagree about edge cases. We will need to document any which come up and work with developers to get them corrected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Generate OFFMol molecule from OEMol is failing
5 participants