Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Toolkit showcase update #1368

Merged
merged 57 commits into from
Aug 24, 2022
Merged

Toolkit showcase update #1368

merged 57 commits into from
Aug 24, 2022

Conversation

Yoshanuikabundi
Copy link
Collaborator

@Yoshanuikabundi Yoshanuikabundi commented Aug 11, 2022

This PR updates the showcase to use all the new magic in the toolkit and interchange. It will no longer use Parmed and OpenMM force fields.

It also includes performance optimizations for new features required to run the example in an acceptable time frame.

@codecov
Copy link

codecov bot commented Aug 11, 2022

Codecov Report

Merging #1368 (70e776d) into main (2a9d6ec) will decrease coverage by 0.09%.
The diff coverage is 95.04%.

@Yoshanuikabundi Yoshanuikabundi marked this pull request as draft August 11, 2022 02:17
@mattwthompson
Copy link
Member

@Yoshanuikabundi please tag me or reach out to me directly if you have any questions or suggestions on integrating Interchange in this, including patches to other repos. I don't want to get in your way but I'm also interested and invested in getting this working nicely 🙂

Molecule.are_isomorphic() begins by comparing the
Hill formulae of each molecule, which is quite
slow especially for large molecules like proteins.
Adding a simple check for the number of atoms
first is much faster. This change cuts in half the
time taken to call from_openmm on a topology of a
protein and about 20k waters (~2min -> ~1min)
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@Yoshanuikabundi
Copy link
Collaborator Author

I've added a few potentially controversial methods to Topology:

  • Topology.get_positions()
  • Topology.set_positions()
  • Topology.visualize()

I'm happy to make them private or put them in the notebook itself, but I think all three make life a lot easier. The first two are just convenience methods for accessing and updating all the positions at once and work by iterating over molecules and assigning/reading the first conformers of each. At some point we should talk about adding an option to the PDB output methods to use standard atom names, because I haven't found any protein visualization software that can render a cartoon on our PDB outputs (and the atom names are not even unique, possibly because the field in a PDB isn't long enough), but for the moment the new visualize() method can fill in this gap.

@mattwthompson Thanks for the offer! Turns out I need you immediately. When I try to construct an OpenMM topology, Interchange complains about virtual sites (AFAIK there should be no virtual sites in this system, but Interchange seems to expect them):

---------------------------------------------------------------------------
LookupError                               Traceback (most recent call last)
Input In [14], in <cell line: 2>()
      1 omm_system = interchange.to_openmm()
----> 2 omm_top = interchange.to_openmm_topology()

File ~/Documents/openff/openff-toolkit/.soap/examples/lib/python3.10/site-packages/openff/interchange/components/interchange.py:529, in Interchange.to_openmm_topology(self, ensure_unique_atom_names)
    526 """Export components of this Interchange to an OpenMM Topology."""
    527 from openff.interchange.interop.openmm import to_openmm_topology
--> 529 return to_openmm_topology(
    530     self, ensure_unique_atom_names=ensure_unique_atom_names
    531 )

File ~/Documents/openff/openff-toolkit/.soap/examples/lib/python3.10/site-packages/openff/interchange/interop/openmm.py:922, in to_openmm_topology(interchange, ensure_unique_atom_names)
    916 from openff.interchange.interop._virtual_sites import (
    917     _virtual_site_parent_molecule_mapping,
    918 )
    920 topology = interchange.topology
--> 922 virtual_site_molecule_map = _virtual_site_parent_molecule_mapping(interchange)
    924 molecule_virtual_site_map = defaultdict(list)
    926 for virtual_site, molecule in virtual_site_molecule_map.items():

File ~/Documents/openff/openff-toolkit/.soap/examples/lib/python3.10/site-packages/openff/interchange/interop/_virtual_sites.py:18, in _virtual_site_parent_molecule_mapping(interchange)
     13 def _virtual_site_parent_molecule_mapping(
     14     interchange: "Interchange",
     15 ) -> Dict[VirtualSiteKey, Molecule]:
     16     mapping = dict()
---> 18     for virtual_site_key in interchange["VirtualSites"].slot_map:
     19         virtual_site_key: VirtualSiteKey  # type: ignore[no-redef]
     20         parent_atom_index = virtual_site_key.orientation_atom_indices[0]

File ~/Documents/openff/openff-toolkit/.soap/examples/lib/python3.10/site-packages/openff/interchange/components/interchange.py:826, in Interchange.__getitem__(self, item)
    824     return self.handlers[item]
    825 else:
--> 826     raise LookupError(
    827         f"Could not find component {item}. This object has the following "
    828         f"potential handlers registered:\n\t{[*self.handlers.keys()]}"
    829     )

LookupError: Could not find component VirtualSites. This object has the following potential handlers registered:
	['Bonds', 'Constraints', 'Angles', 'ProperTorsions', 'ImproperTorsions', 'vdW', 'Electrostatics']

A minimal example is (with the attached topology.json):

from openff.toolkit import Topology, ForceField
from openmm import unit as openmm_unit
from openff.units.openmm import from_openmm

with open("topology.json", "r") as f:
    top = Topology.from_json(f.read())

sage_ff14sb = ForceField("openff-2.0.0.offxml", "ff14sb_off_impropers_0.0.3.offxml")
interchange = sage_ff14sb.create_interchange(top)
omm_top = interchange.to_openmm_topology()

If you want to see how the topology is constructed, I've committed and pushed my current working version of the notebook. Note that it requires changes to the Toolkit itself that are included on this branch. Please ignore the prose, I'm working on the code first.

I'm excited to get this working!

@mattwthompson
Copy link
Member

My first-impression is that these new Topology methods should be standalone functions scoped to the notebook; I think we are getting close to the wire for significant changes to the API in this release and writing them as functions allows us to defer decisions into the future - enabling the notebook to work just about as well as if they were methods but requiring less testing and consensus-gathering. I'm moderately opposed to positions being on topologies here and my opinion on Topology.visualize() is closer to neutrality.

I'll fix that bug shortly, no reason that logic should be in place in the way it is. I thought I wrote a test for that, guess not.

@j-wags
Copy link
Member

j-wags commented Aug 11, 2022

I've added a few potentially controversial methods to Topology:

Topology.get_positions()
Topology.set_positions()
Topology.visualize()

I'm happy to make them private or put them in the notebook itself, but I think all three make life a lot easier.

My first-impression is that these new Topology methods should be standalone functions scoped to the notebook; I think we are getting close to the wire for significant changes to the API in this release and writing them as functions allows us to defer decisions into the future

100% agree with Matt here - Let's just define these in the notebook for now! We've been talking about exposing positions on topologies for a while but every option seems to be a steep tradeoff between horrible complexity and user confusion/footguns. So it'll need more thought, but this close to release it'll be best to keep this out of the API.

@Yoshanuikabundi
Copy link
Collaborator Author

Ok, I've moved those methods back into the notebook.

I'm moderately opposed to positions being on topologies here
We've been talking about exposing positions on topologies for a while but every option seems to be a steep tradeoff between horrible complexity and user confusion/footguns.

We need to make a decision about this, because I need to document (and already have) when users should introduce their positions. I thought we agreed in openff-interchange/436 that we would support velocities, visualization, and positions in topologies - hence me adding some of that support here! 😅

@j-wags I think just providing Topology.get_positions() and Topology.set_positions() is the compromise that avoids confusion without much added complexity. Both methods work by iterating over the molecules' first conformers. The get/set pattern makes it clear that you're working with a copy of the data, and we document that as well. Both methods together including docstrings are less than 60 lines of code and include all the safety checks you could hope for. I suspect if we don't provide these methods, our users will write them themselves. But we don't have to do that today!

Here's me ranting about how we should conceptually have positions on topologies (and a solvate method) for a few pages, you should probably skip it but I couldn't bear to delete it.

I'll just recap the current situation:

  • Interchange stores the system positions and velocities in the .positions and .velocities arrays
  • Interchange initializes its positions from the first conformation of each molecule in the topology iff every molecule has at least one conformer.
  • Interchange works with .positions or .velocities set to None, but export methods requiring those values will fail
  • Interchange supports a .visualize() method
  • By design, methods for adding or removing molecules to an existing Interchange do not exist

In the showcase, this makes things a bit awkward. Our inputs are:

a. A PDB of the receptor protein with crystallographic waters (that we want to preserve)
b. An SDF of the small molecule ligand

To assemble the system, the current strategy is:

  1. Solvate the PDB with crystallographic waters with PDBFixer, which takes a PDB as input
  2. Split the protein out into its own PDB and import it with Molecule.from_polymer_pdb()
  3. Import the fully solvated system directly from PDBFixer with Topology.from_openmm(), passing in the above protein molecule as one of the unique_molecules

At this point, we have:

a. a topology with everything except the ligand, without positions
b. an array of positions for everything except the ligand (from PDBFixer)
c. an SDF of the ligand, with positions

If topologies don't have positions, what do we do here? Something like identify the clashing waters, remove their coordinates from the array, remove the corresponding number of waters from the topology, concatenate the SDF coordinates to the rest of the coordinates, add the ligand to the topology, convert it to an interchange, and THEN add the coordinates to the interchange. Conceptually, having positions in the topology is much simpler:

  1. Set the positions of the molecules in the topology with the array of positions from PDBFixer.
  2. Import the ligand with Molecule.from_file() and add it to the topology
  3. Remove water molecules that clash with the ligand

Note that I'm not talking about having a positions array in the topology; I'm just talking about what we suggest conceptually. Setting positions just means iterating over the molecules and setting the first conformer. This keeps everything object oriented, which is convenient for scientists. I think in most workflows involving proteins, the molecular identity and coordinates are stored together, and so they should enter the OpenFF ecosystem together. Plus, if topologies have coordinates, a user can easily parametrize the exact same system with different force fields.

This workflow is very complex. There is an important question as to whether the developers should pay that complexity cost or the users. I know a Topology.solvate() method is unpopular, but I would rather personally maintain a single solvation method in the toolkit than a separate one in every example, and I'm sure our users would appreciate it too. This would replace steps 1, 4, and 6 with a call to this method. It would mean either packaging equillibrated water boxes with the toolkit, or calling out to something like PDBFixer.

A Topology.from_pdb() method, which would take a list of unique molecules and just load the PDB with openmm.app.PDBFile and pass it through Topology.from_openmm, would be nice too, but this is simple enough that I'm happy for it to live in the examples.

Not asking for anything to change now, but if we're gonna say positions don't belong on topologies I need an absolutely rock solid reason because from a pedagogical and users' perspective I am 99.999% sure they do.


I need to flag a few other things that are driving me nuts at the moment so we can sort them out after 0.11.0:

  • Visualization at the moment is a massive pain, because we attempt to uniquify atom names at the drop of a hat. PyMol, VMD, and NGLView all rely on atom names to identify proteins. Interchange automatically attempts to uniquify atom names, and this means that all outputs are unable to be visualized as ribbons or cartoons, and you can't use selection languages to show side chains in a different representation, and so on. Call interchange.visualize() on the toolkit showcase interchange and you'll see... two caps and a ligand in ball and stick rep and nothing else. This happens even in Topology.to_file(), which is essential for visualization outside the notebook - and there, it produces PDBs that don't even have unique atom names past 100 atoms of an element because the name has a fixed width! I think we should have options everywhere to preserve atom names.

  • at the moment I have to write PDBs to disk a lot to pass information between libraries

    • It would be nice to be able to provide a file-like object to methods like Topology.to_file.
    • Exposing a method like Molecule.from_polymer_openmm_topology(), which is just Molecule.from_polymer_pdb() without the PDBFile step, would be wonderful too

@mattwthompson - What are the dangers of just doing top.to_openmm() rather than interchange.to_openmm_topology()? It seems to be working fine (and it lets me visualize things).


Also, Topology.to_json() and Topology.from_json() are wonderful methods whom I love.

Sorry if this is a grumpy comment.

@mattwthompson
Copy link
Member

What are the dangers of just doing top.to_openmm() rather than interchange.to_openmm_topology()?

Not a danger, but the toolkit's method does not include virtual sites (by design). Otherwise their behavior should be identical

@j-wags
Copy link
Member

j-wags commented Aug 15, 2022

I thought we agreed in openforcefield/openff-interchange#436 (comment) that we would support velocities, visualization, and positions in topologies - hence me adding some of that support here! 😅

Woah, so I did. My mistake @Yoshanuikabundi, and thanks for correcting me. You're right - We agreed to add positions to topologies (and velocities, which doesn't need to be done in this PR). So I'll stick with that decision. Sorry for the confusion and thanks for linking to the written decision :-)

Visualization at the moment is a massive pain, because we attempt to uniquify atom names at the drop of a hat. PyMol, VMD, and NGLView all rely on atom names to identify proteins

Agreed. It looks like you've already switched ensure_unique_atom_names=False in the Toolkit calls, which is great. @mattwthompson, would you be open to a future PR to Interchange that prevents renaming in Interchange.visualize?

It would be nice to be able to provide a file-like object to methods like Topology.to_file.

Oh, that'd be great. Happy to have that in this PR or in a future one.

Exposing a method like Molecule.from_polymer_openmm_topology(), which is just Molecule.from_polymer_pdb() without the PDBFile step, would be wonderful too

Agree, but that will take more quality assurance/documentation than we have time for in the near future. I think the current from_polymer_pdb behavior relies somewhat on the fact that the topology it's loading came from an OpenMM PDBFile object. So we'll need additional logic to handle/provide useful errors for loading general OpenMM topologies.

@lgtm-com
Copy link

lgtm-com bot commented Aug 16, 2022

This pull request introduces 1 alert when merging 1adc90c into db20b63 - view on LGTM.com

new alerts:

  • 1 for Unused import

@Yoshanuikabundi Yoshanuikabundi marked this pull request as ready for review August 23, 2022 07:41
@Yoshanuikabundi
Copy link
Collaborator Author

OK well this PR is a monstrosity. Sorry you have to review it @j-wags. I think it is done though, and I expect the currently running tests to pass!

j-wags
j-wags previously approved these changes Aug 24, 2022
Copy link
Member

@j-wags j-wags left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What a labor of love. Left a few small comments but few are blocking. Overall this looks great. Merge when ready :-)

Comment on lines 95 to 97
### Breaking change: List to Tuple in `Topology.identical_molecule_groups`

The [`Topology.identical_molecule_groups`] property has been refactored for performance and improved typing support. It now returns a value of type `{int:[(int, {int: int})]}`, where it previously returned `{int:[[int, {int: int}]]}`; the interior 2-element list has been changed to a tuple to allow the type of each element to be specified separately. This is technically a breaking change but is unlikely to cause any issues.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(blocking) It's nice that you thought to include this! But it's not actually an API break because it was never in a release. You can either remove it or announce identical_molecule_groups as a new API point.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh nice! Easy

Comment on lines +900 to +904
with NamedTemporaryFile(suffix=".sdf") as outfile:
filename = str(outfile.name)
ethanol.to_file(filename, file_format="sdf")

Molecule.from_file(pathlib.Path("ethanol.sdf"))
Molecule.from_file(pathlib.Path(filename))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

from openff.toolkit.topology import Molecule, Topology

mol = Molecule.from_polymer_pdb(
get_data_file_path("proteins/MainChain_ALA_ALA.pdb")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(not blocking) The atom names in this file are already unique, so we wouldn't expect any atom renaming to happen in this test. It's still a useful test as-is though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I guess the point of this test is just to make sure the argument is getting passed along to to_openmm. The exact behavior is tested in the tests for that method.

)
off_topology = Topology.from_molecules([peptide])

# Remove atom names from some residues, make others have duplicate atom names
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brilliant!

)
def test_to_openmm_copies_molecules(self, ensure_unique_atom_names):
"""
Check that generating new atom names doesn't affect the input topology
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@Yoshanuikabundi
Copy link
Collaborator Author

This must be a new GH feature - I think by making changes after you gave an approving review I've undone your approval. I'll merge with my admin privileges once tests pass and I've double checked the rendered docs but this might need addressing for the future.

@Yoshanuikabundi Yoshanuikabundi merged commit d294dad into main Aug 24, 2022
@Yoshanuikabundi Yoshanuikabundi deleted the update_showcase branch August 24, 2022 03:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants