Toolkit showcase update #1368

Yoshanuikabundi · 2022-08-11T01:35:53Z

This PR updates the showcase to use all the new magic in the toolkit and interchange. It will no longer use Parmed and OpenMM force fields.

It also includes performance optimizations for new features required to run the example in an acceptable time frame.

codecov · 2022-08-11T01:46:13Z

Codecov Report

Merging #1368 (70e776d) into main (2a9d6ec) will decrease coverage by 0.09%.
The diff coverage is 95.04%.

mattwthompson · 2022-08-11T03:42:07Z

@Yoshanuikabundi please tag me or reach out to me directly if you have any questions or suggestions on integrating Interchange in this, including patches to other repos. I don't want to get in your way but I'm also interested and invested in getting this working nicely 🙂

Molecule.are_isomorphic() begins by comparing the Hill formulae of each molecule, which is quite slow especially for large molecules like proteins. Adding a simple check for the number of atoms first is much faster. This change cuts in half the time taken to call from_openmm on a topology of a protein and about 20k waters (~2min -> ~1min)

review-notebook-app · 2022-08-11T11:16:49Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

for more information, see https://pre-commit.ci

Yoshanuikabundi · 2022-08-11T11:32:29Z

I've added a few potentially controversial methods to Topology:

Topology.get_positions()
Topology.set_positions()
Topology.visualize()

I'm happy to make them private or put them in the notebook itself, but I think all three make life a lot easier. The first two are just convenience methods for accessing and updating all the positions at once and work by iterating over molecules and assigning/reading the first conformers of each. At some point we should talk about adding an option to the PDB output methods to use standard atom names, because I haven't found any protein visualization software that can render a cartoon on our PDB outputs (and the atom names are not even unique, possibly because the field in a PDB isn't long enough), but for the moment the new visualize() method can fill in this gap.

@mattwthompson Thanks for the offer! Turns out I need you immediately. When I try to construct an OpenMM topology, Interchange complains about virtual sites (AFAIK there should be no virtual sites in this system, but Interchange seems to expect them):

---------------------------------------------------------------------------
LookupError                               Traceback (most recent call last)
Input In [14], in <cell line: 2>()
      1 omm_system = interchange.to_openmm()
----> 2 omm_top = interchange.to_openmm_topology()

File ~/Documents/openff/openff-toolkit/.soap/examples/lib/python3.10/site-packages/openff/interchange/components/interchange.py:529, in Interchange.to_openmm_topology(self, ensure_unique_atom_names)
    526 """Export components of this Interchange to an OpenMM Topology."""
    527 from openff.interchange.interop.openmm import to_openmm_topology
--> 529 return to_openmm_topology(
    530     self, ensure_unique_atom_names=ensure_unique_atom_names
    531 )

File ~/Documents/openff/openff-toolkit/.soap/examples/lib/python3.10/site-packages/openff/interchange/interop/openmm.py:922, in to_openmm_topology(interchange, ensure_unique_atom_names)
    916 from openff.interchange.interop._virtual_sites import (
    917     _virtual_site_parent_molecule_mapping,
    918 )
    920 topology = interchange.topology
--> 922 virtual_site_molecule_map = _virtual_site_parent_molecule_mapping(interchange)
    924 molecule_virtual_site_map = defaultdict(list)
    926 for virtual_site, molecule in virtual_site_molecule_map.items():

File ~/Documents/openff/openff-toolkit/.soap/examples/lib/python3.10/site-packages/openff/interchange/interop/_virtual_sites.py:18, in _virtual_site_parent_molecule_mapping(interchange)
     13 def _virtual_site_parent_molecule_mapping(
     14     interchange: "Interchange",
     15 ) -> Dict[VirtualSiteKey, Molecule]:
     16     mapping = dict()
---> 18     for virtual_site_key in interchange["VirtualSites"].slot_map:
     19         virtual_site_key: VirtualSiteKey  # type: ignore[no-redef]
     20         parent_atom_index = virtual_site_key.orientation_atom_indices[0]

File ~/Documents/openff/openff-toolkit/.soap/examples/lib/python3.10/site-packages/openff/interchange/components/interchange.py:826, in Interchange.__getitem__(self, item)
    824     return self.handlers[item]
    825 else:
--> 826     raise LookupError(
    827         f"Could not find component {item}. This object has the following "
    828         f"potential handlers registered:\n\t{[*self.handlers.keys()]}"
    829     )

LookupError: Could not find component VirtualSites. This object has the following potential handlers registered:
	['Bonds', 'Constraints', 'Angles', 'ProperTorsions', 'ImproperTorsions', 'vdW', 'Electrostatics']

A minimal example is (with the attached topology.json):

from openff.toolkit import Topology, ForceField
from openmm import unit as openmm_unit
from openff.units.openmm import from_openmm

with open("topology.json", "r") as f:
    top = Topology.from_json(f.read())

sage_ff14sb = ForceField("openff-2.0.0.offxml", "ff14sb_off_impropers_0.0.3.offxml")
interchange = sage_ff14sb.create_interchange(top)
omm_top = interchange.to_openmm_topology()

If you want to see how the topology is constructed, I've committed and pushed my current working version of the notebook. Note that it requires changes to the Toolkit itself that are included on this branch. Please ignore the prose, I'm working on the code first.

I'm excited to get this working!

mattwthompson · 2022-08-11T16:13:06Z

My first-impression is that these new Topology methods should be standalone functions scoped to the notebook; I think we are getting close to the wire for significant changes to the API in this release and writing them as functions allows us to defer decisions into the future - enabling the notebook to work just about as well as if they were methods but requiring less testing and consensus-gathering. I'm moderately opposed to positions being on topologies here and my opinion on Topology.visualize() is closer to neutrality.

I'll fix that bug shortly, no reason that logic should be in place in the way it is. I thought I wrote a test for that, guess not.

j-wags · 2022-08-11T17:11:04Z

I've added a few potentially controversial methods to Topology:
Topology.get_positions()
Topology.set_positions()
Topology.visualize()
I'm happy to make them private or put them in the notebook itself, but I think all three make life a lot easier.

My first-impression is that these new Topology methods should be standalone functions scoped to the notebook; I think we are getting close to the wire for significant changes to the API in this release and writing them as functions allows us to defer decisions into the future

100% agree with Matt here - Let's just define these in the notebook for now! We've been talking about exposing positions on topologies for a while but every option seems to be a steep tradeoff between horrible complexity and user confusion/footguns. So it'll need more thought, but this close to release it'll be best to keep this out of the API.

Yoshanuikabundi · 2022-08-15T06:14:42Z

Ok, I've moved those methods back into the notebook.

I'm moderately opposed to positions being on topologies here
We've been talking about exposing positions on topologies for a while but every option seems to be a steep tradeoff between horrible complexity and user confusion/footguns.

We need to make a decision about this, because I need to document (and already have) when users should introduce their positions. I thought we agreed in openff-interchange/436 that we would support velocities, visualization, and positions in topologies - hence me adding some of that support here! 😅

@j-wags I think just providing Topology.get_positions() and Topology.set_positions() is the compromise that avoids confusion without much added complexity. Both methods work by iterating over the molecules' first conformers. The get/set pattern makes it clear that you're working with a copy of the data, and we document that as well. Both methods together including docstrings are less than 60 lines of code and include all the safety checks you could hope for. I suspect if we don't provide these methods, our users will write them themselves. But we don't have to do that today!

Here's me ranting about how we should conceptually have positions on topologies (and a solvate method) for a few pages, you should probably skip it but I couldn't bear to delete it.

I'll just recap the current situation:

Interchange stores the system positions and velocities in the .positions and .velocities arrays
Interchange initializes its positions from the first conformation of each molecule in the topology iff every molecule has at least one conformer.
Interchange works with .positions or .velocities set to None, but export methods requiring those values will fail
Interchange supports a .visualize() method
By design, methods for adding or removing molecules to an existing Interchange do not exist

In the showcase, this makes things a bit awkward. Our inputs are:

a. A PDB of the receptor protein with crystallographic waters (that we want to preserve)
b. An SDF of the small molecule ligand

To assemble the system, the current strategy is:

Solvate the PDB with crystallographic waters with PDBFixer, which takes a PDB as input
Split the protein out into its own PDB and import it with Molecule.from_polymer_pdb()
Import the fully solvated system directly from PDBFixer with Topology.from_openmm(), passing in the above protein molecule as one of the unique_molecules

At this point, we have:

a. a topology with everything except the ligand, without positions
b. an array of positions for everything except the ligand (from PDBFixer)
c. an SDF of the ligand, with positions

If topologies don't have positions, what do we do here? Something like identify the clashing waters, remove their coordinates from the array, remove the corresponding number of waters from the topology, concatenate the SDF coordinates to the rest of the coordinates, add the ligand to the topology, convert it to an interchange, and THEN add the coordinates to the interchange. Conceptually, having positions in the topology is much simpler:

Set the positions of the molecules in the topology with the array of positions from PDBFixer.
Import the ligand with Molecule.from_file() and add it to the topology
Remove water molecules that clash with the ligand

Note that I'm not talking about having a positions array in the topology; I'm just talking about what we suggest conceptually. Setting positions just means iterating over the molecules and setting the first conformer. This keeps everything object oriented, which is convenient for scientists. I think in most workflows involving proteins, the molecular identity and coordinates are stored together, and so they should enter the OpenFF ecosystem together. Plus, if topologies have coordinates, a user can easily parametrize the exact same system with different force fields.

This workflow is very complex. There is an important question as to whether the developers should pay that complexity cost or the users. I know a Topology.solvate() method is unpopular, but I would rather personally maintain a single solvation method in the toolkit than a separate one in every example, and I'm sure our users would appreciate it too. This would replace steps 1, 4, and 6 with a call to this method. It would mean either packaging equillibrated water boxes with the toolkit, or calling out to something like PDBFixer.

A Topology.from_pdb() method, which would take a list of unique molecules and just load the PDB with openmm.app.PDBFile and pass it through Topology.from_openmm, would be nice too, but this is simple enough that I'm happy for it to live in the examples.

Not asking for anything to change now, but if we're gonna say positions don't belong on topologies I need an absolutely rock solid reason because from a pedagogical and users' perspective I am 99.999% sure they do.

I need to flag a few other things that are driving me nuts at the moment so we can sort them out after 0.11.0:

Visualization at the moment is a massive pain, because we attempt to uniquify atom names at the drop of a hat. PyMol, VMD, and NGLView all rely on atom names to identify proteins. Interchange automatically attempts to uniquify atom names, and this means that all outputs are unable to be visualized as ribbons or cartoons, and you can't use selection languages to show side chains in a different representation, and so on. Call interchange.visualize() on the toolkit showcase interchange and you'll see... two caps and a ligand in ball and stick rep and nothing else. This happens even in Topology.to_file(), which is essential for visualization outside the notebook - and there, it produces PDBs that don't even have unique atom names past 100 atoms of an element because the name has a fixed width! I think we should have options everywhere to preserve atom names.
at the moment I have to write PDBs to disk a lot to pass information between libraries
- It would be nice to be able to provide a file-like object to methods like Topology.to_file.
- Exposing a method like Molecule.from_polymer_openmm_topology(), which is just Molecule.from_polymer_pdb() without the PDBFile step, would be wonderful too

@mattwthompson - What are the dangers of just doing top.to_openmm() rather than interchange.to_openmm_topology()? It seems to be working fine (and it lets me visualize things).

Also, Topology.to_json() and Topology.from_json() are wonderful methods whom I love.

Sorry if this is a grumpy comment.

mattwthompson · 2022-08-15T09:52:51Z

What are the dangers of just doing top.to_openmm() rather than interchange.to_openmm_topology()?

Not a danger, but the toolkit's method does not include virtual sites (by design). Otherwise their behavior should be identical

j-wags · 2022-08-15T21:56:29Z

I thought we agreed in openforcefield/openff-interchange#436 (comment) that we would support velocities, visualization, and positions in topologies - hence me adding some of that support here! 😅

Woah, so I did. My mistake @Yoshanuikabundi, and thanks for correcting me. You're right - We agreed to add positions to topologies (and velocities, which doesn't need to be done in this PR). So I'll stick with that decision. Sorry for the confusion and thanks for linking to the written decision :-)

Visualization at the moment is a massive pain, because we attempt to uniquify atom names at the drop of a hat. PyMol, VMD, and NGLView all rely on atom names to identify proteins

Agreed. It looks like you've already switched ensure_unique_atom_names=False in the Toolkit calls, which is great. @mattwthompson, would you be open to a future PR to Interchange that prevents renaming in Interchange.visualize?

It would be nice to be able to provide a file-like object to methods like Topology.to_file.

Oh, that'd be great. Happy to have that in this PR or in a future one.

Exposing a method like Molecule.from_polymer_openmm_topology(), which is just Molecule.from_polymer_pdb() without the PDBFile step, would be wonderful too

Agree, but that will take more quality assurance/documentation than we have time for in the near future. I think the current from_polymer_pdb behavior relies somewhat on the fact that the topology it's loading came from an OpenMM PDBFile object. So we'll need additional logic to handle/provide useful errors for loading general OpenMM topologies.

for more information, see https://pre-commit.ci

lgtm-com · 2022-08-16T11:43:57Z

This pull request introduces 1 alert when merging 1adc90c into db20b63 - view on LGTM.com

new alerts:

1 for Unused import

for more information, see https://pre-commit.ci

openff/toolkit/topology/topology.py

for more information, see https://pre-commit.ci

…cal_molecule_groups

for more information, see https://pre-commit.ci

Yoshanuikabundi · 2022-08-23T07:55:42Z

OK well this PR is a monstrosity. Sorry you have to review it @j-wags. I think it is done though, and I expect the currently running tests to pass!

j-wags

What a labor of love. Left a few small comments but few are blocking. Overall this looks great. Merge when ready :-)

j-wags · 2022-08-24T01:14:34Z

docs/releasehistory.md

+### Breaking change: List to Tuple in `Topology.identical_molecule_groups`
+
+The [`Topology.identical_molecule_groups`] property has been refactored for performance and improved typing support. It now returns a value of type `{int:[(int, {int: int})]}`, where it previously returned `{int:[[int, {int: int}]]}`; the interior 2-element list has been changed to a tuple to allow the type of each element to be specified separately. This is technically a breaking change but is unlikely to cause any issues.


(blocking) It's nice that you thought to include this! But it's not actually an API break because it was never in a release. You can either remove it or announce identical_molecule_groups as a new API point.

Oh nice! Easy

j-wags · 2022-08-24T01:15:22Z

openff/toolkit/tests/test_molecule.py

+        with NamedTemporaryFile(suffix=".sdf") as outfile:
+            filename = str(outfile.name)
+            ethanol.to_file(filename, file_format="sdf")

-        Molecule.from_file(pathlib.Path("ethanol.sdf"))
+            Molecule.from_file(pathlib.Path(filename))


j-wags · 2022-08-24T01:19:19Z

openff/toolkit/tests/test_topology.py

+        from openff.toolkit.topology import Molecule, Topology
+
+        mol = Molecule.from_polymer_pdb(
+            get_data_file_path("proteins/MainChain_ALA_ALA.pdb")


(not blocking) The atom names in this file are already unique, so we wouldn't expect any atom renaming to happen in this test. It's still a useful test as-is though.

Yeah I guess the point of this test is just to make sure the argument is getting passed along to to_openmm. The exact behavior is tested in the tests for that method.

j-wags · 2022-08-24T01:22:47Z

openff/toolkit/tests/test_topology.py

+        )
+        off_topology = Topology.from_molecules([peptide])
+
+        # Remove atom names from some residues, make others have duplicate atom names


j-wags · 2022-08-24T01:45:01Z

openff/toolkit/tests/test_topology.py

+    )
+    def test_to_openmm_copies_molecules(self, ensure_unique_atom_names):
+        """
+        Check that generating new atom names doesn't affect the input topology


Yoshanuikabundi · 2022-08-24T02:52:50Z

This must be a new GH feature - I think by making changes after you gave an approving review I've undone your approval. I'll merge with my admin privileges once tests pass and I've double checked the rendered docs but this might need addressing for the future.

Faster identification of unique molecules

a9e7d79

Yoshanuikabundi marked this pull request as draft August 11, 2022 02:17

Yoshanuikabundi added 5 commits August 11, 2022 18:16

Cache hill formulae

85749d8

invalidate cache only once when adding multiple molecules to topology

7ab61ad

Topology methods for vizualization and getting/setting positions

65449ba

Initial progress on updating showcase

dc1b241

[pre-commit.ci] auto fixes from pre-commit.com hooks

4d67d03

for more information, see https://pre-commit.ci

Move visualize(), get_positions(), set_positions() to notebook

9d3c26d

Yoshanuikabundi and others added 6 commits August 16, 2022 15:42

Add get/set_positions to topology with tests

9ffae08

Get example running (Excluding appendix)

1e5a170

[pre-commit.ci] auto fixes from pre-commit.com hooks

5747a44

for more information, see https://pre-commit.ci

Improvements to Topology.to_file

1d89996

Pedagogic updates to toolkit_showcase

14afb79

[pre-commit.ci] auto fixes from pre-commit.com hooks

1adc90c

for more information, see https://pre-commit.ci

Yoshanuikabundi added 5 commits August 16, 2022 21:48

Clear outputs of toolkit showcase

2c0afba

Remove unused import

14f3a97

Make MyPy happy

bd63cea

Add comments to Topology.to_file()

580f5cd

PintQuantity -> Quantity (waiting for generic support in openff.units)

87c944e

[pre-commit.ci] auto fixes from pre-commit.com hooks

ef5c0cc

for more information, see https://pre-commit.ci

mattwthompson reviewed Aug 18, 2022

View reviewed changes

openff/toolkit/topology/topology.py Outdated Show resolved Hide resolved

Yoshanuikabundi and others added 14 commits August 21, 2022 18:53

Import Iterator type hint from correct place

6db08d9

Rename keepIds argument to keep_ids

4291c95

Tests for new stuff in Topology.to_file()

1f1a084

Merge branch 'main' into update_showcase

eb491f2

[pre-commit.ci] auto fixes from pre-commit.com hooks

728d9fd

for more information, see https://pre-commit.ci

Remove unused variables

3ff15bc

Copy topology in to_openmm and test input atom names are preserved

c3a1a5c

Add new ensure_unique_atom_names options to existing tests

ce182f8

Add comments to test

77cbfae

Add tests for residue support of ensure_unique_atom_names in to_openmm

df8b4e7

Annotate to_file tests as requires_rdkit

cce90bc

Add type hints to _identify_chemically_identical_molecules and identi…

8768971

…cal_molecule_groups

Stop tests from writing out un-gitignored ethanol.sdf

24c4715

[pre-commit.ci] auto fixes from pre-commit.com hooks

c2c77d7

for more information, see https://pre-commit.ci

Yoshanuikabundi mentioned this pull request Aug 23, 2022

Port new ensure_unique_atom_names behavior from Toolkit openforcefield/openff-interchange#510

Merged

3 tasks

remove unused namedtuple import

dd1c240

Yoshanuikabundi marked this pull request as ready for review August 23, 2022 07:41

Update changelog

1b57098

Yoshanuikabundi requested a review from j-wags August 23, 2022 07:55

j-wags previously approved these changes Aug 24, 2022

View reviewed changes

j-wags mentioned this pull request Aug 24, 2022

Topology.to_openmm fails if an atom doesn't have a name and ensure_unique_atom_names=False #1375

Closed

Change breaking change announcement to new API point announcement

70e776d

Yoshanuikabundi dismissed j-wags’s stale review via 70e776d August 24, 2022 02:46

Yoshanuikabundi merged commit d294dad into main Aug 24, 2022

Yoshanuikabundi deleted the update_showcase branch August 24, 2022 03:13

mattwthompson mentioned this pull request Feb 2, 2024

Allow Topology.set_positions(None) #1820

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Toolkit showcase update #1368

Toolkit showcase update #1368

Yoshanuikabundi commented Aug 11, 2022 •

edited

Loading

codecov bot commented Aug 11, 2022 •

edited

Loading

mattwthompson commented Aug 11, 2022

review-notebook-app bot commented Aug 11, 2022

Yoshanuikabundi commented Aug 11, 2022

mattwthompson commented Aug 11, 2022

j-wags commented Aug 11, 2022

Yoshanuikabundi commented Aug 15, 2022

mattwthompson commented Aug 15, 2022

j-wags commented Aug 15, 2022

lgtm-com bot commented Aug 16, 2022

Yoshanuikabundi commented Aug 23, 2022

j-wags left a comment

j-wags Aug 24, 2022

Yoshanuikabundi Aug 24, 2022

j-wags Aug 24, 2022

j-wags Aug 24, 2022

Yoshanuikabundi Aug 24, 2022

j-wags Aug 24, 2022

j-wags Aug 24, 2022

Yoshanuikabundi commented Aug 24, 2022

		### Breaking change: List to Tuple in `Topology.identical_molecule_groups`

		The [`Topology.identical_molecule_groups`] property has been refactored for performance and improved typing support. It now returns a value of type `{int:[(int, {int: int})]}`, where it previously returned `{int:[[int, {int: int}]]}`; the interior 2-element list has been changed to a tuple to allow the type of each element to be specified separately. This is technically a breaking change but is unlikely to cause any issues.

Toolkit showcase update #1368

Toolkit showcase update #1368

Conversation

Yoshanuikabundi commented Aug 11, 2022 • edited Loading

codecov bot commented Aug 11, 2022 • edited Loading

Codecov Report

mattwthompson commented Aug 11, 2022

review-notebook-app bot commented Aug 11, 2022

Yoshanuikabundi commented Aug 11, 2022

mattwthompson commented Aug 11, 2022

j-wags commented Aug 11, 2022

Yoshanuikabundi commented Aug 15, 2022

mattwthompson commented Aug 15, 2022

j-wags commented Aug 15, 2022

lgtm-com bot commented Aug 16, 2022

Yoshanuikabundi commented Aug 23, 2022

j-wags left a comment

Choose a reason for hiding this comment

j-wags Aug 24, 2022

Choose a reason for hiding this comment

Yoshanuikabundi Aug 24, 2022

Choose a reason for hiding this comment

j-wags Aug 24, 2022

Choose a reason for hiding this comment

j-wags Aug 24, 2022

Choose a reason for hiding this comment

Yoshanuikabundi Aug 24, 2022

Choose a reason for hiding this comment

j-wags Aug 24, 2022

Choose a reason for hiding this comment

j-wags Aug 24, 2022

Choose a reason for hiding this comment

Yoshanuikabundi commented Aug 24, 2022

Yoshanuikabundi commented Aug 11, 2022 •

edited

Loading

codecov bot commented Aug 11, 2022 •

edited

Loading