-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Toolkit showcase update #1368
Toolkit showcase update #1368
Conversation
@Yoshanuikabundi please tag me or reach out to me directly if you have any questions or suggestions on integrating Interchange in this, including patches to other repos. I don't want to get in your way but I'm also interested and invested in getting this working nicely 🙂 |
Molecule.are_isomorphic() begins by comparing the Hill formulae of each molecule, which is quite slow especially for large molecules like proteins. Adding a simple check for the number of atoms first is much faster. This change cuts in half the time taken to call from_openmm on a topology of a protein and about 20k waters (~2min -> ~1min)
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
for more information, see https://pre-commit.ci
I've added a few potentially controversial methods to Topology:
I'm happy to make them private or put them in the notebook itself, but I think all three make life a lot easier. The first two are just convenience methods for accessing and updating all the positions at once and work by iterating over molecules and assigning/reading the first conformers of each. At some point we should talk about adding an option to the PDB output methods to use standard atom names, because I haven't found any protein visualization software that can render a cartoon on our PDB outputs (and the atom names are not even unique, possibly because the field in a PDB isn't long enough), but for the moment the new @mattwthompson Thanks for the offer! Turns out I need you immediately. When I try to construct an OpenMM topology, Interchange complains about virtual sites (AFAIK there should be no virtual sites in this system, but Interchange seems to expect them): ---------------------------------------------------------------------------
LookupError Traceback (most recent call last)
Input In [14], in <cell line: 2>()
1 omm_system = interchange.to_openmm()
----> 2 omm_top = interchange.to_openmm_topology()
File ~/Documents/openff/openff-toolkit/.soap/examples/lib/python3.10/site-packages/openff/interchange/components/interchange.py:529, in Interchange.to_openmm_topology(self, ensure_unique_atom_names)
526 """Export components of this Interchange to an OpenMM Topology."""
527 from openff.interchange.interop.openmm import to_openmm_topology
--> 529 return to_openmm_topology(
530 self, ensure_unique_atom_names=ensure_unique_atom_names
531 )
File ~/Documents/openff/openff-toolkit/.soap/examples/lib/python3.10/site-packages/openff/interchange/interop/openmm.py:922, in to_openmm_topology(interchange, ensure_unique_atom_names)
916 from openff.interchange.interop._virtual_sites import (
917 _virtual_site_parent_molecule_mapping,
918 )
920 topology = interchange.topology
--> 922 virtual_site_molecule_map = _virtual_site_parent_molecule_mapping(interchange)
924 molecule_virtual_site_map = defaultdict(list)
926 for virtual_site, molecule in virtual_site_molecule_map.items():
File ~/Documents/openff/openff-toolkit/.soap/examples/lib/python3.10/site-packages/openff/interchange/interop/_virtual_sites.py:18, in _virtual_site_parent_molecule_mapping(interchange)
13 def _virtual_site_parent_molecule_mapping(
14 interchange: "Interchange",
15 ) -> Dict[VirtualSiteKey, Molecule]:
16 mapping = dict()
---> 18 for virtual_site_key in interchange["VirtualSites"].slot_map:
19 virtual_site_key: VirtualSiteKey # type: ignore[no-redef]
20 parent_atom_index = virtual_site_key.orientation_atom_indices[0]
File ~/Documents/openff/openff-toolkit/.soap/examples/lib/python3.10/site-packages/openff/interchange/components/interchange.py:826, in Interchange.__getitem__(self, item)
824 return self.handlers[item]
825 else:
--> 826 raise LookupError(
827 f"Could not find component {item}. This object has the following "
828 f"potential handlers registered:\n\t{[*self.handlers.keys()]}"
829 )
LookupError: Could not find component VirtualSites. This object has the following potential handlers registered:
['Bonds', 'Constraints', 'Angles', 'ProperTorsions', 'ImproperTorsions', 'vdW', 'Electrostatics'] A minimal example is (with the attached from openff.toolkit import Topology, ForceField
from openmm import unit as openmm_unit
from openff.units.openmm import from_openmm
with open("topology.json", "r") as f:
top = Topology.from_json(f.read())
sage_ff14sb = ForceField("openff-2.0.0.offxml", "ff14sb_off_impropers_0.0.3.offxml")
interchange = sage_ff14sb.create_interchange(top)
omm_top = interchange.to_openmm_topology() If you want to see how the topology is constructed, I've committed and pushed my current working version of the notebook. Note that it requires changes to the Toolkit itself that are included on this branch. Please ignore the prose, I'm working on the code first. I'm excited to get this working! |
My first-impression is that these new I'll fix that bug shortly, no reason that logic should be in place in the way it is. I thought I wrote a test for that, guess not. |
100% agree with Matt here - Let's just define these in the notebook for now! We've been talking about exposing positions on topologies for a while but every option seems to be a steep tradeoff between horrible complexity and user confusion/footguns. So it'll need more thought, but this close to release it'll be best to keep this out of the API. |
Ok, I've moved those methods back into the notebook.
We need to make a decision about this, because I need to document (and already have) when users should introduce their positions. I thought we agreed in openff-interchange/436 that we would support velocities, visualization, and positions in topologies - hence me adding some of that support here! 😅 @j-wags I think just providing Here's me ranting about how we should conceptually have positions on topologies (and a solvate method) for a few pages, you should probably skip it but I couldn't bear to delete it.I'll just recap the current situation:
In the showcase, this makes things a bit awkward. Our inputs are: a. A PDB of the receptor protein with crystallographic waters (that we want to preserve) To assemble the system, the current strategy is:
At this point, we have: a. a topology with everything except the ligand, without positions If topologies don't have positions, what do we do here? Something like identify the clashing waters, remove their coordinates from the array, remove the corresponding number of waters from the topology, concatenate the SDF coordinates to the rest of the coordinates, add the ligand to the topology, convert it to an interchange, and THEN add the coordinates to the interchange. Conceptually, having positions in the topology is much simpler:
Note that I'm not talking about having a positions array in the topology; I'm just talking about what we suggest conceptually. Setting positions just means iterating over the molecules and setting the first conformer. This keeps everything object oriented, which is convenient for scientists. I think in most workflows involving proteins, the molecular identity and coordinates are stored together, and so they should enter the OpenFF ecosystem together. Plus, if topologies have coordinates, a user can easily parametrize the exact same system with different force fields. This workflow is very complex. There is an important question as to whether the developers should pay that complexity cost or the users. I know a A Not asking for anything to change now, but if we're gonna say positions don't belong on topologies I need an absolutely rock solid reason because from a pedagogical and users' perspective I am 99.999% sure they do. I need to flag a few other things that are driving me nuts at the moment so we can sort them out after 0.11.0:
@mattwthompson - What are the dangers of just doing Also, Sorry if this is a grumpy comment. |
Not a danger, but the toolkit's method does not include virtual sites (by design). Otherwise their behavior should be identical |
Woah, so I did. My mistake @Yoshanuikabundi, and thanks for correcting me. You're right - We agreed to add positions to topologies (and velocities, which doesn't need to be done in this PR). So I'll stick with that decision. Sorry for the confusion and thanks for linking to the written decision :-)
Agreed. It looks like you've already switched
Oh, that'd be great. Happy to have that in this PR or in a future one.
Agree, but that will take more quality assurance/documentation than we have time for in the near future. I think the current |
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
This pull request introduces 1 alert when merging 1adc90c into db20b63 - view on LGTM.com new alerts:
|
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
…cal_molecule_groups
for more information, see https://pre-commit.ci
OK well this PR is a monstrosity. Sorry you have to review it @j-wags. I think it is done though, and I expect the currently running tests to pass! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What a labor of love. Left a few small comments but few are blocking. Overall this looks great. Merge when ready :-)
docs/releasehistory.md
Outdated
### Breaking change: List to Tuple in `Topology.identical_molecule_groups` | ||
|
||
The [`Topology.identical_molecule_groups`] property has been refactored for performance and improved typing support. It now returns a value of type `{int:[(int, {int: int})]}`, where it previously returned `{int:[[int, {int: int}]]}`; the interior 2-element list has been changed to a tuple to allow the type of each element to be specified separately. This is technically a breaking change but is unlikely to cause any issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(blocking) It's nice that you thought to include this! But it's not actually an API break because it was never in a release. You can either remove it or announce identical_molecule_groups as a new API point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh nice! Easy
with NamedTemporaryFile(suffix=".sdf") as outfile: | ||
filename = str(outfile.name) | ||
ethanol.to_file(filename, file_format="sdf") | ||
|
||
Molecule.from_file(pathlib.Path("ethanol.sdf")) | ||
Molecule.from_file(pathlib.Path(filename)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
from openff.toolkit.topology import Molecule, Topology | ||
|
||
mol = Molecule.from_polymer_pdb( | ||
get_data_file_path("proteins/MainChain_ALA_ALA.pdb") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(not blocking) The atom names in this file are already unique, so we wouldn't expect any atom renaming to happen in this test. It's still a useful test as-is though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I guess the point of this test is just to make sure the argument is getting passed along to to_openmm
. The exact behavior is tested in the tests for that method.
) | ||
off_topology = Topology.from_molecules([peptide]) | ||
|
||
# Remove atom names from some residues, make others have duplicate atom names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Brilliant!
) | ||
def test_to_openmm_copies_molecules(self, ensure_unique_atom_names): | ||
""" | ||
Check that generating new atom names doesn't affect the input topology |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
This must be a new GH feature - I think by making changes after you gave an approving review I've undone your approval. I'll merge with my admin privileges once tests pass and I've double checked the rendered docs but this might need addressing for the future. |
This PR updates the showcase to use all the new magic in the toolkit and interchange. It will no longer use Parmed and OpenMM force fields.
It also includes performance optimizations for new features required to run the example in an acceptable time frame.