-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Defining a specification for atomic map indices #522
Comments
Oh darn, the table can't be updated like normal checkboxes. Oh well, I'll try to keep it updated manually by directly editing the markdown as we get feedback. My votes are: |
Thanks for putting this up @j-wags |
This came up in our QCFractal meeting today, and @ChayaSt and I are strongly against having We may still need functionality like a way to store an atom mapping, but it would be useful to review the use cases for this before we evaluate the proposed implementation, since we have no means to rationally evaluate it otherwise. I'll note we also have similar functionality in the |
Thankfully, I misspoke. Josh's implementation was doing the right thing all along, so this isn't an issue!
I hadn't thought about atom maps replacing the hacky dict that powers |
Can map indices be redundant? No, I don't see the benefit to having a many-to-one mapping, and it complicates things. To add to the list: In any case, I agree that we should just return the same ordering that was given, rather than modifying the order and giving a map that can be used to get what was originally there in the first place. Also, maps are defined as between molecules, just as bonds are defined as between atoms. Storing a map inside of a molecule is a bit strange, and not sure what purpose it would serve. Much of the wording in these checkboxes make maps seem like a molecule property, which is a little misleading in my opinion. Atom maps should just be dictionaries, and functions that can utilize them should take in a map kwarg (and the obvious is a {1:1, 2:2, ..., n:n} mapping for molecules in the same order). If we really wanted to have maps as molecule-level attrs, then we would need to store a map for each possible conversion, e.g. to/from_oemol, to/from_rdkit, to/from_qcschema, etc. where we would then embed these representations in the oFF molecule itself to maintain the "state" of the mapping (if we e.g. delete an atom). In other words, it is not clear to me that one map will be the same across the toolkits in the way they build molecules. Here is how I envision this working. Let us have 3 mol1, mol2, mol3 = get_off_mols() # could be from a mol2, qcschema, OE, RDkit, etc.
try:
map12 = Molecule.map(mol1, mol2, allow_partial=False)[0] # or map12 = mol1.map(mol2)[0]
map13 = Molecule.map(mol1, mol2, allow_partial=False)[0]
except NullMap:
# Bail if not isomorphic and allow_partial turned off. In the case of allow_partial=True, not
# even a fragment could be found.
#
# Note we need to be careful then, since we could always probably find many partial mappings of
# single sp3 carbons. We would therefore generate a M by N matrix of map dictionaries, where M
# is the number of sp3 carbons in mol1, and N is the number of sp3 carbons in mo2.
# To alleviate some of this behavior, we could implement a map_only option which takes a list of
# atom indices from mol1 that we want to explicitly map (default is all atoms in mol1), to get a 1xN
# mapping.
#
# tl;dr partial maps are difficult to handle (and luckily not part of this issue)
return
except PartialMap:
# The idea here is that, in general, we can map a fragment of mol1 to mol2 and it could match
# multiple parts of mol2.
# However, since we know that these will map completely (e.g. are isomorphic), we turned
# allow_partial=False above, which means this exception will be thrown if a complete mapping
# cannot be found.
assert len(map12) == 1 # if partial maps allowed, then a fragment can map multiple times
assert len(map13) == 1
return
charges = []
charges.append( mol1.partial_charges)
charges.append( [mol2.partial_charges[ to] for from, to in map12.items() ])
charges.append( [mol3.partial_charges[ to] for from, to in map13.items() ])
# or even mol3.partial_charges(map=map13) as a convenience
for sym, (q1, q2, q3) in zip( mol1.symbols, np.transpose( charges)):
print(sym, q1, q2, q3)
resultA = mol1.some_function( mol2, map=map12)
resultB = Molecule.some_other_function(mo1, mol3, map=map13)
print(resultA, resultB) Talking with Jeff a little on this, and a question that came up is whether Overall, I think we should just have a map arg for any function that takes 2 or molecules in as input (or if a mol object has a function taking another mol object as input), and just provide a function, e.g. map/are_isomorphic/remap, that will provide this map if it possible. Whether we call this map automatically is up for debate, and I think it will depend on how much overhead it will introduce when molecules become large. Finally, note that in my description, the |
Atom maps are very useful to have for both bespoke fitting workflows (e.g. track which fragment atoms map to which parent atoms) and for fitting in general where aspects of a molecule (e.g. the torsion being driven) need to tracked. While atom maps are partially implemented currently, they are still not first class citizens of the toolkit and are not guaranteed to be preserved or respected by functions. Canonically ordering a molecule, for example, will retain the My preference would be to add a As for the above questions, presumably these are for a future (pydantic?) implementation of the model? Currently the validation logic for molecules is a quite scattered and segmented, so I'm not sure how easier such validation would be at the moment.
|
Is your feature request related to a problem? Please describe.
Both OETK and RDKit support atom map indices. Many users find having a secondary indexing system useful for a variety of reasons. I'm generally against adding complexity, but @jthorton is making a good case for why supporting map indices will be helpful as we develop an automated QCA submission tool and bespoke workflow.
Describe the solution you'd like
We should decide on a "specification for atomic map indices". This specification should be a set of valid behaviors for map indices on a molecule. Hopefully it will line up well with what OETK and RDKit do, but that's not necessarily the biggest goal. What I want to settle on is an answer to "what behaviors do we allow?". Can map indices be redundant? Can they be negative? I'll put a more complete list of decisions below. As I'm learning with partial charges right now, if we make these decisions now, then life will be easy later.
Note that this may be separate from an "API for atomic map indices" -- at this time I don't really care whether we access it as
offmol.properties['atom_map']
oroffmol.atoms[0].map_index
.For reference, here's the relevant portions of the Daylight SMILES specification, which is where all this mapped-SMILES stuff began to the best of my knowledge. This may help us understand what behavior RDK and OETK should support.
Section 3.5
Section 3.5.1
Things we need to decide
I don't really know how to define a specification, but I know the end up saying a bunch of things like "SHALL", "MAY",
"THOU", and "SHALL NOT". I've tried to put down a bunch of decisions we can vote on here, and hopefully our consensus will help us arrive at some set of biblical-language statements.I'm happy to add more folks to the table, just tagging the most relevant now.
None
?None
?==
operator to fail for otherwise-identical molecules?The text was updated successfully, but these errors were encountered: