Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Defining a specification for atomic map indices #522

Open
j-wags opened this issue Feb 19, 2020 · 6 comments
Open

Defining a specification for atomic map indices #522

j-wags opened this issue Feb 19, 2020 · 6 comments

Comments

@j-wags
Copy link
Member

j-wags commented Feb 19, 2020

Is your feature request related to a problem? Please describe.
Both OETK and RDKit support atom map indices. Many users find having a secondary indexing system useful for a variety of reasons. I'm generally against adding complexity, but @jthorton is making a good case for why supporting map indices will be helpful as we develop an automated QCA submission tool and bespoke workflow.

Describe the solution you'd like
We should decide on a "specification for atomic map indices". This specification should be a set of valid behaviors for map indices on a molecule. Hopefully it will line up well with what OETK and RDKit do, but that's not necessarily the biggest goal. What I want to settle on is an answer to "what behaviors do we allow?". Can map indices be redundant? Can they be negative? I'll put a more complete list of decisions below. As I'm learning with partial charges right now, if we make these decisions now, then life will be easy later.

Note that this may be separate from an "API for atomic map indices" -- at this time I don't really care whether we access it as offmol.properties['atom_map'] or offmol.atoms[0].map_index.

For reference, here's the relevant portions of the Daylight SMILES specification, which is where all this mapped-SMILES stuff began to the best of my knowledge. This may help us understand what behavior RDK and OETK should support.

Section 3.5

Atom maps are non-negative integer atom modifiers. They follow the ":" character within an atom expression. They must be the last modifier within the atom expression:

Atom maps are an atomic property. They can legally appear in a SMILES for any atom, whether or not it is part of a reaction. Atom with atom map labels in a molecule SMILES are considered valid; the atom maps are ignored for molecule processing. Absolute and unique SMILES generated by the system for molecules never include atom maps.

Section 3.5.1

Within the SMILES language, atom maps are represented as a non-negative numeric atom modifier following the ":" character (e.g. [CH3:2] is a carbon in class 2).

Within the Daylight toolkit, the atom maps are manipulated as sets of mapped atoms. The atom map class numbers which are used in SMILES do not appear in the toolkit interface to a reaction. The map class numbers in SMILES do not have any additional significance, except to associate all atoms with the same map class label to one another.

There are no requirements for completeness or uniqueness of the atom mappings. Atom mappings are independent of the connectivity and properties of the underlying molecules.

Things we need to decide

I don't really know how to define a specification, but I know the end up saying a bunch of things like "SHALL", "MAY", "THOU", and "SHALL NOT". I've tried to put down a bunch of decisions we can vote on here, and hopefully our consensus will help us arrive at some set of biblical-language statements.

I'm happy to add more folks to the table, just tagging the most relevant now.

Question @j-wags @jthorton @jchodera @trevorgokey
Can map indices be redundant?
Can they be negative?
Can they be strings?
Can they be None?
Should they be round-trippable to/from OEMol, RDMol, and/or SDF?
Should map indices be initialized upon atom creation to a default value?
If no to above, should an absence of map indices be the same as map indices being all set to None?
Should a difference in map index cause the == operator to fail for otherwise-identical molecules?
Must there be an atom map for all atoms in the molecule if there's an atom map for at least one?
Must the mapping function return a map of every atom, even if given a partial map?
Should a partial map be accepted by the toolkit functions?
@j-wags
Copy link
Member Author

j-wags commented Feb 19, 2020

Oh darn, the table can't be updated like normal checkboxes. Oh well, I'll try to keep it updated manually by directly editing the markdown as we get feedback.

My votes are:
Can map indices be redundant? Yes
Can they be negative? No
Can they be strings? No
Can they be None? No (this is mostly due to implementation difficulty)
Should they be round-trippable to/from OEMol, RDMol, and/or SDF? Yes
Should map indices be initialized upon atom creation to a default value? Yes
If no to above, should an absence of map indices be the same as map indices being all set to None? N/A
Should a difference in map index cause the == operator to fail for otherwise-identical molecules? No
Must there be an atom map for all atoms in the molecule if there's an atom map for at least one? Yes (since I think they should all be initialized to a default value, probably 0)

@jthorton
Copy link
Collaborator

Thanks for putting this up @j-wags
My votes are:
Can map indices be redundant? Yes
Can they be negative? No
Can they be strings? No
Can they be None? Yes (only in OFFTK) this would be the default value or indicate no map is present, as users will mainly deal with the OFFMol I think this would make more sense than 0 as our atom indexing starts from 0. If we use 0 as the default value you would have to check more than one atoms map index to see if you have a valid mapping whereas a quick None check could be done on any atom.
Should they be round-trippable to/from OEMol, RDMol, and/or SDF? Yes
Should map indices be initialized upon atom creation to a default value? Yes
If no to above, should an absence of map indices be the same as map indices being all set to None? N/A
Should a difference in map index cause the == operator to fail for otherwise-identical molecules? No
Must there be an atom map for all atoms in the molecule if there's an atom map for at least one? Yes (since I think they should all be initialized to a default value, probably None).

@jchodera
Copy link
Member

Both OETK and RDKit support atom map indices. Many users find having a secondary indexing system useful for a variety of reasons. I'm generally against adding complexity, but @jthorton is making a good case for why supporting map indices will be helpful as we develop an automated QCA submission tool and bespoke workflow.

This came up in our QCFractal meeting today, and @ChayaSt and I are strongly against having Molecule.from_qcschema() return a Molecule object with an incorrect atom ordering by default. If the cmiles block is present, the atom order of the generated molecule must match the tagged SMILES. The entire point of CMILES is to provide this tagged canonical ordering for this purpose---99.9% of our use cases require the molecule be constructed in the correct order. I'll raise a separate issue for this.

We may still need functionality like a way to store an atom mapping, but it would be useful to review the use cases for this before we evaluate the proposed implementation, since we have no means to rationally evaluate it otherwise.

I'll note we also have similar functionality in the TopologyMolecule class, which stores a mapping of a Molecule onto another molecule (in aTopology). If we design this feature well, we might be able to replace the TopologyMolecule code with this mapping object.

@j-wags
Copy link
Member Author

j-wags commented Feb 21, 2020

This came up in our QCFractal meeting today, and @ChayaSt and I are strongly against having Molecule.from_qcschema() return a Molecule object with an incorrect atom ordering by default. If the cmiles block is present, the atom order of the generated molecule must match the tagged SMILES. The entire point of CMILES is to provide this tagged canonical ordering for this purpose---99.9% of our use cases require the molecule be constructed in the correct order. I'll raise a separate issue for this

Thankfully, I misspoke. Josh's implementation was doing the right thing all along, so this isn't an issue!

I'll note we also have similar functionality in the TopologyMolecule class, which stores a mapping of a Molecule onto another molecule (in aTopology). If we design this feature well, we might be able to replace the TopologyMolecule code with this mapping object.

I hadn't thought about atom maps replacing the hacky dict that powers TopologyMolecule, but that would be a really nice outcome. I'll put some thought into it... one big sticking point is that Topology objects have two indexing systems -- One for TopologyAtoms, and another for TopologyParticles (which are a combination of all TopologyAtoms and VirtualSites). There are some hairy things there we need to figure out about VirtualSites, like whether we need rules for how they're indexed in the list of TopologyParticles, how to maintain a sane indexing system for particles when we can add VirtualSites both via the Molecule API and during system creation, and how they can be accessed and deduplicated. I'll talk to @trevorgokey this Friday about those and post our thoughts.

@trevorgokey
Copy link
Collaborator

Can map indices be redundant? No, I don't see the benefit to having a many-to-one mapping, and it complicates things.
Can they be negative? No
Can they be strings? No, but we can convert easily if detected
Can they be None? No, the value of 0 encodes the same thing None would, and having a None inside the map array would complicate things
Should they be round-trippable to/from OEMol, RDMol, and/or SDF? Yes
Should map indices be initialized upon atom creation to a default value? No, but I am assuming the map is a molecule-level object rather than an atom-level object. I don't see the benefit of having an atom-level object.
Should a difference in map index cause the == operator to fail for otherwise-identical molecules? No
Must there be an atom map for all atoms in the molecule if there's an atom map for at least one? No. Partial maps are out there, and if we come across one we can just set the unmapped atoms to 0 (or None). See my additions below.

To add to the list:
Must the mapping function return a map of every atom, even if given a partial map? Yes, if a partial map is given, then return a complete map with those unmap atoms mapped to 0. Or, which is probably more correct, is to map the unmapped atoms as they are literally mapped, which is in sequential order.
Should a partial map be accepted by the toolkit functions? No, the toolkit functions should always expect a map to have the same number of keys in the map as atoms in the molecule.

In any case, I agree that we should just return the same ordering that was given, rather than modifying the order and giving a map that can be used to get what was originally there in the first place.

Also, maps are defined as between molecules, just as bonds are defined as between atoms. Storing a map inside of a molecule is a bit strange, and not sure what purpose it would serve. Much of the wording in these checkboxes make maps seem like a molecule property, which is a little misleading in my opinion. Atom maps should just be dictionaries, and functions that can utilize them should take in a map kwarg (and the obvious is a {1:1, 2:2, ..., n:n} mapping for molecules in the same order).

If we really wanted to have maps as molecule-level attrs, then we would need to store a map for each possible conversion, e.g. to/from_oemol, to/from_rdkit, to/from_qcschema, etc. where we would then embed these representations in the oFF molecule itself to maintain the "state" of the mapping (if we e.g. delete an atom). In other words, it is not clear to me that one map will be the same across the toolkits in the way they build molecules.

Here is how I envision this working. Let us have 3 offmols that were obtained from the wild, and are the exact same molecule (e.g. benzoic acid). Each mol has some partial charges, and we want to, say, make a table for comparison. Therefore, we need to ensure that the partial charges are in the same order, as we must not expect the ordering to be the same between the three different instances. The appropriate pseudo-code is as follows:

mol1, mol2, mol3 = get_off_mols() # could be from a mol2, qcschema, OE, RDkit, etc.
try:
    map12 = Molecule.map(mol1, mol2, allow_partial=False)[0] # or map12 = mol1.map(mol2)[0]
    map13 = Molecule.map(mol1, mol2, allow_partial=False)[0]
except NullMap:
    # Bail if not isomorphic and allow_partial turned off. In the case of allow_partial=True, not 
    # even a fragment could be found.
    #
    # Note we need to be careful then, since we could always probably find many partial mappings of 
    # single sp3 carbons. We would therefore generate a M by N matrix of map dictionaries, where M 
    # is the number of sp3 carbons in mol1, and N is the number of sp3 carbons in mo2. 
    # To alleviate some of this behavior, we could implement a map_only option which takes a list of 
    # atom indices from mol1 that we want to explicitly map (default is all atoms in mol1), to get a 1xN 
    # mapping.
    #
    # tl;dr partial maps are difficult to handle (and luckily not part of this issue)
    return
except PartialMap:
    # The idea here is that, in general, we can map a fragment of mol1 to mol2 and it could match
    # multiple parts of mol2.
    # However, since we know that these will map completely (e.g. are isomorphic), we turned 
    # allow_partial=False above, which means this exception will be thrown if a complete mapping 
    # cannot be found.

    assert len(map12) == 1 # if partial maps allowed, then a fragment can map multiple times
    assert len(map13) == 1
    return
charges = []
charges.append(  mol1.partial_charges)
charges.append( [mol2.partial_charges[ to] for from, to in map12.items() ])
charges.append( [mol3.partial_charges[ to] for from, to in map13.items() ])
# or even mol3.partial_charges(map=map13) as a convenience
for sym, (q1, q2, q3) in zip( mol1.symbols, np.transpose( charges)):
    print(sym, q1, q2, q3)
resultA = mol1.some_function( mol2, map=map12)
resultB = Molecule.some_other_function(mo1, mol3, map=map13)
print(resultA, resultB)

Talking with Jeff a little on this, and a question that came up is whether some_function which is order dependent should automatically make the mapping, by calling map behind the scenes. My 2 cents is that we should only do this if we are in "paranoid" mode, since calling map ourselves might be a bit more work on the user's side, but is more efficient and easier to debug if no map can be found and some exception is raised. We could perform the mapping ourselves if map=None in the above functions, which would satisfy the "paranoid" aspect of the code (and we set map=None by default). To perform some function on two molecules in the same order and prevent the mapping to be done, we would have to pass in something like {i:i for i in range(len(mol1.atoms))}.

Overall, I think we should just have a map arg for any function that takes 2 or molecules in as input (or if a mol object has a function taking another mol object as input), and just provide a function, e.g. map/are_isomorphic/remap, that will provide this map if it possible. Whether we call this map automatically is up for debate, and I think it will depend on how much overhead it will introduce when molecules become large.

Finally, note that in my description, the remap function that we recently merged would not suffice, as it creates a new mol from an existing mol and a map, i.e. it's a copy constructor. Here, each mol has its own data, and we simply need a map between the two.

@j-wags j-wags mentioned this issue May 27, 2020
4 tasks
@j-wags j-wags mentioned this issue Mar 31, 2021
5 tasks
@SimonBoothroyd
Copy link
Contributor

SimonBoothroyd commented Apr 28, 2021

Atom maps are very useful to have for both bespoke fitting workflows (e.g. track which fragment atoms map to which parent atoms) and for fitting in general where aspects of a molecule (e.g. the torsion being driven) need to tracked.

While atom maps are partially implemented currently, they are still not first class citizens of the toolkit and are not guaranteed to be preserved or respected by functions. Canonically ordering a molecule, for example, will retain the atom_map property but will not update it to have the correct ordering. As another example, to_rdkit and to_openeye do not pass the map indices and so these are lost in a roundtrip.

My preference would be to add a map_index attribute to the Atom class to store them. This would make handling atom arrangements safer, as well as match more closely what OE and RDKit expose.

As for the above questions, presumably these are for a future (pydantic?) implementation of the model? Currently the validation logic for molecules is a quite scattered and segmented, so I'm not sure how easier such validation would be at the moment.

  • Can map indices be redundant? No to begin with, yes if a user brings a compelling use case in the future.
  • Can they be negative? No - they should also be greater than 0 to ensure compatibility with OE / RDKit. A zero index should be replaced with None.
  • Can they be strings? No - ints only.
  • Can they be None? Yes - this is more pythonic than using 0 as no index.
  • Should they be round-trippable to/from OEMol, RDMol, and/or SDF? Yes
  • Should map indices be initialized upon atom creation to a default value? Yes - None.
  • If no to above, should an absence of map indices be the same as map indices being all set to None? N/A
  • Should a difference in map index cause the == operator to fail for otherwise-identical molecules? No
  • Must there be an atom map for all atoms in the molecule if there's an atom map for at least one? No - think e.g. tracking a dihedral that should be driven.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants