Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fragmentation API #49

Closed
jchodera opened this issue Sep 8, 2019 · 18 comments · Fixed by #103
Closed

Fragmentation API #49

jchodera opened this issue Sep 8, 2019 · 18 comments · Fixed by #103
Labels

Comments

@jchodera
Copy link
Member

jchodera commented Sep 8, 2019

The API for fragmenting is a little awkward:

f = fragmenter.fragment.WBOFragmenter(mol, functional_groups=None, verbose=False) # some of the options go here
f.fragment(threshold=0.01, keep_non_rotor_ring_substituents=True, heuristic='path_length', ...) # some of the options go here
for bond in f.fragments:
    # use fragments
    ...

Instead of making the Fragmenter object only operate on a single molecule, and splitting all the options that control its behavior across the constructor (WBOFragmenter()) and the fragment() method, what about making this a factory?

You could still produce an object for each molecule that had all the information about the fragments, like FragmentSet. Using this API would look something like this:

# Create the fragmenter factory
f = fragmenter.fragment.WBOFragmenter()
# Configure the factory for any non-default options you want to set
f.functonal_groups = ...
f.threshold = 0.01
f.keep_non_rotor_ring_substituents = False
# Fragment one or more molecules
for oemol in oemols:
    fragment_set = f.fragment(mol)
    # do something with the fragments
    for bond in fragment_set.fragments:
        # use fragments
        ...
@jchodera
Copy link
Member Author

jchodera commented Sep 8, 2019

This also would enable a bunch of easy optimizations.

For example, a Fragmenter instance could memoize (remember) input molecules so it can look up their FragmentSet if it has already been fragmented.

A future extension could also use multiprocessing under the hood to fragment multiple molecules on multiple threads by adding support to Fragment.fragment() that would accept multiple molecules, rather than just one. Or it could be parallelized across bonds in the same molecule.

@jthorton
Copy link
Contributor

@jchodera @ChayaSt @j-wags I wanted to restart this conversation about the Fragmenter API ahead of the planned integration into the OFFtoolkit. Following the idea of rebasing the code to use the Molecule class from the toolkit it might be nice to have a method like Molecule.wbo_fragment(kwargs), however, based on all of the key arguments listed above this function could have a lot of options which might awkward to change.

I also like the idea of the FragmenterFactory class which could house all of the running options and remember fragments of molecules already processed with the same settings, multiprocessing would also be a nice addition at least at the molecule level. The FragmenterFactory could then keep all of the fragmented molecules with links to the fragments which could be accessed as mol1_fragments = factory.molecules[0].fragments, also if the class was pickled we would still know which molecules the fragments are related to. We could also have some post fragmentation filtering options this way which could remove duplicates etc.

I am not fully sure yet of how Fragmenter will be used so maybe this will make getting the fragments awkward, anything else we should consider?

@jchodera
Copy link
Member Author

We want to avoid making the fragmentation scheme be a method of Molecule because there is no unambiguous fragmentation scheme---it's still very much an active area of research. Molecule should only contain convenience methods that are stable, clearly defined processes, like conversion to/from other formats or clear manipulation methods.

I think creating a Fragmenter factory class is the right way to go. (It doesn't need to be called FragmenterFactory since its API will make it clear it is a factory.) If we create a common Fragmenter base class with a simple, clear API, we can have multiple implementations of this via subclasses (e.g. WBOFragmenter) that can easily be swapped out.

I do not think the factory should store the parent molecule and fragments. That's not any sort of standard idiom or design pattern I'm familiar with. Instead, the factory should be something you create, configure, and use to create new objects that are returned by a factory method. The object created by the factory, however, could contain the parent and daughter fragments in a convenient data object model, however---call it a FragmentResult or FragmentedMolecule.

Writing out a few clear use cases is the easiest way to approach design.

For example, consider a fragmenter that just returns new Molecule objects. We want to fragment a bunch of parent molecules and compile a unique set of fragments:

# Create a Fragmenter
fragmenter = WBOFragmenter(wbo_threshold=0.8, capping_rules='default')
# Further configure the fragmenter
fragmenter.set_wbo_threshold(0.7)
# Fragment a bunch of molecules
all_fragments = set()
for molecule in molecules:
    fragments = fragmenter.fragment(molecule)
    all_fragments.add(fragments)

That could be very useful on its own.

If we wanted to also bundle the results of fragmentation---including the computed WBOs for the parent and child molecules---we could do that through a FragmenterResult object:

# Create a Fragmenter
fragmenter = WBOFragmenter(wbo_threshold=0.8, capping_rules='default')
# Further configure the fragmenter
fragmenter.set_wbo_threshold(0.7)
# Fragment a bunch of molecules
all_fragments = set()
for molecule in molecules:
    fragmented_molecule = fragmenter.fragment(molecule)
    all_fragments.add(fragmented_molecule.fragments)
    # we can also inspect the parent molecule
    parent = fragmented_molecule.parent_molecule

etc. But do we really need to preserve this relationship by bundling everything in an object? What information would be useful to bundle together, and can we make a compelling case for why?

Filtering can simply be done by using a set() and adding molecules to the set. I believe we use the canonical isomeric SMILES for uniqueness checks, so no new machinery is needed to filter out duplicates.

@j-wags
Copy link
Member

j-wags commented Jan 10, 2020

For example, a Fragmenter instance could memoize (remember) input molecules so it can look up their FragmentSet if it has already been fragmented.

links to the fragments which could be accessed as mol1_fragments = factory.molecules[0].fragments

I do not think the factory should store the parent molecule and fragments

The object created by the factory, however, could contain the parent and daughter fragments in a convenient data object model, however---call it a FragmentResult or FragmentedMolecule.

In light of the planned integration of Fragmenter into OFFTK, I agree with @jchodera's last proposal about not keeping links between the original molecule and the fragmentation results. Keeping/caching those relationships would introduce all sorts of surprising behaviors and edge cases -- For example a user who wants to fragment one million molecules would get stuck with a huge memory footprint, even if they wrote out the fragments after processing each molecule. Also, since Molecules are to some extent mutable, the user could change the parent or fragments in several ways that could invalidate a cached result (for example, manually setting the WBOs on a parent molecule after fragmenting it once). My biggest lesson so far in this role is that users will explore literally every extreme case within weeks of a feature release, so it's best to provide them with simple functionality, where the solution to memory/processing limits are in their hands.

If we wanted to also bundle the results of fragmentation---including the computed WBOs for the parent and child molecules---we could do that through a FragmenterResult object:

I like this idea, with the added bonus that the FragmenterResult could contain a mapping from the whole-molecule atom indices to the fragment's -- Especially helpful if we need to keep track of the 4 atoms involved in a torsion that we want to scan. This would avoid the potential memory explosion by having the FragmenterResult object go out of scope if the user doesn't care about it.

@jchodera
Copy link
Member Author

Are we set on integrating the Fragmenter into OFFTK? I thought we had figured out a way to keep the fragmenter package separate, since there is still a lot of science to be done there (even if we have v1 of a production algorithm ready now).

@j-wags
Copy link
Member

j-wags commented Jan 10, 2020

@jchodera I recall that Fragmenter was planned for integration into OFFTK, but CMILES was slated to remain largely independent. I can dig up the meeting notes for that tonight if we'd like to reevaluate.

@jchodera
Copy link
Member Author

I think it could still work do to this, provided we do so in a way that makes it easy to extend via subclassing and still use externally developed variants of Fragmenter without modifying the code in openforcefield.

@ChayaSt
Copy link
Collaborator

ChayaSt commented Jan 13, 2020

But do we really need to preserve this relationship by bundling everything in an object? What information would be useful to bundle together, and can we make a compelling case for why?

The only reason to maintain the relationship between the parent and daughter molecule is to know which torsions to drive for bespoke torsion fitting because the fragments are generated for the rotatable bond. For general forcefield parameterization, there is no need for that relationship because all torsions (or combination of torsions) will be driven. I like the idea of having the FragmenterResults have the option to maintain this relationship.

@jchodera
Copy link
Member Author

The only reason to maintain the relationship between the parent and daughter molecule is to know which torsions to drive for bespoke torsion fitting because the fragments are generated for the rotatable bond.

I'm not sure I get this. My understanding is that the input to the fragmenter algorithm includes the torsion to fragment around, which would mean we start with that information.

Or do you mean that this would be useful for a different API call in which you ask it to "fragment this whole molecule", which would then iteratively fragment around each rotatable bond? In that case, we'd like to know which fragments go with which bond.

@jthorton
Copy link
Contributor

I think with the bespoke fitting for a molecule with multiple bonds users will want to fragment around all of them and refit the parameters which is why I want to return the relational information for each torsion. If this was not built into fragmenter it could instead be part of the bespoke fitting package which could make multiple calls to fragmenter and record the relation its self, but I think this functionality would make more sense in fragmenter.

@ChayaSt
Copy link
Collaborator

ChayaSt commented Jan 13, 2020

Currently, the torsion to fragment around is not part of fragmenters input. The first thing it does is find the rotatable bonds to fragment around and then maps those to the new fragment.
The issue is that the rotatable bonds are numbered with the map indices of the parent molecule but the daughter fragments need to have their own map indices. Another way to deal with this without keeping the parent - daughter relationship is to have fragmenter find that torsion in the resulting fragment before returning the fragments.
This happens in the method that generates the torsiondrive inputs and it uses the parent map indices that are still on the fragment. The provenance still retains the parent molecule map indices because it is used when validating fragmentation schemes.

Validating fragmentation schemes is another reason to retain the parental relationship but since it will probably not happen very often it is not that important. The relationship is needed when validating because it makes it straight forward to find all fragments that have the bond of interest.

It can be a good idea to have the torsion of interest be part of the input because that will reduce the cost of fragmenting in general.

@jthorton
Copy link
Contributor

jthorton commented Jan 13, 2020

That might be a nice extension to the API to have the option to fragment all or fragment around a list of user-supplied bonds. In the future, we may have some method of estimating the performance of a set of parameters applied to a molecule (for bespoke fitting), which could indicate some suspect torsions that should be refit. In this case, only fragmenting around these bonds would be more efficient.

So when fragmenting one molecule we could

# Create a Fragmenter
fragmenter = WBOFragmenter(wbo_threshold=0.8, capping_rules='default')
# Further configure the fragmenter
fragmenter.set_wbo_threshold(0.7)
# fragment our molecule around target bonds
target_bonds = [(0,1), (3,4)]
fragmented_molecules = fragmenter.fragment(molecule, target_bonds=target_bonds)

The default use of the fragmenter would then still fragment around all rotable bonds.

fragmented_molecules = fragmenter.fragment(molecule, target_bonds='all')

@jchodera
Copy link
Member Author

Can you folks make some concrete proposals about the API you're envisioning here? Simple examples illustrating use (like my example above) make it much easier to see what you're proposing.

@SimonBoothroyd
Copy link
Collaborator

SimonBoothroyd commented Apr 21, 2021

My proposal for the API is:

parent_molecule = Molecule.from_smiles(
    "OC1(CN(C1)C(=O)C1=C(NC2=C(F)C=C(I)C=C2)C(F)=C(F)C=C1)[C@@H]1CCCCN1"
)

fragmenter = WBOFragmenter(
    keep_non_rotor_ring_substituents=False,
    threshold=0.03,
    heuristic="path_length"
)
fragmentation_result = fragmentater.fragment(parent_molecule)

print(fragmentation_result.fragments)
print(fragmentation_result.provenance)

The fragmenter settings can be changed on the object directly:

fragmenter.threshold = 0.07

fragmentation_result = fragmentater.fragment(parent_molecule)

The fragmentation engine will inherit from the pydantic base model allowing the options it uses to be readily serialisable. This would then enable us to easily include the general schema for how a molecule should be fragmented in other workflow schemas (e.g. the bespoke fit optimization schema).

The WBOFragmenter and similar would be entirely stateless and not keep track of individual molecules.

The output data models (again pydantic based) would look something like:

class Fragment(BaseModel):
    smiles: str  # A mapped SMILES pattern describing the fragment. The map index of 
                 # the fragment will match the corresponding map indices of the parent
    
    bond: Tuple[int, int]  # The MAP indices of the atoms involved in the rotatable
                           # bond that the fragment was built around.
    
    @property
    def molecule(self):
        return Molecule.from_smiles(self.smiles)

class FragmentationResult(BaseModel):
    
    parent_smiles: str  # A mapped SMILES pattern describing the parent molecule
    
    fragments: List[Fragment]  # The generated fragments. This can't be a dict as
                               # JSON cannot handle tuple keys unfortunately.
    
    @property
    def parent_molecule(self):
        return Molecule.from_smiles(self.parent_smiles)

A mapping between each fragment and the parent could then be generated by:

parent = fragmentation_result.parent_molecule
fragment = fragmentation_result.fragments[0].molecule

fragment_to_parent = {
    fragment_index: get_atom_index(parent, map_index)
    for fragment_index, map_index in fragment.properties["atom_map"].items()
}

@jchodera
Copy link
Member Author

@SimonBoothroyd : I like the idea of using pydantic to be able to easily serialize class options, but how would one modify the WBOFragmenter factory options after instantiation? Would they still be accessible via class fields? Or would it be something like

# Change an option after factory has been instantiated
fragmenter.options.threshold = 0.04

@SimonBoothroyd
Copy link
Collaborator

SimonBoothroyd commented Apr 21, 2021

@SimonBoothroyd : I like the idea of using pydantic to be able to easily serialize class options, but how would one modify the WBOFragmenter factory options after instantiation? Would they still be accessible via class fields? Or would it be something like

I've been playing around with this a bit more and it seems to make sense to make the fragmenter class inherit directly from a pydantic base model. In this way options can be easily changed like normal properties:

fragmenter = WBOFragmenter(threshold=0.03)
# Change an option after factory has been instantiated
fragmenter.threshold = 0.05

and the whole factory can be easily (de)serialized:

>>> fragmenter = WBOFragmenter()
>>> fragmenter.json()

{
  "functional_groups": {
    "hydrazine": "[NX3:1][NX3:2]",
    ...
    "tri_halide": "[#6:1]([F,Cl,I,Br:2])([F,Cl,I,Br:3])([F,Cl,I,Br:4])"
  },
  "scheme": "WBO",
  "wbo_options": {
    "method": "am1-wiberg-elf10",
    "max_conformers": 800,
    "rms_threshold": 1.0
  },
  "threshold": 0.03,
  "heuristic": "path_length",
  "keep_non_rotor_ring_substituents": false
}

I've updated the above comment to reflect this proposed API.

@jchodera
Copy link
Member Author

I've been playing around with this a bit more and it seems to make sense to make the fragmenter class inherit directly from a pydantic base model. In this way options can be easily changed like normal properties:

@SimonBoothroyd : This is great! I was just looking into this, and it sounds like this is the best way to go. Any class fields prepended with _ will not be serialized by default, so it should still be possible to use class fields for temporary or cached storage without issues but have the class fields directly be serialzed and still use a Pythonic factory idiom.

@SimonBoothroyd SimonBoothroyd linked a pull request Apr 21, 2021 that will close this issue
1 task
@SimonBoothroyd
Copy link
Collaborator

Great! I've implemented this now in #103 (example notebook here) if anyone has any additional feedback they'd like to get in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants