Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow ForceField to store extra descriptors for a residue #2757

Closed
tristanic opened this issue Jun 24, 2020 · 15 comments · Fixed by #3604
Closed

Allow ForceField to store extra descriptors for a residue #2757

tristanic opened this issue Jun 24, 2020 · 15 comments · Fixed by #3604
Milestone

Comments

@tristanic
Copy link
Contributor

Something that would be really nice for interactive applications of OpenMM would be the ability to store extra information about a residue (e.g. long-format name(s), SMILES string, etc.) directly in the ffXML, so that it's stored as metadata in the ForceField when running loadFile. This came up because I've been working on adding a tool to ISOLDE for listing possible templates for a residue based on finding the maximum common subgraph between the residue and templates with (leniently) "similar" elemental composition (to recover from cases where the residue is incomplete, has incorrect hydrogens, etc.). The basic mechanism is working, but now I'm faced with the challenge of how to represent the results to the user. For example, if I delete the sidechain amine and run it past this method, I get:
Topology matches: LYN, LYS, CLYS, NLYS, PTM_LYZ, PTM_MLY, PTM_MLZ, ZK
... which is the correct answer in a sense, but will be confusing as heck to the user. Having a name associated with each one would make things much more clear:
LYN (neutral lysine)
LYS (lysine)
CLYS (C-terminal lysine)
NLYS (N-terminal lysine)
PTM_LYZ (5-hydroxylysine)
PTM_MLY (N6,N6-dimethyllysine)
PTM_MLZ (N6-methllysine)
ZK (free, zwitterionic lysine)

@peastman
Copy link
Member

It sounds like this feature would involve two distinct parts. 1) adding the infrastructure for storing extra information, and 2) adding that extra information to all the standard force fields. We would also need to figure out how to generate those labels when applying patches.

Do we want to support arbitrary, use defined metadata? Or would it be better to just support a single description attribute?

@jchodera
Copy link
Member

This came up because I've been working on adding a tool to ISOLDE for listing possible templates for a residue based on finding the maximum common subgraph between the residue and templates with (leniently) "similar" elemental composition (to recover from cases where the residue is incomplete, has incorrect hydrogens, etc.).

@tristanic : Would you be up for chatting with the Open Force Field Initiative folks who are working on biopolymer force fields? We're very much headed in this direction, including having OFF Topology objects (which should provide all this information) and supporting very flexible parameterization of all manner of nonstandard residues.

cc: @j-wags who could help coordinate this call.

@tristanic
Copy link
Contributor Author

@jchodera Absolutely! Not immediately, though. Crazy busy few weeks coming up.

If it's of interest, I found a really nice, fast open-source C++ implementation of a maximum-common-subgraph implementation (this one: https://www.ijcai.org/Proceedings/2017/0099.pdf) which I've wrapped for Python using PyBind11 (code at https://github.com/tristanic/isolde/tree/master/isolde/src/graph). Neatly handles the core problem of matching an incomplete/incorrect residue to potential templates. Feel free to use it if you see a need for it.

@tristanic
Copy link
Contributor Author

@peastman a single description would go a long way. Beyond that, where a corresponding template exists in the Chemical Components Dictionary I'd love to be able to definitively link the two.

@tristanic
Copy link
Contributor Author

What do you think of something like the following approach?

  • give ForceField a dict called something like _custom_attribute_handlers
  • add the following method:
class ForceField:
    def registerCustomResidueAttribute(self, attr_name, handler):
        '''
        Add a handler to look for extra residue attributes when loading a ffXML
        file. The handler should take two arguments: a `ForceField._TemplateData`
        and a string containing the value of the given attribute, and should 
        not return anything.
        '''
        self._custom_attribute_handlers[attr_name] = handler

... with most handlers being exceedingly simple, like:

    def common_name_handler(template, name):
        template.common_name = name
  • then, in ForceField.loadFile():
        for tree in trees:
            if tree.getroot().find('Residues') is not None:
                for residue in tree.getroot().find('Residues').findall('Residue'):
                    resName = prefix+residue.attrib['name']
                    template = ForceField._TemplateData(resName)

>                   for attr_name, handler in self._custom_attribute_handlers.items():
>                       custom_attr = residue.attrib.get(attr_name, '')
>                       handler(template, custom_attr)

                    if 'override' in residue.attrib:
                        template.overrideLevel = int(residue.attrib['override'])
                    atomIndices = template.atomIndices

@peastman
Copy link
Member

Can you give examples of what handlers might do? What are the advantages over simply storing all extra attributes into a dict where they'll be available to user code, but not trying to do anything with them?

Of course, neither one addresses the problem of what to do about patched residues.

@tristanic
Copy link
Contributor Author

Can you give examples of what handlers might do? What are the advantages over simply storing all extra attributes into a dict where they'll be available to user code, but not trying to do anything with them?

One use case for this is if the Chemical Components Dictionary template name is used as an attribute: the handler code could then map this to a separate database containing the info from there: amongst other things, ideal coordinates and explicit definitions of chiral centres. Of course, there are many ways to skin that cat.

Of course, neither one addresses the problem of what to do about patched residues.

Yes, I agree that's much harder. Possibly the patch could provide (a) string(s) to be appended to the name(s) of the affected residue(s) to specify how they've been changed? If, for example, a cysteine's common_name attribute is "L-cysteine", then the patch to deprotonated cys could provide a suffix1 attribute "deprotonated", so the new residue's common_name becomes L-cysteine, deprotonated. A disulfide patch would provide suffix1="disulfide", suffix2="disulfide", so both affected residues become L-cysteine, disulfide.

@peastman
Copy link
Member

Alternatively, the template could directly provide a list of the patches that went into creating it. That might be more useful.

Of course getMatchingTemplates() doesn't consider patches anyway, but the private method _matchAllResiduesToTemplates() does, so as long as you don't mind that your code might break in the future you could use that to get patched templates.

@tristanic
Copy link
Contributor Author

At the moment I'm not using patches - for a range of reasons including choices made by collaborators and lack of resources to properly support more than one forcefield, ISOLDE's currently pretty much locked to AMBER. That being said, I do find the patch approach more elegant than the "new template for every bonding variant" one, and at some point I'd certainly like to start looking down that path.

@peastman peastman added this to the 7.7 milestone Aug 9, 2021
@peastman peastman modified the milestones: 7.7, 7.8 Nov 5, 2021
@peastman
Copy link
Member

I'm looking into implementing this now. For the code part, I'm thinking it would be best to just do the very simple implementation: add an attributes dict to the classes representing residues and patches. Any attributes from the XML tag will be stored in it. We won't try to process them in any way. They're just there for future reference, in case you want to do something with them.

To be useful, the XML files need to actually contain extra information. Should we try to regenerate some of our force fields adding extra attributes? The Amber input files do contain longer descriptions that we could put into a description attribute. For example,

HISTIDINE DELTAH                                                
                                                                
 HID  INT     1                                                 
 CORR OMIT DU   BEG                                             
 ...

@peastman
Copy link
Member

@tristanic any thoughts on the above questions?

@tristanic
Copy link
Contributor Author

Sorry for my long radio silence - a lot of time spent working on figuring out my next career move (successfully, I'm happy to say - and in a way that'll substantially increase my ability to keep working in this area). Anyway, yes - I could definitely work with that approach!

@tristanic
Copy link
Contributor Author

If you wanted a richer source of data to mine, there's always the Chemical Components Dictionary (https://www.wwpdb.org/data/ccd). Encompasses residue name (including synonyms), type (nucleic, peptide, etc.), various flavours of SMILES, and example coordinates (experimental and ideal). Things can get a bit hairy when you get out into the more exotic stuff (erroneous/missing coordinates, only IUPAC rather than common names, etc.) - but for the common stuff it's very reliable.

@peastman
Copy link
Member

Thanks! I'll start by adding the attributes field. Then we can consider changes to https://github.com/openmm/openmmforcefields so it will store useful attributes when generating new force fields.

Congratulations on your career move, whatever it is!

@peastman
Copy link
Member

It's in #3604.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants