Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

residue.is_nucleic and 'nucleic' in atom selection DSL #1398

Open
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

dwhswenson
Copy link
Contributor

This is a really simple PR aiming to add support for residue.is_nucleic and 'nucleic'/'is_nucleic as special query strings (like 'protein') in the atom selection language.

Main question: Is there any reason not to do this? It seems like such low-hanging fruit, and it was half-implemented, so I wondered if some significant concern I'm not aware of was preventing it.

It looks like the last time anything was done with this was way back in #594, when nucleic was officially treated as "not supported" by commenting out several lines that were still comments in the code.

A few specific questions to ask:

  1. Should I add single-letter nucleic acid codes, analogous to amino acid codes? For example, the residue name 'DG' would have code 'G'.
  2. Do we include special names for terminal residues? I've seen some files with, for example, DA5 for a 5-prime A. This is the name used for differentiating the force field/topology at the terminal residue. However, that name is used in the chemical components database for another molecule. To make it really confusing, here's a PDB of the ligand DA5 binding DNA. (Fortunately, that sequence does not have a 5' adenine. That would be really nasty.)

@mpharrigan
Copy link
Contributor

If I had to guess, it was probably a combination of

  1. no one could be bothered to look up the right codes for nucleic residues
  2. residue_names.py didn't exist or was still being modified (see linked issue in Atom Selection DSL #594)

@dwhswenson
Copy link
Contributor Author

Thanks -- I thought it might just be that the MDTraj user/dev community is more focused on proteins than nucleic acids, so this hasn't been an itch anyone else wanted to scratch. This PR is probably worth continuing, then.

Any thoughts on the question about using unofficial (and possibly conflicting) residue names? As for the list, I think ParmEd has a pretty solid list of nucleic acid residues and common aliases/variants for them (though missing inosine):

https://github.com/ParmEd/ParmEd/blob/47dab71532c198c78f260be4fc5e4dbadbf208d3/parmed/residue.py#L269-L283

@rmcgibbo
Copy link
Member

rmcgibbo commented May 4, 2019

@dwhswenson: is this still relevant? I'm happy to hit "merge".

@dwhswenson
Copy link
Contributor Author

It's nearly ready. I've been meaning to ask what you think should be included as "nucleic", but got distracted by other things. In short, there are several possibilities to consider (beyond the obvious and expected residues):

  • Unofficial names that may conflict with CCD names (e.g., the internally-used Amber DA5 for 5-prime adenine; I think OpenMM may still output PDBs with these names).
  • CCD residues that are listed as nucleic linkers, but are not necessarily that similar to the default nucleic residues (e.g., massive changes to the base -- YYG being my go-to example).

I had some discussion on this with the ParmEd folks in ParmEd/ParmEd#1047. I put some detailed discussion in a gist. (Since I'm having trouble getting GitHub to load that gist right now, here's the notebook in nbviewer.) I've designed what I've included to make it relatively easy (though not trivial) for users to customize the behavior here.

The notebook in that gist shows how to generate a VARIANTS dictionary in from the CCD data, in the same format I use in this PR. It includes a section on monkey-patching MDTraj to add other variants. Let me know how much of this you'd like to have included, and I'll update. (Also, I'll add a couple unit tests, just as a matter of my own principles.)

@rmcgibbo
Copy link
Member

rmcgibbo commented May 4, 2019

Honestly I don't really know anything about the conventions for nucleic acid names, so I'm not really in a position to review much.

@stale
Copy link

stale bot commented Mar 19, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Mar 19, 2020
@stale stale bot removed the wontfix label Mar 20, 2020
@rmcgibbo rmcgibbo force-pushed the master branch 7 times, most recently from 9087f56 to 29dd0ec Compare April 5, 2021 04:04
@rmcgibbo rmcgibbo force-pushed the master branch 8 times, most recently from 9ca5ba0 to 3adc2d8 Compare April 5, 2021 04:40
@stale
Copy link

stale bot commented Jul 10, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Jul 10, 2022
@stale stale bot removed the wontfix label Jul 13, 2022
@mattwthompson
Copy link
Member

Is there anybody we could recruit to give a review of this, even if it's only a brief skim? I'm in less of a position to speak with authority on nucleic acid naming conventions.

I'm open to just merging this as-is ... has anything change since your comment in May 2019?

@dwhswenson
Copy link
Contributor Author

Mainly I keep dragging this "forever PR" along, updating it to master every time it gets marked stale, but not bringing it to conclusion. Sorry about that.

I think my primary thoughts on the matter are exposed a little in the gist mentioned above. There are several options there; I think my preference is now leaning toward "let's default to something reasonably minimal, but provide users with an easy way to customize."

In the 4 years (programming gods forgive me) since I opened this PR, I'm hoping that the non-standard Amber-specific names that I used to see are used less frequently. My vote would be for boring standard residues as default "nucleic" (so keep the _AMBER_VARIANTS defined, maybe not subscripted, but drop them from the default definition), and to hve an example notebook (basically that gist) that shows how to add everything in the CCD (and probably then, also the _AMBER_VARIANTS).

It's been a while since I've checked that my janky awk-based parsing in that notebook works, although it should. I'd be very open to an improvement on that ugly mess. (I went through an awk phase in the mid 2010s that apparently hadn't been fully cured before that gist. Sorry.)

@stale
Copy link

stale bot commented May 21, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label May 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants