residue.is_nucleic and 'nucleic' in atom selection DSL #1398

dwhswenson · 2018-11-11T18:24:14Z

This is a really simple PR aiming to add support for residue.is_nucleic and 'nucleic'/'is_nucleic as special query strings (like 'protein') in the atom selection language.

Main question: Is there any reason not to do this? It seems like such low-hanging fruit, and it was half-implemented, so I wondered if some significant concern I'm not aware of was preventing it.

It looks like the last time anything was done with this was way back in #594, when nucleic was officially treated as "not supported" by commenting out several lines that were still comments in the code.

A few specific questions to ask:

Should I add single-letter nucleic acid codes, analogous to amino acid codes? For example, the residue name 'DG' would have code 'G'.
Do we include special names for terminal residues? I've seen some files with, for example, DA5 for a 5-prime A. This is the name used for differentiating the force field/topology at the terminal residue. However, that name is used in the chemical components database for another molecule. To make it really confusing, here's a PDB of the ligand DA5 binding DNA. (Fortunately, that sequence does not have a 5' adenine. That would be really nasty.)

mpharrigan · 2018-11-15T01:00:04Z

If I had to guess, it was probably a combination of

no one could be bothered to look up the right codes for nucleic residues
residue_names.py didn't exist or was still being modified (see linked issue in Atom Selection DSL #594)

dwhswenson · 2018-11-15T10:36:40Z

Thanks -- I thought it might just be that the MDTraj user/dev community is more focused on proteins than nucleic acids, so this hasn't been an itch anyone else wanted to scratch. This PR is probably worth continuing, then.

Any thoughts on the question about using unofficial (and possibly conflicting) residue names? As for the list, I think ParmEd has a pretty solid list of nucleic acid residues and common aliases/variants for them (though missing inosine):

https://github.com/ParmEd/ParmEd/blob/47dab71532c198c78f260be4fc5e4dbadbf208d3/parmed/residue.py#L269-L283

rmcgibbo · 2019-05-04T19:51:33Z

@dwhswenson: is this still relevant? I'm happy to hit "merge".

dwhswenson · 2019-05-04T20:27:29Z

It's nearly ready. I've been meaning to ask what you think should be included as "nucleic", but got distracted by other things. In short, there are several possibilities to consider (beyond the obvious and expected residues):

Unofficial names that may conflict with CCD names (e.g., the internally-used Amber DA5 for 5-prime adenine; I think OpenMM may still output PDBs with these names).
CCD residues that are listed as nucleic linkers, but are not necessarily that similar to the default nucleic residues (e.g., massive changes to the base -- YYG being my go-to example).

I had some discussion on this with the ParmEd folks in ParmEd/ParmEd#1047. I put some detailed discussion in a gist. (Since I'm having trouble getting GitHub to load that gist right now, here's the notebook in nbviewer.) I've designed what I've included to make it relatively easy (though not trivial) for users to customize the behavior here.

The notebook in that gist shows how to generate a VARIANTS dictionary in from the CCD data, in the same format I use in this PR. It includes a section on monkey-patching MDTraj to add other variants. Let me know how much of this you'd like to have included, and I'll update. (Also, I'll add a couple unit tests, just as a matter of my own principles.)

rmcgibbo · 2019-05-04T20:29:19Z

Honestly I don't really know anything about the conventions for nucleic acid names, so I'm not really in a position to review much.

stale · 2020-03-19T12:53:51Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale · 2022-07-10T07:34:59Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

mattwthompson · 2022-09-16T03:57:31Z

Is there anybody we could recruit to give a review of this, even if it's only a brief skim? I'm in less of a position to speak with authority on nucleic acid naming conventions.

I'm open to just merging this as-is ... has anything change since your comment in May 2019?

dwhswenson · 2022-09-16T04:32:32Z

Mainly I keep dragging this "forever PR" along, updating it to master every time it gets marked stale, but not bringing it to conclusion. Sorry about that.

I think my primary thoughts on the matter are exposed a little in the gist mentioned above. There are several options there; I think my preference is now leaning toward "let's default to something reasonably minimal, but provide users with an easy way to customize."

In the 4 years (programming gods forgive me) since I opened this PR, I'm hoping that the non-standard Amber-specific names that I used to see are used less frequently. My vote would be for boring standard residues as default "nucleic" (so keep the _AMBER_VARIANTS defined, maybe not subscripted, but drop them from the default definition), and to hve an example notebook (basically that gist) that shows how to add everything in the CCD (and probably then, also the _AMBER_VARIANTS).

It's been a while since I've checked that my janky awk-based parsing in that notebook works, although it should. I'd be very open to an improvement on that ugly mess. (I went through an awk phase in the mid 2010s that apparently hadn't been fully cured before that gist. Sorry.)

stale · 2023-05-21T21:03:39Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

dwhswenson added 2 commits November 11, 2018 19:04

Implement is_nucleic; support in .select()

ffea0da

Merge remote-tracking branch 'upstream/master' into nucleic

f7c7cad

dwhswenson added 7 commits December 3, 2018 07:21

add nucleic acid codes & fasta support

cd79d96

Merge remote-tracking branch 'upstream/master' into nucleic

c14bea9

Merge remote-tracking branch 'upstream/master' into nucleic

3d017c9

Merge remote-tracking branch 'upstream/master' into nucleic

bf4157a

Add residues based on parmed

a0dd14f

update with variant dict-based nuc res listing

3f6177c

algorithmic setup of residues and codes

1b17b29

dwhswenson added 3 commits May 28, 2019 14:34

Merge remote-tracking branch 'upstream/master' into nucleic

d8a4dd0

Merge remote-tracking branch 'upstream/master' into nucleic

41e6e66

Merge remote-tracking branch 'upstream/master' into nucleic

591f93a

stale bot added the wontfix label Mar 19, 2020

Merge remote-tracking branch 'upstream/master' into nucleic

7b4aa21

stale bot removed the wontfix label Mar 20, 2020

dwhswenson added 2 commits July 29, 2020 15:41

Merge remote-tracking branch 'upstream/master' into nucleic

89e533e

Merge remote-tracking branch 'upstream/master' into nucleic

a645d2b

rmcgibbo force-pushed the master branch 7 times, most recently from 9087f56 to 29dd0ec Compare April 5, 2021 04:04

rmcgibbo force-pushed the master branch 8 times, most recently from 9ca5ba0 to 3adc2d8 Compare April 5, 2021 04:40

dwhswenson added 2 commits May 31, 2021 11:15

Merge remote-tracking branch 'upstream/master' into nucleic

f80bbba

Merge remote-tracking branch 'upstream/master' into nucleic

e8fc974

stale bot added the wontfix label Jul 10, 2022

Merge remote-tracking branch 'upstream/master' into nucleic

b853b8e

stale bot removed the wontfix label Jul 13, 2022

stale bot added the wontfix label May 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

residue.is_nucleic and 'nucleic' in atom selection DSL #1398

residue.is_nucleic and 'nucleic' in atom selection DSL #1398

dwhswenson commented Nov 11, 2018

mpharrigan commented Nov 15, 2018

dwhswenson commented Nov 15, 2018

rmcgibbo commented May 4, 2019

dwhswenson commented May 4, 2019

rmcgibbo commented May 4, 2019

stale bot commented Mar 19, 2020

stale bot commented Jul 10, 2022

mattwthompson commented Sep 16, 2022

dwhswenson commented Sep 16, 2022

stale bot commented May 21, 2023

residue.is_nucleic and 'nucleic' in atom selection DSL #1398

Are you sure you want to change the base?

residue.is_nucleic and 'nucleic' in atom selection DSL #1398

Conversation

dwhswenson commented Nov 11, 2018

mpharrigan commented Nov 15, 2018

dwhswenson commented Nov 15, 2018

rmcgibbo commented May 4, 2019

dwhswenson commented May 4, 2019

rmcgibbo commented May 4, 2019

stale bot commented Mar 19, 2020

stale bot commented Jul 10, 2022

mattwthompson commented Sep 16, 2022

dwhswenson commented Sep 16, 2022

stale bot commented May 21, 2023