Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

augur treetime is too complex #133

Closed
trvrb opened this issue Jun 26, 2018 · 18 comments
Closed

augur treetime is too complex #133

trvrb opened this issue Jun 26, 2018 · 18 comments

Comments

@trvrb
Copy link
Member

trvrb commented Jun 26, 2018

@rneher ---

I'm of the opinion that the augur treetime module is overly complex. If you just look at the arguments we see 28 lines and significantly more complexity than the other modules. Much of this complexity seems necessary, but I would try to tame things a bit.

I had initially been reading the command as augur timetree and expecting it to do the one Unix-y task of taking a ML subs tree and outputting a timetree. I would then have augur ancestral to do ancestral state inference. This streamlines both modules and surfaces the hidden ancestral state functionality.

Right now, after augur treetime is run there will be node_data.json with things like sequence, mutations, clock_length and mutation_length. I like the pattern where augur export stitches together multiple JSON files keyed off of NODE_0000071, etc... node_data.json is opaque, but if augur timetree produced node_dates.json and augur ancestral produced node_nt_seqs.json it would be immediately clear what sort of data it is we're talking about. This would also make it clearer that to run augur translate you need node_nt_seqs.json as input, which would produce node_aa_seqs.json as output.

Also, there is an issue with BEAST tree input. It should be obvious how to proceed from BEAST Newick. I don't think having one function like this is obvious, especially when almost everyone will think that treetime is just to get a clock (the name doesn't suggest at all that it can do ancestral state reconstruction).

Along these lines, what about having augur tree label nodes in the output Newick as NODE_0000071, etc... to prevent any issues downstream? Then augur ancestral could be run without having to run treetime first to get labels.

Also, checking with @emmahodcroft, @huddlej, @jameshadfield and @tsibley to see how they feel about augur treetime vs augur timetree / augur ancestral

@tsibley
Copy link
Member

tsibley commented Jun 26, 2018

I strongly agree with what you've proposed @trvrb. Unlike other augur commands, the treetime command is modal and bundles several things into one. It'd be good to unbundle those.

@trvrb
Copy link
Member Author

trvrb commented Jun 26, 2018

Oh... and I thought of two other things that bothered me here:

  1. Calling augur treetime --timetree is just weird.

  2. augur tree is appropriately agnostic about the underlying program used to build the tree. It could be RAxML, FastTree, etc... augur tree is about going from alignment to subs tree, irrespective of method to get there. However, augur treetime breaks this and specifies program in the title. We could very well eventually want to allow LSD of some other method to infer timetrees for completeness.

@rneher
Copy link
Member

rneher commented Jun 27, 2018

Along these lines, what about having augur tree label nodes in the output Newick as NODE_0000071,

that was the initial idea, but that doesn't work since rerooting and polytomy resolution changes topology/adds/removes nodes.

I had initially been reading the command as augur timetree and expecting it to do the one Unix-y task of taking a ML subs tree and outputting a timetree. I would then have augur ancestral to do ancestral state inference. This streamlines both modules and surfaces the hidden ancestral state functionality.

yes and no. timetree inference via treetime will in most cases do ancestral inference. One could do it twice, but then it would be crucial to do it after timetree inference.

node_data.json is opaque, but if augur timetree produced node_dates.json and augur ancestral produced node_nt_seqs.json

initially node_data.json was the only one. agree that the name doesn't make much sense anymore. but we also don't want a separate file for each individual attribute. if you call it dates, then branch lengths should be somewhere else and we get a massive proliferation of files.

augur tree is appropriately agnostic about the underlying program used to build the tree. It could be RAxML, FastTree, etc...

but this only works if we basically hide all method specific commands from the user. Otherwise, we get a big confusion with the hundreds of options that these different tools take. In this spirit, the current treetime module is just a wrapper around treetime that happens to produce output compatible with augur.

@emmahodcroft
Copy link
Member

I agree with @rneher . Also, moving ancestral out separately is only going to remove 2 (3 if you count the output vcf) arguments, so may not really change the complexity of the treetime call.

I think given what treetime does it's acceptable that it has a lot of arguments, but I'd agree that they're not so straightforward at the moment and could be made clearer, possibly by renaming some, better help info, and mostly, good documentation somewhere.

I also agree that timetree/treetime is confusing. I mess this up on the regular 🙃

@tsibley
Copy link
Member

tsibley commented Jun 27, 2018

but this only works if we basically hide all method specific commands from the user. Otherwise, we get a big confusion with the hundreds of options that these different tools take.

I think this is a false dilemma. It's possible to expose only the most common method-specific options, if any, and also provide a general mechanism for passing any arbitrary option into specific methods without enumerating all of them.

There are several approaches we could take. Using augur tree as an example, one approach might be:

  1. Providing common method-specific options like --raxml-foo or --fasttree-bar-baz which map to --foo and --bar-baz in RAxML and FastTree. These would be "supported" method options.

  2. Providing a --method-options option which is used like --method-options=--foo. This provides people the rope they'd need to do what they wanted, but also implies that something custom is going on which requires you to know what you're doing ("you break it, you keep it").

Again, there are other ways to approach the syntax here, but the general idea is the same.

initially node_data.json was the only one. agree that the name doesn't make much sense anymore. but we also don't want a separate file for each individual attribute. if you call it dates, then branch lengths should be somewhere else and we get a massive proliferation of files.

On what order of magnitude would "massive" be?

@rneher
Copy link
Member

rneher commented Jun 27, 2018

the number of files is ultimately not the issue (would be probably just 2-3 more). The more central issue is how to address nodes across steps.

@tsibley
Copy link
Member

tsibley commented Jun 27, 2018

the number of files is ultimately not the issue (would be probably just 2-3 more). The more central issue is how to address nodes across steps.

Nod. It didn't seem like the number of files would actually be an issue, that's why I asked.

For the core issue of internal nodes, it seems that node annotations must happen after the topology is fixed, whichever step that may be. Is that correct?

@rneher
Copy link
Member

rneher commented Jun 28, 2018

what we could do are simple consistency checks on load:

  • name internal nodes in augur tree (these might change later)
  • if read_node_data is called, that implies a previous augur module requiring a tree was called. We could simply pass the tree file name to that function and check for consistency of names.
  • if an augur module is called with a 'raw' tree without internal labels, name them and rewrite the tree with nothing changed but the labels (kind of violates the in/output rules, but would enable somebody to go into the pipeline with an ML tree and a metadata table and perform traits. result would be a traits.json and a tree with internal node labels, the latter would be a surprise but could be communicated in all caps...)

@tsibley
Copy link
Member

tsibley commented Jun 28, 2018

Sounds reasonable, @rneher, except for one bit:

rewrite the tree with nothing changed but the labels (kind of violates the in/output rules…)

Overwriting an input file in-place is a huge taboo. Even with an all caps warning, it could easily be a nasty surprise. I strongly recommend not doing this, but finding another way of allowing on-the-fly node naming if a tree without them is provided. This could be, for example, augur traits automatically running an augur assign-node-names command with the "raw" tree as input and saving a new "named" tree as output and then automatically using that as the actual input to augur traits.

But also, augur traits could just require a tree with named nodes and tell you to run augur assign-node-names if you have a tree without them.

@trvrb
Copy link
Member Author

trvrb commented Jun 28, 2018

@rneher: I definitely agree on finding ways to do consistency checks.

@tsibley: I agree that silently rewriting files sounds like a bad idea.

@rneher
Copy link
Member

rneher commented Jun 28, 2018

Overwriting an input file in-place is a huge taboo. Even with an all
caps warning, it could easily be a nasty surprise.

I agree. (but hey, gzip does it...)

but finding another way of allowing on-the-fly node
naming if a tree without them is provided. This could be, for example,
automatically running an |augur assign-node-names| command with the
"raw" tree as input and saving a new "named" tree as output and then
using that as the new input.

yes, I was thinking something similar.

But also, |augur traits| could just require a tree with named nodes and
tell you to run |augur assign-node-names| if you have a tree without them.

it sort of strikes me as clunky to have such an book-keeping step explicit.

@jameshadfield
Copy link
Member

re: creating an ancestral command (to be run after timetree temporal or whatever it's called):

  • 👎 it is unnecessary compute, as it has to be done within timetree anyways
  • 👍 it's clearer what's going on
  • 👍 it'd be super useful to add ancestral states to a BEAST tree, where we don't want to mess with topology or branches. (This isn't meant to turn this thread into "how to deal with BEAST", but clearly it's something i've been doing and will move over to modular augur at some point)

For what it's worth, I like the requirement for "new" layers to simply write a flat JSON with the nodes object and have that end up in the tree.JSON

@trvrb
Copy link
Member Author

trvrb commented Jun 29, 2018

Okay. Latest thoughts...

It's clear that we need to support at least 3 big use cases.

  1. Going from alignment to ML tree and time tree via IQ-Tree + TreeTime and then traits, etc... + export
  2. Going from alignment to ML tree via IQ-Tree and then traits, etc... + export
  3. Going from BEAST tree to ML branch lengths and then traits, etc... + export

The current setup is actually not so bad for these three aims.

  1. Is accomplished with augur tree followed by augur treetime to produce tree.nwk and node_data.json that then form the basis of downstream operations.
  2. Is accomplished with augur tree followed by augur treetime but asking treetime not to make a timetree.
  3. Is accomplished with augur treetime but asking treetime to use the input tree as timetree.

I think the core operations could be made more obvious by a simple rename of augur treetime to augur refine. This makes it clear (at least to me) that you would produce a raw tree tree_raw.nwk with augur tree than then needs to be refined by running augur refine before doing any downstream operations. This seems semantically logical to me. You would still need to "refine" a BEAST tree even if it doesn't need timing done.

In one example... there was the issue of wanting to add traits to an ML tree (option 2 above). In this case, I'd recommend to enforce augur traits to take node_data.json as input. This way it's clear that you go augur tree, augur refine, augur traits. Additionally, augur refine could handle rerooting the ML tree via --root NODENAME as it currently does. augur tree wouldn't expose rooting options of IQ-Tree / RAxML.

I know this is basically exactly as things currently exist. Would just work on the exact interface to augur treetime / augur refine.

@rneher
Copy link
Member

rneher commented Jun 29, 2018

an argument in favor of pulling sequences out of node_data.json and putting them into a separate file:

this would align the file I/O pattern between the vcf/fasta workflows. I am fine with splitting sequence reconstruction from rerooting/timetree if this makes is more transparent. the computational overhead is small.

@rneher
Copy link
Member

rneher commented Jun 29, 2018

392c17c checks whether names in node_data match nodes in tree if provided. We should more generally discuss how we want to handle errors and warnings.

@trvrb
Copy link
Member Author

trvrb commented Jun 30, 2018

I hadn't noticed that VCF version of augur treetime produces three file outputs https://github.com/nextstrain/augur/blob/master/builds/tb/Snakefile#L68. To me, this is one output too many and would push for separating augur refine from augur ancestral. augur ancestral would produce a single output file (likeaugur traits) that's something like nt_muts.json (or I guess nt_muts.vcf).

Looking at tb/results/node_data.json I see things like "mutations": [["A", 600688, "C"], so I don't quite understand why treetime.vcf needs to exist (I see a corresponding line that reads MTB_anc 600688 . A C , but this is a separate issue and likely due to my own naivety with what's going on with the VCF pipeline.

Edit: I see... The difference is that node_data.json gets a sequence attribute. Vs VCF where this attribute is not supplied.

@rneher
Copy link
Member

rneher commented Jul 1, 2018

I agree this needs streamlining. sequences in node_data.json was really just a temporary fix when I realized I need them in translate.

@trvrb
Copy link
Member Author

trvrb commented Jul 4, 2018

I'm going to close this now given that #145 is merged. There were some issues surfaced above (like passing arguments to IQTREE and sequences in node data), but I think these exist better as their own issues.

@trvrb trvrb closed this as completed Jul 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants