Errors and warnings for trees without internal node names #283

emmahodcroft · 2019-05-09T15:45:30Z

These are changes designed to detect cases where refine hasn't been run in order to give internal node names, and other functions that could or do rely on node-names are called. Some give warnings, others give errors and fail.

The warnings/errors I implemented are quite wordy. If there's a better way to clearly convey the problem/solution to users, please modify :)

1. Make adding node labels easier in refine

Currently just to add node labels you must provide --tree, --output-tree, --alignment, and, if VCF, --vcf-reference (because this is checked if you've provided a VCF alignment). The alignment isn't actually used - the code is there for a fake alignment to be made if only renaming nodes.... but if you haven't provided --output-node-data (why would you when just naming nodes) then it tries to make the filename from args.alignment...

Node-data is necessary (it seems) even if it only contains branch lengths, because export looks for branch lengths in metadata read in from such files. However, I played around and export can quite easily be modified to look for branch lengths in the Newick if it doesn't find it in the metadata. @jameshadfield , is there any other reason we couldn't move to this?

For the moment, I've modified this so that if args.alignment isn't present, it makes the node-data filename from args.tree. So, you can run just name nodes with:
augur refine --tree results/tree_raw.nwk --output-tree results/tree.nwk

Similarly, --output-tree is also not a required argument but looks to args.alignment to create the output file name, so I also modified this so that if args.alignment is missing, it'll use args.tree (required). (so you can actually now just run:
augur refine --tree results/tree_raw.nwk )

2. Added errors with likely problem and solution to translate

a) If some/all node names are missing, advises user to check they are using the tree output by refine. If refine hasn't been run, gives the run command above.

b) If a node in the tree can't be found in the alignment, advises user to ensure they've used the same tree as input for ancestral and translate, or to re-run ancestral with the tree they provided here, then run translate with the new alignment output.

This doesn't detect situations where the order/topology of the tree has changed since running ancestral but all the node names are the same. (But I'm not sure we could detect this?)

For Fasta-input, there is a message if nodes can't be found in the translation: "no sequence pair for nodes node1-node2", but even if this message displays for every or almost every pair, it doesn't give an error and creates the output files anyway. I don't think that's necessarily obvious that something's gone wrong!
I added an extra catch that if it can't find both nodes in the alignment, it breaks and gives the same error as b) above. However, this allows some nodes to be missing, unlike with VCF (where any missing cause an error). We could be more strict here with Fasta-input as well - unsure if there is a reason for it allowing lots of missing nodes?

For Fasta-input, if the root node doesn't have a match in the alignment, it won't have the full translated sequence(s) in the resulting JSON. This doesn't seem to affect how it views in auspice - could this cause errors elsewhere? (If so, easy to add an extra check here.)

I used some return strings here to separate out the errors. I don't know if that's good Python practice - feel free to edit if there's a better way! (I wanted to be able to include the filename args in the error messages.)

3. Added warnings to traits and ancestral

If it's detected that the tree has no internal node names, a warning is displayed that tells the user if they'd like to use export or translate later, they should either use the tree from refine as input, or run refine to add node names (and gives the simple command)

4. Fixed and added error messages in export (minor)
Minor fixes

I've tested all of the above with Fasta and VCF, but probably not exhaustively, so other eyes would be appreciated.

jameshadfield

@emmahodcroft this is great -- thanks for doing this 🎉

My comments (below) are generally aimed at forcing users to add node names via augur refine and nowhere else. I think this makes things less error-prone (e.g. it avoids the subtle bug where the wrong tree was provided to augur traits and your warning message ignored). Naming things in refine also means that it's only ever done once -- whether you infer temporal phylogeny or not.

This doesn't detect situations where the order/topology of the tree has changed since running ancestral but all the node names are the same. (But I'm not sure we could detect this?)

This would be avoided if we only ever added internal nodes via refine and all downstream commands checked for these.

augur refine --tree results/tree_raw.nwk --output-tree results/tree.nwk

This is much much much nicer 👍
Perhaps we could detect this case (i.e. no alignment provided) and print some output indicating that refine is only adding internal node labels.

For Fasta-input, if the root node doesn't have a match in the alignment, it won't have the full translated sequence(s) in the resulting JSON. This doesn't seem to affect how it views in auspice - could this cause errors elsewhere? (If so, easy to add an extra check here.)

Hmm not sure I understand this. How can we infer mutations on the root node?

jameshadfield · 2019-05-11T04:26:34Z

augur/ancestral.py

+        print("*** If you want to use 'augur export' or `augur translate` later, re-run this command with the output of 'augur refine'.")
+        print("*** If you haven't run 'augur refine', you can add node names to your tree by running:")
+        print("*** augur refine --tree %s --output-tree <filename>.nwk"%(args.tree) )
+        print("*** And use <filename>.nwk as the tree when running 'ancestral', 'translate', and 'traits'")


This is great 👍

I would be strongly in favor of changing this to an ERROR and exiting at this point, thus forcing the user to generate these internal node names through augur refine, then rerunning ancestral / traits using the correct tree.

I think this might raise a bigger question of how we want these individual parts to operate - more independently or always together? Should it be ok just to run ancestral or translate to get some information, without the intention of putting it together or through export (in which case node names don't matter), or do we want to make people always keep the pieces so that they could be put together, because they may make that decision later without remembering the warnings? I can think for both sides, myself - it would be good to hear thoughts of others. (Should I ask in Slack?)

I'm all for being able to use these features as independent parts, with no intention to run export. But if one runs augur traits (e.g.) using a tree without internal node names, then you get an output (json) which refers to things like NODE00001 which do not exist in the tree. I think in these cases we want to exit and say "We can only infer things onto a tree with internal node names labelled, else you cannot match the results to your tree! One way to do this is to run augur refine --tree A --output-tree B"

I'm happy to change to this. Feel free to nudge me if I haven't done it in a few days...!

jameshadfield · 2019-05-11T04:32:53Z

augur/translate.py

+            print("\n*** ERROR: Mismatch between node names in %s and in %s"%(args.tree, args.ancestral_sequences))
+            print("*** Ensure you are using the same tree you used to run 'ancestral' as input here.")
+            print("*** Or, re-run 'ancestral' using %s, then use the new %s as input here."%(args.tree, args.ancestral_sequences))
+        return 1


jameshadfield · 2019-05-11T04:44:08Z

augur/refine.py

    if args.output_node_data:
        node_data_fname = args.output_node_data
-    else:
+    elif args.alignment:


Node-data is necessary (it seems) even if it only contains branch lengths, because export looks for branch lengths in metadata read in from such files. However, I played around and export can quite easily be modified to look for branch lengths in the Newick if it doesn't find it in the metadata.

I think it's sensible for export to work without any node_data JSONs, and look in the newick file for divergence. (If divergence is provided in the node_data files then it should preferentially use that.) Certainly if I just used refine to add node names it seems unnecessarily complicated to have a node data JSON produced.

Ok, I'll make a PR with this change in export, and we can take a look and ensure it seems like it's going to work as intended, then we can modify refine here so that it doesn't output this if only node-names are added.

I played around with this in export. It's simple to get branch lengths from the tree rather than branch_lengths node-data (but preferentially from branch_lengths node-data, if available). However, to test this, I'm inputting a 'raw' tree, then providing node-data (some is required) for only aa_muts (for example).

So the deeper question is, if you export with a raw tree and some ancestral, or trait info also generated on a raw tree (so no internal node names match the tree), how should export handle this?

I played around and for V1, changed recursively_decorate_tree_json_nextflu_schema so that if node['strain'] is None (this is an internal node, so has no name on a raw tree), it doesn't error, but displays a warning saying it can't find the node, and so metadata for this node won't be written. Essentially, the only node-data traits attached to the tree are those on the tips.

Thanks to the wonderful raw tree I am using, this looks like a hot mess, but it works...:

By 'works' I mean that you can see different AA mutations at the tips, for example. Branches always seem to be just coloured by the first AA in the legend, I guess that's a default behaviour somewhere for if there's no other information for the branch? But assume this could be changed to default to grey?

For traits there ends up being no difference between just adding the trait from the metadata.

Is this the kind of behaviour we want? Or do we not want to support this?

(I tried to test this for V2 as well, but didn't realise it required reinstalling Auspice... so I've modified export to give me V2 JSONS without error, but I haven't yet tested them. Same theory applies as above.)

Both the V1 exports that work in 'current' auspice and the V2 export do NOT work in V2 auspice.

So no, it doesn't work for mutations, in V1 or V2. --node-data in export that can't be matched with nodes in the tree should cause an error (a reasonable one, unlike currently).

We could in theory still support export reading branch length from the tree instead of from a node-data file, in case they ran refine just to get node-names but didn't include the branch-length node-data JSON in export, but this is a question of how much we want to coddle users I suppose. If not, we should at least try to give a reasonable error probably.

augur/translate.py

emmahodcroft · 2019-05-13T08:38:11Z

For Fasta-input, if the root node doesn't have a match in the alignment, it won't have the full translated sequence(s) in the resulting JSON. This doesn't seem to affect how it views in auspice - could this cause errors elsewhere? (If so, easy to add an extra check here.)

I'm less familiar with Fasta-input, and I've already forgotten a lot over the weekend - I'll look again. But from what I remember, normally the whole translated sequence of the root node is put in aa_muts, whereas on other nodes it's just the mutations. If the root node in the tree doesn't have a corresponding sequence in translate, then it won't have a sequence written out. From my trials on Friday, it seems this didn't actually impact how things ended up looking in auspice - I don't think we actually export that translated sequence at the moment, an as mutations are already inferred, it's not needed? (But I could be wrong.) However, from other discussions we've had it seems we may assume this is present in future (to allow auspice to show the base at a constant site, for example), so perhaps better to check for this now...

Alternatively we could make the Fasta-input as strict as the VCF-input, where even one missing internal node will cause the whole thing to error, which would intrinsically solve the problem.

emmahodcroft · 2019-05-13T15:05:17Z

I changed to using Exceptions as suggested in translate.

Remaining issues:

Should translate always output a full translated sequence on the root node for Fasta-input? Should Fasta-input be as strict about missing nodes as VCF-input? (Which would solve the first issue.)
How independent do we want augur commands to be? Should users be able to run translate without internal node names if they wish (just give a warning)? Or should they be forced to generate node names to ensure future stuff works predictably (give an error)?
What do we want to require in export? If it can't find internal node info, is it ok to leave it out? (see comment above)

Errors and warnings for trees without internal node names

e235d14

emmahodcroft requested a review from jameshadfield May 9, 2019 15:50

jameshadfield reviewed May 12, 2019

View reviewed changes

use exceptions to catch node errors

a391380

rneher merged commit fddb5b3 into master Jul 10, 2019

rneher deleted the better_node_naming branch July 10, 2019 12:43

emmahodcroft mentioned this pull request Aug 9, 2019

refine w/o TreeTime Error #348

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Errors and warnings for trees without internal node names #283

Errors and warnings for trees without internal node names #283

emmahodcroft commented May 9, 2019 •

edited

jameshadfield left a comment

jameshadfield May 11, 2019

emmahodcroft May 13, 2019 •

edited

jameshadfield May 17, 2019

emmahodcroft May 17, 2019

jameshadfield May 11, 2019

jameshadfield May 11, 2019

emmahodcroft May 13, 2019

emmahodcroft May 13, 2019

emmahodcroft May 13, 2019

emmahodcroft May 13, 2019

emmahodcroft commented May 13, 2019

emmahodcroft commented May 13, 2019

Errors and warnings for trees without internal node names #283

Errors and warnings for trees without internal node names #283

Conversation

emmahodcroft commented May 9, 2019 • edited

jameshadfield left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emmahodcroft May 13, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emmahodcroft commented May 13, 2019

emmahodcroft commented May 13, 2019

emmahodcroft commented May 9, 2019 •

edited

emmahodcroft May 13, 2019 •

edited