Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Understand mutations produced by augur #583

Open
biocyberman opened this issue Jun 27, 2020 · 5 comments
Open

Understand mutations produced by augur #583

biocyberman opened this issue Jun 27, 2020 · 5 comments
Labels
documentation faq Frequently asked question or common issue that could warrant documentation

Comments

@biocyberman
Copy link

biocyberman commented Jun 27, 2020

I guess this has much to do with how treetime works, but haven't got enough time to check it out thoroughly. So I would like to get some help to understand the mutations reported by augur. This is important to us to interpret the results on auspice.

I composed a test script (check_muts.py), run command (run_check.sh) and example output (check_muts.txt) in this gist: https://gist.github.com/biocyberman/13cb3ca5fdd055bf213711b93e9e6b81

My questions are below

1. Root sequence vs reference sequence vs outgroup

I haven't got a reliable source and have been inferring, but it would be great if someone could explain this to me. I noticed that root sequence from nt_muts.json produced by ancestral.py as in this rule is not the same as sequence in ncov's reference.gb. Should they be the same or should they not? I tend to think they are the same, but apparently not. The ncov workflow config also outgroup and doesn't seem to use it.

2. Why are there quite many flipping of mutations around a position like I see in the check_muts.txt?

For example, consider strain Hungary/SRC-00817/2020 with all of its mutations along the path here: Position 23731 are predicted to mutate twice ? It is understandable to compare strain's sequence vs reference sequence to decide mutations. And therefore it becomes confusing when I look at a mutation on a time tree via auspice interface, thinking that the mutation exists in the strain(s) at the tip of that branch, but it actually actually does not! For example, mutation C23731T for node NODE_0000873 in the output below doesn't exist for Hungary/SRC-00817/2020.

Timetree accumulated mutations do not match direct mutations for Hungary/SRC-00817/2020:
Unique for timetree: {'C23731T', 'T23731C'}
Intersection: {'G28882A', 'C13536T', 'A23403G', 'C4002T', 'G10097A', 'C14408T', 'C241T', 'C3037T', 'C19718T', 'G28881A', 'G28883C'}
Unique for direct: set()

Debug, Node Path: NODE_0000100,NODE_0000343,NODE_0000376,NODE_0000341,NODE_0000342,NODE_0000394,NODE_0000323,NODE_0000348,NODE_0000739,NODE_0000873,NODE_0000881,NODE_0000734,NODE_0000896,Hungary/SRC-00817/2020

NODE_0000100:[]
		[](NODE_0000079)
		[](NODE_0000343)
NODE_0000343:[]
		['A5084G', 'C28854T'](NODE_0000268)
		[](NODE_0000376)
NODE_0000376:[]
		['C17373T'](NODE_0000330)
		['C241T', 'A23403G'](NODE_0000341)
NODE_0000341:['C241T', 'A23403G']
		['T141C', 'C1473T', 'C16523A', 'G26144T'](Uganda/UG013/2020)
		['C3037T'](NODE_0000342)
NODE_0000342:['C3037T']
		['G27915T', 'G28580T'](NODE_0000347)
		[](NODE_0000394)
NODE_0000394:[]
		[](NODE_0000396)
		[](NODE_0000323)
NODE_0000323:[]
		[](Germany/BavPat2-ChVir984-ChVir1017/2020)
		['C14408T'](NODE_0000348)
NODE_0000348:['C14408T']
		['G10523C', 'C13424A', 'C14097T', 'G14580T', 'A19298G', 'A21163T', 'A21166T'](Morocco/6893/2020)
		['A3337G', 'T3741C', 'T3762A', 'G21985A', 'T22308C', 'T26187C'](Finland/FIN-25/2020)
		['C13620T', 'C21595T'](NODE_0000402)
		['C23575T'](NODE_0000464)
		['A187G'](NODE_0000473)
		['A20268G'](NODE_0000380)
		['A26530G'](NODE_0000359)
		[](NODE_0000356)
		['C15324T'](NODE_0000475)
		['G25563T'](NODE_0000578)
		['G28881A', 'G28882A', 'G28883C'](NODE_0000739)
NODE_0000739:['G28881A', 'G28882A', 'G28883C']
		['A8945G', 'T24022C'](Peru/010/2020)
		['G957A'](Nigeria/Lagos01/2020)
		['A1103G', 'C1420T', 'G1743T', 'G2422C', 'T27153G', 'C27689T'](Portugal/PT0593/2020)
		['C4010T', 'C18555T', 'A25983G'](Ecuador/HGSQ-USFQ-018/2020)
		['A8449C', 'C13115T', 'C20234T'](SouthAfrica/KRISP-109/2020)
		['A23116T'](Denmark/SSI-01/2020)
		['G10427A', 'C14786T', 'C20389T'](Morocco/6888/2020)
		['A5220T', 'C19170T', 'G19509A', 'T25461C'](SouthAfrica/KRISP-012/2020)
		['C27046T'](NODE_0000800)
		['G12832A'](NODE_0000815)
		['A2869G', 'C17733T', 'C29095T'](NODE_0000783)
		['T3037C'](NODE_0000801)
		['C6573T', 'C25528T'](NODE_0000850)
		['C3373A'](NODE_0000859)
		['G10097A', 'C23731T'](NODE_0000873)
		[](NODE_0000889)
NODE_0000873:['G10097A', 'C23731T']
		['G28580T'](NODE_0000874)
		['C4002T', 'C13536T'](NODE_0000881)
NODE_0000881:['C4002T', 'C13536T']
		['G28209T'](Croatia/ZG-297-20/2020)
		[](Denmark/SSI-05/2020)
		[](Denmark/ALAB-SSI-248/2020)
		['A6693G', 'C8386T'](NODE_0000908)
		['G11083T'](NODE_0000915)
		[](NODE_0000734)
NODE_0000734:[]
		['C19718T'](NODE_0000896)
		[](NODE_0000736)
NODE_0000896:['C19718T']
		['T23731C'](Hungary/SRC-00817/2020)
		['C22323T'](Austria/CeMM0067/2020)
Hungary/SRC-00817/2020:['T23731C']

@emmahodcroft
Copy link
Member

Hi @biocyberman , thanks for reaching out and I hope I can help!

  1. For fasta-input runs, the reference sequence is used to align the sequences (and usually the same one is used to provide translation information - gene annotations etc - but this doesn't necessarily have to be the same sequence). It can also be used to root the sequence (but again, it doesn't have to be the same) an outgroup might also be used for rooting - we may once have used this in the ncov repo, and that may be a leftover from that. Currently in ncov we root based on early samples.

The root sequence, on the other hand, is the inferred ancestor of the tree. So, this is unlikely to be the reference, as the reference is generally a 'real' sequence, from GenBank or similar. The root is the inferred ancestor and so while a virus like this may have existed in the past, it may not have been exactly the same as what we infer - we just reconstruct this as likely as possible.

The mutations shown in augur/auspice are relative to the root sequence, not to the reference sequence, and are cumulative on the tree (to find all the mutations on a tip, start at the root, and note all mutations on every node up from the root to the tip).

It's important to note that for VCF-input, this is different. There, the reference plays a really important role - as VCF files themselves only describe the sites that are different - the reference is needed to know all sites that don't change. In these trees, the root on the tree (which is again, inferred from the sequences) is also relative to the reference, so you'll see there may be mutations on the root node itself. However, you don't need to worry about this for Fasta-file input.

  1. I'm not entirely sure I understand question two. The only way to know if a tip has a mutation is to walk through the tree from the root and keep track. It is always possible there's a mutation and then a back-mutation. Alternatively, if mutations are switching a lot with in a cluster, this can be a sign of alignment/sequencing problems making it hard to reconcile the tree topology. From the example you gave, it seems a mutation happens at 23731 and then it back-mutated. It's hard to say whether this seems to be problematic or not without knowing more about the tree. Could you send the tree, or, could you find a similar example on one of the dated Nextstrain runs - ex: https://nextstrain.org/ncov/global/2020-06-23 (you can change the date to about any day)?

@biocyberman
Copy link
Author

biocyberman commented Jun 29, 2020

Hi @emmahodcroft
=> 1. Thanks for clarification and also confirming my inference, that sequence of the root of the tree is not the same as reference sequence. This however makes interpretation the mutations on tree harder: Those mutations are not always 'real' mutation. By 'real' I mean by comparing directly to the reference sequence. I would say this is a big thing that many people who look at the mutations are not aware of this, especially those have worked with multi-sequence-alignment-based mutation calculation.

What I am aiming is to cluster or map the real mutations into a hierarchical cluster. I was hoping the tree would be used for this purpose, and even better, the mutations along the tree can be used directly for this. Should I generate VCF by some other way and attach the the tree? In that case I don't know yet what to do with sequence of internal node.

=> 2. Yeah it is a complicated problem so it is hard to explain shortly and clearly. If you take a look at the python script in the gist I shared, you can see I did walk along the path Hungary/SRC-00817/2020 and accumulate mutations. I first thought it might be back-mutation, but it happens to quite many sequences, so I suspect it may be another reason. I attached tree, nt_muts.json. To reproduce the pattern of flipping mutations, you can also run the python script on your own data. Sample command is in the run_check.sh part of the gist.

test.data.tar.gz

@emmahodcroft
Copy link
Member

Hi @biocyberman !

  1. So it's fairly normal in a phylogenetic context that mutations are relative to the root of the tree. As a reference sequence could be chosen from anywhere on the tree - in EV-D68 we choose it from a relatively recent tip, as we don't have very many good quality older sequences. So then, the mutations being relative to a reference doesn't really make sense - at least not in a tree context. You can always find out mutations relative to a reference - either manually by traversing the tree and then comparing to a reference, or using code/software to do these comparisons (this is not an uncommon question in biology, so there are tools to calculate this!)

I would definitely say it's not really worth the effort to convert sequences to VCF and run through it in this fashion, but it's possible! You'll need to change some of the rule options to get the pipeline working. I think an easier course would be to change thinking about mutations as relative to an arbitrary reference to being relative to a reconstructed ancestor, then using these on the tree - but its your choice!

  1. When I look at the position you mention (23731) on the Nextstrain tree, I'm not seeing it doing a lot of flipping - it looks like it's in two clusters: https://nextstrain.org/ncov/global/2020-06-29?c=gt-nuc_23731 . Even in the example you give above, it seems it's only changed twice in the tree. I wouldn't consider this a lot of flipping. Of course, there are sites where we may see more changes and often this can indicate a sequencing issue - but this should be fairly rare.

Unfortunately I can't open .gz files at the moment - can you re-post as .zip?

@biocyberman
Copy link
Author

biocyberman commented Jul 1, 2020

Hi @emmahodcroft
=>1. How stable is the inferred root sequence? In what case will it change? The script I wrote already can compute mutations relative to reference to compare with mutations relative to root sequence.

=> 2. You are right, position 23731 changes twice, and not many times for that position. What I meant when I wrote earlier is that this flipping happens to others positions and internal nodes of the tree. For example, along the path of Uganda/UG009/2020:

Timetree accumulated mutations do not match direct mutations for Uganda/UG009/2020:
Unique for timetree: {'T3037C', 'C3037T', 'T14408C', 'C14408T'}
Intersection: {'C241T', 'G28882A', 'T7207A', 'C21335T', 'G28580T', 'G10097A', 'C5997T', 'G28881A', 'T606C', 'A23403G', 'G28883C'}
Unique for direct: set()
 

I uploaded same data in zip format.
test.data.zip

@emmahodcroft
Copy link
Member

Hi @biocyberman -

  1. The root sequence may change, just like the tree, as it'll depend on where the branches happen at the root. I can't say how much this happens, as I haven't looked into it, and we wouldn't expect major changes.

  2. I had a look at the Uganda sequence you referenced. However, we have multiple sequences from around the world that show these same two mutations -
    Position 14408: https://nextstrain.org/ncov/africa/2020-06-30?c=gt-nuc_14408&label=clade:20B
    Position 3037: https://nextstrain.org/ncov/africa/2020-06-30?c=gt-nuc_3037&label=clade:20B

It could indeed be that this is a primer problem of some kind, and these labs are using the same primers and/or similar/same bioinformatics pipelines. However, for many totally different labs to have that problem, it's a bit odd. We'd expect to perhaps see more correspondence with samples from the same country and same lab having the problem. (Having said that, I haven't looked into this in great detail.)

So, I think we don't know if these are real mutations or artefacts of sequencing/assembly - but they don't seem to be a problem with the Treetime algorithm, as far as I can see.

@huddlej huddlej added faq Frequently asked question or common issue that could warrant documentation documentation labels Jul 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation faq Frequently asked question or common issue that could warrant documentation
Projects
None yet
Development

No branches or pull requests

3 participants