New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Understand mutations produced by augur #583
Comments
Hi @biocyberman , thanks for reaching out and I hope I can help!
The root sequence, on the other hand, is the inferred ancestor of the tree. So, this is unlikely to be the reference, as the reference is generally a 'real' sequence, from GenBank or similar. The root is the inferred ancestor and so while a virus like this may have existed in the past, it may not have been exactly the same as what we infer - we just reconstruct this as likely as possible. The mutations shown in augur/auspice are relative to the root sequence, not to the reference sequence, and are cumulative on the tree (to find all the mutations on a tip, start at the root, and note all mutations on every node up from the root to the tip). It's important to note that for VCF-input, this is different. There, the reference plays a really important role - as VCF files themselves only describe the sites that are different - the reference is needed to know all sites that don't change. In these trees, the root on the tree (which is again, inferred from the sequences) is also relative to the reference, so you'll see there may be mutations on the root node itself. However, you don't need to worry about this for Fasta-file input.
|
Hi @emmahodcroft What I am aiming is to cluster or map the real mutations into a hierarchical cluster. I was hoping the tree would be used for this purpose, and even better, the mutations along the tree can be used directly for this. Should I generate VCF by some other way and attach the the tree? In that case I don't know yet what to do with sequence of internal node. => 2. Yeah it is a complicated problem so it is hard to explain shortly and clearly. If you take a look at the python script in the gist I shared, you can see I did walk along the path |
Hi @biocyberman !
I would definitely say it's not really worth the effort to convert sequences to VCF and run through it in this fashion, but it's possible! You'll need to change some of the rule options to get the pipeline working. I think an easier course would be to change thinking about mutations as relative to an arbitrary reference to being relative to a reconstructed ancestor, then using these on the tree - but its your choice!
Unfortunately I can't open .gz files at the moment - can you re-post as .zip? |
Hi @emmahodcroft => 2. You are right, position
I uploaded same data in zip format. |
Hi @biocyberman -
It could indeed be that this is a primer problem of some kind, and these labs are using the same primers and/or similar/same bioinformatics pipelines. However, for many totally different labs to have that problem, it's a bit odd. We'd expect to perhaps see more correspondence with samples from the same country and same lab having the problem. (Having said that, I haven't looked into this in great detail.) So, I think we don't know if these are real mutations or artefacts of sequencing/assembly - but they don't seem to be a problem with the Treetime algorithm, as far as I can see. |
I guess this has much to do with how
treetime
works, but haven't got enough time to check it out thoroughly. So I would like to get some help to understand the mutations reported by augur. This is important to us to interpret the results onauspice
.I composed a test script (
check_muts.py
), run command (run_check.sh
) and example output (check_muts.txt
) in this gist: https://gist.github.com/biocyberman/13cb3ca5fdd055bf213711b93e9e6b81My questions are below
1. Root sequence vs reference sequence vs outgroup
I haven't got a reliable source and have been inferring, but it would be great if someone could explain this to me. I noticed that root sequence from
nt_muts.json
produced byancestral.py
as in this rule is not the same as sequence in ncov's reference.gb. Should they be the same or should they not? I tend to think they are the same, but apparently not. The ncov workflow config also outgroup and doesn't seem to use it.2. Why are there quite many flipping of mutations around a position like I see in the
check_muts.txt
?For example, consider strain
Hungary/SRC-00817/2020
with all of its mutations along the path here: Position 23731 are predicted to mutate twice ? It is understandable to compare strain's sequence vs reference sequence to decide mutations. And therefore it becomes confusing when I look at a mutation on a time tree via auspice interface, thinking that the mutation exists in the strain(s) at the tip of that branch, but it actually actually does not! For example, mutationC23731T
for nodeNODE_0000873
in the output below doesn't exist forHungary/SRC-00817/2020
.The text was updated successfully, but these errors were encountered: