-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ancestral] positions with no information are assigned a base which may contradict the reference #1382
Labels
bug
Something isn't working
Comments
jameshadfield
added a commit
that referenced
this issue
Dec 29, 2023
Fixes a bug where positions which were all ambiguous (i.e. masked) positions were being reported as mutations in the JSON output. Furthermore, while the mask was being applied to sequence export to make all masked positions "N", if we are inferring ambiguous sites then we should use the reference base instead. This is now fixed. Closes #1382 for the FASTA-input case only. Unfortunately the added test doesn't actually recreate the first bug - for this simple example TreeTime is not inferring a ATGC allele at the all-Ns position. I can recreate this (and discovered it) using Cholera data. I presume this is related to numerical consistency across the (flat) likelihood values in this small test case. Variable names are changed: `root_sequence` to `reference_sequence`. This is much clearer (to my eyes at least). NOTE: A further bug was discovered while working on this, which involved behaviour when we are not inferring ambiguous bases. See the added test here, which is currently disabled as otherwise CI fails. This bug is not newly introduced by this work.
jameshadfield
added a commit
that referenced
this issue
Dec 29, 2023
NOTE: This requires a corresponding change in TreeTime, included in PR <neherlab/treetime#263> Previously we were not calculating a mask for VCF files. This adds it and applies it the VCF output. See <#1382> and the parent commit for more context. Closes #1382 (together with the parent commit)
jameshadfield
added a commit
that referenced
this issue
Dec 30, 2023
Fixes a bug where positions which were all ambiguous (i.e. masked) positions were being reported as mutations in the JSON output. Furthermore, while the mask was being applied to sequence export to make all masked positions "N", if we are inferring ambiguous sites then we should use the reference base instead. This is now fixed. Closes #1382 for the FASTA-input case only. Unfortunately the added test doesn't actually recreate the first bug - for this simple example TreeTime is not inferring a ATGC allele at the all-Ns position. I can recreate this (and discovered it) using Cholera data. I presume this is related to numerical consistency across the (flat) likelihood values in this small test case. Variable names are changed: `root_sequence` to `reference_sequence`. This is much clearer (to my eyes at least). NOTE: A further bug was discovered while working on this, which involved behaviour when we are not inferring ambiguous bases. See the added test here, which is currently disabled as otherwise CI fails. This bug is not newly introduced by this work.
jameshadfield
added a commit
that referenced
this issue
Dec 30, 2023
NOTE: This requires a corresponding change in TreeTime, included in PR <neherlab/treetime#263> Previously we were not calculating a mask for VCF files. This adds it and applies it the VCF output. See <#1382> and the parent commit for more context. Closes #1382 (together with the parent commit)
jameshadfield
added a commit
that referenced
this issue
Jan 22, 2024
Fixes a bug where positions which were all ambiguous (i.e. masked) positions were being reported as mutations in the JSON output. Furthermore, while the mask was being applied to sequence export to make all masked positions "N", if we are inferring ambiguous sites then we should use the reference base instead. This is now fixed. Closes #1382 for the FASTA-input case only. Unfortunately the added test doesn't actually recreate the first bug - for this simple example TreeTime is not inferring a ATGC allele at the all-Ns position. I can recreate this (and discovered it) using Cholera data. I presume this is related to numerical consistency across the (flat) likelihood values in this small test case. Variable names are changed: `root_sequence` to `reference_sequence`. This is much clearer (to my eyes at least). NOTE: A further bug was discovered while working on this, which involved behaviour when we are not inferring ambiguous bases. See the added test here, which is currently disabled as otherwise CI fails. This bug is not newly introduced by this work.
jameshadfield
added a commit
that referenced
this issue
Jan 22, 2024
NOTE: This requires a corresponding change in TreeTime, included in PR <neherlab/treetime#263> Previously we were not calculating a mask for VCF files. This adds it and applies it the VCF output. See <#1382> and the parent commit for more context. Closes #1382 (together with the parent commit)
jameshadfield
added a commit
that referenced
this issue
Jan 22, 2024
Fixes a bug where positions which were all ambiguous (i.e. masked) positions were being reported as mutations in the JSON output. Furthermore, while the mask was being applied to sequence export to make all masked positions "N", if we are inferring ambiguous sites then we should use the reference base instead. This is now fixed. Closes #1382 for the FASTA-input case only. Unfortunately the added test doesn't actually recreate the first bug - for this simple example TreeTime is not inferring a ATGC allele at the all-Ns position. I can recreate this (and discovered it) using Cholera data. I presume this is related to numerical consistency across the (flat) likelihood values in this small test case. Variable names are changed: `root_sequence` to `reference_sequence`. This is much clearer (to my eyes at least). NOTE: A further bug was discovered while working on this, which involved behaviour when we are not inferring ambiguous bases. See the added test here, which is currently disabled as otherwise CI fails. This bug is not newly introduced by this work.
jameshadfield
added a commit
that referenced
this issue
Jan 22, 2024
NOTE: This requires a corresponding change in TreeTime, included in PR <neherlab/treetime#263> Previously we were not calculating a mask for VCF files. This adds it and applies it the VCF output. See <#1382> and the parent commit for more context. Closes #1382 (together with the parent commit)
jameshadfield
added a commit
that referenced
this issue
Jan 22, 2024
NOTE: This requires a corresponding change in TreeTime, included in PR <neherlab/treetime#263> Previously we were not calculating a mask for VCF files. This adds it and applies it the VCF output. See <#1382> and the parent commit for more context. Closes #1382 (together with the parent commit)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
(This is probably an issue to be addressed in TreeTime, but it manifests within Augur)
Current Behavior
Given an
augur ancestral
invocation with--root-sequence reference.fasta
which has (e.g.) "C" at a given position, and an input alignment in either VCF or FASTA format where all samples have "N" at that position, we infer a nucleotide for all samples in the tree thus resulting in a "C to something" mutation at the root node.Update: for FASTA input, the sequences attached to each node do have an "N" at this position, due to masking. See below.
The inferred base, which is the same for all nodes in the tree, seems to be consistent across invocations of
augur ancestral
. However it does differ between VCF and FASTA inputs (all other arguments being the same). The inference type (joint vs marginal) makes no difference to the result.P.S. I assume that in some case the assigned base just so happens to match the reference, my current debugging approach hasn't scanned for such occurrences.
Expected behavior
No data is available, and since we are running with the (default)
--infer_ambiguous
behaviour, we should infer all nodes to have the reference base. This will result in no reported mutations, and the inferred sequence at each node (FASTA only) using the reference base.How to reproduce
todo
Update
This may be an augur problem after all. The
ancestral
code includes a per-base-mask which is described asIndeed, for the FASTA input the mask is set to
True
for this position. But the mutations are still reported, and none ofaugur translate
,augur export
or Auspice use the mask. For VCF input we make a mask (and export it!) but it's a dummy mask with no information.The text was updated successfully, but these errors were encountered: