Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ancestral] positions with no information are assigned a base which may contradict the reference #1382

Closed
jameshadfield opened this issue Dec 27, 2023 · 0 comments
Labels
bug Something isn't working

Comments

@jameshadfield
Copy link
Member

jameshadfield commented Dec 27, 2023

(This is probably an issue to be addressed in TreeTime, but it manifests within Augur)

Current Behavior

Given an augur ancestral invocation with --root-sequence reference.fasta which has (e.g.) "C" at a given position, and an input alignment in either VCF or FASTA format where all samples have "N" at that position, we infer a nucleotide for all samples in the tree thus resulting in a "C to something" mutation at the root node.

Update: for FASTA input, the sequences attached to each node do have an "N" at this position, due to masking. See below.

The inferred base, which is the same for all nodes in the tree, seems to be consistent across invocations of augur ancestral. However it does differ between VCF and FASTA inputs (all other arguments being the same). The inference type (joint vs marginal) makes no difference to the result.

P.S. I assume that in some case the assigned base just so happens to match the reference, my current debugging approach hasn't scanned for such occurrences.

Expected behavior

No data is available, and since we are running with the (default) --infer_ambiguous behaviour, we should infer all nodes to have the reference base. This will result in no reported mutations, and the inferred sequence at each node (FASTA only) using the reference base.

How to reproduce

todo

Update

This may be an augur problem after all. The ancestral code includes a per-base-mask which is described as

Identify sites for which every terminal sequence is ambiguous. These sites will be masked to prevent rounding errors in the maximum likelihood inference from assigning an arbitrary nucleotide to sites at internal nodes.

Indeed, for the FASTA input the mask is set to True for this position. But the mutations are still reported, and none of augur translate, augur export or Auspice use the mask. For VCF input we make a mask (and export it!) but it's a dummy mask with no information.

@jameshadfield jameshadfield added the bug Something isn't working label Dec 27, 2023
jameshadfield added a commit that referenced this issue Dec 29, 2023
Fixes a bug where positions which were all ambiguous (i.e. masked)
positions were being reported as mutations in the JSON output.

Furthermore, while the mask was being applied to sequence export to make
all masked positions "N", if we are inferring ambiguous sites then we
should use the reference base instead. This is now fixed.

Closes #1382 for the FASTA-input case only.

Unfortunately the added test doesn't actually recreate the first bug -
for this simple example TreeTime is not inferring a ATGC allele at the
all-Ns position. I can recreate this (and discovered it) using Cholera
data. I presume this is related to numerical consistency across the
(flat) likelihood values in this small test case.

Variable names are changed: `root_sequence` to `reference_sequence`.
This is much clearer (to my eyes at least).

NOTE: A further bug was discovered while working on this, which involved
behaviour when we are not inferring ambiguous bases. See the added test
here, which is currently disabled as otherwise CI fails. This bug is not
newly introduced by this work.
jameshadfield added a commit that referenced this issue Dec 29, 2023
NOTE: This requires a corresponding change in TreeTime, included in PR
<neherlab/treetime#263>

Previously we were not calculating a mask for VCF files. This adds it
and applies it the VCF output.

See <#1382> and the parent
commit for more context.

Closes #1382 (together with the parent commit)
jameshadfield added a commit that referenced this issue Dec 30, 2023
Fixes a bug where positions which were all ambiguous (i.e. masked)
positions were being reported as mutations in the JSON output.

Furthermore, while the mask was being applied to sequence export to make
all masked positions "N", if we are inferring ambiguous sites then we
should use the reference base instead. This is now fixed.

Closes #1382 for the FASTA-input case only.

Unfortunately the added test doesn't actually recreate the first bug -
for this simple example TreeTime is not inferring a ATGC allele at the
all-Ns position. I can recreate this (and discovered it) using Cholera
data. I presume this is related to numerical consistency across the
(flat) likelihood values in this small test case.

Variable names are changed: `root_sequence` to `reference_sequence`.
This is much clearer (to my eyes at least).

NOTE: A further bug was discovered while working on this, which involved
behaviour when we are not inferring ambiguous bases. See the added test
here, which is currently disabled as otherwise CI fails. This bug is not
newly introduced by this work.
jameshadfield added a commit that referenced this issue Dec 30, 2023
NOTE: This requires a corresponding change in TreeTime, included in PR
<neherlab/treetime#263>

Previously we were not calculating a mask for VCF files. This adds it
and applies it the VCF output.

See <#1382> and the parent
commit for more context.

Closes #1382 (together with the parent commit)
jameshadfield added a commit that referenced this issue Jan 22, 2024
Fixes a bug where positions which were all ambiguous (i.e. masked)
positions were being reported as mutations in the JSON output.

Furthermore, while the mask was being applied to sequence export to make
all masked positions "N", if we are inferring ambiguous sites then we
should use the reference base instead. This is now fixed.

Closes #1382 for the FASTA-input case only.

Unfortunately the added test doesn't actually recreate the first bug -
for this simple example TreeTime is not inferring a ATGC allele at the
all-Ns position. I can recreate this (and discovered it) using Cholera
data. I presume this is related to numerical consistency across the
(flat) likelihood values in this small test case.

Variable names are changed: `root_sequence` to `reference_sequence`.
This is much clearer (to my eyes at least).

NOTE: A further bug was discovered while working on this, which involved
behaviour when we are not inferring ambiguous bases. See the added test
here, which is currently disabled as otherwise CI fails. This bug is not
newly introduced by this work.
jameshadfield added a commit that referenced this issue Jan 22, 2024
NOTE: This requires a corresponding change in TreeTime, included in PR
<neherlab/treetime#263>

Previously we were not calculating a mask for VCF files. This adds it
and applies it the VCF output.

See <#1382> and the parent
commit for more context.

Closes #1382 (together with the parent commit)
jameshadfield added a commit that referenced this issue Jan 22, 2024
Fixes a bug where positions which were all ambiguous (i.e. masked)
positions were being reported as mutations in the JSON output.

Furthermore, while the mask was being applied to sequence export to make
all masked positions "N", if we are inferring ambiguous sites then we
should use the reference base instead. This is now fixed.

Closes #1382 for the FASTA-input case only.

Unfortunately the added test doesn't actually recreate the first bug -
for this simple example TreeTime is not inferring a ATGC allele at the
all-Ns position. I can recreate this (and discovered it) using Cholera
data. I presume this is related to numerical consistency across the
(flat) likelihood values in this small test case.

Variable names are changed: `root_sequence` to `reference_sequence`.
This is much clearer (to my eyes at least).

NOTE: A further bug was discovered while working on this, which involved
behaviour when we are not inferring ambiguous bases. See the added test
here, which is currently disabled as otherwise CI fails. This bug is not
newly introduced by this work.
jameshadfield added a commit that referenced this issue Jan 22, 2024
NOTE: This requires a corresponding change in TreeTime, included in PR
<neherlab/treetime#263>

Previously we were not calculating a mask for VCF files. This adds it
and applies it the VCF output.

See <#1382> and the parent
commit for more context.

Closes #1382 (together with the parent commit)
jameshadfield added a commit that referenced this issue Jan 22, 2024
NOTE: This requires a corresponding change in TreeTime, included in PR
<neherlab/treetime#263>

Previously we were not calculating a mask for VCF files. This adds it
and applies it the VCF output.

See <#1382> and the parent
commit for more context.

Closes #1382 (together with the parent commit)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant