-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix VCF parsing bugs #1355
Fix VCF parsing bugs #1355
Conversation
0d4b80e
to
81eac92
Compare
d1f3247
to
c91ca33
Compare
81eac92
to
02a865e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having next to zero experience with VCF, I can't speak to the reasoning behind changes. Aside from that, overall code changes look good, especially the revamped testing! Noted a few small things.
77b4ee5
to
dd4ac03
Compare
Exclude the Augur version from the diff, otherwise it'll fail on the next release. Some of these tests were added via this PR but many weren't, so it's easier to do it all in a single commit. <#1355 (comment)>
Thanks for the review @victorlin. When paired with neherlab/treetime#263 all tests are passing and running a full cholera analysis using VCFs or using FASTA now produces identical mutations across the tree, so I'm going to draw the line here. Things to do before merge:
|
The added test here fails under TreeTime 0.11.1 but has been fixed as part of <neherlab/treetime#263>. Description of the bug (as it relates to the test added in `vcf.t`): The SNPs at nt 33 are encoded in the VCF as: 1 33 . A C,G . . . GT 1 2 0 where ALT 1 ("C") is on Sample_A and ALT 2 ("G") is on Sample_B. The ALT 2 is not being parsed by `read_vcf`, which results in a changed mutation profile at pos 33: . **FASTA input** **VCF input** . |---G33C-- sample_A |---A33C-- sample_A . --A33G-| -------| . |--------- sample_B |--------- sample_B . Because of this bug, the following test fails. The `read_vcf` function is used in augur commands ancestral, refine, sequence-traits, translate and tree.
Switches the `translate-with-gff-and-locus-tag.t` test to using the same data as the corresponding `translate-with-gff-and-gene.t`, thus testing _just_ the change in GFF syntax. The replaced test used TB data which was problematic for a few reasons: - The VCF file wasn't correctly formatted, with a mixture of haploid and diploid genotypes. TreeTime's `read_vcf` will error on this after <neherlab/treetime#263> is merged. - The VCF encoded genotypes of '.' which were read as allele="N", however these were supposed to be reference bases (encoded as genotype="0"). If we update the VCF then the aa_muts.json are very different. This speaks to a bigger problem with test data such as this - there is no assurance that the output data is correct. For this reason I prefer the "simple-genome" tests for which we can validate every mutation.
This requires a TreeTime with <neherlab/treetime#263>. The benefit is that the produced VCF file will encode genotypes with the same ploidy as the input, as well as using the same chromosome name
The functionality is unchanged but this centralises the logic. A subsequent commit will re-use this "is a file a VCF" logic within `augur translate` so rather than pulling it into it's own function it's better to use the widely-used .io function.
If VCF input with requested alignment output then we require --vcf-reference-output. This was not the case in augur 23.1.1 and earlier, where we would automatically create a filename. This is in line with a general philosophy of "files only created when requested"
This test was modified to confirm my own understanding of the code which allowed the next set of commits (modifying `augur translate`). Rather than drop it, it's helpful to include it in the repo for future reference. This commit is related to <#1362>.
dd4ac03
to
0004826
Compare
Exclude the Augur version from the diff, otherwise it'll fail on the next release. Some of these tests were added via this PR but many weren't, so it's easier to do it all in a single commit. <#1355 (comment)>
For JSON inputs we were previously incorrectly exporting the root-sequence translations as the "reference". Instead, we now translate the provided reference (nuc) sequence. (There is some subtlety here because the provided nuc reference sequence may in fact be the root-sequence rather than an actual reference, but this is a problem with `augur ancestral`. See <#1362> for more details.) This allows us to compare the reference translation to the root-sequence translation and thus detail any AA mutations at the root node. A side-effect of this is that we now always export an array of mutations for each gene/CDS at the root node, although this may often be empty. This brings the behaviour of JSON inputs in-line with that of VCF inputs.
These currently fail because for VCF inputs we do not export the translated sequences at the root-node. (We do for JSON inputs.) The subsequent commit will update translate to do this.
This brings the behaviour in-line with that of JSON inputs. The failing tests of the previous commit now pass.
The (somewhat confusingly named, and perhaps little used?) `--root-sequence <FASTA>` is used to report mutations between the inferred tree root node and the supplied FASTA. The inferred sequence would always be upper-case, but the supplied reference wasn't converted so if it were lowercase we would get erroneous mutations such as 'a1A'. Closes #1376
See <#1380> for a full description of the bug this fixes (or see the added tests here). The changes here need to be paired with a corresponding commit in TreeTime (commit message '[get_tree_dict] restore argument behaviour').
Exclude the Augur version from the diff, otherwise it'll fail on the next release. Some of these tests were added via this PR but many weren't, so it's easier to do it all in a single commit. <#1355 (comment)>
0004826
to
59eb681
Compare
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #1355 +/- ##
==========================================
+ Coverage 66.29% 66.69% +0.40%
==========================================
Files 69 69
Lines 7260 7321 +61
Branches 1780 1797 +17
==========================================
+ Hits 4813 4883 +70
+ Misses 2179 2170 -9
Partials 268 268 ☔ View full report in Codecov by Sentry. |
There are no functional changes. This is in preparation for subsequent commits which will fix (separate) bugs associated with mask creation, mutation calling on the root node, and proper masking of sequences.
Fixes a bug where positions which were all ambiguous (i.e. masked) positions were being reported as mutations in the JSON output. Furthermore, while the mask was being applied to sequence export to make all masked positions "N", if we are inferring ambiguous sites then we should use the reference base instead. This is now fixed. Closes #1382 for the FASTA-input case only. Unfortunately the added test doesn't actually recreate the first bug - for this simple example TreeTime is not inferring a ATGC allele at the all-Ns position. I can recreate this (and discovered it) using Cholera data. I presume this is related to numerical consistency across the (flat) likelihood values in this small test case. Variable names are changed: `root_sequence` to `reference_sequence`. This is much clearer (to my eyes at least). NOTE: A further bug was discovered while working on this, which involved behaviour when we are not inferring ambiguous bases. See the added test here, which is currently disabled as otherwise CI fails. This bug is not newly introduced by this work.
NOTE: This requires a corresponding change in TreeTime, included in PR <neherlab/treetime#263> Previously we were not calculating a mask for VCF files. This adds it and applies it the VCF output. See <#1382> and the parent commit for more context. Closes #1382 (together with the parent commit)
Exclude the Augur version from the diff, otherwise it'll fail on the next release. Some of these tests were added via this PR but many weren't, so it's easier to do it all in a single commit. <#1355 (comment)>
59eb681
to
937ab0b
Compare
This is required for VCF related tests added in <#1355> Note that different parts of our CI can use different TreeTime versions according to their installation method. For instance, "pytest-cram" installs Augur & TreeTime via pip, thus getting 0.11.2 at the time of writing. "pathogen-repo-ci" first installs augur via conda, and thus gets 0.11.1.
The test added here fails on the version of TreeTime augur uses. The tests pass under TreeTime neherlab/treetime#263.