-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve VCF parsing #263
Improve VCF parsing #263
Conversation
The VCF 4.2/4.3 spec is precise about the format of meta-information / header-line / data-line syntax. We assert that the file contents matches what we expect and raise errors if not. This allows code simplifications. I've reordered the if/else block to match how the file appears, not the previous approach of placing the data lines first as they're the "most common so first in 'if-list'". This is much more readable to me!
The test failure is due to the failure to parse multiple ALT alleles and will be fixed in a subsequent commit.
Only addresses the case of a single homozygous nuc ALT allele. Specifically, the previous code would ignore all ALT alleles beyond a GT call of 1 (see previously failing test TestSimpleVcf::test_mutation_parsing) Comments added regarding other potential bugs, but I'll add tests / fixes in subsequent commits.
I was hoping the examples would be a little more exhaustive, however they do cover the basics. The examples in v4.3 are almost identical. The encoding of insertions, which may or may not be a bug, is detailed both in the code and in the tests.
Detailed explanation added in test `TestNoCallsOrMissing` Sections of code were removed which incorrectly considered the "." character could be part of a multi-character alternate allele, however the spec is clear that the "." allele can only exist as a single-character allele. We enforce that alleles conform to this upstream of the code, so it wasn't possible that `alt[pos+i]=='.'` Closes #261
See added tests
The VCF spec uses a genotype of '0' to mean that the reference base(s) is unchanged for this sample. The previous '.' means an uncalled allele, in v4.2 "a call cannot be made for a sample at a given locus" and in v4.3 "a call cannot be made for a sample at a given locus".
This is currently unused but a subsequent commit will allow `write_vcf` to use this information. As part of reading this information we now enforce: - only one chromosome can be defined in the vcf. This will provide a helpful error message in situations such as <#247> - genotype calls much consistently use the same ploidy There is a lot more checking we could do if we wish to parse the meta-lines
37af131
to
99a56a1
Compare
@rneher could you suggest someone to review it (or look at it yourself?) 🙏 There are a number of bugfixes related to parsing haploid (or polyploid homozygous) VCFs and I'm now reasonably confident the parsing is correct. The encoding of insertions has the potential to double count the base before the insertion (see comments added in code here) but I don't know what treetime does with these; for the purposes of I haven't touched the parsing of heterozygous genotypes. I've fixed a number of bugs in the writing of VCFs, but I haven't thoroughly reviewed that code and the test VCF I've added here is pretty simple. |
thanks, @jameshadfield! I'll have a look. This code hasn't really been looked at in a long time and was written for rather specific needs in our early MTb work, so really appreciate some attention! |
this all looks good to me. happy to merge it in and make a release. |
Thanks! I'm just doing some final checks comparing |
The added test here fails under TreeTime 0.11.1 but has been fixed as part of <neherlab/treetime#263>. Description of the bug (as it relates to the test added in `vcf.t`): The SNPs at nt 33 are encoded in the VCF as: 1 33 . A C,G . . . GT 1 2 0 where ALT 1 ("C") is on Sample_A and ALT 2 ("G") is on Sample_B. The ALT 2 is not being parsed by `read_vcf`, which results in a changed mutation profile at pos 33: . **FASTA input** **VCF input** . |---G33C-- sample_A |---A33C-- sample_A . --A33G-| -------| . |--------- sample_B |--------- sample_B . Because of this bug, the following test fails. The `read_vcf` function is used in augur commands ancestral, refine, sequence-traits, translate and tree.
Switches the `translate-with-gff-and-locus-tag.t` test to using the same data as the corresponding `translate-with-gff-and-gene.t`, thus testing _just_ the change in GFF syntax. The replaced test used TB data which was problematic for a few reasons: - The VCF file wasn't correctly formatted, with a mixture of haploid and diploid genotypes. TreeTime's `read_vcf` will error on this after <neherlab/treetime#263> is merged. - The VCF encoded genotypes of '.' which were read as allele="N", however these were supposed to be reference bases (encoded as genotype="0"). If we update the VCF then the aa_muts.json are very different. This speaks to a bigger problem with test data such as this - there is no assurance that the output data is correct. For this reason I prefer the "simple-genome" tests for which we can validate every mutation.
This requires a TreeTime with <neherlab/treetime#263>. The benefit is that the produced VCF file will encode genotypes with the same ploidy as the input, as well as using the same chromosome name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that PyVCF (or the more current PyVCF3) is not used to replace/supplement read_vcf
for parsing efficiency reasons, but could it be useful for write_vcf
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks partly to your suggestion to use cProfile, the VCF writing is now down to a handful of seconds 58ab78d 🎉
NOTE: This requires a corresponding change in TreeTime, included in PR <neherlab/treetime#263> Previously we were not calculating a mask for VCF files. This adds it and applies it the VCF output. See <#1382> and the parent commit for more context. Closes #1382 (together with the parent commit)
77e5043
to
8322d54
Compare
This is only used if provided, otherwise we fall back to the default ploidy of diploid (unchanged) and chromosome name of "1" (changed from the previously hardcoded "MTB_anc"). Note that the spec states that no-call genotypes should be encoded to reflect the ploidy (e.g. "./." for diploid and "." only for haploid) which is a change from previous behaviour. Closes #262
This is (partly) behind the bug described in <nextstrain/augur#1380>
This (optional) ability allows the node-data JSON and VCF output from `augur ancestral` to match.
This allows the VCF file to be parsed by bcftools. See <#263 (comment)>
These were previously printed with ALT="" (and all genotype calls indicating to use the reference allele).
The 2 main changes here are changing the const sites list to a set to improve lookup time and pre-computing the positions at each variable site in a more efficient manner. For my test VCF (100k rows, 100 samples) `write_vcf` went from 95 seconds to 0.7 seconds. For the full 800k rows VCF, `write_vcf` went from ~8 hours to 5.6 seconds (5000x speedup). Times & profiling using cProfile CLoses nextstrain/augur#1378
8322d54
to
58ab78d
Compare
NOTE: This requires a corresponding change in TreeTime, included in PR <neherlab/treetime#263> Previously we were not calculating a mask for VCF files. This adds it and applies it the VCF output. See <#1382> and the parent commit for more context. Closes #1382 (together with the parent commit)
@rneher I'm happy with the state of this now and don't plan to do any more work here. There were a few more commits added since you last saw it. Once this is merged and released I'll update nextstrain/augur#1355 to use the newest version. |
The added test here fails under TreeTime 0.11.1 but has been fixed as part of <neherlab/treetime#263>. Description of the bug (as it relates to the test added in `vcf.t`): The SNPs at nt 33 are encoded in the VCF as: 1 33 . A C,G . . . GT 1 2 0 where ALT 1 ("C") is on Sample_A and ALT 2 ("G") is on Sample_B. The ALT 2 is not being parsed by `read_vcf`, which results in a changed mutation profile at pos 33: . **FASTA input** **VCF input** . |---G33C-- sample_A |---A33C-- sample_A . --A33G-| -------| . |--------- sample_B |--------- sample_B . Because of this bug, the following test fails. The `read_vcf` function is used in augur commands ancestral, refine, sequence-traits, translate and tree.
Switches the `translate-with-gff-and-locus-tag.t` test to using the same data as the corresponding `translate-with-gff-and-gene.t`, thus testing _just_ the change in GFF syntax. The replaced test used TB data which was problematic for a few reasons: - The VCF file wasn't correctly formatted, with a mixture of haploid and diploid genotypes. TreeTime's `read_vcf` will error on this after <neherlab/treetime#263> is merged. - The VCF encoded genotypes of '.' which were read as allele="N", however these were supposed to be reference bases (encoded as genotype="0"). If we update the VCF then the aa_muts.json are very different. This speaks to a bigger problem with test data such as this - there is no assurance that the output data is correct. For this reason I prefer the "simple-genome" tests for which we can validate every mutation.
This requires a TreeTime with <neherlab/treetime#263>. The benefit is that the produced VCF file will encode genotypes with the same ploidy as the input, as well as using the same chromosome name
NOTE: This requires a corresponding change in TreeTime, included in PR <neherlab/treetime#263> Previously we were not calculating a mask for VCF files. This adds it and applies it the VCF output. See <#1382> and the parent commit for more context. Closes #1382 (together with the parent commit)
NOTE: This requires a corresponding change in TreeTime, included in PR <neherlab/treetime#263> Previously we were not calculating a mask for VCF files. This adds it and applies it the VCF output. See <#1382> and the parent commit for more context. Closes #1382 (together with the parent commit)
NOTE: This requires a corresponding change in TreeTime, included in PR <neherlab/treetime#263> Previously we were not calculating a mask for VCF files. This adds it and applies it the VCF output. See <#1382> and the parent commit for more context. Closes #1382 (together with the parent commit)
This isn't complete yet, but if people wanted to take a look and see if they are happy with the direction this is going in that'd be great!
I still want to take a close look at how insertions are being encoded in the "sequences" output, and is treetime/augur is ok with that (i.e.
data['sequences'][<sample>][<pos>]
is more than one character). And ideally I'd sort out the writing of VCFs as well to both preserve the ploidy representation of alleles and also use the correct chromosome name!