Improve VCF parsing #263

jameshadfield · 2023-12-08T03:47:51Z

This isn't complete yet, but if people wanted to take a look and see if they are happy with the direction this is going in that'd be great!

I still want to take a close look at how insertions are being encoded in the "sequences" output, and is treetime/augur is ok with that (i.e. data['sequences'][<sample>][<pos>] is more than one character). And ideally I'd sort out the writing of VCFs as well to both preserve the ploidy representation of alleles and also use the correct chromosome name!

The VCF 4.2/4.3 spec is precise about the format of meta-information / header-line / data-line syntax. We assert that the file contents matches what we expect and raise errors if not. This allows code simplifications. I've reordered the if/else block to match how the file appears, not the previous approach of placing the data lines first as they're the "most common so first in 'if-list'". This is much more readable to me!

The test failure is due to the failure to parse multiple ALT alleles and will be fixed in a subsequent commit.

Only addresses the case of a single homozygous nuc ALT allele. Specifically, the previous code would ignore all ALT alleles beyond a GT call of 1 (see previously failing test TestSimpleVcf::test_mutation_parsing) Comments added regarding other potential bugs, but I'll add tests / fixes in subsequent commits.

I was hoping the examples would be a little more exhaustive, however they do cover the basics. The examples in v4.3 are almost identical. The encoding of insertions, which may or may not be a bug, is detailed both in the code and in the tests.

Detailed explanation added in test `TestNoCallsOrMissing` Sections of code were removed which incorrectly considered the "." character could be part of a multi-character alternate allele, however the spec is clear that the "." allele can only exist as a single-character allele. We enforce that alleles conform to this upstream of the code, so it wasn't possible that `alt[pos+i]=='.'` Closes #261

See added tests

The VCF spec uses a genotype of '0' to mean that the reference base(s) is unchanged for this sample. The previous '.' means an uncalled allele, in v4.2 "a call cannot be made for a sample at a given locus" and in v4.3 "a call cannot be made for a sample at a given locus".

This is currently unused but a subsequent commit will allow `write_vcf` to use this information. As part of reading this information we now enforce: - only one chromosome can be defined in the vcf. This will provide a helpful error message in situations such as <#247> - genotype calls much consistently use the same ploidy There is a lot more checking we could do if we wish to parse the meta-lines

jameshadfield · 2023-12-12T04:04:54Z

@rneher could you suggest someone to review it (or look at it yourself?) 🙏

There are a number of bugfixes related to parsing haploid (or polyploid homozygous) VCFs and I'm now reasonably confident the parsing is correct. The encoding of insertions has the potential to double count the base before the insertion (see comments added in code here) but I don't know what treetime does with these; for the purposes of augur ancestral, which is what motivated this work, insertions are not considered.

I haven't touched the parsing of heterozygous genotypes.

I've fixed a number of bugs in the writing of VCFs, but I haven't thoroughly reviewed that code and the test VCF I've added here is pretty simple.

rneher · 2023-12-12T16:12:09Z

thanks, @jameshadfield! I'll have a look. This code hasn't really been looked at in a long time and was written for rather specific needs in our early MTb work, so really appreciate some attention!

treetime/vcf_utils.py

rneher · 2023-12-14T17:50:27Z

this all looks good to me. happy to merge it in and make a release.

jameshadfield · 2023-12-14T20:40:43Z

this all looks good to me. happy to merge it in and make a release.

Thanks! I'm just doing some final checks comparing augur ancestral & augur translate using FASTA vs VCF inputs on the Cholera dataset so let's hold off on this for a few days just in case there are any more little fixes I need to add.

The added test here fails under TreeTime 0.11.1 but has been fixed as part of <neherlab/treetime#263>. Description of the bug (as it relates to the test added in `vcf.t`): The SNPs at nt 33 are encoded in the VCF as: 1 33 . A C,G . . . GT 1 2 0 where ALT 1 ("C") is on Sample_A and ALT 2 ("G") is on Sample_B. The ALT 2 is not being parsed by `read_vcf`, which results in a changed mutation profile at pos 33: . **FASTA input** **VCF input** . |---G33C-- sample_A |---A33C-- sample_A . --A33G-| -------| . |--------- sample_B |--------- sample_B . Because of this bug, the following test fails. The `read_vcf` function is used in augur commands ancestral, refine, sequence-traits, translate and tree.

Switches the `translate-with-gff-and-locus-tag.t` test to using the same data as the corresponding `translate-with-gff-and-gene.t`, thus testing _just_ the change in GFF syntax. The replaced test used TB data which was problematic for a few reasons: - The VCF file wasn't correctly formatted, with a mixture of haploid and diploid genotypes. TreeTime's `read_vcf` will error on this after <neherlab/treetime#263> is merged. - The VCF encoded genotypes of '.' which were read as allele="N", however these were supposed to be reference bases (encoded as genotype="0"). If we update the VCF then the aa_muts.json are very different. This speaks to a bigger problem with test data such as this - there is no assurance that the output data is correct. For this reason I prefer the "simple-genome" tests for which we can validate every mutation.

This requires a TreeTime with <neherlab/treetime#263>. The benefit is that the produced VCF file will encode genotypes with the same ploidy as the input, as well as using the same chromosome name

victorlin · 2023-12-27T00:39:53Z

treetime/vcf_utils.py

I understand that PyVCF (or the more current PyVCF3) is not used to replace/supplement read_vcf for parsing efficiency reasons, but could it be useful for write_vcf?

Thanks partly to your suggestion to use cProfile, the VCF writing is now down to a handful of seconds 58ab78d 🎉

NOTE: This requires a corresponding change in TreeTime, included in PR <neherlab/treetime#263> Previously we were not calculating a mask for VCF files. This adds it and applies it the VCF output. See <#1382> and the parent commit for more context. Closes #1382 (together with the parent commit)

This is only used if provided, otherwise we fall back to the default ploidy of diploid (unchanged) and chromosome name of "1" (changed from the previously hardcoded "MTB_anc"). Note that the spec states that no-call genotypes should be encoded to reflect the ploidy (e.g. "./." for diploid and "." only for haploid) which is a change from previous behaviour. Closes #262

This is (partly) behind the bug described in <nextstrain/augur#1380>

This (optional) ability allows the node-data JSON and VCF output from `augur ancestral` to match.

This allows the VCF file to be parsed by bcftools. See <#263 (comment)>

These were previously printed with ALT="" (and all genotype calls indicating to use the reference allele).

The 2 main changes here are changing the const sites list to a set to improve lookup time and pre-computing the positions at each variable site in a more efficient manner. For my test VCF (100k rows, 100 samples) `write_vcf` went from 95 seconds to 0.7 seconds. For the full 800k rows VCF, `write_vcf` went from ~8 hours to 5.6 seconds (5000x speedup). Times & profiling using cProfile CLoses nextstrain/augur#1378

NOTE: This requires a corresponding change in TreeTime, included in PR <neherlab/treetime#263> Previously we were not calculating a mask for VCF files. This adds it and applies it the VCF output. See <#1382> and the parent commit for more context. Closes #1382 (together with the parent commit)

jameshadfield · 2023-12-30T00:45:32Z

@rneher I'm happy with the state of this now and don't plan to do any more work here. There were a few more commits added since you last saw it. Once this is merged and released I'll update nextstrain/augur#1355 to use the newest version.

The added test here fails under TreeTime 0.11.1 but has been fixed as part of <neherlab/treetime#263>. Description of the bug (as it relates to the test added in `vcf.t`): The SNPs at nt 33 are encoded in the VCF as: 1 33 . A C,G . . . GT 1 2 0 where ALT 1 ("C") is on Sample_A and ALT 2 ("G") is on Sample_B. The ALT 2 is not being parsed by `read_vcf`, which results in a changed mutation profile at pos 33: . **FASTA input** **VCF input** . |---G33C-- sample_A |---A33C-- sample_A . --A33G-| -------| . |--------- sample_B |--------- sample_B . Because of this bug, the following test fails. The `read_vcf` function is used in augur commands ancestral, refine, sequence-traits, translate and tree.

Switches the `translate-with-gff-and-locus-tag.t` test to using the same data as the corresponding `translate-with-gff-and-gene.t`, thus testing _just_ the change in GFF syntax. The replaced test used TB data which was problematic for a few reasons: - The VCF file wasn't correctly formatted, with a mixture of haploid and diploid genotypes. TreeTime's `read_vcf` will error on this after <neherlab/treetime#263> is merged. - The VCF encoded genotypes of '.' which were read as allele="N", however these were supposed to be reference bases (encoded as genotype="0"). If we update the VCF then the aa_muts.json are very different. This speaks to a bigger problem with test data such as this - there is no assurance that the output data is correct. For this reason I prefer the "simple-genome" tests for which we can validate every mutation.

This requires a TreeTime with <neherlab/treetime#263>. The benefit is that the produced VCF file will encode genotypes with the same ploidy as the input, as well as using the same chromosome name

NOTE: This requires a corresponding change in TreeTime, included in PR <neherlab/treetime#263> Previously we were not calculating a mask for VCF files. This adds it and applies it the VCF output. See <#1382> and the parent commit for more context. Closes #1382 (together with the parent commit)

jameshadfield added 9 commits December 12, 2023 12:10

[vcf] Add simple VCF allele parsing tests (failing)

6888678

The test failure is due to the failure to parse multiple ALT alleles and will be fixed in a subsequent commit.

[vcf] Fix missed SNP when followed by insertion

ecd340d

See added tests

[vcf] test roundtripping a vcf (fails)

2ae57e0

jameshadfield force-pushed the james/vcf branch from 37af131 to 99a56a1 Compare December 12, 2023 03:59

jameshadfield marked this pull request as ready for review December 12, 2023 04:10

jameshadfield mentioned this pull request Dec 12, 2023

Fix VCF parsing bugs nextstrain/augur#1355

Merged

jameshadfield commented Dec 12, 2023

View reviewed changes

treetime/vcf_utils.py Show resolved Hide resolved

This was referenced Dec 12, 2023

[vcf] Replace write_VCF_translation with treetime's write_vcf nextstrain/augur#1356

Open

[vcf] replace VCF reading with TreeTime's read_vcf nextstrain/augur#1357

Open

victorlin reviewed Dec 27, 2023

View reviewed changes

jameshadfield mentioned this pull request Dec 27, 2023

[ancestral] VCF and JSON outputs differ nextstrain/augur#1380

Closed

jameshadfield force-pushed the james/vcf branch from 77e5043 to 8322d54 Compare December 29, 2023 06:21

jameshadfield added 4 commits December 30, 2023 12:35

[get_tree_dict] restore argument behaviour

8a196bd

This is (partly) behind the bug described in <nextstrain/augur#1380>

[vcf] add mask to VCF output

a8e3850

This (optional) ability allows the node-data JSON and VCF output from `augur ancestral` to match.

[vcf] Add contig meta-line

4f1514b

This allows the VCF file to be parsed by bcftools. See <#263 (comment)>

jameshadfield added 2 commits December 30, 2023 12:39

[vcf] don't write sites without variation

2864301

These were previously printed with ALT="" (and all genotype calls indicating to use the reference allele).

jameshadfield force-pushed the james/vcf branch from 8322d54 to 58ab78d Compare December 30, 2023 00:34

rneher merged commit 5dbb1cb into master Jan 22, 2024
5 checks passed

rneher deleted the james/vcf branch January 22, 2024 08:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve VCF parsing #263

Improve VCF parsing #263

jameshadfield commented Dec 8, 2023

jameshadfield commented Dec 12, 2023

rneher commented Dec 12, 2023

rneher commented Dec 14, 2023

jameshadfield commented Dec 14, 2023

victorlin Dec 27, 2023

jameshadfield Dec 30, 2023

jameshadfield commented Dec 30, 2023

Improve VCF parsing #263

Improve VCF parsing #263

Conversation

jameshadfield commented Dec 8, 2023

jameshadfield commented Dec 12, 2023

rneher commented Dec 12, 2023

rneher commented Dec 14, 2023

jameshadfield commented Dec 14, 2023

victorlin Dec 27, 2023

Choose a reason for hiding this comment

jameshadfield Dec 30, 2023

Choose a reason for hiding this comment

jameshadfield commented Dec 30, 2023