Faster VCFs loading, support HTTP, and refactored variant metadata#94
Faster VCFs loading, support HTTP, and refactored variant metadata#94timodonnell merged 7 commits intomasterfrom
Conversation
Add support for loading VCFs over any protocol supported by `urlopen` (HTTP, HTTPS, FTP). Closes #91. Also adds a test for loading gzip'd VCFs from files and over HTTP.
…adata * Speed improvements to `load_vcf`, using a new pandas-based implementation and other tweaks. This is 2X faster if you need the variant info column, and 9X faster if you don't. * Add support for loading VCFs over any protocol supported by the `requests` package (HTTP, HTTPS, FTP). Closes #91. * Refactoring: got rid of the `variant.info` field and moved it to `VariantCollection.metadata`, which is a dict from variants to extra metadata. I think this simplifies things, especially when working with variants coming from different sources, where the variant may be same but the metadata may be different. Optimizations * Specify a default sort key in VariantCollection.__init__, to avoid making variant pairwise comparisons in sorted() in Collection.__init__. * Optimize memoize function for common case (cache hit). * Avoid unnecessary list creation in Collection.__init__ * Faster implementation of trim_shared_suffix * Use slots for Variant instances. * Optimize common case of a simple SNV in Variant.__init__ * Optimize Variant.__hash__, __eq__ Other changes * Add test/benchmark_vcf_load.py for benchmarking `load_vcf` performance. * Add allow_extended_nucleotides parameter to load_vcf. This is necessary to load the dbsnp VCF. * Change multiallelic.vcf test case to use tabs (as per VCF standard) not spaces. Timing details. Time to load first 100K variants in dbsnp VCF (median of 3 runs): * master varcode + pyensembl: 10.8 sec * varcode tweaks: 8.1 sec * varcode tweaks + pandas-based loader: 5.4 sec * varcode tweaks + pandas + skip parsing info: 1.5 sec * varcode tweaks + pandas + skip parsing info + pyensembl tweaks: 1.2 sec
|
Small initial comment: can we make the fast path for parsing optional and PyVCF the default? |
|
Alternatively: can we add more unit tests of VCFs from the wild (with, for example, bad tabs vs. spaces mixups)? |
|
Yeah, making the pyvcf version the default sounds good to me. We can experiment with the faster version for a while and if it seems to work reliably can switch it later. I'll push an update with that change. |
* `load_vcf` now uses the traditional pyvcf implementation. `load_vcf_fast` is the pandas implementation. * Split out `load_vcf_fast` into helper functions that load data into pandas and then use the pandas dataframe to create a VariantCollection. * better docs
|
Ok, pushed an update with this change, as well as some hopefully clearer code in vcf.py |
With this change, we now delay parsing the INFO field until after we have tested whether the record is PASS.
Fix a bug in VCF URL parsing where absolute paths starting with two slashes '//foo/file.vcf' would get misparsed. Also add VCFs from strelka and mutect to test the equivalence of the two VCF parsing routines.
|
Simplified the code a bit and added VCFs from strelka and mutect as part of the unit test that the fast and slow VCF parsing routines are equivalent. This is ready for another pass when you have a chance, @iskandr . |
|
Also minor note: while |
There was a problem hiding this comment.
This looks like a maintenance nightmare, does it substantially improve performance? Or, were you running into memory usage problems?
There was a problem hiding this comment.
Yeah, it's for memory. It makes it possible to load in dbsnp on my machine. Whether that is a use case we care about I suppose might be debatable. Since the errors that are raised if this gets out of sync with the Variant attributes will be very obvious, I think it's worth keeping for now, despite the maintenance hassle.
|
LGTM other than the question about slots and single vs double quotes in docstrings |
|
Updated to address the single v double quotes comment |
Faster VCFs loading, support HTTP, and refactored variant metadata
There is a small related pyensembl PR, which I'll push shortly.
load_vcf, using a new pandas-based implementation and other tweaks. This is 2X faster if you need the variant info column, and 9X faster if you don't.requestspackage (HTTP, HTTPS, FTP). Closes support loading VCFs over HTTP #91.variant.infofield and moved it toVariantCollection.metadata, which is a dict from variants to extra metadata. I think this simplifies things, especially when working with variants coming from different sources, where the variant may be same but the metadata may be different.Optimizations
Other changes
load_vcfperformance.Timing details. Time to load first 100K variants in dbsnp VCF (median of 3 runs):