Faster VCFs loading, support HTTP, and refactored variant metadata by timodonnell · Pull Request #94 · openvax/varcode

timodonnell · 2015-06-22T16:30:47Z

There is a small related pyensembl PR, which I'll push shortly.

Speed improvements to load_vcf, using a new pandas-based implementation and other tweaks. This is 2X faster if you need the variant info column, and 9X faster if you don't.
Add support for loading VCFs over any protocol supported by the requests package (HTTP, HTTPS, FTP). Closes support loading VCFs over HTTP #91.
Refactoring: got rid of the variant.info field and moved it to VariantCollection.metadata, which is a dict from variants to extra metadata. I think this simplifies things, especially when working with variants coming from different sources, where the variant may be same but the metadata may be different.

Optimizations

Specify a default sort key in VariantCollection.init, to avoid making variant pairwise comparisons in sorted() in Collection.init.
Optimize memoize function for common case (cache hit).
Avoid unnecessary list creation in Collection.init
Faster implementation of trim_shared_suffix
Use slots for Variant instances.
Optimize common case of a simple SNV in Variant.init
Optimize Variant.hash, eq

Other changes

Add test/benchmark_vcf_load.py for benchmarking load_vcf performance.
Add allow_extended_nucleotides parameter to load_vcf. This is necessary to load the dbsnp VCF.
Change multiallelic.vcf test case to use tabs (as per VCF standard) not spaces.

Timing details. Time to load first 100K variants in dbsnp VCF (median of 3 runs):

master varcode + pyensembl: 10.8 sec
varcode tweaks: 8.1 sec
varcode tweaks + pandas-based loader: 5.4 sec
varcode tweaks + pandas + skip parsing info: 1.5 sec
varcode tweaks + pandas + skip parsing info + pyensembl tweaks: 1.2 sec

Add support for loading VCFs over any protocol supported by `urlopen` (HTTP, HTTPS, FTP). Closes #91. Also adds a test for loading gzip'd VCFs from files and over HTTP.

…adata * Speed improvements to `load_vcf`, using a new pandas-based implementation and other tweaks. This is 2X faster if you need the variant info column, and 9X faster if you don't. * Add support for loading VCFs over any protocol supported by the `requests` package (HTTP, HTTPS, FTP). Closes #91. * Refactoring: got rid of the `variant.info` field and moved it to `VariantCollection.metadata`, which is a dict from variants to extra metadata. I think this simplifies things, especially when working with variants coming from different sources, where the variant may be same but the metadata may be different. Optimizations * Specify a default sort key in VariantCollection.__init__, to avoid making variant pairwise comparisons in sorted() in Collection.__init__. * Optimize memoize function for common case (cache hit). * Avoid unnecessary list creation in Collection.__init__ * Faster implementation of trim_shared_suffix * Use slots for Variant instances. * Optimize common case of a simple SNV in Variant.__init__ * Optimize Variant.__hash__, __eq__ Other changes * Add test/benchmark_vcf_load.py for benchmarking `load_vcf` performance. * Add allow_extended_nucleotides parameter to load_vcf. This is necessary to load the dbsnp VCF. * Change multiallelic.vcf test case to use tabs (as per VCF standard) not spaces. Timing details. Time to load first 100K variants in dbsnp VCF (median of 3 runs): * master varcode + pyensembl: 10.8 sec * varcode tweaks: 8.1 sec * varcode tweaks + pandas-based loader: 5.4 sec * varcode tweaks + pandas + skip parsing info: 1.5 sec * varcode tweaks + pandas + skip parsing info + pyensembl tweaks: 1.2 sec

iskandr · 2015-06-22T19:26:31Z

Small initial comment: can we make the fast path for parsing optional and PyVCF the default?

iskandr · 2015-06-22T19:31:12Z

Alternatively: can we add more unit tests of VCFs from the wild (with, for example, bad tabs vs. spaces mixups)?

timodonnell · 2015-06-23T15:12:30Z

Yeah, making the pyvcf version the default sounds good to me. We can experiment with the faster version for a while and if it seems to work reliably can switch it later. I'll push an update with that change.

* `load_vcf` now uses the traditional pyvcf implementation. `load_vcf_fast` is the pandas implementation. * Split out `load_vcf_fast` into helper functions that load data into pandas and then use the pandas dataframe to create a VariantCollection. * better docs

timodonnell · 2015-06-24T20:06:44Z

Ok, pushed an update with this change, as well as some hopefully clearer code in vcf.py

With this change, we now delay parsing the INFO field until after we have tested whether the record is PASS.

Fix a bug in VCF URL parsing where absolute paths starting with two slashes '//foo/file.vcf' would get misparsed. Also add VCFs from strelka and mutect to test the equivalence of the two VCF parsing routines.

timodonnell · 2015-06-25T16:50:16Z

Simplified the code a bit and added VCFs from strelka and mutect as part of the unit test that the fast and slow VCF parsing routines are equivalent. This is ready for another pass when you have a chance, @iskandr .

timodonnell · 2015-06-25T16:56:40Z

Also minor note: while load_vcf_fast is only 2-4X faster than load_vcf on typical data, on files like Mutect output, where the vast majority of variants are REJECT, this is many orders of magnitude faster.

iskandr · 2015-07-01T16:16:47Z

This looks like a maintenance nightmare, does it substantially improve performance? Or, were you running into memory usage problems?

Yeah, it's for memory. It makes it possible to load in dbsnp on my machine. Whether that is a use case we care about I suppose might be debatable. Since the errors that are raised if this gets out of sync with the Variant attributes will be very obvious, I think it's worth keeping for now, despite the maintenance hassle.

iskandr · 2015-07-01T16:21:06Z

LGTM other than the question about slots and single vs double quotes in docstrings

…cf.py

timodonnell · 2015-07-06T16:56:19Z

Updated to address the single v double quotes comment

Faster VCFs loading, support HTTP, and refactored variant metadata

timodonnell added 2 commits June 18, 2015 12:05

Support loading VCFs from URL

85f232c

Add support for loading VCFs over any protocol supported by `urlopen` (HTTP, HTTPS, FTP). Closes #91. Also adds a test for loading gzip'd VCFs from files and over HTTP.

timodonnell assigned iskandr Jun 22, 2015

timodonnell added 2 commits June 25, 2015 11:46

Optimization for vcf loading when most variants are not passing

05c1628

With this change, we now delay parsing the INFO field until after we have tested whether the record is PASS.

Bugfix in load_vcf URL parsing, add tests

302ef49

Fix a bug in VCF URL parsing where absolute paths starting with two slashes '//foo/file.vcf' would get misparsed. Also add VCFs from strelka and mutect to test the equivalence of the two VCF parsing routines.

comment tweak

0fad9cd

timodonnell mentioned this pull request Jun 26, 2015

Is a VariantCollection single sample or multiple samples? #96

Closed

iskandr reviewed Jul 1, 2015
View reviewed changes

style tweak: change single quotes to double quotes in docstrings in v…

1ec26d4

…cf.py

timodonnell added a commit that referenced this pull request Jul 6, 2015

Merge pull request #94 from hammerlab/faster-vcf-parsing

2b6bb6b

Faster VCFs loading, support HTTP, and refactored variant metadata

timodonnell merged commit 2b6bb6b into master Jul 6, 2015

timodonnell deleted the faster-vcf-parsing branch July 6, 2015 16:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster VCFs loading, support HTTP, and refactored variant metadata#94

Faster VCFs loading, support HTTP, and refactored variant metadata#94
timodonnell merged 7 commits intomasterfrom
faster-vcf-parsing

timodonnell commented Jun 22, 2015

Uh oh!

iskandr commented Jun 22, 2015

Uh oh!

iskandr commented Jun 22, 2015

Uh oh!

timodonnell commented Jun 23, 2015

Uh oh!

timodonnell commented Jun 24, 2015

Uh oh!

timodonnell commented Jun 25, 2015

Uh oh!

timodonnell commented Jun 25, 2015

Uh oh!

iskandr Jul 1, 2015

Uh oh!

timodonnell Jul 6, 2015

Uh oh!

iskandr commented Jul 1, 2015

Uh oh!

timodonnell commented Jul 6, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

timodonnell commented Jun 22, 2015

Uh oh!

iskandr commented Jun 22, 2015

Uh oh!

iskandr commented Jun 22, 2015

Uh oh!

timodonnell commented Jun 23, 2015

Uh oh!

timodonnell commented Jun 24, 2015

Uh oh!

timodonnell commented Jun 25, 2015

Uh oh!

timodonnell commented Jun 25, 2015

Uh oh!

iskandr Jul 1, 2015

Choose a reason for hiding this comment

Uh oh!

timodonnell Jul 6, 2015

Choose a reason for hiding this comment

Uh oh!

iskandr commented Jul 1, 2015

Uh oh!

timodonnell commented Jul 6, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants