Skip to content

Faster VCFs loading, support HTTP, and refactored variant metadata#94

Merged
timodonnell merged 7 commits intomasterfrom
faster-vcf-parsing
Jul 6, 2015
Merged

Faster VCFs loading, support HTTP, and refactored variant metadata#94
timodonnell merged 7 commits intomasterfrom
faster-vcf-parsing

Conversation

@timodonnell
Copy link
Copy Markdown
Contributor

There is a small related pyensembl PR, which I'll push shortly.

  • Speed improvements to load_vcf, using a new pandas-based implementation and other tweaks. This is 2X faster if you need the variant info column, and 9X faster if you don't.
  • Add support for loading VCFs over any protocol supported by the requests package (HTTP, HTTPS, FTP). Closes support loading VCFs over HTTP #91.
  • Refactoring: got rid of the variant.info field and moved it to VariantCollection.metadata, which is a dict from variants to extra metadata. I think this simplifies things, especially when working with variants coming from different sources, where the variant may be same but the metadata may be different.

Optimizations

  • Specify a default sort key in VariantCollection.init, to avoid making variant pairwise comparisons in sorted() in Collection.init.
  • Optimize memoize function for common case (cache hit).
  • Avoid unnecessary list creation in Collection.init
  • Faster implementation of trim_shared_suffix
  • Use slots for Variant instances.
  • Optimize common case of a simple SNV in Variant.init
  • Optimize Variant.hash, eq

Other changes

  • Add test/benchmark_vcf_load.py for benchmarking load_vcf performance.
  • Add allow_extended_nucleotides parameter to load_vcf. This is necessary to load the dbsnp VCF.
  • Change multiallelic.vcf test case to use tabs (as per VCF standard) not spaces.

Timing details. Time to load first 100K variants in dbsnp VCF (median of 3 runs):

  • master varcode + pyensembl: 10.8 sec
  • varcode tweaks: 8.1 sec
  • varcode tweaks + pandas-based loader: 5.4 sec
  • varcode tweaks + pandas + skip parsing info: 1.5 sec
  • varcode tweaks + pandas + skip parsing info + pyensembl tweaks: 1.2 sec

Review on Reviewable

Add support for loading VCFs over any protocol supported by `urlopen` (HTTP, HTTPS, FTP). Closes #91.

Also adds a test for loading gzip'd VCFs from files and over HTTP.
…adata

 * Speed improvements to `load_vcf`, using a new pandas-based implementation and other tweaks. This is 2X faster if you need the variant info column, and 9X faster if you don't.
 * Add support for loading VCFs over any protocol supported by the `requests` package (HTTP, HTTPS, FTP). Closes #91.
 * Refactoring: got rid of the `variant.info` field and moved it to `VariantCollection.metadata`, which is a dict from variants to extra metadata. I think this simplifies things, especially when working with variants coming from different sources, where the variant may be same but the metadata may be different.

Optimizations
 * Specify a default sort key in VariantCollection.__init__, to avoid making variant pairwise comparisons in sorted() in Collection.__init__.
 * Optimize memoize function for common case (cache hit).
 * Avoid unnecessary list creation in Collection.__init__
 * Faster implementation of trim_shared_suffix
 * Use slots for Variant instances.
 * Optimize common case of a simple SNV in Variant.__init__
 * Optimize Variant.__hash__, __eq__

Other changes
 * Add test/benchmark_vcf_load.py for benchmarking `load_vcf` performance.
 * Add allow_extended_nucleotides parameter to load_vcf. This is necessary to load the dbsnp VCF.
 * Change multiallelic.vcf test case to use tabs (as per VCF standard) not spaces.

Timing details. Time to load first 100K variants in dbsnp VCF (median of 3 runs):
 * master varcode + pyensembl: 10.8 sec
 * varcode tweaks: 8.1 sec
 * varcode tweaks + pandas-based loader: 5.4 sec
 * varcode tweaks + pandas + skip parsing info:  1.5 sec
 * varcode tweaks + pandas + skip parsing info + pyensembl tweaks:  1.2 sec
@iskandr
Copy link
Copy Markdown
Contributor

iskandr commented Jun 22, 2015

Small initial comment: can we make the fast path for parsing optional and PyVCF the default?

@iskandr
Copy link
Copy Markdown
Contributor

iskandr commented Jun 22, 2015

Alternatively: can we add more unit tests of VCFs from the wild (with, for example, bad tabs vs. spaces mixups)?

@timodonnell
Copy link
Copy Markdown
Contributor Author

Yeah, making the pyvcf version the default sounds good to me. We can experiment with the faster version for a while and if it seems to work reliably can switch it later. I'll push an update with that change.

 * `load_vcf` now uses the traditional pyvcf implementation. `load_vcf_fast` is the pandas implementation.
 * Split out `load_vcf_fast` into helper functions that load data into pandas and then use the pandas dataframe to create a VariantCollection.
 * better docs
@timodonnell
Copy link
Copy Markdown
Contributor Author

Ok, pushed an update with this change, as well as some hopefully clearer code in vcf.py

With this change, we now delay parsing the INFO field until after we have tested whether the record is PASS.
Fix a bug in VCF URL parsing where absolute paths starting with two slashes '//foo/file.vcf' would get misparsed.

Also add VCFs from strelka and mutect to test the equivalence of the two VCF parsing routines.
@timodonnell
Copy link
Copy Markdown
Contributor Author

Simplified the code a bit and added VCFs from strelka and mutect as part of the unit test that the fast and slow VCF parsing routines are equivalent. This is ready for another pass when you have a chance, @iskandr .

@timodonnell
Copy link
Copy Markdown
Contributor Author

Also minor note: while load_vcf_fast is only 2-4X faster than load_vcf on typical data, on files like Mutect output, where the vast majority of variants are REJECT, this is many orders of magnitude faster.

Comment thread varcode/variant.py
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a maintenance nightmare, does it substantially improve performance? Or, were you running into memory usage problems?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's for memory. It makes it possible to load in dbsnp on my machine. Whether that is a use case we care about I suppose might be debatable. Since the errors that are raised if this gets out of sync with the Variant attributes will be very obvious, I think it's worth keeping for now, despite the maintenance hassle.

@iskandr
Copy link
Copy Markdown
Contributor

iskandr commented Jul 1, 2015

LGTM other than the question about slots and single vs double quotes in docstrings

@timodonnell
Copy link
Copy Markdown
Contributor Author

Updated to address the single v double quotes comment

timodonnell added a commit that referenced this pull request Jul 6, 2015
 Faster VCFs loading, support HTTP, and refactored variant metadata
@timodonnell timodonnell merged commit 2b6bb6b into master Jul 6, 2015
@timodonnell timodonnell deleted the faster-vcf-parsing branch July 6, 2015 16:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

support loading VCFs over HTTP

2 participants