Skip to content
This repository

Cythonize parser #25

Closed
arq5x opened this Issue · 29 comments

5 participants

Aaron Quinlan Martijn Vermaat James Casbon Brent Pedersen - Bioinformatics Peter Cock
Aaron Quinlan

As expected, looping through the samples in _parse_samples is rather slow for files with many samples (e.g., [1]). It seems that the main bottlenecks are Python's split() and the fact that looping in Python is known to be dreadfully slow.

One potential solution would be to port the slowest looping bits to Cython. Another would be to work with BCF.

Thoughts? I think dealing with deep and wide files will be crucial for the future.

[1] ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20120131_omni_genotypes_and_intensities/Omni25_genotypes_2141_samples.b37.vcf.gz

Martijn Vermaat
Collaborator

Agreed, this is important. But it should be kept in mind that once you go the Cython route, you can't run on PyPy anymore (if I'm not mistaken). Same goes for a hard dependency on pysam.

James Casbon
Owner
Brent Pedersen - Bioinformatics

See also #24 we could make the sample parsing lazy so that it's not parsed unless needed.
I'm wary of using cython for this since now PyVCF has no dependencies...

James Casbon
Owner

Just want to point out that the most efficient optimisation is to use pysam based fetch to find only the rows of interest.

Also, a few short cuts would probably speed up the splits. At the moment we have to split on tab to separate samples, colon to separate fields and comma to separate values. Most values do not need the final split.

James Casbon
Owner

Just committed a halving in the number of splits, minor improvement.

However, there is no point improving _parse_samples since >50% of the time is spent in _parse_sample.

before:

Fri Feb 17 11:10:58 2012    1kg.prof

         2701745 function calls (2701621 primitive calls) in 7.593 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   239649    4.116    0.000    6.428    0.000 parser.py:518(_parse_sample)
   668478    1.567    0.000    1.567    0.000 parser.py:443(_map)
  1283830    0.776    0.000    0.776    0.000 {method 'split' of 'str' objects}
      381    0.576    0.002    7.439    0.020 parser.py:490(_parse_samples)
   239649    0.397    0.000    0.397    0.000 parser.py:114(__init__)

after:

Fri Feb 17 11:32:43 2012    1kg.prof

         1685560 function calls (1685436 primitive calls) in 6.986 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   239649    4.284    0.000    5.848    0.000 parser.py:518(_parse_sample)
   346906    1.133    0.000    1.133    0.000 parser.py:443(_map)
      381    0.533    0.001    6.831    0.018 parser.py:490(_parse_samples)
   589217    0.463    0.000    0.463    0.000 {method 'split' of 'str' objects}
   239649    0.410    0.000    0.410    0.000 parser.py:114(__init__)
   247269    0.039    0.000    0.039    0.000 {method 'append' of 'list' objects}
      654    0.025    0.000    0.025    0.000 {built-in method decompress}
James Casbon
Owner

I ran 'test/prof.py time' using pypy and python 2.7. It parses the 1kg test file.

pypy runs in 1.8 seconds, cpython in 3.0 seconds. 40% faster is not bad!

Aaron Quinlan

Just an FYI,

I am playing around with the following two options.

[3] https://github.com/ekg/vcflib
[4] http://code.google.com/p/pysam/source/browse/pysam/cvcf.pyx

However, my (perhaps naive) understanding is that linking to C++ libraries with ctypes is a bit murky. Untrue?

Brent Pedersen - Bioinformatics

if you decide to go that route, why not use cython?

Aaron Quinlan

Yeah, that's what I was getting at. My sense at this point is that the two best options are to:

  1. Work together to port vcflib with Cython (I have a local fork of the C++ lib that includes many of the methods I added to PyVcf)

  2. Use pysam's VCF class which is already written in Cython. I don't quite grok the API yet though.

James Casbon
Owner

An update: @arq5x has cythonized the parser https://github.com/arq5x/cyvcf/blob/master/cyvcf/parser.pyx

I would propose we include this in the next release. If any watchers want to speak up for pure python (i.e. not including the cythonized parser), speak now.

Peter Cock

Is having a Cython (or other low level) module plus the current code as a pure Python fall back viable? That way you can continue to support Jython and PyPy etc.

James Casbon
Owner

@peterjc I think the best route for this would be pypy's support of cython http://mail.python.org/pipermail/cython-devel/2012-February/001930.html

Otherwise, there is a lot of duplication of effort.

Peter Cock

I'm not sure how stable PyPy + cython is yet, but certainly it will work before too long I hope.

Do you care about Python implementations other than C Python and PyPy? e.g. Jython, IronPython, ...

James Casbon
Owner

@peterjc jython is still on python 2.5, so I think it is already not supported. Does anyone actually use ironpython? I've never seen it alive.

Peter Cock

Right now I would say PyPy support was more important than Jython, which in turn is more important than IronPython (I don't know anyone using it).

Aaron Quinlan
arq5x commented

I'm glad to see you are interested in bringing in the Cython version. Is supporting PyPy a high priority? Speaking from an admittedly naive standpoint, my sense is that few people use it. Am I wrong?

Also, Python packaging still baffles/annoys me. Adding Cython as an installation dependency makes things more complex. What is the best way to package things such that only a GCC compiler, assuming the resulting C (.pyx->.c) file is present in the repo?

Are there better ways to handle this?

James Casbon
Owner

@arq5x I agree that pypy is not currently important, no-one really uses it in anger today (I think). I think that it will be important in the medium term.

For the packaging, check pybedtools setup.py which uses cython, if available, otherwise builds the extension with gcc https://github.com/daler/pybedtools/blob/master/setup.py

Aaron Quinlan
arq5x commented

Thanks - Yeah, I actually based the setup.py in cyvcf on pybedtools for this reason, but was unable to get it to behave as expected, but I was likely doing something very noob.

Having more expert eyes on that would be welcome.

By the way, the library really fast now and getting pretty mature. I encourage you to test it out on some VCFs from 1000G. It is close to vcflib and plinkseq, making it much more useful for constructing analysis scripts on large datasets.

James Casbon
Owner

Hi @arq5x, still want to do this but have scheduled for 0.6.0 as I wanted to get the 4.1 support out first.

Aaron Quinlan
arq5x commented

A rational plan. Let me know if I can assist in any way.

James Casbon
Owner

I played around with cython and got this working 8ba9bed

It only cythonizes two functions at around 100 LOC. It's ~30% faster than the cyvcf branch (but does less, no aggregate stats).

Its possible to maintain a python/cython version of 100 LOC, so this could be a way forward.

We'd need to add @arq5x aggregation stuff. I can see this makes sense, since you can collect aggregate stats on parse rather than reiterate over the whole sample list on each site.

James Casbon jamescasbon referenced this issue from a commit
James Casbon Add cython code #25 618988f
James Casbon
Owner

Just added an optional cython branch. My 1kg profile takes 3.4s versus 4.7s for cyvcf.

Will probably need to add some logic to the setup.py so that it still builds under pypy and ensure that the pure python code is tested as well - add to tox and travis.

James Casbon
Owner

PS that means code branch, not git branch.

Aaron Quinlan
arq5x commented

Nice. Is there any intent to precompute num_hets, num_hom_ref, etc., as well as lists of gt_types, gt_phases, etc., or do you want to keep these attrs as they are currently, which requires a second pass through all of the genotype info?

One idea would be to have a parameter for the Reader that dictates whether these attributes should be pre-computed or not. Thoughts? I just struggle with the fact that large lists of genotypes need to be looped through at least twice to compute useful metrics. For example, if I have a file with 1000 genotypes and ask for num_hets, num_hom_ref, num_hom_alt, and num_unknown, the genotypes are looped through 5 different times.

James Casbon
Owner
James Casbon
Owner

Fixed, and for the record pypy is still a bit faster than the cythonized version.

James Casbon
Owner

FWIW, I replaced the use of a dict to store the call data with a namedtuple and now the pure python version is as fast as cyvcf.

You could port this change, or start using HEAD, which now has cython support.

Aaron Quinlan
arq5x commented

Fantastic, thanks for the heads up.

Brent Pedersen - Bioinformatics
brentp commented

@jamescasbon, nice! is the getattr as fast as the previous [] ? I never use namedtuple.

Chris Lasher gotgenes referenced this issue from a commit in gotgenes/PyVCF
James Casbon Add cython code #25 3365dd2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.