Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Improve performance of parse of INFO field #96

Closed
bruth opened this Issue · 16 comments

4 participants

@bruth

I am working with VCF files with 1.5+ million records. A quick benchmark:

Stats

  • Mac OS X 10.8.2
  • 2.5 GHz i5
  • Python 2.7.2
  • PyVCF 0.6.3 + built Cython extensions
# Baseline iteration
with open('/path/to/file.vcf') as source:
    for line in source:
        pass

# Iteration with PyVCF
with open('/path/to/file.vcf') as source:
    reader = vcf.Reader(source)
    for line in reader:
        pass

File size: 610 MB
Variant count: 1,794,055

Iteration over flat file (no parsing)

Time: ~80 seconds
Memory: 0

Iteration over VCF parsed file using PyVCF

Time: ~41 minutes (extrapolated)
Memory: 246 KB

@seandavi

It might help if you can annotate this with your version of python and PyVCF, if only for posterity sake.

@bruth

Added to ticket

@jamescasbon
Owner

Hi @bruth thanks for your feeback. How many samples were in your file? Assuming there were 10 samples, and 4 fields in each cell, then a better baseline is the read time plus the time taken for 4 * 10 * 1.8e6 casts.

There are quite a few reasons why parsing VCFs will be inherently slow: the potential nullability of every field, the variable length calls, lists, etc. However, we can probably do better.

Anyway, it's not clear what this issue actually refers to. You can select a target range to analyze using tabix, which should be your first optimization. Can you clarify the issue, please?

@bruth

@jamescasbon This particular VCF file contains a single sample. Here is the header and what a single record looks like:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  sample_1
1       10186   .       T       G       10.9    PASS    ABHom=0.067;AC=2;AF=1.00;AN=2;BaseCounts=0,1,1,14;BaseQRankSum=-0.347;DP=16;Dels=0.00;FS=0.000;GC=51.81;HaplotypeScore=8.5283;LowMQ=0.0000,0.6250,16;MLEAC=2;MLEAF=1.00;MQ=14.45;MQ0=0;MQRankSum=1.504;OND=0.938;PercentNBaseSolid=0.0000;QD=0.68;ReadPosRankSum=1.504;set=variant;EFF=INTERGENIC(MODIFIER|||||||||)    GT:AD:DP:GQ:PL  1/1:14,1:16:3:37,3,0

As you can see the INFO field is quite large.

@bruth

@jamescasbon The issue is simply to address the time comparison from plain old file iteration compared to iteration (which of course includes all the parsing and _Record creation for each) using PyVCF. I am not making any claim to why this is slow, but my reaction was to think this is quite a painful difference.

Unless you are were aware of any glaring issues, I was going to run it through a profiler to see which steps are taking the longest during a single iteration.

@bruth

I did a quick profile using the profilehooks profile decorator which allows targeting specific methods.

I profiled the reader and found an initial slowness in _parse_info method specifically due to the internals of OrderedDict. I swapped it out with a normal dictionary and it reduced the total time by ~40% (for 10,000 iterations). I am curious the break down of use cases for preserving the position of each info field (I personally do not need the order).

I will be investigating other areas as well.

@bruth

I made a quick branch which adds a preserve_order argument to Reader which determines the type of dict to use. This enables a backwards compatible approach (however, currently I have it set to False), but keeps the ordering optional. chop-dbhi@b333e88

@martijnvermaat
Collaborator

See also #45.

@jamescasbon
Owner

Thanks, @bruth
Curious: what proportion of the time was _parse_info v parsing samples. I optimized sample parsing but didn't touch info parsing.

FWIW I tried to convince some people to use JSON for fields like this. We have existing fast parsers for that.
http://www.biostars.org/p/5265/
Checkout lh3's answer ;)

@bruth

Here are the two profiles for next and _parse_info. First two are with OrderedDict and the second without. I am working with a single-sample file.. you can see the _parse_samples is 4-5x faster than _parse_info.

With OrderedDict:

next method

function called 10000 times

         960496 function calls in 2.970 seconds

   Ordered by: cumulative time, internal time, call count
   List reduced from 70 to 40 due to restriction <40>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    10000    0.243    0.000    2.969    0.000 parser.py:462(next)
    10000    1.004    0.000    1.632    0.000 parser.py:287(_parse_info)
   230213    0.516    0.000    0.672    0.000 parser.py:282(_map)
    10000    0.241    0.000    0.480    0.000 parser.py:360(_parse_samples)
   459979    0.302    0.000    0.302    0.000 {method 'split' of 'str' objects}
    10000    0.027    0.000    0.301    0.000 re.py:164(split)
    10000    0.241    0.000    0.241    0.000 {method 'split' of '_sre.SRE_Pattern' objects}
    10011    0.032    0.000    0.155    0.000 parser.py:434(_parse_alt)
    10000    0.065    0.000    0.075    0.000 model.py:117(__init__)
    10011    0.048    0.000    0.069    0.000 model.py:436(__init__)
    10000    0.048    0.000    0.057    0.000 parser.py:209(<genexpr>)
    20011    0.045    0.000    0.055    0.000 re.py:228(_compile)
    10011    0.024    0.000    0.054    0.000 re.py:139(search)
    10000    0.033    0.000    0.042    0.000 <string>:8(__new__)
    10000    0.026    0.000    0.026    0.000 model.py:11(__init__)
    10011    0.019    0.000    0.019    0.000 model.py:420(__init__)
    20016    0.010    0.000    0.010    0.000 {method 'get' of 'dict' objects}
    20000    0.010    0.000    0.010    0.000 {method 'strip' of 'str' objects}
    10000    0.009    0.000    0.009    0.000 {built-in method __new__ of type object at 0x109a0e248}
    10011    0.009    0.000    0.009    0.000 {method 'search' of '_sre.SRE_Pattern' objects}
    10000    0.008    0.000    0.008    0.000 {method 'extend' of 'list' objects}
    30039    0.005    0.000    0.005    0.000 {len}
    10040    0.003    0.000    0.003    0.000 {method 'append' of 'list' objects}
    10000    0.001    0.000    0.001    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.001    0.001 parser.py:342(_parse_sample_format)
        1    0.000    0.000    0.001    0.001 model.py:529(make_calldata_tuple)
        1    0.001    0.001    0.001    0.001 collections.py:237(namedtuple)
        1    0.000    0.000    0.000    0.000 sre_compile.py:495(compile)
        1    0.000    0.000    0.000    0.000 sre_compile.py:480(_code)
        2    0.000    0.000    0.000    0.000 sre_compile.py:178(_compile_charset)
        2    0.000    0.000    0.000    0.000 sre_compile.py:207(_optimize_charset)
        1    0.000    0.000    0.000    0.000 sre_compile.py:361(_compile_info)
        1    0.000    0.000    0.000    0.000 sre_compile.py:32(_compile)
        1    0.000    0.000    0.000    0.000 sre_parse.py:663(parse)
        1    0.000    0.000    0.000    0.000 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 sre_parse.py:301(_parse_sub)
        1    0.000    0.000    0.000    0.000 sre_parse.py:379(_parse)
        6    0.000    0.000    0.000    0.000 {all}
       24    0.000    0.000    0.000    0.000 collections.py:277(<genexpr>)
        6    0.000    0.000    0.000    0.000 sre_parse.py:201(get)

_parse_info only

function called 10000 times

         989340 function calls in 2.709 seconds

   Ordered by: cumulative time, internal time, call count

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    10000    1.216    0.000    2.708    0.000 parser.py:287(_parse_info)
   200213    0.486    0.000    0.486    0.000 parser.py:282(_map)
   209457    0.432    0.000    0.432    0.000 collections.py:53(__setitem__)
   419670    0.291    0.000    0.291    0.000 {method 'split' of 'str' objects}
    10000    0.094    0.000    0.283    0.000 collections.py:37(__init__)
    10000    0.063    0.000    0.155    0.000 _abcoll.py:483(update)
    10000    0.013    0.000    0.073    0.000 {isinstance}
    10000    0.039    0.000    0.059    0.000 abc.py:128(__instancecheck__)
    30000    0.032    0.000    0.032    0.000 _weakrefset.py:68(__contains__)
    10000    0.016    0.000    0.032    0.000 abc.py:148(__subclasscheck__)
    10000    0.014    0.000    0.014    0.000 {hasattr}
    30000    0.004    0.000    0.004    0.000 {len}
    10000    0.004    0.000    0.004    0.000 {getattr}
    10000    0.003    0.000    0.003    0.000 {method 'items' of 'dict' objects}
    10000    0.002    0.000    0.002    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        0    0.000             0.000          profile:0(profiler)

Without OrderedDict (only changed in _parse_info):

next method

function called 10000 times

         1309953 function calls in 4.183 seconds

   Ordered by: cumulative time, internal time, call count
   List reduced from 78 to 40 due to restriction <40>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    10000    0.283    0.000    4.182    0.000 parser.py:462(next)
    10000    1.234    0.000    2.738    0.000 parser.py:287(_parse_info)
   230213    0.559    0.000    0.732    0.000 parser.py:282(_map)
    10000    0.264    0.000    0.467    0.000 parser.py:360(_parse_samples)
   209457    0.454    0.000    0.454    0.000 collections.py:53(__setitem__)
    10000    0.031    0.000    0.336    0.000 re.py:164(split)
   459979    0.335    0.000    0.335    0.000 {method 'split' of 'str' objects}
    10000    0.096    0.000    0.302    0.000 collections.py:37(__init__)
    10000    0.266    0.000    0.266    0.000 {method 'split' of '_sre.SRE_Pattern' objects}
    10011    0.037    0.000    0.173    0.000 parser.py:434(_parse_alt)
    10000    0.066    0.000    0.166    0.000 _abcoll.py:483(update)
    10000    0.075    0.000    0.086    0.000 model.py:117(__init__)
    10004    0.013    0.000    0.079    0.000 {isinstance}
    10011    0.055    0.000    0.078    0.000 model.py:436(__init__)
    10000    0.062    0.000    0.073    0.000 parser.py:209(<genexpr>)
    10000    0.042    0.000    0.066    0.000 abc.py:128(__instancecheck__)
    20011    0.049    0.000    0.061    0.000 re.py:228(_compile)
    10011    0.026    0.000    0.058    0.000 re.py:139(search)
    10000    0.039    0.000    0.048    0.000 <string>:8(__new__)
    30000    0.038    0.000    0.038    0.000 _weakrefset.py:68(__contains__)
    10000    0.018    0.000    0.037    0.000 abc.py:148(__subclasscheck__)
    10000    0.030    0.000    0.030    0.000 model.py:11(__init__)
    10011    0.022    0.000    0.022    0.000 model.py:420(__init__)
    10000    0.015    0.000    0.015    0.000 {hasattr}
    20016    0.012    0.000    0.012    0.000 {method 'get' of 'dict' objects}
    60039    0.011    0.000    0.011    0.000 {len}
    20000    0.010    0.000    0.010    0.000 {method 'strip' of 'str' objects}
    10011    0.010    0.000    0.010    0.000 {method 'search' of '_sre.SRE_Pattern' objects}
    10000    0.010    0.000    0.010    0.000 {built-in method __new__ of type object at 0x10ddaa248}
    10000    0.009    0.000    0.009    0.000 {method 'extend' of 'list' objects}
    10000    0.005    0.000    0.005    0.000 {getattr}
    10040    0.003    0.000    0.003    0.000 {method 'append' of 'list' objects}
    10001    0.003    0.000    0.003    0.000 {method 'items' of 'dict' objects}
    10000    0.002    0.000    0.002    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.001    0.001 parser.py:342(_parse_sample_format)
        1    0.000    0.000    0.001    0.001 model.py:529(make_calldata_tuple)
        1    0.001    0.001    0.001    0.001 collections.py:237(namedtuple)
        1    0.000    0.000    0.000    0.000 sre_compile.py:495(compile)
        1    0.000    0.000    0.000    0.000 sre_compile.py:480(_code)
        2    0.000    0.000    0.000    0.000 sre_compile.py:178(_compile_charset)

_parse_info only

function called 10000 times

         639883 function calls in 1.572 seconds

   Ordered by: cumulative time, internal time, call count

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    10000    0.979    0.000    1.571    0.000 parser.py:287(_parse_info)
   200213    0.351    0.000    0.351    0.000 parser.py:282(_map)
   419670    0.242    0.000    0.242    0.000 {method 'split' of 'str' objects}
    10000    0.001    0.000    0.001    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        0    0.000             0.000          profile:0(profiler)
@jamescasbon
Owner

See also #50

@martijnvermaat
Collaborator

I'm not too happy with the approach of #127, and added some reasoning in that pull request.

The problem is with the OrderedDict in _parse_info, all the other OrderedDict uses are not hindering performance.

However, when writing a record to a VCF file, it would be nice to have some predictable order, as requested in #46. I also think #46 makes some good suggestions on how to tackle this, but this was not implemented.

So to recap these two suggestions, I think the best approach here is to change the OrderedDict to an ordinary dictionary in _parse_info and, when writing a record, either:

  1. The order of the fields in the INFO column respect the order of the INFO fields in the header, or
  2. The order of the fields, both in the header and the INFO column be alphabetically sorted.

I probaly prefer the first option, since it will preserve alphabetical order if you started with it and gives you the freedom to place existing and new fields where you like.

Please let me know if you think otherwise or I'm missing something.

@martijnvermaat
Collaborator

To clarify, I see no problem missing the order when using a record in Python code. I do see the problem when writing the record back to a VCF file.

So my suggestion is to recover the order only when writing. This of course might give us the performance hit at that point, but we should just run some new timings on that.

@martijnvermaat martijnvermaat referenced this issue from a commit in martijnvermaat/PyVCF
@martijnvermaat martijnvermaat Do not maintain the order of INFO fields within records
Using an ordinary dict instead of an OrderdDict for the INFO fields makes
parsing faster. The INFO fields are sorted by the VCF writer where all fields
defined in the VCF header go first and in the same order, followed by the
remaining fields in alpabetical order. Note that this make writing slower.

The following are some simple benchmarks, starting with just parsing:

1. Without this change (using OrderedDict):

       In [1]: %timeit list(vcf.Reader(open('vcf/test/1kg.sites.vcf')))
       100 loops, best of 3: 15 ms per loop

2. With this change (using dict):

       In [1]: %timeit list(vcf.Reader(open('vcf/test/1kg.sites.vcf')))
       100 loops, best of 3: 10 ms per loop

Now parsing the same file and writing it back to VCF:

1. Without this change (using OrderedDict, no sorting):

       In [1]: %%timeit
          ...: reader = vcf.Reader(open('vcf/test/1kg.sites.vcf'))
          ...: writer = vcf.Writer(open(os.devnull, 'w'), reader)
          ...: for record in reader:
          ...:     writer.write_record(record)
          ...:
       10 loops, best of 3: 22.7 ms per loop

2. With half this change (using dict, no sorting):

       In [1]: %%timeit
          ...: reader = vcf.Reader(open('vcf/test/1kg.sites.vcf'))
          ...: writer = vcf.Writer(open(os.devnull, 'w'), reader)
          ...: for record in reader:
          ...:     writer.write_record(record)
          ...:
       100 loops, best of 3: 16.5 ms per loop

3. With this change (using dict, sorting during write):

       In [6]: %%timeit
          ...: reader = vcf.Reader(open('vcf/test/1kg.sites.vcf'))
          ...: writer = vcf.Writer(open(os.devnull, 'w'), reader)
          ...: for record in reader:
          ...:     writer.write_record(record)
          ...:
       100 loops, best of 3: 17.7 ms per loop

Fixes GitHub issue #96.
14dd121
@martijnvermaat
Collaborator

My proposed fix is in #128. If you see no problems with it I'll merge it shortly.

@bruth

@martijnvermaat looks great. Thank you.

@martijnvermaat martijnvermaat referenced this issue from a commit in martijnvermaat/PyVCF
@martijnvermaat martijnvermaat Do not maintain the order of INFO fields within records
Using an ordinary dict instead of an OrderdDict for the INFO fields makes
parsing faster. The INFO fields are sorted by the VCF writer where all fields
defined in the VCF header go first and in the same order, followed by the
remaining fields in alpabetical order. Note that this make writing slower.

We lose two things:

1. Getting the INFO fields in original order from a record when using PyVCF as
   a library (but I don't think most users are expecting this anyway).
2. Preserving the original order of the INFO fields when writing (but the
   order is predictable).

The following are some simple benchmarks, starting with just parsing:

1. Without this change (using OrderedDict):

       In [1]: %timeit list(vcf.Reader(open('vcf/test/1kg.sites.vcf')))
       100 loops, best of 3: 15 ms per loop

2. With this change (using dict):

       In [1]: %timeit list(vcf.Reader(open('vcf/test/1kg.sites.vcf')))
       100 loops, best of 3: 10 ms per loop

Now parsing the same file and writing it back to VCF:

1. Without this change (using OrderedDict, no sorting):

       In [1]: %%timeit
          ...: reader = vcf.Reader(open('vcf/test/1kg.sites.vcf'))
          ...: writer = vcf.Writer(open(os.devnull, 'w'), reader)
          ...: for record in reader:
          ...:     writer.write_record(record)
          ...:
       10 loops, best of 3: 22.7 ms per loop

2. With half this change (using dict, no sorting):

       In [1]: %%timeit
          ...: reader = vcf.Reader(open('vcf/test/1kg.sites.vcf'))
          ...: writer = vcf.Writer(open(os.devnull, 'w'), reader)
          ...: for record in reader:
          ...:     writer.write_record(record)
          ...:
       100 loops, best of 3: 16.5 ms per loop

3. With this change (using dict, sorting during write):

       In [6]: %%timeit
          ...: reader = vcf.Reader(open('vcf/test/1kg.sites.vcf'))
          ...: writer = vcf.Writer(open(os.devnull, 'w'), reader)
          ...: for record in reader:
          ...:     writer.write_record(record)
          ...:
       100 loops, best of 3: 17.7 ms per loop

Fixes GitHub issue #96.
4892aab
@martijnvermaat
Collaborator

@bruth thanks a lot for raising the issue and working with us towards a solution!

@gotgenes gotgenes referenced this issue from a commit in gotgenes/PyVCF
@martijnvermaat martijnvermaat Do not maintain the order of INFO fields within records
Using an ordinary dict instead of an OrderdDict for the INFO fields makes
parsing faster. The INFO fields are sorted by the VCF writer where all fields
defined in the VCF header go first and in the same order, followed by the
remaining fields in alpabetical order. Note that this make writing slower.

We lose two things:

1. Getting the INFO fields in original order from a record when using PyVCF as
   a library (but I don't think most users are expecting this anyway).
2. Preserving the original order of the INFO fields when writing (but the
   order is predictable).

The following are some simple benchmarks, starting with just parsing:

1. Without this change (using OrderedDict):

       In [1]: %timeit list(vcf.Reader(open('vcf/test/1kg.sites.vcf')))
       100 loops, best of 3: 15 ms per loop

2. With this change (using dict):

       In [1]: %timeit list(vcf.Reader(open('vcf/test/1kg.sites.vcf')))
       100 loops, best of 3: 10 ms per loop

Now parsing the same file and writing it back to VCF:

1. Without this change (using OrderedDict, no sorting):

       In [1]: %%timeit
          ...: reader = vcf.Reader(open('vcf/test/1kg.sites.vcf'))
          ...: writer = vcf.Writer(open(os.devnull, 'w'), reader)
          ...: for record in reader:
          ...:     writer.write_record(record)
          ...:
       10 loops, best of 3: 22.7 ms per loop

2. With half this change (using dict, no sorting):

       In [1]: %%timeit
          ...: reader = vcf.Reader(open('vcf/test/1kg.sites.vcf'))
          ...: writer = vcf.Writer(open(os.devnull, 'w'), reader)
          ...: for record in reader:
          ...:     writer.write_record(record)
          ...:
       100 loops, best of 3: 16.5 ms per loop

3. With this change (using dict, sorting during write):

       In [6]: %%timeit
          ...: reader = vcf.Reader(open('vcf/test/1kg.sites.vcf'))
          ...: writer = vcf.Writer(open(os.devnull, 'w'), reader)
          ...: for record in reader:
          ...:     writer.write_record(record)
          ...:
       100 loops, best of 3: 17.7 ms per loop

Fixes GitHub issue #96.
ea5f458
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.