Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve mzMLb read performance with array cache #38

Merged
merged 2 commits into from Apr 8, 2021

Conversation

mobiusklein
Copy link
Contributor

Sure enough, the feature was merged and then I started doing more serious performance testing. It appears that h5py does not attempt to do serious caching of decompressed chunks in memory for simplicity, so each attempt to read a slice of the dataset requires decompressing a whole chunk of data and then taking the requested slice on demand every time, making sequential reading of the mzMLb file orders of magnitude slower than reading the equivalent compressed mzML.

This feature adds an extra intermediate caching layer to the MzMLb object which caches a chunk of each array in memory under the assumption that the next read will be in the same region. This boosts the performance to being twice as fast as the equivalent mzML file.

@levitsky
Copy link
Owner

levitsky commented Apr 7, 2021

In a very quick test with a file converted from mzML to mzMLb, the updated parser is almost 2x faster than without caching, yet the original mzML file is parsed about 5x faster. I wonder what the difference is.

@mobiusklein
Copy link
Contributor Author

There's a difference in compression levels between test.mzML and test.mzMLb. All the buffers in test.mzMLb are gzip compressed. There's likely still some inefficiency. I will need to carry out a more thorough benchmark with large data files, larger than what would go in a unit test anyway.

@levitsky
Copy link
Owner

levitsky commented Apr 7, 2021

To clarify, I didn't test with test.mzML and test.mzMLb, rather some arbitrary mzML files of ~100 MB and ~1 GB I had lying around. I thought it might have to do with compression differences in mzML, but I don't think it affects mzml performance all that much, at least not zlib. Anyway, you're right that more benchmarking is needed even if it makes sense at all to make speed comparisons with mzML parsing here.
Meanwhile, I don't see why this PR should be waiting here, I'll be happy to merge it as is.

@mobiusklein
Copy link
Contributor Author

Sounds good to me then. I'll get back to you on benchmarks soon.

@mobiusklein
Copy link
Contributor Author

Since I saw this wasn't merged yet, I thought I'd add another optimization.

I found that even after reducing the overhead of reading chunks of the numerical data arrays, I could reduce the number of times we read from the HDF5 file directly by adding buffering to the XML byte buffer. That suffered from the same repeated __getitem__ cost as the array slicing did. This way, we only hit the HDF5 file a few times during a sequential read through instead of multiple times per iteration.

I think h5py has been planning to implement better caching for a long time, but is thrashing between implementing it themselves vs. rebasing themselves off PyTables.

@mobiusklein
Copy link
Contributor Author

Benchmarked on a full .raw file I had lying around:

mzMLb with blosc compression (1 GB)

In [2]: reader2 = mzmlb.MzMLb("AGP_tryptic_300ng_2microscans_glycoproteomics_nCE_27-30.mzMLb")
In [3]: %%prun
   ...: for scan in reader2:
   ...:     pass
   ...:
         34527800 function calls (33607805 primitive calls) in 30.914 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
436268/41270    7.102    0.000   24.059    0.001 xml.py:403(_get_info)
     6182    4.516    0.001    4.865    0.001 dataset.py:688(__getitem__)
    41271    4.324    0.000   30.831    0.001 xml.py:528(_iterfind_impl)
  1190912    4.258    0.000    7.041    0.000 xml.py:352(_handle_param)
  6508886    2.449    0.000    3.267    0.000 xml.py:51(_local_name)
436268/41270    1.179    0.000   24.241    0.001 mzml.py:268(_get_info_smart)
  3219488    1.043    0.000    1.043    0.000 {built-in method __new__ of type object at 0x00007FF9AB5BC810}
  6516852    0.820    0.000    0.820    0.000 {method 'rpartition' of 'str' objects}
  1190912    0.472    0.000    0.472    0.000 structures.py:345(__new__)
  1190912    0.375    0.000    0.969    0.000 structures.py:288(__new__)
    82540    0.344    0.000    6.222    0.000 mzmlb.py:226(_handle_binary)
  1190912    0.311    0.000    0.311    0.000 xml.py:379(_insert_param)
        1    0.287    0.287   31.185   31.185 <string>:1(<module>)
    82540    0.228    0.000    1.162    0.000 mzml.py:235(_handle_binary)
    82540    0.218    0.000    0.218    0.000 {built-in method numpy.array}
    82540    0.196    0.000    0.310    0.000 mzml.py:116(_detect_array_name)
  2305750    0.194    0.000    0.194    0.000 {built-in method builtins.isinstance}
    82540    0.187    0.000    0.195    0.000 mzml.py:210(_determine_compression)
  1559546    0.167    0.000    0.167    0.000 {method 'get' of 'dict' objects}
   540438    0.160    0.000    0.296    0.000 structures.py:313(__new__)
    82540    0.145    0.000    0.153    0.000 mzml.py:190(_determine_array_dtype)
   296746    0.142    0.000    0.372    0.000 xml.py:151(str_to_num)
  1446388    0.137    0.000    0.137    0.000 {method 'append' of 'list' objects}
   296746    0.125    0.000    0.230    0.000 structures.py:268(__new__)
  1391790    0.119    0.000    0.119    0.000 {method 'items' of 'dict' objects}
     6182    0.118    0.000    0.119    0.000 dataset.py:466(_fast_reader)
   296746    0.118    0.000    0.490    0.000 xml.py:157(convert_from)

mzML uncompressed (1.5 GB)

In [4]: reader = mzml.MzML("./AGP_tryptic_300ng_2microscans_glycoproteomics_nCE_27-30.mzML")
In [5]: %%prun
   ...: for scan in reader:
   ...:     pass
   ...:
         25450028 function calls (24494951 primitive calls) in 37.147 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
436268/41270    8.323    0.000   28.340    0.001 xml.py:403(_get_info)
    41270    6.870    0.000    7.496    0.000 xml.py:622(_find_by_id_no_reset)
    82540    5.872    0.000    5.872    0.000 {built-in method zlib.decompress}
   984561    3.647    0.000    5.830    0.000 xml.py:352(_handle_param)
436268/41270    2.487    0.000   28.558    0.001 mzml.py:268(_get_info_smart)
    82540    2.266    0.000    2.266    0.000 {built-in method binascii.a2b_base64}
  2765038    1.020    0.000    1.020    0.000 {built-in method __new__ of type object at 0x00007FF9AB5BC810}
  2841658    0.640    0.000    0.640    0.000 xml.py:51(_local_name)
   984561    0.391    0.000    0.391    0.000 structures.py:345(__new__)
   984561    0.313    0.000    0.885    0.000 structures.py:288(__new__)
    82540    0.290    0.000    9.004    0.000 utils.py:170(decode_data_array)
   984561    0.269    0.000    0.269    0.000 xml.py:379(_insert_param)
    82540    0.245    0.000    9.923    0.000 mzml.py:235(_handle_binary)
   984561    0.225    0.000    0.409    0.000 <string>:1(__new__)
    41270    0.223    0.000    0.443    0.000 ntpath.py:450(normpath)
  2423094    0.221    0.000    0.221    0.000 {built-in method builtins.isinstance}
    82540    0.186    0.000    0.194    0.000 mzml.py:210(_determine_compression)
    82540    0.173    0.000    0.266    0.000 mzml.py:116(_detect_array_name)
  1515154    0.173    0.000    0.173    0.000 {method 'get' of 'dict' objects}
    41270    0.171    0.000   36.872    0.001 file_helpers.py:78(wrapped)
   499168    0.160    0.000    0.316    0.000 structures.py:313(__new__)
    41270    0.156    0.000   36.367    0.001 xml.py:1107(get_by_id)
   123810    0.151    0.000    0.151    0.000 {method 'seek' of '_io.BufferedReader' objects}
   296746    0.149    0.000    0.386    0.000 xml.py:151(str_to_num)
    82540    0.147    0.000    0.154    0.000 mzml.py:190(_determine_array_dtype)

@mobiusklein
Copy link
Contributor Author

And to try out a better compressor, mzMLb with blosc:zstd. 700 MB on disk. That's less than the mzML file compressed with gzip (970 MB).

In [2]: reader2 = mzmlb.MzMLb("AGP_tryptic_300ng_2microscans_glycoproteomics_nCE_27-30.zstd.mzMLb")
In [3]: %%prun
   ...: for scan in reader2:
   ...:     pass
   ...:
         34527932 function calls (33607934 primitive calls) in 33.959 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     6185    7.568    0.001    7.940    0.001 dataset.py:688(__getitem__)
436268/41270    7.047    0.000   27.003    0.001 xml.py:403(_get_info)
    41271    4.382    0.000   33.879    0.001 xml.py:528(_iterfind_impl)
  1190912    4.264    0.000    7.007    0.000 xml.py:352(_handle_param)
  6508886    2.438    0.000    3.261    0.000 xml.py:51(_local_name)
436268/41270    1.185    0.000   27.183    0.001 mzml.py:268(_get_info_smart)
  3219488    1.040    0.000    1.040    0.000 {built-in method __new__ of type object at 0x00007FF9AB3BC810}
  6516855    0.826    0.000    0.826    0.000 {method 'rpartition' of 'str' objects}
  1190912    0.440    0.000    0.440    0.000 structures.py:345(__new__)
  1190912    0.358    0.000    0.961    0.000 structures.py:288(__new__)
    82540    0.344    0.000    9.253    0.000 mzmlb.py:226(_handle_binary)
  1190912    0.305    0.000    0.305    0.000 xml.py:379(_insert_param)
        1    0.285    0.285   34.232   34.232 <string>:1(<module>)
    82540    0.223    0.000    1.152    0.000 mzml.py:235(_handle_binary)
    82540    0.223    0.000    0.223    0.000 {built-in method numpy.array}
    82540    0.201    0.000    0.313    0.000 mzml.py:116(_detect_array_name)
  2305771    0.194    0.000    0.194    0.000 {built-in method builtins.isinstance}
    82540    0.182    0.000    0.190    0.000 mzml.py:210(_determine_compression)
  1559546    0.165    0.000    0.165    0.000 {method 'get' of 'dict' objects}
   540438    0.163    0.000    0.298    0.000 structures.py:313(__new__)
   296746    0.147    0.000    0.360    0.000 xml.py:151(str_to_num)
    82540    0.138    0.000    0.146    0.000 mzml.py:190(_determine_array_dtype)
  1446388    0.136    0.000    0.136    0.000 {method 'append' of 'list' objects}
   296746    0.132    0.000    0.491    0.000 xml.py:157(convert_from)
     6185    0.129    0.000    0.130    0.000 dataset.py:466(_fast_reader)
  1391790    0.120    0.000    0.120    0.000 {method 'items' of 'dict' objects}
   296746    0.119    0.000    0.213    0.000 structures.py:268(__new__)
   996346    0.118    0.000    0.118    0.000 {method 'pop' of 'dict' objects}
    82540    0.090    0.000    7.642    0.000 mzmlb.py:325(get)
   495240    0.077    0.000    0.077    0.000 {method 'endswith' of 'str' objects}

@levitsky
Copy link
Owner

levitsky commented Apr 8, 2021

Nice! Now I see parsing times faster than for mzML, as well. Thank you!

@levitsky levitsky merged commit c1e1502 into levitsky:master Apr 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants