New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve mzMLb read performance with array cache #38
Improve mzMLb read performance with array cache #38
Conversation
In a very quick test with a file converted from mzML to mzMLb, the updated parser is almost 2x faster than without caching, yet the original mzML file is parsed about 5x faster. I wonder what the difference is. |
There's a difference in compression levels between |
To clarify, I didn't test with test.mzML and test.mzMLb, rather some arbitrary mzML files of ~100 MB and ~1 GB I had lying around. I thought it might have to do with compression differences in mzML, but I don't think it affects mzml performance all that much, at least not zlib. Anyway, you're right that more benchmarking is needed even if it makes sense at all to make speed comparisons with mzML parsing here. |
Sounds good to me then. I'll get back to you on benchmarks soon. |
Since I saw this wasn't merged yet, I thought I'd add another optimization. I found that even after reducing the overhead of reading chunks of the numerical data arrays, I could reduce the number of times we read from the HDF5 file directly by adding buffering to the XML byte buffer. That suffered from the same repeated I think |
Benchmarked on a full .raw file I had lying around: mzMLb with In [2]: reader2 = mzmlb.MzMLb("AGP_tryptic_300ng_2microscans_glycoproteomics_nCE_27-30.mzMLb")
In [3]: %%prun
...: for scan in reader2:
...: pass
...:
34527800 function calls (33607805 primitive calls) in 30.914 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
436268/41270 7.102 0.000 24.059 0.001 xml.py:403(_get_info)
6182 4.516 0.001 4.865 0.001 dataset.py:688(__getitem__)
41271 4.324 0.000 30.831 0.001 xml.py:528(_iterfind_impl)
1190912 4.258 0.000 7.041 0.000 xml.py:352(_handle_param)
6508886 2.449 0.000 3.267 0.000 xml.py:51(_local_name)
436268/41270 1.179 0.000 24.241 0.001 mzml.py:268(_get_info_smart)
3219488 1.043 0.000 1.043 0.000 {built-in method __new__ of type object at 0x00007FF9AB5BC810}
6516852 0.820 0.000 0.820 0.000 {method 'rpartition' of 'str' objects}
1190912 0.472 0.000 0.472 0.000 structures.py:345(__new__)
1190912 0.375 0.000 0.969 0.000 structures.py:288(__new__)
82540 0.344 0.000 6.222 0.000 mzmlb.py:226(_handle_binary)
1190912 0.311 0.000 0.311 0.000 xml.py:379(_insert_param)
1 0.287 0.287 31.185 31.185 <string>:1(<module>)
82540 0.228 0.000 1.162 0.000 mzml.py:235(_handle_binary)
82540 0.218 0.000 0.218 0.000 {built-in method numpy.array}
82540 0.196 0.000 0.310 0.000 mzml.py:116(_detect_array_name)
2305750 0.194 0.000 0.194 0.000 {built-in method builtins.isinstance}
82540 0.187 0.000 0.195 0.000 mzml.py:210(_determine_compression)
1559546 0.167 0.000 0.167 0.000 {method 'get' of 'dict' objects}
540438 0.160 0.000 0.296 0.000 structures.py:313(__new__)
82540 0.145 0.000 0.153 0.000 mzml.py:190(_determine_array_dtype)
296746 0.142 0.000 0.372 0.000 xml.py:151(str_to_num)
1446388 0.137 0.000 0.137 0.000 {method 'append' of 'list' objects}
296746 0.125 0.000 0.230 0.000 structures.py:268(__new__)
1391790 0.119 0.000 0.119 0.000 {method 'items' of 'dict' objects}
6182 0.118 0.000 0.119 0.000 dataset.py:466(_fast_reader)
296746 0.118 0.000 0.490 0.000 xml.py:157(convert_from) mzML uncompressed (1.5 GB) In [4]: reader = mzml.MzML("./AGP_tryptic_300ng_2microscans_glycoproteomics_nCE_27-30.mzML")
In [5]: %%prun
...: for scan in reader:
...: pass
...:
25450028 function calls (24494951 primitive calls) in 37.147 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
436268/41270 8.323 0.000 28.340 0.001 xml.py:403(_get_info)
41270 6.870 0.000 7.496 0.000 xml.py:622(_find_by_id_no_reset)
82540 5.872 0.000 5.872 0.000 {built-in method zlib.decompress}
984561 3.647 0.000 5.830 0.000 xml.py:352(_handle_param)
436268/41270 2.487 0.000 28.558 0.001 mzml.py:268(_get_info_smart)
82540 2.266 0.000 2.266 0.000 {built-in method binascii.a2b_base64}
2765038 1.020 0.000 1.020 0.000 {built-in method __new__ of type object at 0x00007FF9AB5BC810}
2841658 0.640 0.000 0.640 0.000 xml.py:51(_local_name)
984561 0.391 0.000 0.391 0.000 structures.py:345(__new__)
984561 0.313 0.000 0.885 0.000 structures.py:288(__new__)
82540 0.290 0.000 9.004 0.000 utils.py:170(decode_data_array)
984561 0.269 0.000 0.269 0.000 xml.py:379(_insert_param)
82540 0.245 0.000 9.923 0.000 mzml.py:235(_handle_binary)
984561 0.225 0.000 0.409 0.000 <string>:1(__new__)
41270 0.223 0.000 0.443 0.000 ntpath.py:450(normpath)
2423094 0.221 0.000 0.221 0.000 {built-in method builtins.isinstance}
82540 0.186 0.000 0.194 0.000 mzml.py:210(_determine_compression)
82540 0.173 0.000 0.266 0.000 mzml.py:116(_detect_array_name)
1515154 0.173 0.000 0.173 0.000 {method 'get' of 'dict' objects}
41270 0.171 0.000 36.872 0.001 file_helpers.py:78(wrapped)
499168 0.160 0.000 0.316 0.000 structures.py:313(__new__)
41270 0.156 0.000 36.367 0.001 xml.py:1107(get_by_id)
123810 0.151 0.000 0.151 0.000 {method 'seek' of '_io.BufferedReader' objects}
296746 0.149 0.000 0.386 0.000 xml.py:151(str_to_num)
82540 0.147 0.000 0.154 0.000 mzml.py:190(_determine_array_dtype) |
And to try out a better compressor, mzMLb with In [2]: reader2 = mzmlb.MzMLb("AGP_tryptic_300ng_2microscans_glycoproteomics_nCE_27-30.zstd.mzMLb")
In [3]: %%prun
...: for scan in reader2:
...: pass
...:
34527932 function calls (33607934 primitive calls) in 33.959 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
6185 7.568 0.001 7.940 0.001 dataset.py:688(__getitem__)
436268/41270 7.047 0.000 27.003 0.001 xml.py:403(_get_info)
41271 4.382 0.000 33.879 0.001 xml.py:528(_iterfind_impl)
1190912 4.264 0.000 7.007 0.000 xml.py:352(_handle_param)
6508886 2.438 0.000 3.261 0.000 xml.py:51(_local_name)
436268/41270 1.185 0.000 27.183 0.001 mzml.py:268(_get_info_smart)
3219488 1.040 0.000 1.040 0.000 {built-in method __new__ of type object at 0x00007FF9AB3BC810}
6516855 0.826 0.000 0.826 0.000 {method 'rpartition' of 'str' objects}
1190912 0.440 0.000 0.440 0.000 structures.py:345(__new__)
1190912 0.358 0.000 0.961 0.000 structures.py:288(__new__)
82540 0.344 0.000 9.253 0.000 mzmlb.py:226(_handle_binary)
1190912 0.305 0.000 0.305 0.000 xml.py:379(_insert_param)
1 0.285 0.285 34.232 34.232 <string>:1(<module>)
82540 0.223 0.000 1.152 0.000 mzml.py:235(_handle_binary)
82540 0.223 0.000 0.223 0.000 {built-in method numpy.array}
82540 0.201 0.000 0.313 0.000 mzml.py:116(_detect_array_name)
2305771 0.194 0.000 0.194 0.000 {built-in method builtins.isinstance}
82540 0.182 0.000 0.190 0.000 mzml.py:210(_determine_compression)
1559546 0.165 0.000 0.165 0.000 {method 'get' of 'dict' objects}
540438 0.163 0.000 0.298 0.000 structures.py:313(__new__)
296746 0.147 0.000 0.360 0.000 xml.py:151(str_to_num)
82540 0.138 0.000 0.146 0.000 mzml.py:190(_determine_array_dtype)
1446388 0.136 0.000 0.136 0.000 {method 'append' of 'list' objects}
296746 0.132 0.000 0.491 0.000 xml.py:157(convert_from)
6185 0.129 0.000 0.130 0.000 dataset.py:466(_fast_reader)
1391790 0.120 0.000 0.120 0.000 {method 'items' of 'dict' objects}
296746 0.119 0.000 0.213 0.000 structures.py:268(__new__)
996346 0.118 0.000 0.118 0.000 {method 'pop' of 'dict' objects}
82540 0.090 0.000 7.642 0.000 mzmlb.py:325(get)
495240 0.077 0.000 0.077 0.000 {method 'endswith' of 'str' objects} |
Nice! Now I see parsing times faster than for mzML, as well. Thank you! |
Sure enough, the feature was merged and then I started doing more serious performance testing. It appears that
h5py
does not attempt to do serious caching of decompressed chunks in memory for simplicity, so each attempt to read a slice of the dataset requires decompressing a whole chunk of data and then taking the requested slice on demand every time, making sequential reading of the mzMLb file orders of magnitude slower than reading the equivalent compressed mzML.This feature adds an extra intermediate caching layer to the
MzMLb
object which caches a chunk of each array in memory under the assumption that the next read will be in the same region. This boosts the performance to being twice as fast as the equivalent mzML file.