I/O performance improvements #100

iguinn · 2024-07-11T20:42:55Z

Use low level h5py API to for read. Note that this resulted in moving the handling of lists of files from _serializers.read.composite to store.read and core.read. Only the _serializers functions are using the low level API; read itself is still using h5py.File to open the file get the top level group.
Require h5py >= 3.10. Not sure why, but there is a noticeable performance improvement starting at this version
Encode string attributes to utf-8 before writing. This is something that happens implicitly anyway, so I'm not sure why doing this helps, but it does
Write files using paged aggregation. This is a setting that you use when opening a file to write. Right now I have it hard-coded to use 64 kB pages, which seemed to be optimal in terms of file size and read speed, although this was only tested using a file with many small datasets. Also use latest version of file for writing
Among these changes, the low level API made the biggest difference (a factor of almost 2), while the other two changes combine for an improvement of maybe 1.5 or so. The low level API makes a difference without reprocessing our files, while the other two changes happen at file write time, so will only be noticed after reprocessing

codecov · 2024-07-11T20:45:05Z

Codecov Report

Attention: Patch coverage is 76.73267% with 47 lines in your changes missing coverage. Please review.

Project coverage is 74.69%. Comparing base (5ac36c8) to head (c19906b).
Report is 1 commits behind head on main.

Files	Patch %	Lines
...rc/lgdo/lh5/_serializers/read/vector_of_vectors.py	59.09%	9 Missing ⚠️
src/lgdo/lh5/core.py	72.72%	9 Missing ⚠️
src/lgdo/lh5/_serializers/read/ndarray.py	75.86%	7 Missing ⚠️
src/lgdo/lh5/store.py	80.55%	7 Missing ⚠️
src/lgdo/lh5/_serializers/read/composite.py	70.00%	6 Missing ⚠️
src/lgdo/lh5/_serializers/read/utils.py	84.21%	3 Missing ⚠️
src/lgdo/lh5/_serializers/read/encoded.py	83.33%	2 Missing ⚠️
src/lgdo/lh5/exceptions.py	33.33%	2 Missing ⚠️
src/lgdo/lh5/_serializers/read/array.py	88.88%	1 Missing ⚠️
src/lgdo/lh5/_serializers/read/scalar.py	87.50%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #100      +/-   ##
==========================================
+ Coverage   74.45%   74.69%   +0.23%     
==========================================
  Files          45       45              
  Lines        2815     2908      +93     
==========================================
+ Hits         2096     2172      +76     
- Misses        719      736      +17

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

gipert · 2024-07-12T12:15:45Z

@oschulz anything to be worried about from the Julia side?

oschulz · 2024-07-12T15:25:52Z

@oschulz anything to be worried about from the Julia side?

Probably not, but can you give a test output file to @apmypb for testing?

oschulz · 2024-07-12T15:27:45Z

Probably not, but can you give a test output file to @apmypb for testing?

The best thing would be to replace the HDF5 files (read and write again with these improvements) in the legend-testdata repo as part of a PR there. Then we can test on Julia and finally merge into legend-testdata.

iguinn · 2024-07-20T21:51:50Z

Added a change to read files with locking=False (see #78 (comment))

iguinn and others added 9 commits July 6, 2024 14:31

Replace file/name with hdf5 group/dataset when decoding

5f81458

style: pre-commit fixes

bed10c8

Fixed pre-commit thing

b954b88

Use low level h5py API to read LGDO objects

40526b4

Require h5py>=3.10

f0ac76d

Encode string attributes to utf-8 before writing

69b2a1b

Write files using paged aggregation

cae648e

Pre-commit fixes

0db9fc4

Merge branch 'main' of https://github.com/legend-exp/legend-pydataobj

56e334d

iguinn mentioned this pull request Jul 11, 2024

Increase read speed by x20-100 for most data #78

Open

iguinn and others added 2 commits July 11, 2024 21:14

Fixed bug when reading multiple files

322b250

style: pre-commit fixes

050f964

gipert added performance Code performance lh5 HDF5 I/O labels Jul 12, 2024

Added test_read_multiple_files

ce9c6d0

Read files with file locking off

94037f7

iguinn and others added 4 commits July 25, 2024 09:43

Merge branch 'main' of https://github.com/legend-exp/legend-pydataobj

f035526

Use locking=false for lh5.show

efa2e8e

Bug fix

1b2bef2

style: pre-commit fixes

c19906b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I/O performance improvements #100

I/O performance improvements #100

iguinn commented Jul 11, 2024 •

edited

Loading

codecov bot commented Jul 11, 2024 •

edited

Loading

gipert commented Jul 12, 2024

oschulz commented Jul 12, 2024

oschulz commented Jul 12, 2024

iguinn commented Jul 20, 2024

I/O performance improvements #100

Are you sure you want to change the base?

I/O performance improvements #100

Conversation

iguinn commented Jul 11, 2024 • edited Loading

codecov bot commented Jul 11, 2024 • edited Loading

Codecov Report

gipert commented Jul 12, 2024

oschulz commented Jul 12, 2024

oschulz commented Jul 12, 2024

iguinn commented Jul 20, 2024

iguinn commented Jul 11, 2024 •

edited

Loading

codecov bot commented Jul 11, 2024 •

edited

Loading