Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I/O performance improvements #100

Open
wants to merge 17 commits into
base: main
Choose a base branch
from
Open

I/O performance improvements #100

wants to merge 17 commits into from

Conversation

iguinn
Copy link
Contributor

@iguinn iguinn commented Jul 11, 2024

  • Use low level h5py API to for read. Note that this resulted in moving the handling of lists of files from _serializers.read.composite to store.read and core.read. Only the _serializers functions are using the low level API; read itself is still using h5py.File to open the file get the top level group.
  • Require h5py >= 3.10. Not sure why, but there is a noticeable performance improvement starting at this version
  • Encode string attributes to utf-8 before writing. This is something that happens implicitly anyway, so I'm not sure why doing this helps, but it does
  • Write files using paged aggregation. This is a setting that you use when opening a file to write. Right now I have it hard-coded to use 64 kB pages, which seemed to be optimal in terms of file size and read speed, although this was only tested using a file with many small datasets. Also use latest version of file for writing
    Among these changes, the low level API made the biggest difference (a factor of almost 2), while the other two changes combine for an improvement of maybe 1.5 or so. The low level API makes a difference without reprocessing our files, while the other two changes happen at file write time, so will only be noticed after reprocessing

Copy link

codecov bot commented Jul 11, 2024

Codecov Report

Attention: Patch coverage is 76.73267% with 47 lines in your changes missing coverage. Please review.

Project coverage is 74.69%. Comparing base (5ac36c8) to head (c19906b).
Report is 1 commits behind head on main.

Files Patch % Lines
...rc/lgdo/lh5/_serializers/read/vector_of_vectors.py 59.09% 9 Missing ⚠️
src/lgdo/lh5/core.py 72.72% 9 Missing ⚠️
src/lgdo/lh5/_serializers/read/ndarray.py 75.86% 7 Missing ⚠️
src/lgdo/lh5/store.py 80.55% 7 Missing ⚠️
src/lgdo/lh5/_serializers/read/composite.py 70.00% 6 Missing ⚠️
src/lgdo/lh5/_serializers/read/utils.py 84.21% 3 Missing ⚠️
src/lgdo/lh5/_serializers/read/encoded.py 83.33% 2 Missing ⚠️
src/lgdo/lh5/exceptions.py 33.33% 2 Missing ⚠️
src/lgdo/lh5/_serializers/read/array.py 88.88% 1 Missing ⚠️
src/lgdo/lh5/_serializers/read/scalar.py 87.50% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #100      +/-   ##
==========================================
+ Coverage   74.45%   74.69%   +0.23%     
==========================================
  Files          45       45              
  Lines        2815     2908      +93     
==========================================
+ Hits         2096     2172      +76     
- Misses        719      736      +17     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@gipert gipert added performance Code performance lh5 HDF5 I/O labels Jul 12, 2024
@gipert
Copy link
Member

gipert commented Jul 12, 2024

@oschulz anything to be worried about from the Julia side?

@oschulz
Copy link

oschulz commented Jul 12, 2024

@oschulz anything to be worried about from the Julia side?

Probably not, but can you give a test output file to @apmypb for testing?

@oschulz
Copy link

oschulz commented Jul 12, 2024

Probably not, but can you give a test output file to @apmypb for testing?

The best thing would be to replace the HDF5 files (read and write again with these improvements) in the legend-testdata repo as part of a PR there. Then we can test on Julia and finally merge into legend-testdata.

@iguinn
Copy link
Contributor Author

iguinn commented Jul 20, 2024

Added a change to read files with locking=False (see #78 (comment))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lh5 HDF5 I/O performance Code performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants