-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
significant performance loss after transitioning from h5
package
#85
Comments
Actually, using
|
Hi Michael, I assume the performance penalty is from creating and destroying the R6 objects representing the datasets and the file object. If you open very many small datasets in different files thousands of times, there will indeed be a significant performance penalty. This may also occur if you open datasets very often and then discard the corresponding R6 pointer to the dataset. It was intended to open large datasets and read large amounts of data from them. There is a way to handle this by working with the raw ids, which is a bit tricky as you have to take care of closing them on your own. I assume this is an artificial example? Can you explain a little more your use case where this performance penalty becomes felt? Thanks |
Another benchmark---doesn't look like opening/closing is the bottleneck:
Results:
|
I have a package for reading HDF5 outputs from HEC-RAS software package. The main usage modes are (1) dynamically accessing multiple tables from a single HDF file for exploratory analysis of results, and (2) comparing outputs of multiple similarly-structured HDF files. In general, this results in reading a few small tables (for output metadata) and a few larger tables (for actual results). I expect my users to be fairly unfamiliar with working with hdf files, so I have structured the package so that HDF file connections are managed for the user (i.e. users pass in filenames rather than HDF5 objects, and the files are opened/closed within the function). I initially assumed that opening/closing was a bottleneck as well and have already put some time into optimizing my package to limit the number of open/close actions, but the performance hit is still much larger than expected (and frankly, too large for my use case). I would be interested in working with the raw IDs if that will actually get performance closer to |
Reading this the benchmarks suggest both hdf5r and h5 spend roughly 80% of their time opening and closing the file. For hdf5r, opening the dataset is another bottleneck. I will have to look into this use case to see if this can be made faster. |
some additional benchmarks. First, benchmarks for opening/closing a table in an already-opened file:
Second, benchmarks for reading already-opened tables:
|
Thanks for the benchmark. I will have a look at them soon. |
Any updates on this? |
Hi Michael,
To me there are no clear solutions yet. To address 1. my ideas would be to Cheers, |
I'm seeing significant performance loss after switching from
h5
tohdf5r
. Withh5
, I was able to use the lower-levelopenDataSet
andreadDataSet
commands which were significantly faster than accessing with[[
. Using thehdf5r
object methodsopen()
andread()
is faster than[[
but still much slower thanh5
.Consider the following microbenchmark test (sample HDF file available here):
My results:
This ignores the additional cost of transposing the result of the
hdf5r
method to match the structure ofh5
outputs.The text was updated successfully, but these errors were encountered: