Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

significant performance loss after transitioning from h5 package #85

Open
mkoohafkan opened this issue Nov 17, 2017 · 9 comments
Open
Assignees

Comments

@mkoohafkan
Copy link

I'm seeing significant performance loss after switching from h5 to hdf5r. With h5, I was able to use the lower-level openDataSet and readDataSet commands which were significantly faster than accessing with [[. Using the hdf5r object methods open() and read() is faster than [[ but still much slower than h5.

Consider the following microbenchmark test (sample HDF file available here):

get_dataset_hdf5r = function(f, table.path) {
  x = hdf5r::H5File$new(f)
  g = x$open(table.path)
  res = g$read()
  g$close()
  x$close()
  res
}

get_dataset_h5 = function(f, table.path, type = "double") {
  x = h5::h5file(f)
  g = h5::openDataSet(x, table.path, type)
  res = h5::readDataSet(g)
  h5::h5close(g)
  h5::h5close(x)
  res
}

myfile = system.file("sample-data/SampleQuasiUnsteady.hdf", package = "RAStestR")
mytable =  "Results/Sediment/Output Blocks/Sediment/Sediment Time Series/Cross Sections/Vol Bed Change Cum"

library(microbenchmark)

microbenchmark(
  get_dataset_hdf5r(myfile, mytable),
  get_dataset_h5(myfile, mytable)
)

My results:

Unit: milliseconds
                               expr      min       lq      mean   median       uq       max neval
 get_dataset_hdf5r(myfile, mytable) 4.053642 4.197638 11.082077 4.313332 4.538343 552.76297   100
    get_dataset_h5(myfile, mytable) 1.606342 1.670565  2.271489 1.738831 1.833065  46.95775   100

This ignores the additional cost of transposing the result of the hdf5r method to match the structure of h5 outputs.

@mkoohafkan
Copy link
Author

Actually, using [[ in hdf5r appears to be slightly faster than open/read methods, but still slower than h5:

get_dataset_hdf5r_b = function(f, table.path) {
  x = hdf5r::H5File$new(f)
  res = x[[table.path]][,]
  x$close()
  res
}

microbenchmark(
  get_dataset_hdf5r_b(myfile, mytable),
  get_dataset_hdf5r(myfile, mytable),
  get_dataset_h5(myfile, mytable)
)
Unit: milliseconds
                                 expr      min       lq     mean   median       uq      max neval
 get_dataset_hdf5r_b(myfile, mytable) 3.162613 3.364921 3.664380 3.446250 3.592112 8.197476   100
   get_dataset_hdf5r(myfile, mytable) 3.240986 3.455113 3.905648 3.569875 3.758032 8.045706   100
      get_dataset_h5(myfile, mytable) 1.596080 1.676941 1.799975 1.733855 1.822180 5.037663   100

@hhoeflin
Copy link
Owner

hhoeflin commented Nov 17, 2017

Hi Michael,

I assume the performance penalty is from creating and destroying the R6 objects representing the datasets and the file object. If you open very many small datasets in different files thousands of times, there will indeed be a significant performance penalty. This may also occur if you open datasets very often and then discard the corresponding R6 pointer to the dataset. It was intended to open large datasets and read large amounts of data from them.

There is a way to handle this by working with the raw ids, which is a bit tricky as you have to take care of closing them on your own.

I assume this is an artificial example? Can you explain a little more your use case where this performance penalty becomes felt?

Thanks

@mkoohafkan
Copy link
Author

mkoohafkan commented Nov 17, 2017

Another benchmark---doesn't look like opening/closing is the bottleneck:

read_dataset_hdf5r = function(x, table.path) {
  res = x[[table.path]][,]
  res
}
read_dataset_h5 = function(x, table.path, type = "double") {
  g = h5::openDataSet(x, table.path, type)
  res = h5::readDataSet(g)
  h5::h5close(g)
  res
}

file_h5 = h5::h5file(myfile)
file_hdf5r = hdf5r::H5File$new(myfile)

microbenchmark(
  read_dataset_hdf5r(file_hdf5r, mytable),
  read_dataset_h5(file_h5, mytable)
)

file_hdf5r$close()
h5close(file_h5)

Results:

Unit: microseconds
                                    expr      min       lq      mean   median        uq       max neval
 read_dataset_hdf5r(file_hdf5r, mytable) 1513.663 1571.355 1816.8625 1616.139 1666.3670 10526.595   100
       read_dataset_h5(file_h5, mytable)  290.791  317.848  381.4583  350.193  375.3845  2499.861   100

@mkoohafkan
Copy link
Author

mkoohafkan commented Nov 17, 2017

I have a package for reading HDF5 outputs from HEC-RAS software package. The main usage modes are (1) dynamically accessing multiple tables from a single HDF file for exploratory analysis of results, and (2) comparing outputs of multiple similarly-structured HDF files. In general, this results in reading a few small tables (for output metadata) and a few larger tables (for actual results). I expect my users to be fairly unfamiliar with working with hdf files, so I have structured the package so that HDF file connections are managed for the user (i.e. users pass in filenames rather than HDF5 objects, and the files are opened/closed within the function).

I initially assumed that opening/closing was a bottleneck as well and have already put some time into optimizing my package to limit the number of open/close actions, but the performance hit is still much larger than expected (and frankly, too large for my use case).

I would be interested in working with the raw IDs if that will actually get performance closer to h5 standards. However, I do plan to publish my package which means I need to rely on exported methods from hdf5r in order to pass CRAN checks.

@hhoeflin
Copy link
Owner

hhoeflin commented Nov 17, 2017

Reading this the benchmarks suggest both hdf5r and h5 spend roughly 80% of their time opening and closing the file. For hdf5r, opening the dataset is another bottleneck.

I will have to look into this use case to see if this can be made faster.

@mkoohafkan
Copy link
Author

some additional benchmarks. First, benchmarks for opening/closing a table in an already-opened file:

library(microbenchmark)

myfile = system.file("sample-data/SampleQuasiUnsteady.hdf", package = "RAStestR")
mytable =  "Results/Sediment/Output Blocks/Sediment/Sediment Time Series/Cross Sections/Vol Bed Change Cum"

h5_openclose = function() {
  table_h5 = h5::openDataSet(file_h5, mytable)
  h5::h5close(table_h5)
}
hdf5r_openclose = function() {
  table_hdf5r = file_hdf5r$open(mytable)
  table_hdf5r$close()
}

file_h5 = h5::h5file(myfile)
file_hdf5r = hdf5r::H5File$new(myfile)

microbenchmark(
  h5_openclose(),
  hdf5r_openclose()
)
Unit: microseconds
              expr      min        lq      mean    median        uq      max neval
    h5_openclose()  176.030  190.4915  242.0152  215.0605  233.8765 2144.997   100
 hdf5r_openclose() 1065.191 1105.1555 1226.5023 1138.4320 1175.4425 3720.234   100

Second, benchmarks for reading already-opened tables:

table_h5 = h5::openDataSet(file_h5, mytable)
table_hdf5r = file_hdf5r$open(mytable)

microbenchmark(
  table_hdf5r$read(),
  h5::readDataSet(table_h5)
)
Unit: microseconds
                      expr     min       lq     mean   median       uq      max neval
        table_hdf5r$read() 569.139 592.1535 611.7184 601.3275 616.2565 1083.851   100
 h5::readDataSet(table_h5)  84.905  97.8120 110.3982 113.8290 117.7160  266.843   100

@hhoeflin
Copy link
Owner

Thanks for the benchmark. I will have a look at them soon.

@hhoeflin hhoeflin self-assigned this Nov 28, 2017
@mkoohafkan
Copy link
Author

Any updates on this?

@mannau
Copy link
Collaborator

mannau commented Oct 20, 2018

Hi Michael,
I just revisited the issue and found that

  1. Opening datasets in hdf5r is significantly slower than h5 - as Holger already mentioned R6 object instantiation in H5GTD_factory seems to be the reason:
    return(H5GTD_factory(oid))
  2. Reading datasets is also slower but I need to take a closer look to find out what's happening here.

To me there are no clear solutions yet. To address 1. my ideas would be to
ad 1.1) Create something like dataset collection objects (e.g. through instantiation with single [) which do not instantiate each dataset separately and implement e.g. iterators.
ad 1.2) Switch from R6 to S3
I would clearly prefer ad 1.1

Cheers,
mario

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants