Benchmark with h5 package #92

swissr · 2018-02-28T14:39:53Z

I noticed that reading hdf5 files with the hdf5r package is much slower than with the deprecated h5 package (16 times slower in example below).

Am I doing something wrong?

   test replications elapsed relative
1    h5           10   0.041    1.000
2 hdf5r           10   0.662   16.146

MWE:

# create hdf5 file (6 vectors with 10k random numbers each)

h5file <- hdf5r::H5File$new("testdata.h5", "w")
for (i in paste0("vector", 1:6)) {
    h5file[[i]] <- runif(10000)
}
h5file$close_all()

# compare read speed when using h5 and hdf5r package

read_h5 <- function(file) {
    h5file <- h5::h5file(file, "r")
    sets <- h5::list.datasets(h5file)
    result <- lapply(sets, function(i) h5file[i][])
    h5::h5close(h5file)
    result
}
read_hdf5r <- function(file) {
    h5file <- hdf5r::H5File$new(file, "r")
    sets <- h5file$ls()$name
    result <- lapply(sets, function(i) h5file[[i]][])
    h5file$close_all()
    result
}
rbenchmark::benchmark(
    replications = 10,
    "h5"     = read_h5("testdata.h5"),
    "hdf5r"  = read_hdf5r("testdata.h5"))[,1:4]

The text was updated successfully, but these errors were encountered:

hhoeflin · 2018-02-28T15:32:26Z

Hi, no you are not doing anything wrong. It is a known issue that I recently unfortunately didn't have time to dig into.

mannau · 2018-02-28T20:00:09Z

Hi Hans-Peter,
a quick profiling exercise suggests that most of the time (80%) is spent in the $close_all() call:

Rprof("issue-92-hdf5r.out", line.profiling=TRUE)
h5file <- hdf5r::H5File$new("testdata.h5", "r")
sets <- h5file$ls()$name
result <- lapply(sets, function(i) h5file[[i]][])
h5file$close_all()
Rprof(NULL)
summaryRprof("issue-92-hdf5r.out", lines = "show")

$by.total
                       total.time total.pct self.time self.pct
R6Classes_H5File.R#260       0.08        80      0.08       80
R6Classes.R#43               0.02        20      0.02       20
#1                           0.02        20      0.00        0
Common_functions.R#99        0.02        20      0.00        0
high_level_UI.R#81           0.02        20      0.00        0
R6Classes_H5D.R#109          0.02        20      0.00        0
R6Classes_H5Group.R#95       0.02        20      0.00        0

swissr · 2018-03-01T08:58:34Z

most of the time (80%) is spent in the $close_all()

Yes exactly.

When I commented out close_all() in a 'real-live-script' where hundreds of such hdf5 files are being read and processed, hdf5r took about twice the time as h5. In the above example after commenting out close_all the times for hdf5r drops to a quarter (but are still about 5 times more than h5).

(Spurred by #86 I tried with a hdf5r branch where gc in close_all has been commented out. Got an error though and stopped as I don't have much hdf5 low-level background).

mkoohafkan · 2018-03-09T22:37:49Z

Yep, I observed similar issues and posted issue #85. I put up some benchmarks in that issues too.

JZL · 2022-02-03T22:44:44Z

This is an old issue I know but I found similar results about close_all()'s gc() call. I was using hdf5r as part of Seurat (a biological data package) to load in a medium-sized number of h5 files, and found it was spending much more time gc()'ing than loading in data.

I removed that line in close_all and found demonstrable increases. This aligns with what I've seen when using gc in general. Oftentimes it's very fast but some deeply nested, large objects (30Gb+) can really slow gc and object.size down, which is exacerbated by opening many h5 files. I haven't yet seen any corruption/errors with the gc commented out so it could be worth revisiting and/or putting it behind a flag.

I don't fully understand what the gc does but going off of 'If not all objects in a file are closed, the file remains open and cannot be re-opened the regular way.', maybe removing gc would lead to errors if I tried to reopen a file? In which case having the user be able to open many files at once, and then close all at once, could solve both problems with some added complexity

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark with h5 package #92

Benchmark with h5 package #92

swissr commented Feb 28, 2018

hhoeflin commented Feb 28, 2018

mannau commented Feb 28, 2018

swissr commented Mar 1, 2018

mkoohafkan commented Mar 9, 2018

JZL commented Feb 3, 2022

Benchmark with h5 package #92

Benchmark with h5 package #92

Comments

swissr commented Feb 28, 2018

hhoeflin commented Feb 28, 2018

mannau commented Feb 28, 2018

swissr commented Mar 1, 2018

mkoohafkan commented Mar 9, 2018

JZL commented Feb 3, 2022