Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark with h5 package #92

Open
swissr opened this issue Feb 28, 2018 · 5 comments
Open

Benchmark with h5 package #92

swissr opened this issue Feb 28, 2018 · 5 comments

Comments

@swissr
Copy link

swissr commented Feb 28, 2018

I noticed that reading hdf5 files with the hdf5r package is much slower than with the deprecated h5 package (16 times slower in example below).

Am I doing something wrong?

   test replications elapsed relative
1    h5           10   0.041    1.000
2 hdf5r           10   0.662   16.146

MWE:

# create hdf5 file (6 vectors with 10k random numbers each)

h5file <- hdf5r::H5File$new("testdata.h5", "w")
for (i in paste0("vector", 1:6)) {
    h5file[[i]] <- runif(10000)
}
h5file$close_all()

# compare read speed when using h5 and hdf5r package

read_h5 <- function(file) {
    h5file <- h5::h5file(file, "r")
    sets <- h5::list.datasets(h5file)
    result <- lapply(sets, function(i) h5file[i][])
    h5::h5close(h5file)
    result
}
read_hdf5r <- function(file) {
    h5file <- hdf5r::H5File$new(file, "r")
    sets <- h5file$ls()$name
    result <- lapply(sets, function(i) h5file[[i]][])
    h5file$close_all()
    result
}
rbenchmark::benchmark(
    replications = 10,
    "h5"     = read_h5("testdata.h5"),
    "hdf5r"  = read_hdf5r("testdata.h5"))[,1:4]
@hhoeflin
Copy link
Owner

Hi, no you are not doing anything wrong. It is a known issue that I recently unfortunately didn't have time to dig into.

@mannau
Copy link
Collaborator

mannau commented Feb 28, 2018

Hi Hans-Peter,
a quick profiling exercise suggests that most of the time (80%) is spent in the $close_all() call:

Rprof("issue-92-hdf5r.out", line.profiling=TRUE)
h5file <- hdf5r::H5File$new("testdata.h5", "r")
sets <- h5file$ls()$name
result <- lapply(sets, function(i) h5file[[i]][])
h5file$close_all()
Rprof(NULL)
summaryRprof("issue-92-hdf5r.out", lines = "show")
$by.total
                       total.time total.pct self.time self.pct
R6Classes_H5File.R#260       0.08        80      0.08       80
R6Classes.R#43               0.02        20      0.02       20
#1                           0.02        20      0.00        0
Common_functions.R#99        0.02        20      0.00        0
high_level_UI.R#81           0.02        20      0.00        0
R6Classes_H5D.R#109          0.02        20      0.00        0
R6Classes_H5Group.R#95       0.02        20      0.00        0

@swissr
Copy link
Author

swissr commented Mar 1, 2018

most of the time (80%) is spent in the $close_all()

Yes exactly.

When I commented out close_all() in a 'real-live-script' where hundreds of such hdf5 files are being read and processed, hdf5r took about twice the time as h5. In the above example after commenting out close_all the times for hdf5r drops to a quarter (but are still about 5 times more than h5).

(Spurred by #86 I tried with a hdf5r branch where gc in close_all has been commented out. Got an error though and stopped as I don't have much hdf5 low-level background).

@mkoohafkan
Copy link

Yep, I observed similar issues and posted issue #85. I put up some benchmarks in that issues too.

@JZL
Copy link

JZL commented Feb 3, 2022

This is an old issue I know but I found similar results about close_all()'s gc() call. I was using hdf5r as part of Seurat (a biological data package) to load in a medium-sized number of h5 files, and found it was spending much more time gc()'ing than loading in data.

I removed that line in close_all and found demonstrable increases. This aligns with what I've seen when using gc in general. Oftentimes it's very fast but some deeply nested, large objects (30Gb+) can really slow gc and object.size down, which is exacerbated by opening many h5 files. I haven't yet seen any corruption/errors with the gc commented out so it could be worth revisiting and/or putting it behind a flag.

I don't fully understand what the gc does but going off of 'If not all objects in a file are closed, the file remains open and cannot be re-opened the regular way.', maybe removing gc would lead to errors if I tried to reopen a file? In which case having the user be able to open many files at once, and then close all at once, could solve both problems with some added complexity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants