Skip to content

Commit

Permalink
Try to iron kinks in selective vignette code
Browse files Browse the repository at this point in the history
  • Loading branch information
grimbough committed Feb 6, 2024
1 parent c5539ba commit 9f39273
Showing 1 changed file with 11 additions and 5 deletions.
16 changes: 11 additions & 5 deletions vignettes/practical_tips.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,8 @@ library(rhdf5)
library(dplyr)
library(ggplot2)
library(BiocParallel)
run_on_platform <- (Sys.info()['sysname'] == "Linux")
```

# Introduction
Expand Down Expand Up @@ -94,13 +96,13 @@ However, things get a little more tricky if you want an irregular selection of d
```{r, chooseColunns, eval = TRUE}
columns <- sample(x = seq_len(20000), size = 1000, replace = FALSE) %>%
sort()
```

```{r singleReads, eval = (Sys.info()['sysname'] == "Linux"), cache = TRUE}
f1 <- function(cols, name) {
h5read(file = ex_file, name = name,
index = list(NULL, cols))
}
```

```{r singleReads, eval = run_on_platform, cache = TRUE}
system.time(res4 <- vapply(X = columns, FUN = f1,
FUN.VALUE = integer(length = 100),
name = 'counts'))
Expand Down Expand Up @@ -172,7 +174,7 @@ If we were stuck with the single-chunk dataset and want to minimise the number o

Unfortunately, this approach doesn't scale very well. This is because creating unions of hyperslabs is currently very slow in HDF5 (see [Union of non-consecutive hyperslabs is very slow](https://forum.hdfgroup.org/t/union-of-non-consecutive-hyperslabs-is-very-slow/5062) for another report of this behaviour), with the performance penalty increasing exponentially relative to the number of unions. The plot below shows the the exponential increase in time as the number of hyberslab unions increases.

```{r, hyperslab-benchmark, eval = (Sys.info()['sysname'] == "Linux"), echo = FALSE, fig.width=6, fig.height=3, fig.wide = TRUE, fig.cap='The time taken to join hyperslabs increases expontentially with the number of join operations. These timings are taken with no reading occuring, just the creation of a dataset selection.'}
```{r, hyperslab-benchmark, eval = run_on_platform, echo = FALSE, fig.width=6, fig.height=3, fig.wide = TRUE, fig.cap='The time taken to join hyperslabs increases expontentially with the number of join operations. These timings are taken with no reading occuring, just the creation of a dataset selection.'}
## this code demonstrates the exponential increase in time as the
## number of hyberslab unions increases
Expand Down Expand Up @@ -342,7 +344,7 @@ split_and_gather(tempfile(), input_dsets = dsets,

Below we can see some timings comparing calling `simple_writer()` with `split_and_gather()` using 1, 2, and 4 cores.

```{r, run-writing-benchmark, eval = (Sys.info()['sysname'] == "Linux"), cache = TRUE, message=FALSE, echo=FALSE}
```{r, run-writing-benchmark, eval = run_on_platform, cache = TRUE, message=FALSE, echo=FALSE}
bench_results <- bench::mark(
"simple writer" = simple_writer(file_name = tempfile(), dsets = dsets),
"split/gather - 1 core" = split_and_gather(tempfile(), input_dsets = dsets,
Expand All @@ -356,6 +358,10 @@ bench_results <- bench::mark(
bench_results |> select(expression, min, median)
```

```{r, fake-timings, eval = !run_on_platform, message = FALSE, echo = FALSE}
bench_results <- data.frame(median = 4:1)
```

We can see from our benchmark results that there is some performance improvement to be achieved by using the parallel approach. Based on the median times of out three iterations using two cores sees an speedup of `r round(bench_results$median[1] / bench_results$median[3], 2)` and `r round(bench_results$median[1] / bench_results$median[4], 1)` with 4 cores. This isn't quite linear, presumably because there are overheads involved both in using a two-step process and initialising the parallel workers, but it is a noticeable improvement.

# Session info {.unnumbered}
Expand Down

0 comments on commit 9f39273

Please sign in to comment.