Speed up HDF5 processing #110

emilmelnikov · 2023-07-28T12:59:53Z

Previously, HDF5 datasets were always read and written plane-wise (XY). Also, pixel values were always boxed when reading. This resulted in sub-optimal performance, especially when the source HDF5 did not have XY as it's fastest changing dimensions.

The new reader implementation always reads data either in native HDF5 chunks, or in the largest possible contiguous blocks, and pixel values are never boxed. Moreover, if the source image is not chunked and sufficiently small (element count is less than 2^31), a simpler ArrayImg (just a simple wrapper around a primitive array) is used instead of CellImg (ND grid of ArrayImg).

The writer still deals with boxed pixels: we need to be able to write (almost) arbitrary ImgPlus which could contain anything. However, it tries to minimize overhead by creating flat primitive arrays from large blocks obtained from the source ImgPlus and write them directly. Note that these blocks are large and different from HDF5 chunks which are usually much smaller.

Another important change is using image views both when reading and writing instead of trying to actually change data dimensions. When reading, source dimension order is preserved, but before returning it is (virtually) permuted to match the requested axis order. When writing, data is (also virtually) permuted before obtaining blocks and performing actual writes.

Additionally, support for other primitive data types has been added.

The new implementation always reads blocks from disk sequentially, doesn't box pixels, and changes axis order by wrapping the result into a view without data copying.

Also fix some bugs and adapt tests.

k-dominik

Hey @emilmelnikov,

when I first loaded a file with this code I couldn't believe it! Before it was 2 minutes, and with your changes it feels almost instant. That's really awesome!!

Also thank you for running me through the changes in person, really nice work. I left some comments of what we discovered during the walkthrough.

src/main/java/org/ilastik/ilastik4ij/ui/IlastikExportCommand.java

src/main/java/org/ilastik/ilastik4ij/util/ImgUtils.java

This partially replaces logService from the old implementation.

* Use no-op callbacks if none was provided * Add error checking

src/main/java/org/ilastik/ilastik4ij/util/ImgUtils.java

src/main/java/org/ilastik/ilastik4ij/hdf5/DatasetDescription.java

btbest · 2023-08-21T14:29:25Z

src/main/java/org/ilastik/ilastik4ij/hdf5/Hdf5.java

+     * if axes are specified, image will be written in the specified axis order.
+     * <p>
+     * <p>
+     * If not null, callback will be invoked each time a block is about to be written.


Suggested change

* If not null, callback will be invoked each time a block is about to be written.

* If not null, callback will be invoked each time a block is about to be written. This is used to provide feedback via progress bar.

src/main/java/org/ilastik/ilastik4ij/hdf5/Hdf5.java

wolny

fantastic job! Code is not only more efficient, but much more elegant (using streams and optionals). Can't wait to see it released!

src/main/java/org/ilastik/ilastik4ij/hdf5/DatasetDescription.java

src/main/java/org/ilastik/ilastik4ij/hdf5/DatasetType.java

src/main/java/org/ilastik/ilastik4ij/hdf5/Hdf5.java

HDF5 chunk sizes are usually quite small. At the same time, we want cells in `CellImg` to be large (larger cells means less overhead).

emilmelnikov · 2023-08-23T16:55:56Z

I've made some more changes.

Now HDF5 chunk size and imglib2 cell size are always separate. When reading chunked HDF5 file, it's chunk size no longer used as a cell size because chunk size is too small. When writing, cell size is chosen to be at most 256 MiB, which should be a good compromise between interactivity and performance. Write cell shape is now always either XY or XYZ with singleton dimensions for other axes.

When writing, callbacks now trigger status bar updates, so users can see the progress. Callback signature is changed: now it accepts long which is the total number of bytes written (it is easier to deal with bytes because we can precalculate the total number of bytes we need to read/write in advance, and use this for progress bar updates).

I think now this is really ready to be merged, if everything else is OK.

emilmelnikov · 2023-08-23T17:02:12Z

Side note: imglib2 recently added PrimitiveBlocks for copying an arbitrary subregion into a raw array. I've benchmarked it for writing a "simple" dataset, and it is faster by 20–40% depending on the specific axis order! However, if the target dataset is a more complex view, it switches to fallback implementation which, in my tests, resulted in out-of-heap error. Therefore, I left existing implementation for now.

k-dominik

Really nice! With all the eyes this PR got and all the praise, too, let's not delay merging it (pls don't squash).

The only minor cosmetic issue I noticed is that the progress bar is not reset to zero when writing during workflow processing. It's reset after processing is done though, so, also fine to merge without fixing it.

This script could be used to test HDF5 reader

If not cleared, progress in the status bar do not disappear. Co-authored-by: Dominik Kutra <k-dominik@users.noreply.github.com>

imagesc-bot · 2023-09-14T15:47:37Z

This pull request has been mentioned on Image.sc Forum. There might be relevant details there:

https://forum.image.sc/t/ilastik-instance-handling-in-fiji-macros/86286/2

emilmelnikov force-pushed the faster-hdf5 branch from e22e72b to 7ac379d Compare August 7, 2023 14:54

emilmelnikov added 10 commits August 16, 2023 14:57

Add new HDF5 list datasets implementation

11cfd74

Add reading of "small" HDF5 datasets

89ddaa9

Delete temp files in tests ASAP

7babc47

Add new HDF5 dataset reader

dfeb381

The new implementation always reads blocks from disk sequentially, doesn't box pixels, and changes axis order by wrapping the result into a view without data copying.

Add HDF5 dataset writer

46aeddf

Add writing for ARGBType datasets

2ed9305

Add deflate compression option

41a1d15

Temp benchmark

5a31fb1

Refactor HDF5 reader/writer

c839f6a

Transpose images before writing

e6927ac

emilmelnikov force-pushed the faster-hdf5 branch from 8707710 to c0c3b8c Compare August 16, 2023 19:01

emilmelnikov marked this pull request as ready for review August 16, 2023 19:01

emilmelnikov requested review from k-dominik and btbest August 16, 2023 19:02

Replace old HDF5 reader/writer

03742e2

Also fix some bugs and adapt tests.

emilmelnikov force-pushed the faster-hdf5 branch from c0c3b8c to 03742e2 Compare August 16, 2023 19:08

emilmelnikov requested a review from wolny August 17, 2023 11:53

k-dominik reviewed Aug 17, 2023

View reviewed changes

src/main/java/org/ilastik/ilastik4ij/ui/IlastikExportCommand.java Show resolved Hide resolved

src/main/java/org/ilastik/ilastik4ij/ui/IlastikExportCommand.java Show resolved Hide resolved

src/main/java/org/ilastik/ilastik4ij/util/ImgUtils.java Outdated Show resolved Hide resolved

emilmelnikov added 4 commits August 18, 2023 09:14

Add callbacks for block reading/writing

7562132

This partially replaces logService from the old implementation.

Improve axis handling

1e1d1bd

* Use no-op callbacks if none was provided * Add error checking

Move HDF5 tests to JUnit 5

a5f579a

Initialize default axes in writeDataset

2b80a3a

btbest approved these changes Aug 21, 2023

View reviewed changes

wolny approved these changes Aug 21, 2023

View reviewed changes

emilmelnikov added 3 commits August 23, 2023 16:14

Change dataset chunking

662f6b6

HDF5 chunk sizes are usually quite small. At the same time, we want cells in `CellImg` to be large (larger cells means less overhead).

Warn about writing ARGBType images

2408bdd

Display write progress in status bar

b7a1830

emilmelnikov force-pushed the faster-hdf5 branch from 79327ff to b7a1830 Compare August 23, 2023 16:19

Log read/write timings

ae36523

emilmelnikov added 2 commits August 23, 2023 18:39

Remove wildcard imports

ffc3d5a

Log ilastik subprocess run timings

58a5b4c

emilmelnikov requested a review from k-dominik August 23, 2023 16:48

emilmelnikov force-pushed the faster-hdf5 branch from ea7dfe3 to 0d60c61 Compare August 23, 2023 18:58

Merge branch 'master' into faster-hdf5

82a706f

emilmelnikov force-pushed the faster-hdf5 branch from 0d60c61 to 82a706f Compare August 23, 2023 18:59

k-dominik approved these changes Aug 24, 2023

View reviewed changes

emilmelnikov and others added 2 commits August 24, 2023 14:54

Add HDF5 generation script

601d3ed

This script could be used to test HDF5 reader

Clear status bar after writing

5737f05

If not cleared, progress in the status bar do not disappear. Co-authored-by: Dominik Kutra <k-dominik@users.noreply.github.com>

emilmelnikov merged commit 08c603e into master Aug 24, 2023
1 check passed

emilmelnikov deleted the faster-hdf5 branch August 24, 2023 13:00

This was referenced Aug 24, 2023

Progress bar when running import/export only #113

Closed

min/max computation when exporting #114

Closed

emilmelnikov mentioned this pull request Aug 24, 2023

Add support for float64 in HDF5 import/export #30

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up HDF5 processing #110

Speed up HDF5 processing #110

emilmelnikov commented Jul 28, 2023 •

edited

Loading

k-dominik left a comment

btbest Aug 21, 2023

wolny left a comment

emilmelnikov commented Aug 23, 2023 •

edited

Loading

emilmelnikov commented Aug 23, 2023

k-dominik left a comment

imagesc-bot commented Sep 14, 2023

	* If not null, callback will be invoked each time a block is about to be written.
	* If not null, callback will be invoked each time a block is about to be written. This is used to provide feedback via progress bar.

Speed up HDF5 processing #110

Speed up HDF5 processing #110

Conversation

emilmelnikov commented Jul 28, 2023 • edited Loading

k-dominik left a comment

Choose a reason for hiding this comment

btbest Aug 21, 2023

Choose a reason for hiding this comment

wolny left a comment

Choose a reason for hiding this comment

emilmelnikov commented Aug 23, 2023 • edited Loading

emilmelnikov commented Aug 23, 2023

k-dominik left a comment

Choose a reason for hiding this comment

imagesc-bot commented Sep 14, 2023

emilmelnikov commented Jul 28, 2023 •

edited

Loading

emilmelnikov commented Aug 23, 2023 •

edited

Loading