[Feature]: Parallel Write Support for HDMF-Zarr #101

CodyCBakerPhD · 2023-06-27T17:09:52Z

What would you like to see added to HDMF-ZARR?

Parallel Write Support for HDMF-Zarr

Allow NWB files written using the Zarr backend to leverage multiple threads or CPUs to enhance speed of operation.
Objectives

Zarr is built to support efficient Python parallelization strategies, both multi-processing and multi-threaded
HDMF-Zarr currently handles all write operations (including buffering and slicing) without exposing the necessary controls to enable these strategies

Approach and Plan
Identify the best injection point for parallelization parameters in the io.write() stack of HDMF-Zarr
Progress and Next Steps
TODO
Background and References

Is your feature request related to a problem?

No response

What solution would you like?

Identify the best injection point for parallelization parameters in the io.write() stack of HDMF-Zarr (as controlled via the NWBZarrIO)

Essentially revolving around the line https://github.com/hdmf-dev/hdmf/blob/2f9ec567ebe1df9fccb05f139d2f669661e50018/src/hdmf/backends/hdf5/h5_utils.py#L61 from the main repo (which might be what is used to delegate the command here as well?

Do you have any interest in helping implement the feature?

Yes.

Code of Conduct

I agree to follow this project's Code of Conduct
Have you checked the Contributing document?
Have you ensured this change was not already requested?

The text was updated successfully, but these errors were encountered:

oruebel · 2023-06-27T18:19:25Z

Approach and Plan

For DataChunkIterators the easiest place to start I believe is to look at how to parallelize the ZarrIODataChunkIteratorQueue.exhaust_queue function.

The reason this is a good place to start is because: 1) it keeps things simple, because at that point all the setup for the file and datasets is already done and all we deal with is writing the datasets itself 2) it's a pretty simple function, and 3) it allows us to parallelize across both chunks within a dataset as well as across all DataChunkIterator that are being used. I believe this should actually cover all cases that involve writing of DataChunkIterator, because I believe the ZarrIO.write_dataset method also just calls the exhaust_queue function to write from DataChunkIterator, here:

hdmf-zarr/src/hdmf_zarr/backend.py

Lines 839 to 840 in 531296c

    
           if exhaust_dci: 
        
               self.__dci_queue.exhaust_queue()

only in this case the queue will always just contain a single DataChunkIterator object. I.e., when calling ZarrIO.write with exhaust_dci=False then we would get all DataChunkIterator objects at once as part of the queue and if exhaust_dci=True then we would see them one-at-a-time. I.e. parallelizing exhaust_queue I believe should take care of all cases related to parallel write of DataChunkIterator objects.

If we also want to do parallel write for other datasets that are being specified via numpy arrays or lists, then that would require us to also look at ZarrIO.write, but I think as a start focusing on DataChunkIterator should be fine. Also, for the more general case we also likely need to worry more about how to specifiy how to actually determine intelligently how to write these in parallel, whereas with DataChunkIterator and ZarrDataIO there is a more explicit way for the user to control things.

Identify the best injection point for parallelization parameters in the io.write() stack of HDMF-Zarr

If it needs to be parametrized on a per-dataset level then the definition of parameters would probably go into ZarrDataIO. If this needs to parametrized across datasets (i.e., use the same settings for all) then this would probably either go as parameters of ZarrIO.write or ZarrIO itself. I think this all depends on how flexible we need it to be. ZarrIO may end up having to pass these to ZarrDataChunkIteratorQueue, but that is an internal detail that a user would normally not see, since the queue is an internal data structure that is not exposed to the user (i.e., it is only stored in a private member variable and should not be used directly by the user).

oruebel · 2023-06-27T18:32:57Z

Another part that we likely need to consider is to:

ensure that we use the correct synchornizer for Zarr in ZarrIO.__init__
We may also need to ensure that numcodecs.blosc.use_threads = False is set according to https://zarr.readthedocs.io/en/stable/tutorial.html#parallel-computing-and-synchronization

oruebel · 2023-06-27T18:34:36Z

@CodyCBakerPhD what approach for parallel write were you thinking of, Dask, MPI, joblib, multiprocessing etc.?

CodyCBakerPhD · 2023-06-27T19:23:52Z

(1) ensure that we use the correct synchornizer for Zarr in ZarrIO.init

As I understand it, with multiprocessing we should not need a synchronizer if the DataChunkIterator can be trusted to properly partition the data evenly across chunks (which the GenericDataChunkIterator will necessarily - so maybe easiest to only support that to begin with)

Unsure about the threading, haven't tried that yet

(2) @CodyCBakerPhD what approach for parallel write were you thinking of, Dask, MPI, joblib, multiprocessing etc.?

concurrents.futures is our go-to choice as a nicer simplification of multiprocessing and multithreading without needing to add much management of queues and other overhead

Would probably start there but we could think of a design that is extendable to a user's choice of backend, especially given how nice Dask is regarded to work with Zarr

I do not recommend MPI; the method of encoding the parallelization has always felt rather awkward and I don't even know how we would use it at these deeper levels of abstraction (demos of MPI I've seen have consisted of simple scripts that get 'deployed' via CLI - not exactly our use case either)

oruebel · 2023-06-28T06:22:23Z

concurrents.futures is our go-to choice

Sounds reasonable to me as a start.

we could think of a design that is extendable to a user's choice of backend

Let's see how this goes. I think if ZarrIODataChunkIteratorQueue.exhaust_queue is the main place we need to modify, then to make it configurable we could either: 1) implement different derived versions of ZarrIODataChunkIteratorQueue (e.g., a DaskZarrIODataChunkIteratorQueue) and the have a parameter on ZarrIO.__init__ to set the queue to use or 2) we could make the behavior configurable in a single ZarrIODataChunkIteratorQueueclass. But at first glance, I think I'd prefer option 1.

oruebel added category: proposal proposed enhancements or new features priority: medium non-critical problem and/or affecting only a small set of users labels Jun 27, 2023

oruebel assigned oruebel and CodyCBakerPhD Jun 27, 2023

CodyCBakerPhD mentioned this issue Jul 30, 2023

Add parallel zarr #111

Closed

6 tasks

CodyCBakerPhD mentioned this issue Aug 31, 2023

Parallel zarr with Queue instance attributes #118

Merged

6 tasks

oruebel closed this as completed in #118 Oct 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Parallel Write Support for HDMF-Zarr #101

[Feature]: Parallel Write Support for HDMF-Zarr #101

CodyCBakerPhD commented Jun 27, 2023

oruebel commented Jun 27, 2023 •

edited

Loading

oruebel commented Jun 27, 2023

oruebel commented Jun 27, 2023

CodyCBakerPhD commented Jun 27, 2023 •

edited

Loading

oruebel commented Jun 28, 2023

[Feature]: Parallel Write Support for HDMF-Zarr #101

[Feature]: Parallel Write Support for HDMF-Zarr #101

Comments

CodyCBakerPhD commented Jun 27, 2023

What would you like to see added to HDMF-ZARR?

Is your feature request related to a problem?

What solution would you like?

Do you have any interest in helping implement the feature?

Code of Conduct

oruebel commented Jun 27, 2023 • edited Loading

oruebel commented Jun 27, 2023

oruebel commented Jun 27, 2023

CodyCBakerPhD commented Jun 27, 2023 • edited Loading

oruebel commented Jun 28, 2023

oruebel commented Jun 27, 2023 •

edited

Loading

CodyCBakerPhD commented Jun 27, 2023 •

edited

Loading