-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature]: Parallel Write Support for HDMF-Zarr #101
Comments
For The reason this is a good place to start is because: 1) it keeps things simple, because at that point all the setup for the file and datasets is already done and all we deal with is writing the datasets itself 2) it's a pretty simple function, and 3) it allows us to parallelize across both chunks within a dataset as well as across all hdmf-zarr/src/hdmf_zarr/backend.py Lines 839 to 840 in 531296c
only in this case the queue will always just contain a single If we also want to do parallel write for other datasets that are being specified via numpy arrays or lists, then that would require us to also look at
If it needs to be parametrized on a per-dataset level then the definition of parameters would probably go into ZarrDataIO. If this needs to parametrized across datasets (i.e., use the same settings for all) then this would probably either go as parameters of |
Another part that we likely need to consider is to:
|
@CodyCBakerPhD what approach for parallel write were you thinking of, |
As I understand it, with multiprocessing we should not need a synchronizer if the DataChunkIterator can be trusted to properly partition the data evenly across chunks (which the Unsure about the threading, haven't tried that yet
Would probably start there but we could think of a design that is extendable to a user's choice of backend, especially given how nice Dask is regarded to work with Zarr I do not recommend MPI; the method of encoding the parallelization has always felt rather awkward and I don't even know how we would use it at these deeper levels of abstraction (demos of MPI I've seen have consisted of simple scripts that get 'deployed' via CLI - not exactly our use case either) |
Sounds reasonable to me as a start.
Let's see how this goes. I think if ZarrIODataChunkIteratorQueue.exhaust_queue is the main place we need to modify, then to make it configurable we could either: 1) implement different derived versions of |
What would you like to see added to HDMF-ZARR?
Parallel Write Support for HDMF-Zarr
Allow NWB files written using the Zarr backend to leverage multiple threads or CPUs to enhance speed of operation.
Objectives
Zarr is built to support efficient Python parallelization strategies, both multi-processing and multi-threaded
HDMF-Zarr currently handles all write operations (including buffering and slicing) without exposing the necessary controls to enable these strategies
Approach and Plan
Identify the best injection point for parallelization parameters in the io.write() stack of HDMF-Zarr
Progress and Next Steps
TODO
Background and References
Is your feature request related to a problem?
No response
What solution would you like?
Identify the best injection point for parallelization parameters in the
io.write()
stack of HDMF-Zarr (as controlled via theNWBZarrIO
)Essentially revolving around the line https://github.com/hdmf-dev/hdmf/blob/2f9ec567ebe1df9fccb05f139d2f669661e50018/src/hdmf/backends/hdf5/h5_utils.py#L61 from the main repo (which might be what is used to delegate the command here as well?
Do you have any interest in helping implement the feature?
Yes.
Code of Conduct
The text was updated successfully, but these errors were encountered: