Optimize zarr multiscale writing based on #237 #257

camilolaiton · 2023-03-03T22:25:01Z

Please, consider this PR as a possible solution to #237. It is more convenient to let the user decide how to compute the dask delayed objects. This is certainly useful/optimal when we are working with large image datasets.

will-moore · 2023-03-06T11:25:22Z

Thanks for this contribution.

Could you show how you're using this? Is it only used for using persist() instead of compute() or are there other usages?

Also, it would be great to have this parameter tested - I realise that we don't have write_multiscale() tested directly since it's tested indirectly from write_image() - See example at

ome-zarr-py/tests/test_writer.py

Line 149 in 2c4d489

def test_write_image_dask(self, read_from_zarr):

A test similar to that which tested the compute parameter via @pytest.mark.parametrize would be great.

Would it make sense to add this same parameter to write_image() too?

camilolaiton · 2023-03-06T19:52:28Z

Hello @will-moore , thanks for the prompt response.

Yes, I can give you some examples in how we're using this feature to write datasets in our team. Example 1 and example 2. For both examples, we are currently using persist + wait. It is important to mention that the example 1 is running in a cloud VM to write approximately 400-800 GB of data while the second one about 100 TB of data using the HPC, of course, the size depends on multiple factors (they are an approximate).

In terms to the testing, I agree, it is totally necessary to test this parameter. As I pointed in this comment, we're using persist + wait for our purposes. I can totally add that to the tests if you agree.

Lastly, I believe it makes sense to add this parameter to the write_image as well. I noticed that the write_image does something similar to what we do in the multiscale writer by capturing the delayed jobs, but as I mentioned in the issue, this could lead to memory issues when we compute large images (it works perfectly fine when we compute images that fit in memory). Therefore, if you agree, we can also contribute in that sense 👍🏽

will-moore · 2023-03-22T08:45:34Z

tests/test_writer.py

        )
+
+        if not len(dask_delayed_jobs):


maybe before this line it makes sense to assert compute == (len(dask_delayed_jobs) > 0) ?

Yes, I agree. I tried to follow dask convention. in the compute flag. Therefore, when compute is False we return the delayed jobs and assert not compute == len(dask_delayed_jobs).

codecov · 2023-03-22T09:10:17Z

Codecov Report

Patch coverage: 94.73% and project coverage change: +0.05 🎉

Comparison is base (9753167) 84.79% compared to head (0e767a6) 84.84%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #257      +/-   ##
==========================================
+ Coverage   84.79%   84.84%   +0.05%     
==========================================
  Files          13       13              
  Lines        1473     1485      +12     
==========================================
+ Hits         1249     1260      +11     
- Misses        224      225       +1

Impacted Files	Coverage Δ
ome_zarr/writer.py	`95.27% <94.73%> (-0.18%)`	⬇️

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

will-moore

Looks good. That's all from me 👍

camilolaiton · 2023-04-10T21:49:39Z

Hello! @will-moore I just wanted to follow up with this PR. Do I have to do anything else to get the branch merge with main?
Thanks!

will-moore · 2023-04-11T05:39:34Z

Hi - apologies for the delay - I have assigned this to be included in the next release, but I wanted to get feedback from others on the OME team (some of us are away just now - Easter etc). But not forgotten! - cc @joshmoore @sbesson.

sbesson

As a preamble, I am not actively using the ome-zarr Python library and have even less input on the dask functionality. I am unsure if the new maintenance responsibilities have been redefined for this repository. If not, I would suggest a member of the academically-funded OME team should add his/her review.

I don't have much to add to this PR beyond the fact that the proposed API addition feels simple enough. Only suggestion is that the new return value for each of the write method might be worth communicating in the docstring (and hence the generated readthedocs). In particular, my understand is that compute will be a no-op and the return list will be [] if the image is a simple numpy array, is that correct? If so, it might be worth clarifying the expectation for consumers.

joshmoore · 2023-04-18T09:04:48Z

Thanks for this, @camilolaiton. Definitely like the idea and it seems straight-forward enough.

The only question I have is whether or not there is any duck-typing (and/or wrapping) to be considered on the return type to improve the future-proofing. It's likely not a particularly critical issue at this point so I wouldn't want us to get into over-engineering, but if there is a common idiom out there (e.g. from https://data-apis.org/) then I could see adding that here. Alternatively, we (continue to) go all in on Dask as part of the public API.

Does anyone have any thoughts?

camilolaiton · 2023-04-25T19:59:18Z

As a preamble, I am not actively using the ome-zarr Python library and have even less input on the dask functionality. I am unsure if the new maintenance responsibilities have been redefined for this repository. If not, I would suggest a member of the academically-funded OME team should add his/her review.

I don't have much to add to this PR beyond the fact that the proposed API addition feels simple enough. Only suggestion is that the new return value for each of the write method might be worth communicating in the docstring (and hence the generated readthedocs). In particular, my understand is that compute will be a no-op and the return list will be [] if the image is a simple numpy array, is that correct? If so, it might be worth clarifying the expectation for consumers.

I totally agree. I'll add this to the docstring.

Thanks for this, @camilolaiton. Definitely like the idea and it seems straight-forward enough.

The only question I have is whether or not there is any duck-typing (and/or wrapping) to be considered on the return type to improve the future-proofing. It's likely not a particularly critical issue at this point so I wouldn't want us to get into over-engineering, but if there is a common idiom out there (e.g. from https://data-apis.org/) then I could see adding that here. Alternatively, we (continue to) go all in on Dask as part of the public API.

Does anyone have any thoughts?

Yes, I get your point. However, I think it's not entirely necessary to add a wrapper for the return on this functionality, as it would probably make things a little more complicated than they should be. I think going all in on Dask is better at this point for the public API.

for more information, see https://pre-commit.ci

joshmoore · 2023-05-03T07:21:36Z

Resolved conflicts and build now running.

joshmoore · 2023-05-03T17:39:42Z

Everything is green. If there are no other comments, I'd propose to get this shipped as 0.7.0. We may eventually re-consider the all-in-on-dask decision, but introducing a new interface at this stage seems tricky at best.

joshmoore · 2023-05-04T09:02:55Z

Taking that as confirmation and releasing. Thanks again, @camilolaiton!

joshmoore · 2023-05-08T07:37:23Z

Note that this might have broken the conda-forge build:

ome-zarr 0.7.0 requires distributed, which is not installed.

see conda-forge/ome-zarr-feedstock#12 for the failing build

adding optional compute flag to write multiscale

ddfd3ff

adding compute parameter to write_image and testing for this parameter

fbfd23b

will-moore reviewed Mar 22, 2023

View reviewed changes

will-moore requested a review from joshmoore March 22, 2023 08:48

will-moore added this to the 0.7.0 milestone Mar 22, 2023

including assert to dask compute flag

059b7bf

camilolaiton requested review from will-moore and removed request for joshmoore March 22, 2023 16:21

will-moore approved these changes Mar 22, 2023

View reviewed changes

will-moore requested a review from sbesson March 28, 2023 10:41

sbesson reviewed Apr 18, 2023

View reviewed changes

including returns in the docstrings

ff1619f

sbesson approved these changes May 2, 2023

View reviewed changes

joshmoore and others added 2 commits May 3, 2023 09:21

Merge branch 'master' into optimize-multiscale

a4e3697

[pre-commit.ci] auto fixes from pre-commit.com hooks

0e767a6

for more information, see https://pre-commit.ci

joshmoore merged commit 23986fa into ome:master May 4, 2023
12 checks passed

joshmoore added a commit to will-moore/ome-zarr-py that referenced this pull request May 4, 2023

update changelog with ome#257

6c547aa

will-moore mentioned this pull request May 4, 2023

changelog entry for 0.7.0 #267

Merged

joshmoore mentioned this pull request May 8, 2023

ome-zarr v0.7.0 conda-forge/ome-zarr-feedstock#12

Closed

3 tasks

camilolaiton mentioned this pull request May 16, 2023

write_multiscale function could be optimized when writing a dask multiscale image #237

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize zarr multiscale writing based on #237 #257

Optimize zarr multiscale writing based on #237 #257

camilolaiton commented Mar 3, 2023 •

edited

Loading

will-moore commented Mar 6, 2023

camilolaiton commented Mar 6, 2023 •

edited

Loading

will-moore Mar 22, 2023 •

edited

Loading

camilolaiton Mar 22, 2023

codecov bot commented Mar 22, 2023 •

edited

Loading

will-moore left a comment

camilolaiton commented Apr 10, 2023

will-moore commented Apr 11, 2023

sbesson left a comment

joshmoore commented Apr 18, 2023

camilolaiton commented Apr 25, 2023

joshmoore commented May 3, 2023

joshmoore commented May 3, 2023

joshmoore commented May 4, 2023

joshmoore commented May 8, 2023

Optimize zarr multiscale writing based on #237 #257

Optimize zarr multiscale writing based on #237 #257

Conversation

camilolaiton commented Mar 3, 2023 • edited Loading

will-moore commented Mar 6, 2023

camilolaiton commented Mar 6, 2023 • edited Loading

will-moore Mar 22, 2023 • edited Loading

Choose a reason for hiding this comment

camilolaiton Mar 22, 2023

Choose a reason for hiding this comment

codecov bot commented Mar 22, 2023 • edited Loading

Codecov Report

will-moore left a comment

Choose a reason for hiding this comment

camilolaiton commented Apr 10, 2023

will-moore commented Apr 11, 2023

sbesson left a comment

Choose a reason for hiding this comment

joshmoore commented Apr 18, 2023

camilolaiton commented Apr 25, 2023

joshmoore commented May 3, 2023

joshmoore commented May 3, 2023

joshmoore commented May 4, 2023

joshmoore commented May 8, 2023

camilolaiton commented Mar 3, 2023 •

edited

Loading

camilolaiton commented Mar 6, 2023 •

edited

Loading

will-moore Mar 22, 2023 •

edited

Loading

codecov bot commented Mar 22, 2023 •

edited

Loading