Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize zarr multiscale writing based on #237 #257

Merged
merged 6 commits into from
May 4, 2023

Conversation

camilolaiton
Copy link
Contributor

@camilolaiton camilolaiton commented Mar 3, 2023

Please, consider this PR as a possible solution to #237. It is more convenient to let the user decide how to compute the dask delayed objects. This is certainly useful/optimal when we are working with large image datasets.

@will-moore
Copy link
Member

Thanks for this contribution.

Could you show how you're using this? Is it only used for using persist() instead of compute() or are there other usages?

Also, it would be great to have this parameter tested - I realise that we don't have write_multiscale() tested directly since it's tested indirectly from write_image() - See example at

def test_write_image_dask(self, read_from_zarr):

A test similar to that which tested the compute parameter via @pytest.mark.parametrize would be great.

Would it make sense to add this same parameter to write_image() too?

@camilolaiton
Copy link
Contributor Author

camilolaiton commented Mar 6, 2023

Hello @will-moore , thanks for the prompt response.

Yes, I can give you some examples in how we're using this feature to write datasets in our team. Example 1 and example 2. For both examples, we are currently using persist + wait. It is important to mention that the example 1 is running in a cloud VM to write approximately 400-800 GB of data while the second one about 100 TB of data using the HPC, of course, the size depends on multiple factors (they are an approximate).

In terms to the testing, I agree, it is totally necessary to test this parameter. As I pointed in this comment, we're using persist + wait for our purposes. I can totally add that to the tests if you agree.

Lastly, I believe it makes sense to add this parameter to the write_image as well. I noticed that the write_image does something similar to what we do in the multiscale writer by capturing the delayed jobs, but as I mentioned in the issue, this could lead to memory issues when we compute large images (it works perfectly fine when we compute images that fit in memory). Therefore, if you agree, we can also contribute in that sense 👍🏽

)

if not len(dask_delayed_jobs):
Copy link
Member

@will-moore will-moore Mar 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe before this line it makes sense to assert compute == (len(dask_delayed_jobs) > 0) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. I tried to follow dask convention. in the compute flag. Therefore, when compute is False we return the delayed jobs and assert not compute == len(dask_delayed_jobs).

@will-moore will-moore added this to the 0.7.0 milestone Mar 22, 2023
@codecov
Copy link

codecov bot commented Mar 22, 2023

Codecov Report

Patch coverage: 94.73% and project coverage change: +0.05 🎉

Comparison is base (9753167) 84.79% compared to head (0e767a6) 84.84%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #257      +/-   ##
==========================================
+ Coverage   84.79%   84.84%   +0.05%     
==========================================
  Files          13       13              
  Lines        1473     1485      +12     
==========================================
+ Hits         1249     1260      +11     
- Misses        224      225       +1     
Impacted Files Coverage Δ
ome_zarr/writer.py 95.27% <94.73%> (-0.18%) ⬇️

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@camilolaiton camilolaiton requested review from will-moore and removed request for joshmoore March 22, 2023 16:21
Copy link
Member

@will-moore will-moore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. That's all from me 👍

@will-moore will-moore requested a review from sbesson March 28, 2023 10:41
@camilolaiton
Copy link
Contributor Author

Hello! @will-moore I just wanted to follow up with this PR. Do I have to do anything else to get the branch merge with main?
Thanks!

@will-moore
Copy link
Member

Hi - apologies for the delay - I have assigned this to be included in the next release, but I wanted to get feedback from others on the OME team (some of us are away just now - Easter etc). But not forgotten! - cc @joshmoore @sbesson.

Copy link
Member

@sbesson sbesson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a preamble, I am not actively using the ome-zarr Python library and have even less input on the dask functionality. I am unsure if the new maintenance responsibilities have been redefined for this repository. If not, I would suggest a member of the academically-funded OME team should add his/her review.

I don't have much to add to this PR beyond the fact that the proposed API addition feels simple enough. Only suggestion is that the new return value for each of the write method might be worth communicating in the docstring (and hence the generated readthedocs). In particular, my understand is that compute will be a no-op and the return list will be [] if the image is a simple numpy array, is that correct? If so, it might be worth clarifying the expectation for consumers.

@joshmoore
Copy link
Member

Thanks for this, @camilolaiton. Definitely like the idea and it seems straight-forward enough.

The only question I have is whether or not there is any duck-typing (and/or wrapping) to be considered on the return type to improve the future-proofing. It's likely not a particularly critical issue at this point so I wouldn't want us to get into over-engineering, but if there is a common idiom out there (e.g. from https://data-apis.org/) then I could see adding that here. Alternatively, we (continue to) go all in on Dask as part of the public API.

Does anyone have any thoughts?

@camilolaiton
Copy link
Contributor Author

As a preamble, I am not actively using the ome-zarr Python library and have even less input on the dask functionality. I am unsure if the new maintenance responsibilities have been redefined for this repository. If not, I would suggest a member of the academically-funded OME team should add his/her review.

I don't have much to add to this PR beyond the fact that the proposed API addition feels simple enough. Only suggestion is that the new return value for each of the write method might be worth communicating in the docstring (and hence the generated readthedocs). In particular, my understand is that compute will be a no-op and the return list will be [] if the image is a simple numpy array, is that correct? If so, it might be worth clarifying the expectation for consumers.

I totally agree. I'll add this to the docstring.

Thanks for this, @camilolaiton. Definitely like the idea and it seems straight-forward enough.

The only question I have is whether or not there is any duck-typing (and/or wrapping) to be considered on the return type to improve the future-proofing. It's likely not a particularly critical issue at this point so I wouldn't want us to get into over-engineering, but if there is a common idiom out there (e.g. from https://data-apis.org/) then I could see adding that here. Alternatively, we (continue to) go all in on Dask as part of the public API.

Does anyone have any thoughts?

Yes, I get your point. However, I think it's not entirely necessary to add a wrapper for the return on this functionality, as it would probably make things a little more complicated than they should be. I think going all in on Dask is better at this point for the public API.

@joshmoore
Copy link
Member

Resolved conflicts and build now running.

@joshmoore
Copy link
Member

Everything is green. If there are no other comments, I'd propose to get this shipped as 0.7.0. We may eventually re-consider the all-in-on-dask decision, but introducing a new interface at this stage seems tricky at best.

@joshmoore
Copy link
Member

Taking that as confirmation and releasing. Thanks again, @camilolaiton!

@joshmoore joshmoore merged commit 23986fa into ome:master May 4, 2023
12 checks passed
joshmoore added a commit to will-moore/ome-zarr-py that referenced this pull request May 4, 2023
@joshmoore
Copy link
Member

Note that this might have broken the conda-forge build:

ome-zarr 0.7.0 requires distributed, which is not installed.

see conda-forge/ome-zarr-feedstock#12 for the failing build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants