New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Big images #134
Big images #134
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll export some sample data...
Especially if we need to split the logic, 👍 for having tests of each.
src/omero_zarr/raw_pixels.py
Outdated
) | ||
# if big image... | ||
if image.requiresPixelsPyramid(): | ||
paths = add_big_image(image, parent, level_count) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the big image method so much slower that it's worth splitting the logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah - you mean always use a tile-based approach, even if it's not a big image?
One question then would be: should the chunk-size match the whole plane (as it currently does for non-big images) or would it be better to have max plane of e.g. 1024 or 512 (which would be a ~breaking change)?
Actually, that reminds me that I should update to get a preferred tile-size from the rawPixelsStore, instead of hard-coding it at 512 x 512....
@joshmoore Updated as I think you suggested? Removed code duplication so we use tile sizes from |
Looking to test this PR on the export of https://idr.openmicroscopy.org/webclient/?show=image-9846152, the preferred tile-size from that Image is I'll go ahead and make that change, unless there's any objection? |
I think the tile size given by the tile service is retrieved from the relevant Bio-Formats reader and effectively reflects how the data is stored on disk in its original file formats. For comparison, |
915b059
to
005c78c
Compare
Since we already support |
Running current version against https://idr.openmicroscopy.org/webclient/?show=image-9846152 Then downsampling failed at this point, during downsample from 1 -> 2:
downsample stacktrace
|
This pull request has been mentioned on Image.sc Forum. There might be relevant details there: https://forum.image.sc/t/converting-other-idr-images-into-public-zarr/44025/20 |
The resize error above was fixed by ome/ome-zarr-py@e1fdd12 The branch at this point was used to generate the image at https://ome.github.io/ome-ngff-validator/?source=https://uk1s3.embassy.ebi.ac.uk/idr/zarr/v0.4/idr0048A/9846152.zarr/ Also updated description with all changes in this PR. |
The
Same error at ome/ome-zarr-py#244 https://pypi.org/project/poetry/ last release was 10th Jan 2023, but that doesn't appear to be the issue since this check has passed after that date above. I don't see any mention of Poetry itself in this repo's config. |
Googled for https://stackoverflow.com/questions/75269700/pre-commit-fails-to-install-isort-5-11-4-with-error-runtimeerror-the-poetry-co which suggested the fix in previous commit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still could use some functional testing (to convince ourselves that no chunk could be unintentionally missed) but from reading the code:
- No objection to the new strategy but the dropping of an argument will require a bigger bump since pipelines could break. Additionally we might want to say, "need to delete existing datasets if you don't want them unused"
- Generally could see moving to
logging
rather thanprint
everywhere but we can follow up with this since I assume there are still a good number ofprint
s now. - I could see adding dask in to the mix in the future to download/write tiles in parallel.
|
||
# create "0" array if it doesn't exist | ||
path = "0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated heads up that we might want to encapsulate these names in case we ever want to move to a SHOULD
recommendation for the naming of the arrays that includes a prefix.
Re logging, I tried using logging but couldn't work out how to actually see the logging statements. If I do:
I see this e.g:
which is not ideal (I really just want |
That comes down to the config. The idea is that someone else can also influence how it appears. If you are doing this in your own code, then use logging.basicConfig and you set the format and the log-level. Just make sure that that isn't in the released code so that the CLI, e.g., can set its preferences. |
Testing the export of a Plate with this PR, I see an issue with the logic for testing whether a chunk has been previously written:
This is failing to detect existing chunks and prints out...
It looks like a different path separator is being used for the path to the Image ( |
Much easier to judge export progress if Wells are ordered
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, looking good. I do note the lack of automated testing around this stack that we probably need to spend a bit of time on.
@@ -126,12 +126,6 @@ def _configure(self, parser: Parser) -> None: | |||
"--output", type=str, default="", help="The output directory" | |||
) | |||
|
|||
parser.add_argument( | |||
"--cache_numpy", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this takes us to 0.6.0 unless we want to tolerate it being set and print a warning message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think 0.6.0 makes sense, since we also change the default chunking behaviour
👍 Merging. Anything else to roll into an 0.6.0, @will-moore? @dominikl? |
Fixes #111
This adds support for exporting big images in a tile-based approach instead of a plane-based approach.
As suggested by @joshmoore - this tile-based approach is used for ALL images (instead of using the whole plane for as a single chunk as before).
First we write a full-size array at
0
, a tile at a time.Then resize via
ome_zarr.dask_utils.resize
(see bug-fix at ome/ome-zarr-py#244)to create a pyramid.
Other changes:
The
--numpy_cache
functionality has been replaced.Previously, if you anticipated connection issues with OMERO, you could use the
--numpy_cache
option to cache chunks to disk at the same time as writing them to zarr. The problem with this was that if you lost connection and weren't expecting it, then you had nothing cached. Also, caching doubled the disk usage.Now, if connection fails, you can simply rerun the
omero zarr export Image:ID
command and we pick-up writing to the existing partially-generated array.This is nicer as you don't have to delete the partially-exported image as before and it's much faster as you do don't have to copy the cached chunks and re-write to zarr.
Exporting the big image below required many re-runs of the export command - maybe 20 - 30 times before completion!
The
tile_width
andtile_height
options that previously only applied tobioformats2raw
usage, now also apply to API-based export.Image exported with this PR is at:
https://ome.github.io/ome-ngff-validator/?source=https://uk1s3.embassy.ebi.ac.uk/idr/zarr/v0.4/idr0048A/9846152.zarr/
I also fixed the Plate export to work with the new export logic and to support the interruption (partial export) of a Plate and the continued export from where you left off.