Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Big images #134

Merged
merged 17 commits into from Mar 9, 2023
Merged

Big images #134

merged 17 commits into from Mar 9, 2023

Conversation

will-moore
Copy link
Member

@will-moore will-moore commented Jan 23, 2023

Fixes #111

This adds support for exporting big images in a tile-based approach instead of a plane-based approach.
As suggested by @joshmoore - this tile-based approach is used for ALL images (instead of using the whole plane for as a single chunk as before).

First we write a full-size array at 0, a tile at a time.
Then resize via ome_zarr.dask_utils.resize (see bug-fix at ome/ome-zarr-py#244)
to create a pyramid.

Other changes:

The --numpy_cache functionality has been replaced.
Previously, if you anticipated connection issues with OMERO, you could use the --numpy_cache option to cache chunks to disk at the same time as writing them to zarr. The problem with this was that if you lost connection and weren't expecting it, then you had nothing cached. Also, caching doubled the disk usage.

Now, if connection fails, you can simply rerun the omero zarr export Image:ID command and we pick-up writing to the existing partially-generated array.

This is nicer as you don't have to delete the partially-exported image as before and it's much faster as you do don't have to copy the cached chunks and re-write to zarr.
Exporting the big image below required many re-runs of the export command - maybe 20 - 30 times before completion!

The tile_width and tile_height options that previously only applied to bioformats2raw usage, now also apply to API-based export.

Image exported with this PR is at:
https://ome.github.io/ome-ngff-validator/?source=https://uk1s3.embassy.ebi.ac.uk/idr/zarr/v0.4/idr0048A/9846152.zarr/

I also fixed the Plate export to work with the new export logic and to support the interruption (partial export) of a Plate and the continued export from where you left off.

Copy link
Member

@joshmoore joshmoore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll export some sample data...

Especially if we need to split the logic, 👍 for having tests of each.

)
# if big image...
if image.requiresPixelsPyramid():
paths = add_big_image(image, parent, level_count)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the big image method so much slower that it's worth splitting the logic?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah - you mean always use a tile-based approach, even if it's not a big image?
One question then would be: should the chunk-size match the whole plane (as it currently does for non-big images) or would it be better to have max plane of e.g. 1024 or 512 (which would be a ~breaking change)?

Actually, that reminds me that I should update to get a preferred tile-size from the rawPixelsStore, instead of hard-coding it at 512 x 512....

@will-moore
Copy link
Member Author

@joshmoore Updated as I think you suggested? Removed code duplication so we use tile sizes from rfs.getTileSize() for ALL images now, even if not big images.
Maybe not quite as fast for non-big images as previously, but I think speed is not so critical for export, and it certainly means less code to support 1 way of exporting.
Also, this adds support for the --cache_numpy behaviour for ALL images (including big images) now.

@will-moore
Copy link
Member Author

Looking to test this PR on the export of https://idr.openmicroscopy.org/webclient/?show=image-9846152, the preferred tile-size from that Image is tile_size_x: 19120, tile_size_y: 27.
This seems like a sub-optimal tile size for viewing an OME-NGFF image, so I wonder if instead of using the preferred tile-size from rawPixelsStore, we simply choose a good default, e.g. 512 x 512 (which is what iviewer uses for all Images).
This might mean slightly slower export, which is OK, but then you have better NGFF exported.

I'll go ahead and make that change, unless there's any objection?

@sbesson
Copy link
Member

sbesson commented Jan 25, 2023

I'll go ahead and make that change, unless there's any objection?

I think the tile size given by the tile service is retrieved from the relevant Bio-Formats reader and effectively reflects how the data is stored on disk in its original file formats.

For comparison, bioformats2raw completely ignores this value when converting the data and sets a chunk size of 1024x1024. This is largely in agreement with the proposal you are making here.

@will-moore
Copy link
Member Author

Since we already support --tile_width and --tile_height for bioformats2raw, I've used them for cli export too.

@will-moore
Copy link
Member Author

will-moore commented Feb 1, 2023

Running current version against https://idr.openmicroscopy.org/webclient/?show=image-9846152
Loading all tiles for resolution 0 completed after many re-runs, losing connection after ~12 Z planes each time.

Then downsampling failed at this point, during downsample from 1 -> 2:

$ ls -alh ./9846152.zarr
total 4.0K
drwxrwxr-x. 5 wmoore wmoore 68 Feb  1 20:18 .
drwxrwxr-x. 4 wmoore wmoore 53 Feb  1 06:04 ..
drwxrwxr-x. 5 wmoore wmoore 68 Feb  1 12:03 0
drwxrwxr-x. 5 wmoore wmoore 68 Feb  1 20:10 1
drwxrwxr-x. 5 wmoore wmoore 68 Feb  1 20:19 2
-rw-rw-r--. 1 wmoore wmoore 24 Feb  1 06:04 .zgroup

$ find ./9846152.zarr/0 -type f | wc -l
72619
$ find ./9846152.zarr/1 -type f | wc -l
19111
$ find ./9846152.zarr/2 -type f | wc -l
downsample stacktrace
Traceback (most recent call last):
  File "/home/wmoore/miniconda3/envs/omero_zarr_export/bin/omero", line 11, in <module>
    sys.exit(main())
  File "/home/wmoore/miniconda3/envs/omero_zarr_export/lib/python3.9/site-packages/omero/main.py", line 125, in main
    rv = omero.cli.argv()
  File "/home/wmoore/miniconda3/envs/omero_zarr_export/lib/python3.9/site-packages/omero/cli.py", line 1784, in argv
    cli.invoke(args[1:])
  File "/home/wmoore/miniconda3/envs/omero_zarr_export/lib/python3.9/site-packages/omero/cli.py", line 1222, in invoke
    stop = self.onecmd(line, previous_args)
  File "/home/wmoore/miniconda3/envs/omero_zarr_export/lib/python3.9/site-packages/omero/cli.py", line 1299, in onecmd
    self.execute(line, previous_args)
  File "/home/wmoore/miniconda3/envs/omero_zarr_export/lib/python3.9/site-packages/omero/cli.py", line 1381, in execute
    args.func(args)
  File "/home/wmoore/miniconda3/envs/omero_zarr_export/lib/python3.9/site-packages/omero_zarr/cli.py", line 107, in _wrapper
    return func(self, *args, **kwargs)
  File "/home/wmoore/miniconda3/envs/omero_zarr_export/lib/python3.9/site-packages/omero_zarr/cli.py", line 316, in export
    image_to_zarr(image, args)
  File "/home/wmoore/miniconda3/envs/omero_zarr_export/lib/python3.9/site-packages/omero_zarr/raw_pixels.py", line 36, in image_to_zarr
    add_image(image, root, tile_width=tile_width, tile_height=tile_height)
  File "/home/wmoore/miniconda3/envs/omero_zarr_export/lib/python3.9/site-packages/omero_zarr/raw_pixels.py", line 63, in add_image
    paths = add_raw_image(image, parent, level_count, tile_width, tile_height)
  File "/home/wmoore/miniconda3/envs/omero_zarr_export/lib/python3.9/site-packages/omero_zarr/raw_pixels.py", line 164, in add_raw_image
    downsample_pyramid_on_disk(parent, paths)
  File "/home/wmoore/miniconda3/envs/omero_zarr_export/lib/python3.9/site-packages/omero_zarr/raw_pixels.py", line 188, in downsample_pyramid_on_disk
    da.to_zarr(arr=output, url=parent.store, component=path)
  File "/home/wmoore/miniconda3/envs/omero_zarr_export/lib/python3.9/site-packages/dask/array/core.py", line 3694, in to_zarr
    return arr.store(z, lock=False, compute=compute, return_stored=return_stored)
  File "/home/wmoore/miniconda3/envs/omero_zarr_export/lib/python3.9/site-packages/dask/array/core.py", line 1767, in store
    r = store([self], [target], **kwargs)
  File "/home/wmoore/miniconda3/envs/omero_zarr_export/lib/python3.9/site-packages/dask/array/core.py", line 1235, in store
    compute_as_if_collection(Array, store_dsk, map_keys, **kwargs)
  File "/home/wmoore/miniconda3/envs/omero_zarr_export/lib/python3.9/site-packages/dask/base.py", line 341, in compute_as_if_collection
    return schedule(dsk2, keys, **kwargs)
  File "/home/wmoore/miniconda3/envs/omero_zarr_export/lib/python3.9/site-packages/dask/threaded.py", line 89, in get
    results = get_async(
  File "/home/wmoore/miniconda3/envs/omero_zarr_export/lib/python3.9/site-packages/dask/local.py", line 511, in get_async
    raise_exception(exc, tb)
  File "/home/wmoore/miniconda3/envs/omero_zarr_export/lib/python3.9/site-packages/dask/local.py", line 319, in reraise
    raise exc
  File "/home/wmoore/miniconda3/envs/omero_zarr_export/lib/python3.9/site-packages/dask/local.py", line 224, in execute_task
    result = _execute_task(task, data)
  File "/home/wmoore/miniconda3/envs/omero_zarr_export/lib/python3.9/site-packages/dask/core.py", line 119, in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
  File "/home/wmoore/miniconda3/envs/omero_zarr_export/lib/python3.9/site-packages/dask/array/core.py", line 4366, in store_chunk
    return load_store_chunk(x, out, index, lock, return_stored, False)
  File "/home/wmoore/miniconda3/envs/omero_zarr_export/lib/python3.9/site-packages/dask/array/core.py", line 4348, in load_store_chunk
    out[index] = x
  File "/home/wmoore/miniconda3/envs/omero_zarr_export/lib/python3.9/site-packages/zarr/core.py", line 1373, in __setitem__
    self.set_basic_selection(pure_selection, value, fields=fields)
  File "/home/wmoore/miniconda3/envs/omero_zarr_export/lib/python3.9/site-packages/zarr/core.py", line 1468, in set_basic_selection
    return self._set_basic_selection_nd(selection, value, fields=fields)
  File "/home/wmoore/miniconda3/envs/omero_zarr_export/lib/python3.9/site-packages/zarr/core.py", line 1772, in _set_basic_selection_nd
    self._set_selection(indexer, value, fields=fields)
  File "/home/wmoore/miniconda3/envs/omero_zarr_export/lib/python3.9/site-packages/zarr/core.py", line 1800, in _set_selection
    check_array_shape('value', value, sel_shape)
  File "/home/wmoore/miniconda3/envs/omero_zarr_export/lib/python3.9/site-packages/zarr/util.py", line 546, in check_array_shape
    raise ValueError('parameter {!r}: expected array with shape {!r}, got {!r}'
ValueError: parameter 'value': expected array with shape (1, 1, 265, 1024), got (1, 1, 259, 1024)

@imagesc-bot
Copy link

This pull request has been mentioned on Image.sc Forum. There might be relevant details there:

https://forum.image.sc/t/converting-other-idr-images-into-public-zarr/44025/20

@will-moore
Copy link
Member Author

The resize error above was fixed by ome/ome-zarr-py@e1fdd12

The branch at this point was used to generate the image at https://ome.github.io/ome-ngff-validator/?source=https://uk1s3.embassy.ebi.ac.uk/idr/zarr/v0.4/idr0048A/9846152.zarr/

Also updated description with all changes in this PR.

@will-moore
Copy link
Member Author

The pre-commit.ci check is failing with:

              RuntimeError: The Poetry configuration is invalid:
                - [extras.pipfile_deprecated_finder.2] 'pip-shims<=0.3.4' does not match '^[a-zA-Z-_.0-9]+$'

Same error at ome/ome-zarr-py#244

https://pypi.org/project/poetry/ last release was 10th Jan 2023, but that doesn't appear to be the issue since this check has passed after that date above.

I don't see any mention of Poetry itself in this repo's config.

@will-moore
Copy link
Member Author

Copy link
Member

@joshmoore joshmoore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still could use some functional testing (to convince ourselves that no chunk could be unintentionally missed) but from reading the code:

  • No objection to the new strategy but the dropping of an argument will require a bigger bump since pipelines could break. Additionally we might want to say, "need to delete existing datasets if you don't want them unused"
  • Generally could see moving to logging rather than print everywhere but we can follow up with this since I assume there are still a good number of prints now.
  • I could see adding dask in to the mix in the future to download/write tiles in parallel.


# create "0" array if it doesn't exist
path = "0"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated heads up that we might want to encapsulate these names in case we ever want to move to a SHOULD recommendation for the naming of the arrays that includes a prefix.

@will-moore
Copy link
Member Author

Re logging, I tried using logging but couldn't work out how to actually see the logging statements.

If I do:

import logging
LOGGER = logging.getLogger("omero_zarr.raw_pixels")
...
 LOGGER.warn("t, c, z, chk_x, chk_y %s %s %s %s %s" % (t, c, z, chk_y, chk_x))

I see this e.g:

WARNING:omero_zarr.raw_pixels:t, c, z, chk_x, chk_y 0 2 0 0 0

which is not ideal (I really just want t, c, z, chk_x, chk_y 0 2 0 0 0).
If I use LOGGER.info() I see nothing at-all.

@joshmoore
Copy link
Member

That comes down to the config. The idea is that someone else can also influence how it appears. If you are doing this in your own code, then use logging.basicConfig and you set the format and the log-level. Just make sure that that isn't in the released code so that the CLI, e.g., can set its preferences.

@will-moore
Copy link
Member Author

Testing the export of a Plate with this PR, I see an issue with the logic for testing whether a chunk has been previously written:

chunk_keys = list(zarray.chunk_store.keys())
separator = zarray.chunk_store.key_separator

chunk_key = separator.join([str(k) for k in key_dims])
print("chunk_key", chunk_key)
print("keys", chunk_keys)
if chunk_key not in chunk_keys:
   ... load from omero...

This is failing to detect existing chunks and prints out...

chunk_key 0/4/0/0
keys ['.zgroup', '1/.zarray', '1/0/0/0', '1/1/0/0', '1/2/0/0', '1/3/0/0', '1/4/0/0', 'L/.zgroup', 'L/9/.zattrs', 'L/9/.zgroup', 'L/9/0/.zattrs', 'L/9/0/.zgroup', 'L/9/0/0/.zarray', 'L/9/0/0/0/0/0', 'L/9/0/0/1/0/0', 'L/9/0/0/2/0/0', 'L/9/0/0/3/0/0', 'L/9/0/0/4/0/0', 'L/9/0/1/.zarray', 'L/9/0/1/0.0.0', 'L/9/0/1/1.0.0', 'L/9/0/1/2.0.0', 'L/9/0/1/3.0.0', 'L/9/0/1/4.0.0', 'L/9/0/2/.zarray', 'L/9/0/2/0.0.0', 'L/9/0/2/1.0.0', 'L/9/0/2/2.0.0', 'L/9/0/2/3.0.0', 'L/9/0/2/4.0.0', 'L/9/0/3/.zarray', 'L/9/0/3/0.0.0', 'L/9/0/3/1.0.0', 'L/9/0/3/2.0.0', 'L/9/0/3/3.0.0', 'L/9/0/3/4.0.0', 'L/9/1/.zattrs', 'L/9/1/.zgroup', 'L/9/1/0/.zarray', 'L/9/1/0/0/0/0', 'L/9/1/0/1/0/0', 'L/9/1/0/2/0/0', 'L/9/1/0/3/0/0', 'L/9/1/0/4/0/0', 'L/9/1/1/.zarray', 'L/9/1/1/0.0.0', 'L/9/1/1/1.0.0', 'L/9/1/1/2.0.0', 'L/9/1/1/3.0.0', 'L/9/1/1/4.0.0', 'L/9/1/2/.zarray', 'L/9/1/2/0.0.0', 'L/9/1/2/1.0.0', 'L/9/1/2/2.0.0', 'L/9/1/2/3.0.0', 'L/9/1/2/4.0.0', 'L/9/1/3/.zarray', 'L/9/1/3/0.0.0', 'L/9/1/3/1.0.0', 'L/9/1/3/2.0.0', 'L/9/1/3/3.0.0', 'L/9/1/3/4.0.0', 'L/9/2/.zattrs', 'L/9/2/.zgroup', 'L/9/2/0/.zarray', 'L/9/2/0/0/0/0', 'L/9/2/0/1/0/0', 'L/9/2/0/2/0/0', 'L/9/2/0/3/0/0', 'L/9/2/0/4/0/0', 'L/9/2/1/.zarray', 'L/9/2/1/0.0.0', 'L/9/2/1/1.0.0', 'L/9/2/1/2.0.0', 'L/9/2/1/3.0.0', 'L/9/2/1/4.0.0', 'L/9/2/2/.zarray', 'L/9/2/2/0.0.0', 'L/9/2/2/1.0.0', 'L/9/2/2/2.0.0', 'L/9/2/2/3.0.0', 'L/9/2/2/4.0.0', 'L/9/2/3/.zarray', 'L/9/2/3/0.0.0', 'L/9/2/3/1.0.0', 'L/9/2/3/2.0.0', 'L/9/2/3/3.0.0', 'L/9/2/3/4.0.0', 'L/9/3/.zattrs', 'L/9/3/.zgroup', 'L/9/3/0/.zarray', 'L/9/3/0/0/0/0', 'L/9/3/0/1/0/0', 'L/9/3/0/2/0/0', 'L/9/3/0/3/0/0', 'L/9/3/0/4/0/0', 'L/9/3/1/.zarray', 'L/9/3/1/0.0.0', 'L/9/3/1/1.0.0', 'L/9/3/1/2.0.0', 'L/9/3/1/3.0.0', 'L/9/3/1/4.0.0', 'L/9/3/2/.zarray', 'L/9/3/2/0.0.0', 'L/9/3/2/1.0.0', 'L/9/3/2/2.0.0', 'L/9/3/2/3.0.0', 'L/9/3/2/4.0.0', 'L/9/3/3/.zarray', 'L/9/3/3/0.0.0', 'L/9/3/3/1.0.0', 'L/9/3/3/2.0.0', 'L/9/3/3/3.0.0', 'L/9/3/3/4.0.0', 'L/9/4/.zgroup', 'L/9/4/0/.zarray']

It looks like a different path separator is being used for the path to the Image (/) and for the Image dims (.). Trying to match this behaviour - and have it behave differently for Images vv Plate - seems too fragile.
So I think I'll revert to my previous approach (changed in 6af4061) to check for existing chunks

Copy link
Member

@joshmoore joshmoore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, looking good. I do note the lack of automated testing around this stack that we probably need to spend a bit of time on.

@@ -126,12 +126,6 @@ def _configure(self, parser: Parser) -> None:
"--output", type=str, default="", help="The output directory"
)

parser.add_argument(
"--cache_numpy",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this takes us to 0.6.0 unless we want to tolerate it being set and print a warning message.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 0.6.0 makes sense, since we also change the default chunking behaviour

src/omero_zarr/raw_pixels.py Show resolved Hide resolved
@joshmoore
Copy link
Member

👍 Merging.

Anything else to roll into an 0.6.0, @will-moore? @dominikl?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

zarr export doesn't support big images
4 participants