New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zarr Reading and Writing (.zspy format) - Reformatted #2825
Conversation
# Conflicts: # hyperspy/io_plugins/hspy.py
Codecov Report
@@ Coverage Diff @@
## RELEASE_next_minor #2825 +/- ##
======================================================
- Coverage 78.45% 77.00% -1.46%
======================================================
Files 203 206 +3
Lines 31287 31566 +279
Branches 6872 6907 +35
======================================================
- Hits 24547 24307 -240
- Misses 4977 5505 +528
+ Partials 1763 1754 -9
Continue to review full report at Codecov.
|
Thanks @CSSFrancis, this looks good and very promising! I will not have time in the next two weeks to look at the details and I will do that beginning of October then. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very good! See comments in the changes and:
- there is a empty file which needs to be removed: hyperspy/io_plugins/spy.py
- when overwriting an existing folder, it doesn't ask to overwrite, while other reader do
- it would be good to increase the coverage as highlighted by the github annotations but this can be done in a separate PR.
Co-authored-by: Eric Prestat <eric.prestat@gmail.com>
Done!
I changed this behavior so it should be working. It was kind of an oversight...
I've added in a little bit of coverage, but obviously there are some parts which are still lacking, mostly because the hspy format isn't at 100% coverage. I might get around to that in another PR, as that should probably be at 100% The only thing I'm not super confident with this PR is removing the shuffle and maxshape. I think that they are legacy from some older h5py requirement. For the maxshape I can pretty confidently say that removing it is fine. For the shuffle, I have just set the default to True, but at least now that can also be set using the kwds arguement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made some more comments and a PR to simplify the class inheritance: CSSFrancis#1.
an issue, it can be disabled by setting ``compressor=None``. In general we recommend | ||
compressing your datasets as it can greatly reduce i-o overhead | ||
|
||
- ``write_to_storage``: The write to storage option allows you to pass the path to a directory (or database) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going for option 2 should be good because this is a pattern used by the zarr.group
, as it can take a path or a zarr store as first argument.
Simplify class inheritance for the zarr-io PR
@ericpre def _get_signal_chunks(shape, dtype, signal_axes, target_size=1024*1024):
"""Function that calculates chunks for the signal, preferably at least one
chunk per signal space.
Parameters
----------
shape : tuple
the shape of the dataset to be sored / chunked
dtype : {dtype, string}
the numpy dtype of the data
signal_axes: {None, iterable of ints}
the axes defining "signal space" of the dataset. If None, the default
h5py chunking is performed.
target_size: int
The target number of bytes for one chunk
"""
typesize = np.dtype(dtype).itemsize
if signal_axes is None:
return h5py._hl.filters.guess_chunk(shape, None, typesize)
# largely based on the guess_chunk in h5py
bytes_per_signal = multiply([shape[i] for i in signal_axes]) * typesize
signals_per_chunk = np.floor_divide(target_size,bytes_per_signal)
if signals_per_chunk < 2:
# signal is larger than chunk max
chunks = [s if i in signal_axes else 1 for i, s in enumerate(shape)]
return tuple(chunks)
else:
# signal is smaller than chunk max
navigation_axes = tuple(i for i in range(len(shape)) if i not in
signal_axes)
num_nav_axes = len(navigation_axes)
sig_axes_chunk = np.floor(signals_per_chunk**(1/num_nav_axes))
remainder = np.floor_divide(signals_per_chunk - (sig_axes_chunk**num_nav_axes),
sig_axes_chunk)
chunks = [s if i in signal_axes else sig_axes_chunk for i, s in enumerate(shape)]
chunks[navigation_axes[0]] = chunks[navigation_axes[0]]+remainder
return tuple(int(x) for x in chunks) I think that this makes quite a bit more sense and gets rid of the while loop. It changes the behavior a little (I think) but the previous version was not actually creating 1 mb chunks but rather chunks that were 10241024 bit depth which I don't thinks makes much sense. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in one of the comments, I would be inclined not to change how the chunking is calculated and sort it out properly in a separate PR.
Some tests of this PR are failing and a few other minor comments need to considered. Other than that, it should be good to go!
def _get_signal_chunks(): | ||
raise NotImplementedError( | ||
"This method must be implemented by subclasses.") | ||
def _get_signal_chunks(shape, dtype, signal_axes=None, target_size=1e6): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also
hyperspy/hyperspy/_signals/lazy.py
Lines 189 to 263 in 3517f8a
def _get_dask_chunks(self, axis=None, dtype=None): | |
"""Returns dask chunks. | |
Aims: | |
- Have at least one signal (or specified axis) in a single chunk, | |
or as many as fit in memory | |
Parameters | |
---------- | |
axis : {int, string, None, axis, tuple} | |
If axis is None (default), returns chunks for current data shape so | |
that at least one signal is in the chunk. If an axis is specified, | |
only that particular axis is guaranteed to be "not sliced". | |
dtype : {string, np.dtype} | |
The dtype of target chunks. | |
Returns | |
------- | |
Tuple of tuples, dask chunks | |
""" | |
dc = self.data | |
dcshape = dc.shape | |
for _axis in self.axes_manager._axes: | |
if _axis.index_in_array < len(dcshape): | |
_axis.size = int(dcshape[_axis.index_in_array]) | |
if axis is not None: | |
need_axes = self.axes_manager[axis] | |
if not np.iterable(need_axes): | |
need_axes = [need_axes, ] | |
else: | |
need_axes = self.axes_manager.signal_axes | |
if dtype is None: | |
dtype = dc.dtype | |
elif not isinstance(dtype, np.dtype): | |
dtype = np.dtype(dtype) | |
typesize = max(dtype.itemsize, dc.dtype.itemsize) | |
want_to_keep = multiply([ax.size for ax in need_axes]) * typesize | |
# @mrocklin reccomends to have around 100MB chunks, so we do that: | |
num_that_fit = int(100. * 2.**20 / want_to_keep) | |
# want to have at least one "signal" per chunk | |
if num_that_fit < 2: | |
chunks = [tuple(1 for _ in range(i)) for i in dc.shape] | |
for ax in need_axes: | |
chunks[ax.index_in_array] = dc.shape[ax.index_in_array], | |
return tuple(chunks) | |
sizes = [ | |
ax.size for ax in self.axes_manager._axes if ax not in need_axes | |
] | |
indices = [ | |
ax.index_in_array for ax in self.axes_manager._axes | |
if ax not in need_axes | |
] | |
while True: | |
if multiply(sizes) <= num_that_fit: | |
break | |
i = np.argmax(sizes) | |
sizes[i] = np.floor(sizes[i] / 2) | |
chunks = [] | |
ndim = len(dc.shape) | |
for i in range(ndim): | |
if i in indices: | |
size = float(dc.shape[i]) | |
split_array = np.array_split( | |
np.arange(size), np.ceil(size / sizes[indices.index(i)])) | |
chunks.append(tuple(len(sp) for sp in split_array)) | |
else: | |
chunks.append((dc.shape[i], )) | |
return tuple(chunks) |
which does pretty much the same thing. I guess, we need a standalone
guess_chunks
which should be used here and in hyperspy._signals.lazy.LazySignal
.
Since this is indirectly related to this purpose of this PR, maybe it is better to do this in a separate PR. I would prefer to not change the behaviour in this PR and sort it out completely in a separate PR, with adding test, etc. There is already a lot in this PR! What do you think?
For reference, here is the auto-chunking from zarr:
https://github.com/zarr-developers/zarr-python/blob/be3d657484590adacaedf980a2951ddfa92b68b8/zarr/util.py#L60-L103
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can sort it out in a separate PR. Do you want me to roll back to the previous _get_signal_chunks.
I would argue against the zarr
and the h5py
guess chunk functions. Something about using a while loop rather than directly calculating the chunk shape makes me itch and is very confusing. zarr
does a little bit better job explaining what is happening, but I think it is still fairly unpythonic.
Honestly, chunking in hyperspy is a little bit broken and could use some major work.
I can make a issue to track some of the issues that I have come across.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest to leave the chunking discussion for another PR. Same for passing zarr store instead of filename string, because the it doesn't fully work.
There are two other comments/questions.
path = group._store.dir_path() + "/" + dset.path | ||
dset = zarr.open_array(path, | ||
mode="w", | ||
shape=data.shape, | ||
dtype=data.dtype, | ||
chunks=chunks, | ||
**kwds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It it necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least on my machine it seems to be. If you take it out then the array is never updated. There are some things with how the synchronizer is handled that I am still rather confused with. There is no flush
method for the directory store but I think that reopening the file is equivalent to this.
If I tried a bunch of different things but this is the one that worked in the end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, let's leave it there for now and we can revisit after I rebase #2797 on this branch.
Co-authored-by: Eric Prestat <eric.prestat@gmail.com>
There are two tests failing: ______________________ test_get_signal_chunks[1000000.0] ______________________
[gw1] win32 -- Python 3.8.10 c:\hostedtoolcache\windows\python\3.8.10\x64\python.exe
target_size = 1000000.0
@pytest.mark.parametrize("target_size", (1e6,1e7))
def test_get_signal_chunks(target_size):
chunks = HierarchicalWriter._get_signal_chunks(shape=[15, 15, 256, 256],
dtype=int,
signal_axes=(2, 3),
target_size=target_size)
> assert (np.prod(chunks)*8 < target_size)
E assert (196608 * 8) < 1000000.0
E + where 196608 = <function prod at 0x0000017CB7857160>((3, 1, 256, 256))
E + where <function prod at 0x0000017CB7857160> = np.prod
hyperspy\tests\io\test_hdf5.py:857: AssertionError
RRRF
_____________________ test_get_signal_chunks[10000000.0] ______________________
[gw1] win32 -- Python 3.8.10 c:\hostedtoolcache\windows\python\3.8.10\x64\python.exe
target_size = 10000000.0
@pytest.mark.parametrize("target_size", (1e6,1e7))
def test_get_signal_chunks(target_size):
chunks = HierarchicalWriter._get_signal_chunks(shape=[15, 15, 256, 256],
dtype=int,
signal_axes=(2, 3),
target_size=target_size)
> assert (np.prod(chunks)*8 < target_size)
E assert (2359296 * 8) < 10000000.0
E + where 2359296 = <function prod at 0x0000017CB7857160>((6, 6, 256, 256))
E + where <function prod at 0x0000017CB7857160> = np.prod |
@ericpre What a strange little bug... the windows operating systems running the tests are 32 bit so |
… dependency for dask version < 2021.3.1
…ilable for numcodecs on other platforms. Skip zspy tests when zarr not installed
The machine on CI are 64 bits ( In 2b236f5, I made zarr optional because there are no packages available on pypi or conda-forge for architecture other than x86_64/amd64. |
Thanks for the good work on this, @CSSFrancis! |
This PR is a refactoring of #2798 and all further development should be done here.