Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zarr Reading and Writing (.zspy format) - Reformatted #2825

Merged
merged 34 commits into from Oct 20, 2021

Conversation

CSSFrancis
Copy link
Member

This PR is a refactoring of #2798 and all further development should be done here.

@codecov
Copy link

codecov bot commented Sep 16, 2021

Codecov Report

Merging #2825 (2b236f5) into RELEASE_next_minor (a030666) will decrease coverage by 1.45%.
The diff coverage is 78.04%.

Impacted file tree graph

@@                  Coverage Diff                   @@
##           RELEASE_next_minor    #2825      +/-   ##
======================================================
- Coverage               78.45%   77.00%   -1.46%     
======================================================
  Files                     203      206       +3     
  Lines                   31287    31566     +279     
  Branches                 6872     6907      +35     
======================================================
- Hits                    24547    24307     -240     
- Misses                   4977     5505     +528     
+ Partials                 1763     1754       -9     
Impacted Files Coverage Δ
hyperspy/io_plugins/__init__.py 71.42% <60.00%> (-2.49%) ⬇️
hyperspy/signal.py 73.80% <60.00%> (-2.15%) ⬇️
hyperspy/io_plugins/_hierarchical.py 73.31% <73.31%> (ø)
hyperspy/io_plugins/hspy.py 88.63% <88.37%> (+19.70%) ⬆️
hyperspy/io_plugins/zspy.py 91.66% <91.66%> (ø)
hyperspy/io.py 85.14% <100.00%> (+0.47%) ⬆️
hyperspy/io_plugins/emd.py 66.99% <100.00%> (+0.03%) ⬆️
hyperspy/io_plugins/nexus.py 93.91% <100.00%> (+0.02%) ⬆️
hyperspy/misc/io/tools.py 68.85% <100.00%> (+5.51%) ⬆️
hyperspy/misc/machine_learning/import_sklearn.py 50.00% <0.00%> (-50.00%) ⬇️
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a030666...2b236f5. Read the comment docs.

@ericpre
Copy link
Member

ericpre commented Sep 17, 2021

Thanks @CSSFrancis, this looks good and very promising! I will not have time in the next two weeks to look at the details and I will do that beginning of October then.

Copy link
Member

@ericpre ericpre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very good! See comments in the changes and:

  • there is a empty file which needs to be removed: hyperspy/io_plugins/spy.py
  • when overwriting an existing folder, it doesn't ask to overwrite, while other reader do
  • it would be good to increase the coverage as highlighted by the github annotations but this can be done in a separate PR.

doc/user_guide/io.rst Show resolved Hide resolved
doc/user_guide/big_data.rst Outdated Show resolved Hide resolved
doc/user_guide/io.rst Outdated Show resolved Hide resolved
doc/user_guide/io.rst Outdated Show resolved Hide resolved
hyperspy/io_plugins/hierarchical.py Outdated Show resolved Hide resolved
hyperspy/io_plugins/zspy.py Outdated Show resolved Hide resolved
hyperspy/io_plugins/zspy.py Show resolved Hide resolved
hyperspy/io_plugins/hierarchical.py Outdated Show resolved Hide resolved
hyperspy/tests/io/test_zspy.py Outdated Show resolved Hide resolved
hyperspy/io_plugins/hierarchical.py Outdated Show resolved Hide resolved
@CSSFrancis
Copy link
Member Author

@ericpre

This looks very good! See comments in the changes and:

  • there is a empty file which needs to be removed: hyperspy/io_plugins/spy.py

Done!

  • when overwriting an existing folder, it doesn't ask to overwrite, while other reader do

I changed this behavior so it should be working. It was kind of an oversight...

  • it would be good to increase the coverage as highlighted by the github annotations but this can be done in a separate PR.

I've added in a little bit of coverage, but obviously there are some parts which are still lacking, mostly because the hspy format isn't at 100% coverage. I might get around to that in another PR, as that should probably be at 100%

The only thing I'm not super confident with this PR is removing the shuffle and maxshape. I think that they are legacy from some older h5py requirement. For the maxshape I can pretty confidently say that removing it is fine. For the shuffle, I have just set the default to True, but at least now that can also be set using the kwds arguement.

Copy link
Member

@ericpre ericpre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made some more comments and a PR to simplify the class inheritance: CSSFrancis#1.

doc/user_guide/io.rst Outdated Show resolved Hide resolved
hyperspy/io_plugins/hspy.py Show resolved Hide resolved
hyperspy/io_plugins/hierarchical.py Outdated Show resolved Hide resolved
hyperspy/io_plugins/hierarchical.py Outdated Show resolved Hide resolved
hyperspy/io_plugins/zspy.py Outdated Show resolved Hide resolved
hyperspy/io_plugins/zspy.py Outdated Show resolved Hide resolved
an issue, it can be disabled by setting ``compressor=None``. In general we recommend
compressing your datasets as it can greatly reduce i-o overhead

- ``write_to_storage``: The write to storage option allows you to pass the path to a directory (or database)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going for option 2 should be good because this is a pattern used by the zarr.group, as it can take a path or a zarr store as first argument.

Simplify class inheritance for the zarr-io PR
@CSSFrancis
Copy link
Member Author

CSSFrancis commented Oct 7, 2021

@ericpre
In reference to #2825 (comment) I think that the current get signal chunks function is very confusing. If I am going to make a generic function I might as well just change it...

    def _get_signal_chunks(shape, dtype, signal_axes, target_size=1024*1024):
        """Function that calculates chunks for the signal, preferably at least one
        chunk per signal space.

        Parameters
        ----------
        shape : tuple
            the shape of the dataset to be sored / chunked
        dtype : {dtype, string}
            the numpy dtype of the data
        signal_axes: {None, iterable of ints}
            the axes defining "signal space" of the dataset. If None, the default
            h5py chunking is performed.
        target_size: int
            The target number of bytes for one chunk
        """
        typesize = np.dtype(dtype).itemsize
        if signal_axes is None:
            return h5py._hl.filters.guess_chunk(shape, None, typesize)

        # largely based on the guess_chunk in h5py
        bytes_per_signal = multiply([shape[i] for i in signal_axes]) * typesize
        signals_per_chunk = np.floor_divide(target_size,bytes_per_signal)
        if signals_per_chunk < 2:
            # signal is larger than chunk max
            chunks = [s if i in signal_axes else 1 for i, s in enumerate(shape)]
            return tuple(chunks)
        else:
            # signal is smaller than chunk max
            navigation_axes = tuple(i for i in range(len(shape)) if i not in
                                    signal_axes)
            num_nav_axes = len(navigation_axes)
            sig_axes_chunk = np.floor(signals_per_chunk**(1/num_nav_axes))
            remainder = np.floor_divide(signals_per_chunk - (sig_axes_chunk**num_nav_axes),
                                        sig_axes_chunk)
            chunks = [s if i in signal_axes else sig_axes_chunk for i, s in enumerate(shape)]
            chunks[navigation_axes[0]] = chunks[navigation_axes[0]]+remainder
        return tuple(int(x) for x in chunks)

I think that this makes quite a bit more sense and gets rid of the while loop. It changes the behavior a little (I think) but the previous version was not actually creating 1 mb chunks but rather chunks that were 10241024 bit depth which I don't thinks makes much sense.

Copy link
Member

@ericpre ericpre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in one of the comments, I would be inclined not to change how the chunking is calculated and sort it out properly in a separate PR.
Some tests of this PR are failing and a few other minor comments need to considered. Other than that, it should be good to go!

hyperspy/io.py Show resolved Hide resolved
hyperspy/io_plugins/zspy.py Show resolved Hide resolved
hyperspy/io_plugins/zspy.py Outdated Show resolved Hide resolved
hyperspy/signal.py Outdated Show resolved Hide resolved
hyperspy/io_plugins/zspy.py Outdated Show resolved Hide resolved
hyperspy/io.py Outdated Show resolved Hide resolved
def _get_signal_chunks():
raise NotImplementedError(
"This method must be implemented by subclasses.")
def _get_signal_chunks(shape, dtype, signal_axes=None, target_size=1e6):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also

def _get_dask_chunks(self, axis=None, dtype=None):
"""Returns dask chunks.
Aims:
- Have at least one signal (or specified axis) in a single chunk,
or as many as fit in memory
Parameters
----------
axis : {int, string, None, axis, tuple}
If axis is None (default), returns chunks for current data shape so
that at least one signal is in the chunk. If an axis is specified,
only that particular axis is guaranteed to be "not sliced".
dtype : {string, np.dtype}
The dtype of target chunks.
Returns
-------
Tuple of tuples, dask chunks
"""
dc = self.data
dcshape = dc.shape
for _axis in self.axes_manager._axes:
if _axis.index_in_array < len(dcshape):
_axis.size = int(dcshape[_axis.index_in_array])
if axis is not None:
need_axes = self.axes_manager[axis]
if not np.iterable(need_axes):
need_axes = [need_axes, ]
else:
need_axes = self.axes_manager.signal_axes
if dtype is None:
dtype = dc.dtype
elif not isinstance(dtype, np.dtype):
dtype = np.dtype(dtype)
typesize = max(dtype.itemsize, dc.dtype.itemsize)
want_to_keep = multiply([ax.size for ax in need_axes]) * typesize
# @mrocklin reccomends to have around 100MB chunks, so we do that:
num_that_fit = int(100. * 2.**20 / want_to_keep)
# want to have at least one "signal" per chunk
if num_that_fit < 2:
chunks = [tuple(1 for _ in range(i)) for i in dc.shape]
for ax in need_axes:
chunks[ax.index_in_array] = dc.shape[ax.index_in_array],
return tuple(chunks)
sizes = [
ax.size for ax in self.axes_manager._axes if ax not in need_axes
]
indices = [
ax.index_in_array for ax in self.axes_manager._axes
if ax not in need_axes
]
while True:
if multiply(sizes) <= num_that_fit:
break
i = np.argmax(sizes)
sizes[i] = np.floor(sizes[i] / 2)
chunks = []
ndim = len(dc.shape)
for i in range(ndim):
if i in indices:
size = float(dc.shape[i])
split_array = np.array_split(
np.arange(size), np.ceil(size / sizes[indices.index(i)]))
chunks.append(tuple(len(sp) for sp in split_array))
else:
chunks.append((dc.shape[i], ))
return tuple(chunks)

which does pretty much the same thing. I guess, we need a standalone guess_chunks which should be used here and in hyperspy._signals.lazy.LazySignal.

Since this is indirectly related to this purpose of this PR, maybe it is better to do this in a separate PR. I would prefer to not change the behaviour in this PR and sort it out completely in a separate PR, with adding test, etc. There is already a lot in this PR! What do you think?

For reference, here is the auto-chunking from zarr:
https://github.com/zarr-developers/zarr-python/blob/be3d657484590adacaedf980a2951ddfa92b68b8/zarr/util.py#L60-L103

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can sort it out in a separate PR. Do you want me to roll back to the previous _get_signal_chunks.

I would argue against the zarr and the h5py guess chunk functions. Something about using a while loop rather than directly calculating the chunk shape makes me itch and is very confusing. zarr does a little bit better job explaining what is happening, but I think it is still fairly unpythonic.

Honestly, chunking in hyperspy is a little bit broken and could use some major work.

I can make a issue to track some of the issues that I have come across.

Copy link
Member

@ericpre ericpre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to leave the chunking discussion for another PR. Same for passing zarr store instead of filename string, because the it doesn't fully work.

There are two other comments/questions.

hyperspy/io_plugins/_hierarchical.py Outdated Show resolved Hide resolved
Comment on lines +125 to +131
path = group._store.dir_path() + "/" + dset.path
dset = zarr.open_array(path,
mode="w",
shape=data.shape,
dtype=data.dtype,
chunks=chunks,
**kwds)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It it necessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least on my machine it seems to be. If you take it out then the array is never updated. There are some things with how the synchronizer is handled that I am still rather confused with. There is no flush method for the directory store but I think that reopening the file is equivalent to this.

If I tried a bunch of different things but this is the one that worked in the end

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let's leave it there for now and we can revisit after I rebase #2797 on this branch.

hyperspy/io_plugins/zspy.py Show resolved Hide resolved
Co-authored-by: Eric Prestat <eric.prestat@gmail.com>
@ericpre
Copy link
Member

ericpre commented Oct 19, 2021

There are two tests failing:

______________________ test_get_signal_chunks[1000000.0] ______________________
[gw1] win32 -- Python 3.8.10 c:\hostedtoolcache\windows\python\3.8.10\x64\python.exe

target_size = 1000000.0

    @pytest.mark.parametrize("target_size", (1e6,1e7))
    def test_get_signal_chunks(target_size):
        chunks = HierarchicalWriter._get_signal_chunks(shape=[15, 15, 256, 256],
                                                       dtype=int,
                                                       signal_axes=(2, 3),
                                                       target_size=target_size)
>       assert (np.prod(chunks)*8 < target_size)
E       assert (196608 * 8) < 1000000.0
E        +  where 196608 = <function prod at 0x0000017CB7857160>((3, 1, 256, 256))
E        +    where <function prod at 0x0000017CB7857160> = np.prod

hyperspy\tests\io\test_hdf5.py:857: AssertionError
RRRF
_____________________ test_get_signal_chunks[10000000.0] ______________________
[gw1] win32 -- Python 3.8.10 c:\hostedtoolcache\windows\python\3.8.10\x64\python.exe

target_size = 10000000.0

    @pytest.mark.parametrize("target_size", (1e6,1e7))
    def test_get_signal_chunks(target_size):
        chunks = HierarchicalWriter._get_signal_chunks(shape=[15, 15, 256, 256],
                                                       dtype=int,
                                                       signal_axes=(2, 3),
                                                       target_size=target_size)
>       assert (np.prod(chunks)*8 < target_size)
E       assert (2359296 * 8) < 10000000.0
E        +  where 2359296 = <function prod at 0x0000017CB7857160>((6, 6, 256, 256))
E        +    where <function prod at 0x0000017CB7857160> = np.prod

@CSSFrancis
Copy link
Member Author

@ericpre What a strange little bug... the windows operating systems running the tests are 32 bit so int is by default 32 bits where on the 64 bit operating systems int by default is 64 bits.

…ilable for numcodecs on other platforms. Skip zspy tests when zarr not installed
@ericpre
Copy link
Member

ericpre commented Oct 20, 2021

@ericpre What a strange little bug... the windows operating systems running the tests are 32 bit so int is by default 32 bits where on the 64 bit operating systems int by default is 64 bits.

The machine on CI are 64 bits ('win32', which must be the output of sys.platform is misleading here...`), and they must be another reason. Anyway, since we say that the chunking estimation will need a refactor in a follow-up PR, we can leave it for now.

In 2b236f5, I made zarr optional because there are no packages available on pypi or conda-forge for architecture other than x86_64/amd64.

@ericpre ericpre merged commit 19dab6e into hyperspy:RELEASE_next_minor Oct 20, 2021
@ericpre
Copy link
Member

ericpre commented Oct 20, 2021

Thanks for the good work on this, @CSSFrancis!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants