Zarr Reading and Writing (.zspy format) - Reformatted #2825

CSSFrancis · 2021-09-16T20:51:31Z

This PR is a refactoring of #2798 and all further development should be done here.

…ation

…l reduction

…rmats.

# Conflicts: # hyperspy/io_plugins/hspy.py

codecov · 2021-09-16T21:04:25Z

Codecov Report

Merging #2825 (2b236f5) into RELEASE_next_minor (a030666) will decrease coverage by 1.45%.
The diff coverage is 78.04%.

@@                  Coverage Diff                   @@
##           RELEASE_next_minor    #2825      +/-   ##
======================================================
- Coverage               78.45%   77.00%   -1.46%     
======================================================
  Files                     203      206       +3     
  Lines                   31287    31566     +279     
  Branches                 6872     6907      +35     
======================================================
- Hits                    24547    24307     -240     
- Misses                   4977     5505     +528     
+ Partials                 1763     1754       -9

Impacted Files	Coverage Δ
hyperspy/io_plugins/__init__.py	`71.42% <60.00%> (-2.49%)`	⬇️
hyperspy/signal.py	`73.80% <60.00%> (-2.15%)`	⬇️
hyperspy/io_plugins/_hierarchical.py	`73.31% <73.31%> (ø)`
hyperspy/io_plugins/hspy.py	`88.63% <88.37%> (+19.70%)`	⬆️
hyperspy/io_plugins/zspy.py	`91.66% <91.66%> (ø)`
hyperspy/io.py	`85.14% <100.00%> (+0.47%)`	⬆️
hyperspy/io_plugins/emd.py	`66.99% <100.00%> (+0.03%)`	⬆️
hyperspy/io_plugins/nexus.py	`93.91% <100.00%> (+0.02%)`	⬆️
hyperspy/misc/io/tools.py	`68.85% <100.00%> (+5.51%)`	⬆️
hyperspy/misc/machine_learning/import_sklearn.py	`50.00% <0.00%> (-50.00%)`	⬇️
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a030666...2b236f5. Read the comment docs.

ericpre · 2021-09-17T08:36:14Z

Thanks @CSSFrancis, this looks good and very promising! I will not have time in the next two weeks to look at the details and I will do that beginning of October then.

ericpre

This looks very good! See comments in the changes and:

there is a empty file which needs to be removed: hyperspy/io_plugins/spy.py
when overwriting an existing folder, it doesn't ask to overwrite, while other reader do
it would be good to increase the coverage as highlighted by the github annotations but this can be done in a separate PR.

doc/user_guide/io.rst

doc/user_guide/big_data.rst

doc/user_guide/io.rst

hyperspy/io_plugins/hierarchical.py

hyperspy/io_plugins/zspy.py

hyperspy/io_plugins/hierarchical.py

hyperspy/tests/io/test_zspy.py

hyperspy/io_plugins/hierarchical.py

Co-authored-by: Eric Prestat <eric.prestat@gmail.com>

CSSFrancis · 2021-10-05T22:18:29Z

@ericpre

This looks very good! See comments in the changes and:

there is a empty file which needs to be removed: hyperspy/io_plugins/spy.py

Done!

when overwriting an existing folder, it doesn't ask to overwrite, while other reader do

I changed this behavior so it should be working. It was kind of an oversight...

it would be good to increase the coverage as highlighted by the github annotations but this can be done in a separate PR.

I've added in a little bit of coverage, but obviously there are some parts which are still lacking, mostly because the hspy format isn't at 100% coverage. I might get around to that in another PR, as that should probably be at 100%

The only thing I'm not super confident with this PR is removing the shuffle and maxshape. I think that they are legacy from some older h5py requirement. For the maxshape I can pretty confidently say that removing it is fine. For the shuffle, I have just set the default to True, but at least now that can also be set using the kwds arguement.

ericpre

I made some more comments and a PR to simplify the class inheritance: CSSFrancis#1.

doc/user_guide/io.rst

hyperspy/io_plugins/hspy.py

hyperspy/io_plugins/hierarchical.py

hyperspy/io_plugins/zspy.py

ericpre · 2021-10-06T19:27:13Z

doc/user_guide/io.rst

+    an issue, it can be disabled by setting ``compressor=None``. In general we recommend
+    compressing your datasets as it can greatly reduce i-o overhead
+
+- ``write_to_storage``: The write to storage option allows you to pass the path to a directory (or database)


Going for option 2 should be good because this is a pattern used by the zarr.group, as it can take a path or a zarr store as first argument.

Simplify class inheritance for the zarr-io PR

CSSFrancis · 2021-10-07T16:11:58Z

@ericpre
In reference to #2825 (comment) I think that the current get signal chunks function is very confusing. If I am going to make a generic function I might as well just change it...

    def _get_signal_chunks(shape, dtype, signal_axes, target_size=1024*1024):
        """Function that calculates chunks for the signal, preferably at least one
        chunk per signal space.

        Parameters
        ----------
        shape : tuple
            the shape of the dataset to be sored / chunked
        dtype : {dtype, string}
            the numpy dtype of the data
        signal_axes: {None, iterable of ints}
            the axes defining "signal space" of the dataset. If None, the default
            h5py chunking is performed.
        target_size: int
            The target number of bytes for one chunk
        """
        typesize = np.dtype(dtype).itemsize
        if signal_axes is None:
            return h5py._hl.filters.guess_chunk(shape, None, typesize)

        # largely based on the guess_chunk in h5py
        bytes_per_signal = multiply([shape[i] for i in signal_axes]) * typesize
        signals_per_chunk = np.floor_divide(target_size,bytes_per_signal)
        if signals_per_chunk < 2:
            # signal is larger than chunk max
            chunks = [s if i in signal_axes else 1 for i, s in enumerate(shape)]
            return tuple(chunks)
        else:
            # signal is smaller than chunk max
            navigation_axes = tuple(i for i in range(len(shape)) if i not in
                                    signal_axes)
            num_nav_axes = len(navigation_axes)
            sig_axes_chunk = np.floor(signals_per_chunk**(1/num_nav_axes))
            remainder = np.floor_divide(signals_per_chunk - (sig_axes_chunk**num_nav_axes),
                                        sig_axes_chunk)
            chunks = [s if i in signal_axes else sig_axes_chunk for i, s in enumerate(shape)]
            chunks[navigation_axes[0]] = chunks[navigation_axes[0]]+remainder
        return tuple(int(x) for x in chunks)

I think that this makes quite a bit more sense and gets rid of the while loop. It changes the behavior a little (I think) but the previous version was not actually creating 1 mb chunks but rather chunks that were 10241024 bit depth which I don't thinks makes much sense.

ericpre

As mentioned in one of the comments, I would be inclined not to change how the chunking is calculated and sort it out properly in a separate PR.
Some tests of this PR are failing and a few other minor comments need to considered. Other than that, it should be good to go!

hyperspy/io.py

hyperspy/io_plugins/zspy.py

hyperspy/signal.py

hyperspy/io_plugins/zspy.py

hyperspy/io.py

ericpre · 2021-10-08T07:47:19Z

hyperspy/io_plugins/_hierarchical.py

-    def _get_signal_chunks():
-        raise NotImplementedError(
-            "This method must be implemented by subclasses.")
+    def _get_signal_chunks(shape, dtype, signal_axes=None, target_size=1e6):


There is also

hyperspy/hyperspy/_signals/lazy.py

Lines 189 to 263 in 3517f8a

def _get_dask_chunks(self, axis=None, dtype=None):

"""Returns dask chunks.

Aims:

- Have at least one signal (or specified axis) in a single chunk,

or as many as fit in memory

Parameters

----------

axis : {int, string, None, axis, tuple}

If axis is None (default), returns chunks for current data shape so

that at least one signal is in the chunk. If an axis is specified,

only that particular axis is guaranteed to be "not sliced".

dtype : {string, np.dtype}

The dtype of target chunks.

Returns

-------

Tuple of tuples, dask chunks

"""

dc = self.data

dcshape = dc.shape

for _axis in self.axes_manager._axes:

if _axis.index_in_array < len(dcshape):

_axis.size = int(dcshape[_axis.index_in_array])

if axis is not None:

need_axes = self.axes_manager[axis]

if not np.iterable(need_axes):

need_axes = [need_axes, ]

else:

need_axes = self.axes_manager.signal_axes

if dtype is None:

dtype = dc.dtype

elif not isinstance(dtype, np.dtype):

dtype = np.dtype(dtype)

typesize = max(dtype.itemsize, dc.dtype.itemsize)

want_to_keep = multiply([ax.size for ax in need_axes]) * typesize

# @mrocklin reccomends to have around 100MB chunks, so we do that:

num_that_fit = int(100. * 2.**20 / want_to_keep)

# want to have at least one "signal" per chunk

if num_that_fit < 2:

chunks = [tuple(1 for _ in range(i)) for i in dc.shape]

for ax in need_axes:

chunks[ax.index_in_array] = dc.shape[ax.index_in_array],

return tuple(chunks)

sizes = [

ax.size for ax in self.axes_manager._axes if ax not in need_axes

]

indices = [

ax.index_in_array for ax in self.axes_manager._axes

if ax not in need_axes

]

while True:

if multiply(sizes) <= num_that_fit:

break

i = np.argmax(sizes)

sizes[i] = np.floor(sizes[i] / 2)

chunks = []

ndim = len(dc.shape)

for i in range(ndim):

if i in indices:

size = float(dc.shape[i])

split_array = np.array_split(

np.arange(size), np.ceil(size / sizes[indices.index(i)]))

chunks.append(tuple(len(sp) for sp in split_array))

else:

chunks.append((dc.shape[i], ))

return tuple(chunks)

which does pretty much the same thing. I guess, we need a standalone guess_chunks which should be used here and in hyperspy._signals.lazy.LazySignal.

Since this is indirectly related to this purpose of this PR, maybe it is better to do this in a separate PR. I would prefer to not change the behaviour in this PR and sort it out completely in a separate PR, with adding test, etc. There is already a lot in this PR! What do you think?

For reference, here is the auto-chunking from zarr:
https://github.com/zarr-developers/zarr-python/blob/be3d657484590adacaedf980a2951ddfa92b68b8/zarr/util.py#L60-L103

We can sort it out in a separate PR. Do you want me to roll back to the previous _get_signal_chunks.

I would argue against the zarr and the h5py guess chunk functions. Something about using a while loop rather than directly calculating the chunk shape makes me itch and is very confusing. zarr does a little bit better job explaining what is happening, but I think it is still fairly unpythonic.

Honestly, chunking in hyperspy is a little bit broken and could use some major work.

I can make a issue to track some of the issues that I have come across.

ericpre

I would suggest to leave the chunking discussion for another PR. Same for passing zarr store instead of filename string, because the it doesn't fully work.

There are two other comments/questions.

hyperspy/io_plugins/_hierarchical.py

ericpre · 2021-10-19T15:37:36Z

hyperspy/io_plugins/zspy.py

+            path = group._store.dir_path() + "/" + dset.path
+            dset = zarr.open_array(path,
+                                   mode="w",
+                                   shape=data.shape,
+                                    dtype=data.dtype,
+                                    chunks=chunks,
+                                    **kwds)


It it necessary?

At least on my machine it seems to be. If you take it out then the array is never updated. There are some things with how the synchronizer is handled that I am still rather confused with. There is no flush method for the directory store but I think that reopening the file is equivalent to this.

If I tried a bunch of different things but this is the one that worked in the end

OK, let's leave it there for now and we can revisit after I rebase #2797 on this branch.

hyperspy/io_plugins/zspy.py

Co-authored-by: Eric Prestat <eric.prestat@gmail.com>

ericpre · 2021-10-19T19:41:44Z

There are two tests failing:

______________________ test_get_signal_chunks[1000000.0] ______________________
[gw1] win32 -- Python 3.8.10 c:\hostedtoolcache\windows\python\3.8.10\x64\python.exe

target_size = 1000000.0

    @pytest.mark.parametrize("target_size", (1e6,1e7))
    def test_get_signal_chunks(target_size):
        chunks = HierarchicalWriter._get_signal_chunks(shape=[15, 15, 256, 256],
                                                       dtype=int,
                                                       signal_axes=(2, 3),
                                                       target_size=target_size)
>       assert (np.prod(chunks)*8 < target_size)
E       assert (196608 * 8) < 1000000.0
E        +  where 196608 = <function prod at 0x0000017CB7857160>((3, 1, 256, 256))
E        +    where <function prod at 0x0000017CB7857160> = np.prod

hyperspy\tests\io\test_hdf5.py:857: AssertionError
RRRF
_____________________ test_get_signal_chunks[10000000.0] ______________________
[gw1] win32 -- Python 3.8.10 c:\hostedtoolcache\windows\python\3.8.10\x64\python.exe

target_size = 10000000.0

    @pytest.mark.parametrize("target_size", (1e6,1e7))
    def test_get_signal_chunks(target_size):
        chunks = HierarchicalWriter._get_signal_chunks(shape=[15, 15, 256, 256],
                                                       dtype=int,
                                                       signal_axes=(2, 3),
                                                       target_size=target_size)
>       assert (np.prod(chunks)*8 < target_size)
E       assert (2359296 * 8) < 10000000.0
E        +  where 2359296 = <function prod at 0x0000017CB7857160>((6, 6, 256, 256))
E        +    where <function prod at 0x0000017CB7857160> = np.prod

CSSFrancis · 2021-10-19T21:37:37Z

@ericpre What a strange little bug... the windows operating systems running the tests are 32 bit so int is by default 32 bits where on the 64 bit operating systems int by default is 64 bits.

… dependency for dask version < 2021.3.1

…ilable for numcodecs on other platforms. Skip zspy tests when zarr not installed

ericpre · 2021-10-20T12:38:56Z

@ericpre What a strange little bug... the windows operating systems running the tests are 32 bit so int is by default 32 bits where on the 64 bit operating systems int by default is 64 bits.

The machine on CI are 64 bits ('win32', which must be the output of sys.platform is misleading here...`), and they must be another reason. Anyway, since we say that the chunking estimation will need a refactor in a follow-up PR, we can leave it for now.

In 2b236f5, I made zarr optional because there are no packages available on pypi or conda-forge for architecture other than x86_64/amd64.

ericpre · 2021-10-20T13:04:09Z

Thanks for the good work on this, @CSSFrancis!

CSSFrancis added 16 commits August 31, 2021 11:50

refactoring hspy and zspy to build from each other

5d94aa7

Added in zspy as file format

2270bd7

Added in HyperspyReader and HyperspyWriter Classes for easier organiz…

0e6d705

…ation

Hyperspy file loading as a class set of functions

37d5fe0

Zspy classes for saving/loading data added. Still need some additiona…

8e27d65

…l reduction

Tried to reduce code. Still could remove parse data function

af590ee

Messed up the file reading a little bit. Need to Fix Nexus and EMD fo…

023630b

…rmats.

Broken-- Starting with a new branch

28350a5

Cleaned up get_signal_chunks and overwrite dataset functions

c6f8178

Added in the ability to define your storage container

6021f3f

Updated Documentation including zarr format

db105d5

Updated Big data documentation

fc2302d

Added in change description

f766284

Added Zarr as a requirement

7fe289a

removed unused imports

57d96f1

Merge remote-tracking branch 'upstream/RELEASE_next_minor' into zarr-io

665dc59

# Conflicts: # hyperspy/io_plugins/hspy.py

CSSFrancis mentioned this pull request Sep 16, 2021

Zarr Reading and Writing (.zspy format) #2798

Closed

8 tasks

ericpre reviewed Oct 4, 2021

View reviewed changes

ericpre added release: next minor status: waiting for author type: new feature labels Oct 4, 2021

ericpre added this to the v1.7 milestone Oct 4, 2021

ericpre mentioned this pull request Oct 4, 2021

Lazy save improvement #2797

Merged

9 tasks

CSSFrancis and others added 5 commits October 5, 2021 09:47

Change wording to be more formal

f0a2144

Co-authored-by: Eric Prestat <eric.prestat@gmail.com>

Cleaning up after review

4132db5

Added in spacing for nexus file

b83ec5d

Removed maxshape and shuffle for the require_dataset function

9fffe4c

Overwrite working properly

cf8632e

CSSFrancis added 2 commits October 5, 2021 14:43

Added in more checking for directory and zspy format

fb4c048

Add in shuffle kwd argument as default for hspy format.

48c88f6

ericpre added 2 commits October 6, 2021 12:02

Fix typo error message.

efd1086

Simplify inheritance structure

b9be6fc

ericpre mentioned this pull request Oct 6, 2021

Simplify class inheritance for the zarr-io PR CSSFrancis/hyperspy#1

Merged

ericpre reviewed Oct 6, 2021

View reviewed changes

Merge pull request #1 from ericpre/zarr-io

ad0902b

Simplify class inheritance for the zarr-io PR

CSSFrancis added 2 commits October 7, 2021 12:13

Replaced get_signal_chunks_function

9840bcc

Allow MutableMapping Objects to be passed in

31d27b0

ericpre reviewed Oct 8, 2021

View reviewed changes

Cleaning up print statements and saving output

0e98cae

ericpre reviewed Oct 19, 2021

View reviewed changes

Update hyperspy/io_plugins/_hierarchical.py

88b38bc

Co-authored-by: Eric Prestat <eric.prestat@gmail.com>

CSSFrancis added 2 commits October 19, 2021 16:30

Specify dtype for testing on 32bit operating systems

cb45958

Merge remote-tracking branch 'origin/zarr-io' into zarr-io

d863064

ericpre added 2 commits October 20, 2021 10:50

Add fsspec to conda environment and setup.py since this is not a dask…

3132e22

… dependency for dask version < 2021.3.1

Install zarr on x86_64 machines until pypi and conda packages are ava…

2b236f5

…ilable for numcodecs on other platforms. Skip zspy tests when zarr not installed

ericpre merged commit 19dab6e into hyperspy:RELEASE_next_minor Oct 20, 2021

ericpre removed the status: waiting for author label Oct 20, 2021

CSSFrancis deleted the zarr-io branch October 20, 2021 20:41

magnunor mentioned this pull request Apr 26, 2022

Release 1.7 #2932

Merged

4 tasks

ericpre added release highlight and removed release: next minor labels Apr 26, 2022

magnunor mentioned this pull request Jun 8, 2022

Projects using Zarr zarr-developers/community#19

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zarr Reading and Writing (.zspy format) - Reformatted #2825

Zarr Reading and Writing (.zspy format) - Reformatted #2825

CSSFrancis commented Sep 16, 2021

codecov bot commented Sep 16, 2021 •

edited

ericpre commented Sep 17, 2021

ericpre left a comment

CSSFrancis commented Oct 5, 2021

ericpre left a comment

ericpre Oct 6, 2021

CSSFrancis commented Oct 7, 2021 •

edited

ericpre left a comment •

edited

ericpre Oct 8, 2021

CSSFrancis Oct 18, 2021

ericpre left a comment

ericpre Oct 19, 2021

CSSFrancis Oct 19, 2021

ericpre Oct 19, 2021

ericpre commented Oct 19, 2021

CSSFrancis commented Oct 19, 2021

ericpre commented Oct 20, 2021

ericpre commented Oct 20, 2021

	def _get_dask_chunks(self, axis=None, dtype=None):
	"""Returns dask chunks.

	Aims:
	- Have at least one signal (or specified axis) in a single chunk,
	or as many as fit in memory

	Parameters
	----------
	axis : {int, string, None, axis, tuple}
	If axis is None (default), returns chunks for current data shape so
	that at least one signal is in the chunk. If an axis is specified,
	only that particular axis is guaranteed to be "not sliced".
	dtype : {string, np.dtype}
	The dtype of target chunks.

	Returns
	-------
	Tuple of tuples, dask chunks
	"""
	dc = self.data
	dcshape = dc.shape
	for _axis in self.axes_manager._axes:
	if _axis.index_in_array < len(dcshape):
	_axis.size = int(dcshape[_axis.index_in_array])

	if axis is not None:
	need_axes = self.axes_manager[axis]
	if not np.iterable(need_axes):
	need_axes = [need_axes, ]
	else:
	need_axes = self.axes_manager.signal_axes

	if dtype is None:
	dtype = dc.dtype
	elif not isinstance(dtype, np.dtype):
	dtype = np.dtype(dtype)
	typesize = max(dtype.itemsize, dc.dtype.itemsize)
	want_to_keep = multiply([ax.size for ax in need_axes]) * typesize

	# @mrocklin reccomends to have around 100MB chunks, so we do that:
	num_that_fit = int(100. * 2.**20 / want_to_keep)

	# want to have at least one "signal" per chunk
	if num_that_fit < 2:
	chunks = [tuple(1 for _ in range(i)) for i in dc.shape]
	for ax in need_axes:
	chunks[ax.index_in_array] = dc.shape[ax.index_in_array],
	return tuple(chunks)

	sizes = [
	ax.size for ax in self.axes_manager._axes if ax not in need_axes
	]
	indices = [
	ax.index_in_array for ax in self.axes_manager._axes
	if ax not in need_axes
	]

	while True:
	if multiply(sizes) <= num_that_fit:
	break

	i = np.argmax(sizes)
	sizes[i] = np.floor(sizes[i] / 2)
	chunks = []
	ndim = len(dc.shape)
	for i in range(ndim):
	if i in indices:
	size = float(dc.shape[i])
	split_array = np.array_split(
	np.arange(size), np.ceil(size / sizes[indices.index(i)]))
	chunks.append(tuple(len(sp) for sp in split_array))
	else:
	chunks.append((dc.shape[i], ))
	return tuple(chunks)

Zarr Reading and Writing (.zspy format) - Reformatted #2825

Zarr Reading and Writing (.zspy format) - Reformatted #2825

Conversation

CSSFrancis commented Sep 16, 2021

codecov bot commented Sep 16, 2021 • edited

Codecov Report

ericpre commented Sep 17, 2021

ericpre left a comment

Choose a reason for hiding this comment

CSSFrancis commented Oct 5, 2021

ericpre left a comment

Choose a reason for hiding this comment

ericpre Oct 6, 2021

Choose a reason for hiding this comment

CSSFrancis commented Oct 7, 2021 • edited

ericpre left a comment • edited

Choose a reason for hiding this comment

ericpre Oct 8, 2021

Choose a reason for hiding this comment

CSSFrancis Oct 18, 2021

Choose a reason for hiding this comment

ericpre left a comment

Choose a reason for hiding this comment

ericpre Oct 19, 2021

Choose a reason for hiding this comment

CSSFrancis Oct 19, 2021

Choose a reason for hiding this comment

ericpre Oct 19, 2021

Choose a reason for hiding this comment

ericpre commented Oct 19, 2021

CSSFrancis commented Oct 19, 2021

ericpre commented Oct 20, 2021

ericpre commented Oct 20, 2021

codecov bot commented Sep 16, 2021 •

edited

CSSFrancis commented Oct 7, 2021 •

edited

ericpre left a comment •

edited