Lazy save improvement #2797

ericpre · 2021-07-15T15:30:04Z

Progress of the PR

Add option to close file when saving lazy signal to hspy format,
add option not to write dataset, which can useful when saving large signal (hspy/zspy format only),
fix issue with latest h5py version (3.5)
update docstring,
update user guide,
add an changelog entry in the upcoming_changes folder (see upcoming_changes/README.rst),
Check formatting changelog entry in the readthedocs doc build of this PR (link in github checks)
add tests,
ready for review.

Minimal example of the bug fix or the new feature

import hyperspy.api as hs
import numpy as np

s = hs.signals.Signal1D(np.arange(1000).reshape(10, 100))

fname = 'test.hspy'
s.save(fname, overwrite=True)

s2 = hs.load(fname, lazy=True, mode='a')
s2.axes_manager[-1].scale = 0.1
# write everything, except the dataset itself
s2.save(fname, overwrite=True, write_dataset=False)

s3 = hs.load(fname)
print(s3.axes_manager[-1].scale)

codecov · 2021-07-15T15:42:15Z

Codecov Report

Merging #2797 (de5ec73) into RELEASE_next_minor (19dab6e) will increase coverage by 0.04%.
The diff coverage is 94.11%.

@@                  Coverage Diff                   @@
##           RELEASE_next_minor    #2797      +/-   ##
======================================================
+ Coverage               77.03%   77.07%   +0.04%     
======================================================
  Files                     206      206              
  Lines                   31566    31600      +34     
  Branches                 6907     6918      +11     
======================================================
+ Hits                    24317    24357      +40     
- Misses                   5493     5494       +1     
+ Partials                 1756     1749       -7

Impacted Files	Coverage Δ
hyperspy/_signals/lazy.py	`89.30% <88.88%> (+0.26%)`	⬆️
hyperspy/io_plugins/_hierarchical.py	`75.16% <90.38%> (+1.85%)`	⬆️
hyperspy/io_plugins/hspy.py	`93.33% <93.61%> (+4.69%)`	⬆️
hyperspy/io.py	`85.76% <95.00%> (+0.61%)`	⬆️
hyperspy/axes.py	`90.50% <100.00%> (+<0.01%)`	⬆️
hyperspy/io_plugins/emd.py	`66.95% <100.00%> (-0.04%)`	⬇️
hyperspy/io_plugins/nexus.py	`93.88% <100.00%> (-0.03%)`	⬇️
hyperspy/io_plugins/zspy.py	`95.34% <100.00%> (+3.68%)`	⬆️
hyperspy/signal.py	`73.80% <100.00%> (ø)`
hyperspy/misc/eels/eelsdb.py	`50.72% <0.00%> (-14.50%)`	⬇️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 19dab6e...de5ec73. Read the comment docs.

jlaehne

some typos

doc/user_guide/io.rst

hyperspy/io.py

hyperspy/io_plugins/hspy.py

ericpre · 2021-10-04T12:39:36Z

I will come back to this after #2825 is merged.

…nal data is never used

ericpre

The test suite is failing on python 3.6 and I think that it is not worth trying to get it to work because python 3.6 is not supported anymore and we should drop it in the next minor release.

hyperspy/io_plugins/_hierarchical.py

hyperspy/io_plugins/zspy.py

ericpre · 2021-10-21T12:37:29Z

@CSSFrancis, would you like to review this PR? Thanks!

CSSFrancis · 2021-10-21T13:29:07Z

@CSSFrancis, would you like to review this PR? Thanks!

Yea let me see if I can get to this today or tomorrow

hyperspy/io_plugins/zspy.py

CSSFrancis

@ericpre Looks like this cleans up a lot of the little things that I wasn't terribly confident about and reduces the code complexity a fair bit! Thanks for doing this.

One note (as I said above) is that the store function doesn't play well with dask.distrubted unless lock=False. It might be worth adding in some tests for compatibility there if the goal is to fully integrate with dask, but I am not sure how easy deploying something like that on a testing environment is.

I assume that replacing create_group with require_group is preferable because if won't throw an error if the group already exists?

Other than that, one of the tests which tested passing a Mutable Mapping object as the file name was deleted and should be covered.

hyperspy/io_plugins/zspy.py

CSSFrancis · 2021-10-21T14:30:38Z

hyperspy/tests/io/test_zspy.py

-    def test_save_N5_type(self, signal, tmp_path):
-        filename = tmp_path / 'testmodels.zspy'
-        store = zarr.N5Store(path=filename)
-        signal.save(store, write_to_storage=True)
-        signal2 = load(filename)
-        np.testing.assert_array_equal(signal2.data, signal.data)
-


This should not be deleted I think. You can remove the write_to_storage=True but this test shows if passing a MutableMapping object works for saving the data.

hyperspy/io_plugins/hspy.py

ericpre · 2021-10-22T11:29:13Z

Running the integration test suite show that it was breaking the public API... see #2797 (comment). This is fixed in caf6a01.

@CSSFrancis, does it look good to you?

CSSFrancis · 2021-10-25T13:45:13Z

@CSSFrancis, does it look good to you?

Yea with the changes it looks good! Thanks for cleaning this stuff up.

ericpre · 2021-10-27T14:28:09Z

@jlaehne, do you want to do another review of this PR?

jlaehne · 2021-10-27T21:16:19Z

hyperspy/io.py

@@ -277,6 +288,17 @@ def load(filenames=None,
        acquisition stopped before the end: if True, load only the acquired
        data. If False, fill empty data with zeros. Default is False and this
        default value will change to True in version 2.0.
+    chunks : tuple of integer or None
+        Only for hspy files. Define the chunking used for saving the dataset.


In the user guide it says these extra arguments are only relevant for zarr format, here it says they are only relevant for hspy format. Also some details in the description differ between here and io.rst.

Yes, this docstring is for save, not load! 🤦

jlaehne · 2021-10-27T21:56:50Z

@jlaehne, do you want to do another review of this PR?

I sent a PR correcting some typos. There are still some codecov warnings (mostly about untested exceptions). Otherwise nothing that caught my eye.

hyperspy/io_plugins/_hierarchical.py

some typos

ericpre · 2021-10-29T13:25:24Z

Thanks @jlaehne, I increased the coverage and fixed the docstring.

magnunor

I tested this on my own computer, and I'm getting an error:

import dask.array as da
import hyperspy.api as hs

s = hs.signals.Signal1D(da.random.random((10, 20, 50))).as_lazy()
s.save("test_dataset.hspy")

Giving the error:

hyperspy/io_plugins/_hierarchical.py in overwrite_dataset(cls, group, data, key, signal_axes, chunks, **kwds)
    574                     # we delete the old one and create new in the next loop run
    575                     del group[key]
--> 576         if dset == data:
    577             # just a reference to already created thing
    578             pass

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

I do have a bit of a strange mix of package versions, so it could be caused by that.

doc/user_guide/io.rst

hyperspy/io_plugins/_hierarchical.py

doc/user_guide/io.rst

ericpre · 2021-10-29T16:26:22Z

I tested this on my own computer, and I'm getting an error:

import dask.array as da
import hyperspy.api as hs

s = hs.signals.Signal1D(da.random.random((10, 20, 50))).as_lazy()
s.save("test_dataset.hspy")

Giving the error:

hyperspy/io_plugins/_hierarchical.py in overwrite_dataset(cls, group, data, key, signal_axes, chunks, **kwds)
    574                     # we delete the old one and create new in the next loop run
    575                     del group[key]
--> 576         if dset == data:
    577             # just a reference to already created thing
    578             pass

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

I do have a bit of a strange mix of package versions, so it could be caused by that.

Thanks @magnunor! It seems that you are on a wrong branch, the code at line 576 is different in this branch:
https://github.com/ericpre/hyperspy/blob/9c5bf6fa1550867eee8a586ad4a8d163d38f4f2a/hyperspy/io_plugins/_hierarchical.py#L576

magnunor · 2021-10-29T16:48:17Z

It seems that you are on a wrong branch, the code at line 576 is different in this branch:

Indeed! Seems like git had changed how it handled pulling branches from a remote, so this,

git checkout -b ericpre-lazy_save_improvement RELEASE_next_minor
git pull https://github.com/ericpre/hyperspy.git lazy_save_improvement

failed.

magnunor

I got the previous issue sorted out, and tested the functionality a little bit. Some observations:

write_dataset=False for updating both hspy and zspy files works fine, for both axes_manager and metadata
zspy does not require loading the data with any specific mode. As opposed to hspy, which requires mode="a"
The filename can be dropped when doing s.save(overwrite=True, write_dataset=False)
Loading via zspy is much faster, both using DirectoryStore and LMDBStore. For a 512 x 512 x 256 x 256 dataset, with uint16, it took about 20 seconds. The same dataset took about 100 seconds with hspy

This is probably unrelated, but doing:

s = hs.load("test_data.hspy", lazy=True)
s.compute()

used two times as much memory (64 GB), as opposed to s = hs.load("test_data.hspy", lazy=False), which used 32 GB.

This is using the most recent version of dask, 2021.10.0

hyperspy/io_plugins/_hierarchical.py

hyperspy/io_plugins/hspy.py

magnunor · 2021-10-29T17:58:15Z

doc/user_guide/io.rst

+        However, be aware that loading those files will require installing the package
+        providing the compression filter. If not available an error will be raised.
+
+        Compression can significantly increase the saving speed. If file size is not


Is this sentence correct? Or at least ambiguous?

I'm guessing it should say that using compression can cause file saving and loading to be much slower.

Yes, this is correct in many cases when the IO time is balanced against with the CPU time - most of the time, CPU are fast enough and compressor are efficient enough.

ericpre · 2021-10-29T18:53:01Z

This is probably unrelated, but doing:
s = hs.load("test_data.hspy", lazy=True)
s.compute()
used two times as much memory (64 GB), as opposed to s = hs.load("test_data.hspy", lazy=False), which used 32 GB.

This is using the most recent version of dask, 2021.10.0

Yes, this is not related to this PR - at least I don't see how it can be and I would expect that the same happen with other branches. Do you observe the same memory usage with zspy?

ericpre · 2021-10-30T11:46:51Z

Thanks @CSSFrancis, @jlaehne and @magnunor for the reviews, I will merge this PR to be able to rebase #2839 and fix CI, so that it is easier to work on #2842 and co.

magnunor · 2021-10-31T07:33:00Z

Yes, this is not related to this PR - at least I don't see how it can be and I would expect that the same happen with other branches. Do you observe the same memory usage with zspy?

Yes, this was also with zspy.

ericpre added the type: enhancement label Jul 15, 2021

ericpre added this to the v1.7 milestone Jul 15, 2021

ericpre added the status: needs review label Jul 15, 2021

CSSFrancis mentioned this pull request Jul 18, 2021

Zarr Reading and Writing (.zspy format) #2798

Closed

8 tasks

jlaehne reviewed Oct 4, 2021

View reviewed changes

doc/user_guide/io.rst Outdated Show resolved Hide resolved

hyperspy/io.py Outdated Show resolved Hide resolved

hyperspy/io_plugins/hspy.py Outdated Show resolved Hide resolved

ericpre marked this pull request as draft October 4, 2021 12:39

ericpre removed the status: needs review label Oct 4, 2021

ericpre mentioned this pull request Oct 19, 2021

Zarr Reading and Writing (.zspy format) - Reformatted #2825

Merged

ericpre added 9 commits October 20, 2021 14:18

Add option to save hspy file without closing the file

2338918

Add option not to write dataset

2526e78

Improve docstring load and extra saving arguments of hspy format

6272e58

Fix hdf5 opening mode when writing and using write_dataset=False

716e94f

Add changelog entry

f4ea563

Increase coverage

986c1cf

Simplify test using pathlib and tmp_path fixture

7d2abea

Fix issue with latest h5py (3.5): comparison of hdf5 dataset with sig…

c94aa9f

…nal data is never used

Simplify write to zarr array and increase test coverage of lazy data

1196121

ericpre force-pushed the lazy_save_improvement branch from a1dcb73 to 1196121 Compare October 21, 2021 11:56

Fix typo docstring and user guide

8b66647

ericpre marked this pull request as ready for review October 21, 2021 12:11

ericpre commented Oct 21, 2021

View reviewed changes

hyperspy/io_plugins/_hierarchical.py Show resolved Hide resolved

hyperspy/io_plugins/zspy.py Show resolved Hide resolved

hyperspy/io_plugins/zspy.py Show resolved Hide resolved

ericpre mentioned this pull request Oct 21, 2021

Drop python 3.6 #2839

Merged

5 tasks

CSSFrancis reviewed Oct 21, 2021

View reviewed changes

hyperspy/io_plugins/zspy.py Outdated Show resolved Hide resolved

CSSFrancis requested changes Oct 21, 2021

View reviewed changes

ericpre added 2 commits October 21, 2021 18:26

Use lock=False in da.store when writing zspy file

f03c2f8

Re-introduce passing zarr store to save zspy file

84ced7a

ericpre commented Oct 22, 2021

View reviewed changes

hyperspy/io_plugins/hspy.py Show resolved Hide resolved

This was referenced Oct 26, 2021

Fix h5ebsd file writer with our own overwrite_dataset() since it will be removed in HyperSpy pyxem/kikuchipy#457

Closed

Release 1.7 #2845

Closed

jlaehne reviewed Oct 27, 2021

View reviewed changes

some typos

3d02aa0

jlaehne reviewed Oct 27, 2021

View reviewed changes

hyperspy/io_plugins/_hierarchical.py Outdated Show resolved Hide resolved

ericpre and others added 3 commits October 28, 2021 18:34

Merge pull request #20 from jlaehne/lazy_save_improvement_typos

28d6812

some typos

Update _hierarchical.py

50fdde7

Increase coverage

575c975

ericpre added the status: needs review label Oct 29, 2021

magnunor reviewed Oct 29, 2021

View reviewed changes

ericpre force-pushed the lazy_save_improvement branch from 9c5bf6f to 40365f1 Compare October 29, 2021 16:30

magnunor reviewed Oct 29, 2021

View reviewed changes

Fix load and save docstrings and typos

de5ec73

ericpre force-pushed the lazy_save_improvement branch from 40365f1 to de5ec73 Compare October 29, 2021 19:03

ericpre merged commit 560d1e3 into hyperspy:RELEASE_next_minor Oct 30, 2021

ericpre deleted the lazy_save_improvement branch October 31, 2021 20:16

ericpre removed status: needs review run-extension-tests Run extension test suites labels Nov 1, 2021

CSSFrancis mentioned this pull request May 11, 2022

Improving the Map Function- UDF Map Overlap etc. #2936

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lazy save improvement #2797

Lazy save improvement #2797

ericpre commented Jul 15, 2021 •

edited

codecov bot commented Jul 15, 2021 •

edited

jlaehne left a comment

ericpre commented Oct 4, 2021

ericpre left a comment

ericpre commented Oct 21, 2021

CSSFrancis commented Oct 21, 2021

CSSFrancis left a comment

CSSFrancis Oct 21, 2021

ericpre commented Oct 22, 2021

CSSFrancis commented Oct 25, 2021

ericpre commented Oct 27, 2021

jlaehne Oct 27, 2021

ericpre Oct 29, 2021

jlaehne commented Oct 27, 2021

ericpre commented Oct 29, 2021

magnunor left a comment

ericpre commented Oct 29, 2021

magnunor commented Oct 29, 2021

magnunor left a comment

magnunor Oct 29, 2021

ericpre Oct 29, 2021

ericpre commented Oct 29, 2021

ericpre commented Oct 30, 2021

magnunor commented Oct 31, 2021

Lazy save improvement #2797

Lazy save improvement #2797

Conversation

ericpre commented Jul 15, 2021 • edited

Progress of the PR

Minimal example of the bug fix or the new feature

codecov bot commented Jul 15, 2021 • edited

Codecov Report

jlaehne left a comment

Choose a reason for hiding this comment

ericpre commented Oct 4, 2021

ericpre left a comment

Choose a reason for hiding this comment

ericpre commented Oct 21, 2021

CSSFrancis commented Oct 21, 2021

CSSFrancis left a comment

Choose a reason for hiding this comment

CSSFrancis Oct 21, 2021

Choose a reason for hiding this comment

ericpre commented Oct 22, 2021

CSSFrancis commented Oct 25, 2021

ericpre commented Oct 27, 2021

jlaehne Oct 27, 2021

Choose a reason for hiding this comment

ericpre Oct 29, 2021

Choose a reason for hiding this comment

jlaehne commented Oct 27, 2021

ericpre commented Oct 29, 2021

magnunor left a comment

Choose a reason for hiding this comment

ericpre commented Oct 29, 2021

magnunor commented Oct 29, 2021

magnunor left a comment

Choose a reason for hiding this comment

magnunor Oct 29, 2021

Choose a reason for hiding this comment

ericpre Oct 29, 2021

Choose a reason for hiding this comment

ericpre commented Oct 29, 2021

ericpre commented Oct 30, 2021

magnunor commented Oct 31, 2021

ericpre commented Jul 15, 2021 •

edited

codecov bot commented Jul 15, 2021 •

edited