Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lazy save improvement #2797

Merged
merged 18 commits into from Oct 30, 2021

Conversation

ericpre
Copy link
Member

@ericpre ericpre commented Jul 15, 2021

Progress of the PR

  • Add option to close file when saving lazy signal to hspy format,
  • add option not to write dataset, which can useful when saving large signal (hspy/zspy format only),
  • fix issue with latest h5py version (3.5)
  • update docstring,
  • update user guide,
  • add an changelog entry in the upcoming_changes folder (see upcoming_changes/README.rst),
  • Check formatting changelog entry in the readthedocs doc build of this PR (link in github checks)
  • add tests,
  • ready for review.

Minimal example of the bug fix or the new feature

import hyperspy.api as hs
import numpy as np

s = hs.signals.Signal1D(np.arange(1000).reshape(10, 100))

fname = 'test.hspy'
s.save(fname, overwrite=True)

s2 = hs.load(fname, lazy=True, mode='a')
s2.axes_manager[-1].scale = 0.1
# write everything, except the dataset itself
s2.save(fname, overwrite=True, write_dataset=False)

s3 = hs.load(fname)
print(s3.axes_manager[-1].scale)

@ericpre ericpre added this to the v1.7 milestone Jul 15, 2021
@codecov
Copy link

codecov bot commented Jul 15, 2021

Codecov Report

Merging #2797 (de5ec73) into RELEASE_next_minor (19dab6e) will increase coverage by 0.04%.
The diff coverage is 94.11%.

Impacted file tree graph

@@                  Coverage Diff                   @@
##           RELEASE_next_minor    #2797      +/-   ##
======================================================
+ Coverage               77.03%   77.07%   +0.04%     
======================================================
  Files                     206      206              
  Lines                   31566    31600      +34     
  Branches                 6907     6918      +11     
======================================================
+ Hits                    24317    24357      +40     
- Misses                   5493     5494       +1     
+ Partials                 1756     1749       -7     
Impacted Files Coverage Δ
hyperspy/_signals/lazy.py 89.30% <88.88%> (+0.26%) ⬆️
hyperspy/io_plugins/_hierarchical.py 75.16% <90.38%> (+1.85%) ⬆️
hyperspy/io_plugins/hspy.py 93.33% <93.61%> (+4.69%) ⬆️
hyperspy/io.py 85.76% <95.00%> (+0.61%) ⬆️
hyperspy/axes.py 90.50% <100.00%> (+<0.01%) ⬆️
hyperspy/io_plugins/emd.py 66.95% <100.00%> (-0.04%) ⬇️
hyperspy/io_plugins/nexus.py 93.88% <100.00%> (-0.03%) ⬇️
hyperspy/io_plugins/zspy.py 95.34% <100.00%> (+3.68%) ⬆️
hyperspy/signal.py 73.80% <100.00%> (ø)
hyperspy/misc/eels/eelsdb.py 50.72% <0.00%> (-14.50%) ⬇️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 19dab6e...de5ec73. Read the comment docs.

Copy link
Contributor

@jlaehne jlaehne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some typos

doc/user_guide/io.rst Outdated Show resolved Hide resolved
hyperspy/io.py Outdated Show resolved Hide resolved
hyperspy/io_plugins/hspy.py Outdated Show resolved Hide resolved
@ericpre
Copy link
Member Author

ericpre commented Oct 4, 2021

I will come back to this after #2825 is merged.

@ericpre ericpre marked this pull request as ready for review October 21, 2021 12:11
Copy link
Member Author

@ericpre ericpre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test suite is failing on python 3.6 and I think that it is not worth trying to get it to work because python 3.6 is not supported anymore and we should drop it in the next minor release.

hyperspy/io_plugins/_hierarchical.py Show resolved Hide resolved
hyperspy/io_plugins/zspy.py Show resolved Hide resolved
hyperspy/io_plugins/zspy.py Show resolved Hide resolved
@ericpre ericpre mentioned this pull request Oct 21, 2021
5 tasks
@ericpre
Copy link
Member Author

ericpre commented Oct 21, 2021

@CSSFrancis, would you like to review this PR? Thanks!

@CSSFrancis
Copy link
Member

@CSSFrancis, would you like to review this PR? Thanks!

Yea let me see if I can get to this today or tomorrow

Copy link
Member

@CSSFrancis CSSFrancis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericpre Looks like this cleans up a lot of the little things that I wasn't terribly confident about and reduces the code complexity a fair bit! Thanks for doing this.

One note (as I said above) is that the store function doesn't play well with dask.distrubted unless lock=False. It might be worth adding in some tests for compatibility there if the goal is to fully integrate with dask, but I am not sure how easy deploying something like that on a testing environment is.

I assume that replacing create_group with require_group is preferable because if won't throw an error if the group already exists?

Other than that, one of the tests which tested passing a Mutable Mapping object as the file name was deleted and should be covered.

hyperspy/io_plugins/zspy.py Show resolved Hide resolved
hyperspy/io_plugins/zspy.py Show resolved Hide resolved
hyperspy/io_plugins/zspy.py Show resolved Hide resolved
Comment on lines 38 to 44
def test_save_N5_type(self, signal, tmp_path):
filename = tmp_path / 'testmodels.zspy'
store = zarr.N5Store(path=filename)
signal.save(store, write_to_storage=True)
signal2 = load(filename)
np.testing.assert_array_equal(signal2.data, signal.data)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be deleted I think. You can remove the write_to_storage=True but this test shows if passing a MutableMapping object works for saving the data.

@ericpre
Copy link
Member Author

ericpre commented Oct 22, 2021

Running the integration test suite show that it was breaking the public API... see #2797 (comment). This is fixed in caf6a01.

@CSSFrancis, does it look good to you?

@CSSFrancis
Copy link
Member

@CSSFrancis, does it look good to you?

Yea with the changes it looks good! Thanks for cleaning this stuff up.

@ericpre
Copy link
Member Author

ericpre commented Oct 27, 2021

@jlaehne, do you want to do another review of this PR?

hyperspy/io.py Outdated
@@ -277,6 +288,17 @@ def load(filenames=None,
acquisition stopped before the end: if True, load only the acquired
data. If False, fill empty data with zeros. Default is False and this
default value will change to True in version 2.0.
chunks : tuple of integer or None
Only for hspy files. Define the chunking used for saving the dataset.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the user guide it says these extra arguments are only relevant for zarr format, here it says they are only relevant for hspy format. Also some details in the description differ between here and io.rst.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this docstring is for save, not load! 🤦

@jlaehne
Copy link
Contributor

jlaehne commented Oct 27, 2021

@jlaehne, do you want to do another review of this PR?

I sent a PR correcting some typos. There are still some codecov warnings (mostly about untested exceptions). Otherwise nothing that caught my eye.

@ericpre
Copy link
Member Author

ericpre commented Oct 29, 2021

Thanks @jlaehne, I increased the coverage and fixed the docstring.

Copy link
Contributor

@magnunor magnunor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this on my own computer, and I'm getting an error:

import dask.array as da
import hyperspy.api as hs

s = hs.signals.Signal1D(da.random.random((10, 20, 50))).as_lazy()
s.save("test_dataset.hspy")

Giving the error:

hyperspy/io_plugins/_hierarchical.py in overwrite_dataset(cls, group, data, key, signal_axes, chunks, **kwds)
    574                     # we delete the old one and create new in the next loop run
    575                     del group[key]
--> 576         if dset == data:
    577             # just a reference to already created thing
    578             pass

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

I do have a bit of a strange mix of package versions, so it could be caused by that.

doc/user_guide/io.rst Outdated Show resolved Hide resolved
hyperspy/io_plugins/_hierarchical.py Outdated Show resolved Hide resolved
hyperspy/io_plugins/_hierarchical.py Outdated Show resolved Hide resolved
hyperspy/io_plugins/_hierarchical.py Outdated Show resolved Hide resolved
hyperspy/io_plugins/_hierarchical.py Outdated Show resolved Hide resolved
doc/user_guide/io.rst Outdated Show resolved Hide resolved
@ericpre
Copy link
Member Author

ericpre commented Oct 29, 2021

I tested this on my own computer, and I'm getting an error:

import dask.array as da
import hyperspy.api as hs

s = hs.signals.Signal1D(da.random.random((10, 20, 50))).as_lazy()
s.save("test_dataset.hspy")

Giving the error:

hyperspy/io_plugins/_hierarchical.py in overwrite_dataset(cls, group, data, key, signal_axes, chunks, **kwds)
    574                     # we delete the old one and create new in the next loop run
    575                     del group[key]
--> 576         if dset == data:
    577             # just a reference to already created thing
    578             pass

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

I do have a bit of a strange mix of package versions, so it could be caused by that.

Thanks @magnunor! It seems that you are on a wrong branch, the code at line 576 is different in this branch:
https://github.com/ericpre/hyperspy/blob/9c5bf6fa1550867eee8a586ad4a8d163d38f4f2a/hyperspy/io_plugins/_hierarchical.py#L576

@magnunor
Copy link
Contributor

It seems that you are on a wrong branch, the code at line 576 is different in this branch:

Indeed! Seems like git had changed how it handled pulling branches from a remote, so this,

git checkout -b ericpre-lazy_save_improvement RELEASE_next_minor
git pull https://github.com/ericpre/hyperspy.git lazy_save_improvement

failed.

Copy link
Contributor

@magnunor magnunor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got the previous issue sorted out, and tested the functionality a little bit. Some observations:

  • write_dataset=False for updating both hspy and zspy files works fine, for both axes_manager and metadata
  • zspy does not require loading the data with any specific mode. As opposed to hspy, which requires mode="a"
  • The filename can be dropped when doing s.save(overwrite=True, write_dataset=False)
  • Loading via zspy is much faster, both using DirectoryStore and LMDBStore. For a 512 x 512 x 256 x 256 dataset, with uint16, it took about 20 seconds. The same dataset took about 100 seconds with hspy

This is probably unrelated, but doing:

s = hs.load("test_data.hspy", lazy=True)
s.compute()

used two times as much memory (64 GB), as opposed to s = hs.load("test_data.hspy", lazy=False), which used 32 GB.

This is using the most recent version of dask, 2021.10.0

hyperspy/io_plugins/_hierarchical.py Outdated Show resolved Hide resolved
hyperspy/io_plugins/hspy.py Outdated Show resolved Hide resolved
However, be aware that loading those files will require installing the package
providing the compression filter. If not available an error will be raised.

Compression can significantly increase the saving speed. If file size is not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this sentence correct? Or at least ambiguous?

I'm guessing it should say that using compression can cause file saving and loading to be much slower.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is correct in many cases when the IO time is balanced against with the CPU time - most of the time, CPU are fast enough and compressor are efficient enough.

@ericpre
Copy link
Member Author

ericpre commented Oct 29, 2021

This is probably unrelated, but doing:

s = hs.load("test_data.hspy", lazy=True)
s.compute()

used two times as much memory (64 GB), as opposed to s = hs.load("test_data.hspy", lazy=False), which used 32 GB.

This is using the most recent version of dask, 2021.10.0

Yes, this is not related to this PR - at least I don't see how it can be and I would expect that the same happen with other branches. Do you observe the same memory usage with zspy?

@ericpre
Copy link
Member Author

ericpre commented Oct 30, 2021

Thanks @CSSFrancis, @jlaehne and @magnunor for the reviews, I will merge this PR to be able to rebase #2839 and fix CI, so that it is easier to work on #2842 and co.

@ericpre ericpre merged commit 560d1e3 into hyperspy:RELEASE_next_minor Oct 30, 2021
@magnunor
Copy link
Contributor

Yes, this is not related to this PR - at least I don't see how it can be and I would expect that the same happen with other branches. Do you observe the same memory usage with zspy?

Yes, this was also with zspy.

@ericpre ericpre deleted the lazy_save_improvement branch October 31, 2021 20:16
@ericpre ericpre removed status: needs review run-extension-tests Run extension test suites labels Nov 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants