New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lazy save improvement #2797
Lazy save improvement #2797
Conversation
Codecov Report
@@ Coverage Diff @@
## RELEASE_next_minor #2797 +/- ##
======================================================
+ Coverage 77.03% 77.07% +0.04%
======================================================
Files 206 206
Lines 31566 31600 +34
Branches 6907 6918 +11
======================================================
+ Hits 24317 24357 +40
- Misses 5493 5494 +1
+ Partials 1756 1749 -7
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some typos
I will come back to this after #2825 is merged. |
…nal data is never used
a1dcb73
to
1196121
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test suite is failing on python 3.6 and I think that it is not worth trying to get it to work because python 3.6 is not supported anymore and we should drop it in the next minor release.
@CSSFrancis, would you like to review this PR? Thanks! |
Yea let me see if I can get to this today or tomorrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ericpre Looks like this cleans up a lot of the little things that I wasn't terribly confident about and reduces the code complexity a fair bit! Thanks for doing this.
One note (as I said above) is that the store
function doesn't play well with dask.distrubted unless lock=False. It might be worth adding in some tests for compatibility there if the goal is to fully integrate with dask, but I am not sure how easy deploying something like that on a testing environment is.
I assume that replacing create_group
with require_group
is preferable because if won't throw an error if the group already exists?
Other than that, one of the tests which tested passing a Mutable Mapping object as the file name was deleted and should be covered.
hyperspy/tests/io/test_zspy.py
Outdated
def test_save_N5_type(self, signal, tmp_path): | ||
filename = tmp_path / 'testmodels.zspy' | ||
store = zarr.N5Store(path=filename) | ||
signal.save(store, write_to_storage=True) | ||
signal2 = load(filename) | ||
np.testing.assert_array_equal(signal2.data, signal.data) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not be deleted I think. You can remove the write_to_storage=True
but this test shows if passing a MutableMapping object works for saving the data.
Running the integration test suite show that it was breaking the public API... see #2797 (comment). This is fixed in caf6a01. @CSSFrancis, does it look good to you? |
Yea with the changes it looks good! Thanks for cleaning this stuff up. |
@jlaehne, do you want to do another review of this PR? |
hyperspy/io.py
Outdated
@@ -277,6 +288,17 @@ def load(filenames=None, | |||
acquisition stopped before the end: if True, load only the acquired | |||
data. If False, fill empty data with zeros. Default is False and this | |||
default value will change to True in version 2.0. | |||
chunks : tuple of integer or None | |||
Only for hspy files. Define the chunking used for saving the dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the user guide it says these extra arguments are only relevant for zarr
format, here it says they are only relevant for hspy
format. Also some details in the description differ between here and io.rst.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this docstring is for save
, not load
! 🤦
I sent a PR correcting some typos. There are still some codecov warnings (mostly about untested exceptions). Otherwise nothing that caught my eye. |
Thanks @jlaehne, I increased the coverage and fixed the docstring. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested this on my own computer, and I'm getting an error:
import dask.array as da
import hyperspy.api as hs
s = hs.signals.Signal1D(da.random.random((10, 20, 50))).as_lazy()
s.save("test_dataset.hspy")
Giving the error:
hyperspy/io_plugins/_hierarchical.py in overwrite_dataset(cls, group, data, key, signal_axes, chunks, **kwds)
574 # we delete the old one and create new in the next loop run
575 del group[key]
--> 576 if dset == data:
577 # just a reference to already created thing
578 pass
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
I do have a bit of a strange mix of package versions, so it could be caused by that.
Thanks @magnunor! It seems that you are on a wrong branch, the code at line 576 is different in this branch: |
9c5bf6f
to
40365f1
Compare
Indeed! Seems like
failed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got the previous issue sorted out, and tested the functionality a little bit. Some observations:
write_dataset=False
for updating bothhspy
andzspy
files works fine, for bothaxes_manager
andmetadata
zspy
does not require loading the data with any specificmode
. As opposed tohspy
, which requiresmode="a"
- The filename can be dropped when doing
s.save(overwrite=True, write_dataset=False)
- Loading via
zspy
is much faster, both usingDirectoryStore
andLMDBStore
. For a 512 x 512 x 256 x 256 dataset, with uint16, it took about 20 seconds. The same dataset took about 100 seconds withhspy
This is probably unrelated, but doing:
s = hs.load("test_data.hspy", lazy=True)
s.compute()
used two times as much memory (64 GB), as opposed to s = hs.load("test_data.hspy", lazy=False)
, which used 32 GB.
This is using the most recent version of dask
, 2021.10.0
However, be aware that loading those files will require installing the package | ||
providing the compression filter. If not available an error will be raised. | ||
|
||
Compression can significantly increase the saving speed. If file size is not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this sentence correct? Or at least ambiguous?
I'm guessing it should say that using compression can cause file saving and loading to be much slower.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is correct in many cases when the IO time is balanced against with the CPU time - most of the time, CPU are fast enough and compressor are efficient enough.
Yes, this is not related to this PR - at least I don't see how it can be and I would expect that the same happen with other branches. Do you observe the same memory usage with |
40365f1
to
de5ec73
Compare
Thanks @CSSFrancis, @jlaehne and @magnunor for the reviews, I will merge this PR to be able to rebase #2839 and fix CI, so that it is easier to work on #2842 and co. |
Yes, this was also with |
Progress of the PR
h5py
version (3.5)upcoming_changes
folder (seeupcoming_changes/README.rst
),readthedocs
doc build of this PR (link in github checks)Minimal example of the bug fix or the new feature