Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lazy save improvement #2797

Merged
merged 18 commits into from Oct 30, 2021
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
44 changes: 27 additions & 17 deletions doc/user_guide/io.rst
Expand Up @@ -405,21 +405,31 @@ Extra saving arguments
- ``compression``: One of ``None``, ``'gzip'``, ``'szip'``, ``'lzf'`` (default is ``'gzip'``).
``'szip'`` may be unavailable as it depends on the HDF5 installation including it.

.. note::

HyperSpy uses h5py for reading and writing HDF5 files and, therefore, it
supports all `compression filters supported by h5py <https://docs.h5py.org/en/stable/high/dataset.html#dataset-compression>`_.
The default is ``'gzip'``. It is possible to enable other compression filters
such as ``blosc`` by installing e.g. `hdf5plugin <https://github.com/silx-kit/hdf5plugin>`_.
However, be aware that loading those files will require installing the package
providing the compression filter. If not available an error will be raised.

Compression can significantly increase the saving speed. If file size is not
an issue, it can be disabled by setting ``compression=None``. Notice that only
``compression=None`` and ``compression='gzip'`` are available in all platforms,
see the `h5py documentation <https://docs.h5py.org/en/stable/faq.html#what-compression-processing-filters-are-supported>`_
for more details. Therefore, if you choose any other compression filter for
saving a file, be aware that it may not be possible to load it in some platforms.
.. note::

HyperSpy uses h5py for reading and writing HDF5 files and, therefore, it
supports all `compression filters supported by h5py <https://docs.h5py.org/en/stable/high/dataset.html#dataset-compression>`_.
The default is ``'gzip'``. It is possible to enable other compression filters
such as ``blosc`` by installing e.g. `hdf5plugin <https://github.com/silx-kit/hdf5plugin>`_.
However, be aware that loading those files will require installing the package
providing the compression filter. If not available an error will be raised.

Compression can significantly increase the saving speed. If file size is not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this sentence correct? Or at least ambiguous?

I'm guessing it should say that using compression can cause file saving and loading to be much slower.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is correct in many cases when the IO time is balanced against with the CPU time - most of the time, CPU are fast enough and compressor are efficient enough.

an issue, it can be disabled by setting ``compression=None``. Notice that only
``compression=None`` and ``compression='gzip'`` are available in all platforms,
see the `h5py documentation <https://docs.h5py.org/en/stable/faq.html#what-compression-processing-filters-are-supported>`_
for more details. Therefore, if you choose any other compression filter for
saving a file, be aware that it may not be possible to load it in some platforms.

- ``chunks``: tuple of interger or None. Define the chunking used for saving
ericpre marked this conversation as resolved.
Show resolved Hide resolved
the dataset. If None, calculates chunks for the signal, with preferably at
least one chunk per signal space.
- ``close_file``: if ``False``, doesn't close the file after writing. The file
should not be closed if the data need to be accessed lazily after saving.
Default is ``True``.
- ``write_dataset``: if ``False``, doesn't write the dataset when writing the file.
This can be useful to overwrite signal attributes only (for example ``axes_manager``)
without having to write the whole dataset, which can take time. Default is ``True``.


.. _zspy-format:
Expand Down Expand Up @@ -666,7 +676,7 @@ Extra saving arguments
scalebar. Useful to set formattiong, location, etc. of the scalebar. See the
`matplotlib-scalebar <https://pypi.org/project/matplotlib-scalebar/>`_
documentation for more information.
- ``output_size`` : (int, tuple of length 2 or None, optional): the output size
- ``output_size`` : (int, tuple of length 2 or None, optional): the output size
of the image in pixels:

* if ``int``, defines the width of the image, the height is
Expand Down Expand Up @@ -1827,7 +1837,7 @@ Extra loading arguments
acquired last frame, which typically occurs when the acquisition was
interrupted. When loading incomplete data (``only_valid_data=False``),
the missing data are filled with zeros. If ``sum_frames=True``, this argument
will be ignored to enforce consistent sum over the mapped area.
will be ignored to enforce consistent sum over the mapped area.
(default True).


Expand Down
19 changes: 15 additions & 4 deletions hyperspy/_signals/lazy.py
Expand Up @@ -167,13 +167,21 @@ def rechunk(self,
**kwargs)
)


def close_file(self):
"""Closes the associated data file if any.

Currently it only supports closing the file associated with a dask
array created from an h5py DataSet (default HyperSpy hdf5 reader).

"""
try:
self._get_file_handle().close()
except AttributeError:
_logger.warning("Failed to close lazy Signal file")

def _get_file_handle(self, warn=True):
"""Return file handle when possible; currently only hdf5 file are
supported.
"""
arrkey = None
for key in self.data.dask.keys():
Expand All @@ -182,9 +190,12 @@ def close_file(self):
break
if arrkey:
try:
self.data.dask[arrkey].file.close()
except AttributeError:
_logger.exception("Failed to close lazy Signal file")
return self.data.dask[arrkey].file
except (AttributeError, ValueError):
if warn:
_logger.warning("Failed to retrieve file handle, either "
"the file is already closed or it is not "
"a hdf5 file.")

def _get_dask_chunks(self, axis=None, dtype=None):
"""Returns dask chunks.
Expand Down
7 changes: 4 additions & 3 deletions hyperspy/axes.py
Expand Up @@ -289,9 +289,10 @@ def __init__(self,

self.events = Events()
if '_type' in kwargs:
if kwargs.get('_type') != self.__class__.__name__:
raise ValueError('The passed `_type` of axis is inconsistent '
'with the given attributes')
_type = kwargs.get('_type')
if _type != self.__class__.__name__:
raise ValueError(f'The passed `_type` ({_type}) of axis is '
'inconsistent with the given attributes.')
_name = self.__class__.__name__
self.events.index_changed = Event("""
Event that triggers when the index of the `{}` changes
Expand Down
18 changes: 17 additions & 1 deletion hyperspy/io.py
Expand Up @@ -26,7 +26,7 @@
from natsort import natsorted
from inspect import isgenerator
from pathlib import Path
from collections import MutableMapping
from collections.abc import MutableMapping

from hyperspy.drawing.marker import markers_metadata_dict_to_markers
from hyperspy.exceptions import VisibleDeprecationWarning
Expand Down Expand Up @@ -277,6 +277,17 @@ def load(filenames=None,
acquisition stopped before the end: if True, load only the acquired
data. If False, fill empty data with zeros. Default is False and this
default value will change to True in version 2.0.
chunks : tuple of integer or None
Only for hspy files. Define the chunking used for saving the dataset.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the user guide it says these extra arguments are only relevant for zarr format, here it says they are only relevant for hspy format. Also some details in the description differ between here and io.rst.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this docstring is for save, not load! 🤦

If None, calculates chunks for the signal, with preferably at least
one chunk per signal space.
close_file : bool, optional
Only for hspy files. Close the file after writing, default is True.
write_dataset : bool, optional
Only for hspy files. If True, write the dataset, otherwise, don't
write it. Useful to save attributes without having to write the whole
dataset. Default is True.


Returns
-------
Expand Down Expand Up @@ -516,6 +527,11 @@ def load_with_reader(
signal.tmp_parameters.folder = folder
signal.tmp_parameters.filename = filename
signal.tmp_parameters.extension = extension.replace('.', '')
# original_filename and original_file are used to keep track of
# where is the file which has been open lazily
signal.tmp_parameters.original_folder = folder
signal.tmp_parameters.original_filename = filename
signal.tmp_parameters.original_extension = extension.replace('.', '')
# test if binned attribute is still in metadata
if signal.metadata.has_item('Signal.binned'):
for axis in signal.axes_manager.signal_axes:
Expand Down