Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zarr Reading and Writing (.zspy format) - Reformatted #2825

Merged
merged 34 commits into from Oct 20, 2021
Merged
Show file tree
Hide file tree
Changes from 29 commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
5d94aa7
refactoring hspy and zspy to build from each other
CSSFrancis Aug 31, 2021
2270bd7
Added in zspy as file format
CSSFrancis Sep 1, 2021
0e6d705
Added in HyperspyReader and HyperspyWriter Classes for easier organiz…
CSSFrancis Sep 3, 2021
37d5fe0
Hyperspy file loading as a class set of functions
CSSFrancis Sep 3, 2021
8e27d65
Zspy classes for saving/loading data added. Still need some additiona…
CSSFrancis Sep 3, 2021
af590ee
Tried to reduce code. Still could remove parse data function
CSSFrancis Sep 3, 2021
023630b
Messed up the file reading a little bit. Need to Fix Nexus and EMD f…
CSSFrancis Sep 8, 2021
28350a5
Broken-- Starting with a new branch
CSSFrancis Sep 16, 2021
c6f8178
Cleaned up get_signal_chunks and overwrite dataset functions
CSSFrancis Sep 16, 2021
6021f3f
Added in the ability to define your storage container
CSSFrancis Sep 16, 2021
db105d5
Updated Documentation including zarr format
CSSFrancis Sep 16, 2021
fc2302d
Updated Big data documentation
CSSFrancis Sep 16, 2021
f766284
Added in change description
CSSFrancis Sep 16, 2021
7fe289a
Added Zarr as a requirement
CSSFrancis Sep 16, 2021
57d96f1
removed unused imports
CSSFrancis Sep 16, 2021
665dc59
Merge remote-tracking branch 'upstream/RELEASE_next_minor' into zarr-io
CSSFrancis Sep 16, 2021
f0a2144
Change wording to be more formal
CSSFrancis Oct 5, 2021
4132db5
Cleaning up after review
CSSFrancis Oct 5, 2021
b83ec5d
Added in spacing for nexus file
CSSFrancis Oct 5, 2021
9fffe4c
Removed maxshape and shuffle for the `require_dataset` function
CSSFrancis Oct 5, 2021
cf8632e
Overwrite working properly
CSSFrancis Oct 5, 2021
fb4c048
Added in more checking for directory and zspy format
CSSFrancis Oct 5, 2021
48c88f6
Add in shuffle kwd argument as default for hspy format.
CSSFrancis Oct 5, 2021
efd1086
Fix typo error message.
ericpre Oct 6, 2021
b9be6fc
Simplify inheritance structure
ericpre Oct 6, 2021
ad0902b
Merge pull request #1 from ericpre/zarr-io
CSSFrancis Oct 7, 2021
9840bcc
Replaced get_signal_chunks_function
CSSFrancis Oct 7, 2021
31d27b0
Allow MutableMapping Objects to be passed in
CSSFrancis Oct 7, 2021
0e98cae
Cleaning up print statements and saving output
CSSFrancis Oct 18, 2021
88b38bc
Update hyperspy/io_plugins/_hierarchical.py
CSSFrancis Oct 19, 2021
cb45958
Specify dtype for testing on 32bit operating systems
CSSFrancis Oct 19, 2021
d863064
Merge remote-tracking branch 'origin/zarr-io' into zarr-io
CSSFrancis Oct 19, 2021
3132e22
Add fsspec to conda environment and setup.py since this is not a dask…
ericpre Oct 20, 2021
2b236f5
Install zarr on x86_64 machines until pypi and conda packages are ava…
ericpre Oct 20, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions conda_environment.yml
Expand Up @@ -28,5 +28,6 @@ dependencies:
- toolz
- tqdm
- traits
- zarr


9 changes: 9 additions & 0 deletions doc/user_guide/big_data.rst
Expand Up @@ -397,6 +397,15 @@ Other minor differences
convenience, ``nansum``, ``nanmean`` and other ``nan*`` signal methods were
added to mimic the workflow as closely as possible.

.. _big_data.saving:

Saving Big Data
^^^^^^^^^^^^^^^^^

The most efficient format supported by HyperSpy to write data is the :ref:` zspy format <zspy-format>`,
mainly because it supports writing currently from concurrently from multiple threads or processes.

This also allows for smooth interaction with dask-distributed for efficient scaling.

.. _lazy_details:

Expand Down
61 changes: 61 additions & 0 deletions doc/user_guide/io.rst
Expand Up @@ -251,6 +251,8 @@ HyperSpy. The "lazy" column specifies if lazy evaluation is supported.
+-----------------------------------+--------+--------+--------+
| hspy | Yes | Yes | Yes |
+-----------------------------------+--------+--------+--------+
| zspy | Yes | Yes | Yes |
+-----------------------------------+--------+--------+--------+
| Image: e.g. jpg, png, tif, ... | Yes | Yes | Yes |
+-----------------------------------+--------+--------+--------+
| TIFF | Yes | Yes | Yes |
Expand Down Expand Up @@ -418,6 +420,65 @@ Extra saving arguments
saving a file, be aware that it may not be possible to load it in some platforms.


.. _zspy-format:

ZSpy - HyperSpy's Zarr Specification
------------------------------------

Similarly to the :ref:`hspy format <hspy-format>`, the zspy format guarantees that no
information will be lost in the writing process and that supports saving data
of arbitrary dimensions. It is based on the `Zarr project <https://zarr.readthedocs.io/en/stable/index.html>`_. Which exists as a drop in
replacement for hdf5 with the intention to fix some of the speed and scaling
issues with the hdf5 format and is therefore suitable for saving :ref:`big data <big_data.saving>`.


.. code-block:: python

>>> s = hs.signals.BaseSignal([0])
>>> s.save('test.zspy') # will save in nested directory
>>> hs.load('test.zspy') # loads the directory


When saving to `zspy <https://zarr.readthedocs.io/en/stable/index.html>`_, all supported objects in the signal's
:py:attr:`~.signal.BaseSignal.metadata` is stored. This includes lists, tuples and signals.
Please note that in order to increase saving efficiency and speed, if possible,
the inner-most structures are converted to numpy arrays when saved. This
procedure homogenizes any types of the objects inside, most notably casting
numbers as strings if any other strings are present:
CSSFrancis marked this conversation as resolved.
Show resolved Hide resolved

Extra saving arguments
^^^^^^^^^^^^^^^^^^^^^^

- ``compressor``: A `Numcodecs Codec <https://numcodecs.readthedocs.io/en/stable/index.html?>`_.
A compresssor can be passed to the save function to compress the data efficiently. The defualt
is to call a Blosc Compressor object.

.. code-block:: python

>>> from numcodecs import Blosc
>>> compressor=Blosc(cname='zstd', clevel=1, shuffle=Blosc.SHUFFLE)
>>> s.save('test.zspy', compressor = compressor) # will save with Blosc compression

.. note::

Lazy operations are often i-o bound, reading and writing the data creates a bottle neck in processes
due to the slow read write speed of many hard disks. In these cases, compressing your data is often
beneficial to the speed of some operation. Compression speeds up the process as there is less to
read/write with the trade off of slightly more computational work on the CPU."

- ``write_to_storage``: The write to storage option allows you to pass the path to a directory (or database)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more alternative would be to define the store argument to pass a zarr store, for example, as defined by store = zarr.ZipStore('data/example.zip', mode='w'). If the store argument is specified, then the zspy can pick up the path from the store itself and it doesn't need to be defined explicitely.
An even more simple alternative is to accept zarr store for the filename but I am not convinced it is good because it is a bit misleading.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I considered both of those possibilities but decided against them for kind of the reason you described above.

For the passing of a store argument your filename doesn't really mean anything as only the store argument will be passed. If I remember correctly this complicates things a little bit.

The second one is pretty do-able and I think that I got it working at one point. I could be convinced to get that working again. The question is if that really makes sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going for option 2 should be good because this is a pattern used by the zarr.group, as it can take a path or a zarr store as first argument.

and write directly to the storage container. This gives you access to the `different storage methods
<https://zarr.readthedocs.io/en/stable/api/storage.html>`_
available through zarr. Namely using a SQL, MongoDB or LMDB database. Additional downloads may need
to be configured to use these features.

.. code-block:: python

>>> filename = 'test.zspy/'
>>> os.mkdir('test.zspy')
>>> store = zarr.LMDBStore(path=filename)
>>> signal.save(store.path, write_to_storage=True) # saved to Lmdb

.. _netcdf-format:

NetCDF
Expand Down
69 changes: 43 additions & 26 deletions hyperspy/io.py
Expand Up @@ -26,6 +26,7 @@
from natsort import natsorted
from inspect import isgenerator
from pathlib import Path
from collections import MutableMapping

from hyperspy.drawing.marker import markers_metadata_dict_to_markers
from hyperspy.exceptions import VisibleDeprecationWarning
Expand Down Expand Up @@ -327,22 +328,23 @@ def load(filenames=None,
lazy = load_ui.lazy
if filenames is None:
raise ValueError("No file provided to reader")

if isinstance(filenames, str):
pattern = filenames
if escape_square_brackets:
filenames = _escape_square_brackets(filenames)

filenames = natsorted([f for f in glob.glob(filenames)
if os.path.isfile(f)])
if os.path.isfile(f) or (os.path.isdir(f) and
os.path.splitext(f)[1] == '.zspy')])

if not filenames:
raise ValueError(f'No filename matches the pattern "{pattern}"')

elif isinstance(filenames, Path):
# Just convert to list for now, pathlib.Path not
# fully supported in io_plugins
filenames = [f for f in [filenames] if f.is_file()]
filenames = [f for f in [filenames]
if f.is_file() or (f.is_dir() and ".zspy" in f.name)]

elif isgenerator(filenames):
filenames = list(filenames)
Expand Down Expand Up @@ -449,7 +451,8 @@ def load_single_file(filename, **kwds):
Data loaded from the file.

"""
if not os.path.isfile(filename):
if not os.path.isfile(filename) and not (os.path.isdir(filename) or
os.path.splitext(filename)[1] == '.zspy'):
raise FileNotFoundError(f"File: {filename} not found!")

# File extension without "." separator
Expand Down Expand Up @@ -731,11 +734,14 @@ def save(filename, signal, overwrite=None, **kwds):
None

"""
filename = Path(filename).resolve()
extension = filename.suffix
if extension == '':
extension = ".hspy"
filename = filename.with_suffix(extension)
if isinstance(filename, MutableMapping):
extension =".zspy"
else:
filename = Path(filename).resolve()
extension = filename.suffix
if extension == '':
extension = ".hspy"
filename = filename.with_suffix(extension)

writer = None
for plugin in io_plugins:
Expand Down Expand Up @@ -780,24 +786,35 @@ def save(filename, signal, overwrite=None, **kwds):
)

# Create the directory if it does not exist
ensure_directory(filename.parent)
is_file = filename.is_file()

if overwrite is None:
write = overwrite_method(filename) # Ask what to do
elif overwrite is True or (overwrite is False and not is_file):
write = True # Write the file
elif overwrite is False and is_file:
write = False # Don't write the file
if not isinstance(filename, MutableMapping):
ensure_directory(filename.parent)
is_file = filename.is_file() or (filename.is_dir() and
os.path.splitext(filename)[1] == '.zspy')

if overwrite is None:
write = overwrite_method(filename) # Ask what to do
elif overwrite is True or (overwrite is False and not is_file):
write = True # Write the file
elif overwrite is False and is_file:
write = False # Don't write the file
else:
raise ValueError("`overwrite` parameter can only be None, True or "
"False.")
else:
raise ValueError("`overwrite` parameter can only be None, True or "
"False.")
write = True # file does not exist (creating it)
if write:
# Pass as a string for now, pathlib.Path not
# properly supported in io_plugins
writer.file_writer(str(filename), signal, **kwds)

_logger.info(f'{filename} was created')
signal.tmp_parameters.set_item('folder', filename.parent)
signal.tmp_parameters.set_item('filename', filename.stem)
signal.tmp_parameters.set_item('extension', extension)
if not isinstance(filename, MutableMapping):
writer.file_writer(str(filename), signal, **kwds)
_logger.info(f'{filename} was created')
signal.tmp_parameters.set_item('folder', filename.parent)
signal.tmp_parameters.set_item('filename', filename.stem)
signal.tmp_parameters.set_item('extension', extension)
CSSFrancis marked this conversation as resolved.
Show resolved Hide resolved
else:
writer.file_writer(filename, signal, **kwds)
if hasattr(filename, "path"):
file = Path(filename.path).resolve()
signal.tmp_parameters.set_item('folder', file.parent)
signal.tmp_parameters.set_item('filename', file.stem)
signal.tmp_parameters.set_item('extension', extension)
2 changes: 2 additions & 0 deletions hyperspy/io_plugins/__init__.py
Expand Up @@ -40,6 +40,7 @@
semper_unf,
sur,
tiff,
zspy,
)

io_plugins = [
Expand All @@ -63,6 +64,7 @@
semper_unf,
sur,
tiff,
zspy,
]


Expand Down