Fix saving ragged array string hspy #213

ericpre · 2024-01-21T13:24:49Z

Fix for #212.

Progress of the PR

Change implemented (can be split into several points),
update docstring (if appropriate),
update user guide (if appropriate),
add a changelog entry in the upcoming_changes folder (see upcoming_changes/README.rst),
Check formatting of the changelog entry (and eventual user guide changes) in the docs/readthedocs.org:rosettasciio build of this PR (link in github checks)
add tests,
ready for review.

Minimal example of the bug fix or the new feature

"""
import numpy as np
import hyperspy.api as hs
import h5py

fname = "test.hspy"

rng = np.random.default_rng(0)
data = np.ones((5, 100, 100))
s = hs.signals.Signal2D(data)

# Create an navigation dependent (ragged) Texts marker
offsets = np.empty(s.axes_manager.navigation_shape, dtype=object)
texts = np.empty(s.axes_manager.navigation_shape, dtype=object)

for index in np.ndindex(offsets.shape):
    i = index[0]
    offsets[index] = rng.random((5, 2))[: i + 2] * 100
    texts[index] = np.array(
        ["a" * (i + 1), "b", "c", "d", "e"][: i + 2]
    )

m = hs.plot.markers.Texts(
    offsets=offsets,
    texts=texts,
    sizes=3,
    facecolor="black",
)

s.add_marker(m, permanent=True)
s.plot()
s.save(fname, overwrite=True)

s2 = hs.load(fname)

m_texts = m.kwargs['texts']
m2_texts = s2.metadata.Markers.Texts.kwargs['texts']


for index in np.ndindex(m_texts.shape):
    np.testing.assert_equal(m_texts[index], m2_texts[index])

…dependent (ragged) texts markers: h5py doesn't support numpy unicode dtype, convert to variable length h5py str type before saving

…t (the dtype of the first was taken and the length of unicode dtype can be too short for all other array)

ericpre · 2024-01-21T13:26:45Z

rsciio/_hierarchical.py

+        # Since h5py doesn't support numpy unicode dtype, we need to save
+        # data with h5py variable length str dtype and when reading the
+        # ragged data, we need to convert it back to numpy unicode dtype
+        if data.dtype.metadata["vlen"].metadata["vlen"] == str:


I am not sure how robust it would be but this is the only way (workaround) that I could find to figure out if the dtype of the ragged are string.

Yea I think in the past I've used something like data[data.dim*(0,)].dtype but you're right that this makes a little more sense and removes the possible error that might occur with a 0 dimensional array.

numpy does say that it might change in the future so I would just make sure that it is well tested.

Yeah, I am not keen on this but I would ok with keeping it knowing that I couldn't find any another solution with h5py, even if there is a risk that it breaks at some point because this attribute is "weakly" documented.
This is another justification for consider the alternative approach more seriously.

ericpre · 2024-01-21T13:30:10Z

@CSSFrancis, can you please have a look at this PR?
An alternative would be to save an attributes which contain the dtype for each array of the ragged array and convert it back when reading - similarly as it is done with reshaping ragged array. I didn't look much how easy it would be but I thought about it as being more robust/generic.

rsciio/hspy/_api.py

-            self.ragged_kwds = {
-                "dtype": h5py.special_dtype(vlen=signal["data"][0].dtype)
-            }
+        self.unicode_kwds = {"dtype": h5py.string_dtype()}


codecov · 2024-01-21T13:34:39Z

Codecov Report

Attention: 10 lines in your changes are missing coverage. Please review.

Comparison is base (c85f768) 86.22% compared to head (d48e4ae) 86.15%.

Files	Patch %	Lines
rsciio/zspy/_api.py	16.66%	4 Missing and 1 partial ⚠️
rsciio/_hierarchical.py	85.71%	2 Missing and 1 partial ⚠️
rsciio/hspy/_api.py	75.00%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #213      +/-   ##
==========================================
- Coverage   86.22%   86.15%   -0.07%     
==========================================
  Files          82       82              
  Lines       10549    10568      +19     
  Branches     2293     2300       +7     
==========================================
+ Hits         9096     9105       +9     
- Misses        933      940       +7     
- Partials      520      523       +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

CSSFrancis

Hmm I think that this is a good solution to hard problem. I don't love that you have to go through every position to find the longest string. I almost want to just say preallocate 128 characters and call it good. It might be a waste but storage is cheap. We could have a setting where it either checks or goes with the default.

That or we could unwrap and wrap the data which might be the best way to do things.

CSSFrancis · 2024-01-21T20:31:34Z

rsciio/_hierarchical.py

+        # Since h5py doesn't support numpy unicode dtype, we need to save
+        # data with h5py variable length str dtype and when reading the
+        # ragged data, we need to convert it back to numpy unicode dtype
+        if data.dtype.metadata["vlen"].metadata["vlen"] == str:


Yea I think in the past I've used something like data[data.dim*(0,)].dtype but you're right that this makes a little more sense and removes the possible error that might occur with a 0 dimensional array.

numpy does say that it might change in the future so I would just make sure that it is well tested.

CSSFrancis · 2024-01-21T20:32:44Z

rsciio/_hierarchical.py

+            convert_to_unicode = True
+        else:
+            convert_to_unicode = False
+    except Exception:


Why is there a Try/Except here? Maybe add a comment on why this fails.

CSSFrancis · 2024-01-21T21:03:32Z

rsciio/zspy/_api.py

+            for index in np.ndindex(data.shape):
+                if data[index].dtype.itemsize // size_of_char > dtype.itemsize:
+                    dtype = data[index].dtype


How does this work with lazy arrays of strings? I don't think that this should work? If it does it won't be very efficient as you have to look at every single index.

We don't have to support saving lazy arrays of strings but we should at least document it... Part of me is just tempted to pre allocate something like 128 characters but as soon as I say that then something is going to come up where I want more than 128.

Yes, I have a look at the feasibility of the alternative approach mentioned in #213 (comment), because if this is possible, this would be done in the flatten/unflatten functions and therefore be lazy friendly.

I think to do that you would basically have to store everything as 1 long string:

For example

["a", "b", "cd", "e"] --> "abcde" and then also save an int array [1,1,2,1]

You could also use a deliminator:
["a", "b", "cd", "e"] --> "a-b-cd-e"

The flatten function:

def flatten_strings(arr): string_lengths = np.empty(arr.shape, dtype=object) flat_strings= np.empty(arr.shape, dtype=object) for i in np.ndindex(arr): string_lengths[I] = np.char.str_len(arr) flat_strings[i] = "".join( arr[I]) return string_lengths, flat_strings

Note that 2-d ragged arrays of strings might be more of a headache so I'm okay with sticking to the 1d case.

ericpre · 2024-01-31T16:35:12Z

Done in #217.

ericpre added 3 commits January 21, 2024 12:58

Fix writing hyperspy ragged signal with unicode dtype and navigation …

fcddbfc

…dependent (ragged) texts markers: h5py doesn't support numpy unicode dtype, convert to variable length h5py str type before saving

Tidy up unnecessary code

d97603c

Convert to unicode when reading hspy file and fix bug with zspy forma…

d48e4ae

…t (the dtype of the first was taken and the length of unicode dtype can be too short for all other array)

ericpre commented Jan 21, 2024

View reviewed changes

github-advanced-security bot found potential problems Jan 21, 2024

View reviewed changes

jlaehne added the type: bug-fix label Jan 21, 2024

jlaehne linked an issue Jan 21, 2024 that may be closed by this pull request

Error when saving markers with navigation dimension to hspy format #212

Closed

CSSFrancis reviewed Jan 21, 2024

View reviewed changes

This was referenced Jan 23, 2024

Bugfix: Set equal chunking for shapes and dataset #211

Merged

Fix saving ragged array string hspy - take 2 #217

Merged

ericpre closed this Jan 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix saving ragged array string hspy #213

Fix saving ragged array string hspy #213

ericpre commented Jan 21, 2024

ericpre Jan 21, 2024

CSSFrancis Jan 21, 2024

ericpre Jan 22, 2024

ericpre commented Jan 21, 2024

codecov bot commented Jan 21, 2024

CSSFrancis left a comment

CSSFrancis Jan 21, 2024

CSSFrancis Jan 21, 2024

CSSFrancis Jan 21, 2024

ericpre Jan 22, 2024

CSSFrancis Jan 22, 2024

CSSFrancis Jan 22, 2024

ericpre commented Jan 31, 2024

Fix saving ragged array string hspy #213

Fix saving ragged array string hspy #213

Conversation

ericpre commented Jan 21, 2024

Progress of the PR

Minimal example of the bug fix or the new feature

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericpre commented Jan 21, 2024

codecov bot commented Jan 21, 2024

Codecov Report

CSSFrancis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericpre commented Jan 31, 2024