Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix saving ragged array string hspy #213

Closed
wants to merge 3 commits into from

Conversation

ericpre
Copy link
Member

@ericpre ericpre commented Jan 21, 2024

Fix for #212.

Progress of the PR

  • Change implemented (can be split into several points),
  • update docstring (if appropriate),
  • update user guide (if appropriate),
  • add a changelog entry in the upcoming_changes folder (see upcoming_changes/README.rst),
  • Check formatting of the changelog entry (and eventual user guide changes) in the docs/readthedocs.org:rosettasciio build of this PR (link in github checks)
  • add tests,
  • ready for review.

Minimal example of the bug fix or the new feature

"""
import numpy as np
import hyperspy.api as hs
import h5py

fname = "test.hspy"

rng = np.random.default_rng(0)
data = np.ones((5, 100, 100))
s = hs.signals.Signal2D(data)

# Create an navigation dependent (ragged) Texts marker
offsets = np.empty(s.axes_manager.navigation_shape, dtype=object)
texts = np.empty(s.axes_manager.navigation_shape, dtype=object)

for index in np.ndindex(offsets.shape):
    i = index[0]
    offsets[index] = rng.random((5, 2))[: i + 2] * 100
    texts[index] = np.array(
        ["a" * (i + 1), "b", "c", "d", "e"][: i + 2]
    )

m = hs.plot.markers.Texts(
    offsets=offsets,
    texts=texts,
    sizes=3,
    facecolor="black",
)

s.add_marker(m, permanent=True)
s.plot()
s.save(fname, overwrite=True)

s2 = hs.load(fname)

m_texts = m.kwargs['texts']
m2_texts = s2.metadata.Markers.Texts.kwargs['texts']


for index in np.ndindex(m_texts.shape):
    np.testing.assert_equal(m_texts[index], m2_texts[index])

…dependent (ragged) texts markers: h5py doesn't support numpy unicode dtype, convert to variable length h5py str type before saving
…t (the dtype of the first was taken and the length of unicode dtype can be too short for all other array)
# Since h5py doesn't support numpy unicode dtype, we need to save
# data with h5py variable length str dtype and when reading the
# ragged data, we need to convert it back to numpy unicode dtype
if data.dtype.metadata["vlen"].metadata["vlen"] == str:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how robust it would be but this is the only way (workaround) that I could find to figure out if the dtype of the ragged are string.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I think in the past I've used something like data[data.dim*(0,)].dtype but you're right that this makes a little more sense and removes the possible error that might occur with a 0 dimensional array.

numpy does say that it might change in the future so I would just make sure that it is well tested.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I am not keen on this but I would ok with keeping it knowing that I couldn't find any another solution with h5py, even if there is a risk that it breaks at some point because this attribute is "weakly" documented.
This is another justification for consider the alternative approach more seriously.

@ericpre
Copy link
Member Author

ericpre commented Jan 21, 2024

@CSSFrancis, can you please have a look at this PR?
An alternative would be to save an attributes which contain the dtype for each array of the ragged array and convert it back when reading - similarly as it is done with reshaping ragged array. I didn't look much how easy it would be but I thought about it as being more robust/generic.

self.ragged_kwds = {
"dtype": h5py.special_dtype(vlen=signal["data"][0].dtype)
}
self.unicode_kwds = {"dtype": h5py.string_dtype()}

Check warning

Code scanning / CodeQL

Overwriting attribute in super-class or sub-class Warning

Assignment overwrites attribute unicode_kwds, which was previously defined in superclass
HierarchicalWriter
.
Copy link

codecov bot commented Jan 21, 2024

Codecov Report

Attention: 10 lines in your changes are missing coverage. Please review.

Comparison is base (c85f768) 86.22% compared to head (d48e4ae) 86.15%.

Files Patch % Lines
rsciio/zspy/_api.py 16.66% 4 Missing and 1 partial ⚠️
rsciio/_hierarchical.py 85.71% 2 Missing and 1 partial ⚠️
rsciio/hspy/_api.py 75.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #213      +/-   ##
==========================================
- Coverage   86.22%   86.15%   -0.07%     
==========================================
  Files          82       82              
  Lines       10549    10568      +19     
  Branches     2293     2300       +7     
==========================================
+ Hits         9096     9105       +9     
- Misses        933      940       +7     
- Partials      520      523       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@CSSFrancis CSSFrancis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I think that this is a good solution to hard problem. I don't love that you have to go through every position to find the longest string. I almost want to just say preallocate 128 characters and call it good. It might be a waste but storage is cheap. We could have a setting where it either checks or goes with the default.

That or we could unwrap and wrap the data which might be the best way to do things.

# Since h5py doesn't support numpy unicode dtype, we need to save
# data with h5py variable length str dtype and when reading the
# ragged data, we need to convert it back to numpy unicode dtype
if data.dtype.metadata["vlen"].metadata["vlen"] == str:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I think in the past I've used something like data[data.dim*(0,)].dtype but you're right that this makes a little more sense and removes the possible error that might occur with a 0 dimensional array.

numpy does say that it might change in the future so I would just make sure that it is well tested.

convert_to_unicode = True
else:
convert_to_unicode = False
except Exception:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there a Try/Except here? Maybe add a comment on why this fails.

Comment on lines +111 to +113
for index in np.ndindex(data.shape):
if data[index].dtype.itemsize // size_of_char > dtype.itemsize:
dtype = data[index].dtype
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this work with lazy arrays of strings? I don't think that this should work? If it does it won't be very efficient as you have to look at every single index.

We don't have to support saving lazy arrays of strings but we should at least document it... Part of me is just tempted to pre allocate something like 128 characters but as soon as I say that then something is going to come up where I want more than 128.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have a look at the feasibility of the alternative approach mentioned in #213 (comment), because if this is possible, this would be done in the flatten/unflatten functions and therefore be lazy friendly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think to do that you would basically have to store everything as 1 long string:

For example

["a", "b", "cd", "e"] --> "abcde" and then also save an int array [1,1,2,1]

You could also use a deliminator:
["a", "b", "cd", "e"] --> "a-b-cd-e"

The flatten function:

def flatten_strings(arr):
    string_lengths = np.empty(arr.shape, dtype=object)
    flat_strings= np.empty(arr.shape, dtype=object)
    for i in np.ndindex(arr):
        string_lengths[I] = np.char.str_len(arr)
        flat_strings[i] =  "".join( arr[I])
    return string_lengths, flat_strings

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that 2-d ragged arrays of strings might be more of a headache so I'm okay with sticking to the 1d case.

@ericpre
Copy link
Member Author

ericpre commented Jan 31, 2024

Done in #217.

@ericpre ericpre closed this Jan 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error when saving markers with navigation dimension to hspy format
3 participants