-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix saving ragged array string hspy #213
Conversation
…dependent (ragged) texts markers: h5py doesn't support numpy unicode dtype, convert to variable length h5py str type before saving
…t (the dtype of the first was taken and the length of unicode dtype can be too short for all other array)
# Since h5py doesn't support numpy unicode dtype, we need to save | ||
# data with h5py variable length str dtype and when reading the | ||
# ragged data, we need to convert it back to numpy unicode dtype | ||
if data.dtype.metadata["vlen"].metadata["vlen"] == str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure how robust it would be but this is the only way (workaround) that I could find to figure out if the dtype of the ragged are string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I think in the past I've used something like data[data.dim*(0,)].dtype
but you're right that this makes a little more sense and removes the possible error that might occur with a 0 dimensional array.
numpy does say that it might change in the future so I would just make sure that it is well tested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I am not keen on this but I would ok with keeping it knowing that I couldn't find any another solution with h5py, even if there is a risk that it breaks at some point because this attribute is "weakly" documented.
This is another justification for consider the alternative approach more seriously.
@CSSFrancis, can you please have a look at this PR? |
self.ragged_kwds = { | ||
"dtype": h5py.special_dtype(vlen=signal["data"][0].dtype) | ||
} | ||
self.unicode_kwds = {"dtype": h5py.string_dtype()} |
Check warning
Code scanning / CodeQL
Overwriting attribute in super-class or sub-class Warning
HierarchicalWriter
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #213 +/- ##
==========================================
- Coverage 86.22% 86.15% -0.07%
==========================================
Files 82 82
Lines 10549 10568 +19
Branches 2293 2300 +7
==========================================
+ Hits 9096 9105 +9
- Misses 933 940 +7
- Partials 520 523 +3 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I think that this is a good solution to hard problem. I don't love that you have to go through every position to find the longest string. I almost want to just say preallocate 128 characters and call it good. It might be a waste but storage is cheap. We could have a setting where it either checks or goes with the default.
That or we could unwrap and wrap the data which might be the best way to do things.
# Since h5py doesn't support numpy unicode dtype, we need to save | ||
# data with h5py variable length str dtype and when reading the | ||
# ragged data, we need to convert it back to numpy unicode dtype | ||
if data.dtype.metadata["vlen"].metadata["vlen"] == str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I think in the past I've used something like data[data.dim*(0,)].dtype
but you're right that this makes a little more sense and removes the possible error that might occur with a 0 dimensional array.
numpy does say that it might change in the future so I would just make sure that it is well tested.
convert_to_unicode = True | ||
else: | ||
convert_to_unicode = False | ||
except Exception: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is there a Try/Except here? Maybe add a comment on why this fails.
for index in np.ndindex(data.shape): | ||
if data[index].dtype.itemsize // size_of_char > dtype.itemsize: | ||
dtype = data[index].dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this work with lazy arrays of strings? I don't think that this should work? If it does it won't be very efficient as you have to look at every single index.
We don't have to support saving lazy arrays of strings but we should at least document it... Part of me is just tempted to pre allocate something like 128 characters but as soon as I say that then something is going to come up where I want more than 128.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I have a look at the feasibility of the alternative approach mentioned in #213 (comment), because if this is possible, this would be done in the flatten
/unflatten
functions and therefore be lazy friendly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think to do that you would basically have to store everything as 1 long string:
For example
["a", "b", "cd", "e"] --> "abcde"
and then also save an int array [1,1,2,1]
You could also use a deliminator:
["a", "b", "cd", "e"] --> "a-b-cd-e"
The flatten function:
def flatten_strings(arr):
string_lengths = np.empty(arr.shape, dtype=object)
flat_strings= np.empty(arr.shape, dtype=object)
for i in np.ndindex(arr):
string_lengths[I] = np.char.str_len(arr)
flat_strings[i] = "".join( arr[I])
return string_lengths, flat_strings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that 2-d ragged arrays of strings might be more of a headache so I'm okay with sticking to the 1d case.
Done in #217. |
Fix for #212.
Progress of the PR
upcoming_changes
folder (seeupcoming_changes/README.rst
),docs/readthedocs.org:rosettasciio
build of this PR (link in github checks)Minimal example of the bug fix or the new feature