New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Breakout special dtypes #1132
Breakout special dtypes #1132
Conversation
Optimistically marking for 2.9 because I can and the tests are passing, but feel free to bump this to a later release if you like. |
This looks nice and useful, esp. |
I think it's a neat symmetry to have the |
With the |
# Fixed length
string_dtype('utf-8', length=10)
# Variable length - different possible APIs:
string_dtype('utf-8', length=None)
string_dtype('utf-8', length=h5py.VLEN)
string_dtype('utf-8', length=any)
string_dtype('utf-8', vlen=True) I like the options with a sentinel value for |
I like |
I've bumped this to 2.10, because trying to land redesigned APIs just before a release sounds like a bad idea. |
I work with questionnaire based-datasets, and the entire tooling revolves around utf-8 for obvious reasons. vlen/fixed strings are used appropriately due to space constraints. I actually gave up with h5py due to this limitation more than 6 months ago and ended up having to write my own thin C wrapper around libhdf5. I've been revisiting h5py lately for other reasons, and I stumbled again onto the same issue. |
Please, be indulgent, I might need to better understand the issue. My understanding is that vlen, utf-8 arrays are supported as special_dtype(vlen=unicode) under python2 and as special_dtype(vlen=str) under python3. Is fixed length utf-8 the missing thing to be dealt with? If so, the string_dtype('utf-8', length=None) suggested above seems quite reasonable as signature and being a new thing I would not expect it to break code. |
Thanks. I do not know what you will decide, but the proposal above seems excellent: Fixed lengthstring_dtype('utf-8', length=10) Variable length:string_dtype('utf-8', length=None) |
The above proposal also looks good to me. |
|
||
return dtype(dt, metadata={'enum': values_dict}) | ||
|
||
ref_dtype = dtype('O', metadata={'ref': Reference}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is now one instance each of these two dtypes that are re-used, where as previously we would call dytpe
each time. Do we care and are dtype
instance mutable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure dtype
instances are indeed mutable, for better or worse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems like a bit of a foot-cannon to hand to users....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want these to be functions returning new dtype objects on each call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I have a good enough grasp of the trade-offs here.
A function makes it more like the others and gets out of mutablity issue.
On the other hand, if dtypes are mutable then this also applies to the dtypes from numpy it's self which no one seems to worry about so we should not either and it is nice that these are constants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I was wrong here -- it seems that you cannot assign into the metadata dict after creating a dtype:
>>> d1 = np.dtype(object, metadata={'x': 1})
>>> d1.metadata['x'] = 2
TypeError: 'mappingproxy' object does not support item assignment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And attempting to modify dt.kind
and dt.itemsize
gives me AttributeError: readonly attribute
. So maybe even if they're not totally immutable, they're hard enough to modify that we needn't worry about the potential footgun.
Do we want to deprecate |
Looks nice to me as well, with a potential development of |
I've added the However, that now breaks the symmetry between
|
We definitely need some way to get full string dtype information (encoding and size) out from arbitrary h5py datasets.
This sounds pretty reasonable to me. My guess is that most of the time, users who are using h5py aren't bothering to check the the returned dtype at all -- they just rely on reading the data into a NumPy array. But if they do care, then they likely care about both details. At least it's not obvious to me why If you wanted to make it slightly more readable/convenient, you could have The other option is to split this into two functions, e.g., |
The dtype of the dataset for a fixed-length string will be a normal numpy fixed-length bytes dtype (e.g. That leads to another question: do we want to publicly document the dtype metadata h5py uses (in particular, string encoding) as part of its API, or keep that as an implementation detail? And is Also, how strict do we want to be about encoding names? So far I've only allowed the exact strings |
True, but fixed-length UTF-8 (if/when we add it) won't have a NumPy equivalent.
I would slightly rather keep this an implementation detail. Long term, I'm optimistic that the NumPy dtype refactor (which is finally being started on) will let us write real NumPy dtypes that correspond to each type of h5py string.
Normalization seems easy enough to do, and definitely a user-friendly choice. |
I'd suggest having the metadata explicitly documented as internal to h5py, as I'm not sure whether Is there a plan for what |
I agree on documenting that the metadata is not a stable API. I made |
Codecov Report
@@ Coverage Diff @@
## master #1132 +/- ##
=========================================
Coverage ? 83.73%
=========================================
Files ? 18
Lines ? 2146
Branches ? 0
=========================================
Hits ? 1797
Misses ? 349
Partials ? 0
Continue to review full report at Codecov.
|
I believe this is ready for further review. I'll summarise what the PR now does, because its scope has grown a bit since it started:
Naturally, the existing functions are still there. It's pretty painless to keep them working unless we have to change how the information is represented in dtypes, which I'm not proposing to do. |
@@ -746,15 +746,15 @@ class TestStrings(BaseDataset): | |||
|
|||
def test_vlen_bytes(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about some check_string_dtype()
(and check_vlen_dtype()
) here and in the following tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just added some in test_datatype
- does that cover what you meant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not exactly (I thought about check_*_dtype()
calls after create_dataset()
, in these tests), but now it's definitely good enough to be merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you want to check that they roundtrip correctly through HDF5. That makes sense. I've added some tests.
Looks really good to me! (save the tiny nitpicks wrt tests) |
Looks good to me, too. |
Looks good to me too! I told appveyor to re-run the failing tests to see if the failures were down to appveyor playing up, if they pass, does anyone have a problem with merging this? |
Thanks everyone for the reviews :-) |
I had hoped to get #1132 in for 2.9, but it actually landed in 2.10, and I missed this.
Closes #1118.
This adds three new pairs of functions for handling special dtypes:
string_dtype
andcheck_string_dtype
vlen_dtype
andcheck_vlen_dtype
enum_dtype
andcheck_enum_dtype
And two constants:
ref_dtype
andregionref_dtype
, along with the functioncheck_ref_dtype
.@aragilar I haven't yet changed the representation of string dtypes, as I think you were suggesting in #1118. I think it may make sense, especially if the changes requested in #379 go ahead, but there's more risk of breaking working code with that, so maybe it would be better to leave it for 3.0.