Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Breakout special dtypes #1132

Merged
merged 16 commits into from Feb 7, 2019
Merged

Conversation

takluyver
Copy link
Member

Closes #1118.

This adds three new pairs of functions for handling special dtypes:

  • string_dtype and check_string_dtype
  • vlen_dtype and check_vlen_dtype
  • enum_dtype and check_enum_dtype

And two constants: ref_dtype and regionref_dtype, along with the function check_ref_dtype.

@aragilar I haven't yet changed the representation of string dtypes, as I think you were suggesting in #1118. I think it may make sense, especially if the changes requested in #379 go ahead, but there's more risk of breaking working code with that, so maybe it would be better to leave it for 3.0.

@takluyver takluyver added this to the 2.9 milestone Nov 12, 2018
@takluyver
Copy link
Member Author

Optimistically marking for 2.9 because I can and the tests are passing, but feel free to bump this to a later release if you like.

@aparamon
Copy link
Member

This looks nice and useful, esp. ref_dtype and regionref_dtype!
I'm curious about the naming for check_*_dtype: these do actually inspect, not just check the dtype objects (and do so in different ways). Recognizing the historical context, is it intentional to keep these uniformly named given the different, non-compatible return values?

@takluyver
Copy link
Member Author

I think it's a neat symmetry to have the x_dtype functions (or constants) to create dtypes, and check_x_dtype functions to see if you have such a dtype.

@aragilar
Copy link
Member

With the string_dtype/check_string_dtype I was thinking of using that for fixed length strings (#719, #988 and similar), rather than the variable length strings. So to create a variable length utf-8 string (matching that of special_dtype(vlen=unicode)), I guess h5py.vlen_dtype('utf-8') would be the recommended usage, with h5py.vlen_dtype(h5py.string_dtype(encoding='utf-8'))also working once we add the fixed utf-8 string support (which can be put off to a later PR).

@takluyver
Copy link
Member Author

vlen_dtype('utf-8') doesn't feel quite right to me. What about something like this:

# Fixed length
string_dtype('utf-8',  length=10)

# Variable length - different possible APIs:
string_dtype('utf-8', length=None)
string_dtype('utf-8', length=h5py.VLEN)
string_dtype('utf-8', length=any)
string_dtype('utf-8', vlen=True)

I like the options with a sentinel value for length=, because that eliminates the possibility of passing length=10, vlen=True. But on the other hand the built-in values aren't particularly good sentinels, and it's a bit awkward to have a new name in h5py just for that.

@shoyer
Copy link

shoyer commented Nov 25, 2018

I like string_dtype('utf-8', length=None). length=None feels like a pretty reasonable sentinel value for "no fixed length".

@takluyver takluyver modified the milestones: 2.9, 2.10 Nov 29, 2018
@takluyver
Copy link
Member Author

I've bumped this to 2.10, because trying to land redesigned APIs just before a release sounds like a bad idea.

@wavexx
Copy link
Contributor

wavexx commented Jan 8, 2019

I work with questionnaire based-datasets, and the entire tooling revolves around utf-8 for obvious reasons. vlen/fixed strings are used appropriately due to space constraints. I actually gave up with h5py due to this limitation more than 6 months ago and ended up having to write my own thin C wrapper around libhdf5.

I've been revisiting h5py lately for other reasons, and I stumbled again onto the same issue.
I feel the handling of variable/fixed utf-8 strings should be fixed soon, breaking the API if necessary.

@vasole
Copy link
Contributor

vasole commented Jan 8, 2019

Please, be indulgent, I might need to better understand the issue.

My understanding is that vlen, utf-8 arrays are supported as special_dtype(vlen=unicode) under python2 and as special_dtype(vlen=str) under python3.

Is fixed length utf-8 the missing thing to be dealt with? If so, the string_dtype('utf-8', length=None) suggested above seems quite reasonable as signature and being a new thing I would not expect it to break code.

@wavexx
Copy link
Contributor

wavexx commented Jan 8, 2019

Fixed-length UTF-8 is missing, both for reading and for writing.
See #973 and #988

@vasole
Copy link
Contributor

vasole commented Jan 8, 2019

Thanks. I do not know what you will decide, but the proposal above seems excellent:

Fixed length

string_dtype('utf-8', length=10)

Variable length:

string_dtype('utf-8', length=None)

@wavexx
Copy link
Contributor

wavexx commented Jan 8, 2019

The above proposal also looks good to me.


return dtype(dt, metadata={'enum': values_dict})

ref_dtype = dtype('O', metadata={'ref': Reference})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is now one instance each of these two dtypes that are re-used, where as previously we would call dytpe each time. Do we care and are dtype instance mutable?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure dtype instances are indeed mutable, for better or worse.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems like a bit of a foot-cannon to hand to users....

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want these to be functions returning new dtype objects on each call?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I have a good enough grasp of the trade-offs here.

A function makes it more like the others and gets out of mutablity issue.

On the other hand, if dtypes are mutable then this also applies to the dtypes from numpy it's self which no one seems to worry about so we should not either and it is nice that these are constants.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I was wrong here -- it seems that you cannot assign into the metadata dict after creating a dtype:

>>> d1 = np.dtype(object, metadata={'x': 1})
>>> d1.metadata['x'] = 2
TypeError: 'mappingproxy' object does not support item assignment

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And attempting to modify dt.kind and dt.itemsize gives me AttributeError: readonly attribute . So maybe even if they're not totally immutable, they're hard enough to modify that we needn't worry about the potential footgun.

@tacaswell
Copy link
Member

Do we want to deprecate special_dtype in the code or just leave it?

@aparamon
Copy link
Member

aparamon commented Jan 9, 2019

Fixed length
string_dtype('utf-8', length=10)
Variable length
string_dtype('utf-8', length=None)

Looks nice to me as well, with a potential development of None becoming the default maybe (because varlen strings are arguably more common).

@takluyver
Copy link
Member Author

I've added the length= parameter to string_dtype, as suggested.

However, that now breaks the symmetry between string_dtype and check_string_dtype - the latter only checks for a variable-length string dtype. Do we want to:

  • Make check_string_dtype return (encoding, length) for fixed- and variable-length strings, restoring the symmetry, but potentially making it less convenient to check for a variable length string (not sure how common this is?).
  • Rename it to check_vlen_string_dtype, making the broken symmetry more clear.
  • Drop check_string_dtype entirely - it's introduced by this pull request, so there are no compatibility requirements, and it's not used in h5py at present. I wrote it to fit the pattern of having a pair of functions to construct and check some kind of dtype.

@shoyer
Copy link

shoyer commented Jan 9, 2019

We definitely need some way to get full string dtype information (encoding and size) out from arbitrary h5py datasets.

  • Make check_string_dtype return (encoding, length) for fixed- and variable-length strings, restoring the symmetry, but potentially making it less convenient to check for a variable length string (not sure how common this is?).

This sounds pretty reasonable to me. My guess is that most of the time, users who are using h5py aren't bothering to check the the returned dtype at all -- they just rely on reading the data into a NumPy array. But if they do care, then they likely care about both details. At least it's not obvious to me why encoding is more important than length.

If you wanted to make it slightly more readable/convenient, you could have check_string_dtype return a namedtuple with fields encoding and length.

The other option is to split this into two functions, e.g., check_string_dtype_encoding() (defined on both fixed size and vlen string dtypes) and check_dtype_size() (the later could be defined for other types, too). But I think I like the single function returning a namedtuple or None a bit more.

@takluyver
Copy link
Member Author

The dtype of the dataset for a fixed-length string will be a normal numpy fixed-length bytes dtype (e.g. dtype('|S10'), so there's some way to access this information even without h5py providing a function - you can check dataset.dtype.itemsize and dataset.dtype.metadata['h5py_encoding'].

That leads to another question: do we want to publicly document the dtype metadata h5py uses (in particular, string encoding) as part of its API, or keep that as an implementation detail? And is 'h5py_encoding' a good name? The other fields h5py is using (vlen, enum, ref) don't have the h5py prefix, but I wondered if using such generic terms could lead to clashes with other software?

Also, how strict do we want to be about encoding names? So far I've only allowed the exact strings {'utf-8', 'ascii'}, but Python's own codec lookup (when you do 'hello'.encode(x)) is case insensitive, normalises hyphens and underscores, and has aliases. We could normalise names through Python before checking them.

@shoyer
Copy link

shoyer commented Jan 9, 2019

The dtype of the dataset for a fixed-length string will be a normal numpy fixed-length bytes dtype (e.g. dtype('|S10'), so there's some way to access this information even without h5py providing a function - you can check dataset.dtype.itemsize and dataset.dtype.metadata['h5py_encoding'].

True, but fixed-length UTF-8 (if/when we add it) won't have a NumPy equivalent.

That leads to another question: do we want to publicly document the dtype metadata h5py uses (in particular, string encoding) as part of its API, or keep that as an implementation detail?

I would slightly rather keep this an implementation detail. Long term, I'm optimistic that the NumPy dtype refactor (which is finally being started on) will let us write real NumPy dtypes that correspond to each type of h5py string.

Also, how strict do we want to be about encoding names? So far I've only allowed the exact strings {'utf-8', 'ascii'}, but Python's own codec lookup (when you do 'hello'.encode(x)) is case insensitive, normalises hyphens and underscores, and has aliases. We could normalise names through Python before checking them.

Normalization seems easy enough to do, and definitely a user-friendly choice.

@aragilar
Copy link
Member

I'd suggest having the metadata explicitly documented as internal to h5py, as I'm not sure whether dtype.metadata is explicitly documented anywhere (it's definitely not widely advertised as a feature). Also, it means if we do need to make backwards-incompatible changes to the metadata (as may need to happen if numpy is changing the structure of dtypes), we've explicitly called out not to rely on a specific internal format.

Is there a plan for what check_vlen_dtype will do for strings? I would suggest returning the same output as check_string_dtype, unless people have other suggestions?

@takluyver
Copy link
Member Author

I agree on documenting that the metadata is not a stable API.

I made check_vlen_dtype(x) so it should be equivalent to check_dtype(vlen=x), i.e. returning str or bytes for a vlen string. It's a bit ugly, but it's easy to adapt existing code to it. If we wanted to make it more consistent, it could return a numpy dtype like dtype('S1') or dtype('U1'), representing the unit which can be repeated.

@codecov
Copy link

codecov bot commented Jan 10, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@6217b86). Click here to learn what that means.
The diff coverage is 83.33%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #1132   +/-   ##
=========================================
  Coverage          ?   83.73%           
=========================================
  Files             ?       18           
  Lines             ?     2146           
  Branches          ?        0           
=========================================
  Hits              ?     1797           
  Misses            ?      349           
  Partials          ?        0
Impacted Files Coverage Δ
h5py/_hl/dataset.py 84.19% <ø> (ø)
h5py/__init__.py 59.64% <100%> (ø)
h5py/_hl/attrs.py 86.17% <100%> (ø)
h5py/_hl/base.py 90.35% <75%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6217b86...a343ede. Read the comment docs.

@takluyver
Copy link
Member Author

I believe this is ready for further review. I'll summarise what the PR now does, because its scope has grown a bit since it started:

  1. string_dtype() provides a unified way to prepare an HDF5 compatible dtype for strings - fixed or variable length, with ASCII or UTF-8 encoding. This does not change what kind of Python objects can be stored, but it does give us the first high-level API to declare that fixed-length strings should be interpreted as UTF-8.
  2. The functionality of special_dtype() is broken out into vlen_dtype(), enum_dtype(), ref_dtype and regionref_dtype, plus string_dtype() as already described. The ref/regionref ones are constants rather than functions, since they take no parameters; investigation in this PR suggested that dtype objects are immutable enough that this is probably workable, but it's easy to make them constructor functions if preferred.
  3. The corresponding functionality of check_dtype() is split into check_string_dtype(), check_vlen_dtype(), check_enum_dtype() and check_ref_dtype(). These are all used in a similar way: they return the construction information if the dtype is of the kind that function checks, and None otherwise.

Naturally, the existing functions are still there. It's pretty painless to keep them working unless we have to change how the information is represented in dtypes, which I'm not proposing to do.

@@ -746,15 +746,15 @@ class TestStrings(BaseDataset):

def test_vlen_bytes(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about some check_string_dtype() (and check_vlen_dtype()) here and in the following tests?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just added some in test_datatype - does that cover what you meant?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly (I thought about check_*_dtype() calls after create_dataset(), in these tests), but now it's definitely good enough to be merged.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you want to check that they roundtrip correctly through HDF5. That makes sense. I've added some tests.

@aparamon
Copy link
Member

Looks really good to me! (save the tiny nitpicks wrt tests)

@shoyer
Copy link

shoyer commented Feb 5, 2019

Looks good to me, too.

@takluyver takluyver mentioned this pull request Feb 5, 2019
@aragilar
Copy link
Member

aragilar commented Feb 7, 2019

Looks good to me too! I told appveyor to re-run the failing tests to see if the failures were down to appveyor playing up, if they pass, does anyone have a problem with merging this?

@aparamon aparamon merged commit ab949c9 into h5py:master Feb 7, 2019
@takluyver takluyver deleted the breakout-special-dtypes branch February 7, 2019 11:35
@takluyver
Copy link
Member Author

Thanks everyone for the reviews :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants