Breakout special dtypes #1132

takluyver · 2018-11-12T11:23:50Z

Closes #1118.

This adds three new pairs of functions for handling special dtypes:

string_dtype and check_string_dtype
vlen_dtype and check_vlen_dtype
enum_dtype and check_enum_dtype

And two constants: ref_dtype and regionref_dtype, along with the function check_ref_dtype.

@aragilar I haven't yet changed the representation of string dtypes, as I think you were suggesting in #1118. I think it may make sense, especially if the changes requested in #379 go ahead, but there's more risk of breaking working code with that, so maybe it would be better to leave it for 3.0.

Closes h5pygh-1118

takluyver · 2018-11-12T13:50:56Z

Optimistically marking for 2.9 because I can and the tests are passing, but feel free to bump this to a later release if you like.

aparamon · 2018-11-12T19:38:55Z

This looks nice and useful, esp. ref_dtype and regionref_dtype!
I'm curious about the naming for check_*_dtype: these do actually inspect, not just check the dtype objects (and do so in different ways). Recognizing the historical context, is it intentional to keep these uniformly named given the different, non-compatible return values?

takluyver · 2018-11-13T09:08:38Z

I think it's a neat symmetry to have the x_dtype functions (or constants) to create dtypes, and check_x_dtype functions to see if you have such a dtype.

aragilar · 2018-11-19T13:01:45Z

With the string_dtype/check_string_dtype I was thinking of using that for fixed length strings (#719, #988 and similar), rather than the variable length strings. So to create a variable length utf-8 string (matching that of special_dtype(vlen=unicode)), I guess h5py.vlen_dtype('utf-8') would be the recommended usage, with h5py.vlen_dtype(h5py.string_dtype(encoding='utf-8'))also working once we add the fixed utf-8 string support (which can be put off to a later PR).

takluyver · 2018-11-19T21:25:03Z

vlen_dtype('utf-8') doesn't feel quite right to me. What about something like this:

# Fixed length
string_dtype('utf-8',  length=10)

# Variable length - different possible APIs:
string_dtype('utf-8', length=None)
string_dtype('utf-8', length=h5py.VLEN)
string_dtype('utf-8', length=any)
string_dtype('utf-8', vlen=True)

I like the options with a sentinel value for length=, because that eliminates the possibility of passing length=10, vlen=True. But on the other hand the built-in values aren't particularly good sentinels, and it's a bit awkward to have a new name in h5py just for that.

shoyer · 2018-11-25T19:40:27Z

I like string_dtype('utf-8', length=None). length=None feels like a pretty reasonable sentinel value for "no fixed length".

takluyver · 2018-11-29T14:57:06Z

I've bumped this to 2.10, because trying to land redesigned APIs just before a release sounds like a bad idea.

wavexx · 2019-01-08T16:14:03Z

I work with questionnaire based-datasets, and the entire tooling revolves around utf-8 for obvious reasons. vlen/fixed strings are used appropriately due to space constraints. I actually gave up with h5py due to this limitation more than 6 months ago and ended up having to write my own thin C wrapper around libhdf5.

I've been revisiting h5py lately for other reasons, and I stumbled again onto the same issue.
I feel the handling of variable/fixed utf-8 strings should be fixed soon, breaking the API if necessary.

vasole · 2019-01-08T16:28:00Z

Please, be indulgent, I might need to better understand the issue.

My understanding is that vlen, utf-8 arrays are supported as special_dtype(vlen=unicode) under python2 and as special_dtype(vlen=str) under python3.

Is fixed length utf-8 the missing thing to be dealt with? If so, the string_dtype('utf-8', length=None) suggested above seems quite reasonable as signature and being a new thing I would not expect it to break code.

wavexx · 2019-01-08T16:34:29Z

Fixed-length UTF-8 is missing, both for reading and for writing.
See #973 and #988

vasole · 2019-01-08T16:38:09Z

Thanks. I do not know what you will decide, but the proposal above seems excellent:

Fixed length

string_dtype('utf-8', length=10)

Variable length:

string_dtype('utf-8', length=None)

wavexx · 2019-01-08T16:47:55Z

The above proposal also looks good to me.

tacaswell · 2019-01-09T01:30:41Z

h5py/h5t.pyx

+
+    return dtype(dt, metadata={'enum': values_dict})
+
+ref_dtype = dtype('O', metadata={'ref': Reference})


There is now one instance each of these two dtypes that are re-used, where as previously we would call dytpe each time. Do we care and are dtype instance mutable?

I'm pretty sure dtype instances are indeed mutable, for better or worse.

That seems like a bit of a foot-cannon to hand to users....

Do you want these to be functions returning new dtype objects on each call?

I am not sure I have a good enough grasp of the trade-offs here.

A function makes it more like the others and gets out of mutablity issue.

On the other hand, if dtypes are mutable then this also applies to the dtypes from numpy it's self which no one seems to worry about so we should not either and it is nice that these are constants.

Maybe I was wrong here -- it seems that you cannot assign into the metadata dict after creating a dtype:

>>> d1 = np.dtype(object, metadata={'x': 1}) >>> d1.metadata['x'] = 2 TypeError: 'mappingproxy' object does not support item assignment

And attempting to modify dt.kind and dt.itemsize gives me AttributeError: readonly attribute . So maybe even if they're not totally immutable, they're hard enough to modify that we needn't worry about the potential footgun.

tacaswell · 2019-01-09T01:33:40Z

Do we want to deprecate special_dtype in the code or just leave it?

aparamon · 2019-01-09T06:33:53Z

Fixed length
string_dtype('utf-8', length=10)
Variable length
string_dtype('utf-8', length=None)

Looks nice to me as well, with a potential development of None becoming the default maybe (because varlen strings are arguably more common).

takluyver · 2019-01-09T15:59:06Z

I've added the length= parameter to string_dtype, as suggested.

However, that now breaks the symmetry between string_dtype and check_string_dtype - the latter only checks for a variable-length string dtype. Do we want to:

Make check_string_dtype return (encoding, length) for fixed- and variable-length strings, restoring the symmetry, but potentially making it less convenient to check for a variable length string (not sure how common this is?).
Rename it to check_vlen_string_dtype, making the broken symmetry more clear.
Drop check_string_dtype entirely - it's introduced by this pull request, so there are no compatibility requirements, and it's not used in h5py at present. I wrote it to fit the pattern of having a pair of functions to construct and check some kind of dtype.

shoyer · 2019-01-09T16:35:13Z

We definitely need some way to get full string dtype information (encoding and size) out from arbitrary h5py datasets.

Make check_string_dtype return (encoding, length) for fixed- and variable-length strings, restoring the symmetry, but potentially making it less convenient to check for a variable length string (not sure how common this is?).

This sounds pretty reasonable to me. My guess is that most of the time, users who are using h5py aren't bothering to check the the returned dtype at all -- they just rely on reading the data into a NumPy array. But if they do care, then they likely care about both details. At least it's not obvious to me why encoding is more important than length.

If you wanted to make it slightly more readable/convenient, you could have check_string_dtype return a namedtuple with fields encoding and length.

The other option is to split this into two functions, e.g., check_string_dtype_encoding() (defined on both fixed size and vlen string dtypes) and check_dtype_size() (the later could be defined for other types, too). But I think I like the single function returning a namedtuple or None a bit more.

takluyver · 2019-01-09T17:41:30Z

The dtype of the dataset for a fixed-length string will be a normal numpy fixed-length bytes dtype (e.g. dtype('|S10'), so there's some way to access this information even without h5py providing a function - you can check dataset.dtype.itemsize and dataset.dtype.metadata['h5py_encoding'].

That leads to another question: do we want to publicly document the dtype metadata h5py uses (in particular, string encoding) as part of its API, or keep that as an implementation detail? And is 'h5py_encoding' a good name? The other fields h5py is using (vlen, enum, ref) don't have the h5py prefix, but I wondered if using such generic terms could lead to clashes with other software?

Also, how strict do we want to be about encoding names? So far I've only allowed the exact strings {'utf-8', 'ascii'}, but Python's own codec lookup (when you do 'hello'.encode(x)) is case insensitive, normalises hyphens and underscores, and has aliases. We could normalise names through Python before checking them.

shoyer · 2019-01-09T17:49:26Z

The dtype of the dataset for a fixed-length string will be a normal numpy fixed-length bytes dtype (e.g. dtype('|S10'), so there's some way to access this information even without h5py providing a function - you can check dataset.dtype.itemsize and dataset.dtype.metadata['h5py_encoding'].

True, but fixed-length UTF-8 (if/when we add it) won't have a NumPy equivalent.

That leads to another question: do we want to publicly document the dtype metadata h5py uses (in particular, string encoding) as part of its API, or keep that as an implementation detail?

I would slightly rather keep this an implementation detail. Long term, I'm optimistic that the NumPy dtype refactor (which is finally being started on) will let us write real NumPy dtypes that correspond to each type of h5py string.

Also, how strict do we want to be about encoding names? So far I've only allowed the exact strings {'utf-8', 'ascii'}, but Python's own codec lookup (when you do 'hello'.encode(x)) is case insensitive, normalises hyphens and underscores, and has aliases. We could normalise names through Python before checking them.

Normalization seems easy enough to do, and definitely a user-friendly choice.

aragilar · 2019-01-10T10:03:21Z

I'd suggest having the metadata explicitly documented as internal to h5py, as I'm not sure whether dtype.metadata is explicitly documented anywhere (it's definitely not widely advertised as a feature). Also, it means if we do need to make backwards-incompatible changes to the metadata (as may need to happen if numpy is changing the structure of dtypes), we've explicitly called out not to rely on a specific internal format.

Is there a plan for what check_vlen_dtype will do for strings? I would suggest returning the same output as check_string_dtype, unless people have other suggestions?

takluyver · 2019-01-10T10:31:10Z

I agree on documenting that the metadata is not a stable API.

I made check_vlen_dtype(x) so it should be equivalent to check_dtype(vlen=x), i.e. returning str or bytes for a vlen string. It's a bit ugly, but it's easy to adapt existing code to it. If we wanted to make it more consistent, it could return a numpy dtype like dtype('S1') or dtype('U1'), representing the unit which can be repeated.

codecov · 2019-01-10T12:25:49Z

Codecov Report

❗ No coverage uploaded for pull request base (master@6217b86). Click here to learn what that means.
The diff coverage is 83.33%.

@@            Coverage Diff            @@
##             master    #1132   +/-   ##
=========================================
  Coverage          ?   83.73%           
=========================================
  Files             ?       18           
  Lines             ?     2146           
  Branches          ?        0           
=========================================
  Hits              ?     1797           
  Misses            ?      349           
  Partials          ?        0

Impacted Files	Coverage Δ
h5py/_hl/dataset.py	`84.19% <ø> (ø)`
h5py/__init__.py	`59.64% <100%> (ø)`
h5py/_hl/attrs.py	`86.17% <100%> (ø)`
h5py/_hl/base.py	`90.35% <75%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6217b86...a343ede. Read the comment docs.

takluyver · 2019-01-30T12:49:02Z

I believe this is ready for further review. I'll summarise what the PR now does, because its scope has grown a bit since it started:

string_dtype() provides a unified way to prepare an HDF5 compatible dtype for strings - fixed or variable length, with ASCII or UTF-8 encoding. This does not change what kind of Python objects can be stored, but it does give us the first high-level API to declare that fixed-length strings should be interpreted as UTF-8.
The functionality of special_dtype() is broken out into vlen_dtype(), enum_dtype(), ref_dtype and regionref_dtype, plus string_dtype() as already described. The ref/regionref ones are constants rather than functions, since they take no parameters; investigation in this PR suggested that dtype objects are immutable enough that this is probably workable, but it's easy to make them constructor functions if preferred.
The corresponding functionality of check_dtype() is split into check_string_dtype(), check_vlen_dtype(), check_enum_dtype() and check_ref_dtype(). These are all used in a similar way: they return the construction information if the dtype is of the kind that function checks, and None otherwise.

Naturally, the existing functions are still there. It's pretty painless to keep them working unless we have to change how the information is represented in dtypes, which I'm not proposing to do.

aparamon · 2019-01-30T19:12:36Z

h5py/tests/old/test_dataset.py

@@ -746,15 +746,15 @@ class TestStrings(BaseDataset):

    def test_vlen_bytes(self):


What about some check_string_dtype() (and check_vlen_dtype()) here and in the following tests?

I've just added some in test_datatype - does that cover what you meant?

Not exactly (I thought about check_*_dtype() calls after create_dataset(), in these tests), but now it's definitely good enough to be merged.

Ah, you want to check that they roundtrip correctly through HDF5. That makes sense. I've added some tests.

aparamon · 2019-01-30T19:14:26Z

Looks really good to me! (save the tiny nitpicks wrt tests)

shoyer · 2019-02-05T16:16:19Z

Looks good to me, too.

aragilar · 2019-02-07T03:57:05Z

Looks good to me too! I told appveyor to re-run the failing tests to see if the failures were down to appveyor playing up, if they pass, does anyone have a problem with merging this?

takluyver · 2019-02-07T11:35:58Z

Thanks everyone for the reviews :-)

I had hoped to get #1132 in for 2.9, but it actually landed in 2.10, and I missed this.

takluyver added 7 commits November 12, 2018 09:32

Split out API for making vlen, string, enum and reference dtypes

c6490c2

Add corresponding check functions for special dtype kinds

8171fc9

Document new dtype functions

6b053d5

Closes h5pygh-1118

Use new dtype functions in h5py code

c53839e

dtype.metadata can be None

f895a78

Fix test for Python 2

3d6d74d

Fix check for integer dtypes for enum_dtype

7bbef16

takluyver added this to the 2.9 milestone Nov 12, 2018

takluyver modified the milestones: 2.9, 2.10 Nov 29, 2018

tacaswell reviewed Jan 9, 2019

View reviewed changes

Make string_dtype() also generate fixed-length string dtypes

cab03b1

Allow for dtype.metadata being None

f8f0ff5

Correct silly mistake for vlen strings

d3a4336

takluyver mentioned this pull request Jan 10, 2019

ENH,WIP: Add support for including cython in coverage analysis #1004

Open

takluyver added 3 commits January 10, 2019 10:56

Document that dtype metadata is not a stable API

a332591

Normalise encoding name in string_dtype

c705fc0

check_string_dtype() works for fixed-length strings as well as vlen

4167a3e

aparamon reviewed Jan 30, 2019

View reviewed changes

Add tests of constructing and checking string dtypes

182ea9a

takluyver mentioned this pull request Feb 5, 2019

H5PL: Plugin Interface #1166

Closed

takluyver added 2 commits February 5, 2019 16:17

Check string dtypes after HDF5 round trip

6a9c4a3

Check fixed-length string dtype

a343ede

aparamon merged commit ab949c9 into h5py:master Feb 7, 2019

takluyver deleted the breakout-special-dtypes branch February 7, 2019 11:35

aparamon mentioned this pull request Feb 7, 2019

python3: attributes are byte strings #379

Closed

t-b mentioned this pull request May 16, 2019

FIX: Enable reading fixed-length UTF-8 strings #988

Closed

t-b mentioned this pull request May 31, 2019

Read unicode strings in attributes #585

Closed

rsignell-usgs mentioned this pull request Jun 5, 2019

Could we get a new minor release? #1227

Closed

takluyver mentioned this pull request Aug 3, 2019

h5py can't read strings written by julia HDF5 #1271

Closed

tacaswell mentioned this pull request Sep 17, 2019

Change in dtype returned in h5py v2.10.0 #1307

Closed

takluyver added a commit that referenced this pull request Sep 30, 2019

Special dtypes API changed in 2.10, not 2.9

4fb047e

I had hoped to get #1132 in for 2.9, but it actually landed in 2.10, and I missed this.

takluyver mentioned this pull request Sep 30, 2019

Special dtypes API changed in 2.10, not 2.9 #1361

Merged

jreadey mentioned this pull request Jan 17, 2023

Variable length array HDFGroup/h5pyd#85

Open


		return dtype(dt, metadata={'enum': values_dict})

		ref_dtype = dtype('O', metadata={'ref': Reference})

		@@ -746,15 +746,15 @@ class TestStrings(BaseDataset):

		def test_vlen_bytes(self):

Breakout special dtypes #1132

Breakout special dtypes #1132

Conversation

takluyver commented Nov 12, 2018

takluyver commented Nov 12, 2018

aparamon commented Nov 12, 2018

takluyver commented Nov 13, 2018

aragilar commented Nov 19, 2018

takluyver commented Nov 19, 2018

shoyer commented Nov 25, 2018

takluyver commented Nov 29, 2018

wavexx commented Jan 8, 2019

vasole commented Jan 8, 2019

wavexx commented Jan 8, 2019

vasole commented Jan 8, 2019 • edited

Fixed length

Variable length:

wavexx commented Jan 8, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tacaswell commented Jan 9, 2019

aparamon commented Jan 9, 2019 • edited

takluyver commented Jan 9, 2019

shoyer commented Jan 9, 2019

takluyver commented Jan 9, 2019

shoyer commented Jan 9, 2019

aragilar commented Jan 10, 2019

takluyver commented Jan 10, 2019

codecov bot commented Jan 10, 2019 • edited

Codecov Report

takluyver commented Jan 30, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aparamon commented Jan 30, 2019

shoyer commented Feb 5, 2019

aragilar commented Feb 7, 2019

takluyver commented Feb 7, 2019

vasole commented Jan 8, 2019 •

edited

aparamon commented Jan 9, 2019 •

edited

codecov bot commented Jan 10, 2019 •

edited