Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

string handling changed in h5py >= 3.0 #81

Closed
kmuehlbauer opened this issue Nov 10, 2020 · 8 comments · Fixed by #82
Closed

string handling changed in h5py >= 3.0 #81

kmuehlbauer opened this issue Nov 10, 2020 · 8 comments · Fixed by #82

Comments

@kmuehlbauer
Copy link
Collaborator

In h5py storing of strings has changed (see https://docs.h5py.org/en/latest/whatsnew/3.0.html and https://docs.h5py.org/en/latest/strings.html).

This currently breaks the testsuite and also downstream packages using h5py>=3.0 (pydata/xarray#4570).

Below is the ncdump of the three write methods used in the testsuit (using h5py=3.1), where the output of netcdf4 differs from the output of the two h5netcdf flavours.


write module netcdf4:
write_module <module 'netCDF4' from '/home/kai/miniconda3/envs/h5py3/lib/python3.9/site-packages/netCDF4/__init__.py'>
netcdf testfile {
dimensions:
	x = 4 ;
	y = 5 ;
	z = 6 ;
	empty = UNLIMITED ; // (0 currently)
	string3 = 3 ;
	mismatched_dim = 1 ;
variables:
	double foo(x, y) ;
		foo:units = "meters" ;
	int64 y(y) ;
		y:_FillValue = -1LL ;
	char z(z, string3) ;
		z:_FillValue = "X" ;
	float scalar ;
	int64 intscalar ;
	int64 mismatched_dim ;
	string var_len_str(x) ;

// global attributes:
		:global = 42LL ;
		:other_attr = "yes" ;
data:

 foo =
  1, 1, 1, 1, 1,
  1, 1, 1, 1, 1,
  1, 1, 1, 1, 1,
  1, 1, 1, 1, 1 ;

 y = 0, 1, 2, 3, _ ;

 z =
  "a",
  "b",
  "c",
  "foo",
  "bar",
  "baz" ;

 scalar = 2 ;

 intscalar = 2 ;

 mismatched_dim = _ ;

 var_len_str = "foo", _, _, _ ;

group: subgroup {
  dimensions:
  	y = 10 ;
  variables:
  	int subvar(x) ;
  	double y_var(y) ;
  data:

   subvar = 0, 1, 2, 3 ;

   y_var = _, _, _, _, _, _, _, _, _, _ ;
  } // group subgroup
}

write Module h5netcdf.legacy API:
write_module <module 'h5netcdf.legacyapi' from '/home/data/kai/projects/h5netcdf/h5netcdf/legacyapi.py'>
netcdf testfile {
dimensions:
	x = 4 ;
	y = 5 ;
	z = 6 ;
	empty = UNLIMITED ; // (0 currently)
	string3 = 3 ;
	mismatched_dim = 1 ;
variables:
	int64 mismatched_dim ;
	double foo(x, y) ;
		string foo:units = "meters" ;
	int64 intscalar ;
	float scalar ;
	string var_len_str(x) ;
	int64 y(y) ;
		y:_FillValue = -1LL ;
	char z(z, string3) ;
		z:_FillValue = "X" ;

// global attributes:
		:global = 42LL ;
		string :other_attr = "yes" ;
data:

 mismatched_dim = 0 ;

 foo =
  1, 1, 1, 1, 1,
  1, 1, 1, 1, 1,
  1, 1, 1, 1, 1,
  1, 1, 1, 1, 1 ;

 intscalar = 2 ;

 scalar = 2 ;

 var_len_str = "foo", NIL, NIL, NIL ;

 y = 0, 1, 2, 3, _ ;

 z =
  "a",
  "b",
  "c",
  "foo",
  "bar",
  "baz" ;

group: subgroup {
  dimensions:
  	y = 10 ;
  variables:
  	int subvar(x) ;
  	double y_var(y) ;
  data:

   subvar = 0, 1, 2, 3 ;

   y_var = 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ;
  } // group subgroup
}

write Module h5netcdf API:
write_module: <module 'h5netcdf' from '/home/data/kai/projects/h5netcdf/h5netcdf/__init__.py'>
netcdf testfile {
dimensions:
	x = 4 ;
	y = 5 ;
	z = 6 ;
	empty = UNLIMITED ; // (0 currently)
	string3 = 3 ;
	mismatched_dim = 1 ;
variables:
	int64 mismatched_dim ;
	double foo(x, y) ;
		string foo:units = "meters" ;
	int64 intscalar ;
	float scalar ;
	string var_len_str(x) ;
	int64 y(y) ;
		y:_FillValue = -1LL ;
	char z(z, string3) ;
		z:_FillValue = "X" ;

// global attributes:
		:global = 42LL ;
		string :other_attr = "yes" ;
data:

 mismatched_dim = 0 ;

 foo =
  1, 1, 1, 1, 1,
  1, 1, 1, 1, 1,
  1, 1, 1, 1, 1,
  1, 1, 1, 1, 1 ;

 intscalar = 2 ;

 scalar = 2 ;

 var_len_str = "foo", NIL, NIL, NIL ;

 y = 0, 1, 2, 3, _ ;

 z =
  "a",
  "b",
  "c",
  "foo",
  "bar",
  "baz" ;

group: subgroup {
  dimensions:
  	y = 10 ;
  variables:
  	int subvar(x) ;
  	double y_var(y) ;
  data:

   subvar = 0, 1, 2, 3 ;

   y_var = 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ;
  } // group subgroup
}

A quick check using h5py=2.10.0 leads to the same results. That means:

  • There is a problem with writing which lead to different files. This does not break the testsuite using h5py<3.0.
  • There is a problem with reading, which breaks the testsuite using h5py>=3.0.

One more note: Another indicator that only reading is the problem, is that the testsuite doesn't break for the tests where netcdf4 is used for reading back files.

@kmuehlbauer
Copy link
Collaborator Author

Quoting from the h5py 3.1 documentation:

String data in HDF5 datasets is read as bytes by default: bytes objects for variable-length strings, or numpy bytes arrays ('S' dtypes) for fixed-length strings. Use Dataset.asstr() to retrieve str objects.

Variable-length strings in attributes are read as str objects. These are decoded as UTF-8 with surrogate escaping for unrecognised bytes.

AFAICS the problem is with the variable-length strings which are returned as bytes objects.

@kmuehlbauer
Copy link
Collaborator Author

The major pain is now, that netCDF4 returns decoded strings and h5py>=3.0 returns bytes objects.

ping @shoyer I would try to come up with a PR to fix this, but wanted to hear your thoughts about possible handling. As compatibility with netCDF4 is crucial there would be need to decode the bytes object. There is Dataset.asstr() which can be utilized.

@kmuehlbauer
Copy link
Collaborator Author

See #82 for possible fix.

@shoyer
Copy link
Collaborator

shoyer commented Nov 13, 2020

I think using asstr() is probably the right fix here, though I am a little hesitant to make h5netcdf inconsistent with h5py. Maybe it would make sense only to change the "netCDF4" compatible interface? I'm honestly not entirely sure

@kmuehlbauer
Copy link
Collaborator Author

@shoyer I must admit that I've never used the legacyapi. From my understanding this problem has three contradicting points:

  1. backwards compatibility
  2. compatibility with netCDF4
  3. consistency with h5py

One solution would be, if I understand your suggestion correctly, to apply this only to the legacyapiand leave the modern API to use h5py verbatim. But this would break every downstream package which uses modern API and h5py>=3 (at least they would have to do the string decoding).

We could add some keyword argument for the user to select the behaviour (eg. decode_strings). We could start defaulting to decode_strings=True and add a DeprecationWarning, that this will be changed to decode_strings=False in future releases (to be in line with h5py 3). For the legacyapi we would use decode_strings=True as default.

I very much appreciate any further suggestions and ideas.

@drew-parsons
Copy link
Contributor

Your idea of adding a decode_strings with a DeprecationWarning sounds reasonable to me. It supports existing users of h5netcdf and old h5py while providing a clear pathway for upgrading to h5py 3. The alternative is to just drop backwards compatibility, which is essentially what h5py have done. Going through a DeprecationWarning first makes it a softer transition.

@kmuehlbauer
Copy link
Collaborator Author

@drew-parsons Thanks for the feedback. I'll try to integrate this into #82 when I find time.

@shoyer
Copy link
Collaborator

shoyer commented Jan 11, 2021

I like the new keyword argument + deprecation cycle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants