Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode decode error when reading tags from HDF5 #1663

Closed
glostis opened this issue Mar 26, 2019 · 8 comments

Comments

Projects
None yet
2 participants
@glostis
Copy link

commented Mar 26, 2019

First of all, thanks for this package that makes using GDAL enjoyable in Python!

I have encountered an unexpected behavior when reading tags from a seemingly valid HDF5 file.

The file contains an HDF5 attribute with value 135, encoded as a uint8 (in an HDF5 file, each attribute has its own datatype according to the official specification). When reading the file's tags with rasterio, a UnicodeDecodeError is raised because the attribute's binary value is being decoded as a utf-8 string (in the same way that the python expression bytes([135]).decode("utf-8") raises a UnicodeDecodeError).

Expected behavior and actual behavior.

  • Expected behavior: When reading an HDF5 file with an attribute with value 135, I expect rasterio to decode the tag using its datatype instead of decoding it as a utf-8 string.

  • Actual behavior: f.tags() on the HDF5 file in rasterio currently raises an error, preventing me from accessing the image's tags.

Steps to reproduce the problem.

I have made a small python script to reproduce my problem. In this script, I create several HDF5 files using the library h5py. These files contain an empty dataset, and one attribute with value 125 or 135, encoded with several different datatypes. I then parse this attribute's value in 3 different ways, using h5py, gdalinfo/grep and rasterio.

Here is the script:

import subprocess
import warnings

import h5py
import rasterio
from rasterio.errors import NotGeoreferencedWarning

print("Rasterio version", rasterio.__version__)
print("GDAL version", rasterio.__gdal_version__)
print()

warnings.filterwarnings("ignore", category=NotGeoreferencedWarning)

string = "{: <8} {: <8} {: <8} {: <8} {: <8}"
print(string.format("dtype", "number", "h5py", "rio", "gdal"))
print("=" * (8 * 5 + 5))

attr_name = "test_attr"

for dtype in [
    "int8",
    "uint8",
    "int16",
    "uint16",
    "float16",
    "int32",
    "uint32",
    "float32",
    "float64",
]:
    for num in [125, 135]:
        filename = "{}_{}.h5".format(dtype, num)
        with h5py.File(filename, "w") as f:
            f.create_dataset("dataset", shape=(100, 100), dtype="float32")
            f.attrs.create(attr_name, num, dtype=dtype)
            attr = f.attrs.get(attr_name)

        tmp = subprocess.Popen(["gdalinfo", filename], stdout=subprocess.PIPE)
        gdal_out = subprocess.check_output(["grep", attr_name], stdin=tmp.stdout)
        try:
            gdal_meta = gdal_out.split()[0].split(b"=")[-1].decode("utf-8")
        except UnicodeDecodeError:
            gdal_meta = "ERROR"

        with rasterio.open(filename) as f:
            try:
                tags = f.tags()
                print(string.format(dtype, num, attr, tags[attr_name], gdal_meta))
            except UnicodeDecodeError:
                print(string.format(dtype, num, attr, "ERROR", gdal_meta))
    print()

Here is the script's output:

Rasterio version 1.0.22
GDAL version 2.2.3

dtype    number   h5py     rio      gdal
=============================================
int8     125      125      }        }
int8     135      -121     ERROR    ERROR

uint8    125      125      }        }
uint8    135      135      ERROR    ERROR

int16    125      125      125      125
int16    135      135      135      135

uint16   125      125      125d     125d
uint16   135      135      135d     135d

float16  125      125.0    125      125
float16  135      135.0    135      135

int32    125      125      125      125
int32    135      135      135      135

uint32   125      125      125d     125d
uint32   135      135      135d     135d

float32  125      125.0    125      125
float32  135      135.0    135      135

float64  125      125.0    125      125
float64  135      135.0    135      135

As you can see:

  • For attribute datatype int8 and uint8, accessing the tags() crashes for value 135, but works for value 125 since bytes([125]).decode("utf-8") gives }.
  • For uint16 and uint32, tags adds a d after the actual value.
  • For all of the other datatypes, the decoding works as expected.

I have found a workaround by monkey-patching the function tags in rasterio._base.pyx, by replacing the line:

return dict(metadata[i].split('=', 1) for i in range(num_items))

by

d = dict()
for i in range(num_items):
    try:
        k, v = metadata[i].split('=', 1)
        d[k] = v
    except UnicodeDecodeError:
        pass
return d

But this solution is not a fix, it merely ignores tag values that cannot be decode as utf-8...

Operating system

I have tested this problem on several OS and gdal/rasterio versions.
The example above was run on the following Docker environment:

FROM ubuntu:18.04

RUN apt-get update && \
    apt-get install -y python3-pip && \
    apt-get install -y gdal-bin libgdal-dev

RUN pip3 install numpy && \
    pip3 install rasterio --no-binary rasterio && \
    pip3 install h5py

ENTRYPOINT ["bash"]

Rasterio version and provenance

Rasterio version 1.0.22 installed from pypi (see Dockerfile above)

@sgillies

This comment has been minimized.

Copy link
Contributor

commented Mar 26, 2019

@glostis interesting! I'm inclined to say that this is a GDAL bug because GDALGetMetadata (), which we call at https://github.com/mapbox/rasterio/blob/master/rasterio/_base.pyx#L960, is supposed to return an array of UTF-8 encoded (as I understand it) C strings.

Do you subscribe to gdal-dev by any chance? Would you be willing to ask there? If not, I'll make a note to ask soon.

@sgillies

This comment has been minimized.

Copy link
Contributor

commented Mar 26, 2019

rouault added a commit to OSGeo/gdal that referenced this issue Mar 26, 2019

HDF5: fix handling of attributes of type SCHAR, UCHAR, USHORT and UINT
Fixes mapbox/rasterio#1663

For SCHAR and UCHAR, they were up to now considered as strings, which is
wrong according to https://support.hdfgroup.org/HDF5/doc/H5.user/Datatypes.html
"""3.6. Character and String Datatype Issues
The H5T_NATIVE_CHAR and H5T_NATIVE_UCHAR datatypes are actually numeric data
(1-byte integers). If the application wishes to store character data, then
an HDF5 string datatype should be derived from H5T_C_S1 instead."
So do consider them as integers, and add a GDAL_HDF5_CHAR_AS_STRING config
option that defaults to NO to get previous behaviour in case this would really
be needed.

For USHORT and UINT, use correct '%u' formatting instead of '%ud' that
added an extraneous 'd' character after the number.

rouault added a commit to OSGeo/gdal that referenced this issue Mar 26, 2019

HDF5: fix handling of attributes of type SCHAR, UCHAR, USHORT and UINT
Fixes mapbox/rasterio#1663

For SCHAR and UCHAR, they were up to now considered as strings, which is
wrong according to https://support.hdfgroup.org/HDF5/doc/H5.user/Datatypes.html
"""3.6. Character and String Datatype Issues
The H5T_NATIVE_CHAR and H5T_NATIVE_UCHAR datatypes are actually numeric data
(1-byte integers). If the application wishes to store character data, then
an HDF5 string datatype should be derived from H5T_C_S1 instead."
So do consider them as integers, and add a GDAL_HDF5_CHAR_AS_STRING config
option that defaults to NO to get previous behaviour in case this would really
be needed.

For USHORT and UINT, use correct '%u' formatting instead of '%ud' that
added an extraneous 'd' character after the number.
@sgillies

This comment has been minimized.

Copy link
Contributor

commented Mar 26, 2019

It's a GDAL bug, fixed in OSGeo/gdal@9640098 and will be available in 2.4.2.

@sgillies sgillies closed this Mar 26, 2019

@sgillies sgillies added upstream and removed needs discussion labels Mar 26, 2019

@glostis

This comment has been minimized.

Copy link
Author

commented Mar 27, 2019

Thanks for the promptness of your reply and for your help @sgillies !
I have tested my script with rasterio using a GDAL version locally built on commit 9640098, and it works as expected.

@glostis

This comment has been minimized.

Copy link
Author

commented Mar 27, 2019

@sgillies I just have a question following my previous comment: with rasterio installed using my newly-built GDAL version, whenever I run import rasterio, I get the following warning:

Warning 1: +init=epsg:XXXX syntax is deprecated. It might return a CRS with a non-EPSG compliant axis order.

The traceback of this warning is the following:

<ipython-input-1-350e27267e59> in <module>
----> 1 import rasterio

~/rasterio/rasterio/__init__.py in <module>
     20             pass
     21
---> 22 from rasterio._base import gdal_version
     23 from rasterio.drivers import is_blacklisted
     24 from rasterio.dtypes import (

~/rasterio/rasterio/_base.pyx in init rasterio._base()
     25 from rasterio.enums import (
     26     ColorInterp, Compression, Interleaving, MaskFlags, PhotometricInterp)
---> 27 from rasterio.env import Env, env_ctx_if_needed
     28 from rasterio.errors import (
     29     RasterioIOError, CRSError, DriverRegistrationError, NotGeoreferencedWarning,

~/rasterio/rasterio/env.py in <module>
    621
    622     # See https://github.com/mapbox/rasterio/issues/1631.
--> 623     if PROJDataFinder().has_data():
    624         log.debug("PROJ data files are available at built-in paths")
    625

~/rasterio/rasterio/_env.pyx in rasterio._env.PROJDataFinder.has_data()
    274         try:
    275             exc_wrap_ogrerr(exc_wrap_int(OSRImportFromProj4(osr, "+init=epsg:4326")))
    276         except CPLE_BaseError:
    277             return False

The warning comes from the following GDAL lines which were introduced in GDAL 2 months ago.

This warning can be silenced by replacing in rasterio/_env.pyx the +init=epsg:4326 by +proj=longlat. But since I am not a geospatial expert, I do not know if this substitution is correct.

I can file a separate issue detailing this if needed.

@sgillies

This comment has been minimized.

Copy link
Contributor

commented Mar 27, 2019

Ack, I can't believe I added deprecated usage to the library! I think that pytest, or my usage of pytest, has been hiding this from me. No need for a separate issue @glostis. Fix coming right up.

@sgillies

This comment has been minimized.

Copy link
Contributor

commented Mar 28, 2019

c5b0d91 should fix this.

@glostis

This comment has been minimized.

Copy link
Author

commented Mar 28, 2019

Indeed it does work. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.