New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset.tags() call throws UnicodeDecodeError on GeoTIFF metadata readable by GDAL #2078
Comments
Looking into it briefly, the only difference I see is between the rasterio method of accessing the metadata and the GDAL python SWIG bindings is that GDAL decodes the encoded metadata tags with a call to GDALGetMetadata here, and rasterio calls GDALGetMetadata and then uses CPLParseNameValue and Cython’s char * -> string conversion to decode the bytes. So potentially the difference is caused by the difference in unicode decoding between the two approaches? |
👋 Happy New Year @lossyrob. I modified
It's gdalinfo reports:
I suspect Making rasterio able to handle corrupted metadata will be a bit of work since the assumption that GDAL strings are all properly UTF-8 encoded is well baked into |
@sgillies happy new year to you!
I agree; looks like there's some corrupt metadata being carried over from the HDF file.
That makes sense - thank you for adding the error handling! The user who encountered this was using |
@lossyrob I forgot to add yesterday that I worked the fix into rasterio 1.2b2, which is on PyPI now. Using it on the GeoTIFF, we now get:
|
Ah I see, I misinterpreted that commit - it emits a warning but still successfully reads in non-corrupt tags. Many thanks! |
Expected behavior and actual behavior.
Calling
.tags()
on a open dataset of a specific GeoTIFF file causes aUnicodeDecodeError
. The dataset does contain misbehaving characters, but using GDAL directly as well as the GDAL python bindings reads the metadata without error.The file is NASA HLS data converted from an HDF file into a COG - specifically from https://hls.gsfc.nasa.gov/data/v1.4/S30/2019/10/S/E/G/HLS.S30.T10SEG.2019048.v1.4.hdf to https://hlssa.blob.core.windows.net/hls/S30/HLS.S30.T10SEG.2019048.v1.4_01.tif
The offending bytes do exist in the HDF file, and were transferred to the COG.
Steps to reproduce the problem.
The issue can be reproduced by patching in the unit test here and running in an environment with GDAL python bindings installed: master...lossyrob:bug/tags/unicode-decode-error
(Pasted here for convenient viewing)
Operating system
Multiple environments have replicated this issue, but I'm personally running Ubuntu 20.04.1 LTS on Windows Subsystem for Linux 2.
Rasterio version and provenance
I reproduced with the above unit test on current master (6351d5a)
The text was updated successfully, but these errors were encountered: