Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset.tags() call throws UnicodeDecodeError on GeoTIFF metadata readable by GDAL #2078

Closed
lossyrob opened this issue Jan 8, 2021 · 5 comments
Labels

Comments

@lossyrob
Copy link

lossyrob commented Jan 8, 2021

Expected behavior and actual behavior.

Calling .tags() on a open dataset of a specific GeoTIFF file causes a UnicodeDecodeError. The dataset does contain misbehaving characters, but using GDAL directly as well as the GDAL python bindings reads the metadata without error.

The file is NASA HLS data converted from an HDF file into a COG - specifically from https://hls.gsfc.nasa.gov/data/v1.4/S30/2019/10/S/E/G/HLS.S30.T10SEG.2019048.v1.4.hdf to https://hlssa.blob.core.windows.net/hls/S30/HLS.S30.T10SEG.2019048.v1.4_01.tif

The offending bytes do exist in the HDF file, and were transferred to the COG.

Steps to reproduce the problem.

The issue can be reproduced by patching in the unit test here and running in an environment with GDAL python bindings installed: master...lossyrob:bug/tags/unicode-decode-error

(Pasted here for convenient viewing)

def test_tags_reads_what_gdal_can():
    import os
    from tempfile import TemporaryDirectory
    from urllib.request import urlretrieve

    # Note: GDAL python bindings must be installed in the virtualenv, e.g.
    # pip install GDAL==`gdal-config --version`
    from osgeo import gdal

    # HLS COG file from AI for Earth datasets
    url = 'https://hlssa.blob.core.windows.net/hls/S30/HLS.S30.T10SEG.2019048.v1.4_01.tif'

    with TemporaryDirectory() as tmp_dir:
        test_path = os.path.join(tmp_dir, 'test.tif')
        urlretrieve(url, test_path)

        # Show tags can be read with GDAL
        gdal_ds = gdal.Open(test_path)
        gdal_tags = gdal_ds.GetMetadata_Dict()

        # Can tags be read in by rasterio?
        with rasterio.open(test_path) as ds:
            rio_tags = ds.tags()

        # Notice error thrown: UnicodeDecodeError

Operating system

Multiple environments have replicated this issue, but I'm personally running Ubuntu 20.04.1 LTS on Windows Subsystem for Linux 2.

Rasterio version and provenance

I reproduced with the above unit test on current master (6351d5a)

@lossyrob lossyrob added the bug label Jan 8, 2021
@lossyrob
Copy link
Author

lossyrob commented Jan 8, 2021

Looking into it briefly, the only difference I see is between the rasterio method of accessing the metadata and the GDAL python SWIG bindings is that GDAL decodes the encoded metadata tags with a call to GDALGetMetadata here, and rasterio calls GDALGetMetadata and then uses CPLParseNameValue and Cython’s char * -> string conversion to decode the bytes. So potentially the difference is caused by the difference in unicode decoding between the two approaches?

@sgillies
Copy link
Member

sgillies commented Jan 9, 2021

👋 Happy New Year @lossyrob. I modified tags() to not convert the metadata to unicode and here's what I see in the GeoTIFF:

{b'ACCODE': b'LaSRCS2AV3.5.5 + LaSRCS2AV3.5.5\xfc\xffp\xd9)\xfe\x7f',
 b'AREA_OR_POINT': b'Area',
 b'AngleBand': b'0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12',
 b'DATASTRIP_ID': b'S2B_OPER_MSI_L1C_DS_MPS__20190217T222115_S20190217T19045'
                  b'3_N02.07 + S2B_OPER_MSI_L1C_DS_MTI__20190217T222820_S201'
                  b'90217T191401_N02.07',
 b'HLS_PROCESSING_TIME': b'2019-02-23T03:37:12Z',
 b'HORIZONTAL_CS_CODE': b'EPSG:32610',
 b'HORIZONTAL_CS_NAME': b'WGS84 / UTM zone 10N',
 b'L1C_IMAGE_QUALITY': b'NONE + NONE',
 b'L1_PROCESSING_TIME': b'2019-02-17T23:02:30.528367Z',
 b'MEAN_SUN_AZIMUTH_ANGLE(B01)': b'158.25120335712',
 b'MEAN_SUN_ZENITH_ANGLE(B01)': b'51.9487642872077',
 b'MEAN_VIEW_AZIMUTH_ANGLE(B01)': b'290.459083715287',
 b'MEAN_VIEW_ZENITH_ANGLE(B01)': b'10.6675355130094',
 b'MSI band 01 bandpass adjustment slope and offset': b'0.995900, -0.000200',
 b'MSI band 02 bandpass adjustment slope and offset': b'0.977800, -0.004000',
 b'MSI band 03 bandpass adjustment slope and offset': b'1.007500, -0.000800',
 b'MSI band 04 bandpass adjustment slope and offset': b'0.976100, 0.001000',
 b'MSI band 11 bandpass adjustment slope and offset': b'1.000000, -0.000300',
 b'MSI band 12 bandpass adjustment slope and offset': b'0.986700, 0.000400',
 b'MSI band 8a bandpass adjustment slope and offset': b'0.996600, 0.000000',
 b'NBAR_Solar_Zenith': b'42.3149907361208',
 b'NCOLS': b'3660',
 b'NROWS': b'3660',
 b'PROCESSING_BASELINE': b'02.07 + 02.07',
 b'PRODUCT_URI': b'S2B_MSIL1C_20190217T190439_N0207_R013_T10SEG_20190217T222115'
                 b'.SAFE + S2B_MSIL1C_20190217T190439_N0207_R013_T10SEG_2019021'
                 b'7T222820.SAFE',
 b'SENSING_TIME': b'2019-02-17T19:14:02.147Z + 2019-02-17T19:14:08.412Z',
 b'SPACECRAFT_NAME': b'Sentinel-2B',
 b'SPATIAL_RESOLUTION': b'30',
 b'TILE_ID': b'S2B_OPER_MSI_L1C_TL_MPS__20190217T222115_A010192_T10SEG_N02.07 +'
             b' S2B_OPER_MSI_L1C_TL_MTI__20190217T222820_A010192_T10SEG_N02.07',
 b'ULX': b'499980',
 b'ULY': b'4200000',
 b'_FillValue': b'-1000',
 b'add_offset': b'0.0',
 b'arop_ave_xshift(meters)': b'0',
 b'arop_ave_yshift(meters)': b'0',
 b'arop_ncp': b'0',
 b'arop_rmse(meters)': b'0',
 b'arop_s2_refimg': b'NONE',
 b'cloud_coverage': b'34',
 b'long_name': b'Coastal_Aerosol',
 b'scale_factor': b'0.0001',
 b'spatial_coverage': b'23'}

It's ACCODE and its value b'LaSRCS2AV3.5.5 + LaSRCS2AV3.5.5\xfc\xffp\xd9)\xfe\x7f' that are the problem.

gdalinfo reports:

  ACCODE=LaSRCS2AV3.5.5 + LaSRCS2AV3.5.5p
  add_offset=0.0
  AngleBand=0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
  AREA_OR_POINT=Area
  arop_ave_xshift(meters)=0
  arop_ave_yshift(meters)=0
  arop_ncp=0
  arop_rmse(meters)=0
  arop_s2_refimg=NONE
  cloud_coverage=34
  ...

I suspect ACCODE=LaSRCS2AV3.5.5 + LaSRCS2AV3.5.5p isn't correct either.

Making rasterio able to handle corrupted metadata will be a bit of work since the assumption that GDAL strings are all properly UTF-8 encoded is well baked into rasterio._base, I think I'll put that off until after 1.2.0.

@lossyrob
Copy link
Author

@sgillies happy new year to you!

I suspect ACCODE=LaSRCS2AV3.5.5 + LaSRCS2AV3.5.5p isn't correct either.

I agree; looks like there's some corrupt metadata being carried over from the HDF file.

Making rasterio able to handle corrupted metadata will be a bit of work since the assumption that GDAL strings are all properly UTF-8 encoded is well baked into rasterio._base, I think I'll put that off until after 1.2.0.

That makes sense - thank you for adding the error handling!

The user who encountered this was using xarray.open_rasterio to open the file; I think the workaround here will be a step that reads and then overwrites the metadata in the GeoTIFF with a sanitized version beforehand.

@sgillies
Copy link
Member

@lossyrob I forgot to add yesterday that I worked the fix into rasterio 1.2b2, which is on PyPI now. Using it on the GeoTIFF, we now get:

$ rio info ~/Downloads/HLS.S30.T10SEG.2019048.v1.4_01.tif --tags
WARNING:rasterio._base:Failed to decode metadata item: i=0, item=b'ACCODE=LaSRCS2AV3.5.5 + LaSRCS2AV3.5.5\xfc\xffp\xd9)\xfe\x7f'
{"add_offset": "0.0", "AngleBand": "0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12", "AREA_OR_POINT": "Area", "arop_ave_xshift(meters)": "0", "arop_ave_yshift(meters)": "0", "arop_ncp": "0", "arop_rmse(meters)": "0", "arop_s2_refimg": "NONE", "cloud_coverage": "34", "DATASTRIP_ID": "S2B_OPER_MSI_L1C_DS_MPS__20190217T222115_S20190217T190453_N02.07 + S2B_OPER_MSI_L1C_DS_MTI__20190217T222820_S20190217T191401_N02.07", "HLS_PROCESSING_TIME": "2019-02-23T03:37:12Z", "HORIZONTAL_CS_CODE": "EPSG:32610", "HORIZONTAL_CS_NAME": "WGS84 / UTM zone 10N", "L1C_IMAGE_QUALITY": "NONE + NONE", "L1_PROCESSING_TIME": "2019-02-17T23:02:30.528367Z", "long_name": "Coastal_Aerosol", "MEAN_SUN_AZIMUTH_ANGLE(B01)": "158.25120335712", "MEAN_SUN_ZENITH_ANGLE(B01)": "51.9487642872077", "MEAN_VIEW_AZIMUTH_ANGLE(B01)": "290.459083715287", "MEAN_VIEW_ZENITH_ANGLE(B01)": "10.6675355130094", "MSI band 01 bandpass adjustment slope and offset": "0.995900, -0.000200", "MSI band 02 bandpass adjustment slope and offset": "0.977800, -0.004000", "MSI band 03 bandpass adjustment slope and offset": "1.007500, -0.000800", "MSI band 04 bandpass adjustment slope and offset": "0.976100, 0.001000", "MSI band 11 bandpass adjustment slope and offset": "1.000000, -0.000300", "MSI band 12 bandpass adjustment slope and offset": "0.986700, 0.000400", "MSI band 8a bandpass adjustment slope and offset": "0.996600, 0.000000", "NBAR_Solar_Zenith": "42.3149907361208", "NCOLS": "3660", "NROWS": "3660", "PROCESSING_BASELINE": "02.07 + 02.07", "PRODUCT_URI": "S2B_MSIL1C_20190217T190439_N0207_R013_T10SEG_20190217T222115.SAFE + S2B_MSIL1C_20190217T190439_N0207_R013_T10SEG_20190217T222820.SAFE", "scale_factor": "0.0001", "SENSING_TIME": "2019-02-17T19:14:02.147Z + 2019-02-17T19:14:08.412Z", "SPACECRAFT_NAME": "Sentinel-2B", "spatial_coverage": "23", "SPATIAL_RESOLUTION": "30", "TILE_ID": "S2B_OPER_MSI_L1C_TL_MPS__20190217T222115_A010192_T10SEG_N02.07 + S2B_OPER_MSI_L1C_TL_MTI__20190217T222820_A010192_T10SEG_N02.07", "ULX": "499980", "ULY": "4200000", "_FillValue": "-1000"}

@lossyrob
Copy link
Author

lossyrob commented Jan 11, 2021

Ah I see, I misinterpreted that commit - it emits a warning but still successfully reads in non-corrupt tags. Many thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants