Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scale/offset with raster:bands.{scale|offset} #55

Open
gadomski opened this issue Apr 21, 2022 · 13 comments
Open

Scale/offset with raster:bands.{scale|offset} #55

gadomski opened this issue Apr 21, 2022 · 13 comments
Labels
enhancement New feature or request

Comments

@gadomski
Copy link
Contributor

The raster extension provides scale and offset information. These scales and offsets should be applied to loaded data, if available. Right now, we're having to work around it like this:

data = odc.stac.load(
    items,
    crs="EPSG:3857",
    bands="500m_16_days_NDVI",
    resolution=500,
    bbox=bbox,
)
raster = RasterExtension.ext(items[0].assets["500m_16_days_NDVI"])
data = data["500m_16_days_NDVI"] * raster.bands[0].scale

Full notebook example that includes the workaround in Cell 4 here.

@gadomski gadomski added the enhancement New feature or request label Apr 21, 2022
@Kirill888
Copy link
Member

There are some important implications to supporting this use case.

  1. It impacts dtype derivation for the band, I assume that implies that expected load type is float32 in this case
  2. Native pixel type reads should still be possible in my opinion
    • This requires some kind of configuration mechanism
    • Default should probably be: apply scaling, but some mechanism to disable scaling should also be available, ideally per band but at least disable scaling for this load.
    • Scaling info should be propagated into returned xarray attributes table
  3. Requires adjusting model to propagate scaling information from STAC document

Related to that is scaling information that is supplied in the raster file itself (depending on a format these things can have various levels of being "standard"). For example netcdf reads with GDAL ignore scaling info and return raw pixel values, while reads from the same file with netcdf library return scaled pixels. I'm not sure if TIFF has any "standard" tags like that.

@gadomski
Copy link
Contributor Author

Native pixel type reads should still be possible in my opinion

Agreed.

Related to that is scaling information that is supplied in the raster file itself

Does odc.stac.load apply in-file scale+offsets currently? IMO it would be an error state for the STAC scale+offset information to disagree with the actual file's scale+offset information, so maybe just throw an error if your use_raster_extension config argument is True and the data and metadata disagree?

@Kirill888
Copy link
Member

Does odc.stac.load apply in-file scale+offsets currently?

no, unless rasterio/gdal do it, which I'm pretty sure they don't for COG/netcdf, but who knows what other gdal drivers do or don't do

IMO it would be an error state for the STAC scale+offset information to disagree with the actual file's scale+offset information, so maybe just throw an error if your use_raster_extension config argument is True and the data and metadata disagree?

there is "disagree", but then there is also "stac does not include it but data does".

@gadomski
Copy link
Contributor Author

there is "disagree", but then there is also "stac does not include it but data does".

Yeah, I'd say if rasterio+GDAL isn't applying scale+offset for the "common" data formats, we can't expect to protect against some other GDAL reader applying it under the hood. So I think the answer is "use raster:bands if its there, otherwise don't apply any scale+offset"?

@Kirill888
Copy link
Member

issue is that we can't tell if STAC data is "just a description of a storage scheme" or "description and instruction to convert at load time". For formats like COG there is no "automatic" scaling at read-time done by GDAL, but some tiledb or what not might perform that scaling as part of the read, but it might also be useful to record that metadata in STAC. I guess having "read without further scaling" is enough to deal with those cases for now, later on if it becomes a problem we can change default based on format of the source.

@idantene
Copy link

idantene commented Mar 5, 2024

With the new Element84 Sentinel-2 collection, it was decided that

The offset will no longer be applied so the data will be unaltered other than the conversion to a COG. Users should always apply gains and offset when using data.

Is this the time to implement the scale/offset application? Or perhaps at least mention @gadomski's approach here?

Alternatively, should this be pushed further down the stack (pun intended) to e.g. pystac-client?

EDIT: With @gadomski's solution, there is a small caveat of:

  1. Assuming the scale/offset in the first item applies to all the item collection (should this be verified?)
  2. Assuming the scale/offset exists in the first item?
  3. The offset isn't applied, just the scale?

@Kirill888
Copy link
Member

@idantene my understanding is that right now this collection is in a bit of flux (this is from what I hear from other users, I have not been working with this data recently): some items have pixel scaling applied to data and some do not, and this makes data loading particularly tricky.

Alternatively, should this be pushed further down the stack (pun intended) to e.g. pystac-client?

Not sure what you mean by that, as pystac-client only deals in metadata, not pixel values.

My preference here is to either:

  • keep the current state: return "raw" pixels, with caveat that "raw" depends on format being used, see discussions above about gdal data loading drivers
  • implement most flexible solution

Most flexible solution would have to:

  • support raw mode output
    • default to applying scaling
    • allow users to opt-out of that with load time parameter that can be applied per band
  • apply scale/offset per item (do not assume that all items share the same scale/offset)
  • allow user supplied overrides per band
  • possibly allow user hook to decide per-item what scaling should be used
    • Metadata is broken way too often, or non-standard, one always needs an escape chute

Right now work is underway to refactor loading parts of odc-stac to support non-gdal data loading drivers (zarr with kerchunk for example). There is already an abstraction in place to "bring your own loader", but it will change somewhat as we iron things out. I feel like completing this work will make adding support for inline pixel modification much simpler, with scale and offset being just one of the standard transforms odc-stac should support natively. The same machinery can be used to bring arbitrary user supplied transforms that apply to pixel values in the native projection of the data before reprojection and merging of rasters that share the same pixel plane (data acquired at the same time/orbit/day).

@Kirill888
Copy link
Member

Pinging @robbibt as it relates to "unifying odc-stac/datacube-core", and @woodcockr because of "virtual product light".

@woodcockr
Copy link
Member

Working with all our EO workflows the only sensible option is "possibly allow user hook to decide per-item what scaling should be used". In part because the use data provider metadata is required and is also processing version dependent (the absence of a "scale and offset applied flag" may require knowing what processing version you are dealing with (its implicit).
Ultimately, ODC can only reliably provide "raw" data and allow the client call to hook in anything specific for a particular source.

which is where we are heading with the ODC hyperspectral native loader support currently prototyped in odc-stac but to be added to datacube-code once we get the kinks out.

@idantene
Copy link

idantene commented Mar 7, 2024

Thanks (once again) for the very detailed reply @Kirill888!

All of that sounds great, and I guess I'll just wait for future updates (/subscribe 😉).

Not sure what you mean by that, as pystac-client only deals in metadata, not pixel values.

Sorry, I was being a bit laconic there. I meant that maybe pystac-client could handle any mismatch/issues in metadata (such as inconsistent/missing scale/offset), so that odc-stac would simply have to apply it.

@gadomski
Copy link
Contributor Author

gadomski commented Mar 7, 2024

this collection is in a bit of flux

That's correct — Element 84 is building a new Sentinel 2 L2A collection, sentinel-2-c1-l2a. All items in that collection will not have the offset applied, which differs from the items in the old sentinel-2-l2a collection. This discussion has links to the mailing list signup and archive for anyone who wants to dig in more.

I meant that maybe pystac-client could handle any mismatch/issues in metadata (such as inconsistent/missing scale/offset)

@idantene can you provide more detail on the desired behavior here? E.g. an example? Thanks!

@idantene
Copy link

idantene commented Mar 7, 2024

Sure @gadomski - we currently apply a similar filtering/transformation internally - but consider:

  1. The user expects the scale and offset for each item. As such, they consider any Item without scale/offset metadata as invalid, and would prefer to ignore it.
  2. The user expects the same scale/offset for each item. They may want to only consider the most common scale/offset, and reject the non-standard ones.
  3. The user expects the same scale/offset for each item. They want to be alerted (or have the code crash) if this isn't the case.

Our current approach is e.g.

def fetch_sentinel2_timeseries(..., item_filter:  = None):
    # ...
    query = stac_client.search(...)
    try:
        items = query.item_collection()
    except Exception as e:  # noqa, capture any fetching exceptions and raise them as runtime errors
        raise RuntimeError from e
    if item_filter is not None:
        items = [item for item in items if item_filter(item)] 
    # ...

For the time being, the only item_filter we're using is to filter away items with zenith larger 11 (calculated from the granule_metadata or metadata xml file).

@gadomski
Copy link
Contributor Author

gadomski commented Mar 7, 2024

To me, pystac-client is meant to be a relatively general tool to fetch data from a STAC API. What you're describing (to me) feels more use-case specific, and is already supported more-or-less by the workflow you describe. Happy to continue chatting about desired behaviors, but we should probably move the discussion over to an issue on https://github.com/stac-utils/pystac-client to avoid the noise here on odc-stac.

As an aside, I would recommend using items instead of item_collection so that you can do your filtering on-the-fly rather than loading all items first, e.g.:

item_search = Client.open(...).search(...)
items = [item for item in item_search.items() if item_filter(item)]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants