Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Derived fields and aggregation support #164

Open
jerstlouis opened this issue Apr 13, 2022 · 4 comments
Open

Derived fields and aggregation support #164

jerstlouis opened this issue Apr 13, 2022 · 4 comments
Labels
2022-05 Sprint Cross-SWG Discussion EDR-related For coordination with EDR Extension Will be addressed by a future extension

Comments

@jerstlouis
Copy link
Member

jerstlouis commented Apr 13, 2022

Suggesting that we plan for a separate part enabling basic analytics capabilities, including conformance classes for:

  • derived fields supporting arithmetic (e.g. NDVI computation), properties=
  • filtering (Retrieve values within a certain range #103) (e.g. only retaining cells with elevation values above certain threshold), filter=
  • sorting (e.g. allowing to flatten multiple scenes into a 2D image with least cloudy cells retained), sortby=
  • standardized pre-defined aggregation functions, e.g. Max(), Min(), Avg(), StdDev(), Sum()... used within properties=, filter=, sortby= expressions. The dimensions over which data is aggregated could also leverage subset, bbox, datetime, but a distinction mechanism would still be needed to know whether a series should be returned for a particular dimension, or aggregation should be performed.
  • operating over multiple collections (allowing to perform the above capabilities combining fields from those multiple collections), collections=

This would be informed by the work from DAPA and Testbed-17 GeoDataCube API, and ideally be consistent with the OGC API - Features Search extension as well as with OGC API - DGGS and OGC API - EDR.
We plan to explore this in the upcoming May 2022 Code Sprint.

Example proposed syntax:
properties=NDVI:Max((B5-B4)/(B5+B4))&subset=("2020-07-01":"2020-07-31")

@jerstlouis
Copy link
Member Author

jerstlouis commented Apr 28, 2022

We should consider use cases where we want to aggregate / return results differently for different dimensions, for example:

A) Return a 0D value including derived "minimum NDVI" and "maximum NDVI" values aggregated locally spatially over the time dimension at a single point in space, but averaged over the spatial dimensions.
B) Support aggregating to a time series at a coarser resolution but not to a single value over a dimension, e.g. computing monthly minimum, maximum or average for each months of a year.

How could that look like syntactically? Possibly an additional parameter to the aggregation function to select dimensions on which to aggregate? e.g. time, space, spacetime, [latitude, longitude, datetime].

A) Aggregate minimum of spatially local values over time, then aggregate average over space (a single cell is returned with a minimum and a maximum value)

properties=
   minNDVI:Avg(
      Min((B5-B4)/(B5+B4), time),
   space),
   maxNDVI:Avg(
      Max((B5-B4)/(B5+B4), time),
   space)
&subset=datetime("2020-01-01":"2021-12-31"),Lat(45.0:45.1),Lon(-75.1:-75.0)

With an additional option to specify aggregating to a coarser resolution, as opposed to a single value? e.g., time:month, Lat:0.005

B) Aggregate minimum of spatially local values over time for each given month, then aggregate sum over space. The result would be a 1D time series with 12 cells (data records / features), each with a single value in this case (the sum for each of the monthly minimums and maximums, over all subsetted space).

properties=
   minNDVI:Sum(
      Min((B5-B4)/(B5+B4), time:month),
   space),
   maxNDVI:Sum(
      Max((B5-B4)/(B5+B4), time:month),
   space),
&subset=datetime("2020-01-01":"2020-12-31"),Lat(45.0:45.1),Lon(-75.1:-75.0)

A special month resolution is proposed in the example here to accommodate common usage uneven temporal units. A number corresponding to units (e.g., in seconds or meters or degrees) could also be used to qualify the dimension over which aggregation is performed.

DAPA had some similar ideas for its aggregate query parameter, but more so for the different aggregation processes (area:aggregate-space, area:aggregate-space-time, area:aggregate-time, grid:aggregate-time, position:aggregate-time).

To compare aggregating a gridded coverage with the Features search extension, cells are akin to the features in that their set of properties have given values. Aggregation is essentially creating a new collection of cells (equivalent to a new feature collection) with different dimensionality and/or resolution across some dimension(s).

Note that if aggregation is simply functions used in derived fields properties, then the resulting dimensionality may differ if returned properties use different kinds of aggregation -- that could mean fields that are not aggregated over some dimensions or resolution would get duplicated.

Another use case for sortby might be to more explicitly specify the behavior associated with the subset slice sparse data behavior discussed in #105, if e.g., the time dimension is included as a sortable. That could be combined with other sortable keys, including derived fields using aggregation e.g., Avg() over space (but not time) to sort scenes as a whole without mixing them up.

sortby=
      -Avg((B5-B4)/(B5+B4), space),
      +time

@ghobona
Copy link
Contributor

ghobona commented May 10, 2022

It would be great to have a list or tree like the one at https://github.com/cportele/ogcapi-building-blocks

This would help to visualise what the building blocks are.

@jerstlouis
Copy link
Member Author

jerstlouis commented May 10, 2022

@ghobona I tried at the top of this issue to organize them in a bullet list.

Most of these building blocks are query parameters:

Analytics Query parameters:

  • properties
    • For the simplest conformance class this is simply for "property selection" (proposed future part of Features) or "range subsetting" (current conformance class of Coverages)
    • For more advanced analytics, it can support complex expressions for Derived Fields, as suggested in DAPA (instead of only identifiers -- those expressions can be very similar to CQL2 expressions, except they can return any type of value, not only a boolean)
      • Then pre-defined aggregation functions can be defined:
        • Same aggregation over all dimensions, or
        • With an extra parameter to specify over which dimensions a particular aggregating function should aggregate
  • filter
    • Support a predicate, e.g. defined with CQL2 (can refer to any queryables: e.g., feature properties, coverage cell range values, scene metadata properties) -- as in Features - Part 3: Filtering
  • sortby
    • In Coverages, together with returning a lower dimensionality than the result set, it can also control which pixels to keep (e.g., least cloudy scene or cells on top)
  • collections
    • In the context of DGGS / Coverages, it would allow to use fields (feature properties / coverage data record range values) from multiple collections, including mixed vector/raster collections (much like FROM <tables> in SQL). The fields can then be prefixed by the {collectionId}. to disambiguate them.

Aggregating functions

  • Min()
  • Max()
  • Sum()
  • Avg()
  • StdDev()
  • ... ?

Spatiotemporal Subsetting Query parameters:

  • subset
  • datetime
  • bbox

@ghobona
Copy link
Contributor

ghobona commented May 16, 2022

Thanks @jerstlouis !

Cc: @doublebyte1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2022-05 Sprint Cross-SWG Discussion EDR-related For coordination with EDR Extension Will be addressed by a future extension
Projects
None yet
Development

No branches or pull requests

2 participants