Skip to content

Future work: 1st order stats and thresholding from collections ensembles

Chris Little edited this page Oct 28, 2020 · 1 revision

1st order stats and thresholding from collections/ensembles

Ensembles

At various meetings, and the orginal Charter, it was agreed that the support for Ensemble type data would be tackled as second phase of the EDR API.

It was also agreed that a more sophisticated EDR API might entail processing sets of data, such as an ensemble, rather than just retrieving stored values. E.g. Retrieve the min/max/mean/mode value from a datastore, or get the quartile/decile/percentile values, only those values above/below/between threshold values, etc.

Obviously, some of this can be done in common processing environments such as python, Javascript, R, WPS, etc, so there is a debate to be had over what should be 'built-in' to the API and which should be firmly in a general purpose programming environment.

To this end, below is a list of various common data manipulation patters that might be helpful.

World Meteorological Organisation Guidelines for ensemble forecasts are in Chapter 4.

Common Data Transformations

  • Enrich metadata: adding metadata fields to support downstream data usage

  • Dataset arithmetic: combining datasets with consistent domains using arithmetic functions – enhanced by using unit of measure to assess whether it is sensible to combine those datasets

  • Generate summary statistics: evaluating a standard set of summary statistics for a dataset e.g. max, min, mean, std deviation

  • [simple] Collapse: reducing the dimensionality of a dataset by evaluating a standard set of summary statistics for the values along one axis e.g. converting a {x, y, t} dataset of temperatures to a {x, y} dataset with values of temperature max, min, mean and standard deviation for the time-range of the original dataset

  • Handle missing data:

    i) converting ‘magic numbers’ to missing data values,
    
    ii) converting data-gaps to missing data values i.e. creating a regular structured dataset for simple processing,
    
    iii) generating statistics on the amount of missing data values in a dataset
    
  • Group: aggregating multiple data fields into a single file

  • Concatenate: combine multiple consistent datasets with a continuous time axis into a single dataset

  • Restructure*: modify how data is persisted to disk to optimise access patterns i.e. convert a dataset persisted as a set of files with each file containing values for all parameters at a given time-step to each file containing all time-steps for a given parameter

  • Reformat: read a file in format (A) and write in format (B) – offer standard file formats e.g. GRIB2, netCDF, CoverageJSON, [cloud-optimised] GeoTIFF etc.

  • Extract: provide a part of a dataset based on user’s selected geometry (i.e. point, line, area, volume), time instant / time-range, and/or parameter / parameter set – extract will provide data-values from the original dataset without resampling i.e. point-extract will return the nearest point to the user’s selection position and not interpolate to derive a value at the selected position

  • Mask: convert a dataset to a mask based on evaluation of a Boolean expression e.g. temperature value exceeds a specified threshold

  • Visualise: render data into a visual form for display – this may include specifying the coordinate reference system onto which the data is projected

  • Resample/reproject: provide part of a dataset on a new domain based on user’s selected geometry, time instant / time range, and/or parameter / parameter set – resample will use interpolation to recalculate data values at positions specified in the user’s request e.g. for re-gridding to a standard grid with different resolution or projection [resampling / reprojection needs significant domain insight to use without introducing unwanted errors or side-effects … goal is to retain the essential scientific characteristics and integrity of the data; PP might aim to improve the scientific characteristics]

Definitions:

Domain – the set of positions in space and/or time for which data values are available

Range – the set of data values

Examples:

Dataset arithmetic – evaluating temperature anomaly by subtracting the long-term mean annual average temperature field from the annual average temperature field for a given year

Dataset arithmetic – evaluating wind velocity as √((x wind)2 + (y wind)2)

Generalised ‘reshaping’ of datasets not routinely done Pivoting datasets doesn’t really make sense for geospatial/geotemporal datasets

Generalised merging or joining of datasets is a specialist function

Simple collapse: fix the summary statistics to a standard to make it simpler for users – avoid them having to provide input parameters

Missing data handling can be complicated and challenging

'*' Restructuring to optimise data access patterns – is this still crucial if the file-format supports byte-range access i.e. you don’t need to read the entire file into memory to extract the required values

Extract – includes operations like slice and trim

Visualise – avoid overloading with style information; app developers can add this afterward? Key factors: discrete/categorical values versus continuous/field/coverage; scalar vector or tensor fields; dimensionality 1/2/3/etc. Secondary factors: distribution/variability/scale; others?

Mask – is this really a common operation?

Geometry used for extract / resample can be 0d (point), 1d (line – e.g. vertical profile, time-series, route, trajectory), 2d (area – e.g. horizontal polygon, vertical curtain, time-series for a vertical profile i.e. Hovmoller diagram), 3d (volume – e.g. cube) etc. Some sampling geometry are likely to need resampling (e.g. trajectory) because the original dataset is unlikely to provide data-points whose positions fit on the specified sampling geometry

Resampling is very easy to get wrong – it’s more akin to a post-processing activity and needs scientific or domain insight to do correctly

Clone this wiki locally