Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected use of GroupBy's sort_key to label output dimensions when applying custom grouping #1153

Closed
robbibt opened this issue Jul 13, 2021 · 0 comments · Fixed by #1157
Closed
Labels

Comments

@robbibt
Copy link
Contributor

robbibt commented Jul 13, 2021

Expected behaviour

I would like to use a custom GroupBy object to specify custom sorting when grouping data. For example, when loading datasets from multiple annual geomedian products, I would like datasets to be sorted by Landsat platform (e.g. LANDSAT-5, LANDSAT-7, LANDSAT-8) when data is grouped so that I can prioritise data from one satellite over the other.

For example, in this example I would like to create a GroupBy that uses a custom sort_by_platform function to prioritise data by alphabetical platform order:

def sort_by_platform(ds):
    return ds.metadata.platform
    
platform_grouper = GroupBy(dimension='time',
                       group_by_func=_extract_time_from_ds,
                       units='seconds since 1970-01-01 00:00:00',
                       sort_key=sort_by_platform)                       

I expect to be able to use this custom GroupBy to load data with a normal time dimension, with the only difference being the internal sorting used when combining datasets.

Actual behaviour

When I supply this custom GroupBy to dc.load, the output of my custom sorting function now replaces the time dimension in my dataset:

image

It is not possible to specify a custom sorting without the output of this function being used to rename the time dimension.

This coupling of axis value from group sorted order appears to have been previously flagged in the following to-do: https://github.com/opendatacube/datacube-core/blob/develop/datacube/api/core.py#L464

Steps to reproduce the behaviour

import datacube
import datetime
from datacube.model import Range, Dataset
from datacube.utils.dates import normalise_dt
import collections

dc = datacube.Datacube()

#################
# Find datasets #
#################

# Set query
query = {
    'x': (146.592962, 146.712962),
    'y': (-38.80321, -38.683209999999995),
    'time': ('1987', '2020')
}

# Find datasets for all three geomedians
dss_ls5 = dc.find_datasets(product='ls5_nbart_geomedian_annual', **query)
dss_ls7 = dc.find_datasets(product='ls7_nbart_geomedian_annual', **query)
dss_ls8 = dc.find_datasets(product='ls8_nbart_geomedian_annual', **query)

##################
# Custom GroupBy #
##################

def _extract_time_from_ds(ds: Dataset) -> datetime.datetime:
    return normalise_dt(ds.center_time)

def sort_by_platform(ds):
    return ds.metadata.platform


GroupBy = collections.namedtuple(
    'GroupBy', ['dimension', 'group_by_func', 'units', 'sort_key'])
platform_grouper = GroupBy(dimension='time',
                           group_by_func=_extract_time_from_ds,
                           units='seconds since 1970-01-01 00:00:00',
                           sort_key=sort_by_platform)

#############
# Load data #
#############

ds = dc.load(datasets=dss_ls5 + dss_ls7 + dss_ls8,
             measurements=['swir1'],
             dask_chunks={},
             group_by=platform_grouper,
             **query)

ds.time

Environment information

  • Which datacube --version are you using? '1.8.4.dev81+g80d466a2'
  • What datacube deployment/enviornment are you running against? DEA Sandbox, standard image
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant