Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mapping from `Dataset` to `Storage Unit` is poorly defined when using netcdf storage #415

Closed
Kirill888 opened this issue Apr 11, 2018 · 8 comments

Comments

@Kirill888
Copy link
Contributor

commented Apr 11, 2018

When using netcdf files for storage we often "stack" year worth of observations into one netcdf file. As a result multiple dc.Datasets point to the same file on disk. Since dc.Dataset represents a single point in time only one slice along the time dimension "belongs" to a given dataset. Which slice that is, is poorly defined.

We do not store time slice index anywhere in the dc.Dataset metadata document. Instead center_time property is used to "locate" the index that "matches best".

The "matches best" logic is fairly involved, see:

def get_bandnumber(self, src):
# If `band` property is set to an integer it overrides any other logic
band = self._measurement.get('band')
if band is not None:
if isinstance(band, integer_types):
return band
else:
_LOG.warning('Expected "band" property to be of integer type')
if 'netcdf' not in self._dataset.format.lower():
layer_id = self._measurement.get('layer', 1)
return layer_id if isinstance(layer_id, integer_types) else 1
tag_name = GDAL_NETCDF_DIM + 'time'
if tag_name not in src.tags(1): # TODO: support time-less datasets properly
return 1
time = self._dataset.center_time
sec_since_1970 = datetime_to_seconds_since_1970(time)
idx = 0
dist = float('+inf')
for i in range(1, src.count + 1):
v = float(src.tags(i)[tag_name])
if abs(sec_since_1970 - v) < dist:
idx = i
dist = abs(sec_since_1970 - v)
return idx

It can not use exact equality since center_time is a datetime.datetime, which can contain sub-second component, while what's stored on disk is float64 containing seconds since 1970, and is always rounded to seconds (I only observed it being rounded down, but who knows, it happens inside netcdf library). By the way this process is repeated for every band being loaded, even though they all share the same time slice index.

It means that file needs to be opened before one can figure out what part of the file is to be read. This is a problem for optimized read scheduling.

@Kirill888

This comment has been minimized.

Copy link
Contributor Author

commented Apr 11, 2018

Correction: loss of sub-second precision on the time axis is happening here:

if data.dtype.kind == 'M':
return data.astype('<M8[s]').astype('double')

So, as long as numpy conversion from dtype('<M8[ns]') to dtype('<M8[s]') is consistent across versions we can use exact comparison to locate time slice. But really we should just store the index somewhere in the dataset metadata document.

@Kirill888

This comment has been minimized.

Copy link
Contributor Author

commented Apr 17, 2018

After some discussions there is an extra constraint to consider:

  • Stacking process does not create new datasets and does not modify existing dataset metadata documents, instead it just changes dataset location (stored in a separate table) by adding new dataset location with a path of a stacked netcdf file and marking previous non-stacked location as "archived".

This means that an "obvious solution" of adding extra field to the metadata document of the dataset to indicate which slice of a stacked netcdf file it occupies is not ideal, as it does not support "stacking as a location change" mode of operation of the current stacker.

There seems to be an agreement that current "fuzzy matching" approach needs fixing. Location of a dataset within a stacked netcdf should be clearly defined as an index. Even if we replace "fuzzy matching" with "exact matching", there is still a problem of:

  • What happens when timestamp is missing?
  • What happens when there are duplicate timestamps in the file?
  • Need to have file open before knowing what parts of the file to read

There were several proposed solutions

  1. Add optional metadata column to dataset_location table (json format), one could then store driver specific data there, time index would then go there as well as other data like dimensions/chunking. Requires schema migration and code change.

  2. Add "fragment" part to the file uri with driver specific data, example file:///stacked.nc#tidx=20, it's then up to the driver to interpret the part following #. This approach doesn't require schema migration and can be made backward compatible. Just need to be careful in the code that maps URI to a file, need to make sure that fragment part is discarded.

I personally think that approach in (2.) is quite reasonable. It gives clear indication that only part of the file is "claimed" by a given dataset, it does not require db migration and a tool to update locations for existing stacked datasets can be easily implemented.

@omad

This comment has been minimized.

Copy link
Member

commented Apr 18, 2018

Agreed on needing to fix the fuzzy matching of times, it's way too likely to cause incorrect data to be loaded, or other invisible bugs!

Adding a URL Fragment sounds like a pragmatic and effective solution. Location metadata would be overly complex.

I thought the NetCDF Subset Service might be a good example to follow for URL fragments, but it uses a nearest time search too! For a web service though, I think it makes sense. For an internal data archive, not at all.

@Kirill888

This comment has been minimized.

Copy link
Contributor Author

commented Apr 19, 2018

Looks like code already supports having fragment in the location without any problems

from datacube.utils import uri_to_local_path
from datacube.storage.storage import _resolve_url

display(uri_to_local_path('file:///some/path/file.nc#tidx=10'))
display(_resolve_url('s3://some-bucket/path/file.json#param=1', 'a.tif'))

generates

PosixPath('/some/path/file.nc')
's3://some-bucket/path/a.tif'

So just updating existing location with an extra fragment part #{metadata-in-some-driver-specific-format} will not break any logic for locating file on disk and will not interfere with relative path resolution for other url types.

Question remains what format to choose for fragment part

  1. Simplest #time_index_as_int e.g. #23, but least extensible
  2. Using query string format #key=value&key2=value2, e.g #time_idx=11

Also wether to use 0 or 1 based indexing, rasterio is using 1-based indexing, while netcdf is 0 based.

@omad

This comment has been minimized.

Copy link
Member

commented Apr 19, 2018

Sweet!

I think we need extensibility here, hard coding fragment = time would be a mistake. Lets use a query string format.

Personally I prefer 0 based indexing, but I don't have a strong argument either way. We will just have to be very careful to use the right base with each library. It's enough to make selecting by time somewhat appealing again...

@Kirill888

This comment has been minimized.

Copy link
Contributor Author

commented Apr 19, 2018

I think in this case 0-based indexing is less surprising, since interpretation will be "netcdf" driver specific, the fact that currently netcdf driver is implemented by rasterio-based catch all driver is just an implementation detail that might change in the future.

@Kirill888

This comment has been minimized.

Copy link
Contributor Author

commented May 15, 2018

As part of this change we might need to change behaviour of the read_documents function

def read_documents(*paths):
"""
Read & parse documents from the filesystem (yaml or json).
Note that a single yaml file can contain multiple documents.
This function will load any dates in the documents as strings. In
the datacube we use JSON in PostgreSQL and it will turn our dates
to strings anyway.
:type paths: pathlib.Path
:rtype: tuple[(pathlib.Path, dict)]

Currently it produces a generator of (path, parsed_yaml_doc) from a bunch of filesystem paths. I think it should return url instead of just plain path, and this url should include fragment part in the case of stacked netcdf file, and arguably for multi-part yaml documents also. Switching to url will also allow future extension for non-local sources like http:// or s3:// .

Most users of this function ignore path output anyway, or use it for logging purposes only. As far as I can tell only dataset update/add use returned path for anything but logging, and the first they do is to convert it to a uri.

for metadata_path, metadata_doc in read_documents(metadata_path):
uri = metadata_path.absolute().as_uri()

@Kirill888

This comment has been minimized.

Copy link
Contributor Author

commented May 15, 2018

Thinking more about this, read_documents should also accept urls on input as well, so that user can point to a specific dataset in a multi-dataset file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.