Skip to content

Latest commit

 

History

History
744 lines (492 loc) · 24.5 KB

index.rst

File metadata and controls

744 lines (492 loc) · 24.5 KB

EO Datasets 3

EO Datasets aims to be the easiest way to write, validate and convert dataset imagery and metadata for the Open Data Cube

There are two major tools for creating datasets:

  1. DatasetAssembler<assembling_metadata>, for writing a package: including writing the metadata document, COG imagery, thumbnails, checksum files etc.
  2. DatasetPrepare<preparing_metadata>, for preparing a metadata document, referencing existing imagery and files.

Their APIs are the same, except the assembler adds methods named write_* for writing new files.

Assemble a Dataset Package

Here's a simple example of creating a dataset package with one measurement (called "blue") from an existing image.

The measurement is converted to a COG image when written to the package:

Writing only a metadata doc

(ie. "I already have appropriate imagery files!")

Example of generating a metadata document with DatasetPrepare <eodatasets3.DatasetPrepare>:

Custom properties can also be set directly on .properties:

p.properties['fmask:cloud_cover'] = 34.0

And known properties are automatically normalised:

p.platform = "LANDSAT_8"  # to: 'landsat-8'
p.processed = "2016-03-04 14:23:30Z"  # into a date.
p.maturity = "FINAL"  # lowercased
p.properties["eo:off_nadir"] = "34"  # into a number

Including provenance

Most datasets are processed from an existing input dataset and have the same spatial information as the input. We can record them as source datasets, and the assembler can optionally copy any common metadata automatically:

collection = Path('/some/output/collection/path')
with DatasetAssembler(collection) as p:
   # We add a source dataset, asking to inherit the common properties
   # (eg. platform, instrument, datetime)
   p.add_source_path(level1_ls8_dataset_path, auto_inherit_properties=True)

   # Set our product information.
   # It's a GA product of "numerus-unus" ("the number one").
   p.producer = "ga.gov.au"
   p.product_family = "numerus-unus"
   p.dataset_version = "3.0.0"

   ...

In these situations, we often write our new pixels as a numpy array, inheriting the existing grid spatial information <eodatasets3.GridSpec> of our input dataset:

# Write a measurement from a numpy array, using the source dataset's grid spec.
p.write_measurement_numpy(
   "water",
   my_computed_numpy_array,
   GridSpec.from_dataset_doc(source_dataset),
   nodata=-999,
)

Other ways to reference your source datasets:

  • As an in-memory DatasetDoc <eodatasets3.DatasetDoc> using p.add_source_dataset() <eodatasets3.DatasetPrepare.add_source_dataset>
  • Or as raw uuids, using p.note_source_datasets() <eodatasets3.DatasetPrepare.note_source_datasets> (without property inheritance)

Creating documents in-memory

You may want to assemble metadata entirely in memory without touching the filesystem.

To do this, prepare a dataset as normal. You still need to give a dataset location, as paths in the document will be relative to this location:

inmem

from eodatasets3 import GridSpec

import tempfile

tmp_path = Path(tempfile.mkdtemp())

grid_spec = GridSpec(shape=(7721, 7621),

transform=Affine(30.0, 0.0, 241485.0, 0.0, -30.0, -2281485.0), crs=CRS.from_epsg(32656)

)

dataset_location = (tmp_path / 'test_dataset') measurement_path = dataset_location / "our_image_dont_read_it.tif"

inmem

>>> from eodatasets3 import DatasetPrepare >>> >>> p = DatasetPrepare(dataset_location=dataset_location) >>> p.datetime = datetime(2019, 7, 4, 13, 7, 5) >>> p.product_name = "loch_ness_sightings" >>> p.processed = datetime(2019, 7, 4, 13, 8, 7)

Normally when a measurement is added, the image will be opened to read grid and size informaation. You can avoid this by giving a GridSpec <eodatasets3.GridSpec> yourself (see GridSpec doc <eodatasets3.GridSpec> for creation):

inmem

>>> p.note_measurement( ... "blue", ... measurement_path, ... # We give it grid information, so it doesn't have to read it itself. ... grid=grid_spec, ... # And the image pixels, since we are letting it calculate our geometry. ... pixels=numpy.ones((60, 60), numpy.int16), ... nodata=-1, ... )

Note

If you're writing your own image files manually, you may still want to use eodataset's name generation. You can ask for suitable paths from p.names <eodatasets3.DatasetPrepare.names>:

See the the naming section<names_n_paths> for examples.

Now finish it as a DatasetDoc <eodatasets3.DatasetDoc>:

inmem

>>> dataset = p.to_dataset_doc()

You can now use serialise functions<serialise_explanation> on the result yourself, such as conversion to a dictionary:

inmem

>>> from eodatasets3 import serialise >>> doc: dict = serialise.to_doc(dataset) >>> doc['label'] 'loch_ness_sightings_2019-07-04'

Or convert it to a formatted yaml: serialise.to_path(path, dataset) <eodatasets3.serialise.to_path> or serialise.to_stream(stream, dataset) <eodatasets3.serialise.to_stream>.

Avoiding geometry calculation

Datasets include a geometry field, which shows the coverage of valid data pixels of all measurements.

By default, the assembler will create this geometry by reading the pixels from your measurements, and calculate a geometry vector on completion.

This can be configured by setting the p.valid_data_method <eodatasets3.DatasetPrepare.valid_data_method> to a different ValidDataMethod<eodatasets3.ValidDataMethod> value.

But you may want to avoid these reads and calculations entirely, in which case you can set a geometry yourself:

p.geometry = my_shapely_polygon

Or copy it from one of your source datasets when you add your provenance (if it has the same coverage):

p.add_source_path(source_path, inherit_geometry=True)

If you do this before you note measurements, it will not need to read any pixels from them.

Generating names & paths alone

You can use the naming module alone to find file paths:

import eodatasets3 from pathlib import Path from eodatasets3 import DatasetDoc

Create some properties.

d = DatasetDoc() d.platform = "sentinel-2a" d.product_family = "fires" d.datetime = "2018-05-04T12:23:32" d.processed_now()

# Arbitrarily set any properties. d.properties["fmask:cloud_shadow"] = 42.0 d.properties.update({"odc:file_format": "GeoTIFF"})

Note

You can use a plain dict if you prefer. But we use an DatasetDoc() <eodatasets3.DatasetDoc> here, which has convenience methods similar to DatasetAssembler <eodatasets3.DatasetAssembler> for building properties.

Now create a namer instance with our properties (and chosen naming conventions):

names = eodatasets3.namer(d, conventions="default")

We can see some generated names:

print(names.metadata_file) print(names.measurement_filename('water')) print() print(names.product_name) print(names.dataset_folder)

Output:

s2a_fires_2018-05-04.odc-metadata.yaml s2a_fires_2018-05-04_water.tif

s2a_fires s2a_fires/2018/05/04

In reality, these paths go within a location (folder, s3 bucket, etc) somewhere.

This location is called the collection_prefix, and we can create our namer with one:

collection_path = Path('/datacube/collections')

names = eodatasets3.namer(d, collection_prefix=collection_path)

print("The dataset location is always a URL:") print(names.dataset_location)

print()

a_file_name = names.measurement_filename('water') print(f"We can resolve our previous file name to a dataset URL:") print(names.resolve_file(a_file_name))

print()

print(f"Or a local path (if it's file://):") print(repr(names.resolve_path(a_file_name)))

from eodatasets3 import DatasetAssembler

import tempfile collection_path = Path(tempfile.mkdtemp())

names.collection_prefix = collection_path.as_uri()

We could now start assembling some metadata if our dataset doesn't exist, passing it our existing fields:

# Our dataset doesn't exist? if not names.dataset_path.exists(): with DatasetAssembler(names=names) as p:

# The properties are already set, thanks to our namer.

... # Write some measurements here, etc!

p.done()

# It exists! assert names.dataset_path.exists()

Naming things yourself

Names and paths are only auto-generated if they have not been set manually by the user.

You can set properties yourself on the NamingConventions <eodatasets3.NamingConventions> to avoid automatic generation (or to avoid their finicky metadata requirements).

nametest

from eodatasets3 import DatasetPrepare from pathlib import Path import tempfile

collection_path = Path(tempfile.mkdtemp())

nametest

>>> p = DatasetPrepare(collection_path) >>> p.platform = 'sentinel-2a' >>> p.product_family = 'ard' >>> # The namer will generate a product name: >>> p.names.product_name 's2a_ard' >>> # Let's customise the generated abbreviation: >>> p.names.platform_abbreviated = "s2" >>> p.names.product_name 's2_ard'

See more examples in the assembler .names <eodatasets3.DatasetPrepare.names> property.

Separating metadata from data

(Or. "I don’t want to store ODC metadata alongside my data!")

You may want your data to live in a different location to your ODC metadata files, or even not store metadata on disk at all. But you still want it to be easily indexed.

To do this, the done() commands include an embed_location=True argument. This will tell the Assemblers to embed the dataset_location into the output document.

For example:

Now our dataset location is included in the document:

Now ODC will ignore the actual location of the metadata file we are indexing, and use the embedded s3 location instead.

Understanding Locations

When writing an ODC dataset, there are two important locations that need to be known by assembler: where the metadata file will go, and the "dataset location".

Note

In ODC, all file paths in a dataset are computed relative to the dataset_location

Examples

Dataset Location Path Result
s3://dea-public-data/year-summary/2010/ water.tif s3://dea-public-data/year-summary/2010/water.tif
s3://dea-public-data/year-summary/2010/ bands/water.tif s3://dea-public-data/year-summary/2010/bands/water.tif
file:///rs0/datacube/LS7_NBAR/10_-24/odc-metadata.yaml v1496652530.nc file:///rs0/datacube/LS7_NBAR/10_-24/v1496652530.nc
file:///rs0/datacube/LS7_NBAR/10_-24/odc-metadata.yaml s3://dea-public-data/year-summary/2010/water.tif s3://dea-public-data/year-summary/2010/water.tif

You can specify both of these paths if you want:

with DatasetPrepare(dataset_location=..., metadata_path=...):
    ...

But you usually don't need to give them explicitly. They will be inferred if missing.

  1. If you only give a metadata path, the dataset location will be the same.:

    metadata_path             = "file:///tmp/ls7_nbar_20120403_c1/my-dataset.odc-metadata.yaml"
    inferred_dataset_location = "file:///tmp/ls7_nbar_20120403_c1/my-dataset.odc-metadata.yaml"
  2. If you only give a dataset location, a metadata path will be created as a sibling file with .odc-metadata.yaml suffix within the same "directory" as the location.:

    dataset_location       = "file:///tmp/ls7_nbar_20120403_c1/my-dataset.tar"
    inferred_metadata_path = "file:///tmp/ls7_nbar_20120403_c1/my-dataset.odc-metadata.yaml"
  3. ... or you can give neither of them! And they will be generated from a base collection_path.:

    collection_path           = "file:///collections"
    inferred_dataset_location = "file:///collections/ga_s2am_level4_3/023/543/2013/02/03/
    inferred_metadata_path    = "file:///collections/ga_s2am_level4_3/023/543/2013/02/03/ga_s2am_level4_3-2-3_023543_2013-02-03_final.odc-metadata.yaml

Note

For local files, you can give the location as a pathlib.Path, and it will internally be converted into a URL for you.

In the third case, the folder and file names are generated from your metadata properties and chosen naming convention. You can also set folders, files and parts yourself <eodatasets3.DatasetPrepare.names>.

Specifying a collection path:

with DatasetPrepare(collection_path=collection, naming_conventions='default'):
    ...

Let's print out a table of example default paths for each built-in naming convention:

Result:

Note

The default conventions look the same as dea here, but dea is stricter in its mandatory metadata fields (following policies within the organisation).

You can leave out many more properties from your metadata in default and they will not be included in the generated paths.

Dataset Prepare class reference

eodatasets3.DatasetPrepare

Dataset Assembler class reference

eodatasets3.DatasetAssembler

Reading/Writing YAMLs

Methods for parsing and outputting EO3 docs as a eodatasets3.DatasetDoc

Parsing

eodatasets3.serialise.from_path

eodatasets3.serialise.from_doc

Writing

eodatasets3.serialise.to_path

eodatasets3.serialise.to_stream

eodatasets3.serialise.to_doc

Name Generation API

You may want to use the name generation alone, for instance to tell if a dataset has already been written before you assemble it.

eodatasets3.namer

eodatasets3.NamingConventions

EO Metadata API

eodatasets3.properties.Eo3Interface

Misc Types

eodatasets3