Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define and Store metadata in STAC #51

Closed
ymoisan opened this issue Feb 5, 2019 · 9 comments
Closed

Define and Store metadata in STAC #51

ymoisan opened this issue Feb 5, 2019 · 9 comments
Labels
question Further information is requested

Comments

@ymoisan
Copy link
Contributor

ymoisan commented Feb 5, 2019

Currently, training is performed on a list of GeoTIFF input images using reference data in GeoPackage files. That list of inputs is stored in csv files. For the results we store just the weights of our model (.pth file).

To make our models interoperable, we need to write out the model together with related weights; those items are our final shareable outputs. Also, should we care to implement checks on whether a particular dataset is amenable to inference using a given model, we need to store all inputs somewhere.

Initially we thought of using HDF to store both the inputs to and outputs of our models. It now appears one of the STAC extensions might be a more logical approach, as STAC is much more web-friendly than HDF.

@mpelchat04
Copy link
Collaborator

Mandatory information to store with the model, for re-usability:

  • Weights (.pth)
  • Model definition (e.g. Unet model)
  • Task type (e.g. classification or semantic segmentation)
  • Number of classes and surely their definition (e.g. 1-Vegetation, 2- Lake, 3- Building, etc.)
  • Number of band used for training and their definition (e.g. 4 bands: R-G-B-PIR);
    • The definition should describe the source of each band:
      • Sensor type (e.g. Satellite, LiDAR, aerial photos, radar, etc.)
      • Acquisition date
      • Wavelength (if applicable)
      • Preprocess (if applicable)
  • Spatial resolution to which the training was conducted
  • Geographic location where the training/validation and tests were conducted. (e.g. bounding box or footprint, maybe?)

Optional information to store:

  • Training and validation accuracy
  • Training parameters (e.g. learning rate, # of epoch, class weights, etc.)

@ymoisan
Copy link
Contributor Author

ymoisan commented Feb 7, 2019

A nice way of validating if inputs are applicable to a given model implemented as a decorator : see "input validation" in A comprehensive guide to putting a machine learning model in production using Flask, Docker, and Kubernetes.

@ymoisan
Copy link
Contributor Author

ymoisan commented Feb 8, 2019

If we wanted to devise some kind of standard for model interoperability around HDF5, we would likely come up with a HDF5 product definition. Interesting excerpts from [HDF Product Designer](https://wiki.earthdata.nasa.gov/display/HPD/HDF+Product+Designer ++):

The Hierarchical Data Format (HDF5) provides a flexible container that supports groups and datasets, each of which can have attributes. In many ways, HDF5 is similar to a directory structure in a file and, like directory structures, the same data can be structured and annotated in many ways. This flexibility empowers HDF5 users to arrange data in ways that make sense to them. However, it can make it difficult to share data ...
Many communities have successfully addressed this problem by creating conventional structures and annotations for data in HDF5. This approach depends on data files (e.g., products) that carefully follow these conventions.
A HDF5 product is the content that should exist in a single HDF5 file.
This content is defined by the HDF5 objects (groups, attributes, datasets), their names, the hierarchies they create (links and references), and attribute values. Dataset values are typically not stored in such files (unless they qualify as metadata) thus this software cannot be used as a data server. Once completed, a HDF5 product is replicated in many files (commonly on the order of tens of thousands or more) and filled with real data.

How would the use of HDF5 help us in forming totally independent DL containers that would contain all the information needed for interoperability ? Could we implement something in relation to "standardised environments" as per OGC Testbed 14 ?

@ymoisan
Copy link
Contributor Author

ymoisan commented Feb 8, 2019

How well does HDF5 play with Big Data infrastructures and OGC services like WCS ? Could the H5Server be useful ?

@ymoisan ymoisan added the P2 Medium priority label Feb 12, 2019
@ymoisan
Copy link
Contributor Author

ymoisan commented Apr 5, 2019

Could we integrate STAC fields ?

@mpelchat04 mpelchat04 added P1 High priority and removed P2 Medium priority labels Jun 12, 2019
@mpelchat04 mpelchat04 added this to the V1.1 milestone Jun 12, 2019
@ymoisan
Copy link
Contributor Author

ymoisan commented Jul 16, 2019

deepdish ? torch hdf5 ?

@ymoisan
Copy link
Contributor Author

ymoisan commented Aug 1, 2019

EO profile of STAC includes items such as sun azimuth and elevation : https://github.com/radiantearth/stac-spec/blob/master/extensions/eo/schema.json. Type 20170831_162740_ssc1d1 in your browser search bar and you'll en up here :

image

All we need is there...

I suggest we investigate creating STAC Items of the label extension type. Note : models per se are not STAC Items for now. I think there is an opportunity for us to think about how we could make that happen.

@valhassan valhassan changed the title Store model definitions and metadata in HDF Define and Store metadata in STAC Aug 11, 2020
@ymoisan ymoisan added question Further information is requested and removed P1 High priority labels Aug 11, 2020
@mpelchat04 mpelchat04 removed this from the V1.1 milestone Aug 12, 2020
@CharlesAuthier
Copy link
Collaborator

@mpelchat04 is it something that we still want to do?

@remtav
Copy link
Collaborator

remtav commented May 4, 2023

Work is ongoing to develop a STAC extension applied to models. The GDL team will check on this as the extension is developed. We will close the issue for now.

@remtav remtav closed this as completed May 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants