MLHub dataset #406

calebrob6 · 2022-02-16T15:11:03Z

(this is similar in sprirt to #403)

Radiant Earth's MLHub is a repository for geospatial ML datasets. We currently have created VisionDatasets for a few of the datasets that they host (we use their API to download the entire dataset archive to local disk), e.g.: NASAMarineDebris, BeninSmallHolderCashews, CV4AKenyaCropType, TropicalCycloneWindEstimation.

In addition to storing the entire archive of a dataset (as a zip file or similar), Radiant Earth have formatted each dataset as a STAC Collection (or several STAC Collections). They are also currently developing a way that users can download just the STAC metadata associated with each dataset. This would allow users to more easily subset very large datasets (e.g. BigEarthNet) without having to download the entire dataset. We'd like to create a generic MLHubDataset(...) that uses this upcoming feature to build a GeoDataset for any dataset hosted on MLHub. The rough idea is as follows:

Download the STAC Collection files associated with a requested dataset
Build a RasterDataset, creating the index using the metadata from each item
- For datasets that have both imagery and labels, the label items have a pointer to their corresponding imagery item
In __getitem__ we can download the necessary files on the fly and cache them to disk

A reasonable signature for the constructor would be something like MLHubDataset(root="data/", collection_name, max_cache_size=None).

Similar to #403, this would require new dependencies for working with STAC.

If you are interested in working on this, I can send you an example "STAC metadata archive" that corresponds to the LandCoverNet dataset.

The text was updated successfully, but these errors were encountered:

adamjstewart · 2022-02-16T16:19:04Z

Just being able to easily convert all existing MLHub datasets from VisionDataset to GeoDataset would be a huge win!

adamjstewart · 2024-04-04T13:41:47Z

MLHub is dead, long live Source Cooperative!

calebrob6 added the datasets Geospatial or benchmark datasets label Feb 16, 2022

adamjstewart closed this as completed Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MLHub dataset #406

MLHub dataset #406

calebrob6 commented Feb 16, 2022

adamjstewart commented Feb 16, 2022

adamjstewart commented Apr 4, 2024

MLHub dataset #406

MLHub dataset #406

Comments

calebrob6 commented Feb 16, 2022

adamjstewart commented Feb 16, 2022

adamjstewart commented Apr 4, 2024