You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Radiant Earth's MLHub is a repository for geospatial ML datasets. We currently have created VisionDatasets for a few of the datasets that they host (we use their API to download the entire dataset archive to local disk), e.g.: NASAMarineDebris, BeninSmallHolderCashews, CV4AKenyaCropType, TropicalCycloneWindEstimation.
In addition to storing the entire archive of a dataset (as a zip file or similar), Radiant Earth have formatted each dataset as a STAC Collection (or several STAC Collections). They are also currently developing a way that users can download just the STAC metadata associated with each dataset. This would allow users to more easily subset very large datasets (e.g. BigEarthNet) without having to download the entire dataset. We'd like to create a generic MLHubDataset(...) that uses this upcoming feature to build a GeoDataset for any dataset hosted on MLHub. The rough idea is as follows:
Download the STAC Collection files associated with a requested dataset
Build a RasterDataset, creating the index using the metadata from each item
For datasets that have both imagery and labels, the label items have a pointer to their corresponding imagery item
In __getitem__ we can download the necessary files on the fly and cache them to disk
A reasonable signature for the constructor would be something like MLHubDataset(root="data/", collection_name, max_cache_size=None).
Similar to #403, this would require new dependencies for working with STAC.
If you are interested in working on this, I can send you an example "STAC metadata archive" that corresponds to the LandCoverNet dataset.
The text was updated successfully, but these errors were encountered:
(this is similar in sprirt to #403)
Radiant Earth's MLHub is a repository for geospatial ML datasets. We currently have created VisionDatasets for a few of the datasets that they host (we use their API to download the entire dataset archive to local disk), e.g.: NASAMarineDebris, BeninSmallHolderCashews, CV4AKenyaCropType, TropicalCycloneWindEstimation.
In addition to storing the entire archive of a dataset (as a zip file or similar), Radiant Earth have formatted each dataset as a STAC Collection (or several STAC Collections). They are also currently developing a way that users can download just the STAC metadata associated with each dataset. This would allow users to more easily subset very large datasets (e.g. BigEarthNet) without having to download the entire dataset. We'd like to create a generic
MLHubDataset(...)
that uses this upcoming feature to build a GeoDataset for any dataset hosted on MLHub. The rough idea is as follows:__getitem__
we can download the necessary files on the fly and cache them to diskA reasonable signature for the constructor would be something like
MLHubDataset(root="data/", collection_name, max_cache_size=None)
.Similar to #403, this would require new dependencies for working with STAC.
If you are interested in working on this, I can send you an example "STAC metadata archive" that corresponds to the LandCoverNet dataset.
The text was updated successfully, but these errors were encountered: