Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Radiant MLHub datasets - FileNotFoundError due to radiant_mlhub version 0.5.0 release #711

Closed
KennSmithDS opened this issue Aug 10, 2022 · 3 comments · Fixed by #1102
Closed
Assignees
Labels
datasets Geospatial or benchmark datasets dependencies Packaging and dependencies

Comments

@KennSmithDS
Copy link
Contributor

Description

@calebrob6 flagged an issue in merging #510 that the utils.download_radiant_mlhub_collection(...) method is resulting in a FileNotFoundError. This is because the method in torchgeo.utils is expecting the radiant_mlhub.Dataset.download(...) method to fetch a .tar.gz archive of the entire dataset or for an individual collection. As of v0.5.0+ of the radiant-mlhub Python Client, this is no longer the case, there are no .tar.gz archives downloaded except for the archive that contains the STAC catalog. This change was made to support asset level filtering and downloading (please see radiantearth/radiant-mlhub#104).

As is the case for all MLHub dataset classes in Torchgeo (BeninSmallHolderCashews, CV4AKenyaCropType, TropicalCycloneWindEstimation, NASAMarineDebris), they anticipate an archive of the entire dataset, or archives of each collection. For example with NASAMarineDebris, similar to CloudCoverDetection, these are assigned as class properties for the archives and their md5 checksums:

    dataset_id = "nasa_marine_debris"
    directories = ["nasa_marine_debris_source", "nasa_marine_debris_labels"]
    filenames = ["nasa_marine_debris_source.tar.gz", "nasa_marine_debris_labels.tar.gz"]
    md5s = ["fe8698d1e68b3f24f0b86b04419a797d", "d8084f5a72778349e07ac90ec1e1d990"]

Proposed Short Term Solution:
It seems the easiest solution is to fix the version of the radiant-mlhub Python Client to v0.4.x, prior to the release which introduced the bug in torchgeo.

Long Term Solution:
Ideally any existing and new torchgeo datasets that are sourced from Radiant MLHub will be updated to reflect the latest Dataset.download(...) functionality. This is the sequence of events from calling that method now:

  1. Download the catalog archive with all STAC objects for the dataset
  2. Uncompress the STAC catalog archive directory
  3. Load the STAC catalog into a Sqlite database on disk
  4. Apply spatial, temporal and band filters to the SQL table
  5. Construct list of assets to download from query
  6. Download all assets locally into catalog directory

Steps to reproduce

  1. Run the following code to reproduce:
from torchgeo.datasets import BeninSmallHolderCashews

cashews = BeninSmallHolderCashews(
    root='/data',
    bands=('B02','B03','B04'),
    download=True
)

# the same result will happen with CV4AKenyaCropType, TropicalCycloneWindEstimation, NASAMarineDebris, CloudCoverDetection

Version

0.4.0.dev0

@adamjstewart adamjstewart added this to the 0.3.1 milestone Aug 10, 2022
@adamjstewart adamjstewart added the datasets Geospatial or benchmark datasets label Aug 10, 2022
@adamjstewart adamjstewart self-assigned this Aug 10, 2022
@adamjstewart
Copy link
Collaborator

It seems the easiest solution is to fix the version of the radiant-mlhub Python Client to v0.4.x, prior to the release which introduced the bug in torchgeo.

Fixed in 51b4d6d (prob should have opened a PR first but accidentally pushed to main). Unfortunately I don't know of an easy way to test this. We try to avoid tests that require internet access, so we currently monkeypatch Dataset.download() to simply copy a local file. That's why our tests didn't catch the bug.

@adamjstewart
Copy link
Collaborator

@KennSmithDS any updates on a Radiant MLHub dataset? We're planning a 0.4.0 release by the end of the month and would love to see support for the latest version of radiant-mlhub. Not sure what you're work schedule is like before the holidays.

@KennSmithDS
Copy link
Contributor Author

Hi @adamjstewart sorry for the belated response. We have been in a mad push during Q4, especially during November and December to finish updating all of our STAC catalogs to be in compliance with the STAC specifications, and add in metadata/extensions where they weren't used before. Then I was on holiday break, and today is my last day at Radiant Earth. I will not be able to continue supporting the development of this dataset class for Radiant MLHub's datasets going forward, but look forward to contributing to torchgeo in other capacities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasets Geospatial or benchmark datasets dependencies Packaging and dependencies
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants