Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for radiant-mlhub 0.5+ #1102

Conversation

SpontaneousDuck
Copy link
Contributor

@SpontaneousDuck SpontaneousDuck commented Feb 9, 2023

Quick changes to several datasets and the tests to support the new download file structures in radiant-mlhub>=0.5.

Quick changes for now. Currently no checksum checking for the dataset since radiant-mlhub just downloads the individual files directly. Any ideas how I can check for partial dataset downloads? radiant-mlhub says it will auto-resume dataset downloads if requested but this code does not currently check or ask it to resume.

Closes #711
Closes #1101

@github-actions github-actions bot added datasets Geospatial or benchmark datasets dependencies Packaging and dependencies testing Continuous integration testing labels Feb 9, 2023
@adamjstewart
Copy link
Collaborator

We'll need to update all datasets, not just NASA Marine Debris.

radiant-mlhub says it will auto-resume dataset downloads if requested

Can we just ask to resume whenever download=True? Will it fail if the download is 0% or 100% complete and only work when 1<x<99% complete? Or will it handle things gracefully?

environment.yml Outdated Show resolved Hide resolved
@SpontaneousDuck
Copy link
Contributor Author

Got it, working on all the datatsets now!

@SpontaneousDuck SpontaneousDuck force-pushed the NASAMarineDebris_download-structure branch from 2286ec1 to e9a2366 Compare February 11, 2023 01:40
@SpontaneousDuck
Copy link
Contributor Author

Ok! So what I ended up doing is falling back to download the old file structure by grabbing the radiant_mlhub.Collection instead of the radiant_mlhub.Dataset. This basically matches the old way of doing it but calls tochgeo.datasets.utils.download_radiant_mlhub_collection instead to download each individual tarball for each dataset. This allows us to grab files individually and checksum each tarball after downloading.

@adamjstewart
Copy link
Collaborator

Does that approach work for older radiant-mlhub or is 0.5.0 the minimum version where that technique works?

@SpontaneousDuck SpontaneousDuck force-pushed the NASAMarineDebris_download-structure branch from cef2957 to d369a34 Compare February 11, 2023 02:08
@SpontaneousDuck
Copy link
Contributor Author

Good idea! Checked with radiant-mlhub=0.2.2 and this new method still works. Changed environment.yml and setup.cfg to just remove the version cap instead of bumping the required version

@SpontaneousDuck SpontaneousDuck marked this pull request as ready for review February 11, 2023 03:28
Copy link
Collaborator

@adamjstewart adamjstewart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also update requirements/datasets.txt to radiant-mlhub==0.5.5 and remove radiant-mlhub from .github/dependabot.yml?

setup.cfg Outdated Show resolved Hide resolved
@@ -16,15 +16,15 @@
from torchgeo.datasets import BeninSmallHolderCashews


class Dataset:
class Collection:
def download(self, output_dir: str, **kwargs: str) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When radiant-mlhub 0.5 was first released, we actually didn't notice when things broke because we replace their download method with our own. Do you know if it's possible to monkeypatch something else farther internal so that we can actually test their download method but still use a local file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually have never used monkeypatch! So I'm not sure how that would work. It looks like the download call just creates a requests. Session and download chunks for the dataset in a threadpool. We would probably have to over-ride something in requests? But then our test data would need to be chunked I think.

torchgeo/datasets/benin_cashews.py Show resolved Hide resolved
@adamjstewart adamjstewart changed the title Support for radiant-mlhub>0.5 in NASAMarineDebris Support for radiant-mlhub 0.5+ Feb 13, 2023
@adamjstewart adamjstewart added this to the 0.4.1 milestone Feb 13, 2023
Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>
@SpontaneousDuck
Copy link
Contributor Author

Can we loosen the shapely requirements for CI?

@adamjstewart
Copy link
Collaborator

Bleh, that's annoying. Let me see if I can convince them to support newer shapely. I would rather focus on testing newer shapely than on newer radiant-mlhub. So we can test the latest version of radiant-mlhub 0.4 and prevent dependabot from updating it until radiant-mlhub supports shapely 2.

@adamjstewart
Copy link
Collaborator

Done: https://github.com/radiantearth/radiant-mlhub/pull/195

Hopefully the next release will support newer shapely. Until then, let's keep testing old radiant-mlhub.

@SpontaneousDuck
Copy link
Contributor Author

Done: radiantearth/radiant-mlhub#195

Hopefully the next release will support newer shapely. Until then, let's keep testing old radiant-mlhub.

So that means changing .github/dependabot.yml back? Do any other files need to be changed back?

@adamjstewart
Copy link
Collaborator

You can actually leave your changes to dependabot.yml, when the new release comes out we want to try bumping the version we test.

setup.cfg and environment.yml should be updated, but requirements/datasets.txt should not.

requirements/datasets.txt Outdated Show resolved Hide resolved
setup.cfg Outdated Show resolved Hide resolved
@SpontaneousDuck SpontaneousDuck force-pushed the NASAMarineDebris_download-structure branch from 8f24cb0 to 76268b3 Compare February 13, 2023 19:22
environment.yml Outdated Show resolved Hide resolved
Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>
@adamjstewart
Copy link
Collaborator

Everything looks good at first glance, testing this locally now...

@adamjstewart
Copy link
Collaborator

adamjstewart commented Feb 15, 2023

This is working for most datasets but doesn't seem to work for SpaceNet 6. Maybe @ashnair1 can comment on why only SpaceNet 6 is different.

spacenet6: fetch stac catalog: 2829KB [00:00, 2959.89KB/s]                                                                                                                                                          
INFO:radiant_mlhub.client.catalog_downloader:unarchive spacenet6.tar.gz ...
unarchive spacenet6.tar.gz: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13608/13608 [00:08<00:00, 1555.38it/s]
INFO:radiant_mlhub.client.catalog_downloader:create stac asset list (please wait) ...
INFO:radiant_mlhub.client.catalog_downloader:20406 unique assets in stac catalog.
download assets:   1%|█▏                                                                                                                                                      | 165/20406 [00:54<3:26:14,  1.64it/s]WARNING:urllib3.connectionpool:Retrying (Retry(total=9, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProtocolError('Connection aborted.', OSError(22, 'Invalid argument'))': /spacenet/SN6_buildings/train/AOI_11_Rotterdam/PAN/SN6_Train_AOI_11_Rotterdam_PAN_20190822144046_20190822144338_tile_1484.tif
download assets:   1%|█▍                                                                                                                                                      | 187/20406 [01:06<3:04:13,  1.83it/s]WARNING:urllib3.connectionpool:Retrying (Retry(total=9, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProtocolError('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer'))': /spacenet/SN6_buildings/train/AOI_11_Rotterdam/RGBNIR/SN6_Train_AOI_11_Rotterdam_RGBNIR_20190822065725_20190822065959_tile_7267.tif
download assets:   3%|████▉                                                                                                                                                   | 668/20406 [04:39<1:43:10,  3.19it/s]

(scroll right to see error messages)

@SpontaneousDuck
Copy link
Contributor Author

So Spacenet6 is the only one that still uses the old method (download_radiant_mlhub_collection). It only has a single zip instead of multiple so I thought it would be ok. It's data is not easily available through the new method I changed to. I can look in to that more though!

@SpontaneousDuck
Copy link
Contributor Author

So it looks like SpaceNet provides a direct download link shown here: https://spacenet.ai/sn6-challenge/. Since it is 39GB, it looks like ml-hub does not support downloading that dataset directly anymore and only allow the new method of downloading each image individually using the dataset API in radiant-mlhub. This method as is should work, might just need some retires for data that does not download. Oooor, we could change to directly downloading the 39GB tar file from s3 and using that.

@adamjstewart
Copy link
Collaborator

I'll let @ashnair1 decide, he wrote all of our SpaceNet data loaders. I think the current approach is probably fine.

@ashnair1
Copy link
Collaborator

So the difference in SpaceNet6 is primarily caused by the way it was archived. Refer to radiantearth/radiant-mlhub#93

While writing this dataset, download_radiant_mlhub_dataset worked while download_radiant_mlhub_collection did not. It might be possible to download each asset but I haven't tried it out yet.

Long term, I think it might be best to download all SpaceNet datasets from the official site - https://spacenet.ai/ and not use radiant-mlhub.

@adamjstewart
Copy link
Collaborator

With this PR both SpaceNet 6 and SpaceNet != 6 seem to download properly for all versions of radiant-mlhub (just ignore the warning messages) so I think this is safe to download. We can consider moving the SpaceNet downloads (or rehosting) later. Would be great if we could download all of our datasets without an API key.

@adamjstewart adamjstewart merged commit dc98b60 into microsoft:main Feb 17, 2023
calebrob6 pushed a commit that referenced this pull request Apr 10, 2023
* update datasets and tests to support radiant-mlhub>0.5

* add test coverage for nasa_marine_debris corrupted cases

* style fixes

* Correct return type in test_nasa_marine_debris.py

* Update setup.cfg to limit radiant-mlhub version

Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>

* radiant-mlhub version updates to <0.6

* Update environment.yml to not upper bound radiant-mlhub

Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>

---------

Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>
yichiac pushed a commit to yichiac/torchgeo that referenced this pull request Apr 29, 2023
* update datasets and tests to support radiant-mlhub>0.5

* add test coverage for nasa_marine_debris corrupted cases

* style fixes

* Correct return type in test_nasa_marine_debris.py

* Update setup.cfg to limit radiant-mlhub version

Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>

* radiant-mlhub version updates to <0.6

* Update environment.yml to not upper bound radiant-mlhub

Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>

---------

Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasets Geospatial or benchmark datasets dependencies Packaging and dependencies testing Continuous integration testing
Projects
None yet
3 participants