Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MapInWild dataset #1131

Merged
merged 58 commits into from
Sep 29, 2023
Merged

Add MapInWild dataset #1131

merged 58 commits into from
Sep 29, 2023

Conversation

burakekim
Copy link
Contributor

@burakekim burakekim commented Feb 21, 2023

Closes #1096

Here is the MapInWild dataset instance. There are things I solved with ugly fixes and quick workarounds but I hope this pull request still is a good-enough starting point.

TODO and issues

  • Checksum: So far no checksum. I can not seem to find the MD5s from Hugging Face. There are only SH256s.
  • Now it is not really straightforward to distinguish the modalities in the __getitem__ making it harder to apply specific normalization steps (i.e., s2/10000).
  • Documentation and tests

Nice to do

  • Picking modalities band-wise to enable the use of various combinations of bands. This could even supersede the modality-wise picking.
  • Improving the plot function. A normalization function (i.e., percentile_normalization) is needed for at least the Sentinel-2 data. For that, an argument and modality-aware logic inside this function could be helpful.
  • Remove the pandas and move to a lower-level library.

Any help is appreciated. Thanks in advance for your time in reviewing this pull request.

P.S. I am by no means a GitHub pro, so any heads-up is appreciated.

The signature:
mapinwild = MapInWild(root="data", download=True, modality=["mask","esa_wc","s2_temporal_subset","viirs","s1"], split="train", transforms=None, checksum=False)

Some sample images:

VIIRS Nighttime light band
viirs
Sentinel-2 single temporal subset (the Sentinel-2 season with the highest score)
s2
ESA WorldCover map
esa_wc
Sentinel-1
s1

@github-actions github-actions bot added the datasets Geospatial or benchmark datasets label Feb 21, 2023
@adamjstewart adamjstewart added this to the 0.5.0 milestone Feb 21, 2023
Copy link
Collaborator

@adamjstewart adamjstewart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a great start! Let me know if you have any specific questions.

Checksum: So far no checksum. I can not seem to find the MD5s from Hugging Face. There are only SH256s.

You can calculate them yourself using md5 <filename> on the command line. We'll likely move from MD5 to SHA256 someday.

Remove the pandas and move to a lower-level library.

You should probably just keep pandas. I've been thinking about making it a required dep of TorchGeo because so many datasets use it.

torchgeo/datasets/mapinwild.py Show resolved Hide resolved
torchgeo/datasets/mapinwild.py Outdated Show resolved Hide resolved
torchgeo/datasets/mapinwild.py Outdated Show resolved Hide resolved
torchgeo/datasets/mapinwild.py Outdated Show resolved Hide resolved
torchgeo/datasets/mapinwild.py Outdated Show resolved Hide resolved
torchgeo/datasets/mapinwild.py Outdated Show resolved Hide resolved
torchgeo/datasets/mapinwild.py Outdated Show resolved Hide resolved
torchgeo/datasets/mapinwild.py Show resolved Hide resolved
@burakekim
Copy link
Contributor Author

@microsoft-github-policy-service agree company="unibwmunich"

@burakekim
Copy link
Contributor Author

burakekim commented Feb 22, 2023

Thanks a bunch for your revisions!

I applied some quick changes and tests. I will make sure there is a full test coverage after applying the necessary additions and changes, which are for now:

  • Applying MD5 checks.
  • Finding a generic way to handle modalities one by one in the __getitem__ so that we can apply modality-specific normalizations. As a last resort, I will create some mess by simply using self._load_raster(id, modals) to load the modalities separately.
  • Making plot function modality aware to apply proper visualization scalings.

Issues

  • Do you happen to know why the line pyupgrade --py37-plus $(find . -name "*.py") outputs the error below? It works without the query argument $(find . -name "*.py"). Is this argument needed?
usage: pyupgrade [-h] [--exit-zero-even-if-changed] [--keep-percent-format] [--keep-mock] [--keep-runtime-typing]
                 [--py3-plus] [--py36-plus] [--py37-plus] [--py38-plus] [--py39-plus] [--py310-plus] [--py311-plus]
                 [filenames ...]
pyupgrade: error: unrecognized arguments: -name *.py)
  • I shortened the plain texts that are going over 89 characters and applied # noqa: E501 for the links, is this OK?

@adamjstewart
Copy link
Collaborator

Do you happen to know why the line pyupgrade --py37-plus $(find . -name "*.py") outputs the error below?

Nope, that syntax should be correct. Note that our tests use the more complicated:

$ pyupgrade --py38-plus $(find . -path ./docs/src -prune -o -name "*.py" -print)

I shortened the plain texts that are going over 89 characters and applied # noqa: E501 for the links, is this OK?

Yes this is perfect.

@adamjstewart
Copy link
Collaborator

Note that we just dropped Python 3.8 support, so --py38-plus is now --py39-plus

burakekim and others added 2 commits April 18, 2023 17:28
Improves the dataset instance and test data script.
@github-actions github-actions bot added the testing Continuous integration testing label Apr 24, 2023
@burakekim
Copy link
Contributor Author

@adamjstewart I have improved the dataset class (mapinwild.py) and added the code to create test data (mapinwild\data.py) with the last commit.

Now, I assume the next step is to populate the test_mapinwild.py.

The progress is quite slow as I have very little bandwidth for this side hustle -sorry for that! In fact, feel free to contribute to the test_mapinwild.py to wrap up things faster, otherwise, it could take up quite some time for me.

@adamjstewart
Copy link
Collaborator

No rush. Most of us are currently working towards a paper deadline, so we prob won't be able to help much either. But might be able to help after June.

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Aug 4, 2023
@burakekim burakekim changed the title Add mapinwild dataset Add MapInWild dataset Aug 4, 2023
@burakekim
Copy link
Contributor Author

burakekim commented Sep 25, 2023

Could you please help me understand the check "license/cla Expected — Waiting for status to be reported"? I can not seem to complete that.

Also, please let me know if there is anything else to be done from my side before the release. Thank you for your time in helping me out here, it was a great learning curve for me. I could have never done it without your great reviews :) I plan to contribute more to the library as much as I can!

@adamjstewart
Copy link
Collaborator

@calebrob6 any idea what's wrong with the CLA bot? It was accepted here: #1131 (comment)

Glad my harsh reviews didn't scare you away 😆. I also see that you're in Munich? I just started a postdoc at TUM, maybe I'll see you around!

I'll review everything in detail again tomorrow, but I think we're probably pretty close now.

@calebrob6 calebrob6 closed this Sep 25, 2023
@calebrob6 calebrob6 reopened this Sep 25, 2023
@calebrob6
Copy link
Member

No idea, let's see if the ol' close and reopen trick works :)

@adamjstewart
Copy link
Collaborator

I hate how reliable ol' close and reopen is...

@calebrob6
Copy link
Member

Yep, I think that did it! Thanks for all the hard work on this @burakekim!

@burakekim
Copy link
Contributor Author

I also see that you're in Munich? I just started a postdoc at TUM, maybe I'll see you around!

@adamjstewart Yes, I am doing my PhD here in Munich. Definitely, it would be great to meet up! I will shoot you an e-mail :)

@burakekim
Copy link
Contributor Author

Thanks for all the hard work on this @burakekim!

Thanks a bunch, @calebrob6! Contributing to the library has been an enjoyable experience :)

torchgeo/datasets/mapinwild.py Outdated Show resolved Hide resolved
tests/datasets/test_mapinwild.py Outdated Show resolved Hide resolved
torchgeo/datasets/mapinwild.py Outdated Show resolved Hide resolved
torchgeo/datasets/mapinwild.py Outdated Show resolved Hide resolved
torchgeo/datasets/mapinwild.py Outdated Show resolved Hide resolved
torchgeo/datasets/mapinwild.py Outdated Show resolved Hide resolved
torchgeo/datasets/mapinwild.py Outdated Show resolved Hide resolved
@calebrob6
Copy link
Member

What are the requested changes here?

@adamjstewart
Copy link
Collaborator

Need to re-review tonight

Copy link
Collaborator

@adamjstewart adamjstewart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two remaining easy changes and I think this will be ready to merge!

tests/datasets/test_mapinwild.py Show resolved Hide resolved
torchgeo/datasets/mapinwild.py Show resolved Hide resolved
Copy link
Collaborator

@adamjstewart adamjstewart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work on this! Glad to have your dataset in TorchGeo!

@adamjstewart adamjstewart enabled auto-merge (squash) September 29, 2023 09:56
@adamjstewart
Copy link
Collaborator

Bumping tests...

auto-merge was automatically disabled September 29, 2023 10:38

Pull request was closed

@adamjstewart adamjstewart reopened this Sep 29, 2023
@adamjstewart adamjstewart enabled auto-merge (squash) September 29, 2023 10:38
@adamjstewart adamjstewart merged commit c51014c into microsoft:main Sep 29, 2023
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasets Geospatial or benchmark datasets documentation Improvements or additions to documentation testing Continuous integration testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MapInWild dataset
3 participants