Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Radiant MLHub (REF) Cloud Cover Dataset #510

Merged
merged 51 commits into from
Aug 31, 2022

Conversation

KennSmithDS
Copy link
Contributor

@KennSmithDS KennSmithDS commented Apr 19, 2022

This PR adds the Cloud Cover Detection Challenge dataset which was generated as part of a crowdsourcing competition as hosted on the Radiant MLHub.

Dataset features:

  • Sentinel-2 L2A (RGB and NIR bands only)
  • 22,728 chips in the training dataset, and 10,980 chips in the test dataset
  • Chips represent imagery between 2018 and 2020
  • GeoTiff labels represent binary classification if a pixel is labelled as a cloud or not

To Dos:

  • Resolve linter errors
  • Create new sample data with data.py script

@github-actions github-actions bot added datasets Geospatial or benchmark datasets testing Continuous integration testing labels Apr 19, 2022
@adamjstewart adamjstewart added this to the 0.3.0 milestone Apr 19, 2022
torchgeo/datasets/cloud_cover.py Outdated Show resolved Hide resolved
torchgeo/datasets/cloud_cover.py Show resolved Hide resolved
torchgeo/datasets/cloud_cover.py Outdated Show resolved Hide resolved
@calebrob6
Copy link
Member

Thanks for the contribution @KennSmithDS! I took a pass at reviewing. Other comments:

Copy link
Collaborator

@adamjstewart adamjstewart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still needs a data.py script to generate the fake data. It also looks like some of the tests are failing. You'll need to rebase/merge on top of main to run some of the new required tests. Let me know if I can help with any of this.

tests/datasets/test_cloud_cover.py Outdated Show resolved Hide resolved
tests/datasets/test_cloud_cover.py Outdated Show resolved Hide resolved
torchgeo/datasets/__init__.py Outdated Show resolved Hide resolved
torchgeo/datasets/cloud_cover.py Outdated Show resolved Hide resolved
torchgeo/datasets/cloud_cover.py Outdated Show resolved Hide resolved
torchgeo/datasets/cloud_cover.py Outdated Show resolved Hide resolved
torchgeo/datasets/cloud_cover.py Show resolved Hide resolved
torchgeo/datasets/cloud_cover.py Outdated Show resolved Hide resolved
@KennSmithDS
Copy link
Contributor Author

KennSmithDS commented Jul 6, 2022

@adamjstewart @calebrob6 thank you for the feedback and recommendations. I've rebased/merged the latest release with my local branch, and fixed the failing tests, as well as implemented the suggestions above. The only thing I still have questions on is the data.py script.

Is this requirement for contributing documented anywhere in the repository? I've been looking at a few different examples to get a sense for what it does, e.g. agb_live_woody_density, naip, and baywide datasets. Each have their own flavor, but the common workflow I see is defining the profile information of a raster file to be written using rasterio. Questions this raises for me are:

  • Is it required to use the long form WKT string for the profile["crs"] attribute or can you simply pass in the simpler versions, e.g. CRS.from_epsg(4326) or CRS.from_string("EPSG:4326")?
  • Is it necessary/recommended to use a colormap for the geotiffs written to disk?
  • Should there be a min/max value for the random values generated for the rasterio dataset writer class output? The above examples in the repo just create a random image np array of any dtype value
  • I'm also doing things in a slightly different way with this Dataset to load information from the actual STAC catalog, e.g. collection.json and stac.json so is it necessary to create a GeoJSON file (FeatureCollection) like with agb_live_woody_density?

Thanks again!

@adamjstewart
Copy link
Collaborator

Is this requirement for contributing documented anywhere in the repository?

The fact that it's needed is documented at https://torchgeo.readthedocs.io/en/latest/user/contributing.html#datasets. Some tips for writing the script can be found at https://github.com/microsoft/torchgeo/blob/main/tests/data/README.md

For all of your other questions, the answer is that you should try to match the real data from your dataset.

  • If your dataset contains raster images with geospatial metadata, you should create fake data using rasterio. If your dataset contains PNG or JPEG images with no geospatial metadata, you should create fake data using PIL.
  • For raster datasets, you should use the same CRS and dtype as the real files.
  • The random values can range from the min/max value of that dtype, which can be computed using np.iinfo(dtype).min and np.iinfo(dtype).max (or np.ffinfo for float dtypes).
  • If your dataset contains labels with a colormap, your fake data should to.

Hope that answers your questions!

@KennSmithDS
Copy link
Contributor Author

@adamjstewart I had missed the README file nestled under tests/data/README.md so thanks for pointing that out!

@KennSmithDS
Copy link
Contributor Author

KennSmithDS commented Jul 6, 2022

@adamjstewart I'm assuming the answer to this question is yes, but to clarify, should the data.py script also generate the collection.json and stac.json files used by the method I added _load_collections()

@adamjstewart
Copy link
Collaborator

Yep! At some point I'm planning to convert these data.py files to pytest fixtures to generate all of our data at test time instead of storing it in the repo. So if the data.py generates those files too that would help a lot.

@adamjstewart adamjstewart modified the milestones: 0.3.0, 0.4.0 Jul 9, 2022
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Jul 22, 2022
datasets/cloud-cover branch diverged from main, and the base class VisionDataset is now deprecated, unable to make commit with new class name NonGeoDataset as the underlying class structures have changed, and are preventing the commit.
Copy link
Collaborator

@adamjstewart adamjstewart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the data.py, that looks like it wasn't easy to write!

Did a more detailed pass through the code and listed mostly minor things to change. I think this is pretty close to completion! Will likely have one last pass once these comments are addressed.

docs/api/datasets.rst Show resolved Hide resolved
torchgeo/datasets/cloud_cover.py Show resolved Hide resolved
torchgeo/datasets/cloud_cover.py Outdated Show resolved Hide resolved
torchgeo/datasets/cloud_cover.py Outdated Show resolved Hide resolved
torchgeo/datasets/cloud_cover.py Outdated Show resolved Hide resolved
torchgeo/datasets/cloud_cover.py Show resolved Hide resolved
torchgeo/datasets/cloud_cover.py Outdated Show resolved Hide resolved
KennSmithDS and others added 9 commits August 21, 2022 23:19
adding commas to train/test split sizes

Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>
Including total dataset size in api CSV, removing train/test split

Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>
Accepting carrots under text in datasets.rst, sorry forgot to add these in!

Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>
Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>
Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>
Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>
@adamjstewart
Copy link
Collaborator

Rebased to fix coverage. Just need to see an example plot and this should be ready to merge.

@KennSmithDS
Copy link
Contributor Author

@adamjstewart sorry for delay, we've had some high priority projects past 1-2 weeks that have required my attention.

  1. Do you have an example of what the plot should look like, or will this do?

cloud_cover_plot_idx_1

  1. Is there a standard pixel value normalization for Sentinel 2?

I just modified a single line of code in the .plot(...) method to divide the pixel values by 3000, which is arbitrary but also used in BeninSmallHolderCashews, another MLHub NonGeoDataset which has the added numpy dimension for time series in the data. The np.clip(..) method is also used while normalizing the S2 pixel values.

I'll have to push 1 more commit with the slight modification to the .plot(...) method after I hear back from you. Cheers!

@adamjstewart
Copy link
Collaborator

  1. Plot looks good, just wanted to make sure the image and label aren't flipped or something like that. You'd be surprised how often that happens.
  2. No preference on normalization constant. I used 2000 for Sentinel2 because 3000 made the image a bit too dark. I don't know of a more rigorous way to define how Sentinel imagery should be visualized. You would think there would be an "official" way to plot a true color image...

@adamjstewart adamjstewart enabled auto-merge (squash) August 31, 2022 17:47
@adamjstewart adamjstewart merged commit 7d49956 into microsoft:main Aug 31, 2022
Modexus pushed a commit to Modexus/torchgeo that referenced this pull request Sep 3, 2022
* adding cloud cover dataset class

* Adding Cloud Cover Detection Challenge dataset class and tests

* fixed linter issues and passing pre-commit tests

* resolving linter errors

* addressing failed isort test

* replacing deprecated VisionDataset with NonGeoDataset reference

* removed NDArray import for quoted np.typing.NDArray

* addressing mypy errors

* updated docstring for plot method

* Update docs/api/non_geo_datasets.csv

adding commas to train/test split sizes

Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>

* Update docs/api/non_geo_datasets.csv

Including total dataset size in api CSV, removing train/test split

Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>

* updated Cloud Cover Dataset name in dataset.rst

* Update docs/api/datasets.rst

Accepting carrots under text in datasets.rst, sorry forgot to add these in!

Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>

* Update torchgeo/datasets/cloud_cover.py

Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>

* Update torchgeo/datasets/cloud_cover.py

Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>

* Update torchgeo/datasets/cloud_cover.py

Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>

* flakefreaking8

* Fix duplicate rst link

* adding cloud cover dataset class

* Adding Cloud Cover Detection Challenge dataset class and tests

* fixed linter issues and passing pre-commit tests

* resolving linter errors

* addressing failed isort test

* replacing deprecated VisionDataset with NonGeoDataset reference

* removed NDArray import for quoted np.typing.NDArray

* addressing mypy errors

* updated docstring for plot method

* Update docs/api/non_geo_datasets.csv

adding commas to train/test split sizes

Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>

* Update docs/api/non_geo_datasets.csv

Including total dataset size in api CSV, removing train/test split

Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>

* updated Cloud Cover Dataset name in dataset.rst

* Update docs/api/datasets.rst

Accepting carrots under text in datasets.rst, sorry forgot to add these in!

Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>

* Update torchgeo/datasets/cloud_cover.py

Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>

* Update torchgeo/datasets/cloud_cover.py

Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>

* Update torchgeo/datasets/cloud_cover.py

Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>

* flakefreaking8

* Fix duplicate rst link

* added normalization to plot method for better visual

* CloudCoverDetection is a non-geo dataset

Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>
Co-authored-by: Caleb Robinson <calebrob6@gmail.com>
yichiac pushed a commit to yichiac/torchgeo that referenced this pull request Apr 29, 2023
* adding cloud cover dataset class

* Adding Cloud Cover Detection Challenge dataset class and tests

* fixed linter issues and passing pre-commit tests

* resolving linter errors

* addressing failed isort test

* replacing deprecated VisionDataset with NonGeoDataset reference

* removed NDArray import for quoted np.typing.NDArray

* addressing mypy errors

* updated docstring for plot method

* Update docs/api/non_geo_datasets.csv

adding commas to train/test split sizes

Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>

* Update docs/api/non_geo_datasets.csv

Including total dataset size in api CSV, removing train/test split

Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>

* updated Cloud Cover Dataset name in dataset.rst

* Update docs/api/datasets.rst

Accepting carrots under text in datasets.rst, sorry forgot to add these in!

Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>

* Update torchgeo/datasets/cloud_cover.py

Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>

* Update torchgeo/datasets/cloud_cover.py

Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>

* Update torchgeo/datasets/cloud_cover.py

Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>

* flakefreaking8

* Fix duplicate rst link

* adding cloud cover dataset class

* Adding Cloud Cover Detection Challenge dataset class and tests

* fixed linter issues and passing pre-commit tests

* resolving linter errors

* addressing failed isort test

* replacing deprecated VisionDataset with NonGeoDataset reference

* removed NDArray import for quoted np.typing.NDArray

* addressing mypy errors

* updated docstring for plot method

* Update docs/api/non_geo_datasets.csv

adding commas to train/test split sizes

Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>

* Update docs/api/non_geo_datasets.csv

Including total dataset size in api CSV, removing train/test split

Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>

* updated Cloud Cover Dataset name in dataset.rst

* Update docs/api/datasets.rst

Accepting carrots under text in datasets.rst, sorry forgot to add these in!

Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>

* Update torchgeo/datasets/cloud_cover.py

Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>

* Update torchgeo/datasets/cloud_cover.py

Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>

* Update torchgeo/datasets/cloud_cover.py

Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>

* flakefreaking8

* Fix duplicate rst link

* added normalization to plot method for better visual

* CloudCoverDetection is a non-geo dataset

Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>
Co-authored-by: Caleb Robinson <calebrob6@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasets Geospatial or benchmark datasets documentation Improvements or additions to documentation testing Continuous integration testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants