Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overhaul BoundingBox and ZipDataset classes #144

Merged
merged 37 commits into from
Dec 3, 2021
Merged

Conversation

calebrob6
Copy link
Member

@calebrob6 calebrob6 commented Sep 18, 2021

This PR completely overhauls the BoundingBox and ZipDataset classes. Notable enhancements include:

  • BoundingBox supports set arithmetic (intersection, union, contains)
  • GeoDataset supports set arithmetic (intersection, union)
  • GeoDataset no longer supports addition, as it is ambiguous
  • ZipDataset has been replaced by IntersectionDataset and UnionDataset
  • IntersectionDataset and UnionDataset merge the indices of their datasets
  • IntersectionDataset and UnionDataset work even if CRS/res don't match
  • IntersectionDataset stacks tensors, UnionDataset merges tensors
  • GeoSampler ROI is respected
  • Extensive testing
  • Documentation updates
  • Tutorial

Closes #77
Closes #86
Closes #135
Closes #149
Closes #260

@adamjstewart adamjstewart added the datasets Geospatial or benchmark datasets label Sep 19, 2021
@adamjstewart adamjstewart changed the title Adding a UnionDataset Overhaul BoundingBox and ZipDataset classes Nov 13, 2021
@adamjstewart adamjstewart mentioned this pull request Nov 15, 2021
@adamjstewart adamjstewart added this to the 0.2.0 milestone Nov 20, 2021
@adamjstewart
Copy link
Collaborator

adamjstewart commented Nov 30, 2021

This will break benchmark.py right?

benchmark.py has now been updated to use & instead of + and to use stack_samples instead of the default collation function.

@adamjstewart
Copy link
Collaborator

I think it is worth re-running the paper experiment and comparing times.

Done. As usual, the results vary quite a bit and are so random as to be completely useless.

Before (main)

$ ./benchmark.py --landsat-root /datadrive/adam/landsat/original --cdl-root /datadrive/adam/cdl/original -n 8 -v
Global seed set to 0

RandomGeoSampler:
  duration: 23.199 sec
  count: 128 patches
  rate: 5.518 patches/sec

GridGeoSampler:
  duration: 17.035 sec
  count: 128 patches
  rate: 7.514 patches/sec

RandomBatchGeoSampler:
  duration: 47.418 sec
  count: 128 patches
  rate: 2.699 patches/sec

ResNet-34:
  duration: 4.040 sec
  count: 128 patches
  rate: 31.684 patches/sec
$ ./benchmark.py --landsat-root /datadrive/adam/landsat/warped --cdl-root /datadrive/adam/cdl/warped -c -n 8 -v
Global seed set to 0

RandomGeoSampler:
  duration: 112.901 sec
  count: 128 patches
  rate: 1.134 patches/sec
CacheInfo(hits=253, misses=834, maxsize=128, currsize=128)

GridGeoSampler:
  duration: 2.377 sec
  count: 128 patches
  rate: 53.855 patches/sec
CacheInfo(hits=1016, misses=8, maxsize=128, currsize=8)

RandomBatchGeoSampler:
  duration: 28.301 sec
  count: 128 patches
  rate: 4.523 patches/sec
CacheInfo(hits=967, misses=57, maxsize=128, currsize=57)

ResNet-34:
  duration: 1.999 sec
  count: 128 patches
  rate: 64.037 patches/sec

After (feature/zipdatasets)

$ ./benchmark.py --landsat-root /datadrive/adam/landsat/original --cdl-root /datadrive/adam/cdl/original -n 8 -v
Global seed set to 0

RandomGeoSampler:
  duration: 19.806 sec
  count: 128 patches
  rate: 6.463 patches/sec

GridGeoSampler:
  duration: 15.489 sec
  count: 128 patches
  rate: 8.264 patches/sec

RandomBatchGeoSampler:
  duration: 18.744 sec
  count: 128 patches
  rate: 6.829 patches/sec

ResNet-34:
  duration: 1.996 sec
  count: 128 patches
  rate: 64.137 patches/sec
$ ./benchmark.py --landsat-root /datadrive/adam/landsat/warped --cdl-root /datadrive/adam/cdl/warped -c -n 8 -v
Global seed set to 0

RandomGeoSampler:
  duration: 62.110 sec
  count: 128 patches
  rate: 2.061 patches/sec
CacheInfo(hits=253, misses=834, maxsize=128, currsize=128)

GridGeoSampler:
  duration: 1.161 sec
  count: 128 patches
  rate: 110.212 patches/sec
CacheInfo(hits=1016, misses=8, maxsize=128, currsize=8)

RandomBatchGeoSampler:
  duration: 5.916 sec
  count: 128 patches
  rate: 21.636 patches/sec
CacheInfo(hits=967, misses=57, maxsize=128, currsize=57)

ResNet-34:
  duration: 2.228 sec
  count: 128 patches
  rate: 57.454 patches/sec

@adamjstewart
Copy link
Collaborator

I think it is probably worth having a section that explains how the geo dataset stuff is different than vanilla pytorch.

I assume you mean vanilla torchvision? Both torchvision and torchgeo are all pure pytorch, we aren't doing anything hacky here.

@adamjstewart
Copy link
Collaborator

@calebrob6 I fleshed out the README a bit more. I think I've addressed most of your comments. Ready for another round of review.

README.md Outdated Show resolved Hide resolved
@calebrob6
Copy link
Member Author

@calebrob6 I fleshed out the README a bit more. I think I've addressed most of your comments. Ready for another round of review.

I love it! Really well done -- I think it really effectively communicates why torchgeo is really cool. The only thing I'd add is pictures but we should do that later.

@calebrob6
Copy link
Member Author

Benchmarking...

"the results vary quite a bit and are so random as to be completely useless." this is troubling and we'll definitely need to revisit. We should be able to get less variance between runs by running longer -- there is not anything too random going on.

@calebrob6
Copy link
Member Author

Just went through it again carefully and it looks good to me! I can't approve though because I originally opened this PR.

Plotting Landsat8 + CDL based on intersection dataset
image

except ValueError:
raise ValueError("Datasets have no overlap")
# Force dataset2 to have the same CRS/res as dataset1
dataset2.crs = dataset1.crs
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary? (this is present in both Union and Intersection)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the two datasets have a different CRS/res, we need to ensure that they have a matching CRS/res. Otherwise, the combined R-tree index would be meaningless. I added a getter/setter to GeoDataset that updates the entire index when you try to set the CRS to a different CRS.

Copy link
Member Author

@calebrob6 calebrob6 Dec 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Should we warn users that if they use a dataset in an Intersection/Union then it might have its properties changed? (I'm imagining throwing a warning if dataset1.crs != dataset2.crs and the same with res)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I ended up using a print statement instead of a warning since warnings aren't displayed by default.

adamjstewart
adamjstewart previously approved these changes Dec 2, 2021
Copy link
Collaborator

@adamjstewart adamjstewart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving on behalf of @calebrob6

@adamjstewart adamjstewart merged commit 5d407b7 into main Dec 3, 2021
@adamjstewart adamjstewart deleted the feature/zipdatasets branch December 3, 2021 22:40
adamjstewart added a commit that referenced this pull request Dec 24, 2021
adamjstewart added a commit that referenced this pull request Dec 24, 2021
@adamjstewart adamjstewart added utilities Utilities for working with geospatial data samplers Samplers for indexing datasets documentation Improvements or additions to documentation testing Continuous integration testing and removed utilities Utilities for working with geospatial data labels Jan 2, 2022
yichiac pushed a commit to yichiac/torchgeo that referenced this pull request Apr 29, 2023
* Adding a UnionDataset

* Adding contains method to BoundingBox

* Finishing UnionDataset

* Add __contains__ method

* Overhaul BoundingBox, add set arithmetic

* mypy fixes

* pydocstyle fixes

* Ignore erroneous pydocstyle warnings

* rtree only supports tuples, not BoundingBoxes

* mypy fixes

* Use custom collate function to handle BoundingBoxes

* Add back support for Python 3.6

* Add tests for all new BoundingBox features

* Rename ZipDataset to IntersectionDataset

* Merge indices of IntersectionDataset, auto-convert CRS/res

* Get tests to pass

* Fix more tests

* Test more of RasterDataset/VectorDataset directly

* Increase UnionDataset test coverage

* IntersectionDataset stacks tensors, UnionDataset merges tensors

* Support collating dicts with differing keys, add tests

* Style fixes

* Samplers: compute intersection between index and ROI

* Update README with example usage

* GeoDataset addition is deprecated

* Add note about CRS/res

* More documentation for Intersection/UnionDatasets

* Use collate function in tutorial

* Don't use multiple workers

* Fix typo

* Drop support for adding GeoDatasets

* Remove unused import

* Add comment explaining coverage config settings

* Collation function needed for benchmark script

* Add more explanation to README

* Correct Landsat 8 bands

* Print warning when changing CRS/res

Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>
yichiac pushed a commit to yichiac/torchgeo that referenced this pull request Apr 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasets Geospatial or benchmark datasets documentation Improvements or additions to documentation samplers Samplers for indexing datasets testing Continuous integration testing
Projects
None yet
2 participants