Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pixel sampling mode #294

Merged
merged 17 commits into from
Feb 24, 2022
Merged

Conversation

RitwikGupta
Copy link
Collaborator

@RitwikGupta RitwikGupta commented Dec 16, 2021

This would still need updated docs, benchmarks, tests, etc., but let's verify functionality first. Resolves #279

Co-authored-by: Ashwin Nair <ash1995@gmail.com>
@adamjstewart adamjstewart added the samplers Samplers for indexing datasets label Dec 16, 2021
@adamjstewart adamjstewart added this to the 0.2.0 milestone Dec 19, 2021
Copy link
Collaborator

@adamjstewart adamjstewart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mostly looks okay to me, just had some questions about the API and ways that we can make things more clear.

torchgeo/samplers/batch.py Outdated Show resolved Hide resolved
torchgeo/samplers/constants.py Outdated Show resolved Hide resolved
torchgeo/samplers/constants.py Outdated Show resolved Hide resolved
torchgeo/samplers/single.py Outdated Show resolved Hide resolved
torchgeo/samplers/utils.py Outdated Show resolved Hide resolved
torchgeo/samplers/utils.py Outdated Show resolved Hide resolved
@adamjstewart
Copy link
Collaborator

Just realized that a lot of our documentation uses size=256 to return "256x256 images". We should definitely push this through since it seems that a size in pixel coordinates is more intuitive even for me who wrote the original samplers.

@RitwikGupta
Copy link
Collaborator Author

The enum module seems to be Python 3.10+. Is that okay? I'd think Python 3.8+ support would be ideal.

@adamjstewart
Copy link
Collaborator

The enum module was added in Python 3.4

@adamjstewart
Copy link
Collaborator

@calebrob6 @isaaccorley (and anyone else): this is a pretty important design change so I would like to get your opinions on this.

Design

To the best of my knowledge, my original thought process for the sampler design (and why I chose CRS size instead of pixel size) was as follows:

  1. When stitching together images in a dataset, or composing multiple GeoDatasets, the images won't share the same origin and may not even share the same pixel grid, so we have to use geospatial coordinates to index GeoDatasets
  2. Since we're using geospatial coordinates for indexing, it is easiest/most natural to specify the size in geospatial coordinates as well

In hindsight, 2) is not necessarily true. Because the index itself is largely abstracted from the user, the user doesn't necessarily think about the fact that the index is in CRS units and that the size may also be in CRS units. Actually, 1) isn't necessarily true either. We could compute the bounds of the image in pixel coordinates relative to the origin and store than in the rtree, then convert those back to relative pixel coords (relative to a particular image) during indexing. But that seems way harder than it needs to be.

In fact, despite being the person who designed this, I myself have made the mistake of thinking that size is in pixel coordinates (our README/paper/blog all currently make this assumption and use size=256 to get "256x256 pixel images"). Before we publish our paper/blog, we really need to settle on a final design.

Discussion

The solution proposed in this PR seems very logical and straightforward. I think we only have a couple remaining decisions to make:

  1. Do we really need to support both pixel units and CRS units? Are there applications where pixel units don't make sense?
  2. If we do support both, which should we use by default? Which is more natural?

Based on the fact that my original design used CRS units, I'm leaning towards supporting both. I think for computer scientists, pixel units make much more sense. However, for remote sensing scientists, I think CRS units might be more intuitive.

One of the advantages of pixel units over CRS units is that you don't need to think too hard about ensuring that your size is an integer multiple of the CRS units. If my CRS uses 30 m/pixel units and I ask for size=256, what happens? Since 256 is not a multiple of 30, the exact behavior isn't well defined. I believe rasterio/GDAL resamples the image so that it's possible to get an image that is exactly 256x256 m, but I'm not sure. At one point, I hacked our samplers to always choose the floor of this floating point number to avoid time-consuming resampling. If we only support pixel units and only allow ints, that avoids this entire problem.

Are there any corner cases where either CRS units or pixel units could be problematic? My biggest concern is w.r.t. #278. If we want to avoid reprojecting images in a dataset (like how @calebrob6 does in the ChesapeakeCVPR dataset), we start to hit problems where the resulting images are not necessarily the exact same size. @calebrob6 what was the specific problem you ran into again, was it that the bounding box you warp ends up rotated and so the bounding box of that bounding box has a different size? If we fix #278 and only reproject when absolutely necessary, we may end up with problems with CRS units or pixel units. I want to ensure that the image we sample is always the same pixel dimensions, even if that means the spatial distance changes.

@calebrob6
Copy link
Member

Do we really need to support both pixel units and CRS units? Are there applications where pixel units don't make sense?

I think so -- one reasonable example I can think of is a RasterDataset made up of tiles that have different pixel resolutions. A user may want to specify 1km crops (e.g. sampling in CRS units) then have a transform step that resamples the resulting images to a fixed size.

If we do support both, which should we use by default? Which is more natural?

Pixel units is more natural to me, but I can see the argument for either.

@calebrob6
Copy link
Member

calebrob6 commented Feb 15, 2022

  • Need to add enum to docs
  • Need to explain what is going on here at docs/apis/samplers.rst

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Feb 16, 2022
@RitwikGupta
Copy link
Collaborator Author

@calebrob6 how's that? Not sure the best way to do the docs.

@calebrob6
Copy link
Member

Looks good -- I can finish off the rest of this too (the above are basically notes from a conversation with Adam). We think that it is okay to merge this with the added documentation, then work on how GeoDataset / RasterDataset should be indexed / how the samplers should interact with them (see #409).

@adamjstewart
Copy link
Collaborator

This will also need unit tests to ensure that size=10 in pixel units actually results in an image that is 10x10 pixels, and that size=300 in CRS units where res=30 is also equivalent.

@RitwikGupta
Copy link
Collaborator Author

RitwikGupta commented Feb 19, 2022

@adamjstewart Does anything have to change in the index for this to work? I think I'll need help writing this test out.

@adamjstewart
Copy link
Collaborator

adamjstewart commented Feb 22, 2022

Changes in last commit:

  • Clarifications to docs
  • Add Units to docs
  • Add Units to torchgeo.samplers import
  • Revert change to local imports
  • units must be of type Units, not int
  • remove "in units of CRS" from size docs

Remaining things I'll work on now:

  • Add units arg to GridGeoSampler
  • Update benchmarking code
  • Add tests

For GridGeoSampler, we don't use get_random_bounding_box. I think it would be better to convert size to the right units directly in GeoSampler or one of its subclasses instead of doing this in get_random_bounding_box. I'll make this change.

@github-actions github-actions bot added the testing Continuous integration testing label Feb 23, 2022
@adamjstewart
Copy link
Collaborator

Hmm, possible enum values don't seem to show up in the docs. Not sure what to do about that, maybe a bug in Sphinx?

Anyway, I think this looks good to me now.

@adamjstewart adamjstewart merged commit cc1a9fb into microsoft:main Feb 24, 2022
@adamjstewart adamjstewart added the backwards-incompatible Changes that are not backwards compatible label Jul 10, 2022
@adamjstewart adamjstewart mentioned this pull request Jul 11, 2022
yichiac pushed a commit to yichiac/torchgeo that referenced this pull request Apr 29, 2023
* Add pixel sampling mode

* Fix maxy indexing error

Co-authored-by: Ashwin Nair <ash1995@gmail.com>

* Add sample_mode docstrings, default to PIXELS

* Replace sample_mode with units

* Update to use enum

* Fix mypy, tuple, and flake8 issues

* Fix isort and pydocstyle problems

* Update sampler docs to discuss unit sampling mode

* Various fixes

* Add units arg to GridGeoSampler

* Update benchmark script

* Add tests

* Document enum values

* mypy fixes

Co-authored-by: Ashwin Nair <ash1995@gmail.com>
Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backwards-incompatible Changes that are not backwards compatible documentation Improvements or additions to documentation samplers Samplers for indexing datasets testing Continuous integration testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sample in pixel units, not CRS units
4 participants