Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RandomGeoSampler bias #408

Closed
adamjstewart opened this issue Feb 16, 2022 · 2 comments · Fixed by #477
Closed

RandomGeoSampler bias #408

adamjstewart opened this issue Feb 16, 2022 · 2 comments · Fixed by #477
Labels
samplers Samplers for indexing datasets
Milestone

Comments

@adamjstewart
Copy link
Collaborator

I've discovered two instances of bias in our current RandomGeoSampler and RandomBatchGeoSampler implementations.

Area bias

In our current implementations, we first select a tile uniformly at random, then choose a random patch from that tile. This means that small tiles are just as likely to be sampled as large tiles. For most datasets, this is not an issue, as all tiles are approximately the same size. However, for IntersectionDatasets, the area of each intersection can vary widely. Tiles that are barely large enough to return a single sample from will be sampled no less often than massive tiles. This is particularly problematic for RandomBatchGeoSampler which would sample an entire batch of images from that very small tile.

This is an issue inherent in our current implementation, but is relatively easy to fix. The solution would be to use a weighted random sampler where weights are derived from the area of each image. Therefore, large tiles will be more likely to be sampled from than small tiles.

Latitude bias

In certain 2D projections like Mercator, polar regions have the same area as equatorial regions. However, this is not the case on the actual 3D Earth. If we sample at random, we end up oversampling the poles relative to the equator. This has serious consequences for model training including models that are biased towards polar regions or that overestimate/underestimate certain climate patterns.

This is an issue inherent to 2D projections of the Earth in general, not necessarily an issue in TorchGeo. We could still try to do something about this however. One solution would be to force the R-tree index to be in an equal-area projection like Albers. Another solution would be to force the index to be in Mercator and to use a weighted random sampler where the weight comes from (the square root of?) the latitude.

@adamjstewart adamjstewart added the samplers Samplers for indexing datasets label Feb 16, 2022
@adamjstewart adamjstewart added this to the 0.2.2 milestone Mar 19, 2022
@adamjstewart
Copy link
Collaborator Author

I think latitude bias is going to be difficult to correct. It depends on the CRS being used (equal angle CRSs are affected but not equal area) so we can't simply use a weighted random sampler without first checking against a list of known equal angle CRSs. It's also unclear what to do for non-equal angle and non-equal area CRSs.

@adamjstewart adamjstewart modified the milestones: 0.2.2, 0.3.0 Mar 19, 2022
@adamjstewart
Copy link
Collaborator Author

Correcting area bias will require the new bbox.area attributes introduced in 0.3.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
samplers Samplers for indexing datasets
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant