Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GeoDataset: avoid unnecessary reprojection #409

Open
adamjstewart opened this issue Feb 16, 2022 · 22 comments
Open

GeoDataset: avoid unnecessary reprojection #409

adamjstewart opened this issue Feb 16, 2022 · 22 comments
Labels
datasets Geospatial or benchmark datasets samplers Samplers for indexing datasets

Comments

@adamjstewart
Copy link
Collaborator

This issue is meant to serve as a proposal for how to fix #278. Please try to poke holes in this proposal and find ways in which it won't work, or suggest alternative solutions.

Rationale

Geospatial data files are stored in a particular CRS. In order to build a single R-tree index of a geospatial dataset, the bounds of all files must be reprojected to the same CRS. During sampling, we also reproject the file itself to a common CRS so that we can stitch together multiple files from the same layer, or combine multiple layers in an IntersectionDataset.

Unfortunately, reprojection/resampling is very slow, and can negatively affect data loader performance (see section 4.2 of https://arxiv.org/abs/2111.08872). It can also lead to large distortions of images in polar regions. We would like to be able to avoid reprojection if possible.

Proposal

I propose to only reproject files during sampling when absolutely necessary. This includes any situation in which we sample from multiple files at once and all files are not in the same CRS, such as:

  • a single GeoDataset containing overlapping files on the border of two UTM zones
  • an IntersectionDataset with datasets in different CRSes
  • a UnionDataset with overlapping files and datasets in different CRSes

The details of how this can be done is left to the "Implementation" section below, but the basic idea is to add an additional parameter for __getitem__ that contains the CRS to project to. The sampler is then responsible for setting this CRS in a way that minimizes reprojection.

There is one issue that arises with this implementation. If we choose our bounding box in one CRS and reproject it to a different CRS, the bounding box will be rotated, and the enclosing bounding box may be larger than the original. This can result in tensors of different size that cannot be concatenated without padding. The "Implementation" section describes how this problem can be avoided.

Implementation

The details of what will need to be changed are as follows:

  1. GeoDataset.index: instead of storing {bbox: filename}, store {bbox: [(filename, CRS), ...]}
  2. {Intersection|Union}Dataset.index: instead of storing {bbox}, store {bbox: [(filename, CRS), ...]}
  3. GeoDataset.__getitem__: instead of taking a BoundingBox(minx, maxx, miny, maxy, mint, maxt), accept a BoundingBox(x, y, t, width, height, time) and CRS, then project (x, y) to that CRS before sampling
  4. GeoSampler: look at the native CRS of all files in the R-tree and select the mode as the CRS to return

By reprojecting only (x, y) instead of (minx, maxx, miny, maxy), we avoid issues with inconsistent tensor sizes during concatenation. The width/height can be applied directly in the new CRS without distortion.

Note that the R-tree index will still use (minx, maxx, miny, maxy, mint, maxt) as its index, while __getitem__ will take a BoundingBox(x, y, t, width, height, time). This is quite different from the previous approach that used the former for both.

Alternatives

There have been a few other alternatives proposed, each with their own pros/cons:

  1. Always project to shared CRS—This is our current implementation. It can be slow and cause spatial distortion, but it's simple and straightforward.
  2. Always sample in native CRS—This is the implementation we use for ChesapeakeCVPR. It is the fastest possible solution with the least distortion, but it disallows datasets with overlapping images in different CRS, so it's somewhere between VisionDataset and GeoDataset. It also suffers from the aforementioned bug with tensor concatenation shape mismatches, although we could employ the same solution as proposed here to fix that.
  3. Add flag to control reprojection—We could add a flag that tells TorchGeo whether or not to reproject files. Unfortunately, this just punts the problem to the user and allows them to shoot themselves in the foot. This also results in more code paths to test and debug.
  4. Require single CRS—We could require all users to reproject their data themselves and ensure that all files are in the same CRS. Then we never have to worry about reprojection. This is similar to what Raster Vision does (from my understanding). Fast at loading time, but removes most of what makes TorchGeo cool. Also increases storage demands.

@RitwikGupta

@adamjstewart adamjstewart added datasets Geospatial or benchmark datasets samplers Samplers for indexing datasets labels Feb 16, 2022
@weiji14
Copy link
Contributor

weiji14 commented Feb 17, 2022

Just to give my two cents as I've been following #278 closely and trying to figure out how to use torchgeo to load multiple Sentinel-2 and higher resolution datasets from different UTM zones without reprojection for an object detection project 😃

  1. GeoDataset.index: instead of storing {bbox: filename}, store {bbox: [(filename, CRS), ...]}
  2. {Intersection|Union}Dataset.index: instead of storing {bbox}, store {bbox: [(filename, CRS), ...]}

Ok with this, but instead of storing tuple(filename, CRS) in the rtree index, I would suggest storing a Python dict ({"filename": filename, "CRS": epsg}) or better, a namedtuple ((filename=filename, CRS=epsg)) instead. This would be a bit more future proof as there might be more metadata we want to store, e.g. the resolution of the raster GeoDataset (c.f. #74, e.g. when people want to store different Sentinel-2 images in their native resolution of 10m, 20m, 60m).

  1. GeoDataset.__getitem__: instead of taking a BoundingBox(minx, maxx, miny, maxy, mint, maxt), accept a BoundingBox(x, y, t, width, height, time) and CRS, then project (x, y) to that CRS before sampling

Just a minor (but important) detail. I insist that the x and y in BoundingBox(x, y, t, width, height, time) be the middle x and y, and not some arbitrary corner. I've had problems before with having to flip data up-down because some people use minx/miny from the bottom left and some people use minx/miny from the top left 🙃

  1. GeoSampler: look at the native CRS of all files in the R-tree and select the mode as the CRS to return

Could you elaborate more on this? By 'mode', do you mean having a flag where the default is reproject to the R-tree index's CRS (as with torchgeo v0.2.0), and a mode='native' that just samples at the raster's native projection?

Either way, I'm happy to help review and/or open a PR to implement this in torchgeo!

@adamjstewart
Copy link
Collaborator Author

Thanks for the feedback @weiji14!

I would suggest storing a Python dict... or better, a namedtuple... instead.

A named tuple could work. Proper type hints weren't added until Python 3.7 but we could use typing-extensions to backport this. I'm not super worried about being future proof because we rebuild the index each time, so it's not like we need to support old indices. I don't think many users directly pass a value to __getitem__, I think most users instantiate a GeoSampler and let that determine the value to pass.

I insist that the x and y in BoundingBox(x, y, t, width, height, time) be the middle x and y, and not some arbitrary corner.

I actually proposed this exact thing to @calebrob6 because I was worried that we would end up with more distortion if we use a corner instead of a centroid. However, the problem is that you can run into off-by-one errors if the width/height is odd vs even. It also adds a bit of complexity to the code if you want to specify a bounding box that rasterio can actually read. Again, I'm hoping that most users won't need to think about BoundingBox at all, they will just get it from a GeoSampler.

Could you elaborate more on this?

I mean "mode" in the mathematical sense. If I'm sampling from an IntersectionDataset and the area of overlap has files in (CRS1, CRS1), I will use CRS1. If the area has files in (CRS1, CRS2, CRS2), I'll use CRS2. If the area has files in (CRS1, CRS2, CRS3), I'll randomly choose one since they're all equally common. This minimizes the number of files we need to reproject.

@adamjstewart
Copy link
Collaborator Author

Proper type hints weren't added until Python 3.7 but we could use typing-extensions to backport this.

Actually, that's for TypedDict, not NamedTuple. The former was added in Python 3.7, the latter is available in all versions of Python we support.

@adamjstewart
Copy link
Collaborator Author

  1. GeoDataset.index: instead of storing {bbox: filename}, store {bbox: [(filename, CRS), ...]}
  2. {Intersection|Union}Dataset.index: instead of storing {bbox}, store {bbox: [(filename, CRS), ...]}

We might have a problem with this. I was hoping to contribute our intersection/union logic directly to rtree: Toblerity/rtree#210. However, it isn't clear what the behavior should be when computing an intersection for general rtree indices. For example, if an rtree stores general objects, the only possible way to compute an intersection would be to store the tuple of objects from both indices. So a dataset like ((ds1 & ds2) & ds3) would end up with an object like [[(bbox1, CRS1), (bbox2, CRS2)], (bbox3, CRS3)]. This would be less fun to parse than a single list, but we can't use a single list if we want general rtree support.

@weiji14
Copy link
Contributor

weiji14 commented Feb 18, 2022

I would suggest storing a Python dict... or better, a namedtuple... instead.

A named tuple could work. Proper type hints weren't added until Python 3.7 but we could use typing-extensions to backport this. I'm not super worried about being future proof because we rebuild the index each time, so it's not like we need to support old indices. I don't think many users directly pass a value to __getitem__, I think most users instantiate a GeoSampler and let that determine the value to pass.

Is torchgeo still supporting Python 3.6? Considering that Python 3.6 is EOL already and many Python scientific/geospatial packages are moving towards adopting NEP29 (e.g. rasterio/rasterio#1813, Toblerity/Fiona#1034) which is technically Python 3.8+, I really don't feel the need to do the backporting. I did find b6925f6 and the closed PR #326 though, so if there's a reason to have Python 3.6 support for a few more months, that's ok too.

I insist that the x and y in BoundingBox(x, y, t, width, height, time) be the middle x and y, and not some arbitrary corner.

I actually proposed this exact thing to @calebrob6 because I was worried that we would end up with more distortion if we use a corner instead of a centroid. However, the problem is that you can run into off-by-one errors if the width/height is odd vs even. It also adds a bit of complexity to the code if you want to specify a bounding box that rasterio can actually read. Again, I'm hoping that most users won't need to think about BoundingBox at all, they will just get it from a GeoSampler.

Right, I'm thinking of large full scene satellite images several km wide that would definitely get distorted if we choose a corner vs a centroid. Perhaps raise a warning if the user sets an odd number width/height, saying that there will be some offset or resampling going on? Let's try to draft an implementation and see how it goes, like you say, most of this should be backend math calculations casual users shouldn't have to worry about.

Could you elaborate more on this?

I mean "mode" in the mathematical sense. If I'm sampling from an IntersectionDataset and the area of overlap has files in (CRS1, CRS1), I will use CRS1. If the area has files in (CRS1, CRS2, CRS3), I'll use CRS2. If the area has files in (CRS1, CRS2, CRS3), I'll randomly choose one since they're all equally common. This minimizes the number of files we need to reproject.

Ah ok, that statistical 'mode'! Personally, I'd prefer the sampling to be more deterministic by having an explicit flag/parameter to control what CRS is being returned, and whether or not any reprojection should happen. To your point on tensor concatenation shape mismatches (mentioned in alternative point 2 #409 (comment)), I think this will become a non-issue with #294, so we would be left with the problem on how to sample from overlapping regions, in which a 'mode' based sampling or 'first-raster' based sampling approach would be ok.

@adamjstewart
Copy link
Collaborator Author

Is torchgeo still supporting Python 3.6?

For the time being, yes. I would prefer to stick to Python's EOL timeline, but there are many systems that still use Python 3.6. I strongly disagree with numpy's aggressive deprecation timeline, as Google Colab still runs Python 3.7 by default. I think we'll likely compromise and stick to whatever versions of Python that PyTorch supports (currently 3.6–3.9 in the latest release, soon to be 3.7–3.9 in the next release). Unless I can convince @calebrob6 to drop 3.6 support since PyTorch has already dropped it in master.

Personally, I'd prefer the sampling to be more deterministic by having an explicit flag/parameter to control what CRS is being returned

This sounds like alternative 1 (our current behavior). If we do this, we require all samples to be projected to the same CRS, which leads to extreme distortion and slow data loading times.

For example, consider a dataset composed of files from 4 different UTM zones. If a sample contains files in zones A and B, we want to return a sample in one of those zones. If another sample contains files in zones C and D, we don't want to warp both files to zone A or B, we want to choose from C and D in order to minimize reprojection.

and whether or not any reprojection should happen.

This sounds like alternative 3. As I mentioned above, there are many issues with this. If I specify reproject=False and a sample contains files in two different CRSes, what should the behavior be? Crash? Only sample from one file? If we do the latter, the dataset basically becomes a VisionDataset and can no longer be combined intelligently with other GeoDatasets.

@weiji14
Copy link
Contributor

weiji14 commented Feb 18, 2022

Is torchgeo still supporting Python 3.6?

For the time being, yes. I would prefer to stick to Python's EOL timeline, but there are many systems that still use Python 3.6. I strongly disagree with numpy's aggressive deprecation timeline, as Google Colab still runs Python 3.7 by default. I think we'll likely compromise and stick to whatever versions of Python that PyTorch supports (currently 3.6–3.9 in the latest release, soon to be 3.7–3.9 in the next release). Unless I can convince @calebrob6 to drop 3.6 support since PyTorch has already dropped it in master.

Yep, I think keeping Python 3.7 is definitely good since Pytorch still supports it. Let's continue this discussion at #413 since typing.NamedTuple should work in Python 3.6 as you said. I've got a draft implementation at #411 for this ready.

Personally, I'd prefer the sampling to be more deterministic by having an explicit flag/parameter to control what CRS is being returned

This sounds like alternative 1 (our current behavior). If we do this, we require all samples to be projected to the same CRS, which leads to extreme distortion and slow data loading times.

For example, consider a dataset composed of files from 4 different UTM zones. If a sample contains files in zones A and B, we want to return a sample in one of those zones. If another sample contains files in zones C and D, we don't want to warp both files to zone A or B, we want to choose from C and D in order to minimize reprojection.

and whether or not any reprojection should happen.

This sounds like alternative 3. As I mentioned above, there are many issues with this. If I specify reproject=False and a sample contains files in two different CRSes, what should the behavior be? Crash? Only sample from one file? If we do the latter, the dataset basically becomes a VisionDataset and can no longer be combined intelligently with other GeoDatasets.

Consider a semantic segmentation task (X -> Y) with:

  • Sentinel-2 images (X_1) in UTM zone 11N
  • Sentinel-2 image (X_2) in UTM zone 12N
  • Land use classification mask (Y) in EPSG:3857

I would say that the training input X should not be reprojected. The sampler should ideally take be taking 256x256 crops of the UTM zone 11N image and UTM zone 12N image, both in their native projections. The land use classification mask (which is of lower quality) in EPSG:3857 would then be reprojected to either UTM zone 11N or UTM zone 12N, depending on where the sampler is sampling from. The key is that the training image X should remain in the native UTM projection, because we will always be running inference on the native UTM projection of Sentinel-2 images (X), and not the projection of the mask file (EPSG:3857).

My initial idea of doing this is as follows, note that this assumes pixel sampling mode (c.f. #294):

## Land use classification (LUC) dataset stored in EPSG:3857
mask_dataset = RasterDataset(filepath="landuseclassification.tif")
# mask_sample = {
#        "image": torch.tensor(mask),  # shape: (1, 256, 256)
#        "crs": self.crs,  # in EPSG:3857
#        "bbox": query,  # in EPSG:3857
#  }

## Sentinel-2 image with 12 bands across UTM zone 11N and 12N
image_dataset11 = RasterDataset(filepath="S2_UTM_Zone_11N.tif")
image_dataset12 = RasterDataset(filepath="S2_UTM_Zone_12N.tif")
# image_sample = {
#        "image": torch.tensor(image),  # shape: (12, 256, 256)
#        "crs": self.crs,  # in UTM projection
#        "bbox": query,  # bounds in UTM projection
#  }

## Pair Sentinel-2 dataset with LUC dataset
utm11N_mask_dataset = IntersectionGeoDataset(dataset1=image_dataset11, dataset2=mask_dataset)
utm12N_mask_dataset = IntersectionGeoDataset(dataset1=image_dataset12, dataset2=mask_dataset)
# image_mask_sample1 = {
#        "image": torch.tensor(image_mask),  # shape: (13, 256, 256)
#        "crs": self.crs,  # in UTM projection 11N
#        "bbox": query,  # bounds in UTM projection 11N
#  }
# image_mask_sample2 = {
#        "image": torch.tensor(image_mask),  # shape: (13, 256, 256)
#        "crs": self.crs,  # in UTM projection 12N
#        "bbox": query,  # bounds in UTM projection 12N
#  }

## Combine UTM Zone 11N and Zone 12N GeoDatasets
full_image_dataset = UnionGeoDataset(
    dataset1=utmzone11N_mask_dataset, dataset2=utmzone12N_mask_dataset
)
# full_image_sample = {
#        "image": torch.tensor(image),  # shape: (13, 256, 256)
#        "crs": self.crs,  # in UTM projection 11N or 12N, depending on area
#        "bbox": query,  # bounds in what projection?? UTM 11N, 12N, or EPSG:3857?
#  }

So the problem is with the last Union operation - what projection should the Rtree index be in? UTM Zone 11N, 12N, or EPSG:3857? Maybe you're right and what I'm after might not be a job for IntersectionGeoDataset/UnionGeoDataset, but just a VisionDataset, or something else entirely like a ConcatGeoDataset or CombinedGeoDataset (for lack of a better term) that does not do reprojection at all (i.e. the user has to handle it manually).

It did occur to me that there is an inherent limitation with the rtree library that only allows for one projection type. I had an idea to store 3 rtree indexes, one for each GeoDataset being combined, and have GeoSampler do a for-loop through all of those. But not sure how well that would work in practice, and probably more trouble than it is worth.

@adamjstewart
Copy link
Collaborator Author

I would say that the training input X should not be reprojected.

What if the training dataset X contains a file A in UTM zone 11N and a file B in UTM zone 12N and both A and B intersect with the sampler bounding box? Do you a) crash, or b) only return one of the two images? How do you decide which image to return?

You do make a good point that we should avoid reprojecting float images and prefer reprojecting int label masks when possible, although it's going to be difficult to assert that preference in my proposed solution.

@weiji14
Copy link
Contributor

weiji14 commented Feb 18, 2022

I would say that the training input X should not be reprojected.

What if the training dataset X contains a file A in UTM zone 11N and a file B in UTM zone 12N and both A and B intersect with the sampler bounding box? Do you a) crash, or b) only return one of the two images? How do you decide which image to return?

How about taking the image where the sampled bounding box overlaps with most? Or if the box happens to be in the middle and the overlap areas are the same, I would just go with returning the first one in the case of RandomGeoSampler (since it is random anyway). Not sure how to handle it for GridGeoSampler though, probably should return both images, but not sure if that is possible looking at the code.

You do make a good point that we should avoid reprojecting float images and prefer reprojecting int label masks when possible, although it's going to be difficult to assert that preference in my proposed solution.

Taking a step back, I think we can both agree that implementation details 1, 2 and 3 are good enough, and can be done right now just with some code refactoring. Most of the debate appears to be on implementation detail 4 (how GeoSampler should behave), and given that there are probably a whole lot of combinations we haven't even though about, maybe we should come up with the simplest method that makes sense (e.g. taking the 'mode' projection, or projection of the 'first' GeoDataset), and make it easier somehow for people to subclass or adapt GeoSampler to fit their own specific problem.

@adamjstewart
Copy link
Collaborator Author

The problem is that implementation details 1, 2, and 3 are only necessary if we decide to go with this proposal. Alternative 1 (not changing anything, requiring the user to specify the CRS to reproject to) means not changing anything. Let's decide if this proposal is the behavior we want to achieve before we worry about making any changes. I'm still interested in hearing other alternative proposals, especially since you seem interested in alternatives 1 and 3.

@weiji14
Copy link
Contributor

weiji14 commented Feb 18, 2022

The problem is that implementation details 1, 2, and 3 are only necessary if we decide to go with this proposal. Alternative 1 (not changing anything, requiring the user to specify the CRS to reproject to) means not changing anything. Let's decide if this proposal is the behavior we want to achieve before we worry about making any changes. I'm still interested in hearing other alternative proposals, especially since you seem interested in alternatives 1 and 3.

Well, I would argue that implementation details 1 and 2 is actually quite useful regardless of what happens. At the very least, having the CRS information in the Rtree index would allow me to create a custom Dataset/DataModule containing data from different projections more easily, instead of having to subclass RasterDataset and rewrite these 70 lines of logic:

def __init__(
self,
root: str,
crs: Optional[CRS] = None,
res: Optional[float] = None,
transforms: Optional[Callable[[Dict[str, Any]], Dict[str, Any]]] = None,
cache: bool = True,
) -> None:
"""Initialize a new Dataset instance.
Args:
root: root directory where dataset can be found
crs: :term:`coordinate reference system (CRS)` to warp to
(defaults to the CRS of the first file found)
res: resolution of the dataset in units of CRS
(defaults to the resolution of the first file found)
transforms: a function/transform that takes an input sample
and returns a transformed version
cache: if True, cache file handle to speed up repeated sampling
Raises:
FileNotFoundError: if no files are found in ``root``
"""
super().__init__(transforms)
self.root = root
self.cache = cache
# Populate the dataset index
i = 0
pathname = os.path.join(root, "**", self.filename_glob)
filename_regex = re.compile(self.filename_regex, re.VERBOSE)
for filepath in glob.iglob(pathname, recursive=True):
match = re.match(filename_regex, os.path.basename(filepath))
if match is not None:
try:
with rasterio.open(filepath) as src:
# See if file has a color map
try:
self.cmap = src.colormap(1)
except ValueError:
pass
if crs is None:
crs = src.crs
if res is None:
res = src.res[0]
with WarpedVRT(src, crs=crs) as vrt:
minx, miny, maxx, maxy = vrt.bounds
except rasterio.errors.RasterioIOError:
# Skip files that rasterio is unable to read
continue
else:
mint: float = 0
maxt: float = sys.maxsize
if "date" in match.groupdict():
date = match.group("date")
mint, maxt = disambiguate_timestamp(date, self.date_format)
coords = (minx, maxx, miny, maxy, mint, maxt)
self.index.insert(i, coords, filepath)
i += 1
if i == 0:
raise FileNotFoundError(
f"No {self.__class__.__name__} data was found in '{root}'"
)
self._crs = cast(CRS, crs)
self.res = cast(float, res)

@adamjstewart adamjstewart added this to the 0.3.0 milestone Feb 18, 2022
@adamjstewart
Copy link
Collaborator Author

When we make this switch, we should also switch from interleaved=True to interleaved=False in our rtree index. The former is known to be buggy and actually discouraged by the rtree maintainers: Toblerity/rtree#204 (comment)

@adamjstewart
Copy link
Collaborator Author

If anyone ever tries to implement this, we should use a dict, not a tuple. There are a lot of things we would like to optionally store in the R-tree that aren't always available but could be used by the sampler. For example, we could store percent cloud cover and have a sampler weighted by cloud cover.

@AdeelH
Copy link

AdeelH commented Aug 10, 2024

Require single CRS—We could require all users to reproject their data themselves and ensure that all files are in the same CRS. Then we never have to worry about reprojection. This is similar to what Raster Vision does (from my understanding).

Sorry to resurrect this old issue, but re: Raster Vision: the above is incorrect.

The RV approach (which has not changed since this issue was created) is to have one GeoDataset per scene (i.e. per file) and then concatenate them using PyTorch's ConcatDataset for training etc. This means you can even concatenate GeoDatasets with ImageDatasets (RV's counterpart to NonGeoDatasets) or even with non-RV Datasets. So not only can you sample from a mix of scenes with different projections, but you can even do it from a mix of georeferenced and non-georeferenced data sources. At no point* is any file reprojected.

You can even take it a step further and stack bands from files with different projections. Needless to say, the labels can come from a source with different projection than the imagery as well.

The concatenation is possible because RV GeoDatasets have a length and are indexable by integers. Unlike TorchGeo, RV does not use specialized Samplers. The spatial sampling is part of the definition of GeoDataset subclasses, SlidingWindowGeoDataset and RandomWindowGeoDataset, and is done in pixel-space. The map-to-pixel coordinate and vice versa transformation is performed at a different level of abstraction, namely at the RasterSource and the VectorSource level.

Happy to provide more details about RV's approach, if needed.


*An exception is if you create a GeoDataset (or RasterioSource) by passing multiple URIs; in that case, those are mosaiced together into a single VRT using gdal.

@adamjstewart
Copy link
Collaborator Author

Thanks for the clarification @AdeelH! If I understand correctly, RV never does reprojection unless you stack bands or combine images and masks from different CRS? I think this is similar to what I'm proposing here.

@AdeelH
Copy link

AdeelH commented Aug 11, 2024

Yes, if by reprojection you mean transforming the bounding box being read between pixel and map coordinates. E.g. when sampling from a MultiRasterSource, the query bounding box (if I'm using the TorchGeo terminology correctly), which is initially in pixel coords, will

  • get transformed to map coords
  • and then, for each sub-RasterSource:
    • will get transformed to pixel coords in the sub-RasterSource's respective pixel space
    • which will then be used to read a chip
  • finally, the sub-chips will be concatenated along the channel dimension

Each RasterSource has an associated CRSTransformer which can transform bounding boxes between pixel coords and coords in some map CRS (EPSG:4326 by default).

@adamjstewart
Copy link
Collaborator Author

By reprojection, I mean using gdalwarp (or rasterio.warp) to project the data from one CRS to another. That's what this issue is about.

@AdeelH
Copy link

AdeelH commented Aug 11, 2024

In that case, no, RV never does reprojection nor requires users to do it.

@adamjstewart
Copy link
Collaborator Author

If users want to combine images and masks in the same location but two different CRSs, who does the reprojection? The user?

@AdeelH
Copy link

AdeelH commented Aug 12, 2024

It's supposed to work the same way as when stacking bands from different sources that I described above. Since RV samples windows in pixel coords e.g. Box(0, 0, 100, 100), if we're sampling a window from the image RasterSource, that window will be transformed to the pixel coords of the mask RasterSource such that it still references the same spatial area. Although as I'm looking at the code now, the band-stacking implementation is currently more general than the image-mask alignment implementation. The former can handle different CRSs, extents, and resolutions while the latter can only handle different CRSs and extents.

If the masks are coming from a vector data source (e.g. GeoJSON), the geometries are transformed to pixel coords in-memory, as shown here.

Edit: Note: It's entirely possible that this approach is not fully general and does not work for all CRS combinations. If you can think of cases where it might fail, I'd be eager to know.

@adamjstewart
Copy link
Collaborator Author

If the images are in different coordinate systems, don't you end up with rotated/warped images like in Figure 2 of https://arxiv.org/abs/2111.08872? How do you even compute pixel coordinates if the images don't have the same bounds?

@AdeelH
Copy link

AdeelH commented Aug 12, 2024

Example noteboook showing stacking of bands from NAIP, Sentinel-2, and Landsat with geospatial alignment without reprojection: https://colab.research.google.com/drive/1trjhV2ohQN9AkJb889XuOymr-Z9F7YLt?usp=sharing

Sampled chip:
image

What the scenes look like in QGIS:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasets Geospatial or benchmark datasets samplers Samplers for indexing datasets
Projects
None yet
Development

No branches or pull requests

3 participants