Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RasterDataset support for nodata masks. #1078

Open
pmaldonado opened this issue Jan 31, 2023 · 4 comments
Open

RasterDataset support for nodata masks. #1078

pmaldonado opened this issue Jan 31, 2023 · 4 comments
Labels
datasets Geospatial or benchmark datasets

Comments

@pmaldonado
Copy link

Summary

We suggest adding support for nodata masking in RasterDataset, allowing users to optionally include a nodata mask with each sample returned by __getitem__.

Rationale

In one of our applications, we take NAIP quarter quadrangles and sample the RasterDataset with a spatial sampler. We noticed that our labels will occasionally appear at the edge of a raster where there is no data present. We'd like to mask those regions during training and evaluation to prevent those labels from being included in the loss.

Implementation

To produce optional "nodata" masks, we simply add a field to RasterDataset defining whether to return said masks with each sample, and when reading bands during a query we additionally read the raster nodata masks and return them as a tensor.

We previously implemented this feature on a forked version of RasterDataset and would be happy to submit a PR with the implementation.

Alternatives

No response

Additional information

No response

@adamjstewart
Copy link
Collaborator

I agree that we should have a way to access a nodata mask. However, I'm not sure the best way to do this. Your approach would work, but if you then want to combine NAIP with another mask dataset, the mask will go from B x H x W to B x C x H x W. Our current semantic segmentation trainers can't handle this, and I've been trying to standardize the output of our datasets more: #985

Reading through https://rasterio.readthedocs.io/en/latest/topics/masks.html, it seems like the nodata mask is simply the locations in the image where the value equals the nodata value. So if nodata=0, then all pixels in the image equal to zero are nodata pixels.

One option would be to manually parse this information from the image yourself during training instead of creating a separate mask returned by the data loader. The problem with this is you have to be careful to track how this value changes if you use image normalization.

Another option would be to use what you described and either forbid this kind of dataset being combined with a mask, or to make all of our masks B x C x H x W like Kornia expects. What do you think?

@adamjstewart adamjstewart added the datasets Geospatial or benchmark datasets label Feb 1, 2023
@pmaldonado
Copy link
Author

pmaldonado commented Feb 2, 2023

Parsing the nodata mask from the image during training strikes me as breaking the abstractions of the dataloader and the Trainer. I would expect the dataloader to do all necessary disk access, particularly given that GeoDatasets provide additional abstraction above raw imagery files.

One potential problem with pulling the nodata value from the imagery is that nodatavals is optional and sometimes returns None. For instance, this happens for all of our NAIP imagery. Nevertheless, using .read([band], masked=True) still returns the appropriate nodata mask.

Rasterio stores masks per-band, so the dataset could return a mask of shape C x H x W to match the input key and users can decide how to apply the mask downstream.

What you write in #985 about standardizing sample keys makes a lot of sense to me. Am I correct in my understanding that the mask key is used for geometric features (like in VectorDataset)? If so, what would you think about adding an additional nodata key to produce an additional nodata mask? They key distinction is whether nodata masks should use the same key in a Sample as a collection of geometric features. If we separate them we'd prevent collision between the two types of "mask".

@adamjstewart
Copy link
Collaborator

Parsing the nodata mask from the image during training strikes me as breaking the abstractions of the dataloader and the Trainer. I would expect the dataloader to do all necessary disk access, particularly given that GeoDatasets provide additional abstraction above raw imagery files.

Just to clarify, I was suggesting creating the mask from the image in memory, there shouldn't be any file I/O necessary. For example, if I load an image, and I know the value 0 means nodata, it wouldn't be hard to create a mask from an image with torch.nonzero. But that's only if we can parse it from the image.

Nevertheless, using .read([band], masked=True) still returns the appropriate nodata mask.

If that's the case, it may not be possible to create a mask from the image in memory.

Rasterio stores masks per-band

In what scenarios does the mask differ by band?

Am I correct in my understanding that the mask key is used for geometric features (like in VectorDataset)?

mask is used for semantic segmentation masks. These masks can come from both raster images (RasterDataset, e.g., CDL) or vector shapefiles (VectorDataset, e.g., CBF).

If so, what would you think about adding an additional nodata key to produce an additional nodata mask?

This actually sounds like a pretty good idea! I would be fine with this solution. It removes most of my concerns, and it's not particularly obtrusive.

Only remaining concern would be Kornia support. We want to make sure RandomCrop affects not only the image but also the nodata mask. Part of the reason for standardizing the keys is to match what Kornia uses. In that sense, Kornia only recognizes mask, it wouldn't recognize nodata. I'm still working on kornia/kornia#2119 but it should be possible to add support for names like "nodata_mask" or "mask_nodata". If you want to submit a PR to implement this, let's just call it nodata for now and we can rename it later.

@pmaldonado
Copy link
Author

In what scenarios does the mask differ by band?

In our application it should be consistent between bands. My guess is that this feature exists in rasterio due to allow composition between different data sources which may have missing data at different locations in a raster based on the observation method.

I would be fine with this solution.

Great! I'll work on a PR for this and call it nodata for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasets Geospatial or benchmark datasets
Projects
None yet
Development

No branches or pull requests

2 participants