Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add custom RasterDataset notebook #283

Merged
merged 7 commits into from
Dec 21, 2021

Conversation

RitwikGupta
Copy link
Collaborator

Using xView3 as an example to address issue #280

@ghost
Copy link

ghost commented Dec 14, 2021

CLA assistant check
All CLA requirements met.

@isaaccorley
Copy link
Collaborator

Can you add the tutorial name to docs/index.rst?

@RitwikGupta
Copy link
Collaborator Author

Goes to show I should finally learn how to use Sphinx...

@adamjstewart adamjstewart added the documentation Improvements or additions to documentation label Dec 14, 2021
@calebrob6
Copy link
Member

calebrob6 commented Dec 15, 2021

Thanks @RitwikGupta!

Just reviewed this but it is hard to leave comments inline on notebooks...

  • I'm not able to load "https://iuu.xview3.us/". Is this the right URL / is that server up?
  • Does "Copyright (c) Ritwik Gupta. All rights reserved." conflict with the CLA (https://cla.opensource.microsoft.com/microsoft/torchgeo?pullRequest=283)? I'm reading the "You grant Microsoft, and those who receive the Submission directly or indirectly from Microsoft, a perpetual, worldwide, non-exclusive, royalty-free, irrevocable license in the Submission to reproduce, prepare derivative works of, publicly display, publicly perform, and distribute the Submission and such derivative works, and to sublicense any or all of the foregoing rights to third parties." part but am not sure.
  • I'd like to be a bit more verbose with the descriptions of what is going on, something like:

"""
Assume we have the xView3 tiny dataset downloaded and unzipped in our local directory:

xview3tiny
├── 05bc615a9b0e1159t
│ ├── bathymetry.tif
│ ├── owiMask.tif
│ ├── owiWindDirection.tif
│ ├── owiWindQuality.tif
│ ├── owiWindSpeed.tif
│ ├── VH_dB.tif
│ └── VV_dB.tif
├── 2899cfb18883251bt
│ ├── bathymetry.tif
│ ├── owiMask.tif
│ ├── owiWindDirection.tif
│ ├── owiWindQuality.tif
│ ├── owiWindSpeed.tif
│ ├── VH_dB.tif
│ └── VV_dB.tif
|.......

We would like to create a custom Dataset class based off of RasterDataset for this xView3 data. This will let us use torchgeo features such as: random sampling, merging other layers, etc. To do this, we can simply subclass RasterDataset and define a filename_glob property to select which files in a root directory will be included in the dataset. For example:
"""

Thoughts?

@adamjstewart adamjstewart added this to the 0.1.1 milestone Dec 15, 2021
@RitwikGupta
Copy link
Collaborator Author

  1. I had committed prior to signing the CLA, fixed.
  2. Typo in the URL, fixed.
  3. Added the description!

@calebrob6
Copy link
Member

calebrob6 commented Dec 15, 2021

I think we try to run each notebook as an integration test before new releases and this one would break because there is no way to auto download xView3 tiny. This is generally OK with me (as other notebooks have special test behavior and exist to show one-off things). @adamjstewart, thoughts?

see notebook render here https://github.com/RitwikGupta/torchgeo/blob/custom_rasterdataset_doc/docs/tutorials/custom_raster_dataset.ipynb

@RitwikGupta
Copy link
Collaborator Author

@calebrob6 if you want to self-host xView3 tiny like you self host your subset of xView2, that should be simple enough to add in.

At some point once we finish writing the paper for xView3 I can add the right citation for it too.

@calebrob6
Copy link
Member

We don't self-host any real data as to not run into any licensing issues. I.e., all the data at tests/data/* should be dummy data for test purposes. Having dummy data would make the example notebook a lot less cool though... If we can put xview3 tiny on zenodo or someone's google drive that'd be the ideal solution.

@RitwikGupta
Copy link
Collaborator Author

How are the integration tests for xview.py currently run?

@calebrob6
Copy link
Member

calebrob6 commented Dec 15, 2021

(Disclaimer, I might have my terminology mixed up, I'm new to this software engineering stuff)

I don't think we run "integration" tests on xview.py, but run "unit" tests (https://github.com/microsoft/torchgeo/blob/main/tests/datasets/test_xview2.py). These unit tests use dummy data that we create to be in the same format as the xview2 data (tests/data/xview2/*).

Some parts of the code take too long to run on every commit to a PR, so we mark these as "slow" and only run them occasionally (see https://github.com/microsoft/torchgeo/blob/main/tests/test_train.py#L12 for an example). These are what I'm calling "integration" tests and the notebooks fall in this category.

To run the unit tests you can do pytest or pytest --cov=. --cov-report=term-missing to see test coverage.
To run the integration tests you can do pytest -m slow.

@RitwikGupta
Copy link
Collaborator Author

RitwikGupta commented Dec 15, 2021

Okay, how's this. I created completely empty xView3 rasters with the same geotransforms as original source files. The custom RasterDataset in the notebook wouldn't be a complete xView3 dataset, so any future xview3.py could be unit tested with those files. Thoughts? This still allows you to use this notebook to manually test the code (I assume you manually run these notebooks).

@calebrob6
Copy link
Member

haha yes, that will definitely work! Note: I just wanted Adam to weigh in on how to handle these notebooks in general. It seems reasonable to me that we don't have to force these tutorial notebooks be fully executable by the github tests (at the cost of losing some usefulness / cool factor). However, if we decide that, then we'll have to be extra vigilant about running them ourselves periodically.

@adamjstewart
Copy link
Collaborator

I think it makes sense to expect all of our tutorials to actually run. Without this, we have no way of testing them, so they rapidly become out-of-date. Manually running tests is a perfect recipe for not running tests. If we just want to document a feature, it doesn't need to be in a notebook.

Note: there are actually two types of integration tests we run, the slow unit tests can be run with pytest -m slow, and the notebook tests can be run with pytest --nbmake docs/tutorials. Both of these are run when creating a new release as @calebrob6 mentioned.

@calebrob6
Copy link
Member

Okay so last changes:

  • Use the example data
  • is it possible to shrink the VH_dB.tif file somehow? I see it is already using LZW with predictor=2, but is also 12MB.

@RitwikGupta
Copy link
Collaborator Author

Zipping all the files together brings the total file size down to 142 KB. What's the way to have TorchGeo automatically unzip files before using them?

@calebrob6
Copy link
Member

torchgeo.datasets.utils.download_and_extract_archive

@RitwikGupta
Copy link
Collaborator Author

Try that on for size

@calebrob6
Copy link
Member

Fits great!

calebrob6
calebrob6 previously approved these changes Dec 16, 2021
@isaaccorley
Copy link
Collaborator

This lgtm but while I'm here looking at the notebook, I think you can simplify the dataset further from

class XView3Polarizations(RasterDataset):
    '''
    Load xView3 polarization data that ends in *_dB.tif
    '''

    filename_glob = "*_dB.tif"

    def __init__(
        self,
        root: Path = None,
        crs: Optional[CRS] = None,
        res: Optional[float] = None,
        transforms: Optional[Callable[[Dict[str, Tensor]], Dict[str, Tensor]]] = None,
        cache: bool = True
    ) -> None:
        self.root = root

        super().__init__(root, crs, res, transforms, cache)

to just:

class XView3Polarizations(RasterDataset):
    '''
    Load xView3 polarization data that ends in *_dB.tif
    '''

    filename_glob = "*_dB.tif"

@RitwikGupta
Copy link
Collaborator Author

This was a willful decision to be explicit. If people want to modify the root, they can! Should I add a markdown comment discussing this instead?

@adamjstewart
Copy link
Collaborator

I agree with @isaaccorley on this one, there's no need to reimplement methods of the parent class unless you need to change them. If a different root is needed, you can change the variable that is passed into the dataset when you instantiate it.

@isaaccorley
Copy link
Collaborator

In this case I think we should remove the boilerplate to show how simple it is for a new user to inherit and just change the glob. However, I do think we should come up with some more examples in a future PR which show how you can customize the dataset more.

@RitwikGupta
Copy link
Collaborator Author

Is this any better?

@adamjstewart adamjstewart modified the milestones: 0.1.1, 0.1.2 Dec 19, 2021
isaaccorley
isaaccorley previously approved these changes Dec 19, 2021
Copy link
Collaborator

@isaaccorley isaaccorley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@calebrob6
Copy link
Member

What is the difference in the new one? Is the idea XView3Polarizations("xview3/train/")?

@RitwikGupta
Copy link
Collaborator Author

Yeah. It's a pretty weak way of showing how you could change things up, to be honest.

@RitwikGupta
Copy link
Collaborator Author

But I think it does open your mind up to the fact that you can add additional args and things, and use them, in your constructor.

@calebrob6
Copy link
Member

No problem, maybe like a final cell that says something to the effect of, "Now you can do this train_ds = XView3Polarizations("xview3/train/")", whereas before you couldn't." to make that explicit?

@adamjstewart
Copy link
Collaborator

You could do that before and after, I don't see any point of setting self.root, it's the same as what super().__init__(...) does. I think if we want to include this we need a better example of a more complex use case. This tutorial should tell users everything they could potentially need to know to when creating your own RasterDataset.

@calebrob6 calebrob6 merged commit a3f5593 into microsoft:main Dec 21, 2021
@adamjstewart
Copy link
Collaborator

Looks like the header levels are too high. Can you reduce the header levels so they match the other tutorials? Also, the plot is empty. Is it supposed to be?

@calebrob6
Copy link
Member

I just did this on main. FYI for all of us (I didn't know this anyway) that level 0 headers in markdown ("# Headers") will get included in the sidebar on the docs, e.g.

image

@adamjstewart adamjstewart modified the milestones: 0.1.2, 0.2.0 Jan 1, 2022
@adamjstewart adamjstewart added utilities Utilities for working with geospatial data and removed utilities Utilities for working with geospatial data labels Jan 2, 2022
"source": [
"from torchgeo.datasets.utils import extract_archive\n",
"\n",
"data_root = Path('../../tests/data/xview3/')\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This directory doesn't exist on Colab, so this tutorial crashes immediately when you run it. We should really write these files ourselves using rasterio inside the notebook, or download them (from GitHub or just use real images).

yichiac pushed a commit to yichiac/torchgeo that referenced this pull request Apr 29, 2023
* Add custom RasterDataset notebook

* Update docs index.rst

* Update copyright, fix URL typo, and add verbose description

* Add xview3 sample data

* Update notebook

* Show simple example first, complicated example second

* Remove the second half of the notebook, can expand later
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants