Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dataset table to the documentation #435

Merged
merged 25 commits into from
Jun 19, 2022
Merged

Conversation

calebrob6
Copy link
Member

@calebrob6 calebrob6 commented Feb 26, 2022

In the TorchGeo paper we have a table that lists the properties of some of the datasets we've added. This should be reproduced in the docs so that we have an overview of what all is available in the library.

I've just copied (more/less) the tables from the paper and haven't updated them with the datasets that have been implemented since.

Things to look into/questions:

  • What other columns can we add for the Geospatial datasets? In the paper we've split it into Benchmark/Generic, however that's not really the organization that we have here. Perhaps we need a single table with a "Type" column?
  • Adding hyperlinks to the citations for each dataset
  • Updating the CSVs to be current

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Feb 26, 2022
@adamjstewart
Copy link
Collaborator

What other columns can we add for the Geospatial datasets? In the paper we've split it into Benchmark/Generic, however that's not really the organization that we have here. Perhaps we need a single table with a "Type" column?

This starts to re-raise the question of how we should categorize datasets. We keep flip-flopping on this and it's leading to a lot of inconsistencies. Functionally, I think the biggest distinction is between geospatial datasets (have geospatial metadata) and non-geospatial datasets. In terms of attributes that we might want to list in this table, the biggest distinction is between benchmark datasets (bot input image and target labels) and non-benchmark datasets. The division in the docs doesn't necessarily need to match the base class division.

I'm leaning towards splitting the docs into benchmark vs. non-benchmark and keeping geospatial vs. non-geospatial as a base class distinction only. I've also been thinking about renaming VisionDataset to NonGeoDataset. Thoughts?

Adding hyperlinks to the citations for each dataset

Torchvision does this for their models and it always confuses me because I expect the hyperlink to take me to the model class definition, not the citation. I would prefer to have hyperlinks to class definitions and then the class definition contains a hyperlink to the citation. Thoughts?

Updating the CSVs to be current

We'll have to remind people to update this table every time they add a new dataset.

@calebrob6
Copy link
Member Author

calebrob6 commented Feb 26, 2022

Dataset naming stuff

NonGeoDataset sounds great to me. And I'm fine with having a "benchmark dataset" table that doesn't align with how the classes are organized.

I would prefer to have hyperlinks to class definitions and then the class definition contains a hyperlink to the citation. Thoughts?

Fine with me -- rows in the table should definitely link somewhere.

We'll have to remind people to update this table every time they add a new dataset.

Yep! That's fine with me. I can also add instructions to the contributing page in this PR.

@adamjstewart adamjstewart added this to the 0.3.0 milestone Feb 27, 2022
@ashnair1
Copy link
Collaborator

ashnair1 commented Mar 23, 2022

While we're on the topic, what about datasets like SpaceNet where they're pre-chipped to be like VisionDatasets but do contain geospatial metadata? In that case, it was the organisation of the dataset that informed the decision to make it a VisionDataset (query by integer index and not bounding box) not its lack of geospatial metadata.

Kind of lies in between geo-vs-vision

@adamjstewart
Copy link
Collaborator

@ashnair1 my current plan is to someday convert all of those to GeoDatasets (#83). The only thing holding us back at the moment is #409. I'm also planning on adding a new sampler (maybe PreChippedGeoSampler?) that doesn't require the user to specify the epoch length or patch size and instead gathers this directly from the dataset r-tree index. This will make them almost as simple as a VisionDataset but way more powerful since you can combine them with other datasets.

Copy link
Collaborator

@adamjstewart adamjstewart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of filenames, there isn't a ton of consistency. We list "Geospatial Datasets" in "generic_datasets.csv" and "Non-geospatial Datasets" in "non_geo_datasets.csv". I'm honestly not sure what to call them anymore, and we've gone back and forth for a while. We should figure out how to make these more consistent in our docs/API. This doesn't necessarily need to happen in this PR, just pointing out the inconsistencies here.

docs/api/datasets.rst Show resolved Hide resolved
docs/api/vision_datasets.csv Outdated Show resolved Hide resolved
@adamjstewart
Copy link
Collaborator

This looks great! Will take a closer look later. Since we first created the docs, torchvision's docs have completely changed. They now have a short page with just the dataset tables and then separate pages for each dataset. I actually kind of like that format, and it may allow us to skip the step of adding the dataset to datasets.rst. This doesn't change anything in this PR, but it may make this PR more important in the future.

docs/api/geo_datasets.csv Show resolved Hide resolved
docs/api/non_geo_datasets.csv Outdated Show resolved Hide resolved
docs/api/non_geo_datasets.csv Outdated Show resolved Hide resolved
docs/api/non_geo_datasets.csv Outdated Show resolved Hide resolved
docs/api/non_geo_datasets.csv Outdated Show resolved Hide resolved
docs/user/contributing.rst Outdated Show resolved Hide resolved
docs/user/contributing.rst Show resolved Hide resolved
Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>
calebrob6 and others added 4 commits June 18, 2022 04:44
Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>
Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>
Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>
Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>
@calebrob6
Copy link
Member Author

Alright, ready for round 2

@github-actions github-actions bot added the datasets Geospatial or benchmark datasets label Jun 19, 2022
@adamjstewart adamjstewart enabled auto-merge (squash) June 19, 2022 19:29
@adamjstewart adamjstewart merged commit 98cc3c9 into main Jun 19, 2022
@adamjstewart adamjstewart deleted the docs/dataset_table branch June 19, 2022 19:30
@adamjstewart adamjstewart mentioned this pull request Jul 11, 2022
yichiac pushed a commit to yichiac/torchgeo that referenced this pull request Apr 29, 2023
* Add benchmark dataset table

* Add geospatial datasets

* Work on Data table (microsoft#478)

* added to data table

* add links

* fix docs

* Added section for implementing new datasets to the Contributing page

* Removing extra file

* Add EDDMapS and GBIF rows to generic

* Formatting

* Renaming to make sense

* Short names

* Fixes

* Checking references

* Trying links

* Figured out links

* Removing hyphens for empty cells as these are rendered as bullet points

* Update docs/api/non_geo_datasets.csv

Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>

* Update docs/api/non_geo_datasets.csv

Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>

* Update docs/api/non_geo_datasets.csv

Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>

* Update docs/api/non_geo_datasets.csv

Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>

* Update docs/user/contributing.rst

Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>

* Update docs/api/geo_datasets.csv

* Update geo_datasets.csv

* Update geo_datasets.csv

* Update contributing.rst

* Formatting

* Fix table links

Co-authored-by: Nils Lehmann <35272119+nilsleh@users.noreply.github.com>
Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasets Geospatial or benchmark datasets documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants