Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MapInWild dataset #1096

Closed
burakekim opened this issue Feb 8, 2023 · 2 comments · Fixed by #1131
Closed

MapInWild dataset #1096

burakekim opened this issue Feb 8, 2023 · 2 comments · Fixed by #1131
Labels
datasets Geospatial or benchmark datasets
Milestone

Comments

@burakekim
Copy link
Contributor

burakekim commented Feb 8, 2023

Summary

Hi!

MapInWild is a large-scale multi-modal dataset curated for the novel task of wilderness mapping from space, introduced in our recent paper. We would be glad to see our dataset in torchgeo!

  • The dataset is originally hosted on harvard dataverse. Following @calebrob6's suggestion, I have mirrored it on huggingface to make things easier for everyone.

  • The original GitHub repository of the dataset is here.

  • Wilderness mapping is a supervised pixel-wise classification task. Although the annotations contain four classes/pixel values {0:Background, 1:Strict nature reserve, 2:Wilderness Area, 3:National Park]}, we perform annotation[annotation!= 0] = 1 and frame this task as a binary classification problem where 0 is Background and 1 is Wilderness Area.

  • The annotations (in the form of polygons) are derived from World Database of Protected Areas.

  • The dataset is around 350 GB in size. That is why each modality is zipped separately, allowing the users to pick among the modalities they want to work on.

There are 1018 areas in the dataset and each area contains the following modalities in the shape of 1920 x 1920 pixels.

MapInWild
├── Dual Pol Sentinel-1 (2 Bands)
├──  Sentinel-2 (10 Bands)  
│   ├── Spring
│   ├── Autumn
│   ├── Winter
│   ├── Summer
│   └── Single Temporal Subset 
├──  VIIRS Night Time Light (1 Band)
└── ESA WorldCover (1 Band)
  • As explained in the paper, the single temporal subset includes the most informative Sentinel-2 season for each area, suggested to the users who are not interested in the multi-seasonality aspect of the dataset.

Following the torchgeo logic, here are the bands and their modality-level combinations.

    BAND_SETS: Dict[str, Tuple[str, ...]] = {
        "all": (
            "VV",
            "VH",
            "B2",
            "B3",
            "B4",
            "B5",
            "B6",
            "B7",
            "B8",
            "B8A",
            "B11",
            "B12",
            "2020_Map",
            "avg_rad"), 
        "s1": ("VV", "VH"),
        "s2-rgb":(
            "B4",
            "B3",
            "B2"),
        "s2-all": (
            "B2",
            "B3",
            "B4",
            "B5",
            "B6",
            "B7",
            "B8",
            "B8A",
            "B11",
            "B12"),
        "esa_wc": {"2020_Map"},
        "viirs":{"avg_rad"}
    }

Rationale

No response

Implementation

After asked by the user MapInWild(root="data/", modalities=[...], download=True), any modality can be loaded as below:

mask= load_dataset("burakekim/mapinwild", data_dir="mask")
s1 = load_dataset("burakekim/mapinwild", data_dir="s1")
viirs = load_dataset("burakekim/mapinwild", data_dir="viirs")
esa_wc = load_dataset("burakekim/mapinwild", data_dir="esa_wc")
s2 = load_dataset("burakekim/mapinwild", data_dir="s2_temporal_subset")
s2_autumn= load_dataset("burakekim/mapinwild", data_dir="s2_autumn")
s2_spring= load_dataset("burakekim/mapinwild", data_dir="s2_spring")
s2_winter= load_dataset("burakekim/mapinwild", data_dir="s2_winter")
s2_summer = load_dataset("burakekim/mapinwild", data_dir="s2_summer")

The s1 and s2 are bigger than 50 GB and they are split into two zip files. For these modalities, the num_proc=2 argument in the load_dataset can be used.

Alternatives

No response

Additional information

No response

@adamjstewart adamjstewart added the datasets Geospatial or benchmark datasets label Feb 8, 2023
@adamjstewart
Copy link
Collaborator

We would love to have your dataset in TorchGeo! Would you like to try opening a PR to add it? I would suggest looking at recent PRs that added new datasets to get a full list of files you would need to add/modify. I'm happy to help review a draft when it's ready.

@burakekim
Copy link
Contributor Author

Sure, I can give it a go!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasets Geospatial or benchmark datasets
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants