Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix folder name for labels_root in the FAIR1M dataset #1099

Merged
merged 5 commits into from
Feb 9, 2023

Conversation

SpontaneousDuck
Copy link
Contributor

Change name of expected extracted folder for FAIR1M labels_root. Fixes #1098

Change name of expected extracted folder for FAIR1M labels_root
@github-actions github-actions bot added the datasets Geospatial or benchmark datasets label Feb 8, 2023
@adamjstewart
Copy link
Collaborator

Thanks for the fix!

Where did you even download the dataset? The dataset homepage seems to be down.

In order to get the tests to pass, you'll also need to change the fake data in tests/data/fair1m to have the same directory structure (both the directory and the zip file of that directory). Let me know if you need help.

@adamjstewart adamjstewart added this to the 0.4.1 milestone Feb 9, 2023
@SpontaneousDuck
Copy link
Contributor Author

I was able to access the dataset's website here: https://www.gaofen-challenge.com/benchmark. Since it requires an account to get the links to the dataset, my colleague who is from China was able to get that for me. The Google Drive hosted dataset is a public folder here: https://drive.google.com/drive/folders/1lCZibAl3k9sI5d7ahRm_5GA3g7OCLXmY.

@SpontaneousDuck
Copy link
Contributor Author

SpontaneousDuck commented Feb 9, 2023

Tests fixed! I changed what you requested in the tests folder. We'll see if that works!

@github-actions github-actions bot added the testing Continuous integration testing label Feb 9, 2023
@SpontaneousDuck
Copy link
Contributor Author

@microsoft-github-policy-service agree company="Kostas Research Institute at Northeastern University"

@adamjstewart
Copy link
Collaborator

I think you need to update the zip file too.

@SpontaneousDuck
Copy link
Contributor Author

The zip file name is correct! The problem is they zipped a folder with a different name than the archive containing it so you download labelXmls.zip from the dataset repository but then that zip extracts to a folder named labelXml when you extract it. So the files inside the labelXmls.zip zip file are under the structure like labelXml/518.xml.

@adamjstewart
Copy link
Collaborator

The name of the zip file is correct, but the contents of the zip file in our test data are not correct.

@SpontaneousDuck
Copy link
Contributor Author

Ah, got it! I just changed the contents and made a new commit

@adamjstewart
Copy link
Collaborator

I was able to access the dataset's website here: https://www.gaofen-challenge.com/benchmark.

Can you update the URL in the FAIR1M docstring? The old one gives a 404.

@adamjstewart
Copy link
Collaborator

Is there a difference in directory structure between the 1.0 and 2.0 versions of the dataset? I'm wondering if this PR only supports 2.0 and the old version only supports 1.0. It would be easy to support both if you have the ability to download both and test them.

@SpontaneousDuck
Copy link
Contributor Author

So the current version of the dataset has two different parts to the training dataset as well as a validation set. (train/part1 and train/part2). It looks like the md5 for images.zip matches the part1 folder so I think what they did is leave part1 as the original dataset and added the part2 folder as well as the validation folder as version 2.0. The md5 for part1/labelXmls.zip so there must have been some change to the original labels when they updated. Maybe the folder name change in the zip was changed in the update and that also affected the original dataset?

From what I can tell, they left the original zips from 1.0 in the 2.0 dataset and added more data zips to extend it. Should I extend this PR to support version 2.0 or make that a separate PR? Either way I am getting the 1.0 md5s and tests fixed now 😊

@SpontaneousDuck
Copy link
Contributor Author

Directory structure for FAIR1M2.0 is as below with the exception of the labelXmls folders which I renamed to work with the current release version of torchgeo.datasets.FAIR1M

.
├── test
│   ├── images0.zip
│   ├── images1.zip
│   └── images2.zip
├── train
│   ├── part1
│   │   ├── images
│   │   ├── images.zip
│   │   ├── labelXmls
│   │   └── labelXmls.zip
│   └── part2
│       ├── images
│       ├── images.zip
│       ├── labelXmls
│       └── labelXmls.zip
└── validation
    ├── images.zip
    └── labelXmls.zip

@adamjstewart
Copy link
Collaborator

So what you're saying is that our dataset is for version 1.0, but the authors seem to have changed the directory structure for 1.0 (and therefore the dataset broke and the MD5s changed), and your PR fixes support for version 1.0? That sounds fine to me, we can add support for version 2.0 in a separate PR.

@SpontaneousDuck
Copy link
Contributor Author

Great! Yes, they actually changed the 1.0 version download link to just link to v2.0/train/part1 folder. So, if anyone downloads the 1.0 version, this dataset code will work. This dataset code will also work if the version 2.0 dataset is downloaded and root="train/part1 is set for torchgeo.datasets.FAIR1M.

@adamjstewart
Copy link
Collaborator

Closing and reopening to try to fix the coverage checks...

@adamjstewart adamjstewart reopened this Feb 9, 2023
@adamjstewart adamjstewart merged commit f82af4a into microsoft:main Feb 9, 2023
calebrob6 pushed a commit that referenced this pull request Apr 10, 2023
* Update fair1m.py

Change name of expected extracted folder for FAIR1M labels_root

* moved test data for FAIR1M to match new folder name

* fix contents of fair1m test label zipfile

* update md5 hash for fair1m labelXmls.zip

* update md5s for fair1m dataset as well as tests
yichiac pushed a commit to yichiac/torchgeo that referenced this pull request Apr 29, 2023
* Update fair1m.py

Change name of expected extracted folder for FAIR1M labels_root

* moved test data for FAIR1M to match new folder name

* fix contents of fair1m test label zipfile

* update md5 hash for fair1m labelXmls.zip

* update md5s for fair1m dataset as well as tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasets Geospatial or benchmark datasets testing Continuous integration testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FAIR1M Dataset labels_root name
2 participants