Split definition of DTD, EuroSAT and SUN397 #1

gortizji · 2023-02-06T16:02:06Z

Hi, awesome work!

I'm trying to reproduce your results but I cannot find the split definitions you use for DTD, EuroSAT and SUN397. Would you mind pointing me to the right resources to download the versions of these datasets compatible with your code?

Thanks a lot!

gabrielilharco · 2023-02-06T18:05:27Z

Hi @gortizji. Thanks for the interest in our work and for the kind words!

DTD: Downloaded from https://www.robots.ox.ac.uk/~vgg/data/dtd/ (direct link: https://www.robots.ox.ac.uk/~vgg/data/dtd/download/dtd-r1.0.1.tar.gz), which also contains the original splits. As in PAINT, we use their test set unchanged, and merge the training and validation sets and re-split them. So from the files you download, you can put the training and validation files into a train folder, and the test files into a val folder.
EuroSAT: Downloaded from https://github.com/phelber/EuroSAT (direct link: https://madm.dfki.de/files/sentinel/EuroSAT.zip). For this dataset we randomly split the downloaded data into train/validation/test (~~55,000/5,000/10,000 samples~~ 21,600/2,700/2,700 samples).
SUN397: Download from https://vision.princeton.edu/projects/2010/SUN/ (direct link: http://vision.princeton.edu/projects/2010/SUN/SUN397.tar.gz)

Please note that in this codebase we use the suffix "Val" to indicate that we want to use the validation sets instead of the test sets (e.g. evaluating on DTDVal uses the validation set, and DTD uses the test set. You should also use the Val suffix when training).

Hope this helps, and let me know if you have any other questions!

gortizji · 2023-02-07T08:19:58Z

Thanks @gabrielilharco for your quick answer. Some follow-up questions:

DTD: There are 10 different splits defined in the original webpage. I assume you use the first split, then?
EuroSAT: As far as I can tell, EuroSAT has only 27,000 images. Could the split be 12,000/5,000/10,000?
SUN397: Again there are 10 different balanced splits defined in the original webpage. Do you use any one in particular?

Thanks!

gabrielilharco · 2023-02-07T20:26:02Z

For DTD and SUN397, yes, we use the first split (train1.txt+val1.txt / test1.txt for DTD, and Testing_01.txt/Training_01.txt for SUN397, as in https://vision.princeton.edu/projects/2010/SUN/download/Partitions.zip). For EuroSAT, it indeed has 27,000, and we use a 21,600/2,700/2,700 split (also updated the previous message)

gortizji · 2023-02-08T08:36:38Z

That makes sense 😄. Thanks a lot @gabrielilharco!

gortizji · 2023-03-24T16:14:16Z

Hi again,

Could you comment what is the expected folder structure for SUN397? This seems to determine the classnames of the dataset, but I am not sure how to deal with the nested structure of the labels such as volleyball_court/indoor and volleyball_court/outdoor.

Thanks in advance 😄

gabrielilharco · 2023-03-24T17:46:14Z

Hi @gortizji,

We expect the data to be stored without nested folders, it should look like this:

a_abbey
     sun_aaalbzqrimafwbiv.jpg
     sun_aasgdbvvfthiibcm.jpg
     ...
a_airplane_cabin
a_airport_terminal
a_alley
a_amphitheater
...
v_volleyball_court_indoor
v_volleyball_court_outdoor
...
y_youth_hostel

gortizji · 2023-03-24T18:12:07Z

Perfect! Thanks a lot.

prateeky2806 · 2023-04-18T18:57:37Z

Hi @gabrielilharco and @gortizji, I am facing a similar issue. I have downloaded the datasets from the provided links but I am not sure how to structure the downloaded files so that they can be loaded correctly. Does any of you have a script that can be used to correctly structure these downloaded datasets?

Thank you in advance!
Prateek Yadav

prateeky2806 · 2023-04-19T03:50:00Z

I kind of figured this out myself but for anyone else like me here are the scripts I used. There are four datasets that require manual downloading, DTD, EuroSAT, RESISC45, SUN397. the links for downloading the datasets and the splits file are mentioned above in the thread. Download datasets from those links.
There is no folder structuring required for RESISC45, I have provided the code for SUN397, and EuroSAT below. I forgot to save the script for DTD but it's pretty similar to the ones provided below.

## PROCESS SUN397 DATASET

import os
import shutil
from pathlib import Path


def process_dataset(txt_file, downloaded_data_path, output_folder):
    with open(txt_file, 'r') as file:
        lines = file.readlines()

    for i, line in enumerate(lines):
        input_path = line.strip()
        final_folder_name = "_".join(x for x in input_path.split('/')[:-1])[1:]
        filename = input_path.split('/')[-1]
        output_class_folder = os.path.join(output_folder, final_folder_name)

        if not os.path.exists(output_class_folder):
            os.makedirs(output_class_folder)

        full_input_path = os.path.join(downloaded_data_path, input_path[1:])
        output_file_path = os.path.join(output_class_folder, filename)
        # print(final_folder_name, filename, output_class_folder, full_input_path, output_file_path)
        # exit()
        shutil.copy(full_input_path, output_file_path)
        if i % 100 == 0:
            print(f"Processed {i}/{len(lines)} images")

downloaded_data_path = "path/to/downloaded/SUN/data"
process_dataset('Training_01.txt', downloaded_data_path, os.path.join(downloaded_data_path, "train"))
process_dataset('Testing_01.txt', downloaded_data_path, os.path.join(downloaded_data_path, "val"))

### PROCESS EuroSAT_RGB DATASET

import os
import shutil
import random

def create_directory_structure(base_dir, classes):
    for dataset in ['train', 'val', 'test']:
        path = os.path.join(base_dir, dataset)
        os.makedirs(path, exist_ok=True)
        for cls in classes:
            os.makedirs(os.path.join(path, cls), exist_ok=True)

def split_dataset(base_dir, source_dir, classes, val_size=270, test_size=270):
    for cls in classes:
        class_path = os.path.join(source_dir, cls)
        images = os.listdir(class_path)
        random.shuffle(images)

        val_images = images[:val_size]
        test_images = images[val_size:val_size + test_size]
        train_images = images[val_size + test_size:]

        for img in train_images:
            src_path = os.path.join(class_path, img)
            dst_path = os.path.join(base_dir, 'train', cls, img)
            print(src_path, dst_path)
            shutil.copy(src_path, dst_path)
        for img in val_images:
            src_path = os.path.join(class_path, img)
            dst_path = os.path.join(base_dir, 'val', cls, img)
            print(src_path, dst_path)
            shutil.copy(src_path, dst_path)
        for img in test_images:
            src_path = os.path.join(class_path, img)
            dst_path = os.path.join(base_dir, 'test', cls, img)
            print(src_path, dst_path)
            shutil.copy(src_path, dst_path)

source_dir = '/nas-hdd/prateek/data/EuroSAT_RGB'  # replace with the path to your dataset
base_dir = '/nas-hdd/prateek/data/EuroSAT_Splitted'  # replace with the path to the output directory

classes = [d for d in os.listdir(source_dir) if os.path.isdir(os.path.join(source_dir, d))]

create_directory_structure(base_dir, classes)
split_dataset(base_dir, source_dir, classes)

Cheers,
Prateek

gabrielilharco · 2023-04-19T15:47:52Z

Thanks a lot @prateeky2806!

fredzzhang · 2023-11-29T01:40:52Z

Hi @gabrielilharco and @prateeky2806,

I might have missed something but why doesn't RESISC45 need a folder structure? The dataset class inherits from the ImageFolder class, which assumes the images are arranged under folders named after the classes. So when I tried to run the code on this dataset, I get the following error.

...
  File "/home/frederic/miniconda3/envs/ws/lib/python3.10/site-packages/torchvision/datasets/folder.py", line 309, in __init__
    super().__init__(
  File "/home/frederic/miniconda3/envs/ws/lib/python3.10/site-packages/torchvision/datasets/folder.py", line 144, in __init__
    classes, class_to_idx = self.find_classes(self.root)
  File "/home/frederic/miniconda3/envs/ws/lib/python3.10/site-packages/torchvision/datasets/folder.py", line 218, in find_classes
    return find_classes(directory)
  File "/home/frederic/miniconda3/envs/ws/lib/python3.10/site-packages/torchvision/datasets/folder.py", line 42, in find_classes
    raise FileNotFoundError(f"Couldn't find any class folder in {directory}.")
FileNotFoundError: Couldn't find any class folder in /home/frederic/data/resisc45/NWPU-RESISC45.

Is there something I missed?

Thanks,
Fred.

fredzzhang · 2023-11-29T07:41:23Z

It seems that creating a folder for each class is necessary. Either way, I'll attach the scripts to set up the resisc45 dataset for future reference.

mkdir resisc45 && cd resisc45
# Download the dataset and splits
FILE=NWPU-RESISC45.rar
ID=1DnPSU5nVSN7xv95bpZ3XQ0JhKXZOKgIv
wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&id=$ID" -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1/p')&id=$ID" -O $FILE && rm -rf /tmp/cookies.txt
unrar x $FILE
wget -O resisc45-train.txt "https://storage.googleapis.com/remote_sensing_representations/resisc45-train.txt"
wget -O resisc45-val.txt "https://storage.googleapis.com/remote_sensing_representations/resisc45-val.txt"
wget -O resisc45-test.txt "https://storage.googleapis.com/remote_sensing_representations/resisc45-test.txt"

# Partition the dataset into different classes

import os
import shutil

def create_directory_structure(data_root, split):
    split_file = f'resisc45-{split}.txt'
    with open(os.path.join(data_root, split_file), 'r') as f:
        lines = f.readlines()
    for l in lines:
        l = l.strip()
        class_name = '_'.join(l.split('_')[:-1])
        class_dir = os.path.join(data_root, 'NWPU-RESISC45', class_name)
        if not os.path.exists(class_dir):
            os.mkdir(class_dir)
        src_path = os.path.join(data_root, 'NWPU-RESISC45', l)
        dst_path = os.path.join(class_dir, l)
        print(src_path, dst_path)
        shutil.move(src_path, dst_path)

data_root = '/home/frederic/data/resisc45'
for split in ['train', 'val', 'test']:
    create_directory_structure(data_root, split)

Cheers,
Fred.

gortizji closed this as completed Feb 8, 2023

gortizji reopened this Mar 24, 2023

gortizji closed this as completed Mar 24, 2023

gabrielilharco mentioned this issue Sep 28, 2023

Query Regarding SUN397 Dataset mlfoundations/patching#4

Closed

yifei-he mentioned this issue Feb 14, 2024

About dataset preparation EnnengYang/AdaMerging#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split definition of DTD, EuroSAT and SUN397 #1

Split definition of DTD, EuroSAT and SUN397 #1

gortizji commented Feb 6, 2023

gabrielilharco commented Feb 6, 2023 •

edited

gortizji commented Feb 7, 2023 •

edited

gabrielilharco commented Feb 7, 2023

gortizji commented Feb 8, 2023

gortizji commented Mar 24, 2023

gabrielilharco commented Mar 24, 2023

gortizji commented Mar 24, 2023

prateeky2806 commented Apr 18, 2023

prateeky2806 commented Apr 19, 2023 •

edited

gabrielilharco commented Apr 19, 2023

fredzzhang commented Nov 29, 2023

fredzzhang commented Nov 29, 2023

Split definition of DTD, EuroSAT and SUN397 #1

Split definition of DTD, EuroSAT and SUN397 #1

Comments

gortizji commented Feb 6, 2023

gabrielilharco commented Feb 6, 2023 • edited

gortizji commented Feb 7, 2023 • edited

gabrielilharco commented Feb 7, 2023

gortizji commented Feb 8, 2023

gortizji commented Mar 24, 2023

gabrielilharco commented Mar 24, 2023

gortizji commented Mar 24, 2023

prateeky2806 commented Apr 18, 2023

prateeky2806 commented Apr 19, 2023 • edited

gabrielilharco commented Apr 19, 2023

fredzzhang commented Nov 29, 2023

fredzzhang commented Nov 29, 2023

gabrielilharco commented Feb 6, 2023 •

edited

gortizji commented Feb 7, 2023 •

edited

prateeky2806 commented Apr 19, 2023 •

edited