New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split definition of DTD, EuroSAT and SUN397 #1
Comments
Hi @gortizji. Thanks for the interest in our work and for the kind words!
Please note that in this codebase we use the suffix "Val" to indicate that we want to use the validation sets instead of the test sets (e.g. evaluating on Hope this helps, and let me know if you have any other questions! |
Thanks @gabrielilharco for your quick answer. Some follow-up questions:
Thanks! |
For DTD and SUN397, yes, we use the first split (train1.txt+val1.txt / test1.txt for DTD, and Testing_01.txt/Training_01.txt for SUN397, as in https://vision.princeton.edu/projects/2010/SUN/download/Partitions.zip). For EuroSAT, it indeed has 27,000, and we use a 21,600/2,700/2,700 split (also updated the previous message) |
That makes sense 😄. Thanks a lot @gabrielilharco! |
Hi again, Could you comment what is the expected folder structure for Thanks in advance 😄 |
Hi @gortizji, We expect the data to be stored without nested folders, it should look like this:
|
Perfect! Thanks a lot. |
Hi @gabrielilharco and @gortizji, I am facing a similar issue. I have downloaded the datasets from the provided links but I am not sure how to structure the downloaded files so that they can be loaded correctly. Does any of you have a script that can be used to correctly structure these downloaded datasets? Thank you in advance! |
I kind of figured this out myself but for anyone else like me here are the scripts I used. There are four datasets that require manual downloading, ## PROCESS SUN397 DATASET
import os
import shutil
from pathlib import Path
def process_dataset(txt_file, downloaded_data_path, output_folder):
with open(txt_file, 'r') as file:
lines = file.readlines()
for i, line in enumerate(lines):
input_path = line.strip()
final_folder_name = "_".join(x for x in input_path.split('/')[:-1])[1:]
filename = input_path.split('/')[-1]
output_class_folder = os.path.join(output_folder, final_folder_name)
if not os.path.exists(output_class_folder):
os.makedirs(output_class_folder)
full_input_path = os.path.join(downloaded_data_path, input_path[1:])
output_file_path = os.path.join(output_class_folder, filename)
# print(final_folder_name, filename, output_class_folder, full_input_path, output_file_path)
# exit()
shutil.copy(full_input_path, output_file_path)
if i % 100 == 0:
print(f"Processed {i}/{len(lines)} images")
downloaded_data_path = "path/to/downloaded/SUN/data"
process_dataset('Training_01.txt', downloaded_data_path, os.path.join(downloaded_data_path, "train"))
process_dataset('Testing_01.txt', downloaded_data_path, os.path.join(downloaded_data_path, "val")) ### PROCESS EuroSAT_RGB DATASET
import os
import shutil
import random
def create_directory_structure(base_dir, classes):
for dataset in ['train', 'val', 'test']:
path = os.path.join(base_dir, dataset)
os.makedirs(path, exist_ok=True)
for cls in classes:
os.makedirs(os.path.join(path, cls), exist_ok=True)
def split_dataset(base_dir, source_dir, classes, val_size=270, test_size=270):
for cls in classes:
class_path = os.path.join(source_dir, cls)
images = os.listdir(class_path)
random.shuffle(images)
val_images = images[:val_size]
test_images = images[val_size:val_size + test_size]
train_images = images[val_size + test_size:]
for img in train_images:
src_path = os.path.join(class_path, img)
dst_path = os.path.join(base_dir, 'train', cls, img)
print(src_path, dst_path)
shutil.copy(src_path, dst_path)
for img in val_images:
src_path = os.path.join(class_path, img)
dst_path = os.path.join(base_dir, 'val', cls, img)
print(src_path, dst_path)
shutil.copy(src_path, dst_path)
for img in test_images:
src_path = os.path.join(class_path, img)
dst_path = os.path.join(base_dir, 'test', cls, img)
print(src_path, dst_path)
shutil.copy(src_path, dst_path)
source_dir = '/nas-hdd/prateek/data/EuroSAT_RGB' # replace with the path to your dataset
base_dir = '/nas-hdd/prateek/data/EuroSAT_Splitted' # replace with the path to the output directory
classes = [d for d in os.listdir(source_dir) if os.path.isdir(os.path.join(source_dir, d))]
create_directory_structure(base_dir, classes)
split_dataset(base_dir, source_dir, classes) Cheers, |
Thanks a lot @prateeky2806! |
Hi @gabrielilharco and @prateeky2806, I might have missed something but why doesn't
Is there something I missed? Thanks, |
It seems that creating a folder for each class is necessary. Either way, I'll attach the scripts to set up the mkdir resisc45 && cd resisc45
# Download the dataset and splits
FILE=NWPU-RESISC45.rar
ID=1DnPSU5nVSN7xv95bpZ3XQ0JhKXZOKgIv
wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&id=$ID" -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1/p')&id=$ID" -O $FILE && rm -rf /tmp/cookies.txt
unrar x $FILE
wget -O resisc45-train.txt "https://storage.googleapis.com/remote_sensing_representations/resisc45-train.txt"
wget -O resisc45-val.txt "https://storage.googleapis.com/remote_sensing_representations/resisc45-val.txt"
wget -O resisc45-test.txt "https://storage.googleapis.com/remote_sensing_representations/resisc45-test.txt" # Partition the dataset into different classes
import os
import shutil
def create_directory_structure(data_root, split):
split_file = f'resisc45-{split}.txt'
with open(os.path.join(data_root, split_file), 'r') as f:
lines = f.readlines()
for l in lines:
l = l.strip()
class_name = '_'.join(l.split('_')[:-1])
class_dir = os.path.join(data_root, 'NWPU-RESISC45', class_name)
if not os.path.exists(class_dir):
os.mkdir(class_dir)
src_path = os.path.join(data_root, 'NWPU-RESISC45', l)
dst_path = os.path.join(class_dir, l)
print(src_path, dst_path)
shutil.move(src_path, dst_path)
data_root = '/home/frederic/data/resisc45'
for split in ['train', 'val', 'test']:
create_directory_structure(data_root, split) Cheers, |
Hi, awesome work!
I'm trying to reproduce your results but I cannot find the split definitions you use for DTD, EuroSAT and SUN397. Would you mind pointing me to the right resources to download the versions of these datasets compatible with your code?
Thanks a lot!
The text was updated successfully, but these errors were encountered: