Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ImageFolder with Grayscale images dataset #4112

Closed
chainyo opened this issue Apr 6, 2022 · 3 comments
Closed

ImageFolder with Grayscale images dataset #4112

chainyo opened this issue Apr 6, 2022 · 3 comments

Comments

@chainyo
Copy link

chainyo commented Apr 6, 2022

Hi, I'm facing a problem with a grayscale images dataset I have uploaded here (RVL-CDIP)

I'm getting an error while I want to use images for training a model with PyTorch DataLoader. Here is the full traceback:

AttributeError: Caught AttributeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/chainyo/miniconda3/envs/gan-bird/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/chainyo/miniconda3/envs/gan-bird/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/chainyo/miniconda3/envs/gan-bird/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/chainyo/miniconda3/envs/gan-bird/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1765, in __getitem__
    return self._getitem(
  File "/home/chainyo/miniconda3/envs/gan-bird/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1750, in _getitem
    formatted_output = format_table(
  File "/home/chainyo/miniconda3/envs/gan-bird/lib/python3.8/site-packages/datasets/formatting/formatting.py", line 532, in format_table
    return formatter(pa_table, query_type=query_type)
  File "/home/chainyo/miniconda3/envs/gan-bird/lib/python3.8/site-packages/datasets/formatting/formatting.py", line 281, in __call__
    return self.format_row(pa_table)
  File "/home/chainyo/miniconda3/envs/gan-bird/lib/python3.8/site-packages/datasets/formatting/torch_formatter.py", line 58, in format_row
    return self.recursive_tensorize(row)
  File "/home/chainyo/miniconda3/envs/gan-bird/lib/python3.8/site-packages/datasets/formatting/torch_formatter.py", line 54, in recursive_tensorize
    return map_nested(self._recursive_tensorize, data_struct, map_list=False)
  File "/home/chainyo/miniconda3/envs/gan-bird/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 314, in map_nested
    mapped = [
  File "/home/chainyo/miniconda3/envs/gan-bird/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 315, in <listcomp>
    _single_map_nested((function, obj, types, None, True, None))
  File "/home/chainyo/miniconda3/envs/gan-bird/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 267, in _single_map_nested
    return {k: _single_map_nested((function, v, types, None, True, None)) for k, v in pbar}
  File "/home/chainyo/miniconda3/envs/gan-bird/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 267, in <dictcomp>
    return {k: _single_map_nested((function, v, types, None, True, None)) for k, v in pbar}
  File "/home/chainyo/miniconda3/envs/gan-bird/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 251, in _single_map_nested
    return function(data_struct)
  File "/home/chainyo/miniconda3/envs/gan-bird/lib/python3.8/site-packages/datasets/formatting/torch_formatter.py", line 51, in _recursive_tensorize
    return self._tensorize(data_struct)
  File "/home/chainyo/miniconda3/envs/gan-bird/lib/python3.8/site-packages/datasets/formatting/torch_formatter.py", line 38, in _tensorize
    if np.issubdtype(value.dtype, np.integer):
AttributeError: 'bytes' object has no attribute 'dtype'

I don't really understand why the image is still a bytes object while I used transformations on it. Here the code I used to upload the dataset (and it worked well):

train_dataset = load_dataset("imagefolder", data_dir="data/train")
train_dataset = train_dataset["train"]
test_dataset = load_dataset("imagefolder", data_dir="data/test")
test_dataset = test_dataset["train"]
val_dataset = load_dataset("imagefolder", data_dir="data/val")
val_dataset = val_dataset["train"]

dataset = DatasetDict({
    "train": train_dataset,
    "val": val_dataset,
    "test": test_dataset
})
dataset.push_to_hub("ChainYo/rvl-cdip")

Now here is the code I am using to get the dataset and prepare it for training:

img_size = 512
batch_size = 128
normalize = [(0.5), (0.5)]
data_dir = "ChainYo/rvl-cdip"

dataset = load_dataset(data_dir, split="train")

transforms = transforms.Compose([
        transforms.Resize(img_size), 
        transforms.CenterCrop(img_size), 
        transforms.ToTensor(), 
        transforms.Normalize(*normalize)
])

transformed_dataset = dataset.with_transform(transforms)
transformed_dataset.set_format(type="torch", device="cuda")

train_dataloader = torch.utils.data.DataLoader(
    transformed_dataset, batch_size=batch_size, shuffle=True, num_workers=4, pin_memory=True
)

But this get me the error above. I don't understand why it's doing this kind of weird thing?
Do I need to map something on the dataset? Something like this:

labels = dataset.features["label"].names
num_labels = dataset.features["label"].num_classes


def preprocess_data(examples):
    images = [ex.convert("RGB") for ex in examples["image"]]
    labels = [ex for ex in examples["label"]]
    return {"images": images, "labels": labels}


features = Features({
    "images": Image(decode=True, id=None),
    "labels": ClassLabel(num_classes=num_labels, names=labels)
})


decoded_dataset = dataset.map(preprocess_data, remove_columns=dataset.column_names, features=features, batched=True, batch_size=100)
@mariosasko
Copy link
Collaborator

Hi! Replacing:

transformed_dataset = dataset.with_transform(transforms)
transformed_dataset.set_format(type="torch", device="cuda")

with:

def transform_func(examples):
    examples["image"] = [transforms(img).to("cuda") for img in examples["image"]]
    return examples

transformed_dataset = dataset.with_transform(transform_func)

should fix the issue. datasets doesn't support chaining of transforms (you can think of set_format/with_format as a predefined transform func for set_transform/with_transforms), so the last transform (in your case, set_format) takes precedence over the previous ones (in your case with_format). And the PyTorch formatter is not supported by the Image feature, hence the error (adding support for that is on our short-term roadmap).

@chainyo
Copy link
Author

chainyo commented Apr 6, 2022

Ok thanks a lot for the code snippet!

I love the way datasets is easy to use but it made it really long to pre-process all the images (400.000 in my case) before training anything. ImageFolder from pytorch is faster in my case but force me to have the images on my local machine.

I don't know how to speed up the process without switching to ImageFolder 😄

@mariosasko
Copy link
Collaborator

You can pass ignore_verifications=True in load_dataset to skip checksum verification, which takes a lot of time if the number of files is large. We will consider making this the default behavior.

@chainyo chainyo closed this as completed Apr 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants