ImageFolder with Grayscale images dataset #4112

chainyo · 2022-04-06T15:10:00Z

Hi, I'm facing a problem with a grayscale images dataset I have uploaded here (RVL-CDIP)

I'm getting an error while I want to use images for training a model with PyTorch DataLoader. Here is the full traceback:

AttributeError: Caught AttributeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/chainyo/miniconda3/envs/gan-bird/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/chainyo/miniconda3/envs/gan-bird/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/chainyo/miniconda3/envs/gan-bird/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/chainyo/miniconda3/envs/gan-bird/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1765, in __getitem__
    return self._getitem(
  File "/home/chainyo/miniconda3/envs/gan-bird/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1750, in _getitem
    formatted_output = format_table(
  File "/home/chainyo/miniconda3/envs/gan-bird/lib/python3.8/site-packages/datasets/formatting/formatting.py", line 532, in format_table
    return formatter(pa_table, query_type=query_type)
  File "/home/chainyo/miniconda3/envs/gan-bird/lib/python3.8/site-packages/datasets/formatting/formatting.py", line 281, in __call__
    return self.format_row(pa_table)
  File "/home/chainyo/miniconda3/envs/gan-bird/lib/python3.8/site-packages/datasets/formatting/torch_formatter.py", line 58, in format_row
    return self.recursive_tensorize(row)
  File "/home/chainyo/miniconda3/envs/gan-bird/lib/python3.8/site-packages/datasets/formatting/torch_formatter.py", line 54, in recursive_tensorize
    return map_nested(self._recursive_tensorize, data_struct, map_list=False)
  File "/home/chainyo/miniconda3/envs/gan-bird/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 314, in map_nested
    mapped = [
  File "/home/chainyo/miniconda3/envs/gan-bird/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 315, in <listcomp>
    _single_map_nested((function, obj, types, None, True, None))
  File "/home/chainyo/miniconda3/envs/gan-bird/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 267, in _single_map_nested
    return {k: _single_map_nested((function, v, types, None, True, None)) for k, v in pbar}
  File "/home/chainyo/miniconda3/envs/gan-bird/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 267, in <dictcomp>
    return {k: _single_map_nested((function, v, types, None, True, None)) for k, v in pbar}
  File "/home/chainyo/miniconda3/envs/gan-bird/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 251, in _single_map_nested
    return function(data_struct)
  File "/home/chainyo/miniconda3/envs/gan-bird/lib/python3.8/site-packages/datasets/formatting/torch_formatter.py", line 51, in _recursive_tensorize
    return self._tensorize(data_struct)
  File "/home/chainyo/miniconda3/envs/gan-bird/lib/python3.8/site-packages/datasets/formatting/torch_formatter.py", line 38, in _tensorize
    if np.issubdtype(value.dtype, np.integer):
AttributeError: 'bytes' object has no attribute 'dtype'

I don't really understand why the image is still a bytes object while I used transformations on it. Here the code I used to upload the dataset (and it worked well):

train_dataset = load_dataset("imagefolder", data_dir="data/train")
train_dataset = train_dataset["train"]
test_dataset = load_dataset("imagefolder", data_dir="data/test")
test_dataset = test_dataset["train"]
val_dataset = load_dataset("imagefolder", data_dir="data/val")
val_dataset = val_dataset["train"]

dataset = DatasetDict({
    "train": train_dataset,
    "val": val_dataset,
    "test": test_dataset
})
dataset.push_to_hub("ChainYo/rvl-cdip")

Now here is the code I am using to get the dataset and prepare it for training:

img_size = 512
batch_size = 128
normalize = [(0.5), (0.5)]
data_dir = "ChainYo/rvl-cdip"

dataset = load_dataset(data_dir, split="train")

transforms = transforms.Compose([
        transforms.Resize(img_size), 
        transforms.CenterCrop(img_size), 
        transforms.ToTensor(), 
        transforms.Normalize(*normalize)
])

transformed_dataset = dataset.with_transform(transforms)
transformed_dataset.set_format(type="torch", device="cuda")

train_dataloader = torch.utils.data.DataLoader(
    transformed_dataset, batch_size=batch_size, shuffle=True, num_workers=4, pin_memory=True
)

But this get me the error above. I don't understand why it's doing this kind of weird thing?
Do I need to map something on the dataset? Something like this:

labels = dataset.features["label"].names
num_labels = dataset.features["label"].num_classes


def preprocess_data(examples):
    images = [ex.convert("RGB") for ex in examples["image"]]
    labels = [ex for ex in examples["label"]]
    return {"images": images, "labels": labels}


features = Features({
    "images": Image(decode=True, id=None),
    "labels": ClassLabel(num_classes=num_labels, names=labels)
})


decoded_dataset = dataset.map(preprocess_data, remove_columns=dataset.column_names, features=features, batched=True, batch_size=100)

mariosasko · 2022-04-06T17:03:22Z

Hi! Replacing:

transformed_dataset = dataset.with_transform(transforms)
transformed_dataset.set_format(type="torch", device="cuda")

with:

def transform_func(examples):
    examples["image"] = [transforms(img).to("cuda") for img in examples["image"]]
    return examples

transformed_dataset = dataset.with_transform(transform_func)

should fix the issue. datasets doesn't support chaining of transforms (you can think of set_format/with_format as a predefined transform func for set_transform/with_transforms), so the last transform (in your case, set_format) takes precedence over the previous ones (in your case with_format). And the PyTorch formatter is not supported by the Image feature, hence the error (adding support for that is on our short-term roadmap).

chainyo · 2022-04-06T17:23:56Z

Ok thanks a lot for the code snippet!

I love the way datasets is easy to use but it made it really long to pre-process all the images (400.000 in my case) before training anything. ImageFolder from pytorch is faster in my case but force me to have the images on my local machine.

I don't know how to speed up the process without switching to ImageFolder 😄

mariosasko · 2022-04-07T11:50:46Z

You can pass ignore_verifications=True in load_dataset to skip checksum verification, which takes a lot of time if the number of files is large. We will consider making this the default behavior.

chainyo closed this as completed Apr 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ImageFolder with Grayscale images dataset #4112

ImageFolder with Grayscale images dataset #4112

chainyo commented Apr 6, 2022 •

edited

Loading

mariosasko commented Apr 6, 2022

chainyo commented Apr 6, 2022 •

edited

Loading

mariosasko commented Apr 7, 2022

ImageFolder with Grayscale images dataset #4112

ImageFolder with Grayscale images dataset #4112

Comments

chainyo commented Apr 6, 2022 • edited Loading

mariosasko commented Apr 6, 2022

chainyo commented Apr 6, 2022 • edited Loading

mariosasko commented Apr 7, 2022

chainyo commented Apr 6, 2022 •

edited

Loading

chainyo commented Apr 6, 2022 •

edited

Loading