Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error iterating over Dataset with DataLoader #1765

Closed
EvanZ opened this issue Jan 21, 2021 · 6 comments
Closed

Error iterating over Dataset with DataLoader #1765

EvanZ opened this issue Jan 21, 2021 · 6 comments

Comments

@EvanZ
Copy link

EvanZ commented Jan 21, 2021

I have a Dataset that I've mapped a tokenizer over:

encoded_dataset.set_format(type='torch',columns=['attention_mask','input_ids','token_type_ids'])
encoded_dataset[:1]
{'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]),
 'input_ids': tensor([[  101,   178,  1198,  1400,  1714, 22233, 21365,  4515,  8618,  1113,
            102]]),
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])}

When I try to iterate as in the docs, I get errors:

dataloader = torch.utils.data.DataLoader(encoded_dataset, batch_sampler=32)
next(iter(dataloader))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-45-05180ba8aa35> in <module>()
      1 dataloader = torch.utils.data.DataLoader(encoded_dataset, batch_sampler=32)
----> 2 next(iter(dataloader))

3 frames
/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py in __init__(self, loader)
    411         self._timeout = loader.timeout
    412         self._collate_fn = loader.collate_fn
--> 413         self._sampler_iter = iter(self._index_sampler)
    414         self._base_seed = torch.empty((), dtype=torch.int64).random_(generator=loader.generator).item()
    415         self._persistent_workers = loader.persistent_workers

TypeError: 'int' object is not iterable


@mariosasko
Copy link
Collaborator

Instead of:

dataloader = torch.utils.data.DataLoader(encoded_dataset, batch_sampler=32)

It should be:

dataloader = torch.utils.data.DataLoader(encoded_dataset, batch_size=32)

batch_sampler accepts a Sampler object or an Iterable, so you get an error.

@EvanZ
Copy link
Author

EvanZ commented Jan 22, 2021

@mariosasko I thought that would fix it, but now I'm getting a different error:

/usr/local/lib/python3.6/dist-packages/datasets/arrow_dataset.py:851: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at  /pytorch/torch/csrc/utils/tensor_numpy.cpp:141.)
  return torch.tensor(x, **format_kwargs)
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-20-3af1d82bf93a> in <module>()
      1 dataloader = torch.utils.data.DataLoader(encoded_dataset, batch_size=32)
----> 2 next(iter(dataloader))

5 frames
/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/collate.py in default_collate(batch)
     53             storage = elem.storage()._new_shared(numel)
     54             out = elem.new(storage)
---> 55         return torch.stack(batch, 0, out=out)
     56     elif elem_type.__module__ == 'numpy' and elem_type.__name__ != 'str_' \
     57             and elem_type.__name__ != 'string_':

RuntimeError: stack expects each tensor to be equal size, but got [7] at entry 0 and [10] at entry 1

Any thoughts what this means?I Do I need padding?

@mariosasko
Copy link
Collaborator

mariosasko commented Jan 23, 2021

Yes, padding is an answer.

This can be solved easily by passing a callable to the collate_fn arg of DataLoader that adds padding.

@EvanZ
Copy link
Author

EvanZ commented Jan 23, 2021

Padding was the fix, thanks!

@EvanZ EvanZ closed this as completed Jan 23, 2021
@anupamadeo
Copy link

dataloader = torch.utils.data.DataLoader(encoded_dataset, batch_size=4)
batch = next(iter(dataloader))

getting
ValueError: cannot reshape array of size 8192 into shape (1,512,4)

I had put padding as 2048 for encoded_dataset
kindly help

@arul210
Copy link

arul210 commented Oct 28, 2022

data_loader_val = torch.utils.data.DataLoader(val_dataset, batch_size=32, shuffle=True, drop_last=False, num_workers=0)
dataiter = iter(data_loader_val)
images, _ = next(dataiter)

getting -> TypeError: 'list' object is not callable

Cannot iterate through the data. Kindly suggest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants