Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. #15505

Closed
anum94 opened this issue Feb 3, 2022 · 7 comments · Fixed by #18119
Assignees

Comments

@anum94
Copy link

anum94 commented Feb 3, 2022

Maybe @SaulLu can help?

Information

I am following the text summarization tutorial on hugging face website which uses the mt5-small model. It explains step by step on how to perform a text summarization task.

To reproduce

Steps to reproduce the behavior:

  1. Run the following notebook
  2. cell # 32 should reproduce the following error. (it did for me)
ValueError                                Traceback (most recent call last)
File ~/PycharmProjects/nlp-env/lib/python3.9/site-packages/transformers/tokenization_utils_base.py:707, in BatchEncoding.convert_to_tensors(self, tensor_type, prepend_batch_axis)
    706 if not is_tensor(value):
--> 707     tensor = as_tensor(value)
    709     # Removing this for now in favor of controlling the shape with `prepend_batch_axis`
    710     # # at-least2d
    711     # if tensor.ndim > 2:
    712     #     tensor = tensor.squeeze(0)
    713     # elif tensor.ndim < 2:
    714     #     tensor = tensor[None, :]

ValueError: too many dimensions 'str'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
Input In [72], in <module>
----> 1 data_collator(features)

File ~/PycharmProjects/nlp-env/lib/python3.9/site-packages/transformers/data/data_collator.py:586, in DataCollatorForSeq2Seq.__call__(self, features, return_tensors)
    583         else:
    584             feature["labels"] = np.concatenate([remainder, feature["labels"]]).astype(np.int64)
--> 586 features = self.tokenizer.pad(
    587     features,
    588     padding=self.padding,
    589     max_length=self.max_length,
    590     pad_to_multiple_of=self.pad_to_multiple_of,
    591     return_tensors=return_tensors,
    592 )
    594 # prepare decoder_input_ids
    595 if (
    596     labels is not None
    597     and self.model is not None
    598     and hasattr(self.model, "prepare_decoder_input_ids_from_labels")
    599 ):

File ~/PycharmProjects/nlp-env/lib/python3.9/site-packages/transformers/tokenization_utils_base.py:2842, in PreTrainedTokenizerBase.pad(self, encoded_inputs, padding, max_length, pad_to_multiple_of, return_attention_mask, return_tensors, verbose)
   2839             batch_outputs[key] = []
   2840         batch_outputs[key].append(value)
-> 2842 return BatchEncoding(batch_outputs, tensor_type=return_tensors)

File ~/PycharmProjects/nlp-env/lib/python3.9/site-packages/transformers/tokenization_utils_base.py:212, in BatchEncoding.__init__(self, data, encoding, tensor_type, prepend_batch_axis, n_sequences)
    208     n_sequences = encoding[0].n_sequences
    210 self._n_sequences = n_sequences
--> 212 self.convert_to_tensors(tensor_type=tensor_type, prepend_batch_axis=prepend_batch_axis)

File ~/PycharmProjects/nlp-env/lib/python3.9/site-packages/transformers/tokenization_utils_base.py:723, in BatchEncoding.convert_to_tensors(self, tensor_type, prepend_batch_axis)
    718         if key == "overflowing_tokens":
    719             raise ValueError(
    720                 "Unable to create tensor returning overflowing tokens of different lengths. "
    721                 "Please see if a fast version of this tokenizer is available to have this feature available."
    722             )
--> 723         raise ValueError(
    724             "Unable to create tensor, you should probably activate truncation and/or padding "
    725             "with 'padding=True' 'truncation=True' to have batched tensors with the same length."
    726         )
    728 return self

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.

@LysandreJik
Copy link
Member

Ah actually this is linked to the summarization example in the course @lewtun

@lewtun
Copy link
Member

lewtun commented Feb 3, 2022

Hi @anum94, thanks for reporting this bug! The cause of the error is that the tokenized_datasets object has columns with strings, and the data collator doesn't know how to pad these.

The fix is to add the following line before the data collator:

tokenized_datasets = tokenized_datasets.remove_columns(books_dataset["train"].column_names)

I'll post a fix in the website and Colab too - thanks!

@anum94
Copy link
Author

anum94 commented Feb 4, 2022

Thank you.
I think I have an older version of transformers so it worked for me when I used

tokenized_datasets = tokenized_datasets.remove_columns_(books_dataset["train"].column_names)

instead of
tokenized_datasets = tokenized_datasets.remove_columns(books_dataset["train"].column_names)

Thanks for your help and the prompt response.
cheers

@samozturk
Copy link

When I used tokenized_datasets = tokenized_datasets.remove_columns(books_dataset["train"].column_names) it gives ZeroDivisionError: integer division or modulo by zero because it can't access rows.

@dmatekenya
Copy link

I'm following this tutorial and facing this same error of Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. during model training. I'm not sure how to implement the solution which worked in the case above. Any help will be appreciated.

@ollayf
Copy link

ollayf commented Jun 28, 2023

I am having the same issue^

@VirginieBfd
Copy link

Upgrading numpy to 1.24 resolved this issue:

pip install numpy --upgrade

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
8 participants