ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. #15505

anum94 · 2022-02-03T17:32:00Z

Maybe @SaulLu can help?

Information

I am following the text summarization tutorial on hugging face website which uses the mt5-small model. It explains step by step on how to perform a text summarization task.

To reproduce

Steps to reproduce the behavior:

Run the following notebook
cell # 32 should reproduce the following error. (it did for me)

ValueError                                Traceback (most recent call last)
File ~/PycharmProjects/nlp-env/lib/python3.9/site-packages/transformers/tokenization_utils_base.py:707, in BatchEncoding.convert_to_tensors(self, tensor_type, prepend_batch_axis)
    706 if not is_tensor(value):
--> 707     tensor = as_tensor(value)
    709     # Removing this for now in favor of controlling the shape with `prepend_batch_axis`
    710     # # at-least2d
    711     # if tensor.ndim > 2:
    712     #     tensor = tensor.squeeze(0)
    713     # elif tensor.ndim < 2:
    714     #     tensor = tensor[None, :]

ValueError: too many dimensions 'str'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
Input In [72], in <module>
----> 1 data_collator(features)

File ~/PycharmProjects/nlp-env/lib/python3.9/site-packages/transformers/data/data_collator.py:586, in DataCollatorForSeq2Seq.__call__(self, features, return_tensors)
    583         else:
    584             feature["labels"] = np.concatenate([remainder, feature["labels"]]).astype(np.int64)
--> 586 features = self.tokenizer.pad(
    587     features,
    588     padding=self.padding,
    589     max_length=self.max_length,
    590     pad_to_multiple_of=self.pad_to_multiple_of,
    591     return_tensors=return_tensors,
    592 )
    594 # prepare decoder_input_ids
    595 if (
    596     labels is not None
    597     and self.model is not None
    598     and hasattr(self.model, "prepare_decoder_input_ids_from_labels")
    599 ):

File ~/PycharmProjects/nlp-env/lib/python3.9/site-packages/transformers/tokenization_utils_base.py:2842, in PreTrainedTokenizerBase.pad(self, encoded_inputs, padding, max_length, pad_to_multiple_of, return_attention_mask, return_tensors, verbose)
   2839             batch_outputs[key] = []
   2840         batch_outputs[key].append(value)
-> 2842 return BatchEncoding(batch_outputs, tensor_type=return_tensors)

File ~/PycharmProjects/nlp-env/lib/python3.9/site-packages/transformers/tokenization_utils_base.py:212, in BatchEncoding.__init__(self, data, encoding, tensor_type, prepend_batch_axis, n_sequences)
    208     n_sequences = encoding[0].n_sequences
    210 self._n_sequences = n_sequences
--> 212 self.convert_to_tensors(tensor_type=tensor_type, prepend_batch_axis=prepend_batch_axis)

File ~/PycharmProjects/nlp-env/lib/python3.9/site-packages/transformers/tokenization_utils_base.py:723, in BatchEncoding.convert_to_tensors(self, tensor_type, prepend_batch_axis)
    718         if key == "overflowing_tokens":
    719             raise ValueError(
    720                 "Unable to create tensor returning overflowing tokens of different lengths. "
    721                 "Please see if a fast version of this tokenizer is available to have this feature available."
    722             )
--> 723         raise ValueError(
    724             "Unable to create tensor, you should probably activate truncation and/or padding "
    725             "with 'padding=True' 'truncation=True' to have batched tensors with the same length."
    726         )
    728 return self

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.

The text was updated successfully, but these errors were encountered:

LysandreJik · 2022-02-03T17:58:04Z

Ah actually this is linked to the summarization example in the course @lewtun

lewtun · 2022-02-03T19:21:40Z

Hi @anum94, thanks for reporting this bug! The cause of the error is that the tokenized_datasets object has columns with strings, and the data collator doesn't know how to pad these.

The fix is to add the following line before the data collator:

tokenized_datasets = tokenized_datasets.remove_columns(books_dataset["train"].column_names)

I'll post a fix in the website and Colab too - thanks!

anum94 · 2022-02-04T10:45:29Z

Thank you.
I think I have an older version of transformers so it worked for me when I used

tokenized_datasets = tokenized_datasets.remove_columns_(books_dataset["train"].column_names)

instead of
tokenized_datasets = tokenized_datasets.remove_columns(books_dataset["train"].column_names)

Thanks for your help and the prompt response.
cheers

samozturk · 2022-07-19T08:02:54Z

When I used tokenized_datasets = tokenized_datasets.remove_columns(books_dataset["train"].column_names) it gives ZeroDivisionError: integer division or modulo by zero because it can't access rows.

dmatekenya · 2023-02-19T23:47:23Z

I'm following this tutorial and facing this same error of Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. during model training. I'm not sure how to implement the solution which worked in the case above. Any help will be appreciated.

ollayf · 2023-06-28T16:42:21Z

I am having the same issue^

VirginieBfd · 2023-09-18T15:54:51Z

Upgrading numpy to 1.24 resolved this issue:

pip install numpy --upgrade

LysandreJik assigned patil-suraj Feb 3, 2022

This was referenced Feb 3, 2022

Fix dataset for summarisation chapter huggingface/course#5

Merged

Fix course summarization notebooks huggingface/notebooks#139

Merged

anum94 closed this as completed Feb 4, 2022

CakeCrusher mentioned this issue May 27, 2022

More informative error message for DataCollatorForSeq2Seq #17447

Closed

5 tasks

This was referenced Jul 13, 2022

Better messaging and fix for incorrect shape when collating data. #18119

Merged

Added method to remove excess nesting in a DatasetDict huggingface/datasets#4679

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. #15505

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. #15505

anum94 commented Feb 3, 2022

LysandreJik commented Feb 3, 2022

lewtun commented Feb 3, 2022

anum94 commented Feb 4, 2022

samozturk commented Jul 19, 2022

dmatekenya commented Feb 19, 2023

ollayf commented Jun 28, 2023

VirginieBfd commented Sep 18, 2023

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. #15505

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. #15505

Comments

anum94 commented Feb 3, 2022

Information

To reproduce

LysandreJik commented Feb 3, 2022

lewtun commented Feb 3, 2022

anum94 commented Feb 4, 2022

samozturk commented Jul 19, 2022

dmatekenya commented Feb 19, 2023

ollayf commented Jun 28, 2023

VirginieBfd commented Sep 18, 2023