More informative error message for DataCollatorForSeq2Seq #17447

CakeCrusher · 2022-05-27T01:08:28Z

What does this PR do?

I ran into an error related to an incorrect shape of inputs when using DataCollatorForSeq2Seq. I learned that it had to do with the BatchEncoding class. I did not find the error message particularly helpful as it does not mention anything about the input shape. Therefore I added the extra line on the error message to help guide anyone else who runs into this error.

Fixes #15505
@stas00

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

stas00 · 2022-05-27T01:13:17Z

@CakeCrusher, if it's programmable - won't it be better to actually validate the input shape explicitly, and assert if it's wrong - instead of piling up possible errors to an already long error message?

CakeCrusher · 2022-05-27T01:46:52Z

@stas00 agreed ill make a check for it

CakeCrusher · 2022-05-27T20:27:13Z

src/transformers/tokenization_utils_base.py

@@ -715,15 +715,14 @@ def convert_to_tensors(

                    self[key] = tensor
            except:  # noqa E722
-                if key == "overflowing_tokens":
+                if key == "overflowing_tokens" or key == "input_ids" or key == "attention_mask":


I'm not sure if there are more keys to take take into account for this issue

I think there was a miscommunication there.

In the OP you suggested that there could be an error of passing wrongly shaped inputs - at least that's how I understood it. And I proposed that perhaps it'd be better to check if inputs are misformatted and assert if that's the case. But your newly proposed change is something totally different. So I think I lost you here.

Perhaps you could show an example of wrongly shaped inputs and then the corrected one?

HuggingFaceDocBuilderDev · 2022-05-27T20:27:47Z

The documentation is not available anymore as the PR was closed or merged.

CakeCrusher · 2022-05-27T20:28:21Z

src/transformers/tokenization_utils_base.py

                    raise ValueError(
-                        "Unable to create tensor returning overflowing tokens of different lengths. "
+                        f"Unable to create tensor returning {key} of different lengths. "
                        "Please see if a fast version of this tokenizer is available to have this feature available."


Not sure if this feature of fast tokenizers applies for the other keys.

CakeCrusher · 2022-06-05T15:47:47Z

Hi @stas00 , is the pr ok?

stas00 · 2022-06-05T16:29:11Z

This PR is waiting for your answer here: #17447 (comment)

stas00 · 2022-06-30T15:26:25Z

@CakeCrusher, I think we lost each other here. Should we finish this PR?

CakeCrusher · 2022-07-06T00:10:12Z

@sgugger @stas00

Hi @stas00 sorry for the discontinuity, I am now able to focus and see this issue through.

Here is an example demonstrating successful and erring inputs:
https://colab.research.google.com/drive/16aLu6QrDSV_aUYRdpufl5E4iS08qkUGj?usp=sharing

I then made the following changes to overcome excessive nesting (a list containing a single item):
CakeCrusher/transformers@main...lead_nesting_solution

I understand the changes are pretty fundamental, but they work. I have yet to add the assert statement, since the nesting fix does the job forcefully. I was hoping to do an overarching PR, involving the new error message (or assert) and the fix (possibly parametrized so that it is not forced). What are your thoughts?

stas00 · 2022-07-08T21:59:59Z

This is an interesting idea, but I'm concerned it might be (1) not backward compatible (2) I think it's best for the user to apply this function themselves. Perhaps if it's a useful util function we can provide it and assert with a message to use it instead?

And to remind my initial suggestion was:

check if shape is wrong and raise a specific assert if it is wrong (with possible hints at how to fix it)

e.g. the inputs shape is wrong, expecting a, but got b....

won't that be a clean solution?

we can then discuss with others if they feel your proposed util function would be a good match to add.

CakeCrusher · 2022-07-09T19:59:49Z

@stas00

Perhaps if it's a useful util function we can provide it and assert with a message to use it instead?

That is an excellent idea.

I will have it ready early next week with a test.

Do you recommend I make a new PR for it or merge it to this one?

CakeCrusher · 2022-07-13T02:01:30Z

Hi @stas00 , I submitted a new PR for the aforementioned fixes. I have yet to add the test and proper docs. As for what I have so far please let me know what you think.

(My git tree was a mess, so that was largely why it's a new PR sorry about that.)

stas00 · 2022-07-18T19:55:53Z

Apologies for taking a long time to follow up, @CakeCrusher

As I suggested in the first place I think your suggestion to assert on invalid input nesting is great.

I see you tried to move the helper util to datasets and it's not being welcomed there, as it's really a user's responsibility to prepare the data correctly.

Perhaps we just stick to the assert part and trust the user to figure out how to fix it?

@sgugger, are you ok with the assertion part of this PR on the deeply nested input? I'd guess that you too might be against the 2nd part of adding a helper util to remove excessive nesting as it's not generic enough.

CakeCrusher · 2022-07-18T23:17:38Z

No worries @stas00,
Yeah.. I understand if I have to give up on introducing the helper function on this PR. I'll see what what lhoestq ends up thinking about the datasets alternative.

In the meantime, I'll keep the assert independent. And maybe open a new PR for the helper function.

sgugger · 2022-07-19T06:09:02Z

I must admit I do not understand what the problem is, since the notebook linked executes without any issue.

CakeCrusher · 2022-07-19T21:46:16Z

Sorry about that @sgugger the notebook was organized in a weird way. Now the notebook will raise the error.

sgugger · 2022-07-20T06:22:45Z

I see. I've pointed out in #18119 where that error message should be updated.

More informative error message

aff6b41

CakeCrusher added 2 commits May 27, 2022 16:04

Merge branch 'main' of https://github.com/huggingface/transformers

9639817

raise dynamic error

0f84b93

CakeCrusher commented May 27, 2022

View reviewed changes

huggingface deleted a comment from github-actions bot Jun 30, 2022

Merge branch 'main' of https://github.com/huggingface/transformers

7780f2a

CakeCrusher deleted the branch huggingface:main July 12, 2022 23:35

CakeCrusher closed this Jul 12, 2022

CakeCrusher deleted the main branch July 12, 2022 23:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More informative error message for DataCollatorForSeq2Seq #17447

More informative error message for DataCollatorForSeq2Seq #17447

CakeCrusher commented May 27, 2022 •

edited

stas00 commented May 27, 2022 •

edited

CakeCrusher commented May 27, 2022

CakeCrusher May 27, 2022

stas00 May 27, 2022

HuggingFaceDocBuilderDev commented May 27, 2022 •

edited

CakeCrusher May 27, 2022

CakeCrusher commented Jun 5, 2022

stas00 commented Jun 5, 2022

stas00 commented Jun 30, 2022

CakeCrusher commented Jul 6, 2022

stas00 commented Jul 8, 2022 •

edited

CakeCrusher commented Jul 9, 2022

CakeCrusher commented Jul 13, 2022 •

edited

stas00 commented Jul 18, 2022 •

edited

CakeCrusher commented Jul 18, 2022

sgugger commented Jul 19, 2022

CakeCrusher commented Jul 19, 2022

sgugger commented Jul 20, 2022

More informative error message for DataCollatorForSeq2Seq #17447

More informative error message for DataCollatorForSeq2Seq #17447

Conversation

CakeCrusher commented May 27, 2022 • edited

What does this PR do?

Before submitting

stas00 commented May 27, 2022 • edited

CakeCrusher commented May 27, 2022

CakeCrusher May 27, 2022

Choose a reason for hiding this comment

stas00 May 27, 2022

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented May 27, 2022 • edited

CakeCrusher May 27, 2022

Choose a reason for hiding this comment

CakeCrusher commented Jun 5, 2022

stas00 commented Jun 5, 2022

stas00 commented Jun 30, 2022

CakeCrusher commented Jul 6, 2022

stas00 commented Jul 8, 2022 • edited

CakeCrusher commented Jul 9, 2022

CakeCrusher commented Jul 13, 2022 • edited

stas00 commented Jul 18, 2022 • edited

CakeCrusher commented Jul 18, 2022

sgugger commented Jul 19, 2022

CakeCrusher commented Jul 19, 2022

sgugger commented Jul 20, 2022

CakeCrusher commented May 27, 2022 •

edited

stas00 commented May 27, 2022 •

edited

HuggingFaceDocBuilderDev commented May 27, 2022 •

edited

stas00 commented Jul 8, 2022 •

edited

CakeCrusher commented Jul 13, 2022 •

edited

stas00 commented Jul 18, 2022 •

edited