-
Notifications
You must be signed in to change notification settings - Fork 25.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More informative error message for DataCollatorForSeq2Seq #17447
Conversation
@CakeCrusher, if it's programmable - won't it be better to actually validate the input shape explicitly, and assert if it's wrong - instead of piling up possible errors to an already long error message? |
@stas00 agreed ill make a check for it |
@@ -715,15 +715,14 @@ def convert_to_tensors( | |||
|
|||
self[key] = tensor | |||
except: # noqa E722 | |||
if key == "overflowing_tokens": | |||
if key == "overflowing_tokens" or key == "input_ids" or key == "attention_mask": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if there are more keys to take take into account for this issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there was a miscommunication there.
In the OP you suggested that there could be an error of passing wrongly shaped inputs - at least that's how I understood it. And I proposed that perhaps it'd be better to check if inputs are misformatted and assert if that's the case. But your newly proposed change is something totally different. So I think I lost you here.
Perhaps you could show an example of wrongly shaped inputs and then the corrected one?
The documentation is not available anymore as the PR was closed or merged. |
raise ValueError( | ||
"Unable to create tensor returning overflowing tokens of different lengths. " | ||
f"Unable to create tensor returning {key} of different lengths. " | ||
"Please see if a fast version of this tokenizer is available to have this feature available." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this feature of fast tokenizers applies for the other keys.
Hi @stas00 , is the pr ok? |
This PR is waiting for your answer here: #17447 (comment) |
@CakeCrusher, I think we lost each other here. Should we finish this PR? |
Hi @stas00 sorry for the discontinuity, I am now able to focus and see this issue through. Here is an example demonstrating successful and erring inputs: I then made the following changes to overcome excessive nesting (a list containing a single item): I understand the changes are pretty fundamental, but they work. I have yet to add the assert statement, since the nesting fix does the job forcefully. I was hoping to do an overarching PR, involving the new error message (or assert) and the fix (possibly parametrized so that it is not forced). What are your thoughts? |
This is an interesting idea, but I'm concerned it might be (1) not backward compatible (2) I think it's best for the user to apply this function themselves. Perhaps if it's a useful util function we can provide it and assert with a message to use it instead? And to remind my initial suggestion was:
e.g. the inputs shape is wrong, expecting a, but got b.... won't that be a clean solution? we can then discuss with others if they feel your proposed util function would be a good match to add. |
That is an excellent idea. I will have it ready early next week with a test. Do you recommend I make a new PR for it or merge it to this one? |
Apologies for taking a long time to follow up, @CakeCrusher As I suggested in the first place I think your suggestion to assert on invalid input nesting is great. I see you tried to move the helper util to Perhaps we just stick to the assert part and trust the user to figure out how to fix it? @sgugger, are you ok with the assertion part of this PR on the deeply nested input? I'd guess that you too might be against the 2nd part of adding a helper util to remove excessive nesting as it's not generic enough. |
I must admit I do not understand what the problem is, since the notebook linked executes without any issue. |
Sorry about that @sgugger the notebook was organized in a weird way. Now the notebook will raise the error. |
I see. I've pointed out in #18119 where that error message should be updated. |
What does this PR do?
I ran into an error related to an incorrect shape of inputs when using
DataCollatorForSeq2Seq
. I learned that it had to do with theBatchEncoding
class. I did not find the error message particularly helpful as it does not mention anything about the input shape. Therefore I added the extra line on the error message to help guide anyone else who runs into this error.Fixes #15505
@stas00
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.