Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More informative error message for DataCollatorForSeq2Seq #17447

Closed
wants to merge 4 commits into from

Conversation

CakeCrusher
Copy link
Contributor

@CakeCrusher CakeCrusher commented May 27, 2022

What does this PR do?

I ran into an error related to an incorrect shape of inputs when using DataCollatorForSeq2Seq. I learned that it had to do with the BatchEncoding class. I did not find the error message particularly helpful as it does not mention anything about the input shape. Therefore I added the extra line on the error message to help guide anyone else who runs into this error.

Fixes #15505
@stas00

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

@stas00
Copy link
Contributor

stas00 commented May 27, 2022

@CakeCrusher, if it's programmable - won't it be better to actually validate the input shape explicitly, and assert if it's wrong - instead of piling up possible errors to an already long error message?

@CakeCrusher
Copy link
Contributor Author

@stas00 agreed ill make a check for it

@@ -715,15 +715,14 @@ def convert_to_tensors(

self[key] = tensor
except: # noqa E722
if key == "overflowing_tokens":
if key == "overflowing_tokens" or key == "input_ids" or key == "attention_mask":
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if there are more keys to take take into account for this issue

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there was a miscommunication there.

In the OP you suggested that there could be an error of passing wrongly shaped inputs - at least that's how I understood it. And I proposed that perhaps it'd be better to check if inputs are misformatted and assert if that's the case. But your newly proposed change is something totally different. So I think I lost you here.

Perhaps you could show an example of wrongly shaped inputs and then the corrected one?

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented May 27, 2022

The documentation is not available anymore as the PR was closed or merged.

raise ValueError(
"Unable to create tensor returning overflowing tokens of different lengths. "
f"Unable to create tensor returning {key} of different lengths. "
"Please see if a fast version of this tokenizer is available to have this feature available."
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this feature of fast tokenizers applies for the other keys.

@CakeCrusher
Copy link
Contributor Author

Hi @stas00 , is the pr ok?

@stas00
Copy link
Contributor

stas00 commented Jun 5, 2022

This PR is waiting for your answer here: #17447 (comment)

@huggingface huggingface deleted a comment from github-actions bot Jun 30, 2022
@stas00
Copy link
Contributor

stas00 commented Jun 30, 2022

@CakeCrusher, I think we lost each other here. Should we finish this PR?

@CakeCrusher
Copy link
Contributor Author

@sgugger @stas00

Hi @stas00 sorry for the discontinuity, I am now able to focus and see this issue through.

Here is an example demonstrating successful and erring inputs:
https://colab.research.google.com/drive/16aLu6QrDSV_aUYRdpufl5E4iS08qkUGj?usp=sharing

I then made the following changes to overcome excessive nesting (a list containing a single item):
CakeCrusher/transformers@main...lead_nesting_solution

I understand the changes are pretty fundamental, but they work. I have yet to add the assert statement, since the nesting fix does the job forcefully. I was hoping to do an overarching PR, involving the new error message (or assert) and the fix (possibly parametrized so that it is not forced). What are your thoughts?

@stas00
Copy link
Contributor

stas00 commented Jul 8, 2022

This is an interesting idea, but I'm concerned it might be (1) not backward compatible (2) I think it's best for the user to apply this function themselves. Perhaps if it's a useful util function we can provide it and assert with a message to use it instead?

And to remind my initial suggestion was:

  • check if shape is wrong and raise a specific assert if it is wrong (with possible hints at how to fix it)

e.g. the inputs shape is wrong, expecting a, but got b....

won't that be a clean solution?

we can then discuss with others if they feel your proposed util function would be a good match to add.

@CakeCrusher
Copy link
Contributor Author

@stas00

Perhaps if it's a useful util function we can provide it and assert with a message to use it instead?

That is an excellent idea.

I will have it ready early next week with a test.

Do you recommend I make a new PR for it or merge it to this one?

@CakeCrusher CakeCrusher deleted the branch huggingface:main July 12, 2022 23:35
@CakeCrusher CakeCrusher deleted the main branch July 12, 2022 23:35
@CakeCrusher
Copy link
Contributor Author

CakeCrusher commented Jul 13, 2022

Hi @stas00 , I submitted a new PR for the aforementioned fixes. I have yet to add the test and proper docs. As for what I have so far please let me know what you think.

(My git tree was a mess, so that was largely why it's a new PR sorry about that.)

@stas00
Copy link
Contributor

stas00 commented Jul 18, 2022

Apologies for taking a long time to follow up, @CakeCrusher

As I suggested in the first place I think your suggestion to assert on invalid input nesting is great.

I see you tried to move the helper util to datasets and it's not being welcomed there, as it's really a user's responsibility to prepare the data correctly.

Perhaps we just stick to the assert part and trust the user to figure out how to fix it?

@sgugger, are you ok with the assertion part of this PR on the deeply nested input? I'd guess that you too might be against the 2nd part of adding a helper util to remove excessive nesting as it's not generic enough.

@CakeCrusher
Copy link
Contributor Author

No worries @stas00,
Yeah.. I understand if I have to give up on introducing the helper function on this PR. I'll see what what lhoestq ends up thinking about the datasets alternative.

In the meantime, I'll keep the assert independent. And maybe open a new PR for the helper function.

@sgugger
Copy link
Collaborator

sgugger commented Jul 19, 2022

I must admit I do not understand what the problem is, since the notebook linked executes without any issue.

@CakeCrusher
Copy link
Contributor Author

Sorry about that @sgugger the notebook was organized in a weird way. Now the notebook will raise the error.

@sgugger
Copy link
Collaborator

sgugger commented Jul 20, 2022

I see. I've pointed out in #18119 where that error message should be updated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants