Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct order of overflowing tokens for LayoutLmV2 tokenizer #13495

Conversation

Apoorvgarg-creator
Copy link
Contributor

@Apoorvgarg-creator Apoorvgarg-creator commented Sep 9, 2021

What does this PR do?

This PR objective is same as #13179
Fixes #13148
The issue was resolved for every tokenizer except the LayoutLmV2 tokenizer.

Before submitting

  • Did you read the contributor guideline,
    Pull Request section? Yes 👍🏻.
  • Was this discussed/approved via a Github issue or the forum? Yes👍🏻 , Slow tokenizers return overflowing tokens in reversed order #13148
  • Did you make sure to update the documentation with your changes? Yes 👍🏻.
  • Did you add any necessary tests?
    The following test have been added which check the sequence of overflowing tokens, bbox sequence and input_ids.
    • test_maximum_encoding_length_pair_input
    • test_maximum_encoding_length_single_input

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@LysandreJik @sgugger @SaulLu @NielsRogge

@SaulLu
Copy link
Contributor

SaulLu commented Sep 9, 2021

Thank you very much for the PR.

As it seems to me that we didn't see any tests failing for LayoutLMv2 at the time your previous PR was merged, could you confirm that this behaviour is tested for LayoutLMv2 now? As I see that in the PR you did not change any tests, I'd like to make sure we catch the problem with the tests next time 🙂 .

@Apoorvgarg-creator
Copy link
Contributor Author

Apoorvgarg-creator commented Sep 9, 2021

As I see that in the PR you did not change any tests, I'd like to make sure we catch the problem with the tests next time 🙂 .

@SaulLu, The test test_maximum_encoding_length_pair_input and test_maximum_encoding_length_single_input were skipped because of the reason LayoutLMv2 tokenizer requires boxes besides sequences.
I will make the necessary changes to the file test_tokenization_layoutlmv2.py to ensure that we catch the problem next time.
Thank you

@Apoorvgarg-creator Apoorvgarg-creator changed the title Correct order of overflowing tokens for LayoutLmV2 tokenizer [WIP] Correct order of overflowing tokens for LayoutLmV2 tokenizer Sep 9, 2021
@Apoorvgarg-creator
Copy link
Contributor Author

@SaulLu, I have added a function get_clean_sequences which gives us list of words along with their ids and boxes, have taken reference from the test_tokenization_common.py.
Furthermore, I have modified the test_maximum_encoding_length_single_input to ensure the correct order of overflowing order.
Please can you review the changes.
Thank you

Copy link
Contributor

@SaulLu SaulLu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for all this work @Apoorvgarg-creator. 🎊

I left you mostly comments about the testing part. Especially as LayoutLmV2 has extra boxes, I think we should also add checks on the corresponding sequences.

Finally, if I'm not mistaken, there is no test_maximum_encoding_length_pair_input test for lajoutlmv2, it would be nice to have it too. 😄

Let me know if you need some help!

tests/test_tokenization_layoutlmv2.py Outdated Show resolved Hide resolved
tests/test_tokenization_layoutlmv2.py Outdated Show resolved Hide resolved
tests/test_tokenization_layoutlmv2.py Outdated Show resolved Hide resolved
tests/test_tokenization_layoutlmv2.py Show resolved Hide resolved
tests/test_tokenization_layoutlmv2.py Show resolved Hide resolved
tests/test_tokenization_layoutlmv2.py Show resolved Hide resolved
tests/test_tokenization_layoutlmv2.py Show resolved Hide resolved
tests/test_tokenization_layoutlmv2.py Outdated Show resolved Hide resolved
@Apoorvgarg-creator
Copy link
Contributor Author

@SaulLu, I have gone through the reviews. I will make the changes.
I didn't commit test_maximum_encoding_length_pair_input function because I wanted to see the changes that are required in the get_clean_sequences and test_maximum_encoding_length_single_input.
Thank you for reviewing and suggesting the changes.

@Apoorvgarg-creator
Copy link
Contributor Author

Apoorvgarg-creator commented Oct 15, 2021

@SaulLu @NielsRogge @LysandreJik, Sorry for the delay in the PR. I am done with most of the work, but I would like to ask for help in Comparing bbox sequence for fast tokenizer for test_maximum_encoding_length_pair_input. That test isn't working properly. Can you help me from where I should read about this ?
I am sorry once again for delaying the work.
Update: I have resolved the problem I was facing.

@Apoorvgarg-creator Apoorvgarg-creator changed the title [WIP] Correct order of overflowing tokens for LayoutLmV2 tokenizer Correct order of overflowing tokens for LayoutLmV2 tokenizer Oct 15, 2021
@Apoorvgarg-creator
Copy link
Contributor Author

@SaulLu, Working on test_maximum_encoding_length_pair_input. Will Commit the changes by end of the day.

@Apoorvgarg-creator
Copy link
Contributor Author

@SaulLu @LysandreJik @NielsRogge, could you please review the test added and changes.
Thank you

@Apoorvgarg-creator
Copy link
Contributor Author

@patrickvonplaten @SaulLu @LysandreJik @NielsRogge , Could you please review the PR.
Thank you.

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the very extensive tests, great work!
Regarding the format:

  • Could you add some comments explaining what is done? There is a lot of code so I fear it will become unmaintainable without clear comments explaining what is done.
  • Please replace all assert with unittest's specific methods.

I'll let @NielsRogge review the logic.

tests/test_tokenization_layoutlmv2.py Outdated Show resolved Hide resolved
@Apoorvgarg-creator
Copy link
Contributor Author

Apoorvgarg-creator commented Nov 1, 2021

  • Could you add some comments explaining what is done? There is a lot of code so I fear it will become unmaintainable without clear comments explaining what is done.

Sure, Thank you for reviewing.

@Apoorvgarg-creator Apoorvgarg-creator changed the title Correct order of overflowing tokens for LayoutLmV2 tokenizer [WIP] Correct order of overflowing tokens for LayoutLmV2 tokenizer Nov 2, 2021
Copy link
Contributor

@NielsRogge NielsRogge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks good to me.

Pinging @SaulLu for a final review.

@Apoorvgarg-creator Apoorvgarg-creator changed the title [WIP] Correct order of overflowing tokens for LayoutLmV2 tokenizer Correct order of overflowing tokens for LayoutLmV2 tokenizer Nov 7, 2021
Copy link
Contributor

@SaulLu SaulLu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for all your work @Apoorvgarg-creator !

Thanks especially for adding all those tests! I have one last little request which is about the overflowing boxes check, but to save you time I have opened a PR on your branch which includes the changes I am referring to. If you like these proposals, don't hesitate to merge this PR in your branch. 🙂

tests/test_tokenization_layoutlmv2.py Outdated Show resolved Hide resolved
@huggingface huggingface deleted a comment Nov 8, 2021
@SaulLu SaulLu self-requested a review November 8, 2021 18:52
Copy link
Contributor

@SaulLu SaulLu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for these latest changes! It's all good for me!

@Apoorvgarg-creator
Copy link
Contributor Author

Apoorvgarg-creator commented Nov 9, 2021

@SaulLu @LysandreJik, I have been getting these mails, I tried checking the reason for their failing but couldn't understand the issue.
Message : This request was automatically failed because there were no enabled runners online to process the request for more than 1 days.
Here the Screenshot

@sgugger sgugger merged commit 6326aa4 into huggingface:master Nov 9, 2021
@sgugger
Copy link
Collaborator

sgugger commented Nov 9, 2021

Thnaks a gain for all your work on this!

@Apoorvgarg-creator Apoorvgarg-creator deleted the LayoutLmV2-overflowing-tokens-order-fix branch November 9, 2021 12:52
Albertobegue pushed a commit to Albertobegue/transformers that referenced this pull request Jan 27, 2022
…face#13495)

* correct order of overflowing tokens for LayoutLmV2 tokenizer

* test to check order of overflowing_tokens for a seq of input_ids

* fix up quality

* added suggested changes

* check that tests the bbox sequence

* pair_input test added

* pass quality test

* check bbox sequence added

* unittest method

* comments added

* add overflowing bbox test

* improved "seq_1"

Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>

* improve code quality

Co-authored-by: SaulLu <lucilesaul.com@gmail.com>
Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Slow tokenizers return overflowing tokens in reversed order
5 participants