Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LayoutXLMTokenizer and LayoutXLMTokenizerFast #14030

Closed
wants to merge 12 commits into from
Closed

Add LayoutXLMTokenizer and LayoutXLMTokenizerFast #14030

wants to merge 12 commits into from

Conversation

kingyiusuen
Copy link
Contributor

What does this PR do?

Fixes #13972 (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Optionally, one can provide integer :obj:`word_labels`, which are turned into token-level :obj:`labels` for token
classification tasks (such as FUNSD, CORD).
:class:`~transformers.LayoutLMv2Tokenizer`, :class:`~transformers.LayoutLMv2TokenizerFast`,
:class:`LayoutXLMTokenizer` or :class:`LayoutXLMTokenizerFast`which turns the words and bounding boxes into
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:class:`LayoutXLMTokenizer` or :class:`LayoutXLMTokenizerFast`which turns the words and bounding boxes into
:class:`LayoutXLMTokenizer` or :class:`LayoutXLMTokenizerFast` which turn the words and bounding boxes into

@@ -0,0 +1,1237 @@
# coding=utf-8
# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.
# Copyright 2021 the HuggingFace Inc. team.

@@ -0,0 +1,818 @@
# coding=utf-8
# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.
# Copyright 2021 The HuggingFace Inc. team.

@NielsRogge
Copy link
Contributor

Nice! Let me know if you need any help.

@kingyiusuen
Copy link
Contributor Author

kingyiusuen commented Oct 19, 2021

Nice! Let me know if you need any help.

Thanks!

The tokenizer in 'microsoft/layoutxlm-base' was registered with XLMRobertaTokenizer, so the user will get a warning that there is a mismatch when they do

tokenizer = LayoutXLMTokenizer.from_pretrained('microsoft/layoutxlm-base') 

Is there a way to avoid this?

Also, I keep getting a ModuleNotFoundError: No module named 'sentencepiece' error in tests/test_processor_layoutlmv2.py. I can't figure out what is wrong.

@NielsRogge
Copy link
Contributor

NielsRogge commented Oct 19, 2021

The tokenizer in 'microsoft/layoutxlm-base' was registered with XLMRobertaTokenizer, so the user will get a warning

Yeah, that's because the config currently has an attribute tokenizer_class which is set to XLMRobertaTokenizer as can be seen here. Once LayoutXLMTokenizer/LayoutXLMTokenizerFast are ready, we will upload the vocab files to the model repo, and remove the tokenizer_class attribute.

@kingyiusuen
Copy link
Contributor Author

The tokenizer in 'microsoft/layoutxlm-base' was registered with XLMRobertaTokenizer, so the user will get a warning

Yeah, that's because the config currently has an attribute tokenizer_class which is set to XLMRobertaTokenizer as can be seen here. Once LayoutXLMTokenizer/LayoutXLMTokenizerFast are ready, we will upload the vocab files to the model repo, and remove the tokenizer_class attribute.

Got it. I just have a failing test and a docstring error in the CI. Any advice on how to fix them?

@NielsRogge
Copy link
Contributor

Ok, I'll take a look later today.

@NielsRogge
Copy link
Contributor

NielsRogge commented Oct 21, 2021

@kingyiusuen I've fixed the docs issue, currently checking out the tests. One is failing, the boxes created between the slow and fast tokenizer aren't equal:

from transformers import LayoutXLMTokenizer, LayoutXLMTokenizerFast

tokenizer_p = LayoutXLMTokenizer.from_pretrained("microsoft/layoutxlm-base")
tokenizer_r = LayoutXLMTokenizerFast.from_pretrained("microsoft/layoutxlm-base")
question = "what's his name?"
words = ["a", "weirdly", "test"]
boxes = [[423, 237, 440, 251], [427, 272, 441, 287], [419, 115, 437, 129]]

encoding_p = tokenizer_p(question, words, boxes, padding="max_length", max_length=20)
encoding_r = tokenizer_r(question, words, boxes, padding="max_length", max_length=20)
for x,y in zip(encoding_p.bbox, encoding_r.bbox): print(x,y)

# this prints:
[0, 0, 0, 0] [0, 0, 0, 0]
[0, 0, 0, 0] [0, 0, 0, 0]
[0, 0, 0, 0] [0, 0, 0, 0]
[0, 0, 0, 0] [0, 0, 0, 0]
[0, 0, 0, 0] [0, 0, 0, 0]
[0, 0, 0, 0] [0, 0, 0, 0]
[0, 0, 0, 0] [0, 0, 0, 0]
[1000, 1000, 1000, 1000] [1000, 1000, 1000, 1000]
[423, 237, 440, 251] [1000, 1000, 1000, 1000]
[427, 272, 441, 287] [423, 237, 440, 251]
[427, 272, 441, 287] [427, 272, 441, 287]
[419, 115, 437, 129] [427, 272, 441, 287]
[1000, 1000, 1000, 1000] [419, 115, 437, 129]
[0, 0, 0, 0] [1000, 1000, 1000, 1000]
[0, 0, 0, 0] [0, 0, 0, 0]
[0, 0, 0, 0] [0, 0, 0, 0]
[0, 0, 0, 0] [0, 0, 0, 0]
[0, 0, 0, 0] [0, 0, 0, 0]
[0, 0, 0, 0] [0, 0, 0, 0]

The input_ids and attention_mask are equal. Decoding the input_ids, it seems that one adds 2 special tokens in between the question and the context? So the fast tokenizer seems correct.

UPDATE: fixed it in the slow tokenizer.

@NielsRogge
Copy link
Contributor

It might make sense to create a separate LayoutXLMProcessor. Will do this.

@NielsRogge
Copy link
Contributor

NielsRogge commented Oct 21, 2021

I've made all necessary changes, fixed all tests, implemented a new LayoutXLMProcessor and created new tests for it accordingly.

You can find my branch here: https://github.com/NielsRogge/transformers/tree/add-layoutxlm-fast-tokenizer

Should I open a PR on your branch? Or should I directly open a PR to HuggingFace Transformers?

@kingyiusuen
Copy link
Contributor Author

I've made all necessary changes, fixed all tests, implemented a new LayoutXLMProcessor and created new tests for it accordingly.

You can find my branch here: https://github.com/NielsRogge/transformers/tree/add-layoutxlm-fast-tokenizer

Should I open a PR on your branch? Or should I directly open a PR to HuggingFace Transformers?

Maybe you should directly open a PR to HuggingFace Transformers. You've done most of the heavy-lifting. This should be counted as your contribution. 😃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LayoutLMv2Processor does not accept the XLMRobertaTokenizerFast
2 participants