Add `LayoutXLMTokenizer` and `LayoutXLMTokenizerFast` #14030

kingyiusuen · 2021-10-15T21:07:46Z

What does this PR do?

Fixes #13972 (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

NielsRogge · 2021-10-18T07:45:18Z

src/transformers/models/layoutlmv2/processing_layoutlmv2.py

-    Optionally, one can provide integer :obj:`word_labels`, which are turned into token-level :obj:`labels` for token
-    classification tasks (such as FUNSD, CORD).
+    :class:`~transformers.LayoutLMv2Tokenizer`, :class:`~transformers.LayoutLMv2TokenizerFast`,
+    :class:`LayoutXLMTokenizer` or :class:`LayoutXLMTokenizerFast`which turns the words and bounding boxes into


Suggested change

:class:`LayoutXLMTokenizer` or :class:`LayoutXLMTokenizerFast`which turns the words and bounding boxes into

:class:`LayoutXLMTokenizer` or :class:`LayoutXLMTokenizerFast` which turn the words and bounding boxes into

NielsRogge · 2021-10-18T07:46:37Z

src/transformers/models/layoutlmv2/tokenization_layoutxlm.py

@@ -0,0 +1,1237 @@
+# coding=utf-8
+# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.


Suggested change

# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.

# Copyright 2021 the HuggingFace Inc. team.

NielsRogge · 2021-10-18T07:48:04Z

src/transformers/models/layoutlmv2/tokenization_layoutxlm_fast.py

@@ -0,0 +1,818 @@
+# coding=utf-8
+# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.


Suggested change

# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.

# Copyright 2021 The HuggingFace Inc. team.

NielsRogge · 2021-10-18T08:57:54Z

Nice! Let me know if you need any help.

kingyiusuen · 2021-10-19T13:03:29Z

Nice! Let me know if you need any help.

Thanks!

The tokenizer in 'microsoft/layoutxlm-base' was registered with XLMRobertaTokenizer, so the user will get a warning that there is a mismatch when they do

tokenizer = LayoutXLMTokenizer.from_pretrained('microsoft/layoutxlm-base')

Is there a way to avoid this?

Also, I keep getting a ModuleNotFoundError: No module named 'sentencepiece' error in tests/test_processor_layoutlmv2.py. I can't figure out what is wrong.

NielsRogge · 2021-10-19T15:28:39Z

The tokenizer in 'microsoft/layoutxlm-base' was registered with XLMRobertaTokenizer, so the user will get a warning

Yeah, that's because the config currently has an attribute tokenizer_class which is set to XLMRobertaTokenizer as can be seen here. Once LayoutXLMTokenizer/LayoutXLMTokenizerFast are ready, we will upload the vocab files to the model repo, and remove the tokenizer_class attribute.

kingyiusuen · 2021-10-19T18:39:08Z

The tokenizer in 'microsoft/layoutxlm-base' was registered with XLMRobertaTokenizer, so the user will get a warning

Yeah, that's because the config currently has an attribute tokenizer_class which is set to XLMRobertaTokenizer as can be seen here. Once LayoutXLMTokenizer/LayoutXLMTokenizerFast are ready, we will upload the vocab files to the model repo, and remove the tokenizer_class attribute.

Got it. I just have a failing test and a docstring error in the CI. Any advice on how to fix them?

NielsRogge · 2021-10-20T09:35:38Z

Ok, I'll take a look later today.

NielsRogge · 2021-10-21T14:55:34Z

@kingyiusuen I've fixed the docs issue, currently checking out the tests. One is failing, the boxes created between the slow and fast tokenizer aren't equal:

from transformers import LayoutXLMTokenizer, LayoutXLMTokenizerFast

tokenizer_p = LayoutXLMTokenizer.from_pretrained("microsoft/layoutxlm-base")
tokenizer_r = LayoutXLMTokenizerFast.from_pretrained("microsoft/layoutxlm-base")
question = "what's his name?"
words = ["a", "weirdly", "test"]
boxes = [[423, 237, 440, 251], [427, 272, 441, 287], [419, 115, 437, 129]]

encoding_p = tokenizer_p(question, words, boxes, padding="max_length", max_length=20)
encoding_r = tokenizer_r(question, words, boxes, padding="max_length", max_length=20)
for x,y in zip(encoding_p.bbox, encoding_r.bbox): print(x,y)

# this prints:
[0, 0, 0, 0] [0, 0, 0, 0]
[0, 0, 0, 0] [0, 0, 0, 0]
[0, 0, 0, 0] [0, 0, 0, 0]
[0, 0, 0, 0] [0, 0, 0, 0]
[0, 0, 0, 0] [0, 0, 0, 0]
[0, 0, 0, 0] [0, 0, 0, 0]
[0, 0, 0, 0] [0, 0, 0, 0]
[1000, 1000, 1000, 1000] [1000, 1000, 1000, 1000]
[423, 237, 440, 251] [1000, 1000, 1000, 1000]
[427, 272, 441, 287] [423, 237, 440, 251]
[427, 272, 441, 287] [427, 272, 441, 287]
[419, 115, 437, 129] [427, 272, 441, 287]
[1000, 1000, 1000, 1000] [419, 115, 437, 129]
[0, 0, 0, 0] [1000, 1000, 1000, 1000]
[0, 0, 0, 0] [0, 0, 0, 0]
[0, 0, 0, 0] [0, 0, 0, 0]
[0, 0, 0, 0] [0, 0, 0, 0]
[0, 0, 0, 0] [0, 0, 0, 0]
[0, 0, 0, 0] [0, 0, 0, 0]

The input_ids and attention_mask are equal. Decoding the input_ids, it seems that one adds 2 special tokens in between the question and the context? So the fast tokenizer seems correct.

UPDATE: fixed it in the slow tokenizer.

NielsRogge · 2021-10-21T15:25:12Z

It might make sense to create a separate LayoutXLMProcessor. Will do this.

NielsRogge · 2021-10-21T17:02:13Z

I've made all necessary changes, fixed all tests, implemented a new LayoutXLMProcessor and created new tests for it accordingly.

You can find my branch here: https://github.com/NielsRogge/transformers/tree/add-layoutxlm-fast-tokenizer

Should I open a PR on your branch? Or should I directly open a PR to HuggingFace Transformers?

kingyiusuen · 2021-10-21T21:40:54Z

I've made all necessary changes, fixed all tests, implemented a new LayoutXLMProcessor and created new tests for it accordingly.

You can find my branch here: https://github.com/NielsRogge/transformers/tree/add-layoutxlm-fast-tokenizer

Should I open a PR on your branch? Or should I directly open a PR to HuggingFace Transformers?

Maybe you should directly open a PR to HuggingFace Transformers. You've done most of the heavy-lifting. This should be counted as your contribution. 😃

kingyiusuen added 8 commits October 15, 2021 16:01

Add LayoutXLMTokenizer and LayoutXLMTokenizerFast

8d532b1

Fix styling issues

80ea956

Fix more styling issues

608496a

Fix more styling issues

582b88d

Fix docstring

1d53954

Fix unit tests

6964a8f

Fix docs

11483e9

Fix unit tests

d33f896

NielsRogge reviewed Oct 18, 2021

View reviewed changes

NielsRogge mentioned this pull request Oct 18, 2021

LayoutXLM for Token Classification on FUNSD NielsRogge/Transformers-Tutorials#35

Closed

kingyiusuen added 2 commits October 19, 2021 07:50

Fix typos and styling issues

30e081a

Merge branch 'master'

2b70af8

kingyiusuen added 2 commits October 19, 2021 13:01

Fix styling issues

e151b74

Fix docstring

f2ffc5e

NielsRogge mentioned this pull request Oct 22, 2021

Add LayoutXLMProcessor (and LayoutXLMTokenizer, LayoutXLMTokenizerFast) #14115

Merged

1 task

kingyiusuen closed this Oct 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `LayoutXLMTokenizer` and `LayoutXLMTokenizerFast` #14030

Add `LayoutXLMTokenizer` and `LayoutXLMTokenizerFast` #14030

kingyiusuen commented Oct 15, 2021

NielsRogge Oct 18, 2021

NielsRogge Oct 18, 2021

NielsRogge Oct 18, 2021

NielsRogge commented Oct 18, 2021

kingyiusuen commented Oct 19, 2021 •

edited

Loading

NielsRogge commented Oct 19, 2021 •

edited

Loading

kingyiusuen commented Oct 19, 2021

NielsRogge commented Oct 20, 2021

NielsRogge commented Oct 21, 2021 •

edited

Loading

NielsRogge commented Oct 21, 2021

NielsRogge commented Oct 21, 2021 •

edited

Loading

kingyiusuen commented Oct 21, 2021

	:class:`LayoutXLMTokenizer` or :class:`LayoutXLMTokenizerFast`which turns the words and bounding boxes into
	:class:`LayoutXLMTokenizer` or :class:`LayoutXLMTokenizerFast` which turn the words and bounding boxes into

		@@ -0,0 +1,1237 @@
		# coding=utf-8
		# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.

	# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.
	# Copyright 2021 the HuggingFace Inc. team.

		@@ -0,0 +1,818 @@
		# coding=utf-8
		# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.

Add LayoutXLMTokenizer and LayoutXLMTokenizerFast #14030

Add LayoutXLMTokenizer and LayoutXLMTokenizerFast #14030

Conversation

kingyiusuen commented Oct 15, 2021

What does this PR do?

Before submitting

Who can review?

NielsRogge Oct 18, 2021

Choose a reason for hiding this comment

NielsRogge Oct 18, 2021

Choose a reason for hiding this comment

NielsRogge Oct 18, 2021

Choose a reason for hiding this comment

NielsRogge commented Oct 18, 2021

kingyiusuen commented Oct 19, 2021 • edited Loading

NielsRogge commented Oct 19, 2021 • edited Loading

kingyiusuen commented Oct 19, 2021

NielsRogge commented Oct 20, 2021

NielsRogge commented Oct 21, 2021 • edited Loading

NielsRogge commented Oct 21, 2021

NielsRogge commented Oct 21, 2021 • edited Loading

kingyiusuen commented Oct 21, 2021

Add `LayoutXLMTokenizer` and `LayoutXLMTokenizerFast` #14030

Add `LayoutXLMTokenizer` and `LayoutXLMTokenizerFast` #14030

kingyiusuen commented Oct 19, 2021 •

edited

Loading

NielsRogge commented Oct 19, 2021 •

edited

Loading

NielsRogge commented Oct 21, 2021 •

edited

Loading

NielsRogge commented Oct 21, 2021 •

edited

Loading