Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LayoutXLMProcessor (and LayoutXLMTokenizer, LayoutXLMTokenizerFast) #14115

Merged

Conversation

NielsRogge
Copy link
Contributor

@NielsRogge NielsRogge commented Oct 22, 2021

What does this PR do?

This PR implements LayoutXLMProcessor, which can be used to prepare all data for LayoutXLM. LayoutXLM is a multilingual version of LayoutLMv2. It uses the same vocabulary as XLMRoBERTa.

Big thanks to @kingyiusuen for setting up a first draft. This PR is built on his work: #14030

  • To do: it might make sense to make a new "layoutxlm" folder in the models directory, where the following files can be added:
  • tokenization_layoutxlm.py
  • tokenization_layoutxlm_fast.py
  • processor_layoutxlm.py

@vanpersie32
Copy link

vanpersie32 commented Oct 24, 2021

have you merged it into master?

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding those! I think we can simplify a bit the methods that are rewritten for this tokenizer (I left pointers in the slow file but it should also apply to the fast one), otherwise it looks great!

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fantastic @NielsRogge! Thank you for working on this refactor. Great work adding tests as well.

LGTM!

docs/source/model_doc/layoutxlm.rst Outdated Show resolved Hide resolved
@NielsRogge NielsRogge merged commit 5f789a6 into huggingface:master Nov 3, 2021
Albertobegue pushed a commit to Albertobegue/transformers that referenced this pull request Jan 27, 2022
…t) (huggingface#14115)

* Add LayoutXLMTokenizer and LayoutXLMTokenizerFast

* Fix styling issues

* Fix more styling issues

* Fix more styling issues

* Fix docstring

* Fix unit tests

* Fix docs

* Fix unit tests

* Fix typos and styling issues

* Fix styling issues

* Fix docstring

* Make all tests of test_tokenization_layoutxlm pass

* Add LayoutXLMProcessor

* Make fixup

* Make all LayoutXLMProcessor tests pass

* Minor fixes

* Leave LayoutLMv2Processor tests unchanged

* Fix code quality

* Move LayoutXLM tokenizers and processor to separate folder

* Fix code quality

* Apply suggestions from code review

* Replace assertions by value errors

* Remove methods from fast tokenizer

Co-authored-by: King Yiu Suen <kingyiusuen@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants