-
Notifications
You must be signed in to change notification settings - Fork 25.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add LayoutXLMProcessor (and LayoutXLMTokenizer, LayoutXLMTokenizerFast) #14115
Add LayoutXLMProcessor (and LayoutXLMTokenizer, LayoutXLMTokenizerFast) #14115
Conversation
7d2f83a
to
a0fadfb
Compare
have you merged it into master? |
ed65739
to
9307a0f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding those! I think we can simplify a bit the methods that are rewritten for this tokenizer (I left pointers in the slow file but it should also apply to the fast one), otherwise it looks great!
src/transformers/models/layoutxlm/tokenization_layoutxlm_fast.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fantastic @NielsRogge! Thank you for working on this refactor. Great work adding tests as well.
LGTM!
9307a0f
to
b671ea6
Compare
7baddaa
to
20dd625
Compare
…t) (huggingface#14115) * Add LayoutXLMTokenizer and LayoutXLMTokenizerFast * Fix styling issues * Fix more styling issues * Fix more styling issues * Fix docstring * Fix unit tests * Fix docs * Fix unit tests * Fix typos and styling issues * Fix styling issues * Fix docstring * Make all tests of test_tokenization_layoutxlm pass * Add LayoutXLMProcessor * Make fixup * Make all LayoutXLMProcessor tests pass * Minor fixes * Leave LayoutLMv2Processor tests unchanged * Fix code quality * Move LayoutXLM tokenizers and processor to separate folder * Fix code quality * Apply suggestions from code review * Replace assertions by value errors * Remove methods from fast tokenizer Co-authored-by: King Yiu Suen <kingyiusuen@gmail.com>
What does this PR do?
This PR implements
LayoutXLMProcessor
, which can be used to prepare all data for LayoutXLM. LayoutXLM is a multilingual version of LayoutLMv2. It uses the same vocabulary as XLMRoBERTa.Big thanks to @kingyiusuen for setting up a first draft. This PR is built on his work: #14030
tokenization_layoutxlm.py
tokenization_layoutxlm_fast.py
processor_layoutxlm.py