-
Notifications
You must be signed in to change notification settings - Fork 25.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add LayoutXLMTokenizer
and LayoutXLMTokenizerFast
#14030
Add LayoutXLMTokenizer
and LayoutXLMTokenizerFast
#14030
Conversation
Optionally, one can provide integer :obj:`word_labels`, which are turned into token-level :obj:`labels` for token | ||
classification tasks (such as FUNSD, CORD). | ||
:class:`~transformers.LayoutLMv2Tokenizer`, :class:`~transformers.LayoutLMv2TokenizerFast`, | ||
:class:`LayoutXLMTokenizer` or :class:`LayoutXLMTokenizerFast`which turns the words and bounding boxes into |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:class:`LayoutXLMTokenizer` or :class:`LayoutXLMTokenizerFast`which turns the words and bounding boxes into | |
:class:`LayoutXLMTokenizer` or :class:`LayoutXLMTokenizerFast` which turn the words and bounding boxes into |
@@ -0,0 +1,1237 @@ | |||
# coding=utf-8 | |||
# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team. | |
# Copyright 2021 the HuggingFace Inc. team. |
@@ -0,0 +1,818 @@ | |||
# coding=utf-8 | |||
# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team. | |
# Copyright 2021 The HuggingFace Inc. team. |
Nice! Let me know if you need any help. |
Thanks! The tokenizer in
Is there a way to avoid this? Also, I keep getting a |
Yeah, that's because the config currently has an attribute |
Got it. I just have a failing test and a docstring error in the CI. Any advice on how to fix them? |
Ok, I'll take a look later today. |
@kingyiusuen I've fixed the docs issue, currently checking out the tests. One is failing, the boxes created between the slow and fast tokenizer aren't equal:
The UPDATE: fixed it in the slow tokenizer. |
It might make sense to create a separate LayoutXLMProcessor. Will do this. |
I've made all necessary changes, fixed all tests, implemented a new You can find my branch here: https://github.com/NielsRogge/transformers/tree/add-layoutxlm-fast-tokenizer Should I open a PR on your branch? Or should I directly open a PR to HuggingFace Transformers? |
Maybe you should directly open a PR to HuggingFace Transformers. You've done most of the heavy-lifting. This should be counted as your contribution. 😃 |
What does this PR do?
Fixes #13972 (issue)
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.