Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LayoutXLM tokenizer issues after last update #14275

Closed
1 of 2 tasks
topolskib opened this issue Nov 4, 2021 · 3 comments · Fixed by #14344
Closed
1 of 2 tasks

LayoutXLM tokenizer issues after last update #14275

topolskib opened this issue Nov 4, 2021 · 3 comments · Fixed by #14344

Comments

@topolskib
Copy link

Environment info

  • transformers version: 4.10.0
  • Platform: Darwin-20.4.0-x86_64-i386-64bit
  • Python version: 3.7.11
  • PyTorch version (GPU?): 1.9.0 (False)
  • Tensorflow version (GPU?): 2.4.0 (False)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: no
  • Using distributed or parallel set-up in script?: no

Who can help

@NielsRogge

Describe the bug
I'm using the LayoutXLM. After the last update on huggingface, the tokenizer stopped working correctly.

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tokenizer for layoutxlm-base has some mismatched token ids. They exceed the declared tokenizer vocabulary size, and they also have larger ids than the size of the embedding module in model.

To Reproduce
Steps to reproduce the behavior:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('microsoft/layoutxlm-base')
input_ = tokenizer(["foo"], boxes=[[0,0,12,12]], add_special_tokens=True) 

This produces

ValueError: Id not recognized

Tokenization works with add_special_tokens=True, but obviously adding special tokens manually causes the model to crash because of too large ids.

Expected behavior
Tokenization works with add_special_tokens=True, and the model's embeddings are adapted to new changes if necessary.

@NielsRogge
Copy link
Contributor

@NielsRogge
Copy link
Contributor

Update: I've restored the tokenizer_class attribute of the configuration of LayoutXLM, such that your tokenizer still works as expected.

However, once a new version of Transformers comes out, one can use LayoutXLMTokenizer/LayoutXLMTokenizerFast and the corresponding LayoutXLMProcessor, which allow you to prepare all the data for the model (see PR #14115).

@topolskib
Copy link
Author

Thanks. I've just tested the changes and it works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants