You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
I'm using the LayoutXLM. After the last update on huggingface, the tokenizer stopped working correctly.
The problem arises when using:
the official example scripts: (give details below)
my own modified scripts: (give details below)
The tokenizer for layoutxlm-base has some mismatched token ids. They exceed the declared tokenizer vocabulary size, and they also have larger ids than the size of the embedding module in model.
Update: I've restored the tokenizer_class attribute of the configuration of LayoutXLM, such that your tokenizer still works as expected.
However, once a new version of Transformers comes out, one can use LayoutXLMTokenizer/LayoutXLMTokenizerFast and the corresponding LayoutXLMProcessor, which allow you to prepare all the data for the model (see PR #14115).
Environment info
transformers
version: 4.10.0Who can help
@NielsRogge
Describe the bug
I'm using the LayoutXLM. After the last update on huggingface, the tokenizer stopped working correctly.
The problem arises when using:
The tokenizer for
layoutxlm-base
has some mismatched token ids. They exceed the declared tokenizer vocabulary size, and they also have larger ids than the size of the embedding module in model.To Reproduce
Steps to reproduce the behavior:
This produces
Tokenization works with
add_special_tokens=True
, but obviously adding special tokens manually causes the model to crash because of too large ids.Expected behavior
Tokenization works with
add_special_tokens=True
, and the model's embeddings are adapted to new changes if necessary.The text was updated successfully, but these errors were encountered: