LayoutXLM tokenizer issues after last update #14275

topolskib · 2021-11-04T13:43:21Z

Environment info

transformers version: 4.10.0
Platform: Darwin-20.4.0-x86_64-i386-64bit
Python version: 3.7.11
PyTorch version (GPU?): 1.9.0 (False)
Tensorflow version (GPU?): 2.4.0 (False)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: no
Using distributed or parallel set-up in script?: no

Who can help

Describe the bug
I'm using the LayoutXLM. After the last update on huggingface, the tokenizer stopped working correctly.

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tokenizer for layoutxlm-base has some mismatched token ids. They exceed the declared tokenizer vocabulary size, and they also have larger ids than the size of the embedding module in model.

To Reproduce
Steps to reproduce the behavior:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('microsoft/layoutxlm-base')
input_ = tokenizer(["foo"], boxes=[[0,0,12,12]], add_special_tokens=True)

This produces

ValueError: Id not recognized

Tokenization works with add_special_tokens=True, but obviously adding special tokens manually causes the model to crash because of too large ids.

Expected behavior
Tokenization works with add_special_tokens=True, and the model's embeddings are adapted to new changes if necessary.

The text was updated successfully, but these errors were encountered:

NielsRogge · 2021-11-04T13:45:59Z

Hi,

See NielsRogge/Transformers-Tutorials#50 (comment)

NielsRogge · 2021-11-04T13:56:10Z

Update: I've restored the tokenizer_class attribute of the configuration of LayoutXLM, such that your tokenizer still works as expected.

However, once a new version of Transformers comes out, one can use LayoutXLMTokenizer/LayoutXLMTokenizerFast and the corresponding LayoutXLMProcessor, which allow you to prepare all the data for the model (see PR #14115).

topolskib · 2021-11-04T14:54:40Z

Thanks. I've just tested the changes and it works.

topolskib closed this as completed Nov 4, 2021

LysandreJik mentioned this issue Nov 9, 2021

Allow per-version configurations #14344

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LayoutXLM tokenizer issues after last update #14275

LayoutXLM tokenizer issues after last update #14275

topolskib commented Nov 4, 2021

NielsRogge commented Nov 4, 2021

NielsRogge commented Nov 4, 2021

topolskib commented Nov 4, 2021

LayoutXLM tokenizer issues after last update #14275

LayoutXLM tokenizer issues after last update #14275

Comments

topolskib commented Nov 4, 2021

Environment info

Who can help

NielsRogge commented Nov 4, 2021

NielsRogge commented Nov 4, 2021

topolskib commented Nov 4, 2021