Vocab size for microsoft/layoutxlm-base #50

FelipeAlb94 · 2021-11-03T22:27:52Z

Hello there,

First of all thank you so much for the work you are doing, it's being really helpful for me to get my hands dirty with state-of-the-art models.

Some weeks ago I fine-tunned a layoutxlm-base using this notebook as reference and it worked, even got nice results with it.

Today I tried to run another training but unfortunetaly something went wrong, after a couple of hours I noticed that the tokenizer's size and model's vocab_size is 250002 but vocab's length is 250007.

So as a work around I came with this:
model.layoutlmv2.embeddings.word_embeddings = torch.nn.Embedding(250007, 768, padding_idx=1)

It seems to be working..
Furthermore I will save tokenizer and model files to ensure that will be always the same.

But my question is if I change this layer in the previous model I will get the same results? Or it is needed to re-train?

Once again, thank you so much!

The text was updated successfully, but these errors were encountered:

NielsRogge · 2021-11-04T06:57:34Z

Hi,

Thanks for the kind words! Actually, I recently updated the config.json of LayoutXLM, and I probably know the reason: in a recent PR, I added a LayoutXLMProcessor, together with a new LayoutXLMTokenizer/LayoutXLMTokenizerFast.

Therefore, I removed the "tokenizer_class" attribute of LayoutXLM"s configuration, as this was still set to XLMRobertaTokenizer.

However, you've got install Transformers from master for them to use them for now: pip install git+https://github.com/huggingface/transformers.git.

Maybe it's better for me to add the tokenizer_class attribute again, and remove it once Transformers has a new version on PyPi.

Thanks for reporting!

FelipeAlb94 · 2021-11-04T12:02:25Z

Hmm ok I see. Well I'm using the right tokenizer and everything is working fine now.

Just to update I let the model train with the previous change in the word embedding layer but it doesn't converged, the loss kept high all the training process.

Appreciate your help!

NielsRogge · 2021-11-04T13:54:32Z

Update: I've restored the tokenizer_class attribute for now, such that there are no breaking changes. So for now, the recommended way to use the tokenizer for LayoutXLM is by using the AutoTokenizer class.

However, in the new version of Transformers, it's recommended to use LayoutXLMProcessor and LayoutXLMTokenizer/LayoutXLMTokenizerFast, which support bounding boxes and labels, to be prepared for the model.

FelipeAlb94 closed this as completed Nov 4, 2021

NielsRogge mentioned this issue Nov 4, 2021

LayoutXLM tokenizer issues after last update huggingface/transformers#14275

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vocab size for microsoft/layoutxlm-base #50

Vocab size for microsoft/layoutxlm-base #50

FelipeAlb94 commented Nov 3, 2021 •

edited

NielsRogge commented Nov 4, 2021 •

edited

FelipeAlb94 commented Nov 4, 2021

NielsRogge commented Nov 4, 2021 •

edited

Vocab size for microsoft/layoutxlm-base #50

Vocab size for microsoft/layoutxlm-base #50

Comments

FelipeAlb94 commented Nov 3, 2021 • edited

NielsRogge commented Nov 4, 2021 • edited

FelipeAlb94 commented Nov 4, 2021

NielsRogge commented Nov 4, 2021 • edited

FelipeAlb94 commented Nov 3, 2021 •

edited

NielsRogge commented Nov 4, 2021 •

edited

NielsRogge commented Nov 4, 2021 •

edited