Probably a bug in XLMRobertaTokenizer #2795

zjujh1995 · 2020-02-10T13:13:12Z

(Everything goes perfect when I did experiment with MultilingualBert, but seems only the base-model is released.)

When using XLM-R, the according tokenizer (XLMRobertaTokenizer) converts <unk> and every OOV token into id = 1. However, 1 should be the number of <pad>. (And the tokenizer can convert 1 to <pad>, 3 to <unk>).

ShiroKL · 2020-02-12T13:51:07Z

Hi, if I am not wrong the pad should be 2 ?
at least the parameters tokenizer.pad_token_id for XLM-R

EDIT: 2 is eos sorry.

LysandreJik · 2020-03-09T22:45:28Z

Indeed, this is a bug that will be fixed when #3198 is merged. Thanks for letting us know.

LysandreJik added the Core: Tokenization Internals of the library; Tokenization. label Feb 10, 2020

dennlinger mentioned this issue Feb 14, 2020

Is it Multilingual? UKPLab/sentence-transformers#75

Closed

LysandreJik mentioned this issue Mar 9, 2020

XLM-R Tokenizer now passes common tests + Integration tests #3198

Merged

LysandreJik closed this as completed in #3198 Mar 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Probably a bug in XLMRobertaTokenizer #2795

Probably a bug in XLMRobertaTokenizer #2795

zjujh1995 commented Feb 10, 2020 •

edited

Loading

ShiroKL commented Feb 12, 2020 •

edited

Loading

LysandreJik commented Mar 9, 2020

Probably a bug in XLMRobertaTokenizer #2795

Probably a bug in XLMRobertaTokenizer #2795

Comments

zjujh1995 commented Feb 10, 2020 • edited Loading

ShiroKL commented Feb 12, 2020 • edited Loading

LysandreJik commented Mar 9, 2020

zjujh1995 commented Feb 10, 2020 •

edited

Loading

ShiroKL commented Feb 12, 2020 •

edited

Loading