Conversion from slow to fast for BPE spm vocabs contained an error. #10120

Narsil · 2021-02-10T13:42:15Z

What does this PR do?

There is only 1 test currently (tokenizers + slow) that used the modified path
and it's reformer, which does not contain any ids modification so the
bug was silent for now.
The real issue is that vocab variable was overloaded by
SentencePieceExtractor, leading to Slow specific vocab oddities to be
completely ignored
The bug was reported here Model Hub hanging in model's loading #9518
Ran the complete tokenization test suite with slow without error
(RUN_SLOW=1 pytest -sv tests/test_tokenization_*)
We need to keep in mind that BPE + SPM are relatively rare.
I still need to carry out
a full sweep of the hub to check all possible variants.

Affected models (all repos containing sentencepiece.bpe.model):

Musixmatch/umberto-commoncrawl-cased-v1
idb-ita/gilberto-uncased-from-camembert
itsunoda/wolfbbsRoBERTa-large (not fixed with current PR, seems linked to prefixed '_' in fast tokenizers)
itsunoda/wolfbbsRoBERTa-small (not fixed with current PR)
mrm8488/umberto-wikipedia-uncased-v1-finetuned-squadv1-it
EMBEDDIA/litlat-bert
neuralspace-reverie/indic-transformers-bn-xlmroberta
neuralspace-reverie/indic-transformers-hi-xlmroberta

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@thomwolf @LysandreJik @sgugger

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors which may be interested in your PR.

- There is only 1 test currently (tokenizers + slow) that used the modified path and it's reformer, which does not contain any ids modification so the bug was silent for now. - The real issue is that vocab variable was overloaded by SentencePieceExtractor, leading to Slow specific vocab oddities to be completely ignored - The bug was reported here huggingface#9518 - Ran the complete tokenization test suite with slow without error (`RUN_SLOW=1 pytest -sv tests/test_tokenization_*`)

n1t0

LGTM! Thanks for fixing this @Narsil!

sgugger

Thanks for the fix!

LysandreJik

Great, LGTM! Thanks @Narsil!

Narsil added 3 commits February 10, 2021 14:33

Remove rebase error.

ab55612

Adding the fixture.

c5d37c3

n1t0 approved these changes Feb 10, 2021

View reviewed changes

sgugger approved these changes Feb 10, 2021

View reviewed changes

LysandreJik approved these changes Feb 13, 2021

View reviewed changes

LysandreJik merged commit c9837a0 into huggingface:master Feb 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conversion from slow to fast for BPE spm vocabs contained an error. #10120

Conversion from slow to fast for BPE spm vocabs contained an error. #10120

Narsil commented Feb 10, 2021 •

edited

n1t0 left a comment

sgugger left a comment

LysandreJik left a comment

Conversion from slow to fast for BPE spm vocabs contained an error. #10120

Conversion from slow to fast for BPE spm vocabs contained an error. #10120

Conversation

Narsil commented Feb 10, 2021 • edited

What does this PR do?

Before submitting

Who can review?

n1t0 left a comment

Choose a reason for hiding this comment

sgugger left a comment

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

Narsil commented Feb 10, 2021 •

edited