[Lllama] Update tokenization code to ensure parsing of the special tokens [core] #24042

ArthurZucker · 2023-06-06T09:08:40Z

What does this PR do?

Adresses the issues with the fast tokenizer of LLama. Namely:

nit making it return token type ids.
the added tokens are not correctly encoded.

There seems to be an issue with the conversion: before the python layer, just loading the tokenizer_config.json file with the rust backend still produced: tokenizer.encode("this is not<s>").tokens, ['<s>', '▁This', '▁is', '▁not', '</', 's', '>']

ArthurZucker · 2023-06-06T09:44:27Z

Ok, narrowed it down to this line:

        # Check all our special tokens are registered as "no split" token (we don't cut them) and are in the vocab
        added_tokens = tokenizer.sanitize_special_tokens()

When converting the model from a slow one, the tokenizer correctly processes the inputs up until this point. Meaning that before, the special tokens where already registered as special tokens, but adding them once more most probably breaks the internal regex. Still checking but should be this.

HuggingFaceDocBuilderDev · 2023-06-06T10:23:47Z

The documentation is not available anymore as the PR was closed or merged.

ArthurZucker · 2023-06-07T08:38:09Z

After debugging with @Narsil it seems that the special tokens have to be not normalised, otherwise the normalizer prepends a space when adding it, which is why the token is not recognized. I suspect that there is another bug, as I tried with special tokens set to normalized = True (when calling from_slow=True+commenting self._sanitize_special_tokens) but the current should fix the conversion.

A big discrepancy is that the default AddedTokens imported from tokenizers will set normalized to !special, so if you add tokens as special tokens, normalized will be False. But in transformers this is not the case, which explains why the call to sanitize is a source of problem.

…to fix-llama-fast

ArthurZucker · 2023-06-07T15:24:14Z

We have to update the online models to change the tokenizer.json, (people might be confused because the normalized param is also in the slow files but always ignored)

sgugger

Thanks for the fix!

…kens [core] (huggingface#24042) * preventllama fast from returning token type ids * remove type hints * normalised False

preventllama fast from returning token type ids

c0c9672

remove type hints

c8dee03

ArthurZucker changed the title ~~preventllama fast from returning token type ids~~ [Lllama] Update tokenization code to ensure parsing of the special tokens [core] Jun 6, 2023

normalised False

8622c17

Merge branch 'main' of https://github.com/huggingface/transformers in…

8c8c782

…to fix-llama-fast

ArthurZucker linked an issue Jun 7, 2023 that may be closed by this pull request

LLaMATokenizerFast works abnormally #23818

Closed

4 tasks

ArthurZucker requested a review from Narsil June 7, 2023 13:58

ArthurZucker marked this pull request as ready for review June 7, 2023 14:13

ArthurZucker requested a review from sgugger June 7, 2023 15:27

sgugger approved these changes Jun 7, 2023

View reviewed changes

ArthurZucker merged commit 535542d into huggingface:main Jun 9, 2023
22 checks passed

This was referenced Jun 9, 2023

LLaMA Implementation #21955

Closed

LLaMATokenizerFast works abnormally #23818

Closed

ArthurZucker mentioned this pull request Jun 27, 2023

LlamaModel.forward() got an unexpected keyword argument 'token_type_ids' #24514

Closed

4 tasks

ArthurZucker mentioned this pull request Jul 5, 2023

'eos_token_id' for llama model.generate is not working #24644

Closed

4 tasks

regisss mentioned this pull request Jul 15, 2023

Add llama model change huggingface/optimum-habana#296

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Lllama] Update tokenization code to ensure parsing of the special tokens [core] #24042

[Lllama] Update tokenization code to ensure parsing of the special tokens [core] #24042

ArthurZucker commented Jun 6, 2023

ArthurZucker commented Jun 6, 2023

HuggingFaceDocBuilderDev commented Jun 6, 2023 •

edited

ArthurZucker commented Jun 7, 2023 •

edited

ArthurZucker commented Jun 7, 2023

sgugger left a comment

[Lllama] Update tokenization code to ensure parsing of the special tokens [core] #24042

[Lllama] Update tokenization code to ensure parsing of the special tokens [core] #24042

Conversation

ArthurZucker commented Jun 6, 2023

What does this PR do?

ArthurZucker commented Jun 6, 2023

HuggingFaceDocBuilderDev commented Jun 6, 2023 • edited

ArthurZucker commented Jun 7, 2023 • edited

ArthurZucker commented Jun 7, 2023

sgugger left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Jun 6, 2023 •

edited

ArthurZucker commented Jun 7, 2023 •

edited