Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Lllama] Update tokenization code to ensure parsing of the special tokens [core] #24042

Merged
merged 4 commits into from Jun 9, 2023

Conversation

ArthurZucker
Copy link
Collaborator

What does this PR do?

Adresses the issues with the fast tokenizer of LLama. Namely:

  • nit making it return token type ids.
  • the added tokens are not correctly encoded.

There seems to be an issue with the conversion: before the python layer, just loading the tokenizer_config.json file with the rust backend still produced: tokenizer.encode("this is not<s>").tokens, ['<s>', '▁This', '▁is', '▁not', '</', 's', '>']

@ArthurZucker
Copy link
Collaborator Author

Ok, narrowed it down to this line:

        # Check all our special tokens are registered as "no split" token (we don't cut them) and are in the vocab
        added_tokens = tokenizer.sanitize_special_tokens()

When converting the model from a slow one, the tokenizer correctly processes the inputs up until this point. Meaning that before, the special tokens where already registered as special tokens, but adding them once more most probably breaks the internal regex. Still checking but should be this.

@ArthurZucker ArthurZucker changed the title preventllama fast from returning token type ids [Lllama] Update tokenization code to ensure parsing of the special tokens [core] Jun 6, 2023
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jun 6, 2023

The documentation is not available anymore as the PR was closed or merged.

@ArthurZucker
Copy link
Collaborator Author

ArthurZucker commented Jun 7, 2023

After debugging with @Narsil it seems that the special tokens have to be not normalised, otherwise the normalizer prepends a space when adding it, which is why the token is not recognized. I suspect that there is another bug, as I tried with special tokens set to normalized = True (when calling from_slow=True+commenting self._sanitize_special_tokens) but the current should fix the conversion.

A big discrepancy is that the default AddedTokens imported from tokenizers will set normalized to !special, so if you add tokens as special tokens, normalized will be False. But in transformers this is not the case, which explains why the call to sanitize is a source of problem.

@ArthurZucker ArthurZucker linked an issue Jun 7, 2023 that may be closed by this pull request
4 tasks
@ArthurZucker ArthurZucker requested a review from Narsil June 7, 2023 13:58
@ArthurZucker ArthurZucker marked this pull request as ready for review June 7, 2023 14:13
@ArthurZucker
Copy link
Collaborator Author

We have to update the online models to change the tokenizer.json, (people might be confused because the normalized param is also in the slow files but always ignored)

@ArthurZucker ArthurZucker requested a review from sgugger June 7, 2023 15:27
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix!

@ArthurZucker ArthurZucker merged commit 535542d into huggingface:main Jun 9, 2023
22 checks passed
novice03 pushed a commit to novice03/transformers that referenced this pull request Jun 23, 2023
…kens [core] (huggingface#24042)

* preventllama fast from returning token type ids

* remove type hints

* normalised False
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LLaMATokenizerFast works abnormally
3 participants