-
Notifications
You must be signed in to change notification settings - Fork 25.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Special tokens not tokenized properly #12168
Comments
Hello! What is your tokenizer? Is it a WordPiece-based tokenizer, or a Byte-level BPE-based tokenizer like the original one from RoBERTa? |
Hi @LysandreJik, thanks for your reply and sorry that I'm just seeing this now. My tokenizer is a byte-level BPE-based tokenizer. |
Hi @LysandreJik, let me know if you have a solution for this or if you need more info, thanks a lot in advance :) |
Hi, How did you add the additional special tokens? So you start from a pre-trained RoBERTa, then added additional special tokens and further pre-trained on a corpus? Did you add these additional special tokens using the tokenizers library? Normally, one can add additional tokens as follows (based on huggingface/tokenizers#247 (comment)):
However, printing the following:
Returns
When I then test your example:
I get: And when doing:
I get: |
Awesome @NielsRogge, thanks a lot! Will test this and get back to you/close if solved. |
I created a new vocab with the tokenizers module for which I added new special tokens. Here is the code I use below:
|
Works fine, thanks again! |
Environment info
transformers
version: 4.5.1Who can help
@LysandreJik
Information
Hi,
I have recently further pretrained a RoBERTa model with fairseq. I use a custom vocabulary, trained with the tokenizers module. After converting the fairseq model to pytorch, I loaded all my model-related files here.
When loading the tokenizer, I noticed that the special tokens are not tokenized properly.
To reproduce
Expected behavior
Since
<hashtag>
is a special token in the vocabulary with ID 7 (see here), the last output should be: [0, 7, 2].<hashtag>
with the '<>' should also be recognized as a unique token.Potential explanation
When looking at the files from a similar model, it seems that the vocab is in txt format and they also have the
bpe.codes
file, which I don't have. Could that be the issue? And if so, how do I convert my files to this format?For vocab.txt, I have already found your lengthy explanation here, thanks for this.
The text was updated successfully, but these errors were encountered: