-
Notifications
You must be signed in to change notification settings - Fork 25.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bert Batch Encode Plus adding an extra [SEP] #3502
Comments
Hi @creat89, Thanks for posting this issue! You are correct there is some inconsistent behavior here.
|
Well, the issue not only happens with a simple string. In my actual code I was using a batch of size 2. However, I just used a simple example to demonstrate the issue. I didn't find any inconsistency between |
Sorry, I was skimming through your problem too quickly - I see what you mean now. |
Created a PR this fixes this behavior. Thanks for pointing this out @creat89 :-) |
There has been a big change in tokenizers recently :-) which adds a |
Cool, that's awesome and yes, I'm sure that makes everything easier. Cheers! |
馃悰 Bug
Information
I'm using
bert-base-multilingual-cased
tokenizer and model for creating another model. However, thebatch_encode_plus
is adding an extra[SEP]
token id in the middle.The problem arises when using:
16.
,3.
,10.
,bert-base-multilingual-cased
tokenizer is used beforehand to tokenize the previously described strings andbatch_encode_plus
is used to convert the tokenized stringsIn fact,
batch_encode_plus
will generate aninput_ids
list containing two[SEP]
, such as in[101, 10250, 102, 119, 102]
I have seen similar issues, but they don't indicate the version of transformers:
#2658
#3037
Thus, I'm not sure if it is related to transformers version
2.6.0
To reproduce
Steps to reproduce the behavior (simplified steps):
16.
or6.
tokens = bert_tokenizer.tokenize("16.")
bert_tokenizer.batch_encode_plus([tokens])
You can reproduce the error with this code
The code will break at test
test_tokens_vs_batch_list_tokens
, with the following summarized output:Expected behavior
The
batch_encode_plus
should always produce the sameinput_ids
no matter whether we pass them a list of tokens or a list of strings.For instance, for the string
16.
we should get always[101, 10250, 119, 102]
. However, usingbatch_encode_plus
we get[101, 10250, 102, 119, 102]
if we pass them an input already tokenized.Environment info
transformers
version: 2.6.0The text was updated successfully, but these errors were encountered: