Bert Batch Encode Plus adding an extra [SEP] #3502

creat89 · 2020-03-28T17:33:34Z

🐛 Bug

Information

I'm using bert-base-multilingual-cased tokenizer and model for creating another model. However, the batch_encode_plus is adding an extra [SEP] token id in the middle.

The problem arises when using:

Specific strings to encode, e.g. 16., 3., 10.,
The bert-base-multilingual-cased tokenizer is used beforehand to tokenize the previously described strings and
The batch_encode_plus is used to convert the tokenized strings

In fact, batch_encode_plus will generate an input_ids list containing two [SEP], such as in [101, 10250, 102, 119, 102]

I have seen similar issues, but they don't indicate the version of transformers:

#2658
#3037

Thus, I'm not sure if it is related to transformers version 2.6.0

To reproduce

Steps to reproduce the behavior (simplified steps):

Have a string of type 16. or 6.
Use tokens = bert_tokenizer.tokenize("16.")
Use bert_tokenizer.batch_encode_plus([tokens])

You can reproduce the error with this code

from transformers import BertTokenizer
import unittest

class TestListElements(unittest.TestCase):

    def setUp(self):

        bert_tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')

        problematic_string = "16."

        tokens = bert_tokenizer.tokenize(problematic_string)

        self.encoded_batch_1 = bert_tokenizer.batch_encode_plus([tokens])    #list[list[]]
        self.encoded_batch_2 = bert_tokenizer.batch_encode_plus([problematic_string]) #list[]
        self.encoded_tokens_1 = bert_tokenizer.encode_plus(problematic_string)
        self.encoded_tokens_2 = bert_tokenizer.encode_plus(tokens)

    def test_tokens_vs_tokens(self):
        self.assertListEqual(self.encoded_tokens_1["input_ids"], self.encoded_tokens_2["input_ids"])

    def test_tokens_vs_batch_string(self):
        self.assertListEqual(self.encoded_tokens_1["input_ids"], self.encoded_batch_2["input_ids"][0])

    def test_tokens_vs_batch_list_tokens(self):
        self.assertListEqual(self.encoded_tokens_1["input_ids"], self.encoded_batch_1["input_ids"][0])

if __name__ == "__main__":
    unittest.main(verbosity=2)

The code will break at test test_tokens_vs_batch_list_tokens, with the following summarized output:

- [101, 10250, 119, 102]
+ [101, 10250, 102, 119, 102]

Expected behavior

The batch_encode_plus should always produce the same input_ids no matter whether we pass them a list of tokens or a list of strings.

For instance, for the string 16. we should get always [101, 10250, 119, 102]. However, using batch_encode_plus we get [101, 10250, 102, 119, 102] if we pass them an input already tokenized.

Environment info

transformers version: 2.6.0
Platform: Linux (Manjaro)
Python version: Python 3.8.1 (default, Jan 8 2020, 22:29:32)
PyTorch version (GPU?): 1.4.0 (True)
Tensorflow version (GPU?): ---
Using GPU in script?: False
Using distributed or parallel set-up in script?: False

The text was updated successfully, but these errors were encountered:

patrickvonplaten · 2020-03-28T21:41:31Z

Hi @creat89,

Thanks for posting this issue!

You are correct there is some inconsistent behavior here.

We should probably in general not allow using batch_encode_plus() of a simple string. For this the encode_plus() function should be used.
It seems like there is an inconsistency between encode_plus([string]) and encode_plus(string). This should probably be fixed.

creat89 · 2020-03-28T21:46:16Z

Well, the issue not only happens with a simple string. In my actual code I was using a batch of size 2. However, I just used a simple example to demonstrate the issue.

I didn't find any inconsistency between encode_plus([string]) and encode_plus(string) but batch_encode_plus([strings]) and batch_encode_plus([[tokens]])

patrickvonplaten · 2020-03-29T02:33:11Z

Sorry, I was skimming through your problem too quickly - I see what you mean now.
I will take a closer look at this.

patrickvonplaten · 2020-03-29T16:54:16Z

Created a PR this fixes this behavior. Thanks for pointing this out @creat89 :-)

patrickvonplaten · 2020-04-07T16:46:37Z

There has been a big change in tokenizers recently :-) which adds a is_pretokenized flag to the input which makes everything much easier. This should then be used as follows:
bert_tokenizer.batch_encode_plus([tokens], is_pretokenized=True))

creat89 · 2020-04-07T16:48:11Z

Cool, that's awesome and yes, I'm sure that makes everything easier. Cheers!

patrickvonplaten self-assigned this Mar 28, 2020

patrickvonplaten mentioned this issue Mar 29, 2020

[Tokenization] fix edge case for bert tokenization #3517

Merged

creat89 closed this as completed Mar 29, 2020

patrickvonplaten reopened this Apr 7, 2020

LysandreJik closed this as completed in #3517 Apr 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bert Batch Encode Plus adding an extra [SEP] #3502

Bert Batch Encode Plus adding an extra [SEP] #3502

creat89 commented Mar 28, 2020 •

edited

patrickvonplaten commented Mar 28, 2020 •

edited

creat89 commented Mar 28, 2020 •

edited

patrickvonplaten commented Mar 29, 2020 •

edited

patrickvonplaten commented Mar 29, 2020

patrickvonplaten commented Apr 7, 2020 •

edited

creat89 commented Apr 7, 2020

Bert Batch Encode Plus adding an extra [SEP] #3502

Bert Batch Encode Plus adding an extra [SEP] #3502

Comments

creat89 commented Mar 28, 2020 • edited

🐛 Bug

Information

To reproduce

Expected behavior

Environment info

patrickvonplaten commented Mar 28, 2020 • edited

creat89 commented Mar 28, 2020 • edited

patrickvonplaten commented Mar 29, 2020 • edited

patrickvonplaten commented Mar 29, 2020

patrickvonplaten commented Apr 7, 2020 • edited

creat89 commented Apr 7, 2020

creat89 commented Mar 28, 2020 •

edited

patrickvonplaten commented Mar 28, 2020 •

edited

creat89 commented Mar 28, 2020 •

edited

patrickvonplaten commented Mar 29, 2020 •

edited

patrickvonplaten commented Apr 7, 2020 •

edited