[Tokenizer] batch_encode_plus method cannot encode List[Tuple[str]] with is_pretokenized=True #5169

vjeronymo2 · 2020-06-21T18:13:17Z

🐛 Bug

Information

Model I am using: BERT

Language I am using the model on: English

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

Load a tokenizer
Create a generic List[Tuple[List[int], List[int]]], for example [([2023, 2573], [2023, 2573, 2205])]
Encode the list using the method batch_encode_plus with is_tokenized=True to encode the pairs together
Error

from transformers import BertTokenizer
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
text = [('This works','this works too')]
print(tokenizer.batch_encode_plus(text, add_special_tokens=False)['input_ids'])  # This works
input_ids = [([2023, 2573], [2023, 2573, 2205])]
tokenizer.batch_encode_plus(input_ids, is_pretokenized=True) # This raises error

/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils.py in get_input_ids(text)
   1715             else:
   1716                 raise ValueError(
-> 1717                     "Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers."
   1718                 )
   1719 

ValueError: Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.

Expected behavior

batch_encode_plus would be able to encode List[Tuple[List[int], List[int]]], as described in the function description, hence, the example's input_ids would be:
[[101, 2023, 2573, 102, 2023, 2573, 2205, 102]]

Environment info

transformers version: 2.11.0
Platform: Linux-4.19.104+-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.6.9
PyTorch version (GPU?): 1.5.0+cu101 (True)
Tensorflow version (GPU?): 2.2.0 (True)
Using GPU in script?: probably not, but irrelevant
Using distributed or parallel set-up in script?: probably not

The text was updated successfully, but these errors were encountered:

patil-suraj · 2020-06-22T11:27:29Z

Hi @vjeronymo2, encode_plus method expects list of str or tokens if you are using is_pretokenized=True.
tokens here are not ints, they are wordpiece tokens, if you want to convert a string to tokens then you can use .tokenize method

vjeronymo2 · 2020-06-23T13:20:25Z

Hi @vjeronymo2, encode_plus method expects list of str or tokens if you are using is_pretokenized=True.
tokens here are not ints, they are wordpiece tokens, if you want to convert a string to tokens then you can use .tokenize method

Thanks for clarifying that. I guess my biggest confusion was with the correct meaning of is_pretokenized flag.
Have a nice day, good sir.

sshleifer added the Core: Tokenization Internals of the library; Tokenization. label Jun 22, 2020

sshleifer changed the title ~~[Tokenizer] batch_encode_plus method cannot encode List[Tuple[List[int], List[int]]]~~ [Tokenizer] batch_encode_plus method cannot encode List[Tuple[str]] with is_pretokenized=True Jun 22, 2020

thomwolf added a commit that referenced this issue Jun 22, 2020

More clear error message in the use-case of #5169

3c842db

thomwolf mentioned this issue Jun 22, 2020

More clear error message in the use-case of #5169 #5184

Merged

thomwolf closed this as completed in #5184 Jun 23, 2020

thomwolf added a commit that referenced this issue Jun 23, 2020

More clear error message in the use-case of #5169 (#5184)

b28b537

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tokenizer] batch_encode_plus method cannot encode List[Tuple[str]] with is_pretokenized=True #5169

[Tokenizer] batch_encode_plus method cannot encode List[Tuple[str]] with is_pretokenized=True #5169

vjeronymo2 commented Jun 21, 2020 •

edited

patil-suraj commented Jun 22, 2020

vjeronymo2 commented Jun 23, 2020

[Tokenizer] batch_encode_plus method cannot encode List[Tuple[str]] with is_pretokenized=True #5169

[Tokenizer] batch_encode_plus method cannot encode List[Tuple[str]] with is_pretokenized=True #5169

Comments

vjeronymo2 commented Jun 21, 2020 • edited

🐛 Bug

Information

To reproduce

Expected behavior

Environment info

patil-suraj commented Jun 22, 2020

vjeronymo2 commented Jun 23, 2020

vjeronymo2 commented Jun 21, 2020 •

edited