Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tokenizer] batch_encode_plus method cannot encode List[Tuple[str]] with is_pretokenized=True #5169

Closed
2 of 4 tasks
vjeronymo2 opened this issue Jun 21, 2020 · 2 comments 路 Fixed by #5184
Closed
2 of 4 tasks
Labels
Core: Tokenization Internals of the library; Tokenization.

Comments

@vjeronymo2
Copy link

vjeronymo2 commented Jun 21, 2020

馃悰 Bug

Information

Model I am using: BERT

Language I am using the model on: English

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

  1. Load a tokenizer
  2. Create a generic List[Tuple[List[int], List[int]]], for example [([2023, 2573], [2023, 2573, 2205])]
  3. Encode the list using the method batch_encode_plus with is_tokenized=True to encode the pairs together
  4. Error
from transformers import BertTokenizer
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
text = [('This works','this works too')]
print(tokenizer.batch_encode_plus(text, add_special_tokens=False)['input_ids'])  # This works
input_ids = [([2023, 2573], [2023, 2573, 2205])]
tokenizer.batch_encode_plus(input_ids, is_pretokenized=True) # This raises error

/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils.py in get_input_ids(text)
   1715             else:
   1716                 raise ValueError(
-> 1717                     "Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers."
   1718                 )
   1719 

ValueError: Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.

Expected behavior

batch_encode_plus would be able to encode List[Tuple[List[int], List[int]]], as described in the function description, hence, the example's input_ids would be:
[[101, 2023, 2573, 102, 2023, 2573, 2205, 102]]

Environment info

  • transformers version: 2.11.0
  • Platform: Linux-4.19.104+-x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.6.9
  • PyTorch version (GPU?): 1.5.0+cu101 (True)
  • Tensorflow version (GPU?): 2.2.0 (True)
  • Using GPU in script?: probably not, but irrelevant
  • Using distributed or parallel set-up in script?: probably not
@patil-suraj
Copy link
Contributor

Hi @vjeronymo2, encode_plus method expects list of str or tokens if you are using is_pretokenized=True.
tokens here are not ints, they are wordpiece tokens, if you want to convert a string to tokens then you can use .tokenize method

@sshleifer sshleifer added the Core: Tokenization Internals of the library; Tokenization. label Jun 22, 2020
@sshleifer sshleifer changed the title [Tokenizer] batch_encode_plus method cannot encode List[Tuple[List[int], List[int]]] [Tokenizer] batch_encode_plus method cannot encode List[Tuple[str]] with is_pretokenized=True Jun 22, 2020
@vjeronymo2
Copy link
Author

Hi @vjeronymo2, encode_plus method expects list of str or tokens if you are using is_pretokenized=True.
tokens here are not ints, they are wordpiece tokens, if you want to convert a string to tokens then you can use .tokenize method

Thanks for clarifying that. I guess my biggest confusion was with the correct meaning of is_pretokenized flag.
Have a nice day, good sir.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Core: Tokenization Internals of the library; Tokenization.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants