Tokenizer Fast bug: ValueError: TextInputSequence must be str #7735

mariusjohan · 2020-10-12T14:25:46Z

Environment info

transformers version:
Platform: In a Colab enviroment aswell as on my local windows version
Python version: 3.7.4
PyTorch version (GPU?): Yes and No
Tensorflow version (GPU?): I didn't try with tensorflow, but I suspect that it has nothing to do with it
Using GPU in script?: I used the automodeling on a GPU session in Colab
Using distributed or parallel set-up in script?: Nope

Who can help

@mfuntowicz

Information

Model I am using: Initially Electra but I tested it out with BERT, DistilBERT and RoBERTa

It's using your scripts, but again, it believe it wouldn't work if I did it myself. The model is trained on SQuAD.

Error traceback

"""
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/usr/local/lib/python3.6/dist-packages/transformers/data/processors/squad.py", line 165, in squad_convert_example_to_features
    return_token_type_ids=True,
  File "/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils_base.py", line 2050, in encode_plus
    **kwargs,
  File "/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils_fast.py", line 473, in _encode_plus
    **kwargs,
  File "/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils_fast.py", line 376, in _batch_encode_plus
    is_pretokenized=is_split_into_words,
  File "/usr/local/lib/python3.6/dist-packages/tokenizers/implementations/base_tokenizer.py", line 212, in encode
    return self._tokenizer.encode(sequence, pair, is_pretokenized, add_special_tokens)
ValueError: TextInputSequence must be str
"""

To reproduce

Steps to reproduce the behavior:

Download model and tokenizer (fast)
Test it out with the transformers pipeline for a question answering task

I've also made a small notebook to test it out for yourself. here

Expected behavior

Instead of giving an error, I would expect the tokenizer to work...

The text was updated successfully, but these errors were encountered:

LysandreJik · 2020-10-13T09:01:46Z

Hi, thanks for opening such a detailed issue with a notebook!

Unfortunately, fast tokenizers don’t currently work with the QA pipeline. They will in the second pipeline version which is expected in a few weeks to a few months, but right now please use the slow tokenizers for the QA pipeline.

Thanks!

zhiqihuang · 2020-12-10T04:51:08Z

I think the issue is still there.

LysandreJik · 2020-12-10T14:27:57Z

Please open a new issue with your environment, an example of what the issue is and how you expect it to work. Thank you.

frederico-klein · 2020-12-22T14:23:33Z

Hi, thanks for opening such a detailed issue with a notebook!

Unfortunately, fast tokenizers don’t currently work with the QA pipeline. They will in the second pipeline version which is expected in a few weeks to a few months, but right now please use the slow tokenizers for the QA pipeline.

Thanks!

~~and how do I do that? I don't understand the difference from slow and fast tokenizers. Do I need to train my tokenizer again, or can I just somehow "cast" the fast into the slow version?~~

I could fix this simply by changing:

from transformers import RobertaTokenizerFast 
tokenizer = RobertaTokenizerFast

to:

from transformers import RobertaTokenizer 
tokenizer = RobertaTokenizer

CSworkspace · 2022-05-13T05:04:26Z

I also find this problem when using transformers. I check my data and find that if csv file contains much Null data or the length of str is 0, this error will be returned. I filter these data and I can successfully run my code.

tonight-is-you · 2023-06-01T06:27:10Z

double check the data and make sure there is no nan in your data, this is the problem i encountered

LysandreJik closed this as completed Oct 13, 2020

slvcsl mentioned this issue Jan 5, 2021

XLNet evaluation on SQuAD #9351

Closed

4 tasks

juice500ml mentioned this issue Feb 17, 2021

QA pipeline fails during convert_squad_examples_to_features #8787

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer Fast bug: ValueError: TextInputSequence must be str #7735

Tokenizer Fast bug: ValueError: TextInputSequence must be str #7735

mariusjohan commented Oct 12, 2020

LysandreJik commented Oct 13, 2020

zhiqihuang commented Dec 10, 2020

LysandreJik commented Dec 10, 2020

frederico-klein commented Dec 22, 2020 •

edited

CSworkspace commented May 13, 2022

tonight-is-you commented Jun 1, 2023

Tokenizer Fast bug: ValueError: TextInputSequence must be str #7735

Tokenizer Fast bug: ValueError: TextInputSequence must be str #7735

Comments

mariusjohan commented Oct 12, 2020

Environment info

Who can help

Information

Error traceback

To reproduce

Expected behavior

LysandreJik commented Oct 13, 2020

zhiqihuang commented Dec 10, 2020

LysandreJik commented Dec 10, 2020

frederico-klein commented Dec 22, 2020 • edited

CSworkspace commented May 13, 2022

tonight-is-you commented Jun 1, 2023

frederico-klein commented Dec 22, 2020 •

edited