Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer Fast bug: ValueError: TextInputSequence must be str #7735

Closed
mariusjohan opened this issue Oct 12, 2020 · 6 comments
Closed

Tokenizer Fast bug: ValueError: TextInputSequence must be str #7735

mariusjohan opened this issue Oct 12, 2020 · 6 comments

Comments

@mariusjohan
Copy link

Environment info

  • transformers version:
  • Platform: In a Colab enviroment aswell as on my local windows version
  • Python version: 3.7.4
  • PyTorch version (GPU?): Yes and No
  • Tensorflow version (GPU?): I didn't try with tensorflow, but I suspect that it has nothing to do with it
  • Using GPU in script?: I used the automodeling on a GPU session in Colab
  • Using distributed or parallel set-up in script?: Nope

Who can help

@mfuntowicz

Information

Model I am using: Initially Electra but I tested it out with BERT, DistilBERT and RoBERTa

It's using your scripts, but again, it believe it wouldn't work if I did it myself. The model is trained on SQuAD.

Error traceback

"""
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/usr/local/lib/python3.6/dist-packages/transformers/data/processors/squad.py", line 165, in squad_convert_example_to_features
    return_token_type_ids=True,
  File "/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils_base.py", line 2050, in encode_plus
    **kwargs,
  File "/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils_fast.py", line 473, in _encode_plus
    **kwargs,
  File "/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils_fast.py", line 376, in _batch_encode_plus
    is_pretokenized=is_split_into_words,
  File "/usr/local/lib/python3.6/dist-packages/tokenizers/implementations/base_tokenizer.py", line 212, in encode
    return self._tokenizer.encode(sequence, pair, is_pretokenized, add_special_tokens)
ValueError: TextInputSequence must be str
"""

To reproduce

Steps to reproduce the behavior:

  1. Download model and tokenizer (fast)
  2. Test it out with the transformers pipeline for a question answering task

I've also made a small notebook to test it out for yourself. here

Expected behavior

Instead of giving an error, I would expect the tokenizer to work...

@LysandreJik
Copy link
Member

Hi, thanks for opening such a detailed issue with a notebook!

Unfortunately, fast tokenizers don’t currently work with the QA pipeline. They will in the second pipeline version which is expected in a few weeks to a few months, but right now please use the slow tokenizers for the QA pipeline.

Thanks!

@zhiqihuang
Copy link

I think the issue is still there.

@LysandreJik
Copy link
Member

Please open a new issue with your environment, an example of what the issue is and how you expect it to work. Thank you.

@frederico-klein
Copy link

frederico-klein commented Dec 22, 2020

Hi, thanks for opening such a detailed issue with a notebook!

Unfortunately, fast tokenizers don’t currently work with the QA pipeline. They will in the second pipeline version which is expected in a few weeks to a few months, but right now please use the slow tokenizers for the QA pipeline.

Thanks!

and how do I do that? I don't understand the difference from slow and fast tokenizers. Do I need to train my tokenizer again, or can I just somehow "cast" the fast into the slow version?

I could fix this simply by changing:

from transformers import RobertaTokenizerFast 
tokenizer = RobertaTokenizerFast

to:

from transformers import RobertaTokenizer 
tokenizer = RobertaTokenizer

@CSworkspace
Copy link

I also find this problem when using transformers. I check my data and find that if csv file contains much Null data or the length of str is 0, this error will be returned. I filter these data and I can successfully run my code.

@tonight-is-you
Copy link

double check the data and make sure there is no nan in your data, this is the problem i encountered

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants