QA pipeline fails during convert_squad_examples_to_features #8787

TrupeshKumarPatel · 2020-11-25T17:28:10Z

Environment info

transformers version: 4.0.0-rc-1
Platform: Linux-3.10.0-1062.9.1.el7.x86_64-x86_64-with-redhat-7.8-Maipo
Python version: 3.7.9
PyTorch version (GPU?): 1.7.0 (True)
Tensorflow version (GPU?): 2.3.1 (True)
Using GPU in script?: yes
Using distributed or parallel set-up in script?: yes

Who can help

@LysandreJik @mfuntowicz

IDK who else can help, but in sort, I am looking for someone who can help me in QA tasks.

Information

Model I am using (Bert, XLNet ...): Bert

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: run_squad.py (modifying to run using jupyter notebook, using "HfArgumentParser")

The tasks I am working on is:

an official GLUE/SQUaD task: SQUaD
my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

modified all argparse to HfArgumentParser
created "ModelArguments" dataclass function for HfArgumentParser (Ref: https://github.com/patil-suraj/Notebooks/blob/master/longformer_qa_training.ipynb)
need to small changes in the whole script.

The test fails with error TypeError: TextInputSequence must be str

Complete failure result:

RemoteTraceback                           Traceback (most recent call last)
RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/data/user/tr27p/.conda/envs/DeepBioComp/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/data/user/tr27p/.conda/envs/DeepBioComp/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/data/user/tr27p/.conda/envs/DeepBioComp/lib/python3.7/site-packages/transformers/data/processors/squad.py", line 175, in squad_convert_example_to_features
    return_token_type_ids=True,
  File "/data/user/tr27p/.conda/envs/DeepBioComp/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 2439, in encode_plus
    **kwargs,
  File "/data/user/tr27p/.conda/envs/DeepBioComp/lib/python3.7/site-packages/transformers/tokenization_utils_fast.py", line 463, in _encode_plus
    **kwargs,
  File "/data/user/tr27p/.conda/envs/DeepBioComp/lib/python3.7/site-packages/transformers/tokenization_utils_fast.py", line 378, in _batch_encode_plus
    is_pretokenized=is_split_into_words,
TypeError: TextInputSequence must be str
"""

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
<ipython-input-19-263240bbee7e> in <module>
----> 1 main()

<ipython-input-18-61d7f0eab618> in main()
    111     # Training
    112     if train_args.do_train:
--> 113         train_dataset = load_and_cache_examples((model_args, train_args), tokenizer, evaluate=False, output_examples=False)
    114         global_step, tr_loss = train(args, train_dataset, model, tokenizer)
    115         logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)

<ipython-input-8-79eb3ed364c2> in load_and_cache_examples(args, tokenizer, evaluate, output_examples)
     54             max_query_length=model_args.max_query_length,
     55             is_training=not evaluate,
---> 56             return_dataset="pt",
     57 #             threads=model_args.threads,
     58         )

/data/user/tr27p/.conda/envs/DeepBioComp/lib/python3.7/site-packages/transformers/data/processors/squad.py in squad_convert_examples_to_features(examples, tokenizer, max_seq_length, doc_stride, max_query_length, is_training, padding_strategy, return_dataset, threads, tqdm_enabled)
    366                 total=len(examples),
    367                 desc="convert squad examples to features",
--> 368                 disable=not tqdm_enabled,
    369             )
    370         )

/data/user/tr27p/.conda/envs/DeepBioComp/lib/python3.7/site-packages/tqdm/std.py in __iter__(self)
   1131 
   1132         try:
-> 1133             for obj in iterable:
   1134                 yield obj
   1135                 # Update and possibly print the progressbar.

/data/user/tr27p/.conda/envs/DeepBioComp/lib/python3.7/multiprocessing/pool.py in <genexpr>(.0)
    323                     result._set_length
    324                 ))
--> 325             return (item for chunk in result for item in chunk)
    326 
    327     def imap_unordered(self, func, iterable, chunksize=1):

/data/user/tr27p/.conda/envs/DeepBioComp/lib/python3.7/multiprocessing/pool.py in next(self, timeout)
    746         if success:
    747             return value
--> 748         raise value
    749 
    750     __next__ = next                    # XXX

TypeError: TextInputSequence must be str

Expected behavior

for more details check here:

link: https://github.com/uabinf/nlp-group-project-fall-2020-deepbiocomp/blob/cancer_ask/scripts/qa_script/qa_squad_v1.ipynb

The text was updated successfully, but these errors were encountered:

TrupeshKumarPatel · 2020-11-25T18:30:24Z

After updating the run_squad.py script with a newer version of transformers, it works now!

Thank you!

aleSuglia · 2021-02-15T11:07:30Z

@TrupeshKumarPatel Seems that this is not working still. What was the actual solution to this?

TrupeshKumarPatel · 2021-02-15T12:28:41Z

Hi @aleSuglia,
here is the updated link: https://github.com/uabinf/nlp-group-project-fall-2020-deepbiocomp/blob/main/scripts/qa_script/qa_squad_v1.ipynb , see if this help. If not then please elaborate on the error or problem that you are facing.

aleSuglia · 2021-02-15T12:44:32Z

I have exactly the same error that you reported: TypeError: TextInputSequence must be str
By debugging, I can see that the variable truncated_query has a list of integers (which should be the current question's token ids). However, when you pass that to the encode_plus method, you get the error. I guess it's because encode_plus expects strings and not integers. Do you have any suggestion?

aleSuglia · 2021-02-15T13:08:31Z

If you googled this error and you are reading this post, please do the following. When you create your tokenizer make sure that you set the flag use_fast to False like this:

AutoTokenizer.from_pretrained(tokenizer_name, use_fast=False)

This fixes the error. However, I wonder why there is no backward compatibility...

juice500ml · 2021-02-16T03:43:48Z

Had the similar issue with the above. What @aleSuglia suggested indeed works, but the issue still persists; fast version of the tokenizer should be compatible with the previous methods. In my case, I narrowed the problem down to InputExample, where text_b can be None,

transformers/src/transformers/data/processors/utils.py

Lines 47 to 48 in 447808c

    
           text_b: Optional[str] = None 
        
           label: Optional[str] = None

but the tokenizer apparently doesn't accept None as an input. So, I found a workaround by changing

InputExample(guid=some_id, text_a=some_text, label=some_label)
-> InputExample(guid=some_id, text_a=some_text, text_b='', label=some_label)

I'm not sure this completely solves the issue though.

juice500ml · 2021-02-17T02:22:24Z

Potentially related issues: #6545 #7735 #7011

TrupeshKumarPatel changed the title ~~QA pipeline fails~~ QA pipeline fails during convert_squad_examples_to_features Nov 25, 2020

TrupeshKumarPatel closed this as completed Nov 25, 2020

gohjiayi mentioned this issue Jan 21, 2022

Error running run_re.py - TypeError: TextInputSequence must be str smitkiri/ehr-relation-extraction#14

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QA pipeline fails during convert_squad_examples_to_features #8787

QA pipeline fails during convert_squad_examples_to_features #8787

TrupeshKumarPatel commented Nov 25, 2020 •

edited

Loading

TrupeshKumarPatel commented Nov 25, 2020

aleSuglia commented Feb 15, 2021

TrupeshKumarPatel commented Feb 15, 2021

aleSuglia commented Feb 15, 2021 •

edited

Loading

aleSuglia commented Feb 15, 2021 •

edited

Loading

juice500ml commented Feb 16, 2021

juice500ml commented Feb 17, 2021

QA pipeline fails during convert_squad_examples_to_features #8787

QA pipeline fails during convert_squad_examples_to_features #8787

Comments

TrupeshKumarPatel commented Nov 25, 2020 • edited Loading

Environment info

Who can help

Information

To reproduce

Expected behavior

for more details check here:

TrupeshKumarPatel commented Nov 25, 2020

aleSuglia commented Feb 15, 2021

TrupeshKumarPatel commented Feb 15, 2021

aleSuglia commented Feb 15, 2021 • edited Loading

aleSuglia commented Feb 15, 2021 • edited Loading

juice500ml commented Feb 16, 2021

juice500ml commented Feb 17, 2021

TrupeshKumarPatel commented Nov 25, 2020 •

edited

Loading

aleSuglia commented Feb 15, 2021 •

edited

Loading

aleSuglia commented Feb 15, 2021 •

edited

Loading