Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QA pipeline fails during convert_squad_examples_to_features #8787

Closed
2 of 4 tasks
TrupeshKumarPatel opened this issue Nov 25, 2020 · 7 comments
Closed
2 of 4 tasks

QA pipeline fails during convert_squad_examples_to_features #8787

TrupeshKumarPatel opened this issue Nov 25, 2020 · 7 comments

Comments

@TrupeshKumarPatel
Copy link

TrupeshKumarPatel commented Nov 25, 2020

Environment info

  • transformers version: 4.0.0-rc-1
  • Platform: Linux-3.10.0-1062.9.1.el7.x86_64-x86_64-with-redhat-7.8-Maipo
  • Python version: 3.7.9
  • PyTorch version (GPU?): 1.7.0 (True)
  • Tensorflow version (GPU?): 2.3.1 (True)
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: yes

Who can help

@LysandreJik @mfuntowicz

IDK who else can help, but in sort, I am looking for someone who can help me in QA tasks.

Information

Model I am using (Bert, XLNet ...): Bert

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: run_squad.py (modifying to run using jupyter notebook, using "HfArgumentParser")

The tasks I am working on is:

  • an official GLUE/SQUaD task: SQUaD
  • my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

  1. modified all argparse to HfArgumentParser
  2. created "ModelArguments" dataclass function for HfArgumentParser (Ref: https://github.com/patil-suraj/Notebooks/blob/master/longformer_qa_training.ipynb)
  3. need to small changes in the whole script.

The test fails with error TypeError: TextInputSequence must be str

Complete failure result:

RemoteTraceback                           Traceback (most recent call last)
RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/data/user/tr27p/.conda/envs/DeepBioComp/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/data/user/tr27p/.conda/envs/DeepBioComp/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/data/user/tr27p/.conda/envs/DeepBioComp/lib/python3.7/site-packages/transformers/data/processors/squad.py", line 175, in squad_convert_example_to_features
    return_token_type_ids=True,
  File "/data/user/tr27p/.conda/envs/DeepBioComp/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 2439, in encode_plus
    **kwargs,
  File "/data/user/tr27p/.conda/envs/DeepBioComp/lib/python3.7/site-packages/transformers/tokenization_utils_fast.py", line 463, in _encode_plus
    **kwargs,
  File "/data/user/tr27p/.conda/envs/DeepBioComp/lib/python3.7/site-packages/transformers/tokenization_utils_fast.py", line 378, in _batch_encode_plus
    is_pretokenized=is_split_into_words,
TypeError: TextInputSequence must be str
"""

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
<ipython-input-19-263240bbee7e> in <module>
----> 1 main()

<ipython-input-18-61d7f0eab618> in main()
    111     # Training
    112     if train_args.do_train:
--> 113         train_dataset = load_and_cache_examples((model_args, train_args), tokenizer, evaluate=False, output_examples=False)
    114         global_step, tr_loss = train(args, train_dataset, model, tokenizer)
    115         logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)

<ipython-input-8-79eb3ed364c2> in load_and_cache_examples(args, tokenizer, evaluate, output_examples)
     54             max_query_length=model_args.max_query_length,
     55             is_training=not evaluate,
---> 56             return_dataset="pt",
     57 #             threads=model_args.threads,
     58         )

/data/user/tr27p/.conda/envs/DeepBioComp/lib/python3.7/site-packages/transformers/data/processors/squad.py in squad_convert_examples_to_features(examples, tokenizer, max_seq_length, doc_stride, max_query_length, is_training, padding_strategy, return_dataset, threads, tqdm_enabled)
    366                 total=len(examples),
    367                 desc="convert squad examples to features",
--> 368                 disable=not tqdm_enabled,
    369             )
    370         )

/data/user/tr27p/.conda/envs/DeepBioComp/lib/python3.7/site-packages/tqdm/std.py in __iter__(self)
   1131 
   1132         try:
-> 1133             for obj in iterable:
   1134                 yield obj
   1135                 # Update and possibly print the progressbar.

/data/user/tr27p/.conda/envs/DeepBioComp/lib/python3.7/multiprocessing/pool.py in <genexpr>(.0)
    323                     result._set_length
    324                 ))
--> 325             return (item for chunk in result for item in chunk)
    326 
    327     def imap_unordered(self, func, iterable, chunksize=1):

/data/user/tr27p/.conda/envs/DeepBioComp/lib/python3.7/multiprocessing/pool.py in next(self, timeout)
    746         if success:
    747             return value
--> 748         raise value
    749 
    750     __next__ = next                    # XXX

TypeError: TextInputSequence must be str

Expected behavior

for more details check here:

link: https://github.com/uabinf/nlp-group-project-fall-2020-deepbiocomp/blob/cancer_ask/scripts/qa_script/qa_squad_v1.ipynb

@TrupeshKumarPatel TrupeshKumarPatel changed the title QA pipeline fails QA pipeline fails during convert_squad_examples_to_features Nov 25, 2020
@TrupeshKumarPatel
Copy link
Author

After updating the run_squad.py script with a newer version of transformers, it works now!

Thank you!

@aleSuglia
Copy link
Contributor

@TrupeshKumarPatel Seems that this is not working still. What was the actual solution to this?

@TrupeshKumarPatel
Copy link
Author

Hi @aleSuglia,
here is the updated link: https://github.com/uabinf/nlp-group-project-fall-2020-deepbiocomp/blob/main/scripts/qa_script/qa_squad_v1.ipynb , see if this help. If not then please elaborate on the error or problem that you are facing.

@aleSuglia
Copy link
Contributor

aleSuglia commented Feb 15, 2021

I have exactly the same error that you reported: TypeError: TextInputSequence must be str
By debugging, I can see that the variable truncated_query has a list of integers (which should be the current question's token ids). However, when you pass that to the encode_plus method, you get the error. I guess it's because encode_plus expects strings and not integers. Do you have any suggestion?

@aleSuglia
Copy link
Contributor

aleSuglia commented Feb 15, 2021

If you googled this error and you are reading this post, please do the following. When you create your tokenizer make sure that you set the flag use_fast to False like this:

AutoTokenizer.from_pretrained(tokenizer_name, use_fast=False)

This fixes the error. However, I wonder why there is no backward compatibility...

@juice500ml
Copy link
Contributor

Had the similar issue with the above. What @aleSuglia suggested indeed works, but the issue still persists; fast version of the tokenizer should be compatible with the previous methods. In my case, I narrowed the problem down to InputExample, where text_b can be None,

text_b: Optional[str] = None
label: Optional[str] = None

but the tokenizer apparently doesn't accept None as an input. So, I found a workaround by changing

InputExample(guid=some_id, text_a=some_text, label=some_label)
-> InputExample(guid=some_id, text_a=some_text, text_b='', label=some_label)

I'm not sure this completely solves the issue though.

@juice500ml
Copy link
Contributor

Potentially related issues: #6545 #7735 #7011

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants