Pipelines fails with IndexError using Bert model with outputs and batch size >= 16 #14414

alwayscurious · 2021-11-16T10:19:25Z

Environment info

transformers version: 4.13.0.dev0
Platform: Linux-5.4.104+-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.7.12
PyTorch version (GPU?): 1.10.0+cu111 (True)
Tensorflow version (GPU?): 2.7.0 (True)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help

@Narsil

Library:

Pipelines: @Narsil

Information

Model I am using (Bert, XLNet ...): FinBert

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

I came across this issue when setting output_hidden_states=True during instantiation of a pretrained model inorder to obtain the inferred CLS sentence embeddings using the following approach:

from transformers.pipelines.text_classification import TextClassificationPipeline
class FinBertSentimentClassificationPipeline(TextClassificationPipeline):
  def postprocess(self, model_outputs, function_to_apply=None, return_all_scores=False):
    prediction = super().postprocess(model_outputs, function_to_apply, return_all_scores)
    prediction['last_hidden_layer']= model_outputs.hidden_states[0][0][0]
    return prediction


def custom_pipeline(task,model,tokenizer,**kwargs):
  kwargs['tokenizer'] = tokenizer
  kwargs['feature_extractor'] = None
  return FinBertSentimentClassificationPipeline(model=model,framework='pt',task=task,**kwargs)

Where I override the postprocess function to also return the last hidden state layer. However, the issue occurs in the pipeline code itself when using a batch size equal to or larger than 16.

To reproduce

Steps to reproduce the behavior:

!pip install git+https://github.com/huggingface/transformers.git

from transformers import BertTokenizerFast, BertForSequenceClassification
from transformers import pipeline
finbert = BertForSequenceClassification.from_pretrained('yiyanghkust/finbert-tone',num_labels=3, output_hidden_states=True)
tokenizer = BertTokenizerFast.from_pretrained('yiyanghkust/finbert-tone')

nlp = pipeline("sentiment-analysis", model=finbert, tokenizer=tokenizer, device=0)

varying_length_sentences = ["there is a shortage of capital, and we need extra financing "*5,  
             "growth is strong and we have plenty of liquidity ", 
             "there are doubts about our finances" * 10, 
             "profits are flat",
             "profits are flat "*30]*1000
similar_length_sentences = ["there is a shortage",  
             "growth is strong ", 
             "there are doubts", 
             "profits are flat"]*1000

results = nlp(similar_length_sentences, batch_size=16, num_workers=2)
See the Colab Notebook for reference.

Running step 3 produces the following error:

/usr/local/lib/python3.7/dist-packages/transformers/pipelines/base.py in loader_batch_item(self)
    750                     if k == "past_key_values":
    751                         continue
--> 752                     if isinstance(element[self._loader_batch_index], torch.Tensor):
    753                         loader_batched[k] = element[self._loader_batch_index].unsqueeze(0)
    754                     elif isinstance(element[self._loader_batch_index], np.ndarray):

IndexError: tuple index out of range

Executing the pipeline with batch sizes smaller than 16 seem to work (see colab notebook).

Expected behavior

Pipeline runs successfully with any batch size using a model loaded to output hidden states and attention.

The text was updated successfully, but these errors were encountered:

Narsil · 2021-11-16T15:23:22Z

Hi @alwayscurious ,

Yes, the current system for automating batching/unbatching doesn't support hidden_states nor attentions.
I opened up a PR. Currently it explicitly needs specific keys to check for this tuples of tensors since they are not the norm.

alwayscurious · 2021-11-16T20:40:21Z

Hi @Narsil

Thanks for adding a fix to support this. Once the PR is merged to master I'll check that it works successfully!

Narsil · 2021-11-17T08:43:29Z

@alwayscurious don't use it before with batch_size < 16 btw, it's just incorrect, you will receive first layer hidden states (with full batch) as your first item, second layer as second item, so on and so forth.

Narsil mentioned this issue Nov 16, 2021

Adding support for hidden_states and attentions in unbatching support. #14420

Merged

5 tasks

Narsil closed this as completed in #14420 Nov 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pipelines fails with IndexError using Bert model with outputs and batch size >= 16 #14414

Pipelines fails with IndexError using Bert model with outputs and batch size >= 16 #14414

alwayscurious commented Nov 16, 2021 •

edited

Loading

Narsil commented Nov 16, 2021

alwayscurious commented Nov 16, 2021

Narsil commented Nov 17, 2021

Pipelines fails with IndexError using Bert model with outputs and batch size >= 16 #14414

Pipelines fails with IndexError using Bert model with outputs and batch size >= 16 #14414

Comments

alwayscurious commented Nov 16, 2021 • edited Loading

Environment info

Who can help

Information

To reproduce

Expected behavior

Narsil commented Nov 16, 2021

alwayscurious commented Nov 16, 2021

Narsil commented Nov 17, 2021

alwayscurious commented Nov 16, 2021 •

edited

Loading