Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipelines fails with IndexError using Bert model with outputs and batch size >= 16 #14414

Closed
2 of 4 tasks
alwayscurious opened this issue Nov 16, 2021 · 3 comments · Fixed by #14420
Closed
2 of 4 tasks

Comments

@alwayscurious
Copy link

alwayscurious commented Nov 16, 2021

Environment info

  • transformers version: 4.13.0.dev0
  • Platform: Linux-5.4.104+-x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.7.12
  • PyTorch version (GPU?): 1.10.0+cu111 (True)
  • Tensorflow version (GPU?): 2.7.0 (True)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: No

Who can help

@Narsil

Library:

Information

Model I am using (Bert, XLNet ...): FinBert

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

I came across this issue when setting output_hidden_states=True during instantiation of a pretrained model inorder to obtain the inferred CLS sentence embeddings using the following approach:

from transformers.pipelines.text_classification import TextClassificationPipeline
class FinBertSentimentClassificationPipeline(TextClassificationPipeline):
  def postprocess(self, model_outputs, function_to_apply=None, return_all_scores=False):
    prediction = super().postprocess(model_outputs, function_to_apply, return_all_scores)
    prediction['last_hidden_layer']= model_outputs.hidden_states[0][0][0]
    return prediction


def custom_pipeline(task,model,tokenizer,**kwargs):
  kwargs['tokenizer'] = tokenizer
  kwargs['feature_extractor'] = None
  return FinBertSentimentClassificationPipeline(model=model,framework='pt',task=task,**kwargs)

Where I override the postprocess function to also return the last hidden state layer. However, the issue occurs in the pipeline code itself when using a batch size equal to or larger than 16.

To reproduce

Steps to reproduce the behavior:

  1. !pip install git+https://github.com/huggingface/transformers.git
from transformers import BertTokenizerFast, BertForSequenceClassification
from transformers import pipeline
finbert = BertForSequenceClassification.from_pretrained('yiyanghkust/finbert-tone',num_labels=3, output_hidden_states=True)
tokenizer = BertTokenizerFast.from_pretrained('yiyanghkust/finbert-tone')

nlp = pipeline("sentiment-analysis", model=finbert, tokenizer=tokenizer, device=0)

varying_length_sentences = ["there is a shortage of capital, and we need extra financing "*5,  
             "growth is strong and we have plenty of liquidity ", 
             "there are doubts about our finances" * 10, 
             "profits are flat",
             "profits are flat "*30]*1000
similar_length_sentences = ["there is a shortage",  
             "growth is strong ", 
             "there are doubts", 
             "profits are flat"]*1000
  1. results = nlp(similar_length_sentences, batch_size=16, num_workers=2)
    See the Colab Notebook for reference.

Running step 3 produces the following error:

/usr/local/lib/python3.7/dist-packages/transformers/pipelines/base.py in loader_batch_item(self)
    750                     if k == "past_key_values":
    751                         continue
--> 752                     if isinstance(element[self._loader_batch_index], torch.Tensor):
    753                         loader_batched[k] = element[self._loader_batch_index].unsqueeze(0)
    754                     elif isinstance(element[self._loader_batch_index], np.ndarray):

IndexError: tuple index out of range

Executing the pipeline with batch sizes smaller than 16 seem to work (see colab notebook).

Expected behavior

Pipeline runs successfully with any batch size using a model loaded to output hidden states and attention.

@Narsil
Copy link
Contributor

Narsil commented Nov 16, 2021

Hi @alwayscurious ,

Yes, the current system for automating batching/unbatching doesn't support hidden_states nor attentions.
I opened up a PR. Currently it explicitly needs specific keys to check for this tuples of tensors since they are not the norm.

@alwayscurious
Copy link
Author

Hi @Narsil

Thanks for adding a fix to support this. Once the PR is merged to master I'll check that it works successfully!

@Narsil
Copy link
Contributor

Narsil commented Nov 17, 2021

@alwayscurious don't use it before with batch_size < 16 btw, it's just incorrect, you will receive first layer hidden states (with full batch) as your first item, second layer as second item, so on and so forth.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants