# Pipelines for Inference

## Pipeline usage

The `pipeline()` automatically loads a default model and a preprocessing class capable of inference for our task.

For an exmaple of automatic speech recognition (ASR), or speech-to-text.

1. Start by creating a `pipeline()` and specify the inference task:

In [None]:
from transformers import pipeline

transcriber = pipeline(task='automatic-speech-recognition')

2. Pass the input to the `pipeline()`:

In [None]:
transcriber("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac")

{'text': 'I HAVE A DREAM BUT ONE DAY THIS NATION WILL RISE UP LIVE UP THE TRUE MEANING OF ITS TREES'}

Try the Whisper large-v2 model from OpenAI:

In [None]:
transcriber = pipeline(model="openai/whisper-large-v2")

transcriber("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac")

Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.
Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.43.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


{'text': ' I have a dream that one day this nation will rise up and live out the true meaning of its creed.'}

Multiple inputs can be passed as a list:

In [None]:
transcriber(
    [
        "https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac",
        "https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/1.flac",
    ]
)



[{'text': ' I have a dream that one day this nation will rise up and live out the true meaning of its creed.'},
 {'text': ' He hoped there would be stew for dinner, turnips and carrots and bruised potatoes and fat mutton pieces to be ladled out in thick, peppered, flour-fattened sauce.'}]

Pipelines are great for experimentation as switching from one model to another is trivial.

## Parameters

In [None]:
# we can specify parameters anywhere we want
import torch
transcriber = pipeline(model='openai/whisper-large-v2', torch_dtype=torch.float16)

In [None]:
# This uses `use_fast=True`
out = transcriber("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac")
out



{'text': ' I have a dream that one day this nation will rise up and live out the true meaning of its creed.'}

In [None]:
# This will override and use `use_fast=False`
out = transcriber("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac", torch_dtype=torch.float32)
out

In [None]:
# This will go back to using `use_fast=True`
out = transcriber("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac")
out

### Device

If we use `device=n`, the pipeline automatically puts the model on the specified device.

In [None]:
transcriber = pipeline(model='openai/whisper-large-v2', device=0)

If the model is too large for a single GPU and we are using PyTorch, we can set `torch_dtype='float16'` to enable FP16 precision inference.

Alternatively, we can set `device_map='auto'` to automatically determine how to load and store the model weights. Using the `device_map` arguments require the HuggingFace Acclerate pacakge:

In [None]:
!pip install --upgrade accelerate

Then this automatically loads and stores model weights across devices:

In [None]:
transcriber = pipeline(model='openai/whisper-large-v2', device_map='auto')

Once `device_map="auto"` is passed, there is no need to add the argument `device=device` when instantiating our `pipeline` as we may encounter some unexpected behavior.

### Batch Size

Batching during inference is not necessarily faster, and can actually be quite slower in some cases.

But if it works,

In [None]:
transcriber = pipeline(model='openai/whisper-large-v2',
                       device=0,
                       batch_size=2)

audio_filenames = [f"https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/{i}.flac" for i in range(1, 5)]

texts = transcriber(audio_filenames)
texts

This runs the pipeline on the 4 provided audio files, but it will pass them in batches of 2 to the model (which is on a GPU, where batching is more likely to help) without requiring any further code.

### Task specific parameters

Different tasks provide different task-specific parameters for additional flexibility.

For example the `transformers.AutomaticSpeechRecognitionPipeline.call()` method has a `return_timestamps` parameter which sounds promising for subtitling videos:

In [None]:
transcriber = pipeline(model='openai/whisper-large-v2', return_timestamps=True)

transcriber("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac")

The `AutomaticSpeechRecognitionPipeline` also has a `chunk_length_s` parameter which is helpful for working on really long audio files that a model typically cannot handle on its own:

In [None]:
transcriber = pipeline(model='openai/whisper-large-v2', chunk_length_s=30)

transcriber("https://huggingface.co/datasets/reach-vb/random-audios/resolve/main/ted_60.wav")

## Using pipelines on a dataset

The recommended way to run inference on a large dataset is to use an iterator:

In [None]:
def data():
    for i in range(10):
        yield f"My example {i}"

In [None]:
pipe = pipeline(model='openai-community/gpt2', device=0)

generated_characters = 0
for out in pipe(data()):
    generated_characters += len(out[0]['generated_text'])
    print(f"Generated {generated_characters} characters so far")

The iterator `data()` yields each result, and the pipeline automatically recognizes the input is iterable and will start fetching the data while it continues to process it on the GPU (which uses `DataLoader` under the hood), so that we don't have to allocate memory for the whole dataset and we can feed the GPU as fast as possible.

In [None]:
# KeyDataset is a util that will just output the item we are interested in
from transformers.pipeline.pt_utils import KeyDataset
from datasets import load_dataset

pipe = pipeline(model='hf-internal-testing/tiny-random-wav2vec2', device=0)
dataset = load_dataset('hf-internal_testing/librispeech_asr_dummy', 'clean', split='validation[:10]')

for out in pipe(KeyDataset(dataset, 'audio')):
    print(out)

## Using pipelines for a webserver

https://huggingface.co/docs/transformers/pipeline_webserver

## Vision pipeline

In [None]:
from transformers import pipeline

vision_classifier = pipeline(model='google/vit-base-patch16-224')

In [3]:
preds = vision_classifier(
    images="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
)

preds = [
    {'score': round(pred['score'], 4), 'label': pred['label']}
    for pred in preds
]

preds

[{'score': 0.4335, 'label': 'lynx, catamount'},
 {'score': 0.0348,
  'label': 'cougar, puma, catamount, mountain lion, painter, panther, Felis concolor'},
 {'score': 0.0324, 'label': 'snow leopard, ounce, Panthera uncia'},
 {'score': 0.0239, 'label': 'Egyptian cat'},
 {'score': 0.0229, 'label': 'tiger cat'}]

## Text pipeline

In [None]:
# This model is a `zero-shot-classification` model
classifier = pipeline(model='facebook/bart-large-mnli')

In [4]:
classifier(
    "I have a problem with my iphone that needs to be resolved asap!!",
    candidate_labels=["urgent", "not urgent", "phone", "tablet", "computer"],
)

{'sequence': 'I have a problem with my iphone that needs to be resolved asap!!',
 'labels': ['urgent', 'phone', 'computer', 'not urgent', 'tablet'],
 'scores': [0.503635585308075,
  0.47879981994628906,
  0.012600085698068142,
  0.002655789954587817,
  0.0023087512236088514]}

## Multimodal pipeline

In [1]:
!pip install pytesseract

Collecting pytesseract
  Downloading pytesseract-0.3.13-py3-none-any.whl.metadata (11 kB)
Downloading pytesseract-0.3.13-py3-none-any.whl (14 kB)
Installing collected packages: pytesseract
Successfully installed pytesseract-0.3.13


In [None]:
vqa = pipeline(model='impira/layoutlm-document-qa')

In [None]:
output = vqa(
    image="https://huggingface.co/spaces/impira/docquery/resolve/2359223c1837a7587402bda0f2643382a6eefeab/invoice.png",
    question="What is the invoice number?",
)

output[0]['score'] = round(output[0]['score'], 5)
output

## Using pipeline on large models with Accerlerate

In [6]:
!pip install accelerate



In [None]:
import torch
from transformers import pipeline

pipe = pipeline(model='facebook/opt-1.3b',
                torch_dtype=torch.bfloat16,
                device_map='auto')

In [10]:
output = pipe('This is a cool example!', do_sample=True, top_p=0.95)
output

[{'generated_text': 'This is a cool example! It shows the kind of detail the animators are putting in this'}]

We can also pass 8-bit loaded models:

In [None]:
!pip install bitsandbytes
import torch

pipe = pipeline(model='facebook/opt-1.3b',
                device_map='auto',
                model_kwargs={'load_in_8bit': True})

In [None]:
output = pipe('This is a cool example!', do_sample=True, top_p=0.95)
output