# Huggingface pipelines

Using the pipelines is a very simple way to use pretrained models for **inference**. 

**Pipelines** can be used in a very simple way for applying **existing models (available in huggingface)** to your own data.

The models that can be used in pipelines deal with both **uni-modal** and **multi-modal** tasks.

## General usage:

A pipeline is a **wrapper** built for working around all the other available pipelines. 

It can be executed on:

- a single item
- a list of items (list)
- a Dataset object


In [1]:
from transformers import pipeline
import logging, sys
logging.disable(sys.maxsize)
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')

from IPython.core.display import HTML

def colored(s, color='blue'):
    return "<text style=color:{}>{}</text>".format(color, s)

## Feature extraction from text

### on a single item

The result is a **tensor** of shape [1, sequence_lenth, hidden_dimension] representing the input string.

In [2]:
from transformers import pipeline
import torch

from transformers import AutoTokenizer


model = "bert-base-uncased"
task = "feature-extraction"
tokenizer = AutoTokenizer.from_pretrained(model)


text = "This is a short sentence."
tokenized_text = tokenizer.tokenize(text)


feature_extractor = pipeline(task = task, model=model, tokenizer=tokenizer)
result = feature_extractor(text, return_tensors=True)

print(tokenized_text)
print(result.shape)


['this', 'is', 'a', 'short', 'sentence', '.']
torch.Size([1, 8, 768])


### on a list of items (i.e., sentences)
#### The result is list of tensors
- each tensor of shape [1, sequence_lenth, hidden_dimension] represents one input string.

In [3]:
sentences = ["an suv sitting on top of a cross walk.",
             "the suv is stopped in the middle of the crosswalkj.",
             "there are people crossing the street and a car in the cross walk",
             "an intersection at night and the light is red.",
             "the car stops on the crosswalk to allow pedestrians to cross the street safely."]
feature_extractor = pipeline(model="bert-base-uncased", task="feature-extraction")
result_list = feature_extractor(sentences, return_tensors=True)

print(len(result_list))
result_list[0].shape

5


torch.Size([1, 12, 768])

## Pre-defined pipelines are also useful for inference on the most important NLP, Vision and Multi-modal tasks.

### The Focus of this notebook:

#### Vision tasks:

- Object Detection
- Image Classification
- Zero-shot Object Detection

#### Multi-modal tasks:
- Automatic Captioning
- Document Question Answering 
- Visual Question Answering

## Vision tasks:

### Object Detection Pipeline:

This pipeline can be used with any AutoModelForObjectDetection. It predicts bounding boxes of objects and their classes.

#### Inferring from [**facebook/detr-resnet-50**](https://huggingface.co/facebook/detr-resnet-50)

The DETR model is an encoder-decoder transformer with a convolutional backbone. Two heads are added on top of the decoder outputs in order to perform object detection: a linear layer for the class labels and a MLP (multi-layer perceptron) for the bounding boxes. The model uses object queries to detect objects in an image. Each object query looks for a particular object in the image.


In [4]:
from transformers import pipeline
from IPython.display import Image, display


image_url = "https://farm4.staticflickr.com/3236/2474976343_15aabea22b_z.jpg"
detector = pipeline(model="facebook/detr-resnet-50")

display(Image(url= image_url))
detector(image_url)


[{'score': 0.9331260323524475,
  'label': 'sandwich',
  'box': {'xmin': 24, 'ymin': 209, 'xmax': 290, 'ymax': 377}},
 {'score': 0.9893137812614441,
  'label': 'hot dog',
  'box': {'xmin': 197, 'ymin': 16, 'xmax': 414, 'ymax': 152}},
 {'score': 0.9176713228225708,
  'label': 'dining table',
  'box': {'xmin': 0, 'ymin': 4, 'xmax': 639, 'ymax': 474}},
 {'score': 0.9770232439041138,
  'label': 'chair',
  'box': {'xmin': 0, 'ymin': 1, 'xmax': 267, 'ymax': 474}}]

### Let's try with a different (smaller) model

We use [**hustvl/yolos-tiny**](https://huggingface.co/hustvl/yolos-tiny) for inference.

The YOLOS model is a Vision Transformer (ViT) trained using the DETR loss. Despite its simplicity, a base-sized YOLOS model is able to achieve 42 AP on COCO validation 2017 (similar to DETR and more complex frameworks such as Faster R-CNN).

It was pre-trained on ImageNet-1k and fine-tuned on COCO 2017 object detection (300 epochs)

In [5]:
from transformers import pipeline
from IPython.display import Image, display
import timm

image_url = "https://farm4.staticflickr.com/3236/2474976343_15aabea22b_z.jpg"
detector = pipeline(model="hustvl/yolos-tiny")

display(Image(url= image_url))
detector(image_url)


[{'score': 0.9307621121406555,
  'label': 'dining table',
  'box': {'xmin': 0, 'ymin': 2, 'xmax': 640, 'ymax': 473}}]

## Vision tasks:

### Image Classification Pipeline:

The pipeline can be used with any AutoModelForImageClassification [such as](https://huggingface.co/models?filter=image-classification). It predicts the class of an image.

 
#### Inferring from [**beit-base-patch16-224-pt22k-ft22k**](https://huggingface.co/microsoft/beit-base-patch16-224-pt22k-ft22k)

The BEiT model is a Vision Transformer (ViT), which is a transformer encoder model (BERT-like). It is pretrained on a large collection of images (ImageNet21k) in a self-supervised fashion and then fine-tuned in a supervised fashion on ImageNet (1 million images and 1,000 classes).


In [6]:
from transformers import pipeline

image_url = "http://images.cocodataset.org/val2017/000000039769.jpg"

classifier = pipeline(model="microsoft/beit-base-patch16-224-pt22k-ft22k")

display(Image(url= image_url))
classifier(image_url)

[{'score': 0.20681101083755493, 'label': 'tabby, tabby_cat'},
 {'score': 0.09384264796972275, 'label': 'tabby, queen'},
 {'score': 0.05316951125860214, 'label': 'kitten, kitty'},
 {'score': 0.04978814721107483, 'label': 'beanbag'},
 {'score': 0.03999922797083855, 'label': 'reliquary'}]

## Vision tasks:

### Zero-Shot Object Detection Pipeline:

This pipeline is based on OwlViTForObjectDetection. It predicts bounding boxes of objects (requires a set of candidate_labels).

#### Inferring from [**google/owlvit-base-patch32**](https://huggingface.co/google/owlvit-base-patch32)

The model uses a CLIP backbone with a ViT-B/32 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. 

The CLIP backbone is trained from scratch and fine-tuned together with the box and class prediction heads with an object detection objective.

The Pipeline can be used also with [other models](https://huggingface.co/models?other=zero-shot-object-detection)



In [13]:
from transformers import pipeline
import pandas as pd
from IPython.display import Image, display, Markdown

detector = pipeline(model="google/owlvit-base-patch32", task="zero-shot-object-detection")

examples = [("https://farm4.staticflickr.com/3266/3247626615_b3ab8a85af_z.jpg",["person","woman","man"]),
           ("https://currumbinvetservices.com.au/wp-content/uploads/2015/09/indian-ringneck.jpg",["parrot","bird",'flamingo'])]



for (image,candidate_labels) in examples:
    
    response = detector(image=image, candidate_labels=candidate_labels)
    
    display(Image(url= image,width=500))
    print(f"Detections: {response[0]['label']} (score:{response[0]['score']}")
    print(response[0]['box'])
    print()
    



Detections: man (score:0.4218994379043579
{'xmin': 341, 'ymin': 105, 'xmax': 467, 'ymax': 310}



Detections: parrot (score:0.598288893699646
{'xmin': 5, 'ymin': 117, 'xmax': 705, 'ymax': 574}



## Multi-modal tasks:

### Automatic Captioning:

Image To Text pipeline is designed for AutoModelForVision2Seq. The pipeline predicts a caption for a given image. It can be used with [other models](https://huggingface.co/models?pipeline_tag=image-to-text)

#### Inferring from [**ydshieh/vit-gpt2-coco-en**](https://huggingface.co/ydshieh/vit-gpt2-coco-en)
It's not a state of the art model, but it works in a reasonable way for simple images.

In [14]:
from transformers import pipeline
from IPython.display import Image, display


image_url = "https://farm4.staticflickr.com/3607/3567935365_dc4880fa10_z.jpg"


captioner = pipeline(model="ydshieh/vit-gpt2-coco-en", max_new_tokens=50)
display(Image(url= image_url))

caption = captioner(image_url)[0]['generated_text']

HTML((f'Predicted caption:\n{colored(caption)}'))

### Document Question Answering:

Document Question Answering pipeline can be used with any AutoModelForDocumentQuestionAnswering. 

The pipeline takes an image (and optional OCR’d words/boxes) as input and it generates answers according to its content.

The pipeline can be used with [other models](https://huggingface.co/models?pipeline_tag=document-question-answering)

#### Inferring from [**impira/layoutlm-document-qa**](https://huggingface.co/impira/layoutlm-document-qa)

The model has been fine-tuned on SQuAD2.0 and DocVQA.

IMPORTANT: In addition to transformers, it requires also **PIL**, **pytesseract**, and **PyTorch**.


In [15]:
from transformers import pipeline
import pandas as pd
from IPython.display import Image, display


questions = ['What is the difference average gross hourly earnings between males and females in Europe?',
             'What is the difference in pay gap between Italy and Poland?']
images = ['https://ec.europa.eu/eurostat/documents/4187653/13722720/Gender_pay_gap_2020.png/',
          'https://ec.europa.eu/eurostat/documents/4187653/13722720/Gender_pay_gap_2020.png/']

document_qa = pipeline(model="impira/layoutlm-document-qa")


for question, image in zip(questions,images):
    result = document_qa(image = image, question=question)
    answer = result[0]['answer']
    display(Image(url= image,width=500))
    print(f'Question: {question} \nAnswer:{answer}')
    

Question: What is the difference average gross hourly earnings between males and females in Europe? 
Answer:13.0


Question: What is the difference in pay gap between Italy and Poland? 
Answer:4.2


### Visual Question Answering:

The Visual Question Answering pipeline can be used with any AutoModelForVisualQuestionAnswering. 

The pipeline takes an image (and optional OCR’d words/boxes) as input and it generates answers according to its content.

The pipeline can be used with [other models](https://huggingface.co/models?pipeline_tag=visual-question-answering)

#### Inferring from [**vilt-b32-finetuned-vqa**](https://huggingface.co/dandelin/vilt-b32-finetuned-vqa)

Vision-and-Language Transformer (ViLT) model fine-tuned on VQAv2.


In [16]:
from transformers import pipeline

vqa = pipeline(model="dandelin/vilt-b32-finetuned-vqa")
image_url = "https://farm5.staticflickr.com/4046/4314731899_4baf64470f_z.jpg"


questions = ['Is this a man?',
             'What is the subject of the photo?',
             'How many people are in the picture?']
images = ['https://huggingface.co/datasets/Narsil/image_dummy/raw/main/lena.png',
          'http://images.cocodataset.org/val2017/000000039769.jpg',
          'https://farm5.staticflickr.com/4046/4314731899_4baf64470f_z.jpg']

for question, image in zip(questions,images):
    result = vqa(image = image, question=question)
    answer = result[0]['answer']
    display(Image(url= image,width=500))
    print(f'Question: {question} \nAnswer:{answer}')




Question: Is this a man? 
Answer:no


Question: What is the subject of the photo? 
Answer:cat


Question: How many people are in the picture? 
Answer:4
