# Transformers, what can they do?

## Introduction

In this section, we look at what the **Transformer models** can do and use the Hugging Face (from now on HF) **`pipeline()`** function

### Transformers are everywhere!

**Transformer models** are used to solve **all kind** of **NLP** tasks (like the ones in the previous section).

Companies like: 
- Facebook
- Microsoft
- Grammarly

Use **HF** and the **Transformer models** and share back their own models.
The [**HF Transformers Library**](https://github.com/huggingface/transformers) provides the following tools:
- Create the models
- Use the shared models

The [**HF Model Hub**](https://huggingface.co/models) contains thousands of **pretrained models** open source.

## Working with pipelines

Install the `Transformers`, `Datasets`, and `Evaluate` libraries to run this notebook.

In [1]:
!pip install datasets evaluate transformers[sentencepiece]



The most **basic object** in the `transformers`library is the `pipeline()`function.
It connects a **model** with the following steps: 
- preprocessing
- postprocessing

This allows us 
1. to input any text
2. get an intelligible answer

By default, the `pipeline()`function selects a **pretrained model** fine-tuned. The model is **downloaded and cached

In [1]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier("I've been waiting for a HuggingFace course my whole life.")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9598046541213989}]

We can even pass an **array** of sentences:

In [2]:
classifier(
    ["I've been waiting for a HuggingFace course my whole life.", "I hate this so much!"]
)

[{'label': 'POSITIVE', 'score': 0.9598046541213989},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

There are **3 main steps** involved when you pass **text** to the `pipeline()`

1. Text is **preprocessed** into a format the model can understand
2. **Preprocessed inputs** are passed to the model
3. **Predictions** made by the model are **post-processed** so we can understand them

In [3]:
from transformers import pipeline

classifier = pipeline("zero-shot-classification")
classifier(
    "This is a course about the Transformers library",
    candidate_labels=["education", "politics", "business"],
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'sequence': 'This is a course about the Transformers library',
 'labels': ['education', 'business', 'politics'],
 'scores': [0.8445965647697449, 0.11197595298290253, 0.0434274896979332]}

The currently **[HF available pipelines]**(https://huggingface.co/docs/transformers/main_classes/pipelines) are: 
- feature-extraction
- fill-mask
- ner
- question-answering
- sentiment-analysis
- summarization
- text-generation
- translation
- zero-shot-classification

### Using any model from the **HF HUB** in a `pipeline()`

You can choose a **particular model** from the [**HF HUB**](https://huggingface.co/models) to use in the `pipeline()`


In [None]:
from transformers import pipeline

generator = pipeline("text-generation", model="distilgpt2")
generator("In this course, we will teach you how to")

## Types of pipelines & examples

### Zero-Shot-Classification

`zero-shot`: Because we don't need to **fine-tune** the model.

It allows us to **classify texts** that haven't been **labeled**.

It is a common scenario in real world. 

The `zero-shot-classification`pipeline allows to specify **which labels to use for classification**. 

By doing that, you don't have to rely on the **labels of the pretrained model**.

In [6]:
from transformers import pipeline

classifier = pipeline("zero-shot-classification")
classifier(
    "The company was bought by an arabic fund",
    candidate_labels=["mergers and acquisitions", "politics"],
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'sequence': 'The company was bought by an arabic fund',
 'labels': ['mergers and acquisitions', 'politics'],
 'scores': [0.9484009146690369, 0.05159905552864075]}

### Text Generation

Used for **generating some text**. 

We provide a **prompt** and the model will **auto-complete** by generating the **remaining text**.

It involves **randomness**. Outputs may differ in each execution.

In [7]:
from transformers import pipeline

generator = pipeline("text-generation")
generator("Mergers and acquisitions or M&A is ")

No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Mergers and acquisitions or M&A is \xa0not necessarily a bad thing.\nM&A is a form of financial engineering which is a process which takes advantage of all the information available in terms of historical information that has come into being.'}]

You can control **how many different sequences are generated** by using `num_return_sequences`

You can control **the total lenght of the output text** by using `max_length`

In [None]:
from transformers import pipeline

generator = pipeline("text-generation", model="distilgpt2")
generator(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2,
)

### Mask Filling

The `top_k` argument controls: 
- How many possibilities you want to be displayed

The model fills in the **special** `<mask>` word. Also called **mask token**.

Other mask-filling models **might have different mask tokens**.

In [8]:
from transformers import pipeline

unmasker = pipeline("fill-mask")
unmasker("This course will teach you all about <mask> models.", top_k=2)

No model was supplied, defaulted to distilroberta-base and revision ec58a5b (https://huggingface.co/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

[{'score': 0.19619670510292053,
  'token': 30412,
  'token_str': ' mathematical',
  'sequence': 'This course will teach you all about mathematical models.'},
 {'score': 0.04052688181400299,
  'token': 38163,
  'token_str': ' computational',
  'sequence': 'This course will teach you all about computational models.'}]

### Named Entity Recognition (NER)

The model has to find which parts of the **input text** correspond to entities (companies, locations, persons...)

By passing the option `grouped_entities=true`, we tell the pipeline the following: 
- To regroup together the parts of the sentence that correspond to the same entity

In the example Hugging Face will be treated as a **single** organization, even the name consists of multiple words.

In [9]:
from transformers import pipeline

ner = pipeline("ner", grouped_entities=True)
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]



[{'entity_group': 'PER',
  'score': 0.9981694,
  'word': 'Sylvain',
  'start': 11,
  'end': 18},
 {'entity_group': 'ORG',
  'score': 0.9796019,
  'word': 'Hugging Face',
  'start': 33,
  'end': 45},
 {'entity_group': 'LOC',
  'score': 0.9932106,
  'word': 'Brooklyn',
  'start': 49,
  'end': 57}]

You will see that the model has recognized: 
- Sylvain as a **person** `PER`
- Hugging Face as a **organization** `ORG`
- Brooklyn as a **location** `LOC`

In [15]:
from transformers import pipeline

ner = pipeline("ner", grouped_entities=True, model="MMG/xlm-roberta-large-ner-spanish")
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")

[{'entity_group': 'PER',
  'score': 0.8840325,
  'word': 'Sylvain',
  'start': 11,
  'end': 18},
 {'entity_group': 'ORG',
  'score': 0.9043134,
  'word': 'Hugging Face',
  'start': 33,
  'end': 45},
 {'entity_group': 'LOC',
  'score': 0.9973139,
  'word': 'Brooklyn',
  'start': 49,
  'end': 57}]

### Question Answering

The pipeline works by extracting **information** from the **provided context**. 

It does not generate the answer!

In [16]:
from transformers import pipeline

question_answerer = pipeline("question-answering")
question_answerer(
    question="Where do I work?",
    context="My name is Sylvain and I work at Hugging Face in Brooklyn",
)

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

{'score': 0.6949754953384399, 'start': 33, 'end': 45, 'answer': 'Hugging Face'}

### Summarization

We can specify a `max_length` or `min_length` argument

In [19]:
from transformers import pipeline

summarizer = pipeline("summarization")
summarizer(
    """
    America has changed dramatically during recent years. Not only has the number of 
    graduates in traditional engineering disciplines such as mechanical, civil, 
    electrical, chemical, and aeronautical engineering declined, but in most of 
    the premier American universities engineering curricula now concentrate on 
    and encourage largely the study of engineering science. As a result, there 
    are declining offerings in engineering subjects dealing with infrastructure, 
    the environment, and related issues, and greater concentration on high 
    technology subjects, largely supporting increasingly complex scientific 
    developments. While the latter is important, it should not be at the expense 
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other 
    industrial countries in Europe and Asia, continue to encourage and advance 
    the teaching of engineering. Both China and India, respectively, graduate 
    six and eight times as many traditional engineers as does the United States. 
    Other industrial countries at minimum maintain their output, while America 
    suffers an increasingly serious decline in the number of engineering graduates 
    and a lack of well-educated engineers.
""", max_length=30, min_length = 10
)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'summary_text': ' America has changed dramatically during recent years . The number of engineering graduates in the U.S. has declined in traditional engineering disciplines such as'}]

### Translation

We can specify a `max_length` or `min_length` argument

In [20]:
from transformers import pipeline

translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce cours est produit par Hugging Face.")

config.json:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/301M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/778k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.34M [00:00<?, ?B/s]



[{'translation_text': 'This course is produced by Hugging Face.'}]