# Transformers, what can they do?

<span style="color:blue"><b>Transformer models are used to solve all kinds of NLP tasks</b></span>

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [13]:
# !pip install datasets evaluate transformers[sentencepiece]

## Working with pipelines

<span style="color:blue">The most basic object in the 🤗 Transformers library is the `pipeline()` function. It connects a model with its necessary preprocessing and postprocessing steps, allowing us to directly input any text and get an intelligible answer</span>:

In [24]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier("I've been waiting for a HuggingFace course my whole life.")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9598050713539124}]

In [25]:
classifier(
    ["I've been waiting for a HuggingFace course my whole life.", "I hate this so much!"]
)

[{'label': 'POSITIVE', 'score': 0.9598050713539124},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

By default, this pipeline selects a particular pretrained model that has been fine-tuned for sentiment analysis in English. The model is downloaded and cached when you create the classifier object. If you rerun the command, the cached model will be used instead and there is no need to download the model again.

<b>There are three main steps involved when you pass some text to a pipeline:</b>
1. <span style="color:blue">The text is preprocessed into a format the model can understand.</span>
2. <span style="color:blue">The preprocessed inputs are passed to the model.</span>
3. <span style="color:blue">The predictions of the model are post-processed, so you can make sense of them.</span>

<b>Some of the currently available pipelines are:</b>

- `feature-extraction` (get the vector representation of a text)
- `fill-mask`
- `ner` (named entity recognition)
- `question-answering`
- `sentiment-analysis`
- `summarization`
- `text-generation`
- `translation`
- `zero-shot-classification`

## Zero-shot classification

We’ll start by tackling a more challenging task where we need to classify texts that haven’t been labelled. This is a common scenario in real-world projects because <span style="color:red">annotating text is usually time-consuming and requires domain expertise</span>. <span style="color:green">For this use case, the `zero-shot-classification` pipeline is very powerful: it allows you to specify which labels to use for the classification, so you don’t have to rely on the labels of the pretrained model</b>. You’ve already seen how the model can classify a sentence as positive or negative using those two labels — but it can also classify the text using any other set of labels you like.

In [27]:
from transformers import pipeline

classifier = pipeline("zero-shot-classification")
classifier(
    "This is a course about the Transformers library",
    candidate_labels=["education", "politics", "business"],
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'sequence': 'This is a course about the Transformers library',
 'labels': ['education', 'business', 'politics'],
 'scores': [0.8445950150489807, 0.11197733134031296, 0.043427735567092896]}

In [33]:
from transformers import pipeline

classifier = pipeline("zero-shot-classification")
classifier(
    ["Can you please cancel the booking as I am stuck in traffic?", "I am on the way"],
    candidate_labels=["cancel", "not cancel"],
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'sequence': 'Can you please cancel the booking as I am stuck in traffic?',
  'labels': ['cancel', 'not cancel'],
  'scores': [0.9738192558288574, 0.026180660352110863]},
 {'sequence': 'I am on the way',
  'labels': ['not cancel', 'cancel'],
  'scores': [0.8968855738639832, 0.10311441868543625]}]

In [35]:
classifier = pipeline("zero-shot-classification")
classifier(
    ["Can you please cancel the booking as I am stuck in traffic?", "I am on the way"],
    candidate_labels=["cancel"],
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'sequence': 'Can you please cancel the booking as I am stuck in traffic?',
  'labels': ['cancel'],
  'scores': [0.9704782366752625]},
 {'sequence': 'I am on the way',
  'labels': ['cancel'],
  'scores': [0.0013881203485652804]}]

<span style="color:blue">This pipeline is called <i>zero-shot</i> because you don’t need to fine-tune the model on your data to use it</span>. It can directly return probability scores for any list of labels you want!

## Text generation

Now let’s see how to use a pipeline to generate some text. The main idea here is that you provide a prompt and the model will auto-complete it by generating the remaining text. This is similar to the predictive text feature that is found on many phones. <b>Text generation involves randomness, so it’s normal if you don’t get the same results everytime (or as shown below).</b>

You can control how many different sequences are generated with the argument `num_return_sequences` and the total length of the output text with the argument `max_length`.

In [38]:
from transformers import pipeline

generator = pipeline("text-generation")
generator("In this course, we will teach you how to")

No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to build out your app and test it in any of the available programming languages. All you need are some Python libraries and your favourite Ruby scripts. We will take you through the steps to launch your app.'}]

In [42]:
generator("Later today, I will go to gym to")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Later today, I will go to gym to look for new bodybuilding books and to look for articles on weight loss. This site may not be up for your reading pleasure.'}]

In [44]:
generator("Later today, I will go to gym to", num_return_sequences=3, max_length=15)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Later today, I will go to gym to make sure my life is better'},
 {'generated_text': 'Later today, I will go to gym to make sure everyone is getting in'},
 {'generated_text': 'Later today, I will go to gym to get as much attention as I'}]

## Using any model from the Hub in a pipeline

The previous examples used the default model for the task at hand, but you can also choose a particular model from the Hub to use in a pipeline for a specific task — say, text generation. Go to the [Model Hub](https://huggingface.co/models) and click on the corresponding tag on the left to display only the supported models for that task. You should get to a page like [this one](https://huggingface.co/models?pipeline_tag=text-generation&sort=trending).

Let’s try the [distilgpt2](https://huggingface.co/distilgpt2) model! Here’s how to load it in the same pipeline as before:

In [46]:
from transformers import pipeline

generator = pipeline("text-generation", model="distilgpt2")
generator(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2,
)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to apply your principles and techniques to learning from traditional and emerging business leaders, as well as building relationships that'},
 {'generated_text': 'In this course, we will teach you how to read a few of my favorite Japanese prose from Hainan-Won and I love talking about'}]

You can refine your search for a model by clicking on the language tags, and pick a model that will generate text in another language. The Model Hub even contains checkpoints for multilingual models that support several languages.

Once you select a model by clicking on it, you’ll see that there is a widget enabling you to try it directly online. This way you can quickly test the model’s capabilities before downloading it.

- [bloom](https://huggingface.co/bigscience/bloom?text=%E0%AE%8F%E0%AE%A4%E0%AF%8B+%E0%AE%92%E0%AE%A9%E0%AF%8D%E0%AE%B1%E0%AF%81+%E0%AE%8E%E0%AE%A9%E0%AF%8D%E0%AE%A9%E0%AF%88+%E0%AE%A4%E0%AE%BE%E0%AE%95%E0%AF%8D%E0%AE%95+%E0%AE%AF%E0%AE%BE%E0%AE%B0%E0%AF%8B) - 17B params - Takes too long to load and files are huge as well
- [bloom560m](https://huggingface.co/bigscience/bloom-560m?text=%E0%AE%8F%E0%AE%A4%E0%AF%8B+%E0%AE%92%E0%AE%A9%E0%AF%8D%E0%AE%B1%E0%AF%81+%E0%AE%8E%E0%AE%A9%E0%AF%8D%E0%AE%A9%E0%AF%88+%E0%AE%A4%E0%AE%BE%E0%AE%95%E0%AF%8D%E0%AE%95+%E0%AE%AF%E0%AE%BE%E0%AE%B0%E0%AF%8B) - 559M params

In [54]:
generator = pipeline("text-generation", model="bigscience/bloom-560m")
generator(
    "ஏதோ ஒன்று என்னை தாக்க யாரோ",
    max_new_tokens=10,
    num_return_sequences=1,
)

[{'generated_text': 'ஏதோ ஒன்று என்னை தாக்க யாரோ என்னை தாக்கி தாக்க'}]

### The Inference API

All the models can be tested directly through your browser using the Inference API. You can play with the model directly on this page by inputting custom text and watching the model process the input data.

## Mask filling

The next pipeline you’ll try is `fill-mask`. The idea of this task is to fill in the blanks in a given text:

In [56]:
from transformers import pipeline

unmasker = pipeline("fill-mask")
unmasker("This course will teach you all about <mask> models.", top_k=2)

No model was supplied, defaulted to distilroberta-base and revision ec58a5b (https://huggingface.co/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'score': 0.1961977630853653,
  'token': 30412,
  'token_str': ' mathematical',
  'sequence': 'This course will teach you all about mathematical models.'},
 {'score': 0.04052729532122612,
  'token': 38163,
  'token_str': ' computational',
  'sequence': 'This course will teach you all about computational models.'}]

The `top_k` argument controls how many possibilities (predictions) you want to be displayed. <span style="color:blue">Note that here the model fills in the special `<mask>` word, which is often referred to as a <b><i>mask token</i></b>. Other mask-filling models might have different mask tokens, so it’s always good to verify the proper mask word when exploring other models</span>. One way to check it is by looking at the mask word used in the widget.

In [63]:
# tokenizer details
print (unmasker.tokenizer)
# Vocab dict
list(unmasker.tokenizer.vocab.items())[:5]

RobertaTokenizerFast(name_or_path='distilroberta-base', vocab_size=50265, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=False)}, clean_up_tokenization_spaces=True)


[('heavy', 18888),
 ('Ġfact', 754),
 ('Ġhusbands', 27718),
 ('ĠSalvation', 21435),
 ('Ġworkforce', 6862)]

## Named entity recognition

Named entity recognition (NER) is a task where the model has to find which parts of the input text correspond to entities such as persons, locations, or organizations. Let’s look at an example:

In [64]:
from transformers import pipeline

ner = pipeline("ner", grouped_entities=True)
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'entity_group': 'PER',
  'score': 0.9981694,
  'word': 'Sylvain',
  'start': 11,
  'end': 18},
 {'entity_group': 'ORG',
  'score': 0.9796021,
  'word': 'Hugging Face',
  'start': 33,
  'end': 45},
 {'entity_group': 'LOC',
  'score': 0.9932106,
  'word': 'Brooklyn',
  'start': 49,
  'end': 57}]

Here the model correctly identified that Sylvain is a person (PER), Hugging Face an organization (ORG), and Brooklyn a location (LOC).

<span style="color:green">We pass the option `grouped_entities=True` in the pipeline creation function to tell the pipeline to regroup together the parts of the sentence that correspond to the same entity: here the model correctly grouped “Hugging” and “Face” as a single organization, even though the name consists of multiple words</span>. In fact, as we will see in the next chapter, the preprocessing even splits some words into smaller parts. For instance, Sylvain is split into four pieces: S, ##yl, ##va, and ##in. In the post-processing step, the pipeline successfully regrouped those pieces.

English Part-of-Speech Tagging in Flair (default model): https://huggingface.co/flair/pos-english

In [71]:
# from transformers import pipeline

# pos = pipeline("pos", "flair/pos-english")
# pos("My name is Sylvain and I work at Hugging Face in Brooklyn.")

In [76]:
from transformers import AutoTokenizer, AutoModelForTokenClassification, TokenClassificationPipeline

model_name = "QCRI/bert-base-multilingual-cased-pos-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

pipeline = TokenClassificationPipeline(model=model, tokenizer=tokenizer)
outputs = pipeline("My name is Clara and I live in Berkeley, California.")
for output in outputs:
    print (output)

{'entity': 'PRP$', 'score': 0.9995173, 'index': 1, 'word': 'My', 'start': 0, 'end': 2}
{'entity': 'NN', 'score': 0.9995148, 'index': 2, 'word': 'name', 'start': 3, 'end': 7}
{'entity': 'VBZ', 'score': 0.99945337, 'index': 3, 'word': 'is', 'start': 8, 'end': 10}
{'entity': 'NNP', 'score': 0.99843496, 'index': 4, 'word': 'Clara', 'start': 11, 'end': 16}
{'entity': 'CC', 'score': 0.99965525, 'index': 5, 'word': 'and', 'start': 17, 'end': 20}
{'entity': 'PRP', 'score': 0.9996451, 'index': 6, 'word': 'I', 'start': 21, 'end': 22}
{'entity': 'VBP', 'score': 0.99787474, 'index': 7, 'word': 'live', 'start': 23, 'end': 27}
{'entity': 'IN', 'score': 0.9996462, 'index': 8, 'word': 'in', 'start': 28, 'end': 30}
{'entity': 'NNP', 'score': 0.9988323, 'index': 9, 'word': 'Berkeley', 'start': 31, 'end': 39}
{'entity': ',', 'score': 0.9998846, 'index': 10, 'word': ',', 'start': 39, 'end': 40}
{'entity': 'NNP', 'score': 0.9995912, 'index': 11, 'word': 'California', 'start': 41, 'end': 51}
{'entity': '.',

## Question answering

The `question-answering` pipeline answers questions using information from a given context:

In [77]:
from transformers import pipeline

question_answerer = pipeline("question-answering")
question_answerer(
    question="Where do I work?",
    context="My name is Sylvain and I work at Hugging Face in Brooklyn",
)

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'score': 0.6949763894081116, 'start': 33, 'end': 45, 'answer': 'Hugging Face'}

<span style="color:blue">Note that this pipeline works by extracting information from the provided context; it does not generate the answer.</span>

In [83]:
# Example: Struggling
question_answerer(
    question="Where do I live?",
    context="My name is Sylvain and I work at Hugging Face in Brooklyn",
)

{'score': 0.47422829270362854,
 'start': 33,
 'end': 57,
 'answer': 'Hugging Face in Brooklyn'}

In [84]:
# Example: Completely off
question_answerer(
    question="What day is today?",
    context="My name is Sylvain and I work at Hugging Face in Brooklyn",
)

{'score': 0.22079642117023468,
 'start': 11,
 'end': 57,
 'answer': 'Sylvain and I work at Hugging Face in Brooklyn'}

## Summarization

Summarization is the task of reducing a text into a shorter text while keeping all (or most) of the important aspects referenced in the text. Here’s an example:

Like with text generation, you can specify a `max_length` or a `min_length` for the result.

In [85]:
from transformers import pipeline

summarizer = pipeline("summarization")
summarizer(
    """
    America has changed dramatically during recent years. Not only has the number of
    graduates in traditional engineering disciplines such as mechanical, civil,
    electrical, chemical, and aeronautical engineering declined, but in most of
    the premier American universities engineering curricula now concentrate on
    and encourage largely the study of engineering science. As a result, there
    are declining offerings in engineering subjects dealing with infrastructure,
    the environment, and related issues, and greater concentration on high
    technology subjects, largely supporting increasingly complex scientific
    developments. While the latter is important, it should not be at the expense
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other
    industrial countries in Europe and Asia, continue to encourage and advance
    the teaching of engineering. Both China and India, respectively, graduate
    six and eight times as many traditional engineers as does the United States.
    Other industrial countries at minimum maintain their output, while America
    suffers an increasingly serious decline in the number of engineering graduates
    and a lack of well-educated engineers.
"""
)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'summary_text': ' The number of engineering graduates in the United States has declined in recent years . China and India graduate six and eight times as many traditional engineers as the U.S. does . Rapidly developing economies such as China continue to encourage and advance the teaching of engineering . There are declining offerings in engineering subjects dealing with infrastructure, infrastructure, the environment, and related issues .'}]

## Translation

For translation, you can use a default model if you provide a language pair in the task name (such as `"translation_en_to_fr"`), but the easiest way is to pick the model you want to use on the Model Hub. 

Like with text generation and summarization, you can specify a `max_length` or a `min_length` for the result.

Here we’ll try translating from French to English:

In [86]:
from transformers import pipeline

translator = pipeline(task="translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce cours est produit par Hugging Face.")



[{'translation_text': 'This course is produced by Hugging Face.'}]

In [88]:
translator = pipeline(task="translation", model="facebook/nllb-200-distilled-600M")
translator("Ce cours est produit par Hugging Face.", src_lang="french", tgt_lang="english")

[{'translation_text': 'This course is produced by Hugging Face.'}]

In [90]:
# seems buggy
translator = pipeline(task="translation", model="facebook/nllb-200-distilled-600M")
translator("Ce cours est produit par Hugging Face.", src_lang="french", tgt_lang="tamil")

[{'translation_text': 'This course is produced by Hugging Face.'}]

In [91]:
# seems buggy
translator = pipeline(task="translation", model="facebook/nllb-200-distilled-600M")
translator("Ce cours est produit par Hugging Face.", src_lang="french", tgt_lang="spanish")

[{'translation_text': 'This course is produced by Hugging Face.'}]

The pipelines shown so far are mostly for demonstrative purposes. They were programmed for specific tasks and cannot perform variations of them. In the next chapter, you’ll learn what’s inside a `pipeline()` function and how to customize its behavior.