In [1]:
#The most basic object in the 🤗 Transformers library is the pipeline() function. It connects a model with its necessary preprocessing and postprocessing steps, allowing us to directly input any text and get an intelligible answer:

from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier("I've been waiting for a HuggingFace course my whole life.")

  from .autonotebook import tqdm as notebook_tqdm
2023-04-27 18:00:45.282364: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Downloading (…)lve/main/config.json: 100%|██████████| 629/629 [00:00<00:00, 207kB/s]
Downloading pytorch_model.bin: 100%|██████████| 268M/268M [00:05<00:00, 46.4MB/s] 
Downloading (…)okenizer_config.json: 100%|██████████| 48.0/48.0 [00:00<00:00, 40.4kB/s]
Downloading (…)solve/main/vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 2.78MB/s]


[{'label': 'POSITIVE', 'score': 0.9598048329353333}]

In [3]:
classifier("Energy Transfer Hikes Its Distribution And Guides For More: Thesis Re-Examined")

[{'label': 'POSITIVE', 'score': 0.9965259432792664}]

In [5]:
classifier("AT&T: The More You Sell, The More I Buy")

[{'label': 'POSITIVE', 'score': 0.9965259432792664}]

In [6]:
## multiple sentences

classifier(
    ["I've been waiting for a HuggingFace course my whole life.", "I hate this so much!"]
)

[{'label': 'POSITIVE', 'score': 0.9598048329353333},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

# Zero-shot classification

We’ll start by tackling a more challenging task where we need to classify texts that haven’t been labelled. This is a common scenario in real-world projects because annotating text is usually time-consuming and requires domain expertise. For this use case, the zero-shot-classification pipeline is very powerful: it allows you to specify which labels to use for the classification, so you don’t have to rely on the labels of the pretrained model. You’ve already seen how the model can classify a sentence as positive or negative using those two labels — but it can also classify the text using any other set of labels you like.

This pipeline is called zero-shot because you don’t need to fine-tune the model on your data to use it. It can directly return probability scores for any list of labels you want!


In [7]:
classifier = pipeline("zero-shot-classification")
classifier(
    "This is a course about the Transformers library",
    candidate_labels=["education", "politics", "business"],
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.
Downloading (…)lve/main/config.json: 100%|██████████| 1.15k/1.15k [00:00<00:00, 1.22MB/s]
Downloading pytorch_model.bin: 100%|██████████| 1.63G/1.63G [00:32<00:00, 50.9MB/s]
Downloading (…)okenizer_config.json: 100%|██████████| 26.0/26.0 [00:00<00:00, 21.2kB/s]
Downloading (…)olve/main/vocab.json: 100%|██████████| 899k/899k [00:00<00:00, 3.36MB/s]
Downloading (…)olve/main/merges.txt: 100%|██████████| 456k/456k [00:00<00:00, 2.37MB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████| 1.36M/1.36M [00:00<00:00, 5.00MB/s]


{'sequence': 'This is a course about the Transformers library',
 'labels': ['education', 'business', 'politics'],
 'scores': [0.8445989489555359, 0.11197412759065628, 0.04342695698142052]}

# Text generation
Now let’s see how to use a pipeline to generate some text. The main idea here is that you provide a prompt and the model will auto-complete it by generating the remaining text. This is similar to the predictive text feature that is found on many phones. Text generation involves randomness, so it’s normal if you don’t get the same results as shown below.

In [10]:
generator = pipeline("text-generation")
generator("In this course, we will teach you how to")

No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to play the "Game" game and what can be done to create a good life with fun at a great price. You will also teach you all about the great games you can play, and how to'}]

# Using any model from the Hub in a pipeline

The previous examples used the default model for the task at hand, but you can also choose a particular model from the Hub to use in a pipeline for a specific task — say, text generation. Go to the Model Hub and click on the corresponding tag on the left to display only the supported models for that task. You should get to a page like this one.

In [9]:
generator = pipeline("text-generation", model="distilgpt2")
generator(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2,
)

Downloading (…)lve/main/config.json: 100%|██████████| 762/762 [00:00<00:00, 731kB/s]
Downloading pytorch_model.bin: 100%|██████████| 353M/353M [00:06<00:00, 50.5MB/s] 
Downloading (…)neration_config.json: 100%|██████████| 124/124 [00:00<00:00, 45.2kB/s]
Downloading (…)olve/main/vocab.json: 100%|██████████| 1.04M/1.04M [00:00<00:00, 3.23MB/s]
Downloading (…)olve/main/merges.txt: 100%|██████████| 456k/456k [00:00<00:00, 2.32MB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████| 1.36M/1.36M [00:00<00:00, 6.52MB/s]
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to use PHP to perform functions that are set up in the constructor.\n\n\n\n\n\n\n'},
 {'generated_text': "In this course, we will teach you how to get involved and learn a lot more about the nature of digital media. You'll also learn about how"}]

# The Inference API

All the models can be tested directly through your browser using the Inference API, which is available on the Hugging Face website. You can play with the model directly on this page by inputting custom text and watching the model process the input data.

The Inference API that powers the widget is also available as a paid product, which comes in handy if you need it for your workflows. See the pricing page for more details.

https://huggingface.co/pricing



# Mask filling
The next pipeline you’ll try is fill-mask. The idea of this task is to fill in the blanks in a given text.

The top_k argument controls how many possibilities you want to be displayed. Note that here the model fills in the special <mask> word, which is often referred to as a mask token. Other mask-filling models might have different mask tokens, so it’s always good to verify the proper mask word when exploring other models. One way to check it is by looking at the mask word used in the widget.



In [14]:
unmasker = pipeline("fill-mask")
unmasker("This will teach you all about <mask> models.", top_k=2)

No model was supplied, defaulted to distilroberta-base and revision ec58a5b (https://huggingface.co/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'score': 0.17133232951164246,
  'token': 30412,
  'token_str': ' mathematical',
  'sequence': 'This will teach you all about mathematical models.'},
 {'score': 0.037065230309963226,
  'token': 38163,
  'token_str': ' computational',
  'sequence': 'This will teach you all about computational models.'}]

In [17]:
unmasker = pipeline('fill-mask', model='bert-base-cased')
unmasker("This will teach you all about [MASK] models.", top_k=2)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'score': 0.2097155600786209,
  'token': 1648,
  'token_str': 'role',
  'sequence': 'This will teach you all about role models.'},
 {'score': 0.13849380612373352,
  'token': 1103,
  'token_str': 'the',
  'sequence': 'This will teach you all about the models.'}]

# Named entity recognition

Named entity recognition (NER) is a task where the model has to find which parts of the input text correspond to entities such as persons, locations, or organizations. Let’s look at an example:
Here the model correctly identified that Sylvain is a person (PER), Hugging Face an organization (ORG), and Brooklyn a location (LOC).

We pass the option grouped_entities=True in the pipeline creation function to tell the pipeline to regroup together the parts of the sentence that correspond to the same entity: here the model correctly grouped “Hugging” and “Face” as a single organization, even though the name consists of multiple words. In fact, as we will see in the next chapter, the preprocessing even splits some words into smaller parts. For instance, Sylvain is split into four pieces: S, ##yl, ##va, and ##in. In the post-processing step, the pipeline successfully regrouped those pieces.

In [18]:
ner = pipeline("ner", grouped_entities=True)
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Downloading (…)lve/main/config.json: 100%|██████████| 998/998 [00:00<00:00, 853kB/s]
Downloading pytorch_model.bin: 100%|██████████| 1.33G/1.33G [00:27<00:00, 48.4MB/s]
Downloading (…)okenizer_config.json: 100%|██████████| 60.0/60.0 [00:00<00:00, 50.0kB/s]
Downloading (…)solve/main/vocab.txt: 100%|██████████| 213k/213k [00:00<00:00, 1.64MB/s]


[{'entity_group': 'PER',
  'score': 0.9981694,
  'word': 'Sylvain',
  'start': 11,
  'end': 18},
 {'entity_group': 'ORG',
  'score': 0.9796019,
  'word': 'Hugging Face',
  'start': 33,
  'end': 45},
 {'entity_group': 'LOC',
  'score': 0.9932106,
  'word': 'Brooklyn',
  'start': 49,
  'end': 57}]

In [22]:
pos = pipeline(model="QCRI/bert-base-multilingual-cased-pos-english")
pos("My name is Sylvain and I work at Hugging Face in Brooklyn.")

Downloading pytorch_model.bin: 100%|██████████| 712M/712M [00:14<00:00, 48.9MB/s] 
Downloading (…)okenizer_config.json: 100%|██████████| 49.0/49.0 [00:00<00:00, 50.8kB/s]
Downloading (…)solve/main/vocab.txt: 100%|██████████| 996k/996k [00:00<00:00, 3.99MB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 112/112 [00:00<00:00, 102kB/s]


[{'entity': 'PRP$',
  'score': 0.99944586,
  'index': 1,
  'word': 'My',
  'start': 0,
  'end': 2},
 {'entity': 'NN',
  'score': 0.9995615,
  'index': 2,
  'word': 'name',
  'start': 3,
  'end': 7},
 {'entity': 'VBZ',
  'score': 0.9995523,
  'index': 3,
  'word': 'is',
  'start': 8,
  'end': 10},
 {'entity': 'NNP',
  'score': 0.9963742,
  'index': 4,
  'word': 'Sylvain',
  'start': 11,
  'end': 18},
 {'entity': 'CC',
  'score': 0.9996537,
  'index': 5,
  'word': 'and',
  'start': 19,
  'end': 22},
 {'entity': 'PRP',
  'score': 0.99956113,
  'index': 6,
  'word': 'I',
  'start': 23,
  'end': 24},
 {'entity': 'VBP',
  'score': 0.9976095,
  'index': 7,
  'word': 'work',
  'start': 25,
  'end': 29},
 {'entity': 'IN',
  'score': 0.999801,
  'index': 8,
  'word': 'at',
  'start': 30,
  'end': 32},
 {'entity': 'NNP',
  'score': 0.99481,
  'index': 9,
  'word': 'Hu',
  'start': 33,
  'end': 35},
 {'entity': 'NNP',
  'score': 0.9181016,
  'index': 10,
  'word': '##gging',
  'start': 35,
  'end'

# Question answering

The question-answering pipeline answers questions using information from a given context. 

Note that this pipeline works by extracting information from the provided context; it does not generate the answer.

In [23]:
question_answerer = pipeline("question-answering")
question_answerer(
    question="Where do I work?",
    context="My name is Sylvain and I work at Hugging Face in Brooklyn",
)

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
Downloading (…)lve/main/config.json: 100%|██████████| 473/473 [00:00<00:00, 416kB/s]
Downloading pytorch_model.bin: 100%|██████████| 261M/261M [00:05<00:00, 51.6MB/s] 
Downloading (…)okenizer_config.json: 100%|██████████| 29.0/29.0 [00:00<00:00, 28.5kB/s]
Downloading (…)solve/main/vocab.txt: 100%|██████████| 213k/213k [00:00<00:00, 3.93MB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████| 436k/436k [00:00<00:00, 2.21MB/s]


{'score': 0.6949767470359802, 'start': 33, 'end': 45, 'answer': 'Hugging Face'}

# Summarization

Summarization is the task of reducing a text into a shorter text while keeping all (or most) of the important aspects referenced in the text. Here’s an example:

In [24]:
summarizer = pipeline("summarization")
summarizer(
    """
    America has changed dramatically during recent years. Not only has the number of 
    graduates in traditional engineering disciplines such as mechanical, civil, 
    electrical, chemical, and aeronautical engineering declined, but in most of 
    the premier American universities engineering curricula now concentrate on 
    and encourage largely the study of engineering science. As a result, there 
    are declining offerings in engineering subjects dealing with infrastructure, 
    the environment, and related issues, and greater concentration on high 
    technology subjects, largely supporting increasingly complex scientific 
    developments. While the latter is important, it should not be at the expense 
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other 
    industrial countries in Europe and Asia, continue to encourage and advance 
    the teaching of engineering. Both China and India, respectively, graduate 
    six and eight times as many traditional engineers as does the United States. 
    Other industrial countries at minimum maintain their output, while America 
    suffers an increasingly serious decline in the number of engineering graduates 
    and a lack of well-educated engineers.
"""
)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Downloading (…)lve/main/config.json: 100%|██████████| 1.80k/1.80k [00:00<00:00, 1.80MB/s]
Downloading pytorch_model.bin: 100%|██████████| 1.22G/1.22G [00:23<00:00, 52.1MB/s]
Downloading (…)okenizer_config.json: 100%|██████████| 26.0/26.0 [00:00<00:00, 20.5kB/s]
Downloading (…)olve/main/vocab.json: 100%|██████████| 899k/899k [00:00<00:00, 4.48MB/s]
Downloading (…)olve/main/merges.txt: 100%|██████████| 456k/456k [00:00<00:00, 2.32MB/s]


[{'summary_text': ' America has changed dramatically during recent years . The number of engineering graduates in the U.S. has declined in traditional engineering disciplines such as mechanical, civil,    electrical, chemical, and aeronautical engineering . Rapidly developing economies such as China and India continue to encourage and advance the teaching of engineering .'}]

# Translation

For translation, you can use a default model if you provide a language pair in the task name (such as "translation_en_to_fr"), but the easiest way is to pick the model you want to use on the Model Hub. Here we’ll try translating from French to English:


In [26]:
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce cours est produit par Hugging Face.")

Downloading (…)lve/main/config.json: 100%|██████████| 1.42k/1.42k [00:00<00:00, 1.27MB/s]
Downloading pytorch_model.bin: 100%|██████████| 301M/301M [00:06<00:00, 49.7MB/s] 
Downloading (…)neration_config.json: 100%|██████████| 293/293 [00:00<00:00, 296kB/s]
Downloading (…)okenizer_config.json: 100%|██████████| 42.0/42.0 [00:00<00:00, 35.5kB/s]
Downloading (…)olve/main/source.spm: 100%|██████████| 802k/802k [00:00<00:00, 8.23MB/s]
Downloading (…)olve/main/target.spm: 100%|██████████| 778k/778k [00:00<00:00, 8.36MB/s]
Downloading (…)olve/main/vocab.json: 100%|██████████| 1.34M/1.34M [00:00<00:00, 9.17MB/s]


[{'translation_text': 'This course is produced by Hugging Face.'}]