## Working with pipelines 

The most basic object in the 🤗 Transformers library is the `pipeline()` function. It connects a model with its necessary preprocessing and postprocessing steps, allowing us to directly input any text and get an intelligible answer:

In [2]:
from transformers import pipeline

In [2]:
classifier = pipeline('sentiment-analysis')
classifier('I`ve been looking to build a silicon valley in Punjab')

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

2024-12-18 14:31:06.894431: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M1
2024-12-18 14:31:06.894508: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 8.00 GB
2024-12-18 14:31:06.894531: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 2.67 GB
2024-12-18 14:31:06.894757: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-12-18 14:31:06.894975: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch mo

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use 0


[{'label': 'POSITIVE', 'score': 0.9512071013450623}]

We can even pass several sentences 

In [5]:
classifier(['I`ve been looking to build a silicon valley in Punjab','Indian government is worst'])

[{'label': 'POSITIVE', 'score': 0.9512071013450623},
 {'label': 'NEGATIVE', 'score': 0.9997968077659607}]

By default, this pipeline selects a particular pretrained model that has been fine-tuned for sentiment analysis in English. The model is downloaded and cached when you create the classifier object. If you rerun the command, the cached model will be used instead and there is no need to download the model again.

There are three main steps involved when you pass some text to a pipeline: 
* The text is processed into a format the model can understand
* The preprocessed inputs are passed to the model
* The predictions of the model are post-processed, so we can make sense of them

#### Let’s have a look at a few more Pipelines

## Zero-shot classification 

We’ll start by tackling a more challenging task where we need to classify texts that haven’t been labelled. This is a common scenario in real-world projects because annotating text is usually time-consuming and requires domain expertise. For this use case, the zero-shot-classification pipeline is very powerful: it allows you to specify which labels to use for the classification, so you don’t have to rely on the labels of the pretrained model. You’ve already seen how the model can classify a sentence as positive or negative using those two labels — but it can also classify the text using any other set of labels you like.

In [6]:
classifier = pipeline('zero-shot-classification')

No model was supplied, defaulted to FacebookAI/roberta-large-mnli and revision 2a8f12d (https://huggingface.co/FacebookAI/roberta-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/688 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.43G [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFRobertaForSequenceClassification.

All the weights of TFRobertaForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForSequenceClassification for predictions without further training.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use 0


In [10]:
classifier("I`ve been looking to build a silicon valley in Punjab.",
          candidate_labels=['education', 'politics', 'business'])

{'sequence': 'I`ve been looking to build a silicon valley in Punjab.',
 'labels': ['business', 'politics', 'education'],
 'scores': [0.8677144050598145, 0.08291906118392944, 0.04936656355857849]}

This pipeline is called zero-shot because you don’t need to fine-tune the model on your data to use it. It can directly return probability scores for any list of labels you want!

## Text Generation
The main idea here is that you provide a prompt and the model will auto-complete it by generating the remaining text. This is similar to the predictive text feature that is found on many phones.

In [26]:
generator = pipeline('text-generation', model="distilgpt2")

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFGPT2LMHeadModel.

All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use 0


In [24]:
# using default model
res = generator("In this course, we will teach you how to", max_length=30, num_return_sequences=3)
print(res)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to write your own program by following and using common Java methods provided in the Java Language Reference Manual.\n'}, {'generated_text': 'In this course, we will teach you how to use the basic functions of a computer to build a data storage application.\n\nNote that the tutorial'}, {'generated_text': 'In this course, we will teach you how to set your own alarms so that you no longer have to worry about putting it off when you open,'}]


In [27]:
# using distilgpt2 
res = generator("In this course, we will teach you how to", max_length=30, num_return_sequences=3)
print(res)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to become a successful consumer, how to create value.\n\n\n\nWhat is your next product?'}, {'generated_text': 'In this course, we will teach you how to set a specific goal. In this course, we will explore the concept of the "goal" so'}, {'generated_text': "In this course, we will teach you how to get you started applying for the right to do business online, so it doesn't take a long time"}]


## Mask Filling 

In [32]:
unmasker = pipeline('fill-mask', model='bert-base-cased')

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFBertForMaskedLM.

All the weights of TFBertForMaskedLM were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForMaskedLM for predictions without further training.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Device set to use 0


In [29]:
# Using distilbert/distilroberta-base
unmasker("This course will teach you all about <mask> models.", top_k=2)

[{'score': 0.1961970329284668,
  'token': 30412,
  'token_str': ' mathematical',
  'sequence': 'This course will teach you all about mathematical models.'},
 {'score': 0.04052671417593956,
  'token': 38163,
  'token_str': ' computational',
  'sequence': 'This course will teach you all about computational models.'}]

In [38]:
# Using Google BERT base model (cased)
unmasker("This course will teach you all about [MASK] models.", top_k=2)

[{'score': 0.2596321702003479,
  'token': 1648,
  'token_str': 'role',
  'sequence': 'This course will teach you all about role models.'},
 {'score': 0.09427190572023392,
  'token': 1103,
  'token_str': 'the',
  'sequence': 'This course will teach you all about the models.'}]

## Named Entity Recognition 
Named entity recognition (NER) is a task where the model has to find which parts of the input text correspond to entities such as persons, locations, or organizations.

In [44]:
ner = pipeline('ner', model='eventdata-utd/conflibert-named-entity-recognition', grouped_entities=True)

config.json:   0%|          | 0.00/1.71k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/434M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFBertForTokenClassification.

All the weights of TFBertForTokenClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForTokenClassification for predictions without further training.


tokenizer_config.json:   0%|          | 0.00/1.26k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/224k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/695k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

Device set to use 0


In [42]:
# Using default Model
ner("My name is Harpreet and I work at Google in Bangalore.")

[{'entity_group': 'PER',
  'score': 0.9982609,
  'word': 'Harpreet',
  'start': 11,
  'end': 19},
 {'entity_group': 'ORG',
  'score': 0.9989806,
  'word': 'Google',
  'start': 34,
  'end': 40},
 {'entity_group': 'LOC',
  'score': 0.9986883,
  'word': 'Bangalore',
  'start': 44,
  'end': 53}]

The Model correctly identified that Harpeet is a person, Google is an organization and bangalore is a location

In [45]:
#  Using eventdata-utd/conflibert-named-entity-recognition
ner("My name is Harpreet and I work at Google in Bangalore.")

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


[{'entity_group': 'Person',
  'score': 0.97572845,
  'word': 'my',
  'start': 0,
  'end': 2},
 {'entity_group': 'Person',
  'score': 0.9741495,
  'word': 'harpreet',
  'start': 11,
  'end': 19},
 {'entity_group': 'Person',
  'score': 0.99614257,
  'word': 'i',
  'start': 24,
  'end': 25},
 {'entity_group': 'Organisation',
  'score': 0.9885367,
  'word': 'google',
  'start': 34,
  'end': 40},
 {'entity_group': 'Location',
  'score': 0.9990915,
  'word': 'bangalore',
  'start': 44,
  'end': 53}]

## Question Answering 

The question answering pipeline answers questions using information from a given context

In [65]:
question_answerer_default = pipeline('question-answering')

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
All PyTorch model weights were used when initializing TFDistilBertForQuestionAnswering.

All the weights of TFDistilBertForQuestionAnswering were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForQuestionAnswering for predictions without further training.
Device set to use 0


In [67]:
question_answerer_default(
    question="Where do I work?",
    context="My name is Sylvain and I work at Hugging Face in Brooklyn",
)

{'score': 0.694976270198822, 'start': 33, 'end': 45, 'answer': 'Hugging Face'}

In [66]:
question_answerer_default(
    question='What is the bathroom south of?',
    context='The office is south of the bathroom. The kitchen is north of the bathroom.'
)

{'score': 0.49766385555267334, 'start': 37, 'end': 48, 'answer': 'The kitchen'}

In [71]:
question_answerer_default(
    question='What is the bedroom south of?', 
    context='The kitchen is north of the bathroom. The bathroom is north of the bedroom.',
)

{'score': 0.6194040775299072, 'start': 38, 'end': 50, 'answer': 'The bathroom'}

*Note*: that this pipeline works by extracting information from the provided context; it does not generate the answer.

### Using deepset/roberta-base-squad2 Model 

In [58]:
question_answerer = pipeline('question-answering', model='deepset/roberta-base-squad2')

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/496M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFRobertaForQuestionAnswering: ['roberta.embeddings.position_ids']
- This IS expected if you are initializing TFRobertaForQuestionAnswering from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaForQuestionAnswering from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFRobertaForQuestionAnswering were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForQuestionAnswering for predictions without further training.


tokenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

Device set to use 0


In [68]:
question_answerer(
    question="Where do I work?",
    context="My name is Sylvain and I work at Hugging Face in Brooklyn",
)

{'score': 0.6549862623214722, 'start': 33, 'end': 45, 'answer': 'Hugging Face'}

In [70]:
question_answerer(
    question='What is the bedroom south of?', 
    context='The kitchen is north of the bathroom. The bathroom is north of the bedroom.',
)

{'score': 1.8777550394588616e-06, 'start': 28, 'end': 36, 'answer': 'bathroom'}

## Summarization 
Summarization is the task of reducing a text into a shorter text while keeping all (or most) of the important aspects referenced in the text. 

In [76]:
summarizer = pipeline('summarization', model='facebook/bart-large-cnn')

config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFBartForConditionalGeneration.

All the weights of TFBartForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBartForConditionalGeneration for predictions without further training.


vocab.json:   0%|          | 0.00/899k [00:01<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use 0


In [77]:
summarizer(
    """
    If you’re looking for a good opportunity to upgrade from an older pair of AirPods, 
    then some of the early Black Friday sales going on right now are where you want to be.
    Walmart, for example, has chopped down the price of a pair of Apple AirPods (third-gen) with a
    Lightning charging case to $94, which is a massive $75 off and easily the lowest price we’ve seen to date.
    
    The fourth-gen AirPods are out now — with slight audio quality and comfort improvements, 
    plus the option to get them with active noise cancellation and a wireless charging case bearing its own speaker —
    but the third-gen AirPods are still rock solid for routine music, podcasts, calls, and other listening needs.
    The charging case adds 24 hours of playtime to the earbuds’ six-hour battery life, but without the option for wireless charging,
    the Lightning connector is a bit of a downer now that Apple has moved on from it in nearly all of its latest devices. 
    Still, the price is low enough that it may be worth keeping an extra cable handy.

    And the third-gen AirPods are still great on other merits.
    They were the first base AirPods to snub the tailpipe in an overdue design refresh,
    but they’re also notable for adding spatial audio head tracking, IPX4 sweat and water resistance, precision 
    Find My, automatic and quick device pairing and switching, and other tight integrations with iPhones, iPads, Macs, 
    and other Apple devices. We considered them well-balanced at their original price, so they’re even easier 
    to recommend with today’s discount.
""", max_length=150
)


[{'summary_text': 'Walmart is offering a $75 discount on a pair of Apple AirPods with a Lightning charging case. The third-gen AirPod are still rock solid for routine music, podcasts, calls, and other listening needs. The charging case adds 24 hours of playtime to the earbuds’ six-hour battery life.'}]

## Translation

In [82]:
%pip install sentencepiece

python(12843) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting sentencepiece
  Downloading sentencepiece-0.2.0-cp39-cp39-macosx_11_0_arm64.whl.metadata (7.7 kB)
Downloading sentencepiece-0.2.0-cp39-cp39-macosx_11_0_arm64.whl (1.2 MB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m16.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: sentencepiece
Successfully installed sentencepiece-0.2.0
Note: you may need to restart the kernel to use updated packages.


In [5]:
translator = pipeline('translation', model='Helsinki-NLP/opus-mt-fr-en')

All model checkpoint layers were used when initializing TFMarianMTModel.

All the layers of TFMarianMTModel were initialized from the model checkpoint at Helsinki-NLP/opus-mt-fr-en.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFMarianMTModel for predictions without further training.
Device set to use 0


In [7]:
translator("Chuck Sipes a utilisé efficacement une routine d'entraînement de 3 jours, mettant l'accent sur un volume et une intensité élevés adaptés à ses commentaires et objectifs personnels.")

[{'translation_text': 'Chuck Sipes effectively used a 3-day training routine, focusing on a high volume and intensity adapted to his personal comments and goals.'}]