**Pipeline in Transformers**

Pipeline helps to load a model and perform preprocessing and postprocessing automatically

In [1]:
from transformers import pipeline

classifier = pipeline('sentiment-analysis')
classifier('The model has gone crazy')

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

[{'label': 'NEGATIVE', 'score': 0.9909963607788086}]

In [5]:
classifier(["I have been waiting to learn LLMs for a long time.", "I think it is a mistake."])

[{'label': 'NEGATIVE', 'score': 0.9866707921028137},
 {'label': 'NEGATIVE', 'score': 0.9997982382774353}]

Following is the task pipeline is doing internally:

1. Preprocessing to feed the model in desired format
2. preprocessed units passed to the model
3. Post processing for predictions

Following are the avaliable pipelines:
1. sentiment-analysis
2. fill-mask
3. zero-shot-classification
4. question-answering
5. text-summarization
6. translation
7. feature-extraction
8. ner
9. summarization

***Zero-shot-classification: ***

It is used to give label to text.

In [6]:
from transformers import pipeline

classifier = pipeline('zero-shot-classification')
classifier("I am studing LLMs and plan on mastering it.",
           candidate_labels = ['focus', 'education', 'politics'])

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

{'sequence': 'I am studing LLMs and plan on mastering it.',
 'labels': ['education', 'focus', 'politics'],
 'scores': [0.5787284970283508, 0.40389740467071533, 0.017374113202095032]}

***Text-generation: ***

It will generate text given a sentence or prompt.

In [7]:
from transformers import pipeline

classifier = pipeline('text-generation')
classifier('This course is a detailed and easy way to understand')

No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'This course is a detailed and easy way to understand the core principles of the game of Dota 2. It covers general fundamentals for each game that helps players learn how to play better together.\n\nAlso on this course are five new exercises that provide new'}]

In [8]:
classifier('This course is a detailed and easy way to understand', num_return_sequences = 2, max_length = 50)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'This course is a detailed and easy way to understand your application concept. It is structured to have your concepts in one place, for example, you can use the class Object and its properties to create reusable instances.\n\nThe concepts you create are easily'},
 {'generated_text': 'This course is a detailed and easy way to understand and become a proficient, professional speaker of English. This class will give you the ability to create and publish your own transcripts, and you can then send your transcripts to us to be transcribed and edited'}]

Notice that these pipelines load the default models. We can also change the model as per our need.

In [9]:
from transformers import pipeline

generator = pipeline('text-generation', model = 'distilgpt2')
generator('The humanity thrives for', num_return_sequences = 2)

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The humanity thrives for the sake of the future. We live in an age of a world where government and private monopolies have created and monopolized the lives of the middle class and poor, even the middle class. In order to achieve this we'},
 {'generated_text': 'The humanity thrives for. One in a million people will live in the land of God, of the Earth and of the Holy Spirit, and of the planets of space and time – and in the Earth – all live in the same eternal universe through'}]

***fill-mask: ***

It will fill the blanks in a sentence. The <mask> provided can be different for different models.

top_k = how many possibilities to display

In [10]:
from transformers import pipeline

unmask = pipeline('fill-mask')
unmask('This course is about <mask>', top_k = 4)

No model was supplied, defaulted to distilroberta-base and revision ec58a5b (https://huggingface.co/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

[{'score': 0.020620953291654587,
  'token': 2239,
  'token_str': ' learning',
  'sequence': 'This course is about learning'},
 {'score': 0.017827048897743225,
  'token': 35,
  'token_str': ':',
  'sequence': 'This course is about:'},
 {'score': 0.016429610550403595,
  'token': 734,
  'token_str': '...',
  'sequence': 'This course is about...'},
 {'score': 0.013306526467204094,
  'token': 2600,
  'token_str': ' reading',
  'sequence': 'This course is about reading'}]

***ner: ***

Identifies NER (named entity recognition) in a sentence like people, organisation, place etc.

grouped_entities = True, groups together same entities

In [13]:
from transformers import pipeline

ner = pipeline('ner', grouped_entities = True)
ner('My name is Sonali and I work in Los Angles as a data scientist.')

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'entity_group': 'PER',
  'score': 0.99884415,
  'word': 'Sonali',
  'start': 11,
  'end': 17},
 {'entity_group': 'ORG',
  'score': 0.97355396,
  'word': 'Los Angles',
  'start': 32,
  'end': 42}]

***question-answering: ***

It will answer the question given. This pipeline extract text from the given context.

In [15]:
from transformers import pipeline

question_answerer = pipeline('question-answering')
question_answerer(
    question="Where do I work?",
    context="My name is Sylvain and I work at Hugging Face in Brooklyn",
)

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'score': 0.6949767470359802, 'start': 33, 'end': 45, 'answer': 'Hugging Face'}

***Summarization: ***

It will summarize the context given.

In [16]:
from transformers import pipeline

summarizer = pipeline('summarization')
summarizer("""America has changed dramatically during recent years. Not only has the number of
    graduates in traditional engineering disciplines such as mechanical, civil,
    electrical, chemical, and aeronautical engineering declined, but in most of
    the premier American universities engineering curricula now concentrate on
    and encourage largely the study of engineering science. As a result, there
    are declining offerings in engineering subjects dealing with infrastructure,
    the environment, and related issues, and greater concentration on high
    technology subjects, largely supporting increasingly complex scientific
    developments. While the latter is important, it should not be at the expense
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other
    industrial countries in Europe and Asia, continue to encourage and advance
    the teaching of engineering. Both China and India, respectively, graduate
    six and eight times as many traditional engineers as does the United States.
    Other industrial countries at minimum maintain their output, while America
    suffers an increasingly serious decline in the number of engineering graduates
    and a lack of well-educated engineers.""")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

[{'summary_text': ' America has changed dramatically during recent years . The number of engineering graduates in the U.S. has declined in traditional engineering disciplines such as mechanical, civil, electrical, chemical, and aeronautical engineering . Rapidly developing economies such as China and India, as well as other industrial countries in Europe and Asia, continue to encourage and advance the teaching of engineering .'}]

In [18]:
!pip install sentencepiece

Collecting sentencepiece
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.99


In [1]:
from transformers import pipeline

translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce cours est produit par Hugging Face.")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


source.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/778k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.34M [00:00<?, ?B/s]



[{'translation_text': 'This course is produced by Hugging Face.'}]

## How do transformers work

Transformers can be classified into 3 categories:
1. GPT - auto-regressive transformer models
2. BERT - auto-encoding transformer models
3. BART/T5 - sequence-to-sequence transformer models

On a high level transformer are encoder-decoder model. However, each of these parts can be used independently:

1. Encoder only models - Good for understanding the inputs
   Uses only the encoder of the transformer model. The attention layer can access all the words. It has bi-directional attention, often called, auto-encoding.

   
   Working: Corrupts the input sentence by masking few wrds and asking the model to predict the initial sentence

   
   Family inlcudes:
   a. BERT
   b. ALBERT
   c. RoBERTa
   d. ELECTRA
   e. DistilBERT

   
   Eg: text classification, ner


2. Decoder only models - Good for generating tasks
   Uses only the decder of a transformer model. The attention layer pay more attention the previous words to understand the full meaning of the current word. Also called auto-regressive

   
   Family includes:
   a. CTRL
   b. GPT
   c. GPT-2
   d. Transformer XL

  
   For eg: Text generation

3. Sequence-to-sequence models - Good for generating task given a input
   Takes the encder and decoder of a transformer
   
   
   Family includes:
   a. BART
   b. mBART
   c. Marian
   d. T5

   
   for eg: Next word prediction, summarization

#### Limitation and bias

The limitation of such models are they can generate sexist, bias or homophobic content and even fine tuning does not make it dissappear because it heavily relies on the data it was pretrained on