<a href="https://colab.research.google.com/github/pratikgujral/learn-data-analysis/blob/master/Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install -q transformers datasets xformers sentencepiece

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m57.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m485.6/485.6 kB[0m [31m41.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m109.1/109.1 MB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m70.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.8/236.8 kB[0m [31m23.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m60.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m65.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━

# Pipeline
There are pipelines for several types of tasks readily available in the `transformers` library.

In [2]:
from transformers import pipeline

### Sentiment Analysis pipeline

In [3]:
## Creating a pipeline for sentiment analysis.
## This pipeline will use the default model set for the task.
classifier = pipeline('sentiment-analysis')

classifier.predict(
    ['This movie was a horror movie. I like watching horror.',
     'This pudding tastes bland']
)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.9815176129341125},
 {'label': 'NEGATIVE', 'score': 0.9996724128723145}]

Simply passing the list to the pipeline object also works.

In [4]:
classifier(
    ['This movie was a horror movie. I like watching horror.',
     'This pudding tastes bland']
)

[{'label': 'POSITIVE', 'score': 0.9815176129341125},
 {'label': 'NEGATIVE', 'score': 0.9996724128723145}]

### Zero shot classification pipeline

It is a text classification pipeline to predict the label of the input text.

In [5]:
classifier = pipeline('zero-shot-classification')

classifier([
    'this is an example notebook for using the transformers library',
    'Bhartiya Janta Party has been winning many elections lately',
    '7 साल में दिल्ली के PM-2.5 और PM-10 के स्तर में 30% की गिरावट: केजरीवाल'
], candidate_labels=['education', 'politics', 'events', 'entertainment'])

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

[{'sequence': 'this is an example notebook for using the transformers library',
  'labels': ['events', 'entertainment', 'education', 'politics'],
  'scores': [0.5622214078903198,
   0.2550264298915863,
   0.1196880042552948,
   0.06306417286396027]},
 {'sequence': 'Bhartiya Janta Party has been winning many elections lately',
  'labels': ['politics', 'events', 'entertainment', 'education'],
  'scores': [0.7305487990379333,
   0.24770037829875946,
   0.01709085889160633,
   0.004659924190491438]},
 {'sequence': '7 साल में दिल्ली के PM-2.5 और PM-10 के स्तर में 30% की गिरावट: केजरीवाल',
  'labels': ['events', 'entertainment', 'politics', 'education'],
  'scores': [0.6858853697776794,
   0.1147824302315712,
   0.1091599091887474,
   0.09017231315374374]}]

### Text generation pipeline

In [6]:
generate = pipeline('text-generation')

print(generate('What are you doing this', num_return_sequences=4, max_new_tokens=20))

print(generate('In this course I will teach you how to', num_return_sequences=4, max_new_tokens=20))

No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'What are you doing this on, like, the day I took your vote?"\n\n"Yeah...you\'re in here'}, {'generated_text': "What are you doing this for? How do you feel about this? It does not affect your life? Maybe you don't"}, {'generated_text': 'What are you doing this for?" he asks me.\n\n"I don\'t know. I could just look at it'}, {'generated_text': 'What are you doing this week? (1:00 pm) "The other end of this week is here, where we'}]
[{'generated_text': 'In this course I will teach you how to create smart contracts, where that is done in a smart contract network. This way your contract will look'}, {'generated_text': 'In this course I will teach you how to implement a simple and simple SQL query using a database. You will learn about how you create data,'}, {'generated_text': 'In this course I will teach you how to write a simple, and concise, code which generates new JSON objects.\n\nThe goal of this'}, {'generated_text': 'In this course I will teach you how to use a 

### Other models
We can specify the model to be used for the task.


In [7]:
generate = pipeline('text-generation', model='distilgpt2')
## distilgpt2 is a distilled (smaller) version of GPT-2

generate('I am heading to the supermarket', num_return_sequences=4, max_length=30)

Downloading (…)lve/main/config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'I am heading to the supermarket to buy the food from the local supermarket. There is also a huge number of supermarket workers in Westchester which was shut'},
 {'generated_text': "I am heading to the supermarket. At one time, I'd rather have them than wait for someone in the store to make it. But one time"},
 {'generated_text': 'I am heading to the supermarket next week to watch a movie, I know. We‼ll see him go to bed, get naked, I'},
 {'generated_text': 'I am heading to the supermarket this afternoon, which may not be a difficult experience for most shoppers who do plan to purchase.\n\n\n\nThe'}]

### Filling mask

In [8]:
fillmask = pipeline('fill-mask')
fillmask('I told him to bring a <mask> along with him to the party.')

No model was supplied, defaulted to distilroberta-base and revision ec58a5b (https://huggingface.co/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

[{'score': 0.03388902172446251,
  'token': 1441,
  'token_str': ' friend',
  'sequence': 'I told him to bring a friend along with him to the party.'},
 {'score': 0.021068697795271873,
  'token': 1040,
  'token_str': ' book',
  'sequence': 'I told him to bring a book along with him to the party.'},
 {'score': 0.020387286320328712,
  'token': 4806,
  'token_str': ' bike',
  'sequence': 'I told him to bring a bike along with him to the party.'},
 {'score': 0.017366837710142136,
  'token': 7304,
  'token_str': ' bottle',
  'sequence': 'I told him to bring a bottle along with him to the party.'},
 {'score': 0.017116010189056396,
  'token': 4437,
  'token_str': ' beer',
  'sequence': 'I told him to bring a beer along with him to the party.'}]

In [9]:
translator = pipeline('translation_en_to_fr')

translator('The server installed on the second floor of the main IT building caught fire.')

No model was supplied, defaulted to t5-base and revision 686f1db (https://huggingface.co/t5-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


[{'translation_text': 'Le serveur installé au deuxième étage du bâtiment principal des TI a pris feu.'}]

### Named entity recognition

In [10]:
ner = pipeline('ner', aggregation_strategy="simple")

ner(['Shyam lives on the second floor of the appartment in the city of New York.'])

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

[[{'entity_group': 'PER',
   'score': 0.99482733,
   'word': 'Shyam',
   'start': 0,
   'end': 5},
  {'entity_group': 'LOC',
   'score': 0.99947196,
   'word': 'New York',
   'start': 65,
   'end': 73}]]

### Question Answering

In [11]:
qa = pipeline('question-answering')

qa(
   question = 'Where do I live?',
   context = 'My name is Pratik. I was born in New Delhi, and currently live there. However, I have travelled across many cities in the states of Uttranchal, Himachal Pradesh and Karnataka.'
)

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

{'score': 0.9807660579681396, 'start': 33, 'end': 42, 'answer': 'New Delhi'}


| **Task**                     | **Description**                                                                                              | **Modality**    | **Pipeline identifier**                       |
|------------------------------|--------------------------------------------------------------------------------------------------------------|-----------------|-----------------------------------------------|
| Text classification          | assign a label to a given sequence of text                                                                   | NLP             | pipeline(task=“sentiment-analysis”)           |
| Text generation              | generate text given a prompt                                                                                 | NLP             | pipeline(task=“text-generation”)              |
| Summarization                | generate a summary of a sequence of text or document                                                         | NLP             | pipeline(task=“summarization”)                |
| Image classification         | assign a label to an image                                                                                   | Computer vision | pipeline(task=“image-classification”)         |
| Image segmentation           | assign a label to each individual pixel of an image (supports semantic, panoptic, and instance segmentation) | Computer vision | pipeline(task=“image-segmentation”)           |
| Object detection             | predict the bounding boxes and classes of objects in an image                                                | Computer vision | pipeline(task=“object-detection”)             |
| Audio classification         | assign a label to some audio data                                                                            | Audio           | pipeline(task=“audio-classification”)         |
| Automatic speech recognition | transcribe speech into text                                                                                  | Audio           | pipeline(task=“automatic-speech-recognition”) |
| Visual question answering    | answer a question about the image, given an image and a question                                             | Multimodal      | pipeline(task=“vqa”)                          |
| Document question answering  | answer a question about a document, given an image and a question                                            | Multimodal      | pipeline(task="document-question-answering")  |
| Image captioning             | generate a caption for a given image                                                                         | Multimodal      | pipeline(task="image-to-text")                |


## Translation

In [12]:
from transformers import pipeline

translator = pipeline("translation", model="Helsinki-NLP/opus-mt-en-hi")
translator("How are you dude?")

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.39k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/306M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/44.0 [00:00<?, ?B/s]

Downloading (…)olve/main/source.spm:   0%|          | 0.00/812k [00:00<?, ?B/s]

Downloading (…)olve/main/target.spm:   0%|          | 0.00/1.07M [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/2.10M [00:00<?, ?B/s]



[{'translation_text': 'आप दोस्त कैसे हैं?'}]

Transformer models can be broadly classified into three categories:
1. **Auto-Regressive**: E.g.: GPT-like models
2. **Auto-encoding**: E.g. BeRT-like models
3. **sequence-to-sequence Transformer models**: E.g. BART, T5, Marian, mBART


Models are composed of two blocks: Encoder block and decoder block



More deatils of functioning of Transformers on [Jay Alammar's blog](http://jalammar.github.io/illustrated-transformer/).


Each of these parts can be used independently, depending on the task:

* **Encoder-only model**: Good for tasks that require understanding of the input, such as sentence classification and named entity recognition. E.g. BeRT
* **Decoder-only models**: Good for generative tasks such as text generation. E.g, GPT
* **Encoder-decoder models** or **sequence-to-sequence models**: Good for generative tasks that require an input, such as translation or summarization.

![Transformer Architecture](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter1/transformers.svg)

The first attention layer in a decoder block pays attention to all (past) inputs to the decoder, but the second attention layer uses the output of the encoder. It can thus access the whole input sentence to best predict the current word. This is very useful as different languages can have grammatical rules that put the words in different orders, or some context provided later in the sentence may be helpful to determine the best translation of a given word.

The attention mask can also be used in the encoder/decoder to prevent the model from paying attention to some special words — for instance, the special padding word used to make all the inputs the same length when batching together sentences.

----

# Encoder architectures
- An encoder block is used to encode a sequence of text.
- So, if our input to encoder is three words- "Welcome to India", the output from encoder will be three vectors- each corresponding to an input word. This numerical representation is also called as **feature vector** or **feature tensor**.
- The dimension of the feature vector is defined by the architecture of the model. For example, for the base BeRT model, feature vectors were of length 768.
- As with previous word embedding techniques, these numerical embeddings are contextualized.
- These feature vectors are constructed through the self-attention mechanism.Hence, the representation of a word in a sequence is affected by the other words present in the sequence.




Encoder models use only the encoder of a Transformer model. At each stage, the attention layers can access all the words in the initial sentence. These models are often characterized as having “bi-directional” attention, and are often called **auto-encoding** models.

### Key-words
* Bi-directional
* Self-attention

## When to use encoder models?
- Sequence classification, NER, question answering, masked language models
- Tasks that require understanding natural language (NLU)
- Examples of encoder models: BERT, DistilBERT, RoBERTa, AlBERT, ELECTRA etc.

---

# Decoder architectures
* Examples: GPT-2, GPT-3, CTRL, Transformer XL
* Decoder architectures can be used for all the tasks that we use an encoder models for, but decoder models generally have lower performance than encoder models for those tasks.
 * For example, we can pass the three words- "Welcome to India" to decoder architecture, and at the output, we will get a sequence of three feature vectors- each corresponding to a word in the input sequence.

### Key words
* Uni-directional
* Auto-regressive
* Masked self-attention + Cross-attention

## Decoder different from Encoder
* The primary difference between a decoder and encoder is the way attention mechanism is implemented.

* Encoders use self-attention. An encoder will input the entire sequence in one-go.
* Decoders perform masked-self attention and cross-attention. Masked self-attention means that the future-words are not available to the decoder. When calculating the vector for the word "to" in the sequence "Welcome to India" it will only see "Welcome", whereas "India" would be masked. Hence, for a decoder only the words appearing on the left are used to understand the context (not necessarily left, but it is restricted to one side of the word). This is unlike encoder architectures, where bi-directional context is used.

## When to use decoder models


*   Having access to only the left (or right) context, decoders are good at generating sequences. This is known as **causal language modelling**.



## Why are decoder models called autoregressive models?

Decoder models take only one input (previous output) at a time, and generate the next word in sequence. All words generated so far are added to the input sequence to generate the next word. Because past outputs are used in the next input step, these models are called as autoregressive.

My --> name
My name --> is
My name is --> Pratik

We can continue generating, till `<EOS>` is generated. Else, like encoder, decoders also have a maximum context length that they can generate by the design of the architecture. E.g. GPT-2 has a maximum context length of 1024 tokens. This means that GPT-2 could generate 1024 words while still retaining the context of the first word in the sequence.


----

# Sequence to Sequence Models (Encoder-Decoder models)
* Examples: T5, BART
* **Encoder** takes a sequence of words as inputs, casts them through the encoder, and retrieves a numerical representation for each word called feature vectors. This numerical representation holds the contextual meaning of all words appearing in sequence.
* **Decoder** takes the output of the encoder as input. Addition to the encoder output, we also provide the past decoder output sequence as input to the decoder. When there is no initial sequence to provide as an input to the decoder, we provice it a dummy value that indicates 'Start of Sequence'.

### Encoder and Decoder together
Encoder inputs a sequence as inputs, computes a prediction and outputs a numerical representation. Then, this (final) numerical representation is sent to the decoder. The encoder's job is done here.

The decoder, uses the numerical representation received from encoder along with its own sequence (intially, "start of sequence"), and "decodes" the sequence. The decoder decodes the sequence and outputs a word. The words that are generated by the decoder are fed-back to the input of the decoder in an autoregressive manner.
