Source: https://huggingface.co/learn/nlp-course/chapter2/2?fw=pt

# Behind the pipeline (PyTorch)

> This is the first section where the content is slightly different depending on whether you use PyTorch or TensorFlow. Toggle the switch on top of the source website title to select the platform you prefer!


Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [13]:
# !pip install datasets evaluate transformers[sentencepiece]

Let’s start with a complete example, taking a look at what happened behind the scenes when we executed the following code in Chapter 1:

In [2]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier(
    [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
    ]
)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Xformers is not installed correctly. If you want to use memorry_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


[{'label': 'POSITIVE', 'score': 0.9598050713539124},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

As we saw in Chapter 1, this <b>pipeline groups together three steps</b>: 
1. <b>preprocessing</b> - convert raw text to numbers using a tokenizer
2. <b>passing the inputs through the model</b> - which outputs logits 
3. <b>postprocessing</b> - transform logits into labels and scores

<img src="https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter2/full_nlp_pipeline.svg" style="width:900px;" title="Full NLP pipeline">

Let’s quickly go over each of these.

## Preprocessing with a tokenizer


 <span style="color:red">Like other neural networks, Transformer models can’t process raw text directly</span>, so the <span style="color:blue">first step of our pipeline is to convert the text inputs into numbers that the model can make sense of. To do this we use a <b><i>tokenizer</i></b></span>, which will be responsible for:

- Splitting the input into words, subwords, or symbols (like punctuation) that are called tokens
- Mapping each token to an integer
- Adding additional inputs that may be useful to the model

<img src="images/tokenizer-steps.png" style="width:600px;" title="tokenizer steps">

<span style="color:blue">All this preprocessing needs to be done in exactly the same way as when the model was pretrained</span>, so we first need to download that information from the Model Hub. To do this, we use the `AutoTokenizer` class and its `from_pretrained()` method. Using the checkpoint name of our model, it will automatically fetch the data associated with the model’s tokenizer and cache it (so it’s only downloaded the first time you run the code below).

Since the default checkpoint of the `sentiment-analysis` pipeline is `distilbert-base-uncased-finetuned-sst-2-english` (you can see its model card [here](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)), we run the following:

In [14]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Once we have the tokenizer, we can directly pass our sentences to it and we’ll get back a dictionary that’s ready to feed to our model! The only thing left to do is to convert the list of input IDs to tensors.

You can use 🤗 Transformers without having to worry about which ML framework is used as a backend; it might be PyTorch or TensorFlow, or Flax for some models. However, <span style="color:blue">Transformer models only accept <i>tensors</i> as input. <b>If this is your first time hearing about tensors, you can think of them as NumPy arrays instead</b>. A NumPy array can be a scalar (0D), a vector (1D), a matrix (2D), or have more dimensions. It’s effectively a tensor; other ML frameworks’ tensors behave similarly, and are usually as simple to instantiate as NumPy arrays.</span>

To specify the type of tensors we want to get back (PyTorch, TensorFlow, or plain NumPy), we use the `return_tensors` argument.

Don’t worry about `padding` and `truncation` just yet; we’ll explain those later. 
- Since two sentences are not same lenght, we will need to pad the shortest one to build an array
- To truncate any sentences longer than the model can handle, we use specify the truncation

The main things to remember here are that <span style="color:blue">you can pass one sentence or a list of sentences, as well as specifying the type of tensors you want to get back</span> (if no type is passed, you will get a list of lists as a result).

Here’s what the results look like as PyTorch tensors:

In [17]:
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}


The output itself is a dictionary containing two keys, `input_ids` and `attention_mask`. <span style="color:blue">`input_ids` contains two rows of integers (one for each sentence) that are the unique identifiers of the tokens in each sentence</span>. We’ll explain what the `attention_mask` is later in this chapter.

In [34]:
# Inspecting the tokenizers outputs
print (type(inputs))
print (inputs.keys())
print (inputs['input_ids'])
print (len(inputs['input_ids'][0]), len(inputs['input_ids'][1]))
print (inputs['attention_mask'])

<class 'transformers.tokenization_utils_base.BatchEncoding'>
dict_keys(['input_ids', 'attention_mask'])
tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]])
16 16
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])


## Going through the model - <span style="color:red">Slightly unclear</span>

We can download our pretrained model the same way we did with our tokenizer. 🤗 Transformers provides an `AutoModel` class which also has a `from_pretrained()` method:

In [35]:
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

Some weights of the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing DistilBertModel: ['classifier.bias', 'pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In this code snippet, we have downloaded the same checkpoint we used in our pipeline before (it should actually have been cached already) and instantiated a model with it.

<span style="color:blue">This architecture contains only the <b>base Transformer module</b>: given some inputs, it outputs what we’ll call <i>hidden states</i>, also known as <i>features</i>. For each model input, we’ll retrieve a high-dimensional vector representing the <b>contextual understanding of that input by the Transformer model</b>.</span>

If this doesn’t make sense, don’t worry about it. We’ll explain it all later.

<span style="color:blue">While these hidden states can be useful on their own, they’re usually inputs to another part of the model, known as the <b><i>head</i></b></span>. In Chapter 1, the different tasks could have been performed with the same architecture, but each of these tasks will have a different head associated with it.



### A high-dimensional vector?

<span style="color:blue">The vector output by the Transformer module is usually large</span>. It generally has three dimensions:

- <b>Batch size</b>: The number of sequences processed at a time (2 in our example).
- <b>Sequence length</b>: The length of the numerical representation of the sequence (16 in our example).
- <b>Hidden size</b>: The vector dimension of each model input.
It is said to be “high dimensional” because of the last value. The hidden size can be very large (768 is common for smaller models, and in larger models this can reach 3072 or more).

We can see this if we feed the inputs we preprocessed to our model:

In [36]:
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

torch.Size([2, 16, 768])


Note that the outputs of 🤗 Transformers models behave like `namedtuples` or dictionaries. You can access the elements by attributes (like we did) or by key (`outputs["last_hidden_state"]`), or even by index if you know exactly where the thing you are looking for is (`outputs[0]`).

### Model heads: Making sense out of numbers - <span style="color:blue">Need to revisit</span>

<span style="color:blue">The model heads take the high-dimensional vector of hidden states as input and project them onto a different dimension</span>. They are usually composed of one or a few linear layers:

<img src="https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter2/transformer_and_head.svg" style="width:1000px;" title="Transformer and head">

<span style="color:blue">The output of the Transformer model is sent directly to the model head to be processed.</span>

In this diagram, the model is represented by its embeddings layer and the subsequent layers. <span style="color:blue">The embeddings layer converts each input ID in the tokenized input into a vector that represents the associated token. The subsequent layers manipulate those vectors using the <b>attention mechanism</b> to produce the final representation of the sentences.<span style="color:blue">

There are many different architectures available in 🤗 Transformers, with each one designed around tackling a specific task. Here is a non-exhaustive list:

- `*Model` (retrieve the hidden states)
- `*ForCausalLM`
- `*ForMaskedLM`
- `*ForMultipleChoice`
- `*ForQuestionAnswering`
- `*ForSequenceClassification`
- `*ForTokenClassification`
- and others 🤗

For our example, <span style="color:green">we will need a model with a <b>sequence classification head</b> (to be able to classify the sentences as positive or negative)</b>. So, we won’t actually use the AutoModel class, but </span><span style="color:blue">`AutoModelForSequenceClassification`, which works exactly as the `AutoModel` class, except that it will build a model with a classificatioion head.</span>

In [51]:
from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)
print (outputs)

SequenceClassifierOutput(loss=None, logits=tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)


<span style="color:green">Now if we look at the shape of our outputs, the dimensionality will be much lower: the model head takes as input the high-dimensional vectors we saw before, and outputs vectors containing two values (one per label)</span>:

In [109]:
print(outputs.logits)

tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>)


In [52]:
print(outputs.logits.shape)

torch.Size([2, 2])


Since we have just two sentences and two labels, the result we get from our model is of shape 2 x 2.

## Postprocessing the output


The values we get as output from our model don’t necessarily make sense by themselves. Let’s take a look:

Our model predicted $[-1.5607, 1.6123]$ for the first sentence and $[ 4.1692, -3.3464]$ for the second one. <span style="color:blue"><b>Those are not probabilities but <i>logits</i></b>, the raw, unnormalized scores outputted by the last layer of the model. <b>To be converted to probabilities, they need to go through a [SoftMax](https://en.wikipedia.org/wiki/Softmax_function) layer </b>(all 🤗 Transformers models output the logits, as the loss function for training will generally fuse the last activation function, such as SoftMax, with the actual loss function, such as cross entropy)</span>:

In [10]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

tensor([[4.0195e-02, 9.5981e-01],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward0>)


Now we can see that the model predicted $[0.0402, 0.9598]$ for the first sentence and $[0.9995, 0.0005]$ for the second one. These are recognizable probability scores.

To get the labels corresponding to each position, we can inspect the `id2label` attribute of the model config (more on this in the next section):

In [54]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

Now we can conclude that the model predicted the following:

- First sentence: NEGATIVE: 0.0402, POSITIVE: 0.9598 $\Rightarrow$ Negative sentiment
- Second sentence: NEGATIVE: 0.9995, POSITIVE: 0.0005 $\Rightarrow$ Positive sentiment
            
We have successfully reproduced the three steps of the pipeline: preprocessing with tokenizers, passing the inputs through the model, and postprocessing! Now let’s take some time to dive deeper into each of those steps.

## Clean version

In [55]:
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]

<b>Processing with tokenizers</b>

In [60]:
from transformers import AutoTokenizer

checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print (inputs)

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}


<b>Going through the base transformer model - A high dimension vector</b>

In [68]:
from transformers import AutoModel

checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english'
model = AutoModel.from_pretrained(checkpoint)

outputs = model(**inputs)
print (outputs)
print(outputs.last_hidden_state.shape)

Some weights of the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing DistilBertModel: ['classifier.bias', 'pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


BaseModelOutput(last_hidden_state=tensor([[[-0.1798,  0.2333,  0.6321,  ..., -0.3017,  0.5008,  0.1481],
         [ 0.2758,  0.6497,  0.3200,  ..., -0.0760,  0.5136,  0.1329],
         [ 0.9046,  0.0985,  0.2950,  ...,  0.3352, -0.1407, -0.6464],
         ...,
         [ 0.1466,  0.5661,  0.3235,  ..., -0.3376,  0.5100, -0.0561],
         [ 0.7500,  0.0487,  0.1738,  ...,  0.4684,  0.0030, -0.6084],
         [ 0.0519,  0.3729,  0.5223,  ...,  0.3584,  0.6500, -0.3883]],

        [[-0.2937,  0.7283, -0.1497,  ..., -0.1187, -1.0227, -0.0422],
         [-0.2206,  0.9384, -0.0951,  ..., -0.3643, -0.6605,  0.2407],
         [-0.1536,  0.8988, -0.0728,  ..., -0.2189, -0.8528,  0.0710],
         ...,
         [-0.3017,  0.9002, -0.0200,  ..., -0.1082, -0.8412, -0.0861],
         [-0.3338,  0.9674, -0.0729,  ..., -0.1952, -0.8181, -0.0634],
         [-0.3454,  0.8824, -0.0426,  ..., -0.0993, -0.8329, -0.1065]]],
       grad_fn=<NativeLayerNormBackward0>), hidden_states=None, attentions=None)
t

<b>Model heads: Making sense out of numbers</b>

In [72]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

outputs = model(**inputs)
print (outputs)

SequenceClassifierOutput(loss=None, logits=tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)


In [76]:
print (outputs.logits.shape)
print (outputs.logits)

torch.Size([2, 2])
tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>)


<b>Postprocessing the outputs</b>

In [83]:
predictions = torch.nn.functional.softmax(input=outputs.logits, dim=-1)
print (predictions)

tensor([[4.0195e-02, 9.5981e-01],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward0>)


In [87]:
labels = model.config.id2label
print (labels)

{0: 'NEGATIVE', 1: 'POSITIVE'}


In [108]:
for i, pred in enumerate(predictions):
    print (raw_inputs[i])
    for idx, x in enumerate(pred):
        print (f">>> pred={float(x.detach()):.3f} => label={labels[idx]}")
    print ()

I've been waiting for a HuggingFace course my whole life.
>>> pred=0.040 => label=NEGATIVE
>>> pred=0.960 => label=POSITIVE

I hate this so much!
>>> pred=0.999 => label=NEGATIVE
>>> pred=0.001 => label=POSITIVE

