## Behind the Pipeline

Let’s start with a complete example, taking a look at what happened behind the scenes when we executed the following code in the `pipelines` notebook.

<img src='./images/arc.png' width='800' height='200' style='border-radius:10px; margin-left:auto; margin-right:auto' />



## Preprocessing with a tokenizer  

Like other neural networks, Transformer models can’t process raw text directly, so the first step of our pipeline is to convert the text inputs into numbers that the model can make sense of. To do this we use a tokenizer, which will be responsible for:

Splitting the input into words, subwords, or symbols (like punctuation) that are called tokens
Mapping each token to an integer
Adding additional inputs that may be useful to the model
All this preprocessing needs to be done in exactly the same way as when the model was pretrained, so we first need to download that information from the Model Hub. To do this, we use the AutoTokenizer class and its from_pretrained() method. Using the checkpoint name of our model, it will automatically fetch the data associated with the model’s tokenizer and cache it

In [1]:
from transformers import AutoTokenizer

In [2]:
checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english'

In [3]:
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Once we have the tokenizer, we can directly pass our sentences to it and we’ll get back a dictionary that’s ready to feed to our model! The only thing left to do is to convert the list of input IDs to tensors.

In [27]:
raw_inputs = [
    "this is the best thing happened to me", 
    "I hate this so much!"
]

In [28]:
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors='tf')
inputs

{'input_ids': <tf.Tensor: shape=(2, 10), dtype=int32, numpy=
array([[ 101, 2023, 2003, 1996, 2190, 2518, 3047, 2000, 2033,  102],
       [ 101, 1045, 5223, 2023, 2061, 2172,  999,  102,    0,    0]],
      dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(2, 10), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 0, 0]], dtype=int32)>}

The output itself is a dictionary containing two keys, `input_ids` and `attention_mask`. `input_ids` contains two rows of integers (one for each sentence) that are the unique identifiers of the tokens in each sentence. 

We can download our pretrained model the same way we did with our tokenizer. 🤗 Transformers provides an TFAutoModel class which also has a from_pretrained method:

In [29]:
from transformers import TFAutoModel 
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"

In [30]:
model = TFAutoModel.from_pretrained(checkpoint)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertModel: ['pre_classifier.bias', 'classifier.bias', 'pre_classifier.weight', 'classifier.weight']
- This IS expected if you are initializing TFDistilBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFDistilBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.


This architecture contains only the base transformer module, given some inputs, it outputs what we'll call `hidden states`, also known as features. For each model Input, we'll retrieve a high dimensional vector representing the contextual understanding of that input by the transformer model. 

While these hidden states can be useful on their own, they’re usually inputs to another part of the model, known as the head. In pipelines notebook, the different tasks could have been performed with the same architecture, but each of these tasks will have a different head associated with it.

The vector output by the Transformer module is usually large. It generally has three dimensions:

- Batch size: The number of sequences processed at a time (2 in our example).
- Sequence length: The length of the numerical representation of the sequence (16 in our example).
- Hidden size: The vector dimension of each model input.

It is said to be “high dimensional” because of the last value. The hidden size can be very large (768 is common for smaller models, and in larger models this can reach 3072 or more).

We can see this if we feed the inputs we preprocessed to our model

In [31]:
outputs = model(inputs)
print(outputs.last_hidden_state.shape)

(2, 10, 768)


*Note* that the outputs of the huggingface transformer models behave like namedtuples or dictionaries. We can access the elements by attributes or by key or even by index. 

## Model Heads
The Model heads take the high dimensionsal vector of hidden states as input and project onto a different dimension: 


<img src='./images/arc2.png' width='800' height='200' style='border-radius:10px; margin-left:auto; margin-right:auto' />

The output of the Transformer model is sent directly to the model head to be processed.

For our example, we will need a model with a sequence classification head (to be able to classify the sentences as positive or negative). So, we won’t actually use the `TFAutoModel` class, but `TFAutoModelForSequenceClassification`:

In [9]:
from transformers import TFAutoModelForSequenceClassification

In [32]:
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)

All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


In [33]:
outputs = model(inputs)
outputs.logits.shape

TensorShape([2, 2])

*Note* the shape of our outputs, the dimensionality is much lower: the model head takes as input the high-dimensional vectors we saw before, and outputs vectors containing two values (one per label)

Since we have just two sentences and two labels, the result we get from our model is of shape 2 x 2.

## Post-Processing the Output

The values we get as output from our model don’t necessarily make sense by themselves. Let’s take a look

In [34]:
outputs.logits

<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[-0.70115715,  1.0346943 ],
       [ 4.169231  , -3.3464472 ]], dtype=float32)>

Those are not probabilities but logits, the raw, unnormalized scores outputted by the last layer of the model. To be converted to probabilities, they need to go through a SoftMax layer (all 🤗 Transformers models output the logits, as the loss function for training will generally fuse the last activation function, such as SoftMax, with the actual loss function, such as cross entropy)

In [35]:
import tensorflow as tf

predictions = tf.math.softmax(outputs.logits, axis=-1)
print(predictions)

tf.Tensor(
[[1.4984064e-01 8.5015941e-01]
 [9.9945587e-01 5.4418418e-04]], shape=(2, 2), dtype=float32)


To get the labels corresponding to each position, we can inspect the id2label attribute of the model config

In [36]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

Now we can conclude that the model predicted the following:

First prediction [1.4984064e-01 8.5015941e-01]:
- NEGATIVE (0): 0.1498 or ~15% confidence
- POSITIVE (1): 0.8502 or ~85% confidence
Conclusion: The model strongly predicts this is a POSITIVE sentiment (85% confidence)

Second prediction [9.9945587e-01 5.4418418e-04]:
- NEGATIVE (0): 0.9995 or ~99.95% confidence
- POSITIVE (1): 0.0005 or ~0.05% confidence
Conclusion: The model is extremely confident this is a NEGATIVE sentiment (99.95% confidence)

# Model 

In this section we’ll take a closer look at creating and using a model. We’ll use the TFAutoModel class, which is handy when you want to instantiate any model from a checkpoint.

The TFAutoModel class and all of its relatives are actually simple wrappers over the wide variety of models available in the library. It’s a clever wrapper as it can automatically guess the appropriate model architecture for your checkpoint, and then instantiates a model with this architecture.

However, if you know the type of model you want to use, you can use the class that defines its architecture directly. Let’s take a look at how this works with a BERT model.

## Creating a Transformer 

In [37]:
from transformers import BertConfig, TFBertModel

config = BertConfig()

model = TFBertModel(config)

In [38]:
print(config)

BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.47.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}



Creating a model from the default configuration initializes it with random values. Loading a Transformer model that is already trained is simple — we can do this using the `from_pretrained()` method:

In [39]:
model = TFBertModel.from_pretrained('bert-base-cased')

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

In [40]:
# Save Model 
model.save_pretrained('model/')

In [41]:
sequences = ["Hello!", "Cool.", "Nice!"]

In [46]:
inputs = tokenizer(sequences, return_tensors='tf')

In [48]:
inputs.input_ids

<tf.Tensor: shape=(3, 4), dtype=int32, numpy=
array([[ 101, 7592,  999,  102],
       [ 101, 4658, 1012,  102],
       [ 101, 3835,  999,  102]], dtype=int32)>

In [50]:
model_inputs = tf.constant(inputs.input_ids)

In [51]:
output = model(model_inputs)

## Handling Multiple Sequences 

### Models expect a batch of Inputs 

In [52]:
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)

All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


In [53]:
sequence = "I've been waiting for a HuggingFace course my whole life."

In [54]:
tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = tf.constant(ids)


In [55]:
model(input_ids)

TFSequenceClassifierOutput(loss=None, logits=<tf.Tensor: shape=(1, 2), dtype=float32, numpy=array([[-2.7276208,  2.8789375]], dtype=float32)>, hidden_states=None, attentions=None)

In [58]:
tokenized_inputs = tokenizer(sequence, return_tensors='tf')
print(tokenized_inputs.input_ids)

tf.Tensor(
[[  101  1045  1005  2310  2042  3403  2005  1037 17662 12172  2607  2026
   2878  2166  1012   102]], shape=(1, 16), dtype=int32)


In [62]:
output = model(input_ids)
print("logits", output.logits)

logits tf.Tensor([[-2.7276208  2.8789375]], shape=(1, 2), dtype=float32)


Batching is the act of sending multiple sentences through the model, all at once. If you only have one sentence, you can just build a batch with a single sequence

In [66]:
batched_ids = [ids, ids]

In [67]:
input_ids = tf.constant(batched_ids)

In [68]:
output = model(input_ids)
output.logits

<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[-2.7276218,  2.8789384],
       [-2.7276218,  2.8789384]], dtype=float32)>

Batching allows the model to work when you feed it multiple sentences. Using multiple sequences is just as simple as building a batch with a single sequence. There’s a second issue, though. When you’re trying to batch together two (or more) sentences, they might be of different lengths. If you’ve ever worked with tensors before, you know that they need to be of rectangular shape, so you won’t be able to convert the list of input IDs into a tensor directly. To work around this problem, we usually pad the inputs.

### Padding the Inputs 

In [None]:
# The following list of lists cannot be converted to a tensor
batched_ids = [
    [200, 200, 200],
    [200, 200]
]
tf.constant(batched_ids) # This returns ValueError: Can't convert non-rectangular Python sequence to Tensor.

In order to work around this, we’ll use padding to make our tensors have a rectangular shape. Padding makes sure all our sentences have the same length by adding a special word called the padding token to the sentences with fewer values. For example, if you have 10 sentences with 10 words and 1 sentence with 20 words, padding will ensure all the sentences have 20 words. In our example, the resulting tensor looks like this:

In [75]:
padding_id = 100

batched_ids = [
    [200, 200, 200],
    [200, 200, padding_id],
]

tf.constant(batched_ids) # Converted to tensorflow tensor successfully

<tf.Tensor: shape=(2, 3), dtype=int32, numpy=
array([[200, 200, 200],
       [200, 200, 100]], dtype=int32)>

The padding token ID can be found in tokenizer.pad_token_id. Let’s use it and send our two sentences through the model individually and batched together:

In [76]:
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)

All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


In [77]:
sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]] 
batched_ids = [
    [200, 200, 200], 
    [200, 200, tokenizer.pad_token_id]
]

In [78]:
print(model(tf.constant(sequence1_ids)).logits)
print(model(tf.constant(sequence2_ids)).logits)
print(model(tf.constant(batched_ids)).logits)

tf.Tensor([[ 1.5693657 -1.3894563]], shape=(1, 2), dtype=float32)
tf.Tensor([[ 0.5802985  -0.41252238]], shape=(1, 2), dtype=float32)
tf.Tensor(
[[ 1.5693657 -1.3894563]
 [ 1.3373483 -1.2163186]], shape=(2, 2), dtype=float32)


There’s something wrong with the logits in our batched predictions: the second row should be the same as the logits for the second sentence, but we’ve got completely different values!

This is because the key feature of Transformer models is attention layers that contextualize each token. These will take into account the padding tokens since they attend to all of the tokens of a sequence. To get the same result when passing individual sentences of different lengths through the model or when passing a batch with the same sentences and padding applied, we need to tell those attention layers to ignore the padding tokens. This is done by using an attention mask.

### Attention Masks 

Attention masks are tensors with the exact same shape as the input IDs tensor, filled with 0s and 1s: 1s indicate the corresponding tokens should be attended to, and 0s indicate the corresponding tokens should not be attended to (i.e., they should be ignored by the attention layers of the model).

Let’s complete the previous example with an attention mask:

In [79]:
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

attention_mask = [
    [1,1,1], 
    [1,1,0]
]

outputs = model(tf.constant(batched_ids), attention_mask=tf.constant(attention_mask))

In [80]:
outputs.logits

<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[ 1.5693657, -1.3894563],
       [ 0.5802984, -0.4125224]], dtype=float32)>

Now we get the same logits for the second sentence in the batch.

Notice how the last value of the second sequence is a padding ID, which is a 0 value in the attention mask.

## Special tokens

If we take a look at the input IDs returned by the tokenizer, we will see they are a tiny bit different from what we had earlier

In [81]:
sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)
print(model_inputs["input_ids"])

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102]
[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012]


One token ID was added at the beginning, and one at the end. Let’s decode the two sequences of IDs above to see what this is about: 

In [82]:
print(tokenizer.decode(model_inputs["input_ids"]))
print(tokenizer.decode(ids))

[CLS] i've been waiting for a huggingface course my whole life. [SEP]
i've been waiting for a huggingface course my whole life.


The tokenizer added the special word [CLS] at the beginning and the special word [SEP] at the end. This is because the model was pretrained with those, so to get the same results for inference we need to add them as well. 