# Using 🤗 transformers

## [Introduction](https://huggingface.co/course/chapter2/1?fw=pt)
As you saw in [Chapter 1](https://huggingface.co/course/chapter1), Transformer models are usually very large. With millions to tens of *billions* of parameters, training and deploying these models is a complicated undertaking. Furthermore, with new models being released on a near-daily basis and each having its own implementation, trying them all out is no easy task.

The 🤗 Transformers library was created to solve this problem. Its goal is to provide a single API through which any Transformer model can be loaded, trained, and saved. The library's main features are:
- **Ease of use**: Downloading, loading, and using a state-of-the-art NLP model for inference can be done in just two lines of code.
- **Flexibility**: At their core, all models are simple PyTorch `nn.Module` or TensorFlow `tf.keras.Model` classes and can be handled like any other models in their respective machine learning (ML) frameworks.
- **Simplicity**: Hardly any abstractions are made across the library. The "All in one file" is a core concept: a model's forward pass is entirely defined in a single file, so that the code itself is understandable and hackable.

This last feature makes 🤗 Transformers quite different from other ML libraries. The models are not built on modules that are shared across files; instead, each model has its own layers. In addition to making the models more approachable and understandable, this allows you to easily experiment on one model without affecting others.

This chapter will begin with an end-to-end example where we use a model and a tokenizer together to replicate the `pipeline()` function introduced in [Chapter 1](https://huggingface.co/course/chapter1). Next, we'll discuss the model API: we'll dive into the model and configuration classes, and show you how to load a model and how it processes numerical inputs to output predictions.

Then we'll look at the tokenizer API, which is the other main component of the `pipeline()` function. Tokenizers take care of the first and last processing steps, handling the conversion from text to numerical inputs for the neural network, and the conversion back to text when it is needed. Finally, we'll show you how to handle sending multiple sentences through a model in a prepared batch, then wrap it all up with a closer look at the high-level `tokenizer()` function.
> <font color="darkgreen">⚠️ In order to benefit from all features available with the Model Hub and 🤗 Transformers, we recommend [creating an account](https://huggingface.co/join).</font>

## [Behind the pipeline](https://huggingface.co/course/chapter2/2?fw=pt)
> <font color="darkgreen">This is the first section where the content is slightly different depending on whether you use PyTorch and TensorFlow. Toggle the switch on top of the title to select the platform you prefer!</font>

In [1]:
# https://gist.github.com/christopherlovell/e3e70880c0b0ad666e7b5fe311320a62
from IPython.display import HTML
HTML('<iframe width="640" height="360" src="https://www.youtube.com/embed/1pedAIvTWXk" allowfullscreen></iframe>')



Let's start with a complete example, taking a look at what happened behind the scenes when we executed the following code in [Chapter 1](https://huggingface.co/course/chapter1) and obtained:

In [2]:
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
classifier(
    [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!"
    ]
)

  from .autonotebook import tqdm as notebook_tqdm
No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0


[{'label': 'POSITIVE', 'score': 0.9598046541213989},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

As we saw in [Chapter 1](https://huggingface.co/course/chapter1), this pipeline groups together three steps: preprocessing, passing the inputs through the model, and postprocessing:

<img style="float=center" width="80%" src="sections/section_2/images/full_nlp_pipeline.svg">

Let's quickly go over each of these.

### Preprocessing with a tokenizer
Like other neural networks, Transformer models can't process raw text directly, so the first step of our pipeline is to convert the text inputs into numbers that the model can make sense of. To do this we use a *tokenizer*, which will be responsible for:
- Splitting the input into words, subwords, or symbols (like punctuation) that are called tokens.
- Mapping each token to an integer.
- Adding additional inputs that may be useful to the model.

All this preprocessing needs to be done in exactly the same way as when the model was pretrained, so we first need to download that information from the [Model Hub](https://huggingface.co/models). To do this, we use the `AutoTokenizer` class and its `from_pretrained()` method. Using the checkpoint name of our model, it will automatically fetch the data associated with the model's tokenizer and cache it (so it's only downloaded the first time you run the code below).

Since the default checkpoint of the `sentiment-analysis` pipeline is `distilbert-base-uncased-finetuned-sst-2-english` (you can see its model card [here](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)), we run the following:

In [3]:
from transformers import AutoTokenizer
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Once we have the tokenizer, we can directly pass our sentences to it and we'll get back a dictionary that's ready to feed to our model! The only thing left to do is to convert the list of input IDs to tensors.

You can use 🤗 Transformers without having to worry about which ML framework is used as a backend; it might be PyTorch or TensorFlow, or Flax for some models. However, Transformer models only accept *tensors* as input. If this is your first time hearing about tensors, you can think of them as NumPy arrays instead. A NumPy array can be a scalar (0D), a vector (1D), a matrix (2D), or have more dimensions. It's effectively a tensor; other ML frameworks' tensors behave similarly, and are usually as simple to instantiate as NumPy arrays.

To specify the type of tensors we want to get back (PyTorch, TensorFlow, or plain NumPy), we use the `return_tensors` argument:

In [4]:
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
inputs

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}

Don't worry about padding and truncation just yet; we'll explain those later. The main things to remember here are that you can pass one sentence or a list of sentences, as well as specifying the type of tensors you want to get back (if no type is passed, you will get a list of lists as a result).

The above output shows what the results look like as PyTorch tensors.

The output itself is a dictionary containing two keys, `input_ids` and `attention_mask`. The key `input_ids` contains two rows of integers (one for each sentence) that are the unique identifiers of the tokens in each sentence. We'll explain what the `attention_mask` is later in this chapter.

### Going through the model
We can download our pretrained model the same way we did with our tokenizer. 🤗 Transformers provides an `AutoModel` class which also has a `from_pretrained()` method:

In [5]:
from transformers import AutoModel
model = AutoModel.from_pretrained(checkpoint)

In this code snippet, we have downloaded the same checkpoint we used in our pipeline before (it should actually have been cached already) and instantiated a model with it.

This architecture contains only the base Transformer module: given some inputs, it outputs what we'll call *hidden states*, also known as *features*. For each model input, we'll retrieve a high-dimensional vector representing the **contextual understanding of that input by the Transformer model**.

If this doesn't make sense, don't worry about it. We'll explain it all later.

While these hidden states can be useful on their own, they're usually inputs to another part of the model, known as the head. In [Chapter 1](https://huggingface.co/course/chapter1), the different tasks could have been performed with the same architecture, but each of these tasks will have a different head associated with it.

#### A high-dimensional vector?
The vector output by the Transformer module is usually large. It generally has three dimensions:

- **Batch size**: The number of sequences processed at a time (2 in our example).
- **Sequence length**: The length of the numerical representation of the sequence (16 in our example).
- **Hidden size**: The vector dimension of each model input.
It is said to be "high dimensional" because of the last value. The hidden size can be very large (768 is common for smaller models, and in larger models this can reach 3072 or more).

We can see this if we feed the inputs we preprocessed to our model:

In [6]:
outputs = model(**inputs)
outputs.last_hidden_state.shape

torch.Size([2, 16, 768])

Note that the outputs of 🤗 Transformers models behave like `namedtuples` or dictionaries. You can access the elements by attributes (like we did) or by key (`outputs["last_hidden_state"]`), or even by index if you know exactly where the thing you are looking for is (`outputs[0]`).

#### Model heads: Making sense out of numbers
The model heads take the high-dimensional vector of hidden states as input and project them onto a different dimension. They are usually composed of one or a few linear layers:
<img style="float=center" src="sections/section_2/images/transformer_and_head2.svg">
The output of the Transformer model is sent directly to the model head to be processed.

In this diagram, the model is represented by its embeddings layer and the subsequent layers. The embeddings layer converts each input ID in the tokenized input into a vector that represents the associated token. The subsequent layers manipulate those vectors using the attention mechanism to produce the final representation of the sentences.

There are many different architectures available in 🤗 Transformers, with each one designed around tackling a specific task. Here is a non-exhaustive list:
- `*Model` (retrieve the hidden states)
- `*ForCausalLM`
- `*ForMaskedLM`
- `*ForMultipleChoice`
- `*ForQuestionAnswering`
- `*ForSequenceClassification`
- `*ForTokenClassification`
- and others 🤗

For our example, we will need a model with a sequence classification head (to be able to classify the sentences as positive or negative). So, we won't actually use the `AutoModel` class, but `AutoModelForSequenceClassification`:

In [7]:
from transformers import AutoModelForSequenceClassification
# checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)
outputs

SequenceClassifierOutput(loss=None, logits=tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

Now if we look at the shape of our inputs, the dimensionality will be much lower: the model head takes as input the high-dimensional vectors we saw before, and outputs vectors containing two values (one per label):

In [8]:
outputs.logits.shape

torch.Size([2, 2])

Since we have just two sentences and two labels, the result we get from our model is of shape `2 x 2`.

### Postprocessing the output
The values we get as output from our model don't necessarily make sense by themselves. Let's take a look:

In [9]:
outputs.logits

tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>)

Our model predicted `[-1.5607, 1.6123]` for the first sentence and `[ 4.1692, -3.3464]` for the second one. Those are not probabilities but *logits*, the raw, unnormalized scores outputted by the last layer of the model. To be converted to probabilities, they need to go through a [SoftMax](https://en.wikipedia.org/wiki/Softmax_function) layer (all 🤗 Transformers models output the logits, as the loss function for training will generally fuse the last activation function, such as SoftMax, with the actual loss function, such as cross entropy):

In [10]:
import torch
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predictions

tensor([[4.0195e-02, 9.5980e-01],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward0>)

Now we can see that the model predicted `[0.0402, 0.9598]` for the first sentence and `[0.9995, 0.0005]` for the second one. These are recognizable probability scores.

To get the labels corresponding to each position, we can inspect the `id2label` attribute of the model config (more on this in the next section):

In [11]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

Now we can conclude that the model predicted the following:
- First sentence: NEGATIVE: `0.0402`, POSITIVE: `0.9598`
- Second sentence: NEGATIVE: `0.9995`, POSITIVE: `0.0005`

We have successfully reproduced the three steps of the pipeline: preprocessing with tokenizers, passing the inputs through the model, and postprocessing! Now let's take some time to dive deeper into each of those steps.
> ✏️ Try it out! <font color="darkgreen">Choose two (or more) texts of your own and run them through the `sentiment-analysis` pipeline. Then replicate the steps you saw here yourself and check that you obtain the same results!</font>

In [12]:
# Trying it out
## inputs (https://www.amazon.de/-/en/Phoenix-Graphics-DisplayPort-PH-RTX3060-12G-V2-0GB4-M0NA10/dp/B0974XXKC1/...
# ... ref=sr_1_1?crid=3HZA50UFNRAXF&keywords=rtx+3060&qid=1649928271&refinements=p_36%3A43000-46000&rnid=...
# ... 428358031&sprefix=rtx+3060%2Caps%2C112&sr=8-1)
review_inputs = [
    "Asus RTX 3060 Phoenix V2: compact and silent power",                         # 5 stars
    "Not bad - but not good for the name",                                        # 3 stars
    "Way too overpriced, and brazen by Asus to sell the cards for so much money." # 1 star
]
[print(ri) for ri in review_inputs]
## pipeline
classifier = pipeline("sentiment-analysis")
pipeline_results = classifier(review_inputs)
print(f"\nPIPELINE\n{pipeline_results}")
## model setup
from transformers import AutoModelForSequenceClassification
checkpoint = "ProsusAI/finbert" # all predictions > 0.975 (all 5 stars => bad performance)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
## tokenize inputs
inputs = tokenizer(review_inputs, padding=True, truncation=True, return_tensors="pt")
## labels
labels = model.config.id2label
print(f"\nAUTOMODEL\nlabels:\t\t{labels}")
## predictions
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(f"predictions:\t{predictions}\n")
keys = predictions.argmax(1).tolist()
for key in keys:
    print(labels[key])

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Asus RTX 3060 Phoenix V2: compact and silent power
Not bad - but not good for the name
Way too overpriced, and brazen by Asus to sell the cards for so much money.


Device set to use cuda:0



PIPELINE
[{'label': 'POSITIVE', 'score': 0.9847788214683533}, {'label': 'NEGATIVE', 'score': 0.9957661628723145}, {'label': 'NEGATIVE', 'score': 0.9994076490402222}]

AUTOMODEL
labels:		{0: 'positive', 1: 'negative', 2: 'neutral'}
predictions:	tensor([[0.0691, 0.0168, 0.9141],
        [0.1429, 0.8201, 0.0371],
        [0.0337, 0.4195, 0.5469]], grad_fn=<SoftmaxBackward0>)

neutral
negative
neutral


## [Models](https://huggingface.co/course/chapter2/3?fw=pt)

In [13]:
HTML('<iframe width="640" height="360" src="https://www.youtube.com/embed/AhChOFRegn4" allowfullscreen></iframe>')



In this section we'll take a closer look at creating and using a model. We'll use the `AutoModel` class, which is handy when you want to instantiate any model from a checkpoint.

The `AutoModel` class and all of its relatives are actually simple wrappers over the wide variety of models available in the library. It's a clever wrapper as it can automatically guess the appropriate model architecture for your checkpoint, and then instantiates a model with this architecture.

However, if you know the type of model you want to use, you can use the class that defines its architecture directly. Let's take a look at how this works with a BERT model.

### Creating a Transformer
The first thing we'll need to do to initialize a BERT model is load a configuration object:

In [14]:
from transformers import BertConfig, BertModel
# Building the config
config = BertConfig()
# Building the model from the config
model = BertModel(config)

The configuration contains many attributes that are used to build the model:

In [15]:
config

BertConfig {
  "_attn_implementation_autoset": true,
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.48.3",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

While you haven't seen what all of these attributes do yet, you should recognize some of them: the `hidden_size` attribute defines the size of the `hidden_states` vector, and `num_hidden_layers` defines the number of layers the Transformer model has.

#### Different loading methods
Creating a model from the default configuration initializes it with random values:

In [16]:
config = BertConfig()
model = BertModel(config)
# Model is randomly initialized!

The model can be used in this state, but it will output gibberish; it needs to be trained first. We could train the model from scratch on the task at hand, but as you saw in [Chapter 1](https://huggingface.co/course/chapter1/1?fw=pt), this would require a long time and a lot of data, and it would have a non-negligible environmental impact. To avoid unnecessary and duplicated effort, it's imperative to be able to share and reuse models that have already been trained.

Loading a Transformer model that is already trained is simple — we can do this using the `from_pretrained()` method:

In [17]:
model = BertModel.from_pretrained("bert-base-cased")

As you saw earlier, we could replace `BertModel` with the equivalent `AutoModel` class. We'll do this from now on as this produces checkpoint-agnostic code; if your code works for one checkpoint, it should work seamlessly with another. This applies even if the architecture is different, as long as the checkpoint was trained for a similar task (for example, a sentiment analysis task).

In the code sample above we didn't use `BertConfig`, and instead loaded a pretrained model via the `bert-base-cased` identifier. This is a model checkpoint that was trained by the authors of BERT themselves; you can find more details about it in its [model card](https://huggingface.co/bert-base-cased).

This model is now initialized with all the weights of the checkpoint. It can be used directly for inference on the tasks it was trained on, and it can also be fine-tuned on a new task. By training with pretrained weights rather than from scratch, we can quickly achieve good results.

The weights have been downloaded and cached (so future calls to the `from_pretrained()` method won't re-download them) in the cache folder, which defaults to *~/.cache/huggingface/transformers*. You can customize your cache folder by setting the `HF_HOME` environment variable.

The identifier used to load the model can be the identifier of any model on the Model Hub, as long as it is compatible with the BERT architecture. The entire list of available BERT checkpoints can be found [here](https://huggingface.co/models?filter=bert).

#### Saving methods
Saving a model is as easy as loading one — we use the `save_pretrained()` method, which is analogous to the `from_pretrained()` method:

In [18]:
model.save_pretrained("sections/section_2/logs/first_saves")

This saves two files to your disk:

In [19]:
!ls "sections/section_2/logs/first_saves" # save config.json and pytorch_model.bin

config.json	   pytorch_model.bin	    tokenizer_config.json
model.safetensors  special_tokens_map.json  vocab.txt


If you take a look at the *config.json* file, you'll recognize the attributes necessary to build the model architecture. This file also contains some metadata, such as where the checkpoint originated and what 🤗 Transformers version you were using when you last saved the checkpoint.

The *pytorch_model.bin* file is known as the *state dictionary*; it contains all your model's weights. The two files go hand in hand; the configuration is necessary to know your model's architecture, while the model weights are your model's parameters.

### Using a Transformer model for inference
Now that you know how to load and save a model, let's try using it to make some predictions. Transformer models can only process numbers — numbers that the tokenizer generates. But before we discuss tokenizers, let's explore what inputs the model accepts.

Tokenizers can take care of casting the inputs to the appropriate framework's tensors, but to help you understand what's going on, we'll take a quick look at what must be done before sending the inputs to the model.

Let's say we have a couple of sequences:
```python
sequences = ["Hello!", "Cool.", "Nice!"]
```
The tokenizer converts these to vocabulary indices which are typically called *input IDs*. Each sequence is now a list of numbers! The resulting output is:

In [20]:
encoded_sequences = [
    [101, 7592, 999, 102],
    [101, 4658, 1012, 102],
    [101, 3835, 999, 102],
]

This is a list of encoded sequences: a list of lists. Tensors only accept rectangular shapes (think matrices). This "array" is already of rectangular shape, so converting it to a tensor is easy:

In [21]:
model_inputs = torch.tensor(encoded_sequences)
model_inputs

tensor([[ 101, 7592,  999,  102],
        [ 101, 4658, 1012,  102],
        [ 101, 3835,  999,  102]])

#### Using the tensors as inputs to the model
Making use of the tensors with the model is extremely simple — we just call the model with the inputs:

In [22]:
output = model(model_inputs)
output

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[ 4.4496e-01,  4.8276e-01,  2.7797e-01,  ..., -5.4032e-02,
           3.9393e-01, -9.4770e-02],
         [ 2.4943e-01, -4.4093e-01,  8.1772e-01,  ..., -3.1917e-01,
           2.2992e-01, -4.1172e-02],
         [ 1.3668e-01,  2.2518e-01,  1.4502e-01,  ..., -4.6915e-02,
           2.8224e-01,  7.5566e-02],
         [ 1.1789e+00,  1.6738e-01, -1.8187e-01,  ...,  2.4671e-01,
           1.0441e+00, -6.1972e-03]],

        [[ 3.6436e-01,  3.2464e-02,  2.0258e-01,  ...,  6.0110e-02,
           3.2451e-01, -2.0996e-02],
         [ 7.1866e-01, -4.8725e-01,  5.1740e-01,  ..., -4.4012e-01,
           1.4553e-01, -3.7545e-02],
         [ 3.3223e-01, -2.3271e-01,  9.4876e-02,  ..., -2.5268e-01,
           3.2172e-01,  8.1085e-04],
         [ 1.2523e+00,  3.5754e-01, -5.1320e-02,  ..., -3.7840e-01,
           1.0526e+00, -5.6255e-01]],

        [[ 2.4042e-01,  1.4718e-01,  1.2110e-01,  ...,  7.6062e-02,
           3.3564e-01,  2

While the model accepts a lot of different arguments, only the input IDs are necessary. We'll explain what the other arguments do and when they are required later, but first we need to take a closer look at the tokenizers that build the inputs that a Transformer model can understand.

## [Tokenizers](https://huggingface.co/course/chapter2/4?fw=pt)

In [23]:
HTML('<iframe width="640" height="360" src="https://www.youtube.com/embed/VFp38yj8h3A" allowfullscreen></iframe>')

Tokenizers are one of the core components of the NLP pipeline. They serve one purpose: to translate text into data that can be processed by the model. Models can only process numbers, so tokenizers need to convert our text inputs to numerical data. In this section, we'll explore exactly what happens in the tokenization pipeline.

In NLP tasks, the data that is generally processed is raw text. Here's an example of such text:

In [24]:
text = "Jim Henson was a puppeteer"
text

'Jim Henson was a puppeteer'

However, models can only process numbers, so we need to find a way to convert the raw text to numbers. That's what the tokenizers do, and there are a lot of ways to go about this. The goal is to find the most meaningful representation — that is, the one that makes the most sense to the model — and, if possible, the smallest representation.

Let's take a look at some examples of tokenization algorithms, and try to answer some of the questions you may have about tokenization.

### Word-based

In [25]:
HTML('<iframe width="640" height="360" src="https://www.youtube.com/embed/nhJxYji1aho" allowfullscreen></iframe>')

The first type of tokenizer that comes to mind is *word-based*. It's generally very easy to set up and use with only a few rules, and it often yields decent results. For example, in the image below, the goal is to split the raw text into words and find a numerical representation for each of them:

<img style="float=center" width="80%" src="sections/section_2/images/word_based_tokenization.svg">

There are different ways to split the text. For example, we could could use whitespace to tokenize the text into words by applying Python’s `split()` function:

In [26]:
tokenized_text = text.split()
tokenized_text

['Jim', 'Henson', 'was', 'a', 'puppeteer']

There are also variations of word tokenizers that have extra rules for punctuation. With this kind of tokenizer, we can end up with some pretty large "vocabularies", where a vocabulary is defined by the total number of independent tokens that we have in our corpus.

Each word gets assigned an ID, starting from 0 and going up to the size of the vocabulary. The model uses these IDs to identify each word.

If we want to completely cover a language with a word-based tokenizer, we'll need to have an identifier for each word in the language, which will generate a huge amount of tokens. For example, there are over 500.000 words in the English language, so to build a map from each word to an input ID we'd need to keep track of that many IDs. Furthermore, words like "dog" are represented differently from words like "dogs", and the model will initially have no way of knowing that "dog" and "dogs" are similar: it will identify the two words as unrelated. The same applies to other similar words, like "run" and "running", which the model will not see as being similar initially.

Finally, we need a custom token to represent words that are not in our vocabulary. This is known as the "unknown" token, often represented as `"[UNK]"` or `""`. It's generally a bad sign if you see that the tokenizer is producing a lot of these tokens, as it wasn't able to retrieve a sensible representation of a word and you're losing information along the way. The goal when crafting the vocabulary is to do it in such a way that the tokenizer tokenizes as few words as possible into the unknown token.

One way to reduce the amount of unknown tokens is to go one level deeper, using a *character-based* tokenizer.

### Character-based

In [27]:
HTML('<iframe width="640" height="360" src="https://www.youtube.com/embed/ssLq_EK2jLE" allowfullscreen></iframe>')

Character-based tokenizers split the text into characters, rather than words. This has two primary benefits:
- The vocabulary is much smaller.
- There are much fewer out-of-vocabulary (unknown) tokens, since every word can be built from characters.

But here, too, some questions arise concerning spaces and punctuation:
<img style="float=center;" src="sections/section_2/images/character_based_tokenization.svg">
This approach isn't perfect either. Since the representation is now based on characters rather than words, one could argue that, intuitively, it's less meaningful: each character doesn't mean a lot on its own, whereas that is the case with words. However, this again differs according to the language; in Chinese, for example, each character carries more information than a character in a Latin language.

Another thing to consider is that we'll end up with a very large amount of tokens to be processed by our model: whereas a word would only be a single token with a word-based tokenizer, it can easily turn into 10 or more tokens when converted into characters.

To get the best of both worlds, we can use a third technique that combines the two approaches: *subword tokenization*.

### Subword tokenization

In [28]:
HTML('<iframe width="640" height="360" src="https://www.youtube.com/embed/zHvTiHr506c" allowfullscreen></iframe>')

Subword tokenization algorithms rely on the principle that frequently used words should not be split into smaller subwords, but rare words should be decomposed into meaningful subwords.

For instance, "annoyingly" might be considered a rare word and could be decomposed into "annoying" and "ly". These are both likely to appear more frequently as standalone subwords, while at the same time the meaning of "annoyingly" is kept by the composite meaning of "annoying" and "ly".

Here is an example showing how a subword tokenization algorithm would tokenize the sequence "Let's do tokenization!":

<img style="float:center" width="70%" src="sections/section_2/images/bpe_subword.svg">

These subwords end up providing a lot of semantic meaning: for instance, in the example above "tokenization" was split into "token" and "ization", two tokens that have a semantic meaning while being space-efficient (only two tokens are needed to represent a long word). This allows us to have relatively good coverage with small vocabularies, and close to no unknown tokens.

This approach is especially useful in agglutinative languages such as Turkish, where you can form (almost) arbitrarily long complex words by stringing together subwords.

**And more!**<br>
Unsurprisingly, there are many more techniques out there. To name a few:
- Byte-level BPE, as used in GPT-2,
- WordPiece, as used in BERT, and
- SentencePiece or Unigram, as used in several multilingual models.

You should now have sufficient knowledge of how tokenizers work to get started with the API.

### Loading and saving
Loading and saving tokenizers is as simple as it is with models. Actually, it's based on the same two methods: `from_pretrained()` and `save_pretrained()`. These methods will load or save the algorithm used by the tokenizer (a bit like the *architecture* of the model) as well as its vocabulary (a bit like the *weights* of the model).

Loading the BERT tokenizer trained with the same checkpoint as BERT is done the same way as loading the model, except we use the `BertTokenizer` class:

In [29]:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

We can now use the tokenizer as shown in the previous section:

In [30]:
tokenizer("Using a Transformer network is simple")

{'input_ids': [101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

Saving a tokenizer is identical to saving a model:

In [31]:
# save tokenizer_config.json, spcial_tokens_map.json, vocab.txt, and added_tokens.json
tokenizer.save_pretrained("sections/section_2/logs/first_saves")

('sections/section_2/logs/first_saves/tokenizer_config.json',
 'sections/section_2/logs/first_saves/special_tokens_map.json',
 'sections/section_2/logs/first_saves/vocab.txt',
 'sections/section_2/logs/first_saves/added_tokens.json')

We'll talk more about `token_type_ids` in Chapter 3, and we'll explain the `attention_mask` key a little later. First, let's see how the `input_ids` are generated. To do this, we'll need to look at the intermediate methods of the tokenizer.

### Encoding

In [32]:
HTML('<iframe width="640" height="360" src="https://www.youtube.com/embed/Yffk5aydLzg" allowfullscreen></iframe>')

Translating text to numbers is known as *encoding*. Encoding is done in a two-step process: the tokenization, followed by the conversion to input IDs.

As we've seen, the first step is to split the text into words (or parts of words, punctuation symbols, etc.), usually called *tokens*. There are multiple rules that can govern that process, which is why we need to instantiate the tokenizer using the name of the model, to make sure we use the same rules that were used when the model was pretrained.

The second step is to convert those tokens into numbers, so we can build a tensor out of them and feed them to the model. To do this, the tokenizer has a *vocabulary*, which is the part we download when we instantiate it with the `from_pretrained()` method. Again, we need to use the same vocabulary used when the model was pretrained.

To get a better understanding of the two steps, we'll explore them separately. Note that we will use some methods that perform parts of the tokenization pipeline separately to show you the intermediate results of those steps, but in practice, you should call the tokenizer directly on your inputs (as shown in section 2).

#### Tokenization
The tokenization process is done by the `tokenize()` method of the tokenizer:

In [33]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenize(sequence)
tokens

['Using', 'a', 'Trans', '##former', 'network', 'is', 'simple']

This tokenizer is a subword tokenizer: it splits the words until it obtains tokens that can be represented by its vocabulary. That's the case here with `transformer`, which is split into two tokens: `transform` and `##er`.

#### From tokens to input IDs
The conversion to input IDs is handled by the `convert_tokens_to_ids()` tokenizer method:

In [34]:
ids = tokenizer.convert_tokens_to_ids(tokens)
ids

[7993, 170, 13809, 23763, 2443, 1110, 3014]

These outputs, once converted to the appropriate framework tensor, can then be used as inputs to a model as seen earlier in this chapter.
> ✏️ Try it out! <font color="darkgreen">Replicate the two last steps (tokenization and conversion to input IDs) on the input sentences we used in section 2 ("I've been waiting for a HuggingFace course my whole life." and "I hate this so much!"). Check that you get the same input IDs we got earlier!</font>

In [35]:
# Trying it out
try_tokenizer = AutoTokenizer.from_pretrained("ProsusAI/finbert")
try_sequences = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!"
]
try_tokens = try_tokenizer.tokenize(try_sequences)
print(f"tokens:\n{try_tokens}")
try_ids = try_tokenizer.convert_tokens_to_ids(try_tokens)
print(f"\nids:\n{try_ids}")
try_decoded_string = try_tokenizer.decode(try_ids)
print(f"\ndecoded string:\n{try_decoded_string}")

tokens:
['i', "'", 've', 'been', 'waiting', 'for', 'a', 'hugging', '##face', 'course', 'my', 'whole', 'life', '.', 'i', 'hate', 'this', 'so', 'much', '!']

ids:
[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 1045, 5223, 2023, 2061, 2172, 999]

decoded string:
i've been waiting for a huggingface course my whole life. i hate this so much!


### Decoding
*Decoding* is going the other way around: from vocabulary indices, we want to get a string. This can be done with the `decode()` method as follows:

In [36]:
decoded_string = tokenizer.decode(ids)
decoded_string

'Using a Transformer network is simple'

Note that the `decode` method not only converts the indices back to tokens, but also groups together the tokens that were part of the same words to produce a readable sentence. This behavior will be extremely useful when we use models that predict new text (either text generated from a prompt, or for sequence-to-sequence problems like translation or summarization).

By now, you should understand the atomic operations a tokenizer can handle: tokenization, conversion to IDs, and converting IDs back to a string. However, we've just scraped the tip of the iceberg. In the following section, we'll take our approach to its limits and take a look at how to overcome them.

## [Handling multiple sequences](https://huggingface.co/course/chapter2/5?fw=pt)

In [37]:
HTML('<iframe width="640" height="360" src="https://www.youtube.com/embed/M6adb1j2jPI" allowfullscreen></iframe>')

In the previous section, we explored the simplest of use cases: doing inference on a single sequence of a small length. However, some questions emerge already:
- How do we handle multiple sequences?
- How do we handle multiple sequences of *different lengths*?
- Are vocabulary indices the only inputs that allow a model to work well?
- Is there such a thing as too long a sequence?

Let's see what kinds of problems these questions pose, and how we can solve them using the 🤗 Transformers API.

### Models expect a batch of inputs
In the previous exercise you saw how sequences get translated into lists of numbers. Let's convert this list of numbers to a tensor and send it to the model:

In [38]:
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequence = "I've been waiting for a HuggingFace course my whole life."
tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
# use "ids"
input_ids = torch.tensor(ids)
input_ids

tensor([ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
         2026,  2878,  2166,  1012])

```python
#This line would fail:
model(input_ids)

IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)
```
Oh no! Why would this fail? We followed the steps from the pipeline in section 2 (*Behind the pipeline* further above).

The problem is that we sent a single sequence to the model, whereas 🤗 Transformers models expect multiple sentences by default. Here we tried to do everything the tokenizer did behind the scenes when we applied it to a `sequence`, but if you look closely, you'll see that it didn't just convert the list of input IDs into a tensor, it added a dimension on top of it:

In [39]:
tokenized_inputs = tokenizer(sequence, return_tensors="pt")
print(tokenized_inputs["input_ids"]) # This also adds the IDs for the "[CLS]" and "[SEP]" tokens (101 and 102)

tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102]])


Let's try again and add a new dimension:

In [40]:
# use "[ids]"
input_ids = torch.tensor([ids])
print("Input IDs:", input_ids)
output = model(input_ids)
print("Logits:", output.logits)

Input IDs: tensor([[ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
          2026,  2878,  2166,  1012]])
Logits: tensor([[-2.7276,  2.8789]], grad_fn=<AddmmBackward0>)


We print the input IDs as well as the resulting logits — see the output above!

*Batching* is the act of sending multiple sentences through the model, all at once. If you only have one sentence, you can just build a batch with a single sequence:

In [41]:
batched_ids = [ids, ids]
batched_ids

[[1045,
  1005,
  2310,
  2042,
  3403,
  2005,
  1037,
  17662,
  12172,
  2607,
  2026,
  2878,
  2166,
  1012],
 [1045,
  1005,
  2310,
  2042,
  3403,
  2005,
  1037,
  17662,
  12172,
  2607,
  2026,
  2878,
  2166,
  1012]]

This is a batch of two identical sequences!
> ✏️ Try it out! <font color="darkgreen">Convert this `batched_ids` list into a tensor and pass it through your model. Check that you obtain the same logits as before (but twice)!</font>

In [42]:
# Trying it out
batched_input_ids = torch.tensor(batched_ids)
print("Batched input IDs:", batched_input_ids)
batched_output = model(batched_input_ids)
print("Batched logits:", batched_output.logits)

Batched input IDs: tensor([[ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
          2026,  2878,  2166,  1012],
        [ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
          2026,  2878,  2166,  1012]])
Batched logits: tensor([[-2.7276,  2.8789],
        [-2.7276,  2.8789]], grad_fn=<AddmmBackward0>)


Batching allows the model to work when you feed it multiple sentences. Using multiple sequences is just as simple as building a batch with a single sequence. There's a second issue, though. When you're trying to batch together two (or more) sentences, they might be of different lengths. If you've ever worked with tensors before, you know that they need to be of rectangular shape, so you won't be able to convert the list of input IDs into a tensor directly. To work around this problem, we usually *pad* the inputs.

### Padding the inputs
The following list of lists cannot be converted to a tensor:
```python
batched_ids = [
    [200, 200, 200],
    [200, 200]
]
```
In order to work around this, we'll use *padding* to make our tensors have a rectangular shape. Padding makes sure all our sentences have the same length by adding a special word called the *padding token* to the sentences with fewer values. For example, if you have 10 sentences with 10 words and 1 sentence with 20 words, padding will ensure all the sentences have 20 words. In our example, the resulting tensor looks like this:
```python
padding_id = 100
batched_ids = [
    [200, 200, 200],
    [200, 200, padding_id],
]
```
The padding token ID can be found in `tokenizer.pad_token_id`. Let's use it and send our two sentences through the model individually and batched together:

In [43]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id]
]
print(f"tokenizer pad token ID: {tokenizer.pad_token_id}")
# individually
print("\nindividually:")
print(model(torch.tensor(sequence1_ids)).logits)
print(model(torch.tensor(sequence2_ids)).logits)
# batched together
print("\nbatched together:")
print(model(torch.tensor(batched_ids)).logits)

We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


tokenizer pad token ID: 0

individually:
tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward0>)
tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)

batched together:
tensor([[ 1.5694, -1.3895],
        [ 1.3374, -1.2163]], grad_fn=<AddmmBackward0>)


There's something wrong with the logits in our batched predictions: the second row should be the same as the logits for the second sentence, but we've got completely different values!

This is because the key feature of Transformer models are attention layers that *contextualize* each token. These will take the padding tokens into account since they attend to all of the tokens of a sequence. To get the same result when passing individual sentences of different lengths through the model or when passing a batch with the same sentences and padding applied, we need to tell those attention layers to ignore the padding tokens. This is done by using an attention mask.

### Attention masks
*Attention masks* are tensors with the exact same shape as the input IDs tensor, filled with `0`s and `1`s: `1`s indicate the corresponding tokens should be attended to, and `0`s indicate the corresponding tokens should not be attended to (i.e., they should be ignored by the attention layers of the model).

Let's complete the previous example with an attention mask:

In [44]:
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id]
]
attention_mask = [
    [1, 1, 1],
    [1, 1, 0]
]
outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask))
print(outputs.logits)

tensor([[ 1.5694, -1.3895],
        [ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)


Now we get the same logits for the second sentence in the batch.

Notice how the last value of the second sequence is a padding ID, which is a 0 value in the attention mask.

> ✏️ Try it out! <font color="darkgreen">Apply the tokenization manually on the two sentences used in section 2 ("I've been waiting for a HuggingFace course my whole life." and "I hate this so much!"). Pass them through the model and check that you get the same logits as in section 2. Now batch them together using the padding token, then create the proper attention mask. Check that you obtain the same results when going through the model!</font>

In [45]:
# Trying it out
## sequences
sequence1 = "I've been waiting for a HuggingFace course my whole life."
sequence2 = "I hate this so much!"
sequences12 = [sequence1, sequence2]
## use checkpoint, model, and tokenizer from section 2 "Behind the pipeline"
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
## tokens, IDs, attention mask, and logits for sequence 1
tokenized_seq1 = tokenizer(sequence1, return_tensors="pt")
input_ids1 = tokenized_seq1["input_ids"]
print(f"input IDs 1:\n{input_ids1}")
attn_mask1 = tokenized_seq1["attention_mask"]
print(f"\nattention mask 1:\n{attn_mask1}")
logits1 = model(torch.tensor(input_ids1), attention_mask=torch.tensor(attn_mask1)).logits
print(f"\nlogits 1:\n{logits1}") # expected: [[-1.5607, 1.6123]]
## tokens, IDs, attention mask, and logits for sequence 2
tokenized_seq2 = tokenizer(sequence2, return_tensors="pt")
input_ids2 = tokenized_seq2["input_ids"]
print(f"\ninput IDs 2:\n{input_ids2}")
attn_mask2 = tokenized_seq2["attention_mask"]
print(f"\nattention mask 2:\n{attn_mask2}")
logits2 = model(torch.tensor(input_ids2), attention_mask=torch.tensor(attn_mask2)).logits
print(f"\nlogits 2:\n{logits2}") # expected: [[4.1692, -3.3464]]
## tokens, IDs, attention mask, and logits for sequences 1 and 2
tokenized_seqs12 = tokenizer(sequences12, padding=True, return_tensors="pt")
input_ids12 = tokenized_seqs12["input_ids"]
print(f"\ninput IDs 12:\n{input_ids12}")
attn_mask12 = tokenized_seqs12["attention_mask"]
print(f"\nattention mask 12:\n{attn_mask12}")
logits12 = model(torch.tensor(input_ids12), attention_mask=torch.tensor(attn_mask12)).logits
print(f"\nlogits 1 and 2:\n{logits12}") # expected: [[-1.5607, 1.6123], [4.1692, -3.3464]]

input IDs 1:
tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102]])

attention mask 1:
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])

logits 1:
tensor([[-1.5607,  1.6123]], grad_fn=<AddmmBackward0>)

input IDs 2:
tensor([[ 101, 1045, 5223, 2023, 2061, 2172,  999,  102]])

attention mask 2:
tensor([[1, 1, 1, 1, 1, 1, 1, 1]])

logits 2:
tensor([[ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>)

input IDs 12:
tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]])

attention mask 12:
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])

logits 1 and 2:
tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>)


  logits1 = model(torch.tensor(input_ids1), attention_mask=torch.tensor(attn_mask1)).logits
  logits2 = model(torch.tensor(input_ids2), attention_mask=torch.tensor(attn_mask2)).logits
  logits12 = model(torch.tensor(input_ids12), attention_mask=torch.tensor(attn_mask12)).logits


### Longer sequences
With Transformer models, there is a limit to the lengths of the sequences we can pass the models. Most models handle sequences of up to 512 or 1024 tokens, and will crash when asked to process longer sequences. There are two solutions to this problem:
- Use a model with a longer supported sequence length.
- Truncate your sequences.

Models have different supported sequence lengths, and some specialize in handling very long sequences. [Longformer](https://huggingface.co/transformers/model_doc/longformer.html) is one example, and another is [LED](https://huggingface.co/transformers/model_doc/led.html). If you're working on a task that requires very long sequences, we recommend you take a look at those models.

Otherwise, we recommend you truncate your sequences by specifying the `max_sequence_length` parameter:
```python
sequence = sequence[:max_sequence_length]
```

## [Putting it all together](https://huggingface.co/course/chapter2/6?fw=pt)
In the last few sections, we've been trying our best to do most of the work by hand. We've explored how tokenizers work and looked at tokenization, conversion to input IDs, padding, truncation, and attention masks.

However, as we saw in section 2, the 🤗 Transformers API can handle all of this for us with a high-level function that we'll dive into here. When you call your `tokenizer` directly on the sentence, you get back inputs that are ready to pass through your model:

In [46]:
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
sequence = "I've been waiting for a HuggingFace course my whole life."
model_inputs = tokenizer(sequence)
model_inputs

{'input_ids': [101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Here, the `model_inputs` variable contains everything that's necessary for a model to operate well. For DistilBERT, that includes the input IDs as well as the attention mask. Other models that accept additional inputs will also have those output by the tokenizer object.

As we'll see in some examples below, this method is very powerful. First, it can tokenize a single sequence, as shown above.

It also handles multiple sequences at a time, with no change in the API:

In [47]:
sequences = ["I've been waiting for a HuggingFace course my whole life.", "I hate this so much!"]
model_inputs = tokenizer(sequences)
model_inputs

{'input_ids': [[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102], [101, 1045, 5223, 2023, 2061, 2172, 999, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1]]}

It can pad according to several objectives:

In [48]:
# Will pad the sequences up to the maximum sequence length
model_inputs = tokenizer(sequences, padding="longest")
print(f"longest:\n{model_inputs}\n")
# Will pad the sequences up to the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, padding="max_length")
print(f"max_length:\n{model_inputs}\n")
# Will pad the sequences up to the specified max length
model_inputs = tokenizer(sequences, padding="max_length", max_length=8)
print(f"max_length=8:\n{model_inputs}")

longest:
{'input_ids': [[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102], [101, 1045, 5223, 2023, 2061, 2172, 999, 102, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}

max_length:
{'input_ids': [[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

It can also truncate sequences:

In [49]:
sequences = ["I've been waiting for a HuggingFace course my whole life.", "I hate this so much!"]
# Will truncate the sequences that are longer than the model max length
# (512 for BERT or DistilBERT)
print("truncation to maximum supported length (=> no truncation, here):")
model_inputs = tokenizer(sequences, truncation=True)
print(model_inputs)
# Will truncate the sequences that are longer than the specified max length
print("\ntruncation to max_length=8:")
model_inputs = tokenizer(sequences, max_length=8, truncation=True)
print(model_inputs)

truncation to maximum supported length (=> no truncation, here):
{'input_ids': [[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102], [101, 1045, 5223, 2023, 2061, 2172, 999, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1]]}

truncation to max_length=8:
{'input_ids': [[101, 1045, 1005, 2310, 2042, 3403, 2005, 102], [101, 1045, 5223, 2023, 2061, 2172, 999, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1]]}


The `tokenizer` object can handle the conversion to specific framework tensors, which can then be directly sent to the model. For example, in the following code sample we are prompting the tokenizer to return tensors from the different frameworks — `"pt"` returns PyTorch tensors, `"tf"` returns TensorFlow tensors, and `"np"` returns NumPy arrays:

In [50]:
sequences = ["I've been waiting for a HuggingFace course my whole life.", "I hate this so much!"]
# PyTorch tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="pt")
print(f"PyTorch:\t pt\n{model_inputs}")
# TensorFlow tensors
#model_inputs = tokenizer(sequences, padding=True, return_tensors="tf")
print(f"\nTensorFlow:\t tf\n{'not installed'}")
#print(f"TensorFlow:\t tf\n{model_inputs}")
# NumPy arrays
model_inputs = tokenizer(sequences, padding=True, return_tensors="np")
print(f"\nNumPy:\t\t np\n{model_inputs}")

PyTorch:	 pt
{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}

TensorFlow:	 tf
not installed

NumPy:		 np
{'input_ids': array([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662,
        12172,  2607,  2026,  2878,  2166,  1012,   102],
       [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,
            0,     0,     0,     0,     0,     0,     0]]), 'attention_mask': array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}


### Special tokens
If we take a look at the input IDs returned by the tokenizer, we will see they are a tiny bit different from what we had earlier:

In [51]:
sequence = "I've been waiting for a HuggingFace course my whole life."
model_inputs = tokenizer(sequence)
print(model_inputs["input_ids"])
tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102]
[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012]


One token ID was added at the beginning, and one at the end. Let's decode the two sequences of IDs above to see what this is about:

In [52]:
print(tokenizer.decode(model_inputs["input_ids"]))
print(tokenizer.decode(ids))

[CLS] i've been waiting for a huggingface course my whole life. [SEP]
i've been waiting for a huggingface course my whole life.


The tokenizer added the special word `[CLS]` at the beginning and the special word `[SEP]` at the end. This is because the model was pretrained with those, so to get the same results for inference we need to add them as well. Note that some models don't add special words, or add different ones; models may also add these special words only at the beginning, or only at the end. In any case, the tokenizer knows which ones are expected and will deal with this for you.

### Wrapping up: From tokenizer to model
Now that we've seen all the individual steps the `tokenizer` object uses when applied on texts, let's see one final time how it can handle multiple sequences (padding!), very long sequences (truncation!), and multiple types of tensors with its main API:

In [53]:
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = ["I've been waiting for a HuggingFace course my whole life.", "I hate this so much!"]
tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
print(tokens)
output = model(**tokens)
output

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}


SequenceClassifierOutput(loss=None, logits=tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

## [Basic usage completed!](https://huggingface.co/course/chapter2/7?fw=pt)
Great job following the course up to here! To recap, in this chapter you:
- Learned the basic building blocks of a Transformer model.
- Learned what makes up a tokenization pipeline.
- Saw how to use a Transformer model in practice.
- Learned how to leverage a tokenizer to convert text to tensors that are understandable by the model.
- Set up a tokenizer and a model together to get from text to predictions.
- Learned the limitations of input IDs, and learned about attention masks.
- Played around with versatile and configurable tokenizer methods.

From now on, you should be able to freely navigate the 🤗 Transformers docs: the vocabulary will sound familiar, and you've already seen the methods that you'll use the majority of the time.

## [End-of-chapter quiz](https://huggingface.co/course/chapter2/8?fw=pt)
**1. What is the order of the language modeling pipeline?**<br>
⚪️  First, the model, which handles text and returns raw predictions. The tokenizer then makes sense of these predictions and converts them back to text when needed.<br>
⚪️ First, the tokenizer, which handles text and returns IDs. The model handles these IDs and outputs a prediction, which can be some text.<br>
⚫️ The tokenizer handles text and returns IDs. The model handles these IDs and outputs a prediction. The tokenizer can then be used once again to convert these predictions back to some text.
> **Correct!** Correct! The tokenizer can be used for both tokenizing and de-tokenizing.

**2. How many dimensions does the tensor output by the base Transformer model have, and what are they?**<br>
⚪️ 2: The sequence length and the batch size.<br>
⚪️ 2: The sequence length and the hidden size.<br>
⚫️ 3: The sequence length, the batch size, and the hidden size.
> **Correct!** Correct!

**3. Which of the following is an example of subword tokenization?**<br>
⚫️ WordPiece.
> **Correct!** Yes, that's one example of subword tokenization!

⚪️ Character-based tokenization.<br>
⚪️ Splitting on whitespace and punctuation.<br>
⚫️ BPE.
> **Correct!** Yes, that's one example of subword tokenization!

⚫️ Unigram.
> **Correct!** Yes, that's one example of subword tokenization!

⚪️ None of the above.

**4. What is a model head??**<br>
⚪️ A component of the base Transformer network that redirects tensors to their correct layers.<br>
⚪️ Also known as the self-attention mechanism, it adapts the representation of a token according to the other tokens of the sequence.<br>
⚫️ An additional component, usually made up of one or a few layers, to convert the transformer predictions to a task-specific output.
> **Correct!** That's right. Adaptation heads, also known simply as heads, come up in different forms: language modeling heads, question answering heads, sequence classification heads...

**5. What is an AutoModel?**<br>
⚪️ A model that automatically trains on your data.<br>
⚫️ An object that returns the correct architecture based on the checkpoint.
> **Correct!** Exactly: the `AutoModel` only needs to know the checkpoint from which to initialize to return the correct architecture.

⚪️ A model that automatically detects the language used for its inputs to load the correct weights.<br>

**6. What are the techniques to be aware of when batching sequences of different lengths together?**<br>
⚫️ Truncating.
> **Correct!** Yes, truncation is a correct way of evening out sequences so that they fit in a rectangular shape. Is it the only one, though?

⚪️ Returning tensors.<br>
⚫️ Padding.
> **Correct!** Yes, padding is a correct way of evening out sequences so that they fit in a rectangular shape. Is it the only one, though?

⚫️ Attention masking.
> **Correct!** Absolutely! Attention masks are of prime importance when handling sequences of different lengths. That's not the only technique to be aware of, however.

**7. What is the point of applying a SoftMax function to the logits output by a sequence classification model?**<br>
⚪️ It softens the logits so that they're more reliable.<br>
⚫️ It applies a lower and upper bound so that they're understandable.
> **Correct!** Correct! The resulting values are bound between 0 and 1. That's not the only reason we use a SoftMax function, though.

⚫️ The total sum of the output is then 1, resulting in a possible probabilistic interpretation.
> **Correct!** Correct! That's not the only reason we use a `SoftMax` function, though.

**8. What method is most of the tokenizer API centered around?**<br>
⚪️ The method `encode`, as it can encode text into IDs and IDs into predictions.<br>
⚫️ Calling the tokenizer object directly.
> **Correct!** Exactly! The `__call__` method of the tokenizer is a very powerful method which can handle pretty much anything. It is also the method used to retrieve predictions from a model.

⚪️ The method `pad`.<br>
⚪️ The method `tokenize`.<br>

**9. What does the `result` variable contain in this code sample?**<br>
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
result = tokenizer.tokenize("Hello!")
```
⚫️ A list of strings, each string being a token.
> **Correct!** Absolutely! Convert this to IDs, and send them to a model!

⚪️ A list of IDs.<br>
⚪️ A string containing all of the tokens.<br>

**10. Is there something wrong with the following code?**<br>
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
model = AutoModel.from_pretrained("gpt2")
encoded = tokenizer("Hey!", return_tensors="pt")
result = model(**encoded)
```
⚪️ No, it seems correct.<br>
⚫️ The tokenizer and model should always be from the same checkpoint.
> **Correct!** Right!

⚪️ It's good practice to pad and truncate with the tokenizer as every input is a batch.