# Transformers

In this part we will cover Transformer based large pretrained models. 

This notebook focus on showing how you can use the widely known Hugging Face library to apply different types of transformer models to a different range of tasks.

We hope you learn how you can levarage pretrained transformer-based models and how to fine-tune them to a specific downstream task. 

To dive deep into transformers, we recommend to start by reading
    
- http://jalammar.github.io/illustrated-transformer/

This gives a step by step explanation of the original paper [Attention Is All You Need](https://arxiv.org/abs/1706.03762)

![4](https://lilianweng.github.io/lil-log/assets/images/transformer.png)

#### Code implemantation

It's also good pratice to try implementing yourself the original code before using a Transformers library:

Code Implementation
- http://nlp.seas.harvard.edu/2018/04/03/attention.html

# Hugging Face 🤗

"Training a transformer model and deploying these models can be quite challeging. In general these models have millions to tens of billions of parameters and requires large amount of data. 

This becomes very costly in terms of time and compute resources. It even translates to environmental impact. Imagine if each time a research team, a student organization, or a company wanted to train a model, it did so from scratch. This would lead to huge, unnecessary global costs!


The Transformers library was created to solve this problem. Its goal is to provide a single API through which any Transformer model can be loaded, trained, and saved. 

The Hugging Face Transformers library provides the functionality to create and use those shared models."

For more details look at the official [Hugging Face course](https://huggingface.co/course/chapter1/1)

## Setup

In [1]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
!pip3 install transformers
!pip3 install ipywidgets --user
!pip3 install torchtext

### Important:

After this instalation --> don't forget to restart the kernel of the jupyter

## Examples of transformer architectures available

As menioned above, the Transformer architecture was introduced in the paper [Attention Is All You Need](https://arxiv.org/abs/1706.03762), 2017, in which the focus of the original research was on translation tasks using encoder-decoder blocks. 

This was then followed by the introduction of several influential models, including encoder only models (e.g., BERT) and decoder only models (e.g., GPT), and there have been also a variety of encoder-decoder transformer based models (e.g., BART and T5). 

### Encoder

Encoder models use only the encoder block of a Transformer model. 

These models are often characterized as having “bi-directional” attention, and are often called auto-encoding models.

Encoder models are best suited for tasks requiring an understanding of the full sentence, such as:
   - sentence classification
   - named entity recognition (word classification in general), 
   - extractive question answering
   - sentence representation (contextual embeddings)

    
The pretraining of these models usually revolves around somehow corrupting a given sentence (for instance, by masking random words in it) and tasking the model with finding or reconstructing the initial sentence.


There a variaty of encoder models available at Hugging Face. Some examples include:

- [BERT](https://huggingface.co/docs/transformers/model_doc/bert)
- [DistilBERT](https://huggingface.co/docs/transformers/model_doc/distilbert)
- [RoBERTa](https://huggingface.co/docs/transformers/model_doc/roberta)

### Decoder

Decoder models use only the decoder of a Transformer model. At each stage, for a given word, the self-attention layers can only access the words positioned before it in the sentence. These models are often called auto-regressive models.

The pretraining of decoder models usually revolves around predicting the next word in the sentence.

These models are best suited for tasks involving text generation.

Representatives of this family of models include:

Some examples of decoder models available in Hugging Face include:
- [GPT](https://huggingface.co/docs/transformers/model_doc/openai-gpt)
- [GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)
- [Transformer XL](https://huggingface.co/docs/transformers/model_doc/transfo-xl)

## Encoder-decoder models

Encoder-decoder models (also called sequence-to-sequence models) use both parts of the Transformer architecture. 

Sequence-to-sequence models are best suited for tasks revolving around generating new sentences depending on a given input (conditional text generation), such as:
- summarization
- translation
- or generative question answering

Representatives of this family of models include:

- [BART](https://huggingface.co/docs/transformers/model_doc/bart)
- [Marian](https://huggingface.co/docs/transformers/model_doc/marian)
- [T5](https://huggingface.co/docs/transformers/model_doc/t5)

# Pipeline

The most basic object in the Transformers library is the pipeline() function. It connects a model with its necessary preprocessing and postprocessing steps, allowing us to directly input any text and get an intelligible answer.

See all details of pipeline here: https://huggingface.co/docs/transformers/v4.16.0/en/main_classes/pipelines#transformers.TranslationPipeline

In [2]:
#pip install numpy==1.21

Collecting numpy==1.21

  Using cached numpy-1.21.0-cp38-cp38-macosx_10_9_x86_64.whl (16.9 MB)

Installing collected packages: numpy

  Attempting uninstall: numpy

    Found existing installation: numpy 1.24.3

    Uninstalling numpy-1.24.3:

      Successfully uninstalled numpy-1.24.3

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.

torchvision 0.9.1 requires torch==1.8.1, but you have torch 1.9.0 which is incompatible.

datawig 0.2.0 requires scikit-learn==1.0.2, but you have scikit-learn 1.1.1 which is incompatible.[0m

Successfully installed numpy-1.21.0

Note: you may need to restart the kernel to use updated packages.


In [48]:
from transformers import pipeline
import torch

classifier = pipeline("sentiment-analysis")
classifier(["I've been waiting for a HuggingFace course my whole life.",
            "I hate this so much!"])

# By default, this pipeline selects a particular pretrained model 
# that has been fine-tuned for sentiment analysis in English. 

# The model is downloaded and cached when you create the classifier object. 
# If you rerun the command, the cached model will be used instead 
# and there is no need to download the model again.

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.9598050713539124},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

Try with your own text:

In [50]:
text = input()
classifier(text)

 Good soup!


[{'label': 'POSITIVE', 'score': 0.9998539686203003}]

Besides sentiment analysis, some of the currently available pipelines are:

- feature-extraction (get the vector representation of a text)
- fill-mask
- ner (named entity recognition)
- question-answering
- summarization
- text-generation
- translation
- zero-shot-classification

Let’s now for instance check how to use a pipeline to generate some text.

In [51]:
generator = pipeline("text-generation")
generator("In this course, we will teach you how to use transformer models and", do_sample=False)

# if set to True, this parameter enables decoding strategies such as multinomial sampling, 
# beam-search multinomial sampling, Top-K sampling and Top-p sampling. 
# All these strategies select the next token from the probability distribution over the entire 
# vocabulary with various strategy-specific adjustments.

No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to use transformer models and how to use them in your own projects.\n\nThe first thing you will learn is how to use transformer models in your own projects.\n\nThe second thing you will learn'}]

The previous examples used the default model for the task at hand, but you can also choose a particular model for any of the above tasks. 

For the specific case of the generation task, you can also control:
- how many different sequences are generated with the argument num_return_sequences
- the total length of the output text with the argument max_length.

In [54]:
from transformers import pipeline

generator = pipeline("text-generation", model="distilgpt2")
generator(
    input(),
    max_length=30,
    num_return_sequences=3,
)

 Today I feel


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Today I feel so ashamed. I was never able to stop being angry at myself. I was never ever even able to fight for equality. I needed'},
 {'generated_text': 'Today I feel it may be the end of the year for me and my little family. It is not only fun to do so, but it really'},
 {'generated_text': 'Today I feel that the future of gaming is bright and a part of that. With the success of the past couple of times I wanted to see the'}]

Try it yourself with different pipelines available. 

To check how to use one in specific see the documentation: https://huggingface.co/docs/transformers/v4.16.0/en/main_classes/pipelines)

In [55]:
# Trying the machine translation pipeline
translator = pipeline("translation_en_to_fr")
translator("A man walks into a English bar.")

No model was supplied, defaulted to t5-base and revision 686f1db (https://huggingface.co/t5-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


[{'translation_text': 'Un homme se promène dans un bar anglais.'}]


# Pipeline with a real dataset

Lets try this pipeline with sentences from a dataset, such as the IMDB dataset
    

In [1]:
import torch
from torchtext.legacy import data, datasets

TEXT = data.Field()
LABEL = data.LabelField(dtype = torch.float)

_, test_data = datasets.IMDB.splits(TEXT, LABEL)

In [2]:
# Ideally we should run all the senteces to the model and see it's performance on test (or validation set).
# But lets just pick first some number of sentences to run faster (or you could use GPU): such as first 50 sentences and last 50 sentences
labels=list(test_data.label)[:50]+list(test_data.label)[-50:] 
sents_with_tokens=list(test_data.text)[:50]+list(test_data.text)[-50:]

In [3]:
# We need to have the corresponding sentences (and not tokens) 
# so that the pipeline tokenizes the text by itself according to the model tokenization
# we can give a max nº of words to be faster (e.g., in terms of performing self attention)
MAX_LEN=500
sents=[" ".join(tokens)[:MAX_LEN] for tokens in sents_with_tokens] 

In [6]:
from tqdm import tqdm

classifier = pipeline("sentiment-analysis")
preds=[]
for sent in tqdm(sents):
    preds.append(classifier(sent))

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
100%|██████████| 100/100 [00:21<00:00,  4.72it/s]


In [19]:
count_correct=0
for i in range(len(preds)):
    if labels[i] in preds[i][0].get("label").lower():
        count_correct +=1
print("acc", count_correct/len(labels))              

acc 0.78


# Dive deep into Hugging Face


Besides the currently available pipelines, we can dive deep and use any model available in HG and apply it to any given task. 

We’ll dive into the model and configuration classes, and show you how to load a model and how it processes numerical inputs to output predictions. 


## Behind pipeline 
Lets begin with an end-to-end example where we use a model and a tokenizer together to replicate the pipeline() function of sentiment analysis introduced before.

There are three main steps involved when you pass some text to a pipeline:
1. <b> Preprocessing with a tokenizer:</b> The text is preprocessed into a format the model can understand.
2. <b> Going through the model:</b> The preprocessed inputs are passed to the model.
3. <b> Postprocessing the output
: </b> The predictions of the model are post-processed, so you can make sense of them.

![0](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter2/full_nlp_pipeline.svg)

### 1. Preprocessing with a tokenizer:

Like other neural networks, Transformer models can’t process raw text directly.

So the first step of our pipeline is to convert the text inputs into numbers that the model can make sense of. 

To do this we use a <b> tokenizer</b>. They serve the purpose to translate text into data that can be processed by the model.

Tokenizer will be responsible for:

- Splitting the input into words, subwords, or symbols (like punctuation) that are called tokens
- Mapping each token to an integer
- Adding additional inputs that may be useful to the model (e.g., attention mask)

#### Define tokenizer 

The tokenizer and the model should always be from the same checkpoint. Therefore, we need to define the tokenizer with the checkpoint name of the corresponding model. 


To do this, we use the <b> AutoTokenizer class </b> and its <b> from_pretrained() </b> method with the checkpoint name of the corresponding model inside it.

This will automatically fetch the data associated with the model’s tokenizer (as the vocabulary) and cache it (so it’s only downloaded the first time you run the code below).

In [20]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Alternatively to the wrapper AutoModel class, you can use directly the class of the corresponding tokenizer.

In [21]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

You can use either one of the two. But note that the AutoTokenizer produces checkpoint-agnostic code

- The AutoTokenizer will works for other checkpoints besides BERT (e.g.: distilbert-base-uncased-finetuned-sst-2-english), 
- Whereas the BertTokenizer just works for checkpoints related to BertTokenizer

In [22]:
tokenizer = BertTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 

The tokenizer class you load from this checkpoint is 'DistilBertTokenizer'. 

The class this function is called from is 'BertTokenizer'.


To mimic our sentiment analysis pipeline, let's actually use the DistilBERT model and thus use the corresponding DistilBERT tokenizer (with checkpoint "bert-base-uncased")

In [23]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

#### Encode the sentences

Once we have the tokenizer, we can directly pass our sentences to it.

Translating text to numbers is known as encoding. Encoding is done in a two-step process: 
- the tokenization
- followed by the conversion to input IDs.

---------
To know more about each of the 2 steps:
The first step is to split the text into words (or parts of words, punctuation symbols, etc.), usually called tokens. There are multiple rules that can govern that process, which is why we need to instantiate the tokenizer using the name of the model, to make sure we use the same rules that were used when the model was pretrained.

The second step is to convert those tokens into numbers, so we can build a tensor out of them and feed them to the model. To do this, the tokenizer has a vocabulary, which is the part we download when we instantiate it with the from_pretrained() method. Again, we need to use the same vocabulary used when the model was pretrained.


In [24]:
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
    "I find this very easy!"
]
inputs = tokenizer(
    raw_inputs,
    padding=True,        # since the sentences might not have the same size, don't forget to padding. 
    return_tensors="pt"  # to return with pytorch tensors
)

# - Feeding your raw_sentences to the tokenizer will give the corresponding input_id
# - As well as the attention_mask, usufel so that there is no self attention between padding words. 

# - In case of BERT it will also produce "token_type_ids" 
#   useful when the model receives sentence A and B together (e.g., for text similarity tasks) 
      # (0 corresponds to a sentence A token,
      # 1 corresponds to a sentence B token.)

print(inputs)

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,

          2607,  2026,  2878,  2166,  1012,   102],

        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,

             0,     0,     0,     0,     0,     0],

        [  101,  1045,  2424,  2023,  2200,  3733,   999,   102,     0,     0,

             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],

        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0],

        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}


Note that each word can correspond to more than one id, since DistilBERT uses subword tokenization

In [25]:
tokens = tokenizer.tokenize(raw_inputs[0])
print(tokens)

['i', "'", 've', 'been', 'waiting', 'for', 'a', 'hugging', '##face', 'course', 'my', 'whole', 'life', '.']


Besides encoding the corresponding input text, we can also decode, going the other way around: from vocabulary indices to get the corresponding string.

    

In [26]:
tokens_id = tokenizer.encode(raw_inputs[0])
print("Tokens id:", tokens_id)

decode_inputs = tokenizer.decode(tokens_id)
print("\nDecoded inputs:", decode_inputs)

decode_inputs = tokenizer.decode(tokens_id, skip_special_tokens=True)
print("\nWithout special tokens:", decode_inputs)


Tokens id: [101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102]



Decoded inputs: [CLS] i've been waiting for a huggingface course my whole life. [SEP]



Without special tokens: i've been waiting for a huggingface course my whole life.


# Going through the model


#### Define model

We can download our model that is already trained in the same way that we did with our tokenizer:

- You can use AutoModel class which also has from_pretrained() method or use directly
- You could replace AutoModel directly with the corresponding model class.

In [27]:
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

# we have downloaded the same checkpoint we used in our pipeline before
# threfore it should actually have been cached already

Some weights of the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing DistilBertModel: ['pre_classifier.weight', 'classifier.bias', 'classifier.weight', 'pre_classifier.bias']

- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).

- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [28]:
from transformers import DistilBertModel

model = DistilBertModel.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

Some weights of the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing DistilBertModel: ['pre_classifier.weight', 'classifier.bias', 'classifier.weight', 'pre_classifier.bias']

- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).

- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


It's also possible to load a model from scratch without trained weights:

In [29]:
from transformers import DistilBertConfig

config = DistilBertConfig()
model_without_pretrained = DistilBertModel(config)

#### Feed the inputs

This model is now initialized with all the weights of the checkpoint. It can be used directly for inference on the tasks it was trained on, and it can also be fine-tuned on a new task. By training with pretrained weights rather than from scratch, we can quickly achieve good results.

We can now feed the inputs we preprocessed before (with the tokenizer) to our model:
    
    {'input_ids': 
    
        tensor([
            [  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
              2607,  2026,  2878,  2166,  1012,   102],
            [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
                 0,     0,     0,     0,     0,     0],
            [  101,  1045,  2424,  2023,  2200,  3733,   999,   102,     0,     0,
                 0,     0,     0,     0,     0,     0]]),
             
     'attention_mask': 
     
         tensor(
            [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
            [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
            [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}
    

In [31]:
outputs = model(**inputs) # you give further arguments: output_attentions=True and output_hidden_states:True
print(outputs)

BaseModelOutput(last_hidden_state=tensor([[[-0.1798,  0.2333,  0.6321,  ..., -0.3017,  0.5008,  0.1481],

         [ 0.2758,  0.6497,  0.3200,  ..., -0.0760,  0.5136,  0.1329],

         [ 0.9046,  0.0985,  0.2950,  ...,  0.3352, -0.1407, -0.6464],

         ...,

         [ 0.1466,  0.5661,  0.3235,  ..., -0.3376,  0.5100, -0.0561],

         [ 0.7500,  0.0487,  0.1738,  ...,  0.4684,  0.0030, -0.6084],

         [ 0.0519,  0.3729,  0.5223,  ...,  0.3584,  0.6500, -0.3883]],



        [[-0.2937,  0.7283, -0.1497,  ..., -0.1187, -1.0227, -0.0422],

         [-0.2206,  0.9384, -0.0951,  ..., -0.3643, -0.6605,  0.2407],

         [-0.1536,  0.8988, -0.0728,  ..., -0.2189, -0.8528,  0.0710],

         ...,

         [-0.3017,  0.9002, -0.0200,  ..., -0.1082, -0.8412, -0.0861],

         [-0.3338,  0.9674, -0.0729,  ..., -0.1952, -0.8181, -0.0634],

         [-0.3454,  0.8824, -0.0426,  ..., -0.0993, -0.8329, -0.1065]],



        [[-0.3841, -0.1072,  0.3243,  ...,  0.2156,  0.2593,  0.08

Note that the outputs of Transformers models behave like named tuples or dictionaries. 

You can access the elements by:
- attributes (like we did) 
- or by key (outputs["last_hidden_state"])
- or even by index if you know exactly where the thing you are looking for is (outputs[0]).


In [34]:
print("last hidden state", outputs.last_hidden_state)

last hidden state tensor([[[-0.1798,  0.2333,  0.6321,  ..., -0.3017,  0.5008,  0.1481],

         [ 0.2758,  0.6497,  0.3200,  ..., -0.0760,  0.5136,  0.1329],

         [ 0.9046,  0.0985,  0.2950,  ...,  0.3352, -0.1407, -0.6464],

         ...,

         [ 0.1466,  0.5661,  0.3235,  ..., -0.3376,  0.5100, -0.0561],

         [ 0.7500,  0.0487,  0.1738,  ...,  0.4684,  0.0030, -0.6084],

         [ 0.0519,  0.3729,  0.5223,  ...,  0.3584,  0.6500, -0.3883]],



        [[-0.2937,  0.7283, -0.1497,  ..., -0.1187, -1.0227, -0.0422],

         [-0.2206,  0.9384, -0.0951,  ..., -0.3643, -0.6605,  0.2407],

         [-0.1536,  0.8988, -0.0728,  ..., -0.2189, -0.8528,  0.0710],

         ...,

         [-0.3017,  0.9002, -0.0200,  ..., -0.1082, -0.8412, -0.0861],

         [-0.3338,  0.9674, -0.0729,  ..., -0.1952, -0.8181, -0.0634],

         [-0.3454,  0.8824, -0.0426,  ..., -0.0993, -0.8329, -0.1065]],



        [[-0.3841, -0.1072,  0.3243,  ...,  0.2156,  0.2593,  0.0866],

         [

Note also that given the input to the model, it will output a high-dimensional vector representing the contextual understanding of that input by the Transformer model.

The vector output by the Transformer module generally has three dimensions:

- Batch size: The number of sequences processed at a time (2 in our example).
- Sequence length: The length of the numerical representation of the sequence (16 in our example).
- Hidden size: The vector dimension of each model input.



In [35]:
print("size", outputs.last_hidden_state.size())

size torch.Size([3, 16, 768])


## Postprocessing the output
 
While the hidden states that were outputed can be useful on their own.


In this case for sentiment analysis we are more interesting in using the corresponding classification head. 

For our example, we will use the DistilBERT model with a sequence classification head (to be able to classify the sentences as positive or negative). So, we won’t actually use the DistilBertModel class but the DistilBertForSequenceClassification (or AutoModelForSequenceClassification class). 

In [36]:
from transformers import DistilBertForSequenceClassification, DistilBertTokenizer

tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
    "I find this very easy!"
]
inputs = tokenizer(
    raw_inputs,
    padding=True,        # since the sentences might not have the same size, don't forget to padding. 
    return_tensors="pt"  # to return with pytorch tensors
)

model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
outputs = model(**inputs)
logits = outputs.logits.detach()

In [37]:
logits

tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464],
        [-3.0649,  3.0434]])

Those are not probabilities but logits, the raw, unnormalized scores outputted by the last layer of the model. 

To be converted to probabilities, they need to go through a SoftMax layer

In [38]:
predictions = torch.nn.functional.softmax(logits, dim=-1) # A dimension along which Softmax will be computed (so every slice along dim will sum to 1).

print(predictions)

tensor([[4.0195e-02, 9.5980e-01],

        [9.9946e-01, 5.4418e-04],

        [2.2193e-03, 9.9778e-01]])


In [39]:
raw_inputs
for i in range(len(predictions)):
    if predictions[i][1].item()>=0.5:
        print(raw_inputs[i],model.config.id2label[1])
    else:
        print(raw_inputs[i],model.config.id2label[0])

model.config.id2label

I've been waiting for a HuggingFace course my whole life. POSITIVE

I hate this so much! NEGATIVE

I find this very easy! POSITIVE


{0: 'NEGATIVE', 1: 'POSITIVE'}

We have successfully reproduced the three steps of the pipeline: preprocessing with tokenizers, passing the inputs through the model, and postprocessing!


# Fine-Tunings Transformers for Translation

In [1]:
# Transformers installation
! pip install transformers datasets
# To install from source instead of the last release, comment the command above and uncomment the following one.
# ! pip install git+https://github.com/huggingface/transformers.git

[0m

Translation converts a sequence of text from one language to another. It is one of several tasks you can formulate as a sequence-to-sequence problem, a powerful framework for returning some output from an input, like translation or summarization. Translation systems are commonly used for translation between different language texts, but it can also be used for speech or some combination in between like text-to-speech or speech-to-text.

![3](https://raw.githubusercontent.com/huggingface/notebooks/c5d94e54a771af91c6732a5313c49c2c42ac5cff/examples/images/translation.png)

This guide will show you how to:

1. Finetune [T5](https://huggingface.co/t5-small) on the English-French subset of the [OPUS Books](https://huggingface.co/datasets/opus_books) dataset to translate English text to French.
2. Use your finetuned model for inference.

<Tip>
The task illustrated in this tutorial is supported by the following model architectures:

<!--This tip is automatically generated by `make fix-copies`, do not fill manually!-->

[BART](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/bart), [BigBird-Pegasus](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/bigbird_pegasus), [Blenderbot](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/blenderbot), [BlenderbotSmall](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/blenderbot-small), [Encoder decoder](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/encoder-decoder), [FairSeq Machine-Translation](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/fsmt), [GPTSAN-japanese](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/gptsan-japanese), [LED](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/led), [LongT5](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/longt5), [M2M100](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/m2m_100), [Marian](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/marian), [mBART](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/mbart), [MT5](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/mt5), [MVP](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/mvp), [NLLB](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/nllb), [NLLB-MOE](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/nllb-moe), [Pegasus](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/pegasus), [PEGASUS-X](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/pegasus_x), [PLBart](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/plbart), [ProphetNet](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/prophetnet), [SwitchTransformers](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/switch_transformers), [T5](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/t5), [XLM-ProphetNet](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/xlm-prophetnet)

<!--End of the generated tip-->

## Load OPUS Books dataset

Start by loading the English-French subset of the [OPUS Books](https://huggingface.co/datasets/opus_books) dataset from the Datasets library:

In [26]:
from datasets import load_dataset

books = load_dataset("opus_books", "en-fr")

Downloading builder script:   0%|          | 0.00/2.40k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/7.98k [00:00<?, ?B/s]

Downloading and preparing dataset opus_books/en-fr (download: 11.45 MiB, generated: 31.47 MiB, post-processed: Unknown size, total: 42.92 MiB) to /root/.cache/huggingface/datasets/opus_books/en-fr/1.0.0/e8f950a4f32dc39b7f9088908216cd2d7e21ac35f893d04d39eb594746af2daf...


Downloading data:   0%|          | 0.00/12.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/127085 [00:00<?, ? examples/s]

Dataset opus_books downloaded and prepared to /root/.cache/huggingface/datasets/opus_books/en-fr/1.0.0/e8f950a4f32dc39b7f9088908216cd2d7e21ac35f893d04d39eb594746af2daf. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

Split the dataset into a train and test set with the [train_test_split](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.train_test_split) method:

In [27]:
books = books["train"].train_test_split(test_size=0.2)

Then take a look at an example:

In [28]:
books["train"][0]

{'id': '6091',
 'translation': {'en': 'What a stroke was this for poor Jane! who would willingly have gone through the world without believing that so much wickedness existed in the whole race of mankind, as was here collected in one individual.',
  'fr': 'Quel coup pour la pauvre Jane qui aurait parcouru le monde entier sans s’imaginer qu’il existât dans toute l’humanité autant de noirceur qu’elle en découvrait en ce moment dans un seul homme !'}}

`translation`: an English and French translation of the text.

## Preprocess

The next step is to load a T5 tokenizer to process the English-French language pairs:

In [29]:
from transformers import AutoTokenizer

checkpoint = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

The preprocessing function you want to create needs to:

1. Prefix the input with a prompt so T5 knows this is a translation task. Some models capable of multiple NLP tasks require prompting for specific tasks.
2. Tokenize the input (English) and target (French) separately because you can't tokenize French text with a tokenizer pretrained on an English vocabulary.
3. Truncate sequences to be no longer than the maximum length set by the `max_length` parameter.

In [30]:
source_lang = "en"
target_lang = "fr"
prefix = "translate English to French: "


def preprocess_function(examples):
    inputs = [prefix + example[source_lang] for example in examples["translation"]]
    targets = [example[target_lang] for example in examples["translation"]]
    model_inputs = tokenizer(inputs, text_target=targets, max_length=128, truncation=True)
    return model_inputs

To apply the preprocessing function over the entire dataset, use Datasets [map](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map) method. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once:

In [31]:
tokenized_books = books.map(preprocess_function, batched=True)

  0%|          | 0/102 [00:00<?, ?ba/s]

  0%|          | 0/26 [00:00<?, ?ba/s]

Now create a batch of examples using [DataCollatorForSeq2Seq](https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.DataCollatorForSeq2Seq). It's more efficient to *dynamically pad* the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.

In [32]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint) # padding - default is True

## Evaluate

Including a metric during training is often helpful for evaluating your model's performance. You can quickly load a evaluation method with the [Evaluate](https://huggingface.co/docs/evaluate/index) library. For this task, load the [SacreBLEU](https://huggingface.co/spaces/evaluate-metric/sacrebleu) metric (see the Evaluate [quick tour](https://huggingface.co/docs/evaluate/a_quick_tour) to learn more about how to load and compute a metric):

In [35]:
#pip install transformers datasets evaluate sacrebleu

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Collecting sacrebleu
  Downloading sacrebleu-2.3.1-py3-none-any.whl (118 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m118.9/118.9 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
Collecting portalocker
  Downloading portalocker-2.7.0-py2.py3-none-any.whl (15 kB)
Installing collected packages: portalocker, sacrebleu
Successfully installed portalocker-2.7.0 sacrebleu-2.3.1
[0mNote: you may need to restart the kernel to use updated packages.


In [36]:
import evaluate

metric = evaluate.load("sacrebleu") # sacrebleu - to calculate Bleu that is a metric used in translations tasks

Then create a function that passes your predictions and labels to [compute](https://huggingface.co/docs/evaluate/main/en/package_reference/main_classes#evaluate.EvaluationModule.compute) to calculate the SacreBLEU score:

In [37]:
import numpy as np


def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]

    return preds, labels


def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result

Your `compute_metrics` function is ready to go now, and you'll return to it when you setup your training.

## Train

<Tip>

If you aren't familiar with finetuning a model with the [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer), take a look at the basic tutorial [here](https://huggingface.co/docs/transformers/main/en/tasks/../training#train-with-pytorch-trainer)!

</Tip>

You're ready to start training your model now! Load T5 with [AutoModelForSeq2SeqLM](https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoModelForSeq2SeqLM):

In [38]:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/242M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

At this point, only three steps remain:

1. Define your training hyperparameters in [Seq2SeqTrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Seq2SeqTrainingArguments). The only required parameter is `output_dir` which specifies where to save your model. At the end of each epoch, the [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) will evaluate the SacreBLEU metric and save the training checkpoint.
2. Pass the training arguments to [Seq2SeqTrainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Seq2SeqTrainer) along with the model, dataset, tokenizer, data collator, and `compute_metrics` function.
3. Call [train()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.train) to finetune your model.

In [39]:
training_args = Seq2SeqTrainingArguments(
    output_dir="my_awesome_opus_books_model",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=2,
    predict_with_generate=True,
    fp16=True, # if we have a GPU available, if not - simply comment this line
    push_to_hub=True,  
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_books["train"],
    eval_dataset=tokenized_books["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

Cloning https://huggingface.co/irodrigues/my_awesome_opus_books_model into local empty directory.
You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,1.8603,1.628459,5.4527,17.6356
2,1.8073,1.60524,5.639,17.6262


TrainOutput(global_step=12710, training_loss=1.8756247459669735, metrics={'train_runtime': 3219.5736, 'train_samples_per_second': 63.156, 'train_steps_per_second': 3.948, 'total_flos': 5000491514068992.0, 'train_loss': 1.8756247459669735, 'epoch': 2.0})

In [40]:
trainer.push_to_hub()

Upload file pytorch_model.bin:   0%|          | 1.00/231M [00:00<?, ?B/s]

Upload file runs/May21_14-05-57_f1b1c7fb942b/events.out.tfevents.1684677962.f1b1c7fb942b.31.2:   0%|          …

To https://huggingface.co/irodrigues/my_awesome_opus_books_model
   ec9176c..f483b0c  main -> main

To https://huggingface.co/irodrigues/my_awesome_opus_books_model
   f483b0c..b7207cf  main -> main



'https://huggingface.co/irodrigues/my_awesome_opus_books_model/commit/f483b0c90b7f945f6c14bbd26d8870ad114774ba'

<Tip>

For a more in-depth example of how to finetune a model for translation, take a look at the corresponding
[PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/translation.ipynb)
or [TensorFlow notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/translation-tf.ipynb).

</Tip>

## Inference

Great, now that you've finetuned a model, you can use it for inference!

Come up with some text you'd like to translate to another language. For T5, you need to prefix your input depending on the task you're working on. For translation from English to French, you should prefix your input as shown below:

In [41]:
text = "translate English to French: Legumes share resources with nitrogen-fixing bacteria."

The simplest way to try out your finetuned model for inference is to use it in a [pipeline()](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.pipeline). Instantiate a `pipeline` for translation with your model, and pass your text to it:

In [42]:
from transformers import pipeline

translator = pipeline("translation", model="irodrigues/my_awesome_opus_books_model")
translator(text)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/242M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]



[{'translation_text': 'Legumes teilen Ressourcen mit Stickstoff-fixierenden Bakterien.'}]

You can also manually replicate the results of the `pipeline` if you'd like:

Tokenize the text and return the `input_ids` as PyTorch tensors:

In [43]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("irodrigues/my_awesome_opus_books_model")
inputs = tokenizer(text, return_tensors="pt").input_ids

Use the [generate()](https://huggingface.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationMixin.generate) method to create the translation. For more details about the different text generation strategies and parameters for controlling generation, check out the [Text Generation](https://huggingface.co/docs/transformers/main/en/tasks/../main_classes/text_generation) API.

In [44]:
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained("irodrigues/my_awesome_opus_books_model")
outputs = model.generate(inputs, max_new_tokens=40, do_sample=True, top_k=30, top_p=0.95)

Decode the generated token ids back into text:

In [45]:
tokenizer.decode(outputs[0], skip_special_tokens=True)

"Les légumes partagent les ressources avec des bactéries fixantes d'azote."

# Fine-Tuning Transformers for Text Classification

Text classification is a common NLP task that assigns a label or class to text. Some of the largest companies run text classification in production for a wide range of practical applications. One of the most popular forms of text classification is sentiment analysis, which assigns a label like 🙂 positive, 🙁 negative, or 😐 neutral to a sequence of text.

![1](https://github.com/huggingface/notebooks/blob/main/examples/images/text_classification.png?raw=1)


This guide will show you how to:

1. Finetune [DistilBERT](https://huggingface.co/distilbert-base-uncased) on the [IMDb](https://huggingface.co/datasets/imdb) dataset to determine whether a movie review is positive or negative.
2. Use your finetuned model for inference.

<Tip>
The task illustrated in this tutorial is supported by the following model architectures:

<!--This tip is automatically generated by `make fix-copies`, do not fill manually!-->

[ALBERT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/albert), [BART](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/bart), [BERT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/bert), [BigBird](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/big_bird), [BigBird-Pegasus](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/bigbird_pegasus), [BioGpt](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/biogpt), [BLOOM](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/bloom), [CamemBERT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/camembert), [CANINE](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/canine), [ConvBERT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/convbert), [CTRL](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/ctrl), [Data2VecText](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/data2vec-text), [DeBERTa](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/deberta), [DeBERTa-v2](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/deberta-v2), [DistilBERT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/distilbert), [ELECTRA](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/electra), [ERNIE](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/ernie), [ErnieM](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/ernie_m), [ESM](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/esm), [FlauBERT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/flaubert), [FNet](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/fnet), [Funnel Transformer](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/funnel), [GPT-Sw3](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/gpt-sw3), [OpenAI GPT-2](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/gpt2), [GPTBigCode](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/gpt_bigcode), [GPT Neo](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/gpt_neo), [GPT NeoX](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/gpt_neox), [GPT-J](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/gptj), [I-BERT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/ibert), [LayoutLM](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/layoutlm), [LayoutLMv2](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/layoutlmv2), [LayoutLMv3](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/layoutlmv3), [LED](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/led), [LiLT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/lilt), [LLaMA](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/llama), [Longformer](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/longformer), [LUKE](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/luke), [MarkupLM](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/markuplm), [mBART](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/mbart), [MEGA](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/mega), [Megatron-BERT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/megatron-bert), [MobileBERT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/mobilebert), [MPNet](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/mpnet), [MVP](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/mvp), [Nezha](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/nezha), [Nyströmformer](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/nystromformer), [OpenLlama](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/open-llama), [OpenAI GPT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/openai-gpt), [OPT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/opt), [Perceiver](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/perceiver), [PLBart](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/plbart), [QDQBert](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/qdqbert), [Reformer](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/reformer), [RemBERT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/rembert), [RoBERTa](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/roberta), [RoBERTa-PreLayerNorm](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/roberta-prelayernorm), [RoCBert](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/roc_bert), [RoFormer](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/roformer), [SqueezeBERT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/squeezebert), [TAPAS](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/tapas), [Transformer-XL](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/transfo-xl), [XLM](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/xlm), [XLM-RoBERTa](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/xlm-roberta), [XLM-RoBERTa-XL](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/xlm-roberta-xl), [XLNet](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/xlnet), [X-MOD](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/xmod), [YOSO](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/yoso)



<!--End of the generated tip-->

</Tip>

Before you begin, make sure you have all the necessary libraries installed:

```bash
pip install transformers datasets evaluate
```

## Load IMDb dataset

Start by loading the IMDb dataset from the Datasets library:

In [1]:
from datasets import load_dataset

imdb = load_dataset("imdb")

Downloading builder script:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

Downloading and preparing dataset imdb/plain_text (download: 80.23 MiB, generated: 127.02 MiB, post-processed: Unknown size, total: 207.25 MiB) to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1...


Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset imdb downloaded and prepared to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Then take a look at an example:

In [2]:
imdb["test"][0]

{'text': 'I love sci-fi and am willing to put up with a lot. Sci-fi movies/TV are usually underfunded, under-appreciated and misunderstood. I tried to like this, I really did, but it is to good TV sci-fi as Babylon 5 is to Star Trek (the original). Silly prosthetics, cheap cardboard sets, stilted dialogues, CG that doesn\'t match the background, and painfully one-dimensional characters cannot be overcome with a \'sci-fi\' setting. (I\'m sure there are those of you out there who think Babylon 5 is good sci-fi TV. It\'s not. It\'s clichéd and uninspiring.) While US viewers might like emotion and character development, sci-fi is a genre that does not take itself seriously (cf. Star Trek). It may treat important issues, yet not as a serious philosophy. It\'s really difficult to care about the characters here as they are not simply foolish, just missing a spark of life. Their actions and reactions are wooden and predictable, often painful to watch. The makers of Earth KNOW it\'s rubbish as 

There are two fields in this dataset:

- `text`: the movie review text.
- `label`: a value that is either `0` for a negative review or `1` for a positive review.

## Preprocess

The next step is to load a DistilBERT tokenizer to preprocess the `text` field:

In [3]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Create a preprocessing fucnction to tokenize `text` and truncate sequences to be no longer than DistilBERT's maximum input length:

In [4]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

To apply the preprocessing function over the entire dataset, use Datasets [map](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map) function. You can speed up `map` by setting `batched=True` to process multiple elements of the dataset at once:

In [5]:
tokenized_imdb = imdb.map(preprocess_function, batched=True)

  0%|          | 0/25 [00:00<?, ?ba/s]

  0%|          | 0/25 [00:00<?, ?ba/s]

  0%|          | 0/50 [00:00<?, ?ba/s]

Now create a batch of examples using [DataCollatorWithPadding](https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.DataCollatorWithPadding). It's more efficient to *dynamically pad* the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.

In [6]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)



## Evaluate

Including a metric during training is often helpful for evaluating your model's performance. You can quickly load a evaluation method with the [Evaluate](https://huggingface.co/docs/evaluate/index) library. For this task, load the [accuracy](https://huggingface.co/spaces/evaluate-metric/accuracy) metric (see the Evaluate [quick tour](https://huggingface.co/docs/evaluate/a_quick_tour) to learn more about how to load and compute a metric):

In [8]:
#!pip install evaluate

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Collecting evaluate
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: evaluate
Successfully installed evaluate-0.4.0
[0m

In [9]:
import evaluate

accuracy = evaluate.load("accuracy")

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Then create a function that passes your predictions and labels to [compute](https://huggingface.co/docs/evaluate/main/en/package_reference/main_classes#evaluate.EvaluationModule.compute) to calculate the accuracy:

In [10]:
import numpy as np


def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

Your `compute_metrics` function is ready to go now, and you'll return to it when you setup your training.

## Train

Before you start training your model, create a map of the expected ids to their labels with `id2label` and `label2id`:

In [11]:
id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}

<Tip>

If you aren't familiar with finetuning a model with the [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer), take a look at the basic tutorial [here](https://huggingface.co/docs/transformers/main/en/tasks/../training#train-with-pytorch-trainer)!

</Tip>

You're ready to start training your model now! Load DistilBERT with [AutoModelForSequenceClassification](https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoModelForSequenceClassification) along with the number of expected labels, and the label mappings:

In [12]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id
)

Downloading pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'pre_classifier.weight', 'classi

At this point, only three steps remain:

1. Define your training hyperparameters in [TrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments). The only required parameter is `output_dir` which specifies where to save your model. At the end of each epoch, the [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) will evaluate the accuracy and save the training checkpoint.
2. Pass the training arguments to [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) along with the model, dataset, tokenizer, data collator, and `compute_metrics` function.
3. Call [train()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.train) to finetune your model.

In [15]:
training_args = TrainingArguments(
    output_dir="my_awesome_model",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_imdb["train"],
    eval_dataset=tokenized_imdb["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

Cloning https://huggingface.co/irodrigues/my_awesome_model into local empty directory.
[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

  ········································


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.2318,0.186235,0.92816
2,0.1494,0.241715,0.93076


TrainOutput(global_step=3126, training_loss=0.20540331512861196, metrics={'train_runtime': 1852.6272, 'train_samples_per_second': 26.989, 'train_steps_per_second': 1.687, 'total_flos': 6561288258498624.0, 'train_loss': 0.20540331512861196, 'epoch': 2.0})

In [16]:
trainer.push_to_hub()

Upload file runs/May21_13-30-23_f1b1c7fb942b/events.out.tfevents.1684675833.f1b1c7fb942b.31.0:   0%|          …

To https://huggingface.co/irodrigues/my_awesome_model
   e9a55be..4226df5  main -> main

To https://huggingface.co/irodrigues/my_awesome_model
   4226df5..1d51f71  main -> main



'https://huggingface.co/irodrigues/my_awesome_model/commit/4226df5cdba03134a1f8069debe7470b701e2486'

<Tip>

[Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) applies dynamic padding by default when you pass `tokenizer` to it. In this case, you don't need to specify a data collator explicitly.

</Tip>

<Tip>

For a more in-depth example of how to finetune a model for text classification, take a look at the corresponding
[PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification.ipynb)
or [TensorFlow notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification-tf.ipynb).

</Tip>

## Inference

Great, now that you've finetuned a model, you can use it for inference!

Grab some text you'd like to run inference on:

In [17]:
text = "This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three."

The simplest way to try out your finetuned model for inference is to use it in a [pipeline()](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.pipeline). Instantiate a `pipeline` for sentiment analysis with your model, and pass your text to it:

In [19]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis", model="irodrigues/my_awesome_model")
classifier(text)

Downloading (…)lve/main/config.json:   0%|          | 0.00/735 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/320 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.991104781627655}]

You can also manually replicate the results of the `pipeline` if you'd like:

Tokenize the text and return PyTorch tensors:

In [20]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("irodrigues/my_awesome_model")
inputs = tokenizer(text, return_tensors="pt")

Pass your inputs to the model and return the `logits`:

In [23]:
import torch

In [24]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("irodrigues/my_awesome_model")
with torch.no_grad():
    logits = model(**inputs).logits

Get the class with the highest probability, and use the model's `id2label` mapping to convert it to a text label:

In [25]:
predicted_class_id = logits.argmax().item()
model.config.id2label[predicted_class_id]

'POSITIVE'