## Huggingface Transformers

This notebook relies on a lot of different resources including the [Huggingface Documentation](https://huggingface.co/docs/transformers/index) and the official [tutorials](https://huggingface.co/docs/transformers/notebooks) for the library.

### Tokenizer

A tokenizer is responsible for preprocessing text into an array of numbers as inputs to a model. There are multiple rules that govern the tokenization process, including how to split a word and at what level words should be split. The most important thing to remember is you need to instantiate a tokenizer with the same model name to ensure you're using the same tokenization rules a model was pretrained with.


In [1]:
!pip install transformers torch



In [2]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
encoding = tokenizer("Let's learn how to tokenize a sentence using transformers library.")
print(encoding)


{'input_ids': [101, 2421, 112, 188, 3858, 1293, 1106, 22559, 3708, 170, 5650, 1606, 11303, 1468, 3340, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


Here `input_ids` are the indices of the tokens, `token_type_ids` is something which was used in BERT to indicate which sequence the token belongs to (if there is more than one), and `attention_mask` is whether the model should attend to that token (you don't want model to attend to pad tokens for example).

You can get back the original sentence from the ids as:

In [3]:
tokenizer.decode(encoding["input_ids"])



"[CLS] Let's learn how to tokenize a sentence using transformers library. [SEP]"

Here `[CLS]` is a special token added to indicate the start of sentence where `[SEP]` is a special separator token. When we finetune a model like BERT, we often use the representation of `[CLS]` as the representation of the sentence using which we finetune the model.

You can also process sentences in a batch:

In [4]:
batch_sentences = [
    "I was literally starving by that time since I had not eating anything that morning.",
    "It was such a good day.",
    "How many apples should I buy Mark?",
]
encoded_inputs = tokenizer(batch_sentences)
print(encoded_inputs)

{'input_ids': [[101, 146, 1108, 6290, 20285, 1118, 1115, 1159, 1290, 146, 1125, 1136, 5497, 1625, 1115, 2106, 119, 102], [101, 1135, 1108, 1216, 170, 1363, 1285, 119, 102], [101, 1731, 1242, 22888, 1431, 146, 4417, 2392, 136, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}


#### Padding

As you might notice in the above, the sentences are of different length and hence the returned `input_ids`, `attention_mask` etc. are of different length. We can pad the shorter sentences so that they match in length to the longest sentence (this will allow you to use the returned values directly in your models):

In [5]:
batch_sentences = [
    "I was literally starving by that time since I had not eating anything that morning.",
    "It was such a good day.",
    "How many apples should I buy Mark?",
]
encoded_inputs = tokenizer(batch_sentences, padding=True)
print(encoded_inputs)

{'input_ids': [[101, 146, 1108, 6290, 20285, 1118, 1115, 1159, 1290, 146, 1125, 1136, 5497, 1625, 1115, 2106, 119, 102], [101, 1135, 1108, 1216, 170, 1363, 1285, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1731, 1242, 22888, 1431, 146, 4417, 2392, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}


We can again decode the sentence. Notice that this time we have `[SEP]` tokens after the decoding.

In [6]:
print(tokenizer.decode(encoded_inputs["input_ids"][0]))
print(tokenizer.decode(encoded_inputs["input_ids"][1]))
print(tokenizer.decode(encoded_inputs["input_ids"][2]))

assert len(encoded_inputs["input_ids"][0])==len(encoded_inputs["input_ids"][1])==len(encoded_inputs["input_ids"][2])

[CLS] I was literally starving by that time since I had not eating anything that morning. [SEP]
[CLS] It was such a good day. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
[CLS] How many apples should I buy Mark? [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]


#### Truncation

Sometimes a sentence may be too long for a model to handle. In this case you can truncate the longer sentences. It will be truncated to the maximum length accepted by that model (in BERT it is 512 tokens).

Additionally for the returned values to be directly used in models, you can specify return type as "pt" which will return pytorch tensors:

In [7]:
batch_sentences = [
    "I was literally starving by that time since I had not eating anything that morning.",
    "It was such a good day.",
    "How many apples should I buy Mark?",
]
encoded_inputs = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="pt")
print(encoded_inputs)

{'input_ids': tensor([[  101,   146,  1108,  6290, 20285,  1118,  1115,  1159,  1290,   146,
          1125,  1136,  5497,  1625,  1115,  2106,   119,   102],
        [  101,  1135,  1108,  1216,   170,  1363,  1285,   119,   102,     0,
             0,     0,     0,     0,     0,     0,     0,     0],
        [  101,  1731,  1242, 22888,  1431,   146,  4417,  2392,   136,   102,
             0,     0,     0,     0,     0,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}


### Auto Classes

The library provides Auto Classes which allows us to load tokenizers, configurations and models instead of having a separate class for each different tokenizer and model. For models, depending on the type of model there are different Auto Classses.

In [8]:
from transformers import AutoConfig

# Load model configurations -- this will download from the hub and cache it
config = AutoConfig.from_pretrained("bert-base-uncased")

# Check what the config looks like
print(config)


BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.34.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}



We've already looked at loading tokenizers using the `AutoTokenizer` class. So let's focus now on how to load model using the Auto Classes.

In [9]:
from transformers import AutoModelForCausalLM, AutoModelForMaskedLM, AutoModelForSeq2SeqLM, AutoModelForSequenceClassification

# # Load a model which is a causal LM (model with a LM head on top)
# model = AutoModelForCausalLM.from_pretrained("gpt2")

# # Load a masked language model
# model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")

# # Load a sequence-2-sequence model (i.e. encoder-decoder based)
# model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")

# Load a model with a classification head on top (you can finetune this on any classification task)
# e.g. BERT model with a classification head on top
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")



Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [10]:
model

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

There are lot more Auto Classes for specific model types such as `AutoModelForTokenClassification` and `AutoModelForQuestionAnswering`. You can check the [documentation](https://huggingface.co/docs/transformers/model_doc/auto#auto-classes) for more details.


If you don't want to use these "Auto Classes", you can always just use the custom class for different models too.

In [11]:
from transformers import BertForSequenceClassification

In [12]:
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [13]:
model

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

In [14]:
del model

### Finetuning a Model

Next, let's look at how to load a pretrained model and finetune it on a dataset (for which we will use the `datasets` library from last time). For now, we will manually define the training loop similar to what we covered in the Pytorch tutorial as well as what you had in HW2.


In [15]:
! pip install datasets tqdm evaluate

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




In [16]:
from datasets import load_dataset
import torch
from torch.utils.data import DataLoader
from torch.optim import AdamW
from transformers import get_scheduler
from tqdm.auto import tqdm
import evaluate


# Load the dataset
# dataset = load_dataset("yelp_review_full")
dataset = load_dataset("glue", "rte")


Found cached dataset glue (/home/yk2516/.cache/huggingface/datasets/glue/rte/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


  0%|          | 0/3 [00:00<?, ?it/s]

In [17]:
dataset

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 2490
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 277
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3000
    })
})

In [18]:
dataset["train"][0]

{'sentence1': 'No Weapons of Mass Destruction Found in Iraq Yet.',
 'sentence2': 'Weapons of Mass Destruction Found in Iraq.',
 'label': 1,
 'idx': 0}

In [19]:
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")


In [20]:
# # Define a function to use in map (preprocessing)
# def tokenize_function(examples):
#     return tokenizer(examples["text"], padding="max_length", truncation=True)

# # Tokenize the dataset
# tokenized_datasets = dataset.map(tokenize_function, batched=True)

def tokenize_function(example):
  tokenized_example = tokenizer(
      example["sentence1"],
      example["sentence2"],
      max_length=tokenizer.model_max_length,
      padding="max_length",
      truncation=True,
  )
  # Skip dummy label for test set
  if example["label"] != -1:
    tokenized_example["label"] = example["label"]
  return tokenized_example

tokenized_datasets = dataset.map(
    tokenize_function,
    remove_columns=["sentence1", "sentence2", "idx"]
)


Loading cached processed dataset at /home/yk2516/.cache/huggingface/datasets/glue/rte/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-6e6ad128f5133f68.arrow
Loading cached processed dataset at /home/yk2516/.cache/huggingface/datasets/glue/rte/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-bbdb435608fd1ff3.arrow


Map:   0%|          | 0/3000 [00:00<?, ? examples/s]

In [21]:
# # Prepare for training
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")


In [22]:
# # For the purpose of this notebook we will use smaller datasets
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["validation"].shuffle(seed=42).select(range(200))

# # Let's first look at what example in our data looks like
print(small_train_dataset[0])


Loading cached shuffled indices for dataset at /home/yk2516/.cache/huggingface/datasets/glue/rte/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-285397329eb2f882.arrow
Loading cached shuffled indices for dataset at /home/yk2516/.cache/huggingface/datasets/glue/rte/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-8ad81eff304d9187.arrow


{'labels': tensor(0), 'input_ids': tensor([  101,  5749,  1106,   170,  4265,  8214,  1113,  1103, 20728,  2597,
         1115,  1108,  2085,  1107,  2056,  1118,  1103,  5818,  1113,  1570,
         5820,  1107,  5135, 27290,  1174, 11763,  1104,  5469,   143,  3984,
         1605,  1105, 15527,   113,   140, 12150,  9919,   114,   117,  1103,
         2170, 15172,  1416,  1336,  1129,  1231, 27563,  1157, 14061,  1755,
        16137,   102,  1109,  8214,  1113, 20728,  2597,  1144,  1151,  3903,
         1107,  8547,  1103, 15172,  1121, 16137,   119,   102,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0, 

Next, using the dataset we will define the training loop for BERT. Once the model is trained for 3 epochs, we will evaluate it on the test data and compute the accuracy.

In [23]:
# Dataloader (this will give us an iterator with specified batch size)
train_dataloader = DataLoader(small_train_dataset, shuffle=True, batch_size=8)
eval_dataloader = DataLoader(small_eval_dataset, batch_size=8)

# Define model
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)

# Definfe optimizer
optimizer = AdamW(model.parameters(), lr=5e-5)

# Learning rate scheduler
num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)

# Check if there is GPU and move model to device accordingly
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
print(f"device={device}")
model.to(device)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


device=cuda


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

In [32]:
# Main training loop
progress_bar = tqdm(range(num_training_steps))
model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)



  0%|          | 0/375 [00:00<?, ?it/s]

In [33]:
# Evaluate the trained model
metric = evaluate.load("accuracy")
model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

print("Accuracy: ", metric.compute())

Accuracy:  {'accuracy': 0.62}


With the following command, this is the GLUE dataset results we can get for `bert-base-cased` (from https://github.com/microsoft/LoRA/blob/main/examples/NLU/examples/text-classification/README.md). 

```bash
export TASK_NAME=mrpc

python run_glue.py \
  --model_name_or_path bert-base-cased \
  --task_name $TASK_NAME \
  --do_train \
  --do_eval \
  --max_seq_length 128 \
  --per_device_train_batch_size 32 \
  --learning_rate 2e-5 \
  --num_train_epochs 3 \
  --output_dir /tmp/$TASK_NAME/
```

| Task  | Metric                       | Result      | Training time |
|-------|------------------------------|-------------|---------------|
| CoLA  | Matthew's corr               | 56.53       | 3:17          |
| SST-2 | Accuracy                     | 92.32       | 26:06         |
| MRPC  | F1/Accuracy                  | 88.85/84.07 | 2:21          |
| STS-B | Person/Spearman corr.        | 88.64/88.48 | 2:13          |
| QQP   | Accuracy/F1                  | 90.71/87.49 | 2:22:26       |
| MNLI  | Matched acc./Mismatched acc. | 83.91/84.10 | 2:35:23       |
| QNLI  | Accuracy                     | 90.66       | 40:57         |
| RTE   | Accuracy                     | 65.70       | 57            |
| WNLI  | Accuracy                     | 56.34       | 24            |


So we're doing decently good.

pip install transformers[torch]#### Trainer

The library provides a `Trainer` class for training (or fine-tuning) models instead of defining your own training loops. It supports a wide range of features such as gradient accumulation, evaluation strategies (e.g. every epoch) etc.

In [25]:
from transformers import TrainingArguments, Trainer
import evaluate
import numpy as np

# Define model
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)

# Contains all the hyperparameters and arguments for training
training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch", report_to=None)

# Define the function to use for evaluation
# All models return logits which we need to process to compute our metric (accuracy in this case)
metric = evaluate.load("accuracy")
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

# Define the trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

trainer.train()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.67701,0.565
2,No log,0.662719,0.62
3,No log,1.188929,0.645


TrainOutput(global_step=375, training_loss=0.5087980550130208, metrics={'train_runtime': 73.8907, 'train_samples_per_second': 40.601, 'train_steps_per_second': 5.075, 'total_flos': 789333166080000.0, 'train_loss': 0.5087980550130208, 'epoch': 3.0})

In [32]:
trainer.state

TrainerState(epoch=3.0, global_step=375, max_steps=375, logging_steps=500, eval_steps=500, save_steps=500, num_train_epochs=3, total_flos=789333166080000.0, log_history=[{'eval_loss': 0.6770097613334656, 'eval_accuracy': 0.565, 'eval_runtime': 1.496, 'eval_samples_per_second': 133.688, 'eval_steps_per_second': 16.711, 'epoch': 1.0, 'step': 125}, {'eval_loss': 0.6627189517021179, 'eval_accuracy': 0.62, 'eval_runtime': 1.4991, 'eval_samples_per_second': 133.409, 'eval_steps_per_second': 16.676, 'epoch': 2.0, 'step': 250}, {'eval_loss': 1.1889294385910034, 'eval_accuracy': 0.645, 'eval_runtime': 1.4957, 'eval_samples_per_second': 133.714, 'eval_steps_per_second': 16.714, 'epoch': 3.0, 'step': 375}, {'train_runtime': 73.8907, 'train_samples_per_second': 40.601, 'train_steps_per_second': 5.075, 'total_flos': 789333166080000.0, 'train_loss': 0.5087980550130208, 'epoch': 3.0, 'step': 375}], best_metric=None, best_model_checkpoint=None, is_local_process_zero=True, is_world_process_zero=True, i

In the above code, we used the default values for the learning rate, optimizer etc. Make sure to check what these default values are in [TrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments) and change them as required.

### Text Generation

The above example covered how to finetune a model for text classification. In this part, we will focus on text generation. The training part is very similar to the above, so we will skip that and instead focus on how to do text generation with different decoding strategies.

In [33]:

# We'll try out using GPT2 and load its corresponding tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# add the EOS token as PAD token to avoid warnings
model = AutoModelForCausalLM.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)

# encode context the generation is conditioned on
input_ids = tokenizer('I enjoy walking with my cute dog', return_tensors='pt').input_ids

# generate text until the output length (which includes the context length) reaches 50
greedy_output = model.generate(input_ids, max_length=50)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))



Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with my dog. I'm not sure if I'll ever be able to walk with my dog.

I'm not sure if I'll


We can also use other decoding methods such as beam search decoding. For example:

In [34]:
# activate beam search and early_stopping
beam_output = model.generate(
    input_ids,
    max_length=50,
    num_beams=5,
    early_stopping=True
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I'm not sure if I'll ever be able to walk with him again. I'm not sure if I'll


You can also use methods like top-k or top-p sampling and specify how many sequences should be returned during generation using the parameter `num_return_sequences`:

In [36]:

# set top_k to 50
sample_outputs = model.generate(
    input_ids,
    do_sample=True,
    max_length=50,
    top_k=50,
    num_return_sequences=2
)

print("Output:\n" + 100 * '-')
for i, output in enumerate(sample_outputs):
    print("{}: {}".format(i, tokenizer.decode(output, skip_special_tokens=True)))


Output:
----------------------------------------------------------------------------------------------------
0: I enjoy walking with my cute dog," Kasei said. The dog is now able to go to the garage to wash his paws, but Kasei feathers them up and rides them for weeks. The dog can walk up and down stairs while
1: I enjoy walking with my cute dog in the spring and after that I get up and walk around the city everyday with her. I am not sure if she can handle it without me, but at least I have the freedom or something. Maybe I should
