# Transformers: Fine-tune your own BERT Model

In [4]:
!pip install datasets evaluate transformers


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In this notebook, we will

1) look at more functions from the 🤗 `transformers` library
2) learn how to work with 🤗 `datasets`
3) fine-tune a pre-trained BERT model

## What happens in `pipeline()`?

*This section is based on [Chapter 2](https://huggingface.co/learn/llm-course/chapter2/2) of the 🤗 LLM Course.*

You are already familiar with the `pipeline()` method from our preceding notebook. Here is an example to jog your memory:

In [5]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier(
    [
        "My research is going to be whole different ballgame using LLMs!",
        "I hate this so much!",
    ]
)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use mps:0


[{'label': 'POSITIVE', 'score': 0.9055778980255127},
 {'label': 'NEGATIVE', 'score': 0.9966409206390381}]

Alright, so this little function takes text inputs and transforms them into label predictions. But since transformers cannot simply be fit on the raw texts, several things have to happen in the pipeline:

![Visualization from Chapter 2 of teh Huggingface Course](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter2/full_nlp_pipeline.svg)

You already know from our last session how to load tokenizers of specific models and tokenize input texts using `AutoTokenizer()`. Lets's repeat this briefly:

In [6]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english" # this is the default model for sentiment-analysis
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [7]:
raw_inputs = [
    "My research is going to be whole different ballgame using LLMs!",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

{'input_ids': tensor([[  101,  2026,  2470,  2003,  2183,  2000,  2022,  2878,  2367,  3608,
         16650,  2478,  2222,  5244,   999,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}


*Can you explain what the padding and truncation parameters do here? What would the data look like if you would not specify them?*

Great. So we've already covered the first step! Let's look at how we could load a specific model and generate output embeddings of each text using our token IDs. We start by loading the model from the hub using `AutoModel()`:

In [8]:
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

This architecture contains only the base Transformer module: given some inputs, it returns the output embeddings or *hidden states* of the model, which might then be fed into a classification head downstream. We can simply use the assigned object to run the forward pass and return the model embeddings:

In [9]:
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

torch.Size([2, 16, 768])


*Interpret these dimensions. What does each number describe?*

Now we need to transform these representations into useful classifications. As you heard already earlier today, models 'translate' embeddings into classifications using a **classification head**. Classification heads exist for many different tasks, e.g. question answering or token classification. We are interested in classifying our entire sequences. To do so, we load the model with a classification head for sequence classification using `AutoModelForSequenceClassification` instead of simply `AutoModel`.

In [10]:
from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)

In [12]:
print(outputs.logits.shape)

torch.Size([2, 2])


Instead of outputting the embeddings, this model contains a classification head which transforms the original, high-dimensional representation into a vector containing two values (one for each label). The values we get as output from our model don’t necessarily make sense by themselves - they are so-called logits. We can transform them to values bounded between 0 and 1 using the `softmax` function you already saw earlier today.

In [21]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions.round(decimals=3))

tensor([[0.0940, 0.9060],
        [0.9990, 0.0010]], grad_fn=<RoundBackward1>)


To see which label corresponds to which prediction, you can check the model configuration:

In [23]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

## Working with 🤗 `datasets`

Before we get started with training our own models, we need to understand how to provide useful inputs to our models. Huggingface provides a library called `datasets` to make our life a little easier. 

The most basic function in the datasets library is probably `load_dataset`. Using this function, we can load an annotated dataset of synthetically generated phone calls, annotated for whether a given conversation is a scam or not.

In [1]:
from datasets import load_dataset

dataset = load_dataset("BothBosu/scam-dialogue")

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
dataset

DatasetDict({
    train: Dataset({
        features: ['dialogue', 'type', 'label'],
        num_rows: 1280
    })
    test: Dataset({
        features: ['dialogue', 'type', 'label'],
        num_rows: 320
    })
})

As you can see, we get a dictionary (specifically, a `DatasetDict`) containing two parts of this dataset: a train set (referred to as a 'split) in the datasets lingo; and a test set. You can directly dowload either by specifying the split you are interested in:

In [3]:
dataset_train = load_dataset("BothBosu/scam-dialogue", split='train')

In [4]:
dataset_train

Dataset({
    features: ['dialogue', 'type', 'label'],
    num_rows: 1280
})

We have three features: dialogue (containing the text of the conversation), the type of the conversation (which we will ignore here), and the label, that is the outcome of interest - whether the conversation was a scam call (1) or not (0). *You can check the [dataset card](https://huggingface.co/datasets/BothBosu/scam-dialogue) for more information.*

You can inspect observations in the data simply by subsetting to the row of interest:

In [5]:
dataset['train'][0]

{'dialogue': "caller: Hello, is this John? receiver: Yes, it is. Who's calling? caller: My name is Officer Johnson from the Social Security Administration. How are you today? receiver: I'm fine, thank you. What can I do for you? caller: We've been trying to reach you about a very important matter. Your social security number has been compromised and we need to take immediate action to protect your identity. receiver: Oh no, what happened? caller: We've received reports of suspicious activity on your account and we need to verify some information to ensure your benefits aren't interrupted. receiver: Okay, what do you need to know? caller: Can you please confirm your social security number for me? receiver: Wait, I'm not comfortable giving that out over the phone. Is this really the Social Security Administration? caller: Ma'am, I assure you this is a legitimate call. We're trying to help you. If you don't cooperate, we'll have to suspend your benefits. receiver: I don't think so. I'm go

Another neat function is `train_test_split`: it divides the dataset into a train and a test set. We can use it here to subdivide our trainset into a validation and a training set. It is useful to set a seed for these operations, in order to make the process reproducible.

In [6]:
train_val_set = dataset['train'].train_test_split(test_size=0.1, shuffle=True, seed=42)

In [7]:
dataset['train'] = train_val_set['train']
dataset['validation'] = train_val_set['test']

In [8]:
dataset

DatasetDict({
    train: Dataset({
        features: ['dialogue', 'type', 'label'],
        num_rows: 1152
    })
    test: Dataset({
        features: ['dialogue', 'type', 'label'],
        num_rows: 320
    })
    validation: Dataset({
        features: ['dialogue', 'type', 'label'],
        num_rows: 128
    })
})

Let's prepare the dataset by tokenizing the conversations. The datasets library offers a neat function called `map` to simply and quickly apply any transformation to our data. We start by loading the tokenizer of a pre-trained bert model and defining the function we want to apply:

In [9]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def encode(examples):
    return tokenizer(examples["dialogue"], truncation=True, padding="max_length")


We can then use `map` to encode each dataset in the dictionary:

In [10]:
dataset = dataset.map(encode, batched=True)
dataset

Map: 100%|██████████| 1152/1152 [00:00<00:00, 3570.20 examples/s]
Map: 100%|██████████| 320/320 [00:00<00:00, 4640.44 examples/s]
Map: 100%|██████████| 128/128 [00:00<00:00, 4693.13 examples/s]


DatasetDict({
    train: Dataset({
        features: ['dialogue', 'type', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1152
    })
    test: Dataset({
        features: ['dialogue', 'type', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 320
    })
    validation: Dataset({
        features: ['dialogue', 'type', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 128
    })
})

As you can see, we have added three columns to our dataset, containing the tokenizer output for each conversation. Note that the logic of map allows you to apply any transformation you want - you simply need to change the supplied function.

Our model will expect a variable called 'labels' as outcome. So let's rename our variable: 

In [11]:
dataset

DatasetDict({
    train: Dataset({
        features: ['dialogue', 'type', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1152
    })
    test: Dataset({
        features: ['dialogue', 'type', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 320
    })
    validation: Dataset({
        features: ['dialogue', 'type', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 128
    })
})

In [12]:
dataset = dataset.rename_column("label", "labels")

Lastly, we select the columns of interest using `select_columns`. 

*For a full overview of pre-processing functions, check out the [datasets documentation](https://huggingface.co/docs/datasets/en/process).*

In [13]:
dataset = dataset.select_columns(["input_ids", "token_type_ids", "attention_mask", "labels"])

*If you want to learn more about 🤗 `datasets` you can check out the [`datasets` tutorial](https://huggingface.co/docs/datasets/en/quickstart) and the [chapter on `datasets` in the 🤗 LLM course](https://huggingface.co/learn/llm-course/chapter5/1).*

### Using your own data as a dataset

datasets also offers helpful functions to load your own data. There are [many functions](https://huggingface.co/docs/datasets/en/loading) to do this, including loading directly from disk. However, especially if you need to prepare the dataset beforehand, it is usually simplest to work with `pandas` and then transform this dataframe to a datasets dataset:

In [35]:
from datasets import Dataset
import pandas as pd
df = pd.DataFrame({"a": [1, 2, 3]})
tmp_dataset = Dataset.from_pandas(df)

## Fine-tuning a Transformer Model

Now you should be prepared to understand how to fine-tune your own model. We start by loading a model with the appropriate classification head and defining the number of outcomes.

In [14]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
model = model.to("cuda")

Huggingface offers the `Trainer` function to make training as simple as possible for us. All we need to do is to define some training arguments, supply the training and (optionally) validation data. We can specify many different hyperparameters in the training arguments, but possibly the most important one is the number of epochs - that is, full runs through the training data - you want the model to train. We also specify a validation strategy (once after every epoch) to assess how well our model predicts the validation data.

In [63]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    "test-trainer",
    num_train_epochs=1,
    eval_strategy="epoch"
    )

We should also specify an evaluation metric to understand how well our model performs.

In [64]:
import pandas as pd
import numpy as np

def acc_prec_rec_f1(predictions, references):
    cm = pd.crosstab(predictions, references)
    accuracy = np.diag(cm).sum() / cm.sum().sum()

    tn, fp, fn, tp = cm[0][0], cm[0][1], cm[1][0], cm[1][1]

    precision = tp / (tp + fp)
    recall = tp / (tp + fn)
    f1 = 2 * (precision * recall) / (precision + recall)
    return {
        "accuracy": accuracy,
        "precision": precision,
        "recall": recall,
        "f1": f1
    }

def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return acc_prec_rec_f1(predictions=predictions, references=labels)

Quick test:

In [65]:
predictions = [0,1,0,0,1]
references =  [0,1,0,1,1]

acc_prec_rec_f1(predictions, references)

{'accuracy': 0.8, 'precision': 1.0, 'recall': 0.6666666666666666, 'f1': 0.8}

In [66]:
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    compute_metrics=compute_metrics
)

You then train the model simply by running `trainer.train()`.

In [67]:
trainer.train()



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,No log,0.084474,0.992188,0.986111,1.0,0.993007




TrainOutput(global_step=144, training_loss=0.034307145410113864, metrics={'train_runtime': 265.7918, 'train_samples_per_second': 4.334, 'train_steps_per_second': 0.542, 'total_flos': 303103935774720.0, 'train_loss': 0.034307145410113864, 'epoch': 1.0})

*What is happening during `trainer.train()`?*

Best to save the trained model directly:

In [69]:
model.save_pretrained("my_first_model")

You can generate predictions of the trained model using `model.predict()` now:

In [73]:
preds = trainer.predict(dataset['test'])



In [None]:
preds.predictions[0] # again, we get logits

array([-5.586289 ,  5.2830496], dtype=float32)

In [77]:
## transform logits to labels
pred_labels = np.argmax(preds.predictions, axis=-1)

In [78]:
pred_labels

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

Finally, we evaluate the model on our testset:

In [None]:
## evaluate
acc_prec_rec_f1(pred_labels, dataset['test']['labels'])

{'accuracy': 1.0, 'precision': 1.0, 'recall': 1.0, 'f1': 1.0}

*Why would it be necessary or useful to use a separate testset here?*

## Exercise

TIme to train your own classifier! Look for data of interest, start a new notebook and fine-tune a model!

1. Start by identifying a reasonably sized dataset (max 100k rows) to use for sequence classification from the [transformers hub](https://huggingface.co/datasets?size_categories=or:%28size_categories:n%3C1K,size_categories:1K%3Cn%3C10K,size_categories:10K%3Cn%3C100K%29&sort=downloads). Load the data.

2. Preprocess the data appropriately. Split the data into train, validation and test set.

3. Load a 'bert-base-uncased' model and train the model for at least one epoch. Explicitly define some other hyperparameters if you want to.

4. Evaluate the model's performance.