# <font color="#003660">Applied Machine Learning for Text Analysis (M.184.5331)</font>


# <font color="#003660">Session 2: Transformer Architecture - Encoder-only Models</font>

# <font color="#003660">Notebook 2: Text Classification with Transformers</font>

<center><br><img width=256 src="https://raw.githubusercontent.com/olivermueller/aml4ta-2021/main/resources/dag.png"/><br></center>

<p>
<center>
<div>
    <font color="#085986"><b>By the end of this lesson, you ...</b><br><br>
        ... will know the difference between the feature extraction and fine.-tuning approach for text classification with Transformers, <br>
        ... will know how to train a feature extraction model for text classification, <br>
        ... will know how to train a fine-tuned model for text classification, <br>,
        ... will get to know the main libraries of the Hugging Face ecosystem (i.e., Datasets, Tokenizers, Transformers).
    </font>
</div>
</center>
</p>

The following content is heavily inspired by the following excellent sources:


*   Tunstall et al. (2021): Natural Language Processing with Transformers. O'Reilly. https://www.oreilly.com/library/view/natural-language-processing/9781098103231/
*   Hugging Face (2021): Transformer Models - Hugging Face Course. https://huggingface.co/course/



# Import Packages

As always, we first need to load a number of required Python packages:
- `pandas` provides high-performance, easy-to-use data structures and data analysis tools.
- `numpy` is a library adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.
- `sklearn` is a free software machine learning library for the Python programming language.
- `transformers` provides general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pretrained models in 100+ languages.
- `datasets` is an API for datasets from the makers of transformers.
- `torch` is an open source machine learning library used for applications such as computer vision and natural language processing, primarily developed by Facebook's AI Research lab.

In [1]:
#!pip install datasets

In [2]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score
from sklearn.metrics import classification_report
from datasets import load_dataset
from transformers import AutoTokenizer
from transformers import AutoModel
from transformers import AutoModelForSequenceClassification
from transformers import Trainer, TrainingArguments
import torch

# Overview

<center><br><img width=400 src="https://upload.wikimedia.org/wikipedia/en/f/f1/Bert_and_Ernie.JPG"/><br></center>



In this notebook we will use a famous Transformer model called **BERT**, short for Bidirectional Encoder Representations from Transformers.

We will use the three core libraries from the Hugging Face ecosystem: **Datasets**, **Tokenizers**, and **Transformers**. These libraries will allow us to quickly go from raw text to a fine-tuned model that can be used for predictions on new texts.

The following diagram illustrates the architecture of a BERT model for sequence classification at a high level:

<center><img width=600 src="https://raw.githubusercontent.com/olivermueller/amlta-2024/main/Session_03/bert.png"/></center>

# Load and Prepare Dataset

In the next code chunks we will use the **Datasets** library to retrieve a dataset from Hugging Face's datasets hub. This library is designed to load and process large datasets efficiently and share them with the community.

Load the "emotions" dataset, a dataset of English Twitter messages labelled with six basic emotions: anger, fear, joy, love, sadness, and surprise. More: https://huggingface.co/datasets/emotion

In [3]:
emotions = load_dataset("emotion")

Explore the structure of the dataset.

In [4]:
emotions

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
})

In [5]:
emotions["train"].features

{'text': Value('string'),
 'label': ClassLabel(names=['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'])}

Show the first Tweet and its label.

In [6]:
emotions["train"][0]

{'text': 'i didnt feel humiliated', 'label': 0}

Get the name of the first label.

In [7]:
emotions["train"].features["label"].int2str(emotions["train"][0]["label"])

'sadness'

Store a list of all label names for later.

In [8]:
labels = emotions["train"].features["label"].names
labels

['sadness', 'joy', 'love', 'anger', 'fear', 'surprise']

# Tokenize Texts

Like other neural networks or machine learning models, Transformer models cannot process raw strings as input; instead they assume the text has been tokenized and turned into numerical vectors.

Most Transformers use a subword **tokenizer**. The idea behind subword tokenization is a meet-in-the-middle between character and word tokenization. On one hand we want to use characters since they allow the model to deal with rare character combinations and misspellings. On the other hand, we want to keep frequent words and word parts as unique entities.

There are many different subword tokenization strategies. Using the right tokenizer for a given pretrained model is crucial for getting sensible results. The Transformers library provides convenient tools (e.g., AutoTokenizer, from_pretrained()) to load both objects (model and tokenizer) from the Hugging Face model hub.

In [9]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

In [10]:
tokenizer.vocab_size

30522

In [11]:
tokenizer.model_max_length

512

In [12]:
encoded_str = tokenizer.encode("this is a complicatedtest")
encoded_str

[101, 2023, 2003, 1037, 8552, 22199, 102]

Show special tokens used by the tokenizer. BERT uses the [MASK] token for the primary objective of masked language modeling and the [CLS] and [SEP] tokens for the secondary pretraining objective of predicting if two sentences are consecutive.

Below, we can observe two things. First, the [CLS] and [SEP] tokens have been added automatically to the start and end of the sequence. Second, the long word `complicatedtest` has been split into two tokens. The ## prefix in ##test signifies that the preceding string is not a whitespace and that it should be merged with the previous token.

In [13]:
for token in encoded_str:
    print(token, tokenizer.decode([token]))

101 [CLS]
2023 this
2003 is
1037 a
8552 complicated
22199 ##test
102 [SEP]


# BERT for Text Classification

BERT has been pretrained to predict masked words and next sentences in texts. Hence, we can’t use the model as-is for text classification and have to modify its architecture slightly.

The figure below illustrates the general architecture of a BERT text classification model.

<center><br><img width=600 src="https://raw.githubusercontent.com/olivermueller/aml4ta-2021/main/resources/bert.png"/><br></center>

First, the text is tokenized and represented as one-hot vectors whose dimension is the size of the tokenizer vocabulary (not shown in the diagram). Next, these token encodings are embedded in lower dimensions and passed through stacks of encoder layers to yield a hidden state for each input token. During pretraining, the hidden states are used for the task of language modeling. For the classification task, we replace the language modeling layer with a classification model.

We have two options to train the classification model:

* Feature extraction: We use BERT in inference mode take the hidden states as features to train an "external" classifier on them.

* Fine-tuning: We add a classification head to the model and train the whole model end-to-end, which also updates the parameters of the pretrained BERT model.

# The Feature Extraction Approach

The figure below illustrates the idea behind the feature extraction approach. To use a Transformer as a feature extractor we use BERT just for inference and use the generated hidden states as features for a downstream classifier (e.g., logistic regression, random forest). The advantage of this approach is that we can quickly train a shallow model. This method is especially convenient if GPUs are unavailable since the hidden states can be computed relatively fast on a CPU.

<center><br><img width=600 src="https://raw.githubusercontent.com/olivermueller/aml4ta-2021/main/resources/feature_extraction.png"/><br></center>


See also Jay Alammar's blog post for using the feature-extraction approach for classification: https://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/

## Model: Load and Inference

Check if GPU is available.

In [14]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cpu')

(Down)Load a pretrained model and move it to the GPU. The AutoModel class initalizes an input encoder that translates the one-hot vectors to embeddings with positional encodings and feeds them through the encoder stack to return the hidden states. Note that the language model head that normally takes the hidden states and decodes them to the masked token prediction is excluded since it is only needed for pretraining.

In [15]:
model_name = "distilbert-base-uncased"

In [16]:
model = AutoModel.from_pretrained(model_name).to(device)

Test the model with a single short document.

In [17]:
text = "this is a test"
text_tensor = tokenizer.encode(text, return_tensors="pt").to(device)

The tokenized text is a tensor with the dimensionality [1, 6] (number of documents, length of the documents). 

In [18]:
text_tensor.shape

torch.Size([1, 6])

The values of the tensor are integers, which are the indices of the tokens in the vocabulary (101 and 102 are special tokens that have been added during tokenization).

In [19]:
text_tensor

tensor([[ 101, 2023, 2003, 1037, 3231,  102]])

Let's feed the tensor through the model to get the hidden states.

In [20]:
output = model(text_tensor)

In [21]:
output

BaseModelOutput(last_hidden_state=tensor([[[-0.1565, -0.1862,  0.0528,  ..., -0.1188,  0.0662,  0.5470],
         [-0.3575, -0.6484, -0.0618,  ..., -0.3040,  0.3508,  0.5221],
         [-0.2772, -0.4459,  0.1818,  ..., -0.0948, -0.0076,  0.9958],
         [-0.2841, -0.3917,  0.3753,  ..., -0.2151, -0.1173,  1.0526],
         [ 0.2661, -0.5094, -0.3180,  ..., -0.4203,  0.0144, -0.2149],
         [ 0.9441,  0.0112, -0.4714,  ...,  0.1439, -0.7288, -0.1619]]],
       grad_fn=<NativeLayerNormBackward0>), hidden_states=None, attentions=None)

The output are the hidden states of the last encoder layer for each token. Hence, they have the dimensionality [1, 6, 768] (number of documents, length of the documents, dimension of the hidden states). They are the raw contextual embeddings generated by BERT.

In [22]:
output.last_hidden_state.shape

torch.Size([1, 6, 768])

In [23]:
output.last_hidden_state

tensor([[[-0.1565, -0.1862,  0.0528,  ..., -0.1188,  0.0662,  0.5470],
         [-0.3575, -0.6484, -0.0618,  ..., -0.3040,  0.3508,  0.5221],
         [-0.2772, -0.4459,  0.1818,  ..., -0.0948, -0.0076,  0.9958],
         [-0.2841, -0.3917,  0.3753,  ..., -0.2151, -0.1173,  1.0526],
         [ 0.2661, -0.5094, -0.3180,  ..., -0.4203,  0.0144, -0.2149],
         [ 0.9441,  0.0112, -0.4714,  ...,  0.1439, -0.7288, -0.1619]]],
       grad_fn=<NativeLayerNormBackward0>)

Let’s repeat the process for the whole dataset. First, we write a simple function that will tokenize our texts. The padding=True parameter will pad the examples with zeroes to the longest one in a batch, and truncation=True will truncate the examples to the model’s maximum sequence length.

In [24]:
def tokenize(batch):
    return tokenizer(batch["text"], padding=True, truncation=True)

Apply to the first three texts.

In [25]:
tokenize(emotions["train"][:3])

{'input_ids': [[101, 1045, 2134, 2102, 2514, 26608, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1045, 2064, 2175, 2013, 3110, 2061, 20625, 2000, 2061, 9636, 17772, 2074, 2013, 2108, 2105, 2619, 2040, 14977, 1998, 2003, 8300, 102], [101, 10047, 9775, 1037, 3371, 2000, 2695, 1045, 2514, 20505, 3308, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]}

Repeat with the whole corpus.

In [26]:
emotions_encoded = emotions.map(tokenize, batched=True, batch_size=None)

The above function call has added two new features to the dataset: input_ids and the attention mask.

In [27]:
emotions_encoded["train"].features

{'text': Value('string'),
 'label': ClassLabel(names=['sadness', 'joy', 'love', 'anger', 'fear', 'surprise']),
 'input_ids': List(Value('int32')),
 'attention_mask': List(Value('int8'))}

With `hidden_states = model(input_ids, attention_mask)` we could now generate the hidden states for each token of each document. For convenience, we define a custom function that takes a batch of tokenized documents, feeds them through the model, and adds hidden_state features to the batch. Instead of using the hidden states of all tokens, in the original BERT paper it was recommended to only use the hidden states of the CLS token as features for classification. An alternative would be to average over the hidden states of all tokens.

In [None]:
def extract_features_cls(batch):
  # store inputs in separate variables
  input_ids = torch.tensor(batch["input_ids"]).to(device)
  attention_mask = torch.tensor(batch["attention_mask"]).to(device)

  # feed inputs into model and save outputs
  with torch.no_grad():
      last_hidden_state = model(input_ids, attention_mask).last_hidden_state
      last_hidden_state = last_hidden_state.cpu().numpy()

  # extract the hidden state of the CLS token (i.e., the first token)
  cls_lhs = last_hidden_state[:,0,:]

  # return results
  batch["cls_hidden_state"] = cls_lhs
  return batch

In [29]:
emotions_encoded = emotions_encoded.map(extract_features_cls,
                                        batched=True,
                                        batch_size=16)

Show the dataset again.

In [30]:
emotions_encoded["train"].features

{'text': Value('string'),
 'label': ClassLabel(names=['sadness', 'joy', 'love', 'anger', 'fear', 'surprise']),
 'input_ids': List(Value('int32')),
 'attention_mask': List(Value('int8')),
 'cls_hidden_state': List(Value('float32'))}

Check the dimensions of the hidden states.

In [31]:
len(emotions_encoded["train"]["cls_hidden_state"][0])

768

The dataset now contains all the information we need to train a classifier on it. We will use the extracted hidden states of the CLS tokens as input features and the labels as targets. We can easily create the corresponding arrays in the well known Scikit-Learn format as follows:

In [32]:
X_train = np.array(emotions_encoded["train"]["cls_hidden_state"])
X_valid = np.array(emotions_encoded["validation"]["cls_hidden_state"])
y_train = np.array(emotions_encoded["train"]["label"])
y_valid = np.array(emotions_encoded["validation"]["label"])
X_train.shape, X_valid.shape

((16000, 768), (2000, 768))

And now we can train a standard classifier on these data structures.

In [33]:
clf = LogisticRegression(max_iter=3000)
clf.fit(X_train, y_train)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,3000


## Evaluate Model

Make predictions on validation set and evaluate predictive accuracy.

In [34]:
y_preds = clf.predict(X_valid)

In [35]:
print(classification_report(y_valid, y_preds, target_names=labels))

              precision    recall  f1-score   support

     sadness       0.65      0.71      0.68       550
         joy       0.71      0.80      0.75       704
        love       0.49      0.30      0.37       178
       anger       0.51      0.44      0.47       275
        fear       0.55      0.56      0.55       212
    surprise       0.54      0.27      0.36        81

    accuracy                           0.63      2000
   macro avg       0.57      0.51      0.53      2000
weighted avg       0.62      0.63      0.62      2000



# The Fine-tuning Approach

Let’s now fine-tune a Transformer end-to-end. With the fine-tuning approach we do not use the hidden states as fixed features, but instead train them as shown in the figure below. Since we retrain all the parameters, this approach requires more compute than the feature extraction approach and typically requires a GPU.

<center><br><img width=600 src="https://raw.githubusercontent.com/olivermueller/aml4ta-2021/main/resources/fine_tuning.png"/><br></center>


## Model: Load Pre-trained Model and Fine-tune

In the following, we will use the Trainer API to simplify the training loop.

We need is a pretrained model like the one we used in the feature-based approach. The only difference is that we use the AutoModelForSequenceClassification model instead of AutoModel. This model has a classification head on top of the model outputs which can be easily trained with the base model.

In [36]:
model_name = "distilbert-base-uncased"

In [37]:
model = (AutoModelForSequenceClassification
         .from_pretrained(model_name, num_labels = 6)
         .to(device))

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Furthermore, we define a function to monitor some metrics during training.

In [38]:
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    f1 = f1_score(labels, preds, average="weighted")
    acc = accuracy_score(labels, preds)
    return {"accuracy": acc, "f1": f1}

With the dataset and metrics ready we can now instantiate a Trainer class. The main ingredient here is the TrainingArguments class to specify all the parameters of the training run, one of which is the output directory for the model checkpoints.

In [39]:
batch_size = 64
logging_steps = len(emotions_encoded["train"]) // batch_size

training_args = TrainingArguments(output_dir="results",
                                  num_train_epochs=2,
                                  learning_rate=3e-5,
                                  per_device_train_batch_size=batch_size,
                                  per_device_eval_batch_size=batch_size,
                                  load_best_model_at_end=True,
                                  weight_decay=0.01,
                                  eval_strategy="epoch",
                                  save_strategy="epoch",
                                  disable_tqdm=False,
                                  logging_steps=logging_steps,
                                  report_to="none")

We can now instantiate and fine-tune our model with the Trainer.

In [40]:
trainer = Trainer(model=model,
                  args=training_args,
                  compute_metrics=compute_metrics,
                  train_dataset=emotions_encoded["train"],
                  eval_dataset=emotions_encoded["validation"])
trainer.train()



Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.6795,0.226784,0.9215,0.921907
2,0.1845,0.170033,0.9355,0.935557




TrainOutput(global_step=500, training_loss=0.4320345458984375, metrics={'train_runtime': 339.6264, 'train_samples_per_second': 94.221, 'train_steps_per_second': 1.472, 'total_flos': 720342861696000.0, 'train_loss': 0.4320345458984375, 'epoch': 2.0})

## Evaluate Model

Make predictions on validation set and evaluate predictive accuracy.

In [41]:
preds_output = trainer.predict(emotions_encoded["validation"])



In [42]:
preds_output

PredictionOutput(predictions=array([[ 4.92795   , -0.7383838 , -1.1149226 , -0.9638449 , -1.2058758 ,
        -1.7956127 ],
       [ 4.8994846 , -0.925503  , -1.6025988 , -0.89169985, -0.7663219 ,
        -1.6876588 ],
       [-1.3763558 ,  2.5984538 ,  3.007816  , -1.0662369 , -2.2169294 ,
        -1.8496767 ],
       ...,
       [-1.0316542 ,  4.997706  ,  0.1024612 , -1.0043087 , -1.5416032 ,
        -1.620951  ],
       [-1.6368203 ,  3.7184246 ,  2.434468  , -1.4215966 , -2.308291  ,
        -1.5715662 ],
       [-1.2465651 ,  5.05565   , -0.16913122, -1.5041523 , -1.3891387 ,
        -0.5576928 ]], dtype=float32), label_ids=array([0, 0, 2, ..., 1, 1, 1]), metrics={'test_loss': 0.17003321647644043, 'test_accuracy': 0.9355, 'test_f1': 0.9355569427561781, 'test_runtime': 4.6277, 'test_samples_per_second': 432.177, 'test_steps_per_second': 6.915})

Above, you can see that the model outputs are raw logits, not probabilities. Nonetheless, we can simply take the highest value as the predicted label.

In [43]:
y_preds = np.argmax(preds_output.predictions, axis=1)

In [44]:
print(classification_report(y_valid, y_preds, target_names=labels))

              precision    recall  f1-score   support

     sadness       0.96      0.97      0.96       550
         joy       0.97      0.94      0.95       704
        love       0.87      0.88      0.87       178
       anger       0.93      0.95      0.94       275
        fear       0.88      0.88      0.88       212
    surprise       0.86      0.85      0.86        81

    accuracy                           0.94      2000
   macro avg       0.91      0.91      0.91      2000
weighted avg       0.94      0.94      0.94      2000



The classification report reveals that the fine-tune approach is performing much better than the feature extraction approach.