# Transformers Fine-tuning Step
Fine-tuning step in the DistilBERT model pre-trained on our data, to later save the model and its respective trained weights.

In this step we perform:
* Model definition;
* Fine-tuning the model;
* Model evaluation;
* Saving the model and trained weights.

> **Note**: **Article on Medium** about the `fine-tuning step in the pre-trained transformers` of this system in Portuguese: [Sentiment Analysis About Marvel Comics (Part 3) - DistilBERT Fine-tuning](https://medium.com/@guineves.py/2648e14c9123).

## Table of Contents
* [Packages](#1)
* [Loading the Data](#2)
* [Transformers](#3)
* [Transfer Learning](#4)
    * [Fine-tuning](#4.1)

<a name="1"></a>
## Packages
Packages that were used in the system:
* [transformers](https://huggingface.co/docs/transformers/index): provides APIs and tools to easily download and train state-of-the-art pretrained models;
* [datasets](https://huggingface.co/docs/datasets/index): is a library for easily accessing and sharing datasets for Audio, Computer Vision, and Natural Language Processing (NLP) tasks;
* [scikit-learn](https://scikit-learn.org/stable/): open source machine learning library;
* [src](../src/): package with all the codes for all utility functions created for this system. Located inside the `../../src/` directory.

In [31]:
from transformers import DistilBertForSequenceClassification, TrainingArguments, Trainer
from datasets import Dataset

import os
import sys
PROJECT_ROOT = os.path.abspath( # Getting Obtaining the absolute normalized version of the project root path
    os.path.join( # Concatenating the paths
        os.getcwd(), # Getting the path of the notebooks directory
        os.pardir, # Gettin the constant string used by the OS to refer to the parent directory
        os.pardir
    )
)
# Adding path to the list of strings that specify the search path for modules
sys.path.append(PROJECT_ROOT)
from src.transformers_finetuning import *

> **Note**: the codes for the utility functions used in this system are in the `transformers_finetuning.py` script within the `../../src/` directory.

<a name="2"></a>
## Loading the Data
Let's read each of the subsets within their respective directories within the `../../data/preprocessed/` directory and plot each of them.

In [8]:
train_set = Dataset.load_from_disk('../../data/preprocessed/train_dataset')
val_set = Dataset.load_from_disk('../../data/preprocessed/validation_dataset')
test_set = Dataset.load_from_disk('../../data/preprocessed/test_dataset')
print(f'Train set shape: {train_set}\nValidation set shape: {val_set}\nTest set shape: {test_set}')

Train set shape: Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 10156
})
Validation set shape: Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 3385
})
Test set shape: Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 3386
})


<a name="3"></a>
## Transformers
<img align='center' src='../../figures/transformers.png' style='width:400px;'>
Transformer is a purely attention-based model that was developed by Google to solve some problems with RNNs, it is difficult to fully exploit the advantages of parallel computing, due to its problems arising from its sequential structure. In a seq2seq RNN, we need to go through each word of our input, sequentially by the encoder, and it is done in a similar sequential way by the decoder, without parallel computation. For this reason, there is not much room for parallel calculations. The more words we have in the input sequence, the more time it will take to process it.

With large sequences, that is, with many $T$ sequential steps, information tends to get lost in the network (loss of information) and problems of vanishing gradients arise related to the length of our input sequences. LSTMs and GRUs help a little with these problems, but even these architectures stop working well when they try to process very long sequences due to the `information bottleneck`.
* `Loss of information`: it is more difficult to know whether the subject is singular or plural as we move away from the subject.
* `Vanishing gradients`: When we calculate backprop, the gradients can become very small and as a result the model will not learn anything.

Transformers are based on attention and do not require any sequential calculation per layer, requiring only a single step. Furthermore, the gradient steps that need to be taken from the last output to the first output in a transformer are just 1. For RNNs, the number of steps increases with longer sequences. Finally, transformers do not suffer from problems of vanishing gradients related to the length of the sequences.

Transformers do not use RNNs, attention is all we need, and only a few linear and non-linear transformations are usually included. The transformers model was introduced in 2017 by Google researchers, and since then, the transformer architecture has become standard for LLMs. Transformers have revolutionized the field of NLP.

The transformer model uses `scaled dot-product attention`. The first form of attention is very efficient in terms of computation and memory, because it consists only of matrix multiplication operations. This engine is the core of the model and allows the transfomer to grow and become more complex, while being faster and using less memory than other comparable model architectures.

In the transformer model, we will use the `Multi-Head Attention layer`, this layer is executed in parallel and has several $h$ scaled dot-product attention mechanisms and several linear transformations of input queries, keys and values. In this layer, linear transformations are trainable parameters.
$$\text{ Attention}(Q, K, V) = \mathrm{softmax} \left( \frac{Q K^T}{\sqrt{d_k}} \right) V$$
<img align='center' src='../../figures/attention.png' style='width:600px;'>

`Encoders` transformers start with a multi-head attention module that performs `self-attention` on the input sequence. This is followed by a residual connection and normalization, then a feed forward layer and another residual connection and normalization. The encoder layer is repeated $N_x$ times.
* Self-attention: each word in the input corresponds to each other word in the input sequence
* Thanks to the self-attention layer, the encoder will provide a contextual representation of each of our inputs.


The `decoder` is built in a similar way to the encoder, with multi-head attention, residual connections and normalization modules. The first attention module is masked (`Masked Self-Attention`) so that each position only serves the previous positions, it blocks the flow of information to the left. The second attention module (`Encoder-Decoder Attention`) takes the encoder output and allows the decoder to attend to all items. This entire decoder layer is also repeated several $N_x$ times, one after the other.
$$\text{ Masked Self-Attention } = \mathrm{softmax} \left( \frac{Q K^T}{\sqrt{d_k}} + M \right) = \mathrm{softmax} \left( \frac{Q K^T}{\sqrt{d_k}} + \text{ mask matrix } \begin{pmatrix} 0 & -\infty & -\infty \\ 0 & 0 & -\infty \\ 0 & 0 & 0 \end{pmatrix} \right)$$

Transformers also incorporate a `positional encoding stage` ($PE$), which encodes the position of each input in the sequence, that is, the sequential information. This is necessary because transformers don't use RNNs, but word order is relevant to any language. Positional encoding is trainable, just like word embeddings.
$$\begin{align*}
& \text{PE}_{(\text{pos, }2i)} = \text{sin}\left( \frac{\text{pos}}{10000^{\frac{2i}{d}}} \right) \\
& \text{PE}_{(\text{pos, }2i + 1)} = \text{cos}\left( \frac{\text{pos}}{10000^{\frac{2i}{d}}} \right)
\end{align*}$$

First, the embedding on the input is calculated and the positional encodings are applied. So, this goes to the encoder which consists of several layers of Multi-Head Attention modules, then the decoder receives the output sequence shifted one step to the right and the encoder outputs. The decoder output is transformed into output probabilities using a linear layer with a softmax activation. This architecture is easy to parallelize compared to RNNs models and can be trained much more efficiently on multiple GPUs. It can also be scaled to learn multiple tasks on increasingly larger datasets. Transformers are a great alternative to RNNs and help overcome these problems in NLP and many fields that process sequential data.

We will fine-tune the DistilBERT model, which is a small, fast, cheap and lightweight Transformer model, trained by distilling the base BERT model. It has 40% fewer parameters than bert-base-uncased, runs 60% faster, and preserves more than 95% of BERT's performance as measured in the GLUE (General Language Understanding Evaluation) benchmark.

[Hugging Face](https://huggingface.co/) (🤗) is the best resource for pre-trained transformers. Its open source libraries make it simple to download, fine-tune, and use transformer models like DeepSeek, BERT, Llama, T5, Qwen, GPT-2, and more. And the best part, we can use them together with TensorFlow, PyTorch or Flax. In this notebook, I use 🤗 transformers to use the `DistilBERT` model for sentiment classification. For the pre-processing step, we use the tokenizer (in the notebook `03_preprocessing.ipynb`), and the DistilBERT checkpoint fine-tuning `distilbert-base-uncased-finetuned-sst-2-english` pre-trained in the code below. To do this, we initialize the DistilBertForSequenceClassification class and define the desired pre-trained model.

<a name="4"></a>
## Transfer Learning
<img align='center' src='../../figures/transfer_learning.png' style='width:600px;'>

One of the most powerful ideas in deep learning is that sometimes we can take the knowledge that the neural network learned in one task and apply that knowledge to a separate task. There are 3 main advantages to transfer learning:
* Reduces training time.
* Improves predictions.
* Allows us to use smaller datasets.

If we are creating a model, rather than training the weights from 0, from a random initialization, we often make much faster progress by downloading weights that someone else has already trained for days/weeks/months on the neural network architecture, use them as pre-training, and transfer them to a new, similar task that we might be interested in. This means that we can often download weights from open-source algorithms that other people took weeks or months to figure out, and then we use that as a really good initialization for our neural network.

<a name="4.1"></a>
### Fine-tuning
We take the weights from an existing pre-trained model using transfer learning and then tweak them a bit to ensure they work for the specific task we are working on. Let's say we pre-trained a model that predicts movie evaluation, and now we're going to create a model to evaluate courses. One way to do this is, by locking all the weights that we already have pre-trained, and then we add a new output layer, or perhaps, a new feed forward layer and an output layer that will be trained, while we keep the rest locked and then we only train our new network, the new layers that we just added. We can slowly unfreeze the layers, one at a time.

Many of the low-level features that the pre-trained model learned from a very large corpus, like the structure of the text, the nature of the text, this can help our algorithm do better in the sentiment classification task and faster or with less data, because maybe the model has learned enough what the structures of different texts are like and some of that knowledge will be useful. After deleting the output layer of a pre-trained model, we don't necessarily need to create just the output layer, but we can create several new layers.

We need to remove the output layer from the pre-trained model and add ours, because the neural network can have a softmax output layer that generates one of 1000 possible labels. So we remove this output layer and create our own output layer, in this case a sigmoid activation.

* With a small training set, we think of the rest of the layers as `frozen`, so we freeze the parameters of these layers, and only train the parameters associated with our output layer. This way we will obtain very good performance, even with a small training set.

* With a larger training set, we can freeze fewer layers and then train the layers that were not frozen and our new output layer. We can use the layers that are not frozen as initialization and use gradient descent from them, or we can also eliminate these layers that are not frozen and use our own new hidden layers and our own output layer. Any of these methods could be worth trying.

* With a much larger training set, we use this pre-trained neural network and its weights as initialization and train the entire neural network, just changing the output layer, with labels that we care about.

Setting the checkpoint of the pre-trained model that we will do the fine-tuning.

In [12]:
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')

Setting the hyperparameters, using the Hugging Face Trainer object to fine-tune the model.
* The pre-trained model with fine-tuning is already being saved in the `../../models/` directory, defined when defining the hyperparameters.

In [14]:
# Fine-tuning hyperparameters
training_args = TrainingArguments(
    output_dir='../../models/transformers_results',
    overwrite_output_dir=True,
    num_train_epochs=2,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=20,
    weight_decay=1e-1,
    eval_strategy='steps',
    lr_scheduler_type='reduce_lr_on_plateau',
    logging_steps=100
)
# Trainer object
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_set,
    eval_dataset=val_set,
    compute_metrics=f1_metric
)
trainer.train()

Step,Training Loss,Validation Loss,F1 Score
100,0.6902,0.592717,0.684184
200,0.5923,0.547271,0.697679
300,0.5239,0.58544,0.712016
400,0.5112,0.488176,0.756529
500,0.4867,0.502221,0.761378
600,0.5219,0.474515,0.775508
700,0.5099,0.474097,0.76456
800,0.4786,0.448737,0.778758
900,0.4754,0.487454,0.767549
1000,0.5016,0.445698,0.818367


TrainOutput(global_step=2540, training_loss=0.4370420598608302, metrics={'train_runtime': 25970.5016, 'train_samples_per_second': 0.782, 'train_steps_per_second': 0.098, 'total_flos': 2690677801500672.0, 'train_loss': 0.4370420598608302, 'epoch': 2.0})

Evaluating model performance with fine-tuning in the train and validation set.

In [52]:
print(f'Train set evaluate: {trainer.evaluate(train_set)["eval_f1_score"]:.4f}\nValidation set evaluation: {trainer.evaluate(val_set)["eval_f1_score"]:.4f}')

Train set evaluate: 0.9178
Validation set evaluation: 0.8523


Evaluating the performance of the final model with fine-tuning in the test set.

In [51]:
print(f'Test set evaluate: {trainer.evaluate(test_set)["eval_f1_score"]:.4f}')

Test set evaluate: 0.8518
