# Fine-Tuning DistilBERT on the Federal Aviation Administration (FAA) Service Difficulty Report (SDR).

Date
: 05/03/2025

Author
: Kaya Arkin (Stu No. 2105361)

Copyright
: Swansea University

### **Context**
This jupyter notebook is authored by Kaya Arkin (Stu No. 2105361) as part of their Swansea University final year project *"Exploratory Research on Explainable LLMs (Airbus AI Research)"*. The project is supervised by Mark Hall, an employee at the Airbus AI Research Department, and Bertie Muller, a university assigned supervisor. The project aims *"to provide an insightful set of findings and recommendations on fine-turned local explanations for LLMs that can utilised as resource for future explainability implementations"* tailored towards Airbus AI Research.

Please ensure you have read `README.txt` before continuing.

---

### **Code Explanation**

The jupyter notebook aims to finetune a DistilBERT model on an Airlines Incidents dataset (FAA SDR).

#### What does the code do?
1. Loads the Airlines Incident Report Dataset.
2. Formats the data.
3. Fine-tunes a DistilBERT model using T5.
4. Saves the pre-trained model.

#### Why is DistilBERT used?
1. Abundance of online documentation, resources, & guidance.
2. 40% less parameters, 60% faster, and 95% performance of BERT; allowing for quicker development iterations and enables model to run on lower-end hardware.
3. Compatibility & customise-ability with other python libraries.

#### Why is T5 (Text-to-Text Transfer Transformer) used to train the model?
The scenario for our LLM is that for a given (unseen) incident report the LLM predicts the part failure. For example, inputting the incident *"FLIGHT CREW REPORTED OF A BAGGAGE/FUEL DOOR CAS MESSAGE ..."* the LLM should return a predicted part failure, *"FUEL DOOR DEFECTIVE"*.

Originally, I treated this tasks as a classification problem. However, as the dataset contains over 10,000 unique part failure and the dataset does not contain all possible part failures, it wasn't a suitable method. T5 uses a *"text-to-text"* paradigm. It treats all tasks, including classification, as text generation. Inputs are formatted with prefixes, like *"Report: {text}"*, and outputs are textual labels (e.g. *"FUEL DOOR DEFECTIVE"*). T5 allows the model to generate failure types which the LLM was not explicitly trained on.

---


## 1. Imports

See `README.txt` for installation requirements.

In [1]:
import torch, os
from transformers import T5Tokenizer, T5ForConditionalGeneration, Trainer, TrainingArguments, DataCollatorForSeq2Seq
from datasets import load_from_disk, Dataset
import pandas as pd

  from .autonotebook import tqdm as notebook_tqdm


---

## 2. Hardware Setup

1. Checking whether a GPU is available for processing, otherwise defaulting to the cpu for processing.
2. Clearing the gpu cache.
3. Optimise GPU for matrix multiplications (required as part of T5)

Note: For performance reasons GPU processing is highly recommended.

See `README.txt` for information on hardware requirements.

In [2]:
## Hardware Setup Constants
CPU_DEVICE_NAME = "cpu"
GPU_DEVICE_NAME = "cuda"
TORCH_MATRIX_MULTIPLICATION_PRECISION = "high"

In [3]:
## Set processing to GPU if available
if torch.cuda.is_available():
    device = GPU_DEVICE_NAME

    ## Empty GPU VRAM
    torch.cuda.empty_cache()

    ## Optimises matrix multiplications
    ## "high" - ensures high precision (closer to the true float 32)
    torch.set_float32_matmul_precision(TORCH_MATRIX_MULTIPLICATION_PRECISION)

    print("GPU processing enabled.")

else:
    device = CPU_DEVICE_NAME
    print("GPU processing not available.")

Processing set to: cuda


---

## 3. Preprocessing Dataset
The following code pre-processes the dataset for fine-tuning using T5. We preform the following:
1. Initialise the T5 Tokeniser & model.
2. Load the dataset.
3. Tokenise the inputs & outputs for T5.
4. Split the dataset into training & testing.
5. Setup the input / output sequences in a batch to be dynamically padded during training.

In [4]:
## Preprocessing Constants
MODEL_NAME = "t5-small"
MODEL_INPUT = "Report: {input_text}"
MODEL_OUTPUT = "Part Failure: {output_text}"
MODEL_INPUT_MAX_LENGTH = 512
MODEL_OUTPUT_MAX_LENGTH = 128
DATASET_REPORT_COLUMN_TITLE = "report"
DATASET_PART_FAILURE_COLUMN_TITLE = "part failure"
DATASET_PATH = "airline_incidents.csv"
PREPROCESSED_DATASET_NANE = "processed_dataset"
PREPROCESSED_DATASET_PATH = f"./{PREPROCESSED_DATASET_NANE}"
TRAINING_TEST_SPLIT_RATIO = 0.2

The following code initialises a T5 tokenizer / model and moves the model to the specified device (either the CPU or GPU).


In [5]:
tokenizer = T5Tokenizer.from_pretrained(MODEL_NAME)
model = T5ForConditionalGeneration.from_pretrained(MODEL_NAME).to(device)

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


This function preprocesses a single dataset example by formatting input and output text using predefined templates, tokenizing them for the T5 model, and setting the tokenized labels for training.

In [6]:
def preprocess_dataset(example):
    ## Tokenizes input and output text for T5
    model_input = MODEL_INPUT.format(input_text = example[DATASET_REPORT_COLUMN_TITLE])
    target_text = MODEL_OUTPUT.format(output_text = example[DATASET_PART_FAILURE_COLUMN_TITLE])

    model_inputs = tokenizer(model_input, max_length=MODEL_INPUT_MAX_LENGTH, truncation=True, padding="max_length")
    labels = tokenizer(target_text, max_length=MODEL_OUTPUT_MAX_LENGTH, truncation=True, padding="max_length")

    model_inputs["labels"] = labels["input_ids"]  # Assign decoder labels

    return model_inputs

This code checks if a preprocessed dataset exists at the specified path (`PREPROCESSED_DATASET_PATH`). If it exists, it loads the dataset. Otherwise, it loads a raw CSV file (`DATASET_PATH`), removes missing values, shuffles the data, converts it into a `Dataset` object, applies a preprocessing function (`preprocess_dataset`) to format and tokenize it, and saves the preprocessed dataset to disk.

In [7]:
if os.path.exists(PREPROCESSED_DATASET_PATH):
    dataset = load_from_disk(PREPROCESSED_DATASET_PATH)

else:
    df = pd.read_csv(DATASET_PATH)  # Ensure the file is in the same directory
    df = df.dropna()  # Remove missing values
    df = df.sample(frac=1).reset_index(drop=True)  # Shuffle dataset

    dataset = Dataset.from_pandas(df)
    dataset = dataset.map(preprocess_dataset, remove_columns=[DATASET_REPORT_COLUMN_TITLE, DATASET_PART_FAILURE_COLUMN_TITLE])

    dataset.save_to_disk(PREPROCESSED_DATASET_NANE)

Map: 100%|██████████| 100028/100028 [01:09<00:00, 1440.14 examples/s]
Saving the dataset (1/1 shards): 100%|██████████| 100028/100028 [00:00<00:00, 227180.61 examples/s]


This code splits the dataset into training and testing subsets, with 80% used for training and 20% used for testing, based on the defined split ratio.

In [8]:
# Train-Test Split
dataset = dataset.train_test_split(test_size=TRAINING_TEST_SPLIT_RATIO)  # 80% Training, 20% Validation

The purpose of this code is to ensure that input and output sequences in a batch are padded dynamically during training, making them uniform in length, which is necessary for efficient processing by the model.

In [9]:
# Data Collator (Pads batch inputs dynamically)
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

---

## 4. Training DistilBERT Model

This code initializes the hyperparameters and configuration required for training the T5 model using the `TrainingArguments` class from the Transformers library. It specifies settings such as batch sizes, gradient accumulation, mixed precision (fp16), learning rate, weight decay, checkpoint saving strategies, logging, and evaluation intervals.

In [10]:
training_args = TrainingArguments(
    output_dir = "../t5_airline_incidents",
    per_device_train_batch_size= 64,
    per_device_eval_batch_size = 64,
    gradient_accumulation_steps = 1,  # Helps with large models
    bf16 = False,
    fp16 = True,
    save_total_limit = 2,  # Manage checkpoints
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate = 1e-3,
    weight_decay = 0.01,
    lr_scheduler_type = "linear",
    optim = "adamw_torch_fused",  # Optimized optimizer for ROCm
    report_to = "none",
    logging_strategy = "steps",
    logging_steps = 100,  # Log every 100 steps
    eval_steps = 500  # Evaluate every 500 training steps

)



This code initializes a `Trainer` object from the Transformers library, setting up the T5 model, training arguments, training and evaluation datasets, tokenizer, and data collator for model fine-tuning.

In [11]:
# Initialize Trainer
trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = dataset["train"],
    eval_dataset = dataset["test"],
    tokenizer = tokenizer,
    data_collator = data_collator,
)

  trainer = Trainer(


In [12]:
# Start Training
trainer.train()

  return F.linear(input, self.weight, self.bias)
Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Epoch,Training Loss,Validation Loss
1,0.0445,0.037569
2,0.0359,0.032731
3,0.0327,0.031245


TrainOutput(global_step=3753, training_loss=0.0649011887901788, metrics={'train_runtime': 1752.3425, 'train_samples_per_second': 136.997, 'train_steps_per_second': 2.142, 'total_flos': 3.249096491217715e+16, 'train_loss': 0.0649011887901788, 'epoch': 3.0})

In [13]:
# SAVE THE FINE-TUNED MODEL
model.save_pretrained("./t5_finetuned_airline_incidents")
tokenizer.save_pretrained("./t5_finetuned_airline_incidents")

('./t5_finetuned_airline_incidents/tokenizer_config.json',
 './t5_finetuned_airline_incidents/special_tokens_map.json',
 './t5_finetuned_airline_incidents/spiece.model',
 './t5_finetuned_airline_incidents/added_tokens.json')