This notebook is part of the "Literary Metaphor Detection
with LLM Fine-Tuning and Few-Shot Learning" paper. The corresponding repository can be found on [Github](https://github.com/ma-spie/LLM_metaphor_detection).

# Training with Transformers

*goals of this notebook*:

* fine-tuning the [DistilBERT](https://huggingface.co/distilbert/distilbert-base-uncased) language model on the metaphor detecton task using the [HuggingFace Transformers](https://huggingface.co/docs/transformers/index) framework with four datasets (PoFO, TroFi, MOH, PoFo_TroFi_MOH).

The normalisation and analysis of these datasets can be found in the `Preprocessing_analysis.ipynb`notebook.

Fine-tuning a sentence transformer model on the same task with the SetFit framework can be found in the `SetFit_training.ipynb notebook`.

This notebook is based on the [HuggingFace Task Guide: Text Classification](https://huggingface.co/docs/transformers/tasks/sequence_classification).

## installations, imports, loading data

In [1]:
#required installations
!pip install transformers datasets evaluate --quiet
!pip install accelerate -U --quiet
!pip install codecarbon --quiet


[notice] A new release of pip is available: 23.3.2 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip

[notice] A new release of pip is available: 23.3.2 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip

[notice] A new release of pip is available: 23.3.2 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
#general imports
from google.colab import files     #only needed if Google colab is used
import pandas as pd
import evaluate
import numpy as np
from codecarbon import EmissionsTracker

#imports for training
from datasets import load_dataset
from transformers import AutoTokenizer
from transformers import DataCollatorWithPadding
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

upload these files from the `folder preprocessed_datasets` from the repository:


*   PoFo_normalised.csv
*   TroFi_normalised.cs
*   MOH-X_normalised.csv
*   PoFo_TroFi_MOH.csv


In [3]:
uploaded_files = files.upload()     #only needed if Google colab is used, otherwis these files have to be in the same folder as this notebook

To fine-tune DistilBERT on all four datasets uncomment one dataset at a time and run the script a total of four times.

In [4]:
inputfile= "PoFo_normalised.csv"
#inputfile = "TroFi_normalised.csv"
#inputfile = "MOH-X_normalised.csv"
#inputfile = "PoFo_TroFi_MOH.csv"

In [5]:
#load and split dataset
dataset = load_dataset("csv", data_files=inputfile, delimiter= "\t")
split_dataset = dataset["train"].train_test_split(test_size=0.2, seed=42)

In [6]:
#checking outputs
print(split_dataset)
print(split_dataset["train"][0])
print(split_dataset["test"][0])
print("Train dataset size:", len(split_dataset["train"]))
print("Test dataset size:", len(split_dataset["test"]))

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 441
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 111
    })
})
{'text': "maybe these events are nature 's sleight of hand , and the real", 'label': 'metaphorical'}
{'text': 'where books were trees .', 'label': 'metaphorical'}
Train dataset size: 441
Test dataset size: 111


## Preprocessing data



The dataset is tokenized and the string labels are binarized.

In [7]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
print(f"Tokenizer: {tokenizer}")

Tokenizer: DistilBertTokenizerFast(name_or_path='distilbert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}


In [8]:
id2label = {0: "literal", 1: "metaphorical"}
label2id = {"literal": 0, "metaphorical": 1}

In [9]:
def preprocess_function(examples):
      examples["label"] = [label2id[label] for label in examples["label"]]
      return tokenizer(examples["text"], truncation=True, max_length=512)

In [10]:
tokenized_dataset = split_dataset.map(preprocess_function, batched=True)
print(f"Train IDs: {tokenized_dataset['train']['input_ids'][0]}")
print(f"Test IDs: {tokenized_dataset['test']['input_ids'][0]}")

Train IDs: [101, 2672, 2122, 2824, 2024, 3267, 1005, 1055, 22889, 7416, 13900, 1997, 2192, 1010, 1998, 1996, 2613, 102]
Test IDs: [101, 2073, 2808, 2020, 3628, 1012, 102]


In [11]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
print(f"Data collator: {data_collator}")

Data collator: DataCollatorWithPadding(tokenizer=DistilBertTokenizerFast(name_or_path='distilbert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}, padding=True, max_length=None, pad_to_multiple_of=None, return_te

## Evaluation function

Define accuracy, F1 score, precision and recall as metrics for evalaluation

In [12]:
def compute_metrics(eval_pred):
  metrics = evaluate.combine(["accuracy", "f1", "precision", "recall"])
  predictions, labels = eval_pred
  predictions = np.argmax(predictions, axis=1)
  metrics_results = metrics.compute(predictions=predictions, references=labels)
  return metrics_results

## Training

In [13]:
#initialise model
model = AutoModelForSequenceClassification.from_pretrained(
        "distilbert/distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [14]:
# start emissions tracker
emissions_tracker = EmissionsTracker(save_to_file=True, output_file="emissions.csv", on_csv_write="append")
emissions_tracker.start()

[codecarbon INFO @ 17:28:27] [setup] RAM Tracking...
[codecarbon INFO @ 17:28:27] [setup] GPU Tracking...
[codecarbon INFO @ 17:28:27] Tracking Nvidia GPU via pynvml
[codecarbon INFO @ 17:28:27] [setup] CPU Tracking...
[codecarbon INFO @ 17:28:28] CPU Model on constant consumption mode: AMD Ryzen 9 7950X 16-Core Processor
[codecarbon INFO @ 17:28:28] >>> Tracker's metadata:
[codecarbon INFO @ 17:28:28]   Platform system: Windows-10-10.0.22631-SP0
[codecarbon INFO @ 17:28:28]   Python version: 3.11.7
[codecarbon INFO @ 17:28:28]   CodeCarbon version: 2.3.5
[codecarbon INFO @ 17:28:28]   Available RAM : 31.118 GB
[codecarbon INFO @ 17:28:28]   CPU count: 32
[codecarbon INFO @ 17:28:28]   CPU model: AMD Ryzen 9 7950X 16-Core Processor
[codecarbon INFO @ 17:28:28]   GPU count: 1
[codecarbon INFO @ 17:28:28]   GPU model: 1 x NVIDIA GeForce RTX 4090


In [15]:
print(f"Processing dataset: {inputfile}")

training_args = TrainingArguments(
        output_dir=f"transformers_training_{inputfile}",
        seed=42,
        learning_rate=2e-5,
        per_device_train_batch_size=32,
        per_device_eval_batch_size=32,
        num_train_epochs=5,
        logging_dir="logs",  
        logging_strategy="epoch",
        evaluation_strategy="epoch",         #Evaluation is done at the end of each epoch.https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.evaluation_strategy
        save_strategy="epoch",               #save_strategy has to be the same as evaluation_strategy https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.evaluation_strategy
        load_best_model_at_end=True,
    )

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

print(f"Processing of {inputfile} completed.")

Processing dataset: PoFo_normalised.csv


dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)
[codecarbon INFO @ 17:28:32] [setup] RAM Tracking...
[codecarbon INFO @ 17:28:32] [setup] GPU Tracking...
[codecarbon INFO @ 17:28:32] Tracking Nvidia GPU via pynvml
[codecarbon INFO @ 17:28:32] [setup] CPU Tracking...
[codecarbon INFO @ 17:28:33] CPU Model on constant consumption mode: AMD Ryzen 9 7950X 16-Core Processor
[codecarbon INFO @ 17:28:33] >>> Tracker's metadata:
[codecarbon INFO @ 17:28:33]   Platform system: Windows-10-10.0.22631-SP0
[codecarbon INFO @ 17:28:33]   Python version: 3.11.7
[codecarbon INFO @ 17:28:33]   CodeCarbon version: 2.3.5
[codecarbon INFO @ 17:28:33]   Available RAM : 31.118 GB
[codecarbon INFO @ 17:28:33]   CPU count: 32
[codecarbon INFO @ 17:28:33]   CPU model: AMD Ryzen 9 7950X 16-Core Processor
[codecarbon INFO @ 17:28:33]   GPU count: 1
[codecarbon INFO @ 17:28:33]   GPU model: 1 x NVIDIA GeForce RTX 4090


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.69,0.699233,0.477477,0.646341,0.477477,1.0
2,0.6657,0.688708,0.522523,0.66242,0.5,0.981132
3,0.6179,0.677654,0.603604,0.694444,0.549451,0.943396
4,0.5511,0.648937,0.648649,0.688,0.597222,0.811321
5,0.5134,0.654216,0.63964,0.701493,0.580247,0.886792


Checkpoint destination directory transformers_training_PoFo_normalised.csv\checkpoint-14 already exists and is non-empty.Saving will proceed but saved results may be invalid.
[codecarbon INFO @ 17:28:46] Energy consumed for RAM : 0.000049 kWh. RAM Power : 11.66927719116211 W
[codecarbon INFO @ 17:28:46] Energy consumed for all GPUs : 0.000443 kWh. Total GPU Power : 106.0826712779584 W
[codecarbon INFO @ 17:28:46] Energy consumed for all CPUs : 0.000177 kWh. Total CPU Power : 42.5 W
[codecarbon INFO @ 17:28:46] 0.000669 kWh of electricity used since the beginning.
[codecarbon INFO @ 17:28:51] Energy consumed for RAM : 0.000049 kWh. RAM Power : 11.66927719116211 W
[codecarbon INFO @ 17:28:51] Energy consumed for all GPUs : 0.000591 kWh. Total GPU Power : 141.64158791102088 W
[codecarbon INFO @ 17:28:51] Energy consumed for all CPUs : 0.000178 kWh. Total CPU Power : 42.5 W
[codecarbon INFO @ 17:28:51] 0.000818 kWh of electricity used since the beginning.
Checkpoint destination directory t

Processing of PoFo_normalised.csv completed.


In [16]:
# calculate emissions data for training
emissions: float = emissions_tracker.stop()
print(f"Emissions as CO₂-equivalents [CO₂eq] in kg for {inputfile}: {emissions}")

[codecarbon INFO @ 17:29:29] Energy consumed for RAM : 0.000186 kWh. RAM Power : 11.66927719116211 W
[codecarbon INFO @ 17:29:29] Energy consumed for all GPUs : 0.002717 kWh. Total GPU Power : 172.552037212497 W
[codecarbon INFO @ 17:29:29] Energy consumed for all CPUs : 0.000682 kWh. Total CPU Power : 42.5 W
[codecarbon INFO @ 17:29:29] 0.003585 kWh of electricity used since the beginning.


Emissions as CO₂-equivalents [CO₂eq] in kg for PoFo_normalised.csv: 0.0013816308201039678


## Result documentation

In [17]:
# initialize empty dictionary to store results
evaluation_results = {}
metrics = trainer.evaluate(tokenized_dataset["test"])
evaluation_results[inputfile] = metrics
print(f"Evaluation results for {inputfile}: {metrics}")

# convert evaluation results to a DataFrame for easier access to data in further scripts (for example for visualisaton)
evaluation_results_df = pd.DataFrame(evaluation_results)
evaluation_results_df.to_csv(f"DistilBERT_evaluation_results_df_{inputfile}.csv")
print(f"{evaluation_results_df}")

Evaluation results for PoFo_normalised.csv: {'eval_loss': 0.6489366888999939, 'eval_accuracy': 0.6486486486486487, 'eval_f1': 0.688, 'eval_precision': 0.5972222222222222, 'eval_recall': 0.8113207547169812, 'eval_runtime': 6.5954, 'eval_samples_per_second': 16.83, 'eval_steps_per_second': 0.606, 'epoch': 5.0}
                         PoFo_normalised.csv
eval_loss                           0.648937
eval_accuracy                       0.648649
eval_f1                             0.688000
eval_precision                      0.597222
eval_recall                         0.811321
eval_runtime                        6.595400
eval_samples_per_second            16.830000
eval_steps_per_second               0.606000
epoch                               5.000000


In [18]:
# save the trained model with a unique name for each dataset
model_name = f"{inputfile}_model_DistilBERT"
trainer.save_model(model_name)
print(f"Trained model saved as {model_name}")

Trained model saved as PoFo_normalised.csv_model_DistilBERT


In [19]:
# create a text file for documenting the results
with open(f"DistilBERT_model_evaluation_results_{inputfile}.txt", "a") as file:
    file.write(f"Model: {inputfile}\n")
    file.write(f"Evaluation Results: {metrics}\n")
    file.write(f"Emissions as CO2-equivalents [CO2eq] in kg for training this model: {emissions}\n\n")

As a result we get (for each dataset)

*   a model fine-tuned on the metaphor identification task - you can find the model on [Zenodo]( https://doi.org/10.5281/zenodo.11624278)

*  the file `DistilBERT_model_evaluation_results_inputfile.txt` with evaluation results

*  the emissions information in the `emissions.csv` file

*  the evaluation results in a DataFrame as a csv file