This notebook is part of the "Literary Metaphor Detection
with LLM Fine-Tuning and Few-Shot Learning" paper. The corresponding repository can be found on [Github](https://github.com/ma-spie/LLM_metaphor_detection).


# Training with SetFit



*goals of this notebook*: fine-tuning a sentence transformer model on the metaphor detection task using the [SetFit](https://huggingface.co/docs/setfit/v1.0.3/en/index) framework with four datasets (PoFo, TroFi, MOH, PoFo_TroFi_MOH).

The normalisation and analysis of these datasets can be found in the `Preprocessing_analysis.ipynb`notebook.

The training of a DistilBERT model with Transformers can be found in the `Transformers_training.ipynb` notebook.

This notebook is based on the [SetFit Quickstart](https://huggingface.co/docs/setfit/v1.0.3/en/quickstart).

## Installations and imports

In [1]:
# required installations
!pip install setfit --quiet
!pip install codecarbon --quiet


[notice] A new release of pip is available: 23.3.2 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip

[notice] A new release of pip is available: 23.3.2 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
#general imports
from google.colab import files      #only needed if Google colab is used
import pandas as pd
from codecarbon import EmissionsTracker

#imports for training
from setfit import SetFitModel, Trainer, TrainingArguments
from datasets import load_dataset
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

upload these datasets from the folder `preprocessed_datasets` from the repository:

*   PoFo_normalised.csv
*   TroFi_normalised.cs
*   MOH-X_normalised.csv
*   PoFo_TroFi_MOH.csv


In [3]:
uploaded_files = files.upload()    #only needed if Google colab is used, otherwis these files have to be in the same folder as this notebook

## Training preparation



*   load the preprocessed datasets
*   initialise a SetFit model by choosing a sentence transformer. Here the [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) model is  used.
* define a `compute_metrics()` function to specify which metrics to evaluate on



In [4]:
#load datasets
datasets = {
    "PoFo": load_dataset("csv", data_files="PoFo_normalised.csv", delimiter="\t"),
    "TroFi": load_dataset("csv", data_files="TroFi_normalised.csv", delimiter="\t"),
    "MOH": load_dataset("csv", data_files="MOH-X_normalised.csv", delimiter="\t"),
    "PoFo_TroFi_MOH": load_dataset("csv", data_files="PoFo_TroFi_MOH.csv", delimiter="\t"),
}

In [5]:
# initializing a SetFit model
setfit_model = SetFitModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2", labels = ["literal", "metaphorical"])

model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.


SetFit's default metric for evaluation is accuracy. To add further metrics (precison, recall and F1 score) see the
[Setfit documentation](https://huggingface.co/docs/setfit/v1.0.3/en/how_to/model_cards#custom-metrics) on creating custom metrics.

In [6]:
def compute_metrics(y_pred, y_test):
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    metrics = {
        "accuracy": accuracy,
        "precision": precision,
        "recall": recall,
        "f1": f1
    }
    return metrics

## Training and evaluation

for each dataset

1.   initialise Emmissions Tracker from [CodeCarbon](https://mlco2.github.io/codecarbon/index.html)
2.   split datasets
3.   define training hyperparameters
4.   initialise trainer and train model
5.   calculate the emissions
6.   evaluate the model
7.   document results in a text file

In [7]:
for dataset_name, dataset in datasets.items():
    print(f"Processing of {dataset_name} started...")

    # start emissions tracker
    emissions_tracker = EmissionsTracker(save_to_file=True, output_file="emissions.csv", on_csv_write="update")
    emissions_tracker.start()

    # split dataset into train and test sets
    dataset_split = dataset["train"].train_test_split(test_size=0.2, seed=42)
    train_dataset = dataset_split["train"]
    test_dataset = dataset_split["test"]

    print(f"The train dataset of {dataset_name} has {len(train_dataset)} samples.")
    print(f"The test dataset of {dataset_name} has {len(test_dataset)} samples.")

    # define training arguments
    training_args = TrainingArguments(
        batch_size=32,
        num_epochs=5,  
        evaluation_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
    )
    # initialise trainer
    trainer = Trainer(
        model=setfit_model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=test_dataset,
        metric=compute_metrics,
    )

    trainer.train()

    # calculate emissions data for training
    emissions: float = emissions_tracker.stop()
    print(f"Emissions as CO₂-equivalents [CO₂eq] in kg for {dataset_name}: {emissions}")

    # save the trained model with a unique name for each dataset
    model_name = f"{dataset_name}_best_model_SetFit"
    trainer.model.save_pretrained(model_name)
    print(f"Trained model saved as {model_name}")

    # evaluate the model on the test split and add the results to the evaluation_results dictionary
    print(f"Evaluating {dataset_name}...")
    evaluation_results = {}
    metrics = trainer.evaluate(dataset=test_dataset)
    evaluation_results[dataset_name] = metrics

    # print evaluation metrics
    print(f"Evaluation results of    {dataset_name}: {metrics}")

    # convert evaluation results to a DataFrame for easier access to data in further scripts (for example for visualisaton)
    evaluation_results_df = pd.DataFrame(evaluation_results)
    evaluation_results_df.to_csv(f"setfit_evaluation_results_df_{dataset_name}.csv")
    print(f"{evaluation_results_df}")

    print(f"Processing of {dataset_name} completed.")

    # write results to a text file
    with open(f"SetFit_model_evaluation_results_{dataset_name}.txt", "a") as file:
        file.write(f"Model: {model_name}\n")
        file.write(f"Evaluation Results: {metrics}\n")
        file.write(f"Emissions as CO2-equivalents [CO2eq] in kg for training this model: {emissions}\n\n")
        file.close()


Processing of PoFo started...
The train dataset of PoFo has 441 samples.
The test dataset of PoFo has 111 samples.


***** Running training *****
  Num unique pairs = 98366
  Batch size = 32
  Num epochs = 5
  Total optimization steps = 15370


Epoch,Training Loss,Validation Loss,Embedding Loss,Rate
1,No log,No log,0.371,1.8e-05
2,No log,No log,0.3863,1.3e-05
3,No log,No log,0.4084,9e-06
4,No log,No log,0.3723,4e-06
5,No log,No log,0.4053,0.0


  0%|          | 0/197 [00:00<?, ?it/s]

  0%|          | 0/197 [00:00<?, ?it/s]

  0%|          | 0/197 [00:00<?, ?it/s]

  0%|          | 0/197 [00:00<?, ?it/s]

  0%|          | 0/197 [00:00<?, ?it/s]

Loading best SentenceTransformer model from step 3074.
***** Running evaluation *****


Emissions as CO₂-equivalents [CO₂eq] in kg for PoFo: 0.02097287631204261
Trained model saved as PoFo_best_model_SetFit
Evaluating PoFo...
Evaluation results of    PoFo: {'accuracy': 0.7207207207207207, 'precision': 0.6527777777777778, 'recall': 0.8867924528301887, 'f1': 0.752}
               PoFo
accuracy   0.720721
precision  0.652778
recall     0.886792
f1         0.752000
Processing of PoFo completed.
Processing of TroFi started...
The train dataset of TroFi has 2882 samples.
The test dataset of TroFi has 721 samples.


Map:   0%|          | 0/2882 [00:00<?, ? examples/s]

***** Running training *****
  Num unique pairs = 4220644
  Batch size = 32
  Num epochs = 5
  Total optimization steps = 659480


Epoch,Training Loss,Validation Loss,Embedding Loss,Rate
1,No log,No log,0.4256,1.8e-05
2,No log,No log,0.4444,1.3e-05
3,No log,No log,0.4381,9e-06
4,No log,No log,0.444,4e-06
5,No log,No log,0.4431,0.0


  0%|          | 0/8238 [00:00<?, ?it/s]

  0%|          | 0/8238 [00:00<?, ?it/s]

  0%|          | 0/8238 [00:00<?, ?it/s]

  0%|          | 0/8238 [00:00<?, ?it/s]

  0%|          | 0/8238 [00:00<?, ?it/s]

Loading best SentenceTransformer model from step 131896.


Emissions as CO₂-equivalents [CO₂eq] in kg for TroFi: 0.7704291161175701


***** Running evaluation *****


Trained model saved as TroFi_best_model_SetFit
Evaluating TroFi...
Evaluation results of    TroFi: {'accuracy': 0.665742024965326, 'precision': 0.6147308781869688, 'recall': 0.6739130434782609, 'f1': 0.642962962962963}
              TroFi
accuracy   0.665742
precision  0.614731
recall     0.673913
f1         0.642963
Processing of TroFi completed.
Processing of MOH started...
The train dataset of MOH has 512 samples.
The test dataset of MOH has 129 samples.


Map:   0%|          | 0/512 [00:00<?, ? examples/s]

***** Running training *****
  Num unique pairs = 131872
  Batch size = 32
  Num epochs = 5
  Total optimization steps = 20605


Epoch,Training Loss,Validation Loss,Embedding Loss,Rate
1,No log,No log,0.3853,1.8e-05
2,No log,No log,0.362,1.3e-05
3,No log,No log,0.357,9e-06
4,No log,No log,0.3617,4e-06
5,No log,No log,0.3747,0.0


  0%|          | 0/266 [00:00<?, ?it/s]

  0%|          | 0/266 [00:00<?, ?it/s]

  0%|          | 0/266 [00:00<?, ?it/s]

  0%|          | 0/266 [00:00<?, ?it/s]

  0%|          | 0/266 [00:00<?, ?it/s]

Loading best SentenceTransformer model from step 12363.


Emissions as CO₂-equivalents [CO₂eq] in kg for MOH: 0.013456704483594938


***** Running evaluation *****


Trained model saved as MOH_best_model_SetFit
Evaluating MOH...
Evaluation results of    MOH: {'accuracy': 0.7364341085271318, 'precision': 0.7966101694915254, 'recall': 0.6811594202898551, 'f1': 0.734375}
                MOH
accuracy   0.736434
precision  0.796610
recall     0.681159
f1         0.734375
Processing of MOH completed.
Processing of PoFo_TroFi_MOH started...
The train dataset of PoFo_TroFi_MOH has 3836 samples.
The test dataset of PoFo_TroFi_MOH has 960 samples.


Map:   0%|          | 0/3836 [00:00<?, ? examples/s]

***** Running training *****
  Num unique pairs = 7420452
  Batch size = 32
  Num epochs = 5
  Total optimization steps = 1159450


Epoch,Training Loss,Validation Loss,Embedding Loss,Rate
1,No log,No log,0.2101,1.8e-05
2,No log,No log,0.2427,1.3e-05
3,No log,No log,0.2472,9e-06
4,No log,No log,0.2525,4e-06
5,No log,No log,0.2463,0.0


  0%|          | 0/14521 [00:00<?, ?it/s]

  0%|          | 0/14521 [00:00<?, ?it/s]

  0%|          | 0/14521 [00:00<?, ?it/s]

  0%|          | 0/14521 [00:00<?, ?it/s]

  0%|          | 0/14521 [00:00<?, ?it/s]

Loading best SentenceTransformer model from step 231890.


Emissions as CO₂-equivalents [CO₂eq] in kg for PoFo_TroFi_MOH: 1.3402467884240712


***** Running evaluation *****


Trained model saved as PoFo_TroFi_MOH_best_model_SetFit
Evaluating PoFo_TroFi_MOH...
Evaluation results of    PoFo_TroFi_MOH: {'accuracy': 0.8770833333333333, 'precision': 0.8894230769230769, 'recall': 0.8371040723981901, 'f1': 0.8624708624708625}
           PoFo_TroFi_MOH
accuracy         0.877083
precision        0.889423
recall           0.837104
f1               0.862471
Processing of PoFo_TroFi_MOH completed.


As a result, we get

*   four models (one for each dataset) fine-tuned on the metaphor identification task - you can find the models on [Zenodo]( https://doi.org/10.5281/zenodo.11624278)

*  four text files `named SetFit_model_evaluations_results_[datasetname].txt` with evaluation results for each dataset

*  the emissions information in the `emissions_setfit.csv` file

*  four csv files with the the evaluation results in DataFrame format