## Setup

In [5]:
# you can ignore this cell
import os
os.environ["WANDB_DISABLED"] = "true"
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [None]:
# check if on colab
COLAB = True
try:
    import google.colab
except:
    COLAB=False

if COLAB:
    !pip install -q umap-learn~=0.5.9.post2

Load the required libraries:

In [6]:
from pathlib import Path
import pandas as pd
from sklearn.model_selection import train_test_split

## Load and preprocess the data

Let's use the data from Fornaciari et al. ([2021](https://doi.org/10.18653/v1/2021.findings-acl.301)).
This dataset contains (English-language) party manifestos sentences.
Each sentence is labeled as either a pledge or not a pledge.
The data was labeled applying Fornaciari Thomson et al.'s' ([2017](https://doi.org/10.1111/ajps.12313), 532) definition: 

> [A pledge is] a statement committing a party to one specific action or outcome that can be clearly determined to have occurred or not.

Define the path to the data:

In [7]:
base_path = Path("..", "..", "..")
data_path = base_path / "data" / "labeled" / "fornaciari_we_2021"

Define the file path and (down)load the data:

In [None]:
fp = data_path / "fornaciari_we_2021-pledge_binary.tsv"
if not fp.exists():
    # download the data if not present yet
    url = "https://cta-text-datasets.s3.eu-central-1.amazonaws.com/labeled/fornaciari_we_2021/fornaciari_we_2021-pledge_binary.tsv"
    df = pd.read_csv(url, encoding = "ISO-8859-1")
    fp.parent.mkdir(parents=True, exist_ok=True)
    df.to_csv(fp, index=False, encoding = "ISO-8859-1")

df = pd.read_csv(fp, sep="\t")

Let's inspect the data set by looking at the first few rows:

In [9]:
df.head(3)

Unnamed: 0,text_id,text,label,metadata__party,metadata__year,metadata__split
0,1,"At the same time, corruption, scams and crime ...",0,BJP,2014,trn
1,2,There has been gross misuse and total denigrat...,0,BJP,2014,trn
2,3,There has also been erosion of authority of th...,0,BJP,2014,trn


As you can see, it records the sentence, its label, and some metadata.

The labels are distributed as follows:

In [10]:
df.label.value_counts()

label
0    4246
1    1417
Name: count, dtype: int64

These numbers indicate 

- `1`: pledge
- `0`: not a pledge

So lets capture this is in a dictionary that maps label classes' names to their numeric IDs and _vice versa_:

In [11]:
id2label = {0: "no pledge", 1: "pledge"}
label2id = {"no pledge": 0, "pledge": 1}

One important aspect of the metadata is the column `metadata__split`.

It indicates that the sentences in the dataset have already been split into two "splits": A training (`trn`) and a development (`dev`) split.

In [12]:
df.metadata__split.value_counts()

metadata__split
trn    5089
dev     574
Name: count, dtype: int64

We will use the dev split as a "test" set for evaluating its performance after fine-tuning.

In [13]:
df_test = df[df['metadata__split']=='dev']

But for effective training, we will need an additional valdiation set (discussed further below).
So we will sample 10% of the examples in the original training into a "validation" split:

In [14]:
df_train = df[df['metadata__split']=='trn']
df_train, df_val = train_test_split(df_train, test_size=0.1, stratify=df_train['label'], random_state=42)

Now, we also keep only the columns relevant for training and evaluation:

In [15]:
cols = ['text', 'label']
df_train = df_train[cols].rename(columns={'text':'text', 'label':'labels'})
df_val = df_val[cols].rename(columns={'text':'text', 'label':'labels'})
df_test = df_test[cols].rename(columns={'text':'text', 'label':'labels'})

## Fine-tuning

In the era of large pre-trained language models, fine-tuning has become a crucial step in adapting these models to specific tasks and datasets.

**Fine-tuning** means taking a pre-trained model and training it further on a smaller, task-specific dataset. 

The key idea is that:

1. the pre-trained model has already learned a wealth of linguistic knowledge from vast amounts of text data, and 
2. by fine-tuning it on a specific task, we can adapt this knowledge to perform well on this task. 

In essence, fine-tuning allows us to harness a model's pre-trained capabilities while also learning the nuances of the new task.

We will proceed in the three steps:

1. Prepare training: 
    - specify (additional) classification performance metrics to compute when evaluating the model
    - Define training arguments that control the fine-tuning process
    - Create the classification model
2. Fine-tune the model on the training data (`df_train`) while monitoring its performance using the validation data (`df_val`)
3. Evaluate the fine-tuned model on the test data (`df_test`)

### Prepare training

#### Evaluation metrics

By default, `simpletransformers` always computes the Mathew's correlation coefficient (MCC) during evaluation for classification tasks.

But we want to monitor additional metrics as well.
So let's define them:

In [16]:
from sklearn.metrics import precision_score, recall_score, f1_score, balanced_accuracy_score
extra_metrics_to_compute = dict(
    precision=precision_score,
    recall=recall_score,
    f1=f1_score, 
    ba=balanced_accuracy_score,
)

Here is what these metrics mean:

- The **precision** is the share of true positive predictions among all positive predictions made by the model.
    So it tells us in how many of the examples the model labeled as positive instance (here, pledge instances) it was actually correct.

- The **recall** is the share of actual positive examples that were correctly identified by the model.
    So it tells us how many of all actual pledge instances the model was able to find.

- The **F1 score** combines these metrics into a single value by calculating their harmonic mean.
    It thus provides a balanced summary of precision and recall.

- The **balanced accuracy** is the average of recall obtained on each class.
    So it accounts for class imbalance by giving equal weight to the performance on both classes.

#### The training arguments

In [17]:
from simpletransformers.classification import ClassificationModel, ClassificationArgs
train_args = ClassificationArgs(
    
    # define a folder (relative to current working directory) to store outputs
    output_dir="outputs/",
    overwrite_output_dir=True,
    
    # main hyperparameters
    num_train_epochs = 5,  # how many times to iterate over the entire training set
    train_batch_size = 16, # how many examples per batch to use during training
    
    # switch on evaluation during training
    evaluate_during_training = True,
    # NOTE 
    #  Evaluation means that the model will be applied to held-out data ("validation" examples).
    #  These examples will be provided below when calling `model.train_model()` method by passing 
    #   our `df_val` data frame to this methods`eval_df` argument.
    #  Based on the examples in these validation examples, we will be able to monitor the model's 
    #   classification performance (e.g. accuracy, f1, etc.) and how it changes during training.
    #  Using held-out data, that is, data that is not used to train the model, allows use to 
    #   compute a realistic estimate of model performance on "unseen" data.
    
    # Let's define some parameters relevant for evaluation:
    evaluate_during_training_steps = 50, # evaluate model after every 50 training steps
    eval_batch_size = 32, # number of examples per batch to use during evaluation
    evaluate_each_epoch = False, # switch off evaluation at the end of each epoch
    
    
    # switch on early stopping
    use_early_stopping = True,
    # NOTE: We use early stopping to avoid overfitting of the model to the  training examples.
    #  Overfitting means that the model learns to classify the training examples very well 
    #   but at the detriment of its ability to generalize to "unseen" (i.e., held-out) data.
    #  With early stopping means to take monitor the model's performance on validation examples
    #   during training (as specified above), and stop training when the performance on the
    #   validation examples stops improving.
    early_stopping_metric = "f1", # what metric to monitor for early stopping?
    early_stopping_delta = 0.015, # how much must this metric improve to be considered an improvement
    early_stopping_metric_minimize = False, # does "impriovement" mean that the metric should decrease?
    early_stopping_patience = 5, # how many evaluation steps to wait for improvement before stopping training?

    manual_seed = 42,
    
    # finally, overwrite some of the default arguments defined by the ClassificationArgs
    no_save = True, # do not save model after training
    save_eval_checkpoints = False, # do not save model checkpoints during evaluation
    save_model_every_epoch = False, # do not save model after each epoch
    no_cache = True, # do not cache data
    use_multiprocessing = False, # do not use multiprocessing for data loading
    use_multiprocessing_for_evaluation = False, # do not use multiprocessing for evaluation data loading
    fp16 = False # do not use mixed precision training
)

In [18]:
# compute inverse-proportional class weights
cnts = df_train['labels'].value_counts()
class_weights = cnts.sum()/cnts
class_weights /= class_weights.sum()
class_weights = class_weights.tolist()

Now we are ready to create our classifier by 

1. specifying the type and name of the model we want to fine-tune
2. passing the training arguments define above
3. specifying the number of labels

In [20]:
model_name = "roberta-base"
classifier = ClassificationModel(
    model_type="roberta",
    model_name=model_name,
    args=train_args,
    num_labels=len(id2label),
    weight=class_weights
)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


**_Note:_** executing the above cell will raise a _warning_ message reading:

```
Some weights ... were not initialized from ...
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
```

This is expected behavior âœ…

In [21]:
steps, results = classifier.train_model(df_train.head(3_000), eval_df=df_val.head(250), **extra_metrics_to_compute)



Map:   0%|          | 0/3000 [00:00<?, ? examples/s]

Epoch:   0%|          | 0/5 [00:00<?, ?it/s]

Running Epoch 1 of 5:   0%|          | 0/188 [00:00<?, ?it/s]

Map:   0%|          | 0/250 [00:00<?, ? examples/s]

Map:   0%|          | 0/250 [00:00<?, ? examples/s]

Map:   0%|          | 0/250 [00:00<?, ? examples/s]

Running Epoch 2 of 5:   0%|          | 0/188 [00:00<?, ?it/s]

Map:   0%|          | 0/250 [00:00<?, ? examples/s]

Map:   0%|          | 0/250 [00:00<?, ? examples/s]

Map:   0%|          | 0/250 [00:00<?, ? examples/s]

Map:   0%|          | 0/250 [00:00<?, ? examples/s]

Running Epoch 3 of 5:   0%|          | 0/188 [00:00<?, ?it/s]

Map:   0%|          | 0/250 [00:00<?, ? examples/s]

Map:   0%|          | 0/250 [00:00<?, ? examples/s]

Map:   0%|          | 0/250 [00:00<?, ? examples/s]

Map:   0%|          | 0/250 [00:00<?, ? examples/s]

Running Epoch 4 of 5:   0%|          | 0/188 [00:00<?, ?it/s]

Map:   0%|          | 0/250 [00:00<?, ? examples/s]

Map:   0%|          | 0/250 [00:00<?, ? examples/s]

Map:   0%|          | 0/250 [00:00<?, ? examples/s]

Map:   0%|          | 0/250 [00:00<?, ? examples/s]

In [22]:
pd.DataFrame(results)

Unnamed: 0,global_step,train_loss,mcc,accuracy,f1_score,tp,tn,fp,fn,auroc,auprc,precision,recall,f1,ba,eval_loss
0,50,0.607441,0.500813,0.748,0.721589,55,132,54,9,0.84333,0.557808,0.504587,0.859375,0.635838,0.784526,0.480244
1,100,0.353002,0.507536,0.828,0.740808,31,176,10,33,0.843246,0.693335,0.756098,0.484375,0.590476,0.715306,0.6383
2,150,1.121513,0.445592,0.644,0.636028,62,99,87,2,0.868196,0.651725,0.416107,0.96875,0.58216,0.750504,0.639333
3,200,0.480991,0.572352,0.816,0.779254,51,153,33,13,0.879872,0.674425,0.607143,0.796875,0.689189,0.809728,0.50721
4,250,0.461299,0.532289,0.836,0.752863,32,177,9,32,0.860131,0.713904,0.780488,0.5,0.609524,0.725806,0.598129
5,300,0.507175,0.562249,0.844,0.774196,36,175,11,28,0.87584,0.727406,0.765957,0.5625,0.648649,0.75168,0.51467
6,350,0.450001,0.546176,0.828,0.773068,42,165,21,22,0.858283,0.729485,0.666667,0.65625,0.661417,0.771673,0.498372
7,400,0.547252,0.491874,0.82,0.739457,33,172,14,31,0.863827,0.717999,0.702128,0.515625,0.594595,0.720178,0.625993
8,450,0.158072,0.596019,0.836,0.795731,49,160,26,15,0.889113,0.759396,0.653333,0.765625,0.705036,0.81292,0.53716
9,500,0.023891,0.567346,0.844,0.779785,38,173,13,26,0.867776,0.706208,0.745098,0.59375,0.66087,0.761929,0.836609


In [23]:
eval_res, probs, *_ = classifier.eval_model(df_test, **extra_metrics_to_compute)

Map:   0%|          | 0/574 [00:00<?, ? examples/s]

Running Evaluation:   0%|          | 0/18 [00:00<?, ?it/s]

In [None]:
pd.DataFrame(eval_res, index=['test'])

In [28]:
model_name = "roberta-base"
model_path = base_path / "models" / (model_name + "-pledge-binary")
classifier.args.no_save = False
classifier.save_model(output_dir=model_path, model=classifier.model)

In [29]:
# clean up
import shutil
shutil.rmtree("outputs/", ignore_errors=True)
shutil.rmtree("runs/", ignore_errors=True)