# HUHU@IberLEF2023 Task 2b (Regression)

Task: https://sites.google.com/view/huhuatiberlef23/huhu

This notebook contains the code to fine-tune several pre-trained transformers for the task of hurtful humour detection (regression).

In particular, the models are:

* BERT Multilingual: ``bert-base-multilingual-cased`` and ``bert-base-multilingual-uncased``
* RoBERTa: ``roberta-base``
* BETO: ``dccuchile/bert-base-spanish-wwm-cased`` and ``dccuchile/bert-base-spanish-wwm-uncased``
* DistilBERT Multilingual: ``distilbert-base-multilingual-cased``

To take advantage of these transformer models, different ensembles are configured resulting from all their possible combinations.

Experiments show that combining the prediction capabilities of these models allow to achieve better results than when used independently.

# Setting up the environment

In [None]:
import torch

# Check GPU availability on Google Colab
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

use_cuda = torch.cuda.is_available()

In [None]:
# Install libraries
!pip install simpletransformers
!pip install datasets
!pip install ipywidgets
!pip install --upgrade huggingface_hub

In [None]:
# Define global variables

SEED = 42 # allow for experiments' reproductibility
WEIGHTED = True # use weighted ensemble (in favour of models with higher F1-score)

# Dataset load

In [None]:
from huggingface_hub import notebook_login
# Notebook login via HF's token
notebook_login()

In [None]:
from datasets import *
import pandas as pd

# Avoid warnings
logging.set_verbosity_error()

# Load training, validation and test splits
train = pd.DataFrame(load_dataset("huhu2023/bin-huhu2023", split="train"))
val = pd.DataFrame(load_dataset("huhu2023/bin-huhu2023", split="validation"))
test = pd.DataFrame(load_dataset("huhu2023/bin-huhu2023", split="test"))

train.head()

In [None]:
# Function to rename fields and drop unnecessary ones
def get_text_and_label(df, original_dataset=True):
  return df.rename(columns={"tweet": "text", "mean_prejudice": "score"})[["text", "score"]]

# Get treated dataframe for training, validation and test splits
train = get_text_and_label(train)
val = get_text_and_label(val)
test = get_text_and_label(test)

print(f"Dataset size: <{len(train.index)}:{len(val.index)}:{len(test.index)}>")
train.head()

# Create output directory

The output directory structure is defined. Each of the transformer models will be saved, along with their results. Metrics regarding the performance of the ensembles will be also collected for further analysis.

In [None]:
# Load and mount the Drive helper
from google.colab import drive
drive.mount('/content/drive/')

In [None]:
from datetime import datetime, timedelta
import os

# Define unique path for current experiment
PATH = "/path/to/task2b/outputs/{}/".format((datetime.now() + timedelta(hours=2)).strftime("%d-%m-%Y-%H-%M"))
print("Current working dir:", PATH)

# Create directory
os.mkdir(PATH)

# Models' definition

In this section, the different transformers that will be evaluated are gathered. For this purpose, the implementation mainly relies in the ``simpletransformers`` Python library, which allows to train and test transformers within few steps.

For further information: https://simpletransformers.ai/

In [None]:
# Define transformers' initialization dictionary 
models = {
    "mbert-cased": {
        "model_type": "bert",
        "model_name": "bert-base-multilingual-cased"
    },
    "mbert-uncased": {
        "model_type": "bert",
        "model_name": "bert-base-multilingual-uncased"
    },
    "roberta": {
        "model_type": "roberta",
        "model_name": "roberta-base"
    },
    "beto-cased": {
        "model_type": "bert",
        "model_name": "dccuchile/bert-base-spanish-wwm-cased"
    },
    "beto-uncased": {
        "model_type": "bert",
        "model_name": "dccuchile/bert-base-spanish-wwm-uncased"
    },
    "distilbert-multi": {
        "model_type": "distilbert",
        "model_name": "distilbert-base-multilingual-cased"
    }
}

In [None]:
# Import pre-trained simpletransformers models for classification
from simpletransformers.classification import ClassificationModel, ClassificationArgs

# Define the number of labels for this task
num_labels = 1

# Define a dictionary where each key matches its corresponding transformer
# All transformers share the same classification arguments
for model, fields in models.items():    

  # Define models' classification arguments
  model_args = ClassificationArgs(
      overwrite_output_dir= True,
      regression=True,
      eval_batch_size=8,
      num_train_epochs=10,
      learning_rate = 8e-05,
      optimizer="Adafactor",
      manual_seed=SEED,
      use_early_stopping=True,
      save_model_every_epoch=False,
      adafactor_relative_step=False, adafactor_warmup_init=False
  )

  model_args.output_dir = os.path.join(PATH, model)
  # os.mkdir(model_args.output_dir)
  models[model] = ClassificationModel(fields["model_type"], fields["model_name"],
                                      args=model_args, num_labels=num_labels, use_cuda=use_cuda)

# Training

Each of the aforementioned models is trained separatedly with the entire training set.

This training is directly performed in the previously defined dictionary for convenience.

In [None]:
from sklearn.metrics import mean_squared_error
from math import sqrt

# Define RMSE function
def root_mean_squared_error(y_true, y_pred):
    return sqrt(mean_squared_error(y_true, y_pred))

In [None]:
# Train all models with training set instances
for model_name, model in models.items():
  model.train_model(train, loss_fct=root_mean_squared_error)

# Ensembles' definition

The ensembles of transformers that can be defined with the previously trained models are created.

A dictionary is create for convenience, univocally identifying each ensemble.

In [None]:
import collections
from itertools import combinations

# Define a list containing the lists of models of each ensemble
models_names = list(models.keys())
ensembles_list = list()

for i in range(1, len(models_names) + 1):
    ensembles_list += list(combinations(models_names, i))
ensembles_list = [list(ensemble) for ensemble in ensembles_list]

# Define a dictionary with the ensembles
ensembles = {}
for i in range(len(ensembles_list)):
  ensembles["ensemble{:02d}".format(i)] = {}
  ensembles["ensemble{:02d}".format(i)]["models"] = ensembles_list[i]
  ensembles["ensemble{:02d}".format(i)]["metrics"] = {}
ensembles

# Evaluation

Firstly, each transformer is individually evaluated using the validation split. Subsequently, the main evaluation metrics (R² score, MAE, MSE and RMSE) are stored.

Secondly, the predictions of each ensemble for the validation set instances are derived. After calculating their metrics, it is possible to determine which ensemble obtained the best RMSE. This will be the final ensemble used for the test dataset.

Regarding the ensembles' predictions, these are obtained through a hard voting system: after computing the output that each of the ensemble's models produces for a given instance, the mean score turns out to be the ensemble result.

The voting system can be non-weighted or weighted. In the latter, the prediction of each individual transformer is weighted according to their normalized RMSE, thus providing a greater importance to the best model without disregarding the outputs of the other transformers.

In [None]:
import numpy as np
from sklearn.metrics import mean_absolute_error, r2_score

# Function which computes the evaluation metrics given two lists of true and
# predicted scores
def compute_metrics(y_true, y_pred):
  return {
      'r2_score': round(r2_score(y_true, y_pred), 7), 
      'mae': round(mean_absolute_error(y_true, y_pred), 7),
      'mse': round(mean_squared_error(y_true, y_pred), 7),
      'rmse': round(root_mean_squared_error(y_true, y_pred), 7)
  }

# Transformers' evaluation under the validation set
model_evaluation = {}
for model_name in models:
  model_evaluation[model_name] = {}
  # Storing the prediction outputs
  result, model_outputs, wrong_predictions = models[model_name].eval_model(val, metric=root_mean_squared_error)
  model_evaluation[model_name]["result"] = result                                               # Result
  model_evaluation[model_name]["val_scores"] = [round(output, 1) for output in model_outputs]   # Model predicted scores
  model_evaluation[model_name]["val_wrong_predictions"] = wrong_predictions                     # Wrongly-predicted instances
  
  # Storing the metrics
  model_evaluation[model_name]["metrics"] = compute_metrics(val.get("score"), model_evaluation[model_name].get("val_scores"))
  print(f"{model_name}\t", model_evaluation[model_name].get("metrics"))

The ``vote`` function determines the ensembler prediction based on the outcomes of its transformers. Its arguments are:
1.   ``predictions``: list of transformers' predicted scores
2.   ``weighted``: bool that determines if a weighted voting system must be used
3.   ``weights``: list of weights (normalized RMSE) 

The ``predict_ensemble``function calculates the predictions of each ensemble for a given dataset split (``dataset_name``, ``dataset``).

In [None]:
from sklearn.preprocessing import normalize

# Function which determines the ensembler prediction based on its
# transformers' predictions. A weighted voting system may be used
def vote(predictions, weighted=False, weights=None):
  return sum(predictions * weights) if weighted else sum(predictions)/len(predictions)

# ensemble metrics
ensemble_evaluation = {}

# Function to predict the label of the instances in a dataset split (validation
# ("val") or test ("test")) for each ensemble
def predict_ensemble(ensemble_name, dataset_name, dataset, weighted=False):
  ensemble_evaluation[ensemble_name][f"{dataset_name}_scores"] = list()
  # Traverse each dataset instance
  for i in range(len(dataset.index)):
    predictions = list()
    ensemble_models = ensembles[ensemble_name].get("models")
    # Get the raw output of each model in the ensemble for the instance at hand
    for model_name in ensemble_models:
      curr_model_outputs = model_evaluation[model_name].get(f"{dataset_name}_scores")
      predictions.append(curr_model_outputs[i])
    
    # Define the list of weights if a weighted voting system must be used
    weights = list()
    if weighted:
      # The weights' list is obtained by normalizing the RMSE of the models
      # in the ensemble
      rmse_list = [model_evaluation[model_name]["metrics"].get("rmse")
                        for model_name in ensembles[ensemble_name].get("models")]
      weights = normalize([[1/rmse for rmse in rmse_list]], norm="l1")[0]

    # Append the computed scores to the predictions of the ensemble
    ensemble_pred = round(vote(predictions, weighted, weights), 1)
    ensemble_evaluation[ensemble_name][f"{dataset_name}_scores"].append(ensemble_pred)

# Ensembles' evaluation under the validation set
for ensemble_name in ensembles:
  ensemble_evaluation[ensemble_name] = {}
  ensemble_evaluation[ensemble_name]["val_scores"] = list()
  predict_ensemble(ensemble_name, "val", val, weighted=WEIGHTED)
  ensembles[ensemble_name]["metrics"] = compute_metrics(val.get("score"), ensemble_evaluation[ensemble_name].get("val_scores"))
  print(f"{ensemble_name}\t", ensembles[ensemble_name].get("metrics"))

In [None]:
import json

# Save ensembles to JSON file
with open(os.path.join(PATH, 'ensembles.json'), 'w', encoding='utf-8') as f:
    json.dump(ensembles, f, ensure_ascii=False, indent=4)

# Selecting the best ensemble

Once the predicted scores for each validation instance are calculated for each ensemble, their metrics can be computed. Given that it is a regression task, the best ensemble will be that with a minimum RMSE.

In [None]:
# Defining a dictionary with the RMSE of each ensemble
rmse_ensembles = {ensemble_name: ensembles[ensemble_name]["metrics"].get("rmse") for ensemble_name in ensemble_evaluation}
# Selecting the best ensemble
best_ensemble_name = min(rmse_ensembles, key=rmse_ensembles.get)
best_ensemble = {"name": best_ensemble_name,
                 "models": ensembles[best_ensemble_name].get("models"),
                 "metrics": ensembles[best_ensemble_name].get("metrics")
                 }

best_ensemble

# Predictions on test set

Finally, the ensemble which obtained a lower RMSE can be used to predict the score of each test instance.

Further, these results will be used to portray some evaluation plots. 

In [None]:
# Predicting the label of the test set's instances with each individual transformer
for model_name in models:
  _, model_raw_outputs = models.get(model_name).predict(test["text"].tolist())
  model_evaluation[model_name]["test_scores"] = model_raw_outputs

for model_name in best_ensemble.get("models"):
  # Calculating the test predictions of the best ensemble
  predict_ensemble(best_ensemble.get("name"), "test", test, weighted=WEIGHTED)

In [None]:
# Dump individual transformers' results
for model_name, evaluation in model_evaluation.items():

  curr_model = model_evaluation.get(model_name)

  # Converting ndarrays to lists
  curr_model["val_scores"] = list(curr_model.get("val_scores"))
  curr_model["test_scores"] = list(curr_model.get("test_scores"))
  
  # Adapting validation wrong predictions (if any)
  if curr_model.get("val_wrong_predictions"):
    curr_model["val_wrong_predictions_list"] = curr_model.get("val_wrong_predictions")
    curr_model["val_wrong_predictions"] = {}
    for pred in curr_model.get("val_wrong_predictions_list"):
      curr_model["val_wrong_predictions"][pred.guid] = {
          "text_a": pred.text_a,
          "text_b": pred.text_b,
          "label": pred.label
      }
    del curr_model["val_wrong_predictions_list"]

  with open(os.path.join(PATH, f'{model_name}/model-evaluation.json'), 'w', encoding='utf-8') as f:
    json.dump(curr_model, f, ensure_ascii=False, indent=4)

In [None]:
# Complete fields of best ensemble dictionary
best_ensemble["val_scores"] = ensemble_evaluation[best_ensemble.get("name")].get("val_scores")
best_ensemble["test_scores"] = ensemble_evaluation[best_ensemble.get("name")].get("test_scores")

# Save best ensemble to JSON file
with open(os.path.join(PATH, 'best-ensemble.json'), 'w', encoding='utf-8') as f:
    json.dump(best_ensemble, f, ensure_ascii=False, indent=4)

In [None]:
# Creating a new column of predicted scores in the test dataframe
test["predicted_score"] = ensemble_evaluation[best_ensemble.get("name")].get("test_scores")
test.head(10)

In [None]:
# Dump test predictions
test.to_csv(os.path.join(PATH, "test-predictions.csv"), index=False)  

# True vs. Predicted scores

In [None]:
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde
import seaborn as sns

# Global parameters
sns.reset_orig()
sns.set_style({"xtick.direction": "in","ytick.direction": "in"})

fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(17,5))

# Plot 1
sns.kdeplot(data=test, x="score", y="predicted_score",
            fill=True, thresh=0, levels=20, cmap="rocket_r", ax=ax1)
ax1.set_xlabel("True score", fontsize=15)
ax1.set_ylabel("Predicted score", fontsize=15)

# Plot 2
sns.kdeplot(data=test, x="score", y="predicted_score",
            fill=False, thresh=0, levels=20, cmap="rocket_r", ax=ax2)
min_score = min(min(test["score"]), min(test["predicted_score"]))
max_score = max(max(test["score"]), max(test["predicted_score"]))
ax2.plot([min_score, max_score], [min_score, max_score], "k--")
ax2.set_xlabel("True score", fontsize=15)
ax2.set_ylabel("")

# Plot 3
sns.scatterplot(data=test, x="score", y="predicted_score", color="orange")
ax3.plot([min_score, max_score], [min_score, max_score], "k--")
ax3.set_xlabel("True score", fontsize=15)
ax3.set_ylabel("")

# General specifications
fig.suptitle("True vs. Predicted scores", fontsize=20)

# Save true vs. predicted labels
fig.savefig(os.path.join(PATH, "true-vs-pred.png"))

plt.show()