# HUHU@IberLEF2023 Task 1 (Binary Classification)

Task: https://sites.google.com/view/huhuatiberlef23/huhu

This notebook contains the code to fine-tune several pre-trained transformers for the task of hurtful humour detection (binary classification).

In particular, the models are:

* BERT Multilingual: ``bert-base-multilingual-cased`` and ``bert-base-multilingual-uncased``
* RoBERTa: ``roberta-base``
* BETO: ``dccuchile/bert-base-spanish-wwm-cased`` and ``dccuchile/bert-base-spanish-wwm-uncased``
* DistilBERT Multilingual: ``distilbert-base-multilingual-cased``

To take advantage of these transformer models, different ensembles are configured resulting from all their possible combinations.

Experiments show that combining the prediction capabilities of these models allow to achieve better results than when used independently.

# Setting up the environment

In [None]:
import torch

# Check GPU availability on Google Colab
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

use_cuda = torch.cuda.is_available()

In [None]:
# Install libraries
!pip install simpletransformers
!pip install datasets
!pip install ipywidgets
!pip install --upgrade huggingface_hub

In [None]:
# Define global variables

SEED = 42 # allow for experiments' reproductibility
WEIGHTED = True # use weighted ensemble (in favour of models with higher F1-score)

# Dataset load

In [None]:
from huggingface_hub import notebook_login
# Notebook login via HF's token
notebook_login()

In [None]:
from datasets import *
import pandas as pd

# Avoid warnings
logging.set_verbosity_error()

# Load training, validation and test splits
train = pd.DataFrame(load_dataset("huhu2023/bin-huhu2023", split="train"))
val = pd.DataFrame(load_dataset("huhu2023/bin-huhu2023", split="validation"))
test = pd.DataFrame(load_dataset("huhu2023/bin-huhu2023", split="test"))

train.head()

In [None]:
# Function to rename fields and drop unnecessary ones
def get_text_and_label(df, original_dataset=True):
  text_tag = "tweet" if original_dataset else "text"
  label_tag = "humor" if original_dataset else "is_humor"
  return df.rename(columns={text_tag: "text", label_tag: "label"})[["text", "label"]]

# Get treated dataframe for training, validation and test splits
train = get_text_and_label(train)
val = get_text_and_label(val)
test = get_text_and_label(test)

print(f"Dataset size: <{len(train.index)}:{len(val.index)}:{len(test.index)}>")
train.head()

# Create output directory

The output directory structure is defined. Each of the transformer models will be saved, along with their results. Metrics regarding the performance of the ensembles will be also collected for further analysis.

In [None]:
# Load and mount the Drive helper
from google.colab import drive
drive.mount('/content/drive/')

In [None]:
from datetime import datetime, timedelta
import os

# Define unique path for current experiment
PATH = "/path/to/task1/outputs/{}/".format((datetime.now() + timedelta(hours=2)).strftime("%d-%m-%Y-%H-%M"))
print("Current working dir:", PATH)

# Create directory
os.mkdir(PATH)

# Models' definition

In this section, the different transformers that will be evaluated are gathered. For this purpose, the implementation mainly relies in the ``simpletransformers`` Python library, which allows to train and test transformers within few steps.

For further information: https://simpletransformers.ai/

**IMPORTANT NOTE:** although this is a binary classification task, it will be treated as a regression where a value between 0 and 1 must be predicted for each instance. Later, these predictions will be turned into binary values by the corresponding ensemble.

In [None]:
# Define transformers' initialization dictionary 
models = {
    "mbert-cased": {
        "model_type": "bert",
        "model_name": "bert-base-multilingual-cased"
    },
    "mbert-uncased": {
        "model_type": "bert",
        "model_name": "bert-base-multilingual-uncased"
    },
    "roberta": {
        "model_type": "roberta",
        "model_name": "roberta-base"
    },
    "beto-cased": {
        "model_type": "bert",
        "model_name": "dccuchile/bert-base-spanish-wwm-cased"
    },
    "beto-uncased": {
        "model_type": "bert",
        "model_name": "dccuchile/bert-base-spanish-wwm-uncased"
    },
    "distilbert-multi": {
        "model_type": "distilbert",
        "model_name": "distilbert-base-multilingual-cased"
    }
}

In [None]:
# Import pre-trained simpletransformers models for classification
from simpletransformers.classification import ClassificationModel, ClassificationArgs

# Define the number of labels for this task (a unique binary label)
num_labels = 1

# Define a dictionary where each key matches its corresponding transformer
# All transformers share the same classification arguments
for model, fields in models.items():    

  # Define models' classification arguments
  model_args = ClassificationArgs(
      overwrite_output_dir= True,
      regression=True,
      eval_batch_size=8,
      num_train_epochs=5,
      learning_rate = 4e-05,
      optimizer="AdamW",
      manual_seed=SEED,
      use_early_stopping=True,
      save_model_every_epoch=False
  )

  model_args.output_dir = os.path.join(PATH, model)
  # os.mkdir(model_args.output_dir)
  models[model] = ClassificationModel(fields["model_type"], fields["model_name"],
                                      args=model_args, num_labels=num_labels, use_cuda=use_cuda)

# Training

Each of the aforementioned models is trained separatedly with the entire training set.

This training is directly performed in the previously defined dictionary for convenience.

In [None]:
from sklearn.metrics import mean_squared_error
from math import sqrt

# Define RMSE function
def root_mean_squared_error(y_true, y_pred):
    return sqrt(mean_squared_error(y_true, y_pred))

In [None]:
# Train all models with training set instances
for model_name, model in models.items():
  model.train_model(train, loss_fct=root_mean_squared_error)

# Ensembles' definition

The ensembles of transformers that can be defined with the previously trained models are created.

A dictionary is create for convenience, univocally identifying each ensemble.

In [None]:
import collections
from itertools import combinations

# Define a list containing the lists of models of each ensemble
models_names = list(models.keys())
ensembles_list = list()

for i in range(1, len(models_names) + 1):
    ensembles_list += list(combinations(models_names, i))
ensembles_list = [list(ensemble) for ensemble in ensembles_list]

# Define a dictionary with the ensembles
ensembles = {}
for i in range(len(ensembles_list)):
  ensembles["ensemble{:02d}".format(i)] = {}
  ensembles["ensemble{:02d}".format(i)]["models"] = ensembles_list[i]
  ensembles["ensemble{:02d}".format(i)]["metrics"] = {}
ensembles

# Evaluation

Firstly, each transformer is individually evaluated using the validation split. Subsequently, the main evaluation metrics (accuracy, F1-score, precision and recall) are stored.

Secondly, the predictions of each ensemble for the validation set instances are derived. After calculating their metrics, it is possible to determine which ensemble obtained the best F1-Score. This will be the final ensemble used for the test dataset.

Regarding the ensembles' predictions, these are obtained through a hard voting system: after computing the output that each of the ensemble's models produces for a given instance, the most-voted class turns out to be the ensemble result.

The voting system can be non-weighted or weighted. In the latter, the prediction of each individual transformer is weighted according to their normalized F1-score, thus providing a greater importance to the best model without disregarding the outputs of the other transformers.

In [None]:
import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import time

# Function which computes the evaluation metrics given two lists of true and
# predicted labels
def compute_metrics(y_true, y_pred):
  precision, recall, f1, support = precision_recall_fscore_support(y_true, y_pred, average='binary')
  acc = accuracy_score(y_true, y_pred)
  return {
      'accuracy': round(acc, 5),
      'f1': round(f1, 5),
      'precision': round(precision, 5),
      'recall': round(recall, 5)
  }

# Transformers' evaluation under the validation set
model_evaluation = {}
for model_name in models:
  model_evaluation[model_name] = {}
  # Storing the prediction outputs
  result, model_outputs, wrong_predictions = models[model_name].eval_model(val, metric=root_mean_squared_error)
  model_evaluation[model_name]["result"] = result                                                           # Result
  model_evaluation[model_name]["val_model_outputs"] = model_outputs                                         # Raw model ouputs 
  model_evaluation[model_name]["val_predictions"] = [0 if output < 0.5 else 1 for output in model_outputs]  # Class prediction
  model_evaluation[model_name]["val_wrong_predictions"] = wrong_predictions                                 # Wrongly-predicted instances
  
  # Storing the metrics
  model_evaluation[model_name]["metrics"] = compute_metrics(val.get("label"), model_evaluation[model_name].get("val_predictions"))
  print(f"{model_name}\t", model_evaluation[model_name].get("metrics"))

The ``vote`` function determines the ensembler prediction based on the outcomes of its transformers. Its arguments are:
1.   ``predictions``: list of transformers' (raw) outputs
2.   ``weighted``: bool that determines if a weighted voting system must be used
3.   ``weights``: list of weights (normalized F1-scores) 

The ``predict_ensemble``function calculates the predictions of each ensemble for a given dataset split (``dataset_name``, ``dataset``).

In [None]:
from sklearn.preprocessing import normalize

# Function which determines the ensembler prediction based on its
# transformers' predictions. A weighted voting system may be used
def vote(predictions, weighted=False, weights=None):
  voting = sum(predictions * weights) if weighted else sum(predictions)/len(predictions)
  return 0 if voting < 0.5 else 1

ensemble_evaluation = {}

# Function to predict the label of the instances in a dataset split (validation
# ("val") or test ("test")) for each ensemble
def predict_ensemble(ensemble_name, dataset_name, dataset, weighted=False):
  ensemble_evaluation[ensemble_name][f"{dataset_name}_predictions"] = list()
  # Traverse each dataset instance
  for i in range(len(dataset.index)):
    predictions = list()
    ensemble_models = ensembles[ensemble_name].get("models")
    # Get the raw output of each model in the ensemble for the instance at hand
    for model_name in ensemble_models:
      curr_model_outputs = model_evaluation[model_name].get(f"{dataset_name}_model_outputs")
      predictions.append(curr_model_outputs[i])
    
    # Define the list of weights if a weighted voting system must be used
    weights = list()
    if weighted:
      # The weights' list is obtained by normalizing the F1-scores of the models
      # in the ensemble
      f1_scores_list = [model_evaluation[model_name]["metrics"].get("f1")
                        for model_name in ensembles[ensemble_name].get("models")]
      weights = normalize([f1_scores_list], norm="l1")[0]

    # Append the predicted label to the predictions of the ensemble
    ensemble_pred = vote(predictions, weighted, weights)
    ensemble_evaluation[ensemble_name][f"{dataset_name}_predictions"].append(ensemble_pred)

# Ensembles' evaluation under the validation set
for ensemble_name in ensembles:
  ensemble_evaluation[ensemble_name] = {}
  ensemble_evaluation[ensemble_name]["val_predictions"] = list()
  predict_ensemble(ensemble_name, "val", val, weighted=WEIGHTED)
  ensembles[ensemble_name]["metrics"] = compute_metrics(val.get("label"), ensemble_evaluation[ensemble_name].get("val_predictions"))
  print(f"{ensemble_name}\t", ensembles[ensemble_name].get("metrics"))

In [None]:
import json

# Save ensembles to JSON file
with open(os.path.join(PATH, 'ensembles.json'), 'w', encoding='utf-8') as f:
    json.dump(ensembles, f, ensure_ascii=False, indent=4)

# Selecting the best ensemble

Once the predicted labels for each validation instance are calculated for each ensemble, their metrics can be computed. Given that it is a binary classification task, the best ensemble will be that with a maximum F1-score.

In [None]:
# Defining a dictionary with the F1-score of each ensemble
f1_scores = {ensemble_name: ensembles[ensemble_name]["metrics"].get("f1") for ensemble_name in ensemble_evaluation}
# Selecting the best ensemble
best_ensemble_name = max(f1_scores, key=f1_scores.get)
best_ensemble = {"name": best_ensemble_name,
                 "models": ensembles[best_ensemble_name].get("models"),
                 "metrics": ensembles[best_ensemble_name].get("metrics")
                 }

best_ensemble

# Predictions on test set

Finally, the ensemble which obtained a higher F1-score can be used to predict the label of each test instance.

Further, these results will be used to portray some evaluation plots, including the Confusion Matrix on the positive class ("humour") and the ROC curve. 

In [None]:
# Predicting the label of the test set's instances with each individual transformer
for model_name in models:
  model_predictions, model_raw_outputs = models.get(model_name).predict(test["text"].tolist())
  model_evaluation[model_name]["test_model_outputs"] = model_raw_outputs
  model_evaluation[model_name]["test_predictions"] = [0 if output < 0.5 else 1 for output in model_raw_outputs]

# Calculating the test predictions of the best ensemble
predict_ensemble(best_ensemble.get("name"), "test", test, weighted=WEIGHTED)

In [None]:
# Dump individual transformers' results
for model_name, evaluation in model_evaluation.items():

  curr_model = model_evaluation.get(model_name)

  # Converting ndarrays to lists
  curr_model["val_model_outputs"] = list(curr_model.get("val_model_outputs"))
  curr_model["test_model_outputs"] = list(curr_model.get("test_model_outputs"))
  curr_model["val_predictions"] = list(curr_model.get("val_predictions"))
  curr_model["test_predictions"] = list(curr_model.get("test_predictions"))
  
  # Adapting validation wrong predictions (if any)
  if curr_model.get("val_wrong_predictions"):
    curr_model["val_wrong_predictions_list"] = curr_model.get("val_wrong_predictions")
    curr_model["val_wrong_predictions"] = {}
    for pred in curr_model.get("val_wrong_predictions_list"):
      curr_model["val_wrong_predictions"][pred.guid] = {
          "text_a": pred.text_a,
          "text_b": pred.text_b,
          "label": pred.label
      }
    del curr_model["val_wrong_predictions_list"]

  with open(os.path.join(PATH, f'{model_name}/model-evaluation.json'), 'w', encoding='utf-8') as f:
    json.dump(curr_model, f, ensure_ascii=False, indent=4)

In [None]:
# Complete fields of best ensemble dictionary
best_ensemble["val_predictions"] = ensemble_evaluation[best_ensemble.get("name")].get("val_predictions")
best_ensemble["test_predictions"] = ensemble_evaluation[best_ensemble.get("name")].get("test_predictions")

# Save best ensemble to JSON file
with open(os.path.join(PATH, 'best-ensemble.json'), 'w', encoding='utf-8') as f:
    json.dump(best_ensemble, f, ensure_ascii=False, indent=4)

In [None]:
# Creating a new column of predicted labels in the test dataframe
test["predicted_label"] = ensemble_evaluation[best_ensemble.get("name")].get("test_predictions")
test.head(10)

In [None]:
# Dump test predictions
test.to_csv(os.path.join(PATH, "test-predictions.csv"), index=False)  

## Classification report

In [None]:
from sklearn.metrics import classification_report
import matplotlib.pyplot as plt
import seaborn as sns

# Classification report
labels = ["NO-Humor", "YES-Humor"]
cr_str = classification_report(y_true=test["label"].tolist(), y_pred=test["predicted_label"].tolist(), target_names=labels)
cr = classification_report(y_true=test["label"].tolist(), y_pred=test["predicted_label"].tolist(), target_names=labels, output_dict=True)

fig_cr = plt.figure(figsize = (10.8, 10.8))
sns.heatmap(pd.DataFrame(cr).iloc[:-1, :].T, annot = True, fmt = ".2f", cbar_kws = {"shrink" : 0.5}, annot_kws = {"size": 15})
plt.xlabel("Evaluated Metrics", fontsize = 15)
plt.ylabel("Classes & Metrics", fontsize = 15)
plt.title("Evaluation Metrics", fontsize = 20)

# Save CR
fig_cr.savefig(os.path.join(PATH, "CLASSIFICATION_REPORT.png"))
plt.show()

In [None]:
# Dump classification report
with open(os.path.join(PATH, "classification-report.txt"), "w") as f:
  f.write(cr_str)

## Confusion matrix

In [None]:
import seaborn as sns
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt

LABELS = ["NO-Humor", "YES-Humor"]
fig_cm = plt.figure(figsize = (10.8, 10.8))
cm = confusion_matrix(test.get("label"), test.get("predicted_label"), normalize = "true")
sns.heatmap(cm, vmin = 0, vmax = 1, square = True, annot = True, fmt = ".2f", cbar_kws = {"shrink" : 0.5}, xticklabels = LABELS, yticklabels = LABELS, annot_kws = {"size": 15})
plt.xlabel("Predicted Values", fontsize = 15)
plt.ylabel("True Values", fontsize = 15)
plt.title("Confusion Matrix", fontsize = 20)

# Save CM
fig_cm.savefig(os.path.join(PATH, "CONFUSION MATRIX.png"))
plt.show()

## ROC curve

In [None]:
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

fig_roc = plt.figure(figsize = (10.8, 10.8))
fpr, tpr, _ = roc_curve(test["label"], test["predicted_label"])
plt.plot(fpr, tpr, color = "darkorange", label = "{} (AUC = {:0.2f})".format(best_ensemble.get("name"), auc(fpr, tpr)))
plt.plot([0, 1], [0, 1], "k--", color = "darkblue", linestyle = "--", label = "Random Classifier (AUC = 0.5)") # AUC: Area Under Curve
plt.axis("square")
plt.xlabel("False Positive Rate (FPR)", fontsize = 15)
plt.ylabel("True Positive Rate (TPR)", fontsize = 15)
plt.title("ROC curve", fontsize = 20)
plt.tick_params(axis = "y",direction = "in")
plt.tick_params(axis = "x",direction = "in")
plt.legend()

# Save ROC curve
fig_roc.savefig(os.path.join(PATH, "ROC.png"))
plt.show()