<a href="https://colab.research.google.com/github/mpsdecamargo/ml-data-science-portfolio/blob/main/bert-deep-learning-project/Model_Evaluation_using_Transformers_library.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## INTRODUCTION

Notebook content: Loading of models and evaluation on validation loss, accuracy, precision, recall and F1 Score. Comparison between models.

Note: The notebook was developed in Google Colab. The datasets are not publicly available due to copyright restrictions. This notebook is a form of demonstration of problem solving, Data Science and Machine Learning skills, but as the dataset and the models are not publicly available, it cannot be reproduced. However, the code can be used for similar tasks.

In [1]:
! pip install transformers sentencepiece datasets evaluate accelerate -U

Collecting transformers
  Downloading transformers-4.36.2-py3-none-any.whl (8.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.2/8.2 MB[0m [31m15.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentencepiece
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m17.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.16.1-py3-none-any.whl (507 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m33.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate
  Downloading accelerate-0.25.0-py3-none-any.whl (265 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
from google.colab import drive
drive.mount('gdrive')

Mounted at gdrive


In [3]:
import pandas as pd
import numpy as np
from datasets import Dataset, load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer, DataCollatorWithPadding, EarlyStoppingCallback
from sklearn.metrics import balanced_accuracy_score, precision_score, recall_score, f1_score, multilabel_confusion_matrix
import torch
import time
import evaluate
from datetime import datetime
import pytz
from transformers import AlbertForSequenceClassification, AlbertTokenizer
import sentencepiece

In [4]:
# Creation of the Dataset object

data_files = {
    "train": ["/content/gdrive/MyDrive/Datasets/dataset_verifato_covid_train.csv"],
    "test": ["/content/gdrive/MyDrive/Datasets/dataset_verifato_covid_test.csv"]
}

dataset = load_dataset('csv', data_files=data_files, delimiter=",")
dataset

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'labels'],
        num_rows: 7894
    })
    test: Dataset({
        features: ['text', 'labels'],
        num_rows: 1974
    })
})

In [5]:
# The function transforms the text data into tokens, and then into numerical vectors, which are then utilized in training.

def tokenize(dataset, tokenizer):
  return tokenizer(dataset["text"], truncation=True, max_length=512, padding='max_length', add_special_tokens=True, return_tensors='np')

In [6]:
# Definiton of the metrics used to evaluate models on the binary classification task

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    accuracy = evaluate.load('accuracy').compute(predictions=predictions, references=labels)
    precision = evaluate.load('precision').compute(predictions=predictions, references=labels)
    recall = evaluate.load('recall').compute(predictions=predictions, references=labels)
    f1 = evaluate.load('f1').compute(predictions=predictions, references=labels)

    metrics = {
        'accuracy': accuracy["accuracy"],
        'precision': precision["precision"],
        'recall': recall["recall"],
        'f1': f1["f1"],
    }

    return metrics

In [7]:
# Definiton of the metrics used to evaluate models on the multilabel classification task

def compute_metrics_multilabel(eval_pred):
    logits, labels = eval_pred
    predictions = np.round(logits)

    # Flatten the predictions and labels since multilabel metrics expect 1D arrays
    predictions_flat = predictions.flatten()
    labels_flat = labels.flatten()

    accuracy = balanced_accuracy_score(labels_flat, predictions_flat)
    precision = precision_score(labels_flat, predictions_flat, average='weighted')
    recall = recall_score(labels_flat, predictions_flat, average='weighted',zero_division=0)
    f1 = f1_score(labels_flat, predictions_flat, average='weighted')

    metrics = {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1': f1,
    }

    return metrics

In [19]:
# Defining the repositories of the pretrained language models to be used

model_paths = {
               "BERTPT":"gdrive/My Drive/Modelos/BERTPT/BERTPT_Model_2024-01-02_23-23",
               "BERTPTL":"gdrive/My Drive/Modelos/BERTPTL/BERTPTL_Model_2024-01-03_16-05",
               "MBERT":"gdrive/My Drive/Modelos/MBERT/MBERT_Model_2024-01-03_01-17",
               "ELECTRA":"gdrive/My Drive/Modelos/ELECTRA/ELECTRA_Model_2024-01-02_22-21",
               "ROBERTA": "gdrive/MyDrive/Modelos/ROBERTA/ROBERTA_Model_2024-01-06_21-14",
               "XLMR":"gdrive/My Drive/Modelos/XLMR/XLMR_Model_2024-01-05_16-19",
               "DISTILBERT": "gdrive/My Drive/Modelos/DISTILBERT/DISTILBERT_Model_2024-01-02_21-13",
               "ALBERT":"gdrive/My Drive/Modelos/ALBERT/ALBERT_Model_2024-01-02_17-54",
               "DEBERTA":"gdrive/My Drive/Modelos/DEBERTA/DEBERTA_Model_2024-01-03_10-24",
               "BERTPT_ML": "gdrive/My Drive/Modelos/MultiLabel/BERTPT/BERTPT_MultiLabelModel_2024-01-03_20-24",
               "DEBERTA_ML": "gdrive/My Drive/Modelos/MultiLabel/DEBERTA/DEBERTA_MultiLabelModel_2024-01-03_20-43"
               }

In [9]:
# Function evaluate_loaded_model loads the tokenizer and the model, tokenizes the dataset,
# sets configuration of the parameters of the Trainer and evaluates the model.

def evaluate_loaded_model(model_type, dataset=dataset,  multilabel=False):
  loaded_tokenizer = AutoTokenizer.from_pretrained(model_paths[model_type])
  loaded_model = AutoModelForSequenceClassification.from_pretrained(model_paths[model_type])

  tokenized_dataset = dataset.map(tokenize, batched=True, fn_kwargs={"tokenizer": loaded_tokenizer}, remove_columns=["text"])

  training_args = TrainingArguments(
        output_dir=f"./test_trainer/{model_type}",
        do_train = False,
        do_eval = True,
        per_device_eval_batch_size=8,
        )

  if multilabel==True:
    loaded_model.config.problem_type = "multi_label_classification"

  trainer = Trainer(
    model=loaded_model,
    args=training_args,
    eval_dataset = tokenized_dataset["test"],
    compute_metrics = compute_metrics_multilabel if multilabel else compute_metrics,
  )

  results = trainer.evaluate()

  print(results)

# MODEL EVALUATION

In [None]:
evaluate_loaded_model("BERTPT")

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.55k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.36k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.77k [00:00<?, ?B/s]

{'eval_loss': 0.11576443165540695, 'eval_accuracy': 0.9604863221884499, 'eval_precision': 0.9397705544933078, 'eval_recall': 0.9849699398797596, 'eval_f1': 0.9618395303326811, 'eval_runtime': 65.7992, 'eval_samples_per_second': 30.0, 'eval_steps_per_second': 3.754}


In [None]:
evaluate_loaded_model("ALBERT")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Map:   0%|          | 0/7894 [00:00<?, ? examples/s]

Map:   0%|          | 0/1974 [00:00<?, ? examples/s]

{'eval_loss': 0.20355546474456787, 'eval_accuracy': 0.952887537993921, 'eval_precision': 0.9380445304937076, 'eval_recall': 0.9709418837675351, 'eval_f1': 0.9542097488921714, 'eval_runtime': 39.3332, 'eval_samples_per_second': 50.187, 'eval_steps_per_second': 6.28}


In [None]:
evaluate_loaded_model("ELECTRA")

Map:   0%|          | 0/7894 [00:00<?, ? examples/s]

Map:   0%|          | 0/1974 [00:00<?, ? examples/s]

{'eval_loss': 0.20108181238174438, 'eval_accuracy': 0.955420466058764, 'eval_precision': 0.953187250996016, 'eval_recall': 0.9589178356713427, 'eval_f1': 0.9560439560439561, 'eval_runtime': 56.6003, 'eval_samples_per_second': 34.876, 'eval_steps_per_second': 4.364}


In [None]:
evaluate_loaded_model("MBERT")

Map:   0%|          | 0/7894 [00:00<?, ? examples/s]

Map:   0%|          | 0/1974 [00:00<?, ? examples/s]

{'eval_loss': 0.19372034072875977, 'eval_accuracy': 0.9549138804457954, 'eval_precision': 0.9416909620991254, 'eval_recall': 0.9709418837675351, 'eval_f1': 0.9560927479033053, 'eval_runtime': 62.1364, 'eval_samples_per_second': 31.769, 'eval_steps_per_second': 3.975}


In [None]:
evaluate_loaded_model("DISTILBERT")

Map:   0%|          | 0/7894 [00:00<?, ? examples/s]

Map:   0%|          | 0/1974 [00:00<?, ? examples/s]

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.55k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.36k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.77k [00:00<?, ?B/s]

{'eval_loss': 0.1748952865600586, 'eval_accuracy': 0.9604863221884499, 'eval_precision': 0.9554455445544554, 'eval_recall': 0.966933867735471, 'eval_f1': 0.9611553784860557, 'eval_runtime': 34.1714, 'eval_samples_per_second': 57.768, 'eval_steps_per_second': 7.228}


In [None]:
evaluate_loaded_model("DEBERTA")



Map:   0%|          | 0/7894 [00:00<?, ? examples/s]

Map:   0%|          | 0/1974 [00:00<?, ? examples/s]

{'eval_loss': 0.1452174186706543, 'eval_accuracy': 0.9579533941236069, 'eval_precision': 0.9673135852911133, 'eval_recall': 0.9488977955911824, 'eval_f1': 0.9580171977744056, 'eval_runtime': 94.5271, 'eval_samples_per_second': 20.883, 'eval_steps_per_second': 2.613}


In [22]:
evaluate_loaded_model("ROBERTA")

Map:   0%|          | 0/7894 [00:00<?, ? examples/s]

Map:   0%|          | 0/1974 [00:00<?, ? examples/s]

{'eval_loss': 0.2244383841753006, 'eval_accuracy': 0.9478216818642351, 'eval_precision': 0.9552390640895219, 'eval_recall': 0.9408817635270541, 'eval_f1': 0.9480060575466935, 'eval_runtime': 43.6982, 'eval_samples_per_second': 45.174, 'eval_steps_per_second': 5.652}


In [None]:
evaluate_loaded_model("XLMR")

Map:   0%|          | 0/7894 [00:00<?, ? examples/s]

Map:   0%|          | 0/1974 [00:00<?, ? examples/s]

{'eval_loss': 0.19355185329914093, 'eval_accuracy': 0.9569402228976697, 'eval_precision': 0.946236559139785, 'eval_recall': 0.969939879759519, 'eval_f1': 0.9579416130628401, 'eval_runtime': 56.9464, 'eval_samples_per_second': 34.664, 'eval_steps_per_second': 4.337}


In [None]:
evaluate_loaded_model("BERTPTL")

Map:   0%|          | 0/7894 [00:00<?, ? examples/s]

Map:   0%|          | 0/1974 [00:00<?, ? examples/s]

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.55k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.36k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.77k [00:00<?, ?B/s]

{'eval_loss': 0.19591018557548523, 'eval_accuracy': 0.9690982776089159, 'eval_precision': 0.9652432969215492, 'eval_recall': 0.9739478957915831, 'eval_f1': 0.9695760598503741, 'eval_runtime': 199.04, 'eval_samples_per_second': 9.918, 'eval_steps_per_second': 1.241}


## Results for best Model on Text Binary Classification Task Ordered by Accuracy

| Model                                       | Model Type | Validation Loss | Accuracy | Precision | Recall |   F1   |
|---------------------------------------------|------------|------------------|----------|-----------|--------|--------|
| neuralmind/bert-large-portuguese-cased       | BERPTL     | 0.1959           | **96.90%**   | 95.52%    | **97.39%** | **96.96%** |
| neuralmind/bert-base-portuguese-cased        | BERTPT     | **0.1158**          | 96.05%   | 93.98%    | 98.50% | 96.11% |
| distilbert-base-multilingual-cased           | DISTILBERT  | 0.1749           | 96.05%   | 95.54%    | 96.69% | 96.15% |
| microsoft/mdeberta-v3-base                   | DEBERTA     | 0.1452           | 95.79%   | **96.73%**    | 94.89% | 95.80% |
|xlm-roberta-base                             | XLMR       | 0.1935           | 95.69%   | 95.62%    | 96.99% | 95.79% |
| dlb/electra-base-portuguese-uncased-brwac    | ELECTRA     | 0.2011           | 95.54%   | 95.32%    | 95.89% | 95.60% |
| bert-base-multilingual-cased                 | MBERT      | 0.1937           | 95.49%   | 94.17%    | 97.09% | 95.61% |
| josu/albert-pt-br                            | ALBERT     | 0.2035           | 95.29%   | 93.80%    | 97.09% | 95.42% |
| rdenadai/BR_BERTo                            | ROBERTA    | 0.2244           | 94.78%   | 95.52%    | 94.09% | 94.80% |

Note: Chosen by best validation loss outcome of 5 epochs each. BERTPTL was trained with train batch size of 4 (default is 8) due to memory restrictions and with learning rate of 2e-5 instead of the default 5e-5 for better performance.

In [None]:
# Creation of the Dataset object for multilabel classification task

data_files = {
    "train": ["/content/gdrive/MyDrive/Datasets/dataset_verifato_covid_multilabel_train.csv"],
    "test": ["/content/gdrive/MyDrive/Datasets/dataset_verifato_covid_multilabel_test.csv"]
}

dataset = load_dataset('csv', data_files=data_files, delimiter=",")
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'T01', 'T02', 'T03', 'T04', 'T05', 'T06', 'T07', 'T09'],
        num_rows: 552
    })
    test: Dataset({
        features: ['text', 'T01', 'T02', 'T03', 'T04', 'T05', 'T06', 'T07', 'T09'],
        num_rows: 139
    })
})

In [None]:
# Joining theme columns into "labels"

cols = dataset["train"].column_names
dataset_ml = dataset.map(lambda x : {"labels": [x[c] for c in cols if c in ['T01', 'T02', 'T03', 'T04', 'T05', 'T06', 'T07','T09']]})
dataset_ml

Map:   0%|          | 0/552 [00:00<?, ? examples/s]

Map:   0%|          | 0/139 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'T01', 'T02', 'T03', 'T04', 'T05', 'T06', 'T07', 'T09', 'labels'],
        num_rows: 552
    })
    test: Dataset({
        features: ['text', 'T01', 'T02', 'T03', 'T04', 'T05', 'T06', 'T07', 'T09', 'labels'],
        num_rows: 139
    })
})

In [None]:
evaluate_loaded_model("BERTPT_ML", dataset=dataset_ml, multilabel=True)

{'eval_loss': 0.34109827876091003, 'eval_accuracy': 0.09836342519269349, 'eval_precision': 0.6732628425104952, 'eval_recall': 0.06474820143884892, 'eval_f1': 0.11419095451765258, 'eval_runtime': 4.0574, 'eval_samples_per_second': 34.258, 'eval_steps_per_second': 4.436}




In [None]:
evaluate_loaded_model("DEBERTA_ML",dataset=dataset_ml, multilabel=True)

Map:   0%|          | 0/552 [00:00<?, ? examples/s]

Map:   0%|          | 0/139 [00:00<?, ? examples/s]

{'eval_loss': 0.3753722310066223, 'eval_accuracy': 0.18231443353394575, 'eval_precision': 0.6592044011849344, 'eval_recall': 0.10611510791366907, 'eval_f1': 0.1709781111281188, 'eval_runtime': 6.1179, 'eval_samples_per_second': 22.72, 'eval_steps_per_second': 2.942}




## Results for best Model on Text Multilabel Classification Task Ordered by Precision

| Model                                       | Model Type | Validation Loss | Accuracy | Precision | Recall |   F1   |
|---------------------------------------------|------------|------------------|----------|-----------|--------|--------|
| neuralmind/bert-base-portuguese-cased        | BERTPT_ML     | **0.3411**          | 9.83%   | **67.33%**    | 6.47% | 11.42% |
| microsoft/mdeberta-v3-base                   | DEBERTA_ML     | 0.3754           | **18.23%**   | 65.92%    | **10.61%** | **17.10%** |

Note: Chosen by best precision outcome of 5 epochs each. After analysing the problem, precision was defined as the best metric for model evaluation, because the goal is to correctly identify a theme, avoiding false positives. Also, as the dataset is imbalanced, precision is more valuable than accuracy, besides being assessed for each theme separately.

# CONCLUSION

For the Binary Classification task, the best model is BERT Large Portuguese. However, it is the most time and resource consuming model. The BERT Base Portuguese was chosen to follow through with the Multilabel task, along with the DEBERTA model which had the best precision value.

For the Multilabel Classification task, the best overall model was DEBERTA, because the precision score difference with the Bert Base model was not as significant, and the other metrics are considerably better.