<a href="https://colab.research.google.com/github/mpsdecamargo/ml-data-science-portfolio/blob/main/bert-deep-learning-project/Covid_related_Text_Theme_Multilabel_Classification_using_Transformers_library.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## INTRODUCTION

Application Feature related in this notebook: Classify Covid-related data into 8 themes of 9 most recurring themes of Covid related data according to UNESCO (2020).

Notebook content: Importing of the dataset, processing of data, configuration and training of the models for multilabel classification data, saving models to personal Google Drive.

Note: The notebook was developed in Google Colab. The datasets are not publicly available due to copyright restrictions. This notebook is a form of demonstration of problem solving, Data Science and Machine Learning skills, but as the dataset and the models are not publicly available, it cannot be reproduced. However, the code can be used for similar tasks.

## ABOUT THE DATASET

The dataset was called dataset_verifato_temas (for this updatem it is name dataset_verifato_multilabel) and has 691 samples, divided in 8 themes. All the content is Covid-related, extracted from dataset_verifato_convid, which includes samples from COVID19.BR (A. D. F. Martins et al., "COVID19.br: A dataset of misinformation about COVID-19 in Brazilian Portuguese WhatsApp messages," in Proceedings of the III Dataset Showcase Workshop, SBC, 2021, pp. 138-147. Available: [Link](https://sol.sbc.org.br/index.php/dsw/article/view/17422/17258)).

Here it's possible to see the themes the samples correspond to (in Portuguese):

![Dataset Verifato Sources](https://raw.githubusercontent.com/mpsdecamargo/ml-data-science-portfolio/main/bert-deep-learning-project/images/dataset_verifato_themes.png)

The 9 themes of Covid-related disinformation originated from a UNESCO 2020 paper: J. Posetti and K. Bontcheva, "Disinfodemic: Deciphering COVID-19 Disinformation," Policy brief, 1. UNESCO, 2020. [Online]. Available: https://unesdoc.unesco.org/ark:/48223/pf0000374416

The 9 themes are:

1.Origins and spread of the coronavirus/COVID-19 disease

While scientists first identified cases of novel coronavirus (the virus that causes the disease COVID-19) connected to an animal market in the Chinese city of Wuhan, there are many conspiracy theories that blame other actors and causes. These extend from blaming the 5G network through to chemical weapons manufacturers. Using a label like “Chinese virus” instead of neutral terminology inflates location into an adjective, in an historical echo of early pandemics that gave a biased meaning to a noun.

2.False and misleading statistics

Often connected to the reported incidence of the disease and mortality rates.

3.Economic impacts.

This theme includes spreading false information about the economic and health impacts of the pandemic, suggestions that social isolation is not economically justified, and even claims that COVID-19 is overall creating jobs.

4.Discrediting of journalists and credible news outlets.

This is a theme often associated with political disinformation, with unsupported accusations that certain news outlets are themselvespeddling in disinformation. This behaviour includes abuse levelled at journalists publicly, but it is also used by less visible disinformation campaigns to undermine trust in verified news produced in the public interest. Attacks on journalists in the time of COVID-19 have been associated with crackdowns on critical coverage of political actors and states.

5. Medical science: symptoms, diagnosis and treatment.

This theme includes dangerous disinformation about immunity, prevention, treatments and cures. For example, myriad ‘sticky’ memes claim that drinking or gargling cow urine, hot water, or salt water could prevent the infection reaching lungs. They cannot.2.False and misleading statisticsOften connected to the reported incidence of the disease and mortality rates.

6.Impacts on society and the environment

This theme in the disinfodemic ranges from panic buying triggers and  false information about lockdowns, through to the supposed re-emergence of dolphins in Venetian canals.

7.Politicisation

One-sided and positively-framed information is presented in an effort to negate the significance of facts that are inconvenient for certain actors in power. Other disinformation designed to mislead for political advantage includes: equating COVID-19 with flu; making baseless claims about the likely length of the pandemic; and assertions about the (un)availability of medical testing and equipment.

8.Content driven by fraudulent financial gain

This includes scams designed to steal people’s private data.9.Celebrity-focused disinformationThis theme includes false stories about actors being diagnosed with COVID-19.This research has identified nine key themes present in content associated with the disinfodemic. These themes frequently feature racism and xenophobia.

9.Celebrity-focused disinformation

This theme includes false stories about actors being diagnosed with COVID-19.This research has identified nine key themes present in content associated with the disinfodemic. These themes frequently feature racism and xenophobia.

Note: Only 8 themes were used in the dataset, because theme 8 (Content driven by fraudulent financial gain) because it is considerably hard to assess. Also, the labeling processing was defined in the original paper:

However, due to a lack of available time to dedicate to labeling all the samples, a method was initially defined to find at least 100 samples for each theme. This method involved the manual evaluation of news articles by the authors, which was validated by an individual responsible for maintaining consistency in the assessment of the theme for each news article. (Camargo et al., 2022)

In [None]:
!pip install transformers --upgrade sentencepiece datasets evaluate scikit-multilearn



In [None]:
!pip install accelerate -U



In [None]:
from google.colab import drive
drive.mount('gdrive')

Drive already mounted at gdrive; to attempt to forcibly remount, call drive.mount("gdrive", force_remount=True).


In [None]:
import pandas as pd
import numpy as np
from datasets import Dataset, load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer, DataCollatorWithPadding, EarlyStoppingCallback
from sklearn.model_selection import train_test_split
from sklearn.metrics import balanced_accuracy_score, precision_score, recall_score, f1_score, multilabel_confusion_matrix
import torch
import time
import evaluate
from datetime import datetime
import pytz
from transformers import AlbertForSequenceClassification, AlbertTokenizer
import sentencepiece
import torch.nn as nn

In [None]:
df = pd.read_csv('/content/gdrive/MyDrive/Datasets/dataset_verifato_multilabel.csv', sep=";")
df.head()

Unnamed: 0,text,T01,T02,T03,T04,T05,T06,T07,T09
0,"Globo/DF afirma, em reportagem de 2022, que má...",1,0,0,0,0,0,0,0
1,Uma criança morreu ao ser vacinada contra a Co...,0,1,0,0,1,0,0,0
2,A Espanha anunciou o fim da pandemia da Covid-...,0,0,1,0,0,1,0,0
3,Vacinas contra a Covid-19 causam trombose e em...,0,0,0,0,1,0,0,0
4,O desfile das escolas de samba no Carnaval 202...,0,0,1,0,0,1,0,0


In [None]:
df.shape

(691, 9)

In [None]:
# Assessing how many of each samples of each theme and label the dataset has. Notice it is a very imbalanced dataset.

df2 = df.drop(["text"], axis=1)
for column in df2.columns:
  print(df2[column].value_counts())


0    630
1     61
Name: T01, dtype: int64
0    622
1     69
Name: T02, dtype: int64
0    590
1    101
Name: T03, dtype: int64
0    660
1     31
Name: T04, dtype: int64
0    517
1    174
Name: T05, dtype: int64
0    524
1    167
Name: T06, dtype: int64
0    371
1    320
Name: T07, dtype: int64
0    589
1    102
Name: T09, dtype: int64


In [None]:
# Defining data to be used in Machine Learning

X = df["text"]
y = df.drop(["text"], axis=1).astype(float)
y.head()

Unnamed: 0,T01,T02,T03,T04,T05,T06,T07,T09
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
3,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0


In [None]:
# Dividing dataset into training and test

X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True, test_size=0.2, random_state=42)
y_train

Unnamed: 0,T01,T02,T03,T04,T05,T06,T07,T09
110,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
82,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
51,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
218,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
568,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...
71,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
106,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
270,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0
435,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [None]:
# Concatenating X and y for export to file

train_df = pd.DataFrame(pd.concat([X_train,y_train], axis=1,ignore_index=True))
train_df.columns = ["text", "T01","T02", "T03", "T04", "T05", "T06", "T07", "T09"]
test_df = pd.DataFrame(pd.concat([X_test,y_test], axis=1,ignore_index=True))
test_df.columns = ["text", "T01","T02", "T03", "T04", "T05", "T06", "T07", "T09"]

In [None]:
# Exporting to files for creation of Dataset object necessary for Machine Learning task

train_df.to_csv("/content/gdrive/MyDrive/Datasets/dataset_verifato_covid_multilabel_train.csv", sep=",", columns = ["text", "T01","T02", "T03", "T04", "T05", "T06", "T07", "T09"], index=False)
test_df.to_csv("/content/gdrive/MyDrive/Datasets/dataset_verifato_covid_multilabel_test.csv", sep=",", columns = ["text", "T01","T02", "T03", "T04", "T05", "T06", "T07", "T09"], index=False)

In [None]:
# Creation of the Dataset object

data_files = {
    "train": ["/content/gdrive/MyDrive/Datasets/dataset_verifato_covid_multilabel_train.csv"],
    "test": ["/content/gdrive/MyDrive/Datasets/dataset_verifato_covid_multilabel_test.csv"]
}

dataset = load_dataset('csv', data_files=data_files, delimiter=",")
dataset

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'T01', 'T02', 'T03', 'T04', 'T05', 'T06', 'T07', 'T09'],
        num_rows: 552
    })
    test: Dataset({
        features: ['text', 'T01', 'T02', 'T03', 'T04', 'T05', 'T06', 'T07', 'T09'],
        num_rows: 139
    })
})

In [None]:
# joining all theme labels into feature "labels"

cols = dataset["train"].column_names
dataset = dataset.map(lambda x : {"labels": [x[c] for c in cols if c in ['T01', 'T02', 'T03', 'T04', 'T05', 'T06', 'T07','T09']]})
dataset

Map:   0%|          | 0/552 [00:00<?, ? examples/s]

Map:   0%|          | 0/139 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'T01', 'T02', 'T03', 'T04', 'T05', 'T06', 'T07', 'T09', 'labels'],
        num_rows: 552
    })
    test: Dataset({
        features: ['text', 'T01', 'T02', 'T03', 'T04', 'T05', 'T06', 'T07', 'T09', 'labels'],
        num_rows: 139
    })
})

In [None]:
type(dataset["train"]["labels"])

list

In [None]:
# Defining the repositories of the pretrained language models to be used

lang_models = {"BERTPT":"neuralmind/bert-base-portuguese-cased","BERTPTL":"neuralmind/bert-large-portuguese-cased","MBERT":"bert-base-multilingual-cased", "ELECTRA":"dlb/electra-base-portuguese-uncased-brwac","ROBERTA": "rdenadai/BR_BERTo","XLMR":"xlm-roberta-base","DISTILBERT": "distilbert-base-multilingual-cased","ALBERT":"josu/albert-pt-br","DEBERTA":"microsoft/mdeberta-v3-base"}


In [None]:
# Loading the tokenizer and creation of tokenize function.
# The tokenizer transforms the text data into tokens, which are essentially different numbers,
# which are then used in training.


def startTokenizer(model_type):
  load = time.time()
  tokenizer = AutoTokenizer.from_pretrained(lang_models[model_type])
  end = time.time()
  tokenizationLoad = end - load
  print("Loading Tokenization: ", tokenizationLoad)
  return tokenizer

def tokenize(dataset, tokenizer):
  return tokenizer(dataset["text"], truncation=True, max_length=512, padding='max_length', add_special_tokens=True, return_tensors='np')

In [None]:
# Definiton of the metrics used to evaluate models

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.round(logits)

    # Flatten the predictions and labels since multilabel metrics expect 1D arrays
    predictions_flat = predictions.flatten()
    labels_flat = labels.flatten()

    accuracy = balanced_accuracy_score(labels_flat, predictions_flat)
    precision = precision_score(labels_flat, predictions_flat, average='weighted')
    recall = recall_score(labels_flat, predictions_flat, average='weighted',zero_division=0)
    f1 = f1_score(labels_flat, predictions_flat, average='weighted')

    metrics = {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1': f1,
    }

    return metrics

In [None]:
# Definition of the main functions of the notebook:
# Function build_model loads the tokenizer and the model, tokenizes the dataset,
# sets configuration of the parameters of the Trainer and trains the model.
# Function save_model saves the tokenizer and model in my personal Google Drive
# with the model type and date and time of the training start.

# Note: After analysing the problem, precision was defined as the best metric for model evaluation, because
# the goal is to correctly identify a theme, avoiding false positives. Also, as the dataset is imbalanced,
# precision is more valuable than accuracy, besides being assessed for each theme separately.

def save_model(tokenizer, model, model_type):
  timezone = pytz.timezone('America/Argentina/Buenos_Aires')
  now = str(datetime.now(timezone).strftime("%Y-%m-%d_%H-%M"))
  model_name = f"{model_type}_MultiLabelModel_{now}"
  path_name = f"gdrive/My Drive/Modelos/MultiLabel/{model_type}/{model_name}"
  model.save_pretrained(path_name)
  tokenizer.save_pretrained(path_name)
  print(f"Model Saved at: {path_name}")

def build_model(model_type, dataset=dataset):
  tokenizer = startTokenizer(model_type)
  tokenized_dataset = dataset.map(tokenize, batched=True,fn_kwargs={"tokenizer": tokenizer}, remove_columns=['text', 'T01', 'T02', 'T03', 'T04', 'T05', 'T06', 'T07', 'T09'])
  data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
  model = AutoModelForSequenceClassification.from_pretrained(lang_models[model_type], num_labels=8, return_dict=True)
  model.config.problem_type = "multi_label_classification"
  early_stopping = EarlyStoppingCallback(early_stopping_patience=6, early_stopping_threshold=0.001)
  training_args = TrainingArguments(
        output_dir=f"./test_trainer/MultiLabel/{model_type}/2",
        evaluation_strategy="epoch",
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        num_train_epochs=20,
        save_strategy="epoch",
        save_total_limit=5,
        learning_rate=2e-5,
        load_best_model_at_end=True,
        metric_for_best_model="precision"
        )
  trainer = Trainer(
          model=model,
          args=training_args,
          train_dataset=tokenized_dataset["train"],
          eval_dataset=tokenized_dataset["test"],
          compute_metrics=compute_metrics,
          data_collator=data_collator,
          callbacks=[early_stopping],
        )
  trainer.train()
  print("Training Process Finished")
  save_model(tokenizer, model, model_type)

# MODEL TRAINING

Note: For Model Evaluation, see the respective notebook.

In [None]:
build_model("BERTPT")

In [None]:
build_model("DEBERTA")