<a href="https://colab.research.google.com/github/mpsdecamargo/ml-data-science-portfolio/blob/main/bert-deep-learning-project/Covid_related_Text_Binary_Classification_using_Transformers_library.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## INTRODUCTION

Application Feature related in this notebook: Classify Covid-related data into False or True.

Notebook content: Import of the dataset, processing of data, configuration and training of the models for binary classification data, saving models to personal Google Drive.

Note: The notebook was developed in Google Colab. The datasets are not publicly available due to copyright restrictions. This notebook is a form of demonstration of problem solving, Data Science and Machine Learning skills, but as the dataset and the models are not publicly available, it cannot be reproduced. However, the code can be used for similar tasks.



## ABOUT THE DATASET

The dataset was called dataset_verifato_covid and has 9868 samples, half of False content (label = 1), containing fake and misleading news and rumours, and half of news content (label = 0). All the content is Covid-related, extracted from various sources, including samples from COVID19.BR (A. D. F. Martins et al., "COVID19.br: A dataset of misinformation about COVID-19 in Brazilian Portuguese WhatsApp messages," in Proceedings of the III Dataset Showcase Workshop, SBC, 2021, pp. 138-147. Available: [Link](https://sol.sbc.org.br/index.php/dsw/article/view/17422/17258)).

Here it's possible to see which sources the samples come from (in Portuguese):

![Dataset Verifato Sources](https://raw.githubusercontent.com/mpsdecamargo/ml-data-science-portfolio/main/bert-deep-learning-project/images/dataset_verifato_sources.png)


In [1]:
!pip install transformers datasets evaluate torch sentencepiece

Collecting datasets
  Downloading datasets-2.16.1-py3-none-any.whl (507 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m19.9 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow-hotfix (from datasets)
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl (7.9 kB)
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datas

In [2]:
pip install accelerate -U

Collecting accelerate
  Downloading accelerate-0.25.0-py3-none-any.whl (265 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.7/265.7 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.25.0


In [3]:
import pandas as pd
import numpy as np
from datasets import Dataset, DatasetDict, load_dataset, load_from_disk
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer, get_scheduler, DataCollatorWithPadding, EarlyStoppingCallback
from sklearn.model_selection import train_test_split
import torch
import time
import evaluate
from datetime import datetime
import pytz

In [4]:
from google.colab import drive
drive.mount('gdrive')

Mounted at gdrive


In [5]:
df = pd.read_csv('/content/gdrive/MyDrive/Datasets/dataset_verifato_covid.csv', sep=";")
df.head()

Unnamed: 0,text,labels
0,Chegada dos médicos e enfermeiros pra ajudar n...,0
1,Hoje vi q um homen q morreu no pernambuco no a...,0
2,Informação que o Pai do Roberto Claudio falece...,0
3,Antes do carnaval o Corona vírus era brincadei...,0
4,Pra fica melhor maioria das pessoas tá pegado ...,0


In [6]:
# Checking if the dataset is balanced

df[["labels"]].value_counts()

labels
0         4934
1         4934
dtype: int64

In [7]:
# Defining data to be used in Machine Learning

X = df[['text']]

y = df[["labels"]].astype(int)
y.value_counts()

labels
0         4934
1         4934
dtype: int64

In [8]:
# Dividing dataset into training and test

X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True, test_size=0.2, random_state=42)
y_train

Unnamed: 0,labels
8921,1
5436,0
2885,0
5157,0
1888,1
...,...
5734,1
5191,0
5390,0
860,0


In [9]:
# Concatenating X and y for export to file

train_df = pd.DataFrame(pd.concat([X_train,y_train], axis=1,ignore_index=True))
train_df.columns = ["text", "labels"]
test_df = pd.DataFrame(pd.concat([X_test,y_test], axis=1,ignore_index=True))
test_df.columns = ["text", "labels"]

In [10]:
# Exporting to files for creation of Dataset object necessary for Machine Learning task

train_df.to_csv("/content/gdrive/MyDrive/Datasets/dataset_verifato_covid_train.csv", sep=",", columns = ["text", "labels"], index=False)
test_df.to_csv("/content/gdrive/MyDrive/Datasets/dataset_verifato_covid_test.csv", sep=",", columns = ["text", "labels"], index=False)

In [11]:
# Creation of the Dataset object

data_files = {
    "train": ["/content/gdrive/MyDrive/Datasets/dataset_verifato_covid_train.csv"],
    "test": ["/content/gdrive/MyDrive/Datasets/dataset_verifato_covid_test.csv"]
}

dataset = load_dataset('csv', data_files=data_files, delimiter=",")
dataset

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'labels'],
        num_rows: 7894
    })
    test: Dataset({
        features: ['text', 'labels'],
        num_rows: 1974
    })
})

In [12]:
# Defining the repositories of the pretrained language models to be used

lang_models = {"BERTPT":"neuralmind/bert-base-portuguese-cased","BERTPTL":"neuralmind/bert-large-portuguese-cased","MBERT":"bert-base-multilingual-cased", "ELECTRA":"dlb/electra-base-portuguese-uncased-brwac","ROBERTA": "rdenadai/BR_BERTo","XLMR":"xlm-roberta-base","DISTILBERT": "distilbert-base-multilingual-cased","ALBERT":"josu/albert-pt-br","DEBERTA":"microsoft/mdeberta-v3-base"}


In [13]:
# Loading the tokenizer and creation of tokenize function.
# The tokenizer transforms the text data into tokens, which are essentially different numbers, which are then used in training.

def startTokenizer(model_type):
  load = time.time()
  if model_type == "DEBERTA":
    tokenizer = AutoTokenizer.from_pretrained(lang_models[model_type],use_fast=False)
  else:
    tokenizer = AutoTokenizer.from_pretrained(lang_models[model_type])
  end = time.time()
  tokenizationLoad = end - load
  print("Loading Tokenization: ", tokenizationLoad)
  return tokenizer

def tokenize(dataset, tokenizer):
  return tokenizer(dataset["text"], truncation=True, max_length=512, padding='max_length', add_special_tokens=True, return_tensors='np')

In [14]:
# Definiton of the metrics used to evaluate models

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    accuracy = evaluate.load('accuracy').compute(predictions=predictions, references=labels)
    precision = evaluate.load('precision').compute(predictions=predictions, references=labels)
    recall = evaluate.load('recall').compute(predictions=predictions, references=labels)
    f1 = evaluate.load('f1').compute(predictions=predictions, references=labels)

    metrics = {
        'accuracy': accuracy["accuracy"],
        'precision': precision["precision"],
        'recall': recall["recall"],
        'f1': f1["f1"],
    }

    return metrics

In [17]:
# Definition of the main functions of the notebook:
# Function build_model loads the tokenizer and the model, tokenizes the dataset, sets configuration of the parameters of the Trainer and trains the model.
# Function save_model saves the tokenizer and model in my personal Google Drive with the model type and date and time of the training start.

# Note: In the original paper, the best model was based on accuracy, but that was not optimal, due to tendency for model overfitting. This time, eval_loss was the primary metric.

def save_model(tokenizer, model, model_type):
  timezone = pytz.timezone('America/Argentina/Buenos_Aires')
  now = str(datetime.now(timezone).strftime("%Y-%m-%d_%H-%M"))
  model_name = f"{model_type}_Model_{now}"
  path_name = f"gdrive/My Drive/Modelos/{model_type}/{model_name}"
  model.save_pretrained(path_name)
  tokenizer.save_pretrained(path_name)
  print(f"Model Saved at: {path_name}")

def build_model(model_type, dataset=dataset, train_batch_size=8,learning_rate=5e-5):
  tokenizer = startTokenizer(model_type)
  tokenized_dataset = dataset.map(tokenize, batched=True,fn_kwargs={"tokenizer": tokenizer}, remove_columns=["text"])
  data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
  model = AutoModelForSequenceClassification.from_pretrained(lang_models[model_type], num_labels=2, return_dict=True)
  early_stopping = EarlyStoppingCallback(early_stopping_patience=3, early_stopping_threshold=0.001)
  training_args = TrainingArguments(
      output_dir=f"./test_trainer/{model_type}",
      evaluation_strategy="epoch",
      per_device_train_batch_size=train_batch_size,
      per_device_eval_batch_size=8,
      num_train_epochs=5,
      learning_rate = learning_rate,
      save_strategy="epoch",
      save_total_limit=5,
      load_best_model_at_end=True,
      metric_for_best_model="eval_loss"
      )
  trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_dataset["train"],
        eval_dataset=tokenized_dataset["test"],
        compute_metrics=compute_metrics,
        data_collator=data_collator,
        callbacks=[early_stopping],
      )
  trainer.train()
  print("Training Process Finished")
  save_model(tokenizer, model, model_type)


# MODEL TRAINING

Note: For Model Evaluation, see the respective notebook.

In [None]:
build_model("BERTPT")

In [None]:
build_model("BERTPTL", train_batch_size=4, learning_rate=2e-5)

In [None]:
build_model("MBERT")

In [None]:
build_model("ELECTRA")

In [None]:
build_model("ROBERTA")

In [None]:
build_model("XLMR")

In [None]:
build_model("DISTILBERT")

In [None]:
build_model("ALBERT")

In [None]:
build_model("DEBERTA")