<a href="https://colab.research.google.com/github/leonardoazzi/bert-misinfo-covid19/blob/bert/bertimbau-finetuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup

In [138]:
%pip install pandas transformers datasets torch scikit-learn evaluate

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3


# Preparação de dados
Carrega o dataset a ser utilizado para fine-tuning e seleciona os atributos mais relevantes.

In [70]:
import pandas as pd
import seaborn as sns
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from sklearn.model_selection import train_test_split

Faz o download do dataset anotado no diretório ./data

In [28]:
import os

if not os.path.exists('./data/covidbr_labeled.csv'):
  !mkdir data
  !curl -L -o ./data/covidbr_labeled.csv https://zenodo.org/records/5193932/files/covidbr_labeled.csv
else:
    print("File already exists. Skipping download.")

File already exists. Skipping download.


In [13]:
original_dataset_df = pd.read_csv('./data/covidbr_labeled.csv')
original_dataset_df

Unnamed: 0,shares,text,misinformation,source,revision
0,27,"O ministro da Ciência, Tecnologia, Inovações e...",0,https://www.gov.br/pt-br/noticias/educacao-e-p...,
1,26,Pesquisa com mais de 6.000 médicos em 30 paíse...,1,https://www.aosfatos.org/noticias/e-falso-que-...,
2,25,É com muita alegria que comunico que mais um p...,0,http://portal.mec.gov.br/component/content/art...,
3,25,Renda Brasil unificará vários programas sociai...,0,https://agenciabrasil.ebc.com.br/politica/noti...,
4,24,O Secretário-Geral da OTAN Jens Stoltenberg ta...,0,,1.0
...,...,...,...,...,...
2894,1,A torcida do corona deve estar arrancando os c...,0,,
2895,1,“OS EUA E O CORONAVÍRUS :\n\nAcabei de assisti...,0,https://www.reuters.com/article/us-health-coro...,1.0
2896,1,Estatísticas falsas conforme depoimentos colhi...,1,,1.0
2897,1,"Atenção => 🇧🇷💓💓💓 *MUITO IMPORTANTE! ""Como é qu...",0,,


In [15]:
dataset_df = original_dataset_df[["text", "misinformation"]]
dataset_df

Unnamed: 0,text,misinformation
0,"O ministro da Ciência, Tecnologia, Inovações e...",0
1,Pesquisa com mais de 6.000 médicos em 30 paíse...,1
2,É com muita alegria que comunico que mais um p...,0
3,Renda Brasil unificará vários programas sociai...,0
4,O Secretário-Geral da OTAN Jens Stoltenberg ta...,0
...,...,...
2894,A torcida do corona deve estar arrancando os c...,0
2895,“OS EUA E O CORONAVÍRUS :\n\nAcabei de assisti...,0
2896,Estatísticas falsas conforme depoimentos colhi...,1
2897,"Atenção => 🇧🇷💓💓💓 *MUITO IMPORTANTE! ""Como é qu...",0


# Análise exploratória de dados

O objetivo é entender melhor e sumarizar as características dos dados, analisando quantidade e tipos de atributos, verificando distribuição do atributo alvo, identificando padrões e anomalias, removendo atributos que pareçam irrelevantes ou problemáticos, etc. Utilize gráficos e sumarizações estatísticas para a EDA. Verifique potenciais problemas nos dados, como por exemplo, a necessidade de normalizar os atributos, balancear classes, ou remover instâncias ou atributos por inconsistências nos dados.

- P1. Qual a quantidade e tipos de atributos? Existem inconsistências?
  - Quais são os atributos disponíveis?
  - Existem inconsistências nos atributos? (Atributos vazios, potenciais erros, etc)
  - Existem atributos que necessitam ser removidos ou transformados?
- P2. Qual a distribuição do atributo alvo?
  - Quais são as classes alvo? Qual a distribuição entre as classes? Está balanceada ou desbalanceada?
- P3. Quais os padrões e anomalias dos atributos?



## P1. Qual a quantidade e tipos de atributos? Existem inconsistências?

In [16]:
dataset_df.info(verbose = False, memory_usage = False, show_counts = True) # mostra o tipo e a quantidade de itens não nulos de cada coluna

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2899 entries, 0 to 2898
Columns: 2 entries, text to misinformation
dtypes: int64(1), object(1)

In [18]:
dataset_df.dtypes

Unnamed: 0,0
text,object
misinformation,int64


## P2. Qual a distribuição do atributo alvo?

In [141]:
dataset_df['misinformation'].describe(include='all')

Unnamed: 0,misinformation
count,2899.0
mean,0.314591
std,0.464433
min,0.0
25%,0.0
50%,0.0
75%,1.0
max,1.0


# Pré-processamento

## Tokenização

Carrega o tokenizador para `bert-base-portuguese-cased` (BERTimbau)

In [54]:
from transformers import AutoTokenizer  # Or BertTokenizer

tokenizer = AutoTokenizer.from_pretrained('neuralmind/bert-base-portuguese-cased', do_lower_case=False)

config.json:   0%|          | 0.00/647 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/43.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/210k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Aplica a tokenização para todas as instâncias de `text`

In [89]:
def tokenize_function(examples):
    print(examples)
    return tokenizer(str(examples), padding="max_length", truncation=True, max_length=128)

# Apply the tokenizer to the dataset
tokenized_datasets = dataset_df.apply(lambda row: tokenize_function(row["text"]), axis=1)

# Inspect tokenized samples
tokenized_df = pd.DataFrame(tokenized_datasets, columns=["tk_text"])
tokenized_df

In [95]:
data = pd.concat([tokenized_df, dataset_df["misinformation"]], axis=1, join="inner")
data

Unnamed: 0,tk_text,misinformation
0,"[input_ids, token_type_ids, attention_mask]",0
1,"[input_ids, token_type_ids, attention_mask]",1
2,"[input_ids, token_type_ids, attention_mask]",0
3,"[input_ids, token_type_ids, attention_mask]",0
4,"[input_ids, token_type_ids, attention_mask]",0
...,...,...
2894,"[input_ids, token_type_ids, attention_mask]",0
2895,"[input_ids, token_type_ids, attention_mask]",0
2896,"[input_ids, token_type_ids, attention_mask]",1
2897,"[input_ids, token_type_ids, attention_mask]",0


## Balanceamento de classes

Utilizando o cálculo de class_weights.

Fonte: https://medium.com/@heyamit10/fine-tuning-bert-for-classification-a-practical-guide-b8c1c56f252c

In [96]:
from sklearn.utils.class_weight import compute_class_weight
import numpy as np

labels = data["misinformation"]
class_weights = compute_class_weight("balanced", classes=np.unique(labels), y=labels)
print(class_weights)

[0.7294917  1.58936404]


# Fine-tuning

In [150]:
from datasets import Dataset
import datasets

dataset = Dataset.from_pandas(dataset_df)
dataset

Dataset({
    features: ['text', 'misinformation'],
    num_rows: 2899
})

In [157]:
split_data = dataset.train_test_split()
train_data = split_data["train"]
test_data = split_data["test"]
split_data

DatasetDict({
    train: Dataset({
        features: ['text', 'misinformation'],
        num_rows: 2174
    })
    test: Dataset({
        features: ['text', 'misinformation'],
        num_rows: 725
    })
})

In [122]:
from transformers import AutoModelForPreTraining  # Or BertForPreTraining for loading pretraining heads
from transformers import AutoModel  # or BertModel, for BERT without pretraining heads
from transformers import BertModel
from transformers import BertForSequenceClassification

model_name = 'neuralmind/bert-base-portuguese-cased'
model = BertForSequenceClassification.from_pretrained(model_name)

print(model.config)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at neuralmind/bert-base-portuguese-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "directionality": "bidi",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "output_past": true,
  "pad_token_id": 0,
  "pooler_fc_size": 768,
  "pooler_num_attention_heads": 12,
  "pooler_num_fc_layers": 3,
  "pooler_size_per_head": 128,
  "pooler_type": "first_token_transform",
  "position_embedding_type": "absolute",
  "torch_dtype": "float32",
  "transformers_version": "4.52.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 29794
}



In [133]:
# Freeze all layers except the classifier
for param in model.bert.parameters():
    param.requires_grad = False

# Keep only the classification head trainable
for param in model.classifier.parameters():
    param.requires_grad = True

print(f"Trainable parameters: {sum(p.numel() for p in model.parameters() if p.requires_grad)}")

Trainable parameters: 1538


In [136]:
from transformers import TrainingArguments

# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",           # Directory for saving model checkpoints
    #evaluation_strategy="epoch",     # Evaluate at the end of each epoch
    learning_rate=5e-5,              # Start with a small learning rate
    per_device_train_batch_size=16,  # Batch size per GPU
    per_device_eval_batch_size=16,
    num_train_epochs=3,              # Number of epochs
    weight_decay=0.01,               # Regularization
    save_total_limit=2,              # Limit checkpoints to save space
    #load_best_model_at_end=True,     # Automatically load the best checkpoint
    logging_dir="./logs",            # Directory for logs
    logging_steps=100,               # Log every 100 steps
    fp16=True                        # Enable mixed precision for faster training
)

print(training_args)

TrainingArguments(
_n_gpu=0,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
average_tokens_across_devices=False,
batch_eval_metrics=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=False,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=None,
eval_strategy=IntervalStrategy.NO,
eval_use_gather_object=False,


In [140]:
from evaluate import load

# Load a metric (F1-score in this case)
f1_metric = load("f1")

In [None]:
trainer = Trainer(
    model=model,                        # Pre-trained BERT model
    args=training_args,                 # Training arguments
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    tokenizer=tokenizer,
    compute_metrics=f1_metric     # Custom metric
)

# Start training
trainer.train()