In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Este código carrega os dados de treino e teste, cria um modelo de classificação BERTimbau, treina-o no conjunto de dados de treino e avalia-o no conjunto de dados de teste. Ele exibe F1-Score, precisão, recall e matriz de confusão como resultado. Deve-se instalar o pacote simpletransformers.

In [None]:
!pip install simpletransformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting simpletransformers
  Downloading simpletransformers-0.63.9-py3-none-any.whl (250 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m250.5/250.5 KB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
Collecting transformers>=4.6.0
  Downloading transformers-4.27.4-py3-none-any.whl (6.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m46.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tokenizers
  Downloading tokenizers-0.13.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m67.1 MB/s[0m eta [36m0:00:00[0m
Collecting datasets
  Downloading datasets-2.11.0-py3-none-any.whl (468 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m468.7/468.7 KB[0m [31m39.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting wandb>=0.10.32


In [None]:
# baseado em https://github.com/ThilinaRajapakse/simpletransformers
import pandas as pd
from sklearn.metrics import f1_score, precision_score, recall_score, confusion_matrix
from simpletransformers.classification import ClassificationModel
from sklearn.model_selection import train_test_split
import numpy as np

# Substitua o 'caminho_do_arquivo_treino.csv' pelo caminho real do seu arquivo CSV de treino
arquivo_csv_treino = '/content/drive/MyDrive/Colab Notebooks/2023/SBBD_toxic/data/hate_speech_data_paula.csv'
dados_treino = pd.read_csv(arquivo_csv_treino)

# Pré-processamento dos dados de treino
X = dados_treino.iloc[:, 0]  # Coluna dos dados preprocessados
y = dados_treino.iloc[:, 1]  # Coluna dos rótulos

# Dividir os dados em conjuntos de treino e validação
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

train_df = pd.DataFrame({'text': X_train, 'labels': y_train})
eval_df = pd.DataFrame({'text': X_val, 'labels': y_val})
print(train_df.head())

# Criar o modelo de classificação
model_args = {
    'num_train_epochs': 3,
    'train_batch_size': 8,
    'eval_batch_size': 8,
    'overwrite_output_dir': True,
    'save_steps': -1,
    'save_model_every_epoch': False,
    'learning_rate': 3e-5,
    'fp16': True,
}

model = ClassificationModel(
    'bert',
    'neuralmind/bert-base-portuguese-cased',
    num_labels=2,
    args=model_args,
    use_cuda=False,  # Se estiver usando uma GPU, você pode mudar para True
)


# Treinar o modelo nos dados de treino
model.train_model(train_df)

# Avaliar o modelo nos dados de teste
#result, model_outputs, wrong_predictions = model.eval_model(eval_df)
predictions, raw_outputs = model.predict(eval_df['text'].tolist())

predicted_labels = np.argmax(raw_outputs, axis=1)
print("Rótulos únicos em predicted_labels:", np.unique(predicted_labels))

f1 = f1_score(eval_df['labels'], predicted_labels)
precision = precision_score(eval_df['labels'], predicted_labels)
recall = recall_score(eval_df['labels'], predicted_labels)
conf_matrix = confusion_matrix(eval_df['labels'], predicted_labels)

print("F1-Score:", f1)
print("Precision:", precision)
print("Recall:", recall)
print("Matriz de Confusão:\n", conf_matrix)


                                                text  labels
0  Meu nivel de amizade com isis é ela ter meu in...     1.0
1  rt @user @user o cara adultera dados, que fora...     1.0
2  @user @user @user o cara só é simplesmente o m...     1.0
3  eu to chorando vei vsf e eu nem staneio izone ...     1.0
4  tem um do jack com a msm música e agr não sei ...     0.0


Downloading (…)lve/main/config.json:   0%|          | 0.00/647 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Some weights of the model checkpoint at neuralmind/bert-base-portuguese-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the

Downloading (…)okenizer_config.json:   0%|          | 0.00/43.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/210k [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

  0%|          | 0/16650 [00:00<?, ?it/s]

Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

Running Epoch 0 of 3:   0%|          | 0/2082 [00:00<?, ?it/s]

Resultados na nevasca



```
# Rótulos únicos em predicted_labels: [0 1]
F1-Score: 0.622432859399684
Precision: 0.6912280701754386
Recall: 0.5660919540229885
Matriz de Confusão:
 [[698  88]
 [151 197]]
```

