# Área de imports

In [None]:
! pip install simpletransformers

In [5]:
import pandas as pd 
import torch 
from sklearn.model_selection import train_test_split
from simpletransformers.classification import ClassificationModel

In [7]:
gpu = torch.cuda.is_available()

In [8]:
gpu

True

In [16]:
df = pd.read_csv('https://raw.githubusercontent.com/LucasRotsen/tcc_case_study_tutorial/main/data/SMSSpamCollection.txt', sep='\t', names=['labels', 'text', 'a'])

# Carregamento dos dados

In [17]:
df = df[['labels', 'text']]

In [18]:
df = df.dropna() ## deletar registro nulo

# Modelagem

Observação importante: 
- A utilização de uma GPU com CUDA habilitado é fortemente indicada para esta parte do tutorial e para a realização do projeto
- A ferramenta online [Google Colaboratory](https://colab.research.google.com/) disponibiliza gratuitamente um ambiente com GPU + CUDA para utilização em notebooks
- Há um tutorial detalhado de como utilizar o Google Colab no repositório deste tutorial

In [20]:
  ## encoding dos labels 0 - 1 

  df['labels'] = [1 if row == 'spam' else 0 for row in df['labels']]

In [23]:
## dividir os conjuntos 

train, test = train_test_split(df, test_size=0.3)

In [25]:
train.shape[0]

979

In [26]:
test.shape[0]

420

In [27]:
## iniciando o treinamento do modelo Bert (Roberta) 

model = ClassificationModel(
    "roberta",
    "roberta-base",
    use_cuda=gpu
)

Downloading:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/501M [00:00<?, ?B/s]

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.bias', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.out_proj.weight', 'classifier.out_proj.bias', 'classifier.dense.bias', 'classifier.de

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [28]:
## treinar o modelo 

model.train_model(train)

  0%|          | 0/979 [00:00<?, ?it/s]

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 0 of 1:   0%|          | 0/123 [00:00<?, ?it/s]

  model.parameters(), args.max_grad_norm


(123, 0.35904825024488496)

In [29]:
## passar o modelo por um conjunto de testes nao rotulado e classificar 

result, model_outputs, wrong_predictions = model.eval_model(test)

  0%|          | 0/420 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/53 [00:00<?, ?it/s]

In [30]:
result 

{'auprc': 0.9880921142275862,
 'auroc': 0.9847834967320261,
 'eval_loss': 0.21121440293653956,
 'fn': 12,
 'fp': 10,
 'mcc': 0.8952216956842892,
 'tn': 194,
 'tp': 204}

In [31]:
## criar a tabela de correlação 

cor_tab = {0: 'ham', 1: 'spam'}

In [39]:
text = input('Digite um texto qualquer para descobrir SPAM ou HAM: ')

Digite um texto qualquer para descobrir SPAM ou HAM: Hello Mr Arthur, you got a new car


In [40]:
predictions, raw_outputs = model.predict([text])

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

In [41]:
cor_tab[predictions[0]]

'ham'