# Classificação faixa TRL - Bertimbau + Modelos ML - 512 tokens

- BERTimbau Base (aka "bert-base-portuguese-cased")

- 512 *tokens*

## Código Versão 1.0 - 21 OUT 23

- Uso dos classificadores: *Modelos de ML*
- Matriz de confusão;
- Semente 42

- Esse código adota a simplificação proposta na dissertação, no sentido de obtermos a RV, depois alizarmos o k-fold. Dessa forma, temos uma econnomia computacional.

In [1]:
# dataset.csv   ou  dataset_pre_processado_1.csv  ou  dataset_pre_processado_stem_2.csv
#     CSV1                  CSV2                                   CSV3
dataset = "dataset.csv"

In [2]:
print("Lembre-se estamos usando o dataset: " + dataset)

Lembre-se estamos usando o dataset: dataset.csv


In [3]:
melhor_modelo = 'best_model_TRL_bertimbau_base_ml' + '.bin'
melhor_modelo

'best_model_TRL_bertimbau_base_ml.bin'

- Quantidade de tokens máxima = 512

In [4]:
MAX_LEN = 512

## Preparação

### Bibliotecas e ambiente

In [5]:
!pip install -qq transformers

In [6]:
!pip install -q -U watermark

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[?25h

In [7]:
!nvidia-smi

Tue Dec 12 00:08:31 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   53C    P8    10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [8]:
import transformers
from transformers import BertModel, BertTokenizer, AdamW, get_linear_schedule_with_warmup
import torch

import numpy as np
import pandas as pd
import seaborn as sns
from pylab import rcParams
import matplotlib.pyplot as plt
from matplotlib import rc
from sklearn.model_selection import train_test_split
from collections import defaultdict
import textwrap

from tqdm import tqdm
from torch import nn, optim
from torch.utils.data import Dataset, DataLoader
import torch.nn.functional as F

%matplotlib inline
%config InlineBackend.figure_format='retina'

sns.set(style='whitegrid', palette='muted', font_scale=1.2)

HAPPY_COLORS_PALETTE = ["#01BEFE", "#FFDD00", "#FF7D00", "#FF006D", "#ADFF02", "#8F00FF"]

sns.set_palette(sns.color_palette(HAPPY_COLORS_PALETTE))

rcParams['figure.figsize'] = 12, 8

RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
device

device(type='cuda', index=0)

In [9]:
from sklearn.model_selection import cross_val_predict, KFold
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn import naive_bayes, svm
from sklearn.naive_bayes import ComplementNB
from sklearn.ensemble import AdaBoostClassifier
from sklearn.preprocessing import MinMaxScaler

In [10]:
%reload_ext watermark
%watermark -v -p numpy,pandas,torch,transformers

Python implementation: CPython
Python version       : 3.10.12
IPython version      : 7.34.0

numpy       : 1.23.5
pandas      : 1.5.3
torch       : 2.1.0+cu118
transformers: 4.35.2



# Particoes KFOLD

- Para mais detalhes, consulte o código *1-kfold.ipynb* em que as partições foram sorteadas. As células a seguir recriam o resultado obtido.

In [11]:
array1 = np.array([  0,   1,   2,   3,   5,   7,   9,  10,  11,  12,  13,  14,  15,
         16,  17,  19,  20,  21,  22,  23,  25,  26,  27,  28,  30,  32,
         33,  34,  35,  36,  37,  39,  40,  44,  45,  46,  47,  48,  49,
         50,  52,  53,  54,  55,  56,  57,  59,  60,  61,  63,  64,  66,
         68,  69,  70,  71,  72,  75,  76,  77,  78,  79,  80,  81,  82,
         83,  84,  85,  87,  91,  92,  93,  94,  96,  97,  98,  99, 100,
        102, 103, 104, 105, 106, 108, 109, 110, 112, 113, 114, 115, 116,
        117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 129, 130,
        131, 132, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 145,
        146, 148, 150, 151, 153, 154, 155, 157, 158, 159, 160, 161, 163,
        164, 165, 166, 167])

array2 = np.array([  0,   1,   3,   4,   5,   6,   7,   8,  10,  13,  14,  15,  16,
         17,  18,  19,  20,  22,  23,  24,  25,  26,  27,  29,  30,  31,
         34,  36,  37,  38,  40,  41,  42,  43,  44,  45,  47,  49,  51,
         52,  53,  54,  55,  57,  58,  59,  60,  61,  62,  63,  64,  65,
         66,  67,  68,  69,  70,  71,  73,  74,  75,  76,  77,  80,  83,
         84,  86,  88,  89,  90,  92,  93,  94,  95,  96,  97,  98,  99,
        100, 101, 102, 104, 105, 107, 109, 110, 111, 112, 113, 114, 115,
        116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128,
        129, 130, 132, 133, 135, 136, 137, 138, 139, 142, 143, 144, 145,
        147, 149, 150, 151, 152, 153, 154, 155, 156, 157, 160, 161, 162,
        163, 165, 166, 167])
array3 = np.array([  1,   2,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,  14,
         17,  18,  20,  21,  22,  23,  24,  25,  28,  29,  30,  31,  32,
         33,  35,  36,  37,  38,  39,  41,  42,  43,  44,  45,  46,  47,
         48,  50,  51,  52,  54,  55,  56,  57,  58,  59,  62,  63,  65,
         67,  70,  71,  72,  73,  74,  75,  76,  77,  78,  79,  80,  81,
         82,  84,  85,  86,  87,  88,  89,  90,  91,  93,  95,  96,  97,
         98,  99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 111,
        112, 113, 114, 118, 120, 121, 122, 123, 124, 126, 128, 129, 131,
        132, 133, 134, 135, 136, 138, 139, 140, 141, 142, 144, 146, 147,
        148, 149, 150, 151, 152, 153, 155, 156, 158, 159, 160, 161, 162,
        163, 164, 165, 166])
array4 = np.array([  0,   2,   3,   4,   6,   8,   9,  10,  11,  12,  15,  16,  18,
         19,  20,  21,  22,  24,  25,  26,  27,  28,  29,  31,  32,  33,
         34,  35,  37,  38,  39,  40,  41,  42,  43,  45,  46,  47,  48,
         49,  50,  51,  53,  56,  57,  58,  60,  61,  62,  63,  64,  65,
         66,  67,  68,  69,  71,  72,  73,  74,  75,  76,  78,  79,  80,
         81,  82,  83,  84,  85,  86,  87,  88,  89,  90,  91,  92,  93,
         94,  95,  98, 100, 101, 103, 104, 106, 107, 108, 109, 110, 111,
        112, 114, 115, 116, 117, 119, 121, 123, 124, 125, 127, 128, 129,
        130, 131, 133, 134, 135, 137, 139, 140, 141, 142, 143, 144, 145,
        146, 147, 148, 149, 150, 152, 153, 154, 155, 156, 157, 158, 159,
        162, 163, 164, 166, 167])
array5 = np.array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  11,  12,  13,
         14,  15,  16,  17,  18,  19,  21,  23,  24,  26,  27,  28,  29,
         30,  31,  32,  33,  34,  35,  36,  38,  39,  40,  41,  42,  43,
         44,  46,  48,  49,  50,  51,  52,  53,  54,  55,  56,  58,  59,
         60,  61,  62,  64,  65,  66,  67,  68,  69,  70,  72,  73,  74,
         77,  78,  79,  81,  82,  83,  85,  86,  87,  88,  89,  90,  91,
         92,  94,  95,  96,  97,  99, 101, 102, 103, 105, 106, 107, 108,
        110, 111, 113, 115, 116, 117, 118, 119, 120, 122, 125, 126, 127,
        128, 130, 131, 132, 133, 134, 136, 137, 138, 140, 141, 143, 144,
        145, 146, 147, 148, 149, 151, 152, 154, 156, 157, 158, 159, 160,
        161, 162, 164, 165, 167])
# Criar a lista composta
kfold_train = [array1, array2, array3, array4, array5]

In [12]:
# Criar as cinco arrays numpy
array1 = np.array([  4,   6,   8,  18,  24,  29,  31,  38,  41,  42,  43,  51,  58, 62,  65,  67,  73,  74,  86,  88,  89,  90,  95, 101, 107, 111, 128, 133, 144, 147, 149, 152, 156, 162])
array2 = np.array([  2,   9,  11,  12,  21,  28,  32,  33,  35,  39,  46,  48,  50, 56,  72,  78,  79,  81,  82,  85,  87,  91, 103, 106, 108, 131, 134, 140, 141, 146, 148, 158, 159, 164])
array3 = np.array([  0,   3,  15,  16,  19,  26,  27,  34,  40,  49,  53,  60,  61, 64,  66,  68,  69,  83,  92,  94, 110, 115, 116, 117, 119, 125, 127, 130, 137, 143, 145, 154, 157, 167])
array4 = np.array([  1,   5,   7,  13,  14,  17,  23,  30,  36,  44,  52,  54,  55, 59,  70,  77,  96,  97,  99, 102, 105, 113, 118, 120, 122, 126, 132, 136, 138, 151, 160, 161, 165])
array5 = np.array([ 10,  20,  22,  25,  37,  45,  47,  57,  63,  71,  75,  76,  80,  84,  93,  98, 100, 104, 109, 112, 114, 121, 123, 124, 129, 135, 139, 142, 150, 153, 155, 163, 166])


# Criar a lista composta
kfold_test = [array1, array2, array3, array4, array5]

# Dataset

In [13]:
df = pd.read_csv(dataset)

In [14]:
novo_df = df[["resumo", "rotulo"]]

In [15]:
def adapto_faixa(faixa):
    faixa -=1
    return faixa

In [16]:
novo_df['rotulo'] = novo_df.rotulo.apply(adapto_faixa)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  novo_df['rotulo'] = novo_df.rotulo.apply(adapto_faixa)


In [17]:
novo_df.tail()

Unnamed: 0,resumo,rotulo
163,Rio de Janeiro (RJ) – O Centro de Avaliações d...,2
164,Este trabalho apresenta um sistema para contro...,2
165,No contexto das comunicações táticas baseadas ...,2
166,O valor da velocidade de alvos móveis em image...,0
167,Neste artigo é apresentada a análise da seção ...,0


In [18]:
class_names = ['Faixa 1', 'Faixa 2', 'Faixa 3']

# Bertimbau

- https://huggingface.co/neuralmind/bert-base-portuguese-cased

```
@inproceedings{souza2020bertimbau,
  author    = {F{\'a}bio Souza and
               Rodrigo Nogueira and
               Roberto Lotufo},
  title     = {{BERT}imbau: pretrained {BERT} models for {B}razilian {P}ortuguese},
  booktitle = {9th Brazilian Conference on Intelligent Systems, {BRACIS}, Rio Grande do Sul, Brazil, October 20-23 (to appear)},
  year      = {2020}
}

```

In [19]:
PRE_TRAINED_MODEL_NAME = 'neuralmind/bert-base-portuguese-cased'
tokenizer = BertTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME)

tokenizer_config.json:   0%|          | 0.00/43.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/210k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/647 [00:00<?, ?B/s]

In [20]:
bert_model = BertModel.from_pretrained(PRE_TRAINED_MODEL_NAME)

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

In [21]:
frase = 'A avaliação de prontidão tecnológica em tecnologias de interesse militar'

In [22]:
encoding = tokenizer.encode_plus(
  frase,
  max_length=20,   #Perceba que irei cortar um token da minha frase anterior
  add_special_tokens=True, # Add '[CLS]' and '[SEP]'
  return_token_type_ids=False,
  pad_to_max_length=True,
  return_attention_mask=True,
  return_tensors='pt',  # Return PyTorch tensors
)

encoding.keys()

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


dict_keys(['input_ids', 'attention_mask'])

- Informações sobre o modelo:

In [23]:
bert_model = bert_model.to(device)

In [24]:
bert_model.config.hidden_size

768

In [25]:
bert_model.config

BertConfig {
  "_name_or_path": "neuralmind/bert-base-portuguese-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "directionality": "bidi",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "output_past": true,
  "pad_token_id": 0,
  "pooler_fc_size": 768,
  "pooler_num_attention_heads": 12,
  "pooler_num_fc_layers": 3,
  "pooler_size_per_head": 128,
  "pooler_type": "first_token_transform",
  "position_embedding_type": "absolute",
  "transformers_version": "4.35.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 29794
}

# Aplicando o Modelo para gerar a representação vetorial

- Tutorial interessante:

```
1 - https://towardsdatascience.com/feature-extraction-with-bert-for-text-classification-533dde44dc2f
```

```
2 - https://github.com/felipemaiapolo/legalnlp/blob/main/demo/BERT/BERT_TUTORIAL.ipynb
```

*Recomendo muito o 2, pois foi feito em formato de notebook, bem redigido, comentando os passos e explicando o BERT. ALém de estar em português (material raro em termos de qualidade em nossa língua)*

### Transformando os dados em tensores:

In [26]:
# Fazendo a padronização dos textos
wrapper = textwrap.TextWrapper()
data_resumo = list(novo_df['resumo'])

- Visualizando:

In [27]:
for text in range(len(data_resumo[:2])):
  print(f'{wrapper.fill(data_resumo[text])}')
  print()

O crescente emprego de mísseis de ombro infravermelhos contra alvos
aéreos demanda a utilização de contramedidas cada vez mais modernas e
eficientes. Neste cenário, surge o Directed Infrared Countermeasure
(DIRCM), cujo objetivo é interferir no guiamento do míssil por meio de
pulsos de laser. Neste artigo, um seeker infravermelho do tipo rising
sun é modelado e simulado, sendo os efeitos da emissão de um DIRCM no
processamento do sinal avaliados. A influência de parâmetros de
frequência de repetição de pulsos e intensidade do laser são
evidenciados. Os resultados obtidos ressaltam a importância e a
necessidade do desenvolvimento de ferramentas computacionais mais
complexas, visando ao desenvolvimento da doutrina de emprego deste
tipo de contramedida.

Materiais dielétricos com baixas perdas e alta permissividade são
componentes essenciais que são utilizados em linhas de transmissões
não lineares capacitivas (LTNLs) na geração de RF. LTNLs possuem
grande potencial para gerar ondas de só

In [28]:
# Aplicando o bert_tokenizer com o comprimento máximo que definimos
encoded_inputs = tokenizer(data_resumo, padding=True, truncation=True, max_length=MAX_LEN, return_tensors="pt")


#O encoded_input está como um dicionário com 3 chaves
encoded_inputs.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

- Visualizando como o resumo ficou tokenizado, i.e., a chave ''input_ids' do encoded_inputs :

In [29]:
#Visualiando o primeiro texto após a aplicação do tokenizador
print(encoded_inputs['input_ids'][0])

print()

# Mostrando o mesmo texto decodificado
print(wrapper.fill(tokenizer.decode(encoded_inputs['input_ids'][0])))

tensor([  101,   231,  6478,  7007,   125, 16328,   125, 20462,  5771,   391,
         4020, 22281,   598, 14874, 19251,  8603,   123,  5353,   125,   598,
         8969,   591,  1078,   576,   325, 11177,   122, 10746, 22281,   119,
         3703,  5391,   117,  7872,   146,  2278,  8454, 22284,  5027,   900,
          430, 11357,   140,  2890, 14473, 22279,   113,   250, 15710, 22304,
        22311,   114,   117,  3596,  2630,   253, 12993,   307,   202, 13375,
          310,   171, 14994, 17513,   240,  1423,   125,  5995,  1409,   125,
        20909,   119,  3703,  4319,   117,   222,   176,  6505,   140,  5771,
          391,  4020,   171,  1903,  3979,   446,   233, 22285,   253, 12066,
          201,   122,  6235,   201,   117,   660,   259,  3997,   180, 11352,
          125,   222,   250, 15710, 22304, 22311,   202, 12152,   171,  4227,
        11175,   442,   119,   177,  2824,   125, 14492,   125,  5678,   125,
        21395,   125,  5995,  1409,   122,  8920,   171, 20909, 

In [30]:
# Passando os tensores para para a GPU - aceleração através da GPU
input_ids = encoded_inputs['input_ids'].to(device)
attention_mask = encoded_inputs['attention_mask'].to(device)

In [31]:
# Criando o vetor de features/caracteristicas
features = []

- Aplicando o BERT para a extração das características

In [32]:
for i in tqdm(range(len(data_resumo))):

    with torch.no_grad():
        last_hidden_states = bert_model(input_ids[i:(i+1)],attention_mask[i:(i+1)])[1].cpu().numpy().reshape(-1).tolist()
        # last_hidden_states = bert_model(input_ids[i:(i+1)])[1].cpu().numpy().reshape(-1).tolist()
    features.append(last_hidden_states)

100%|██████████| 168/168 [00:11<00:00, 15.01it/s]


In [33]:
# passando a lista features para numpy array com as features extraidas
features = np.array(features)

- Visualizando o resultado:

In [34]:
print('FEATURES: Número de linhas: ' + str(features.shape[0]) + ' Número de colunas: ', str(features.shape[1]))
print("FEATURES é um objeto " + str(type(features)))

FEATURES: Número de linhas: 168 Número de colunas:  768
FEATURES é um objeto <class 'numpy.ndarray'>


In [35]:
features[0]

array([ 3.06312777e-02,  1.49520472e-01,  1.32294476e-01, -2.43207872e-01,
       -3.00680637e-01,  1.93397328e-01,  9.97895956e-01, -2.27563903e-01,
       -1.58239052e-01, -9.89399031e-02, -9.99431074e-01,  2.17316195e-01,
        4.18445561e-03, -1.34307474e-01,  9.09332037e-02, -5.82002476e-02,
        2.44923443e-01, -8.39493349e-02,  9.99427617e-01, -7.62282372e-01,
       -1.94195554e-01, -1.48620486e-01,  2.05351431e-02,  1.76268876e-01,
       -3.32449317e-01,  8.30158964e-02,  7.46030509e-02, -1.56720895e-02,
        2.05082465e-02, -2.42098168e-01, -2.47283816e-01, -9.61611450e-01,
       -9.84005153e-01,  2.13541374e-01, -9.48253721e-02, -1.30424812e-01,
       -8.81861821e-02, -3.26771736e-02,  9.99302626e-01,  1.35206478e-02,
       -1.41007543e-01,  2.04127997e-01, -8.76255810e-01, -1.04388669e-02,
       -1.08732037e-01, -9.22489837e-02, -2.08393455e-01,  3.33597809e-01,
       -1.47162020e-01, -1.59100313e-02, -1.66461058e-02,  1.05229728e-01,
        5.19961156e-02,  

# Adaptando a representação vetorial gerada

- Perceba que até aqui, temos um numpy array chamado **features**, com 168 linhas e 768 colunas.

- Iremos transformar nosso numpy array em um dataframe novamente:

In [36]:
df_repvet = pd.DataFrame(features)
repvet_rotulado = pd.concat([df_repvet, novo_df['rotulo']], axis = 1)
repvet_rotulado.shape

(168, 769)

In [37]:
df_repvet

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,758,759,760,761,762,763,764,765,766,767
0,0.030631,0.149520,0.132294,-0.243208,-0.300681,0.193397,0.997896,-0.227564,-0.158239,-0.098940,...,0.216849,-0.098750,-0.158954,0.123196,-0.038103,0.143516,0.056023,-0.026564,-0.212858,0.044322
1,0.121566,0.264066,0.046265,-0.087665,-0.114877,0.057351,0.914005,-0.251580,-0.094213,-0.042298,...,0.211784,-0.007474,-0.124507,0.747114,0.076023,0.075090,0.006695,-0.047540,-0.213926,-0.121684
2,0.144653,0.043282,0.140579,-0.252127,-0.194191,0.112558,0.992621,-0.204062,-0.121806,0.061342,...,0.231817,-0.152268,-0.139368,0.432061,0.078178,0.094881,-0.054676,-0.029966,-0.069750,-0.010176
3,0.168949,0.107046,-0.021332,-0.106200,-0.136551,-0.020100,0.955896,-0.237744,-0.073313,-0.012520,...,0.293984,-0.169405,-0.054496,0.572342,0.119459,0.050222,-0.172856,-0.062752,-0.100114,-0.045655
4,-0.026465,0.219727,0.280597,-0.274787,-0.095324,0.298268,0.670012,-0.059023,-0.145148,0.173758,...,0.186416,-0.133134,-0.068989,0.866958,0.174362,0.240498,0.040937,-0.179446,-0.036021,-0.071915
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
163,-0.095821,-0.070080,-0.022212,0.098373,-0.134492,0.104006,0.835263,0.039633,0.059085,-0.092660,...,0.131293,-0.293503,0.142235,0.874294,-0.157637,0.023442,-0.113615,-0.094647,-0.006693,-0.043183
164,0.118335,0.049550,0.133240,-0.159245,-0.237414,0.004430,0.960360,-0.194656,0.030932,-0.191618,...,0.185254,-0.316608,-0.030646,0.607250,0.108507,0.123180,-0.066611,0.151763,-0.162843,0.026763
165,0.112267,0.216307,0.156119,-0.369322,-0.217792,0.165465,0.994893,-0.383831,-0.050472,-0.008918,...,0.272632,-0.112599,-0.047830,0.439271,0.101899,0.213281,0.028534,0.023771,-0.207012,0.006931
166,0.072065,0.065555,0.134661,-0.128100,-0.149739,-0.045680,0.991981,-0.264321,-0.064478,-0.154887,...,0.232019,-0.136087,-0.125552,0.331902,0.136607,0.075867,-0.134309,-0.024143,-0.160347,0.059554


In [38]:
repvet_rotulado.head(3)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,759,760,761,762,763,764,765,766,767,rotulo
0,0.030631,0.14952,0.132294,-0.243208,-0.300681,0.193397,0.997896,-0.227564,-0.158239,-0.09894,...,-0.09875,-0.158954,0.123196,-0.038103,0.143516,0.056023,-0.026564,-0.212858,0.044322,0
1,0.121566,0.264066,0.046265,-0.087665,-0.114877,0.057351,0.914005,-0.25158,-0.094213,-0.042298,...,-0.007474,-0.124507,0.747114,0.076023,0.07509,0.006695,-0.04754,-0.213926,-0.121684,1
2,0.144653,0.043282,0.140579,-0.252127,-0.194191,0.112558,0.992621,-0.204062,-0.121806,0.061342,...,-0.152268,-0.139368,0.432061,0.078178,0.094881,-0.054676,-0.029966,-0.06975,-0.010176,0


# Criando uma estrutura para armazenar os resultados

In [39]:
# Algoritmos de classificação
classifiers = [  #NB, KNN, SVM
    ('Multinomial Naive Bayes', MultinomialNB()),
    ('Complement Naive Bayes Classifier', ComplementNB()),
    ('KNN', KNeighborsClassifier()),  #n_neighbors default é 5
    ('SVM', svm.SVC( )),
     ('Random Forest', RandomForestClassifier(random_state=42)),
      ('AdaBoost',    AdaBoostClassifier(random_state=42)) #n_estimators default é 50
]

In [40]:
lista_classificador_nome = list()
for classifier_name, classifier in classifiers:
    lista_classificador_nome.append(classifier_name)

In [41]:
df_acc = pd.DataFrame(columns=['Classificador','Rodada 1', 'Rodada 2', 'Rodada 3', 'Rodada 4', 'Rodada 5', 'Media'])
df_f1 = pd.DataFrame(columns=['Classificador','Rodada 1', 'Rodada 2', 'Rodada 3', 'Rodada 4', 'Rodada 5', 'Media'])
df_f1_ponderado = pd.DataFrame(columns=['Classificador','Rodada 1', 'Rodada 2', 'Rodada 3', 'Rodada 4', 'Rodada 5', 'Media'])

In [42]:
for classifier_name, classifier in classifiers:
    nova_linha = pd.DataFrame({'Classificador': [classifier_name], 'Rodada 1':[0] , 'Rodada 2':[0], 'Rodada 3':[0], 'Rodada 4':[0], 'Rodada 5':[0], 'Media':[0]})
    df_acc = pd.concat([df_acc, nova_linha], ignore_index=True)
    df_f1 = pd.concat([df_f1, nova_linha], ignore_index=True)
    df_f1_ponderado = pd.concat([df_f1_ponderado, nova_linha], ignore_index=True)

In [43]:
df_f1

Unnamed: 0,Classificador,Rodada 1,Rodada 2,Rodada 3,Rodada 4,Rodada 5,Media
0,Multinomial Naive Bayes,0,0,0,0,0,0
1,Complement Naive Bayes Classifier,0,0,0,0,0,0
2,KNN,0,0,0,0,0,0
3,SVM,0,0,0,0,0,0
4,Random Forest,0,0,0,0,0,0
5,AdaBoost,0,0,0,0,0,0


In [44]:
df_f1_ponderado

Unnamed: 0,Classificador,Rodada 1,Rodada 2,Rodada 3,Rodada 4,Rodada 5,Media
0,Multinomial Naive Bayes,0,0,0,0,0,0
1,Complement Naive Bayes Classifier,0,0,0,0,0,0
2,KNN,0,0,0,0,0,0
3,SVM,0,0,0,0,0,0
4,Random Forest,0,0,0,0,0,0
5,AdaBoost,0,0,0,0,0,0


In [45]:
df_acc

Unnamed: 0,Classificador,Rodada 1,Rodada 2,Rodada 3,Rodada 4,Rodada 5,Media
0,Multinomial Naive Bayes,0,0,0,0,0,0
1,Complement Naive Bayes Classifier,0,0,0,0,0,0
2,KNN,0,0,0,0,0,0
3,SVM,0,0,0,0,0,0
4,Random Forest,0,0,0,0,0,0
5,AdaBoost,0,0,0,0,0,0


 Usei como parâmetro para **average** o **'macro'** para o f1, e o **'weighted'** para o f1-ponderado

**'weighted'**:

Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall.

**'macro'**:

Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

 vide [Documentação oficial](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html)

# Aplicando modelo de classificação

In [46]:
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score
from sklearn.metrics import precision_recall_fscore_support as score

def evaluate_model(model, X_test, y_test):
    # Predição dos rótulos
    y_pred = model.predict(X_test)
    print(classification_report(y_test, y_pred))

    # Cálculo da matriz de confusão
    cm = confusion_matrix(y_test, y_pred)

    # Cálculo da acurácia
    acc = accuracy_score(y_test, y_pred)

    # Cálculo do F1-score
    f1 = f1_score(y_test, y_pred, average='macro')

    # Cálculo do F1-score
    f1_poderado = f1_score(y_test, y_pred, average='weighted')

    # Outras métricas
    precision, recall, f1score, support = score(y_test, y_pred, average='macro')
    return cm, acc, f1, precision, recall, f1score, support, f1_poderado

In [47]:
df_repvet

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,758,759,760,761,762,763,764,765,766,767
0,0.030631,0.149520,0.132294,-0.243208,-0.300681,0.193397,0.997896,-0.227564,-0.158239,-0.098940,...,0.216849,-0.098750,-0.158954,0.123196,-0.038103,0.143516,0.056023,-0.026564,-0.212858,0.044322
1,0.121566,0.264066,0.046265,-0.087665,-0.114877,0.057351,0.914005,-0.251580,-0.094213,-0.042298,...,0.211784,-0.007474,-0.124507,0.747114,0.076023,0.075090,0.006695,-0.047540,-0.213926,-0.121684
2,0.144653,0.043282,0.140579,-0.252127,-0.194191,0.112558,0.992621,-0.204062,-0.121806,0.061342,...,0.231817,-0.152268,-0.139368,0.432061,0.078178,0.094881,-0.054676,-0.029966,-0.069750,-0.010176
3,0.168949,0.107046,-0.021332,-0.106200,-0.136551,-0.020100,0.955896,-0.237744,-0.073313,-0.012520,...,0.293984,-0.169405,-0.054496,0.572342,0.119459,0.050222,-0.172856,-0.062752,-0.100114,-0.045655
4,-0.026465,0.219727,0.280597,-0.274787,-0.095324,0.298268,0.670012,-0.059023,-0.145148,0.173758,...,0.186416,-0.133134,-0.068989,0.866958,0.174362,0.240498,0.040937,-0.179446,-0.036021,-0.071915
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
163,-0.095821,-0.070080,-0.022212,0.098373,-0.134492,0.104006,0.835263,0.039633,0.059085,-0.092660,...,0.131293,-0.293503,0.142235,0.874294,-0.157637,0.023442,-0.113615,-0.094647,-0.006693,-0.043183
164,0.118335,0.049550,0.133240,-0.159245,-0.237414,0.004430,0.960360,-0.194656,0.030932,-0.191618,...,0.185254,-0.316608,-0.030646,0.607250,0.108507,0.123180,-0.066611,0.151763,-0.162843,0.026763
165,0.112267,0.216307,0.156119,-0.369322,-0.217792,0.165465,0.994893,-0.383831,-0.050472,-0.008918,...,0.272632,-0.112599,-0.047830,0.439271,0.101899,0.213281,0.028534,0.023771,-0.207012,0.006931
166,0.072065,0.065555,0.134661,-0.128100,-0.149739,-0.045680,0.991981,-0.264321,-0.064478,-0.154887,...,0.232019,-0.136087,-0.125552,0.331902,0.136607,0.075867,-0.134309,-0.024143,-0.160347,0.059554


- Alguns algoritmos só lidam com valores positivos, por isso precisamos normalizar os valores que aparecem na RV do BERT e GPT-2:

In [48]:
# scaler = MinMaxScaler()
# df_repvet_pos = scaler.fit_transform(df_repvet)
# scaling_factor = 1000
# df_repvet_pos_i = (df_repvet_pos * scaling_factor).astype(int)
# df_repvet_pos_i = pd.DataFrame(df_repvet_pos_i, columns=df_repvet.columns)

In [49]:
# df_repvet_pos_i.describe()

In [50]:
classifiers

[('Multinomial Naive Bayes', MultinomialNB()),
 ('Complement Naive Bayes Classifier', ComplementNB()),
 ('KNN', KNeighborsClassifier()),
 ('SVM', SVC()),
 ('Random Forest', RandomForestClassifier(random_state=42)),
 ('AdaBoost', AdaBoostClassifier(random_state=42))]

In [51]:
classificador=0
for classifier_name, classifier in classifiers:
    print('---', classifier_name, '---')
    y_true = []
    y_pred = []
    contador = 0
    serie_acc = pd.Series()
    serie_f1 = pd.Series()
    serie_f1_ponderado = pd.Series()

    #Substituindo o dataframe
    # if (classifier_name == 'Multinomial Naive Bayes' or classifier_name == "Complement Naive Bayes Classifier"):
    #     df_substituto = df_repvet_pos
    # else:
    #     df_substituto = df_repvet
    # df_substituto= df_repvet_pos_i

    for i in range(0, 5):  #Estou percorrendo as 5 rodadas
        train_index = kfold_train[i]
        test_index = kfold_test[i]
        contador +=1


         # df_repvet
        X_train_n, X_test_n = df_repvet.iloc[train_index], df_repvet.iloc[test_index]
        y_train, y_test = repvet_rotulado.iloc[train_index]['rotulo'], repvet_rotulado.iloc[test_index]['rotulo']


        scaler = MinMaxScaler()
        df_repvet_pos_treino = scaler.fit_transform(X_train_n)
        df_repvet_pos_teste =  scaler.transform(X_test_n)

        scaling_factor = 1000
        df_repvet_pos_i_treino = (df_repvet_pos_treino * scaling_factor).astype(int)
        X_train = pd.DataFrame(df_repvet_pos_i_treino, columns=df_repvet.columns)

        df_repvet_pos_i_teste = (df_repvet_pos_teste * scaling_factor).astype(int)
        X_test = pd.DataFrame(df_repvet_pos_i_teste, columns=df_repvet.columns)

        # Treinamento do modelo
        classifier.fit(X_train, y_train)
        cm, acc, f1, precision, recall, f1score, support,f1_poderado = evaluate_model(classifier, X_test, y_test)


        print(classifier_name + " Rodada " + str(contador) )
        print('Matriz de Confusão:')
        print(cm)
        print('Acurácia:', acc)
        print('F1-Score:', f1)
        print("outras métricas:")
        print('precision:', precision)
        print('recall:', recall)
        print('f1score:', f1score)
        print('support:', support)
        print('-------------------------------------')
        # serie_acc = serie_acc.append(pd.Series([acc]))
        serie_acc = pd.concat([serie_acc, pd.Series([acc])])
        # serie_f1 = serie_f1.append(pd.Series([f1]))
        serie_f1 = pd.concat([serie_f1, pd.Series([f1])])
        serie_f1_ponderado = pd.concat([serie_f1_ponderado, pd.Series([f1_poderado])])


    # Avaliação do modelo: Aqui estamos inserindo os valores das medias na serie
    media_acc = serie_acc[:5].mean()
    media_f1 = serie_f1[:5].mean()
    media_f1_ponderado = serie_f1_ponderado[:5].mean()
    # serie_acc = serie_acc.append(pd.Series([media_acc]))
    # serie_f1 = serie_f1.append(pd.Series([media_f1]))
    serie_acc = pd.concat([serie_acc, pd.Series([media_acc])])
    serie_f1 = pd.concat([serie_f1, pd.Series([media_f1])])
    serie_f1_ponderado = pd.concat([serie_f1_ponderado, pd.Series([media_f1_ponderado])])

    # print("Acurácia: " )
    # print(serie_acc)
    # print("F-1: " )
    # print(serie_f1)
    df_acc.loc[classificador, ['Rodada 1', 'Rodada 2', 'Rodada 3', 'Rodada 4', 'Rodada 5', 'Media']] = serie_acc.values
    df_f1.loc[classificador, ['Rodada 1', 'Rodada 2', 'Rodada 3', 'Rodada 4', 'Rodada 5', 'Media']] = serie_f1.values
    df_f1_ponderado.loc[classificador, ['Rodada 1', 'Rodada 2', 'Rodada 3', 'Rodada 4', 'Rodada 5', 'Media']] = serie_f1_ponderado.values
    classificador+=1
    print("=======================================================================================")
    # cm = confusion_matrix(y_true, y_pred)
    # acc = accuracy_score(y_true, y_pred)

--- Multinomial Naive Bayes ---
              precision    recall  f1-score   support

           0       0.80      0.63      0.71        19
           1       0.12      0.12      0.12         8
           2       0.64      1.00      0.78         7

    accuracy                           0.59        34
   macro avg       0.52      0.59      0.54        34
weighted avg       0.61      0.59      0.58        34

Multinomial Naive Bayes Rodada 1
Matriz de Confusão:
[[12  7  0]
 [ 3  1  4]
 [ 0  0  7]]
Acurácia: 0.5882352941176471
F1-Score: 0.5362200435729848
outras métricas:
precision: 0.5204545454545455
recall: 0.5855263157894737
f1score: 0.5362200435729848
support: None
-------------------------------------
              precision    recall  f1-score   support

           0       0.67      0.74      0.70        19
           1       0.12      0.12      0.12         8
           2       1.00      0.71      0.83         7

    accuracy                           0.59        34
   macro avg 

  serie_acc = pd.Series()
  serie_f1 = pd.Series()
  serie_f1_ponderado = pd.Series()


              precision    recall  f1-score   support

           0       0.67      0.84      0.74        19
           1       0.33      0.12      0.18         8
           2       0.86      0.86      0.86         7

    accuracy                           0.68        34
   macro avg       0.62      0.61      0.59        34
weighted avg       0.63      0.68      0.64        34

Multinomial Naive Bayes Rodada 3
Matriz de Confusão:
[[16  2  1]
 [ 7  1  0]
 [ 1  0  6]]
Acurácia: 0.6764705882352942
F1-Score: 0.5943823618242222
outras métricas:
precision: 0.6190476190476191
recall: 0.6080827067669173
f1score: 0.5943823618242222
support: None
-------------------------------------
              precision    recall  f1-score   support

           0       0.81      0.89      0.85        19
           1       0.33      0.25      0.29         8
           2       0.67      0.67      0.67         6

    accuracy                           0.70        33
   macro avg       0.60      0.60      0.60  

  serie_acc = pd.Series()
  serie_f1 = pd.Series()
  serie_f1_ponderado = pd.Series()
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

           0       0.64      0.95      0.77        19
           1       0.00      0.00      0.00         8
           2       0.83      0.71      0.77         7

    accuracy                           0.68        34
   macro avg       0.49      0.55      0.51        34
weighted avg       0.53      0.68      0.59        34

Complement Naive Bayes Classifier Rodada 2
Matriz de Confusão:
[[18  0  1]
 [ 8  0  0]
 [ 2  0  5]]
Acurácia: 0.6764705882352942
F1-Score: 0.5117294053464266
outras métricas:
precision: 0.4920634920634921
recall: 0.5538847117794486
f1score: 0.5117294053464266
support: None
-------------------------------------
              precision    recall  f1-score   support

           0       0.69      0.95      0.80        19
           1       0.00      0.00      0.00         8
           2       0.75      0.86      0.80         7

    accuracy                           0.71        34
   macro avg       0.48      0.60  

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

           0       0.73      1.00      0.84        19
           1       0.00      0.00      0.00         8
           2       0.57      0.67      0.62         6

    accuracy                           0.70        33
   macro avg       0.43      0.56      0.49        33
weighted avg       0.52      0.70      0.60        33

Complement Naive Bayes Classifier Rodada 4
Matriz de Confusão:
[[19  0  0]
 [ 5  0  3]
 [ 2  0  4]]
Acurácia: 0.696969696969697
F1-Score: 0.4866096866096865
outras métricas:
precision: 0.434065934065934
recall: 0.5555555555555555
f1score: 0.4866096866096865
support: None
-------------------------------------
              precision    recall  f1-score   support

           0       0.68      0.94      0.79        18
           1       0.00      0.00      0.00         9
           2       0.62      0.83      0.71         6

    accuracy                           0.67        33
   macro avg       0.44      0.59    

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  serie_acc = pd.Series()
  serie_f1 = pd.Series()
  serie_f1_ponderado = pd.Series()


              precision    recall  f1-score   support

           0       0.83      1.00      0.90        19
           1       1.00      0.12      0.22         8
           2       0.70      1.00      0.82         7

    accuracy                           0.79        34
   macro avg       0.84      0.71      0.65        34
weighted avg       0.84      0.79      0.73        34

KNN Rodada 1
Matriz de Confusão:
[[19  0  0]
 [ 4  1  3]
 [ 0  0  7]]
Acurácia: 0.7941176470588235
F1-Score: 0.6501711795829442
outras métricas:
precision: 0.8420289855072465
recall: 0.7083333333333334
f1score: 0.6501711795829442
support: None
-------------------------------------
              precision    recall  f1-score   support

           0       0.68      1.00      0.81        19
           1       0.00      0.00      0.00         8
           2       1.00      0.86      0.92         7

    accuracy                           0.74        34
   macro avg       0.56      0.62      0.58        34
weighted av

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

           0       0.67      0.95      0.78        19
           1       0.00      0.00      0.00         8
           2       1.00      0.86      0.92         7

    accuracy                           0.71        34
   macro avg       0.56      0.60      0.57        34
weighted avg       0.58      0.71      0.63        34

KNN Rodada 3
Matriz de Confusão:
[[18  1  0]
 [ 8  0  0]
 [ 1  0  6]]
Acurácia: 0.7058823529411765
F1-Score: 0.568561872909699
outras métricas:
precision: 0.5555555555555555
recall: 0.6015037593984962
f1score: 0.568561872909699
support: None
-------------------------------------
              precision    recall  f1-score   support

           0       0.72      0.95      0.82        19
           1       1.00      0.12      0.22         8
           2       0.57      0.67      0.62         6

    accuracy                           0.70        33
   macro avg       0.76      0.58      0.55        33
weighted avg 

  serie_acc = pd.Series()
  serie_f1 = pd.Series()
  serie_f1_ponderado = pd.Series()


              precision    recall  f1-score   support

           0       0.68      1.00      0.81        19
           1       0.00      0.00      0.00         8
           2       1.00      0.71      0.83         7

    accuracy                           0.71        34
   macro avg       0.56      0.57      0.55        34
weighted avg       0.59      0.71      0.62        34

SVM Rodada 2
Matriz de Confusão:
[[19  0  0]
 [ 8  0  0]
 [ 1  1  5]]
Acurácia: 0.7058823529411765
F1-Score: 0.5472813238770685
outras métricas:
precision: 0.5595238095238095
recall: 0.5714285714285715
f1score: 0.5472813238770685
support: None
-------------------------------------
              precision    recall  f1-score   support

           0       0.68      1.00      0.81        19
           1       0.00      0.00      0.00         8
           2       1.00      0.86      0.92         7

    accuracy                           0.74        34
   macro avg       0.56      0.62      0.58        34
weighted av

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

           0       0.70      1.00      0.83        19
           1       0.00      0.00      0.00         8
           2       0.67      0.67      0.67         6

    accuracy                           0.70        33
   macro avg       0.46      0.56      0.50        33
weighted avg       0.53      0.70      0.60        33

SVM Rodada 4
Matriz de Confusão:
[[19  0  0]
 [ 6  0  2]
 [ 2  0  4]]
Acurácia: 0.696969696969697
F1-Score: 0.4975845410628019
outras métricas:
precision: 0.4567901234567901
recall: 0.5555555555555555
f1score: 0.4975845410628019
support: None
-------------------------------------
              precision    recall  f1-score   support

           0       0.67      1.00      0.80        18
           1       0.00      0.00      0.00         9
           2       0.83      0.83      0.83         6

    accuracy                           0.70        33
   macro avg       0.50      0.61      0.54        33
weighted avg

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  serie_acc = pd.Series()
  serie_f1 = pd.Series()
  serie_f1_ponderado = pd.Series()


--- Random Forest ---
              precision    recall  f1-score   support

           0       0.78      0.95      0.86        19
           1       0.00      0.00      0.00         8
           2       0.70      1.00      0.82         7

    accuracy                           0.74        34
   macro avg       0.49      0.65      0.56        34
weighted avg       0.58      0.74      0.65        34

Random Forest Rodada 1
Matriz de Confusão:
[[18  1  0]
 [ 5  0  3]
 [ 0  0  7]]
Acurácia: 0.7352941176470589
F1-Score: 0.5602240896358542
outras métricas:
precision: 0.4942028985507247
recall: 0.6491228070175438
f1score: 0.5602240896358542
support: None
-------------------------------------
              precision    recall  f1-score   support

           0       0.67      0.95      0.78        19
           1       0.00      0.00      0.00         8
           2       1.00      0.71      0.83         7

    accuracy                           0.68        34
   macro avg       0.56      0.55

  serie_acc = pd.Series()
  serie_f1 = pd.Series()
  serie_f1_ponderado = pd.Series()


              precision    recall  f1-score   support

           0       0.78      0.95      0.86        19
           1       0.20      0.12      0.15         8
           2       0.67      0.57      0.62         7

    accuracy                           0.68        34
   macro avg       0.55      0.55      0.54        34
weighted avg       0.62      0.68      0.64        34

AdaBoost Rodada 1
Matriz de Confusão:
[[18  1  0]
 [ 5  1  2]
 [ 0  3  4]]
Acurácia: 0.6764705882352942
F1-Score: 0.5421245421245421
outras métricas:
precision: 0.5497584541062802
recall: 0.5479323308270676
f1score: 0.5421245421245421
support: None
-------------------------------------
              precision    recall  f1-score   support

           0       0.60      0.63      0.62        19
           1       0.00      0.00      0.00         8
           2       1.00      0.29      0.44         7

    accuracy                           0.41        34
   macro avg       0.53      0.31      0.35        34
weight

In [52]:
df_acc

Unnamed: 0,Classificador,Rodada 1,Rodada 2,Rodada 3,Rodada 4,Rodada 5,Media
0,Multinomial Naive Bayes,0.588235,0.588235,0.676471,0.69697,0.636364,0.637255
1,Complement Naive Bayes Classifier,0.764706,0.676471,0.705882,0.69697,0.666667,0.702139
2,KNN,0.794118,0.735294,0.705882,0.69697,0.636364,0.713725
3,SVM,0.735294,0.705882,0.735294,0.69697,0.69697,0.714082
4,Random Forest,0.735294,0.676471,0.735294,0.69697,0.575758,0.683957
5,AdaBoost,0.676471,0.411765,0.647059,0.727273,0.575758,0.607665


In [53]:
df_f1

Unnamed: 0,Classificador,Rodada 1,Rodada 2,Rodada 3,Rodada 4,Rodada 5,Media
0,Multinomial Naive Bayes,0.53622,0.552778,0.594382,0.600794,0.606725,0.57818
1,Complement Naive Bayes Classifier,0.560847,0.511729,0.533333,0.48661,0.501661,0.518836
2,KNN,0.650171,0.577196,0.568562,0.55193,0.572962,0.584164
3,SVM,0.560224,0.547281,0.577196,0.497585,0.544444,0.545346
4,Random Forest,0.560224,0.538647,0.641026,0.503704,0.489744,0.546669
5,AdaBoost,0.542125,0.353276,0.500107,0.69788,0.616254,0.541928


In [54]:
df_f1_ponderado

Unnamed: 0,Classificador,Rodada 1,Rodada 2,Rodada 3,Rodada 4,Rodada 5,Media
0,Multinomial Naive Bayes,0.584006,0.592157,0.63512,0.67987,0.621611,0.622553
1,Complement Naive Bayes Classifier,0.665733,0.586406,0.611765,0.598083,0.56116,0.604629
2,KNN,0.72744,0.64186,0.627385,0.636835,0.592586,0.645221
3,SVM,0.648542,0.623383,0.64186,0.596838,0.587879,0.6197
4,Random Forest,0.648542,0.608909,0.684163,0.607407,0.521678,0.61414
5,AdaBoost,0.641888,0.435395,0.613956,0.733742,0.586808,0.602358
