# Classificação faixa TRL - BERT multilingual + Modelos ML

- bert-base-multilingual-cased

## Código Versão 1.0 - 21 OUT 23

- Uso dos classificadores: *Modelos de ML*
- Matriz de confusão;
- Semente 42

- Esse código adota a simplificação proposta na dissertação, no sentido de obtermos a RV, depois alizarmos o k-fold. Dessa forma, temos uma econnomia computacional.

In [1]:
# dataset.csv   ou  dataset_pre_processado_1.csv  ou  dataset_pre_processado_stem_2.csv
#     CSV1                  CSV2                                   CSV3
dataset = "dataset.csv"

In [2]:
print("Lembre-se estamos usando o dataset: " + dataset)

Lembre-se estamos usando o dataset: dataset.csv


In [3]:
melhor_modelo = 'best_model_TRL_bert_base_ml_' + '.bin'
melhor_modelo

'best_model_TRL_bert_base_ml_.bin'

- Quantidade de tokens máxima = 242

In [4]:
MAX_LEN = 242

## Preparação

### Bibliotecas e ambiente

In [5]:
!pip install -qq transformers

In [6]:
!pip install -q -U watermark

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.6 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.1/1.6 MB[0m [31m4.9 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.6/1.6 MB[0m [31m25.4 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m20.6 MB/s[0m eta [36m0:00:00[0m
[?25h

In [7]:
!nvidia-smi

Tue Dec 12 00:14:01 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   47C    P8     9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [8]:
import transformers
from transformers import BertModel, BertTokenizer, AdamW, get_linear_schedule_with_warmup
import torch

import numpy as np
import pandas as pd
import seaborn as sns
from pylab import rcParams
import matplotlib.pyplot as plt
from matplotlib import rc
from sklearn.model_selection import train_test_split
from collections import defaultdict
import textwrap

from tqdm import tqdm
from torch import nn, optim
from torch.utils.data import Dataset, DataLoader
import torch.nn.functional as F

%matplotlib inline
%config InlineBackend.figure_format='retina'

sns.set(style='whitegrid', palette='muted', font_scale=1.2)

HAPPY_COLORS_PALETTE = ["#01BEFE", "#FFDD00", "#FF7D00", "#FF006D", "#ADFF02", "#8F00FF"]

sns.set_palette(sns.color_palette(HAPPY_COLORS_PALETTE))

rcParams['figure.figsize'] = 12, 8

RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
device

device(type='cuda', index=0)

In [9]:
from sklearn.model_selection import cross_val_predict, KFold
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn import naive_bayes, svm
from sklearn.naive_bayes import ComplementNB
from sklearn.ensemble import AdaBoostClassifier
from sklearn.preprocessing import MinMaxScaler

In [10]:
%reload_ext watermark
%watermark -v -p numpy,pandas,torch,transformers

Python implementation: CPython
Python version       : 3.10.12
IPython version      : 7.34.0

numpy       : 1.23.5
pandas      : 1.5.3
torch       : 2.1.0+cu118
transformers: 4.35.2



# Particoes KFOLD

- Para mais detalhes, consulte o código *1-kfold.ipynb* em que as partições foram sorteadas. As células a seguir recriam o resultado obtido.

In [11]:
array1 = np.array([  0,   1,   2,   3,   5,   7,   9,  10,  11,  12,  13,  14,  15,
         16,  17,  19,  20,  21,  22,  23,  25,  26,  27,  28,  30,  32,
         33,  34,  35,  36,  37,  39,  40,  44,  45,  46,  47,  48,  49,
         50,  52,  53,  54,  55,  56,  57,  59,  60,  61,  63,  64,  66,
         68,  69,  70,  71,  72,  75,  76,  77,  78,  79,  80,  81,  82,
         83,  84,  85,  87,  91,  92,  93,  94,  96,  97,  98,  99, 100,
        102, 103, 104, 105, 106, 108, 109, 110, 112, 113, 114, 115, 116,
        117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 129, 130,
        131, 132, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 145,
        146, 148, 150, 151, 153, 154, 155, 157, 158, 159, 160, 161, 163,
        164, 165, 166, 167])

array2 = np.array([  0,   1,   3,   4,   5,   6,   7,   8,  10,  13,  14,  15,  16,
         17,  18,  19,  20,  22,  23,  24,  25,  26,  27,  29,  30,  31,
         34,  36,  37,  38,  40,  41,  42,  43,  44,  45,  47,  49,  51,
         52,  53,  54,  55,  57,  58,  59,  60,  61,  62,  63,  64,  65,
         66,  67,  68,  69,  70,  71,  73,  74,  75,  76,  77,  80,  83,
         84,  86,  88,  89,  90,  92,  93,  94,  95,  96,  97,  98,  99,
        100, 101, 102, 104, 105, 107, 109, 110, 111, 112, 113, 114, 115,
        116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128,
        129, 130, 132, 133, 135, 136, 137, 138, 139, 142, 143, 144, 145,
        147, 149, 150, 151, 152, 153, 154, 155, 156, 157, 160, 161, 162,
        163, 165, 166, 167])
array3 = np.array([  1,   2,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,  14,
         17,  18,  20,  21,  22,  23,  24,  25,  28,  29,  30,  31,  32,
         33,  35,  36,  37,  38,  39,  41,  42,  43,  44,  45,  46,  47,
         48,  50,  51,  52,  54,  55,  56,  57,  58,  59,  62,  63,  65,
         67,  70,  71,  72,  73,  74,  75,  76,  77,  78,  79,  80,  81,
         82,  84,  85,  86,  87,  88,  89,  90,  91,  93,  95,  96,  97,
         98,  99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 111,
        112, 113, 114, 118, 120, 121, 122, 123, 124, 126, 128, 129, 131,
        132, 133, 134, 135, 136, 138, 139, 140, 141, 142, 144, 146, 147,
        148, 149, 150, 151, 152, 153, 155, 156, 158, 159, 160, 161, 162,
        163, 164, 165, 166])
array4 = np.array([  0,   2,   3,   4,   6,   8,   9,  10,  11,  12,  15,  16,  18,
         19,  20,  21,  22,  24,  25,  26,  27,  28,  29,  31,  32,  33,
         34,  35,  37,  38,  39,  40,  41,  42,  43,  45,  46,  47,  48,
         49,  50,  51,  53,  56,  57,  58,  60,  61,  62,  63,  64,  65,
         66,  67,  68,  69,  71,  72,  73,  74,  75,  76,  78,  79,  80,
         81,  82,  83,  84,  85,  86,  87,  88,  89,  90,  91,  92,  93,
         94,  95,  98, 100, 101, 103, 104, 106, 107, 108, 109, 110, 111,
        112, 114, 115, 116, 117, 119, 121, 123, 124, 125, 127, 128, 129,
        130, 131, 133, 134, 135, 137, 139, 140, 141, 142, 143, 144, 145,
        146, 147, 148, 149, 150, 152, 153, 154, 155, 156, 157, 158, 159,
        162, 163, 164, 166, 167])
array5 = np.array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  11,  12,  13,
         14,  15,  16,  17,  18,  19,  21,  23,  24,  26,  27,  28,  29,
         30,  31,  32,  33,  34,  35,  36,  38,  39,  40,  41,  42,  43,
         44,  46,  48,  49,  50,  51,  52,  53,  54,  55,  56,  58,  59,
         60,  61,  62,  64,  65,  66,  67,  68,  69,  70,  72,  73,  74,
         77,  78,  79,  81,  82,  83,  85,  86,  87,  88,  89,  90,  91,
         92,  94,  95,  96,  97,  99, 101, 102, 103, 105, 106, 107, 108,
        110, 111, 113, 115, 116, 117, 118, 119, 120, 122, 125, 126, 127,
        128, 130, 131, 132, 133, 134, 136, 137, 138, 140, 141, 143, 144,
        145, 146, 147, 148, 149, 151, 152, 154, 156, 157, 158, 159, 160,
        161, 162, 164, 165, 167])
# Criar a lista composta
kfold_train = [array1, array2, array3, array4, array5]

In [12]:
# Criar as cinco arrays numpy
array1 = np.array([  4,   6,   8,  18,  24,  29,  31,  38,  41,  42,  43,  51,  58, 62,  65,  67,  73,  74,  86,  88,  89,  90,  95, 101, 107, 111, 128, 133, 144, 147, 149, 152, 156, 162])
array2 = np.array([  2,   9,  11,  12,  21,  28,  32,  33,  35,  39,  46,  48,  50, 56,  72,  78,  79,  81,  82,  85,  87,  91, 103, 106, 108, 131, 134, 140, 141, 146, 148, 158, 159, 164])
array3 = np.array([  0,   3,  15,  16,  19,  26,  27,  34,  40,  49,  53,  60,  61, 64,  66,  68,  69,  83,  92,  94, 110, 115, 116, 117, 119, 125, 127, 130, 137, 143, 145, 154, 157, 167])
array4 = np.array([  1,   5,   7,  13,  14,  17,  23,  30,  36,  44,  52,  54,  55, 59,  70,  77,  96,  97,  99, 102, 105, 113, 118, 120, 122, 126, 132, 136, 138, 151, 160, 161, 165])
array5 = np.array([ 10,  20,  22,  25,  37,  45,  47,  57,  63,  71,  75,  76,  80,  84,  93,  98, 100, 104, 109, 112, 114, 121, 123, 124, 129, 135, 139, 142, 150, 153, 155, 163, 166])


# Criar a lista composta
kfold_test = [array1, array2, array3, array4, array5]

# Dataset

In [13]:
df = pd.read_csv(dataset)

In [14]:
novo_df = df[["resumo", "rotulo"]]

In [15]:
def adapto_faixa(faixa):
    faixa -=1
    return faixa

In [16]:
novo_df['rotulo'] = novo_df.rotulo.apply(adapto_faixa)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  novo_df['rotulo'] = novo_df.rotulo.apply(adapto_faixa)


In [17]:
novo_df.tail()

Unnamed: 0,resumo,rotulo
163,Rio de Janeiro (RJ) – O Centro de Avaliações d...,2
164,Este trabalho apresenta um sistema para contro...,2
165,No contexto das comunicações táticas baseadas ...,2
166,O valor da velocidade de alvos móveis em image...,0
167,Neste artigo é apresentada a análise da seção ...,0


In [18]:
class_names = ['Faixa 1', 'Faixa 2', 'Faixa 3']

### BERT multilingual base model (cased)

- https://huggingface.co/bert-base-multilingual-cased

```
@article{DBLP:journals/corr/abs-1810-04805,
  author    = {Jacob Devlin and
               Ming{-}Wei Chang and
               Kenton Lee and
               Kristina Toutanova},
  title     = {{BERT:} Pre-training of Deep Bidirectional Transformers for Language
               Understanding},
  journal   = {CoRR},
  volume    = {abs/1810.04805},
  year      = {2018},
  url       = {http://arxiv.org/abs/1810.04805},
  archivePrefix = {arXiv},
  eprint    = {1810.04805},
  timestamp = {Tue, 30 Oct 2018 20:39:56 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1810-04805.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

```


In [19]:
PRE_TRAINED_MODEL_NAME = 'bert-base-multilingual-cased'
tokenizer = BertTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME)

tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

In [20]:
bert_model = BertModel.from_pretrained(PRE_TRAINED_MODEL_NAME)

model.safetensors:   0%|          | 0.00/714M [00:00<?, ?B/s]

In [21]:
frase = 'A avaliação de prontidão tecnológica em tecnologias de interesse militar'

In [22]:
encoding = tokenizer.encode_plus(
  frase,
  max_length=20,   #Perceba que irei cortar um token da minha frase anterior
  add_special_tokens=True, # Add '[CLS]' and '[SEP]'
  return_token_type_ids=False,
  pad_to_max_length=True,
  return_attention_mask=True,
  return_tensors='pt',  # Return PyTorch tensors
)

encoding.keys()

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


dict_keys(['input_ids', 'attention_mask'])

- Informações sobre o modelo:

In [23]:
bert_model = bert_model.to(device)

In [24]:
bert_model.config.hidden_size

768

In [25]:
bert_model.config

BertConfig {
  "_name_or_path": "bert-base-multilingual-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "directionality": "bidi",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "pooler_fc_size": 768,
  "pooler_num_attention_heads": 12,
  "pooler_num_fc_layers": 3,
  "pooler_size_per_head": 128,
  "pooler_type": "first_token_transform",
  "position_embedding_type": "absolute",
  "transformers_version": "4.35.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 119547
}

# Aplicando o Modelo para gerar a representação vetorial

- Tutorial interessante:

```
1 - https://towardsdatascience.com/feature-extraction-with-bert-for-text-classification-533dde44dc2f
```

```
2 - https://github.com/felipemaiapolo/legalnlp/blob/main/demo/BERT/BERT_TUTORIAL.ipynb
```

*Recomendo muito o 2, pois foi feito em formato de notebook, bem redigido, comentando os passos e explicando o BERT. ALém de estar em português (material raro em termos de qualidade em nossa língua)*

### Transformando os dados em tensores:

In [26]:
# Fazendo a padronização dos textos
wrapper = textwrap.TextWrapper()
data_resumo = list(novo_df['resumo'])

- Visualizando:

In [27]:
for text in range(len(data_resumo[:2])):
  print(f'{wrapper.fill(data_resumo[text])}')
  print()

O crescente emprego de mísseis de ombro infravermelhos contra alvos
aéreos demanda a utilização de contramedidas cada vez mais modernas e
eficientes. Neste cenário, surge o Directed Infrared Countermeasure
(DIRCM), cujo objetivo é interferir no guiamento do míssil por meio de
pulsos de laser. Neste artigo, um seeker infravermelho do tipo rising
sun é modelado e simulado, sendo os efeitos da emissão de um DIRCM no
processamento do sinal avaliados. A influência de parâmetros de
frequência de repetição de pulsos e intensidade do laser são
evidenciados. Os resultados obtidos ressaltam a importância e a
necessidade do desenvolvimento de ferramentas computacionais mais
complexas, visando ao desenvolvimento da doutrina de emprego deste
tipo de contramedida.

Materiais dielétricos com baixas perdas e alta permissividade são
componentes essenciais que são utilizados em linhas de transmissões
não lineares capacitivas (LTNLs) na geração de RF. LTNLs possuem
grande potencial para gerar ondas de só

In [28]:
# Aplicando o bert_tokenizer com o comprimento máximo que definimos
encoded_inputs = tokenizer(data_resumo, padding=True, truncation=True, max_length=MAX_LEN, return_tensors="pt")


#O encoded_input está como um dicionário com 3 chaves
encoded_inputs.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

- Visualizando como o resumo ficou tokenizado, i.e., a chave ''input_ids' do encoded_inputs :

In [29]:
#Visualiando o primeiro texto após a aplicação do tokenizador
print(encoded_inputs['input_ids'][0])

print()

# Mostrando o mesmo texto decodificado
print(wrapper.fill(tokenizer.decode(encoded_inputs['input_ids'][0])))

tensor([   101,    152,  83892,  10266,  98361,  10104,  83426,  12818,  10291,
         10104,  10209,  20923,  10106,  31162,  12563,  19390,  21948,  11473,
         10164,  15404,  88805,  10107,  40298,    169,  28518,  10922,  10104,
         11473,  96092,  11205,  11782,  11675,  10614, 104416,    173,  56331,
         49684,  10107,    119,  49022,  10794,  70271,    117,  69824,    183,
         52066,  10336,  84250,  85860,  41947,  47394,    113, 110014,  52932,
         11517,    114,    117,  61807,  23518,    263,  22021,  59908,  10129,
         10192,  75980,  89185,  10149,  83426,  28377,  10161,  10183,  25598,
         10104,  34597,  11747,  10310,  10104,  43136,    119,  49022,  52686,
           117,  10293,  48394,  10165,  10106,  31162,  12563,  19390,  10758,
         10149,  13113,  53816,  42230,    263,  13192,  11272,    173,  92304,
         98444,    117,  14085,  10427,  77989,  10143,  10266,  74489,  10104,
         10293, 110014,  52932,  11517, 

In [30]:
print(encoded_inputs['input_ids'].shape)  # batch size x seq length
print(encoded_inputs['attention_mask'].shape)

torch.Size([168, 242])
torch.Size([168, 242])


In [31]:
# Passando os tensores para para a GPU - aceleração através da GPU
input_ids = encoded_inputs['input_ids'].to(device)
attention_mask = encoded_inputs['attention_mask'].to(device)

In [32]:
# Criando o vetor de features/caracteristicas
features = []

- Aplicando o BERT para a extração das características

In [33]:
for i in tqdm(range(len(data_resumo))):

    with torch.no_grad():
        last_hidden_states = bert_model(input_ids[i:(i+1)],attention_mask[i:(i+1)])[1].cpu().numpy().reshape(-1).tolist()
        # last_hidden_states = bert_model(input_ids[i:(i+1)])[1].cpu().numpy().reshape(-1).tolist()
    features.append(last_hidden_states)

100%|██████████| 168/168 [00:06<00:00, 24.60it/s]


In [34]:
# passando a lista features para numpy array com as features extraidas
features = np.array(features)

- Visualizando o resultado:

In [35]:
print('FEATURES: Número de linhas: ' + str(features.shape[0]) + ' Número de colunas: ', str(features.shape[1]))
print("FEATURES é um objeto " + str(type(features)))

FEATURES: Número de linhas: 168 Número de colunas:  768
FEATURES é um objeto <class 'numpy.ndarray'>


In [36]:
features[0]

array([ 3.14709902e-01, -6.01513907e-02,  1.65918082e-01, -3.74656707e-01,
       -1.38524979e-01,  2.77977914e-01,  2.35641122e-01,  3.21511835e-01,
       -4.38854843e-01,  4.30341095e-01, -1.45486116e-01, -2.25202709e-01,
       -1.52975753e-01, -3.04808170e-01,  1.69529811e-01, -4.04350311e-01,
        7.44322836e-01,  2.29949981e-01,  1.23505145e-01, -2.93096393e-01,
       -9.99990880e-01, -2.73889184e-01, -3.26407909e-01, -2.10124314e-01,
       -5.54806471e-01,  1.24046460e-01, -7.15522766e-02,  2.40168944e-01,
        3.40399474e-01, -1.03946477e-01,  8.95204619e-02, -9.99991536e-01,
        7.39571571e-01,  6.27558470e-01,  3.83014560e-01, -2.91030496e-01,
        1.66120872e-01,  2.96730340e-01,  2.31582478e-01, -4.38912958e-01,
       -1.14710115e-01,  1.48232475e-01,  3.19611207e-02,  2.43096009e-01,
       -2.26232499e-01, -3.33578408e-01, -2.09190220e-01,  2.46220812e-01,
       -5.17164648e-01,  3.83983642e-01, -2.20797975e-02,  2.83197999e-01,
        3.95977229e-01,  

# Adaptando a representação vetorial gerada

- Perceba que até aqui, temos um numpy array chamado **features**, com 168 linhas e 768 colunas.

- Iremos transformar nosso numpy array em um dataframe novamente:

In [37]:
df_repvet = pd.DataFrame(features)
repvet_rotulado = pd.concat([df_repvet, novo_df['rotulo']], axis = 1)
repvet_rotulado.shape

(168, 769)

In [38]:
df_repvet

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,758,759,760,761,762,763,764,765,766,767
0,0.314710,-0.060151,0.165918,-0.374657,-0.138525,0.277978,0.235641,0.321512,-0.438855,0.430341,...,0.105117,0.498755,0.022556,0.859591,0.318669,0.201685,0.343222,-0.387592,0.251425,0.337813
1,0.305746,-0.165856,0.165211,-0.228474,-0.045061,0.341329,0.194617,0.373827,-0.474146,0.420907,...,0.152223,0.504627,0.061583,0.899594,0.348450,0.298324,0.341645,-0.446765,0.269141,0.344139
2,0.325719,-0.156187,0.238672,-0.384598,-0.150140,0.400224,0.182729,0.322785,-0.483749,0.408693,...,0.195706,0.488232,0.050863,0.961977,0.343856,0.229315,0.313479,-0.474858,0.181446,0.221576
3,0.260132,-0.067558,0.105033,-0.219390,-0.114704,0.204282,0.233667,0.299719,-0.340289,0.345567,...,0.019989,0.366665,-0.036615,0.758595,0.271657,0.135321,0.310630,-0.188247,0.161128,0.210674
4,0.057594,-0.044796,0.087209,0.028736,0.057303,0.075167,0.099933,0.188639,-0.245330,0.186603,...,-0.031218,0.245269,0.047061,0.266991,0.251460,0.019276,0.161508,-0.159195,0.079074,0.089511
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
163,0.029056,0.119876,0.049575,-0.066374,0.062001,0.018950,0.039168,0.127449,-0.080557,-0.047690,...,0.011653,0.143067,-0.029939,0.544233,0.120853,0.012453,0.089121,0.009840,0.027879,0.084698
164,0.253423,-0.096248,0.159327,-0.297639,-0.162121,0.309384,0.154456,0.215525,-0.462364,0.315646,...,0.086044,0.436517,-0.047005,0.871293,0.332641,0.264444,0.358351,-0.308806,0.222832,0.247780
165,0.048474,-0.065786,0.096442,-0.081266,0.118146,-0.034410,0.022408,0.187511,-0.215431,0.235637,...,0.005803,0.208649,-0.036314,0.476860,0.118282,0.111448,0.183524,-0.170755,0.059210,0.094427
166,0.178728,-0.030873,0.089366,-0.123075,0.018669,0.014647,0.099844,0.292821,-0.228909,0.278503,...,0.082141,0.268723,-0.003959,0.708699,0.224944,0.153745,0.232333,-0.174973,0.145089,0.166183


In [39]:
repvet_rotulado.head(3)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,759,760,761,762,763,764,765,766,767,rotulo
0,0.31471,-0.060151,0.165918,-0.374657,-0.138525,0.277978,0.235641,0.321512,-0.438855,0.430341,...,0.498755,0.022556,0.859591,0.318669,0.201685,0.343222,-0.387592,0.251425,0.337813,0
1,0.305746,-0.165856,0.165211,-0.228474,-0.045061,0.341329,0.194617,0.373827,-0.474146,0.420907,...,0.504627,0.061583,0.899594,0.34845,0.298324,0.341645,-0.446765,0.269141,0.344139,1
2,0.325719,-0.156187,0.238672,-0.384598,-0.15014,0.400224,0.182729,0.322785,-0.483749,0.408693,...,0.488232,0.050863,0.961977,0.343856,0.229315,0.313479,-0.474858,0.181446,0.221576,0


# Criando uma estrutura para armazenar os resultados

In [40]:
# Algoritmos de classificação
classifiers = [  #NB, KNN, SVM
    ('Multinomial Naive Bayes', MultinomialNB()),
    ('Complement Naive Bayes Classifier', ComplementNB()),
    ('KNN', KNeighborsClassifier()),  #n_neighbors default é 5
    ('SVM', svm.SVC( )),
     ('Random Forest', RandomForestClassifier(random_state=42)),
      ('AdaBoost',    AdaBoostClassifier(random_state=42)) #n_estimators default é 50
]

In [41]:
lista_classificador_nome = list()
for classifier_name, classifier in classifiers:
    lista_classificador_nome.append(classifier_name)

In [42]:
df_acc = pd.DataFrame(columns=['Classificador','Rodada 1', 'Rodada 2', 'Rodada 3', 'Rodada 4', 'Rodada 5', 'Media'])
df_f1 = pd.DataFrame(columns=['Classificador','Rodada 1', 'Rodada 2', 'Rodada 3', 'Rodada 4', 'Rodada 5', 'Media'])
df_f1_ponderado = pd.DataFrame(columns=['Classificador','Rodada 1', 'Rodada 2', 'Rodada 3', 'Rodada 4', 'Rodada 5', 'Media'])

In [43]:
for classifier_name, classifier in classifiers:
    nova_linha = pd.DataFrame({'Classificador': [classifier_name], 'Rodada 1':[0] , 'Rodada 2':[0], 'Rodada 3':[0], 'Rodada 4':[0], 'Rodada 5':[0], 'Media':[0]})
    df_acc = pd.concat([df_acc, nova_linha], ignore_index=True)
    df_f1 = pd.concat([df_f1, nova_linha], ignore_index=True)
    df_f1_ponderado = pd.concat([df_f1_ponderado, nova_linha], ignore_index=True)

In [44]:
df_f1

Unnamed: 0,Classificador,Rodada 1,Rodada 2,Rodada 3,Rodada 4,Rodada 5,Media
0,Multinomial Naive Bayes,0,0,0,0,0,0
1,Complement Naive Bayes Classifier,0,0,0,0,0,0
2,KNN,0,0,0,0,0,0
3,SVM,0,0,0,0,0,0
4,Random Forest,0,0,0,0,0,0
5,AdaBoost,0,0,0,0,0,0


In [45]:
df_f1_ponderado

Unnamed: 0,Classificador,Rodada 1,Rodada 2,Rodada 3,Rodada 4,Rodada 5,Media
0,Multinomial Naive Bayes,0,0,0,0,0,0
1,Complement Naive Bayes Classifier,0,0,0,0,0,0
2,KNN,0,0,0,0,0,0
3,SVM,0,0,0,0,0,0
4,Random Forest,0,0,0,0,0,0
5,AdaBoost,0,0,0,0,0,0


In [46]:
df_acc

Unnamed: 0,Classificador,Rodada 1,Rodada 2,Rodada 3,Rodada 4,Rodada 5,Media
0,Multinomial Naive Bayes,0,0,0,0,0,0
1,Complement Naive Bayes Classifier,0,0,0,0,0,0
2,KNN,0,0,0,0,0,0
3,SVM,0,0,0,0,0,0
4,Random Forest,0,0,0,0,0,0
5,AdaBoost,0,0,0,0,0,0


 Usei como parâmetro para **average** o **'macro'** para o f1, e o **'weighted'** para o f1-ponderado

**'weighted'**:

Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall.

**'macro'**:

Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

 vide [Documentação oficial](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html)

# Aplicando modelo de classificação

In [47]:
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score
from sklearn.metrics import precision_recall_fscore_support as score

def evaluate_model(model, X_test, y_test):
    # Predição dos rótulos
    y_pred = model.predict(X_test)
    print(classification_report(y_test, y_pred))

    # Cálculo da matriz de confusão
    cm = confusion_matrix(y_test, y_pred)

    # Cálculo da acurácia
    acc = accuracy_score(y_test, y_pred)

    # Cálculo do F1-score
    f1 = f1_score(y_test, y_pred, average='macro')

    # Cálculo do F1-score
    f1_poderado = f1_score(y_test, y_pred, average='weighted')

    # Outras métricas
    precision, recall, f1score, support = score(y_test, y_pred, average='macro')
    return cm, acc, f1, precision, recall, f1score, support, f1_poderado

In [48]:
df_repvet

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,758,759,760,761,762,763,764,765,766,767
0,0.314710,-0.060151,0.165918,-0.374657,-0.138525,0.277978,0.235641,0.321512,-0.438855,0.430341,...,0.105117,0.498755,0.022556,0.859591,0.318669,0.201685,0.343222,-0.387592,0.251425,0.337813
1,0.305746,-0.165856,0.165211,-0.228474,-0.045061,0.341329,0.194617,0.373827,-0.474146,0.420907,...,0.152223,0.504627,0.061583,0.899594,0.348450,0.298324,0.341645,-0.446765,0.269141,0.344139
2,0.325719,-0.156187,0.238672,-0.384598,-0.150140,0.400224,0.182729,0.322785,-0.483749,0.408693,...,0.195706,0.488232,0.050863,0.961977,0.343856,0.229315,0.313479,-0.474858,0.181446,0.221576
3,0.260132,-0.067558,0.105033,-0.219390,-0.114704,0.204282,0.233667,0.299719,-0.340289,0.345567,...,0.019989,0.366665,-0.036615,0.758595,0.271657,0.135321,0.310630,-0.188247,0.161128,0.210674
4,0.057594,-0.044796,0.087209,0.028736,0.057303,0.075167,0.099933,0.188639,-0.245330,0.186603,...,-0.031218,0.245269,0.047061,0.266991,0.251460,0.019276,0.161508,-0.159195,0.079074,0.089511
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
163,0.029056,0.119876,0.049575,-0.066374,0.062001,0.018950,0.039168,0.127449,-0.080557,-0.047690,...,0.011653,0.143067,-0.029939,0.544233,0.120853,0.012453,0.089121,0.009840,0.027879,0.084698
164,0.253423,-0.096248,0.159327,-0.297639,-0.162121,0.309384,0.154456,0.215525,-0.462364,0.315646,...,0.086044,0.436517,-0.047005,0.871293,0.332641,0.264444,0.358351,-0.308806,0.222832,0.247780
165,0.048474,-0.065786,0.096442,-0.081266,0.118146,-0.034410,0.022408,0.187511,-0.215431,0.235637,...,0.005803,0.208649,-0.036314,0.476860,0.118282,0.111448,0.183524,-0.170755,0.059210,0.094427
166,0.178728,-0.030873,0.089366,-0.123075,0.018669,0.014647,0.099844,0.292821,-0.228909,0.278503,...,0.082141,0.268723,-0.003959,0.708699,0.224944,0.153745,0.232333,-0.174973,0.145089,0.166183


- Alguns algoritmos só lidam com valores positivos, por isso precisamos normalizar os valores que aparecem na RV do BERT e GPT-2:

In [49]:
# scaler = MinMaxScaler()
# df_repvet_pos = scaler.fit_transform(df_repvet)
# scaling_factor = 1000
# df_repvet_pos_i = (df_repvet_pos * scaling_factor).astype(int)
# df_repvet_pos_i = pd.DataFrame(df_repvet_pos_i, columns=df_repvet.columns)

In [50]:
# df_repvet_pos_i.describe()

In [51]:
classifiers

[('Multinomial Naive Bayes', MultinomialNB()),
 ('Complement Naive Bayes Classifier', ComplementNB()),
 ('KNN', KNeighborsClassifier()),
 ('SVM', SVC()),
 ('Random Forest', RandomForestClassifier(random_state=42)),
 ('AdaBoost', AdaBoostClassifier(random_state=42))]

In [52]:
classificador=0
for classifier_name, classifier in classifiers:
    print('---', classifier_name, '---')
    y_true = []
    y_pred = []
    contador = 0
    serie_acc = pd.Series()
    serie_f1 = pd.Series()
    serie_f1_ponderado = pd.Series()

    #Substituindo o dataframe
    # if (classifier_name == 'Multinomial Naive Bayes' or classifier_name == "Complement Naive Bayes Classifier"):
    #     df_substituto = df_repvet_pos
    # else:
    #     df_substituto = df_repvet
    # df_substituto= df_repvet_pos_i

    for i in range(0, 5):  #Estou percorrendo as 5 rodadas
        train_index = kfold_train[i]
        test_index = kfold_test[i]
        contador +=1

         # df_repvet
        X_train_n, X_test_n = df_repvet.iloc[train_index], df_repvet.iloc[test_index]
        y_train, y_test = repvet_rotulado.iloc[train_index]['rotulo'], repvet_rotulado.iloc[test_index]['rotulo']


        scaler = MinMaxScaler()
        df_repvet_pos_treino = scaler.fit_transform(X_train_n)
        df_repvet_pos_teste =  scaler.transform(X_test_n)

        scaling_factor = 1000
        df_repvet_pos_i_treino = (df_repvet_pos_treino * scaling_factor).astype(int)
        X_train = pd.DataFrame(df_repvet_pos_i_treino, columns=df_repvet.columns)

        df_repvet_pos_i_teste = (df_repvet_pos_teste * scaling_factor).astype(int)
        X_test = pd.DataFrame(df_repvet_pos_i_teste, columns=df_repvet.columns)

        # Treinamento do modelo
        classifier.fit(X_train, y_train)
        cm, acc, f1, precision, recall, f1score, support,f1_poderado = evaluate_model(classifier, X_test, y_test)


        print(classifier_name + " Rodada " + str(contador) )
        print('Matriz de Confusão:')
        print(cm)
        print('Acurácia:', acc)
        print('F1-Score:', f1)
        print("outras métricas:")
        print('precision:', precision)
        print('recall:', recall)
        print('f1score:', f1score)
        print('support:', support)
        print('-------------------------------------')
        # serie_acc = serie_acc.append(pd.Series([acc]))
        serie_acc = pd.concat([serie_acc, pd.Series([acc])])
        # serie_f1 = serie_f1.append(pd.Series([f1]))
        serie_f1 = pd.concat([serie_f1, pd.Series([f1])])
        serie_f1_ponderado = pd.concat([serie_f1_ponderado, pd.Series([f1_poderado])])


    # Avaliação do modelo: Aqui estamos inserindo os valores das medias na serie
    media_acc = serie_acc[:5].mean()
    media_f1 = serie_f1[:5].mean()
    media_f1_ponderado = serie_f1_ponderado[:5].mean()
    # serie_acc = serie_acc.append(pd.Series([media_acc]))
    # serie_f1 = serie_f1.append(pd.Series([media_f1]))
    serie_acc = pd.concat([serie_acc, pd.Series([media_acc])])
    serie_f1 = pd.concat([serie_f1, pd.Series([media_f1])])
    serie_f1_ponderado = pd.concat([serie_f1_ponderado, pd.Series([media_f1_ponderado])])

    # print("Acurácia: " )
    # print(serie_acc)
    # print("F-1: " )
    # print(serie_f1)
    df_acc.loc[classificador, ['Rodada 1', 'Rodada 2', 'Rodada 3', 'Rodada 4', 'Rodada 5', 'Media']] = serie_acc.values
    df_f1.loc[classificador, ['Rodada 1', 'Rodada 2', 'Rodada 3', 'Rodada 4', 'Rodada 5', 'Media']] = serie_f1.values
    df_f1_ponderado.loc[classificador, ['Rodada 1', 'Rodada 2', 'Rodada 3', 'Rodada 4', 'Rodada 5', 'Media']] = serie_f1_ponderado.values
    classificador+=1
    print("=======================================================================================")
    # cm = confusion_matrix(y_true, y_pred)
    # acc = accuracy_score(y_true, y_pred)

--- Multinomial Naive Bayes ---
              precision    recall  f1-score   support

           0       0.92      0.63      0.75        19
           1       0.25      0.38      0.30         8
           2       0.22      0.29      0.25         7

    accuracy                           0.50        34
   macro avg       0.47      0.43      0.43        34
weighted avg       0.62      0.50      0.54        34

Multinomial Naive Bayes Rodada 1
Matriz de Confusão:
[[12  4  3]
 [ 1  3  4]
 [ 0  5  2]]
Acurácia: 0.5
F1-Score: 0.4333333333333333
outras métricas:
precision: 0.4650997150997151
recall: 0.43076441102756896
f1score: 0.4333333333333333
support: None
-------------------------------------


  serie_acc = pd.Series()
  serie_f1 = pd.Series()
  serie_f1_ponderado = pd.Series()


              precision    recall  f1-score   support

           0       0.63      0.63      0.63        19
           1       0.20      0.25      0.22         8
           2       1.00      0.71      0.83         7

    accuracy                           0.56        34
   macro avg       0.61      0.53      0.56        34
weighted avg       0.61      0.56      0.58        34

Multinomial Naive Bayes Rodada 2
Matriz de Confusão:
[[12  7  0]
 [ 6  2  0]
 [ 1  1  5]]
Acurácia: 0.5588235294117647
F1-Score: 0.5623781676413255
outras métricas:
precision: 0.6105263157894737
recall: 0.5319548872180451
f1score: 0.5623781676413255
support: None
-------------------------------------
              precision    recall  f1-score   support

           0       0.62      0.53      0.57        19
           1       0.14      0.25      0.18         8
           2       0.75      0.43      0.55         7

    accuracy                           0.44        34
   macro avg       0.51      0.40      0.43  

  serie_acc = pd.Series()
  serie_f1 = pd.Series()
  serie_f1_ponderado = pd.Series()


              precision    recall  f1-score   support

           0       0.78      0.74      0.76        19
           1       0.00      0.00      0.00         8
           2       0.33      0.71      0.45         7

    accuracy                           0.56        34
   macro avg       0.37      0.48      0.40        34
weighted avg       0.50      0.56      0.52        34

Complement Naive Bayes Classifier Rodada 1
Matriz de Confusão:
[[14  0  5]
 [ 3  0  5]
 [ 1  1  5]]
Acurácia: 0.5588235294117647
F1-Score: 0.4037674037674037
outras métricas:
precision: 0.3703703703703704
recall: 0.48370927318295737
f1score: 0.4037674037674037
support: None
-------------------------------------
              precision    recall  f1-score   support

           0       0.65      0.79      0.71        19
           1       0.17      0.12      0.14         8
           2       1.00      0.71      0.83         7

    accuracy                           0.62        34
   macro avg       0.61      0.54 

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  serie_acc = pd.Series()
  serie_f1 = pd.Series()
  serie_f1_ponderado = pd.Series()


              precision    recall  f1-score   support

           0       0.68      0.94      0.79        18
           1       0.00      0.00      0.00         9
           2       0.62      0.83      0.71         6

    accuracy                           0.67        33
   macro avg       0.44      0.59      0.50        33
weighted avg       0.48      0.67      0.56        33

Complement Naive Bayes Classifier Rodada 5
Matriz de Confusão:
[[17  0  1]
 [ 7  0  2]
 [ 1  0  5]]
Acurácia: 0.6666666666666666
F1-Score: 0.5016611295681064
outras métricas:
precision: 0.43500000000000005
recall: 0.5925925925925926
f1score: 0.5016611295681064
support: None
-------------------------------------
--- KNN ---
              precision    recall  f1-score   support

           0       0.73      1.00      0.84        19
           1       0.50      0.12      0.20         8
           2       0.67      0.57      0.62         7

    accuracy                           0.71        34
   macro avg       0.6

  serie_acc = pd.Series()
  serie_f1 = pd.Series()
  serie_f1_ponderado = pd.Series()
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

           0       0.70      1.00      0.83        19
           1       0.00      0.00      0.00         8
           2       0.57      0.57      0.57         7

    accuracy                           0.68        34
   macro avg       0.43      0.52      0.47        34
weighted avg       0.51      0.68      0.58        34

SVM Rodada 1
Matriz de Confusão:
[[19  0  0]
 [ 5  0  3]
 [ 3  0  4]]
Acurácia: 0.6764705882352942
F1-Score: 0.4658385093167701
outras métricas:
precision: 0.42504409171075835
recall: 0.5238095238095238
f1score: 0.4658385093167701
support: None
-------------------------------------
              precision    recall  f1-score   support

           0       0.63      1.00      0.78        19
           1       0.00      0.00      0.00         8
           2       1.00      0.57      0.73         7

    accuracy                           0.68        34
   macro avg       0.54      0.52      0.50        34
weighted a

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


 0.5009276437847866
support: None
-------------------------------------
              precision    recall  f1-score   support

           0       0.63      1.00      0.78        19
           1       0.00      0.00      0.00         8
           2       1.00      0.57      0.73         7

    accuracy                           0.68        34
   macro avg       0.54      0.52      0.50        34
weighted avg       0.56      0.68      0.58        34

SVM Rodada 3
Matriz de Confusão:
[[19  0  0]
 [ 8  0  0]
 [ 3  0  4]]
Acurácia: 0.6764705882352942
F1-Score: 0.5009276437847866
outras métricas:
precision: 0.5444444444444444
recall: 0.5238095238095238
f1score: 0.5009276437847866
support: None
-------------------------------------


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  serie_acc = pd.Series()
  serie_f1 = pd.Series()
  serie_f1_ponderado = pd.Series()


              precision    recall  f1-score   support

           0       0.66      1.00      0.79        19
           1       0.00      0.00      0.00         8
           2       0.50      0.33      0.40         6

    accuracy                           0.64        33
   macro avg       0.39      0.44      0.40        33
weighted avg       0.47      0.64      0.53        33

SVM Rodada 4
Matriz de Confusão:
[[19  0  0]
 [ 6  0  2]
 [ 4  0  2]]
Acurácia: 0.6363636363636364
F1-Score: 0.3972222222222222
outras métricas:
precision: 0.3850574712643678
recall: 0.4444444444444444
f1score: 0.3972222222222222
support: None
-------------------------------------
              precision    recall  f1-score   support

           0       0.64      1.00      0.78        18
           1       0.00      0.00      0.00         9
           2       1.00      0.83      0.91         6

    accuracy                           0.70        33
   macro avg       0.55      0.61      0.56        33
weighted av

  serie_acc = pd.Series()
  serie_f1 = pd.Series()
  serie_f1_ponderado = pd.Series()


              precision    recall  f1-score   support

           0       0.83      0.79      0.81        19
           1       0.27      0.38      0.32         8
           2       0.40      0.29      0.33         7

    accuracy                           0.59        34
   macro avg       0.50      0.48      0.49        34
weighted avg       0.61      0.59      0.60        34

AdaBoost Rodada 1
Matriz de Confusão:
[[15  4  0]
 [ 2  3  3]
 [ 1  4  2]]
Acurácia: 0.5882352941176471
F1-Score: 0.4866445392761182
outras métricas:
precision: 0.5020202020202019
recall: 0.4833959899749374
f1score: 0.4866445392761182
support: None
-------------------------------------
              precision    recall  f1-score   support

           0       0.74      0.74      0.74        19
           1       0.25      0.38      0.30         8
           2       1.00      0.43      0.60         7

    accuracy                           0.59        34
   macro avg       0.66      0.51      0.55        34
weight

In [53]:
df_acc

Unnamed: 0,Classificador,Rodada 1,Rodada 2,Rodada 3,Rodada 4,Rodada 5,Media
0,Multinomial Naive Bayes,0.5,0.558824,0.441176,0.666667,0.69697,0.572727
1,Complement Naive Bayes Classifier,0.558824,0.617647,0.558824,0.666667,0.666667,0.613725
2,KNN,0.705882,0.647059,0.529412,0.636364,0.575758,0.618895
3,SVM,0.676471,0.676471,0.676471,0.636364,0.69697,0.672549
4,Random Forest,0.705882,0.676471,0.647059,0.636364,0.636364,0.660428
5,AdaBoost,0.588235,0.588235,0.617647,0.636364,0.454545,0.577005


In [54]:
df_f1

Unnamed: 0,Classificador,Rodada 1,Rodada 2,Rodada 3,Rodada 4,Rodada 5,Media
0,Multinomial Naive Bayes,0.433333,0.562378,0.4329,0.614432,0.686751,0.545959
1,Complement Naive Bayes Classifier,0.403767,0.563492,0.451923,0.536819,0.501661,0.491533
2,KNN,0.553276,0.458503,0.432558,0.449365,0.442424,0.467226
3,SVM,0.465839,0.500928,0.500928,0.397222,0.5639,0.485763
4,Random Forest,0.568543,0.564646,0.52963,0.433333,0.5,0.51923
5,AdaBoost,0.486645,0.545614,0.637486,0.570053,0.451104,0.53818


In [55]:
df_f1_ponderado

Unnamed: 0,Classificador,Rodada 1,Rodada 2,Rodada 3,Rodada 4,Rodada 5,Media
0,Multinomial Naive Bayes,0.541176,0.576797,0.474408,0.67114,0.700513,0.592807
1,Complement Naive Bayes Classifier,0.516476,0.604342,0.554299,0.62143,0.56116,0.571541
2,KNN,0.645651,0.556903,0.513406,0.58607,0.505785,0.561563
3,SVM,0.579284,0.583106,0.583106,0.528535,0.592167,0.57324
4,Random Forest,0.659027,0.636007,0.593791,0.551515,0.553719,0.598812
5,AdaBoost,0.596031,0.605882,0.644286,0.645244,0.479151,0.594119
