# Classificação com Count Vectorizer

- Versão feita em **23 outubro de 2023**

- Usando __RESUMO__

O arquivo é separado em 2 partes principais: a importação do dataset, e o emprego de classificadores.

- Descrição: existem 3 arquivos de entrada (*corpus*) com diferentes níveis de pré-processamento. Foi adicionado a stemmatização. Depois, é feita a vetorização com *Count Vectorizer* e analisado o algoritmo de classificação 'MLP'. Usa-se um k-fold com k=5. A amostragem **É estratificada**.

- Uso dos classificadores: *Rede Neural*
- **k-Fold cross-validation: Esse código executou a rodada 5**.
- Matriz de confusão;
- Semente 42
- 50 épocas
- BATCH_SIZE = 8
- optimizer = AdamW(model.parameters(), lr=2e-5, correct_bias=False)
- Escolho o modelo de melhor acurácia considerando a validação.

In [2]:
RANDOM_SEED=42

In [3]:
# dataset.csv   ou  dataset_pre_processado_1.csv  ou  dataset_pre_processado_stem_2.csv
#     CSV1                  CSV2                                   CSV3
dataset = "dataset_pre_processado_stem_2.csv"

In [4]:
print("Lembre-se estamos usando o dataset: " + dataset)

Lembre-se estamos usando o dataset: dataset_pre_processado_stem_2.csv


#Tarefa de Classificação

### Importando bibliotecas:

In [5]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import StratifiedKFold, train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score

from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam
import tensorflow as tf
import matplotlib.pyplot as plt
from sklearn.neural_network import MLPClassifier

In [6]:
import seaborn as sns
%matplotlib inline
%config InlineBackend.figure_format='retina'

sns.set(style='whitegrid', palette='muted', font_scale=1.2)

HAPPY_COLORS_PALETTE = ["#01BEFE", "#FFDD00", "#FF7D00", "#FF006D", "#ADFF02", "#8F00FF"]

sns.set_palette(sns.color_palette(HAPPY_COLORS_PALETTE))

## Carregando Dataset

In [7]:
df = pd.read_csv(dataset)
df.head(2)

Unnamed: 0,id,titulo,autor,url,tipo_documento,rotulo,resumo,texto
0,88,estudo dos efeitos de dircm em mísseis infrave...,"caio augusto de melo silvestre, lester de abre...",https://www.sige.ita.br/edicoes-anteriores/201...,Artigo de Simpósio,1,cresc empreg missel ombr infravermelh contr al...,misseis infravermelhos especialmente tipo manp...
1,125,caracterização de capacitores cerâmicos na fai...,"silva neto, l. p., rossi, j. o., barroso j. j.",https://www.sige.ita.br/edicoes-anteriores/201...,Artigo de Simpósio,2,mater dieletr baix perd alt permiss compon ess...,introducao ceramicas dieletricas encontram imp...


In [8]:
novo_df = df[["resumo", "rotulo"]]

In [9]:
def adapto_faixa(faixa):
    faixa -=1
    return faixa

In [10]:
class_names = ['Faixa 1', 'Faixa 2', 'Faixa 3']

In [11]:
novo_df ['rotulo'] = novo_df.rotulo.apply(adapto_faixa)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  novo_df ['rotulo'] = novo_df.rotulo.apply(adapto_faixa)


In [12]:
novo_df

Unnamed: 0,resumo,rotulo
0,cresc empreg missel ombr infravermelh contr al...,0
1,mater dieletr baix perd alt permiss compon ess...,1
2,recent disponibil dad publ sensor remot obt at...,0
3,disruptiontolerant network dtn evoluca mobil a...,0
4,nest artig apresent arquitet rad secundari ope...,1
...,...,...
163,rio jan rj centr avaliaco exercit caex camp pr...,2
164,trabalh apresent sistem control multipl aerona...,2
165,context comunicaco tatic base uso radi defin s...,2
166,val veloc alv movel imag obt rad abert sinte s...,0


## Vetorização

- Vetorização: Será feita dentro do laço for de iteração sobre as rodadas

## K-fold

Perceba que ** FOI FEITA UMA AMOSTRAGEM ESTRATIFICADA**

In [13]:
# Avaliação dos modelos usando k-fold
k = 5
# kf = KFold(n_splits=k, shuffle=True, random_state=42)
kf = StratifiedKFold(n_splits=k, shuffle=True, random_state=RANDOM_SEED)

- Alternativamente podemos usar o K-fold que já montamos, ambos apresentam a mesma separação, uma vez que trabalham com o mesmo dataframe e mesma semente.

## Dataframe resultado

- Criando uma função de avaliação:

In [14]:
# Algoritmos de classificação
classifiers = [  ("MLP")]

In [15]:
lista_classificador_nome = list()
for classifier_name in classifiers:
    lista_classificador_nome.append(classifier_name)

In [16]:
df_acc = pd.DataFrame(columns=['Classificador','Rodada 1', 'Rodada 2', 'Rodada 3', 'Rodada 4', 'Rodada 5', 'Media'])
df_f1 = pd.DataFrame(columns=['Classificador','Rodada 1', 'Rodada 2', 'Rodada 3', 'Rodada 4', 'Rodada 5', 'Media'])
df_f1_ponderado = pd.DataFrame(columns=['Classificador','Rodada 1', 'Rodada 2', 'Rodada 3', 'Rodada 4', 'Rodada 5', 'Media'])

In [17]:
for classifier_name in classifiers:
    nova_linha = pd.DataFrame({'Classificador': [classifier_name], 'Rodada 1':[0] , 'Rodada 2':[0], 'Rodada 3':[0], 'Rodada 4':[0], 'Rodada 5':[0], 'Media':[0]})
    df_acc = pd.concat([df_acc, nova_linha], ignore_index=True)
    df_f1 = pd.concat([df_f1, nova_linha], ignore_index=True)
    df_f1_ponderado = pd.concat([df_f1_ponderado, nova_linha], ignore_index=True)

In [18]:
df_f1

Unnamed: 0,Classificador,Rodada 1,Rodada 2,Rodada 3,Rodada 4,Rodada 5,Media
0,MLP,0,0,0,0,0,0


In [19]:
df_acc

Unnamed: 0,Classificador,Rodada 1,Rodada 2,Rodada 3,Rodada 4,Rodada 5,Media
0,MLP,0,0,0,0,0,0


In [20]:
df_f1_ponderado

Unnamed: 0,Classificador,Rodada 1,Rodada 2,Rodada 3,Rodada 4,Rodada 5,Media
0,MLP,0,0,0,0,0,0


 Usei como parâmetro para **average** o **'macro'** para o f1, e o **'weighted'** para o f1-ponderado

**'weighted'**:

Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall.

**'macro'**:

Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

 vide [Documentação oficial](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html)

## MLP

- Tutoriais explicativos:

1.  https://analyticsindiamag.com/a-beginners-guide-to-scikit-learns-mlpclassifier/

2.   https://towardsdatascience.com/multilayer-perceptron-explained-with-a-real-life-example-and-python-code-sentiment-analysis-cb408ee93141

3. https://michael-fuchs-python.netlify.app/2021/02/03/nn-multi-layer-perceptron-classifier-mlpclassifier/



In [21]:
EPOCHS = 50
BATCH = 8
classificador=0
serie_acc = pd.Series()
serie_f1 = pd.Series()
serie_cm = pd.Series()
serie_f1_ponderado = pd.Series()
contador=0

  serie_acc = pd.Series()
  serie_f1 = pd.Series()
  serie_cm = pd.Series()
  serie_f1_ponderado = pd.Series()


- http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html

In [22]:
def MLP(x_treinamento,x_teste, y_treinamento, y_teste, epocas=50,batch=8):  #MLP com 3 camadas ocultas com 2 neurônios cada
    #Perceba que já estou deixando o default de epocas e batch_size como o proposto,i.e., 50 e 8 respectivamente
    # hidden_layer_sizes (150,100,50) Cada elemento da tupla representa o número de nós na i-ésima posição, onde i é o índice da tupla. Tamanho da tupla é o Nr de camadas ocultas
    classificadorMLP = MLPClassifier(hidden_layer_sizes=(150,100,50), max_iter=epocas, activation='relu', solver='adam', verbose=10, random_state=RANDOM_SEED, learning_rate_init= 0.00002, batch_size=batch)
    #Treinamento
    classificadorMLP.fit(x_treinamento,y_treinamento)
    print("Parametros da MLP:")
    print(classificadorMLP.get_params())

    #Teste
    y_predicao = classificadorMLP.predict(x_teste)
    accuracy = accuracy_score(y_teste, y_predicao)  #COmparacao de resultado esperado e obtido
    f1 = f1_score(y_teste, y_predicao, average='macro')
    # Cálculo do F1-score
    f1_poderado = f1_score(y_teste, y_predicao, average='weighted')
    print("Resultados:")
    print("Acurácia: " + str(accuracy))
    print("F1-macro: " + str(f1))
    cm = confusion_matrix(y_teste, y_predicao)
    print("Matriz de Confusão:")
    print(cm)
    # plot_confusion_matrix(classificadorMLP, x_teste, y_teste, cmap=plt.cm.Blues)
    # plt.show()
    return accuracy,f1,cm, f1_poderado


## Treinamento

In [23]:
for index_df_train, index_df_test in kf.split(novo_df['resumo'], novo_df['rotulo']):
    contador+=1
    print("="*50)
    print("Rodada:" + str(contador))
    print("="*50)
# TREINAMENTO, VALIDACAO e TESTE
    #Separando treino e teste
    df_train, df_test = novo_df.iloc[index_df_train], novo_df.iloc[index_df_test]
    #Criar validação a partir do treino
    # df_train, df_val = train_test_split(df_train, test_size=0.2, stratify= df_train['rotulo'], random_state=RANDOM_SEED)
    #Separando rotulo e resumo
    # df_train
    X_df_train = df_train['resumo'].values
    y_df_train = df_train['rotulo'].values
    # #df_val
    # X_df_val = df_val['resumo'].values
    # y_df_val = df_val['rotulo'].values
    #df_test
    X_df_test = df_test['resumo'].values
    y_df_test = df_test['rotulo'].values


#VETORIZAÇÃO
    vectorizer = CountVectorizer()
    X_train_tfidf = vectorizer.fit_transform(X_df_train)
    # X_val_tfidf = vectorizer.transform(X_df_val)
    X_test_tfidf = vectorizer.transform(X_df_test)
    print("Shape do treinamento - resumo:" + str(X_train_tfidf.shape))
    print("Shape do treinamento - resumo:" + str(X_test_tfidf.shape))

#Treinamento e teste
    accuracy,f1,cm,f1_poderado = MLP(X_train_tfidf,X_test_tfidf, y_df_train, y_df_test, epocas=EPOCHS,batch=BATCH)
    serie_acc = serie_acc.append(pd.Series([accuracy]))
    serie_f1 = serie_f1.append([pd.Series([f1])])
    serie_cm =  serie_cm.append([pd.Series([cm])])
    serie_f1_ponderado = serie_f1_ponderado.append([pd.Series([f1_poderado])])
    # Avaliação do modelo: Aqui estamos inserindo os valores das medias na serie
media_acc = serie_acc[:5].mean()
media_f1 = serie_f1[:5].mean()
media_f1_ponderado = serie_f1_ponderado[:5].mean()
serie_acc = pd.concat([serie_acc, pd.Series([media_acc])])
serie_f1 = pd.concat([serie_f1, pd.Series([media_f1])])
serie_f1_ponderado = pd.concat([serie_f1_ponderado, pd.Series([media_f1_ponderado])])

df_acc.loc[classificador, ['Rodada 1', 'Rodada 2', 'Rodada 3', 'Rodada 4', 'Rodada 5', 'Media']] = serie_acc.values
df_f1.loc[classificador, ['Rodada 1', 'Rodada 2', 'Rodada 3', 'Rodada 4', 'Rodada 5', 'Media']] = serie_f1.values
df_f1_ponderado.loc[classificador, ['Rodada 1', 'Rodada 2', 'Rodada 3', 'Rodada 4', 'Rodada 5', 'Media']] = serie_f1_ponderado.values

Rodada:1
Shape do treinamento - resumo:(134, 2539)
Shape do treinamento - resumo:(34, 2539)
Iteration 1, loss = 1.03382561
Iteration 2, loss = 1.00924009
Iteration 3, loss = 0.98788300
Iteration 4, loss = 0.96743783
Iteration 5, loss = 0.94721810
Iteration 6, loss = 0.92695748
Iteration 7, loss = 0.90689677
Iteration 8, loss = 0.88767401
Iteration 9, loss = 0.86842807
Iteration 10, loss = 0.84993454
Iteration 11, loss = 0.83076484
Iteration 12, loss = 0.81237008
Iteration 13, loss = 0.79427042
Iteration 14, loss = 0.77548632
Iteration 15, loss = 0.75783331
Iteration 16, loss = 0.74057783
Iteration 17, loss = 0.72287945
Iteration 18, loss = 0.70564265
Iteration 19, loss = 0.68872273
Iteration 20, loss = 0.67224884
Iteration 21, loss = 0.65556879
Iteration 22, loss = 0.63891662
Iteration 23, loss = 0.62260265
Iteration 24, loss = 0.60602550
Iteration 25, loss = 0.58998159
Iteration 26, loss = 0.57394072
Iteration 27, loss = 0.55823353
Iteration 28, loss = 0.54235095
Iteration 29, loss = 

  serie_acc = serie_acc.append(pd.Series([accuracy]))
  serie_f1 = serie_f1.append([pd.Series([f1])])
  serie_cm =  serie_cm.append([pd.Series([cm])])
  serie_f1_ponderado = serie_f1_ponderado.append([pd.Series([f1_poderado])])


Iteration 1, loss = 1.10029068
Iteration 2, loss = 1.07706744
Iteration 3, loss = 1.05598119
Iteration 4, loss = 1.03636339
Iteration 5, loss = 1.01609776
Iteration 6, loss = 0.99711469
Iteration 7, loss = 0.97874566
Iteration 8, loss = 0.96062847
Iteration 9, loss = 0.94302745
Iteration 10, loss = 0.92550592
Iteration 11, loss = 0.90873241
Iteration 12, loss = 0.89115709
Iteration 13, loss = 0.87376200
Iteration 14, loss = 0.85639202
Iteration 15, loss = 0.83865657
Iteration 16, loss = 0.82120920
Iteration 17, loss = 0.80367514
Iteration 18, loss = 0.78648893
Iteration 19, loss = 0.76854708
Iteration 20, loss = 0.75097734
Iteration 21, loss = 0.73321003
Iteration 22, loss = 0.71566050
Iteration 23, loss = 0.69818451
Iteration 24, loss = 0.68043203
Iteration 25, loss = 0.66310681
Iteration 26, loss = 0.64521320
Iteration 27, loss = 0.62802066
Iteration 28, loss = 0.61065594
Iteration 29, loss = 0.59287362
Iteration 30, loss = 0.57589924
Iteration 31, loss = 0.55862975
Iteration 32, los

  serie_acc = serie_acc.append(pd.Series([accuracy]))
  serie_f1 = serie_f1.append([pd.Series([f1])])
  serie_cm =  serie_cm.append([pd.Series([cm])])
  serie_f1_ponderado = serie_f1_ponderado.append([pd.Series([f1_poderado])])


Iteration 1, loss = 1.08149876
Iteration 2, loss = 1.05952814
Iteration 3, loss = 1.04006316
Iteration 4, loss = 1.02121622
Iteration 5, loss = 1.00280714
Iteration 6, loss = 0.98482595
Iteration 7, loss = 0.96730033
Iteration 8, loss = 0.94985125
Iteration 9, loss = 0.93316468
Iteration 10, loss = 0.91624651
Iteration 11, loss = 0.89972704
Iteration 12, loss = 0.88292123
Iteration 13, loss = 0.86605350
Iteration 14, loss = 0.84897332
Iteration 15, loss = 0.83205514
Iteration 16, loss = 0.81478759
Iteration 17, loss = 0.79806720
Iteration 18, loss = 0.78081751
Iteration 19, loss = 0.76361590
Iteration 20, loss = 0.74660028
Iteration 21, loss = 0.72962633
Iteration 22, loss = 0.71260409
Iteration 23, loss = 0.69525671
Iteration 24, loss = 0.67821951
Iteration 25, loss = 0.66135554
Iteration 26, loss = 0.64459565
Iteration 27, loss = 0.62789043
Iteration 28, loss = 0.61120671
Iteration 29, loss = 0.59437112
Iteration 30, loss = 0.57750375
Iteration 31, loss = 0.56112336
Iteration 32, los

  serie_acc = serie_acc.append(pd.Series([accuracy]))
  serie_f1 = serie_f1.append([pd.Series([f1])])
  serie_cm =  serie_cm.append([pd.Series([cm])])
  serie_f1_ponderado = serie_f1_ponderado.append([pd.Series([f1_poderado])])


Iteration 1, loss = 1.08159753
Iteration 2, loss = 1.05799556
Iteration 3, loss = 1.03754997
Iteration 4, loss = 1.01731527
Iteration 5, loss = 0.99828254
Iteration 6, loss = 0.97967950
Iteration 7, loss = 0.96163788
Iteration 8, loss = 0.94417280
Iteration 9, loss = 0.92755583
Iteration 10, loss = 0.91047078
Iteration 11, loss = 0.89367817
Iteration 12, loss = 0.87781082
Iteration 13, loss = 0.86098158
Iteration 14, loss = 0.84461891
Iteration 15, loss = 0.82822532
Iteration 16, loss = 0.81180099
Iteration 17, loss = 0.79532906
Iteration 18, loss = 0.77858181
Iteration 19, loss = 0.76181286
Iteration 20, loss = 0.74551315
Iteration 21, loss = 0.72919800
Iteration 22, loss = 0.71261136
Iteration 23, loss = 0.69631825
Iteration 24, loss = 0.67932249
Iteration 25, loss = 0.66271070
Iteration 26, loss = 0.64661024
Iteration 27, loss = 0.63042112
Iteration 28, loss = 0.61365661
Iteration 29, loss = 0.59732106
Iteration 30, loss = 0.58070171
Iteration 31, loss = 0.56453208
Iteration 32, los

  serie_acc = serie_acc.append(pd.Series([accuracy]))
  serie_f1 = serie_f1.append([pd.Series([f1])])
  serie_cm =  serie_cm.append([pd.Series([cm])])
  serie_f1_ponderado = serie_f1_ponderado.append([pd.Series([f1_poderado])])


Iteration 1, loss = 1.15679887
Iteration 2, loss = 1.12306749
Iteration 3, loss = 1.09276126
Iteration 4, loss = 1.06622864
Iteration 5, loss = 1.04104775
Iteration 6, loss = 1.01687125
Iteration 7, loss = 0.99466714
Iteration 8, loss = 0.97301352
Iteration 9, loss = 0.95321851
Iteration 10, loss = 0.93331033
Iteration 11, loss = 0.91429470
Iteration 12, loss = 0.89561870
Iteration 13, loss = 0.87775993
Iteration 14, loss = 0.85932827
Iteration 15, loss = 0.84171278
Iteration 16, loss = 0.82457617
Iteration 17, loss = 0.80719455
Iteration 18, loss = 0.78969028
Iteration 19, loss = 0.77273878
Iteration 20, loss = 0.75548153
Iteration 21, loss = 0.73852861
Iteration 22, loss = 0.72151500
Iteration 23, loss = 0.70389624
Iteration 24, loss = 0.68638466
Iteration 25, loss = 0.66916573
Iteration 26, loss = 0.65236916
Iteration 27, loss = 0.63484760
Iteration 28, loss = 0.61844273
Iteration 29, loss = 0.60163516
Iteration 30, loss = 0.58514947
Iteration 31, loss = 0.56848556
Iteration 32, los

  serie_acc = serie_acc.append(pd.Series([accuracy]))
  serie_f1 = serie_f1.append([pd.Series([f1])])
  serie_cm =  serie_cm.append([pd.Series([cm])])
  serie_f1_ponderado = serie_f1_ponderado.append([pd.Series([f1_poderado])])


In [24]:
def show_confusion_matrix(confusion_matrix):
  hmap = sns.heatmap(confusion_matrix, annot=True, fmt="d", cmap="Blues")
  hmap.yaxis.set_ticklabels(hmap.yaxis.get_ticklabels(), rotation=0, ha='right')
  hmap.xaxis.set_ticklabels(hmap.xaxis.get_ticklabels(), rotation=30, ha='right')
  plt.ylabel('Faixa TRL verdadeira')
  plt.xlabel('Faixa TRL predita');

In [25]:
serie_acc[0]

0    0.764706
0    0.676471
0    0.705882
0    0.696970
0    0.696970
0    0.708200
dtype: float64

In [26]:
for elemento in serie_cm:
    print(elemento)
    # df_cm = pd.DataFrame(elemento, index=class_names, columns=class_names)
    # show_confusion_matrix(df_cm)

[[19  0  0]
 [ 5  0  3]
 [ 0  0  7]]
[[19  0  0]
 [ 8  0  0]
 [ 2  1  4]]
[[19  0  0]
 [ 8  0  0]
 [ 1  1  5]]
[[19  0  0]
 [ 7  0  1]
 [ 2  0  4]]
[[18  0  0]
 [ 8  0  1]
 [ 1  0  5]]


In [27]:
df_acc

Unnamed: 0,Classificador,Rodada 1,Rodada 2,Rodada 3,Rodada 4,Rodada 5,Media
0,MLP,0.764706,0.676471,0.705882,0.69697,0.69697,0.7082


In [28]:
df_f1

Unnamed: 0,Classificador,Rodada 1,Rodada 2,Rodada 3,Rodada 4,Rodada 5,Media
0,MLP,0.569083,0.506313,0.547281,0.511928,0.544444,0.53581


In [29]:
df_f1_ponderado

Unnamed: 0,Classificador,Rodada 1,Rodada 2,Rodada 3,Rodada 4,Rodada 5,Media
0,MLP,0.663394,0.592135,0.623383,0.597738,0.587879,0.612906


## MLP

- Outra implementação (mais complicada)

In [30]:
# from sklearn.metrics import confusion_matrix, accuracy_score, f1_score
# from sklearn.metrics import precision_recall_fscore_support as score

# def evaluate_model(model, X_test, y_test):
#     # Predição dos rótulos
#     y_pred = model.predict(X_test)
#     print(classification_report(y_test, y_pred))

#     # Cálculo da matriz de confusão
#     cm = confusion_matrix(y_test, y_pred)

#     # Cálculo da acurácia
#     acc = accuracy_score(y_test, y_pred)

#     # Cálculo do F1-score
#     f1 = f1_score(y_test, y_pred, average='macro')

#     # Outras métricas
#     precision, recall, f1score, support = score(y_test, y_pred, average='macro')
#     return cm, acc, f1, precision, recall, f1score, support

In [31]:
# X_textos = novo_df['resumo'].values
# y = novo_df['rotulo'].values

In [32]:
# for index_df_train, index_df_test in kf.split(novo_df['resumo'], novo_df['rotulo']):

#     #Separando treino e teste
#     df_train, df_test = novo_df.iloc[index_df_train], novo_df.iloc[index_df_test]
#     #Criar validação a partir do treino
#     df_train, df_val = train_test_split(df_train, test_size=0.2, stratify= df_train['rotulo'], random_state=RANDOM_SEED)


#     #Separando rotulo e resumo
#     # df_train
#     X_df_train = df_train['resumo'].values
#     y_df_train = df_train['rotulo'].values
#     # #df_val
#     X_df_val = df_val['resumo'].values
#     y_df_val = df_val['rotulo'].values
#     #df_test
#     X_df_test = df_test['resumo'].values
#     y_df_test = df_test['rotulo'].values



#     # Crie uma representação TF-IDF dos dados
#     tfidf_vectorizer = TfidfVectorizer()
#     X_train_tfidf = tfidf_vectorizer.fit_transform(X_df_train)
#     X_val_tfidf = tfidf_vectorizer.transform(X_df_val)
#     X_test_tfidf = tfidf_vectorizer.transform(X_df_test)


#     # O que tenho são:
#     # X_train_tfidf      e      y_df_train
#     # X_val_tfidf       e       y_df_val
#     # X_test_tfidf      e       y_df_test



#     # Crie e compila a MLP
#     model = Sequential()
#     model.add(Dense(128, input_dim=X_train_tfidf.shape[1], activation='relu'))
#     model.add(Dense(3, activation='softmax'))  # 3 classes no seu caso
#     model.compile(loss='categorical_crossentropy', optimizer=Adam(learning_rate=2e-5), metrics=['accuracy'])
#     model.summary()

#     # Transforme os rótulos em codificação one-hot
#     y_train_encoded = np.eye(3)[y_df_train]  # 3 classes no seu caso
#     y_val_encoded = np.eye(3)[y_df_val]  # 3 classes no seu caso
#     y_test_encoded = np.eye(3)[y_df_test]  # 3 classes no seu caso


#     #Treinamento
#     history = model.fit(X_train_tfidf, y_train_encoded, validation_data=(X_val_tfidf, y_val_encoded), epochs=EPOCHS, batch_size=BATCH, verbose=0, shuffle = True)


#     contador+=1
#     print("="*50)
#     print(f'Rodada {contador }')
#     print("="*50)

#     # Realize previsões
#     y_pred = model.predict(X_test_tfidf)
#     y_pred_classes = np.argmax(y_pred, axis=1)

#     # Avaliação
#     accuracy = accuracy_score(y_test_encoded, y_pred_classes)
#     f1 = f1_score(y_test_encoded, y_pred_classes, average='macro')
#     cm = confusion_matrix(y_test_encoded, y_pred_classes)

#     accuracies.append(accuracy)
#     f1_scores.append(f1)

#     print("Matriz de Confusão:")
#     print(cm)
#     print(f'Acurácia: {accuracy:.2f}')
#     print(f'F1-Score: {f1:.2f}')
#     print("\n")


#    # serie_acc = serie_acc.append(pd.Series([acc]))
#     serie_acc = pd.concat([serie_acc, pd.Series([accuracy])])
#     # serie_f1 = serie_f1.append(pd.Series([f1]))
#     serie_f1 = pd.concat([serie_f1, pd.Series([f1])])

#     # Plotagem do histórico de acurácia
#     plt.plot(history.history['accuracy'], label='Acurácia de Treinamento')
#     plt.plot(history.history['val_accuracy'], label='Acurácia de Validação')
#     plt.xlabel('Épocas')
#     plt.ylabel('Acurácia')
#     plt.legend()
#     plt.show()


#     # Avaliação do modelo: Aqui estamos inserindo os valores das medias na serie
# media_acc = serie_acc[:5].mean()
# media_f1 = serie_f1[:5].mean()
# serie_acc = pd.concat([serie_acc, pd.Series([media_acc])])
# serie_f1 = pd.concat([serie_f1, pd.Series([media_f1])])
# df_acc.loc[classificador, ['Rodada 1', 'Rodada 2', 'Rodada 3', 'Rodada 4', 'Rodada 5', 'Media']] = serie_acc.values
# df_f1.loc[classificador, ['Rodada 1', 'Rodada 2', 'Rodada 3', 'Rodada 4', 'Rodada 5', 'Media']] = serie_f1.values

# # Calcule as médias das métricas
# mean_accuracy = np.mean(accuracies)
# mean_f1 = np.mean(f1_scores)
# print("===============  Dados Médios do Modelo =============")
# print(f'Acurácia Média: {mean_accuracy:.2f}')
# print(f'F1-Score Médio: {mean_f1:.2f}')