# Classificação com TF-IDF

- Versão feita em **23 outubro de 2023**

- Usando __RESUMO__

O arquivo é separado em 2 partes principais: a importação do dataset, e o emprego de classificadores.

- Descrição: existem 3 arquivos de entrada (*corpus*) com diferentes níveis de pré-processamento. Foi adicionado a stemmatização. Depois, é feita a vetorização com TF-IDF e analisado o algoritmo de classificação 'MLP'. Usa-se um k-fold com k=5. A amostragem **É estratificada**.

- Uso dos classificadores: *Rede Neural*
- **k-Fold cross-validation: Esse código executou a rodada 5**.
- Matriz de confusão;
- Semente 42
- 50 épocas
- BATCH_SIZE = 8
- optimizer = AdamW(model.parameters(), lr=2e-5, correct_bias=False)
- Escolho o modelo de melhor acurácia considerando a validação.

In [28]:
RANDOM_SEED=42

In [29]:
# dataset.csv   ou  dataset_pre_processado_1.csv  ou  dataset_pre_processado_stem_2.csv
#     CSV1                  CSV2                                   CSV3
dataset = "dataset_pre_processado_stem_2.csv"

In [30]:
print("Lembre-se estamos usando o dataset: " + dataset)

Lembre-se estamos usando o dataset: dataset_pre_processado_stem_2.csv


#Tarefa de Classificação

### Importando bibliotecas:

In [31]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import StratifiedKFold, train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score

from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam
import tensorflow as tf
import matplotlib.pyplot as plt
from sklearn.neural_network import MLPClassifier

In [32]:
import seaborn as sns
%matplotlib inline
%config InlineBackend.figure_format='retina'

sns.set(style='whitegrid', palette='muted', font_scale=1.2)

HAPPY_COLORS_PALETTE = ["#01BEFE", "#FFDD00", "#FF7D00", "#FF006D", "#ADFF02", "#8F00FF"]

sns.set_palette(sns.color_palette(HAPPY_COLORS_PALETTE))

## Carregando Dataset

In [33]:
df = pd.read_csv(dataset)
df.head(2)

Unnamed: 0,id,titulo,autor,url,tipo_documento,rotulo,resumo,texto
0,88,estudo dos efeitos de dircm em mísseis infrave...,"caio augusto de melo silvestre, lester de abre...",https://www.sige.ita.br/edicoes-anteriores/201...,Artigo de Simpósio,1,cresc empreg missel ombr infravermelh contr al...,misseis infravermelhos especialmente tipo manp...
1,125,caracterização de capacitores cerâmicos na fai...,"silva neto, l. p., rossi, j. o., barroso j. j.",https://www.sige.ita.br/edicoes-anteriores/201...,Artigo de Simpósio,2,mater dieletr baix perd alt permiss compon ess...,introducao ceramicas dieletricas encontram imp...


In [34]:
novo_df = df[["resumo", "rotulo"]]

In [35]:
def adapto_faixa(faixa):
    faixa -=1
    return faixa

In [36]:
class_names = ['Faixa 1', 'Faixa 2', 'Faixa 3']

In [37]:
novo_df ['rotulo'] = novo_df.rotulo.apply(adapto_faixa)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  novo_df ['rotulo'] = novo_df.rotulo.apply(adapto_faixa)


In [38]:
novo_df

Unnamed: 0,resumo,rotulo
0,cresc empreg missel ombr infravermelh contr al...,0
1,mater dieletr baix perd alt permiss compon ess...,1
2,recent disponibil dad publ sensor remot obt at...,0
3,disruptiontolerant network dtn evoluca mobil a...,0
4,nest artig apresent arquitet rad secundari ope...,1
...,...,...
163,rio jan rj centr avaliaco exercit caex camp pr...,2
164,trabalh apresent sistem control multipl aerona...,2
165,context comunicaco tatic base uso radi defin s...,2
166,val veloc alv movel imag obt rad abert sinte s...,0


## Vetorização

- Vetorização: Será feita dentro do laço for de iteração sobre as rodadas

In [39]:
# Vetorização usando TF-IDF
# vectorizer = TfidfVectorizer()

## K-fold

Perceba que ** FOI FEITA UMA AMOSTRAGEM ESTRATIFICADA**

In [40]:
# Avaliação dos modelos usando k-fold
k = 5
# kf = KFold(n_splits=k, shuffle=True, random_state=42)
kf = StratifiedKFold(n_splits=k, shuffle=True, random_state=RANDOM_SEED)

- Alternativamente podemos usar o K-fold que já montamos, ambos apresentam a mesma separação, uma vez que trabalham com o mesmo dataframe e mesma semente.

## Dataframe resultado

- Criando uma função de avaliação:

In [41]:
# Algoritmos de classificação
classifiers = [  ("MLP")]

In [42]:
lista_classificador_nome = list()
for classifier_name in classifiers:
    lista_classificador_nome.append(classifier_name)

In [43]:
df_acc = pd.DataFrame(columns=['Classificador','Rodada 1', 'Rodada 2', 'Rodada 3', 'Rodada 4', 'Rodada 5', 'Media'])
df_f1 = pd.DataFrame(columns=['Classificador','Rodada 1', 'Rodada 2', 'Rodada 3', 'Rodada 4', 'Rodada 5', 'Media'])
df_f1_ponderado = pd.DataFrame(columns=['Classificador','Rodada 1', 'Rodada 2', 'Rodada 3', 'Rodada 4', 'Rodada 5', 'Media'])

In [44]:
for classifier_name in classifiers:
    nova_linha = pd.DataFrame({'Classificador': [classifier_name], 'Rodada 1':[0] , 'Rodada 2':[0], 'Rodada 3':[0], 'Rodada 4':[0], 'Rodada 5':[0], 'Media':[0]})
    df_acc = pd.concat([df_acc, nova_linha], ignore_index=True)
    df_f1 = pd.concat([df_f1, nova_linha], ignore_index=True)
    df_f1_ponderado = pd.concat([df_f1_ponderado, nova_linha], ignore_index=True)

In [45]:
df_f1

Unnamed: 0,Classificador,Rodada 1,Rodada 2,Rodada 3,Rodada 4,Rodada 5,Media
0,MLP,0,0,0,0,0,0


In [46]:
df_acc

Unnamed: 0,Classificador,Rodada 1,Rodada 2,Rodada 3,Rodada 4,Rodada 5,Media
0,MLP,0,0,0,0,0,0


In [47]:
df_f1_ponderado

Unnamed: 0,Classificador,Rodada 1,Rodada 2,Rodada 3,Rodada 4,Rodada 5,Media
0,MLP,0,0,0,0,0,0


 Usei como parâmetro para **average** o **'macro'** para o f1, e o **'weighted'** para o f1-ponderado

**'weighted'**:

Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall.

**'macro'**:

Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

 vide [Documentação oficial](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html)

## MLP

- Tutoriais explicativos:

1.  https://analyticsindiamag.com/a-beginners-guide-to-scikit-learns-mlpclassifier/

2.   https://towardsdatascience.com/multilayer-perceptron-explained-with-a-real-life-example-and-python-code-sentiment-analysis-cb408ee93141

3. https://michael-fuchs-python.netlify.app/2021/02/03/nn-multi-layer-perceptron-classifier-mlpclassifier/



In [48]:
EPOCHS = 50
BATCH = 8
classificador=0
serie_acc = pd.Series()
serie_f1 = pd.Series()
serie_cm = pd.Series()
serie_f1_ponderado = pd.Series()
contador=0

  serie_acc = pd.Series()
  serie_f1 = pd.Series()
  serie_cm = pd.Series()
  serie_f1_ponderado = pd.Series()


- http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html

In [49]:
def MLP(x_treinamento,x_teste, y_treinamento, y_teste, epocas=50,batch=8):  #MLP com 3 camadas ocultas com 2 neurônios cada
    #Perceba que já estou deixando o default de epocas e batch_size como o proposto,i.e., 50 e 8 respectivamente
    # hidden_layer_sizes (150,100,50) Cada elemento da tupla representa o número de nós na i-ésima posição, onde i é o índice da tupla. Tamanho da tupla é o Nr de camadas ocultas
    classificadorMLP = MLPClassifier(hidden_layer_sizes=(150,100,50), max_iter=epocas, activation='relu', solver='adam', verbose=10, random_state=RANDOM_SEED, learning_rate_init= 0.00002, batch_size=batch)
    #Treinamento
    classificadorMLP.fit(x_treinamento,y_treinamento)
    print("Parametros da MLP:")
    print(classificadorMLP.get_params())

    #Teste
    y_predicao = classificadorMLP.predict(x_teste)
    accuracy = accuracy_score(y_teste, y_predicao)  #COmparacao de resultado esperado e obtido
    f1 = f1_score(y_teste, y_predicao, average='macro')
    # Cálculo do F1-score
    f1_poderado = f1_score(y_teste, y_predicao, average='weighted')
    print("Resultados:")
    print("Acurácia: " + str(accuracy))
    print("F1-macro: " + str(f1))
    cm = confusion_matrix(y_teste, y_predicao)
    print("Matriz de Confusão:")
    print(cm)
    # plot_confusion_matrix(classificadorMLP, x_teste, y_teste, cmap=plt.cm.Blues)
    # plt.show()
    return accuracy,f1,cm, f1_poderado


## Treinamento

In [50]:
for index_df_train, index_df_test in kf.split(novo_df['resumo'], novo_df['rotulo']):
    contador+=1
    print("="*50)
    print("Rodada:" + str(contador))
    print("="*50)
# TREINAMENTO, VALIDACAO e TESTE
    #Separando treino e teste
    df_train, df_test = novo_df.iloc[index_df_train], novo_df.iloc[index_df_test]
    #Criar validação a partir do treino
    # df_train, df_val = train_test_split(df_train, test_size=0.2, stratify= df_train['rotulo'], random_state=RANDOM_SEED)
    #Separando rotulo e resumo
    # df_train
    X_df_train = df_train['resumo'].values
    y_df_train = df_train['rotulo'].values
    # #df_val
    # X_df_val = df_val['resumo'].values
    # y_df_val = df_val['rotulo'].values
    #df_test
    X_df_test = df_test['resumo'].values
    y_df_test = df_test['rotulo'].values


#VETORIZAÇÃO
    vectorizer = TfidfVectorizer()
    X_train_tfidf = vectorizer.fit_transform(X_df_train)
    # X_val_tfidf = vectorizer.transform(X_df_val)
    X_test_tfidf = vectorizer.transform(X_df_test)
    print("Shape do treinamento - resumo:" + str(X_train_tfidf.shape))
    print("Shape do treinamento - resumo:" + str(X_test_tfidf.shape))

#Treinamento e teste
    accuracy,f1,cm,f1_poderado = MLP(X_train_tfidf,X_test_tfidf, y_df_train, y_df_test, epocas=EPOCHS,batch=BATCH)
    serie_acc = serie_acc.append(pd.Series([accuracy]))
    serie_f1 = serie_f1.append([pd.Series([f1])])
    serie_cm =  serie_cm.append([pd.Series([cm])])
    serie_f1_ponderado = serie_f1_ponderado.append([pd.Series([f1_poderado])])
    # Avaliação do modelo: Aqui estamos inserindo os valores das medias na serie
media_acc = serie_acc[:5].mean()
media_f1 = serie_f1[:5].mean()
media_f1_ponderado = serie_f1_ponderado[:5].mean()
serie_acc = pd.concat([serie_acc, pd.Series([media_acc])])
serie_f1 = pd.concat([serie_f1, pd.Series([media_f1])])
serie_f1_ponderado = pd.concat([serie_f1_ponderado, pd.Series([media_f1_ponderado])])

df_acc.loc[classificador, ['Rodada 1', 'Rodada 2', 'Rodada 3', 'Rodada 4', 'Rodada 5', 'Media']] = serie_acc.values
df_f1.loc[classificador, ['Rodada 1', 'Rodada 2', 'Rodada 3', 'Rodada 4', 'Rodada 5', 'Media']] = serie_f1.values
df_f1_ponderado.loc[classificador, ['Rodada 1', 'Rodada 2', 'Rodada 3', 'Rodada 4', 'Rodada 5', 'Media']] = serie_f1_ponderado.values

Rodada:1
Shape do treinamento - resumo:(134, 2539)
Shape do treinamento - resumo:(34, 2539)
Iteration 1, loss = 1.02253764
Iteration 2, loss = 1.01984995
Iteration 3, loss = 1.01721179
Iteration 4, loss = 1.01474169
Iteration 5, loss = 1.01197622
Iteration 6, loss = 1.00951354
Iteration 7, loss = 1.00704658
Iteration 8, loss = 1.00452718
Iteration 9, loss = 1.00189007
Iteration 10, loss = 0.99954587
Iteration 11, loss = 0.99679089
Iteration 12, loss = 0.99421113
Iteration 13, loss = 0.99154799
Iteration 14, loss = 0.98888932
Iteration 15, loss = 0.98600115
Iteration 16, loss = 0.98320014
Iteration 17, loss = 0.98032825
Iteration 18, loss = 0.97729215
Iteration 19, loss = 0.97420660
Iteration 20, loss = 0.97100037
Iteration 21, loss = 0.96760067
Iteration 22, loss = 0.96410679
Iteration 23, loss = 0.96037169
Iteration 24, loss = 0.95651697
Iteration 25, loss = 0.95257055
Iteration 26, loss = 0.94837794
Iteration 27, loss = 0.94412496
Iteration 28, loss = 0.93979049
Iteration 29, loss = 

  serie_acc = serie_acc.append(pd.Series([accuracy]))
  serie_f1 = serie_f1.append([pd.Series([f1])])
  serie_cm =  serie_cm.append([pd.Series([cm])])
  serie_f1_ponderado = serie_f1_ponderado.append([pd.Series([f1_poderado])])


Iteration 1, loss = 1.07851386
Iteration 2, loss = 1.07511297
Iteration 3, loss = 1.07193968
Iteration 4, loss = 1.06901186
Iteration 5, loss = 1.06598266
Iteration 6, loss = 1.06308188
Iteration 7, loss = 1.06034034
Iteration 8, loss = 1.05733341
Iteration 9, loss = 1.05457440
Iteration 10, loss = 1.05173481
Iteration 11, loss = 1.04912967
Iteration 12, loss = 1.04619041
Iteration 13, loss = 1.04349113
Iteration 14, loss = 1.04073762
Iteration 15, loss = 1.03797883
Iteration 16, loss = 1.03525184
Iteration 17, loss = 1.03254634
Iteration 18, loss = 1.02978017
Iteration 19, loss = 1.02687994
Iteration 20, loss = 1.02406947
Iteration 21, loss = 1.02117081
Iteration 22, loss = 1.01821102
Iteration 23, loss = 1.01510921
Iteration 24, loss = 1.01201260
Iteration 25, loss = 1.00870958
Iteration 26, loss = 1.00525755
Iteration 27, loss = 1.00196347
Iteration 28, loss = 0.99822829
Iteration 29, loss = 0.99446705
Iteration 30, loss = 0.99064943
Iteration 31, loss = 0.98653118
Iteration 32, los

  serie_acc = serie_acc.append(pd.Series([accuracy]))
  serie_f1 = serie_f1.append([pd.Series([f1])])
  serie_cm =  serie_cm.append([pd.Series([cm])])
  serie_f1_ponderado = serie_f1_ponderado.append([pd.Series([f1_poderado])])


Iteration 1, loss = 1.09249900
Iteration 2, loss = 1.08765345
Iteration 3, loss = 1.08302759
Iteration 4, loss = 1.07837821
Iteration 5, loss = 1.07351319
Iteration 6, loss = 1.06918379
Iteration 7, loss = 1.06444013
Iteration 8, loss = 1.05969864
Iteration 9, loss = 1.05518507
Iteration 10, loss = 1.05051257
Iteration 11, loss = 1.04577334
Iteration 12, loss = 1.04127522
Iteration 13, loss = 1.03668643
Iteration 14, loss = 1.03184652
Iteration 15, loss = 1.02738819
Iteration 16, loss = 1.02287866
Iteration 17, loss = 1.01851263
Iteration 18, loss = 1.01401204
Iteration 19, loss = 1.00965798
Iteration 20, loss = 1.00523703
Iteration 21, loss = 1.00087252
Iteration 22, loss = 0.99635811
Iteration 23, loss = 0.99181548
Iteration 24, loss = 0.98730822
Iteration 25, loss = 0.98271349
Iteration 26, loss = 0.97810667
Iteration 27, loss = 0.97338806
Iteration 28, loss = 0.96874463
Iteration 29, loss = 0.96395109
Iteration 30, loss = 0.95895649
Iteration 31, loss = 0.95408089
Iteration 32, los

  serie_acc = serie_acc.append(pd.Series([accuracy]))
  serie_f1 = serie_f1.append([pd.Series([f1])])
  serie_cm =  serie_cm.append([pd.Series([cm])])
  serie_f1_ponderado = serie_f1_ponderado.append([pd.Series([f1_poderado])])


Iteration 1, loss = 1.06580693
Iteration 2, loss = 1.06178959
Iteration 3, loss = 1.05794079
Iteration 4, loss = 1.05410933
Iteration 5, loss = 1.05026712
Iteration 6, loss = 1.04654387
Iteration 7, loss = 1.04289741
Iteration 8, loss = 1.03921273
Iteration 9, loss = 1.03579811
Iteration 10, loss = 1.03183619
Iteration 11, loss = 1.02819126
Iteration 12, loss = 1.02482824
Iteration 13, loss = 1.02105257
Iteration 14, loss = 1.01758613
Iteration 15, loss = 1.01389739
Iteration 16, loss = 1.01034141
Iteration 17, loss = 1.00683018
Iteration 18, loss = 1.00307379
Iteration 19, loss = 0.99946917
Iteration 20, loss = 0.99576792
Iteration 21, loss = 0.99209245
Iteration 22, loss = 0.98826889
Iteration 23, loss = 0.98453162
Iteration 24, loss = 0.98060419
Iteration 25, loss = 0.97663525
Iteration 26, loss = 0.97268496
Iteration 27, loss = 0.96879485
Iteration 28, loss = 0.96425735
Iteration 29, loss = 0.96005549
Iteration 30, loss = 0.95553552
Iteration 31, loss = 0.95090818
Iteration 32, los

  serie_acc = serie_acc.append(pd.Series([accuracy]))
  serie_f1 = serie_f1.append([pd.Series([f1])])
  serie_cm =  serie_cm.append([pd.Series([cm])])
  serie_f1_ponderado = serie_f1_ponderado.append([pd.Series([f1_poderado])])


Iteration 1, loss = 1.10772105
Iteration 2, loss = 1.10104841
Iteration 3, loss = 1.09466257
Iteration 4, loss = 1.08843938
Iteration 5, loss = 1.08274852
Iteration 6, loss = 1.07671185
Iteration 7, loss = 1.07106650
Iteration 8, loss = 1.06546798
Iteration 9, loss = 1.06013742
Iteration 10, loss = 1.05464612
Iteration 11, loss = 1.04949777
Iteration 12, loss = 1.04428110
Iteration 13, loss = 1.03915165
Iteration 14, loss = 1.03409662
Iteration 15, loss = 1.02907938
Iteration 16, loss = 1.02417869
Iteration 17, loss = 1.01929774
Iteration 18, loss = 1.01427947
Iteration 19, loss = 1.00944374
Iteration 20, loss = 1.00449899
Iteration 21, loss = 0.99960728
Iteration 22, loss = 0.99456928
Iteration 23, loss = 0.98961652
Iteration 24, loss = 0.98439528
Iteration 25, loss = 0.97950476
Iteration 26, loss = 0.97422525
Iteration 27, loss = 0.96902471
Iteration 28, loss = 0.96403348
Iteration 29, loss = 0.95899108
Iteration 30, loss = 0.95354972
Iteration 31, loss = 0.94834560
Iteration 32, los

  serie_acc = serie_acc.append(pd.Series([accuracy]))
  serie_f1 = serie_f1.append([pd.Series([f1])])
  serie_cm =  serie_cm.append([pd.Series([cm])])
  serie_f1_ponderado = serie_f1_ponderado.append([pd.Series([f1_poderado])])


In [51]:
def show_confusion_matrix(confusion_matrix):
  hmap = sns.heatmap(confusion_matrix, annot=True, fmt="d", cmap="Blues")
  hmap.yaxis.set_ticklabels(hmap.yaxis.get_ticklabels(), rotation=0, ha='right')
  hmap.xaxis.set_ticklabels(hmap.xaxis.get_ticklabels(), rotation=30, ha='right')
  plt.ylabel('Faixa TRL verdadeira')
  plt.xlabel('Faixa TRL predita');

In [52]:
serie_acc[0]

0    0.588235
0    0.558824
0    0.558824
0    0.575758
0    0.545455
0    0.565419
dtype: float64

In [53]:
for elemento in serie_cm:
    print(elemento)
    # df_cm = pd.DataFrame(elemento, index=class_names, columns=class_names)
    # show_confusion_matrix(df_cm)

[[19  0  0]
 [ 7  0  1]
 [ 6  0  1]]
[[19  0  0]
 [ 8  0  0]
 [ 7  0  0]]
[[19  0  0]
 [ 8  0  0]
 [ 7  0  0]]
[[19  0  0]
 [ 8  0  0]
 [ 6  0  0]]
[[18  0  0]
 [ 9  0  0]
 [ 6  0  0]]


In [54]:
df_acc

Unnamed: 0,Classificador,Rodada 1,Rodada 2,Rodada 3,Rodada 4,Rodada 5,Media
0,MLP,0.588235,0.558824,0.558824,0.575758,0.545455,0.565419


In [55]:
df_f1

Unnamed: 0,Classificador,Rodada 1,Rodada 2,Rodada 3,Rodada 4,Rodada 5,Media
0,MLP,0.32244,0.238994,0.238994,0.24359,0.235294,0.255862


In [56]:
df_f1_ponderado

Unnamed: 0,Classificador,Rodada 1,Rodada 2,Rodada 3,Rodada 4,Rodada 5,Media
0,MLP,0.46213,0.400666,0.400666,0.420746,0.385027,0.413847


## MLP

- Outra implementação (mais complicada)

In [57]:
# from sklearn.metrics import confusion_matrix, accuracy_score, f1_score
# from sklearn.metrics import precision_recall_fscore_support as score

# def evaluate_model(model, X_test, y_test):
#     # Predição dos rótulos
#     y_pred = model.predict(X_test)
#     print(classification_report(y_test, y_pred))

#     # Cálculo da matriz de confusão
#     cm = confusion_matrix(y_test, y_pred)

#     # Cálculo da acurácia
#     acc = accuracy_score(y_test, y_pred)

#     # Cálculo do F1-score
#     f1 = f1_score(y_test, y_pred, average='macro')

#     # Outras métricas
#     precision, recall, f1score, support = score(y_test, y_pred, average='macro')
#     return cm, acc, f1, precision, recall, f1score, support

In [58]:
# X_textos = novo_df['resumo'].values
# y = novo_df['rotulo'].values

In [59]:
# for index_df_train, index_df_test in kf.split(novo_df['resumo'], novo_df['rotulo']):

#     #Separando treino e teste
#     df_train, df_test = novo_df.iloc[index_df_train], novo_df.iloc[index_df_test]
#     #Criar validação a partir do treino
#     df_train, df_val = train_test_split(df_train, test_size=0.2, stratify= df_train['rotulo'], random_state=RANDOM_SEED)


#     #Separando rotulo e resumo
#     # df_train
#     X_df_train = df_train['resumo'].values
#     y_df_train = df_train['rotulo'].values
#     # #df_val
#     X_df_val = df_val['resumo'].values
#     y_df_val = df_val['rotulo'].values
#     #df_test
#     X_df_test = df_test['resumo'].values
#     y_df_test = df_test['rotulo'].values



#     # Crie uma representação TF-IDF dos dados
#     tfidf_vectorizer = TfidfVectorizer()
#     X_train_tfidf = tfidf_vectorizer.fit_transform(X_df_train)
#     X_val_tfidf = tfidf_vectorizer.transform(X_df_val)
#     X_test_tfidf = tfidf_vectorizer.transform(X_df_test)


#     # O que tenho são:
#     # X_train_tfidf      e      y_df_train
#     # X_val_tfidf       e       y_df_val
#     # X_test_tfidf      e       y_df_test



#     # Crie e compila a MLP
#     model = Sequential()
#     model.add(Dense(128, input_dim=X_train_tfidf.shape[1], activation='relu'))
#     model.add(Dense(3, activation='softmax'))  # 3 classes no seu caso
#     model.compile(loss='categorical_crossentropy', optimizer=Adam(learning_rate=2e-5), metrics=['accuracy'])
#     model.summary()

#     # Transforme os rótulos em codificação one-hot
#     y_train_encoded = np.eye(3)[y_df_train]  # 3 classes no seu caso
#     y_val_encoded = np.eye(3)[y_df_val]  # 3 classes no seu caso
#     y_test_encoded = np.eye(3)[y_df_test]  # 3 classes no seu caso


#     #Treinamento
#     history = model.fit(X_train_tfidf, y_train_encoded, validation_data=(X_val_tfidf, y_val_encoded), epochs=EPOCHS, batch_size=BATCH, verbose=0, shuffle = True)


#     contador+=1
#     print("="*50)
#     print(f'Rodada {contador }')
#     print("="*50)

#     # Realize previsões
#     y_pred = model.predict(X_test_tfidf)
#     y_pred_classes = np.argmax(y_pred, axis=1)

#     # Avaliação
#     accuracy = accuracy_score(y_test_encoded, y_pred_classes)
#     f1 = f1_score(y_test_encoded, y_pred_classes, average='macro')
#     cm = confusion_matrix(y_test_encoded, y_pred_classes)

#     accuracies.append(accuracy)
#     f1_scores.append(f1)

#     print("Matriz de Confusão:")
#     print(cm)
#     print(f'Acurácia: {accuracy:.2f}')
#     print(f'F1-Score: {f1:.2f}')
#     print("\n")


#    # serie_acc = serie_acc.append(pd.Series([acc]))
#     serie_acc = pd.concat([serie_acc, pd.Series([accuracy])])
#     # serie_f1 = serie_f1.append(pd.Series([f1]))
#     serie_f1 = pd.concat([serie_f1, pd.Series([f1])])

#     # Plotagem do histórico de acurácia
#     plt.plot(history.history['accuracy'], label='Acurácia de Treinamento')
#     plt.plot(history.history['val_accuracy'], label='Acurácia de Validação')
#     plt.xlabel('Épocas')
#     plt.ylabel('Acurácia')
#     plt.legend()
#     plt.show()


#     # Avaliação do modelo: Aqui estamos inserindo os valores das medias na serie
# media_acc = serie_acc[:5].mean()
# media_f1 = serie_f1[:5].mean()
# serie_acc = pd.concat([serie_acc, pd.Series([media_acc])])
# serie_f1 = pd.concat([serie_f1, pd.Series([media_f1])])
# df_acc.loc[classificador, ['Rodada 1', 'Rodada 2', 'Rodada 3', 'Rodada 4', 'Rodada 5', 'Media']] = serie_acc.values
# df_f1.loc[classificador, ['Rodada 1', 'Rodada 2', 'Rodada 3', 'Rodada 4', 'Rodada 5', 'Media']] = serie_f1.values

# # Calcule as médias das métricas
# mean_accuracy = np.mean(accuracies)
# mean_f1 = np.mean(f1_scores)
# print("===============  Dados Médios do Modelo =============")
# print(f'Acurácia Média: {mean_accuracy:.2f}')
# print(f'F1-Score Médio: {mean_f1:.2f}')