<a href="https://colab.research.google.com/github/koba-works/enap_PLN_2024/blob/main/dnn_mlp_text_classification_comentado.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 2. Redes Neurais Profundas

Uma rede neural profunda do tipo Feed Forward Multilayer Perceptron para classificação de tweets na polaridade positiva, negativa e neutra.


In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import pandas as pd
from torch.utils.data import Dataset, DataLoader
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report

In [None]:
class CustomDataset(Dataset):
    def __init__(self, data, labels):
        self.data = data
        self.labels = labels

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx], self.labels[idx]

In [None]:
class MLPClassifier(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(MLPClassifier, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, hidden_size)
        self.fc3 = nn.Linear(hidden_size, output_size)
        self.relu = nn.ReLU()
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.fc2(x)
        x = self.fc3(x)
        return self.softmax(x)

Carregar dados do arquivo CSV usando pandas

In [None]:
df_train = pd.read_csv('https://raw.githubusercontent.com/giacicunb/enap_pln2024/main/corpora/twitter-2016train-A.txt',sep='\t',encoding="UTF-8")
df_test = pd.read_csv('https://raw.githubusercontent.com/giacicunb/enap_pln2024/main/corpora/twitter-2016test-A.txt',sep='\t',encoding="UTF-8")

Número de classes (são 3 tipos de polaridade de tweets)

In [None]:
num_classes = len(df_train['polarity'].unique())

Obtém os tweets e as labels separadamente para os dados de treinamento e de teste

In [None]:
df_train['polarity'] = pd.Categorical(df_train['polarity'])
df_train['polarity'] = df_train['polarity'].cat.codes

df_test['polarity'] = pd.Categorical(df_test['polarity'])
df_test['polarity'] = df_test['polarity'].cat.codes

train_tweets = df_train['text'].tolist()
train_labels = df_train['polarity'].tolist()

test_tweets = df_test['text'].tolist()
test_labels = df_test['polarity'].tolist()

Calcula os vetores TF-IDF para os dados de treinamento e de teste

In [None]:
tfidf_vectorizer = TfidfVectorizer()
train_tfidf = tfidf_vectorizer.fit_transform(train_tweets).toarray()

test_tfidf = tfidf_vectorizer.transform(test_tweets).toarray()

Converter os vetores TF-IDF para tensores

In [None]:
train_tensor = torch.tensor(train_tfidf, dtype=torch.float32)
test_tensor = torch.tensor(test_tfidf, dtype=torch.float32)

torch_train_labels = torch.tensor(train_labels, dtype=torch.long)
torch_test_labels = torch.tensor(test_labels, dtype=torch.long)

Criar instâncias do DataLoader

In [None]:
train_dataset = CustomDataset(train_tensor, torch_train_labels)
test_dataset = CustomDataset(test_tensor, torch_test_labels)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

parametro shuffle serve para bagunçar os elementos do datalaoder do treinamento (reduzir viés na ordem de leitura dos dados)

parâmetro batch_size=32, no geral, é um bom tamanho, mas atentar que este tamanho se relaciona com o consumo de memória


Definindo alguns hiperparâmetros:

*   Tamanho do vetor TF-IDF de entrada
*   Quantidade de neurônios na camada oculta
*   Dimensão da camada de saída



In [None]:
input_size = train_tensor.shape[1]
hidden_size = 128
output_size = num_classes

### diferença de terminologia entre ferramentas:

hidden layer

pytorch => linear

tensorflow => dense

Instancia o objeto referente a rede neural profunda

In [None]:
dnn_model = MLPClassifier(input_size, hidden_size, output_size)

comentário

a partir de 2 camadas, já é considerado deep learning.

para criar/determinar as camadas ocultas

hidden_size_layer_1 </br>
hidden_size_layer_2


Define a função loss e o otimizador de Adam para otimização dos parâmetros

In [None]:
loss_function = nn.CrossEntropyLoss()
optimizer = optim.Adam(dnn_model.parameters(), lr=0.001)

comentário


Adam => otimizador </br>
lr => learning rate (hiperparametroPP

Loop de treinamento do modelo

In [None]:
num_epochs = 30
for epoch in range(num_epochs):

    dnn_model.train()

    total_loss = 0
    for text, labels in train_loader:

        optimizer.zero_grad()
        outputs = dnn_model(text)

        loss = loss_function(outputs, labels)

        loss.backward()
        optimizer.step()

        total_loss += loss.item()
    print(f'Epoch {epoch+1}, Loss: {total_loss/len(train_loader)}')

Epoch 1, Loss: 0.9812220296798608
Epoch 2, Loss: 0.6770056203389779
Epoch 3, Loss: 0.2800396441076046
Epoch 4, Loss: 0.06519245912726873
Epoch 5, Loss: 0.01762842723669914
Epoch 6, Loss: 0.007162704736961482
Epoch 7, Loss: 0.0038451227629616954
Epoch 8, Loss: 0.001888960220803244
Epoch 9, Loss: 0.001189758465699971
Epoch 10, Loss: 0.0008078704771693223
Epoch 11, Loss: 0.0005552305100979701
Epoch 12, Loss: 0.00040094052607086126
Epoch 13, Loss: 0.0002954009911870041
Epoch 14, Loss: 0.00023252273562180702
Epoch 15, Loss: 0.00017868839304122585
Epoch 16, Loss: 0.00014034393382481096
Epoch 17, Loss: 0.00011359571966843364
Epoch 18, Loss: 9.210995864496232e-05
Epoch 19, Loss: 7.699193912020956e-05
Epoch 20, Loss: 6.589717464195042e-05
Epoch 21, Loss: 5.384203713896195e-05
Epoch 22, Loss: 4.6685882708539466e-05
Epoch 23, Loss: 4.0033603250450164e-05
Epoch 24, Loss: 3.495854514021164e-05
Epoch 25, Loss: 3.070138045703247e-05
Epoch 26, Loss: 2.6927050466605204e-05
Epoch 27, Loss: 2.41393590651

In [None]:
y_pred = []
y_test = []
for text, labels in test_loader:
    y_prob = dnn_model(text)
    _, predicted = torch.max(y_prob, 1)
    y_pred.extend(predicted.tolist())
    y_test.extend(labels.tolist())

In [None]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.31      0.23      0.26      3231
           1       0.56      0.39      0.46     10342
           2       0.44      0.67      0.53      7059

    accuracy                           0.46     20632
   macro avg       0.43      0.43      0.42     20632
weighted avg       0.48      0.46      0.45     20632

