# **NLP Final Project – Support Ticket Classifier**

***Participantes (RM - NOME):***<br>
RM352122 - Guilherme Ruy<br>
RM350785 - Alexandra Maria Rodrigues Marques Figueira<br>
RM352152 - Henrique da Silva Dergado<br>


**[1] = ​https://dados-ml-pln.s3.sa-east-1.amazonaws.com/tickets_reclamacoes_classificados.csv**

**[F1 Score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html)** com average='weighted'

In [None]:
# CARREGANDO O DATA FRAME
import pandas as pd
df = pd.read_csv('https://dados-ml-pln.s3.sa-east-1.amazonaws.com/tickets_reclamacoes_classificados.csv', delimiter=';')

# Façam o download do arquivo e utilizem localmente durante os testes

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21072 entries, 0 to 21071
Data columns (total 4 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   id_reclamacao         21072 non-null  int64 
 1   data_abertura         21072 non-null  object
 2   categoria             21072 non-null  object
 3   descricao_reclamacao  21072 non-null  object
dtypes: int64(1), object(3)
memory usage: 658.6+ KB


Bom desenvolvimento!

Faça aqui as demonstrações das aplicações das técnicas de PLN (regras, pré-processamentos, tratamentos, variedade de modelos aplicados, organização do pipeline, etc.)​

Fique à vontade para testar e explorar as técnicas de pré-processamento, abordagens de NLP, algoritmos e bibliotecas, mas explique e justifique as suas decisões durante o desenvolvimento.​

In [None]:
!pip install accelerate -U



### Teste A: Modelo de GenAI (BERT)
Escolhemos o BERT por sua capacidade de entender o contexto bidirecionalmente, capturando nuances linguísticas complexas, e por ser facilmente ajustável para tarefas específicas de classificação de texto, resultando em alta precisão.

In [None]:
import pandas as pd
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
import torch
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import f1_score, classification_report
from sklearn.model_selection import train_test_split
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk

url = 'https://dados-ml-pln.s3.sa-east-1.amazonaws.com/tickets_reclamacoes_classificados.csv'
df = pd.read_csv(url, delimiter=';')

nltk.download('stopwords')
nltk.download('punkt')

# 1. Pré-processamento nos dados

def remove_stopwords(text):
    stop_words = set(stopwords.words('portuguese'))
    tokens = word_tokenize(text.lower())
    filtered_tokens = [word for word in tokens if word not in stop_words]
    return " ".join(filtered_tokens)

df['descricao_reclamacao'] = df['descricao_reclamacao'].apply(remove_stopwords)

# 2. Treino e teste
train_df, test_df = train_test_split(df, test_size=0.25, random_state=42)

X_train = train_df['descricao_reclamacao'].tolist()
y_train = train_df['categoria'].tolist()
X_test = test_df['descricao_reclamacao'].tolist()
y_test = test_df['categoria'].tolist()

label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
y_test_encoded = label_encoder.transform(y_test)

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
train_encodings = tokenizer(X_train, truncation=True, padding=True, max_length=128)
test_encodings = tokenizer(X_test, truncation=True, padding=True, max_length=128)

class TicketDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = TicketDataset(train_encodings, y_train_encoded)
test_dataset = TicketDataset(test_encodings, y_test_encoded)

training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
)

model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=len(label_encoder.classes_))

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset
)

# 3. Treinar o modelo
trainer.train()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
10,1.6254
20,1.5897
30,1.5883
40,1.5773
50,1.5831
60,1.5594
70,1.5571
80,1.5859
90,1.5249
100,1.5646


TrainOutput(global_step=5928, training_loss=0.5759547327452825, metrics={'train_runtime': 1356.8568, 'train_samples_per_second': 34.943, 'train_steps_per_second': 4.369, 'total_flos': 3118739342625792.0, 'train_loss': 0.5759547327452825, 'epoch': 3.0})

In [None]:
results = trainer.evaluate(eval_dataset=test_dataset)
print("Results:", results)

outputs = trainer.predict(test_dataset)
preds = torch.argmax(torch.tensor(outputs.predictions), axis=1)

# 4. Avaliar modelo
f1_genai = f1_score(y_test_encoded, preds, average='weighted')
print(f'F1 Score com GenAI (weighted): {f1_genai}')

print(classification_report(y_test_encoded, preds, target_names=label_encoder.classes_))

Results: {'eval_loss': 0.6063342690467834, 'eval_runtime': 36.9509, 'eval_samples_per_second': 142.568, 'eval_steps_per_second': 17.834, 'epoch': 3.0}
F1 Score com GenAI (weighted): 0.8156947866295493
                                     precision    recall  f1-score   support

Cartão de crédito / Cartão pré-pago       0.82      0.79      0.81      1290
            Hipotecas / Empréstimos       0.84      0.87      0.86       922
                             Outros       0.75      0.75      0.75       549
       Roubo / Relatório de disputa       0.79      0.82      0.80      1204
         Serviços de conta bancária       0.85      0.83      0.84      1303

                           accuracy                           0.82      5268
                          macro avg       0.81      0.81      0.81      5268
                       weighted avg       0.82      0.82      0.82      5268



### Teste B: Modelo Naive Bayes
Testamos um classificador Naive Bayes, um modelo simples e eficiente para dados textuais

In [None]:
import pandas as pd

url = "https://dados-ml-pln.s3.sa-east-1.amazonaws.com/tickets_reclamacoes_classificados.csv"
data = pd.read_csv(url, delimiter=';')

# Pré-processamento de Dados
# Escolhemos remover caracteres especiais, números, stopwords e converter o texto para minúsculas para normalizar os dados e reduzir a "sujeira".

import re
import nltk
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split

nltk.download('stopwords')

def preprocess_text(descricao_reclamacao):
    descricao_reclamacao = re.sub(r'\W', ' ', descricao_reclamacao)
    descricao_reclamacao = re.sub(r'\d', '', descricao_reclamacao)
    descricao_reclamacao = descricao_reclamacao.lower()
    descricao_reclamacao = re.sub(r'\s+', ' ', descricao_reclamacao).strip()
    return descricao_reclamacao

data['descricao_reclamacao'] = data['descricao_reclamacao'].apply(preprocess_text)

stop_words = set(stopwords.words('portuguese'))
data['descricao_reclamacao'] = data['descricao_reclamacao'].apply(lambda x: ' '.join(word for word in x.split() if word not in stop_words))

# 2. Treino e teste

X = data['descricao_reclamacao']
y = data['descricao_reclamacao']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=5000)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import f1_score

# 3. Treinar o modelo
nb_model = MultinomialNB()
nb_model.fit(X_train_tfidf, y_train)

y_pred = nb_model.predict(X_test_tfidf)

# Avaliar o modelo
f1 = f1_score(y_test, y_pred, average='weighted')
print(f'F1 Score (weighted): {f1}')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


F1 Score (weighted): 0.0015540205175740482


### Teste C: Modelo word2vec
Técnica popular para aprender representações semânticas de palavras, melhorando eficiência em tarefas de processamento de linguagem natural como classificação e análise de similaridade.

In [None]:
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from gensim.models import Word2Vec
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import f1_score
import pandas as pd
import nltk
nltk.download('punkt')

# Carregar os dados do CSV
url = 'https://dados-ml-pln.s3.sa-east-1.amazonaws.com/tickets_reclamacoes_classificados.csv'
df = pd.read_csv(url, delimiter=';')

# 1. Pré-processamento de texto

def preprocess_text(descricao_reclamacao):
    tokens = nltk.word_tokenize(descricao_reclamacao.lower())

    stop_words = set(stopwords.words('portuguese'))
    tokens = [token for token in tokens if token.isalpha() and token not in stop_words]

    return tokens

df['tokens'] = df['descricao_reclamacao'].apply(preprocess_text)

# 2.1 Treino e teste
train_df, test_df = train_test_split(df, test_size=0.25, random_state=42)

# 3.1 Treinar modelo Word2Vec
model_w2v = Word2Vec(sentences=train_df['tokens'], vector_size=100, window=5, min_count=1, workers=4)

def average_vector(tokens, model, vector_size):
    vectors = [model.wv[token] for token in tokens if token in model.wv]
    if not vectors:
        return np.zeros(vector_size)
    return np.mean(vectors, axis=0)

# 2.2 Treino e teste
X_train = np.array([average_vector(tokens, model_w2v, 100) for tokens in train_df['tokens']])
X_test = np.array([average_vector(tokens, model_w2v, 100) for tokens in test_df['tokens']])
y_train = train_df['categoria']
y_test = test_df['categoria']

# 3.2 Treinar classificador SVM
svm_clf = SVC(kernel='linear')
svm_clf.fit(X_train, y_train)

y_pred = svm_clf.predict(X_test)

# 4. Avaliar o modelo
f1 = f1_score(y_test, y_pred, average='weighted')
print(f"F1-Score utilizando Word2Vec e SVM: {f1}")


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


F1-Score utilizando Word2Vec e SVM: 0.8828623934466714


###Conclusões dos testes A, B e C

Utilizamos a métrica F1 Score (weighted) para avaliar a performance dos modelos, focando na precisão e no recall balanceados entre as classes.

Teste A: F1 Score BERT: 0.81

Teste B: F1 Score Naive Bayes: 0.0016

**Teste C: F1-Score Word2Vec e SVM: 0.88**

Com base na comparação dos resultados, o modelo que atingiu maior F1 Score foi o word2vec.

O Word2Vec é relativamente rápido para treinar e usar em comparação com modelos como BERT e apresenta bom desempenho em tarefas simples como classificação de texto quando combinado com classificadores tradicionais como SVM.

Consolidar apenas os scripts do seu **modelo campeão**, desde o carregamento do dataframe, separação das amostras, tratamentos utilizados (funções, limpezas, etc.), criação dos objetos de vetorização dos textos e modelo treinado e outras implementações utilizadas no processo de desenvolvimento do modelo.

O modelo precisar atingir um score na métrica F1 Score superior a 75%.

In [None]:
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from gensim.models import Word2Vec
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import f1_score
import pandas as pd
nltk.download('punkt')
nltk.download('stopwords')

# Carregar os dados do CSV
url = 'https://dados-ml-pln.s3.sa-east-1.amazonaws.com/tickets_reclamacoes_classificados.csv'
df = pd.read_csv(url, delimiter=';')

# 1. Pré-processamento de texto

def preprocess_text(descricao_reclamacao):
    tokens = nltk.word_tokenize(descricao_reclamacao.lower())

    stop_words = set(stopwords.words('portuguese'))
    tokens = [token for token in tokens if token.isalpha() and token not in stop_words]

    return tokens

df['tokens'] = df['descricao_reclamacao'].apply(preprocess_text)

# 2.1 Treino e teste
train_df, test_df = train_test_split(df, test_size=0.25, random_state=42)

# 3.1 Treinar modelo Word2Vec
model_w2v = Word2Vec(sentences=train_df['tokens'], vector_size=100, window=5, min_count=1, workers=4)

def average_vector(tokens, model, vector_size):
    vectors = [model.wv[token] for token in tokens if token in model.wv]
    if not vectors:
        return np.zeros(vector_size)
    return np.mean(vectors, axis=0)

# 2.2 Treino e teste
X_train = np.array([average_vector(tokens, model_w2v, 100) for tokens in train_df['tokens']])
X_test = np.array([average_vector(tokens, model_w2v, 100) for tokens in test_df['tokens']])
y_train = train_df['categoria']
y_test = test_df['categoria']

# 3.2 Treinar classificador SVM
svm_clf = SVC(kernel='linear')
svm_clf.fit(X_train, y_train)

y_pred = svm_clf.predict(X_test)

# 4. Avaliar o modelo
f1 = f1_score(y_test, y_pred, average='weighted')
print(f"F1-Score utilizando Word2Vec e SVM: {f1}")


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


F1-Score utilizando Word2Vec e SVM: 0.8811131304295705
