# IT Service Ticket Classification

Este notebook implementa um sistema de classificação de tickets de suporte de TI utilizando RAG (Retrieval Augmented Generation) com LangGraph.

**Entrada:** texto do ticket (string)  
**Saída:** `{"classe": "...", "justificativa": "..."}`

## 1. Carregamento e Preparação dos Dados

In [20]:
from classifier.data import load_dataset, train_test_split_stratified

In [21]:
# Carregar dataset
df, classes = load_dataset()
print(f"Total de tickets: {len(df):,}")
df.head()

Total de tickets: 47,837


Unnamed: 0,Document,Topic_group
0,connection with icon icon dear please setup ic...,Hardware
1,work experience user work experience user hi w...,Access
2,requesting for meeting requesting meeting hi p...,Hardware
3,reset passwords for external accounts re expir...,Access
4,mail verification warning hi has got attached ...,Miscellaneous


In [22]:
# Classes obtidas do dataset
print(f"Classes ({len(classes)}):")
for c in classes:
    print(f"  - {c}")

Classes (8):
  - Access
  - Administrative rights
  - HR Support
  - Hardware
  - Internal Project
  - Miscellaneous
  - Purchase
  - Storage


In [23]:
# Split estratificado: treino para RAG, teste para avaliação (200 tickets)
train_df, test_df = train_test_split_stratified(df, test_size=200)

print(f"Treino: {len(train_df):,} tickets")
print(f"Teste:  {len(test_df)} tickets")

Treino: 47,637 tickets
Teste:  200 tickets


## 2. RAG - Retrieval de Tickets Similares

O retriever usa sentence-transformers para gerar embeddings dos tickets e busca os mais similares via similaridade de cosseno.

In [24]:
from classifier.rag import TicketRetriever

In [25]:
# Indexar tickets de treino
retriever = TicketRetriever()
retriever.index(train_df)

Batches:   0%|          | 0/1489 [00:00<?, ?it/s]

In [28]:
# Testar retrieval com um ticket do conjunto de teste
test_ticket = test_df.iloc[0]
query = test_ticket["Document"]
true_class = test_ticket["Topic_group"]

similar = retriever.retrieve(query, k=5)

print(f"Ticket de teste (classe real: {true_class}):")
print(f"{query[:200]}...\n")
print("Tickets similares recuperados:")
for i, ticket in enumerate(similar, 1):
    print(f"\n{i}. [{ticket['class']}] (score: {ticket['score']:.3f})")
    print(f"   {ticket['text']}")

Ticket de teste (classe real: Storage):
archive project folders wednesday december pm archive folders please archive folder documentation after rd move backup process folder storage size attention folder includes videos sessions done launch...

Tickets similares recuperados:

1. [Storage] (score: 0.695)
   enable archive folder thursday march pm archive folder create archive folder thanks

2. [Storage] (score: 0.685)
   re new folder storage to be created thursday re folder storage created dear please give folder thanks july pm folder storage created dear please create folder storage internship give next users finish process internship copy information delete folder thank lot specialist

3. [Storage] (score: 0.681)
   marketing folder on public storage archive folder public storage archive hi has upload files stored folder public storage running low done older than right future never please archive folder archive october done thank specialist

4. [Storage] (score: 0.663)
   am week sto