# üß† GERA√á√ÉO DE EMBEDDINGS - Etapa 5/6

## üìã O que este notebook faz

Este notebook **converte texto em vetores** usando intelig√™ncia artificial para busca sem√¢ntica:

- üìö **Carrega chunks** de `pipeline-data/chunks/chunks.jsonl`
- ü§ñ **Usa modelo BAAI/bge-m3** (multil√≠ngue, otimizado para portugu√™s)
- ‚ö° **Processa em lotes** de 32 chunks por vez para efici√™ncia
- üéØ **Normaliza vetores** para busca por similaridade cosseno
- üíæ **Salva embeddings** enriquecidos em JSONL

## üîß Modelo BAAI/bge-m3

- **1024 dimens√µes** por vetor (alta expressividade sem√¢ntica)
- **Multil√≠ngue** com foco em portugu√™s e ingl√™s
- **Normaliza√ß√£o L2** para compara√ß√£o eficiente
- **State-of-the-art** para busca sem√¢ntica

## üìä Output esperado

322 chunks com embeddings (1.3MB) salvos em `pipeline-data/embeddings/embeddings.jsonl`

---

## üîß Configura√ß√£o e Prepara√ß√£o

In [1]:
import os
import json
from pathlib import Path
from sentence_transformers import SentenceTransformer
import numpy as np
import time
from datetime import datetime

# Marcar in√≠cio da execu√ß√£o
stage_start = time.time()
start_timestamp = datetime.now().isoformat() + "Z"

# Configura√ß√£o
MODEL_NAME = os.getenv("EMBEDDING_MODEL", "BAAI/bge-m3")
BATCH_SIZE = int(os.getenv("BATCH_SIZE", "32"))

# Diret√≥rios
chunks_dir = Path("pipeline-data/chunks")
embeddings_dir = Path("pipeline-data/embeddings")
embeddings_dir.mkdir(parents=True, exist_ok=True)

# Limpar diret√≥rio embeddings
for f in embeddings_dir.glob("*"):
    if f.is_file():
        f.unlink()

print(f"Modelo: {MODEL_NAME}")
print(f"Batch size: {BATCH_SIZE}")

Modelo: BAAI/bge-m3
Batch size: 32


## ü§ñ Carregamento do Modelo

In [2]:
# Carregar modelo
print("Carregando modelo de embeddings...")
model = SentenceTransformer(MODEL_NAME)
embedding_dim = model.get_sentence_embedding_dimension()

print(f"‚úÖ Modelo carregado: {MODEL_NAME}")
print(f"Dimens√µes: {embedding_dim}")

Carregando modelo de embeddings...


‚úÖ Modelo carregado: BAAI/bge-m3
Dimens√µes: 1024


In [3]:
# Carregar chunks
chunks_file = chunks_dir / "chunks.jsonl"

if not chunks_file.exists():
    raise FileNotFoundError(f"Arquivo de chunks n√£o encontrado: {chunks_file}")

chunks = []
with open(chunks_file, "r", encoding="utf-8") as f:
    for line in f:
        chunk = json.loads(line)
        chunks.append(chunk)

print(f"Chunks carregados: {len(chunks)}")

# Mostrar alguns exemplos
for i, chunk in enumerate(chunks[:3]):
    preview = chunk["text"][:50] + "..." if len(chunk["text"]) > 50 else chunk["text"]
    print(f"  {i+1}. {preview}")

Chunks carregados: 322
  1. # Vis√£o Geral do Self Checkout

## Introdu√ß√£o

Est...
  2. O objetivo desta documenta√ß√£o √© descrever o fluxo ...
  3. - Processo de identifica√ß√£o do cliente via CPF ou ...


## üìö Processamento de Chunks

In [4]:
# Gerar embeddings
print("Gerando embeddings...")

# Extrair textos dos chunks
texts = [chunk["text"] for chunk in chunks]

# Gerar embeddings em lotes
embeddings = model.encode(
    texts,
    batch_size=BATCH_SIZE,
    show_progress_bar=True,
    normalize_embeddings=True
)

print(f"‚úÖ Embeddings gerados: {len(embeddings)}")
print(f"Formato: {embeddings.shape}")

Gerando embeddings...


Batches:   0%|          | 0/11 [00:00<?, ?it/s]

‚úÖ Embeddings gerados: 322
Formato: (322, 1024)


In [5]:
# Combinar chunks com embeddings
chunks_with_embeddings = []

for chunk, embedding in zip(chunks, embeddings):
    chunk_with_embedding = {
        "chunk_id": chunk["chunk_id"],
        "source_document": chunk["source_document"],
        "chunk_index": chunk["chunk_index"],
        "text": chunk["text"],
        "char_count": chunk["char_count"],
        "embedding": embedding.tolist(),
        "embedding_model": MODEL_NAME,
        "embedding_dimensions": len(embedding)
    }
    chunks_with_embeddings.append(chunk_with_embedding)

print(f"Chunks com embeddings: {len(chunks_with_embeddings)}")

Chunks com embeddings: 322


## üíæ Armazenamento e Estat√≠sticas

In [6]:
# Salvar embeddings
embeddings_file = embeddings_dir / "embeddings.jsonl"

with open(embeddings_file, "w", encoding="utf-8") as f:
    for chunk_data in chunks_with_embeddings:
        f.write(json.dumps(chunk_data, ensure_ascii=False) + "\n")

print(f"‚úÖ Embeddings salvos: {embeddings_file}")

# Estat√≠sticas
total_size_mb = (len(chunks_with_embeddings) * embedding_dim * 4) / (1024 * 1024)  # float32
avg_magnitude = np.mean([np.linalg.norm(chunk["embedding"]) for chunk in chunks_with_embeddings])

print(f"\nüìä Estat√≠sticas:")
print(f"  Total embeddings: {len(chunks_with_embeddings)}")
print(f"  Dimens√µes: {embedding_dim}")
print(f"  Tamanho total: {total_size_mb:.1f} MB")
print(f"  Magnitude m√©dia: {avg_magnitude:.3f}")
print(f"  Modelo: {MODEL_NAME}")

‚úÖ Embeddings salvos: pipeline-data/embeddings/embeddings.jsonl

üìä Estat√≠sticas:
  Total embeddings: 322
  Dimens√µes: 1024
  Tamanho total: 1.3 MB
  Magnitude m√©dia: 1.000
  Modelo: BAAI/bge-m3


## üìä Relat√≥rio de Execu√ß√£o

In [7]:
# Calcular dura√ß√£o
stage_duration = time.time() - stage_start

# Carregar relat√≥rio existente
report_path = Path("pipeline-data/report.json")
if report_path.exists():
    with open(report_path, "r") as f:
        report = json.load(f)
else:
    report = {"stages": [], "context": {}, "summary": {}}

# Atualizar contexto com modelo de embedding
report["context"]["embedding_model"] = MODEL_NAME

# Adicionar informa√ß√µes desta etapa
stage_report = {
    "stage": 5,
    "name": "Gera√ß√£o de Embeddings",
    "status": "SUCCESS" if len(chunks_with_embeddings) > 0 else "FAILED",
    "start_time": start_timestamp,
    "duration_seconds": round(stage_duration, 2),
    "results": {
        "chunks_loaded": len(chunks),
        "embeddings_generated": len(embeddings),
        "embedding_dimensions": embedding_dim,
        "batch_size": BATCH_SIZE,
        "storage_size_mb": round(total_size_mb, 2),
        "avg_magnitude": round(avg_magnitude, 3)
    }
}

# Adicionar ou atualizar stage no relat√≥rio
stages_updated = False
for i, stage in enumerate(report["stages"]):
    if stage["stage"] == 5:
        report["stages"][i] = stage_report
        stages_updated = True
        break

if not stages_updated:
    report["stages"].append(stage_report)

# Atualizar timestamp
report["summary"]["last_update"] = datetime.now().isoformat() + "Z"

# Salvar relat√≥rio atualizado
with open(report_path, "w") as f:
    json.dump(report, f, indent=2, ensure_ascii=False)

print(f"\nüìä Relat√≥rio atualizado: {report_path}")
print(f"‚è±Ô∏è Dura√ß√£o da etapa: {stage_duration:.2f}s")


üìä Relat√≥rio atualizado: pipeline-data/report.json
‚è±Ô∏è Dura√ß√£o da etapa: 71.37s
