# üíæ ARMAZENAMENTO QDRANT - Etapa 6/6

## üìã O que este notebook faz

Este notebook **armazena os embeddings no banco vetorial Qdrant** para busca sem√¢ntica:

- üîó **Conecta ao Qdrant** usando URL e API key configurados
- üìä **Cria/verifica collection** com dimens√µes e m√©trica COSINE apropriadas
- üîÑ **Reindexa√ß√£o at√¥mica** - deleta e insere chunks por arquivo de forma consistente
- üè∑Ô∏è **Metadados completos** - inclui repo, branch, commit, versionamento e hashes
- üÜî **IDs determin√≠sticos** - garante que re-execu√ß√µes n√£o criem duplicatas

## üéØ Configura√ß√£o necess√°ria

- `QDRANT_URL` - URL do servidor Qdrant (padr√£o: http://qdrant.codrstudio.dev:6333)
- `QDRANT_API_KEY` - Chave de acesso (obrigat√≥rio)
- `QDRANT_COLLECTION` - Nome da collection (padr√£o: "nic")

## üîß Funcionalidades principais

- **Modo FULL**: Reindexa√ß√£o completa por arquivo (DELETE + UPSERT)
- **Versionamento**: Controle de vers√µes de modelo e tokenizador
- **Metadados ricos**: Rastreabilidade completa do GitLab ao Qdrant
- **IDs consistentes**: Baseados em hash determin√≠stico para evitar duplicatas

## üìä Output esperado

Todos os chunks (~200-400) inseridos no Qdrant com metadados completos para busca sem√¢ntica eficiente.

---

## üîß Configura√ß√£o e Conex√£o

In [1]:
import os
import json
import hashlib
from pathlib import Path
from datetime import datetime
from collections import defaultdict
from qdrant_client import QdrantClient
from qdrant_client.models import (
    Distance, VectorParams, PointStruct,
    Filter, FieldCondition, MatchValue, FilterSelector
)
import time

# Marcar in√≠cio da execu√ß√£o
stage_start = time.time()
start_timestamp = datetime.now().isoformat() + "Z"

# Ler commit do contexto salvo pelo notebook 2
try:
    with open("pipeline-data/context.json", "r") as f:
        context = json.load(f)
    GITLAB_COMMIT = context["gitlab_commit"]
    print(f"üìç Commit lido do contexto: {GITLAB_COMMIT[:8]}")
except:
    GITLAB_COMMIT = "unknown"
    print("‚ö†Ô∏è Contexto n√£o encontrado, usando commit 'unknown'")

# Configura√ß√£o
QDRANT_URL = os.getenv("QDRANT_URL", "http://qdrant.codrstudio.dev:6333")
QDRANT_API_KEY = os.getenv("QDRANT_API_KEY")
COLLECTION_NAME = os.getenv("QDRANT_COLLECTION", "nic")

# Configura√ß√£o do GitLab (para metadados)
GITLAB_REPO = "nic/documentacao/base-de-conhecimento"
GITLAB_BRANCH = os.getenv("GITLAB_BRANCH", "main")

# Versionamento
EMBED_MODEL_MAJOR = "v1"  # Vers√£o major do modelo
EMBED_MODEL_FULL = "BAAI/bge-m3"
TOKENIZER_MAJOR = "v1"
TOKENIZER_FULL = "RecursiveCharacterTextSplitter-500-100"

if not QDRANT_API_KEY:
    raise ValueError("QDRANT_API_KEY √© obrigat√≥rio")

# Diret√≥rios
embeddings_dir = Path("pipeline-data/embeddings")

print(f"Qdrant URL: {QDRANT_URL}")
print(f"Collection: {COLLECTION_NAME}")
print(f"Repository: {GITLAB_REPO}")
print(f"Branch: {GITLAB_BRANCH}")
print(f"API Key: ***{QDRANT_API_KEY[-4:] if len(QDRANT_API_KEY) > 4 else '***'}")

def calculate_content_hash(text: str) -> str:
    """Calcula hash SHA256 do conte√∫do"""
    return hashlib.sha256(text.encode('utf-8')).hexdigest()

def generate_deterministic_id(repo: str, relpath: str, chunk_index: int, 
                            tokenizer_major: str, embed_model_major: str) -> str:
    """Gera ID determin√≠stico para pontos do Qdrant"""
    # Combinar todos os elementos que tornam o chunk √∫nico
    id_string = f"{repo}:{relpath}:{chunk_index}:{tokenizer_major}:{embed_model_major}"
    
    # Usar hash para gerar ID num√©rico determin√≠stico
    hash_object = hashlib.sha256(id_string.encode('utf-8'))
    # Converter para inteiro usando os primeiros 8 bytes do hash
    return int.from_bytes(hash_object.digest()[:8], byteorder='big')

üìç Commit lido do contexto: e9c8a430
Qdrant URL: http://qdrant.codrstudio.dev:6333
Collection: nic
Repository: nic/documentacao/base-de-conhecimento
Branch: main
API Key: ***d857


In [2]:
# Conectar ao Qdrant
client = QdrantClient(
    url=QDRANT_URL,
    api_key=QDRANT_API_KEY
)

# Verificar conex√£o
collections = client.get_collections()
print(f"‚úÖ Conectado ao Qdrant")
print(f"Collections existentes: {len(collections.collections)}")

for col in collections.collections:
    print(f"  - {col.name}")

  client = QdrantClient(


‚úÖ Conectado ao Qdrant
Collections existentes: 3
  - documents
  - nic_storage
  - nic


## üìä Prepara√ß√£o de Dados

In [3]:
# Carregar embeddings
embeddings_file = embeddings_dir / "embeddings.jsonl"

if not embeddings_file.exists():
    raise FileNotFoundError(f"Arquivo de embeddings n√£o encontrado: {embeddings_file}")

embeddings_data = []
with open(embeddings_file, "r", encoding="utf-8") as f:
    for line in f:
        data = json.loads(line)
        embeddings_data.append(data)

print(f"Embeddings carregados: {len(embeddings_data)}")

# Verificar dimens√µes
if embeddings_data:
    embedding_dim = len(embeddings_data[0]["embedding"])
    print(f"Dimens√µes do embedding: {embedding_dim}")
else:
    raise ValueError("Nenhum embedding encontrado")

# Enriquecer com metadados completos
timestamp_now = datetime.now().isoformat()

for embedding in embeddings_data:
    # Adicionar metadados expandidos
    embedding["metadata"] = {
        # Identifica√ß√£o do reposit√≥rio
        "repo": GITLAB_REPO,
        "branch": GITLAB_BRANCH,
        "relpath": embedding["source_document"],  # Caminho relativo
        "commit": GITLAB_COMMIT,
        "last_updated": timestamp_now,
        
        # Versionamento de embeddings
        "embed_model_major": EMBED_MODEL_MAJOR,
        "embed_model_full": EMBED_MODEL_FULL,
        "tokenizer_major": TOKENIZER_MAJOR,
        "tokenizer_full": TOKENIZER_FULL,
        
        # Hashes para valida√ß√£o
        "content_sha256": calculate_content_hash(embedding["text"]),
        # doc_sha256 seria o hash do documento completo (implementar se necess√°rio)
        
        # Outros metadados
        "lang": "pt-BR",  # Portugu√™s do Brasil
        "processing_date": timestamp_now,
        "pipeline_version": "1.0.0"
    }

print(f"‚úÖ Metadados adicionados a todos os embeddings")

# Exemplo de metadados
if embeddings_data:
    print(f"\nExemplo de metadados:")
    sample = embeddings_data[0]["metadata"]
    for key, value in sample.items():
        print(f"  {key}: {value}")
        if len(str(value)) > 50:  # Truncar valores longos
            break

Embeddings carregados: 322
Dimens√µes do embedding: 1024
‚úÖ Metadados adicionados a todos os embeddings

Exemplo de metadados:
  repo: nic/documentacao/base-de-conhecimento
  branch: main
  relpath: 30-Aprovados/Mapas/Vis√£o Geral do Self Checkout
  commit: e9c8a430b8bc05c306cc8fb342f42e7b45a18744
  last_updated: 2025-08-18T11:53:57.896605
  embed_model_major: v1
  embed_model_full: BAAI/bge-m3
  tokenizer_major: v1
  tokenizer_full: RecursiveCharacterTextSplitter-500-100
  content_sha256: 369066d281dea4e27277770c635a21cc61edf93e0a73a18d5ef5b26f59ce3edf


In [4]:
# Verificar/criar collection
collection_exists = False
try:
    collection_info = client.get_collection(COLLECTION_NAME)
    collection_exists = True
    print(f"\nüìä Collection '{COLLECTION_NAME}' j√° existe:")
    print(f"  Pontos: {collection_info.points_count}")
    print(f"  Status: {collection_info.status}")
    
    # Em modo FULL, vamos reindexar tudo
    print(f"\nüîÑ Modo FULL: Reindexa√ß√£o completa ser√° realizada")
    print(f"  Arquivos existentes ser√£o substitu√≠dos atomicamente")
    
except Exception:
    print(f"Collection '{COLLECTION_NAME}' n√£o existe, criando...")

if not collection_exists:
    client.create_collection(
        collection_name=COLLECTION_NAME,
        vectors_config=VectorParams(
            size=embedding_dim,
            distance=Distance.COSINE
        )
    )
    print(f"‚úÖ Collection '{COLLECTION_NAME}' criada")
    print(f"  Dimens√µes: {embedding_dim}")
    print(f"  Dist√¢ncia: COSINE")
else:
    print(f"‚úÖ Collection '{COLLECTION_NAME}' verificada")


üìä Collection 'nic' j√° existe:
  Pontos: 644
  Status: green

üîÑ Modo FULL: Reindexa√ß√£o completa ser√° realizada
  Arquivos existentes ser√£o substitu√≠dos atomicamente
‚úÖ Collection 'nic' verificada


In [5]:
# Agrupar embeddings por arquivo para reindexa√ß√£o at√¥mica
chunks_by_file = defaultdict(list)

for embedding_data in embeddings_data:
    source_doc = embedding_data["source_document"]
    chunks_by_file[source_doc].append(embedding_data)

print(f"üìÅ Arquivos √∫nicos: {len(chunks_by_file)}")
print(f"üìä Total de chunks: {sum(len(chunks) for chunks in chunks_by_file.values())}")
print(f"\nüìà Distribui√ß√£o de chunks por arquivo:")

# Mostrar estat√≠sticas
for doc, chunks in sorted(chunks_by_file.items(), 
                          key=lambda x: len(x[1]), reverse=True)[:5]:
    print(f"  {doc}: {len(chunks)} chunks")
    
if len(chunks_by_file) > 5:
    print(f"  ... e mais {len(chunks_by_file) - 5} arquivos")

# Fun√ß√£o de reindexa√ß√£o por arquivo
def reindex_file(client, collection_name, source_document, chunks_data):
    """
    Reindexa√ß√£o at√¥mica por arquivo com metadados completos.
    Implementa a diretriz: DELETE por filtro + UPSERT com IDs determin√≠sticos
    """
    # PASSO 1: Deletar chunks existentes do arquivo
    try:
        # Filtro para deletar apenas chunks deste arquivo espec√≠fico
        delete_filter = Filter(
            must=[
                FieldCondition(key="repo", match=MatchValue(value=GITLAB_REPO)),
                FieldCondition(key="branch", match=MatchValue(value=GITLAB_BRANCH)),
                FieldCondition(key="relpath", match=MatchValue(value=source_document))
            ]
        )
        
        delete_result = client.delete(
            collection_name=collection_name,
            points_selector=FilterSelector(filter=delete_filter)
        )
        # N√£o mostrar mensagem se n√£o deletou nada (primeira execu√ß√£o)
        
    except Exception as e:
        # Ignorar erro se n√£o h√° dados para deletar
        pass
    
    # PASSO 2: Preparar novos pontos com IDs determin√≠sticos e metadados completos
    points = []
    for chunk_data in chunks_data:
        # Gerar ID determin√≠stico
        point_id = generate_deterministic_id(
            repo=GITLAB_REPO,
            relpath=chunk_data["source_document"],
            chunk_index=chunk_data["chunk_index"],
            tokenizer_major=TOKENIZER_MAJOR,
            embed_model_major=EMBED_MODEL_MAJOR
        )
        
        # Payload completo com todos os metadados
        payload = {
            # Dados originais do chunk
            "chunk_id": chunk_data["chunk_id"],
            "chunk_index": chunk_data["chunk_index"],
            "text": chunk_data["text"],
            "char_count": chunk_data["char_count"],
            
            # Metadados expandidos (todos os campos da diretriz)
            "repo": GITLAB_REPO,
            "branch": GITLAB_BRANCH,
            "relpath": chunk_data["source_document"],
            "source_document": chunk_data["source_document"],  # Manter compatibilidade
            "commit": GITLAB_COMMIT,
            "last_updated": chunk_data["metadata"]["last_updated"],
            
            # Versionamento
            "embed_model_major": EMBED_MODEL_MAJOR,
            "embed_model_full": EMBED_MODEL_FULL,
            "tokenizer_major": TOKENIZER_MAJOR,
            "tokenizer_full": TOKENIZER_FULL,
            "embedding_model": chunk_data["embedding_model"],  # Manter compatibilidade
            
            # Hashes
            "content_sha256": chunk_data["metadata"]["content_sha256"],
            
            # Outros
            "lang": "pt-BR",
            "processing_date": chunk_data["metadata"]["processing_date"],
            "pipeline_version": chunk_data["metadata"]["pipeline_version"]
        }
        
        point = PointStruct(
            id=point_id,
            vector=chunk_data["embedding"],
            payload=payload
        )
        points.append(point)
    
    # PASSO 3: Upsert (inserir ou atualizar) novos pontos
    if points:
        client.upsert(
            collection_name=collection_name,
            points=points,
            wait=True  # Esperar confirma√ß√£o
        )
    
    return len(points)

# Teste da fun√ß√£o com um arquivo
test_file = list(chunks_by_file.keys())[0]
test_chunks = chunks_by_file[test_file][:1]  # Apenas 1 chunk para teste

print(f"\nüß™ Preparando reindexa√ß√£o:")
print(f"  Exemplo ID determin√≠stico: {generate_deterministic_id(GITLAB_REPO, test_file, 0, TOKENIZER_MAJOR, EMBED_MODEL_MAJOR)}")
print(f"  Arquivo de teste: {test_file}")

üìÅ Arquivos √∫nicos: 31
üìä Total de chunks: 322

üìà Distribui√ß√£o de chunks por arquivo:
  30-Aprovados/Mapas/Vis√£o Geral do Self Checkout: 51 chunks
  30-Aprovados/T√≥picos/Componentes principais do sistema: 19 chunks
  30-Aprovados/T√≥picos/Pr√©-requisitos t√©cnicos: 17 chunks
  30-Aprovados/Mapas/Vis√£o Geral do NIC: 15 chunks
  30-Aprovados/T√≥picos/Padr√µes de Documenta√ß√£o do NIC: 15 chunks
  ... e mais 26 arquivos

üß™ Preparando reindexa√ß√£o:
  Exemplo ID determin√≠stico: 4458554930477711609
  Arquivo de teste: 30-Aprovados/Mapas/Vis√£o Geral do Self Checkout


## üíæ Reindexa√ß√£o e Armazenamento

In [6]:
# Modo FULL: Reindexar todos os arquivos
print(f"üöÄ INICIANDO REINDEXA√á√ÉO FULL")
print(f"  Modo: DELETE + UPSERT por arquivo")
print(f"  Repo: {GITLAB_REPO}")
print(f"  Branch: {GITLAB_BRANCH}")
print(f"  Commit: {GITLAB_COMMIT}")
print("=" * 60)

total_inserted = 0
total_files = len(chunks_by_file)
errors = []

for idx, (source_document, file_chunks) in enumerate(chunks_by_file.items(), 1):
    try:
        # Reindexar arquivo
        inserted = reindex_file(
            client=client,
            collection_name=COLLECTION_NAME,
            source_document=source_document,
            chunks_data=file_chunks
        )
        
        total_inserted += inserted
        
        # Progress bar simples
        progress = idx / total_files * 100
        bar = "‚ñà" * int(progress / 5) + "‚ñë" * (20 - int(progress / 5))
        print(f"\r[{bar}] {progress:.1f}% - {idx}/{total_files} arquivos", end="")
        
    except Exception as e:
        errors.append((source_document, str(e)))

print(f"\n" + "=" * 60)
print(f"‚úÖ REINDEXA√á√ÉO COMPLETA!")
print(f"  üìÅ Arquivos processados: {total_files}")
print(f"  üìä Chunks inseridos: {total_inserted}")
print(f"  ‚ùå Erros: {len(errors)}")

if errors:
    print(f"\n‚ö†Ô∏è Arquivos com erro:")
    for doc, error in errors[:5]:
        print(f"  {doc}: {error}")

# Salvar metadados da inser√ß√£o no contexto  
stage_metadata = {
    "files_processed": total_files,
    "chunks_inserted": total_inserted,
    "errors_count": len(errors),
    "qdrant_collection": COLLECTION_NAME,
    "reindex_mode": "FULL",
    "gitlab_commit": GITLAB_COMMIT,
    "pipeline_versions": {
        "embed_model_major": EMBED_MODEL_MAJOR,
        "embed_model_full": EMBED_MODEL_FULL,
        "tokenizer_major": TOKENIZER_MAJOR,
        "tokenizer_full": TOKENIZER_FULL
    },
    "errors": errors[:10] if errors else []  # Limitar erros salvos
}

# Salvar no arquivo de contexto
with open("pipeline-data/context.json", "r") as f:
    context_data = json.load(f)

context_data["stage_06_armazenamento_qdrant"] = stage_metadata

with open("pipeline-data/context.json", "w") as f:
    json.dump(context_data, f, indent=2, ensure_ascii=False)

print(f"\n‚úÖ Metadados da inser√ß√£o salvos no contexto do stage")

üöÄ INICIANDO REINDEXA√á√ÉO FULL
  Modo: DELETE + UPSERT por arquivo
  Repo: nic/documentacao/base-de-conhecimento
  Branch: main
  Commit: e9c8a430b8bc05c306cc8fb342f42e7b45a18744


[‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë] 3.2% - 1/31 arquivos[‚ñà‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë] 6.5% - 2/31 arquivos[‚ñà‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë] 9.7% - 3/31 arquivos[‚ñà‚ñà‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë] 12.9% - 4/31 arquivos[‚ñà‚ñà‚ñà‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë] 16.1% - 5/31 arquivos

[‚ñà‚ñà‚ñà‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë] 19.4% - 6/31 arquivos

[‚ñà‚ñà‚ñà‚ñà‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë] 22.6% - 7/31 arquivos[‚ñà‚ñà‚ñà‚ñà‚ñà‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë] 25.8% - 8/31 arquivos[‚ñà‚ñà‚ñà‚ñà‚ñà‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë] 29.0% - 9/31 arquivos[‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë] 32.3% - 10/31 arquivos[‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë] 35.5% - 11/31 arquivos[‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë] 38.7% - 12/31 arquivos

[‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë] 41.9% - 13/31 arquivos[‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë] 45.2% - 14/31 arquivos[‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë] 48.4% - 15/31 arquivos[‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë] 51.6% - 16/31 arquivos[‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë] 54.8% - 17/31 arquivos[‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë] 58.1% - 18/31 arquivos

[‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë] 61.3% - 19/31 arquivos

[‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë] 64.5% - 20/31 arquivos[‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë] 67.7% - 21/31 arquivos[‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë] 71.0% - 22/31 arquivos[‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë] 74.2% - 23/31 arquivos[‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñë‚ñë‚ñë‚ñë‚ñë] 77.4% - 24/31 arquivos

[‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñë‚ñë‚ñë‚ñë] 80.6% - 25/31 arquivos[‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñë‚ñë‚ñë‚ñë] 83.9% - 26/31 arquivos[‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñë‚ñë‚ñë] 87.1% - 27/31 arquivos[‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñë‚ñë] 90.3% - 28/31 arquivos[‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñë‚ñë] 93.5% - 29/31 arquivos

[‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñë] 96.8% - 30/31 arquivos

[‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà] 100.0% - 31/31 arquivos
‚úÖ REINDEXA√á√ÉO COMPLETA!
  üìÅ Arquivos processados: 31
  üìä Chunks inseridos: 322
  ‚ùå Erros: 0

‚úÖ Metadados da inser√ß√£o salvos no contexto do stage


## üìä Relat√≥rio Final de Execu√ß√£o

In [7]:
# Calcular dura√ß√£o
stage_duration = time.time() - stage_start

# Carregar relat√≥rio existente
report_path = Path("pipeline-data/report.json")
if report_path.exists():
    with open(report_path, "r") as f:
        report = json.load(f)
else:
    report = {"stages": [], "context": {}, "summary": {}}

# Adicionar informa√ß√µes finais ao contexto
report["context"].update({
    "qdrant_url": QDRANT_URL,
    "qdrant_collection": COLLECTION_NAME
})

# Adicionar informa√ß√µes desta etapa
stage_report = {
    "stage": 6,
    "name": "Armazenamento Qdrant",
    "status": "SUCCESS" if len(errors) == 0 else "FAILED",
    "start_time": start_timestamp,
    "duration_seconds": round(stage_duration, 2),
    "results": {
        "embeddings_loaded": len(embeddings_data),
        "collection_name": COLLECTION_NAME,
        "collection_exists": collection_exists,
        "files_processed": total_files,
        "chunks_inserted": total_inserted,
        "insertion_errors": len(errors),
        "reindex_mode": "FULL"
    }
}

# Se houve erros, adicionar detalhes
if errors:
    stage_report["errors"] = [{"file": doc, "error": err} for doc, err in errors[:5]]

# Adicionar ou atualizar stage no relat√≥rio
stages_updated = False
for i, stage in enumerate(report["stages"]):
    if stage["stage"] == 6:
        report["stages"][i] = stage_report
        stages_updated = True
        break

if not stages_updated:
    report["stages"].append(stage_report)

# Calcular summary final
total_duration = sum(stage.get("duration_seconds", 0) for stage in report["stages"])
failed_stages = [stage["name"] for stage in report["stages"] if stage["status"] == "FAILED"]

# Obter totais das etapas
input_files = next((s["results"]["files_downloaded"] for s in report["stages"] if s["stage"] == 2), 0)
processed_docs = next((s["results"]["total_processed"] for s in report["stages"] if s["stage"] == 3), 0)
total_chunks = next((s["results"]["total_chunks"] for s in report["stages"] if s["stage"] == 4), 0)
embeddings_gen = next((s["results"]["embeddings_generated"] for s in report["stages"] if s["stage"] == 5), 0)
qdrant_vectors = next((s["results"]["chunks_inserted"] for s in report["stages"] if s["stage"] == 6), 0)

# Valida√ß√£o do fluxo de dados
chunks_to_embeddings = "PASSED" if total_chunks == embeddings_gen else "FAILED"
embeddings_to_qdrant = "PASSED" if embeddings_gen == qdrant_vectors else "FAILED"

# Atualizar summary
report["summary"] = {
    "pipeline_status": "SUCCESS" if len(failed_stages) == 0 else "FAILED",
    "total_duration_seconds": round(total_duration, 2),
    "last_update": datetime.now().isoformat() + "Z",
    "data_flow": {
        "input_files": input_files,
        "processed_documents": processed_docs,
        "total_chunks": total_chunks,
        "embeddings_generated": embeddings_gen,
        "vectors_stored": qdrant_vectors
    },
    "validation": {
        "chunks_to_embeddings": chunks_to_embeddings,
        "embeddings_to_qdrant": embeddings_to_qdrant,
        "overall": "PASSED" if chunks_to_embeddings == "PASSED" and embeddings_to_qdrant == "PASSED" else "FAILED"
    }
}

if failed_stages:
    report["summary"]["failed_stages"] = failed_stages

# Atualizar last_execution em pipeline_info
if "pipeline_info" in report:
    report["pipeline_info"]["last_execution"] = report["summary"]["last_update"]

# Salvar relat√≥rio final
with open(report_path, "w") as f:
    json.dump(report, f, indent=2, ensure_ascii=False)

print(f"üìä Relat√≥rio FINAL salvo: {report_path}")
print(f"‚è±Ô∏è Dura√ß√£o da etapa: {stage_duration:.2f}s")
print(f"‚è±Ô∏è Dura√ß√£o total do pipeline: {total_duration:.2f}s")
print(f"\n‚úÖ Pipeline Status: {report['summary']['pipeline_status']}")
print(f"üìà Data Flow Validation: {report['summary']['validation']['overall']}")

# Mostrar resumo
print(f"\nüìã RESUMO DO PIPELINE:")
print(f"  ‚Ä¢ Arquivos baixados: {input_files}")
print(f"  ‚Ä¢ Documentos processados: {processed_docs}")
print(f"  ‚Ä¢ Chunks criados: {total_chunks}")
print(f"  ‚Ä¢ Embeddings gerados: {embeddings_gen}")
print(f"  ‚Ä¢ Vetores no Qdrant: {qdrant_vectors}")

üìä Relat√≥rio FINAL salvo: pipeline-data/report.json
‚è±Ô∏è Dura√ß√£o da etapa: 1.06s
‚è±Ô∏è Dura√ß√£o total do pipeline: 100.35s

‚úÖ Pipeline Status: SUCCESS
üìà Data Flow Validation: PASSED

üìã RESUMO DO PIPELINE:
  ‚Ä¢ Arquivos baixados: 39
  ‚Ä¢ Documentos processados: 31
  ‚Ä¢ Chunks criados: 322
  ‚Ä¢ Embeddings gerados: 322
  ‚Ä¢ Vetores no Qdrant: 322
