# NIC ETL Pipeline

**N√∫cleo de Intelig√™ncia e Conhecimento - Extract, Transform, Load Pipeline**

Este notebook implementa o pipeline completo de ETL para processar documentos do GitLab, extrair conte√∫do com Docling, gerar embeddings com BAAI/bge-m3 e armazenar vetores no Qdrant.

## Arquitetura do Sistema

```
GitLab ‚Üí Docling ‚Üí Text Chunking ‚Üí Embeddings ‚Üí Qdrant
   ‚Üì        ‚Üì           ‚Üì            ‚Üì         ‚Üì
Docs   Structured   Chunks      Vectors   Search
```

## M√≥dulos Principais

1. **Configuration Management**: Gerenciamento centralizado de configura√ß√µes
2. **GitLab Integration**: Conex√£o e extra√ß√£o de documentos do GitLab
3. **Docling Processing**: Processamento de documentos com OCR
4. **Text Chunking**: Segmenta√ß√£o sem√¢ntica de texto
5. **Embedding Generation**: Gera√ß√£o de embeddings com BAAI/bge-m3
6. **Qdrant Integration**: Armazenamento vetorial
7. **Pipeline Orchestration**: Orquestra√ß√£o e monitoramento

## 1. Configura√ß√£o e Imports

In [None]:
# Imports essenciais
import sys
import os
import logging
from pathlib import Path
from datetime import datetime
from typing import Dict, List, Any, Optional
import json
import time

# Add modules to path
sys.path.insert(0, str(Path.cwd() / "modules"))

# Display and progress tracking
from IPython.display import display, HTML, clear_output

print("‚úÖ Imports b√°sicos carregados")
print(f"üìÅ Diret√≥rio de trabalho: {Path.cwd()}")
print(f"üïí Iniciado em: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

## 2. Verifica√ß√£o de Depend√™ncias de M√≥dulos

In [None]:
def check_module_dependencies():
    """Verifica se todos os m√≥dulos da pipeline est√£o dispon√≠veis"""
    
    modules_to_check = [
        'configuration_management',
        'error_handling', 
        'metadata_management',
        'gitlab_integration',
        'document_ingestion',
        'docling_processing',
        'text_chunking',
        'embedding_generation',
        'qdrant_integration',
        'pipeline_orchestration'
    ]
    
    available_modules = []
    missing_modules = []
    
    for module_name in modules_to_check:
        try:
            module = __import__(module_name)
            available_modules.append(module_name)
            print(f"‚úÖ {module_name}")
        except ImportError as e:
            missing_modules.append(module_name)
            print(f"‚ùå {module_name}: {e}")
    
    print(f"\nüìä Status dos M√≥dulos:")
    print(f"   ‚úÖ Dispon√≠veis: {len(available_modules)}/{len(modules_to_check)}")
    print(f"   ‚ùå Faltando: {len(missing_modules)}")
    
    if missing_modules:
        print(f"\n‚ö†Ô∏è  M√≥dulos faltando: {', '.join(missing_modules)}")
        return False
    
    return True

# Executar verifica√ß√£o
modules_available = check_module_dependencies()

## 3. Configura√ß√£o do Pipeline

In [None]:
# Configura√ß√£o do sistema
def setup_pipeline_configuration():
    """Configura o sistema de pipeline com configura√ß√µes padr√£o"""
    
    try:
        from configuration_management import create_configuration_manager
        
        # Criar gerenciador de configura√ß√£o
        config_manager = create_configuration_manager(
            environment='development'
        )
        
        print("‚úÖ Configuration Manager criado")
        
        # Exibir configura√ß√£o (sem secrets)
        config_summary = config_manager.export_configuration(include_secrets=False)
        print("\nüìã Configura√ß√£o Atual:")
        
        # Parse JSON e exibir pontos importantes
        config_dict = json.loads(config_summary)
        
        print(f"   üåç Ambiente: {config_dict['environment']}")
        print(f"   üîó GitLab URL: {config_dict['gitlab']['url']}")
        print(f"   üìÇ Pasta alvo: {config_dict['gitlab']['folder_path']}")
        print(f"   üß† Modelo embedding: {config_dict['embedding']['model_name']}")
        print(f"   üìä Qdrant collection: {config_dict['qdrant']['collection_name']}")
        print(f"   üîß Max docs paralelos: {config_dict['pipeline']['max_concurrent_documents']}")
        
        return config_manager
        
    except Exception as e:
        print(f"‚ùå Erro na configura√ß√£o: {e}")
        return None

# Configurar pipeline
config_manager = setup_pipeline_configuration()

## 4. Inicializa√ß√£o do Orquestrador

In [None]:
def initialize_pipeline_orchestrator(config_manager):
    """Inicializa o orquestrador principal do pipeline"""
    
    if not config_manager:
        print("‚ùå Configuration Manager n√£o dispon√≠vel")
        return None
    
    try:
        from pipeline_orchestration import PipelineOrchestrator
        
        # Criar orquestrador
        orchestrator = PipelineOrchestrator(config_manager)
        print("‚úÖ Pipeline Orchestrator criado")
        
        # Exibir estat√≠sticas
        stats = orchestrator.get_orchestrator_statistics()
        print("\nüìä Estat√≠sticas do Orquestrador:")
        for key, value in stats.items():
            print(f"   {key}: {value}")
        
        return orchestrator
        
    except Exception as e:
        print(f"‚ùå Erro ao criar orquestrador: {e}")
        return None

# Inicializar orquestrador
orchestrator = initialize_pipeline_orchestrator(config_manager)

## 5. Execu√ß√£o Principal do Pipeline

In [None]:
def run_nic_etl_pipeline(target_folder="30-Aprovados", enable_monitoring=True):
    """Executa o pipeline completo NIC ETL"""
    
    if not orchestrator:
        print("‚ùå Orquestrador n√£o dispon√≠vel. Execute as c√©lulas anteriores primeiro.")
        return None
    
    print(f"üöÄ Iniciando NIC ETL Pipeline...")
    print(f"üìÇ Pasta alvo: {target_folder}")
    print(f"üîç Monitoramento: {'Ativado' if enable_monitoring else 'Desativado'}")
    print("\n" + "="*50)
    
    try:
        # Executar pipeline
        result = orchestrator.run_full_pipeline(
            target_folder=target_folder
        )
        
        # Exibir resultados
        print("\n‚úÖ Pipeline executado com sucesso!")
        print(f"\nüìä Resultados:")
        print(f"   Total de documentos: {result.total_documents}")
        print(f"   Processados com sucesso: {result.processed_successfully}")
        print(f"   Falharam: {result.failed_documents}")
        print(f"   Pulados: {result.skipped_documents}")
        print(f"   Tempo total: {result.total_processing_time:.2f}s")
        print(f"   Chunks gerados: {result.total_chunks}")
        print(f"   Embeddings: {result.total_embeddings}")
        print(f"   Vetores armazenados: {result.total_vectors_stored}")
        
        if result.errors:
            print(f"\n‚ö†Ô∏è  Erros encontrados ({len(result.errors)}):")
            for i, error in enumerate(result.errors[:5], 1):
                print(f"   {i}. {error}")
            if len(result.errors) > 5:
                print(f"   ... e mais {len(result.errors) - 5} erros")
        
        return result
        
    except Exception as e:
        print(f"‚ùå Erro durante execu√ß√£o do pipeline: {e}")
        import traceback
        traceback.print_exc()
        return None

# Interface para execu√ß√£o
print("üìù Para executar o pipeline, use:")
print("   result = run_nic_etl_pipeline()")
print("\nüîß Op√ß√µes avan√ßadas:")
print("   result = run_nic_etl_pipeline(target_folder='outra-pasta')")
print("   result = run_nic_etl_pipeline(enable_monitoring=False)")

## 6. An√°lise de Resultados

In [None]:
def analyze_pipeline_results(result):
    """Analisa os resultados do pipeline"""
    
    if not result:
        print("‚ùå Nenhum resultado dispon√≠vel para an√°lise")
        return
    
    # Calcular m√©tricas
    total_docs = result.total_documents
    success_rate = result.processed_successfully / total_docs if total_docs > 0 else 0
    failure_rate = result.failed_documents / total_docs if total_docs > 0 else 0
    avg_time_per_doc = result.total_processing_time / total_docs if total_docs > 0 else 0
    
    # Relat√≥rio HTML formatado
    html_report = f"""
    <div style="border: 2px solid #4CAF50; padding: 20px; border-radius: 10px; background-color: #f0f8f0;">
        <h2>üìà An√°lise de Resultados - NIC ETL Pipeline</h2>
        
        <div style="display: flex; justify-content: space-between; margin: 20px 0;">
            <div style="text-align: center; padding: 10px; background-color: white; border-radius: 5px; margin: 0 5px;">
                <h3 style="color: #4CAF50; margin: 0;">{result.processed_successfully}</h3>
                <p style="margin: 0;">Sucessos</p>
            </div>
            <div style="text-align: center; padding: 10px; background-color: white; border-radius: 5px; margin: 0 5px;">
                <h3 style="color: #FF9800; margin: 0;">{result.failed_documents}</h3>
                <p style="margin: 0;">Falhas</p>
            </div>
            <div style="text-align: center; padding: 10px; background-color: white; border-radius: 5px; margin: 0 5px;">
                <h3 style="color: #2196F3; margin: 0;">{result.total_chunks}</h3>
                <p style="margin: 0;">Chunks</p>
            </div>
            <div style="text-align: center; padding: 10px; background-color: white; border-radius: 5px; margin: 0 5px;">
                <h3 style="color: #9C27B0; margin: 0;">{result.total_vectors_stored}</h3>
                <p style="margin: 0;">Vetores</p>
            </div>
        </div>
        
        <div style="margin: 20px 0;">
            <h3>üìä M√©tricas de Performance</h3>
            <ul>
                <li><strong>Taxa de Sucesso:</strong> {success_rate:.1%}</li>
                <li><strong>Taxa de Falha:</strong> {failure_rate:.1%}</li>
                <li><strong>Tempo Total:</strong> {result.total_processing_time:.2f} segundos</li>
                <li><strong>Tempo M√©dio por Documento:</strong> {avg_time_per_doc:.2f} segundos</li>
                <li><strong>Throughput:</strong> {total_docs/result.total_processing_time:.2f} docs/segundo</li>
            </ul>
        </div>
    </div>
    """
    
    display(HTML(html_report))

def format_pipeline_results(result):
    """Formata resultados para exibi√ß√£o"""
    if result:
        analyze_pipeline_results(result)
    else:
        print("‚ùå Execute o pipeline primeiro para ver os resultados")

print("üìù Para analisar resultados, use:")
print("   format_pipeline_results(result)")

## 7. Busca e Testes no Qdrant

In [None]:
def test_vector_search(query_text, limit=5):
    """Testa busca sem√¢ntica no Qdrant"""
    
    if not config_manager:
        print("‚ùå Configuration Manager n√£o dispon√≠vel")
        return None
    
    try:
        # Importar m√≥dulos necess√°rios
        from qdrant_integration import create_qdrant_vector_store
        from embedding_generation import create_embedding_generator
        
        # Configura√ß√µes
        qdrant_config = config_manager.get_module_config('qdrant')
        embedding_config = config_manager.get_module_config('embedding')
        
        # Criar componentes
        vector_store = create_qdrant_vector_store(qdrant_config)
        embedding_generator = create_embedding_generator(embedding_config)
        
        print(f"üîç Pesquisando por: '{query_text}'")
        print(f"üìä Limite de resultados: {limit}")
        
        # Gerar embedding da query
        query_embedding = embedding_generator.generate_embeddings([query_text])[0]
        
        # Realizar busca
        search_results = vector_store.search_similar_vectors(
            query_vector=query_embedding,
            limit=limit
        )
        
        print(f"\n‚úÖ Encontrados {len(search_results)} resultados:")
        
        for i, result in enumerate(search_results, 1):
            print(f"\n{i}. Similaridade: {result.score:.3f}")
            if hasattr(result, 'payload') and result.payload:
                print(f"   Documento: {result.payload.get('document_name', 'N/A')}")
                print(f"   Chunk ID: {result.payload.get('chunk_id', 'N/A')}")
                content = result.payload.get('content', '')[:200]
                print(f"   Conte√∫do: {content}{'...' if len(content) == 200 else ''}")
        
        return search_results
        
    except Exception as e:
        print(f"‚ùå Erro na busca vetorial: {e}")
        return None

# Interface para busca
print("üîç Para testar busca sem√¢ntica, use:")
print("   results = test_vector_search('sua consulta aqui')")
print("\nExemplos:")
print("   test_vector_search('procedimentos de seguran√ßa')")
print("   test_vector_search('documenta√ß√£o t√©cnica', limit=10)")

## 8. Utilit√°rios e Controles

In [None]:
def reset_pipeline():
    """Reseta o pipeline para uma nova execu√ß√£o"""
    
    global orchestrator, config_manager
    
    print("üîÑ Resetando pipeline...")
    
    # Recriar componentes
    config_manager = setup_pipeline_configuration()
    orchestrator = initialize_pipeline_orchestrator(config_manager)
    
    if orchestrator:
        print("‚úÖ Pipeline resetado com sucesso")
    else:
        print("‚ùå Erro ao resetar pipeline")

def test_connections():
    """Testa conex√µes com servi√ßos externos"""
    
    if not config_manager:
        print("‚ùå Configuration Manager n√£o dispon√≠vel")
        return False
    
    print("üîó Testando conex√µes...")
    
    try:
        # Teste GitLab
        try:
            from gitlab_integration import create_gitlab_connector
            gitlab_config = config_manager.get_module_config('gitlab')
            gitlab_connector = create_gitlab_connector(gitlab_config)
            print("‚úÖ GitLab: Configura√ß√£o carregada")
        except Exception as e:
            print(f"‚ùå GitLab: {e}")
        
        # Teste Qdrant
        try:
            from qdrant_integration import create_qdrant_vector_store
            qdrant_config = config_manager.get_module_config('qdrant')
            qdrant_store = create_qdrant_vector_store(qdrant_config)
            print("‚úÖ Qdrant: Configura√ß√£o carregada")
        except Exception as e:
            print(f"‚ùå Qdrant: {e}")
        
        # Teste Embedding
        try:
            from embedding_generation import create_embedding_generator
            embedding_config = config_manager.get_module_config('embedding')
            embedding_gen = create_embedding_generator(embedding_config)
            print("‚úÖ Embedding: Configura√ß√£o carregada")
        except Exception as e:
            print(f"‚ùå Embedding: {e}")
        
        return True
        
    except Exception as e:
        print(f"‚ùå Erro geral nos testes: {e}")
        return False

def show_pipeline_status():
    """Mostra status atual do pipeline"""
    
    print("üìä Status do NIC ETL Pipeline")
    print("=" * 40)
    
    print(f"üîß Config Manager: {'‚úÖ OK' if config_manager else '‚ùå N√£o inicializado'}")
    print(f"üéØ Orchestrator: {'‚úÖ OK' if orchestrator else '‚ùå N√£o inicializado'}")
    print(f"üì¶ M√≥dulos: {'‚úÖ OK' if modules_available else '‚ùå Faltando m√≥dulos'}")
    
    if orchestrator:
        progress = orchestrator.monitor_progress()
        print(f"\nüìà Progresso Atual:")
        print(f"   Est√°gio: {progress.current_stage.value}")
        print(f"   Documentos: {progress.documents_processed}/{progress.total_documents}")

# Interface de controle
print("üõ†Ô∏è  Utilit√°rios dispon√≠veis:")
print("   reset_pipeline() - Reseta o pipeline")
print("   test_connections() - Testa conex√µes")
print("   show_pipeline_status() - Mostra status")

---

## üèÅ Pipeline Pronto!

### Pr√≥ximos Passos:

1. **Configure suas credenciais** no arquivo `.env`
2. **Execute**: `result = run_nic_etl_pipeline()`
3. **Analise resultados**: `format_pipeline_results(result)`
4. **Teste busca**: `test_vector_search('sua consulta')`

### Comandos √öteis:

```python
# Status e diagn√≥stico
show_pipeline_status()

# Execu√ß√£o personalizada
result = run_nic_etl_pipeline(target_folder="outra-pasta")

# Busca sem√¢ntica
test_vector_search("documenta√ß√£o t√©cnica", limit=10)

# Reset se necess√°rio
reset_pipeline()
```

---

**üìû Suporte**: Consulte `CLAUDE.md` para documenta√ß√£o completa

**üîß Desenvolvimento**: M√≥dulos em `./modules/` ‚Ä¢ Logs em `./logs/` ‚Ä¢ Cache em `./cache/`