# 🏗️ Proyecto Integrador 2: Plataforma Self-Service con GenAI

**Objetivo**: construir una plataforma que permita a usuarios no técnicos generar pipelines ETL, documentación, y reportes usando IA generativa.

## Alcance del Proyecto

- **Duración**: 6-8 horas
- **Dificultad**: Muy Alta
- **Stack**: OpenAI, LangGraph, Airflow, Great Expectations, Streamlit

## Funcionalidades

1. ✅ Generación de pipelines ETL desde descripción en lenguaje natural
2. ✅ Validación automática del código generado
3. ✅ Creación de tests unitarios
4. ✅ Generación de documentación técnica
5. ✅ Dashboard de monitoreo
6. ✅ Sistema de aprobación y deployment

### 🏗️ **Arquitectura de Plataforma Self-Service: De NL a Producción**

**Visión General: Data Democratization**

Esta plataforma representa el siguiente nivel en la evolución de Data Engineering: convertir descripciones en lenguaje natural en pipelines ETL completos, probados, documentados y desplegados.

```
┌─────────────────────────────────────────────────────────────────┐
│              PLATAFORMA SELF-SERVICE ARCHITECTURE               │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  [Usuario No Técnico] → Streamlit UI                           │
│           ↓                                                     │
│  [1. NL Parser]                                                 │
│      • Intent extraction                                        │
│      • Entity recognition (sources, transformations, schedule) │
│      • Requirement validation                                   │
│           ↓                                                     │
│  [2. Code Generator]                                            │
│      • Template selection (batch/streaming/hybrid)             │
│      • Framework choice (Pandas/Spark/Polars)                  │
│      • Best practices injection                                │
│           ↓                                                     │
│  [3. Security Validator]                                        │
│      • AST parsing (syntax validation)                         │
│      • Security patterns (no eval, exec, os.system)            │
│      • Dependency check (vulnerable packages)                  │
│           ↓                                                     │
│  [4. Test Generator]                                            │
│      • Unit tests (per transformation)                         │
│      • Integration tests (mocked I/O)                          │
│      • Data quality tests (Great Expectations)                 │
│           ↓                                                     │
│  [5. Documentation Generator]                                   │
│      • README.md (architecture, setup, troubleshooting)        │
│      • API docs (if exposed)                                   │
│      • Runbook (operations guide)                              │
│           ↓                                                     │
│  [6. DAG Generator]                                             │
│      • Airflow DAG (orchestration)                             │
│      • Task dependencies                                       │
│      • Retry/alerting logic                                    │
│           ↓                                                     │
│  [7. Human Review]                                              │
│      • Side-by-side diff                                       │
│      • Approve/Reject/Request changes                          │
│           ↓                                                     │
│  [8. Deployment]                                                │
│      • Run tests (pytest)                                      │
│      • Create PR (GitHub)                                      │
│      • Deploy to staging → prod                                │
│           ↓                                                     │
│  [9. Monitoring]                                                │
│      • Execution metrics                                       │
│      • Data quality alerts                                     │
│      • Cost tracking                                           │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

**Componentes Clave**

**1. NL Parser (Structured Requirement Extraction)**

```python
# Prompt engineering para extracción estructurada
PARSER_PROMPT = """
Analiza este requerimiento de pipeline ETL y extrae información estructurada.

INPUT: {user_input}

OUTPUT (JSON):
{
  "pipeline_name": "nombre_snake_case",
  "source": {
    "type": "s3|postgres|mysql|api|csv",
    "location": "bucket/path o connection_string",
    "credentials": "reference_to_secret_manager"
  },
  "transformations": [
    {
      "type": "filter|aggregate|join|pivot",
      "description": "...",
      "columns": ["col1", "col2"]
    }
  ],
  "destination": {
    "type": "redshift|postgres|s3|delta",
    "location": "...",
    "mode": "append|overwrite|upsert"
  },
  "schedule": {
    "frequency": "daily|hourly|weekly",
    "time": "02:00",
    "timezone": "UTC"
  },
  "validations": [
    {
      "type": "not_null|unique|range|regex",
      "column": "...",
      "threshold": 0.95
    }
  ],
  "sla": {
    "max_duration_minutes": 60,
    "alert_email": "team@example.com"
  }
}

REGLAS:
- Inferir transformaciones implícitas (ej: "últimos 30 días" → filter fecha)
- Sugerir validaciones si no están explícitas
- Usar defaults sensatos (schedule=daily si no especificado)
"""

# Ejemplo de parsing robusto
def parse_with_validation(user_input: str) -> dict:
    requirements = parse_requirements(user_input)
    
    # Validar completitud
    required_fields = ['pipeline_name', 'source', 'destination']
    missing = [f for f in required_fields if f not in requirements]
    
    if missing:
        # LLM completa campos faltantes
        completion_prompt = f"Completa estos campos faltantes: {missing}"
        requirements = complete_requirements(requirements, completion_prompt)
    
    return requirements
```

**2. Code Generator (Multi-Template Strategy)**

```python
# Templates por tipo de workload
TEMPLATES = {
    'batch': '''
import pandas as pd
from typing import Dict, Any
import logging

logger = logging.getLogger(__name__)

class {pipeline_name}Pipeline:
    def __init__(self, config: Dict[str, Any]):
        self.config = config
    
    def extract(self) -> pd.DataFrame:
        """Extract from {source_type}"""
        {extract_code}
    
    def transform(self, df: pd.DataFrame) -> pd.DataFrame:
        """Apply transformations"""
        {transform_code}
    
    def validate(self, df: pd.DataFrame) -> bool:
        """Data quality checks"""
        {validation_code}
    
    def load(self, df: pd.DataFrame) -> None:
        """Load to {destination_type}"""
        {load_code}
    
    def run(self) -> None:
        logger.info("Starting pipeline...")
        df = self.extract()
        df = self.transform(df)
        
        if not self.validate(df):
            raise ValueError("Data quality checks failed")
        
        self.load(df)
        logger.info("Pipeline completed successfully")
''',
    'streaming': '''
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

spark = SparkSession.builder.appName("{pipeline_name}").getOrCreate()

# Read stream
df = spark.readStream \\
    .format("{source_format}") \\
    .option("kafka.bootstrap.servers", "{kafka_servers}") \\
    .load()

# Transformations
{transform_code}

# Write stream
query = df.writeStream \\
    .format("{sink_format}") \\
    .outputMode("append") \\
    .trigger(processingTime='1 minute') \\
    .start()

query.awaitTermination()
'''
}

# Generación con context injection
def generate_with_context(requirements: dict) -> str:
    # Seleccionar template
    workload_type = infer_workload_type(requirements)
    template = TEMPLATES[workload_type]
    
    # Inyectar código específico
    extract_code = generate_extract_code(requirements['source'])
    transform_code = generate_transform_code(requirements['transformations'])
    validation_code = generate_validation_code(requirements['validations'])
    load_code = generate_load_code(requirements['destination'])
    
    # Rellenar template
    code = template.format(
        pipeline_name=requirements['pipeline_name'],
        source_type=requirements['source']['type'],
        destination_type=requirements['destination']['type'],
        extract_code=extract_code,
        transform_code=transform_code,
        validation_code=validation_code,
        load_code=load_code
    )
    
    return code
```

**3. Security Validator (Multi-Layer)**

```python
import ast
import re

def validate_security(code: str) -> dict:
    issues = []
    
    # Layer 1: AST parsing (syntax + structure)
    try:
        tree = ast.parse(code)
        
        # Detectar imports peligrosos
        for node in ast.walk(tree):
            if isinstance(node, ast.Import):
                for alias in node.names:
                    if alias.name in ['pickle', 'subprocess', 'os']:
                        issues.append({
                            'severity': 'HIGH',
                            'type': 'DANGEROUS_IMPORT',
                            'message': f'Import de módulo peligroso: {alias.name}',
                            'line': node.lineno
                        })
            
            # Detectar eval/exec
            if isinstance(node, ast.Call):
                if isinstance(node.func, ast.Name):
                    if node.func.id in ['eval', 'exec']:
                        issues.append({
                            'severity': 'CRITICAL',
                            'type': 'CODE_INJECTION',
                            'message': f'Uso de {node.func.id}() detectado',
                            'line': node.lineno
                        })
    
    except SyntaxError as e:
        issues.append({
            'severity': 'CRITICAL',
            'type': 'SYNTAX_ERROR',
            'message': str(e)
        })
    
    # Layer 2: Regex patterns (SQL injection)
    if re.search(r'f"SELECT.*\{.*\}"', code):
        issues.append({
            'severity': 'HIGH',
            'type': 'SQL_INJECTION_RISK',
            'message': 'Uso de f-strings en queries SQL detectado'
        })
    
    # Layer 3: Dependency scanning
    imports = extract_imports(code)
    vulnerable = check_vulnerabilities(imports)
    issues.extend(vulnerable)
    
    return {
        'valid': all(i['severity'] != 'CRITICAL' for i in issues),
        'issues': issues
    }

# Dependency scanning
def check_vulnerabilities(packages: list) -> list:
    # Integración con safety (https://pyup.io/safety/)
    import subprocess
    result = subprocess.run(
        ['safety', 'check', '--json'],
        capture_output=True
    )
    
    vulnerabilities = json.loads(result.stdout)
    return [
        {
            'severity': 'HIGH',
            'type': 'VULNERABLE_DEPENDENCY',
            'message': f"{v['package']}: {v['vulnerability']}"
        }
        for v in vulnerabilities
    ]
```

**4. Test Generator (Comprehensive Coverage)**

```python
def generate_tests(code: str, requirements: dict) -> str:
    test_template = '''
import pytest
from unittest.mock import Mock, patch
import pandas as pd
from {module} import {pipeline_class}

@pytest.fixture
def pipeline():
    config = {config_dict}
    return {pipeline_class}(config)

@pytest.fixture
def sample_data():
    return pd.DataFrame({sample_data_dict})

# Unit Tests
def test_extract(pipeline, mocker):
    """Test data extraction"""
    mock_read = mocker.patch('{read_function}')
    mock_read.return_value = pd.DataFrame({mock_data})
    
    df = pipeline.extract()
    
    assert not df.empty
    assert list(df.columns) == {expected_columns}

def test_transform(pipeline, sample_data):
    """Test transformations"""
    df = pipeline.transform(sample_data)
    
    # Assertions específicas
    {transform_assertions}

def test_validate_success(pipeline, sample_data):
    """Test validation with clean data"""
    assert pipeline.validate(sample_data) == True

def test_validate_failure(pipeline):
    """Test validation with dirty data"""
    dirty_data = pd.DataFrame({dirty_data_dict})
    assert pipeline.validate(dirty_data) == False

def test_load(pipeline, sample_data, mocker):
    """Test data loading"""
    mock_write = mocker.patch('{write_function}')
    
    pipeline.load(sample_data)
    
    mock_write.assert_called_once()

# Integration Tests
@patch('{source_module}.read')
@patch('{destination_module}.write')
def test_full_pipeline(mock_write, mock_read, pipeline):
    """Test end-to-end execution"""
    mock_read.return_value = pd.DataFrame({mock_source_data})
    
    pipeline.run()
    
    mock_write.assert_called()
    written_df = mock_write.call_args[0][0]
    assert len(written_df) > 0

# Data Quality Tests
def test_no_null_values(pipeline, sample_data):
    """Test no nulls in required columns"""
    df = pipeline.transform(sample_data)
    required_cols = {required_columns}
    
    for col in required_cols:
        assert df[col].isnull().sum() == 0

def test_data_types(pipeline, sample_data):
    """Test correct data types"""
    df = pipeline.transform(sample_data)
    expected_types = {expected_types_dict}
    
    for col, dtype in expected_types.items():
        assert df[col].dtype == dtype
'''
    
    # Generar código específico basado en requirements
    test_code = test_template.format(
        module=requirements['pipeline_name'],
        pipeline_class=to_camel_case(requirements['pipeline_name']),
        config_dict=generate_config_mock(requirements),
        sample_data_dict=generate_sample_data(requirements),
        # ... más placeholders
    )
    
    return test_code
```

**Beneficios de la Arquitectura**

| Aspecto | Antes (Manual) | Después (Platform) |
|---------|----------------|---------------------|
| **Time to Production** | 2-4 semanas | 1-2 horas |
| **Code Quality** | Variable (depende de desarrollador) | Consistente (templates + validación) |
| **Documentation** | A menudo ausente u obsoleta | Siempre actualizada (auto-generada) |
| **Testing** | ~40% coverage | >80% coverage (auto-generado) |
| **Security** | Manual reviews | Automated scanning (cada generación) |
| **Onboarding** | Requiere semanas de training | Cualquier usuario puede crear pipelines |
| **Maintenance** | Alto (cada pipeline es único) | Bajo (patterns estandarizados) |

**ROI Calculation**

```python
# Ejemplo de ROI
MANUAL_PIPELINE = {
    'development_hours': 40,
    'testing_hours': 16,
    'documentation_hours': 8,
    'review_hours': 4,
    'total_hours': 68,
    'engineer_cost_hour': 75,  # USD
    'total_cost': 5100  # USD
}

PLATFORM_PIPELINE = {
    'generation_minutes': 10,
    'review_hours': 2,
    'total_hours': 2.17,
    'total_cost': 163  # USD
}

savings_per_pipeline = MANUAL_PIPELINE['total_cost'] - PLATFORM_PIPELINE['total_cost']  # $4,937
time_saved = MANUAL_PIPELINE['total_hours'] - PLATFORM_PIPELINE['total_hours']  # 65.83 hours

# Si un equipo crea 20 pipelines/año
annual_savings = savings_per_pipeline * 20  # $98,740
annual_time_saved = time_saved * 20  # 1,316 hours (= 164 días de trabajo)
```

**Casos de Uso Reales**

1. **Marketing Analytics**: "Necesito agregar eventos de Mixpanel por usuario y día"
2. **Finance Reporting**: "Consolidar transacciones de 5 bases de datos en un reporte diario"
3. **ML Feature Engineering**: "Crear features de usuario para modelo de churn"
4. **Data Migration**: "Migrar tablas de MySQL legacy a Snowflake"
5. **Real-time Alerting**: "Detectar anomalías en ventas cada 5 minutos"

---
**Autor:** Luis J. Raigoso V. (LJRV)

## Parte 1: Arquitectura del sistema

```
User Input (NL)
    ↓
[1. Parser] → extrae requisitos
    ↓
[2. Generator] → genera código ETL
    ↓
[3. Validator] → valida sintaxis, seguridad
    ↓
[4. Tester] → genera y ejecuta tests
    ↓
[5. Documenter] → crea README
    ↓
[6. Reviewer] → revisión humana
    ↓
[7. Deployer] → despliega a producción
```

### 🔄 **LangGraph: Orquestación de Agentes Multi-Paso**

**¿Por qué LangGraph para Workflows Complejos?**

LangGraph permite crear flujos de trabajo con:
- **State Management**: Estado persistente entre nodos
- **Conditional Routing**: Lógica de decisión (if/else paths)
- **Human-in-the-Loop**: Pausar para aprobación humana
- **Retry Logic**: Reintentar nodos fallidos
- **Streaming**: Ver progreso en tiempo real

**Arquitectura del Workflow**

```python
from langgraph.graph import StateGraph, END
from typing import TypedDict, Annotated
import operator

# State compartido entre todos los nodos
class PipelineState(TypedDict):
    # Inputs
    user_input: str
    
    # Intermedios
    requirements: dict
    code: str
    validation: dict
    tests: str
    docs: str
    dag: str
    
    # Outputs
    approved: bool
    errors: Annotated[list, operator.add]  # Append errors from any node
    
    # Metadata
    generation_time: float
    review_notes: str

# Cada nodo es una función pura: State → State
def parse_node(state: PipelineState) -> PipelineState:
    """Extrae requisitos estructurados"""
    try:
        state['requirements'] = parse_requirements(state['user_input'])
    except Exception as e:
        state['errors'].append(f"Parse error: {str(e)}")
    return state

def generate_node(state: PipelineState) -> PipelineState:
    """Genera código ETL"""
    if state['errors']:
        return state  # Skip si hay errores previos
    
    import time
    start = time.time()
    
    state['code'] = generate_etl_pipeline(state['requirements'])
    state['generation_time'] = time.time() - start
    
    return state

def validate_node(state: PipelineState) -> PipelineState:
    """Valida seguridad y sintaxis"""
    validation_result = validate_code(state['code'])
    state['validation'] = validation_result
    
    if not validation_result['valid']:
        state['errors'].extend([
            f"{issue['type']}: {issue['message']}"
            for issue in validation_result['issues']
        ])
    
    return state

def should_continue_after_validation(state: PipelineState) -> str:
    """Conditional edge: decide si continuar o terminar"""
    if state['validation']['valid']:
        return "test"  # Go to test node
    else:
        return END  # Stop workflow

# Construir grafo
workflow = StateGraph(PipelineState)

# Add nodes
workflow.add_node("parse", parse_node)
workflow.add_node("generate", generate_node)
workflow.add_node("validate", validate_node)
workflow.add_node("test", test_node)
workflow.add_node("document", document_node)
workflow.add_node("review", review_node)

# Add edges (define flow)
workflow.set_entry_point("parse")
workflow.add_edge("parse", "generate")
workflow.add_edge("generate", "validate")

# Conditional edge (branching)
workflow.add_conditional_edges(
    "validate",
    should_continue_after_validation,
    {
        "test": "test",
        END: END
    }
)

workflow.add_edge("test", "document")
workflow.add_edge("document", "review")
workflow.add_edge("review", END)

# Compile graph
app = workflow.compile()
```

**Visualización del Grafo**

```
        ┌─────────┐
        │  START  │
        └────┬────┘
             │
             v
        ┌─────────┐
        │  Parse  │  ← Extrae requisitos
        └────┬────┘
             │
             v
        ┌──────────┐
        │ Generate │  ← Genera código
        └────┬─────┘
             │
             v
        ┌──────────┐
        │ Validate │  ← Valida seguridad
        └────┬─────┘
             │
        ┌────┴────┐
        │ Valid?  │
        └─┬─────┬─┘
     Yes  │     │ No
          v     v
      ┌──────┐ END
      │ Test │
      └──┬───┘
         │
         v
    ┌──────────┐
    │ Document │
    └────┬─────┘
         │
         v
    ┌────────┐
    │ Review │  ← Human approval
    └────┬───┘
         │
         v
      ┌─────┐
      │ END │
      └─────┘
```

**Streaming de Estado (Real-time Updates)**

```python
# Ejecutar con streaming
initial_state = {
    'user_input': "Pipeline de ventas...",
    'requirements': {},
    'code': '',
    'validation': {},
    'tests': '',
    'docs': '',
    'dag': '',
    'approved': False,
    'errors': [],
    'generation_time': 0.0,
    'review_notes': ''
}

# Stream: ver actualizaciones de cada nodo
for output in app.stream(initial_state):
    node_name = list(output.keys())[0]
    state = output[node_name]
    
    print(f"\n✅ Completado: {node_name}")
    
    if node_name == "parse":
        print(f"Pipeline: {state['requirements'].get('pipeline_name')}")
    
    elif node_name == "generate":
        print(f"Código: {len(state['code'])} caracteres")
        print(f"Tiempo: {state['generation_time']:.2f}s")
    
    elif node_name == "validate":
        status = "✅ OK" if state['validation']['valid'] else "❌ Errores"
        print(f"Validación: {status}")
        
        if state['errors']:
            for error in state['errors']:
                print(f"  - {error}")

# Final state
final = list(app.stream(initial_state))[-1]
```

**Human-in-the-Loop (Approval Gate)**

```python
import time

def review_node_interactive(state: PipelineState) -> PipelineState:
    """Nodo que pausa y espera aprobación humana"""
    
    # Mostrar resumen para reviewer
    print("\n" + "="*60)
    print("🔍 REVISIÓN REQUERIDA")
    print("="*60)
    print(f"\nPipeline: {state['requirements']['pipeline_name']}")
    print(f"Source: {state['requirements']['source']['type']}")
    print(f"Destination: {state['requirements']['destination']['type']}")
    print(f"\nValidación: {'✅ Aprobado' if state['validation']['valid'] else '❌ Errores'}")
    
    if state['errors']:
        print("\nErrores detectados:")
        for error in state['errors']:
            print(f"  ❌ {error}")
    
    print(f"\nCódigo generado ({len(state['code'])} caracteres):")
    print(state['code'][:300] + "...")
    
    # Esperar input humano
    while True:
        decision = input("\n[A]probar / [R]echazar / [M]odificar: ").upper()
        
        if decision == 'A':
            state['approved'] = True
            state['review_notes'] = "Aprobado por revisor"
            break
        
        elif decision == 'R':
            state['approved'] = False
            rejection_reason = input("Razón del rechazo: ")
            state['errors'].append(f"Rechazado: {rejection_reason}")
            state['review_notes'] = rejection_reason
            break
        
        elif decision == 'M':
            modifications = input("Modificaciones requeridas: ")
            state['review_notes'] = f"Modificaciones: {modifications}"
            # Re-route to generate node with feedback
            state['user_input'] += f"\n\nAjustes: {modifications}"
            # Esta lógica requiere un loop en el grafo
            break
    
    return state

# Integrar en workflow
workflow.add_node("review", review_node_interactive)
```

**Retry Logic (Reintentos Automáticos)**

```python
def generate_with_retry(state: PipelineState, max_retries: int = 3) -> PipelineState:
    """Nodo con reintentos automáticos"""
    
    for attempt in range(max_retries):
        try:
            # Intentar generar código
            state['code'] = generate_etl_pipeline(state['requirements'])
            
            # Validar sintaxis
            ast.parse(state['code'])
            
            # Success
            state['errors'] = [e for e in state['errors'] if 'Generation' not in e]
            break
        
        except SyntaxError as e:
            error_msg = f"Generation attempt {attempt+1} failed: {str(e)}"
            print(f"⚠️  {error_msg}")
            
            if attempt < max_retries - 1:
                print("🔄 Retrying...")
                time.sleep(2 ** attempt)  # Exponential backoff
            else:
                state['errors'].append(error_msg)
    
    return state
```

**Sub-Graphs (Workflows Anidados)**

```python
# Sub-workflow para generación de tests
test_workflow = StateGraph(PipelineState)

def unit_test_node(state):
    state['unit_tests'] = generate_unit_tests(state['code'])
    return state

def integration_test_node(state):
    state['integration_tests'] = generate_integration_tests(state['code'])
    return state

def quality_test_node(state):
    state['quality_tests'] = generate_quality_tests(state['requirements'])
    return state

# Build sub-graph
test_workflow.add_node("unit", unit_test_node)
test_workflow.add_node("integration", integration_test_node)
test_workflow.add_node("quality", quality_test_node)

test_workflow.set_entry_point("unit")
test_workflow.add_edge("unit", "integration")
test_workflow.add_edge("integration", "quality")
test_workflow.add_edge("quality", END)

test_subgraph = test_workflow.compile()

# Integrar sub-graph en main workflow
def test_orchestrator(state: PipelineState) -> PipelineState:
    """Ejecuta sub-workflow de tests"""
    test_results = test_subgraph.invoke(state)
    
    # Combinar tests
    state['tests'] = (
        test_results['unit_tests'] +
        test_results['integration_tests'] +
        test_results['quality_tests']
    )
    
    return state

workflow.add_node("test", test_orchestrator)
```

**Parallel Execution (Nodos Independientes)**

```python
# Documentación y DAG pueden generarse en paralelo
from langgraph.pregel import Channel

workflow.add_node("document", document_node)
workflow.add_node("dag", dag_generator_node)

# Ambos nodos reciben el mismo state (paralelo)
workflow.add_edge("test", "document")
workflow.add_edge("test", "dag")

# Sync point: esperar ambos antes de continuar
workflow.add_node("combine", lambda s: s)  # No-op combiner
workflow.add_edge("document", "combine")
workflow.add_edge("dag", "combine")
workflow.add_edge("combine", "review")
```

**Checkpointing (Persistencia de Estado)**

```python
from langgraph.checkpoint import MemorySaver

# Guardar estado en cada nodo
checkpointer = MemorySaver()

app = workflow.compile(checkpointer=checkpointer)

# Ejecutar con checkpoint
config = {"configurable": {"thread_id": "pipeline_123"}}
final_state = app.invoke(initial_state, config=config)

# Recuperar estado guardado
saved_state = checkpointer.get(config)

# Reanudar desde checkpoint (útil si falla a mitad)
resumed_state = app.invoke(saved_state, config=config)
```

**Ventajas vs Alternativas**

| Característica | LangGraph | Airflow | Prefect | N8n |
|----------------|-----------|---------|---------|-----|
| **State Management** | ✅ Built-in | ⚠️ XCom (limitado) | ✅ Flow runs | ❌ |
| **Conditional Logic** | ✅ Native | ⚠️ BranchOperator | ✅ | ✅ |
| **Human-in-Loop** | ✅ Easy | ⚠️ Manual | ✅ Approvals | ✅ |
| **LLM Integration** | ✅ Optimizado | ❌ Custom | ⚠️ Manual | ⚠️ |
| **Streaming** | ✅ Native | ❌ | ⚠️ Limited | ❌ |
| **Python-First** | ✅ | ✅ | ✅ | ❌ (UI-first) |

**Caso de Uso: Multi-Agent Code Review**

```python
# Sistema de review con múltiples agentes especializados

class ReviewState(TypedDict):
    code: str
    security_review: dict
    performance_review: dict
    style_review: dict
    final_approval: bool

def security_agent(state: ReviewState) -> ReviewState:
    """Agente especializado en seguridad"""
    state['security_review'] = {
        'sql_injection_risk': check_sql_injection(state['code']),
        'secrets_exposed': check_secrets(state['code']),
        'dangerous_imports': check_imports(state['code'])
    }
    return state

def performance_agent(state: ReviewState) -> ReviewState:
    """Agente especializado en performance"""
    state['performance_review'] = {
        'complexity': calculate_complexity(state['code']),
        'memory_usage': estimate_memory(state['code']),
        'optimization_suggestions': suggest_optimizations(state['code'])
    }
    return state

def style_agent(state: ReviewState) -> ReviewState:
    """Agente especializado en estilo"""
    state['style_review'] = {
        'pep8_compliance': check_pep8(state['code']),
        'docstring_coverage': check_docstrings(state['code']),
        'type_hints': check_type_hints(state['code'])
    }
    return state

# Ejecutar en paralelo
review_workflow = StateGraph(ReviewState)
review_workflow.add_node("security", security_agent)
review_workflow.add_node("performance", performance_agent)
review_workflow.add_node("style", style_agent)

# Parallel edges
review_workflow.set_entry_point("security")
review_workflow.set_entry_point("performance")
review_workflow.set_entry_point("style")

# Combine results
def aggregate_reviews(state: ReviewState) -> ReviewState:
    all_passed = (
        all(v == 'pass' for v in state['security_review'].values()) and
        state['performance_review']['complexity'] < 10 and
        state['style_review']['pep8_compliance'] > 0.9
    )
    
    state['final_approval'] = all_passed
    return state

review_workflow.add_node("aggregate", aggregate_reviews)
review_workflow.add_edge("security", "aggregate")
review_workflow.add_edge("performance", "aggregate")
review_workflow.add_edge("style", "aggregate")
review_workflow.add_edge("aggregate", END)

multi_agent_review = review_workflow.compile()
```

---
**Autor:** Luis J. Raigoso V. (LJRV)

## Parte 2: Setup

In [None]:
# pip install openai langgraph streamlit pytest great-expectations
import os
import json
import ast
from openai import OpenAI
from typing import TypedDict, Annotated
import operator

client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

print('✅ Setup completo')

## Parte 3: Parser de requisitos

In [None]:
def parse_requirements(user_input: str) -> dict:
    """Extrae requisitos estructurados desde lenguaje natural."""
    prompt = f'''
Extrae los requisitos de este pipeline ETL en formato JSON:

Input del usuario:
{user_input}

Devuelve JSON con:
{{
  "pipeline_name": "nombre_descriptivo",
  "source": {{"type": "csv/api/db", "details": "..."}},
  "transformations": ["lista de transformaciones"],
  "destination": {{"type": "csv/db/s3", "details": "..."}},
  "schedule": "daily/hourly/manual",
  "validations": ["reglas de calidad"]
}}
'''
    
    resp = client.chat.completions.create(
        model='gpt-4',
        messages=[{'role':'user','content':prompt}],
        temperature=0.1
    )
    
    return json.loads(resp.choices[0].message.content)

# Test
user_request = '''
Necesito un pipeline que:
1. Lea ventas.csv de S3 bucket "raw-data"
2. Filtre solo ventas de los últimos 30 días
3. Agregue una columna "mes" (YYYY-MM)
4. Calcule total por mes y categoría
5. Escriba a PostgreSQL tabla "ventas_mensual"
6. Ejecute diariamente a las 2 AM
7. Valide que no haya nulos en "total"
'''

requirements = parse_requirements(user_request)
print('Requisitos extraídos:')
print(json.dumps(requirements, indent=2))

## Parte 4: Generador de código ETL

In [None]:
def generate_etl_pipeline(requirements: dict) -> str:
    """Genera código Python completo del pipeline."""
    prompt = f'''
Genera un pipeline ETL en Python basado en estos requisitos:

{json.dumps(requirements, indent=2)}

El código debe:
- Usar pandas, boto3 (si S3), sqlalchemy (si DB)
- Incluir manejo de errores con try/except
- Logging detallado
- Type hints
- Docstrings
- Función main() ejecutable

Código Python:
'''
    
    resp = client.chat.completions.create(
        model='gpt-4',
        messages=[{'role':'user','content':prompt}],
        temperature=0.2
    )
    
    return resp.choices[0].message.content.strip().replace('```python','').replace('```','')

etl_code = generate_etl_pipeline(requirements)
print('Código generado (preview):')
print(etl_code[:500] + '...')

## Parte 5: Validador de código

In [None]:
def validate_code(code: str) -> dict:
    """Valida sintaxis y seguridad."""
    issues = []
    
    # 1. Validar sintaxis Python
    try:
        ast.parse(code)
    except SyntaxError as e:
        issues.append({'type': 'SYNTAX', 'message': str(e)})
    
    # 2. Detectar patrones inseguros
    dangerous_patterns = ['eval(', 'exec(', 'os.system(', '__import__']
    for pattern in dangerous_patterns:
        if pattern in code:
            issues.append({'type': 'SECURITY', 'message': f'Patrón peligroso detectado: {pattern}'})
    
    # 3. Verificar imports necesarios
    required_imports = ['pandas', 'logging']
    for imp in required_imports:
        if f'import {imp}' not in code:
            issues.append({'type': 'MISSING_IMPORT', 'message': f'Falta import: {imp}'})
    
    return {
        'valid': len(issues) == 0,
        'issues': issues
    }

validation = validate_code(etl_code)
print(f"\nValidación: {'✅ Aprobado' if validation['valid'] else '❌ Con errores'}")
if validation['issues']:
    for issue in validation['issues']:
        print(f"- [{issue['type']}] {issue['message']}")

## Parte 6: Generador de tests

In [None]:
def generate_tests(code: str, requirements: dict) -> str:
    """Genera tests unitarios con pytest."""
    prompt = f'''
Genera tests unitarios con pytest para este código ETL:

Requisitos:
{json.dumps(requirements, indent=2)}

Código:
{code[:1000]}...

Genera tests para:
1. Lectura de datos (mock de fuente)
2. Transformaciones
3. Validaciones de calidad
4. Escritura (mock de destino)
5. Manejo de errores

Código de tests (pytest):
'''
    
    resp = client.chat.completions.create(
        model='gpt-4',
        messages=[{'role':'user','content':prompt}],
        temperature=0.2
    )
    
    return resp.choices[0].message.content.strip().replace('```python','').replace('```','')

test_code = generate_tests(etl_code, requirements)
print('Tests generados (preview):')
print(test_code[:400] + '...')

## Parte 7: Generador de documentación

In [None]:
def generate_documentation(requirements: dict, code: str) -> str:
    """Genera README.md completo."""
    prompt = f'''
Genera un README.md completo para este pipeline ETL:

Requisitos:
{json.dumps(requirements, indent=2)}

Incluye:
- Descripción general
- Arquitectura (diagrama ASCII)
- Requisitos (dependencias)
- Configuración
- Cómo ejecutar
- Monitoreo y troubleshooting
- Contacto/owner

Markdown:
'''
    
    resp = client.chat.completions.create(
        model='gpt-4',
        messages=[{'role':'user','content':prompt}],
        temperature=0.3
    )
    
    return resp.choices[0].message.content.strip()

documentation = generate_documentation(requirements, etl_code)
print('Documentación generada (preview):')
print(documentation[:400] + '...')

## Parte 8: DAG de Airflow

In [None]:
def generate_airflow_dag(requirements: dict, etl_code: str) -> str:
    """Genera DAG de Airflow."""
    prompt = f'''
Genera un DAG de Airflow para este pipeline:

Requisitos:
{json.dumps(requirements, indent=2)}

El DAG debe:
- Nombre descriptivo
- Schedule según requisitos
- Tasks: extract, validate, transform, load
- Retry logic
- Alertas por email si falla

Código Python del DAG:
'''
    
    resp = client.chat.completions.create(
        model='gpt-4',
        messages=[{'role':'user','content':prompt}],
        temperature=0.2
    )
    
    return resp.choices[0].message.content.strip().replace('```python','').replace('```','')

dag_code = generate_airflow_dag(requirements, etl_code)
print('DAG de Airflow (preview):')
print(dag_code[:400] + '...')

## Parte 9: Workflow con LangGraph

In [None]:
from langgraph.graph import StateGraph, END

class PipelineState(TypedDict):
    user_input: str
    requirements: dict
    code: str
    validation: dict
    tests: str
    docs: str
    dag: str
    approved: bool
    errors: Annotated[list, operator.add]

def parse_node(state: PipelineState) -> PipelineState:
    state['requirements'] = parse_requirements(state['user_input'])
    return state

def generate_node(state: PipelineState) -> PipelineState:
    state['code'] = generate_etl_pipeline(state['requirements'])
    return state

def validate_node(state: PipelineState) -> PipelineState:
    state['validation'] = validate_code(state['code'])
    if not state['validation']['valid']:
        state['errors'].extend([i['message'] for i in state['validation']['issues']])
    return state

def test_node(state: PipelineState) -> PipelineState:
    if state['validation']['valid']:
        state['tests'] = generate_tests(state['code'], state['requirements'])
    return state

def document_node(state: PipelineState) -> PipelineState:
    state['docs'] = generate_documentation(state['requirements'], state['code'])
    state['dag'] = generate_airflow_dag(state['requirements'], state['code'])
    return state

def review_node(state: PipelineState) -> PipelineState:
    # En producción: human-in-the-loop
    state['approved'] = state['validation']['valid']
    return state

# Construir grafo
workflow = StateGraph(PipelineState)
workflow.add_node('parse', parse_node)
workflow.add_node('generate', generate_node)
workflow.add_node('validate', validate_node)
workflow.add_node('test', test_node)
workflow.add_node('document', document_node)
workflow.add_node('review', review_node)

workflow.set_entry_point('parse')
workflow.add_edge('parse', 'generate')
workflow.add_edge('generate', 'validate')
workflow.add_edge('validate', 'test')
workflow.add_edge('test', 'document')
workflow.add_edge('document', 'review')
workflow.add_edge('review', END)

app = workflow.compile()

# Ejecutar workflow completo
initial_state = {
    'user_input': user_request,
    'requirements': {},
    'code': '',
    'validation': {},
    'tests': '',
    'docs': '',
    'dag': '',
    'approved': False,
    'errors': []
}

final_state = app.invoke(initial_state)

print('\n🎉 Pipeline generado completamente\n')
print(f"Aprobado: {'✅' if final_state['approved'] else '❌'}")
print(f"Errores: {len(final_state['errors'])}")

## Parte 10: Interfaz Streamlit

In [None]:
# Guardar como platform_app.py

streamlit_app = '''
import streamlit as st
import json
from pipeline_generator import app, PipelineState  # Importar workflow

st.set_page_config(page_title='GenAI Data Platform', page_icon='🏗️', layout='wide')

st.title('🏗️ Plataforma Self-Service de Pipelines')
st.markdown('Genera pipelines ETL completos usando lenguaje natural')

# Input
user_input = st.text_area(
    'Describe tu pipeline:',
    height=200,
    placeholder='Ej: Necesito procesar datos de ventas desde S3, agregar por mes, y cargar a Redshift...'
)

if st.button('🚀 Generar Pipeline', type='primary'):
    if not user_input:
        st.warning('Por favor describe el pipeline')
    else:
        with st.spinner('Generando pipeline completo...'):
            initial = {
                'user_input': user_input,
                'requirements': {}, 'code': '', 'validation': {},
                'tests': '', 'docs': '', 'dag': '',
                'approved': False, 'errors': []
            }
            
            result = app.invoke(initial)
            
            # Tabs
            tab1, tab2, tab3, tab4, tab5 = st.tabs(['📋 Requisitos', '💻 Código', '🧪 Tests', '📄 Docs', '🛫 DAG'])
            
            with tab1:
                st.json(result['requirements'])
            
            with tab2:
                st.code(result['code'], language='python')
                st.download_button('Descargar código', result['code'], file_name='pipeline.py')
            
            with tab3:
                st.code(result['tests'], language='python')
            
            with tab4:
                st.markdown(result['docs'])
            
            with tab5:
                st.code(result['dag'], language='python')
            
            # Validación
            if result['approved']:
                st.success('✅ Pipeline aprobado y listo para deployment')
            else:
                st.error(f"❌ Pipeline con errores: {result['errors']}")
'''

with open('platform_app.py', 'w') as f:
    f.write(streamlit_app)

print('✅ Plataforma Streamlit guardada en platform_app.py')

## Parte 11: Sistema de deployment

In [None]:
import subprocess

def deploy_pipeline(code: str, dag: str, tests: str, pipeline_name: str):
    """Despliega pipeline a producción."""
    # 1. Crear estructura de directorios
    base_path = f'./pipelines/{pipeline_name}'
    os.makedirs(base_path, exist_ok=True)
    
    # 2. Guardar archivos
    with open(f'{base_path}/pipeline.py', 'w') as f:
        f.write(code)
    
    with open(f'{base_path}/test_pipeline.py', 'w') as f:
        f.write(tests)
    
    with open(f'{base_path}/dag.py', 'w') as f:
        f.write(dag)
    
    # 3. Ejecutar tests
    test_result = subprocess.run(
        ['pytest', f'{base_path}/test_pipeline.py', '-v'],
        capture_output=True,
        text=True
    )
    
    if test_result.returncode != 0:
        return {
            'success': False,
            'message': 'Tests fallidos',
            'output': test_result.stdout
        }
    
    # 4. Copiar DAG a Airflow
    # airflow_dags_path = '/opt/airflow/dags/'
    # shutil.copy(f'{base_path}/dag.py', f'{airflow_dags_path}/{pipeline_name}.py')
    
    return {
        'success': True,
        'message': f'Pipeline {pipeline_name} desplegado exitosamente',
        'path': base_path
    }

# Deployment
if final_state['approved']:
    deploy_result = deploy_pipeline(
        code=final_state['code'],
        dag=final_state['dag'],
        tests=final_state['tests'],
        pipeline_name=final_state['requirements']['pipeline_name']
    )
    print(deploy_result['message'])

### 🚀 **Deployment y Producción: De Prototipo a Enterprise**

**CI/CD Pipeline para GenAI Artifacts**

```
┌────────────────────────────────────────────────────────────┐
│               PRODUCTION DEPLOYMENT PIPELINE               │
├────────────────────────────────────────────────────────────┤
│                                                            │
│  [1. Code Generation] → Generated pipeline code           │
│           ↓                                                │
│  [2. Version Control]                                      │
│      • Create feature branch                              │
│      • Commit: code + tests + docs                        │
│      • Tag: v1.0.0                                        │
│           ↓                                                │
│  [3. Automated Testing] (GitHub Actions/GitLab CI)        │
│      • pytest: unit + integration tests                   │
│      • coverage: ≥80% required                            │
│      • security: Bandit, Safety                           │
│      • linting: Black, Flake8, MyPy                       │
│           ↓                                                │
│  [4. Quality Gates]                                        │
│      ✅ All tests pass                                     │
│      ✅ Coverage ≥80%                                      │
│      ✅ No critical security issues                       │
│      ✅ Code review approved                              │
│           ↓                                                │
│  [5. Staging Deployment]                                   │
│      • Deploy to staging environment                      │
│      • Run integration tests with staging data            │
│      • Performance benchmarking                           │
│           ↓                                                │
│  [6. Production Approval]                                  │
│      • Manual sign-off (for critical pipelines)          │
│      • Automated (for low-risk changes)                   │
│           ↓                                                │
│  [7. Production Deployment]                                │
│      • Blue/Green deployment (zero downtime)              │
│      • Copy DAG to Airflow dags/ folder                   │
│      • Update Airflow variables                           │
│      • Trigger initial run                                │
│           ↓                                                │
│  [8. Monitoring]                                           │
│      • Datadog/New Relic APM                              │
│      • Custom metrics (rows processed, runtime)           │
│      • Alerting (PagerDuty, Slack)                        │
│                                                            │
└────────────────────────────────────────────────────────────┘
```

**GitHub Actions Workflow**

```yaml
# .github/workflows/deploy-pipeline.yml
name: Deploy Generated Pipeline

on:
  push:
    paths:
      - 'generated_pipelines/**'
  pull_request:
    paths:
      - 'generated_pipelines/**'

env:
  PYTHON_VERSION: '3.11'
  COVERAGE_THRESHOLD: 80

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: ${{ env.PYTHON_VERSION }}
      
      - name: Install dependencies
        run: |
          pip install -r requirements.txt
          pip install pytest pytest-cov bandit safety black flake8 mypy
      
      - name: Security scan
        run: |
          # Scan dependencies
          safety check --json
          
          # Scan code for vulnerabilities
          bandit -r generated_pipelines/ -f json -o bandit-report.json
      
      - name: Code quality
        run: |
          # Format check
          black --check generated_pipelines/
          
          # Linting
          flake8 generated_pipelines/ --max-line-length=100
          
          # Type checking
          mypy generated_pipelines/ --strict
      
      - name: Run tests
        run: |
          pytest generated_pipelines/ \
            --cov=generated_pipelines \
            --cov-report=xml \
            --cov-report=html \
            --cov-fail-under=${{ env.COVERAGE_THRESHOLD }}
      
      - name: Upload coverage
        uses: codecov/codecov-action@v3
        with:
          file: ./coverage.xml
  
  deploy-staging:
    needs: validate
    if: github.event_name == 'pull_request'
    runs-on: ubuntu-latest
    steps:
      - name: Deploy to staging Airflow
        env:
          AIRFLOW_STAGING_HOST: ${{ secrets.AIRFLOW_STAGING_HOST }}
          AIRFLOW_API_KEY: ${{ secrets.AIRFLOW_API_KEY }}
        run: |
          # Copy DAG to staging
          scp generated_pipelines/*/dag.py \
            user@$AIRFLOW_STAGING_HOST:/opt/airflow/dags/
          
          # Trigger DAG
          curl -X POST \
            -H "Authorization: Bearer $AIRFLOW_API_KEY" \
            https://$AIRFLOW_STAGING_HOST/api/v1/dags/pipeline_name/dagRuns
      
      - name: Integration tests (staging)
        run: |
          pytest tests/integration/ \
            --env=staging \
            --pipeline=pipeline_name
  
  deploy-production:
    needs: validate
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    environment:
      name: production
      url: https://airflow.prod.company.com
    steps:
      - name: Blue/Green deployment
        run: |
          # Deploy to "green" environment first
          ./scripts/deploy.sh --env=prod-green --pipeline=$PIPELINE_NAME
          
          # Run smoke tests
          ./scripts/smoke-tests.sh --env=prod-green
          
          # Switch traffic to green (zero downtime)
          ./scripts/switch-traffic.sh --to=green
          
          # Decommission blue
          ./scripts/cleanup.sh --env=prod-blue
      
      - name: Notify
        uses: 8398a7/action-slack@v3
        with:
          status: ${{ job.status }}
          text: 'Pipeline deployed to production'
          webhook_url: ${{ secrets.SLACK_WEBHOOK }}
```

**Deployment Automation Script**

```python
# scripts/deploy_pipeline.py
import os
import shutil
import subprocess
from pathlib import Path
from typing import Dict, Any
import boto3
import requests

class PipelineDeployer:
    def __init__(self, config: Dict[str, Any]):
        self.config = config
        self.s3 = boto3.client('s3')
        self.airflow_api = config['airflow_api_url']
        self.airflow_token = config['airflow_token']
    
    def deploy(self, pipeline_path: str, environment: str = 'production'):
        """Despliega pipeline a Airflow"""
        
        # 1. Validar que existan archivos necesarios
        required_files = ['pipeline.py', 'dag.py', 'tests.py', 'README.md']
        for file in required_files:
            if not Path(f"{pipeline_path}/{file}").exists():
                raise FileNotFoundError(f"Missing required file: {file}")
        
        # 2. Ejecutar tests
        print("Running tests...")
        test_result = subprocess.run(
            ['pytest', f'{pipeline_path}/tests.py', '-v'],
            capture_output=True,
            text=True
        )
        
        if test_result.returncode != 0:
            raise RuntimeError(f"Tests failed:\n{test_result.stdout}")
        
        print("✅ Tests passed")
        
        # 3. Upload assets to S3
        print("Uploading to S3...")
        pipeline_name = Path(pipeline_path).name
        
        for file in ['pipeline.py', 'README.md']:
            self.s3.upload_file(
                f'{pipeline_path}/{file}',
                self.config['artifacts_bucket'],
                f'pipelines/{pipeline_name}/{file}'
            )
        
        # 4. Deploy DAG to Airflow
        print(f"Deploying to Airflow ({environment})...")
        
        # Airflow API: upload DAG
        with open(f'{pipeline_path}/dag.py', 'r') as f:
            dag_code = f.read()
        
        response = requests.post(
            f"{self.airflow_api}/dags/{pipeline_name}/upload",
            headers={'Authorization': f'Bearer {self.airflow_token}'},
            json={'dag_code': dag_code}
        )
        
        if response.status_code != 200:
            raise RuntimeError(f"Airflow deployment failed: {response.text}")
        
        # 5. Set Airflow variables
        variables = {
            'pipeline_name': pipeline_name,
            's3_bucket': self.config['data_bucket'],
            'db_conn_id': 'postgres_default',
            'owner_email': self.config['owner_email']
        }
        
        for key, value in variables.items():
            requests.post(
                f"{self.airflow_api}/variables",
                headers={'Authorization': f'Bearer {self.airflow_token}'},
                json={'key': f'{pipeline_name}_{key}', 'value': value}
            )
        
        # 6. Unpause DAG
        requests.patch(
            f"{self.airflow_api}/dags/{pipeline_name}",
            headers={'Authorization': f'Bearer {self.airflow_token}'},
            json={'is_paused': False}
        )
        
        # 7. Trigger initial run (optional)
        if self.config.get('trigger_on_deploy'):
            requests.post(
                f"{self.airflow_api}/dags/{pipeline_name}/dagRuns",
                headers={'Authorization': f'Bearer {self.airflow_token}'},
                json={'conf': {'deployed_at': str(datetime.now())}}
            )
        
        print(f"✅ Pipeline {pipeline_name} deployed successfully")
        
        return {
            'pipeline_name': pipeline_name,
            'environment': environment,
            'dag_url': f"{self.airflow_api}/dags/{pipeline_name}",
            's3_path': f's3://{self.config["artifacts_bucket"]}/pipelines/{pipeline_name}'
        }

# Usage
deployer = PipelineDeployer({
    'airflow_api_url': 'https://airflow.company.com/api/v1',
    'airflow_token': os.getenv('AIRFLOW_TOKEN'),
    'artifacts_bucket': 'company-pipeline-artifacts',
    'data_bucket': 'company-data',
    'owner_email': 'data-team@company.com',
    'trigger_on_deploy': True
})

result = deployer.deploy('./generated_pipelines/ventas_pipeline', environment='production')
```

**Monitoring y Observabilidad**

```python
# monitoring/pipeline_metrics.py
from datadog import initialize, statsd
import time

initialize(api_key=os.getenv('DATADOG_API_KEY'))

class PipelineMetrics:
    def __init__(self, pipeline_name: str):
        self.pipeline_name = pipeline_name
    
    def track_execution(self, func):
        """Decorator para trackear métricas de ejecución"""
        def wrapper(*args, **kwargs):
            start_time = time.time()
            
            # Incrementar contador de ejecuciones
            statsd.increment(
                'pipeline.executions',
                tags=[f'pipeline:{self.pipeline_name}']
            )
            
            try:
                result = func(*args, **kwargs)
                
                # Métrica de éxito
                statsd.increment(
                    'pipeline.success',
                    tags=[f'pipeline:{self.pipeline_name}']
                )
                
                return result
            
            except Exception as e:
                # Métrica de error
                statsd.increment(
                    'pipeline.errors',
                    tags=[
                        f'pipeline:{self.pipeline_name}',
                        f'error_type:{type(e).__name__}'
                    ]
                )
                raise
            
            finally:
                # Duración
                duration = time.time() - start_time
                statsd.histogram(
                    'pipeline.duration',
                    duration,
                    tags=[f'pipeline:{self.pipeline_name}']
                )
        
        return wrapper
    
    def track_data_volume(self, rows: int, stage: str):
        """Track volumen de datos procesados"""
        statsd.gauge(
            'pipeline.data_volume',
            rows,
            tags=[
                f'pipeline:{self.pipeline_name}',
                f'stage:{stage}'
            ]
        )
    
    def track_data_quality(self, metric: str, value: float):
        """Track métricas de calidad de datos"""
        statsd.gauge(
            f'pipeline.data_quality.{metric}',
            value,
            tags=[f'pipeline:{self.pipeline_name}']
        )

# Usage en pipeline generado
metrics = PipelineMetrics('ventas_pipeline')

@metrics.track_execution
def run_pipeline():
    # Extract
    df = extract_data()
    metrics.track_data_volume(len(df), 'extract')
    
    # Transform
    df = transform_data(df)
    metrics.track_data_volume(len(df), 'transform')
    
    # Validate
    null_rate = df.isnull().sum().sum() / df.size
    metrics.track_data_quality('null_rate', null_rate)
    
    # Load
    load_data(df)
    metrics.track_data_volume(len(df), 'load')
```

**Alerting Strategy**

```python
# monitoring/alerts.py
from slack_sdk import WebClient
import smtplib
from email.mime.text import MIMEText

class AlertManager:
    def __init__(self):
        self.slack = WebClient(token=os.getenv('SLACK_TOKEN'))
        self.pagerduty_key = os.getenv('PAGERDUTY_KEY')
    
    def send_alert(
        self,
        severity: str,  # 'info' | 'warning' | 'critical'
        pipeline_name: str,
        message: str,
        details: dict = None
    ):
        """Envía alertas según severidad"""
        
        if severity == 'info':
            # Slack notification
            self._send_slack(pipeline_name, message, details)
        
        elif severity == 'warning':
            # Slack + Email
            self._send_slack(pipeline_name, message, details, urgent=True)
            self._send_email(pipeline_name, message, details)
        
        elif severity == 'critical':
            # Slack + Email + PagerDuty
            self._send_slack(pipeline_name, message, details, urgent=True)
            self._send_email(pipeline_name, message, details)
            self._trigger_pagerduty(pipeline_name, message, details)
    
    def _send_slack(self, pipeline, message, details, urgent=False):
        emoji = '🚨' if urgent else 'ℹ️'
        
        blocks = [
            {
                "type": "header",
                "text": {"type": "plain_text", "text": f"{emoji} {pipeline}"}
            },
            {
                "type": "section",
                "text": {"type": "mrkdwn", "text": message}
            }
        ]
        
        if details:
            blocks.append({
                "type": "section",
                "fields": [
                    {"type": "mrkdwn", "text": f"*{k}:*\n{v}"}
                    for k, v in details.items()
                ]
            })
        
        self.slack.chat_postMessage(
            channel='#data-alerts',
            blocks=blocks
        )
    
    def _trigger_pagerduty(self, pipeline, message, details):
        import requests
        
        requests.post(
            'https://events.pagerduty.com/v2/enqueue',
            json={
                'routing_key': self.pagerduty_key,
                'event_action': 'trigger',
                'payload': {
                    'summary': f'{pipeline}: {message}',
                    'severity': 'critical',
                    'source': 'genai-platform',
                    'custom_details': details
                }
            }
        )

# Integración en pipeline
alerts = AlertManager()

try:
    run_pipeline()
except Exception as e:
    alerts.send_alert(
        severity='critical',
        pipeline_name='ventas_pipeline',
        message=f'Pipeline failed: {str(e)}',
        details={
            'Error Type': type(e).__name__,
            'Traceback': traceback.format_exc()[:500]
        }
    )
    raise
```

**Cost Optimization**

```python
# monitoring/cost_tracking.py
class CostTracker:
    """Track costos de ejecución de pipelines"""
    
    COST_PER_COMPUTE_HOUR = 0.50  # USD
    COST_PER_GB_STORAGE = 0.023   # USD/mes
    COST_PER_MILLION_REQUESTS = 0.20  # USD
    
    def calculate_pipeline_cost(
        self,
        runtime_minutes: float,
        data_volume_gb: float,
        api_calls: int
    ) -> dict:
        
        compute_cost = (runtime_minutes / 60) * self.COST_PER_COMPUTE_HOUR
        storage_cost = data_volume_gb * self.COST_PER_GB_STORAGE
        api_cost = (api_calls / 1_000_000) * self.COST_PER_MILLION_REQUESTS
        
        total = compute_cost + storage_cost + api_cost
        
        return {
            'compute': round(compute_cost, 4),
            'storage': round(storage_cost, 4),
            'api': round(api_cost, 4),
            'total': round(total, 4)
        }
    
    def optimize_suggestions(self, metrics: dict) -> list:
        """Sugerencias de optimización"""
        suggestions = []
        
        if metrics['runtime_minutes'] > 60:
            suggestions.append("Consider partitioning data for parallel processing")
        
        if metrics['data_volume_gb'] > 100:
            suggestions.append("Use columnar format (Parquet) to reduce storage")
        
        if metrics['api_calls'] > 1_000_000:
            suggestions.append("Implement caching to reduce API calls")
        
        return suggestions

# En dashboard Streamlit
cost_tracker = CostTracker()

st.subheader('💰 Cost Analysis')
costs = cost_tracker.calculate_pipeline_cost(
    runtime_minutes=15,
    data_volume_gb=50,
    api_calls=100_000
)

col1, col2, col3, col4 = st.columns(4)
col1.metric('Compute', f"${costs['compute']}")
col2.metric('Storage', f"${costs['storage']}")
col3.metric('API', f"${costs['api']}")
col4.metric('Total', f"${costs['total']}", delta='-12%')

suggestions = cost_tracker.optimize_suggestions(metrics)
if suggestions:
    st.info('\n'.join(suggestions))
```

---
**Autor:** Luis J. Raigoso V. (LJRV)

## Parte 12: Monitoreo y observabilidad

In [None]:
def generate_monitoring_dashboard(pipeline_name: str):
    """Genera dashboard de monitoreo."""
    dashboard_code = f'''
import streamlit as st
import pandas as pd
import plotly.express as px

st.title('📊 Monitoreo: {pipeline_name}')

# Métricas simuladas
col1, col2, col3, col4 = st.columns(4)
col1.metric('Ejecuciones exitosas', '95%', '↑ 2%')
col2.metric('Tiempo promedio', '12.5 min', '↓ 1.2 min')
col3.metric('Registros procesados', '1.2M', '↑ 15K')
col4.metric('Errores', '3', '↓ 5')

# Gráfico de ejecuciones
df_runs = pd.DataFrame({{
    'fecha': pd.date_range('2024-01-01', periods=30),
    'duracion': [10 + i*0.2 for i in range(30)],
    'status': ['success'] * 28 + ['failed'] * 2
}}})

fig = px.line(df_runs, x='fecha', y='duracion', color='status')
st.plotly_chart(fig)

# Logs recientes
st.subheader('Logs recientes')
st.text_area('', value='2024-01-15 02:00 - [INFO] Pipeline iniciado\\n2024-01-15 02:05 - [INFO] 10000 registros procesados', height=200)
'''
    
    return dashboard_code

monitoring_dash = generate_monitoring_dashboard(requirements['pipeline_name'])
print('Dashboard de monitoreo generado')

## Parte 13: Mejoras futuras

### Roadmap

**Fase 2**:
- Multi-agente: equipo de agentes especializados (data engineer, QA, DevOps)
- Cost estimation: predecir costos de infraestructura
- A/B testing: comparar versiones de pipelines
- Auto-scaling: ajustar recursos según carga

**Fase 3**:
- Self-healing: detección y corrección automática de errores
- Optimization: sugerencias de mejora basadas en métricas
- Federated learning: aprender de pipelines de otros equipos
- Natural language alerting: alertas explicadas en lenguaje natural

## Evaluación

**Criterios**:

- ✅ Parsing correcto de requisitos (15%)
- ✅ Generación de código funcional (25%)
- ✅ Validación y seguridad (15%)
- ✅ Tests completos (15%)
- ✅ Documentación clara (10%)
- ✅ Workflow orquestado (10%)
- ✅ Interfaz usable (10%)

## Conclusión

Has construido una plataforma completa de self-service que:

✅ Democratiza el acceso a la ingeniería de datos
✅ Reduce tiempo de desarrollo de días a minutos
✅ Mantiene estándares de calidad y seguridad
✅ Genera documentación automáticamente
✅ Facilita deployment y monitoreo

**¡Felicitaciones por completar el módulo de GenAI!** 🎉

### 🎓 **Conclusión del Curso: De Junior a GenAI Expert**

**El Viaje Completo: 40 Notebooks, 4 Niveles**

```
┌──────────────────────────────────────────────────────────────┐
│           DATA ENGINEER COURSE: LEARNING PATH                │
├──────────────────────────────────────────────────────────────┤
│                                                              │
│  📚 NIVEL JUNIOR (10 notebooks)                              │
│  ├─ Fundamentos: Python, Pandas, SQL                        │
│  ├─ Data Manipulation: ETL básico                           │
│  ├─ Visualización: Matplotlib, Seaborn                      │
│  └─ Git: Control de versiones                               │
│      ↓                                                       │
│  🔧 NIVEL MID (10 notebooks)                                 │
│  ├─ Airflow: Orquestación de pipelines                      │
│  ├─ Streaming: Kafka, real-time processing                  │
│  ├─ Cloud: AWS (S3, Redshift, Lambda)                       │
│  ├─ Bases de Datos: PostgreSQL, MongoDB                     │
│  └─ FastAPI: Servicios de datos                             │
│      ↓                                                       │
│  ⚡ NIVEL SENIOR (10 notebooks)                              │
│  ├─ Data Governance: Calidad, linaje                        │
│  ├─ Lakehouse: Delta, Iceberg                               │
│  ├─ Spark Streaming: Processing distribuido                 │
│  ├─ Arquitecturas: Medallion, Lambda, Kappa                 │
│  ├─ ML Pipelines: Feature stores, MLOps                     │
│  └─ FinOps: Optimización de costos                          │
│      ↓                                                       │
│  🤖 NIVEL GENAI (10 notebooks)                               │
│  ├─ 01: Fundamentos LLMs & Prompting                        │
│  ├─ 02: NL2SQL (Natural Language → SQL)                     │
│  ├─ 03: Generación de Código ETL                            │
│  ├─ 04: RAG para Documentación de Datos                     │
│  ├─ 05: Embeddings & Similitud de Datos                     │
│  ├─ 06: Agentes de Automatización                           │
│  ├─ 07: Calidad y Validación con LLMs                       │
│  ├─ 08: Síntesis y Aumento de Datos                         │
│  ├─ 09: Proyecto Integrador 1 (RAG+NL2SQL Chatbot)          │
│  └─ 10: Proyecto Integrador 2 (Plataforma Self-Service) ✅  │
│                                                              │
└──────────────────────────────────────────────────────────────┘
```

**Habilidades Adquiridas por Nivel**

| Nivel | Skills Core | Tools Dominadas | Capacidad Laboral |
|-------|-------------|-----------------|-------------------|
| **Junior** | • Python (Pandas, NumPy)<br>• SQL básico<br>• ETL manual<br>• Git | Jupyter, Pandas, PostgreSQL, Git | Entry-level Data Engineer<br>Data Analyst |
| **Mid** | • Airflow DAGs<br>• Streaming (Kafka)<br>• Cloud (AWS)<br>• APIs (FastAPI) | Airflow, Kafka, AWS, Docker, FastAPI, MongoDB | Mid-level Data Engineer<br>Data Platform Engineer |
| **Senior** | • Data Architecture<br>• Distributed Systems (Spark)<br>• Lakehouse (Delta/Iceberg)<br>• MLOps | Spark, Delta Lake, Databricks, Kubernetes, Terraform | Senior Data Engineer<br>Data Architect<br>Staff Engineer |
| **GenAI** | • LLM Integration<br>• RAG Systems<br>• Agentic Workflows<br>• Code Generation | OpenAI API, LangChain, ChromaDB, LangGraph, Streamlit | GenAI Data Engineer<br>ML Platform Engineer<br>AI Automation Specialist |

**Comparación: Skills Tradicionales vs GenAI-Enhanced**

```
TRADITIONAL DATA ENGINEER              GENAI-ENHANCED DATA ENGINEER
┌──────────────────────────┐          ┌──────────────────────────────┐
│                          │          │                              │
│ • Manual coding          │   →      │ • AI-assisted coding         │
│ • Static pipelines       │   →      │ • Self-adapting pipelines    │
│ • Manual documentation   │   →      │ • Auto-generated docs        │
│ • Fixed schemas          │   →      │ • Schema inference           │
│ • SQL only               │   →      │ • NL→SQL translation         │
│ • Manual data quality    │   →      │ • LLM-powered validation     │
│ • Code reviews (human)   │   →      │ • Multi-agent reviews        │
│ • Static dashboards      │   →      │ • Conversational analytics   │
│                          │          │                              │
└──────────────────────────┘          └──────────────────────────────┘

   Productivity: 1x                       Productivity: 3-5x
   Time to Market: weeks                  Time to Market: hours
   Error Rate: 5-10%                      Error Rate: 1-2%
```

**Proyecto Final: Impacto Real**

Este proyecto integrador 2 (Plataforma Self-Service) representa el **máximo nivel de automatización** en Data Engineering:

**Sin GenAI (Tradicional)**:
```
Usuario → escribe ticket → DE recibe → DE desarrolla (2-4 semanas) 
→ QA testing (1 semana) → Code review → Deploy → Usuario recibe
```
**Con GenAI (Plataforma)**:
```
Usuario → describe en NL → Plataforma genera (10 min) → Aprobación 
→ Deploy automático → Usuario recibe
```

**Métricas de Transformación**:
- ⏱️ **Time to Production**: 3-4 semanas → 2 horas (96% reducción)
- 💰 **Costo por Pipeline**: $5,100 → $163 (97% reducción)
- 👥 **Democratización**: Solo DEs → Cualquier usuario
- 📊 **Calidad**: Variable → Estandarizada (templates + validación)
- 🧪 **Test Coverage**: ~40% → >80%

**Próximos Pasos: Continuar Aprendiendo**

**1. Profundizar en GenAI**
```python
# Áreas avanzadas
topics = [
    'Fine-tuning de LLMs para dominio específico',
    'Prompt engineering avanzado (Chain-of-Thought, ReAct)',
    'Multi-modal AI (texto + imágenes + tablas)',
    'Autonomous agents (Auto-GPT, BabyAGI)',
    'LLM evaluation & benchmarking',
    'Cost optimization (caching, model selection)'
]
```

**2. Especializaciones**
- **ML Engineering**: Feature stores (Feast, Tecton), model serving (MLflow)
- **Real-time Analytics**: Flink, Kafka Streams, ksqlDB
- **Data Mesh**: Domain-oriented data platforms
- **Observability**: OpenTelemetry, distributed tracing

**3. Certificaciones Recomendadas**
```
Cloud:
├─ AWS Certified Data Analytics - Specialty
├─ Google Professional Data Engineer
└─ Azure Data Engineer Associate

Data:
├─ Databricks Certified Data Engineer
├─ Snowflake SnowPro Core
└─ dbt Analytics Engineering

AI/ML:
├─ TensorFlow Developer Certificate
├─ MLOps Specialization (Coursera)
└─ Prompt Engineering for Developers (DeepLearning.AI)
```

**4. Construir Portfolio**
```
GitHub Portfolio Projects:
├─ 1. End-to-end ETL pipeline (batch + streaming)
│     Tech: Airflow, Spark, Delta Lake
│
├─ 2. RAG system para dominio específico
│     Tech: OpenAI, ChromaDB, LangChain, FastAPI
│
├─ 3. Real-time dashboard con NL2SQL
│     Tech: Streamlit, PostgreSQL, LLMs
│
├─ 4. Data quality framework
│     Tech: Great Expectations, dbt, Airflow
│
└─ 5. GenAI code generator (como este proyecto)
      Tech: LangGraph, OpenAI, GitHub Actions
```

**5. Comunidad y Networking**
- **Conferences**: Data Council, Spark Summit, MLOps World
- **Meetups**: Local data engineering & ML groups
- **Open Source**: Contribuir a Airflow, dbt, LangChain
- **Writing**: Blog posts, technical tutorials
- **LinkedIn**: Compartir proyectos y aprendizajes

**Reflexión Final: El Futuro del Data Engineering**

```
┌────────────────────────────────────────────────────────┐
│          THE EVOLUTION OF DATA ENGINEERING             │
├────────────────────────────────────────────────────────┤
│                                                        │
│  2010s: Manual ETL, Hadoop Era                         │
│  ├─ Batch processing                                   │
│  └─ Heavy lifting, manual scaling                      │
│                                                        │
│  2020s: Cloud-Native, Streaming                        │
│  ├─ Real-time data                                     │
│  ├─ Serverless, auto-scaling                          │
│  └─ Infrastructure as Code                            │
│                                                        │
│  2024+: GenAI-Augmented Data Engineering 🚀            │
│  ├─ Natural language interfaces                        │
│  ├─ Self-healing pipelines                            │
│  ├─ Auto-generated code & docs                        │
│  ├─ Autonomous agents                                 │
│  └─ Democratized data access                          │
│                                                        │
│  Future (2030s): Fully Autonomous Data Platforms?      │
│  ├─ Self-optimizing architectures                      │
│  ├─ Predictive data quality                           │
│  ├─ Zero-touch operations                             │
│  └─ Human as orchestrator, not operator               │
│                                                        │
└────────────────────────────────────────────────────────┘
```

**Tu Rol en Esta Transformación**

Como Data Engineer con habilidades GenAI, no solo construyes pipelines—**diseñas el futuro de cómo organizaciones trabajan con datos**. Las herramientas que dominas permiten:

✅ **Democratizar el acceso a datos** (cualquier usuario puede explorar)
✅ **Acelerar innovación** (de semanas a horas)
✅ **Reducir costos operativos** (automatización inteligente)
✅ **Mejorar calidad de decisiones** (insights más rápidos y confiables)

**Agradecimiento**

Has completado un viaje intensivo de **40 notebooks**, abarcando desde fundamentos hasta la vanguardia de GenAI en Data Engineering. 

**Estadísticas del Curso**:
- 📚 **40 notebooks** (Junior: 10, Mid: 10, Senior: 10, GenAI: 10)
- 💻 **~15,000 líneas** de código explicativo
- ⏱️ **~120 horas** de contenido estimado
- 🔧 **50+ herramientas** cubiertas
- 🚀 **10+ proyectos** integradores

Este conocimiento te posiciona en el **top 5% de Data Engineers** a nivel global—una combinación rara de:
1. **Fundamentos sólidos** (SQL, Python, arquitectura)
2. **Cloud-native expertise** (AWS, Airflow, Spark)
3. **GenAI proficiency** (LLMs, RAG, agents)

**El mundo necesita profesionales como tú que pueden construir el puente entre datos y AI.**

```python
# Tu próxima línea de código empieza ahora
def build_the_future():
    """
    El curso termina aquí, pero tu viaje continúa.
    Cada pipeline que construyas, cada problema que resuelvas,
    cada sistema que optimices—estás creando el futuro de los datos.
    """
    while True:
        learn()
        build()
        share()
        repeat()

# ¡Éxito en tu carrera como GenAI Data Engineer! 🚀
build_the_future()
```

---
**Autor:** Luis J. Raigoso V. (LJRV)

**¡Felicitaciones por completar el Data Engineer Course completo! 🎉🎊**

*"The best way to predict the future is to build it."* — Alan Kay