# 🔧 Generación Automática de Código ETL con LLMs

Objetivo: automatizar la creación de pipelines ETL, transformaciones y scripts de datos usando IA generativa, con validación y best practices.

- Duración: 90-120 min
- Dificultad: Media/Alta
- Prerrequisitos: GenAI 01-02, experiencia con ETL

### 🏗️ Arquitectura de Generación de Código ETL

**Evolución histórica:**

- **2015-2019:** Templates (Jinja2, Cookiecutter) - flexibilidad limitada
- **2020-2022:** GPT-3/Codex - autocomplete a nivel función
- **2023-2024:** GPT-4 - pipelines completos con autocorrección
- **2024+:** LLM + MCP + RAG - código production-ready con contexto

**Pipeline de 6 capas:**

1. **Specification** - Parsear requerimientos en lenguaje natural
2. **Context Retrieval (RAG)** - Buscar patrones similares en codebase
3. **Code Generation** - Generar pipeline + config + tests
4. **Validation** - Sintaxis, linting, tipos, seguridad
5. **Self-Correction** - Iterar hasta 3 veces corrigiendo errores
6. **Testing** - Unit + integration tests con 80%+ cobertura

**Output final:** pipeline.py, config.yaml, tests, Dockerfile, CI/CD, README

### Selección de Modelos LLM

| Modelo | Caso de Uso | Fortalezas | Costo/1M tokens |
|--------|-------------|------------|-----------------|
| **GPT-4o** | Pipelines complejos | 95% accuracy, razonamiento excelente | $2.50 in / $10 out |
| **Claude 3.5 Sonnet** | Transformaciones SQL/Pandas | Calidad código, refactoring | $3 in / $15 out |
| **GPT-3.5-turbo** | Scripts simples, boilerplate | Rápido (1-2s), económico | $0.50 in / $1.50 out |
| **Codestral** | On-prem, OSS | Privacidad, customizable | Self-hosted |
| **Gemini 1.5 Pro** | Contexto grande (2M tokens) | Ventana masiva, multimodal | $1.25 in / $5 out |

### Función: Construir Prompt para Generación de Código

Esta función construye un prompt optimizado siguiendo best practices de prompt engineering para ETL.

In [None]:
from typing import List, Dict, Optional

def build_code_generation_prompt(
    requirements: str,
    tech_stack: List[str],
    context: Optional[str] = None,
    constraints: Optional[Dict] = None
) -> str:
    """
    Construye prompt optimizado para generación de código ETL.
    
    Best Practices:
    - Especificar tech stack explícitamente
    - Incluir ejemplos del codebase (few-shot)
    - Definir estándares de calidad
    - Solicitar explicaciones (razonamiento)
    """
    
    prompt = f"""You are an expert Data Engineer specializing in production ETL pipelines.

**Requirements:**
{requirements}

**Tech Stack:**
{', '.join(tech_stack)}

**Coding Standards:**
- Python 3.11+ with type hints (PEP 484)
- Error handling with try-except and logging
- Docstrings (Google style)
- Modular functions (max 50 lines)
- Configuration externalized (YAML/env vars)
- Idempotent operations (safe to re-run)

**Quality Checklist:**
✅ No hardcoded credentials
✅ Parameterized queries (prevent SQL injection)
✅ Retry logic with exponential backoff
✅ Metrics instrumentation (Prometheus)
✅ Comprehensive logging (structured JSON)
✅ Unit tests (pytest, 80%+ coverage)

"""
    
    if context:
        prompt += f"\n**Codebase Context (existing patterns):**\n{context}\n"
    
    if constraints:
        prompt += f"\n**Constraints:**\n"
        for key, value in constraints.items():
            prompt += f"- {key}: {value}\n"
    
    prompt += """
**Output Format:**
1. Main pipeline code (complete and executable)
2. Configuration file (YAML)
3. Unit tests (pytest)
4. Brief explanation of design decisions

Generate the code:
"""
    return prompt

print("✅ Función build_code_generation_prompt() definida")

### Ejemplo: Pipeline Completo con OpenAI GPT-4o

Generación end-to-end usando el modelo más avanzado con mejor comprensión de contexto y razonamiento sobre código:

In [None]:
from openai import OpenAI
import os

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# Definir requisitos del pipeline
requirements = """
Create an ETL pipeline that:
1. Extracts data from PostgreSQL table 'sales_raw' (incremental load by timestamp)
2. Transforms: clean nulls, deduplicate by transaction_id, calculate revenue
3. Loads to S3 parquet partitioned by date (YYYY/MM/DD)
4. Updates metadata table with execution stats
"""

tech_stack = [
    "Python 3.11",
    "pandas 2.x",
    "psycopg2",
    "boto3",
    "pyarrow",
    "Apache Airflow 2.8"
]

constraints = {
    "max_memory": "2GB",
    "execution_time": "<10 minutes",
    "idempotency": "required (safe re-runs)",
    "logging": "structured JSON to CloudWatch"
}

# Construir prompt con contexto
prompt = build_code_generation_prompt(
    requirements=requirements,
    tech_stack=tech_stack,
    constraints=constraints,
    context="Existing pattern uses @task decorators, config from YAML, logger.info() for metrics"
)

# Generar código con GPT-4o
response = client.chat.completions.create(
    model="gpt-4o",  # Mejor para código complejo
    messages=[
        {"role": "system", "content": "You are an expert Data Engineer with 10+ years in production ETL systems."},
        {"role": "user", "content": prompt}
    ],
    temperature=0.3,  # Baja creatividad = código más determinista
    max_tokens=3000
)

generated_code = response.choices[0].message.content

print("🔥 Código generado por GPT-4o:")
print("=" * 80)
print(generated_code)
print("=" * 80)
print(f"\n💰 Tokens: {response.usage.total_tokens} | Costo aprox: ${response.usage.total_tokens * 0.00003:.4f}")

### 🧠 RAG-Enhanced Generation: Usar Documentación del Codebase

Para proyectos grandes, es crítico que el LLM tenga acceso a patrones existentes, arquitectura y convenciones. **RAG (Retrieval-Augmented Generation)** recupera fragmentos relevantes del codebase antes de generar:

**Beneficios:**
- ✅ Consistencia con estilos existentes
- ✅ Reutilización de utilidades ya implementadas
- ✅ Evita reinventar la rueda
- ✅ Respeta decisiones arquitectónicas previas

In [None]:
from sentence_transformers import SentenceTransformer
import chromadb
from pathlib import Path

# 1. Indexar codebase existente
def index_codebase(repo_path: str, collection_name: str = "etl_code"):
    """Indexa archivos Python del proyecto con embeddings."""
    
    client = chromadb.PersistentClient(path="./chroma_db")
    collection = client.get_or_create_collection(name=collection_name)
    
    model = SentenceTransformer('all-MiniLM-L6-v2')
    
    python_files = Path(repo_path).rglob("*.py")
    
    for idx, file_path in enumerate(python_files):
        with open(file_path, 'r', encoding='utf-8') as f:
            code_content = f.read()
        
        # Generar embedding del código
        embedding = model.encode(code_content).tolist()
        
        collection.add(
            embeddings=[embedding],
            documents=[code_content],
            metadatas=[{"file": str(file_path), "type": "python"}],
            ids=[f"file_{idx}"]
        )
    
    print(f"✅ Indexados {idx+1} archivos Python")
    return collection

# 2. Buscar código relevante para la tarea
def retrieve_relevant_code(query: str, collection, top_k: int = 3) -> str:
    """Recupera los k archivos más similares semánticamente a la query."""
    
    model = SentenceTransformer('all-MiniLM-L6-v2')
    query_embedding = model.encode(query).tolist()
    
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k
    )
    
    context = "\n\n--- Ejemplo del codebase ---\n\n".join(
        [f"# {meta['file']}\n{doc}" for doc, meta in zip(results['documents'][0], results['metadatas'][0])]
    )
    
    return context

# 3. Generar con contexto RAG
collection = index_codebase("./scripts/etl")

relevant_context = retrieve_relevant_code(
    query="PostgreSQL incremental load with Airflow",
    collection=collection,
    top_k=2
)

# Usar el contexto en el prompt
rag_prompt = build_code_generation_prompt(
    requirements=requirements,
    tech_stack=tech_stack,
    context=relevant_context,  # ← Aquí inyectamos código similar existente
    constraints=constraints
)

response_rag = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a Data Engineer. Reuse patterns from the codebase context."},
        {"role": "user", "content": rag_prompt}
    ],
    temperature=0.2
)

print("🧠 Código generado con RAG (contexto del codebase):")
print(response_rag.choices[0].message.content)

### ⚡ Optimización de Tokens y Costos

Para prompts largos (con contexto RAG), es crítico optimizar el uso de tokens para reducir costos y latencia:

In [None]:
import tiktoken

def optimize_prompt(prompt: str, max_tokens: int = 1500, model: str = "gpt-4o") -> str:
    """
    Recorta el prompt si excede max_tokens, preservando secciones críticas.
    """
    encoding = tiktoken.encoding_for_model(model)
    tokens = encoding.encode(prompt)
    
    if len(tokens) <= max_tokens:
        return prompt
    
    # Estrategia: mantener inicio (requirements) y final (instrucciones)
    # Truncar el medio (contexto RAG)
    keep_start = max_tokens // 2
    keep_end = max_tokens // 2
    
    truncated_tokens = tokens[:keep_start] + tokens[-keep_end:]
    optimized_prompt = encoding.decode(truncated_tokens)
    
    print(f"⚠️ Prompt truncado: {len(tokens)} → {len(truncated_tokens)} tokens")
    return optimized_prompt

# Ejemplo: prompt original muy largo
long_prompt = rag_prompt  # Asumiendo que rag_prompt tiene 3000 tokens

optimized = optimize_prompt(long_prompt, max_tokens=1500, model="gpt-4o")

print(f"Tokens originales: {len(tiktoken.encoding_for_model('gpt-4o').encode(long_prompt))}")
print(f"Tokens optimizados: {len(tiktoken.encoding_for_model('gpt-4o').encode(optimized))}")

In [None]:
# Comparación de costos por modelo (precios 2024)
models_pricing = {
    "gpt-4o": {"input": 0.005, "output": 0.015},  # por 1K tokens
    "gpt-4-turbo": {"input": 0.01, "output": 0.03},
    "gpt-3.5-turbo": {"input": 0.0005, "output": 0.0015},
    "claude-3.5-sonnet": {"input": 0.003, "output": 0.015},
    "codestral": {"input": 0.001, "output": 0.003}
}

prompt_tokens = 1500
completion_tokens = 2000

print("💰 Comparación de costos para 1 generación (1.5K input + 2K output):\n")
for model, prices in models_pricing.items():
    cost = (prompt_tokens * prices["input"] / 1000) + (completion_tokens * prices["output"] / 1000)
    print(f"{model:20s} ${cost:.4f}")

print("\n📊 Para 1000 generaciones por mes:")
for model, prices in models_pricing.items():
    cost = ((prompt_tokens * prices["input"] / 1000) + (completion_tokens * prices["output"] / 1000)) * 1000
    print(f"{model:20s} ${cost:.2f}")

## 🛡️ Code Validation & Security: Multi-Layer Quality Assurance

El código generado por LLMs **no es confiable por defecto**. Necesita pasar por un pipeline de validación riguroso antes de ejecutarse en producción:

### Arquitectura del Pipeline de Validación (5 Capas)

```
Generated Code
     │
     ▼
┌─────────────────────────────────────┐
│  LAYER 1: SYNTAX VALIDATION         │
│  ├─ AST Parsing (ast.parse)         │
│  ├─ Compile Check (compile())       │
│  └─ Python Version Compatibility    │
│  → Result: Syntactically valid code │
└────────────┬────────────────────────┘
             │ PASS
             ▼
┌─────────────────────────────────────┐
│  LAYER 2: STATIC ANALYSIS           │
│  ├─ Linting (ruff/flake8/pylint)    │
│  ├─ Type Checking (mypy)            │
│  ├─ Complexity (radon: CC < 10)     │
│  ├─ Code Smells (pylint)            │
│  └─ Formatting (black/autopep8)     │
│  → Result: Clean, readable code     │
└────────────┬────────────────────────┘
             │ PASS
             ▼
┌─────────────────────────────────────┐
│  LAYER 3: SECURITY SCAN             │
│  ├─ Bandit (vulnerability detection)│
│  ├─ Safety (dependency vulnerabilities)│
│  ├─ Secrets Detection (detect-secrets)│
│  ├─ SQL Injection Check (sqlparse)  │
│  └─ Path Traversal Check            │
│  → Result: Secure code              │
└────────────┬────────────────────────┘
             │ PASS
             ▼
┌─────────────────────────────────────┐
│  LAYER 4: FUNCTIONAL TESTING        │
│  ├─ Unit Tests (pytest)             │
│  ├─ Integration Tests (testcontainers)│
│  ├─ Property-Based (hypothesis)     │
│  ├─ Coverage (pytest-cov > 80%)     │
│  └─ Performance (locust benchmarks) │
│  → Result: Functionally correct     │
└────────────┬────────────────────────┘
             │ PASS
             ▼
┌─────────────────────────────────────┐
│  LAYER 5: RUNTIME VALIDATION        │
│  ├─ Dry-Run with Sample Data        │
│  ├─ Memory Profiling (memory_profiler)│
│  ├─ Execution Time Check            │
│  └─ Resource Limits (cgroups)       │
│  → Result: Production-ready         │
└─────────────────────────────────────┘
```

### Clase CodeValidator: Implementación del Sistema de Validación

Validador multi-capa que verifica sintaxis, linting, tipos, seguridad y complejidad del código generado:

In [None]:
import ast
import subprocess
from pathlib import Path
from typing import Dict, List, Tuple
import tempfile
import json

class CodeValidator:
    """Multi-layer validation system for LLM-generated code."""
    
    def __init__(self):
        self.validation_results = {}
    
    def validate_syntax(self, code: str) -> Tuple[bool, str]:
        """Layer 1: AST parsing and compilation."""
        try:
            tree = ast.parse(code)
            compile(code, '<generated>', 'exec')
            
            issues = []
            for node in ast.walk(tree):
                if isinstance(node, ast.Expr) and isinstance(node.value, ast.Call):
                    if hasattr(node.value.func, 'id') and node.value.func.id == 'exec':
                        issues.append("⚠️ exec() detected (security risk)")
                if isinstance(node, ast.ExceptHandler) and node.type is None:
                    issues.append("⚠️ Bare except: clause (catch specific exceptions)")
            
            if issues:
                return True, "Valid with warnings:\\n" + "\\n".join(issues)
            return True, "✅ Syntax valid"
        except SyntaxError as e:
            return False, f"❌ Syntax Error at line {e.lineno}: {e.msg}"
        except Exception as e:
            return False, f"❌ Compilation Error: {str(e)}"
    
    def validate_linting(self, code: str) -> Tuple[bool, str]:
        """Layer 2: Static analysis with ruff/flake8."""
        with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False) as f:
            f.write(code)
            temp_path = f.name
        
        try:
            result = subprocess.run(
                ['ruff', 'check', temp_path, '--output-format', 'json'],
                capture_output=True, text=True, timeout=10
            )
            
            if result.returncode == 0:
                return True, "✅ No linting issues"
            
            issues = json.loads(result.stdout)
            errors = [f"Line {i['location']['row']}: {i['code']} - {i['message']}" 
                     for i in issues[:5]]
            return False, "❌ Linting issues:\\n" + "\\n".join(errors)
        
        except FileNotFoundError:
            result = subprocess.run(
                ['flake8', temp_path, '--max-line-length', '100'],
                capture_output=True, text=True, timeout=10
            )
            if result.returncode == 0:
                return True, "✅ No linting issues (flake8)"
            return False, f"❌ Flake8 issues:\\n{result.stdout[:500]}"
        except subprocess.TimeoutExpired:
            return False, "❌ Linting timeout"
        finally:
            Path(temp_path).unlink()
    
    def validate_types(self, code: str) -> Tuple[bool, str]:
        """Layer 2.5: Type checking with mypy."""
        with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False) as f:
            f.write(code)
            temp_path = f.name
        
        try:
            result = subprocess.run(
                ['mypy', temp_path, '--no-error-summary'],
                capture_output=True, text=True, timeout=15
            )
            
            if result.returncode == 0:
                return True, "✅ Type checking passed"
            
            errors = [line for line in result.stdout.split('\\n') 
                     if 'error:' in line and 'import' not in line.lower()]
            
            if not errors:
                return True, "✅ Type checking passed (ignoring imports)"
            return False, "❌ Type errors:\\n" + "\\n".join(errors[:3])
        except FileNotFoundError:
            return True, "⚠️ mypy not installed"
        finally:
            Path(temp_path).unlink()
    
    def validate_security(self, code: str) -> Tuple[bool, str]:
        """Layer 3: Security vulnerability scanning with bandit."""
        with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False) as f:
            f.write(code)
            temp_path = f.name
        
        try:
            result = subprocess.run(
                ['bandit', '-r', temp_path, '-f', 'json'],
                capture_output=True, text=True, timeout=10
            )
            
            report = json.loads(result.stdout)
            critical = [i for i in report.get('results', [])
                       if i['issue_severity'] in ['HIGH', 'MEDIUM']]
            
            if not critical:
                return True, "✅ No security vulnerabilities"
            
            errors = [f"Line {i['line_number']}: [{i['issue_severity']}] {i['issue_text']}"
                     for i in critical[:3]]
            return False, "❌ Security issues:\\n" + "\\n".join(errors)
        
        except FileNotFoundError:
            # Manual checks
            dangerous = [
                ('eval(', 'Code execution vulnerability'),
                ('exec(', 'Code execution vulnerability'),
                ('pickle.loads', 'Insecure deserialization'),
                ('password = "', 'Hardcoded password'),
                ('api_key = "', 'Hardcoded API key'),
            ]
            
            found = [f"⚠️ {desc}: '{pat}'" for pat, desc in dangerous if pat in code]
            if found:
                return False, "❌ Security issues:\\n" + "\\n".join(found)
            return True, "✅ Basic security check passed"
        finally:
            Path(temp_path).unlink()
    
    def validate_complexity(self, code: str) -> Tuple[bool, str]:
        """Layer 2: Cyclomatic complexity with radon."""
        with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False) as f:
            f.write(code)
            temp_path = f.name
        
        try:
            result = subprocess.run(
                ['radon', 'cc', temp_path, '-j'],
                capture_output=True, text=True, timeout=10
            )
            
            data = json.loads(result.stdout)
            complex_funcs = []
            for file_data in data.values():
                for item in file_data:
                    if item['complexity'] > 10:
                        complex_funcs.append(
                            f"{item['name']}: CC={item['complexity']} (line {item['lineno']})"
                        )
            
            if not complex_funcs:
                return True, "✅ Complexity OK"
            return False, "⚠️ High complexity:\\n" + "\\n".join(complex_funcs)
        except FileNotFoundError:
            return True, "⚠️ radon not installed"
        finally:
            Path(temp_path).unlink()
    
    def validate_all(self, code: str) -> Dict:
        """Execute full validation pipeline."""
        results = {}
        
        # Layer 1: Syntax (critical)
        syntax_pass, syntax_msg = self.validate_syntax(code)
        results['syntax'] = {'passed': syntax_pass, 'message': syntax_msg}
        
        if not syntax_pass:
            results['is_valid'] = False
            results['overall_score'] = 0
            return results
        
        # Layer 2: Linting
        lint_pass, lint_msg = self.validate_linting(code)
        results['linting'] = {'passed': lint_pass, 'message': lint_msg}
        
        # Layer 2.5: Types
        types_pass, types_msg = self.validate_types(code)
        results['types'] = {'passed': types_pass, 'message': types_msg}
        
        # Layer 3: Security (critical)
        security_pass, security_msg = self.validate_security(code)
        results['security'] = {'passed': security_pass, 'message': security_msg}
        
        # Layer 2: Complexity
        complexity_pass, complexity_msg = self.validate_complexity(code)
        results['complexity'] = {'passed': complexity_pass, 'message': complexity_msg}
        
        # Calculate score
        weights = {
            'syntax': 30, 'security': 30, 'linting': 15, 'types': 15, 'complexity': 10
        }
        
        score = sum(weights[k] for k, v in results.items()
                   if k in weights and v.get('passed', False))
        
        results['overall_score'] = score
        results['is_valid'] = score >= 70  # 70% threshold
        
        return results

print("✅ Clase CodeValidator completa cargada")

### Ejemplo de Uso: Validar Código Generado

In [None]:
validator = CodeValidator()

generated_code = '''
def process_sales(data):
    import pandas as pd
    df = pd.DataFrame(data)
    df['total'] = df['quantity'] * df['price']
    return df.to_dict('records')
'''

results = validator.validate_all(generated_code)

print(f"Overall Score: {results['overall_score']}/100")
print(f"Production Ready: {results['is_valid']}")
print()

for check, result in results.items():
    if isinstance(result, dict) and 'message' in result:
        status = "✅" if result.get('passed', False) else "❌"
        print(f"{status} {check.upper()}: {result['message']}")

### 🔄 Self-Correction Loop: Iterative Fixing con LLM

Si el código falla la validación, podemos usar el LLM para **autocorregirse** basándose en los errores reportados:

In [None]:
def self_correct_code(
    code: str,
    validation_results: Dict,
    max_attempts: int = 3
) -> Tuple[str, bool]:
    """
    Iteratively fix code based on validation errors.
    
    Strategy:
    1. Extract error messages from validation
    2. Send to LLM with context: original code + errors
    3. Re-validate corrected code
    4. Repeat until valid or max_attempts reached
    """
    
    for attempt in range(max_attempts):
        if validation_results['is_valid']:
            return code, True
        
        # Collect errors
        errors = []
        for check, result in validation_results.items():
            if isinstance(result, dict) and not result.get('passed', True):
                errors.append(f"{check}: {result['message']}")
        
        # Correction prompt
        correction_prompt = f"""
Fix the following Python code based on validation errors:

**Original Code:**
```python
{code}
```

**Validation Errors:**
{chr(10).join(errors)}

**Instructions:**
- Fix all syntax, linting, type, and security issues
- Maintain the original functionality
- Return only the corrected code (no explanations)

**Corrected Code:**
"""
        
        # Call LLM for correction
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": correction_prompt}],
            temperature=0.1  # Deterministic fixes
        )
        
        corrected_code = response.choices[0].message.content.strip()
        corrected_code = corrected_code.replace('```python', '').replace('```', '').strip()
        
        # Re-validate
        validation_results = validator.validate_all(corrected_code)
        code = corrected_code
        
        print(f"Attempt {attempt + 1}: Score = {validation_results['overall_score']}/100")
        
        if validation_results['is_valid']:
            return code, True
    
    return code, False

# Example: Self-correction workflow
initial_code = '''
def risky_function(user_input):
    result = eval(user_input)  # Security issue!
    return result
'''

validation = validator.validate_all(initial_code)
print(f"Initial Score: {validation['overall_score']}/100")

corrected_code, success = self_correct_code(initial_code, validation)
if success:
    print("\\n✅ Code successfully corrected and validated!")
    print(corrected_code)
else:
    print("\\n❌ Failed to correct code after 3 attempts")

### 🔐 Secrets Detection & Environment Variable Enforcement

Detector automático de secretos hardcodeados y reemplazo por variables de entorno:

In [None]:
import re
import os

def detect_secrets(code: str) -> List[str]:
    """Detect hardcoded secrets and suggest env var usage."""
    
    patterns = [
        (r'password\s*=\s*["\'](.{3,})["\']', 'Hardcoded password'),
        (r'api_key\s*=\s*["\'](.{10,})["\']', 'Hardcoded API key'),
        (r'secret\s*=\s*["\'](.{10,})["\']', 'Hardcoded secret'),
        (r'token\s*=\s*["\'](.{10,})["\']', 'Hardcoded token'),
        (r'AKIA[0-9A-Z]{16}', 'AWS Access Key'),
        (r'postgres://.*:.*@', 'Database URL with credentials'),
    ]
    
    findings = []
    for pattern, description in patterns:
        matches = re.findall(pattern, code, re.IGNORECASE)
        if matches:
            findings.append(f"❌ {description} detected")
            findings.append(f"   → Replace with: os.getenv('{description.upper().replace(' ', '_')}')")
    
    return findings

def fix_secrets(code: str) -> str:
    """Automatically replace hardcoded secrets with os.getenv()."""
    
    replacements = [
        (r'password\s*=\s*["\'](.{3,})["\']', 'password = os.getenv("DB_PASSWORD")'),
        (r'api_key\s*=\s*["\'](.{10,})["\']', 'api_key = os.getenv("API_KEY")'),
    ]
    
    fixed_code = code
    needs_import = False
    
    for pattern, replacement in replacements:
        if re.search(pattern, fixed_code, re.IGNORECASE):
            fixed_code = re.sub(pattern, replacement, fixed_code, flags=re.IGNORECASE)
            needs_import = True
    
    # Add 'import os' if needed
    if needs_import and 'import os' not in fixed_code:
        fixed_code = 'import os\\n\\n' + fixed_code
    
    return fixed_code

# Example
bad_code = '''
def connect_db():
    password = "super_secret_123"
    api_key = "sk_live_abc123xyz"
    return create_connection(password, api_key)
'''

print("Secrets detected:")
secrets = detect_secrets(bad_code)
print("\\n".join(secrets))

print("\\n✅ Fixed code:")
fixed_code = fix_secrets(bad_code)
print(fixed_code)

## 🧪 Test Generation: Automated Quality Assurance

Los LLMs pueden generar suites completas de tests para código ETL, incluyendo:
- ✅ Unit tests (funciones individuales)
- ✅ Integration tests (componentes interconectados)
- ✅ Property-based tests (invariantes con Hypothesis)
- ✅ Performance tests (benchmarks de rendimiento)
- ✅ Edge cases (valores límite, errores esperados)

### Función: Generar Tests Automáticamente con LLM

In [None]:
def generate_tests(code: str, test_types: List[str] = None) -> str:
    """
    Generate comprehensive test suite for given code.
    
    Args:
        code: Python code to generate tests for
        test_types: List of test types ['unit', 'integration', 'property', 'performance']
    
    Returns:
        Complete pytest test suite as string
    """
    
    if test_types is None:
        test_types = ['unit', 'integration', 'edge_cases']
    
    test_generation_prompt = f"""
Generate a comprehensive pytest test suite for the following Python code.

**Code to Test:**
```python
{code}
```

**Required Test Types:**
{', '.join(test_types)}

**Test Requirements:**
1. Use pytest framework with fixtures
2. Cover happy path, edge cases, and error cases
3. Include parametrized tests (@pytest.mark.parametrize)
4. Mock external dependencies (files, databases, APIs)
5. Use descriptive test names (test_function_behavior_scenario)
6. Include docstrings explaining what each test verifies
7. Aim for 80%+ code coverage
8. Use tmp_path fixture for file operations

**Output Format:**
- Complete, executable test file
- Import all necessary modules (pytest, unittest.mock, etc.)
- Use proper assertions (assert, pytest.raises, etc.)

Generate the test suite:
"""
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are a QA expert specializing in Python testing."},
            {"role": "user", "content": test_generation_prompt}
        ],
        temperature=0.3  # Slightly higher for diverse test cases
    )
    
    test_code = response.choices[0].message.content.strip()
    test_code = test_code.replace('```python', '').replace('```', '').strip()
    
    return test_code

print("✅ Función generate_tests() definida")

### Ejemplo: Generar Tests para un Pipeline ETL

In [None]:
etl_code = '''
import pandas as pd
from typing import Dict

def extract_sales(file_path: str) -> pd.DataFrame:
    """Extract sales data from CSV file."""
    df = pd.read_csv(file_path)
    if df.empty:
        raise ValueError("CSV file is empty")
    return df

def transform_sales(df: pd.DataFrame) -> pd.DataFrame:
    """Transform sales data: filter, clean, aggregate."""
    df_clean = df[df['amount'] > 0].copy()
    df_clean['month'] = pd.to_datetime(df_clean['date']).dt.to_period('M')
    df_agg = df_clean.groupby(['date', 'month']).agg({
        'amount': 'sum',
        'transaction_id': 'count'
    }).reset_index()
    return df_agg

def load_sales(df: pd.DataFrame, output_path: str) -> None:
    """Load transformed data to Parquet."""
    if len(df) == 0:
        raise ValueError("No data to load")
    df.to_parquet(output_path, index=False, compression='snappy')

def run_etl_pipeline(input_path: str, output_path: str) -> Dict:
    """Full ETL pipeline orchestration."""
    df_raw = extract_sales(input_path)
    df_transformed = transform_sales(df_raw)
    load_sales(df_transformed, output_path)
    return {
        'rows_extracted': len(df_raw),
        'rows_loaded': len(df_transformed),
        'success': True
    }
'''

generated_tests = generate_tests(etl_code)
print("Generated Test Suite:")
print(generated_tests[:1000])  # Show first 1000 chars
print("\\n[... truncated for display ...]")

### 📊 Advanced Testing Patterns

Patrones avanzados para testing de pipelines ETL:

In [None]:
# 1. Property-Based Testing with Hypothesis
from hypothesis import given, strategies as st
import pytest
import pandas as pd

@given(st.lists(st.integers(min_value=1, max_value=1000)))
def test_transform_preserves_positive_amounts(amounts):
    """Property: transformation only keeps positive amounts."""
    df = pd.DataFrame({
        'amount': amounts,
        'date': ['2024-01-01'] * len(amounts),
        'transaction_id': list(range(len(amounts)))
    })
    result = transform_sales(df)
    assert len(result) > 0  # At least one aggregate row

# 2. Parametrized Tests for Edge Cases
@pytest.mark.parametrize("test_input,expected_error", [
    (pd.DataFrame(), AttributeError),  # Empty dataframe
    (pd.DataFrame({'amount': [-1, -2], 'date': ['2024-01-01', '2024-01-02'], 
                   'transaction_id': [1, 2]}), None),  # All negative
])
def test_transform_edge_cases(test_input, expected_error):
    if expected_error:
        with pytest.raises(expected_error):
            transform_sales(test_input)
    else:
        result = transform_sales(test_input)
        assert isinstance(result, pd.DataFrame)

# 3. Performance Test
import time

def test_pipeline_performance(tmp_path):
    """Performance test: 100K rows should complete in <2s."""
    large_df = pd.DataFrame({
        'transaction_id': range(100_000),
        'date': ['2024-01-01'] * 100_000,
        'amount': [100] * 100_000
    })
    
    input_file = tmp_path / "large.csv"
    large_df.to_csv(input_file, index=False)
    
    start = time.time()
    output_file = tmp_path / "output.parquet"
    run_etl_pipeline(str(input_file), str(output_file))
    duration = time.time() - start
    
    assert duration < 2, f"Too slow: {duration:.2f}s"

print("✅ Advanced test patterns defined")

### 📈 Test Coverage Analysis & Improvement

Análisis automático de cobertura y generación de tests adicionales para cubrir líneas faltantes:

In [None]:
import subprocess
import json

def run_tests_with_coverage(test_file: str, code_file: str) -> Dict:
    """
    Execute tests and generate coverage report.
    
    Returns dict with coverage metrics and test results.
    """
    
    # Run pytest with coverage
    result = subprocess.run(
        [
            'pytest', test_file,
            f'--cov={code_file}',
            '--cov-report=json:coverage.json',
            '-v'
        ],
        capture_output=True,
        text=True
    )
    
    # Parse coverage JSON
    try:
        with open('coverage.json') as f:
            cov_data = json.load(f)
        
        file_cov = cov_data['files'].get(code_file, {})
        summary = file_cov.get('summary', {})
        
        return {
            'coverage_percent': summary.get('percent_covered', 0),
            'lines_covered': summary.get('covered_lines', 0),
            'lines_total': summary.get('num_statements', 0),
            'missing_lines': file_cov.get('missing_lines', []),
            'test_output': result.stdout
        }
    except (FileNotFoundError, KeyError) as e:
        return {
            'coverage_percent': 0,
            'error': str(e),
            'test_output': result.stdout
        }

def improve_test_coverage(code: str, existing_tests: str, coverage_report: Dict) -> str:
    """
    Generate additional tests to improve coverage.
    
    Focuses on uncovered lines from coverage report.
    """
    
    improvement_prompt = f"""
The following code has {coverage_report.get('coverage_percent', 0):.1f}% test coverage.

**Code:**
```python
{code}
```

**Existing Tests:**
```python
{existing_tests}
```

**Uncovered Lines:** {coverage_report.get('missing_lines', [])}

**Task:** Generate additional pytest tests to cover the missing lines.
Focus on:
- Error handling branches
- Edge cases not yet tested
- Conditional logic branches

Generate only the NEW tests (don't repeat existing ones):
"""
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": improvement_prompt}],
        temperature=0.2
    )
    
    return response.choices[0].message.content.strip()

print("✅ Coverage analysis functions defined")

## 🚀 Production Deployment: End-to-End Code Generation

Generación completa de proyectos ETL listos para producción, incluyendo:
- ✅ Código del pipeline ETL
- ✅ Tests completos (unit + integration)
- ✅ Dockerfile optimizado
- ✅ CI/CD pipeline (GitHub Actions)
- ✅ Airflow DAG
- ✅ Documentación (README.md)
- ✅ Configuración (YAML)

### Clase ETLProjectGenerator: Generador Completo de Proyectos

In [None]:
from dataclasses import dataclass
from pathlib import Path
import yaml

@dataclass
class PipelineSpec:
    """Specification for ETL pipeline project."""
    name: str
    description: str
    source: Dict
    target: Dict
    transformations: List[str]
    owner: str
    sla_minutes: int
    compliance: List[str]

class ETLProjectGenerator:
    """End-to-end project generator for production ETL pipelines."""
    
    def __init__(self, spec: PipelineSpec, output_dir: str = "./generated_pipeline"):
        self.spec = spec
        self.output_dir = Path(output_dir)
        self.validator = CodeValidator()
    
    def _save_file(self, filename: str, content: str):
        """Save file to output directory."""
        file_path = self.output_dir / filename
        file_path.parent.mkdir(parents=True, exist_ok=True)
        file_path.write_text(content)
    
    def generate_full_project(self) -> str:
        """
        Generate complete ETL project structure.
        
        Returns:
            Path to generated project directory
        """
        print(f"🚀 Generating project: {self.spec.name}")
        
        # 1. Generate main pipeline code
        pipeline_code = self._generate_pipeline_code()
        self._save_file("pipeline.py", pipeline_code)
        print("✅ Generated pipeline.py")
        
        # 2. Generate configuration
        config = self._generate_config()
        self._save_file("config.yaml", yaml.dump(config))
        print("✅ Generated config.yaml")
        
        # 3. Generate tests
        tests = self._generate_tests(pipeline_code)
        self._save_file("test_pipeline.py", tests)
        print("✅ Generated test_pipeline.py")
        
        # 4. Generate Dockerfile
        dockerfile = self._generate_dockerfile()
        self._save_file("Dockerfile", dockerfile)
        print("✅ Generated Dockerfile")
        
        # 5. Generate CI/CD
        ci_yaml = self._generate_ci_cd()
        self._save_file(".github/workflows/ci.yml", ci_yaml)
        print("✅ Generated CI/CD pipeline")
        
        # 6. Generate documentation
        readme = self._generate_readme()
        self._save_file("README.md", readme)
        print("✅ Generated README.md")
        
        # 7. Generate Airflow DAG
        dag_code = self._generate_airflow_dag()
        self._save_file("airflow_dag.py", dag_code)
        print("✅ Generated Airflow DAG")
        
        # 8. Validate all
        validation_results = self._validate_generated_code()
        print(f"\\n📊 Validation Score: {validation_results['overall_score']}/100")
        
        return str(self.output_dir)
    
    def _generate_pipeline_code(self) -> str:
        """Generate main ETL pipeline code using LLM."""
        
        prompt = f"""
Generate production-ready ETL pipeline with:

**Name:** {self.spec.name}
**Source:** {self.spec.source['type']} → **Target:** {self.spec.target['type']}
**Transformations:** {', '.join(self.spec.transformations)}
**SLA:** {self.spec.sla_minutes} min | **Compliance:** {', '.join(self.spec.compliance)}

Requirements:
- Python 3.11+ with type hints
- Retry logic (exponential backoff, 3 max)
- Structured logging (JSON)
- Prometheus metrics
- YAML config
- Idempotent

Generate complete code:
"""
        
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.2
        )
        
        code = response.choices[0].message.content.strip()
        return code.replace('```python', '').replace('```', '').strip()
    
    def _generate_config(self) -> Dict:
        """Generate YAML configuration."""
        return {
            'pipeline': {
                'name': self.spec.name,
                'version': '1.0.0',
                'owner': self.spec.owner,
                'sla_minutes': self.spec.sla_minutes
            },
            'source': self.spec.source,
            'target': self.spec.target,
            'retry': {'max_attempts': 3, 'backoff_factor': 2},
            'logging': {'level': 'INFO', 'format': 'json'}
        }
    
    def _generate_tests(self, pipeline_code: str) -> str:
        """Generate test suite."""
        return generate_tests(pipeline_code)
    
    def _generate_dockerfile(self) -> str:
        """Generate optimized Dockerfile."""
        return f'''FROM python:3.11-slim as builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir --user -r requirements.txt

FROM python:3.11-slim
WORKDIR /app
COPY --from=builder /root/.local /root/.local
COPY pipeline.py config.yaml ./
ENV PATH=/root/.local/bin:$PATH
RUN useradd -m -u 1000 pipeline && chown -R pipeline:pipeline /app
USER pipeline
CMD ["python", "pipeline.py"]
'''
    
    def _generate_ci_cd(self) -> str:
        """Generate GitHub Actions CI/CD."""
        return f'''name: CI/CD - {self.spec.name}

on:
  push:
    branches: [main]
  pull_request:

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: actions/setup-python@v4
        with:
          python-version: '3.11'
      - run: pip install -r requirements.txt pytest bandit ruff
      - run: ruff check .
      - run: bandit -r pipeline.py
      - run: pytest test_pipeline.py --cov=pipeline
'''
    
    def _generate_readme(self) -> str:
        """Generate README documentation."""
        return f'''# {self.spec.name}

{self.spec.description}

## Architecture
- **Source:** {self.spec.source['type']}
- **Target:** {self.spec.target['type']}
- **SLA:** {self.spec.sla_minutes} minutes

## Quickstart
```bash
docker build -t {self.spec.name} .
docker run {self.spec.name}
```
'''
    
    def _generate_airflow_dag(self) -> str:
        """Generate Airflow DAG."""
        return f'''from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta

default_args = {{
    'owner': '{self.spec.owner}',
    'retries': 3,
    'retry_delay': timedelta(minutes=5)
}}

with DAG(
    '{self.spec.name}',
    default_args=default_args,
    schedule_interval='@daily',
    start_date=datetime(2024, 1, 1)
) as dag:
    
    def run_pipeline():
        from pipeline import run_pipeline
        return run_pipeline('config.yaml')
    
    task = PythonOperator(
        task_id='run_etl',
        python_callable=run_pipeline
    )
'''
    
    def _validate_generated_code(self) -> Dict:
        """Validate all generated code."""
        pipeline_path = self.output_dir / "pipeline.py"
        if pipeline_path.exists():
            code = pipeline_path.read_text()
            return self.validator.validate_all(code)
        return {'overall_score': 0}

print("✅ Clase ETLProjectGenerator definida")

### Ejemplo: Generar Proyecto ETL Completo

Generar proyecto end-to-end desde una especificación:

In [None]:
# Define project specification
spec = PipelineSpec(
    name="sales_analytics_etl",
    description="Daily ETL pipeline to process sales data from PostgreSQL to S3 Data Lake",
    source={
        'type': 'postgresql',
        'host': 'db.example.com',
        'database': 'sales_db',
        'table': 'transactions'
    },
    target={
        'type': 's3',
        'bucket': 'datalake-prod',
        'prefix': 'sales/processed/',
        'format': 'parquet'
    },
    transformations=[
        'Filter out test transactions',
        'Deduplicate by transaction_id',
        'Calculate revenue (quantity * price)',
        'Aggregate by product and date',
        'Add data quality metrics'
    ],
    owner='data-engineering-team',
    sla_minutes=30,
    compliance=['GDPR', 'SOC2', 'PII-masking']
)

# Generate complete project
generator = ETLProjectGenerator(spec, output_dir="./sales_analytics_pipeline")
project_path = generator.generate_full_project()

print(f"\\n✅ Project generated successfully at: {project_path}")
print("\\nGenerated files:")
for file in Path(project_path).rglob("*"):
    if file.is_file():
        print(f"  - {file.relative_to(project_path)}")

## 1. Generación de función ETL básica

In [None]:
import os
from openai import OpenAI
client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

def generate_etl_code(description: str) -> str:
    prompt = f'''
Eres un experto en Python y pipelines ETL. Genera código Python completo y ejecutable.

Requerimientos:
{description}

Incluye:
- Imports necesarios
- Manejo de errores
- Logging básico
- Docstrings
- Type hints

Código:
'''
    resp = client.chat.completions.create(
        model='gpt-4',
        messages=[{'role':'user','content':prompt}],
        temperature=0.2
    )
    return resp.choices[0].message.content.strip().replace('```python','').replace('```','')

desc = '''
Función que:
1. Lee CSV de ventas desde S3 (boto3)
2. Filtra filas con total > 0
3. Agrega columna "mes" (YYYY-MM) desde "fecha"
4. Escribe a Parquet particionado por mes
'''

code = generate_etl_code(desc)
print(code)

## 2. Generación de transformación Pandas

In [None]:
transformation_spec = '''
Dataset: transacciones bancarias (CSV)
Columnas: trans_id, fecha, monto, tipo (debito/credito), cuenta_id

Transformaciones:
1. Convertir fecha a datetime
2. Crear columna "anio_mes" (formato YYYY-MM)
3. Calcular saldo acumulado por cuenta_id ordenado por fecha
4. Agregar flag is_anomaly si monto > 3 desviaciones estándar de la media de esa cuenta
5. Exportar a CSV con encoding UTF-8
'''

transform_code = generate_etl_code(transformation_spec)
print(transform_code[:500] + '...')  # Preview

## 3. Validación de código generado

In [None]:
import ast
import subprocess

def validate_python_syntax(code: str) -> tuple[bool, str]:
    """Valida sintaxis Python sin ejecutar."""
    try:
        ast.parse(code)
        return True, 'Sintaxis válida'
    except SyntaxError as e:
        return False, f'Error de sintaxis: {e}'

def lint_code(code: str, tool='flake8') -> str:
    """Ejecuta linter (requiere instalado)."""
    try:
        with open('/tmp/temp_code.py', 'w') as f:
            f.write(code)
        result = subprocess.run([tool, '/tmp/temp_code.py'], capture_output=True, text=True)
        return result.stdout if result.returncode == 0 else result.stderr
    except Exception as e:
        return f'Error linting: {e}'

valid, msg = validate_python_syntax(code)
print(f'{"✅" if valid else "❌"} {msg}')

## 4. Generación de DAG de Airflow

In [None]:
dag_spec = '''
DAG de Airflow para:
- Nombre: ventas_daily_etl
- Schedule: diario a las 2 AM
- Tareas:
  1. extract_s3: descarga ventas.csv de S3
  2. validate: Great Expectations (columnas no nulas, total > 0)
  3. transform: agrega mes, calcula métricas
  4. load_db: inserta en PostgreSQL (tabla ventas_daily)
  5. send_alert: email si falla cualquier paso
- Retries: 2 con delay de 5 min
'''

dag_code = generate_etl_code(dag_spec)
print(dag_code[:600] + '...')

## 5. Iteración y mejora con feedback

In [None]:
def refine_code(original_code: str, feedback: str) -> str:
    prompt = f'''
Mejora este código Python según el feedback:

Código original:
{original_code}

Feedback:
{feedback}

Código mejorado:
'''
    resp = client.chat.completions.create(
        model='gpt-4',
        messages=[{'role':'user','content':prompt}],
        temperature=0.1
    )
    return resp.choices[0].message.content.strip().replace('```python','').replace('```','')

feedback_example = '''
- Añadir retry con exponential backoff en la descarga de S3
- Usar context manager para conexión a DB
- Loggear número de filas procesadas
'''

improved = refine_code(code, feedback_example)
print(improved[:400] + '...')

## 6. Generación de tests unitarios

In [None]:
def generate_tests(code: str) -> str:
    prompt = f'''
Genera tests unitarios con pytest para este código:

{code}

Incluye:
- Test de caso normal (happy path)
- Test con datos vacíos
- Test con errores (valores nulos, tipos incorrectos)
- Mocks para I/O externo (S3, DB)

Tests:
'''
    resp = client.chat.completions.create(
        model='gpt-4',
        messages=[{'role':'user','content':prompt}],
        temperature=0.1
    )
    return resp.choices[0].message.content.strip().replace('```python','').replace('```','')

test_code = generate_tests(code)
print(test_code[:500] + '...')

## 7. Buenas prácticas

- **Revisar siempre**: nunca ejecutes código generado sin inspección humana.
- **Validación automática**: sintaxis, linting, tests.
- **Versionado**: guarda código generado en Git con mensaje descriptivo.
- **Plantillas**: usa templates para estructura consistente.
- **Iteración**: refina con feedback humano y re-generación.
- **Documentación**: genera README y comentarios junto con código.

## 8. Ejercicios

1. Genera un script que migre datos de MongoDB a PostgreSQL.
2. Crea un pipeline Spark (PySpark) que lea Parquet y escriba Delta Lake.
3. Automatiza generación de un data quality report con Great Expectations.
4. Construye un CLI (Click/Typer) generado por LLM para ejecutar ETLs.