# 🔧 Generación Automática de Código ETL con LLMs

Objetivo: automatizar la creación de pipelines ETL, transformaciones y scripts de datos usando IA generativa, con validación y best practices.

- Duración: 90-120 min
- Dificultad: Media/Alta
- Prerrequisitos: GenAI 01-02, experiencia con ETL

### 🏗️ **Code Generation Architecture: From Prompt to Production Pipeline**

**Evolution of ETL Code Generation:**

```
2015-2019: Template-Based Generation
  ├─ Jinja2 templates con parámetros
  ├─ Cookiecutter projects
  └─ Limited flexibility, manual configuration

2020-2022: GPT-3 Era (Codex)
  ├─ GitHub Copilot (autocomplete)
  ├─ Function-level generation
  └─ Still requires significant editing

2023-2024: GPT-4 + Specialized Models
  ├─ Full pipeline generation
  ├─ Multi-file projects
  ├─ Self-correction capabilities
  └─ Context-aware refactoring

2024+: LLM + MCP (Model Context Protocol)
  ├─ RAG with codebase context
  ├─ Real-time validation
  ├─ Automated testing generation
  └─ Production-ready code
```

**Code Generation System Architecture:**

```
┌─────────────────────────────────────────────────────────────┐
│  INPUT: Natural Language Requirements                       │
│  "Create ETL pipeline: S3 CSV → validate → Snowflake"      │
└─────────────────────┬───────────────────────────────────────┘
                      │
        ┌─────────────┴─────────────┐
        │   LAYER 1: SPECIFICATION   │
        │   ├─ Parse requirements    │
        │   ├─ Identify tech stack   │
        │   └─ Extract constraints    │
        └─────────────┬───────────────┘
                      │
        ┌─────────────┴─────────────┐
        │   LAYER 2: CONTEXT RETRIEVAL (RAG) │
        │   ├─ Search codebase (embeddings)  │
        │   ├─ Find similar patterns         │
        │   ├─ Retrieve docs/examples        │
        │   └─ Load company standards        │
        └─────────────┬────────────────────────┘
                      │
        ┌─────────────┴─────────────┐
        │   LAYER 3: CODE GENERATION │
        │   ├─ Main pipeline (LLM)   │
        │   ├─ Config files (YAML)   │
        │   ├─ Tests (pytest)        │
        │   └─ Docs (README)         │
        └─────────────┬───────────────┘
                      │
        ┌─────────────┴─────────────┐
        │   LAYER 4: VALIDATION      │
        │   ├─ Syntax check (AST)    │
        │   ├─ Linting (ruff/flake8) │
        │   ├─ Type check (mypy)     │
        │   ├─ Security (bandit)     │
        │   └─ Complexity (radon)    │
        └─────────────┬───────────────┘
                      │
        ┌─────────────┴─────────────┐
        │   LAYER 5: SELF-CORRECTION │
        │   ├─ Execute validation    │
        │   ├─ Parse error messages  │
        │   ├─ Regenerate fixes      │
        │   └─ Iterate (max 3 times) │
        └─────────────┬───────────────┘
                      │
        ┌─────────────┴─────────────┐
        │   LAYER 6: TESTING         │
        │   ├─ Generate unit tests   │
        │   ├─ Generate integration  │
        │   ├─ Execute test suite    │
        │   └─ Coverage report       │
        └─────────────┬───────────────┘
                      │
┌─────────────────────┴───────────────────────────────────────┐
│  OUTPUT: Production-Ready ETL Pipeline                      │
│  ├─ pipeline.py (main code)                                │
│  ├─ config.yaml (configuration)                            │
│  ├─ test_pipeline.py (tests with 80%+ coverage)           │
│  ├─ requirements.txt (dependencies)                        │
│  ├─ Dockerfile (containerization)                          │
│  ├─ README.md (documentation)                              │
│  └─ .github/workflows/ci.yml (CI/CD)                      │
└─────────────────────────────────────────────────────────────┘
```

**Model Selection for Code Generation:**

| Model | Use Case | Strengths | Limitations | Cost/1M tokens |
|-------|----------|-----------|-------------|----------------|
| **GPT-4o** | Complex ETL pipelines, multi-step logic | Highest accuracy (95%), excellent reasoning, handles edge cases | Slower (4-8s), expensive | $2.50 in / $10 out |
| **Claude 3.5 Sonnet** | Data transformations, business logic | Strong code quality, good at SQL/Pandas, excellent refactoring | Limited context vs GPT-4, occasional verbosity | $3 in / $15 out |
| **GPT-3.5-turbo** | Simple scripts, boilerplate code | Fast (1-2s), cheap, good for templates | Less sophisticated logic, more errors | $0.50 in / $1.50 out |
| **Codestral (Mistral)** | OSS alternative, on-prem deployment | Open source, fast inference, privacy | Lower accuracy (75%), needs fine-tuning | Self-hosted |
| **Code Llama 70B** | Self-hosted code generation | Free, customizable, no API limits | Requires GPU (A100), 70% accuracy | $0 (hardware only) |
| **Gemini 1.5 Pro** | Large context pipelines (2M tokens) | Massive context window, multimodal, cost-effective | Newer model, less proven | $1.25 in / $5 out |

**Prompt Engineering for Code Generation:**

```python
def build_code_generation_prompt(
    requirements: str,
    tech_stack: List[str],
    context: Optional[str] = None,
    constraints: Optional[Dict] = None
) -> str:
    """
    Construye prompt optimizado para generación de código ETL.
    
    Best Practices:
    - Especificar tech stack explícitamente
    - Incluir ejemplos del codebase (few-shot)
    - Definir estándares de calidad
    - Solicitar explicaciones (razonamiento)
    """
    
    prompt = f"""You are an expert Data Engineer specializing in production ETL pipelines.

**Requirements:**
{requirements}

**Tech Stack:**
{', '.join(tech_stack)}

**Coding Standards:**
- Python 3.11+ with type hints (PEP 484)
- Error handling with try-except and logging
- Docstrings (Google style)
- Modular functions (max 50 lines)
- Configuration externalized (YAML/env vars)
- Idempotent operations (safe to re-run)

**Quality Checklist:**
✅ No hardcoded credentials
✅ Parameterized queries (prevent SQL injection)
✅ Retry logic with exponential backoff
✅ Metrics instrumentation (Prometheus)
✅ Comprehensive logging (structured JSON)
✅ Unit tests (pytest, 80%+ coverage)

"""
    
    if context:
        prompt += f"\n**Codebase Context (existing patterns):**\n{context}\n"
    
    if constraints:
        prompt += f"\n**Constraints:**\n"
        for key, value in constraints.items():
            prompt += f"- {key}: {value}\n"
    
    prompt += """
**Output Format:**
1. Main pipeline code (complete and executable)
2. Configuration file (YAML)
3. Unit tests (pytest)
4. Brief explanation of design decisions

Generate the code:
"""
    return prompt
```

**Example: Full Pipeline Generation**

```python
from openai import OpenAI
import os

client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

requirements = """
Create an ETL pipeline that:
1. Extracts daily sales data from PostgreSQL (table: raw_sales)
2. Transforms:
   - Deduplicate by (transaction_id, timestamp)
   - Filter out refunds (amount < 0)
   - Enrich with customer tier from Redis cache
   - Aggregate metrics: total_sales, avg_order_value by (date, customer_tier)
3. Loads to Snowflake (table: analytics.daily_sales_summary)
4. Send Slack notification with summary statistics
"""

tech_stack = [
    "Python 3.11",
    "pandas",
    "SQLAlchemy (PostgreSQL)",
    "redis-py",
    "snowflake-connector-python",
    "slack-sdk",
    "pydantic (config validation)"
]

prompt = build_code_generation_prompt(
    requirements=requirements,
    tech_stack=tech_stack,
    constraints={
        "max_memory": "2GB",
        "timeout": "15 minutes",
        "batch_size": "10,000 rows"
    }
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}],
    temperature=0.2,  # Low temperature for deterministic code
    max_tokens=4000
)

generated_code = response.choices[0].message.content
print(generated_code)
```

**RAG-Enhanced Code Generation (Context from Codebase):**

```python
from sentence_transformers import SentenceTransformer
import chromadb

# Initialize embeddings
embedder = SentenceTransformer('all-MiniLM-L6-v2')
chroma_client = chromadb.Client()
code_collection = chroma_client.create_collection("codebase")

# Index existing codebase
def index_codebase(codebase_path: str):
    """Embeds all Python files for similarity search."""
    for file_path in Path(codebase_path).rglob("*.py"):
        with open(file_path) as f:
            code = f.read()
        
        embedding = embedder.encode(code)
        code_collection.add(
            ids=[str(file_path)],
            embeddings=[embedding.tolist()],
            documents=[code],
            metadatas=[{"path": str(file_path), "type": "python"}]
        )

# Retrieve similar code
def get_similar_code(query: str, n_results: int = 3) -> List[str]:
    """Finds similar code patterns from existing codebase."""
    query_embedding = embedder.encode(query)
    
    results = code_collection.query(
        query_embeddings=[query_embedding.tolist()],
        n_results=n_results
    )
    
    return results['documents'][0]  # Top-k similar code snippets

# Enhanced generation with context
requirements = "Create Spark ETL for parquet → delta lake"
similar_code = get_similar_code(requirements, n_results=2)

prompt = f"""
{requirements}

**Reference implementations from codebase:**

Example 1:
{similar_code[0]}

Example 2:
{similar_code[1]}

Follow the same patterns and coding style. Generate code:
"""

# Generate with context (accuracy improves ~20%)
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}],
    temperature=0.2
)
```

**Token Optimization for Cost Reduction:**

```python
import tiktoken

def optimize_prompt(prompt: str, max_tokens: int = 2000) -> str:
    """
    Reduce prompt tokens while preserving essential information.
    Strategies:
    1. Remove redundant examples
    2. Compress documentation
    3. Use abbreviations for repetitive terms
    """
    encoding = tiktoken.encoding_for_model("gpt-4")
    tokens = encoding.encode(prompt)
    
    if len(tokens) <= max_tokens:
        return prompt
    
    # Truncate examples, keep requirements
    lines = prompt.split('\n')
    essential_lines = [l for l in lines if 'Requirements' in l or 'Tech Stack' in l]
    
    optimized = '\n'.join(essential_lines)
    return optimized

# Cost comparison
original_tokens = len(tiktoken.encoding_for_model("gpt-4").encode(prompt))
optimized_prompt = optimize_prompt(prompt, max_tokens=1500)
optimized_tokens = len(tiktoken.encoding_for_model("gpt-4").encode(optimized_prompt))

print(f"Original: {original_tokens} tokens → ${original_tokens * 0.00001:.4f}")
print(f"Optimized: {optimized_tokens} tokens → ${optimized_tokens * 0.00001:.4f}")
print(f"Savings: {((original_tokens - optimized_tokens) / original_tokens * 100):.1f}%")
```

---
**Autor:** Luis J. Raigoso V. (LJRV)

### 🛡️ **Code Validation & Security: Multi-Layer Quality Assurance**

**Validation Pipeline Architecture:**

```
Generated Code
     │
     ▼
┌─────────────────────────────────────┐
│  LAYER 1: SYNTAX VALIDATION         │
│  ├─ AST Parsing (ast.parse)         │
│  ├─ Compile Check (compile())       │
│  └─ Python Version Compatibility    │
│  → Result: Syntactically valid code │
└────────────┬────────────────────────┘
             │ PASS
             ▼
┌─────────────────────────────────────┐
│  LAYER 2: STATIC ANALYSIS           │
│  ├─ Linting (ruff/flake8/pylint)    │
│  ├─ Type Checking (mypy)            │
│  ├─ Complexity (radon: CC < 10)     │
│  ├─ Code Smells (pylint)            │
│  └─ Formatting (black/autopep8)     │
│  → Result: Clean, readable code     │
└────────────┬────────────────────────┘
             │ PASS
             ▼
┌─────────────────────────────────────┐
│  LAYER 3: SECURITY SCAN             │
│  ├─ Bandit (vulnerability detection)│
│  ├─ Safety (dependency vulnerabilities)│
│  ├─ Secrets Detection (detect-secrets)│
│  ├─ SQL Injection Check (sqlparse)  │
│  └─ Path Traversal Check            │
│  → Result: Secure code              │
└────────────┬────────────────────────┘
             │ PASS
             ▼
┌─────────────────────────────────────┐
│  LAYER 4: FUNCTIONAL TESTING        │
│  ├─ Unit Tests (pytest)             │
│  ├─ Integration Tests (testcontainers)│
│  ├─ Property-Based (hypothesis)     │
│  ├─ Coverage (pytest-cov > 80%)     │
│  └─ Performance (locust benchmarks) │
│  → Result: Functionally correct     │
└────────────┬────────────────────────┘
             │ PASS
             ▼
┌─────────────────────────────────────┐
│  LAYER 5: RUNTIME VALIDATION        │
│  ├─ Dry-Run with Sample Data        │
│  ├─ Memory Profiling (memory_profiler)│
│  ├─ Execution Time Check            │
│  └─ Resource Limits (cgroups)       │
│  → Result: Production-ready         │
└─────────────────────────────────────┘
```

**Implementation: Comprehensive Validation Suite**

```python
import ast
import subprocess
from pathlib import Path
from typing import Dict, List, Tuple
import tempfile
import json

class CodeValidator:
    """
    Multi-layer validation system for LLM-generated code.
    
    Usage:
        validator = CodeValidator()
        results = validator.validate_all(generated_code)
        if results['is_valid']:
            print("✅ Code is production-ready")
        else:
            print(f"❌ Validation failed: {results['errors']}")
    """
    
    def __init__(self):
        self.validation_results = {}
    
    def validate_syntax(self, code: str) -> Tuple[bool, str]:
        """
        Layer 1: AST parsing and compilation.
        
        Catches:
        - SyntaxError (invalid Python)
        - IndentationError
        - TabError
        """
        try:
            # Parse to AST
            tree = ast.parse(code)
            
            # Compile (more thorough than parse)
            compile(code, '<generated>', 'exec')
            
            # Check for common anti-patterns
            issues = []
            for node in ast.walk(tree):
                # Detect 'exec' usage (security risk)
                if isinstance(node, ast.Expr) and isinstance(node.value, ast.Call):
                    if hasattr(node.value.func, 'id') and node.value.func.id == 'exec':
                        issues.append("⚠️ Usage of 'exec()' detected (security risk)")
                
                # Detect bare 'except' (anti-pattern)
                if isinstance(node, ast.ExceptHandler) and node.type is None:
                    issues.append("⚠️ Bare 'except:' clause detected (catch specific exceptions)")
            
            if issues:
                return True, "Syntax valid but has warnings:\n" + "\n".join(issues)
            
            return True, "✅ Syntax valid"
        
        except SyntaxError as e:
            return False, f"❌ Syntax Error at line {e.lineno}: {e.msg}"
        except Exception as e:
            return False, f"❌ Compilation Error: {str(e)}"
    
    def validate_linting(self, code: str) -> Tuple[bool, str]:
        """
        Layer 2: Static analysis with ruff (faster than flake8).
        
        Checks:
        - PEP 8 compliance
        - Unused imports/variables
        - Undefined names
        - Line length violations
        """
        with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False) as f:
            f.write(code)
            temp_path = f.name
        
        try:
            # Ruff (Rust-based, ~100x faster than flake8)
            result = subprocess.run(
                ['ruff', 'check', temp_path, '--output-format', 'json'],
                capture_output=True,
                text=True,
                timeout=10
            )
            
            if result.returncode == 0:
                return True, "✅ No linting issues"
            
            # Parse JSON output
            issues = json.loads(result.stdout)
            error_summary = []
            for issue in issues[:5]:  # Show top 5
                error_summary.append(
                    f"Line {issue['location']['row']}: {issue['code']} - {issue['message']}"
                )
            
            return False, f"❌ Linting issues found:\n" + "\n".join(error_summary)
        
        except FileNotFoundError:
            # Fallback to flake8 if ruff not installed
            result = subprocess.run(
                ['flake8', temp_path, '--max-line-length', '100'],
                capture_output=True,
                text=True,
                timeout=10
            )
            
            if result.returncode == 0:
                return True, "✅ No linting issues (flake8)"
            
            return False, f"❌ Flake8 issues:\n{result.stdout[:500]}"
        
        except subprocess.TimeoutExpired:
            return False, "❌ Linting timeout (code too large)"
        
        finally:
            Path(temp_path).unlink()
    
    def validate_types(self, code: str) -> Tuple[bool, str]:
        """
        Layer 2.5: Type checking with mypy.
        
        Ensures:
        - Type hints are consistent
        - No type mismatches
        - Return types match annotations
        """
        with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False) as f:
            f.write(code)
            temp_path = f.name
        
        try:
            result = subprocess.run(
                ['mypy', temp_path, '--no-error-summary', '--show-error-codes'],
                capture_output=True,
                text=True,
                timeout=15
            )
            
            if result.returncode == 0:
                return True, "✅ Type checking passed"
            
            # Filter errors (ignore missing imports for now)
            errors = [line for line in result.stdout.split('\n') 
                     if 'error:' in line and 'import' not in line.lower()]
            
            if not errors:
                return True, "✅ Type checking passed (ignoring import issues)"
            
            return False, f"❌ Type errors:\n" + "\n".join(errors[:3])
        
        except FileNotFoundError:
            return True, "⚠️ mypy not installed (skipping type check)"
        
        except subprocess.TimeoutExpired:
            return False, "❌ Type checking timeout"
        
        finally:
            Path(temp_path).unlink()
    
    def validate_security(self, code: str) -> Tuple[bool, str]:
        """
        Layer 3: Security vulnerability scanning with bandit.
        
        Detects:
        - Hardcoded credentials
        - SQL injection risks
        - Shell injection (subprocess)
        - Insecure cryptography
        - Path traversal vulnerabilities
        """
        with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False) as f:
            f.write(code)
            temp_path = f.name
        
        try:
            result = subprocess.run(
                ['bandit', '-r', temp_path, '-f', 'json'],
                capture_output=True,
                text=True,
                timeout=10
            )
            
            report = json.loads(result.stdout)
            
            # Filter high/medium severity issues
            critical_issues = [
                issue for issue in report.get('results', [])
                if issue['issue_severity'] in ['HIGH', 'MEDIUM']
            ]
            
            if not critical_issues:
                return True, "✅ No security vulnerabilities detected"
            
            # Format error messages
            error_summary = []
            for issue in critical_issues[:3]:
                error_summary.append(
                    f"Line {issue['line_number']}: [{issue['issue_severity']}] "
                    f"{issue['issue_text']}"
                )
            
            return False, f"❌ Security issues found:\n" + "\n".join(error_summary)
        
        except FileNotFoundError:
            # Manual security checks if bandit not installed
            dangerous_patterns = [
                ('eval(', 'Code execution vulnerability'),
                ('exec(', 'Code execution vulnerability'),
                ('__import__', 'Dynamic import (potential code injection)'),
                ('pickle.loads', 'Insecure deserialization'),
                ('password = "', 'Hardcoded password'),
                ('api_key = "', 'Hardcoded API key'),
                ('token = "', 'Hardcoded token'),
            ]
            
            found_issues = []
            for pattern, description in dangerous_patterns:
                if pattern in code:
                    found_issues.append(f"⚠️ {description}: '{pattern}' found")
            
            if found_issues:
                return False, f"❌ Security issues:\n" + "\n".join(found_issues)
            
            return True, "✅ Basic security check passed (install bandit for thorough scan)"
        
        finally:
            Path(temp_path).unlink()
    
    def validate_complexity(self, code: str) -> Tuple[bool, str]:
        """
        Layer 2: Cyclomatic complexity analysis with radon.
        
        Standards:
        - A: CC 1-5 (simple)
        - B: CC 6-10 (more complex)
        - C: CC 11-20 (complex) ← threshold
        - D: CC 21-30 (very complex)
        - F: CC 31+ (extremely complex)
        """
        with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False) as f:
            f.write(code)
            temp_path = f.name
        
        try:
            result = subprocess.run(
                ['radon', 'cc', temp_path, '-j'],  # JSON output
                capture_output=True,
                text=True,
                timeout=10
            )
            
            complexity_data = json.loads(result.stdout)
            
            # Check for functions with CC > 10
            complex_functions = []
            for file_data in complexity_data.values():
                for item in file_data:
                    if item['complexity'] > 10:
                        complex_functions.append(
                            f"{item['name']}: CC={item['complexity']} (line {item['lineno']})"
                        )
            
            if not complex_functions:
                return True, "✅ Complexity within acceptable range"
            
            return False, f"⚠️ High complexity detected:\n" + "\n".join(complex_functions)
        
        except FileNotFoundError:
            return True, "⚠️ radon not installed (skipping complexity check)"
        
        finally:
            Path(temp_path).unlink()
    
    def validate_all(self, code: str) -> Dict:
        """
        Execute full validation pipeline.
        
        Returns:
            {
                'is_valid': bool,
                'syntax': {'passed': bool, 'message': str},
                'linting': {...},
                'types': {...},
                'security': {...},
                'complexity': {...},
                'overall_score': float (0-100)
            }
        """
        results = {}
        
        # Layer 1: Syntax (critical - must pass)
        syntax_pass, syntax_msg = self.validate_syntax(code)
        results['syntax'] = {'passed': syntax_pass, 'message': syntax_msg}
        
        if not syntax_pass:
            results['is_valid'] = False
            results['overall_score'] = 0
            return results
        
        # Layer 2: Linting
        lint_pass, lint_msg = self.validate_linting(code)
        results['linting'] = {'passed': lint_pass, 'message': lint_msg}
        
        # Layer 2.5: Types
        types_pass, types_msg = self.validate_types(code)
        results['types'] = {'passed': types_pass, 'message': types_msg}
        
        # Layer 3: Security (critical)
        security_pass, security_msg = self.validate_security(code)
        results['security'] = {'passed': security_pass, 'message': security_msg}
        
        # Layer 2: Complexity
        complexity_pass, complexity_msg = self.validate_complexity(code)
        results['complexity'] = {'passed': complexity_pass, 'message': complexity_msg}
        
        # Calculate overall score
        weights = {
            'syntax': 30,       # Critical
            'security': 30,     # Critical
            'linting': 15,
            'types': 15,
            'complexity': 10
        }
        
        score = sum(
            weights[key] for key, value in results.items()
            if key in weights and value.get('passed', False)
        )
        
        results['overall_score'] = score
        results['is_valid'] = score >= 70  # 70% threshold for production
        
        return results

# Example usage
validator = CodeValidator()

generated_code = '''
def process_sales(data):
    import pandas as pd
    df = pd.DataFrame(data)
    df['total'] = df['quantity'] * df['price']
    return df.to_dict('records')
'''

results = validator.validate_all(generated_code)
print(f"Overall Score: {results['overall_score']}/100")
print(f"Production Ready: {results['is_valid']}")
for check, result in results.items():
    if isinstance(result, dict) and 'message' in result:
        print(f"{check.upper()}: {result['message']}")
```

**Self-Correction Loop with LLM:**

```python
def self_correct_code(
    code: str,
    validation_results: Dict,
    max_attempts: int = 3
) -> Tuple[str, bool]:
    """
    Iteratively fix code based on validation errors.
    
    Strategy:
    1. Extract error messages from validation
    2. Send to LLM with context: original code + errors
    3. Re-validate corrected code
    4. Repeat until valid or max_attempts reached
    """
    
    for attempt in range(max_attempts):
        # Check if already valid
        if validation_results['is_valid']:
            return code, True
        
        # Collect error messages
        errors = []
        for check, result in validation_results.items():
            if isinstance(result, dict) and not result.get('passed', True):
                errors.append(f"{check}: {result['message']}")
        
        # Generate correction prompt
        correction_prompt = f"""
Fix the following Python code based on validation errors:

**Original Code:**
```python
{code}
```

**Validation Errors:**
{chr(10).join(errors)}

**Instructions:**
- Fix all syntax, linting, type, and security issues
- Maintain the original functionality
- Do not add unnecessary complexity
- Return only the corrected code (no explanations)

**Corrected Code:**
"""
        
        # Call LLM for correction
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": correction_prompt}],
            temperature=0.1  # Deterministic fixes
        )
        
        corrected_code = response.choices[0].message.content.strip()
        corrected_code = corrected_code.replace('```python', '').replace('```', '').strip()
        
        # Re-validate
        validation_results = validator.validate_all(corrected_code)
        code = corrected_code
        
        print(f"Attempt {attempt + 1}: Score = {validation_results['overall_score']}/100")
        
        if validation_results['is_valid']:
            return code, True
    
    return code, False  # Failed after max_attempts

# Example: Self-correction workflow
initial_code = '''
def risky_function(user_input):
    result = eval(user_input)  # Security issue!
    return result
'''

validation = validator.validate_all(initial_code)
print(f"Initial Score: {validation['overall_score']}/100")

corrected_code, success = self_correct_code(initial_code, validation)
if success:
    print("✅ Code successfully corrected and validated!")
    print(corrected_code)
else:
    print("❌ Failed to correct code after 3 attempts")
```

**Secrets Detection & Environment Variable Enforcement:**

```python
import re

def detect_secrets(code: str) -> List[str]:
    """
    Detect hardcoded secrets and suggest env var usage.
    
    Patterns:
    - API keys (alphanumeric strings 20+ chars)
    - Passwords in assignments
    - AWS access keys
    - Database connection strings with credentials
    """
    
    patterns = [
        (r'password\s*=\s*["\'](.{3,})["\']', 'Hardcoded password'),
        (r'api_key\s*=\s*["\'](.{10,})["\']', 'Hardcoded API key'),
        (r'secret\s*=\s*["\'](.{10,})["\']', 'Hardcoded secret'),
        (r'token\s*=\s*["\'](.{10,})["\']', 'Hardcoded token'),
        (r'AKIA[0-9A-Z]{16}', 'AWS Access Key'),
        (r'postgres://.*:.*@', 'Database URL with credentials'),
    ]
    
    findings = []
    for pattern, description in patterns:
        matches = re.findall(pattern, code, re.IGNORECASE)
        if matches:
            findings.append(f"❌ {description} detected")
            findings.append(f"   → Replace with: os.getenv('{description.upper().replace(' ', '_')}')")
    
    return findings

# Auto-fix: Replace secrets with environment variables
def fix_secrets(code: str) -> str:
    """Automatically replace hardcoded secrets with os.getenv()."""
    
    replacements = [
        (r'password\s*=\s*["\'](.{3,})["\']', 'password = os.getenv("DB_PASSWORD")'),
        (r'api_key\s*=\s*["\'](.{10,})["\']', 'api_key = os.getenv("API_KEY")'),
    ]
    
    fixed_code = code
    needs_import = False
    
    for pattern, replacement in replacements:
        if re.search(pattern, fixed_code, re.IGNORECASE):
            fixed_code = re.sub(pattern, replacement, fixed_code, flags=re.IGNORECASE)
            needs_import = True
    
    # Add 'import os' if needed
    if needs_import and 'import os' not in fixed_code:
        fixed_code = 'import os\n\n' + fixed_code
    
    return fixed_code

# Example
bad_code = '''
def connect_db():
    password = "super_secret_123"
    api_key = "sk_live_abc123xyz"
    return create_connection(password, api_key)
'''

secrets = detect_secrets(bad_code)
print("\n".join(secrets))

fixed_code = fix_secrets(bad_code)
print("\n✅ Fixed code:")
print(fixed_code)
```

---
**Autor:** Luis J. Raigoso V. (LJRV)

### 🧪 **Test Generation: Automated Quality Assurance**

**Test Generation Strategy:**

```
Generated ETL Pipeline Code
          │
          ▼
┌──────────────────────────────────┐
│  Test Type Selection             │
│  ├─ Unit Tests (functions)       │
│  ├─ Integration Tests (I/O)      │
│  ├─ Property Tests (invariants)  │
│  └─ End-to-End Tests (full flow) │
└──────────┬───────────────────────┘
           │
           ▼
┌──────────────────────────────────┐
│  Test Case Generation (LLM)      │
│  ├─ Happy path (normal inputs)   │
│  ├─ Edge cases (nulls, empty)    │
│  ├─ Error cases (exceptions)     │
│  ├─ Boundary values (min/max)    │
│  └─ Performance tests (large data)│
└──────────┬───────────────────────┘
           │
           ▼
┌──────────────────────────────────┐
│  Mock Generation                 │
│  ├─ Database connections (mock)  │
│  ├─ External APIs (mock)         │
│  ├─ File I/O (tmpdir)            │
│  └─ Environment variables        │
└──────────┬───────────────────────┘
           │
           ▼
┌──────────────────────────────────┐
│  Fixtures & Setup                │
│  ├─ pytest fixtures              │
│  ├─ Test data generation (Faker) │
│  ├─ Database seeding             │
│  └─ Teardown cleanup             │
└──────────┬───────────────────────┘
           │
           ▼
┌──────────────────────────────────┐
│  Execution & Coverage            │
│  ├─ pytest execution             │
│  ├─ Coverage report (pytest-cov) │
│  ├─ Assertion analysis           │
│  └─ Target: 80%+ coverage        │
└──────────────────────────────────┘
```

**Comprehensive Test Generation Implementation:**

```python
def generate_tests(
    code: str,
    test_framework: str = "pytest"
) -> str:
    """
    Generate comprehensive test suite for generated code.
    
    Coverage:
    - Unit tests for each function
    - Edge cases (nulls, empty, invalid types)
    - Error handling (exceptions)
    - Mocks for external dependencies
    - Integration tests for full pipeline
    """
    
    test_generation_prompt = f"""
You are an expert in test-driven development (TDD) and pytest.

**Code to Test:**
```python
{code}
```

**Generate comprehensive pytest test suite including:**

1. **Unit Tests:**
   - Test each function with normal inputs (happy path)
   - Test with edge cases: None, empty string/list, zeros
   - Test boundary values (min, max)
   - Test type validation (wrong types should raise errors)

2. **Error Handling Tests:**
   - Test that exceptions are raised correctly
   - Test error messages are descriptive
   - Test retry logic (if applicable)

3. **Mocks & Fixtures:**
   - Mock external dependencies (databases, APIs, file I/O)
   - Use @pytest.fixture for test data
   - Use unittest.mock.patch for external calls

4. **Integration Tests:**
   - Test full pipeline with realistic data
   - Test data flow from input to output
   - Verify side effects (file creation, DB inserts)

5. **Property-Based Tests (if applicable):**
   - Use hypothesis for property testing
   - Test invariants (e.g., output size == input size)

**Test Code Standards:**
- Use descriptive test names: test_<function>_<scenario>_<expected>
- AAA pattern: Arrange, Act, Assert
- Parametrize tests with @pytest.mark.parametrize
- Target: 80%+ code coverage
- Include docstrings for complex test logic

**Output Format:**
Complete pytest test file with all imports and fixtures.

Generate tests:
"""
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": test_generation_prompt}],
        temperature=0.3  # Slightly higher for diverse test cases
    )
    
    test_code = response.choices[0].message.content.strip()
    test_code = test_code.replace('```python', '').replace('```', '').strip()
    
    return test_code


# Example: Generate tests for ETL pipeline
etl_code = '''
import pandas as pd
from typing import List, Dict

def extract_sales(file_path: str) -> pd.DataFrame:
    """Extract sales data from CSV file."""
    df = pd.read_csv(file_path)
    if df.empty:
        raise ValueError("CSV file is empty")
    return df

def transform_sales(df: pd.DataFrame) -> pd.DataFrame:
    """Transform sales data: filter, clean, aggregate."""
    # Filter out negative amounts
    df_clean = df[df['amount'] > 0].copy()
    
    # Add month column
    df_clean['month'] = pd.to_datetime(df_clean['date']).dt.to_period('M')
    
    # Calculate daily totals
    df_agg = df_clean.groupby(['date', 'month']).agg({
        'amount': 'sum',
        'transaction_id': 'count'
    }).reset_index()
    
    return df_agg

def load_sales(df: pd.DataFrame, output_path: str) -> None:
    """Load transformed data to Parquet."""
    if len(df) == 0:
        raise ValueError("No data to load")
    
    df.to_parquet(output_path, index=False, compression='snappy')

def run_etl_pipeline(input_path: str, output_path: str) -> Dict:
    """Full ETL pipeline orchestration."""
    df_raw = extract_sales(input_path)
    df_transformed = transform_sales(df_raw)
    load_sales(df_transformed, output_path)
    
    return {
        'rows_extracted': len(df_raw),
        'rows_loaded': len(df_transformed),
        'success': True
    }
'''

generated_tests = generate_tests(etl_code)
print("Generated Test Suite:")
print(generated_tests)
```

**Advanced Testing Patterns:**

```python
# 1. Property-Based Testing with Hypothesis
from hypothesis import given, strategies as st
import pytest

@given(st.lists(st.integers(min_value=1, max_value=1000)))
def test_transform_preserves_row_count_property(amounts):
    """Property: transformation doesn't lose rows (if all positive)."""
    df = pd.DataFrame({'amount': amounts, 'date': ['2024-01-01'] * len(amounts)})
    result = transform_sales(df)
    assert len(result) > 0  # At least one aggregate row

# 2. Parametrized Tests for Edge Cases
@pytest.mark.parametrize("test_input,expected_error", [
    (pd.DataFrame(), ValueError),  # Empty dataframe
    (pd.DataFrame({'amount': [-1, -2]}), None),  # All negative (filters to empty)
    (pd.DataFrame({'amount': [0, 0]}), None),  # All zeros
])
def test_transform_edge_cases(test_input, expected_error):
    if expected_error:
        with pytest.raises(expected_error):
            transform_sales(test_input)
    else:
        result = transform_sales(test_input)
        assert isinstance(result, pd.DataFrame)

# 3. Integration Test with Test Containers
from testcontainers.postgres import PostgresContainer
import sqlalchemy

@pytest.fixture(scope='session')
def postgres_container():
    """Spin up real PostgreSQL for integration tests."""
    with PostgresContainer("postgres:15") as postgres:
        yield postgres

def test_full_pipeline_with_real_db(postgres_container, tmp_path):
    """Integration test: CSV → Transform → Load to real DB."""
    # Create test CSV
    input_file = tmp_path / "sales.csv"
    pd.DataFrame({
        'transaction_id': [1, 2, 3],
        'date': ['2024-01-01', '2024-01-02', '2024-01-03'],
        'amount': [100, 200, 150]
    }).to_csv(input_file, index=False)
    
    # Run ETL
    output_file = tmp_path / "output.parquet"
    result = run_etl_pipeline(str(input_file), str(output_file))
    
    # Verify
    assert result['success']
    assert result['rows_extracted'] == 3
    assert output_file.exists()
    
    # Load and verify output
    df_output = pd.read_parquet(output_file)
    assert len(df_output) > 0
    assert 'month' in df_output.columns

# 4. Performance Test
import time

def test_pipeline_performance_large_dataset(tmp_path):
    """Performance test: 1M rows should complete in <10s."""
    # Generate large dataset
    large_df = pd.DataFrame({
        'transaction_id': range(1_000_000),
        'date': pd.date_range('2024-01-01', periods=1_000_000, freq='s'),
        'amount': [100] * 1_000_000
    })
    
    input_file = tmp_path / "large_sales.csv"
    large_df.to_csv(input_file, index=False)
    
    # Measure execution time
    start = time.time()
    output_file = tmp_path / "output.parquet"
    run_etl_pipeline(str(input_file), str(output_file))
    duration = time.time() - start
    
    assert duration < 10, f"Pipeline too slow: {duration:.2f}s (expected <10s)"

# 5. Mock External Dependencies
from unittest.mock import patch, MagicMock

@patch('boto3.client')
def test_extract_from_s3_with_mock(mock_boto_client, tmp_path):
    """Unit test with mocked S3 client."""
    # Mock S3 download
    mock_s3 = MagicMock()
    mock_boto_client.return_value = mock_s3
    
    # Create test file locally
    test_file = tmp_path / "test.csv"
    pd.DataFrame({'amount': [100]}).to_csv(test_file, index=False)
    
    # Mock download_file to copy local file
    def mock_download(bucket, key, local_path):
        import shutil
        shutil.copy(test_file, local_path)
    
    mock_s3.download_file.side_effect = mock_download
    
    # Test extraction (would normally download from S3)
    # extract_from_s3('bucket', 'key.csv', '/tmp/output.csv')
    
    # Verify S3 client was called
    # mock_s3.download_file.assert_called_once_with('bucket', 'key.csv', '/tmp/output.csv')
```

**Test Coverage Analysis & Improvement:**

```python
import subprocess
import json
from pathlib import Path

def run_tests_with_coverage(test_file: str, code_file: str) -> Dict:
    """
    Execute tests and generate coverage report.
    
    Returns:
        {
            'coverage_percent': float,
            'lines_covered': int,
            'lines_total': int,
            'missing_lines': List[int],
            'passed_tests': int,
            'failed_tests': int
        }
    """
    
    # Run pytest with coverage
    result = subprocess.run(
        [
            'pytest', test_file,
            f'--cov={code_file}',
            '--cov-report=json',
            '--json-report',
            '--json-report-file=/tmp/test_report.json'
        ],
        capture_output=True,
        text=True
    )
    
    # Parse coverage JSON
    with open('coverage.json') as f:
        cov_data = json.load(f)
    
    file_cov = cov_data['files'][code_file]
    
    # Parse test results
    with open('/tmp/test_report.json') as f:
        test_data = json.load(f)
    
    return {
        'coverage_percent': file_cov['summary']['percent_covered'],
        'lines_covered': file_cov['summary']['covered_lines'],
        'lines_total': file_cov['summary']['num_statements'],
        'missing_lines': file_cov['missing_lines'],
        'passed_tests': test_data['summary']['passed'],
        'failed_tests': test_data['summary']['failed']
    }

def improve_test_coverage(
    code: str,
    existing_tests: str,
    coverage_report: Dict
) -> str:
    """
    Generate additional tests to improve coverage.
    
    Strategy:
    - Identify uncovered lines from coverage report
    - Extract uncovered functions/branches
    - Generate targeted tests for missing coverage
    """
    
    improvement_prompt = f"""
The following code has {coverage_report['coverage_percent']:.1f}% test coverage.

**Code:**
```python
{code}
```

**Existing Tests:**
```python
{existing_tests}
```

**Uncovered Lines:** {coverage_report['missing_lines']}

**Task:**
Generate additional pytest tests to cover the missing lines and bring coverage to 80%+.

Focus on:
- Uncovered branches (if/else not tested)
- Exception handling paths
- Edge cases not yet covered

**Additional Tests:**
"""
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": improvement_prompt}],
        temperature=0.2
    )
    
    additional_tests = response.choices[0].message.content.strip()
    return additional_tests

# Example: Iterative coverage improvement
coverage_report = {
    'coverage_percent': 65.0,
    'missing_lines': [15, 16, 23, 24, 30],
    'passed_tests': 8,
    'failed_tests': 0
}

additional_tests = improve_test_coverage(etl_code, generated_tests, coverage_report)
print("Additional tests to improve coverage:")
print(additional_tests)
```

**Test Execution in CI/CD:**

```yaml
# .github/workflows/test-generated-code.yml
name: Test Generated ETL Code

on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    
    steps:
      - uses: actions/checkout@v3
      
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'
      
      - name: Install dependencies
        run: |
          pip install pytest pytest-cov hypothesis pandas
      
      - name: Run tests with coverage
        run: |
          pytest tests/ \
            --cov=src/ \
            --cov-report=xml \
            --cov-report=term \
            --cov-fail-under=80
      
      - name: Upload coverage to Codecov
        uses: codecov/codecov-action@v3
        with:
          files: ./coverage.xml
          fail_ci_if_error: true
```

**Test Quality Metrics:**

```python
def analyze_test_quality(test_code: str) -> Dict:
    """
    Assess quality of generated tests.
    
    Metrics:
    - Test count
    - Assertion count
    - Mock usage
    - Fixture usage
    - Parametrize usage
    - Descriptive names
    """
    
    tree = ast.parse(test_code)
    
    test_count = 0
    assertion_count = 0
    mock_count = 0
    fixture_count = 0
    parametrize_count = 0
    
    for node in ast.walk(tree):
        # Count test functions
        if isinstance(node, ast.FunctionDef) and node.name.startswith('test_'):
            test_count += 1
            
            # Count assertions in each test
            for child in ast.walk(node):
                if isinstance(child, ast.Assert):
                    assertion_count += 1
        
        # Count mocks
        if isinstance(node, ast.Call):
            if hasattr(node.func, 'attr') and 'patch' in node.func.attr:
                mock_count += 1
        
        # Count fixtures
        if isinstance(node, ast.FunctionDef):
            for decorator in node.decorator_list:
                if hasattr(decorator, 'attr') and decorator.attr == 'fixture':
                    fixture_count += 1
                if hasattr(decorator, 'attr') and decorator.attr == 'parametrize':
                    parametrize_count += 1
    
    assertions_per_test = assertion_count / test_count if test_count > 0 else 0
    
    # Quality score
    score = min(100, (
        (test_count * 5) +           # More tests is better (up to 20)
        (assertions_per_test * 10) +  # ~3 assertions/test is ideal (30)
        (mock_count * 8) +            # Good mocking practice (up to 24)
        (fixture_count * 6) +         # Reusable fixtures (up to 18)
        (parametrize_count * 4)       # Efficient parametrization (up to 8)
    ))
    
    return {
        'test_count': test_count,
        'assertion_count': assertion_count,
        'assertions_per_test': assertions_per_test,
        'mock_count': mock_count,
        'fixture_count': fixture_count,
        'parametrize_count': parametrize_count,
        'quality_score': score
    }

# Analyze generated tests
quality = analyze_test_quality(generated_tests)
print(f"Test Quality Score: {quality['quality_score']:.1f}/100")
print(f"Tests: {quality['test_count']}, Assertions: {quality['assertion_count']}")
```

---
**Autor:** Luis J. Raigoso V. (LJRV)

### 🚀 **Production Deployment: From Generated Code to Live Pipeline**

**End-to-End Production Workflow:**

```
┌─────────────────────────────────────────────────────────────────┐
│  PHASE 1: REQUIREMENTS GATHERING                                │
│  ├─ Natural language specification from stakeholder             │
│  ├─ Technical constraints (SLA, budget, compliance)             │
│  ├─ Data schema documentation                                   │
│  └─ Integration requirements (source/target systems)            │
└────────────────────┬────────────────────────────────────────────┘
                     │
┌────────────────────┴────────────────────────────────────────────┐
│  PHASE 2: CODE GENERATION (GPT-4 + RAG)                         │
│  ├─ Main pipeline code (Python/PySpark)                         │
│  ├─ Configuration (YAML/JSON)                                   │
│  ├─ Unit tests (pytest, 80%+ coverage)                          │
│  ├─ Integration tests (testcontainers)                          │
│  ├─ Dockerfile (containerization)                               │
│  ├─ CI/CD pipeline (GitHub Actions)                             │
│  └─ Documentation (README, API docs)                            │
└────────────────────┬────────────────────────────────────────────┘
                     │
┌────────────────────┴────────────────────────────────────────────┐
│  PHASE 3: VALIDATION & QUALITY ASSURANCE                        │
│  ├─ Syntax validation (AST parsing)                             │
│  ├─ Security scan (bandit, secrets detection)                   │
│  ├─ Linting (ruff/flake8)                                       │
│  ├─ Type checking (mypy)                                        │
│  ├─ Test execution (pytest)                                     │
│  ├─ Coverage analysis (pytest-cov > 80%)                        │
│  └─ Self-correction loop (max 3 iterations)                     │
└────────────────────┬────────────────────────────────────────────┘
                     │
┌────────────────────┴────────────────────────────────────────────┐
│  PHASE 4: HUMAN REVIEW (Mandatory Gate)                         │
│  ├─ Code review by senior engineer                              │
│  ├─ Business logic verification                                 │
│  ├─ Performance review (query plans, partitioning)              │
│  ├─ Security audit (data access, PII handling)                  │
│  └─ Approval required before deployment                         │
└────────────────────┬────────────────────────────────────────────┘
                     │
┌────────────────────┴────────────────────────────────────────────┐
│  PHASE 5: STAGING DEPLOYMENT                                    │
│  ├─ Deploy to staging environment (Kubernetes pod)              │
│  ├─ Run with production-like data (last 7 days)                 │
│  ├─ Monitor metrics (latency, memory, CPU)                      │
│  ├─ Compare results with existing pipeline (if any)             │
│  ├─ Validate data quality (Great Expectations)                  │
│  └─ Smoke tests (end-to-end validation)                         │
└────────────────────┬────────────────────────────────────────────┘
                     │
┌────────────────────┴────────────────────────────────────────────┐
│  PHASE 6: PRODUCTION DEPLOYMENT (Blue-Green)                    │
│  ├─ Deploy to production (parallel to existing)                 │
│  ├─ Gradual traffic shift (0% → 10% → 50% → 100%)               │
│  ├─ Real-time monitoring (Prometheus + Grafana)                 │
│  ├─ Alerting (PagerDuty for failures)                           │
│  ├─ Rollback capability (one-click revert)                      │
│  └─ Post-deployment verification (24h monitoring)               │
└────────────────────┬────────────────────────────────────────────┘
                     │
┌────────────────────┴────────────────────────────────────────────┐
│  PHASE 7: CONTINUOUS MONITORING & ITERATION                     │
│  ├─ Performance metrics (p95 latency, throughput)               │
│  ├─ Data quality metrics (null rate, schema drift)              │
│  ├─ Cost tracking (compute, storage)                            │
│  ├─ Error tracking (Sentry, CloudWatch)                         │
│  ├─ User feedback collection                                    │
│  └─ Iterative improvements (regenerate with feedback)           │
└─────────────────────────────────────────────────────────────────┘
```

**Complete Production Pipeline Generator:**

```python
from dataclasses import dataclass
from typing import List, Dict, Optional
from pathlib import Path
import yaml

@dataclass
class PipelineSpec:
    """Specification for production ETL pipeline."""
    name: str
    description: str
    source: Dict  # {'type': 's3', 'bucket': 'data', 'prefix': 'raw/sales/'}
    target: Dict  # {'type': 'snowflake', 'database': 'analytics', 'table': 'sales'}
    transformations: List[str]  # ['deduplicate', 'filter_nulls', 'aggregate_daily']
    schedule: str  # Cron expression: '0 2 * * *'
    sla_minutes: int  # 30
    owner: str  # 'data-team@company.com'
    compliance: List[str]  # ['GDPR', 'SOC2']

class ProductionPipelineGenerator:
    """
    Generates complete production-ready ETL pipeline.
    
    Outputs:
    - pipeline.py (main code)
    - config.yaml (configuration)
    - test_pipeline.py (tests)
    - Dockerfile (container)
    - .github/workflows/ci.yml (CI/CD)
    - README.md (documentation)
    - airflow_dag.py (orchestration)
    """
    
    def __init__(self, spec: PipelineSpec):
        self.spec = spec
        self.output_dir = Path(f"pipelines/{spec.name}")
    
    def generate_all(self):
        """Generate complete pipeline structure."""
        self.output_dir.mkdir(parents=True, exist_ok=True)
        
        print(f"🚀 Generating pipeline: {self.spec.name}")
        
        # 1. Generate main pipeline code
        pipeline_code = self._generate_pipeline_code()
        self._save_file("pipeline.py", pipeline_code)
        print("✅ Generated pipeline.py")
        
        # 2. Generate configuration
        config = self._generate_config()
        self._save_file("config.yaml", yaml.dump(config))
        print("✅ Generated config.yaml")
        
        # 3. Generate tests
        test_code = self._generate_tests(pipeline_code)
        self._save_file("test_pipeline.py", test_code)
        print("✅ Generated test_pipeline.py")
        
        # 4. Generate Dockerfile
        dockerfile = self._generate_dockerfile()
        self._save_file("Dockerfile", dockerfile)
        print("✅ Generated Dockerfile")
        
        # 5. Generate CI/CD
        ci_yaml = self._generate_ci_cd()
        self._save_file(".github/workflows/ci.yml", ci_yaml)
        print("✅ Generated CI/CD pipeline")
        
        # 6. Generate documentation
        readme = self._generate_readme()
        self._save_file("README.md", readme)
        print("✅ Generated README.md")
        
        # 7. Generate Airflow DAG
        dag_code = self._generate_airflow_dag()
        self._save_file("airflow_dag.py", dag_code)
        print("✅ Generated Airflow DAG")
        
        # 8. Validate all generated code
        validation_results = self._validate_generated_code()
        print(f"\n📊 Validation Score: {validation_results['overall_score']}/100")
        
        return str(self.output_dir)
    
    def _generate_pipeline_code(self) -> str:
        """Generate main ETL pipeline code using LLM."""
        
        prompt = f"""
Generate production-ready ETL pipeline code with the following specifications:

**Pipeline Name:** {self.spec.name}
**Description:** {self.spec.description}

**Source:**
- Type: {self.spec.source['type']}
- Details: {self.spec.source}

**Target:**
- Type: {self.spec.target['type']}
- Details: {self.spec.target}

**Transformations:**
{chr(10).join(f'- {t}' for t in self.spec.transformations)}

**Requirements:**
- Python 3.11+ with type hints
- Error handling with exponential backoff (max 3 retries)
- Structured logging (JSON format)
- Metrics instrumentation (Prometheus)
- Configuration loaded from YAML
- Idempotent (safe to re-run)
- SLA: {self.spec.sla_minutes} minutes
- Compliance: {', '.join(self.spec.compliance)}

**Code Structure:**
```python
# Imports
from dataclasses import dataclass
from typing import Dict, List
import logging
import yaml

# Configuration
@dataclass
class PipelineConfig:
    # Load from config.yaml
    pass

# Extract
def extract(config: PipelineConfig) -> pd.DataFrame:
    # Implement with retry logic
    pass

# Transform
def transform(df: pd.DataFrame, config: PipelineConfig) -> pd.DataFrame:
    # Implement transformations
    pass

# Load
def load(df: pd.DataFrame, config: PipelineConfig) -> None:
    # Implement with transaction
    pass

# Main
def run_pipeline(config_path: str) -> Dict:
    # Orchestrate E-T-L
    # Return metrics
    pass

if __name__ == '__main__':
    run_pipeline('config.yaml')
```

Generate complete, production-ready code:
"""
        
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.2
        )
        
        code = response.choices[0].message.content.strip()
        code = code.replace('```python', '').replace('```', '').strip()
        
        return code
    
    def _generate_config(self) -> Dict:
        """Generate YAML configuration."""
        return {
            'pipeline': {
                'name': self.spec.name,
                'version': '1.0.0',
                'owner': self.spec.owner,
                'sla_minutes': self.spec.sla_minutes
            },
            'source': self.spec.source,
            'target': self.spec.target,
            'transformations': self.spec.transformations,
            'retry': {
                'max_attempts': 3,
                'backoff_factor': 2,
                'timeout_seconds': 300
            },
            'logging': {
                'level': 'INFO',
                'format': 'json',
                'destination': 'stdout'
            },
            'monitoring': {
                'prometheus_port': 9090,
                'healthcheck_endpoint': '/health'
            }
        }
    
    def _generate_tests(self, pipeline_code: str) -> str:
        """Generate comprehensive test suite."""
        return generate_tests(pipeline_code)  # Reuse from earlier
    
    def _generate_dockerfile(self) -> str:
        """Generate optimized Dockerfile."""
        return f'''
# Multi-stage build for smaller image
FROM python:3.11-slim as builder

WORKDIR /app

# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir --user -r requirements.txt

# Final stage
FROM python:3.11-slim

WORKDIR /app

# Copy dependencies from builder
COPY --from=builder /root/.local /root/.local

# Copy application code
COPY pipeline.py config.yaml ./

# Add Python packages to PATH
ENV PATH=/root/.local/bin:$PATH

# Non-root user for security
RUN useradd -m -u 1000 pipeline && chown -R pipeline:pipeline /app
USER pipeline

# Health check
HEALTHCHECK --interval=30s --timeout=3s --start-period=40s \\
  CMD python -c "import sys; sys.exit(0)"

# Run pipeline
CMD ["python", "pipeline.py"]
'''
    
    def _generate_ci_cd(self) -> str:
        """Generate GitHub Actions CI/CD pipeline."""
        return f'''
name: CI/CD Pipeline for {self.spec.name}

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'
      
      - name: Install dependencies
        run: |
          pip install -r requirements.txt
          pip install pytest pytest-cov bandit ruff mypy
      
      - name: Run linting
        run: ruff check pipeline.py
      
      - name: Run type checking
        run: mypy pipeline.py --ignore-missing-imports
      
      - name: Run security scan
        run: bandit -r pipeline.py
      
      - name: Run tests
        run: pytest test_pipeline.py --cov=pipeline --cov-fail-under=80
      
      - name: Upload coverage
        uses: codecov/codecov-action@v3

  build:
    needs: test
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Build Docker image
        run: docker build -t {self.spec.name}:${{{{ github.sha }}}} .
      
      - name: Push to registry
        run: |
          echo "${{{{ secrets.DOCKER_PASSWORD }}}}" | docker login -u "${{{{ secrets.DOCKER_USERNAME }}}}" --password-stdin
          docker push {self.spec.name}:${{{{ github.sha }}}}

  deploy-staging:
    needs: build
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/develop'
    steps:
      - name: Deploy to staging
        run: |
          kubectl set image deployment/{self.spec.name} app={self.spec.name}:${{{{ github.sha }}}} -n staging

  deploy-production:
    needs: build
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'
    environment: production
    steps:
      - name: Deploy to production
        run: |
          kubectl set image deployment/{self.spec.name} app={self.spec.name}:${{{{ github.sha }}}} -n production
'''
    
    def _generate_readme(self) -> str:
        """Generate comprehensive documentation."""
        return f'''
# {self.spec.name}

{self.spec.description}

## Overview

**Owner:** {self.spec.owner}  
**Schedule:** {self.spec.schedule}  
**SLA:** {self.spec.sla_minutes} minutes  
**Compliance:** {', '.join(self.spec.compliance)}

## Architecture

```
{self.spec.source['type']} → Transform → {self.spec.target['type']}
```

### Transformations

{chr(10).join(f'- {t}' for t in self.spec.transformations)}

## Local Development

```bash
# Install dependencies
pip install -r requirements.txt

# Run tests
pytest test_pipeline.py

# Run pipeline locally
python pipeline.py

# Build Docker image
docker build -t {self.spec.name} .

# Run container
docker run {self.spec.name}
```

## Configuration

Edit `config.yaml` to customize pipeline behavior.

## Monitoring

- **Metrics:** http://localhost:9090/metrics (Prometheus)
- **Health:** http://localhost:8080/health

## Deployment

Automatic via GitHub Actions:
- Push to `develop` → deploys to staging
- Push to `main` → deploys to production (requires approval)

## Troubleshooting

See [TROUBLESHOOTING.md](TROUBLESHOOTING.md) for common issues.

## Contact

Questions? Contact {self.spec.owner}
'''
    
    def _generate_airflow_dag(self) -> str:
        """Generate Airflow DAG for orchestration."""
        return f'''
from airflow import DAG
from airflow.providers.docker.operators.docker import DockerOperator
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta

default_args = {{
    'owner': '{self.spec.owner}',
    'depends_on_past': False,
    'email_on_failure': True,
    'email_on_retry': False,
    'retries': 3,
    'retry_delay': timedelta(minutes=5),
    'sla': timedelta(minutes={self.spec.sla_minutes})
}}

with DAG(
    dag_id='{self.spec.name}',
    default_args=default_args,
    description='{self.spec.description}',
    schedule_interval='{self.spec.schedule}',
    start_date=datetime(2024, 1, 1),
    catchup=False,
    tags=['etl', 'generated', '{self.spec.target["type"]}']
) as dag:
    
    run_pipeline = DockerOperator(
        task_id='run_{self.spec.name}',
        image='{self.spec.name}:latest',
        auto_remove=True,
        docker_url='unix://var/run/docker.sock',
        network_mode='bridge'
    )
    
    def send_success_notification(**context):
        # Send Slack/email notification
        print(f"✅ Pipeline {self.spec.name} completed successfully")
    
    notify = PythonOperator(
        task_id='notify_success',
        python_callable=send_success_notification
    )
    
    run_pipeline >> notify
'''
    
    def _save_file(self, filename: str, content: str):
        """Save file to output directory."""
        file_path = self.output_dir / filename
        file_path.parent.mkdir(parents=True, exist_ok=True)
        file_path.write_text(content)
    
    def _validate_generated_code(self) -> Dict:
        """Validate all generated code."""
        validator = CodeValidator()
        pipeline_code = (self.output_dir / "pipeline.py").read_text()
        return validator.validate_all(pipeline_code)

# Example: Generate complete production pipeline
spec = PipelineSpec(
    name='daily_sales_aggregation',
    description='Aggregate daily sales from Postgres to Snowflake',
    source={
        'type': 'postgresql',
        'host': 'db.company.com',
        'database': 'sales',
        'table': 'raw_transactions'
    },
    target={
        'type': 'snowflake',
        'account': 'company',
        'database': 'analytics',
        'schema': 'marts',
        'table': 'daily_sales'
    },
    transformations=[
        'Deduplicate by transaction_id',
        'Filter out refunds (amount < 0)',
        'Enrich with customer tier from Redis',
        'Aggregate by date and customer tier',
        'Calculate rolling 7-day average'
    ],
    schedule='0 2 * * *',  # 2 AM daily
    sla_minutes=30,
    owner='data-engineering@company.com',
    compliance=['GDPR', 'SOC2']
)

generator = ProductionPipelineGenerator(spec)
output_path = generator.generate_all()

print(f"\n✅ Pipeline generated successfully at: {output_path}")
print(f"\nNext steps:")
print(f"1. Review generated code: cd {output_path}")
print(f"2. Run tests: pytest test_pipeline.py")
print(f"3. Commit to Git: git add {output_path} && git commit")
print(f"4. Push to GitHub: git push (CI/CD will auto-deploy)")
```

**Cost Optimization Strategies:**

```python
# Strategy 1: Prompt caching (reduces cost 90%)
@lru_cache(maxsize=100)
def generate_pipeline_cached(spec_hash: str) -> str:
    """Cache generated code for similar specifications."""
    return generate_pipeline(spec)

# Strategy 2: Use cheaper models for simple pipelines
def select_model_by_complexity(spec: PipelineSpec) -> str:
    """Choose model based on pipeline complexity."""
    complexity_score = (
        len(spec.transformations) * 10 +
        (1 if spec.source['type'] == 'kafka' else 0) * 20 +
        (1 if 'HIPAA' in spec.compliance else 0) * 15
    )
    
    if complexity_score > 50:
        return "gpt-4o"  # Complex pipelines
    elif complexity_score > 20:
        return "gpt-4o-mini"  # Medium complexity
    else:
        return "gpt-3.5-turbo"  # Simple pipelines (80% cheaper)

# Strategy 3: Incremental generation (generate only changed parts)
def generate_incremental(old_spec: PipelineSpec, new_spec: PipelineSpec) -> str:
    """Only regenerate changed components."""
    changes = []
    if old_spec.source != new_spec.source:
        changes.append("extract function")
    if old_spec.transformations != new_spec.transformations:
        changes.append("transform function")
    if old_spec.target != new_spec.target:
        changes.append("load function")
    
    # Only regenerate changed functions (saves 60% tokens)
    return generate_partial(new_spec, components=changes)
```

**Production Monitoring Dashboard (Grafana):**

```json
{
  "dashboard": {
    "title": "Generated ETL Pipeline Monitoring",
    "panels": [
      {
        "title": "Pipeline Executions",
        "targets": [{
          "expr": "rate(pipeline_executions_total[5m])"
        }]
      },
      {
        "title": "Success Rate",
        "targets": [{
          "expr": "rate(pipeline_executions_total{status='success'}[1h]) / rate(pipeline_executions_total[1h])"
        }]
      },
      {
        "title": "P95 Latency",
        "targets": [{
          "expr": "histogram_quantile(0.95, pipeline_duration_seconds_bucket)"
        }]
      },
      {
        "title": "Rows Processed",
        "targets": [{
          "expr": "sum(pipeline_rows_processed_total)"
        }]
      },
      {
        "title": "Cost per Execution",
        "targets": [{
          "expr": "rate(pipeline_cost_dollars[1d])"
        }]
      }
    ]
  }
}
```

---
**Autor:** Luis J. Raigoso V. (LJRV)

## 1. Generación de función ETL básica

In [None]:
import os
from openai import OpenAI
client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

def generate_etl_code(description: str) -> str:
    prompt = f'''
Eres un experto en Python y pipelines ETL. Genera código Python completo y ejecutable.

Requerimientos:
{description}

Incluye:
- Imports necesarios
- Manejo de errores
- Logging básico
- Docstrings
- Type hints

Código:
'''
    resp = client.chat.completions.create(
        model='gpt-4',
        messages=[{'role':'user','content':prompt}],
        temperature=0.2
    )
    return resp.choices[0].message.content.strip().replace('```python','').replace('```','')

desc = '''
Función que:
1. Lee CSV de ventas desde S3 (boto3)
2. Filtra filas con total > 0
3. Agrega columna "mes" (YYYY-MM) desde "fecha"
4. Escribe a Parquet particionado por mes
'''

code = generate_etl_code(desc)
print(code)

## 2. Generación de transformación Pandas

In [None]:
transformation_spec = '''
Dataset: transacciones bancarias (CSV)
Columnas: trans_id, fecha, monto, tipo (debito/credito), cuenta_id

Transformaciones:
1. Convertir fecha a datetime
2. Crear columna "anio_mes" (formato YYYY-MM)
3. Calcular saldo acumulado por cuenta_id ordenado por fecha
4. Agregar flag is_anomaly si monto > 3 desviaciones estándar de la media de esa cuenta
5. Exportar a CSV con encoding UTF-8
'''

transform_code = generate_etl_code(transformation_spec)
print(transform_code[:500] + '...')  # Preview

## 3. Validación de código generado

In [None]:
import ast
import subprocess

def validate_python_syntax(code: str) -> tuple[bool, str]:
    """Valida sintaxis Python sin ejecutar."""
    try:
        ast.parse(code)
        return True, 'Sintaxis válida'
    except SyntaxError as e:
        return False, f'Error de sintaxis: {e}'

def lint_code(code: str, tool='flake8') -> str:
    """Ejecuta linter (requiere instalado)."""
    try:
        with open('/tmp/temp_code.py', 'w') as f:
            f.write(code)
        result = subprocess.run([tool, '/tmp/temp_code.py'], capture_output=True, text=True)
        return result.stdout if result.returncode == 0 else result.stderr
    except Exception as e:
        return f'Error linting: {e}'

valid, msg = validate_python_syntax(code)
print(f'{"✅" if valid else "❌"} {msg}')

## 4. Generación de DAG de Airflow

In [None]:
dag_spec = '''
DAG de Airflow para:
- Nombre: ventas_daily_etl
- Schedule: diario a las 2 AM
- Tareas:
  1. extract_s3: descarga ventas.csv de S3
  2. validate: Great Expectations (columnas no nulas, total > 0)
  3. transform: agrega mes, calcula métricas
  4. load_db: inserta en PostgreSQL (tabla ventas_daily)
  5. send_alert: email si falla cualquier paso
- Retries: 2 con delay de 5 min
'''

dag_code = generate_etl_code(dag_spec)
print(dag_code[:600] + '...')

## 5. Iteración y mejora con feedback

In [None]:
def refine_code(original_code: str, feedback: str) -> str:
    prompt = f'''
Mejora este código Python según el feedback:

Código original:
{original_code}

Feedback:
{feedback}

Código mejorado:
'''
    resp = client.chat.completions.create(
        model='gpt-4',
        messages=[{'role':'user','content':prompt}],
        temperature=0.1
    )
    return resp.choices[0].message.content.strip().replace('```python','').replace('```','')

feedback_example = '''
- Añadir retry con exponential backoff en la descarga de S3
- Usar context manager para conexión a DB
- Loggear número de filas procesadas
'''

improved = refine_code(code, feedback_example)
print(improved[:400] + '...')

## 6. Generación de tests unitarios

In [None]:
def generate_tests(code: str) -> str:
    prompt = f'''
Genera tests unitarios con pytest para este código:

{code}

Incluye:
- Test de caso normal (happy path)
- Test con datos vacíos
- Test con errores (valores nulos, tipos incorrectos)
- Mocks para I/O externo (S3, DB)

Tests:
'''
    resp = client.chat.completions.create(
        model='gpt-4',
        messages=[{'role':'user','content':prompt}],
        temperature=0.1
    )
    return resp.choices[0].message.content.strip().replace('```python','').replace('```','')

test_code = generate_tests(code)
print(test_code[:500] + '...')

## 7. Buenas prácticas

- **Revisar siempre**: nunca ejecutes código generado sin inspección humana.
- **Validación automática**: sintaxis, linting, tests.
- **Versionado**: guarda código generado en Git con mensaje descriptivo.
- **Plantillas**: usa templates para estructura consistente.
- **Iteración**: refina con feedback humano y re-generación.
- **Documentación**: genera README y comentarios junto con código.

## 8. Ejercicios

1. Genera un script que migre datos de MongoDB a PostgreSQL.
2. Crea un pipeline Spark (PySpark) que lea Parquet y escriba Delta Lake.
3. Automatiza generación de un data quality report con Great Expectations.
4. Construye un CLI (Click/Typer) generado por LLM para ejecutar ETLs.