# 📊 Observabilidad y Linaje de Datos

Objetivo: implementar observabilidad end-to-end con logs estructurados, métricas (Prometheus), trazas (OpenTelemetry), linaje de datos (OpenLineage/DataHub) y SLOs.

- Duración: 120–150 min
- Dificultad: Alta
- Prerrequisitos: Senior 01 (Governance), pipelines en producción

### 📡 **Observability: From Monitoring to Full Visibility**

**Evolution of Monitoring**

```
┌────────────────────────────────────────────────────────┐
│  TRADITIONAL MONITORING (2000s)                        │
│  • Metrics: CPU, memory, disk                          │
│  • Dashboards: Static graphs                           │
│  • Alerts: Threshold-based (CPU > 80%)                 │
│  • Problem: "What is broken?" ❌                       │
├────────────────────────────────────────────────────────┤
│  OBSERVABILITY (2020s)                                 │
│  • Logs + Metrics + Traces (3 pillars)                 │
│  • Distributed tracing                                 │
│  • High-cardinality queries                            │
│  • Problem: "Why is it broken?" ✅                     │
└────────────────────────────────────────────────────────┘
```

**Three Pillars of Observability**

```
┌─────────────────────────────────────────────────────────┐
│  1. LOGS (Events)                                       │
│  • What: Discrete events (errors, warnings, debug)      │
│  • Format: Structured JSON with context                 │
│  • Tools: ELK, Loki, CloudWatch Logs                    │
│  • Query: "Show me all errors in pipeline X last hour"  │
│                                                          │
│  2. METRICS (Aggregates)                                │
│  • What: Numeric measurements over time                 │
│  • Types: Counter, Gauge, Histogram, Summary            │
│  • Tools: Prometheus, InfluxDB, Datadog                 │
│  • Query: "What's p99 latency for pipeline X?"          │
│                                                          │
│  3. TRACES (Flows)                                      │
│  • What: Request flow through distributed system        │
│  • Standard: OpenTelemetry                              │
│  • Tools: Jaeger, Zipkin, Tempo                         │
│  • Query: "Why is this request slow? Where?"            │
└─────────────────────────────────────────────────────────┘
```

**Structured Logging: From Strings to Context**

```python
# ❌ BAD: String logging
import logging
logging.info("Pipeline ventas_etl started")
logging.error("Validation failed with 5 errors: [1,2,3,4,5]")

# Output: Unparseable strings
# 2025-10-30 10:00:00 INFO Pipeline ventas_etl started
# 2025-10-30 10:01:00 ERROR Validation failed with 5 errors: [1,2,3,4,5]

# ✅ GOOD: Structured logging
from loguru import logger
import sys
import json

# Configure JSON output
logger.remove()
logger.add(
    sys.stdout,
    format="{time} {level} {message}",
    serialize=True  # JSON output
)

# Add context
logger.bind(
    pipeline="ventas_etl",
    version="v1.2.3",
    environment="production"
).info("pipeline_started")

logger.bind(
    pipeline="ventas_etl",
    error_count=5,
    sample_ids=[1, 2, 3, 4, 5],
    validation_type="schema"
).error("validation_failed")

# Output: Queryable JSON
"""
{
  "text": "pipeline_started",
  "record": {
    "time": {"timestamp": 1730286000.123},
    "level": {"name": "INFO"},
    "message": "pipeline_started",
    "extra": {
      "pipeline": "ventas_etl",
      "version": "v1.2.3",
      "environment": "production"
    }
  }
}
"""

# Now can query: "Show all errors where pipeline=ventas_etl AND error_count > 3"
```

**Logging Best Practices for Data Pipelines**

```python
from loguru import logger
from datetime import datetime
import traceback
from functools import wraps

class DataPipelineLogger:
    """Logging wrapper para data pipelines"""
    
    def __init__(self, pipeline_name: str, run_id: str):
        self.pipeline_name = pipeline_name
        self.run_id = run_id
        
        # Configure logger
        logger.remove()
        logger.add(
            f"logs/{pipeline_name}/{{time}}.log",
            rotation="500 MB",
            retention="30 days",
            serialize=True,
            format="{time} {level} {message}"
        )
        
        # Add common context
        self.logger = logger.bind(
            pipeline=pipeline_name,
            run_id=run_id,
            environment="production"
        )
    
    def log_stage_start(self, stage: str, **kwargs):
        """Log inicio de stage"""
        self.logger.info(
            f"stage_started",
            stage=stage,
            timestamp=datetime.now().isoformat(),
            **kwargs
        )
    
    def log_stage_complete(self, stage: str, records_processed: int, duration_sec: float):
        """Log completado de stage"""
        self.logger.info(
            f"stage_completed",
            stage=stage,
            records_processed=records_processed,
            duration_sec=duration_sec,
            throughput_rps=records_processed / duration_sec
        )
    
    def log_data_quality_check(self, check_name: str, passed: bool, details: dict):
        """Log data quality validation"""
        level = "info" if passed else "warning"
        getattr(self.logger, level)(
            f"quality_check",
            check_name=check_name,
            passed=passed,
            **details
        )
    
    def log_error(self, stage: str, error: Exception, **kwargs):
        """Log error con stacktrace"""
        self.logger.error(
            f"stage_failed",
            stage=stage,
            error_type=type(error).__name__,
            error_message=str(error),
            stacktrace=traceback.format_exc(),
            **kwargs
        )

# Decorator para auto-logging
def log_execution(stage_name: str):
    """Decorator to log function execution"""
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            pipeline_logger = kwargs.get('logger')
            
            if pipeline_logger:
                pipeline_logger.log_stage_start(stage_name)
            
            start = datetime.now()
            try:
                result = func(*args, **kwargs)
                
                if pipeline_logger:
                    duration = (datetime.now() - start).total_seconds()
                    records = result.get('records_processed', 0) if isinstance(result, dict) else 0
                    pipeline_logger.log_stage_complete(stage_name, records, duration)
                
                return result
            
            except Exception as e:
                if pipeline_logger:
                    pipeline_logger.log_error(stage_name, e)
                raise
        
        return wrapper
    return decorator

# Usage
logger = DataPipelineLogger("ventas_etl", run_id="run_20251030_100000")

@log_execution("extract")
def extract_data(logger=None):
    # Extraction logic
    return {"records_processed": 10000}

@log_execution("transform")
def transform_data(df, logger=None):
    # Transformation logic
    
    # Log data quality check
    null_count = df.isnull().sum().sum()
    logger.log_data_quality_check(
        "null_check",
        passed=null_count == 0,
        details={"null_count": null_count}
    )
    
    return {"records_processed": len(df)}

# Execute
extract_data(logger=logger)
transform_data(df, logger=logger)
```

**Correlation IDs: Tracing Across Services**

```python
import uuid
from contextvars import ContextVar

# Thread-safe context variable
trace_id_var: ContextVar[str] = ContextVar('trace_id', default=None)

def generate_trace_id() -> str:
    """Generate unique trace ID"""
    return str(uuid.uuid4())

def set_trace_id(trace_id: str = None):
    """Set trace ID for current context"""
    if trace_id is None:
        trace_id = generate_trace_id()
    trace_id_var.set(trace_id)
    return trace_id

def get_trace_id() -> str:
    """Get trace ID from current context"""
    return trace_id_var.get()

# FastAPI middleware to propagate trace ID
from fastapi import Request
from starlette.middleware.base import BaseHTTPMiddleware

class TraceIdMiddleware(BaseHTTPMiddleware):
    async def dispatch(self, request: Request, call_next):
        # Extract trace_id from header or generate
        trace_id = request.headers.get('X-Trace-Id', generate_trace_id())
        set_trace_id(trace_id)
        
        # Add to logger context
        logger.bind(trace_id=trace_id)
        
        response = await call_next(request)
        
        # Add to response header
        response.headers['X-Trace-Id'] = trace_id
        
        return response

# Propagate to downstream services
import httpx

async def call_downstream_service(url: str):
    """Call otro servicio con trace ID"""
    headers = {
        'X-Trace-Id': get_trace_id()
    }
    
    async with httpx.AsyncClient() as client:
        response = await client.get(url, headers=headers)
        return response.json()

# Now all logs across services share same trace_id → easy debugging!
```

**Caso Real: Uber's Observability Platform**

Uber procesa **10M+ eventos/segundo** con:
- **Logs**: JSON structured logs → Kafka → HDFS (long-term) + Elasticsearch (search)
- **Metrics**: Prometheus (4M+ time series), M3DB (distributed storage)
- **Traces**: Jaeger (100K+ traces/sec), custom sampling
- **Cost**: $50M/year en infraestructura de observability
- **ROI**: Reduce MTTR (Mean Time To Recovery) de 2h → 15min

---
**Autor:** Luis J. Raigoso V. (LJRV)

### 📊 **Metrics & Prometheus: Measure What Matters**

**Metric Types**

```
┌────────────────────────────────────────────────────────┐
│  COUNTER (monotonic, only increases)                   │
│  • events_processed_total                              │
│  • errors_total                                        │
│  • bytes_transferred_total                             │
│  Query: rate(events_processed_total[5m])               │
├────────────────────────────────────────────────────────┤
│  GAUGE (can go up or down)                             │
│  • active_connections                                  │
│  • queue_size                                          │
│  • memory_usage_bytes                                  │
│  Query: avg_over_time(queue_size[1h])                  │
├────────────────────────────────────────────────────────┤
│  HISTOGRAM (distribution of values)                    │
│  • request_duration_seconds                            │
│  • record_size_bytes                                   │
│  Query: histogram_quantile(0.99, ...)  # p99           │
├────────────────────────────────────────────────────────┤
│  SUMMARY (client-side percentiles)                     │
│  • processing_time_seconds                             │
│  Query: processing_time_seconds{quantile="0.99"}       │
└────────────────────────────────────────────────────────┘
```

**Prometheus Client Implementation**

```python
from prometheus_client import Counter, Gauge, Histogram, Summary
from prometheus_client import start_http_server, CollectorRegistry
import time
import random

# Create registry (allows multiple apps in same process)
registry = CollectorRegistry()

# 1. COUNTER: Monotonic increasing
events_processed = Counter(
    'events_processed_total',
    'Total number of events processed',
    ['pipeline', 'stage'],
    registry=registry
)

errors_total = Counter(
    'errors_total',
    'Total number of errors',
    ['pipeline', 'stage', 'error_type'],
    registry=registry
)

# 2. GAUGE: Current value
active_jobs = Gauge(
    'active_jobs',
    'Number of active jobs',
    ['pipeline'],
    registry=registry
)

queue_size = Gauge(
    'queue_size',
    'Number of items in queue',
    ['queue_name'],
    registry=registry
)

# 3. HISTOGRAM: Distribution with buckets
processing_duration = Histogram(
    'processing_duration_seconds',
    'Time spent processing records',
    ['pipeline', 'stage'],
    buckets=(0.1, 0.5, 1.0, 2.5, 5.0, 10.0, 30.0, 60.0, 120.0),
    registry=registry
)

record_size = Histogram(
    'record_size_bytes',
    'Size of processed records',
    ['pipeline'],
    buckets=(100, 1000, 10000, 100000, 1000000),
    registry=registry
)

# 4. SUMMARY: Client-side percentiles
processing_time = Summary(
    'processing_time_seconds',
    'Processing time summary',
    ['pipeline'],
    registry=registry
)

# Instrumented data pipeline
class InstrumentedPipeline:
    """Data pipeline with Prometheus metrics"""
    
    def __init__(self, pipeline_name: str):
        self.pipeline_name = pipeline_name
    
    def run_stage(self, stage_name: str, records: list):
        """Execute stage with metrics"""
        
        # Track active jobs
        active_jobs.labels(pipeline=self.pipeline_name).inc()
        
        try:
            # Time execution
            with processing_duration.labels(
                pipeline=self.pipeline_name,
                stage=stage_name
            ).time():
                
                # Process records
                for record in records:
                    # Increment counter
                    events_processed.labels(
                        pipeline=self.pipeline_name,
                        stage=stage_name
                    ).inc()
                    
                    # Record size
                    record_size.labels(pipeline=self.pipeline_name).observe(
                        len(str(record))
                    )
                    
                    # Simulate processing
                    time.sleep(random.uniform(0.001, 0.01))
        
        except Exception as e:
            # Track errors
            errors_total.labels(
                pipeline=self.pipeline_name,
                stage=stage_name,
                error_type=type(e).__name__
            ).inc()
            raise
        
        finally:
            # Decrease active jobs
            active_jobs.labels(pipeline=self.pipeline_name).dec()

# Start metrics server
start_http_server(8000, registry=registry)
# Metrics available at: http://localhost:8000/metrics

# Run pipeline
pipeline = InstrumentedPipeline("ventas_etl")
pipeline.run_stage("extract", [{"id": i} for i in range(1000)])
pipeline.run_stage("transform", [{"id": i} for i in range(1000)])

# Prometheus scrapes /metrics every 15s
"""
# HELP events_processed_total Total number of events processed
# TYPE events_processed_total counter
events_processed_total{pipeline="ventas_etl",stage="extract"} 1000.0
events_processed_total{pipeline="ventas_etl",stage="transform"} 1000.0

# HELP processing_duration_seconds Time spent processing records
# TYPE processing_duration_seconds histogram
processing_duration_seconds_bucket{pipeline="ventas_etl",stage="extract",le="0.1"} 0.0
processing_duration_seconds_bucket{pipeline="ventas_etl",stage="extract",le="0.5"} 0.0
processing_duration_seconds_bucket{pipeline="ventas_etl",stage="extract",le="1.0"} 0.0
processing_duration_seconds_bucket{pipeline="ventas_etl",stage="extract",le="2.5"} 0.0
processing_duration_seconds_bucket{pipeline="ventas_etl",stage="extract",le="5.0"} 1.0
processing_duration_seconds_bucket{pipeline="ventas_etl",stage="extract",le="+Inf"} 1.0
processing_duration_seconds_sum{pipeline="ventas_etl",stage="extract"} 4.523
processing_duration_seconds_count{pipeline="ventas_etl",stage="extract"} 1.0
"""
```

**PromQL Queries for Data Engineers**

```python
# Common queries
promql_queries = {
    # Throughput (events/second)
    "throughput": """
        rate(events_processed_total{pipeline="ventas_etl"}[5m])
    """,
    
    # Error rate (%)
    "error_rate": """
        (
            rate(errors_total{pipeline="ventas_etl"}[5m])
            / 
            rate(events_processed_total{pipeline="ventas_etl"}[5m])
        ) * 100
    """,
    
    # P99 latency
    "p99_latency": """
        histogram_quantile(0.99, 
            rate(processing_duration_seconds_bucket{pipeline="ventas_etl"}[5m])
        )
    """,
    
    # P50 (median) latency
    "p50_latency": """
        histogram_quantile(0.50, 
            rate(processing_duration_seconds_bucket{pipeline="ventas_etl"}[5m])
        )
    """,
    
    # Total errors last hour
    "errors_1h": """
        increase(errors_total{pipeline="ventas_etl"}[1h])
    """,
    
    # Average queue size
    "avg_queue_size": """
        avg_over_time(queue_size{queue_name="events"}[5m])
    """,
    
    # Success rate (inverse of error rate)
    "success_rate": """
        (1 - (
            sum(rate(errors_total[5m]))
            /
            sum(rate(events_processed_total[5m]))
        )) * 100
    """,
    
    # Top 5 slowest pipelines
    "slowest_pipelines": """
        topk(5, 
            avg by (pipeline) (
                rate(processing_duration_seconds_sum[5m])
                /
                rate(processing_duration_seconds_count[5m])
            )
        )
    """
}
```

**Alerting Rules (Prometheus Alertmanager)**

```yaml
# prometheus_rules.yml
groups:
  - name: data_pipeline_alerts
    interval: 30s
    rules:
      # High error rate
      - alert: HighErrorRate
        expr: |
          (
            rate(errors_total{pipeline="ventas_etl"}[5m])
            /
            rate(events_processed_total{pipeline="ventas_etl"}[5m])
          ) > 0.05
        for: 5m
        labels:
          severity: critical
          team: data-engineering
        annotations:
          summary: "High error rate in {{ $labels.pipeline }}"
          description: "Error rate is {{ $value | humanizePercentage }}"
      
      # High latency
      - alert: HighLatency
        expr: |
          histogram_quantile(0.99,
            rate(processing_duration_seconds_bucket{pipeline="ventas_etl"}[5m])
          ) > 30
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High P99 latency in {{ $labels.pipeline }}"
          description: "P99 latency is {{ $value }}s (threshold: 30s)"
      
      # Pipeline stuck (no events processed)
      - alert: PipelineStuck
        expr: |
          rate(events_processed_total{pipeline="ventas_etl"}[5m]) == 0
        for: 15m
        labels:
          severity: critical
        annotations:
          summary: "Pipeline {{ $labels.pipeline }} is stuck"
          description: "No events processed in last 15 minutes"
      
      # Queue too large
      - alert: QueueTooLarge
        expr: queue_size{queue_name="events"} > 10000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Queue {{ $labels.queue_name }} is too large"
          description: "Queue size: {{ $value }}"
```

**Grafana Dashboard (JSON)**

```python
grafana_dashboard = {
    "dashboard": {
        "title": "Data Pipeline Observability",
        "panels": [
            {
                "id": 1,
                "title": "Throughput (events/sec)",
                "targets": [
                    {
                        "expr": 'sum(rate(events_processed_total[5m])) by (pipeline)',
                        "legendFormat": "{{ pipeline }}"
                    }
                ],
                "type": "graph"
            },
            {
                "id": 2,
                "title": "Error Rate (%)",
                "targets": [
                    {
                        "expr": '''
                            (
                                sum(rate(errors_total[5m])) by (pipeline)
                                /
                                sum(rate(events_processed_total[5m])) by (pipeline)
                            ) * 100
                        ''',
                        "legendFormat": "{{ pipeline }}"
                    }
                ],
                "type": "graph",
                "alert": {
                    "conditions": [
                        {
                            "evaluator": {"type": "gt", "params": [5]},
                            "query": {"params": ["A", "5m", "now"]}
                        }
                    ]
                }
            },
            {
                "id": 3,
                "title": "P50, P95, P99 Latency",
                "targets": [
                    {
                        "expr": 'histogram_quantile(0.50, rate(processing_duration_seconds_bucket[5m]))',
                        "legendFormat": "P50"
                    },
                    {
                        "expr": 'histogram_quantile(0.95, rate(processing_duration_seconds_bucket[5m]))',
                        "legendFormat": "P95"
                    },
                    {
                        "expr": 'histogram_quantile(0.99, rate(processing_duration_seconds_bucket[5m]))',
                        "legendFormat": "P99"
                    }
                ],
                "type": "graph"
            },
            {
                "id": 4,
                "title": "Active Jobs",
                "targets": [
                    {
                        "expr": 'sum(active_jobs) by (pipeline)',
                        "legendFormat": "{{ pipeline }}"
                    }
                ],
                "type": "gauge"
            }
        ]
    }
}
```

**Custom Metrics for Data Quality**

```python
from prometheus_client import Gauge

# Data quality metrics
data_quality_score = Gauge(
    'data_quality_score',
    'Data quality score (0-100)',
    ['pipeline', 'table', 'check_type']
)

null_percentage = Gauge(
    'null_percentage',
    'Percentage of null values',
    ['pipeline', 'table', 'column']
)

duplicate_count = Gauge(
    'duplicate_count',
    'Number of duplicate records',
    ['pipeline', 'table']
)

# Update metrics
def run_quality_checks(df, pipeline_name, table_name):
    """Run data quality checks and emit metrics"""
    
    # Null check
    for col in df.columns:
        null_pct = (df[col].isnull().sum() / len(df)) * 100
        null_percentage.labels(
            pipeline=pipeline_name,
            table=table_name,
            column=col
        ).set(null_pct)
    
    # Duplicate check
    dup_count = df.duplicated().sum()
    duplicate_count.labels(
        pipeline=pipeline_name,
        table=table_name
    ).set(dup_count)
    
    # Overall quality score
    score = 100 - (null_pct + (dup_count / len(df) * 100))
    data_quality_score.labels(
        pipeline=pipeline_name,
        table=table_name,
        check_type='overall'
    ).set(max(0, score))
```

---
**Autor:** Luis J. Raigoso V. (LJRV)

### 🔍 **Distributed Tracing: OpenTelemetry**

**Why Distributed Tracing?**

```
Traditional Logs (per service):
Service A: [10:00:01] Request received
Service B: [10:00:02] Processing data
Service C: [10:00:05] Database query
❌ Can't connect these logs across services!

Distributed Tracing:
Trace ID: abc123
├─ Span: API Request (Service A) [200ms]
│  ├─ Span: Extract Data (Service B) [150ms]
│  │  └─ Span: S3 Download [100ms]
│  └─ Span: Transform Data (Service C) [50ms]
│     └─ Span: DB Query [30ms]
✅ Full request flow visualization!
```

**OpenTelemetry Architecture**

```
┌──────────────────────────────────────────────────────┐
│  Application Code                                    │
│  • Instrumented with OpenTelemetry SDK               │
├──────────────────────────────────────────────────────┤
│  OpenTelemetry Collector                             │
│  • Receives traces from apps                         │
│  • Processes (sampling, filtering)                   │
│  • Exports to backends                               │
├──────────────────────────────────────────────────────┤
│  Backend Storage                                     │
│  • Jaeger (open source)                              │
│  • Tempo (Grafana)                                   │
│  • Datadog, New Relic, etc.                          │
└──────────────────────────────────────────────────────┘
```

**OpenTelemetry Implementation**

```python
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.resources import Resource
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
import requests
import time

# 1. SETUP TRACER
def setup_tracing(service_name: str):
    """Initialize OpenTelemetry tracing"""
    
    # Create resource (service metadata)
    resource = Resource.create({
        "service.name": service_name,
        "service.version": "1.2.3",
        "deployment.environment": "production"
    })
    
    # Create tracer provider
    tracer_provider = TracerProvider(resource=resource)
    
    # Configure Jaeger exporter
    jaeger_exporter = JaegerExporter(
        agent_host_name="localhost",
        agent_port=6831,
    )
    
    # Add span processor
    span_processor = BatchSpanProcessor(jaeger_exporter)
    tracer_provider.add_span_processor(span_processor)
    
    # Set global tracer provider
    trace.set_tracer_provider(tracer_provider)
    
    # Auto-instrument libraries
    RequestsInstrumentor().instrument()  # HTTP calls
    SQLAlchemyInstrumentor().instrument()  # DB queries
    
    return trace.get_tracer(__name__)

# Initialize
tracer = setup_tracing("ventas_etl_pipeline")

# 2. MANUAL INSTRUMENTATION
def process_data_with_tracing(data):
    """Pipeline stage with tracing"""
    
    # Create parent span
    with tracer.start_as_current_span("process_data") as span:
        # Add attributes
        span.set_attribute("data.size", len(data))
        span.set_attribute("pipeline.name", "ventas_etl")
        
        # Extract
        with tracer.start_as_current_span("extract") as extract_span:
            extract_span.set_attribute("source", "s3")
            raw_data = extract_from_s3(data)
            extract_span.set_attribute("records.count", len(raw_data))
        
        # Transform
        with tracer.start_as_current_span("transform") as transform_span:
            transform_span.set_attribute("transformation.type", "cleansing")
            
            try:
                cleaned_data = transform(raw_data)
                transform_span.set_status(trace.Status(trace.StatusCode.OK))
            except Exception as e:
                transform_span.set_status(
                    trace.Status(trace.StatusCode.ERROR, str(e))
                )
                transform_span.record_exception(e)
                raise
        
        # Load
        with tracer.start_as_current_span("load") as load_span:
            load_span.set_attribute("destination", "postgres")
            load_to_database(cleaned_data)
        
        # Add event
        span.add_event("processing_completed", {
            "records_processed": len(cleaned_data)
        })
        
        return cleaned_data

# 3. CONTEXT PROPAGATION (across services)
def call_downstream_service(url: str, trace_id: str):
    """Propagate trace context to downstream service"""
    
    from opentelemetry.propagate import inject
    
    # Create child span
    with tracer.start_as_current_span("downstream_call") as span:
        span.set_attribute("http.url", url)
        
        # Inject trace context into headers
        headers = {}
        inject(headers)
        
        # Make HTTP call (trace context propagated)
        response = requests.get(url, headers=headers)
        
        span.set_attribute("http.status_code", response.status_code)
        
        return response.json()

# 4. BAGGAGE (cross-cutting context)
from opentelemetry.baggage import set_baggage, get_baggage

# Set baggage (propagates with trace)
set_baggage("user.id", "user123")
set_baggage("tenant.id", "tenant456")

# Access in downstream services
user_id = get_baggage("user.id")
tenant_id = get_baggage("tenant.id")
```

**Sampling Strategies**

```python
from opentelemetry.sdk.trace.sampling import (
    TraceIdRatioBased,
    ParentBased,
    StaticSampler,
    ALWAYS_ON,
    ALWAYS_OFF
)

# 1. ALWAYS ON (dev/staging)
sampler = StaticSampler(ALWAYS_ON)

# 2. PROBABILITY-BASED (10% of traces)
sampler = TraceIdRatioBased(0.1)

# 3. PARENT-BASED (follow parent decision)
sampler = ParentBased(
    root=TraceIdRatioBased(0.1),  # 10% for root spans
    remote_parent_sampled=ALWAYS_ON,  # If parent sampled, sample
    remote_parent_not_sampled=ALWAYS_OFF  # Else, don't sample
)

# 4. CUSTOM SAMPLER (sample errors always)
class ErrorSampler(Sampler):
    def should_sample(self, context, trace_id, name, kind, attributes, links):
        # Always sample if error
        if attributes and attributes.get("error"):
            return SamplingResult(Decision.RECORD_AND_SAMPLE)
        
        # 1% for normal traces
        if random.random() < 0.01:
            return SamplingResult(Decision.RECORD_AND_SAMPLE)
        
        return SamplingResult(Decision.DROP)

# Apply to tracer provider
tracer_provider = TracerProvider(sampler=ErrorSampler())
```

**Spark Instrumentation**

```python
from opentelemetry import trace
from pyspark.sql import SparkSession

tracer = trace.get_tracer(__name__)

def spark_job_with_tracing():
    """Spark job con OpenTelemetry"""
    
    with tracer.start_as_current_span("spark_job") as job_span:
        job_span.set_attribute("spark.app.name", "ventas_etl")
        
        # Create Spark session
        spark = SparkSession.builder.appName("ventas_etl").getOrCreate()
        
        # Read
        with tracer.start_as_current_span("spark.read") as read_span:
            df = spark.read.parquet("s3://bucket/raw/")
            read_span.set_attribute("records.count", df.count())
        
        # Transform
        with tracer.start_as_current_span("spark.transform") as transform_span:
            df_transformed = df.filter(df['amount'] > 0)
            transform_span.set_attribute("transformation", "filter_positive")
        
        # Write
        with tracer.start_as_current_span("spark.write") as write_span:
            df_transformed.write.mode("overwrite").parquet("s3://bucket/curated/")
            write_span.set_attribute("output.path", "s3://bucket/curated/")

# Airflow instrumentation
from airflow import DAG
from airflow.operators.python import PythonOperator

def task_with_tracing(**context):
    """Airflow task con tracing"""
    
    # Extract trace context from Airflow
    dag_id = context['dag'].dag_id
    task_id = context['task'].task_id
    run_id = context['run_id']
    
    with tracer.start_as_current_span(f"airflow.task.{task_id}") as span:
        span.set_attribute("airflow.dag_id", dag_id)
        span.set_attribute("airflow.run_id", run_id)
        
        # Task logic
        result = process_data()
        
        span.add_event("task_completed", {"result": result})

dag = DAG('ventas_etl', schedule_interval='@daily')

task = PythonOperator(
    task_id='process',
    python_callable=task_with_tracing,
    dag=dag
)
```

**Jaeger Query Examples**

```bash
# Find traces with errors
service=ventas_etl error=true

# Find slow traces (>5s duration)
service=ventas_etl minDuration=5s

# Find traces for specific user
service=ventas_etl baggage.user.id=user123

# Find traces with high DB latency
service=ventas_etl operation=db.query minDuration=1s
```

**Trace Analysis Patterns**

```python
# Pattern 1: Find bottlenecks
def analyze_trace_bottlenecks(trace_data):
    """Identify slowest spans in trace"""
    
    spans = sorted(trace_data['spans'], key=lambda x: x['duration'], reverse=True)
    
    bottlenecks = []
    for span in spans[:5]:
        bottlenecks.append({
            'operation': span['operationName'],
            'duration_ms': span['duration'] / 1000,
            'percentage': (span['duration'] / trace_data['duration']) * 100
        })
    
    return bottlenecks

# Pattern 2: Critical path
def find_critical_path(trace_data):
    """Find longest path through trace"""
    
    def build_span_tree(spans):
        tree = {}
        for span in spans:
            parent_id = span.get('references', [{}])[0].get('spanID')
            if parent_id not in tree:
                tree[parent_id] = []
            tree[parent_id].append(span)
        return tree
    
    tree = build_span_tree(trace_data['spans'])
    
    def longest_path(span_id):
        children = tree.get(span_id, [])
        if not children:
            return 0
        return max(child['duration'] + longest_path(child['spanID']) for child in children)
    
    root = trace_data['spans'][0]
    critical_path_duration = longest_path(root['spanID'])
    
    return critical_path_duration
```

**Caso Real: Netflix Tracing at Scale**

Netflix usa tracing para debug de:
- **15,000+ microservicios**
- **1M+ requests/segundo**
- **Sampling**: 0.1% (1 in 1000 traces)
- **Retención**: 7 días (hot), 90 días (cold)
- **Storage**: 5 PB de trace data
- **Cost savings**: Reduce MTTR de horas → minutos, evita $10M+ en downtime/año

---
**Autor:** Luis J. Raigoso V. (LJRV)

### 🔗 **Data Lineage: OpenLineage, DataHub & Compliance**

**Why Data Lineage?**

```
Without Lineage:
❌ "Where does this metric come from?"
❌ "What breaks if I change this table?"
❌ "Who uses this dataset?"
❌ "How did this bad data get here?"

With Lineage:
✅ Trace data from source → transformations → destination
✅ Impact analysis (downstream dependencies)
✅ Root cause analysis (upstream issues)
✅ Compliance (GDPR data discovery)
```

**Lineage Architecture**

```
┌──────────────────────────────────────────────────────┐
│  DATA SOURCES                                        │
│  • Databases (Postgres, MySQL, Redshift)             │
│  • Data Lakes (S3, ADLS, GCS)                        │
│  • Data Warehouses (Snowflake, BigQuery)             │
├──────────────────────────────────────────────────────┤
│  PROCESSING ENGINES (emit lineage events)            │
│  • Airflow (OpenLineage plugin)                      │
│  • Spark (OpenLineage listener)                      │
│  • dbt (metadata API)                                │
│  • Flink, Kafka, etc.                                │
├──────────────────────────────────────────────────────┤
│  LINEAGE BACKENDS (store & query)                    │
│  • OpenLineage → Marquez (open source)               │
│  • DataHub (LinkedIn, open source)                   │
│  • Amundsen (Lyft, open source)                      │
│  • Commercial: Collibra, Alation, Atlan              │
├──────────────────────────────────────────────────────┤
│  VISUALIZATION & QUERY                               │
│  • Graph UI (interactive lineage exploration)        │
│  • API (programmatic queries)                        │
│  • Search (find datasets, columns)                   │
└──────────────────────────────────────────────────────┘
```

**OpenLineage Standard**

```python
# OpenLineage Event Structure
{
    "eventType": "START",  # START, RUNNING, COMPLETE, FAIL, ABORT
    "eventTime": "2025-10-30T10:00:00.000Z",
    "run": {
        "runId": "abc123-def456-ghi789"
    },
    "job": {
        "namespace": "data-platform",
        "name": "ventas_etl",
        "facets": {
            "documentation": {
                "description": "Daily sales ETL pipeline"
            },
            "sourceCode": {
                "language": "python",
                "sourceCode": "def process_sales()..."
            }
        }
    },
    "inputs": [
        {
            "namespace": "s3://data-lake",
            "name": "raw/sales.csv",
            "facets": {
                "schema": {
                    "fields": [
                        {"name": "order_id", "type": "INTEGER"},
                        {"name": "amount", "type": "FLOAT"},
                        {"name": "customer_id", "type": "INTEGER"}
                    ]
                },
                "dataSource": {
                    "name": "s3",
                    "uri": "s3://data-lake/raw/sales.csv"
                }
            }
        }
    ],
    "outputs": [
        {
            "namespace": "postgres://warehouse",
            "name": "analytics.sales_fact",
            "facets": {
                "schema": {
                    "fields": [
                        {"name": "sale_id", "type": "INTEGER"},
                        {"name": "total", "type": "DECIMAL"},
                        {"name": "customer_key", "type": "INTEGER"}
                    ]
                },
                "dataQuality": {
                    "rowCount": 10000,
                    "bytes": 500000,
                    "columnMetrics": {
                        "total": {
                            "nullCount": 0,
                            "min": 1.50,
                            "max": 9999.99
                        }
                    }
                }
            }
        }
    ]
}
```

**Airflow + OpenLineage Integration**

```python
# Install: pip install openlineage-airflow
# airflow.cfg
"""
[openlineage]
transport = {"type": "http", "url": "http://marquez:5000"}
namespace = data-platform
"""

# DAG automatically emits lineage
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.providers.postgres.operators.postgres import PostgresOperator
from datetime import datetime

def extract_sales(**context):
    """Extract sales from S3"""
    # OpenLineage auto-captures:
    # - Input: s3://data-lake/raw/sales.csv
    # - Output: XCom (internal)
    pass

def transform_sales(**context):
    """Transform sales data"""
    # OpenLineage auto-captures:
    # - Input: XCom from extract_sales
    # - Output: XCom to load_sales
    pass

with DAG('sales_etl', start_date=datetime(2025, 1, 1)) as dag:
    
    # Lineage captured automatically
    extract = PythonOperator(
        task_id='extract',
        python_callable=extract_sales
    )
    
    transform = PythonOperator(
        task_id='transform',
        python_callable=transform_sales
    )
    
    # SQL lineage captured from query
    load = PostgresOperator(
        task_id='load',
        postgres_conn_id='warehouse',
        sql="""
            INSERT INTO analytics.sales_fact
            SELECT order_id, amount, customer_id
            FROM staging.sales_raw
        """
    )
    
    extract >> transform >> load

# View lineage in Marquez UI:
# http://marquez:3000/lineage/data-platform/sales_etl
```

**Spark + OpenLineage**

```python
# Configure Spark with OpenLineage
from pyspark.sql import SparkSession

spark = (
    SparkSession.builder
    .appName("sales_etl")
    .config("spark.extraListeners", "io.openlineage.spark.agent.OpenLineageSparkListener")
    .config("spark.openlineage.transport.type", "http")
    .config("spark.openlineage.transport.url", "http://marquez:5000")
    .config("spark.openlineage.namespace", "data-platform")
    .getOrCreate()
)

# Read (input lineage captured)
df = spark.read.parquet("s3://data-lake/raw/sales/")

# Transform (column-level lineage)
df_transformed = (
    df
    .filter(df['amount'] > 0)
    .withColumn('total', df['amount'] * df['quantity'])
    .select('order_id', 'total', 'customer_id')
)

# Write (output lineage captured)
df_transformed.write.mode("overwrite").parquet("s3://data-lake/curated/sales/")

# OpenLineage automatically emits:
# - Input: s3://data-lake/raw/sales/
# - Output: s3://data-lake/curated/sales/
# - Column lineage: total = amount * quantity
```

**DataHub API Usage**

```python
from datahub.emitter.mce_builder import make_dataset_urn, make_data_platform_urn
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.schema_classes import (
    DatasetPropertiesClass,
    UpstreamLineageClass,
    UpstreamClass
)

# Initialize emitter
emitter = DatahubRestEmitter("http://datahub:8080")

# 1. REGISTER DATASET
dataset_urn = make_dataset_urn(
    platform="postgres",
    name="analytics.sales_fact",
    env="PROD"
)

dataset_properties = DatasetPropertiesClass(
    description="Daily sales fact table",
    customProperties={
        "owner": "data-team@company.com",
        "sla": "daily by 9am",
        "retention": "7 years"
    }
)

emitter.emit_mcp(
    entity_urn=dataset_urn,
    aspect=dataset_properties
)

# 2. EMIT LINEAGE
upstream_sales = make_dataset_urn("s3", "data-lake/raw/sales", "PROD")
upstream_customers = make_dataset_urn("postgres", "crm.customers", "PROD")

lineage = UpstreamLineageClass(
    upstreams=[
        UpstreamClass(
            dataset=upstream_sales,
            type="TRANSFORMED"
        ),
        UpstreamClass(
            dataset=upstream_customers,
            type="TRANSFORMED"
        )
    ]
)

emitter.emit_mcp(
    entity_urn=dataset_urn,
    aspect=lineage
)

# 3. QUERY LINEAGE (REST API)
import requests

def get_downstream_tables(table_urn: str):
    """Find all tables that depend on this table"""
    
    query = """
    {
      dataset(urn: "%s") {
        downstreams(input: {query: "*", start: 0, count: 100}) {
          total
          relationships {
            entity {
              urn
              properties {
                name
                description
              }
            }
          }
        }
      }
    }
    """ % table_urn
    
    response = requests.post(
        "http://datahub:8080/api/graphql",
        json={"query": query}
    )
    
    return response.json()

# Find impact of changing sales_raw
downstreams = get_downstream_tables("urn:li:dataset:(urn:li:dataPlatform:s3,data-lake/raw/sales,PROD)")

print(f"This table impacts {len(downstreams['data']['dataset']['downstreams']['relationships'])} downstream tables")
```

**Column-Level Lineage**

```python
from datahub.metadata.schema_classes import FineGrainedLineageClass

# Define column-level transformations
column_lineage = FineGrainedLineageClass(
    upstreamType="FIELD_SET",
    upstreams=[
        "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:s3,raw.sales,PROD),amount)",
        "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:s3,raw.sales,PROD),quantity)"
    ],
    downstreamType="FIELD",
    downstreams=[
        "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:postgres,analytics.sales_fact,PROD),total)"
    ],
    transformOperation="MULTIPLY"
)

# Query: "Where does column 'total' come from?"
# Answer: total = raw.sales.amount * raw.sales.quantity
```

**Lineage for Compliance (GDPR)**

```python
def find_pii_usage(column_name: str, table_name: str):
    """
    Find all downstream uses of PII column
    (for GDPR right to erasure)
    """
    
    # Start from source
    source_urn = make_dataset_urn("postgres", table_name, "PROD")
    
    # BFS traversal of lineage graph
    visited = set()
    queue = [source_urn]
    pii_locations = []
    
    while queue:
        current_urn = queue.pop(0)
        if current_urn in visited:
            continue
        
        visited.add(current_urn)
        
        # Get downstream tables
        downstreams = get_downstream_tables(current_urn)
        
        for downstream in downstreams:
            table = downstream['entity']['urn']
            
            # Check if PII column exists
            schema = get_table_schema(table)
            if any(col['name'] == column_name for col in schema['fields']):
                pii_locations.append({
                    'table': table,
                    'column': column_name,
                    'lineage_path': get_path_from_source(source_urn, table)
                })
            
            queue.append(table)
    
    return pii_locations

# Example: Find all uses of customer email
pii_usage = find_pii_usage('email', 'customers.users')

"""
Result:
[
  {
    'table': 'analytics.user_events',
    'column': 'email',
    'lineage_path': ['customers.users', 'staging.users', 'analytics.user_events']
  },
  {
    'table': 'ml.customer_features',
    'column': 'email',
    'lineage_path': ['customers.users', 'ml.customer_features']
  }
]
"""

# Now can delete email from all locations for GDPR compliance
```

**Impact Analysis**

```python
def analyze_breaking_change(table_name: str, column_to_remove: str):
    """
    Analyze impact of removing a column
    """
    
    # Find all downstream consumers
    downstreams = get_downstream_tables(table_name)
    
    impact_report = {
        'affected_pipelines': [],
        'affected_dashboards': [],
        'affected_ml_models': []
    }
    
    for downstream in downstreams:
        # Check if column is used
        lineage = get_column_lineage(downstream['urn'])
        
        if column_to_remove in lineage['upstream_columns']:
            if 'pipeline' in downstream['type']:
                impact_report['affected_pipelines'].append(downstream)
            elif 'dashboard' in downstream['type']:
                impact_report['affected_dashboards'].append(downstream)
            elif 'ml_model' in downstream['type']:
                impact_report['affected_ml_models'].append(downstream)
    
    return impact_report

# Before dropping column
impact = analyze_breaking_change('sales_fact', 'legacy_id')

print(f"⚠️ Removing 'legacy_id' will break:")
print(f"  - {len(impact['affected_pipelines'])} pipelines")
print(f"  - {len(impact['affected_dashboards'])} dashboards")
print(f"  - {len(impact['affected_ml_models'])} ML models")
```

**Caso Real: LinkedIn DataHub**

LinkedIn usa DataHub para rastrear:
- **100,000+ datasets** (Hadoop, Kafka, databases)
- **10,000+ pipelines** (Airflow, Azkaban)
- **5,000+ dashboards** (Superset, Tableau)
- **Column-level lineage**: 1M+ column relationships
- **Impact**: Reduce data incidents de 50/mes → 5/mes
- **Compliance**: GDPR data discovery en minutos vs semanas

---
**Autor:** Luis J. Raigoso V. (LJRV)

## 1. Los tres pilares de observabilidad

- **Logs**: eventos estructurados (JSON) con contexto (trace_id, user, pipeline).
- **Metrics**: contadores, gauges, histogramas (latencia, throughput, errores).
- **Traces**: flujo de ejecución distribuido (OpenTelemetry).

## 2. Logs estructurados con loguru

In [None]:
from loguru import logger
import sys, json
logger.remove()
logger.add(sys.stdout, serialize=True)  # JSON output

logger.info('pipeline_started', pipeline='ventas_etl', version='v1.2.3')
logger.error('validation_failed', pipeline='ventas_etl', error_count=5, sample_ids=[1,2,3])

## 3. Métricas con Prometheus (cliente Python)

In [None]:
# from prometheus_client import Counter, Histogram, start_http_server
# import time, random
# 
# EVENTS_PROCESSED = Counter('events_processed_total', 'Total de eventos procesados', ['pipeline'])
# LATENCY = Histogram('processing_latency_seconds', 'Latencia de procesamiento', ['pipeline'])
# 
# start_http_server(8000)  # exponer métricas en :8000/metrics
# 
# while True:
#     with LATENCY.labels(pipeline='ventas_etl').time():
#         time.sleep(random.uniform(0.01, 0.1))
#         EVENTS_PROCESSED.labels(pipeline='ventas_etl').inc()
print('Ejemplo comentado; descomenta si tienes prometheus_client instalado')

## 4. Linaje de datos con OpenLineage

- Estándar abierto para rastrear de dónde vienen los datos y a dónde van.
- Integración con Airflow, Spark, dbt, Flink.
- Backend: Marquez, DataHub, Egeria.

In [None]:
openlineage_example = r'''
# Integración con Airflow (plugin OpenLineage)
# airflow.cfg
[openlineage]
transport = {"type": "http", "url": "http://marquez:5000"}

# En cada DAG Run, Airflow emite eventos de linaje:
{
  "eventType": "START",
  "job": {"namespace": "data-platform", "name": "ventas_etl"},
  "inputs": [{"namespace": "s3", "name": "raw/ventas.csv"}],
  "outputs": [{"namespace": "s3", "name": "curated/ventas.parquet"}]
}
'''
print(openlineage_example.splitlines()[:15])

## 5. SLOs (Service Level Objectives)

- Definir objetivos medibles: ej. "99.9% de los pipelines completan < 30 min".
- Alertar si SLO en riesgo (error budget casi agotado).
- Revisar SLOs mensualmente y ajustar según necesidades del negocio.

In [None]:
import pandas as pd
slo_example = pd.DataFrame([
  {'Pipeline':'ventas_etl', 'SLO':'Latencia p99 < 30min', 'Actual':'28min', 'Status':'✅'},
  {'Pipeline':'streaming_kafka', 'SLO':'Errores < 0.1%', 'Actual':'0.05%', 'Status':'✅'},
  {'Pipeline':'ml_training', 'SLO':'Disponibilidad > 99.5%', 'Actual':'99.2%', 'Status':'⚠️'},
])
slo_example

## 6. Dashboards y alertas

- Grafana para visualizar métricas de Prometheus.
- Alertmanager para notificar en Slack/PagerDuty.
- DataHub/Marquez UI para explorar linaje interactivo.
- Logs centralizados en ELK/Loki para troubleshooting.