# 🔄 Proyecto Integrador Mid 2: Kafka → Streaming → Data Lake y Monitoreo

Objetivo: implementar un pipeline near-real-time que ingiere eventos en Kafka, los valida y transforma, escribe salidas particionadas en Parquet y expone métricas básicas de procesamiento.

- Duración: 120–150 min
- Dificultad: Media/Alta
- Prerrequisitos: Notebooks Mid 02 (Kafka), 05 (DataOps)

### 🎯 **Streaming Pipeline: Real-Time vs Near-Real-Time**

**Objetivo del Proyecto:**  
Construir un pipeline de **near-real-time** que procesa eventos desde Kafka hasta Data Lake (Parquet particionado), aplicando patrones de streaming, idempotencia y observabilidad.

**Arquitectura del Sistema:**

```
┌──────────────────────────────────────────────────────────────┐
│                        EVENT SOURCES                          │
│          (Web Apps, Mobile Apps, IoT Devices)                 │
└─────────────────────┬────────────────────────────────────────┘
                      │ Events/sec
                      ▼
┌─────────────────────────────────────────────────────────────┐
│                      APACHE KAFKA                            │
│  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐        │
│  │Partition│  │Partition│  │Partition│  │Partition│        │
│  │    0    │  │    1    │  │    2    │  │    3    │        │
│  └─────────┘  └─────────┘  └─────────┘  └─────────┘        │
│  Topic: pi2_events (Retention: 7 days)                      │
└─────────────────────┬───────────────────────────────────────┘
                      │ Poll batches (50-200 msgs)
                      ▼
┌─────────────────────────────────────────────────────────────┐
│                   CONSUMER/PROCESSOR                         │
│                                                               │
│  ┌──────────────────────────────────────────────────┐       │
│  │  1. VALIDATE    → Cerberus schema enforcement    │       │
│  │  2. DEDUP       → Checkpoint DB (SQLite/Redis)   │       │
│  │  3. ENRICH      → Add processing_ts, normalize   │       │
│  │  4. TRANSFORM   → Business logic                 │       │
│  └──────────────────────────────────────────────────┘       │
│                                                               │
└─────────────────────┬───────────────────────────────────────┘
                      │ Micro-batches
       ┌──────────────┼──────────────┐
       ▼              ▼               ▼
┌──────────┐   ┌──────────┐   ┌──────────┐
│  SINK 1  │   │  SINK 2  │   │  SINK 3  │
│  Parquet │   │  Metrics │   │   DLQ    │
│  (Lake)  │   │  (Prom.) │   │ (Errors) │
└──────────┘   └──────────┘   └──────────┘
      │              │               │
      ▼              ▼               ▼
  Data Lake      Dashboard     Error Queue
 (Partitioned)   (Grafana)    (Reprocessing)
```

**Near-Real-Time vs Real-Time:**

| Característica | **Real-Time** | **Near-Real-Time** |
|----------------|---------------|---------------------|
| **Latencia** | < 100 ms | 1-60 segundos |
| **Procesamiento** | Event-by-event | Micro-batches (50-200 events) |
| **Throughput** | Menor | Mayor (batch efficiency) |
| **Complejidad** | Alta (state management) | Media (simpler checkpointing) |
| **Casos de uso** | Trading, fraud detection | Analytics, monitoring, ETL |

**¿Por qué Near-Real-Time para este proyecto?**

✅ **Balance latencia/throughput**: Batches de 50-200 eventos optimizan I/O  
✅ **Idempotencia simplificada**: Checkpoint por batch vs por evento  
✅ **Costo-efectivo**: Menos recursos que streaming puro  
✅ **Suficiente para analytics**: 30-60s lag aceptable para dashboards

**Tecnologías Integradas:**

1. **Kafka** (Notebook 02): Event streaming, consumer groups, offset management
2. **Cerberus** (Notebook 06): Schema validation para eventos
3. **Idempotencia**: Checkpoint DB para evitar duplicados (at-least-once → exactly-once semántico)
4. **Parquet** (Notebook 07): Particionamiento por fecha (Hive-style)
5. **Logging** (Notebook 05): Loguru structured logging para observabilidad
6. **DataOps**: Métricas, alerting, DLQ (Dead Letter Queue)

**Casos de Uso Reales:**

- **E-commerce**: Tracking de clicks/views/purchases para recomendaciones en tiempo real
- **FinTech**: Detección de fraude (validación de transacciones sospechosas)
- **Gaming**: Analytics de comportamiento de jugadores (sessions, achievements)
- **IoT**: Ingestión de telemetría de sensores industriales

---
**Autor:** Luis J. Raigoso V. (LJRV)

## 0) Requisitos y ejecución

- Necesitas un clúster Kafka en local (Docker Compose) o remoto.
- Dependencias opcionales: `kafka-python` o `confluent-kafka`. No están activas por defecto en `requirements.txt`.
- Este notebook incluye un modo de simulación sin Kafka para que puedas practicar la lógica de validación/transformación/sink.
- Variables de entorno: `KAFKA_BOOTSTRAP_SERVERS`, `KAFKA_TOPIC`, `OUT_DIR` (por defecto `datasets/processed/pi2/`).

Ejemplo Docker Compose (para referencia):
```yaml
version: '3.8'
services:
  zookeeper:
    image: confluentinc/cp-zookeeper:7.4.0
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181
  kafka:
    image: confluentinc/cp-kafka:7.4.0
    depends_on: [zookeeper]
    ports: ['9092:9092']
    environment:
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092,PLAINTEXT_HOST://localhost:9092
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
      KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
```

## 1) Esquema de eventos y validación

### 📝 **Schema Validation: Event-Driven Architecture**

**¿Por qué validar schemas en streaming?**

En arquitecturas event-driven, **productores y consumidores están desacoplados**:

```
Producer A (v1.0) ───┐
Producer B (v1.2) ───┼──> Kafka Topic ───> Consumer C (v1.1)
Producer C (v2.0) ───┘                       ↓
                                      ¿Qué schema esperar?
```

**Problemas sin validación:**

❌ **Schema drift**: Producer envía campo nuevo `discount` → Consumer crashea  
❌ **Type mismatches**: Producer envía `"123"` (string) → Consumer espera `123` (int)  
❌ **Missing fields**: Producer olvida campo obligatorio `usuario_id`  
❌ **Poison pills**: Evento malformado bloquea consumer indefinidamente

**Solución: Schema Validation con Cerberus**

```python
event_schema = {
    'event_id': {
        'type': 'string',
        'required': True,
        'regex': r'^e\d+$'  # Format: e1, e123, e99999
    },
    'ts': {
        'type': 'string',
        'required': True,
        'regex': r'^\d{4}-\d{2}-\d{2}T'  # ISO 8601
    },
    'usuario_id': {
        'type': 'integer',
        'required': True,
        'min': 1,
        'max': 1_000_000
    },
    'accion': {
        'type': 'string',
        'allowed': ['click', 'view', 'purchase'],
        'required': True
    },
    'monto': {
        'type': 'float',
        'nullable': True,  # Solo presente en 'purchase'
        'min': 0,
        'max': 100_000
    }
}
```

**Ventajas de Cerberus:**

✅ **Dict-based**: Fácil serialización (guardar schema en JSON/YAML)  
✅ **Composable**: `allow_unknown=True` permite backward compatibility  
✅ **Custom validators**: Extendible con lógica de negocio

**Alternativas consideradas:**

| Tool | Pros | Contras |
|------|------|---------|
| **Cerberus** | Lightweight, flexible | No type hints |
| **Pydantic** | Type safety, IDE support | Más verboso |
| **JSON Schema** | Standard, language-agnostic | Menos ergonómico en Python |
| **Avro** | Schema evolution, compact | Requiere Schema Registry |

**Estrategia de Evolución de Schema:**

```python
# v1.0 (inicial)
{'event_id', 'ts', 'usuario_id', 'accion', 'monto'}

# v1.1 (backward compatible)
{'event_id', 'ts', 'usuario_id', 'accion', 'monto', 'session_id'} 
# allow_unknown=True → consumer v1.0 ignora 'session_id'

# v2.0 (breaking change)
{'event_id', 'ts', 'user_id', 'action', 'amount'}
# Renombrado de campos → Requiere consumer v2.0

# Estrategia: Topic versioning
topics: pi2_events_v1, pi2_events_v2
```

**Dead Letter Queue (DLQ) Pattern:**

```python
def process_with_dlq(events):
    good, bad = [], []
    for evt in events:
        if not validator.validate(evt):
            bad.append({
                'event': evt,
                'errors': validator.errors,
                'timestamp': datetime.utcnow().isoformat()
            })
            continue
        good.append(evt)
    
    # Enviar errores a DLQ (topic separado)
    if bad:
        producer.send('pi2_events_dlq', bad)
    
    return good
```

**Beneficios del DLQ:**

✅ **No se pierden eventos**: Errores se guardan para análisis  
✅ **No bloquea pipeline**: Consumer continúa procesando eventos válidos  
✅ **Trazabilidad**: Logs de errores para debugging  
✅ **Reprocessing**: Equipo puede corregir y reingerir desde DLQ

---
**Autor:** Luis J. Raigoso V. (LJRV)

In [None]:
from cerberus import Validator
event_schema = {
  'event_id': {'type':'string', 'required': True},
  'ts': {'type':'string', 'required': True},
  'usuario_id': {'type':'integer', 'required': True},
  'accion': {'type':'string', 'allowed':['click','view','purchase']},
  'monto': {'type':'float', 'nullable': True, 'min': 0}
}
validator = Validator(event_schema, allow_unknown=True)
def is_valid_event(evt):
    return validator.validate(evt)

# Ejemplo
is_valid_event({'event_id':'e1','ts':'2025-10-30T12:00:00Z','usuario_id':1,'accion':'click','monto':None})

## 2) Productor (Kafka) y modo simulación

In [None]:
import os, json, time, random
from datetime import datetime, timezone
KAFKA_BOOTSTRAP = os.getenv('KAFKA_BOOTSTRAP_SERVERS','localhost:9092')
KAFKA_TOPIC = os.getenv('KAFKA_TOPIC','pi2_events')

def gen_event(i: int):
    return {
        'event_id': f'e{i}',
        'ts': datetime.now(timezone.utc).isoformat(),
        'usuario_id': random.randint(1,1000),
        'accion': random.choice(['click','view','purchase']),
        'monto': round(random.uniform(1,500),2) if random.random()>0.8 else None
    }

def produce_simulation(n=50):
    return [gen_event(i) for i in range(n)]

# Productor Kafka (opcional)
def produce_kafka(n=50):
    try:
        from kafka import KafkaProducer
        producer = KafkaProducer(bootstrap_servers=KAFKA_BOOTSTRAP, value_serializer=lambda v: json.dumps(v).encode('utf-8'))
        for i in range(n):
            evt = gen_event(i)
            producer.send(KAFKA_TOPIC, evt)
        producer.flush()
        return n
    except Exception as e:
        print('Kafka no disponible, usa modo simulación:', e)
        return None

simulated = produce_simulation(100)
len(simulated)

## 3) Consumidor/Procesador: validación, enriquecimiento e idempotencia

### 🔄 **Idempotencia: Exactly-Once Semantics**

**El Problema del At-Least-Once Delivery:**

Kafka garantiza **at-least-once** por defecto:

```
Consumer Poll Batch:
  Event e1 → Process ✅ → Commit offset ✅
  Event e2 → Process ✅ → Commit offset ❌ (crash)
  
Consumer Restart:
  Event e2 → Process ✅ (duplicated!)
  Event e3 → Process ✅
```

**Consecuencias:**

❌ **Duplicados en DB**: `INSERT e2` dos veces → métricas incorrectas  
❌ **Aggregations erróneas**: `SUM(monto)` cuenta `e2` doble  
❌ **Inconsistencias**: Dashboard muestra 150 eventos pero solo 100 únicos

**Solución 1: Checkpoint Database (Implementación en el Notebook)**

```python
# SQLite como checkpoint store
def ensure_ckpt():
    conn = sqlite3.connect('checkpoint.sqlite')
    conn.execute('CREATE TABLE IF NOT EXISTS seen (event_id TEXT PRIMARY KEY)')
    conn.commit()

def is_seen(event_id: str) -> bool:
    cur.execute('SELECT 1 FROM seen WHERE event_id=?', (event_id,))
    return cur.fetchone() is not None

def mark_seen(event_id: str):
    cur.execute('INSERT OR IGNORE INTO seen (event_id) VALUES (?)', (event_id,))
```

**Cómo funciona:**

1. Consumer lee batch de Kafka
2. Para cada evento, verifica `is_seen(event_id)`
3. Si duplicado → skip
4. Si nuevo → procesa + `mark_seen(event_id)`
5. Commit Kafka offset al final del batch

**Trade-offs:**

✅ **Pros:**
- Exactly-once semántico (deduplicación garantizada)
- Simple de implementar
- Funciona con cualquier backend (SQLite/Redis/PostgreSQL)

❌ **Contras:**
- Overhead de DB lookup por evento (~1-5ms por query)
- Checkpoint DB crece indefinidamente (requiere TTL/cleanup)
- No funciona si `event_id` no es único globalmente

**Solución 2: Kafka Transactions (Alternative)**

```python
# Producer transaccional
producer = KafkaProducer(
    transactional_id='pi2-producer',
    enable_idempotence=True
)
producer.begin_transaction()
producer.send('pi2_events', event)
producer.commit_transaction()

# Consumer transaccional
consumer = KafkaConsumer(
    isolation_level='read_committed',  # Solo lee transacciones completas
    enable_auto_commit=False
)
```

**Ventajas:**
✅ Exactly-once nativo de Kafka  
✅ No requiere checkpoint externo  
✅ Menor latencia

**Desventajas:**
❌ Requiere Kafka 0.11+ y configuración compleja  
❌ No protege contra duplicados entre diferentes topics/pipelines

**Solución 3: Idempotent Writes (DB-Level)**

```python
# PostgreSQL UPSERT
INSERT INTO events (event_id, ts, monto)
VALUES ('e1', '2025-10-30', 100.0)
ON CONFLICT (event_id) DO UPDATE SET
  ts = EXCLUDED.ts,
  monto = EXCLUDED.monto;

# MongoDB upsert
db.events.update_one(
    {'event_id': 'e1'},
    {'$set': {'ts': '2025-10-30', 'monto': 100.0}},
    upsert=True
)
```

**TTL (Time-To-Live) para Checkpoint Cleanup:**

```python
# Redis con TTL automático (7 días)
redis.setex(f'seen:{event_id}', 7*86400, '1')

# PostgreSQL con expiration timestamp
CREATE TABLE seen (
  event_id TEXT PRIMARY KEY,
  seen_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX idx_seen_at ON seen (seen_at);

-- Cleanup job (diario)
DELETE FROM seen WHERE seen_at < NOW() - INTERVAL '7 days';
```

**¿Por qué 7 días?**

- Kafka retention policy: 7 días (por defecto)
- Si evento re-aparece después de 7 días → fuera del window de Kafka
- Balance entre storage y safety window

**Metrics para Monitorear:**

```python
duplicates_detected = Counter('duplicates_total', 'Total duplicate events')
checkpoint_db_size = Gauge('checkpoint_db_bytes', 'Checkpoint DB size')

def process_events(events):
    for evt in events:
        if is_seen(evt['event_id']):
            duplicates_detected.inc()
            continue
        # Process...
```

---
**Autor:** Luis J. Raigoso V. (LJRV)

In [None]:
from loguru import logger
import sqlite3
from typing import Iterable, Dict, Any

OUT_DIR = os.getenv('OUT_DIR','datasets/processed/pi2/')
os.makedirs(OUT_DIR, exist_ok=True)
CKPT_DB = os.path.join(OUT_DIR, 'checkpoint.sqlite')

def ensure_ckpt():
    conn = sqlite3.connect(CKPT_DB)
    cur = conn.cursor()
    cur.execute('CREATE TABLE IF NOT EXISTS seen (event_id TEXT PRIMARY KEY)')
    conn.commit(); conn.close()

def is_seen(event_id: str) -> bool:
    conn = sqlite3.connect(CKPT_DB)
    cur = conn.cursor()
    cur.execute('SELECT 1 FROM seen WHERE event_id=?', (event_id,))
    row = cur.fetchone()
    conn.close()
    return row is not None

def mark_seen(event_id: str):
    conn = sqlite3.connect(CKPT_DB)
    cur = conn.cursor()
    cur.execute('INSERT OR IGNORE INTO seen(event_id) VALUES (?)', (event_id,))
    conn.commit(); conn.close()

def enrich(evt: Dict[str,Any]) -> Dict[str,Any]:
    evt = dict(evt)
    evt['processing_ts'] = datetime.now(timezone.utc).isoformat()
    evt['monto'] = float(evt['monto']) if evt.get('monto') is not None else 0.0
    return evt

def process_events(events: Iterable[Dict[str,Any]]):
    ensure_ckpt()
    good, bad = [], []
    for evt in events:
        if not is_valid_event(evt):
            bad.append({'evt':evt, 'err':'schema'})
            continue
        if is_seen(evt['event_id']):
            logger.info(f
)
            continue
        evt2 = enrich(evt)
        good.append(evt2)
        mark_seen(evt['event_id'])
    return good, bad

ok, ko = process_events(simulated)
len(ok), len(ko)

## 4) Sink: escribir Parquet particionado por fecha

### 💾 **Parquet Sink: Partitioning Strategy**

**¿Por qué Parquet para Streaming?**

**Comparativa de formatos:**

| Formato | Compresión | Query Speed | Streaming Friendly | Tooling |
|---------|------------|-------------|---------------------|---------|
| **CSV** | 1x | Slow (full scan) | ✅ Sí | Universal |
| **JSON** | 2x | Slow | ✅ Sí | Universal |
| **Avro** | 5x | Medium | ✅ Sí | Schema Registry |
| **Parquet** | 10x | Fast (columnar) | ⚠️ Micro-batches | Spark/Athena/Presto |
| **ORC** | 12x | Fast | ⚠️ Micro-batches | Hive-centric |

**Decisión: Parquet**
- ✅ Mejor compresión (10x vs CSV)
- ✅ Query performance (columnar scan)
- ✅ Compatible con Athena/Spark/pandas
- ⚠️ Requiere buffering (no event-by-event)

**Estrategia de Particionamiento:**

```
datasets/processed/pi2/
├── date=2025-10-30/
│   ├── events_1730246400.parquet  (timestamp: 12:00 PM)
│   ├── events_1730250000.parquet  (timestamp: 01:00 PM)
│   └── events_1730253600.parquet  (timestamp: 02:00 PM)
├── date=2025-10-31/
│   ├── events_1730332800.parquet
│   └── events_1730336400.parquet
└── date=2025-11-01/
    └── events_1730419200.parquet
```

**Ventajas del Hive-style Partitioning:**

1. **Partition Pruning (Athena/Spark):**
   ```sql
   SELECT SUM(monto) FROM events
   WHERE date = '2025-10-30'  -- Solo lee 1 partición (3 archivos)
   -- vs full scan de 100 GB → scan de 300 MB
   ```

2. **Lifecycle Management:**
   ```bash
   # Eliminar datos antiguos (GDPR compliance)
   rm -rf datasets/processed/pi2/date=2025-01-*
   ```

3. **Incremental Processing:**
   ```python
   # Solo procesar particiones nuevas
   new_partitions = [d for d in os.listdir('pi2') 
                     if d > f'date={last_processed_date}']
   ```

**Small Files Problem:**

**Problema:**
```
date=2025-10-30/
├── events_1730246401.parquet  (12 KB)   ← Ineficiente
├── events_1730246402.parquet  (8 KB)    ← Overhead de metadata
├── events_1730246403.parquet  (15 KB)   ← Muchos archivos pequeños
└── ... (1000 archivos)
```

**Impacto:**
- Athena cobra por # de archivos escaneados ($5/TB + $0.002/archivo)
- Spark: Overhead de open/close file handles
- Target: **128-512 MB por archivo**

**Solución 1: Buffering en memoria**
```python
buffer = []
BUFFER_SIZE = 1000  # Eventos

def write_buffered(event):
    buffer.append(event)
    if len(buffer) >= BUFFER_SIZE:
        flush_to_parquet(buffer)
        buffer.clear()
```

**Solución 2: Compaction Job (Airflow)**
```python
# DAG diario de compactación
@task
def compact_partition(date):
    small_files = glob(f'date={date}/*.parquet')
    if len(small_files) > 100:
        df = pd.concat([pd.read_parquet(f) for f in small_files])
        df.to_parquet(f'date={date}/compacted.parquet')
        for f in small_files:
            os.remove(f)
```

**Escritura Atómica (Evitar Archivos Corruptos):**

```python
import tempfile
import shutil

def atomic_write_parquet(df, final_path):
    # 1. Escribir a archivo temporal
    with tempfile.NamedTemporaryFile(delete=False, suffix='.parquet') as tmp:
        tmp_path = tmp.name
        df.to_parquet(tmp_path)
    
    # 2. Mover a destino final (operación atómica en filesystems POSIX)
    shutil.move(tmp_path, final_path)
```

**Por qué es importante:**
- Si consumer crashea durante `to_parquet()` → archivo parcialmente escrito
- Athena/Spark leen archivo corrupto → query fails
- Atomic move garantiza: archivo completo o no existe

**Compression Codecs:**

```python
df.to_parquet(
    'events.parquet',
    compression='snappy',  # Default: Balance speed/ratio
    # compression='gzip',   # Better ratio, slower
    # compression='zstd',   # Modern: Fast + good ratio
    engine='pyarrow'
)
```

**Benchmark:**

| Codec | Compression Ratio | Write Speed | Read Speed |
|-------|-------------------|-------------|------------|
| None | 1x | 100 MB/s | 200 MB/s |
| Snappy | 3x | 80 MB/s | 180 MB/s |
| Gzip | 5x | 30 MB/s | 100 MB/s |
| Zstd | 4.5x | 70 MB/s | 160 MB/s |

**Recomendación:** `snappy` para streaming (balance latencia/compresión)

---
**Autor:** Luis J. Raigoso V. (LJRV)

In [None]:
import pandas as pd
from pathlib import Path

def write_parquet(events):
    if not events:
        return None
    df = pd.DataFrame(events)
    df['date'] = pd.to_datetime(df['ts']).dt.date.astype(str)
    for d, part in df.groupby('date'):
        part_dir = Path(OUT_DIR) / f'date={d}'
        part_dir.mkdir(parents=True, exist_ok=True)
        fp = part_dir / f'events_{int(time.time())}.parquet'
        part.drop(columns=['date']).to_parquet(fp, index=False)
    return True

write_parquet(ok)

## 5) Métricas y logging

In [None]:
import time
total = len(simulated)
validos = len(ok)
invalidos = len(ko)
metricas = f'total {total}
validos {validos}
invalidos {invalidos}
'
with open(os.path.join(OUT_DIR, 'metrics.txt'), 'w') as f:
    f.write(metricas)
logger.info(f'metrics total={total} validos={validos} invalidos={invalidos}')
metricas

## 6) Consumidor Kafka (opcional en vivo)

In [None]:
def consume_kafka(max_messages=100):
    try:
        from kafka import KafkaConsumer
        consumer = KafkaConsumer(
            KAFKA_TOPIC,
            bootstrap_servers=KAFKA_BOOTSTRAP,
            auto_offset_reset='earliest',
            enable_auto_commit=False,
            value_deserializer=lambda v: json.loads(v.decode('utf-8'))
        )
        batch = []
        for i, msg in enumerate(consumer):
            batch.append(msg.value)
            if len(batch) >= 50 or i+1 >= max_messages:
                ok, ko = process_events(batch)
                write_parquet(ok)
                # commit offsets al final del batch
                consumer.commit()
                logger.info(f'batch size={len(batch)} ok={len(ok)} ko={len(ko)}')
                batch = []
                if i+1 >= max_messages:
                    break
        consumer.close()
    except Exception as e:
        print('Kafka no disponible, salta esta sección:', e)

# consume_kafka(200)  # Descomentar para una prueba en vivo

## 7) Buenas prácticas y extensiones

### 📈 **Observability: Metrics, Logging & Alerting**

**Three Pillars of Observability:**

1. **Metrics**: What's happening? (aggregations, rates, gauges)
2. **Logs**: Why did it happen? (detailed context, debugging)
3. **Traces**: How did it flow? (distributed tracing, latency breakdown)

**Streaming Pipeline Metrics (RED Method):**

```python
from prometheus_client import Counter, Histogram, Gauge, Summary

# Rate: Events/sec processed
events_processed = Counter(
    'kafka_events_processed_total',
    'Total events processed',
    ['status']  # Labels: success, validation_error, duplicate
)

# Errors: Validation failures, exceptions
validation_errors = Counter(
    'kafka_validation_errors_total',
    'Schema validation errors',
    ['error_type']
)

# Duration: Latency distribution
processing_latency = Histogram(
    'kafka_processing_seconds',
    'Event processing latency',
    buckets=[0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0]
)

# Additional metrics
kafka_lag = Gauge(
    'kafka_consumer_lag',
    'Consumer lag (uncommitted offsets)',
    ['topic', 'partition']
)

checkpoint_db_size = Gauge(
    'checkpoint_db_size_bytes',
    'Checkpoint database size'
)

batch_size = Summary(
    'kafka_batch_size',
    'Batch size distribution'
)
```

**Usage en el Pipeline:**

```python
import time

def process_batch(events):
    start = time.time()
    batch_size.observe(len(events))
    
    good, bad = [], []
    for evt in events:
        if not validator.validate(evt):
            events_processed.labels(status='validation_error').inc()
            validation_errors.labels(error_type='schema').inc()
            bad.append(evt)
            continue
        
        if is_seen(evt['event_id']):
            events_processed.labels(status='duplicate').inc()
            continue
        
        good.append(evt)
        events_processed.labels(status='success').inc()
    
    duration = time.time() - start
    processing_latency.observe(duration)
    
    return good, bad
```

**Structured Logging con Loguru:**

```python
from loguru import logger
import sys

# Configuración
logger.remove()  # Remove default handler
logger.add(
    sys.stderr,
    format="{time:YYYY-MM-DD HH:mm:ss} | {level: <8} | {extra[request_id]} | {message}",
    level="INFO",
    serialize=True  # JSON output
)

# Contexto enriquecido
logger = logger.bind(request_id="batch-123", consumer_group="pi2-cg")

# Logging en pipeline
logger.info(f"Processing batch", batch_size=len(events), topic="pi2_events")
logger.warning(f"High validation errors", error_rate=0.15, threshold=0.10)
logger.error(f"Kafka consumer lagging", lag_seconds=300, max_lag=60)

# Output (JSON):
{
  "timestamp": "2025-10-30 14:23:45",
  "level": "INFO",
  "message": "Processing batch",
  "request_id": "batch-123",
  "consumer_group": "pi2-cg",
  "batch_size": 150,
  "topic": "pi2_events"
}
```

**Alerting Rules (Prometheus/Alertmanager):**

```yaml
groups:
  - name: kafka_pipeline
    interval: 30s
    rules:
      # Alta latencia de procesamiento
      - alert: HighProcessingLatency
        expr: histogram_quantile(0.95, kafka_processing_seconds) > 1.0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P95 latency > 1s"
          description: "{{ $value }}s processing time"
      
      # Consumer lag alto
      - alert: ConsumerLagging
        expr: kafka_consumer_lag > 10000
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Consumer lag > 10k events"
          description: "Topic {{ $labels.topic }} lag: {{ $value }}"
      
      # Tasa de errores alta
      - alert: HighValidationErrors
        expr: rate(kafka_validation_errors_total[5m]) > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Validation errors > 10/min"
```

**Backpressure Handling:**

```python
MAX_BATCH_SIZE = 500
MAX_PROCESSING_TIME = 30  # seconds

def consume_with_backpressure():
    consumer = KafkaConsumer(
        max_poll_records=MAX_BATCH_SIZE,  # Limit batch size
        max_poll_interval_ms=300000       # 5 min timeout
    )
    
    for msg in consumer:
        start = time.time()
        batch = [msg.value]
        
        # Consumir batch completo
        while time.time() - start < MAX_PROCESSING_TIME:
            try:
                msg = consumer.poll(timeout_ms=100)
                if not msg:
                    break
                batch.extend([m.value for m in msg.values()])
                if len(batch) >= MAX_BATCH_SIZE:
                    break
            except StopIteration:
                break
        
        # Procesar batch
        process_batch(batch)
        
        # Pausar si backpressure (opcional)
        if len(batch) >= MAX_BATCH_SIZE * 0.9:
            logger.warning("Backpressure detected, pausing 5s")
            time.sleep(5)
        
        consumer.commit()
```

**Dashboard (Grafana Queries):**

```promql
# Throughput (events/sec)
rate(kafka_events_processed_total[1m])

# Error rate
sum(rate(kafka_events_processed_total{status!="success"}[5m])) 
/ 
sum(rate(kafka_events_processed_total[5m]))

# P95 latency
histogram_quantile(0.95, kafka_processing_seconds_bucket)

# Consumer lag trend
kafka_consumer_lag
```

**SLI/SLO Example:**

```python
# Service Level Indicator: % successful events
SLI = (success_events / total_events) * 100

# Service Level Objective: 99.5% success rate
SLO = 99.5

# Error Budget: 0.5% = 432 failed events/day (@ 10k events/hour)
error_budget = (100 - SLO) / 100 * total_events

if SLI < SLO:
    alert("SLO breach: pausing non-critical deployments")
```

---
**Autor:** Luis J. Raigoso V. (LJRV)

- Reintentos con DLQ (cola de mensajes de errores) y trazabilidad por `event_id`.
- Idempotencia con checkpoint durable (SQLite/Redis/DB) y caducidad.
- Backpressure: controlar tamaño de batch y límites de latencia.
- Observabilidad: exportar métricas a Prometheus y logs estructurados.
- Seguridad: evitar PII en logs; cifrado en tránsito y at-rest donde aplique.