# 🏛️ Arquitecturas Modernas de Datos: Lambda, Kappa, Delta y Data Mesh

Objetivo: comprender y contrastar arquitecturas de referencia (Lambda, Kappa, Delta) y patrones organizacionales (Data Mesh), con ejemplos de diseño y trade-offs.

- Duración: 120–150 min
- Dificultad: Alta
- Prerrequisitos: Mid completo, experiencia con batch y streaming

## 1. Arquitectura Lambda

### 🏗️ **Lambda Architecture: Batch + Speed Layer**

**Origen y Motivación (Nathan Marz, 2011):**

```
Problema clásico:
- Batch processing: Preciso pero lento (horas/días)
- Stream processing: Rápido pero complejo (estado, fallos)

Solución Lambda:
- Batch Layer: Verdad absoluta, inmutable, reprocessable
- Speed Layer: Aproximación rápida, eventualmente consistente
- Serving Layer: Merge de ambas vistas
```

**Arquitectura Completa:**

```python
┌─────────────────────────────────────────────────────────────┐
│                     DATA SOURCES                             │
│  • Kafka Topics                                              │
│  • Database CDC (Debezium)                                   │
│  • Application Logs                                          │
│  • IoT Streams                                               │
└─────────────────┬───────────────────────────────────────────┘
                  │
        ┌─────────┴──────────┐
        │                    │
        ▼                    ▼
┌──────────────┐    ┌──────────────┐
│ BATCH LAYER  │    │ SPEED LAYER  │
│              │    │              │
│ • Spark Batch│    │ • Flink      │
│ • Hadoop MR  │    │ • Spark SS   │
│ • Hive       │    │ • Storm      │
│              │    │              │
│ Runs: Daily  │    │ Runs: Real-  │
│       @2AM   │    │       time   │
└──────┬───────┘    └──────┬───────┘
       │                   │
       │ Master Dataset    │ Delta Views
       │ (Immutable)       │ (Mutable)
       │                   │
       ▼                   ▼
┌─────────────────────────────────┐
│      SERVING LAYER               │
│                                  │
│  Query(t) = Batch(0→t-1h) +     │
│             Speed(t-1h→t)        │
│                                  │
│  • Druid                         │
│  • Cassandra                     │
│  • ElasticSearch                 │
│  • BigQuery                      │
└─────────────┬───────────────────┘
              │
              ▼
        ┌──────────┐
        │   API    │
        │Dashboard │
        └──────────┘
```

**Componentes Detallados:**

**1. Batch Layer (Verdad Inmutable):**

```python
# Características:
# - Procesamiento completo del histórico
# - Recalculable desde el inicio
# - Optimizado para throughput (no latencia)
# - Tolerancia a fallos simple (restart)

# Ejemplo: Agregaciones diarias con Spark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum, count, window

spark = SparkSession.builder \
    .appName("BatchLayer") \
    .config("spark.sql.shuffle.partitions", "200") \
    .getOrCreate()

# Leer ALL histórico (full recompute)
raw_events = spark.read \
    .format("parquet") \
    .load("s3://datalake/raw/events/")

# Agregaciones pesadas
user_stats_batch = raw_events \
    .groupBy("user_id", "date") \
    .agg(
        count("*").alias("total_events"),
        sum("revenue").alias("total_revenue"),
        countDistinct("session_id").alias("sessions")
    )

# Escribir a serving layer
user_stats_batch.write \
    .format("delta") \
    .mode("overwrite") \
    .partitionBy("date") \
    .save("s3://datalake/serving/user_stats_batch")

# Schedule: Daily @2AM (Airflow DAG)
# Duration: 2-4 horas para procesar 1 año de datos
# Cost: $$$ (large cluster, pero solo 1x/día)
```

**2. Speed Layer (Low Latency Incremental):**

```python
# Características:
# - Solo datos recientes (últimas horas)
# - Baja latencia (<1 min)
# - Estado mutable (agregaciones incrementales)
# - Complejidad de exactly-once

# Ejemplo: Streaming con Spark Structured Streaming
from pyspark.sql.streaming import StreamingQuery

# Leer solo NUEVOS eventos
events_stream = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "kafka:9092") \
    .option("subscribe", "events") \
    .option("startingOffsets", "latest") \
    .load()

# Misma lógica que batch (pero incremental)
user_stats_speed = events_stream \
    .withWatermark("event_timestamp", "10 minutes") \
    .groupBy("user_id", window("event_timestamp", "1 hour")) \
    .agg(
        count("*").alias("total_events"),
        sum("revenue").alias("total_revenue"),
        countDistinct("session_id").alias("sessions")
    )

# Escribir a serving layer (diferentes tablas)
query = user_stats_speed.writeStream \
    .format("delta") \
    .outputMode("update") \
    .option("checkpointLocation", "/checkpoints/speed") \
    .start("s3://datalake/serving/user_stats_speed")

# Runs: Continuously
# Latency: 30s - 2 min
# Cost: $$ (small cluster, pero 24/7)
```

**3. Serving Layer (Query Merge):**

```python
# Merge batch + speed en query time

from datetime import datetime, timedelta

def get_user_stats(user_id: str, as_of: datetime = None):
    """
    Query que merge batch + speed layers
    """
    if as_of is None:
        as_of = datetime.now()
    
    # Batch boundary (típicamente hace 1-2 horas)
    batch_cutoff = as_of - timedelta(hours=2)
    
    # Query batch layer (histórico hasta cutoff)
    batch_stats = spark.sql(f"""
        SELECT 
            user_id,
            SUM(total_events) as events,
            SUM(total_revenue) as revenue
        FROM user_stats_batch
        WHERE user_id = '{user_id}'
          AND date < '{batch_cutoff.date()}'
        GROUP BY user_id
    """).collect()
    
    # Query speed layer (reciente desde cutoff)
    speed_stats = spark.sql(f"""
        SELECT 
            user_id,
            SUM(total_events) as events,
            SUM(total_revenue) as revenue
        FROM user_stats_speed
        WHERE user_id = '{user_id}'
          AND window.start >= '{batch_cutoff}'
        GROUP BY user_id
    """).collect()
    
    # Merge results
    total_events = batch_stats[0]['events'] + speed_stats[0]['events']
    total_revenue = batch_stats[0]['revenue'] + speed_stats[0]['revenue']
    
    return {
        'user_id': user_id,
        'total_events': total_events,
        'total_revenue': total_revenue,
        'as_of': as_of,
        'batch_cutoff': batch_cutoff
    }
```

**Ventajas de Lambda:**

```python
ventajas = {
    "1. Robustez": """
        Batch layer es inmutable → fácil debugging
        Speed layer falla → aún tienes batch como fallback
        Disaster recovery: recompute desde raw data
    """,
    
    "2. Precisión Garantizada": """
        Batch garantiza exactitud (full recompute)
        Speed solo para latencia, batch corrige errores
    """,
    
    "3. Separation of Concerns": """
        Batch: optimiza throughput (large partitions, columnar)
        Speed: optimiza latencia (small batches, in-memory)
        Cada uno usa tecnología ideal para su caso
    """,
    
    "4. Auditoría": """
        Raw data inmutable → compliance (GDPR, SOX)
        Puedes recompute histórico para auditorías
    """
}
```

**Desventajas (¿Por qué Lambda está en declive?):**

```python
desventajas = {
    "1. Código Duplicado": """
        MISMA lógica implementada 2 veces:
        - batch_aggregations.py (Spark Batch)
        - speed_aggregations.py (Spark Streaming)
        
        Cambio de lógica → actualizar AMBOS
        Testing → doble esfuerzo
        Bugs → pueden divergir silenciosamente
    """,
    
    "2. Complejidad Operacional": """
        Mantener 2 pipelines:
        - Batch: Airflow DAG, cluster management, retry logic
        - Speed: Streaming query monitoring, checkpoint management
        
        2x infraestructura, 2x alertas, 2x oncall
    """,
    
    "3. Eventual Consistency": """
        Periodo de gracia donde batch/speed no alineados
        
        Ejemplo:
        10:00 AM: Evento llega
        10:01 AM: Speed layer procesa → visible en dashboard
        02:00 AM: Batch recompute → corrige pequeños errores
        
        Usuario puede ver números ligeramente diferentes
    """,
    
    "4. Costos": """
        Batch cluster: Large (100 nodes) @ 2-4h/día
        Speed cluster: Medium (20 nodes) @ 24/7
        
        Total: ~$50K/mes para org mediana
        
        vs Lakehouse unificado: ~$30K/mes
    """
}
```

**Casos de Uso Reales (Cuándo Lambda Tiene Sentido):**

```python
# 1. LEGACY SYSTEMS con batch existente
"""
Empresa: Retail con 20 años de Hadoop
Situación: 500 Hive tables, cientos de Spark jobs
Necesidad: Agregar real-time sin reescribir todo

Solución: Lambda
- Mantener batch layer existente (no tocas legacy)
- Agregar speed layer con Flink para casos críticos
- Migración gradual dominio por dominio
"""

# 2. ALTA PRECISIÓN requerida
"""
Empresa: Financial trading
Situación: Regulaciones requieren recompute exacto
Necesidad: Auditorías pueden pedir recalcular 5 años atrás

Solución: Lambda
- Batch layer garantiza reproducibilidad exacta
- Speed layer para trading real-time
- Eventual consistency aceptable (batch corrige)
"""

# 3. MUY DIFERENTES SLAs batch vs stream
"""
Empresa: IoT con billones de sensores
Situación: 
  - Alertas críticas: <100ms (anomalías)
  - Análisis histórico: días OK (tendencias)

Solución: Lambda
- Speed layer: Flink para alertas ultra-rápidas
- Batch layer: Hadoop/Hive para análisis pesados
- Tecnologías muy diferentes, difícil unificar
"""
```

**Migración de Lambda → Modern Stack:**

```python
# Estrategia: Unified Batch+Stream con Delta Lake

# ANTES (Lambda):
# 1. Batch: Spark job diario (4h, 100 nodes)
# 2. Speed: Flink continuous (24/7, 20 nodes)
# 3. Serving: Druid (merge queries)

# DESPUÉS (Delta/Lakehouse):
# 1. Unified: Spark Structured Streaming (24/7, 30 nodes)
# 2. Micro-batches cada 5 min
# 3. Delta Lake: ACID transactions, no merge needed

# Migration timeline:
"""
Month 1-2: Setup Delta Lake infrastructure
Month 3-4: Migrate batch pipelines to Delta
Month 5-6: Migrate speed pipelines to Structured Streaming
Month 7: Dual-run (Lambda + Delta) for validation
Month 8: Cutover to Delta, decommission Lambda
Month 9: Cleanup, optimize

Total: 9 months
Cost: $200K engineering + $50K infra
Savings: $20K/mes ongoing (ROI: 2.5 years)
"""
```

**Ejemplo Real: LinkedIn (Inventor de Lambda):**

```python
# LinkedIn inventó Lambda en 2011
# Usaron Lambda 2011-2018 (7 años)

arquitectura_linkedin = {
    "Batch Layer": {
        "Tecnología": "Hadoop MapReduce → Spark",
        "Datos": "Kafka topics replicados a HDFS",
        "Frecuencia": "Daily @midnight",
        "Output": "Hive tables (member profiles, connections, jobs)"
    },
    
    "Speed Layer": {
        "Tecnología": "Storm → Samza → Kafka Streams",
        "Datos": "Kafka topics directamente",
        "Latencia": "<1 second",
        "Output": "Espresso (NoSQL) para low-latency reads"
    },
    
    "Serving Layer": {
        "Tecnología": "Espresso + Voldemort (key-value stores)",
        "Pattern": "API merge batch + speed en read time"
    },
    
    "2018 Migration": """
        LinkedIn migró a Unified Streaming (Samza + Kafka)
        Razón: Complejidad de mantener doble lógica
        Resultado: -30% código, -40% operaciones
    """
}
```

---
**Autor:** Luis J. Raigoso V. (LJRV)

### 🔄 **Kappa Architecture: Stream-Only Simplification**

**Origen (Jay Kreps, LinkedIn/Confluent, 2014):**

```
Crítica a Lambda:
"¿Por qué mantener 2 sistemas si streaming puede hacer todo?"

Propuesta Kappa:
- Eliminar batch layer completamente
- Todo es streaming (batch = replay del log)
- Inmutable log como single source of truth
```

**Arquitectura:**

```python
┌─────────────────────────────────────────┐
│         DATA SOURCES                     │
│  • Applications                          │
│  • Databases (CDC)                       │
│  • IoT Devices                           │
└─────────────┬───────────────────────────┘
              │
              ▼
┌─────────────────────────────────────────┐
│    KAFKA (Immutable Log)                │
│                                          │
│  • Infinite retention (or very long)    │
│  • Partitioned & Replicated             │
│  • Serves as "Master Dataset"           │
│                                          │
│  Topics:                                 │
│  ├─ events (retention: 1 year)          │
│  ├─ transactions (retention: 5 years)   │
│  └─ user_actions (retention: 90 days)   │
└─────────────┬───────────────────────────┘
              │
              ▼
┌─────────────────────────────────────────┐
│   STREAM PROCESSING                     │
│                                          │
│  Version 1: Flink Job                   │
│  ├─ Aggregations                        │
│  ├─ Joins                               │
│  └─ Output → Cassandra                  │
│                                          │
│  Version 2: New logic needed            │
│  ├─ Deploy new Flink job                │
│  ├─ Replay from offset 0                │
│  └─ Output → Cassandra v2               │
│                                          │
│  🔄 Reprocessing = Replay log           │
└─────────────┬───────────────────────────┘
              │
              ▼
┌─────────────────────────────────────────┐
│       SERVING LAYER                     │
│  • Cassandra                            │
│  • ElasticSearch                        │
│  • Redis                                │
│  • PostgreSQL                           │
└─────────────────────────────────────────┘
```

**Principios Clave:**

```python
principios_kappa = {
    "1. Everything is a Stream": """
        No distinción entre batch y stream
        
        Batch tradicional:
        SELECT SUM(revenue) FROM sales WHERE date = '2025-10-30'
        
        Kappa:
        Consume Kafka topic 'sales', aggregate, done
        (No diferencia conceptual con real-time)
    """,
    
    "2. Immutable Log as Source of Truth": """
        Kafka = Database of record
        
        Ventajas:
        - Time travel: replay desde cualquier offset
        - Debugging: reproduce bug con datos exactos
        - Migration: deploy nueva versión, replay, compare
        
        Ejemplo:
        offset 0 → offset 1M (1 año de datos)
        Reprocessing: ~4 horas con 50 partitions
    """,
    
    "3. Schema Evolution in Log": """
        Log contiene todos los schemas históricos
        
        V1: {"user_id": 123, "action": "click"}
        V2: {"user_id": 123, "action": "click", "device": "mobile"}
        V3: {"user_id": 123, "event_type": "click", "metadata": {...}}
        
        Consumer handle all versions gracefully
    """,
    
    "4. Reprocessing for Code Changes": """
        Cambio de lógica → no reescribir batch
        
        Workflow:
        1. Deploy new stream processor (version 2)
        2. Replay desde offset 0 (parallel con v1)
        3. Validate output v2 matches expected
        4. Cutover traffic to v2
        5. Decommission v1
        
        Duration: horas/días (no semanas de reescribir batch)
    """
}
```

**Implementación Real con Kafka + Flink:**

```python
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.table import StreamTableEnvironment
from pyflink.table.descriptors import Kafka, Schema, Json

env = StreamExecutionEnvironment.get_execution_environment()
t_env = StreamTableEnvironment.create(env)

# Configuración Kafka source con retención larga
t_env.execute_sql("""
    CREATE TABLE events (
        user_id STRING,
        event_type STRING,
        product_id STRING,
        revenue DOUBLE,
        event_timestamp TIMESTAMP(3),
        WATERMARK FOR event_timestamp AS event_timestamp - INTERVAL '10' SECOND
    ) WITH (
        'connector' = 'kafka',
        'topic' = 'events',
        'properties.bootstrap.servers' = 'kafka:9092',
        'properties.group.id' = 'user-stats-processor-v2',
        'scan.startup.mode' = 'earliest-offset',  -- Replay desde inicio
        'format' = 'json',
        'json.timestamp-format.standard' = 'ISO-8601'
    )
""")

# Stream processing (unificado batch+stream)
user_stats = t_env.sql_query("""
    SELECT 
        user_id,
        TUMBLE_START(event_timestamp, INTERVAL '1' DAY) as day,
        COUNT(*) as total_events,
        SUM(CASE WHEN event_type = 'purchase' THEN revenue ELSE 0 END) as total_revenue,
        COUNT(DISTINCT product_id) as unique_products
    FROM events
    GROUP BY 
        user_id,
        TUMBLE(event_timestamp, INTERVAL '1' DAY)
""")

# Output a Cassandra
t_env.execute_sql("""
    CREATE TABLE user_stats_cassandra (
        user_id STRING,
        day TIMESTAMP(3),
        total_events BIGINT,
        total_revenue DOUBLE,
        unique_products BIGINT,
        PRIMARY KEY (user_id, day) NOT ENFORCED
    ) WITH (
        'connector' = 'cassandra',
        'host' = 'cassandra',
        'table-name' = 'user_stats'
    )
""")

user_stats.execute_insert('user_stats_cassandra').wait()

# Reprocessing:
# 1. Stop old job
# 2. Deploy this code (nueva lógica)
# 3. Kafka consumer group lee desde offset 0
# 4. Reprocesa 1 año de datos en ~4 horas
# 5. Nueva tabla user_stats_cassandra tiene datos corregidos
```

**Comparación Kafka Retention:**

```python
# Configuración típica para Kappa
kafka_retention = {
    "Short Retention (Anti-Kappa)": {
        "Config": "retention.ms = 604800000  # 7 días",
        "Disk": "1 TB / partition",
        "Cost": "$500/mes",
        "Reprocessing": "❌ Solo últimos 7 días disponibles",
        "Use Case": "Lambda architecture (Kafka solo buffer)"
    },
    
    "Long Retention (Kappa)": {
        "Config": "retention.ms = 31536000000  # 1 año",
        "Disk": "52 TB / partition (con compaction)",
        "Cost": "$2,600/mes (S3 tiered storage)",
        "Reprocessing": "✅ 1 año completo disponible",
        "Use Case": "Kappa architecture (Kafka es source of truth)"
    },
    
    "Infinite Retention (Extreme Kappa)": {
        "Config": "retention.ms = -1  # Forever",
        "Disk": "Ilimitado (requiere tiered storage)",
        "Cost": "$5K/mes (S3 archival)",
        "Reprocessing": "✅ Histórico completo",
        "Use Case": "Event sourcing, auditabilidad estricta"
    }
}

# Kafka Tiered Storage (KIP-405, Kafka 3.0+)
# Archiva logs antiguos a S3, mantiene recientes en SSD
"""
kafka-storage.properties:
  remote.log.storage.enable=true
  remote.log.storage.manager.class=org.apache.kafka.server.log.remote.storage.RemoteLogStorageManager
  
Estructura:
  Local SSD: Últimos 7 días (hot data)
  S3 Bucket: >7 días (cold data, comprimido)
  
Cost reduction: 70% vs all-SSD
"""
```

**Ventajas de Kappa:**

```python
ventajas_kappa = {
    "1. Simplicidad Operacional": """
        Un solo pipeline → un solo sistema para monitorear
        Un solo lenguaje/framework (Flink o Spark)
        Un solo cluster para operar
        
        On-call rotation:
        Lambda: 2 sistemas × 2 equipos = 4 rotaciones/semana
        Kappa: 1 sistema × 1 equipo = 2 rotaciones/semana
    """,
    
    "2. No Código Duplicado": """
        Misma lógica para batch y stream
        
        Lambda: 2000 líneas batch.py + 2000 líneas stream.py = 4000
        Kappa: 2000 líneas unified.py = 2000 (50% menos código)
        
        Bug fix: 1 cambio vs 2 cambios
        Testing: 1 suite vs 2 suites
    """,
    
    "3. Reprocessing Rápido": """
        Cambio de lógica → replay log en horas
        
        Lambda:
        - Reescribir batch job (días de dev)
        - Ejecutar en cluster grande (4h)
        - Validar, debug, iterar
        Total: 1-2 semanas
        
        Kappa:
        - Modificar stream job (horas de dev)
        - Replay desde offset 0 (4h)
        - Compare old vs new output
        Total: 1 día
    """,
    
    "4. Exactly-Once Más Simple": """
        Kafka transactions + idempotent producers
        
        Flink + Kafka:
        - Checkpointing en Kafka offsets
        - Transactional writes a sinks
        - No necesidad de merge batch+speed
        
        Lambda:
        - Batch tiene at-least-once (reintentos)
        - Speed tiene exactly-once (complejo)
        - Merge puede duplicar sin deduplication
    """
}
```

**Desventajas de Kappa:**

```python
desventajas_kappa = {
    "1. Costo de Almacenamiento": """
        Kafka con 1 año de retención = caro
        
        Ejemplo: 10K events/s × 10 KB/event × 1 año
        = 3.15 PB/año crudo
        Con compaction + tiered storage: 500 TB
        Cost: $5K/mes S3 + $10K/mes Kafka brokers
        
        vs Hadoop HDFS: $2K/mes (pero sin streaming)
    """,
    
    "2. Reprocessing Largo": """
        Replay 1 año de datos = horas/días
        
        1 TB data @ 100 MB/s = 2.8 horas ideal
        Real (with processing): 6-12 horas
        
        Durante reprocessing:
        - Nuevo código corre en paralelo (doble costo)
        - Output dual (old version + new version)
        - Validación manual necesaria
        
        Lambda:
        - Batch recompute overnight (no impact users)
        - Speed layer sigue sirviendo tráfico
    """,
    
    "3. State Management Complejo": """
        Streaming state para 1 año de datos = grande
        
        Ejemplo: User profiles con aggregations
        10M users × 1 KB state = 10 GB state
        
        Flink RocksDB:
        - State en disco (slower)
        - Checkpointing largo (minutes)
        - Recovery largo si crash (restore state)
        
        Lambda batch:
        - No state (stateless join con full tables)
        - Simpler, pero más lento
    """,
    
    "4. No Optimización por Caso": """
        Streaming debe servir TODOS los casos
        
        Lambda:
        - Batch: Columnar Parquet, predicate pushdown, Z-order
        - Speed: Row-based, in-memory, indexes
        
        Kappa:
        - Un formato compromiso (no óptimo para ninguno)
        - Queries complejos (joins 5 tables) lentos en stream
        - Queries simples (count) overkill con stream
    """
}
```

**Casos de Uso Ideales para Kappa:**

```python
# 1. REAL-TIME FIRST con poco histórico
"""
Empresa: Fintech fraud detection
Datos: 90 días de transacciones
Procesamiento: <100ms latency requerido
Reprocessing: Semanal (ajustar modelos ML)

Kappa perfecto:
- 90 días en Kafka (100 TB, manejable)
- Flink streaming para detección real-time
- Replay semanal para ajustar modelos (4h)
"""

# 2. EVENT SOURCING puro
"""
Empresa: Booking platform
Patrón: Event sourcing (todo es evento inmutable)
Datos: bookings, cancellations, modifications
Queries: Reconstruir estado actual desde eventos

Kappa natural:
- Kafka como event store (immutable log)
- Flink materializa vistas (current bookings)
- Replay para nuevas vistas (add "bookings by hotel")
"""

# 3. HIGH CHANGE FREQUENCY en lógica
"""
Empresa: Ad-tech con modelos A/B testing constante
Situación: Cambio de lógica 2-3x/semana
Necesidad: Deploy rápido, validar, iterar

Kappa ventaja:
- Modify stream job (1h dev)
- Deploy v2 parallel v1 (30 min)
- Replay último día (30 min)
- Compare outputs, cutover (1h)
Total: 3 horas por cambio vs 3 días Lambda
"""

# 4. SMALL-MEDIUM data scale
"""
Empresa: SaaS startup
Datos: <100 TB total
Users: <1M
Traffic: <1K events/s

Kappa ventaja:
- Un cluster Kafka+Flink (simple)
- Retención 1 año fácil (10 TB en Kafka)
- Equipo pequeño puede mantener
"""
```

**Anti-Patterns (Cuándo NO usar Kappa):**

```python
# ❌ 1. HUGE data scale con queries complejos
"""
Empresa: Retail con 10+ años de histórico
Datos: 100 PB en HDFS
Queries: SQL complejo (20+ joins, window functions)

Problema Kappa:
- 100 PB en Kafka = $500K/mes (vs $50K HDFS)
- Streaming joins lentos vs batch columnar
- Replay 10 años = semanas (inviable)

Better: Lambda o Delta Lakehouse
"""

# ❌ 2. BATCH-FIRST workloads
"""
Empresa: Data warehouse con reportes nocturnos
Queries: ETL batch 90% workload, streaming 10%
Latency: Horas OK para reportes

Problema Kappa:
- Overhead streaming para batch workloads
- Kafka infraestructura innecesaria
- Spark batch más eficiente que Spark Streaming para batch

Better: Tradicional batch (Airflow + Spark)
"""

# ❌ 3. ESTRICTA compliance con reprocessing prohibido
"""
Empresa: Banco con regulaciones estrictas
Requerimiento: Auditorías requieren datos exactos "as-of"
Prohibición: No se puede "rehacer" cálculos pasados

Problema Kappa:
- Reprocessing es core feature (pero regulador dice no)
- Inmutabilidad requiere batch layer aparte

Better: Lambda con batch layer auditado/certificado
"""
```

**Ejemplo Real: Uber (Kappa en Producción):**

```python
uber_kappa = {
    "Caso": "Surge pricing (precios dinámicos)",
    
    "Arquitectura": {
        "Eventos": "ride_requests, driver_locations, ride_completions",
        "Kafka": "500K events/s, retención 7 días (suficiente)",
        "Flink": "Windowed aggregations (supply/demand por área)",
        "Output": "Redis (precios actuales), Cassandra (histórico)",
        "Latency": "<200ms end-to-end"
    },
    
    "Kappa Benefits": """
        1. Cambio frecuente de algoritmo pricing (weekly)
        2. Replay 7 días para validar nuevo algoritmo (2h)
        3. A/B testing: correr 2 versiones paralelo, compare
        4. No batch layer needed (solo últimos días relevantes)
    """,
    
    "Evolution": """
        2015: Lambda (Spark batch + Storm streaming)
        2016: Migrated to Kappa (Samza streaming only)
        2018: Moved to Flink (better stateful processing)
        2020: Hybrid (Kappa para real-time, Hudi para analytics)
        
        Razón hybrid: Analytics largo plazo necesita batch optimizations
    """
}
```

---
**Autor:** Luis J. Raigoso V. (LJRV)

### 🏠 **Delta Architecture (Lakehouse): Unified Batch+Stream**

**Evolución Natural (2020+):**

```
Lambda (2011): Batch + Speed layers separados
   ↓
Kappa (2014): Solo streaming (pero caro, complejo)
   ↓
Delta (2020): Unified sobre transactional storage
```

**Arquitectura Lakehouse:**

```python
┌─────────────────────────────────────────────────────────────┐
│                    DATA SOURCES                              │
│  • Applications → Kafka                                      │
│  • Databases → CDC (Debezium)                                │
│  • Batch files → S3/ADLS                                     │
└────────────────────┬────────────────────────────────────────┘
                     │
                     ▼
          ┌──────────────────┐
          │  INGESTION LAYER │
          │                  │
          │  Spark Streaming │ ← Unified engine!
          │  (micro-batches) │
          └────────┬─────────┘
                   │
                   ▼
┌──────────────────────────────────────────────────────────────┐
│              DELTA LAKE / ICEBERG STORAGE                     │
│                                                               │
│  Bronze (Raw)           Silver (Cleaned)      Gold (Curated) │
│  ┌────────────┐        ┌────────────┐       ┌────────────┐  │
│  │ Append-only│        │ Deduplicated│      │Aggregations│  │
│  │ Parquet +  │   →    │ Validated   │  →   │Star Schema │  │
│  │ Delta Log  │        │ Enriched    │      │Business    │  │
│  │            │        │             │      │Logic       │  │
│  └────────────┘        └────────────┘       └────────────┘  │
│                                                               │
│  Features:                                                    │
│  ✅ ACID Transactions                                        │
│  ✅ Time Travel (rollback, audit)                           │
│  ✅ Schema Evolution                                         │
│  ✅ Upserts/Deletes                                          │
│  ✅ Unified Batch+Stream reads                               │
└────────────────────┬──────────────────────────────────────────┘
                     │
          ┌──────────┴─────────┐
          │                    │
          ▼                    ▼
  ┌───────────────┐    ┌──────────────┐
  │ BATCH QUERIES │    │  STREAMING   │
  │               │    │   QUERIES    │
  │ Spark SQL     │    │ Spark SS     │
  │ Presto/Trino  │    │ Flink        │
  │ Athena        │    │ Real-time    │
  └───────────────┘    └──────────────┘
          │                    │
          └──────────┬─────────┘
                     ▼
            ┌─────────────────┐
            │   CONSUMPTION   │
            │                 │
            │  • BI Tools     │
            │  • ML Training  │
            │  • APIs         │
            │  • Dashboards   │
            └─────────────────┘
```

**Características Clave:**

```python
delta_features = {
    "1. ACID Transactions en Object Storage": """
        Problema tradicional:
        S3/ADLS no tienen transacciones
        → Race conditions, partial writes
        
        Delta Lake solution:
        _delta_log/00000000000000000123.json
        {
          "commitInfo": {"timestamp": "2025-10-30T10:30:00"},
          "add": [
            {"path": "part-00000.parquet", "size": 1024000},
            {"path": "part-00001.parquet", "size": 1024000}
          ],
          "remove": []
        }
        
        Atomic commit: Write log entry = commit completo
        Readers ven versión consistente (snapshot isolation)
    """,
    
    "2. Unified Batch + Stream": """
        SAME TABLE para batch y streaming
        
        Escritura streaming:
        df.writeStream
          .format("delta")
          .outputMode("append")
          .start("/delta/events")
        
        Lectura batch:
        spark.read
          .format("delta")
          .load("/delta/events")
          .where("date = '2025-10-30'")
        
        Lectura streaming:
        spark.readStream
          .format("delta")
          .load("/delta/events")  # Lee cambios incrementales
        
        NO necesitas batch layer separada!
    """,
    
    "3. Time Travel": """
        Acceso a versiones históricas
        
        # Leer versión específica
        df = spark.read
          .format("delta")
          .option("versionAsOf", 42)
          .load("/delta/events")
        
        # Leer timestamp específico
        df = spark.read
          .format("delta")
          .option("timestampAsOf", "2025-10-29 00:00:00")
          .load("/delta/events")
        
        Casos de uso:
        - Rollback: Deployment malo → revert
        - Audit: "¿Qué datos vio el modelo ayer?"
        - Debugging: "Reproduce bug con datos exactos"
        - Compliance: "Muestra datos as-of Q3 2025"
    """,
    
    "4. Schema Evolution": """
        Agregar columnas sin downtime
        
        V1: {user_id, action, timestamp}
        V2: {user_id, action, timestamp, device}  ← Add column
        
        Delta:
        df_v2.write
          .format("delta")
          .mode("append")
          .option("mergeSchema", "true")
          .save("/delta/events")
        
        Lectores antiguos: device = null
        Lectores nuevos: device leído correctamente
        
        NO necesitas backfill!
    """,
    
    "5. Upserts (MERGE)": """
        CDC y SCD Type 2 nativos
        
        from delta.tables import DeltaTable
        
        delta_table = DeltaTable.forPath(spark, "/delta/users")
        
        delta_table.alias("target").merge(
            updates.alias("source"),
            "target.user_id = source.user_id"
        ).whenMatchedUpdate(set={
            "email": "source.email",
            "updated_at": "source.updated_at"
        }).whenNotMatchedInsert(values={
            "user_id": "source.user_id",
            "email": "source.email",
            "created_at": "source.created_at",
            "updated_at": "source.updated_at"
        }).execute()
        
        Streaming UPSERT:
        updates.writeStream
          .foreachBatch(lambda df, _: upsert_function(df))
          .start()
    """
}
```

**Medallion Architecture (Bronze/Silver/Gold):**

```python
# BRONZE: Raw data, append-only
bronze_events = spark.readStream \
    .format("kafka") \
    .option("subscribe", "events") \
    .load() \
    .select(
        col("value").cast("string").alias("raw_json"),
        col("timestamp").alias("ingestion_time")
    )

bronze_events.writeStream \
    .format("delta") \
    .option("checkpointLocation", "/checkpoints/bronze") \
    .option("mergeSchema", "true") \
    .start("/delta/bronze/events")

# SILVER: Cleaned, validated, deduplicated
from pyspark.sql.functions import from_json, col

event_schema = StructType([
    StructField("user_id", StringType()),
    StructField("event_type", StringType()),
    StructField("event_timestamp", TimestampType())
])

silver_events = spark.readStream \
    .format("delta") \
    .load("/delta/bronze/events") \
    .select(
        from_json(col("raw_json"), event_schema).alias("data")
    ) \
    .select("data.*") \
    .filter(col("user_id").isNotNull()) \  # Validación
    .dropDuplicates(["user_id", "event_timestamp"])  # Dedup

silver_events.writeStream \
    .format("delta") \
    .outputMode("append") \
    .option("checkpointLocation", "/checkpoints/silver") \
    .start("/delta/silver/events")

# GOLD: Business aggregations
gold_user_stats = spark.readStream \
    .format("delta") \
    .load("/delta/silver/events") \
    .groupBy("user_id", window("event_timestamp", "1 day")) \
    .agg(
        count("*").alias("daily_events"),
        countDistinct("event_type").alias("event_types")
    )

gold_user_stats.writeStream \
    .format("delta") \
    .outputMode("update") \
    .option("checkpointLocation", "/checkpoints/gold") \
    .start("/delta/gold/user_daily_stats")

# BI Tool lee desde Gold
df_dashboard = spark.read \
    .format("delta") \
    .load("/delta/gold/user_daily_stats") \
    .where("window.start >= current_date() - interval 7 days")
```

**Ventajas sobre Lambda/Kappa:**

```python
ventajas_delta = {
    "vs Lambda": {
        "Código Unificado": """
            Lambda: 2 pipelines (batch.py + stream.py)
            Delta: 1 pipeline (unified.py)
            
            Reduction: 50% código, 50% tests, 50% bugs
        """,
        
        "Sin Merge Layer": """
            Lambda: Query = merge(batch, speed) en serving
            Delta: Query directo a Delta (single source)
            
            Latency: -100ms (no merge overhead)
            Consistency: Guaranteed (ACID transactions)
        """,
        
        "Operaciones Simples": """
            Lambda: Mantener 2 clusters + serving layer
            Delta: 1 cluster Spark + Delta storage
            
            On-call: -50% alerts, -40% incidents
        """
    },
    
    "vs Kappa": {
        "Costo Storage": """
            Kappa: Kafka 1 año = $15K/mes
            Delta: S3 object storage = $2K/mes
            
            Savings: 85% storage costs
        """,
        
        "Batch Optimized": """
            Kappa: Streaming para todo (no óptimo)
            Delta: Columnar Parquet + statistics
            
            Query performance: 10x faster para analytics
        """,
        
        "Reprocessing Flexible": """
            Kappa: Replay Kafka (limited retention)
            Delta: Time travel infinito
            
            Auditabilidad: 5+ años histórico disponible
        """
    },
    
    "vs Traditional Warehouse": {
        "Cost": """
            Warehouse: $25/TB/month (Snowflake)
            Delta: $0.023/GB/month (S3) = $23/TB
            
            Savings: ~50% at petabyte scale
        """,
        
        "Flexibility": """
            Warehouse: Vendor lock-in, SQL only
            Delta: Open format, Spark/Presto/Flink/Python
            
            ML Integration: Native (same storage)
        """,
        
        "Scalability": """
            Warehouse: Scale up (expensive)
            Delta: Scale out (S3 infinite)
            
            Growth: No limits, pay-as-you-go
        """
    }
}
```

**Casos de Uso Reales:**

```python
# 1. Databricks - Inventor de Delta Lake
"""
Cliente: Comcast (Telecomunicaciones)
Datos: 2 PB diarios (network logs, streaming video, IoT)
Arquitectura anterior: Lambda (Hadoop batch + Kafka streaming)

Migration a Delta:
- 2018: Pilot con 10 TB (1 mes)
- 2019: Migration 500 TB (6 meses)
- 2020: Full adoption 2 PB/día

Resultados:
- Cost: -30% ($2M/año savings)
- Latency: 4h → 15 min para dashboards
- Code: -50% (unified batch+stream)
- Incidents: -60% (ACID transactions)

Stack:
- Ingestion: Spark Streaming
- Storage: Delta Lake en S3
- Compute: Databricks clusters
- Consumption: Tableau + ML models
"""

# 2. Riot Games (League of Legends)
"""
Caso: Game analytics + Player behavior
Datos: 100M+ games/day, petabytes de events
Latency: <5 min para anti-cheat, <1 hour para balance

Delta Architecture:
┌──────────────┐
│ Game Servers │ → Kafka (1M events/s)
└──────────────┘
       ↓
┌──────────────┐
│ Spark Stream │ → Delta Bronze (raw events)
└──────────────┘
       ↓
┌──────────────┐
│ Enrichment   │ → Delta Silver (validated)
└──────────────┘
       ↓
    ┌──┴──┐
    │     │
    ▼     ▼
┌────┐  ┌────┐
│ ML │  │ BI │ → Delta Gold (aggregations)
└────┘  └────┘

Benefits:
- Unified: Game designers query same data ML uses
- Time travel: "Replay match from 2 days ago"
- Schema evolution: Add features without downtime
"""

# 3. Adobe Experience Platform
"""
Challenge: 100s of tenants, isolated data, compliance
Solution: Delta Lake multi-tenant

Architecture:
/delta/tenant_001/events/
/delta/tenant_002/events/
...
/delta/tenant_500/events/

Features:
- Row-level security (GDPR compliance)
- Tenant isolation (Delta sharing)
- Unified operations (single platform)
- Time travel (data recovery per tenant)

Scale:
- 500+ tenants
- 10 PB total data
- 100K writes/s
- <100ms p99 latency
"""
```

**Migración Lambda → Delta:**

```python
migration_strategy = {
    "Phase 1: Setup (Month 1-2)": """
        1. Provision Delta Lake infrastructure
           - S3 bucket con versioning
           - Unity Catalog setup
           - Spark clusters (Databricks/EMR)
        
        2. Crear medallion structure
           /delta/bronze/
           /delta/silver/
           /delta/gold/
        
        3. Setup CI/CD pipelines
           - GitHub Actions para deploy
           - Testing framework (pytest)
           - Monitoring (Datadog/Prometheus)
    """,
    
    "Phase 2: Bronze Layer (Month 3-4)": """
        1. Migrate raw ingestion
           Kafka → Delta Bronze (append-only)
           
        2. Dual-write period
           - Write to both Lambda batch + Delta
           - Compare row counts, checksums
           
        3. Backfill histórico
           HDFS/Hive → Delta Bronze
           (1-time copy, then delete HDFS)
    """,
    
    "Phase 3: Silver Layer (Month 5-6)": """
        1. Migrate transformation logic
           Batch Spark jobs → Delta pipelines
           
        2. Unified batch+stream
           - Remove duplicate code
           - Single pipeline para ambos
           
        3. Data quality checks
           Great Expectations integration
    """,
    
    "Phase 4: Gold Layer (Month 7-8)": """
        1. Migrate aggregations
           Speed layer → Delta Gold micro-batches
           
        2. Decommission serving layer
           Druid/Cassandra → query Delta directly
           
        3. BI tools reconfigure
           Point Tableau/Power BI to Delta
    """,
    
    "Phase 5: Validation (Month 9)": """
        1. Parallel run
           - Lambda still running (backup)
           - Delta serving production traffic
           
        2. Compare metrics
           - Latency, accuracy, cost
           - User feedback (BI teams)
        
        3. Fix edge cases
    """,
    
    "Phase 6: Cutover (Month 10)": """
        1. Decommission Lambda
           - Stop batch jobs
           - Stop speed layer
           - Delete Druid/Cassandra
        
        2. Cleanup
           - Delete HDFS clusters
           - Reclaim savings
        
        3. Training
           - Team training on Delta
           - Documentation update
    """,
    
    "Total": """
        Duration: 10 months
        Cost: $300K engineering + $100K infra
        Ongoing savings: $50K/mes
        ROI: 8 months
    """
}
```

**Best Practices:**

```python
best_practices = {
    "1. Partition Strategy": """
        ✅ Date partitioning (daily/hourly)
        partitionBy("date")
        
        ⚠️ Z-ordering para high-cardinality
        OPTIMIZE events ZORDER BY (user_id, product_id)
        
        ❌ Over-partitioning
        partitionBy("date", "hour", "user_id")  # Millones de particiones!
    """,
    
    "2. Compaction": """
        Auto-compact para streaming:
        .option("autoCompact", "true")
        
        Scheduled OPTIMIZE:
        OPTIMIZE events WHERE date >= current_date() - 7
        
        VACUUM para cleanup:
        VACUUM events RETAIN 168 HOURS  # 7 días
    """,
    
    "3. Schema Evolution": """
        Always mergeSchema para backward compatibility:
        .option("mergeSchema", "true")
        
        Schema validation:
        df.write.format("delta")
          .option("enforceSchema", "true")
          .save()
    """,
    
    "4. Monitoring": """
        Metrics to track:
        - Write throughput (records/s)
        - Read latency (p50, p95, p99)
        - File count (small files problem)
        - Version count (retention policy)
        - Transaction log size
        
        Alerts:
        - File count >100K per partition
        - Transaction log >10 MB
        - Write latency p99 >5s
    """
}
```

---
**Autor:** Luis J. Raigoso V. (LJRV)

### 🌐 **Data Mesh: Paradigma Organizacional Descentralizado**

**Problema: Data Platform Centralizado (Monolito)**

```
Organización tradicional:
┌────────────────────────────────────────────────┐
│         CENTRAL DATA TEAM (20 personas)        │
│                                                 │
│  Responsable de:                                │
│  • Todos los pipelines                         │
│  • Todos los data models                       │
│  • Todos los dashboards                        │
│  • Todos los ML models                         │
└────────────┬───────────────────────────────────┘
             │
   ┌─────────┴─────────┐
   │  DATA WAREHOUSE   │
   │   (Single DB)     │
   └─────────┬─────────┘
             │
    ┌────────┴────────┐
    │                 │
    ▼                 ▼
┌────────┐      ┌──────────┐
│ Ventas │      │Logística │ ... (10+ dominios)
│ Team   │      │ Team     │
└────────┘      └──────────┘
   ↓                 ↓
Request            Request
pipeline          pipeline
   ↓                 ↓
  ⏰ 3 meses        ⏰ 3 meses
  🐛 Bugs           🐛 Bugs
  📉 Low quality    📉 Low quality

Problemas:
❌ Bottleneck: Central team sobrecargado (backlog 6+ meses)
❌ Context loss: Data team no entiende dominio de negocio
❌ Scalability: Imposible crecer linearmente
❌ Ownership: Nadie responsable de calidad de datos
```

**Solución: Data Mesh (Zhamak Dehghani, 2019)**

```
Data Mesh: Descentralización por dominio
┌─────────────────────────────────────────────┐
│        FEDERATED GOVERNANCE                  │
│  (Políticas centrales, enforcement local)    │
└────┬──────────────┬──────────────┬───────────┘
     │              │              │
┌────▼─────┐  ┌────▼─────┐  ┌────▼─────┐
│ VENTAS   │  │LOGÍSTICA │  │ FINANZAS │
│ DOMAIN   │  │ DOMAIN   │  │ DOMAIN   │
│          │  │          │  │          │
│ Data     │  │ Data     │  │ Data     │
│ Product  │  │ Product  │  │ Product  │
│ Owner    │  │ Owner    │  │ Owner    │
│          │  │          │  │          │
│ • ETL    │  │ • ETL    │  │ • ETL    │
│ • Quality│  │ • Quality│  │ • Quality│
│ • API    │  │ • API    │  │ • API    │
│ • Docs   │  │ • Docs   │  │ • Docs   │
└────┬─────┘  └────┬─────┘  └────┬─────┘
     │              │              │
     └──────────────┴──────────────┘
                    │
         ┌──────────▼──────────┐
         │ SELF-SERVE PLATFORM │
         │ (Infra común)       │
         │ • Spark/Airflow     │
         │ • Monitoring        │
         │ • Data Catalog      │
         └─────────────────────┘
```

**4 Principios Fundamentales:**

```python
principios_data_mesh = {
    "1. Domain-Oriented Decentralization": """
        Datos pertenecen al dominio de negocio
        
        Ejemplo: E-commerce
        ┌────────────────────────────────────────┐
        │ DOMAIN: Ventas                          │
        │ Owner: VP of Sales                      │
        │ Data Products:                          │
        │  • orders (transaccional)               │
        │  • orders_aggregated (analítico)        │
        │  • customer_segments (ML)               │
        │                                         │
        │ Team composition:                       │
        │  • Product Manager (prioridades)        │
        │  • Data Engineer (pipelines)            │
        │  • Analytics Engineer (dbt models)      │
        │  • Data Analyst (consumers)             │
        └────────────────────────────────────────┘
        
        Ventajas:
        ✅ Domain expertise: Equipo entiende negocio
        ✅ Velocity: No esperar central team
        ✅ Accountability: Owner claro de calidad
    """,
    
    "2. Data as a Product": """
        Cada dataset es un producto con:
        
        📋 SLA (Service Level Agreement):
        - Latency: <15 min para orders_aggregated
        - Availability: 99.9% uptime
        - Freshness: Updated cada 5 min
        - Quality: >95% completeness
        
        📖 Documentation:
        - Schema: Columnas, tipos, constraints
        - Lineage: De dónde viene cada campo
        - Examples: Query samples
        - Contact: Slack channel, owner email
        
        🔒 Access Control:
        - Public: Todos pueden leer
        - Private: Solo equipo ventas
        - PII: Masked para no-autorizados
        
        📊 Observability:
        - Metrics: Row count, size, update time
        - Alerts: SLA violations
        - Changelog: Version history
        
        Ejemplo manifest.yaml:
        ```
        product_name: orders_aggregated
        owner: ventas-data@company.com
        sla:
          latency_minutes: 15
          availability_percent: 99.9
          quality_score: 0.95
        schema_url: s3://catalog/ventas/orders_agg/schema.json
        lineage:
          sources:
            - orders_raw
            - customers
            - products
        access:
          read: [sales_team, finance_team, executives]
          write: [sales_data_engineers]
        ```
    """,
    
    "3. Self-Serve Data Platform": """
        Infraestructura común para todos los dominios
        
        Platform Components:
        ┌─────────────────────────────────────┐
        │ COMPUTE                              │
        │  • Spark clusters (on-demand)       │
        │  • Airflow (orchestration)          │
        │  • dbt (transformations)            │
        └─────────────────────────────────────┘
        
        ┌─────────────────────────────────────┐
        │ STORAGE                              │
        │  • Delta Lake (lakehouse)           │
        │  • S3 buckets (per-domain)          │
        │  • Unity Catalog (metadata)         │
        └─────────────────────────────────────┘
        
        ┌─────────────────────────────────────┐
        │ OBSERVABILITY                        │
        │  • Datadog (metrics, logs)          │
        │  • Great Expectations (quality)     │
        │  • OpenLineage (lineage tracking)   │
        └─────────────────────────────────────┘
        
        ┌─────────────────────────────────────┐
        │ GOVERNANCE                           │
        │  • Data Catalog (Amundsen/DataHub)  │
        │  • Access Control (Ranger/Unity)    │
        │  • Policy Engine (OPA)              │
        └─────────────────────────────────────┘
        
        Self-service workflow:
        1. Team crea repo: github.com/company/ventas-data
        2. Define product: manifest.yaml + dbt models
        3. CI/CD deploy: Tests → Stage → Prod
        4. Auto-register: Catalog actualizado
        5. Monitoring: Dashboards auto-generated
        
        Platform team provee:
        - Templates (cookiecutter)
        - Best practices docs
        - On-call support
        - Cost monitoring
    """,
    
    "4. Federated Computational Governance": """
        Políticas globales + Autonomía local
        
        GLOBAL POLICIES (Central governance):
        ┌────────────────────────────────────┐
        │ 1. Security                         │
        │    • PII must be encrypted at rest │
        │    • Access logs retained 1 year   │
        │    • MFA required for production   │
        │                                     │
        │ 2. Compliance (GDPR, CCPA)         │
        │    • Right to be forgotten         │
        │    • Data retention max 7 years    │
        │    • Audit trail required          │
        │                                     │
        │ 3. Quality                          │
        │    • All products have tests       │
        │    • SLA violations reported       │
        │    • Schema changes versioned      │
        │                                     │
        │ 4. Interoperability                │
        │    • Standard formats (Parquet)    │
        │    • Common IDs (user_id format)   │
        │    • Shared ontology (terms)       │
        └────────────────────────────────────┘
        
        LOCAL AUTONOMY (Domain teams):
        ┌────────────────────────────────────┐
        │ • Choose transformation tools       │
        │   (dbt, Spark, Python)             │
        │ • Define data models               │
        │ • Set refresh frequency            │
        │ • Manage access within domain      │
        │ • Optimize costs                   │
        └────────────────────────────────────┘
        
        Enforcement:
        - Automated checks in CI/CD
        - Policy-as-code (OPA policies)
        - Failing checks block deploy
        - Dashboards show compliance
    """
}
```

**Implementación Real: Data Mesh Stack**

```python
# Estructura organizacional
data_mesh_org = {
    "Domain Teams (Autónomos)": {
        "Ventas Domain": {
            "Members": "8 personas",
            "Products": [
                "orders (transactional)",
                "orders_daily_agg (analytical)",
                "customer_rfm_segments (ML)"
            ],
            "Tech Stack": "Spark, dbt, Delta Lake",
            "Budget": "$50K/mes"
        },
        
        "Logística Domain": {
            "Members": "6 personas",
            "Products": [
                "shipments (operational)",
                "delivery_performance (metrics)",
                "route_optimization (ML)"
            ],
            "Tech Stack": "Flink, Kafka, Iceberg",
            "Budget": "$40K/mes"
        },
        
        "Marketing Domain": {
            "Members": "5 personas",
            "Products": [
                "campaigns (metadata)",
                "attribution (analytics)",
                "propensity_scores (ML)"
            ],
            "Tech Stack": "Airflow, Python, Delta Lake",
            "Budget": "$30K/mes"
        }
    },
    
    "Platform Team (Enabling)": {
        "Members": "10 personas",
        "Responsibilities": [
            "Maintain Spark/Airflow infrastructure",
            "Operate Data Catalog",
            "Enforce governance policies",
            "Cost optimization",
            "Training & documentation"
        ],
        "Budget": "$100K/mes",
        "Ratio": "1 platform engineer per 20 domain engineers"
    },
    
    "Governance Team (Federated)": {
        "Members": "3 personas",
        "Responsibilities": [
            "Define global policies",
            "Compliance (GDPR, SOX)",
            "Audit & reporting",
            "Cross-domain standards"
        ],
        "Budget": "$20K/mes"
    }
}

# Data Product Example
data_product_spec = '''
# orders_daily_agg Data Product

## Overview
Daily aggregated orders for BI dashboards and reporting.

## Owner
- Team: Ventas Data Engineering
- Contact: ventas-data@company.com
- Slack: #ventas-data

## SLA
- Freshness: Updated daily @6AM UTC
- Latency: <30 minutes processing time
- Availability: 99.9% (max 8.76h downtime/year)
- Quality: >99% completeness, >98% accuracy

## Schema
| Column | Type | Description | PII |
|--------|------|-------------|-----|
| date | DATE | Order date | No |
| country | STRING | Country code (ISO) | No |
| total_orders | BIGINT | Count of orders | No |
| total_revenue | DECIMAL(10,2) | Sum of revenue (USD) | No |
| avg_order_value | DECIMAL(10,2) | Average order value | No |

## Lineage
```
orders_raw (Bronze)
    ↓
orders_validated (Silver)
    ↓
orders_daily_agg (Gold) ← THIS PRODUCT
```

## Access
- Read: sales_team, finance_team, executives
- Write: sales_data_engineers
- Admin: sales_domain_owner

## Usage Examples
```sql
-- Get last 7 days
SELECT * FROM orders_daily_agg
WHERE date >= CURRENT_DATE - INTERVAL 7 DAYS
ORDER BY date DESC

-- YoY comparison
SELECT 
    date,
    total_revenue,
    LAG(total_revenue, 365) OVER (ORDER BY date) as prev_year_revenue
FROM orders_daily_agg
WHERE date >= '2024-01-01'
```

## Metrics
- Dashboard: https://monitoring.company.com/ventas/orders_agg
- Alerts: PagerDuty team "sales-data-oncall"

## Changelog
- 2025-10-01: Added avg_order_value column
- 2025-09-15: Changed country to ISO codes
- 2025-08-01: Initial release
'''
```

**Ventajas de Data Mesh:**

```python
ventajas_mesh = {
    "1. Scalability": """
        Tradicional: Central team → linear growth impossible
        10 domains × 1 central team = bottleneck
        
        Data Mesh: Domain teams → linear scaling
        10 domains × 10 teams = 100 engineers productive
        
        Growth: Agregar dominio no afecta otros
    """,
    
    "2. Domain Expertise": """
        Central team:
        "¿Qué significa 'adjusted_revenue'?"
        → Ask business (delays)
        
        Domain team:
        Engineers trabajan con business daily
        → Deep understanding, mejor calidad
    """,
    
    "3. Velocity": """
        Backlog central: 6 meses
        Domain ownership: 2 semanas
        
        Feature request → deploy:
        Central: 3-6 meses
        Mesh: 1-2 sprints
    """,
    
    "4. Quality": """
        Central team: Best effort (no ownership)
        Domain team: Accountable (SLAs, metrics)
        
        Data quality:
        Central: 85% average
        Mesh: 95% average (incentivized)
    """
}
```

**Desafíos de Data Mesh:**

```python
desafios_mesh = {
    "1. Organizational Maturity": """
        Requiere:
        - DevOps culture (CI/CD, testing)
        - Data literacy en equipos de negocio
        - Executive buy-in (budget descentralizado)
        
        NO funciona si:
        - Organización muy jerárquica
        - Equipos no autónomos
        - Sin cultura de ownership
        
        Ejemplo fracaso:
        Company X intentó mesh sin DevOps
        → Dominios no saben deploy pipelines
        → Rollback a central team en 6 meses
    """,
    
    "2. Platform Complexity": """
        Self-serve platform = sophisticated
        
        Requiere:
        - Multi-tenancy (aislamiento dominios)
        - Cost tracking per domain
        - Automated provisioning
        - Standardized observability
        
        Platform team: 10-15 personas mínimo
        Cost: $1M+/año (salaries + infra)
        
        Break-even: ~100+ data engineers total
    """,
    
    "3. Duplication Risk": """
        Sin governance fuerte:
        - 3 equipos crean "customer_dim" diferente
        - Métricas inconsistentes cross-domain
        - Redundant data copies (cost)
        
        Mitigación:
        - Data Catalog mandatory
        - Shared domain for common entities
        - Regular cross-domain sync meetings
    """,
    
    "4. Skillset Gap": """
        Domain engineers necesitan:
        - Data engineering (ETL)
        - Analytics engineering (dbt)
        - DevOps (CI/CD)
        - Domain knowledge (business)
        
        Hiring difficult: Full-stack data engineers
        Training: 6-12 meses ramp-up
        
        Alternative: Embedded specialists
        - 1 data engineer per 2 domains
        - Rotates between teams
    """
}
```

**Casos de Uso (Cuándo Data Mesh Funciona):**

```python
# ✅ 1. LARGE ORG con múltiples dominios
"""
Empresa: Zalando (E-commerce Europeo)
Tamaño: 10,000+ empleados, 20+ países
Dominios: 50+ (Fashion, Logistics, Payments, ...)

Data Mesh adoption (2020):
- 50 domain teams autónomos
- Platform team: 25 personas
- 200+ data products published
- Self-serve: 90% requests no need central

Results:
- Time-to-market: 6 meses → 2 semanas
- Data quality: 80% → 95%
- Team satisfaction: +40%
"""

# ✅ 2. DISTRIBUTED company
"""
Empresa: Confluent (remote-first)
Dominios: Customer Success, Engineering, Sales
Challenge: Central team = timezone hell

Data Mesh benefits:
- Teams own data in their timezone
- Async collaboration (via catalog)
- No central bottleneck
"""

# ❌ 3. SMALL STARTUP (<100 personas)
"""
Mesh overhead > benefits
Better: Small central team (5 personas)
Mesh when: Reach 200+ engineers
"""

# ❌ 4. HIGHLY REGULATED (sin autonomía)
"""
Banking: Regulators require central control
Mesh autonomy conflicts with compliance
Better: Centralized with strong governance
"""
```

**Migration Path: Central → Data Mesh**

```python
migration_roadmap = {
    "Year 1: Foundation": """
        Q1: Executive alignment
        - Present mesh vision
        - Secure budget
        - Identify pilot domain
        
        Q2: Platform MVP
        - Self-serve Spark/Airflow
        - Data Catalog (Amundsen)
        - Basic CI/CD templates
        
        Q3-Q4: Pilot domain
        - Choose 1 domain (e.g., Sales)
        - Migrate 3 products
        - Learn lessons, iterate
    """,
    
    "Year 2: Scale": """
        Q1-Q2: Onboard 5 domains
        - Training programs
        - Platform improvements
        - Governance policies drafted
        
        Q3-Q4: Federated governance
        - Policy engine (OPA)
        - Compliance automation
        - Cross-domain standards
    """,
    
    "Year 3: Maturity": """
        Q1-Q2: Onboard remaining domains
        - 100% domains on mesh
        - Decommission central team legacy
        
        Q3-Q4: Optimization
        - Cost optimization
        - Advanced features (ML platform)
        - Community of practice
    """,
    
    "Total Investment": """
        Platform team: 10 personas × $150K × 3 años = $4.5M
        Training: $500K
        Tooling: $1M
        Total: $6M
        
        Break-even: Year 4 (velocity gains)
    """
}
```

---
**Autor:** Luis J. Raigoso V. (LJRV)

- Batch layer: histórico completo, recalculable, alta latencia (Spark/Hadoop).
- Speed layer: streaming incremental, baja latencia (Kafka/Flink/Spark Streaming).
- Serving layer: merge de vistas batch + speed para consultas (Druid/Cassandra/BigQuery).
- Trade-offs: doble lógica (batch + stream), complejidad operacional, eventual consistency.

In [None]:
lambda_diagram = '''
┌────────────┐
│   Data     │
│  Sources   │
└─────┬──────┘
      │
   ┌──┴───┐
   │Queue │ (Kafka)
   └──┬───┘
      │
 ┌────┴─────┐
 │  Batch   │  (Spark jobs nocturnos)
 │  Layer   │
 └────┬─────┘
      │
   Master
   Dataset
      │
   ┌──┴───┐
   │Serving│ (Druid/BigQuery)
   │ Layer │
   └──────┘
      ↑
   ┌──┴───┐
   │Speed │ (Flink/Spark Streaming)
   │Layer │
   └──────┘
'''
print(lambda_diagram)

## 2. Arquitectura Kappa

- Un solo pipeline streaming para todo (batch = replay del log).
- Simplifica operaciones y lógica unificada.
- Requiere log infinito (Kafka con retención larga) y capacidad de reprocessing.
- Trade-offs: coste de almacenamiento del log, complejidad del estado, menor optimización batch.

## 3. Arquitectura Delta (Lakehouse)

- Unifica batch y streaming sobre un único storage transaccional (Delta/Iceberg).
- Streaming escribe micro-batches transaccionales; batch lee histórico con time travel.
- Elimina duplicación de código y simplifica gobernanza (un catálogo, un linaje).
- Trade-offs: requiere adopción de formato y motor compatible (Spark/Trino).

## 4. Data Mesh: paradigma organizacional

- Descentralización de la propiedad de datos por dominio de negocio.
- Data products: APIs/tablas con SLOs, documentación, linaje y contratos.
- Plataforma self-service: herramientas comunes (CI/CD, catálogo, observabilidad).
- Gobernanza federada: políticas globales + autonomía local.
- Trade-offs: requiere madurez organizacional, infraestructura sólida, cambio cultural.

In [None]:
mesh_principles = '''
1. Domain-oriented decentralization: equipos de dominio (ventas, logística, finanzas) son dueños de sus datos.
2. Data as a product: cada dataset tiene owner, SLO, versionado, documentación y calidad garantizada.
3. Self-serve platform: infraestructura común (orquestación, observabilidad, catálogo) sin silos.
4. Federated governance: políticas de seguridad, privacidad y calidad definidas centralmente, aplicadas localmente.
'''
print(mesh_principles)

## 5. Comparación y cuándo elegir cada una

In [None]:
import pandas as pd
comp = pd.DataFrame([
  {'Arquitectura':'Lambda', 'Latencia':'Batch: horas, Speed: segundos', 'Complejidad':'Alta (doble lógica)', 'Casos':'Legacy con requerimientos mixtos'},
  {'Arquitectura':'Kappa', 'Latencia':'Baja (streaming)', 'Complejidad':'Media (un pipeline)', 'Casos':'Todo es streaming, reprocessing factible'},
  {'Arquitectura':'Delta', 'Latencia':'Baja-Media (micro-batch)', 'Complejidad':'Media (un storage)', 'Casos':'Lakehouse, unificación batch/stream'},
  {'Arquitectura':'Data Mesh', 'Latencia':'Variable (por dominio)', 'Complejidad':'Alta (organizacional)', 'Casos':'Grandes org con múltiples dominios'}
])
comp

## 6. Ejercicio de diseño

Diseña la arquitectura de datos para un e-commerce con:
- Dominio Ventas: transacciones en tiempo real (Kafka).
- Dominio Logística: batch nocturno (inventario/envíos).
- Dominio Analítica: dashboards con latencia < 5 min.
- Requisitos: auditabilidad, GDPR, linaje, costos optimizados.

Responde: ¿Lambda, Kappa, Delta o híbrida? ¿Data Mesh aplicable? Justifica con trade-offs.