# Apache Spark Streaming: Procesamiento en Tiempo Real

## Objetivos de Aprendizaje
- Dominar Spark Structured Streaming
- Implementar procesamiento de ventanas y agregaciones
- Integrar con Kafka y otras fuentes de streaming
- Manejar estado y checkpointing
- Optimizar rendimiento en streaming

## Requisitos
- PySpark 3.x
- Python 3.8+
- Kafka (opcional)
- Delta Lake (opcional)

In [3]:
# Verificaci√≥n e instalaci√≥n de dependencias
try:
    import pyspark
    import pandas
    import numpy
    print("‚úÖ Todas las dependencias ya est√°n instaladas")
except ImportError as e:
    print(f"‚ö†Ô∏è Falta instalar: {e.name}")
    print("üì¶ Instalando dependencias necesarias...")
    import subprocess
    import sys
    result = subprocess.run(
        [sys.executable, "-m", "pip", "install", "pyspark", "pandas", "numpy", "--quiet"],
        capture_output=True,
        text=True,
        timeout=300  # 5 minutos m√°ximo
    )
    if result.returncode == 0:
        print("‚úÖ Dependencias instaladas correctamente")
    else:
        print(f"‚ùå Error: {result.stderr}")

‚úÖ Todas las dependencias ya est√°n instaladas


In [4]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import (
    col, window, count, sum as spark_sum, avg, max as spark_max,
    current_timestamp, to_json, from_json, struct, expr
)
from pyspark.sql.types import (
    StructType, StructField, StringType, IntegerType, 
    DoubleType, TimestampType
)
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import time

print("Librer√≠as importadas correctamente")

Librer√≠as importadas correctamente


### üåä **Spark Structured Streaming: Arquitectura y Micro-batches**

**Evoluci√≥n del Streaming en Spark:**

```
Gen 1: Spark Streaming (DStreams) - 2013
‚îú‚îÄ‚îÄ RDDs en micro-batches
‚îú‚îÄ‚îÄ API de bajo nivel
‚îî‚îÄ‚îÄ ‚ùå Complejo, no unificado con batch

Gen 2: Structured Streaming - 2016
‚îú‚îÄ‚îÄ DataFrames/Datasets unificados
‚îú‚îÄ‚îÄ Event-time processing
‚îú‚îÄ‚îÄ Exactly-once semantics
‚îî‚îÄ‚îÄ ‚úÖ API declarativa, tolerancia a fallos
```

**Micro-batch vs Continuous Processing:**

| Aspecto | Micro-batch (default) | Continuous (experimental) |
|---------|----------------------|---------------------------|
| **Latencia** | ~100ms - 1s | ~1ms |
| **Throughput** | ‚úÖ Alto (10K+ events/s) | ‚ö†Ô∏è Medio |
| **Garant√≠as** | ‚úÖ Exactly-once | ‚ö†Ô∏è At-least-once |
| **Estado** | ‚úÖ Full support | ‚ö†Ô∏è Limitado |
| **Uso** | Analytics, aggregations | Ultra-low latency alerts |

**Arquitectura de Micro-batch:**

```python
# Streaming Query = Infinite Table
# Cada batch procesa un "snapshot" incremental

‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  Input Source (Kafka, Kinesis, Files)   ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
               ‚îÇ Batch 0: Records [1-100]
               ‚ñº
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  Incremental Query Engine                ‚îÇ
‚îÇ  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê  ‚îÇ
‚îÇ  ‚îÇ Batch 0: Transform + Aggregate     ‚îÇ  ‚îÇ
‚îÇ  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò  ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
               ‚îÇ Write Batch 0 Results
               ‚ñº
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  Output Sink (Delta, Kafka, Console)    ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
               ‚îÇ
               ‚îÇ Trigger interval (e.g., 10s)
               ‚ñº
         (Repeat Batch 1, 2, 3...)
```

**Triggers (Frecuencia de Ejecuci√≥n):**

```python
# 1. ProcessingTime: Ejecutar cada N segundos
query = df.writeStream \
    .trigger(processingTime='10 seconds') \
    .start()

# 2. Once: Ejecutar una sola vez (√∫til para testing/backfill)
query = df.writeStream \
    .trigger(once=True) \
    .start()

# 3. Continuous: Latencia ultra-baja (~1ms)
query = df.writeStream \
    .trigger(continuous='1 second') \
    .start()

# 4. AvailableNow: Procesar todos los datos disponibles (Spark 3.3+)
query = df.writeStream \
    .trigger(availableNow=True) \
    .start()
```

**Output Modes:**

```python
# 1. Append: Solo nuevos registros (default)
# ‚úÖ Uso: Raw logs, eventos sin agregaciones
# ‚ùå Limitaci√≥n: No updates/deletes
df.writeStream \
    .outputMode("append") \
    .format("parquet") \
    .start("/data/events")

# 2. Complete: Toda la tabla resultado (re-escribir completo)
# ‚úÖ Uso: Agregaciones peque√±as, dashboards
# ‚ùå Limitaci√≥n: No escala, solo con aggregations
df.groupBy("category").count() \
    .writeStream \
    .outputMode("complete") \
    .format("memory") \
    .queryName("category_counts") \
    .start()

# 3. Update: Solo registros modificados
# ‚úÖ Uso: Agregaciones con watermark, upserts
# ‚ö° Best practice: Balanceo append vs complete
df.groupBy(window("timestamp", "1 hour"), "user_id") \
    .agg(count("*").alias("events")) \
    .writeStream \
    .outputMode("update") \
    .format("delta") \
    .start("/data/user_hourly_stats")
```

**Event-time vs Processing-time:**

```python
from pyspark.sql.functions import current_timestamp

# ‚ùå Processing-time: Cuando Spark procesa el evento
df_processing_time = df.withColumn("processed_at", current_timestamp())
# Problema: Late data processed in wrong time window

# ‚úÖ Event-time: Timestamp del evento original
df_event_time = df.select("event_timestamp", "user_id", "action")
# Benefit: Correcta agregaci√≥n temporal incluso con retrasos

# Ejemplo: Click en app a las 10:00 AM
# - Dispositivo offline hasta 10:30 AM
# - Processing-time: 10:30 AM ‚ùå (wrong window)
# - Event-time: 10:00 AM ‚úÖ (correct window)
```

**Backpressure y Rate Limiting:**

```python
# Limitar ingesta para evitar overwhelm
spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "localhost:9092") \
    .option("subscribe", "events") \
    .option("maxOffsetsPerTrigger", 10000) \  # Max 10K records/batch
    .option("minPartitions", 4) \              # Paralelismo m√≠nimo
    .load()

# Backpressure autom√°tico (Spark 3.2+)
spark.conf.set("spark.streaming.backpressure.enabled", "true")
spark.conf.set("spark.streaming.backpressure.initialRate", 1000)
```

**Ejemplo Real: E-commerce Click Stream**

```python
# Fuente: Kafka con eventos de clicks
clicks = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "kafka:9092") \
    .option("subscribe", "user-clicks") \
    .option("startingOffsets", "latest") \
    .load()

# Parse JSON payload
from pyspark.sql.functions import from_json, col

clicks_parsed = clicks.select(
    from_json(col("value").cast("string"), click_schema).alias("data")
).select("data.*")

# Agregaci√≥n por ventanas de 5 minutos
clicks_per_5min = clicks_parsed \
    .groupBy(
        window("event_timestamp", "5 minutes"),
        "product_id"
    ) \
    .agg(
        count("*").alias("clicks"),
        countDistinct("user_id").alias("unique_users")
    )

# Escribir a Delta Lake con ACID guarantees
query = clicks_per_5min.writeStream \
    .format("delta") \
    .outputMode("update") \
    .option("checkpointLocation", "/checkpoints/clicks-5min") \
    .trigger(processingTime='30 seconds') \
    .start("/delta/clicks_aggregated")

# Monitoreo en tiempo real
print(f"Query ID: {query.id}")
print(f"Status: {query.status}")
print(f"Last Progress: {query.lastProgress}")
```

**M√©tricas de Performance:**

```python
# Acceder a m√©tricas del stream
progress = query.lastProgress

key_metrics = {
    "batchId": progress["batchId"],
    "inputRowsPerSecond": progress["inputRowsPerSecond"],
    "processedRowsPerSecond": progress["processedRowsPerSecond"],
    "batchDuration": progress["batchDuration"],  # ms
    "numInputRows": progress["numInputRows"],
    "stateOperators": progress["stateOperators"]  # Estado acumulado
}

# Alertas si procesamiento es m√°s lento que ingesta
if progress["inputRowsPerSecond"] > progress["processedRowsPerSecond"]:
    print("‚ö†Ô∏è WARNING: Falling behind! Increase parallelism")
```

**Comparaci√≥n con Flink:**

| Caracter√≠stica | Spark Streaming | Apache Flink |
|----------------|-----------------|--------------|
| **Modelo** | Micro-batch | True streaming (record-at-a-time) |
| **Latencia** | 100ms - 1s | 10ms - 100ms |
| **Throughput** | ‚úÖ Muy alto | ‚úÖ Alto |
| **Estado** | RocksDB, memory | RocksDB native |
| **SQL Support** | ‚úÖ Excellent | ‚úÖ Good |
| **Ecosystem** | ‚úÖ Spark ML, Delta | ‚ö†Ô∏è Limitado |
| **Learning Curve** | ‚ö° F√°cil (si conoces Spark) | ‚ö†Ô∏è Steeper |
| **Best For** | Analytics, ML, unified batch+stream | Ultra-low latency, complex CEP |

---
**Autor:** Luis J. Raigoso V. (LJRV)

### ‚è∞ **Watermarking: Manejo de Late Data y Event-time Windows**

**Problema: Late Arriving Data**

```
Timeline:
10:00 AM: User clicks product (event_timestamp = 10:00)
10:15 AM: Network issues, event buffered
10:30 AM: Event arrives at Spark (processing_time = 10:30)

Sin watermark:
- Evento procesado en ventana 10:30-10:35 ‚ùå (wrong)
- Estado crece infinitamente (memory leak)

Con watermark:
- Evento procesado en ventana 10:00-10:05 ‚úÖ (correct)
- Estado antiguo limpiado autom√°ticamente
```

**Watermark = "Cu√°nto retraso tolerar"**

```python
from pyspark.sql.functions import window, col

# Watermark de 10 minutos: Eventos con >10 min retraso se descartan
df_with_watermark = df \
    .withWatermark("event_timestamp", "10 minutes") \
    .groupBy(
        window("event_timestamp", "5 minutes"),
        "user_id"
    ) \
    .count()

# C√≥mo funciona:
# 1. Spark trackea max(event_timestamp) visto hasta ahora
# 2. Watermark = max(event_timestamp) - threshold (10 min)
# 3. Eventos con timestamp < watermark se descartan
# 4. Estado de ventanas < watermark se elimina

# Ejemplo num√©rico:
# Batch 1: max_event_time = 10:15, watermark = 10:05
#   ‚Üí Mantiene ventanas [10:00-10:05, 10:05-10:10, 10:10-10:15]
# Batch 2: max_event_time = 10:25, watermark = 10:15
#   ‚Üí Elimina ventana [10:00-10:05], mantiene [10:05-10:25]
```

**Tipos de Ventanas (Windows):**

```python
from pyspark.sql.functions import window, session_window

# 1. TUMBLING WINDOW (no solapamiento)
# Uso: M√©tricas cada N minutos sin duplicar
tumbling = df.groupBy(
    window("timestamp", "10 minutes")  # [00:00-00:10), [00:10-00:20)
).count()

# 2. SLIDING WINDOW (solapamiento)
# Uso: Promedios m√≥viles, tendencias
sliding = df.groupBy(
    window("timestamp", "10 minutes", "5 minutes")
    # [00:00-00:10), [00:05-00:15), [00:10-00:20)
).count()

# 3. SESSION WINDOW (gap-based)
# Uso: Sesiones de usuario, actividad continua
session = df.groupBy(
    "user_id",
    session_window("timestamp", "30 minutes")  # Gap de inactividad
).count()
# Si user activo 00:00, 00:05, 00:40 ‚Üí 2 sesiones:
#   Session 1: [00:00-00:35] (last activity 00:05 + 30min gap)
#   Session 2: [00:40-...]
```

**Configuraci√≥n de Watermark √ìptima:**

```python
# ‚ùå Watermark muy corto (1 minuto)
df.withWatermark("timestamp", "1 minute")
# Problema: Late data descartado agresivamente
# Uso: Solo si latencia de red <1 min garantizada

# ‚úÖ Watermark balanceado (10-30 minutos)
df.withWatermark("timestamp", "10 minutes")
# Beneficio: 99% eventos capturados, estado razonable

# ‚ö†Ô∏è Watermark muy largo (2 horas)
df.withWatermark("timestamp", "2 hours")
# Problema: Estado crece demasiado (OutOfMemory)
# Uso: Solo si retrasos son realmente >1 hora

# üéØ Regla de oro:
# Watermark = P99 latency de tus datos + buffer
# Ejemplo: Si 99% eventos llegan en <5 min ‚Üí watermark 10 min
```

**Manejo de Late Data con Output Modes:**

```python
# Output Mode + Watermark interacci√≥n:

# 1. APPEND + Watermark
# ‚úÖ Escribe ventanas finalizadas (despu√©s de watermark)
# ‚ö° Best practice: Inmutable, no updates
df.withWatermark("timestamp", "10 minutes") \
    .groupBy(window("timestamp", "5 minutes")) \
    .count() \
    .writeStream \
    .outputMode("append") \
    .start()
# Ventana [10:00-10:05] escrita cuando watermark > 10:05

# 2. UPDATE + Watermark
# ‚úÖ Escribe ventanas activas + finalizadas
# ‚ö° Uso: Dashboards que necesitan updates
df.withWatermark("timestamp", "10 minutes") \
    .groupBy(window("timestamp", "5 minutes")) \
    .count() \
    .writeStream \
    .outputMode("update") \
    .start()
# Ventana [10:00-10:05] puede actualizarse hasta watermark

# 3. COMPLETE (no necesita watermark)
# ‚ùå Re-escribe toda la tabla cada batch
# Uso: Agregaciones peque√±as (<1M registros)
```

**Caso Real: M√©tricas de IoT Devices**

```python
# Devices env√≠an telemetr√≠a cada 1 minuto
# Red celular puede tener latencia variable (1s - 5 min)

iot_metrics = spark.readStream \
    .format("kafka") \
    .option("subscribe", "iot-telemetry") \
    .load() \
    .select(
        from_json(col("value").cast("string"), iot_schema).alias("data")
    ).select("data.*")

# Agregaci√≥n por device + ventana de 5 min
device_stats = iot_metrics \
    .withWatermark("event_timestamp", "10 minutes") \  # Tolerar 10 min retraso
    .groupBy(
        window("event_timestamp", "5 minutes"),
        "device_id"
    ) \
    .agg(
        avg("temperature").alias("avg_temp"),
        max("temperature").alias("max_temp"),
        count("*").alias("num_readings")
    ) \
    .filter(col("max_temp") > 80)  # Alertas de sobrecalentamiento

# Escribir alertas a Kafka para acci√≥n inmediata
alerts_query = device_stats.writeStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "kafka:9092") \
    .option("topic", "device-alerts") \
    .option("checkpointLocation", "/checkpoints/iot-alerts") \
    .outputMode("update") \
    .start()
```

**Debugging Late Data:**

```python
# Agregar columnas de diagn√≥stico
from pyspark.sql.functions import current_timestamp, unix_timestamp

df_debug = df \
    .withColumn("processing_time", current_timestamp()) \
    .withColumn("latency_seconds", 
        unix_timestamp("processing_time") - unix_timestamp("event_timestamp")
    )

# Analizar distribuci√≥n de latencias
latency_stats = df_debug.groupBy(
    window("processing_time", "1 minute")
).agg(
    avg("latency_seconds").alias("avg_latency"),
    expr("percentile_approx(latency_seconds, 0.95)").alias("p95_latency"),
    expr("percentile_approx(latency_seconds, 0.99)").alias("p99_latency"),
    max("latency_seconds").alias("max_latency")
)

# Si p99_latency = 300s (5 min) ‚Üí watermark debe ser ‚â•10 min
```

**Watermark con Joins:**

```python
# Join entre streams requiere watermark en AMBOS
clicks = clicks_stream.withWatermark("click_time", "5 minutes")
impressions = impressions_stream.withWatermark("impression_time", "10 minutes")

# Inner join con time constraint
joined = clicks.join(
    impressions,
    expr("""
        click_user_id = impression_user_id AND
        click_time >= impression_time AND
        click_time <= impression_time + interval 1 hour
    """)
)

# Watermark resultante = min(5 min, 10 min) = 5 min
# Estado mantenido: 1 hora (time constraint) + 5 min (watermark)
```

**Monitoreo de Watermark:**

```python
# Ver watermark actual en query progress
progress = query.lastProgress
print(f"Current watermark: {progress['watermark']}")

# Ejemplo output:
# {
#   "eventTime": {
#     "avg": "2025-10-30T14:25:00.000Z",
#     "max": "2025-10-30T14:30:00.000Z",
#     "min": "2025-10-30T14:20:00.000Z",
#     "watermark": "2025-10-30T14:20:00.000Z"  # max - 10 min
#   }
# }

# Alertas si watermark se retrasa mucho
if progress['watermark'] < (current_time - timedelta(hours=1)):
    send_alert("Watermark lagging! Check data source")
```

**Best Practices:**

1. ‚úÖ **Siempre usar watermark** con agregaciones event-time
2. ‚úÖ **Medir P99 latency** de tus datos antes de configurar
3. ‚úÖ **Append mode** con watermark para inmutabilidad
4. ‚úÖ **Session windows** para an√°lisis de comportamiento
5. ‚ö†Ô∏è **Sliding windows** consumen m√°s estado (overlap)
6. ‚ùå **No watermark infinito** (sin cleanup de estado)
7. ‚ö° **Buffer 2-3x P99** para evitar drops

---
**Autor:** Luis J. Raigoso V. (LJRV)

### üíæ **State Management y Checkpointing: Fault Tolerance**

**¬øQu√© es el Estado (State)?**

```
Stateless Processing (sin estado):
Input: {"user": "A", "action": "click"}
Output: {"user": "A", "action": "click", "processed": true}
‚úÖ Cada evento independiente

Stateful Processing (con estado):
Input Batch 1: {"user": "A", "action": "click"}
State: {A: 1 click}
Input Batch 2: {"user": "A", "action": "purchase"}
State: {A: 1 click, 1 purchase}
Output: {A: total_events=2, conversion_rate=1.0}
‚úÖ Memoria entre batches
```

**Operaciones Stateful en Spark:**

```python
# 1. Aggregations (groupBy)
# Estado: Valores acumulados por key
user_stats = df.groupBy("user_id") \
    .agg(
        count("*").alias("total_events"),
        sum("revenue").alias("total_revenue")
    )
# Estado almacenado: {user_123: {count: 45, revenue: 1200.50}}

# 2. Windowed Aggregations
# Estado: Valores por ventana + key
window_stats = df \
    .withWatermark("timestamp", "10 minutes") \
    .groupBy(
        window("timestamp", "5 minutes"),
        "product_id"
    ) \
    .count()
# Estado: {(window_10:00-10:05, product_42): 150}

# 3. Stream-Stream Joins
# Estado: Buffered events de ambos streams
clicks = clicks_stream.withWatermark("click_time", "5 minutes")
views = views_stream.withWatermark("view_time", "5 minutes")
joined = clicks.join(views, "session_id")
# Estado: Eventos no matcheados dentro de watermark

# 4. Deduplication
# Estado: Claves √∫nicas vistas
deduplicated = df \
    .withWatermark("timestamp", "1 hour") \
    .dropDuplicates(["event_id"])
# Estado: {event_456, event_789, ...}

# 5. mapGroupsWithState / flatMapGroupsWithState
# Estado: Customizado por usuario
# Ejemplo: Sesiones complejas, m√°quinas de estado
```

**State Store Backends:**

```python
# 1. Memory (default para testing)
spark.conf.set(
    "spark.sql.streaming.stateStore.providerClass",
    "org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider"
)
# ‚ö†Ô∏è Limitado por executor memory
# Uso: Testing local, datasets peque√±os

# 2. RocksDB (producci√≥n)
spark.conf.set(
    "spark.sql.streaming.stateStore.providerClass",
    "org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider"
)
spark.conf.set("spark.sql.streaming.stateStore.rocksdb.changelogCheckpointing.enabled", "true")
# ‚úÖ Disk-backed, escala a TB de estado
# Uso: Producci√≥n, estado grande

# 3. Custom (empresarial)
# Ejemplo: Redis, Cassandra para estado compartido
```

**Checkpointing: Write-Ahead Log (WAL)**

```
Checkpoint Directory Structure:
/checkpoints/my-query/
‚îú‚îÄ‚îÄ commits/
‚îÇ   ‚îú‚îÄ‚îÄ 0                    # Batch 0 metadata
‚îÇ   ‚îú‚îÄ‚îÄ 1
‚îÇ   ‚îî‚îÄ‚îÄ 2
‚îú‚îÄ‚îÄ offsets/
‚îÇ   ‚îú‚îÄ‚îÄ 0                    # Kafka offsets batch 0
‚îÇ   ‚îú‚îÄ‚îÄ 1
‚îÇ   ‚îî‚îÄ‚îÄ 2
‚îú‚îÄ‚îÄ sources/
‚îÇ   ‚îî‚îÄ‚îÄ 0/
‚îÇ       ‚îî‚îÄ‚îÄ 0               # Source info
‚îî‚îÄ‚îÄ state/
    ‚îú‚îÄ‚îÄ 0/
    ‚îÇ   ‚îú‚îÄ‚îÄ 0/
    ‚îÇ   ‚îÇ   ‚îú‚îÄ‚îÄ 1.delta     # State snapshots
    ‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ 1.snapshot
    ‚îÇ   ‚îî‚îÄ‚îÄ 1/
    ‚îî‚îÄ‚îÄ 1/
```

**Configuraci√≥n de Checkpointing:**

```python
# Checkpoint obligatorio para stateful queries
query = df.writeStream \
    .format("delta") \
    .outputMode("update") \
    .option("checkpointLocation", "/checkpoints/user-stats") \  # REQUERIDO
    .start("/delta/user_stats")

# Sin checkpoint ‚Üí Error:
# "checkpointLocation must be specified either through option("checkpointLocation", ...)
# or SparkSession.conf.set("spark.sql.streaming.checkpointLocation", ...)"

# Configuraci√≥n global (no recomendado)
spark.conf.set("spark.sql.streaming.checkpointLocation", "/checkpoints/default")
```

**Recovery Process:**

```python
# Escenario: Executor falla durante Batch 5

# 1. Query detecta falla
# 2. Lee √∫ltimo checkpoint exitoso (Batch 4)
#    - Offsets: Kafka partition 0 offset 1000, partition 1 offset 850
#    - State: Usuario A ‚Üí 45 eventos, Usuario B ‚Üí 23 eventos
# 3. Replay desde Batch 5
#    - Lee desde offset 1000, 850 en Kafka
#    - Restaura estado de Batch 4
#    - Procesa Batch 5 con estado correcto
# 4. Contin√∫a normalmente

# Garant√≠a: Exactly-once processing
# Kafka offsets + Estado + Output escritos at√≥micamente
```

**Ejemplo: Session Analytics con Estado Custom**

```python
from pyspark.sql.streaming import GroupState, GroupStateTimeout
from typing import Iterator, Tuple

# Modelo de sesi√≥n
@dataclass
class SessionData:
    user_id: str
    start_time: datetime
    last_activity: datetime
    events: List[str]
    total_revenue: float

def update_session_state(
    key: Tuple[str],
    events: Iterator[pd.DataFrame],
    state: GroupState
) -> Iterator[pd.DataFrame]:
    """
    Custom state management para sesiones
    """
    user_id = key[0]
    
    # Leer estado anterior
    if state.exists:
        session = state.get()
    else:
        session = SessionData(
            user_id=user_id,
            start_time=None,
            last_activity=None,
            events=[],
            total_revenue=0.0
        )
    
    # Procesar eventos del batch
    for event_batch in events:
        for _, event in event_batch.iterrows():
            if session.start_time is None:
                session.start_time = event['timestamp']
            
            session.last_activity = event['timestamp']
            session.events.append(event['event_type'])
            session.total_revenue += event.get('revenue', 0.0)
    
    # Timeout si inactividad >30 min
    if (datetime.now() - session.last_activity).seconds > 1800:
        state.remove()  # Limpiar estado
        # Emitir sesi√≥n finalizada
        return iter([pd.DataFrame([{
            'user_id': user_id,
            'session_duration': (session.last_activity - session.start_time).seconds,
            'num_events': len(session.events),
            'total_revenue': session.total_revenue
        }])])
    else:
        state.update(session)  # Guardar estado actualizado
        state.setTimeoutDuration("30 minutes")
        return iter([])  # No output a√∫n

# Aplicar state management
sessions = df.groupBy("user_id") \
    .applyInPandasWithState(
        update_session_state,
        outputStructType=session_output_schema,
        stateStructType=session_state_schema,
        outputMode="update",
        timeoutConf=GroupStateTimeout.ProcessingTimeTimeout
    )
```

**State Size Monitoring:**

```python
# M√©tricas de estado en lastProgress
progress = query.lastProgress

state_info = progress["stateOperators"][0]  # First stateful operator
print(f"Num state rows: {state_info['numRowsTotal']}")
print(f"State memory (MB): {state_info['memoryUsedBytes'] / 1024 / 1024}")
print(f"Custom metrics: {state_info['customMetrics']}")

# Ejemplo output:
{
  "numRowsTotal": 1234567,
  "numRowsUpdated": 5678,
  "memoryUsedBytes": 524288000,  # ~500 MB
  "customMetrics": {
    "loadedMapCacheHitCount": 1000,
    "loadedMapCacheMissCount": 50,
    "stateOnCurrentVersionSizeBytes": 450000000
  }
}

# Alertas si estado crece sin control
if state_info['numRowsTotal'] > 10_000_000:
    print("‚ö†Ô∏è WARNING: State size >10M rows, consider:")
    print("  1. Reduce watermark threshold")
    print("  2. Increase parallelism (more partitions)")
    print("  3. Use state TTL (time-to-live)")
```

**State Cleanup Strategies:**

```python
# 1. Watermark-based (autom√°tico)
df.withWatermark("timestamp", "1 hour") \
    .groupBy(window("timestamp", "10 minutes")) \
    .count()
# Estado limpiado cuando ventana < watermark

# 2. TTL (Time-to-Live) con RocksDB
spark.conf.set("spark.sql.streaming.statefulOperator.stateInfo.ttl", "2 hours")
# Estado no accedido por >2h eliminado

# 3. Manual cleanup en mapGroupsWithState
def cleanup_old_state(key, events, state):
    if state.hasTimedOut:
        state.remove()  # Cleanup expl√≠cito
        return iter([])
    # ... process events

# 4. State re-partitioning para balancear
df.repartition(200, "user_id")  # Distribuir estado uniformemente
```

**Checkpoint Management:**

```python
# ‚ùå NEVER cambiar checkpoint location en producci√≥n
# Cambiar checkpoint = perder estado + reprocessar desde inicio

# ‚úÖ Migration process:
# 1. Stop query gracefully
query.stop()

# 2. Backup checkpoint
# hdfs dfs -cp /checkpoints/old /checkpoints/backup

# 3. Start new query con nuevo checkpoint
query_new = df.writeStream \
    .option("checkpointLocation", "/checkpoints/new") \
    .start()

# 4. Validar outputs son correctos

# 5. Cleanup old checkpoint (despu√©s de d√≠as/semanas)
# hdfs dfs -rm -r /checkpoints/old
```

**Caso Real: Fraud Detection con Estado**

```python
# Detectar m√∫ltiples transacciones sospechosas del mismo usuario

fraud_detection = transactions \
    .withWatermark("transaction_time", "15 minutes") \
    .groupBy("user_id") \
    .applyInPandasWithState(
        detect_fraud_pattern,
        outputStructType=fraud_alert_schema,
        stateStructType=user_fraud_state_schema,
        outputMode="update",
        timeoutConf=GroupStateTimeout.ProcessingTimeTimeout
    )

def detect_fraud_pattern(key, events, state):
    user_id = key[0]
    
    # Restaurar estado (historial de transacciones)
    if state.exists:
        history = state.get()
    else:
        history = {'transactions': [], 'risk_score': 0}
    
    # Analizar nuevas transacciones
    for event_batch in events:
        for _, txn in event_batch.iterrows():
            history['transactions'].append(txn)
            
            # Patrones sospechosos:
            # 1. >5 transacciones en 10 minutos
            recent_txns = [t for t in history['transactions'] 
                          if (txn['timestamp'] - t['timestamp']).seconds < 600]
            
            # 2. Monto total >$10,000 en 10 minutos
            recent_total = sum(t['amount'] for t in recent_txns)
            
            # 3. M√∫ltiples pa√≠ses en corto tiempo
            recent_countries = set(t['country'] for t in recent_txns)
            
            if len(recent_txns) > 5 or recent_total > 10000 or len(recent_countries) > 2:
                # Emitir alerta
                return iter([pd.DataFrame([{
                    'user_id': user_id,
                    'alert_type': 'FRAUD_SUSPECTED',
                    'risk_score': calculate_risk(recent_txns),
                    'timestamp': txn['timestamp']
                }])])
    
    # Actualizar estado con TTL 24h
    state.update(history)
    state.setTimeoutDuration("24 hours")
    return iter([])

# Escribir alertas a Kafka para acci√≥n inmediata
fraud_alerts = fraud_detection.writeStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "kafka:9092") \
    .option("topic", "fraud-alerts") \
    .option("checkpointLocation", "/checkpoints/fraud-detection") \
    .start()
```

**Best Practices:**

1. ‚úÖ **Siempre configurar checkpoint** para queries stateful
2. ‚úÖ **RocksDB en producci√≥n** (memory solo para dev)
3. ‚úÖ **Monitorear state size** con m√©tricas
4. ‚úÖ **Watermark** para cleanup autom√°tico
5. ‚úÖ **Backup checkpoints** antes de upgrades
6. ‚ö†Ô∏è **State TTL** para evitar crecimiento infinito
7. ‚ùå **Nunca cambiar checkpoint location** sin migration plan
8. ‚ö° **Repartition** por key para balancear estado

---
**Autor:** Luis J. Raigoso V. (LJRV)

### üèóÔ∏è **Integraci√≥n con Delta Lake y Optimizaci√≥n de Performance**

**Spark Streaming + Delta Lake = Streaming Lakehouse**

```
Ventajas de Delta como Sink:
‚úÖ ACID Transactions: Garant√≠a exactly-once sin duplicados
‚úÖ Schema Evolution: Agregar columnas sin interrumpir stream
‚úÖ Time Travel: Rollback si procesamiento incorrecto
‚úÖ Upserts/Merges: CDC (Change Data Capture) en streaming
‚úÖ Performance: Z-ordering, compaction autom√°tica
```

**Escribir Stream a Delta:**

```python
# Configuraci√≥n b√°sica
query = df.writeStream \
    .format("delta") \
    .outputMode("append") \
    .option("checkpointLocation", "/checkpoints/events-delta") \
    .option("path", "/delta/events") \
    .start()

# Configuraci√≥n avanzada
query = df.writeStream \
    .format("delta") \
    .outputMode("append") \
    .option("checkpointLocation", "/checkpoints/events-delta") \
    .option("mergeSchema", "true") \              # Auto schema evolution
    .option("optimizeWrite", "true") \            # Bin-packing
    .option("autoCompact", "true") \              # Auto compactaci√≥n
    .partitionBy("date", "hour") \                # Particionamiento
    .trigger(processingTime='30 seconds') \
    .start("/delta/events")

# Trigger AvailableNow (Spark 3.3+): Procesar backlog completo
query = df.writeStream \
    .format("delta") \
    .trigger(availableNow=True) \                 # Procesar todo y terminar
    .option("checkpointLocation", "/checkpoints/backfill") \
    .start("/delta/events")
```

**Streaming Upserts (Merge):**

```python
from delta.tables import DeltaTable

# Funci√≥n para merge en cada micro-batch
def upsert_to_delta(batch_df, batch_id):
    """
    Merge batch into Delta table (UPSERT)
    """
    delta_table = DeltaTable.forPath(spark, "/delta/user_profiles")
    
    delta_table.alias("target").merge(
        batch_df.alias("source"),
        "target.user_id = source.user_id"
    ).whenMatchedUpdateAll() \    # Update si existe
     .whenNotMatchedInsertAll() \ # Insert si no existe
     .execute()
    
    print(f"Batch {batch_id} merged successfully")

# Aplicar a cada batch
query = user_updates.writeStream \
    .foreachBatch(upsert_to_delta) \
    .option("checkpointLocation", "/checkpoints/user-upserts") \
    .start()

# Ejemplo: Actualizaci√≥n de perfiles de usuario
# Batch 1: user_123 {name: "John", purchases: 5}
# Batch 2: user_123 {purchases: 6}  ‚Üí MERGE actualiza purchases
# Batch 3: user_456 {name: "Jane"} ‚Üí INSERT nuevo usuario
```

**Change Data Capture (CDC) Streaming:**

```python
# Leer CDC desde Kafka (Debezium format)
cdc_stream = spark.readStream \
    .format("kafka") \
    .option("subscribe", "mysql.prod.users") \
    .load() \
    .select(
        from_json(col("value").cast("string"), cdc_schema).alias("cdc")
    ).select("cdc.*")

# Aplicar cambios a Delta
def apply_cdc_changes(batch_df, batch_id):
    """
    Procesar CDC events: INSERT, UPDATE, DELETE
    """
    delta_table = DeltaTable.forPath(spark, "/delta/users")
    
    # Separar por operaci√≥n
    inserts = batch_df.filter(col("op") == "c")  # Create
    updates = batch_df.filter(col("op") == "u")  # Update
    deletes = batch_df.filter(col("op") == "d")  # Delete
    
    # Aplicar deletes
    if deletes.count() > 0:
        delta_table.alias("t").merge(
            deletes.alias("s"),
            "t.user_id = s.user_id"
        ).whenMatchedDelete().execute()
    
    # Aplicar upserts (inserts + updates)
    upserts = inserts.union(updates)
    if upserts.count() > 0:
        delta_table.alias("t").merge(
            upserts.alias("s"),
            "t.user_id = s.user_id"
        ).whenMatchedUpdateAll() \
         .whenNotMatchedInsertAll() \
         .execute()

query = cdc_stream.writeStream \
    .foreachBatch(apply_cdc_changes) \
    .option("checkpointLocation", "/checkpoints/cdc") \
    .start()
```

**Leer Stream desde Delta (Change Data Feed):**

```python
# Habilitar CDF en tabla Delta
spark.sql("""
    ALTER TABLE user_profiles
    SET TBLPROPERTIES (delta.enableChangeDataFeed = true)
""")

# Leer cambios como stream
changes_stream = spark.readStream \
    .format("delta") \
    .option("readChangeFeed", "true") \
    .option("startingVersion", 10) \  # Desde versi√≥n espec√≠fica
    .table("user_profiles")

# Columnas adicionales en CDF:
# _change_type: insert, update_preimage, update_postimage, delete
# _commit_version: Delta version
# _commit_timestamp: When change happened

# Procesar solo updates
updates_only = changes_stream.filter(col("_change_type") == "update_postimage")

# Materializar a otra tabla
query = updates_only.writeStream \
    .format("delta") \
    .option("checkpointLocation", "/checkpoints/cdf-consumer") \
    .start("/delta/user_updates_log")
```

**Performance Optimization: Kafka Source**

```python
# Configuraci√≥n √≥ptima para Kafka
kafka_stream = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "kafka1:9092,kafka2:9092,kafka3:9092") \
    .option("subscribe", "events") \
    .option("startingOffsets", "latest") \        # latest, earliest, {"topic":{"0":23,"1":-1}}
    .option("maxOffsetsPerTrigger", 50000) \      # Limitar ingesta por batch
    .option("minPartitions", 10) \                # Paralelismo m√≠nimo
    .option("kafka.max.poll.records", 500) \      # Records por poll
    .option("kafka.session.timeout.ms", 30000) \
    .option("kafka.request.timeout.ms", 40000) \
    .option("failOnDataLoss", "false") \          # Tolerar data loss en dev
    .load()

# Consumer group por query
# - Checkpoint location determina consumer group
# - Cambiar checkpoint = nuevo consumer group = reprocessar todo
```

**Partitioning Strategies:**

```python
# ‚ùå Anti-pattern: Sobre-particionamiento
df.writeStream \
    .partitionBy("year", "month", "day", "hour", "user_id") \  # Millones de particiones!
    .start()

# ‚úÖ Particionamiento balanceado
df.writeStream \
    .partitionBy("date") \  # ~30 particiones/mes
    .start()

# ‚ö° Dynamic partition overwrite (batch mode)
df.write \
    .format("delta") \
    .mode("overwrite") \
    .option("partitionOverwriteMode", "dynamic") \
    .partitionBy("date") \
    .save("/delta/events")

# Z-ordering para columnas de filtrado frecuente
spark.sql("""
    OPTIMIZE events
    ZORDER BY (user_id, product_id)
""")
```

**Auto Compaction:**

```python
# Problema: Small files generados por streaming
# Cada micro-batch escribe archivos peque√±os (10-100 MB)
# Resultado: Millones de archivos despu√©s de d√≠as/semanas

# Soluci√≥n 1: Auto-compaction (Delta 1.2+)
spark.conf.set("spark.databricks.delta.properties.defaults.autoOptimize.optimizeWrite", "true")
spark.conf.set("spark.databricks.delta.properties.defaults.autoOptimize.autoCompact", "true")

# Soluci√≥n 2: Scheduled compaction job
from delta.tables import DeltaTable

def compact_table(table_path):
    """
    Compactar archivos peque√±os
    """
    delta_table = DeltaTable.forPath(spark, table_path)
    
    # OPTIMIZE: Compactar archivos en particiones
    delta_table.optimize() \
        .where("date >= current_date() - interval 7 days") \  # Solo √∫ltimos 7 d√≠as
        .executeCompaction()
    
    # VACUUM: Limpiar archivos antiguos (despu√©s de retention period)
    delta_table.vacuum(retentionHours=168)  # 7 d√≠as

# Ejecutar como Airflow DAG diario
optimize_dag = DAG('delta_optimize', schedule_interval='@daily')

# Soluci√≥n 3: Bin-packing en escritura
df.writeStream \
    .format("delta") \
    .option("optimizeWrite", "true") \  # Bin-pack antes de escribir
    .start()
```

**Monitoring y Alertas:**

```python
# M√©tricas clave para monitoreo
def extract_stream_metrics(query):
    """
    Extraer m√©tricas para Prometheus/Datadog
    """
    progress = query.lastProgress
    
    if progress is None:
        return {}
    
    metrics = {
        # Throughput
        "input_rows_per_second": progress.get("inputRowsPerSecond", 0),
        "processed_rows_per_second": progress.get("processedRowsPerSecond", 0),
        
        # Latency
        "batch_duration_ms": progress.get("durationMs", {}).get("triggerExecution", 0),
        "batch_id": progress.get("batchId", 0),
        
        # State
        "num_state_rows": 0,
        "state_memory_mb": 0,
        
        # Lag
        "input_rows": progress.get("numInputRows", 0),
    }
    
    # State metrics si hay operadores stateful
    if "stateOperators" in progress and len(progress["stateOperators"]) > 0:
        state_op = progress["stateOperators"][0]
        metrics["num_state_rows"] = state_op.get("numRowsTotal", 0)
        metrics["state_memory_mb"] = state_op.get("memoryUsedBytes", 0) / 1024 / 1024
    
    return metrics

# Alertas
def check_alerts(metrics):
    """
    Generar alertas si m√©tricas anormales
    """
    alerts = []
    
    # Alerta 1: Falling behind
    if metrics["input_rows_per_second"] > metrics["processed_rows_per_second"] * 1.2:
        alerts.append({
            "severity": "WARNING",
            "message": f"Processing falling behind: {metrics['input_rows_per_second']:.0f} in/s vs {metrics['processed_rows_per_second']:.0f} out/s"
        })
    
    # Alerta 2: High latency
    if metrics["batch_duration_ms"] > 60000:  # >1 minuto
        alerts.append({
            "severity": "ERROR",
            "message": f"High batch latency: {metrics['batch_duration_ms']/1000:.1f}s"
        })
    
    # Alerta 3: State growing unbounded
    if metrics["num_state_rows"] > 50_000_000:  # >50M registros
        alerts.append({
            "severity": "WARNING",
            "message": f"Large state size: {metrics['num_state_rows']:,} rows, {metrics['state_memory_mb']:.1f} MB"
        })
    
    return alerts

# Integraci√≥n con monitoring
import time
while query.isActive:
    time.sleep(60)  # Check cada minuto
    metrics = extract_stream_metrics(query)
    alerts = check_alerts(metrics)
    
    # Enviar a Prometheus
    push_to_prometheus(metrics)
    
    # Enviar alertas a Slack/PagerDuty
    if alerts:
        for alert in alerts:
            send_alert(alert)
```

**Caso Real: Real-time Analytics Dashboard**

```python
# Pipeline completo: Kafka ‚Üí Spark Streaming ‚Üí Delta ‚Üí BI Tool

# 1. Leer eventos de Kafka
events = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "kafka:9092") \
    .option("subscribe", "user-events") \
    .option("maxOffsetsPerTrigger", 10000) \
    .load() \
    .select(from_json(col("value").cast("string"), event_schema).alias("e")) \
    .select("e.*")

# 2. Transformaciones
events_enriched = events \
    .withWatermark("event_timestamp", "10 minutes") \
    .join(
        users_dim.alias("u"),
        events.user_id == col("u.user_id"),
        "left"
    ) \
    .select(
        col("event_id"),
        col("event_timestamp"),
        col("e.user_id"),
        col("u.user_segment"),  # Enriquecimiento
        col("event_type"),
        col("revenue")
    )

# 3. Agregaciones por ventanas
dashboard_metrics = events_enriched \
    .groupBy(
        window("event_timestamp", "5 minutes"),
        "user_segment"
    ) \
    .agg(
        count("*").alias("total_events"),
        countDistinct("user_id").alias("unique_users"),
        sum("revenue").alias("total_revenue"),
        avg("revenue").alias("avg_revenue")
    )

# 4. Escribir a Delta con optimizaciones
query = dashboard_metrics.writeStream \
    .format("delta") \
    .outputMode("update") \
    .option("checkpointLocation", "/checkpoints/dashboard") \
    .option("optimizeWrite", "true") \
    .option("autoCompact", "true") \
    .trigger(processingTime='30 seconds') \
    .start("/delta/dashboard_metrics")

# 5. BI Tool lee desde Delta (Tableau, Power BI, Looker)
# SELECT * FROM delta.`/delta/dashboard_metrics`
# WHERE window.start >= current_timestamp() - interval 1 hour

# 6. Z-order para performance
spark.sql("""
    OPTIMIZE delta.`/delta/dashboard_metrics`
    ZORDER BY (window, user_segment)
""")

# 7. Vacuum archivos antiguos (semanal)
spark.sql("""
    VACUUM delta.`/delta/dashboard_metrics`
    RETAIN 168 HOURS
""")
```

**Best Practices:**

1. ‚úÖ **Delta como sink primario** en producci√≥n (ACID + performance)
2. ‚úÖ **foreachBatch para l√≥gica custom** (upserts, alerts, etc.)
3. ‚úÖ **Auto-compaction** o scheduled OPTIMIZE
4. ‚úÖ **Z-ordering** en columnas de filtrado frecuente
5. ‚úÖ **Particionamiento por fecha** (balance entre granularidad y cantidad)
6. ‚úÖ **maxOffsetsPerTrigger** para controlar ingesta
7. ‚úÖ **Monitor m√©tricas** continuamente (Prometheus, Datadog)
8. ‚ö†Ô∏è **Change Data Feed** para downstream consumers
9. ‚ö†Ô∏è **failOnDataLoss=false** solo en dev (strict en prod)
10. ‚ùå **No sobre-particionar** (evitar millones de particiones)

**Performance Benchmarks:**

```
Scenario: 1M events/s, 10 KB/event, 100 partitions

Kafka ‚Üí Spark Streaming ‚Üí Parquet:
- Latency: ~30s (trigger interval + processing)
- Small files: 1000+ archivos/hora
- OPTIMIZE needed: Daily
- Cost: $$

Kafka ‚Üí Spark Streaming ‚Üí Delta (optimized):
- Latency: ~30s
- Small files: Auto-compacted
- OPTIMIZE needed: Weekly
- Cost: $$ (slightly higher for optimization)
- Benefit: ACID, time travel, upserts

Kafka ‚Üí Flink ‚Üí Delta:
- Latency: ~5s
- More complex setup
- Cost: $$$
- Benefit: Lower latency, event-time watermarks native
```

---
**Autor:** Luis J. Raigoso V. (LJRV)

## 1. Inicializar Spark Session

In [7]:
# Nota: PySpark requiere Java instalado. Para este ejemplo educativo,
# mostraremos el c√≥digo y simularemos los resultados con pandas

print("‚ö†Ô∏è NOTA: PySpark requiere Java JDK 8/11/17 instalado.")
print("üìö Este notebook muestra c√≥digo de ejemplo y simulaciones con pandas")
print("üîó Para usar Spark real: https://spark.apache.org/downloads.html\n")

# C√≥digo de ejemplo para crear Spark Session (requiere Java)
spark_code = '''
spark = SparkSession.builder \\
    .appName("SparkStreamingAdvanced") \\
    .config("spark.sql.shuffle.partitions", "4") \\
    .config("spark.sql.streaming.schemaInference", "true") \\
    .config("spark.streaming.stopGracefullyOnShutdown", "true") \\
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")
print(f"Spark Version: {spark.version}")
'''
print("üìù C√≥digo de ejemplo Spark Session:")
print(spark_code)

‚ö†Ô∏è NOTA: PySpark requiere Java JDK 8/11/17 instalado.
üìö Este notebook muestra c√≥digo de ejemplo y simulaciones con pandas
üîó Para usar Spark real: https://spark.apache.org/downloads.html

üìù C√≥digo de ejemplo Spark Session:

spark = SparkSession.builder \
    .appName("SparkStreamingAdvanced") \
    .config("spark.sql.shuffle.partitions", "4") \
    .config("spark.sql.streaming.schemaInference", "true") \
    .config("spark.streaming.stopGracefullyOnShutdown", "true") \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")
print(f"Spark Version: {spark.version}")



## 2. Definir Esquemas para Streaming

In [9]:
# Esquema para eventos de e-commerce
ecommerce_schema = StructType([
    StructField("event_id", StringType(), False),
    StructField("timestamp", TimestampType(), False),
    StructField("user_id", StringType(), False),
    StructField("event_type", StringType(), False),
    StructField("product_id", StringType(), False),
    StructField("product_name", StringType(), True),
    StructField("category", StringType(), True),
    StructField("price", DoubleType(), True),
    StructField("quantity", IntegerType(), True)
])

# Esquema para logs de aplicaci√≥n
log_schema = StructType([
    StructField("timestamp", TimestampType(), False),
    StructField("level", StringType(), False),
    StructField("service", StringType(), False),
    StructField("message", StringType(), True),
    StructField("error_code", IntegerType(), True),
    StructField("user_id", StringType(), True)
])

print("Esquemas definidos")
print("\nEsquema E-commerce:")
print(ecommerce_schema)

Esquemas definidos

Esquema E-commerce:
StructType([StructField('event_id', StringType(), False), StructField('timestamp', TimestampType(), False), StructField('user_id', StringType(), False), StructField('event_type', StringType(), False), StructField('product_id', StringType(), False), StructField('product_name', StringType(), True), StructField('category', StringType(), True), StructField('price', DoubleType(), True), StructField('quantity', IntegerType(), True)])


## 3. Simulaci√≥n de Fuente de Streaming

In [11]:
# Crear datos de ejemplo para simular streaming
def generate_sample_data(n_records=1000):
    """
    Generar datos de muestra para streaming
    """
    np.random.seed(42)
    
    base_time = datetime.now()
    
    data = []
    for i in range(n_records):
        event = {
            'event_id': f'evt_{i:06d}',
            'timestamp': base_time + timedelta(seconds=i),
            'user_id': f'user_{np.random.randint(1, 101)}',
            'event_type': np.random.choice(['view', 'add_to_cart', 'purchase', 'remove'], p=[0.5, 0.25, 0.15, 0.1]),
            'product_id': f'prod_{np.random.randint(1, 51)}',
            'product_name': np.random.choice(['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Headphones']),
            'category': np.random.choice(['Electronics', 'Accessories', 'Computers']),
            'price': round(np.random.uniform(10, 2000), 2),
            'quantity': np.random.randint(1, 5)
        }
        data.append(event)
    
    return data


# Generar datos con pandas
sample_data = generate_sample_data(1000)
df_sample = pd.DataFrame(sample_data)

print(f"‚úÖ Generados {len(df_sample)} registros de muestra")
print("\nüìä Primeros 5 registros:")
print(df_sample.head())

‚úÖ Generados 1000 registros de muestra

üìä Primeros 5 registros:
     event_id                  timestamp  user_id event_type product_id  \
0  evt_000000 2025-12-09 13:08:01.282407  user_52     remove    prod_43   
1  evt_000001 2025-12-09 13:08:02.282407  user_87       view    prod_24   
2  evt_000002 2025-12-09 13:08:03.282407  user_88   purchase    prod_38   
3  evt_000003 2025-12-09 13:08:04.282407  user_22       view    prod_25   
4  evt_000004 2025-12-09 13:08:05.282407  user_92     remove    prod_15   

  product_name     category    price  quantity  
0   Headphones  Electronics   320.48         3  
1   Headphones    Computers    50.96         2  
2        Mouse  Electronics  1238.79         2  
3       Laptop    Computers  1227.59         2  
4     Keyboard    Computers  1966.63         1  


## 4. Streaming con Rate Source (Simulaci√≥n)

In [12]:
# Simulaci√≥n de streaming con pandas
print("üìù C√≥digo Spark para rate source (requiere Java):")
print('''
rate_stream = spark.readStream \\
    .format("rate") \\
    .option("rowsPerSecond", 10) \\
    .option("numPartitions", 2) \\
    .load()
''')

# Simulaci√≥n con pandas: generar micro-batch
print("\nüîÑ Simulaci√≥n de micro-batch con pandas:")
micro_batch = generate_sample_data(100)
df_micro_batch = pd.DataFrame(micro_batch)

print(f"‚úÖ Micro-batch generado: {len(df_micro_batch)} registros")
print(f"üìä Tipos de eventos: {df_micro_batch['event_type'].value_counts().to_dict()}")

üìù C√≥digo Spark para rate source (requiere Java):

rate_stream = spark.readStream \
    .format("rate") \
    .option("rowsPerSecond", 10) \
    .option("numPartitions", 2) \
    .load()


üîÑ Simulaci√≥n de micro-batch con pandas:
‚úÖ Micro-batch generado: 100 registros
üìä Tipos de eventos: {np.str_('view'): 54, np.str_('add_to_cart'): 18, np.str_('remove'): 15, np.str_('purchase'): 13}


## 5. Agregaciones por Ventanas de Tiempo

In [13]:
# Simulaci√≥n de agregaciones por ventanas de tiempo con pandas
print("üìù C√≥digo Spark para windowed aggregations:")
print('''
windowed_counts = stream \\
    .withWatermark("timestamp", "10 seconds") \\
    .groupBy(window(col("timestamp"), "30 seconds", "10 seconds"), col("event_type")) \\
    .agg(count("*").alias("event_count"), sum(expr("price * quantity")).alias("total_revenue"))
''')

# Simulaci√≥n con pandas usando resample
print("\nüîÑ Simulaci√≥n con pandas:")
df_micro_batch['timestamp'] = pd.to_datetime(df_micro_batch['timestamp'])
df_micro_batch = df_micro_batch.set_index('timestamp')

# Agregaci√≥n por ventanas de 30 segundos
windowed = df_micro_batch.groupby([pd.Grouper(freq='30S'), 'event_type']).agg({
    'event_id': 'count',
    'price': 'mean'
}).rename(columns={'event_id': 'event_count', 'price': 'avg_price'})

print("‚úÖ Agregaciones por ventana calculadas")
print(windowed.head(10))

üìù C√≥digo Spark para windowed aggregations:

windowed_counts = stream \
    .withWatermark("timestamp", "10 seconds") \
    .groupBy(window(col("timestamp"), "30 seconds", "10 seconds"), col("event_type")) \
    .agg(count("*").alias("event_count"), sum(expr("price * quantity")).alias("total_revenue"))


üîÑ Simulaci√≥n con pandas:
‚úÖ Agregaciones por ventana calculadas
                                 event_count    avg_price
timestamp           event_type                           
2025-12-09 13:08:30 add_to_cart            2  1926.795000
                    purchase               2   689.120000
                    remove                 3   776.293333
                    view                   5   919.804000
2025-12-09 13:09:00 add_to_cart            3  1399.356667
                    purchase               5  1062.180000
                    remove                 4   761.950000
                    view                  18  1066.811111
2025-12-09 13:09:30 add_to_cart           

  windowed = df_micro_batch.groupby([pd.Grouper(freq='30S'), 'event_type']).agg({


## 6. Procesamiento Stateful

In [14]:
# Simulaci√≥n de agregaciones stateful por usuario con pandas
print("üìù C√≥digo Spark para stateful aggregations:")
print('''
user_aggregations = stream \\
    .groupBy("user_id") \\
    .agg(
        count("*").alias("total_events"),
        sum(when(col("event_type") == "purchase", 1)).alias("purchases")
    )
''')

# Simulaci√≥n con pandas
print("\nüîÑ Simulaci√≥n con pandas:")
user_stats = df_sample.groupby('user_id').agg({
    'event_id': 'count',
    'price': lambda x: (df_sample.loc[x.index, 'event_type'] == 'purchase').sum(),
    'timestamp': 'max'
}).rename(columns={'event_id': 'total_events', 'price': 'purchases', 'timestamp': 'last_activity'})

print("‚úÖ Estad√≠sticas por usuario:")
print(user_stats.head(10))

üìù C√≥digo Spark para stateful aggregations:

user_aggregations = stream \
    .groupBy("user_id") \
    .agg(
        count("*").alias("total_events"),
        sum(when(col("event_type") == "purchase", 1)).alias("purchases")
    )


üîÑ Simulaci√≥n con pandas:
‚úÖ Estad√≠sticas por usuario:
          total_events  purchases              last_activity
user_id                                                     
user_1               9          2 2025-12-09 13:23:57.282407
user_10             10          2 2025-12-09 13:24:01.282407
user_100             7          2 2025-12-09 13:24:22.282407
user_11             12          3 2025-12-09 13:21:27.282407
user_12              7          0 2025-12-09 13:24:00.282407
user_13             13          2 2025-12-09 13:24:20.282407
user_14             17          3 2025-12-09 13:24:02.282407
user_15             13          2 2025-12-09 13:23:37.282407
user_16             15          0 2025-12-09 13:23:43.282407
user_17             13          4

## 7. Detecci√≥n de Patrones en Tiempo Real

In [15]:
# Simulaci√≥n de detecci√≥n de anomal√≠as con pandas
print("üìù C√≥digo Spark para anomaly detection:")
print('''
anomaly_detection = stream \\
    .withWatermark("timestamp", "5 minutes") \\
    .groupBy(window(col("timestamp"), "1 minute"), col("user_id")) \\
    .agg(count("*").alias("events_per_minute")) \\
    .filter(col("events_per_minute") > 50)
''')

# Simulaci√≥n con pandas
print("\nüîÑ Simulaci√≥n con pandas:")
df_micro_batch_reset = df_micro_batch.reset_index()
anomalies = df_micro_batch_reset.groupby('user_id').size()
anomalies = anomalies[anomalies > 3]  # Usuarios con m√°s de 3 eventos

print(f"‚úÖ Usuarios con comportamiento an√≥malo detectados: {len(anomalies)}")
if len(anomalies) > 0:
    print(anomalies.head())

üìù C√≥digo Spark para anomaly detection:

anomaly_detection = stream \
    .withWatermark("timestamp", "5 minutes") \
    .groupBy(window(col("timestamp"), "1 minute"), col("user_id")) \
    .agg(count("*").alias("events_per_minute")) \
    .filter(col("events_per_minute") > 50)


üîÑ Simulaci√≥n con pandas:
‚úÖ Usuarios con comportamiento an√≥malo detectados: 3
user_id
user_63    4
user_89    4
user_92    4
dtype: int64


## 8. Ejemplo de Query con Output Completo

In [16]:
# An√°lisis batch con pandas (simulaci√≥n de resultados Spark)
print("\n=== AN√ÅLISIS BATCH DE DATOS DE MUESTRA ===")

print("\n1. Eventos por tipo:")
print(df_sample['event_type'].value_counts())

print("\n2. Revenue por categor√≠a (solo compras):")
purchases = df_sample[df_sample['event_type'] == 'purchase'].copy()
purchases['revenue'] = purchases['price'] * purchases['quantity']
revenue_by_cat = purchases.groupby('category').agg({
    'event_id': 'count',
    'revenue': 'sum',
    'price': 'mean'
}).rename(columns={'event_id': 'num_purchases', 'revenue': 'total_revenue', 'price': 'avg_price'})
print(revenue_by_cat.sort_values('total_revenue', ascending=False))

print("\n3. Top 10 usuarios m√°s activos:")
user_activity = df_sample.groupby('user_id').agg({
    'event_id': 'count',
    'event_type': lambda x: (x == 'purchase').sum()
}).rename(columns={'event_id': 'total_events', 'event_type': 'purchases'})
print(user_activity.sort_values('total_events', ascending=False).head(10))

print("\n4. Tasa de conversi√≥n por producto:")
product_conv = df_sample.groupby('product_name').agg({
    'event_type': ['count', lambda x: (x == 'view').sum(), lambda x: (x == 'purchase').sum()]
})
product_conv.columns = ['total_events', 'views', 'purchases']
product_conv['conversion_rate'] = (product_conv['purchases'] / product_conv['views'] * 100).fillna(0)
print(product_conv.sort_values('conversion_rate', ascending=False).head(10))


=== AN√ÅLISIS BATCH DE DATOS DE MUESTRA ===

1. Eventos por tipo:
event_type
view           481
add_to_cart    257
purchase       161
remove         101
Name: count, dtype: int64

2. Revenue por categor√≠a (solo compras):
             num_purchases  total_revenue    avg_price
category                                              
Electronics             61      158917.62  1010.036885
Computers               54      117024.44   997.727593
Accessories             46       89701.40   874.156522

3. Top 10 usuarios m√°s activos:
         total_events  purchases
user_id                         
user_60            23          5
user_98            18          3
user_14            17          3
user_80            16          4
user_57            16          2
user_61            15          0
user_82            15          4
user_58            15          2
user_16            15          0
user_50            14          2

4. Tasa de conversi√≥n por producto:
              total_events  views 

## 9. Escribir Stream a Diferentes Sinks

In [17]:
# Ejemplo de configuraci√≥n de escritura (comentado para evitar ejecuci√≥n)

# 1. Escribir a consola (desarrollo/debug)
console_query_config = {
    'outputMode': 'complete',  # complete, append, update
    'format': 'console',
    'trigger': {'processingTime': '10 seconds'},
    'options': {
        'truncate': False,
        'numRows': 20
    }
}

# 2. Escribir a Parquet (data lake)
parquet_query_config = {
    'outputMode': 'append',
    'format': 'parquet',
    'path': '/path/to/output',
    'checkpointLocation': '/path/to/checkpoint',
    'trigger': {'processingTime': '1 minute'},
    'options': {
        'compression': 'snappy',
        'partitionBy': 'date'
    }
}

# 3. Escribir a Kafka
kafka_query_config = {
    'outputMode': 'append',
    'format': 'kafka',
    'options': {
        'kafka.bootstrap.servers': 'localhost:9092',
        'topic': 'processed-events',
        'checkpointLocation': '/path/to/checkpoint'
    }
}

# 4. Escribir a Delta Lake
delta_query_config = {
    'outputMode': 'append',
    'format': 'delta',
    'path': '/path/to/delta',
    'checkpointLocation': '/path/to/checkpoint',
    'options': {
        'mergeSchema': True,
        'optimizeWrite': True
    }
}

print("Configuraciones de sink definidas (ver c√≥digo para detalles)")

Configuraciones de sink definidas (ver c√≥digo para detalles)


## 10. M√©tricas y Monitoreo

In [18]:
# Funci√≥n para monitorear estado del stream
def monitor_stream_metrics(query):
    """
    Extraer m√©tricas del streaming query
    """
    status = query.status
    
    metrics = {
        'isDataAvailable': status['isDataAvailable'],
        'isTriggerActive': status['isTriggerActive'],
        'message': status['message']
    }
    
    if 'inputRowsPerSecond' in status:
        metrics['inputRowsPerSecond'] = status['inputRowsPerSecond']
    
    if 'processedRowsPerSecond' in status:
        metrics['processedRowsPerSecond'] = status['processedRowsPerSecond']
    
    return metrics


# Ejemplo de m√©tricas a monitorear
print("\n=== M√âTRICAS CLAVE PARA MONITOREO ===")
print("""
1. Input Rate: Eventos por segundo recibidos
2. Processing Rate: Eventos por segundo procesados
3. Batch Duration: Tiempo de procesamiento por batch
4. Trigger Interval: Intervalo entre ejecuciones
5. Watermark: Retraso m√°ximo aceptado
6. Estado del Query: Activo, inactivo, error
7. Checkpoint Location: Para recuperaci√≥n de fallos
""")


=== M√âTRICAS CLAVE PARA MONITOREO ===

1. Input Rate: Eventos por segundo recibidos
2. Processing Rate: Eventos por segundo procesados
3. Batch Duration: Tiempo de procesamiento por batch
4. Trigger Interval: Intervalo entre ejecuciones
5. Watermark: Retraso m√°ximo aceptado
6. Estado del Query: Activo, inactivo, error
7. Checkpoint Location: Para recuperaci√≥n de fallos



## 11. Optimizaci√≥n de Rendimiento

In [19]:
# Configuraciones de optimizaci√≥n
optimization_configs = {
    # Particionamiento
    'spark.sql.shuffle.partitions': '200',  # N√∫mero de particiones para shuffles
    'spark.default.parallelism': '200',     # Paralelismo por defecto
    
    # Memoria
    'spark.executor.memory': '4g',
    'spark.driver.memory': '2g',
    'spark.memory.fraction': '0.8',
    
    # Streaming espec√≠fico
    'spark.sql.streaming.minBatchesToRetain': '100',
    'spark.sql.streaming.stateStore.providerClass': 'org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider',
    
    # Optimizaci√≥n de escritura
    'spark.sql.adaptive.enabled': 'true',
    'spark.sql.adaptive.coalescePartitions.enabled': 'true',
}

print("\n=== MEJORES PR√ÅCTICAS DE OPTIMIZACI√ìN ===")
print("""
1. Usar watermarks para limpiar estado antiguo
2. Particionar datos por columnas clave
3. Configurar apropiadamente spark.sql.shuffle.partitions
4. Usar triggers basados en tiempo para controlar frecuencia
5. Implementar checkpointing para recuperaci√≥n
6. Monitorear m√©tricas constantemente
7. Usar Delta Lake para ACID transactions
8. Implementar compactaci√≥n de archivos peque√±os
9. Optimizar esquemas y evitar tipos gen√©ricos
10. Considerar micro-batching vs continuous processing
""")


=== MEJORES PR√ÅCTICAS DE OPTIMIZACI√ìN ===

1. Usar watermarks para limpiar estado antiguo
2. Particionar datos por columnas clave
3. Configurar apropiadamente spark.sql.shuffle.partitions
4. Usar triggers basados en tiempo para controlar frecuencia
5. Implementar checkpointing para recuperaci√≥n
6. Monitorear m√©tricas constantemente
7. Usar Delta Lake para ACID transactions
8. Implementar compactaci√≥n de archivos peque√±os
9. Optimizar esquemas y evitar tipos gen√©ricos
10. Considerar micro-batching vs continuous processing



## Resumen y Arquitectura Enterprise

### Arquitectura T√≠pica de Streaming:
```
Fuentes de Datos       Ingesta           Procesamiento        Almacenamiento        Consumo
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ       ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ           ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ       ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ        ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
Kafka/Kinesis    ‚Üí    Spark       ‚Üí     Transformaciones  ‚Üí   Delta Lake      ‚Üí    BI Tools
IoT Devices      ‚Üí    Streaming   ‚Üí     Agregaciones      ‚Üí   Data Lake       ‚Üí    ML Models
APIs             ‚Üí                ‚Üí     Joins             ‚Üí   Warehouse       ‚Üí    Dashboards
Logs             ‚Üí                ‚Üí     Windows           ‚Üí   Cache (Redis)   ‚Üí    Alertas
```

### Patrones Avanzados:

#### 1. Lambda Architecture
- **Batch Layer**: Procesamiento hist√≥rico completo
- **Speed Layer**: Procesamiento en tiempo real
- **Serving Layer**: Combina ambas vistas

#### 2. Kappa Architecture
- Solo capa de streaming
- Todo procesamiento en tiempo real
- Reprocesamiento desde el inicio del stream

#### 3. Delta Architecture
- Basada en Delta Lake
- ACID transactions
- Time travel
- Schema evolution

### Casos de Uso Enterprise:

1. **Detecci√≥n de Fraude en Tiempo Real**
   - An√°lisis de patrones sospechosos
   - Machine Learning en streaming
   - Alertas autom√°ticas

2. **Recomendaciones Personalizadas**
   - Seguimiento de comportamiento en tiempo real
   - Actualizaci√≥n de perfiles de usuario
   - A/B testing din√°mico

3. **Monitoreo de Infraestructura**
   - Logs y m√©tricas en tiempo real
   - Detecci√≥n de anomal√≠as
   - Auto-scaling basado en carga

4. **IoT y Telemetr√≠a**
   - Procesamiento de sensores
   - Mantenimiento predictivo
   - Optimizaci√≥n de operaciones

### Consideraciones de Producci√≥n:

- **Alta Disponibilidad**: Cluster mode, m√∫ltiples workers
- **Fault Tolerance**: Checkpointing, Write-Ahead Logs
- **Escalabilidad**: Auto-scaling, dynamic allocation
- **Seguridad**: Kerberos, SSL/TLS, encryption at rest
- **Monitoreo**: Prometheus, Grafana, CloudWatch
- **Testing**: Unit tests, integration tests, chaos engineering

### Recursos Adicionales:
- [Spark Structured Streaming Guide](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html)
- [Delta Lake Documentation](https://docs.delta.io/latest/index.html)
- [Databricks Streaming Best Practices](https://docs.databricks.com/structured-streaming/index.html)

In [22]:
# Limpiar recursos
# Nota: En un entorno Spark real, ejecutar√≠as: spark.stop()
print("‚úÖ Notebook completado exitosamente")
print("üìö Has aprendido los conceptos de Spark Streaming con ejemplos pr√°cticos")
print("üîó Para implementar con Spark real, instala Java JDK y configura PySpark")

‚úÖ Notebook completado exitosamente
üìö Has aprendido los conceptos de Spark Streaming con ejemplos pr√°cticos
üîó Para implementar con Spark real, instala Java JDK y configura PySpark


---

## üß≠ Navegaci√≥n

**‚Üê Anterior:** [üèóÔ∏è Data Lakehouse con Parquet, Delta Lake e Iceberg (conceptos y pr√°ctica ligera)](02_lakehouse_delta_iceberg.ipynb)

**Siguiente ‚Üí:** [üèõÔ∏è Arquitecturas Modernas de Datos: Lambda, Kappa, Delta y Data Mesh ‚Üí](04_arquitecturas_modernas.ipynb)

**üìö √çndice de Nivel Senior:**
- [üèõÔ∏è Senior - 01. Data Governance y Calidad de Datos](01_data_governance_calidad.ipynb)
- [üèóÔ∏è Data Lakehouse con Parquet, Delta Lake e Iceberg (conceptos y pr√°ctica ligera)](02_lakehouse_delta_iceberg.ipynb)
- [Apache Spark Streaming: Procesamiento en Tiempo Real](03_spark_streaming.ipynb) ‚Üê üîµ Est√°s aqu√≠
- [üèõÔ∏è Arquitecturas Modernas de Datos: Lambda, Kappa, Delta y Data Mesh](04_arquitecturas_modernas.ipynb)
- [ü§ñ ML Pipelines y Feature Stores](05_ml_pipelines_feature_stores.ipynb)
- [üí∞ Cost Optimization y FinOps en la Nube](06_cost_optimization_finops.ipynb)
- [üîê Seguridad, Compliance y Auditor√≠a de Datos](07_seguridad_compliance.ipynb)
- [üìä Observabilidad y Linaje de Datos](08_observabilidad_linaje.ipynb)
- [üèÜ Proyecto Integrador Senior 1: Plataforma de Datos Completa](09_proyecto_integrador_1.ipynb)
- [üåê Proyecto Integrador Senior 2: Data Mesh Multi-Dominio con Feature Store](10_proyecto_integrador_2.ipynb)

**üéì Otros Niveles:**
- [Nivel Junior](../nivel_junior/README.md)
- [Nivel Mid](../nivel_mid/README.md)
- [Nivel Senior](../nivel_senior/README.md)
- [Nivel GenAI](../nivel_genai/README.md)
- [Negocio LATAM](../negocios_latam/README.md)
