# 🏆 Proyecto Integrador Senior 1: Plataforma de Datos Completa

Objetivo: diseñar e implementar una plataforma moderna de datos con governance, lakehouse, orquestación, observabilidad y compliance.

- Duración: 180+ min (proyecto multi-día)
- Dificultad: Muy Alta
- Prerrequisitos: Todos los notebooks Senior 01–08

### 🏗️ **Diseño de Arquitectura: Patrones y Trade-offs**

**1. Arquitectura Lambda vs Kappa: Hybrid Approach**

```
┌─────────────────────────────────────────────────────────────┐
│              Modern Data Platform Architecture              │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌──────────────┐        ┌──────────────┐                  │
│  │  Real-Time   │        │    Batch     │                  │
│  │  (Lambda)    │        │   (Airflow)  │                  │
│  └──────┬───────┘        └──────┬───────┘                  │
│         │                       │                           │
│         ▼                       ▼                           │
│  ┌────────────────────────────────────────┐                │
│  │     Unified Lakehouse Layer            │                │
│  │  (Delta Lake - ACID + Time Travel)     │                │
│  ├────────────────────────────────────────┤                │
│  │  raw/     (bronze - sin procesar)      │                │
│  │  curated/ (silver - validado)          │                │
│  │  gold/    (gold - agregado)            │                │
│  └────────────────────────────────────────┘                │
│         │                                                    │
│         ├──────┬─────────┬─────────┬───────────┐           │
│         ▼      ▼         ▼         ▼           ▼           │
│     Athena  Trino   FastAPI   Tableau   ML Models          │
└─────────────────────────────────────────────────────────────┘
```

**¿Por qué Hybrid?**

| Aspecto | Lambda (Streaming) | Batch (Airflow) |
|---------|-------------------|-----------------|
| **Latencia** | <30s | 1h-24h |
| **Complejidad** | Alta (gestión de estado) | Media |
| **Costos** | Alto (always-on) | Bajo (on-demand) |
| **Casos de Uso** | Detección fraude, alertas | Reportes, ML training |

**Decisión:** Usamos streaming para datos críticos (transacciones, eventos de usuario) y batch para agregaciones históricas y ETL complejos.

---

**2. Storage Layer: S3 + Delta Lake**

**¿Por qué no Parquet solo?**

```python
# Problema con Parquet puro
# ❌ Sin transacciones ACID
df1.write.parquet("s3://bucket/data/")  # Job 1
df2.write.parquet("s3://bucket/data/")  # Job 2 (concurrent) → CORRUPTION

# ✅ Delta Lake garantiza ACID
df1.write.format("delta").save("s3://bucket/data/")
df2.write.format("delta").save("s3://bucket/data/")  # Safe concurrent writes
```

**Delta Lake Features:**

| Feature | Beneficio |
|---------|-----------|
| **ACID Transactions** | No más datos corruptos |
| **Time Travel** | `SELECT * FROM delta.`table` VERSION AS OF 10` |
| **Schema Evolution** | Add/remove columns sin reescribir |
| **OPTIMIZE** | Compacta small files → mejor perf |
| **VACUUM** | Elimina versiones antiguas → ahorra $$$ |
| **MERGE** | Upserts eficientes (CDC) |

**Estructura de Directorios:**

```
s3://ecommerce-data/
├── raw/                    # Bronze - Inmutable (retención 7 días)
│   ├── ventas/
│   │   └── dt=2024-01-15/
│   │       └── kafka-offset-123.parquet
│   └── logistica/
│       └── dt=2024-01-15/
│           └── sftp-file-001.csv
│
├── curated/                # Silver - Validado y limpio (retención 1 año)
│   ├── ventas/
│   │   └── _delta_log/    # Transaction log
│   │   └── region=LATAM/
│   │       └── dt=2024-01-15/
│   │           └── part-00000.parquet
│   └── clientes/
│       ├── _delta_log/
│       └── tipo=premium/
│           └── dt=2024-01-15/
│
└── gold/                   # Gold - Agregado (retención indefinida)
    ├── ventas_diarias/
    │   └── dt=2024-01-15/
    │       └── region=LATAM_producto=laptop.parquet
    └── kpis/
        └── revenue_by_cohort/
```

**Partitioning Strategy:**

```python
# ❌ Mal: Demasiadas particiones (small files)
df.write.partitionBy("dt", "hora", "minuto", "cliente_id").parquet(...)
# Resultado: 1M+ archivos de 100KB → S3 list operations $$$

# ✅ Bien: Particiones equilibradas
df.write.partitionBy("dt", "region").format("delta").save(...)
# Resultado: ~100 archivos de 128MB (ideal para Spark)

# Optimize small files
spark.sql("OPTIMIZE delta.`s3://bucket/curated/ventas` ZORDER BY (cliente_id)")
# ZORDER: Co-locates data por cliente_id → 10x faster filters
```

---

**3. Compute Layer: Spark vs Pandas vs Polars**

**Decision Matrix:**

| Framework | Volume | Performance | Ecosystem |
|-----------|--------|-------------|-----------|
| **Pandas** | <10 GB | Baseline | Extensive |
| **Polars** | <100 GB | 5-10x faster | Growing |
| **Spark** | >100 GB | Scalable | Industry std |

**Nuestra elección:**
- **Streaming:** Spark Structured Streaming (única opción madura para Kafka)
- **Batch simple:** Polars (reportes diarios <50 GB)
- **Batch complejo:** Spark (joins grandes, ML pipelines)

```python
# Ejemplo: Batch con Polars (más rápido que Pandas)
import polars as pl

df = pl.scan_parquet("s3://bucket/raw/ventas/*.parquet")
result = (
    df
    .filter(pl.col("monto") > 100)
    .groupby("region")
    .agg(pl.sum("monto"))
    .collect(streaming=True)  # Out-of-core processing
)
```

---

**4. Orchestration: Airflow vs Dagster vs Prefect**

**Comparación:**

| Aspecto | Airflow | Dagster | Prefect |
|---------|---------|---------|---------|
| **Madurez** | ✅ 10+ años | ⚠️ 5 años | ⚠️ 5 años |
| **Learning Curve** | Steep | Medium | Easy |
| **Data Testing** | Manual (GE) | Native | Via hooks |
| **Data Lineage** | Plugin (OpenLineage) | Native | Plugin |
| **Deployment** | Complex | Docker-first | Cloud-native |
| **Community** | Huge | Growing | Growing |

**Decisión: Airflow**
- ✅ Ecosistema maduro (miles de operators)
- ✅ OpenLineage integration para linaje
- ✅ Skills existentes en el equipo
- ⚠️ Complejidad operativa (mitigada con MWAA/Astronomer)

```python
# Modern Airflow Pattern: TaskFlow API
from airflow.decorators import dag, task
from datetime import datetime

@dag(
    schedule="@daily",
    start_date=datetime(2024, 1, 1),
    catchup=False,
    tags=["ecommerce", "ventas"]
)
def ventas_pipeline():
    
    @task
    def extract():
        return {"path": "s3://bucket/raw/ventas/2024-01-15/"}
    
    @task
    def validate(data):
        # Great Expectations
        return {"valid": True, "errors": []}
    
    @task
    def transform(data):
        # Spark job
        return {"output": "s3://bucket/curated/ventas/"}
    
    @task
    def publish_lineage(data):
        # OpenLineage event
        pass
    
    data = extract()
    validation = validate(data)
    transformed = transform(validation)
    publish_lineage(transformed)

dag = ventas_pipeline()
```

---

**5. Governance: Glue Catalog vs Unity Catalog vs DataHub**

**Comparación:**

|  | Glue Catalog | Unity Catalog | DataHub |
|--|--------------|---------------|---------|
| **Cloud** | AWS only | Databricks | Any |
| **Lineage** | Limited | Column-level | Column-level |
| **RBAC** | Lake Formation | Native | Native |
| **API** | Boto3 | REST/SQL | GraphQL |
| **Cost** | $1/million API calls | Included | Open-source |

**Nuestra arquitectura híbrida:**

```
┌─────────────────────────────────────────┐
│         Governance Stack                │
├─────────────────────────────────────────┤
│                                          │
│  Glue Catalog (Technical Metadata)      │
│    - Table schemas                      │
│    - Partitions                         │
│    - Athena/Spark integration           │
│                                          │
│          ↓ (sync via API)               │
│                                          │
│  DataHub (Business Metadata)            │
│    - Ownership (Data Engineer: Jane)    │
│    - Tags (PII, GDPR, Confidential)     │
│    - Glossary (ARR = Annual Revenue)    │
│    - Lineage (upstream/downstream)      │
│                                          │
└─────────────────────────────────────────┘
```

**Sync Script:**

```python
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.schema_classes import DatasetPropertiesClass
import boto3

glue = boto3.client('glue')
emitter = DatahubRestEmitter('http://datahub:8080')

# Sync Glue → DataHub
for table in glue.get_tables(DatabaseName='ecommerce')['TableList']:
    dataset_urn = f"urn:li:dataset:(urn:li:dataPlatform:glue,ecommerce.{table['Name']},PROD)"
    
    properties = DatasetPropertiesClass(
        customProperties={
            "glue_database": "ecommerce",
            "owner": table.get("Owner", "unknown"),
            "location": table["StorageDescriptor"]["Location"]
        }
    )
    
    emitter.emit_mcp(
        MetadataChangeProposalWrapper(
            entityUrn=dataset_urn,
            aspect=properties
        )
    )
```

---

**6. Trade-offs Críticos**

**Latencia vs Costo:**

```
┌─────────────────────────────────────────┐
│  Streaming (Kafka + Spark)              │
│  - Latencia: <30s                       │
│  - Costo: $3,000/mes (EMR 24/7)         │
└─────────────────────────────────────────┘
           ↕ (decisión de negocio)
┌─────────────────────────────────────────┐
│  Mini-batch (Airflow cada 5 min)        │
│  - Latencia: 5-10 min                   │
│  - Costo: $800/mes (spot instances)     │
└─────────────────────────────────────────┘
```

**Pregunta clave:** ¿El negocio necesita realmente <1 min de latencia?
- Fraude detection: **Sí** → Streaming
- Reportes ejecutivos: **No** → Batch

**Consistencia vs Disponibilidad (CAP Theorem):**

```python
# Opción 1: Fuerte consistencia (CP)
# - Usar RDS/PostgreSQL para metadata crítica
# - Transacciones ACID garantizadas
# - Downtime si network partition

# Opción 2: Eventual consistency (AP)
# - Usar S3 + DynamoDB
# - Alta disponibilidad (99.99%)
# - Posibles lecturas stale durante <1s

# Decisión: Hybrid
# - Transacciones financieras → RDS (CP)
# - Eventos de usuario → S3 (AP)
```

**Open Source vs Managed Services:**

| Componente | Open Source | Managed | Decisión |
|------------|-------------|---------|----------|
| **Kafka** | Self-hosted (EC2) | MSK | MSK (menos ops) |
| **Airflow** | Docker Compose | MWAA | MWAA (scaling) |
| **Spark** | EMR manual | EMR Serverless | EMR Serverless |
| **DataHub** | K8s deployment | N/A | Self-hosted |

**Criterio:** Si existe managed y costo <2x, usar managed (reduce toil).

---

**7. Disaster Recovery Strategy**

```
┌─────────────────────────────────────────┐
│         Backup & Recovery               │
├─────────────────────────────────────────┤
│                                          │
│  S3 (Data Lake)                         │
│    - Versioning: Enabled                │
│    - Replication: us-east-1 → eu-west-1 │
│    - RTO: <1 hour                       │
│    - RPO: <15 min                       │
│                                          │
│  RDS (Metadata)                         │
│    - Automated backups: Daily           │
│    - Multi-AZ: Enabled                  │
│    - RTO: <5 min (failover)             │
│    - RPO: <5 min (sync replica)         │
│                                          │
│  Glue Catalog                           │
│    - Backup: CloudFormation export      │
│    - Restore: boto3 script              │
│                                          │
└─────────────────────────────────────────┘
```

**Recovery Runbook:**

```bash
# Scenario: S3 bucket accidentally deleted
# 1. Check versioning
aws s3api list-object-versions --bucket ecommerce-data

# 2. Restore from version
aws s3api copy-object \
  --copy-source ecommerce-data/curated/ventas/file.parquet?versionId=abc123 \
  --bucket ecommerce-data-recovered \
  --key curated/ventas/file.parquet

# 3. Failover to replica bucket
# Update Glue catalog locations
aws glue update-table --database ecommerce --table-input '{
  "Name": "ventas",
  "StorageDescriptor": {
    "Location": "s3://ecommerce-data-replica/curated/ventas/"
  }
}'
```

---

**8. Scaling Strategy**

**Current State (MVP):**
- 100 GB/day ingest
- 10 pipelines
- 5 data engineers
- $5K/month

**12-month projection:**
- 1 TB/day ingest (10x)
- 50 pipelines (5x)
- 15 engineers (3x)
- Budget: $15K/month (3x)

**Scaling plan:**

```python
# 1. Data scaling
# - Partition strategy: dt + region (avoid skew)
# - Compaction: OPTIMIZE every week
# - Archival: Move raw to Glacier after 30 days

# 2. Compute scaling
# - Airflow: Celery executor (10 workers → 50)
# - Spark: EMR Serverless (auto-scale 1-100 instances)
# - FastAPI: ECS Fargate (auto-scale on CPU)

# 3. Team scaling
# - Data Platform team: 2 SREs
# - Domain teams: 3x Data Engineers per domain
# - Self-service: DataHub + Airflow UI + Jupyter

# 4. Cost optimization
# - Spot instances: 70% savings on EMR
# - S3 Intelligent Tiering: 40% savings
# - Reserved capacity: Athena ($100/TB → $50/TB)
```

**Bottleneck analysis:**

```python
# ¿Dónde fallará primero?
bottlenecks = {
    "S3 PUT rate": "3,500 req/s (→ increase prefix diversity)",
    "Glue Catalog API": "10 req/s (→ cache with Redis)",
    "Athena concurrency": "25 queries (→ upgrade to Trino)",
    "Airflow scheduler": "1000 tasks/s (→ HA scheduler)",
    "Team knowledge": "Onboarding (→ internal wiki + mentorship)"
}
```

---

**Autor:** Luis J. Raigoso V. (LJRV)

### ⚡ **Streaming Path: Kafka → Spark → Delta Lake**

**1. Kafka Setup: Event-Driven Architecture**

**Docker Compose (Local Development):**

```yaml
# docker-compose.yml
version: '3.8'
services:
  zookeeper:
    image: confluentinc/cp-zookeeper:7.5.0
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181
      ZOOKEEPER_TICK_TIME: 2000
    ports:
      - "2181:2181"
  
  kafka:
    image: confluentinc/cp-kafka:7.5.0
    depends_on:
      - zookeeper
    ports:
      - "9092:9092"
    environment:
      KAFKA_BROKER_ID: 1
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
      KAFKA_AUTO_CREATE_TOPICS_ENABLE: "true"
      KAFKA_LOG_RETENTION_HOURS: 168  # 7 días
      KAFKA_LOG_SEGMENT_BYTES: 1073741824  # 1 GB
      KAFKA_COMPRESSION_TYPE: "snappy"
  
  schema-registry:
    image: confluentinc/cp-schema-registry:7.5.0
    depends_on:
      - kafka
    ports:
      - "8081:8081"
    environment:
      SCHEMA_REGISTRY_HOST_NAME: schema-registry
      SCHEMA_REGISTRY_KAFKASTORE_BOOTSTRAP_SERVERS: kafka:9092
```

**Production: AWS MSK Configuration:**

```python
import boto3

msk = boto3.client('kafka', region_name='us-east-1')

# Create MSK cluster
response = msk.create_cluster_v2(
    ClusterName='ecommerce-kafka',
    Serverless={
        'VpcConfigs': [{
            'SubnetIds': ['subnet-123', 'subnet-456'],
            'SecurityGroupIds': ['sg-789']
        }],
        'ClientAuthentication': {
            'Sasl': {
                'Iam': {'Enabled': True}  # IAM authentication
            }
        }
    },
    Tags={
        'Environment': 'production',
        'CostCenter': 'data-platform'
    }
)

# Costo estimado: $2,500/mes (auto-scaling 2-16 brokers)
```

**Topic Design:**

```python
# topics.py
from kafka.admin import KafkaAdminClient, NewTopic

admin = KafkaAdminClient(bootstrap_servers='localhost:9092')

topics = [
    NewTopic(
        name='ecommerce.ventas.v1',
        num_partitions=12,  # 12 partitions = 12 parallel consumers
        replication_factor=3,  # High availability
        topic_configs={
            'retention.ms': '604800000',  # 7 días
            'cleanup.policy': 'delete',
            'compression.type': 'snappy',
            'max.message.bytes': '1048576',  # 1 MB
        }
    ),
    NewTopic(
        name='ecommerce.eventos_usuario.v1',
        num_partitions=24,  # High throughput
        replication_factor=3,
        topic_configs={
            'retention.ms': '86400000',  # 1 día (eventos efímeros)
            'cleanup.policy': 'delete',
            'compression.type': 'lz4',  # Mejor para logs
        }
    )
]

admin.create_topics(topics)
```

**Partitioning Strategy:**

```python
# ❌ Mal: Random partitioning (no ordering)
producer.send('ventas', value=mensaje)

# ✅ Bien: Key-based partitioning (ordering por cliente)
producer.send(
    'ventas',
    key=str(cliente_id).encode('utf-8'),  # Mismo cliente → misma partition
    value=json.dumps(mensaje).encode('utf-8')
)

# Resultado: Eventos del cliente 123 siempre en partition 5
# - Ordering garantizado por cliente
# - Permite procesamiento stateful (aggregations)
```

---

**2. Event Producer: Transacciones Sintéticas**

```python
# producer.py
from kafka import KafkaProducer
from faker import Faker
import json
import time
import random
from datetime import datetime

fake = Faker()

producer = KafkaProducer(
    bootstrap_servers=['localhost:9092'],
    value_serializer=lambda v: json.dumps(v).encode('utf-8'),
    key_serializer=lambda k: k.encode('utf-8') if k else None,
    acks='all',  # Wait for all replicas
    retries=3,
    max_in_flight_requests_per_connection=5,
    compression_type='snappy'
)

def generate_transaction():
    """Genera transacción sintética realista"""
    cliente_id = random.randint(1000, 9999)
    productos = random.choices(
        ['laptop', 'phone', 'tablet', 'headphones', 'monitor'],
        weights=[0.1, 0.4, 0.2, 0.2, 0.1],  # Phone más común
        k=random.randint(1, 3)
    )
    
    return {
        'transaccion_id': fake.uuid4(),
        'cliente_id': cliente_id,
        'email': fake.email(),
        'productos': productos,
        'total': round(sum(random.uniform(50, 2000) for _ in productos), 2),
        'metodo_pago': random.choice(['credit_card', 'debit_card', 'paypal']),
        'tarjeta_ultimos_4': fake.credit_card_number()[-4:],
        'region': random.choice(['LATAM', 'NA', 'EU', 'APAC']),
        'timestamp': datetime.utcnow().isoformat(),
        'metadata': {
            'ip': fake.ipv4(),
            'user_agent': fake.user_agent()
        }
    }

def produce_events(rate_per_second=100):
    """Produce eventos a rate constante"""
    interval = 1.0 / rate_per_second
    
    while True:
        start = time.time()
        
        transaction = generate_transaction()
        
        # Send con callback para manejo de errores
        future = producer.send(
            'ecommerce.ventas.v1',
            key=str(transaction['cliente_id']),
            value=transaction
        )
        
        try:
            metadata = future.get(timeout=10)
            print(f"✅ Sent to partition {metadata.partition}, offset {metadata.offset}")
        except Exception as e:
            print(f"❌ Error: {e}")
        
        # Rate limiting
        elapsed = time.time() - start
        time.sleep(max(0, interval - elapsed))

if __name__ == '__main__':
    print("🚀 Producing events at 100/sec...")
    produce_events(rate_per_second=100)
    # 100 events/sec = 8.6M events/day
```

**Schema Evolution con Avro:**

```python
from confluent_kafka import avro
from confluent_kafka.avro import AvroProducer

# Schema Registry: versioning automático
value_schema = avro.loads('''
{
  "type": "record",
  "name": "Venta",
  "namespace": "com.ecommerce",
  "fields": [
    {"name": "transaccion_id", "type": "string"},
    {"name": "cliente_id", "type": "int"},
    {"name": "total", "type": "double"},
    {"name": "timestamp", "type": "long", "logicalType": "timestamp-millis"},
    {"name": "email", "type": ["null", "string"], "default": null}  # Nuevo campo opcional
  ]
}
''')

producer = AvroProducer({
    'bootstrap.servers': 'localhost:9092',
    'schema.registry.url': 'http://localhost:8081'
}, default_value_schema=value_schema)

# Consumers automáticamente validan schema
```

---

**3. Spark Structured Streaming Consumer**

```python
# streaming_consumer.py
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from delta.tables import DeltaTable

spark = SparkSession.builder \
    .appName("EcommerceStreamingPipeline") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .config("spark.streaming.kafka.consumer.poll.ms", "512") \
    .config("spark.sql.shuffle.partitions", "12") \
    .getOrCreate()

# Schema del evento
schema = StructType([
    StructField("transaccion_id", StringType()),
    StructField("cliente_id", IntegerType()),
    StructField("email", StringType()),
    StructField("productos", ArrayType(StringType())),
    StructField("total", DoubleType()),
    StructField("metodo_pago", StringType()),
    StructField("tarjeta_ultimos_4", StringType()),
    StructField("region", StringType()),
    StructField("timestamp", StringType()),
    StructField("metadata", StructType([
        StructField("ip", StringType()),
        StructField("user_agent", StringType())
    ]))
])

# Read from Kafka
raw_stream = spark \
    .readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "localhost:9092") \
    .option("subscribe", "ecommerce.ventas.v1") \
    .option("startingOffsets", "latest") \
    .option("maxOffsetsPerTrigger", 10000) \
    .option("failOnDataLoss", "false") \
    .load()

# Parse JSON
parsed_stream = raw_stream \
    .select(
        col("key").cast("string").alias("kafka_key"),
        from_json(col("value").cast("string"), schema).alias("data"),
        col("topic"),
        col("partition"),
        col("offset"),
        col("timestamp").alias("kafka_timestamp")
    ) \
    .select("data.*", "kafka_key", "topic", "partition", "offset", "kafka_timestamp")

# Data Quality: Filter invalid records
validated_stream = parsed_stream \
    .filter(col("transaccion_id").isNotNull()) \
    .filter(col("cliente_id") > 0) \
    .filter(col("total") >= 0) \
    .filter(col("timestamp").isNotNull())

# Enrich: Parse timestamp
enriched_stream = validated_stream \
    .withColumn("processed_at", current_timestamp()) \
    .withColumn("event_date", to_date(col("timestamp"))) \
    .withColumn("event_hour", hour(col("timestamp")))

# PII Masking (GDPR compliance)
masked_stream = enriched_stream \
    .withColumn("email_masked", 
        regexp_replace(col("email"), r"(?<=.{2}).(?=[^@]*?@)", "*")
    ) \
    .withColumn("tarjeta_masked", 
        concat(lit("****-"), col("tarjeta_ultimos_4"))
    ) \
    .drop("email", "tarjeta_ultimos_4")

# Write to Delta Lake (curated layer)
query = masked_stream \
    .writeStream \
    .format("delta") \
    .outputMode("append") \
    .option("checkpointLocation", "s3://ecommerce-data/checkpoints/ventas/") \
    .option("path", "s3://ecommerce-data/curated/ventas/") \
    .partitionBy("event_date", "region") \
    .trigger(processingTime="30 seconds") \
    .start()

query.awaitTermination()
```

**Exactly-Once Semantics:**

```python
# Checkpoint garantiza idempotencia
# - Kafka offsets guardados en checkpoint
# - Si job falla, restart desde último offset
# - Delta Lake transactional writes previenen duplicates

# Verification query
spark.sql("""
    SELECT 
        event_date,
        COUNT(DISTINCT transaccion_id) as unique_transactions,
        COUNT(*) as total_rows
    FROM delta.`s3://ecommerce-data/curated/ventas/`
    GROUP BY event_date
    HAVING COUNT(*) != COUNT(DISTINCT transaccion_id)
""").show()
# Si hay diferencia → investigar duplicates
```

---

**4. Stateful Aggregations: Windowing**

```python
# Real-time aggregations con watermarking
windowed_agg = masked_stream \
    .withWatermark("processed_at", "10 minutes") \
    .groupBy(
        window("processed_at", "5 minutes"),
        "region",
        "metodo_pago"
    ) \
    .agg(
        count("*").alias("num_transacciones"),
        sum("total").alias("revenue_total"),
        avg("total").alias("ticket_promedio"),
        approx_count_distinct("cliente_id").alias("clientes_unicos")
    )

# Write aggregations to gold layer
windowed_query = windowed_agg \
    .writeStream \
    .format("delta") \
    .outputMode("update")  # Update existing windows \
    .option("checkpointLocation", "s3://ecommerce-data/checkpoints/ventas_agg/") \
    .option("path", "s3://ecommerce-data/gold/ventas_realtime/") \
    .trigger(processingTime="1 minute") \
    .start()
```

**Monitoring Dashboard Query:**

```sql
-- Athena query para dashboard en tiempo real
SELECT 
    window.start as window_start,
    region,
    num_transacciones,
    revenue_total,
    ticket_promedio
FROM gold.ventas_realtime
WHERE window.start >= current_timestamp - interval '1' hour
ORDER BY window.start DESC, revenue_total DESC
```

---

**5. Error Handling & Dead Letter Queue**

```python
from pyspark.sql.utils import AnalysisException

def write_to_dlq(df, error_type):
    """Escribe registros problemáticos a Dead Letter Queue"""
    df.write \
        .format("parquet") \
        .mode("append") \
        .partitionBy("error_date") \
        .save(f"s3://ecommerce-data/dlq/{error_type}/")

# Captura schema mismatch
try:
    parsed_stream = raw_stream.select(
        from_json(col("value").cast("string"), schema).alias("data")
    )
except AnalysisException as e:
    # Log a DLQ para debugging
    raw_stream.writeStream \
        .foreachBatch(lambda df, epoch_id: write_to_dlq(df, "schema_error")) \
        .start()

# Alert on DLQ growth
spark.sql("""
    SELECT 
        error_type,
        COUNT(*) as error_count
    FROM parquet.`s3://ecommerce-data/dlq/*/*`
    WHERE error_date = current_date()
    GROUP BY error_type
    HAVING COUNT(*) > 100  -- Alert threshold
""").show()
```

---

**6. Performance Tuning**

```python
# Optimizaciones críticas
spark.conf.set("spark.sql.adaptive.enabled", "true")  # AQE
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")

# Memory tuning para streaming
spark.conf.set("spark.executor.memory", "8g")
spark.conf.set("spark.executor.memoryOverhead", "2g")
spark.conf.set("spark.sql.streaming.stateStore.providerClass", 
    "org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider"
)  # State backend eficiente

# Kafka optimizations
spark.conf.set("spark.streaming.kafka.consumer.cache.enabled", "true")
spark.conf.set("spark.streaming.kafka.consumer.poll.ms", "512")

# Delta Lake optimizations
spark.sql("OPTIMIZE delta.`s3://ecommerce-data/curated/ventas/` ZORDER BY (cliente_id)")
spark.sql("VACUUM delta.`s3://ecommerce-data/curated/ventas/` RETAIN 168 HOURS")  # 7 días
```

**Metrics Collection:**

```python
from pyspark.sql import DataFrame
from prometheus_client import Counter, Gauge, Histogram

# Prometheus metrics
records_processed = Counter('streaming_records_processed_total', 'Records processed')
batch_duration = Histogram('streaming_batch_duration_seconds', 'Batch processing time')
current_lag = Gauge('streaming_kafka_lag', 'Kafka consumer lag')

def process_batch(df: DataFrame, epoch_id: int):
    start = time.time()
    
    # Write logic
    df.write.format("delta").mode("append").save("s3://...")
    
    # Update metrics
    records_processed.inc(df.count())
    batch_duration.observe(time.time() - start)
    
    # Log progress
    print(f"Epoch {epoch_id}: {df.count()} records in {time.time()-start:.2f}s")

query = parsed_stream \
    .writeStream \
    .foreachBatch(process_batch) \
    .start()
```

---

**7. Deployment: EMR Serverless**

```python
# deploy_streaming.py
import boto3

emr_serverless = boto3.client('emr-serverless', region_name='us-east-1')

# Create application
app_response = emr_serverless.create_application(
    name='ecommerce-streaming',
    releaseLabel='emr-6.12.0',
    type='SPARK',
    autoStartConfiguration={'enabled': True},
    autoStopConfiguration={
        'enabled': True,
        'idleTimeoutMinutes': 15
    },
    initialCapacity={
        'DRIVER': {
            'workerCount': 1,
            'workerConfiguration': {
                'cpu': '2 vCPU',
                'memory': '8 GB'
            }
        },
        'EXECUTOR': {
            'workerCount': 10,
            'workerConfiguration': {
                'cpu': '4 vCPU',
                'memory': '16 GB',
                'disk': '100 GB'
            }
        }
    },
    maximumCapacity={
        'cpu': '200 vCPU',
        'memory': '800 GB'
    }
)

# Submit streaming job
job_response = emr_serverless.start_job_run(
    applicationId=app_response['applicationId'],
    executionRoleArn='arn:aws:iam::123456:role/EMRServerlessRole',
    jobDriver={
        'sparkSubmit': {
            'entryPoint': 's3://ecommerce-code/streaming_consumer.py',
            'sparkSubmitParameters': (
                '--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension '
                '--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog '
                '--packages io.delta:delta-core_2.12:2.4.0,org.apache.spark:spark-sql-kafka-0-10_2.12:3.4.0'
            )
        }
    }
)

print(f"Job started: {job_response['jobRunId']}")
# Costo: ~$0.052/vCPU-hour + $0.0065/GB-hour
# Estimado: $1,500/mes (24/7 con auto-scaling)
```

---

**Autor:** Luis J. Raigoso V. (LJRV)

### 🔄 **Batch Path: Airflow + Great Expectations + DataHub**

**1. Airflow DAG: Orchestration Best Practices**

```python
# dags/ventas_batch_pipeline.py
from airflow.decorators import dag, task
from airflow.providers.amazon.aws.hooks.s3 import S3Hook
from airflow.providers.amazon.aws.operators.emr import EmrServerlessStartJobOperator
from airflow.sensors.external_task import ExternalTaskSensor
from airflow.operators.python import BranchPythonOperator
from datetime import datetime, timedelta
import great_expectations as gx
from datahub.emitter.rest_emitter import DatahubRestEmitter

default_args = {
    'owner': 'data-platform',
    'depends_on_past': False,
    'email': ['data-team@ecommerce.com'],
    'email_on_failure': True,
    'email_on_retry': False,
    'retries': 3,
    'retry_delay': timedelta(minutes=5),
    'retry_exponential_backoff': True,
    'max_retry_delay': timedelta(minutes=30),
    'sla': timedelta(hours=2),  # SLA: pipeline completa en 2h
}

@dag(
    dag_id='ventas_batch_pipeline_v1',
    default_args=default_args,
    description='Daily batch pipeline: SFTP → validate → transform → gold',
    schedule='0 2 * * *',  # 2 AM UTC daily
    start_date=datetime(2024, 1, 1),
    catchup=False,
    max_active_runs=1,
    tags=['ecommerce', 'ventas', 'batch', 'production'],
    doc_md="""
    # Ventas Batch Pipeline
    
    ## Purpose
    Daily ETL for historical sales data and aggregations.
    
    ## SLA
    - Completion: 2 hours
    - Data freshness: D+1 (next day at 4 AM)
    
    ## Dependencies
    - Upstream: SFTP files from SAP (uploaded at 1 AM)
    - Downstream: BI dashboards (Tableau refresh at 6 AM)
    
    ## Contacts
    - Owner: Data Platform Team
    - On-call: data-oncall@ecommerce.com
    """
)
def ventas_batch_pipeline():
    
    @task(task_id='check_sftp_files')
    def check_sftp_files(ds):
        """Verifica que archivos SFTP existen"""
        from airflow.providers.sftp.hooks.sftp import SFTPHook
        
        sftp = SFTPHook(sftp_conn_id='sap_sftp')
        expected_files = [
            f'/exports/ventas_{ds}.csv',
            f'/exports/clientes_{ds}.csv',
            f'/exports/productos_{ds}.csv'
        ]
        
        missing = []
        for file_path in expected_files:
            if not sftp.path_exists(file_path):
                missing.append(file_path)
        
        if missing:
            raise FileNotFoundError(f"Missing files: {missing}")
        
        return {
            'files': expected_files,
            'total_size': sum(sftp.get_file_size(f) for f in expected_files)
        }
    
    @task(task_id='download_to_s3')
    def download_to_s3(file_info, ds):
        """Descarga archivos de SFTP a S3 raw"""
        from airflow.providers.sftp.hooks.sftp import SFTPHook
        
        sftp = SFTPHook(sftp_conn_id='sap_sftp')
        s3 = S3Hook(aws_conn_id='aws_default')
        
        for file_path in file_info['files']:
            local_path = f'/tmp/{file_path.split("/")[-1]}'
            s3_key = f'raw/ventas/dt={ds}/{file_path.split("/")[-1]}'
            
            # Download from SFTP
            sftp.retrieve_file(file_path, local_path)
            
            # Upload to S3
            s3.load_file(
                filename=local_path,
                key=s3_key,
                bucket_name='ecommerce-data',
                replace=True
            )
            
            print(f"✅ Uploaded {file_path} → s3://ecommerce-data/{s3_key}")
        
        return f's3://ecommerce-data/raw/ventas/dt={ds}/'
    
    @task(task_id='run_data_quality_checks')
    def run_data_quality_checks(s3_path, ds):
        """Great Expectations validations"""
        context = gx.get_context()
        
        # Data source (S3)
        datasource = context.sources.add_pandas_s3(
            name="s3_datasource",
            bucket="ecommerce-data",
            boto3_options={"region_name": "us-east-1"}
        )
        
        # Expectations
        expectations = [
            # Completeness
            {
                'expectation_type': 'expect_column_values_to_not_be_null',
                'kwargs': {'column': 'transaccion_id'}
            },
            {
                'expectation_type': 'expect_column_values_to_not_be_null',
                'kwargs': {'column': 'cliente_id'}
            },
            # Validity
            {
                'expectation_type': 'expect_column_values_to_be_between',
                'kwargs': {'column': 'total', 'min_value': 0, 'max_value': 100000}
            },
            # Uniqueness
            {
                'expectation_type': 'expect_column_values_to_be_unique',
                'kwargs': {'column': 'transaccion_id'}
            },
            # Freshness
            {
                'expectation_type': 'expect_column_max_to_be_between',
                'kwargs': {
                    'column': 'fecha_transaccion',
                    'min_value': ds,
                    'max_value': ds
                }
            },
            # Referential integrity
            {
                'expectation_type': 'expect_column_values_to_be_in_set',
                'kwargs': {
                    'column': 'region',
                    'value_set': ['LATAM', 'NA', 'EU', 'APAC']
                }
            },
            # Volume check
            {
                'expectation_type': 'expect_table_row_count_to_be_between',
                'kwargs': {
                    'min_value': 50000,  # At least 50K transactions
                    'max_value': 1000000
                }
            }
        ]
        
        # Create suite
        suite = context.add_expectation_suite(
            expectation_suite_name=f"ventas_suite_{ds}"
        )
        
        for exp in expectations:
            suite.add_expectation(**exp)
        
        # Validate
        validator = context.get_validator(
            batch_request=datasource.get_asset(f"raw/ventas/dt={ds}/ventas_{ds}.csv"),
            expectation_suite_name=suite.expectation_suite_name
        )
        
        results = validator.validate()
        
        if not results.success:
            # Log failures
            failures = [r for r in results.results if not r.success]
            print(f"❌ {len(failures)} validation failures:")
            for failure in failures:
                print(f"  - {failure.expectation_config.expectation_type}: {failure.result}")
            
            raise ValueError(f"Data quality checks failed: {len(failures)} issues")
        
        print(f"✅ All {len(expectations)} validation checks passed")
        return results.to_json_dict()
    
    @task.branch(task_id='check_validation_results')
    def check_validation_results(validation_results):
        """Branch based on validation results"""
        if validation_results['success']:
            return 'transform_to_curated'
        else:
            return 'send_failure_alert'
    
    @task(task_id='transform_to_curated')
    def transform_to_curated(s3_path, ds):
        """Spark job: raw → curated (cleaned + enriched)"""
        from pyspark.sql import SparkSession
        from pyspark.sql.functions import *
        
        spark = SparkSession.builder.appName("VentasCurated").getOrCreate()
        
        # Read raw
        raw_df = spark.read.csv(f"{s3_path}/ventas_{ds}.csv", header=True, inferSchema=True)
        
        # Transformations
        curated_df = raw_df \
            .filter(col("total") > 0) \
            .withColumn("fecha_procesamiento", current_date()) \
            .withColumn("year", year(col("fecha_transaccion"))) \
            .withColumn("month", month(col("fecha_transaccion"))) \
            .withColumn("email_masked", 
                regexp_replace(col("email"), r"(?<=.{2}).(?=[^@]*?@)", "*")
            ) \
            .withColumn("tarjeta_masked", 
                concat(lit("****-"), substring(col("tarjeta_numero"), -4, 4))
            ) \
            .drop("email", "tarjeta_numero")
        
        # Enrich with dimension tables
        clientes_df = spark.read.format("delta").load("s3://ecommerce-data/curated/clientes/")
        productos_df = spark.read.format("delta").load("s3://ecommerce-data/curated/productos/")
        
        enriched_df = curated_df \
            .join(clientes_df, "cliente_id", "left") \
            .join(productos_df, "producto_id", "left")
        
        # Write to Delta (curated)
        enriched_df.write \
            .format("delta") \
            .mode("overwrite") \
            .partitionBy("year", "month", "region") \
            .option("overwriteSchema", "true") \
            .save(f"s3://ecommerce-data/curated/ventas/dt={ds}/")
        
        # Optimize
        spark.sql(f"""
            OPTIMIZE delta.`s3://ecommerce-data/curated/ventas/dt={ds}/`
            ZORDER BY (cliente_id, producto_id)
        """)
        
        return {
            'rows_processed': enriched_df.count(),
            'partitions_written': enriched_df.select("region").distinct().count()
        }
    
    @task(task_id='create_gold_aggregations')
    def create_gold_aggregations(curated_info, ds):
        """Create gold layer aggregations"""
        from pyspark.sql import SparkSession
        from pyspark.sql.functions import *
        
        spark = SparkSession.builder.appName("VentasGold").getOrCreate()
        
        df = spark.read.format("delta").load(f"s3://ecommerce-data/curated/ventas/dt={ds}/")
        
        # Aggregation 1: Daily sales by region
        daily_region = df.groupBy("fecha_transaccion", "region") \
            .agg(
                count("*").alias("num_transacciones"),
                sum("total").alias("revenue_total"),
                avg("total").alias("ticket_promedio"),
                countDistinct("cliente_id").alias("clientes_unicos")
            )
        
        daily_region.write \
            .format("delta") \
            .mode("append") \
            .partitionBy("fecha_transaccion") \
            .save("s3://ecommerce-data/gold/ventas_diarias_region/")
        
        # Aggregation 2: Product performance
        product_perf = df.groupBy("producto_id", "producto_nombre") \
            .agg(
                count("*").alias("unidades_vendidas"),
                sum("total").alias("revenue_total"),
                avg("total").alias("precio_promedio")
            ) \
            .orderBy(desc("revenue_total"))
        
        product_perf.write \
            .format("delta") \
            .mode("overwrite") \
            .save(f"s3://ecommerce-data/gold/productos_performance/dt={ds}/")
        
        # Aggregation 3: Customer RFM (Recency, Frequency, Monetary)
        from pyspark.sql.window import Window
        
        customer_rfm = df.groupBy("cliente_id") \
            .agg(
                max("fecha_transaccion").alias("ultima_compra"),
                count("*").alias("frecuencia"),
                sum("total").alias("valor_monetario")
            ) \
            .withColumn("recencia_dias", 
                datediff(lit(ds), col("ultima_compra"))
            ) \
            .withColumn("rfm_score",
                col("frecuencia") * 0.3 + col("valor_monetario") * 0.5 - col("recencia_dias") * 0.2
            )
        
        customer_rfm.write \
            .format("delta") \
            .mode("overwrite") \
            .save(f"s3://ecommerce-data/gold/customer_rfm/dt={ds}/")
        
        return {
            'aggregations_created': 3,
            'gold_path': f"s3://ecommerce-data/gold/"
        }
    
    @task(task_id='update_glue_catalog')
    def update_glue_catalog(gold_info, ds):
        """Update Glue Data Catalog partitions"""
        import boto3
        
        glue = boto3.client('glue', region_name='us-east-1')
        
        tables = [
            ('ecommerce', 'ventas_diarias_region'),
            ('ecommerce', 'productos_performance'),
            ('ecommerce', 'customer_rfm')
        ]
        
        for database, table in tables:
            try:
                # Add partition
                glue.create_partition(
                    DatabaseName=database,
                    TableName=table,
                    PartitionInput={
                        'Values': [ds],
                        'StorageDescriptor': {
                            'Location': f"s3://ecommerce-data/gold/{table}/dt={ds}/",
                            'InputFormat': 'org.apache.hadoop.mapred.SequenceFileInputFormat',
                            'OutputFormat': 'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat',
                            'SerdeInfo': {
                                'SerializationLibrary': 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
                            }
                        }
                    }
                )
                print(f"✅ Partition added: {database}.{table} dt={ds}")
            except glue.exceptions.AlreadyExistsException:
                print(f"⚠️ Partition already exists: {database}.{table} dt={ds}")
    
    @task(task_id='emit_lineage_to_datahub')
    def emit_lineage_to_datahub(gold_info, ds):
        """Emit data lineage to DataHub"""
        from datahub.emitter.mce_builder import make_dataset_urn
        from datahub.emitter.rest_emitter import DatahubRestEmitter
        from datahub.metadata.schema_classes import DatasetLineageTypeClass, UpstreamClass, UpstreamLineageClass
        
        emitter = DatahubRestEmitter('http://datahub:8080')
        
        # Define lineage: raw → curated → gold
        lineage_map = {
            f"urn:li:dataset:(urn:li:dataPlatform:s3,ecommerce-data.curated.ventas.dt={ds},PROD)": [
                f"urn:li:dataset:(urn:li:dataPlatform:s3,ecommerce-data.raw.ventas.dt={ds},PROD)"
            ],
            f"urn:li:dataset:(urn:li:dataPlatform:s3,ecommerce-data.gold.ventas_diarias_region,PROD)": [
                f"urn:li:dataset:(urn:li:dataPlatform:s3,ecommerce-data.curated.ventas.dt={ds},PROD)"
            ]
        }
        
        for downstream_urn, upstream_urns in lineage_map.items():
            upstreams = [
                UpstreamClass(
                    dataset=urn,
                    type=DatasetLineageTypeClass.TRANSFORMED
                )
                for urn in upstream_urns
            ]
            
            lineage = UpstreamLineageClass(upstreams=upstreams)
            
            emitter.emit_mcp(
                MetadataChangeProposalWrapper(
                    entityUrn=downstream_urn,
                    aspect=lineage
                )
            )
            
            print(f"✅ Lineage emitted: {downstream_urn}")
    
    @task(task_id='export_metrics_to_prometheus')
    def export_metrics_to_prometheus(curated_info, gold_info, ds):
        """Push pipeline metrics to Prometheus Pushgateway"""
        from prometheus_client import CollectorRegistry, Gauge, push_to_gateway
        
        registry = CollectorRegistry()
        
        # Define metrics
        rows_processed = Gauge('airflow_ventas_rows_processed', 'Rows processed', registry=registry)
        pipeline_duration = Gauge('airflow_ventas_pipeline_duration_seconds', 'Pipeline duration', registry=registry)
        
        rows_processed.set(curated_info['rows_processed'])
        # Duration calculado por Airflow
        
        push_to_gateway('prometheus-pushgateway:9091', job='airflow_ventas_batch', registry=registry)
    
    @task(task_id='send_success_notification')
    def send_success_notification(ds, **context):
        """Send Slack notification on success"""
        from airflow.providers.slack.hooks.slack_webhook import SlackWebhookHook
        
        slack = SlackWebhookHook(slack_webhook_conn_id='slack_data_notifications')
        
        execution_time = context['ti'].duration
        
        message = f"""
        ✅ *Ventas Batch Pipeline Succeeded*
        
        • Date: {ds}
        • Duration: {execution_time:.0f}s
        • Rows: {context['ti'].xcom_pull(task_ids='transform_to_curated')['rows_processed']:,}
        • Dashboard: <https://grafana.ecommerce.com/d/ventas|View Metrics>
        """
        
        slack.send_text(message)
    
    @task(task_id='send_failure_alert', trigger_rule='one_failed')
    def send_failure_alert(ds, **context):
        """Send alert on failure"""
        from airflow.providers.slack.hooks.slack_webhook import SlackWebhookHook
        
        slack = SlackWebhookHook(slack_webhook_conn_id='slack_data_alerts')
        
        failed_task = context['ti'].task_id
        
        message = f"""
        🚨 *ALERT: Ventas Batch Pipeline Failed*
        
        • Date: {ds}
        • Failed Task: {failed_task}
        • Logs: <https://airflow.ecommerce.com/log?dag_id=ventas_batch_pipeline_v1&task_id={failed_task}|View Logs>
        • Runbook: <https://wiki.ecommerce.com/runbooks/ventas-pipeline|Troubleshooting Guide>
        
        @data-oncall please investigate
        """
        
        slack.send_text(message)
    
    # Define task dependencies
    file_info = check_sftp_files()
    s3_path = download_to_s3(file_info)
    validation_results = run_data_quality_checks(s3_path)
    branch = check_validation_results(validation_results)
    
    curated_info = transform_to_curated(s3_path)
    gold_info = create_gold_aggregations(curated_info)
    
    update_glue_catalog(gold_info)
    emit_lineage_to_datahub(gold_info)
    export_metrics_to_prometheus(curated_info, gold_info)
    
    success = send_success_notification()
    failure = send_failure_alert()
    
    # Dependencies
    branch >> [curated_info, failure]
    curated_info >> gold_info >> [update_glue_catalog, emit_lineage_to_datahub, export_metrics_to_prometheus] >> success

dag = ventas_batch_pipeline()
```

---

**2. Airflow Production Deployment: MWAA**

```python
# deploy_airflow.py
import boto3

mwaa = boto3.client('mwaa', region_name='us-east-1')

# Create MWAA environment
response = mwaa.create_environment(
    Name='ecommerce-airflow-prod',
    ExecutionRoleArn='arn:aws:iam::123456:role/MWAAExecutionRole',
    SourceBucketArn='arn:aws:s3:::ecommerce-airflow-bucket',
    DagS3Path='dags/',
    PluginsS3Path='plugins.zip',
    RequirementsS3Path='requirements.txt',
    NetworkConfiguration={
        'SubnetIds': ['subnet-123', 'subnet-456'],
        'SecurityGroupIds': ['sg-789']
    },
    LoggingConfiguration={
        'DagProcessingLogs': {'LogLevel': 'INFO', 'Enabled': True},
        'SchedulerLogs': {'LogLevel': 'INFO', 'Enabled': True},
        'TaskLogs': {'LogLevel': 'INFO', 'Enabled': True},
        'WorkerLogs': {'LogLevel': 'INFO', 'Enabled': True},
        'WebserverLogs': {'LogLevel': 'INFO', 'Enabled': True}
    },
    AirflowVersion='2.7.2',
    EnvironmentClass='mw1.medium',  # 2 workers
    MaxWorkers=10,
    MinWorkers=2,
    Schedulers=2,  # HA
    AirflowConfigurationOptions={
        'core.parallelism': '32',
        'core.max_active_runs_per_dag': '3',
        'scheduler.catchup_by_default': 'False',
        'webserver.expose_config': 'True'
    },
    Tags={
        'Environment': 'production',
        'CostCenter': 'data-platform'
    }
)

# Costo: ~$1,000/mes (mw1.medium con 2-10 workers)
```

---

**Autor:** Luis J. Raigoso V. (LJRV)

### 🚀 **Operacionalización y SRE: Production Readiness**

**1. Infrastructure as Code: Terraform**

```hcl
# infrastructure/main.tf
terraform {
  required_version = ">= 1.5"
  
  backend "s3" {
    bucket = "ecommerce-terraform-state"
    key    = "data-platform/terraform.tfstate"
    region = "us-east-1"
    encrypt = true
    dynamodb_table = "terraform-state-lock"
  }
  
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

provider "aws" {
  region = var.region
  
  default_tags {
    tags = {
      Project     = "data-platform"
      ManagedBy   = "terraform"
      Environment = var.environment
      CostCenter  = "data-engineering"
    }
  }
}

# S3 Buckets with lifecycle policies
module "data_lake" {
  source = "./modules/s3-datalake"
  
  bucket_name = "ecommerce-data-${var.environment}"
  
  lifecycle_rules = [
    {
      id     = "raw-retention"
      prefix = "raw/"
      transitions = [
        { days = 7, storage_class = "INTELLIGENT_TIERING" },
        { days = 30, storage_class = "GLACIER" }
      ]
      expiration_days = 90
    },
    {
      id     = "curated-retention"
      prefix = "curated/"
      transitions = [
        { days = 90, storage_class = "INTELLIGENT_TIERING" }
      ]
      expiration_days = 365
    }
  ]
  
  versioning_enabled = true
  replication_config = {
    enabled = true
    destination_bucket = "ecommerce-data-replica-eu-west-1"
  }
  
  kms_key_id = module.kms.key_id
}

# MSK (Managed Kafka)
module "msk" {
  source = "./modules/msk"
  
  cluster_name    = "ecommerce-kafka-${var.environment}"
  kafka_version   = "3.5.1"
  broker_instance = "kafka.m5.large"
  broker_count    = 3
  
  ebs_volume_size = 1000  # GB per broker
  
  encryption_in_transit = true
  encryption_at_rest    = true
  
  monitoring_level = "PER_TOPIC_PER_PARTITION"
  
  subnets            = module.vpc.private_subnet_ids
  security_group_ids = [module.security_groups.msk_sg_id]
}

# EMR Serverless
module "emr_serverless" {
  source = "./modules/emr-serverless"
  
  application_name = "ecommerce-spark-${var.environment}"
  
  initial_capacity = {
    driver = {
      count = 1
      cpu   = "2 vCPU"
      memory = "8 GB"
    }
    executor = {
      count = 5
      cpu   = "4 vCPU"
      memory = "16 GB"
      disk   = "100 GB"
    }
  }
  
  maximum_capacity = {
    cpu    = "200 vCPU"
    memory = "800 GB"
    disk   = "2000 GB"
  }
  
  auto_stop_idle_timeout = 15  # minutes
  
  execution_role_arn = module.iam.emr_execution_role_arn
}

# MWAA (Managed Airflow)
module "mwaa" {
  source = "./modules/mwaa"
  
  environment_name = "ecommerce-airflow-${var.environment}"
  
  airflow_version    = "2.7.2"
  environment_class  = "mw1.medium"
  min_workers        = 2
  max_workers        = 10
  schedulers         = 2  # HA
  
  dag_s3_path          = "dags/"
  plugins_s3_path      = "plugins.zip"
  requirements_s3_path = "requirements.txt"
  
  logging_configuration = {
    dag_processing_logs = { enabled = true, log_level = "INFO" }
    scheduler_logs      = { enabled = true, log_level = "INFO" }
    task_logs           = { enabled = true, log_level = "INFO" }
    worker_logs         = { enabled = true, log_level = "INFO" }
    webserver_logs      = { enabled = true, log_level = "INFO" }
  }
  
  airflow_configuration_options = {
    "core.parallelism"                 = "32"
    "core.max_active_runs_per_dag"     = "3"
    "scheduler.catchup_by_default"     = "False"
    "secrets.backend"                  = "airflow.providers.amazon.aws.secrets.secrets_manager.SecretsManagerBackend"
  }
  
  subnets            = module.vpc.private_subnet_ids
  security_group_ids = [module.security_groups.mwaa_sg_id]
  
  execution_role_arn = module.iam.mwaa_execution_role_arn
}

# RDS for DataHub metadata
module "datahub_rds" {
  source = "./modules/rds"
  
  identifier = "datahub-metadata-${var.environment}"
  
  engine         = "postgres"
  engine_version = "15.3"
  instance_class = "db.r6g.xlarge"
  
  allocated_storage     = 100
  max_allocated_storage = 1000
  storage_encrypted     = true
  kms_key_id            = module.kms.key_id
  
  multi_az = true  # HA
  
  backup_retention_period = 7
  backup_window          = "03:00-04:00"
  maintenance_window     = "mon:04:00-mon:05:00"
  
  performance_insights_enabled = true
  
  subnets            = module.vpc.database_subnet_ids
  security_group_ids = [module.security_groups.rds_sg_id]
}

# ECS for DataHub services
module "datahub_ecs" {
  source = "./modules/ecs-fargate"
  
  cluster_name = "datahub-${var.environment}"
  
  services = {
    datahub-gms = {
      image            = "acryldata/datahub-gms:v0.12.0"
      cpu              = 2048
      memory           = 4096
      desired_count    = 2
      health_check_path = "/health"
      environment_variables = {
        DATAHUB_ANALYTICS_ENABLED = "true"
        ELASTICSEARCH_HOST        = module.elasticsearch.endpoint
      }
    }
    datahub-frontend = {
      image            = "acryldata/datahub-frontend-react:v0.12.0"
      cpu              = 1024
      memory           = 2048
      desired_count    = 2
      health_check_path = "/admin"
    }
  }
  
  subnets            = module.vpc.private_subnet_ids
  security_group_ids = [module.security_groups.ecs_sg_id]
  
  load_balancer = {
    enabled         = true
    certificate_arn = module.acm.certificate_arn
    domain_name     = "datahub.ecommerce.com"
  }
}

# Prometheus & Grafana (ECS)
module "observability" {
  source = "./modules/observability"
  
  cluster_name = "observability-${var.environment}"
  
  prometheus = {
    image         = "prom/prometheus:v2.47.0"
    cpu           = 2048
    memory        = 4096
    desired_count = 1
    storage_size  = 100  # GB (EBS)
    retention     = "30d"
  }
  
  grafana = {
    image         = "grafana/grafana:10.1.0"
    cpu           = 1024
    memory        = 2048
    desired_count = 2
    admin_password = data.aws_secretsmanager_secret_version.grafana_password.secret_string
  }
  
  alertmanager = {
    image         = "prom/alertmanager:v0.26.0"
    cpu           = 512
    memory        = 1024
    desired_count = 2
    slack_webhook = data.aws_secretsmanager_secret_version.slack_webhook.secret_string
  }
}

# Budgets & Cost Alerts
module "cost_management" {
  source = "./modules/cost-management"
  
  budgets = [
    {
      name   = "data-platform-monthly"
      amount = 5000
      threshold_percentages = [50, 80, 100, 120]
      notification_emails = ["data-leads@ecommerce.com"]
    },
    {
      name   = "emr-serverless-daily"
      amount = 150
      time_unit = "DAILY"
      threshold_percentages = [100]
    }
  ]
  
  cost_anomaly_detection = {
    enabled           = true
    monitor_name      = "data-platform-anomalies"
    threshold_dollars = 100
  }
}

# Outputs
output "data_lake_bucket" {
  value = module.data_lake.bucket_name
}

output "msk_bootstrap_brokers" {
  value = module.msk.bootstrap_brokers
}

output "mwaa_webserver_url" {
  value = module.mwaa.webserver_url
}

output "datahub_url" {
  value = "https://${module.datahub_ecs.load_balancer_dns}"
}

output "grafana_url" {
  value = "https://${module.observability.grafana_url}"
}
```

**Deploy Commands:**

```bash
# Initialize
terraform init

# Plan
terraform plan -var-file="environments/prod.tfvars" -out=tfplan

# Review changes
terraform show tfplan

# Apply
terraform apply tfplan

# Estimated monthly cost: $4,850
# - S3: $200
# - MSK: $2,500
# - EMR Serverless: $1,200
# - MWAA: $800
# - RDS: $300
# - ECS (DataHub): $400
# - Observability: $250
# - Data transfer: $200
```

---

**2. CI/CD Pipeline: GitHub Actions**

```yaml
# .github/workflows/data-platform-ci.yml
name: Data Platform CI/CD

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

env:
  AWS_REGION: us-east-1
  TERRAFORM_VERSION: 1.5.7

jobs:
  code-quality:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Set up Python 3.11
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'
      
      - name: Install dependencies
        run: |
          pip install -r requirements-dev.txt
      
      - name: Lint with Ruff
        run: |
          ruff check src/ tests/
      
      - name: Type check with mypy
        run: |
          mypy src/
      
      - name: Format check with Black
        run: |
          black --check src/ tests/
      
      - name: Security scan with Bandit
        run: |
          bandit -r src/ -ll
  
  unit-tests:
    runs-on: ubuntu-latest
    needs: code-quality
    
    services:
      postgres:
        image: postgres:15
        env:
          POSTGRES_PASSWORD: test
        options: >-
          --health-cmd pg_isready
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5
    
    steps:
      - uses: actions/checkout@v4
      
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'
      
      - name: Install dependencies
        run: |
          pip install -r requirements.txt
          pip install pytest pytest-cov
      
      - name: Run tests with coverage
        run: |
          pytest tests/ \
            --cov=src \
            --cov-report=xml \
            --cov-report=html \
            --junitxml=test-results.xml
      
      - name: Upload coverage to Codecov
        uses: codecov/codecov-action@v3
        with:
          files: ./coverage.xml
          fail_ci_if_error: true
      
      - name: Check coverage threshold
        run: |
          coverage report --fail-under=80
  
  integration-tests:
    runs-on: ubuntu-latest
    needs: unit-tests
    
    services:
      localstack:
        image: localstack/localstack:latest
        env:
          SERVICES: s3,glue,emr,mwaa
        ports:
          - 4566:4566
    
    steps:
      - uses: actions/checkout@v4
      
      - name: Run integration tests
        run: |
          pytest tests/integration/ \
            --localstack-endpoint=http://localhost:4566
  
  airflow-dags-validation:
    runs-on: ubuntu-latest
    needs: code-quality
    
    steps:
      - uses: actions/checkout@v4
      
      - name: Validate Airflow DAGs
        run: |
          pip install apache-airflow==2.7.2
          
          # Import all DAGs (catch syntax errors)
          python -c "from airflow.models import DagBag; \
                     db = DagBag('dags/'); \
                     assert len(db.import_errors) == 0, db.import_errors"
      
      - name: Test DAG integrity
        run: |
          pytest tests/dags/ -v
  
  terraform-plan:
    runs-on: ubuntu-latest
    needs: [unit-tests, integration-tests]
    if: github.event_name == 'pull_request'
    
    steps:
      - uses: actions/checkout@v4
      
      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v2
        with:
          terraform_version: ${{ env.TERRAFORM_VERSION }}
      
      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
          aws-region: ${{ env.AWS_REGION }}
      
      - name: Terraform Init
        run: terraform init
        working-directory: infrastructure/
      
      - name: Terraform Validate
        run: terraform validate
        working-directory: infrastructure/
      
      - name: Terraform Plan
        run: terraform plan -var-file="environments/prod.tfvars" -no-color
        working-directory: infrastructure/
        continue-on-error: true
        id: plan
      
      - name: Comment Plan on PR
        uses: actions/github-script@v6
        with:
          script: |
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: `#### Terraform Plan 📖\n\`\`\`\n${{ steps.plan.outputs.stdout }}\n\`\`\``
            })
  
  deploy-staging:
    runs-on: ubuntu-latest
    needs: [unit-tests, integration-tests, airflow-dags-validation]
    if: github.ref == 'refs/heads/develop'
    environment: staging
    
    steps:
      - uses: actions/checkout@v4
      
      - name: Deploy Airflow DAGs to S3
        run: |
          aws s3 sync dags/ s3://ecommerce-airflow-staging/dags/ \
            --exclude "*.pyc" \
            --exclude "__pycache__/*" \
            --delete
      
      - name: Trigger MWAA DAG refresh
        run: |
          aws mwaa create-cli-token \
            --name ecommerce-airflow-staging \
            --query CliToken \
            --output text
      
      - name: Deploy Spark jobs to S3
        run: |
          aws s3 sync src/spark/ s3://ecommerce-code-staging/spark/ \
            --delete
  
  deploy-production:
    runs-on: ubuntu-latest
    needs: [deploy-staging]
    if: github.ref == 'refs/heads/main'
    environment: production
    
    steps:
      - uses: actions/checkout@v4
      
      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets.AWS_PROD_ROLE_ARN }}
          aws-region: ${{ env.AWS_REGION }}
      
      - name: Deploy Terraform changes
        run: |
          terraform init
          terraform apply -var-file="environments/prod.tfvars" -auto-approve
        working-directory: infrastructure/
      
      - name: Deploy Airflow DAGs
        run: |
          aws s3 sync dags/ s3://ecommerce-airflow-prod/dags/ --delete
      
      - name: Deploy Spark jobs
        run: |
          aws s3 sync src/spark/ s3://ecommerce-code-prod/spark/ --delete
      
      - name: Create GitHub Release
        uses: actions/create-release@v1
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        with:
          tag_name: v${{ github.run_number }}
          release_name: Release v${{ github.run_number }}
          body: |
            ## Changes
            ${{ github.event.head_commit.message }}
            
            ## Deployment
            - Terraform applied
            - Airflow DAGs updated
            - Spark jobs deployed
          draft: false
          prerelease: false
      
      - name: Notify Slack
        uses: 8398a7/action-slack@v3
        with:
          status: ${{ job.status }}
          text: '🚀 Production deployment completed'
          webhook_url: ${{ secrets.SLACK_WEBHOOK }}
        if: always()
```

---

**3. SLO & SLI Definitions**

```yaml
# slos.yml
service_level_objectives:
  
  # Pipeline Availability
  - name: ventas_pipeline_availability
    description: "Percentage of time ventas pipeline completes successfully"
    objective: 99.5%
    window: 30d
    
    sli_query: |
      sum(rate(airflow_dag_run_success{dag_id="ventas_batch_pipeline_v1"}[30d]))
      /
      sum(rate(airflow_dag_run_total{dag_id="ventas_batch_pipeline_v1"}[30d]))
    
    error_budget: 0.5%  # 3.6 hours/month
    
    alerting:
      - severity: warning
        threshold: 99.0%
        window: 7d
      - severity: critical
        threshold: 98.0%
        window: 24h
  
  # Pipeline Latency
  - name: ventas_pipeline_latency_p99
    description: "99th percentile pipeline duration"
    objective: "<2 hours"
    window: 7d
    
    sli_query: |
      histogram_quantile(0.99,
        rate(airflow_dag_duration_seconds_bucket{dag_id="ventas_batch_pipeline_v1"}[7d])
      )
    
    alerting:
      - severity: warning
        threshold: 7200  # 2 hours
      - severity: critical
        threshold: 10800  # 3 hours
  
  # Data Freshness
  - name: data_freshness
    description: "Time since last successful data update"
    objective: "<90 minutes"
    
    sli_query: |
      time() - max(delta_table_last_updated{table="curated.ventas"})
    
    alerting:
      - severity: warning
        threshold: 5400  # 90 min
      - severity: critical
        threshold: 7200  # 2 hours
  
  # Data Quality
  - name: data_quality_pass_rate
    description: "Percentage of records passing validation"
    objective: 99.9%
    window: 7d
    
    sli_query: |
      sum(rate(great_expectations_validation_success[7d]))
      /
      sum(rate(great_expectations_validation_total[7d]))
    
    alerting:
      - severity: warning
        threshold: 99.5%
      - severity: critical
        threshold: 99.0%
  
  # API Performance
  - name: api_latency_p95
    description: "95th percentile API response time"
    objective: "<500ms"
    window: 24h
    
    sli_query: |
      histogram_quantile(0.95,
        rate(fastapi_request_duration_seconds_bucket{endpoint="/query"}[24h])
      )
    
    alerting:
      - severity: warning
        threshold: 0.5  # 500ms
      - severity: critical
        threshold: 1.0  # 1s
```

**Prometheus Alert Rules:**

```yaml
# alerts/data-platform.yml
groups:
  - name: data_platform_alerts
    interval: 1m
    
    rules:
      - alert: DataPipelineFailure
        expr: |
          rate(airflow_dag_run_failure{dag_id=~".*_pipeline.*"}[5m]) > 0
        for: 1m
        labels:
          severity: critical
          team: data-platform
        annotations:
          summary: "Airflow DAG {{ $labels.dag_id }} failing"
          description: "{{ $value | humanizePercentage }} failure rate in last 5 min"
          runbook: "https://wiki.ecommerce.com/runbooks/airflow-failures"
      
      - alert: DataFreshnessViolation
        expr: |
          (time() - delta_table_last_updated{table="curated.ventas"}) > 7200
        for: 5m
        labels:
          severity: warning
          team: data-platform
        annotations:
          summary: "Stale data in {{ $labels.table }}"
          description: "Last update {{ $value | humanizeDuration }} ago"
      
      - alert: KafkaConsumerLag
        expr: |
          kafka_consumer_lag{topic="ecommerce.ventas.v1"} > 100000
        for: 10m
        labels:
          severity: warning
          team: data-platform
        annotations:
          summary: "High Kafka consumer lag on {{ $labels.topic }}"
          description: "Lag: {{ $value }} messages"
      
      - alert: S3CostAnomaly
        expr: |
          (
            increase(aws_s3_storage_bytes{bucket="ecommerce-data"}[1d])
            /
            increase(aws_s3_storage_bytes{bucket="ecommerce-data"} offset 7d[1d])
          ) > 1.5
        for: 1h
        labels:
          severity: warning
          team: finops
        annotations:
          summary: "S3 storage growth anomaly detected"
          description: "50% increase vs last week"
      
      - alert: DataQualityDegraded
        expr: |
          (
            sum(rate(great_expectations_validation_failure[1h]))
            /
            sum(rate(great_expectations_validation_total[1h]))
          ) > 0.01
        for: 30m
        labels:
          severity: warning
          team: data-platform
        annotations:
          summary: "Data quality degraded"
          description: "{{ $value | humanizePercentage }} validation failure rate"
      
      - alert: EMRServerlessCostRunaway
        expr: |
          sum(increase(aws_emr_serverless_vcore_hours[1h])) > 1000
        for: 1h
        labels:
          severity: critical
          team: finops
        annotations:
          summary: "EMR Serverless usage spike"
          description: "{{ $value }} vCore-hours in last hour (normal: <500)"
```

---

**4. Incident Response Runbook**

```markdown
# Runbook: Ventas Pipeline Failure

## Severity: P1 (Critical)

### Symptoms
- Airflow DAG `ventas_batch_pipeline_v1` status: FAILED
- Alert: "DataPipelineFailure" in PagerDuty
- Dashboard: Red status in Grafana

### Impact
- BI dashboards stale (>2 hours)
- Executive reports unavailable
- Customer RFM scores not updated

### Diagnosis Steps

1. **Check Airflow UI**
   ```bash
   # Open Airflow
   open https://airflow.ecommerce.com/dags/ventas_batch_pipeline_v1
   
   # Identify failed task
   # Common failures:
   # - check_sftp_files: Missing upstream data
   # - run_data_quality_checks: Validation failure
   # - transform_to_curated: Spark job OOM
   ```

2. **Check Logs**
   ```bash
   # CloudWatch Logs
   aws logs tail /aws/mwaa/ecommerce-airflow-prod/Task \
     --follow \
     --filter-pattern "ERROR"
   
   # S3 Spark logs
   aws s3 ls s3://ecommerce-logs/spark/ventas/
   ```

3. **Check Dependencies**
   ```bash
   # SFTP files present?
   sftp sap_sftp_user@sftp.sap.com
   > ls /exports/ventas_2024-01-15.csv
   
   # S3 accessible?
   aws s3 ls s3://ecommerce-data/raw/ventas/
   
   # EMR Serverless healthy?
   aws emr-serverless list-applications
   ```

### Resolution Procedures

#### Scenario A: Missing SFTP Files
```bash
# 1. Contact SAP team (slack: #sap-integration)
# 2. Verify export schedule
# 3. Manual trigger if needed
# 4. Re-run DAG once files available

airflow dags trigger ventas_batch_pipeline_v1 \
  --conf '{"execution_date": "2024-01-15"}'
```

#### Scenario B: Data Quality Failure
```python
# 1. Review Great Expectations report
import great_expectations as gx
context = gx.get_context()

# 2. Check failed validations
results = context.get_validation_result("ventas_suite_2024-01-15")
failures = [r for r in results.results if not r.success]

# 3. Determine if acceptable
# - If data issue: Contact source system owner
# - If expectation too strict: Update suite

# 4. Override validation (emergency only)
airflow tasks run ventas_batch_pipeline_v1 \
  transform_to_curated \
  2024-01-15 \
  --ignore-dependencies
```

#### Scenario C: Spark OOM
```bash
# 1. Check memory usage
aws emr-serverless get-job-run \
  --application-id app-123 \
  --job-run-id jr-456 \
  | jq '.jobRun.totalExecutionDurationSeconds, .jobRun.totalResourceUtilization'

# 2. Increase memory allocation
# Edit DAG: increase executor memory 16GB → 32GB

# 3. Optimize query
# - Add .repartition(100) before expensive operations
# - Use broadcast joins for small tables
# - Enable AQE (Adaptive Query Execution)

# 4. Re-run
airflow dags trigger ventas_batch_pipeline_v1
```

### Escalation Path
- **L1 (On-call DE):** 0-15 min
- **L2 (Senior DE):** 15-30 min
- **L3 (Staff DE + Manager):** 30-60 min
- **Incident Commander:** >60 min or customer-facing

### Post-Incident
1. **Root Cause Analysis** (48h after resolution)
2. **Postmortem doc** (Confluence template)
3. **Action items** (Jira tickets)
4. **Knowledge base update**

### Related Links
- [Airflow UI](https://airflow.ecommerce.com)
- [Grafana Dashboard](https://grafana.ecommerce.com/d/ventas)
- [CloudWatch Logs](https://console.aws.amazon.com/cloudwatch/home#logsV2:log-groups)
- [PagerDuty Escalation](https://ecommerce.pagerduty.com)
```

---

**5. Cost Monitoring Dashboard**

```python
# monitoring/cost_dashboard.py
from prometheus_client import Gauge, start_http_server
import boto3
from datetime import datetime, timedelta
import time

# Metrics
s3_storage_bytes = Gauge('aws_s3_storage_bytes', 'S3 storage size', ['bucket', 'storage_class'])
s3_request_count = Gauge('aws_s3_requests_total', 'S3 requests', ['bucket', 'operation'])
emr_vcore_hours = Gauge('aws_emr_serverless_vcore_hours', 'EMR vCore hours', ['application'])
msk_throughput_mb = Gauge('aws_msk_throughput_mb', 'MSK throughput', ['cluster'])
rds_cpu_utilization = Gauge('aws_rds_cpu_percent', 'RDS CPU', ['instance'])

cloudwatch = boto3.client('cloudwatch', region_name='us-east-1')
ce = boto3.client('ce', region_name='us-east-1')  # Cost Explorer

def collect_s3_metrics():
    """Collect S3 metrics from CloudWatch"""
    response = cloudwatch.get_metric_statistics(
        Namespace='AWS/S3',
        MetricName='BucketSizeBytes',
        Dimensions=[
            {'Name': 'BucketName', 'Value': 'ecommerce-data'},
            {'Name': 'StorageType', 'Value': 'StandardStorage'}
        ],
        StartTime=datetime.utcnow() - timedelta(hours=24),
        EndTime=datetime.utcnow(),
        Period=86400,
        Statistics=['Average']
    )
    
    if response['Datapoints']:
        bytes_stored = response['Datapoints'][0]['Average']
        s3_storage_bytes.labels(bucket='ecommerce-data', storage_class='standard').set(bytes_stored)

def collect_cost_metrics():
    """Collect cost data from Cost Explorer"""
    response = ce.get_cost_and_usage(
        TimePeriod={
            'Start': (datetime.utcnow() - timedelta(days=1)).strftime('%Y-%m-%d'),
            'End': datetime.utcnow().strftime('%Y-%m-%d')
        },
        Granularity='DAILY',
        Metrics=['UnblendedCost'],
        GroupBy=[
            {'Type': 'DIMENSION', 'Key': 'SERVICE'}
        ],
        Filter={
            'Tags': {
                'Key': 'Project',
                'Values': ['data-platform']
            }
        }
    )
    
    for result in response['ResultsByTime']:
        for group in result['Groups']:
            service = group['Keys'][0]
            cost = float(group['Metrics']['UnblendedCost']['Amount'])
            
            # Export as Prometheus metric
            # (define gauge for each service)
            print(f"{service}: ${cost:.2f}")

if __name__ == '__main__':
    # Start Prometheus HTTP server
    start_http_server(8000)
    
    while True:
        collect_s3_metrics()
        collect_cost_metrics()
        time.sleep(300)  # Every 5 min
```

---

**Autor:** Luis J. Raigoso V. (LJRV)

## 1. Contexto y requerimientos

**Empresa**: E-commerce global con 3 dominios de datos (Ventas, Logística, Analítica).

**Requerimientos funcionales**:
- Ingestar transacciones en tiempo real (Kafka) y batch nocturno (archivos SFTP).
- Almacenar en data lakehouse (Parquet + Delta Lake) particionado por fecha y región.
- Catálogo central con metadatos, linaje y políticas de acceso (Glue/Unity/DataHub).
- Orquestación diaria con Airflow: validaciones de calidad, transformaciones, reportes.
- APIs de servicio (FastAPI) para consultas ad-hoc por BI y científicos de datos.

**Requerimientos no funcionales**:
- Compliance GDPR: enmascaramiento de PII, derecho al olvido.
- SLO: latencia p99 < 30 min, disponibilidad > 99.5%.
- Costos: < $5000/mes, optimización continua (FinOps).
- Observabilidad: logs estructurados, métricas en Prometheus, linaje en DataHub.
- Seguridad: IAM con mínimo privilegio, cifrado at-rest y in-transit, auditoría.

## 2. Arquitectura propuesta

In [None]:
arquitectura_diagrama = '''
┌─────────────┐       ┌──────────────┐       ┌───────────────┐
│  Transac-   │──────▶│    Kafka     │──────▶│  Spark        │
│  ciones RT  │       │  (streaming) │       │  Streaming    │
└─────────────┘       └──────────────┘       └───────┬───────┘
                                                      │
┌─────────────┐       ┌──────────────┐              │
│  Archivos   │──────▶│   Airflow    │──────────────┤
│  SFTP Batch │       │  (orquesta)  │              │
└─────────────┘       └──────────────┘              │
                                                     ▼
                      ┌──────────────────────────────────┐
                      │  Data Lakehouse (S3 + Delta)     │
                      │  - raw/                          │
                      │  - curated/                      │
                      │  - gold/ (agregados)             │
                      └─────────┬────────────────────────┘
                                │
                ┌───────────────┼───────────────┐
                ▼               ▼               ▼
         ┌──────────┐   ┌──────────┐   ┌──────────┐
         │  Athena  │   │ FastAPI  │   │   BI     │
         │  (SQL)   │   │ (APIs)   │   │ (Tableau)│
         └──────────┘   └──────────┘   └──────────┘

Observabilidad: Prometheus + Grafana + DataHub (linaje)
Seguridad: IAM, KMS, CloudTrail, enmascaramiento PII
'''
print(arquitectura_diagrama)

## 3. Componentes a implementar (checklist)

In [None]:
checklist = '''
☐ 1. Kafka cluster (Docker Compose local o MSK en AWS)
☐ 2. Productor de eventos simulados (transacciones)
☐ 3. Consumidor Spark Streaming → Delta Lake (S3)
☐ 4. Airflow DAG batch: SFTP → raw → validación → curated → gold
☐ 5. Validaciones de calidad con Great Expectations
☐ 6. Enmascaramiento de PII (email, tarjeta)
☐ 7. Catálogo con Glue Data Catalog o DataHub
☐ 8. Linaje con OpenLineage (plugin Airflow)
☐ 9. FastAPI endpoint para consultas SQL (proxy a Athena/Trino)
☐ 10. Métricas Prometheus exportadas por pipelines
☐ 11. Dashboard Grafana con SLOs y alertas
☐ 12. Políticas IAM con mínimo privilegio
☐ 13. Cifrado KMS para S3 y RDS
☐ 14. Auditoría CloudTrail habilitada
☐ 15. Presupuestos y alertas de costos (AWS Budgets)
☐ 16. Documentación técnica y runbooks
☐ 17. Tests de integración (Pytest)
☐ 18. CI/CD con GitHub Actions (lint, test, deploy)
'''
print(checklist)

## 4. Implementación paso a paso

### 4.1 Setup inicial
- Crear bucket S3 con estructura `raw/`, `curated/`, `gold/`.
- Configurar Glue Data Catalog con base de datos `ecommerce`.
- Levantar Kafka local con Docker Compose (zookeeper + broker).

### 4.2 Streaming path
- Productor Python: genera eventos JSON (transacción_id, cliente_id, monto, timestamp).
- Spark Structured Streaming: consume de Kafka, valida schema, escribe a Delta en `curated/ventas/`.
- Checkpointing idempotente.

### 4.3 Batch path
- Airflow DAG: sensor SFTP → download → validate (GE) → transform (Pandas/Spark) → write Delta → optimize.
- Agregaciones gold: ventas por día/región/producto.

### 4.4 Governance y seguridad
- Enmascarar email y tarjeta antes de escribir en curated.
- Registrar linaje en DataHub vía OpenLineage.
- Configurar IAM roles para Spark, Airflow, APIs.

### 4.5 Observabilidad
- Exportar métricas de conteo, latencia, errores.
- Dashboard Grafana con paneles por pipeline.
- Alertas en Slack si SLO violado.

### 4.6 Servicio de consultas
- FastAPI endpoint `/query` que ejecuta SQL en Athena y retorna JSON.
- Caché Redis para queries repetitivas.
- Rate limiting por API key.

## 5. Entregables

- Repositorio Git con código (pipelines, DAGs, APIs, tests).
- Diagrama de arquitectura actualizado.
- Documento de diseño (decisiones técnicas, trade-offs).
- Dashboard Grafana exportado (JSON).
- Runbook de operaciones (troubleshooting, rollback).
- Video/demo ejecutando pipeline end-to-end.

## 6. Evaluación

- Funcionalidad: ¿pipelines ejecutan correctamente?
- Calidad: ¿validaciones y tests implementados?
- Observabilidad: ¿métricas y linaje visibles?
- Seguridad: ¿IAM, cifrado, PII enmascarado?
- Costos: ¿presupuesto respetado?
- Documentación: ¿clara y completa?