# 🏗️ Data Lakehouse con Parquet, Delta Lake e Iceberg (conceptos y práctica ligera)

Objetivo: comprender los principios de Lakehouse y practicar un flujo básico con Parquet (local) y notas de cómo migrar a Delta Lake o Apache Iceberg.

- Duración: 120 min
- Dificultad: Media/Alta
- Prerrequisitos: Mid 03 (AWS/S3) y 07 (Particionado)

### 🏗️ **Lakehouse Architecture: Unificando Data Warehouse y Data Lake**

**La Evolución de Arquitecturas de Datos:**

```
┌──────────────────────────────────────────────────────────┐
│  GENERACIÓN 1 (2000s): Data Warehouse                    │
│  ┌────────────────────────────────────┐                  │
│  │  Structured Data → RDBMS (Oracle)  │                  │
│  │  OLAP → Star/Snowflake Schema      │                  │
│  │  BI Tools → SQL Queries            │                  │
│  └────────────────────────────────────┘                  │
│  ✅ ACID, Performance      ❌ Caro, No soporta ML        │
└──────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────┐
│  GENERACIÓN 2 (2010s): Data Lake                         │
│  ┌────────────────────────────────────┐                  │
│  │  All Data (structured + unstr.)    │                  │
│  │  → S3/HDFS (cheap storage)         │                  │
│  │  → Spark/Presto (compute layer)    │                  │
│  └────────────────────────────────────┘                  │
│  ✅ Escalable, Barato    ❌ No ACID, Data Swamp          │
└──────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────┐
│  GENERACIÓN 3 (2020s): LAKEHOUSE                         │
│  ┌────────────────────────────────────┐                  │
│  │  ┌──────────────────────────────┐  │                  │
│  │  │  Metadata Layer (Delta/Ice)  │  │ ← Transacciones  │
│  │  └──────────────────────────────┘  │                  │
│  │  ┌──────────────────────────────┐  │                  │
│  │  │  Columnar Format (Parquet)   │  │ ← Performance    │
│  │  └──────────────────────────────┘  │                  │
│  │  ┌──────────────────────────────┐  │                  │
│  │  │  Object Storage (S3/ADLS)    │  │ ← Escalabilidad  │
│  │  └──────────────────────────────┘  │                  │
│  └────────────────────────────────────┘                  │
│  ✅ ACID + Escala + Barato + ML/BI    ❌ Complejidad     │
└──────────────────────────────────────────────────────────┘
```

**¿Qué es un Lakehouse?**

Arquitectura que combina lo mejor de **Data Warehouse** (transacciones, performance) con **Data Lake** (escalabilidad, costo-efectividad).

**Componentes Clave:**

```python
lakehouse_stack = {
    'Storage Layer': {
        'Technology': 'S3, ADLS, GCS',
        'Format': 'Parquet (columnar)',
        'Cost': '$0.023/GB/mes',
        'Benefit': 'Almacenamiento infinito y barato'
    },
    'Metadata Layer': {
        'Technology': 'Delta Lake, Apache Iceberg, Apache Hudi',
        'Features': 'ACID, Time Travel, Schema Evolution',
        'Benefit': 'Transacciones sobre object storage'
    },
    'Catalog Layer': {
        'Technology': 'AWS Glue, Unity Catalog, Hive Metastore',
        'Features': 'Metadata management, Permissions',
        'Benefit': 'Single source of truth'
    },
    'Compute Layer': {
        'Technology': 'Spark, Presto, Athena, Trino',
        'Features': 'SQL queries, Distributed processing',
        'Benefit': 'Separación storage/compute'
    }
}
```

**Table Formats Comparison:**

| Feature | **Delta Lake** | **Apache Iceberg** | **Apache Hudi** |
|---------|----------------|-------------------|-----------------|
| **Creator** | Databricks (2019) | Netflix (2017) | Uber (2016) |
| **ACID** | ✅ Optimistic locking | ✅ Snapshot isolation | ✅ MVCC |
| **Time Travel** | ✅ Version history | ✅ Snapshot-based | ✅ Commit timeline |
| **Schema Evolution** | ✅ ADD/DROP cols | ✅ Full evolution | ✅ Partial |
| **Partition Evolution** | ❌ Manual | ✅ Hidden partitions | ❌ Manual |
| **Streaming** | ✅ Spark Streaming | ✅ Flink, Spark | ✅ DeltaStreamer |
| **Engines** | Spark, Presto, Trino | Spark, Flink, Trino, Athena | Spark, Presto |
| **Maturity** | ⭐⭐⭐ High | ⭐⭐⭐ High | ⭐⭐ Medium |
| **Governance** | Databricks (vendor) | Apache (neutral) | Apache (neutral) |
| **Best For** | Databricks users | Multi-engine, AWS | Upserts, CDC |

**Real-World Adoption:**

```
Delta Lake:
  - Databricks customers (obviamente)
  - Comcast (Petabyte-scale)
  - Riot Games (Gaming analytics)

Apache Iceberg:
  - Netflix (originator, 100+ PB)
  - Apple (iCloud data)
  - Adobe (Experience Cloud)
  - AWS (Athena native support)

Apache Hudi:
  - Uber (ride-hailing data)
  - Amazon (internal use)
  - Disney+ (streaming analytics)
```

**Arquitectura de Datos Moderna:**

```
┌─────────────────────────────────────────────────────────┐
│                     DATA SOURCES                         │
│  Databases, APIs, Streams, Files, SaaS, IoT             │
└──────────────────────┬──────────────────────────────────┘
                       │ Ingest (Fivetran, Airbyte, Custom)
                       ▼
┌─────────────────────────────────────────────────────────┐
│                    BRONZE LAYER                          │
│              (Raw data, append-only)                     │
│  Format: Parquet, Delta, Iceberg                        │
│  Partitioned by: ingestion_date                         │
└──────────────────────┬──────────────────────────────────┘
                       │ Transformation (dbt, Spark)
                       ▼
┌─────────────────────────────────────────────────────────┐
│                    SILVER LAYER                          │
│         (Cleaned, validated, deduplicated)              │
│  Format: Delta Lake (ACID needed)                       │
│  Partitioned by: business dimensions                    │
└──────────────────────┬──────────────────────────────────┘
                       │ Aggregation, Business Logic
                       ▼
┌─────────────────────────────────────────────────────────┐
│                     GOLD LAYER                           │
│        (Aggregated, business-ready datasets)            │
│  Format: Delta Lake or Iceberg                          │
│  Consumed by: BI Tools, ML Models, APIs                 │
└─────────────────────────────────────────────────────────┘
```

**Beneficios del Lakehouse:**

1. **Unified Platform**:
   - Single storage para BI, ML, Data Science
   - No más "copy data from DW to DL for ML"

2. **Cost Reduction**:
   - Storage: $0.023/GB vs $25/TB (Snowflake)
   - Compute: Pay-per-query (Athena) o Spot instances

3. **Performance**:
   - Columnar format (10x faster queries)
   - Partition pruning (scan solo datos relevantes)
   - Caching + predicate pushdown

4. **Governance**:
   - ACID garantiza consistencia
   - Time Travel para auditoría
   - Fine-grained access control

5. **Flexibility**:
   - Múltiples engines (no vendor lock-in)
   - Schema evolution sin downtime
   - Support structured + semi-structured

**¿Cuándo usar Lakehouse?**

✅ **SÍ usar cuando:**
- Volúmenes > 100 TB
- Necesitas ML + BI sobre mismos datos
- Múltiples teams con diferentes tools
- Budget limitado vs DW tradicional

❌ **NO usar cuando:**
- Datasets < 1 TB (PostgreSQL suficiente)
- Team pequeño (< 5) con skills limitados
- Latencia crítica < 100ms (considerar OLTP)
- Compliance requiere on-prem (limitaciones cloud)

---
**Autor:** Luis J. Raigoso V. (LJRV)

## 1. Lakehouse en pocas líneas

### 📐 **Lakehouse Fundamentals: Storage + Metadata + Catalog**

**1. Storage Layer: Object Storage como Foundation**

```python
# ¿Por qué S3/ADLS/GCS y no HDFS o RDBMS?

storage_comparison = {
    'HDFS (Hadoop)': {
        'cost': '$$$',
        'scalability': 'Limited by cluster',
        'durability': '99.9% (3 replicas)',
        'latency': 'Low (local)',
        'ops_complexity': 'High (manage cluster)',
        'verdict': '❌ Legacy, avoid for new projects'
    },
    'RDBMS (PostgreSQL)': {
        'cost': '$$$$',
        'scalability': 'Vertical (TB-scale)',
        'durability': '99.99% (replication)',
        'latency': 'Very Low (<10ms)',
        'ops_complexity': 'Medium',
        'verdict': '✅ For OLTP, ❌ for Analytics at scale'
    },
    'Object Storage (S3)': {
        'cost': '$',
        'scalability': 'Infinite (PB-EB scale)',
        'durability': '99.999999999% (11 nines)',
        'latency': 'Medium (network)',
        'ops_complexity': 'Very Low (managed)',
        'verdict': '✅ Perfect for Lakehouse'
    }
}
```

**S3 as Database? The Challenges:**

```python
# ❌ Problem 1: No ACID transactions
# Dos writers simultáneos → race condition
writer_1_writes_file('s3://bucket/data/part-001.parquet')
writer_2_writes_file('s3://bucket/data/part-001.parquet')  # Overwrite!

# ❌ Problem 2: No consistency
list_objects('s3://bucket/data/')  # May not see latest write immediately

# ❌ Problem 3: No indexing
# Para find record con id=X → scan all files (slow!)

# ❌ Problem 4: No schema enforcement
# Nada previene que alguien escriba schema incompatible
```

**2. Metadata Layer: The Secret Sauce**

**Delta Lake Transaction Log:**

```
table_path/
├── _delta_log/
│   ├── 00000000000000000000.json  ← Version 0 (CREATE TABLE)
│   ├── 00000000000000000001.json  ← Version 1 (INSERT)
│   ├── 00000000000000000002.json  ← Version 2 (UPDATE)
│   └── 00000000000000000003.json  ← Version 3 (DELETE)
├── part-00000-xxx.parquet
├── part-00001-xxx.parquet
└── part-00002-xxx.parquet

# Contenido de 00000000000000000001.json:
{
  "commitInfo": {
    "timestamp": 1730246400000,
    "operation": "WRITE",
    "operationMetrics": {
      "numFiles": "2",
      "numOutputRows": "1000"
    }
  },
  "add": {
    "path": "part-00000-xxx.parquet",
    "size": 52428800,
    "partitionValues": {"year": "2025", "month": "10"},
    "dataChange": true,
    "stats": "{\"numRecords\":500,\"minValues\":{\"id\":1},\"maxValues\":{\"id\":500}}"
  }
}
```

**Cómo funciona ACID:**

```python
# ✅ Atomicity: All-or-nothing
def write_to_delta(df, table_path):
    # Step 1: Write Parquet files (no one can see yet)
    temp_files = df.write.parquet(f"{table_path}/_tmp/")
    
    # Step 2: Create transaction log entry
    version = get_next_version(table_path)
    log_entry = {
        "add": [{"path": f, "size": size} for f in temp_files]
    }
    
    # Step 3: Atomic commit (write JSON file)
    write_json(f"{table_path}/_delta_log/{version:020d}.json", log_entry)
    # Solo cuando este archivo existe → datos visibles
    
    # Si crash antes del Step 3 → temp files huérfanos (no problema)

# ✅ Consistency: Schema enforcement
def validate_write(df, existing_schema):
    if df.schema != existing_schema:
        if not compatible(df.schema, existing_schema):
            raise SchemaIncompatibleException()

# ✅ Isolation: Optimistic concurrency control
def concurrent_write(df1, df2):
    # Writer 1 reads version 5, writes based on v5
    # Writer 2 reads version 5, writes based on v5
    
    # Writer 1 tries to commit version 6
    if current_version == 5:  # OK
        commit_version_6()
    
    # Writer 2 tries to commit version 6
    if current_version == 5:  # CONFLICT! (now it's 6)
        raise ConcurrentModificationException()
        # Writer 2 must retry: read v6, apply changes, commit v7

# ✅ Durability: S3's 11 nines
# Once committed, data is durable (S3 guarantee)
```

**Time Travel (Versioning):**

```python
# Read current version
df = spark.read.format("delta").load("s3://bucket/sales")

# Read version from 7 days ago
df_v7d = spark.read.format("delta").option("versionAsOf", 7).load(...)

# Read version at specific timestamp
df_oct20 = spark.read.format("delta") \
    .option("timestampAsOf", "2025-10-20") \
    .load(...)

# Use cases:
# - Reproducibility: "Show me data exactly as ML model saw it"
# - Auditing: "What changed between yesterday and today?"
# - Rollback: "Undo accidental DELETE"
# - A/B testing: "Compare old vs new transformations"
```

**3. Catalog Layer: Metadata Management**

```
┌──────────────────────────────────────────────────┐
│              DATA CATALOG                         │
│                                                   │
│  Database: sales_prod                            │
│  ├─ Table: orders                                │
│  │   ├─ Location: s3://bucket/gold/orders/      │
│  │   ├─ Format: delta                            │
│  │   ├─ Schema: {id: bigint, total: double, ...}│
│  │   ├─ Partitions: [year, month]               │
│  │   ├─ Owner: data-team@company.com            │
│  │   ├─ Tags: [PII, critical]                   │
│  │   └─ Last Updated: 2025-10-30 14:30:00       │
│  ├─ Table: customers                             │
│  └─ Table: products                              │
└──────────────────────────────────────────────────┘
```

**Catalog Implementations:**

| Catalog | Provider | Best For | Limitations |
|---------|----------|----------|-------------|
| **Hive Metastore** | Apache | Open-source, Spark native | Single point of failure |
| **AWS Glue Catalog** | AWS | Serverless, AWS-integrated | AWS lock-in |
| **Unity Catalog** | Databricks | Unified governance | Databricks only |
| **Polaris** | Snowflake | Open-source Iceberg | New (2024) |

**Example: Creating Table in Catalog:**

```python
# Spark + Delta Lake + Glue Catalog
spark.sql("""
  CREATE TABLE sales_prod.orders
  USING delta
  LOCATION 's3://bucket/gold/orders/'
  PARTITIONED BY (year, month)
  TBLPROPERTIES (
    'delta.dataSkippingNumIndexedCols' = '5',
    'delta.deletedFileRetentionDuration' = 'interval 7 days',
    'owner' = 'data-team@company.com',
    'pii' = 'true'
  )
  AS SELECT * FROM staging.orders_raw
""")

# Query from any engine
# Athena:
SELECT * FROM sales_prod.orders WHERE year=2025 AND month=10

# Presto:
SELECT * FROM glue.sales_prod.orders WHERE total > 1000

# Spark:
spark.table("sales_prod.orders").filter("year = 2025").show()
```

**Metadata Caching & Performance:**

```python
# Problem: Reading metadata for each query is slow
# Solution: Caching

# Delta Lake: Stats in transaction log
{
  "add": {
    "path": "part-00000.parquet",
    "stats": "{\"numRecords\":1000,\"minValues\":{\"id\":1,\"date\":\"2025-10-01\"},\"maxValues\":{\"id\":1000,\"date\":\"2025-10-31\"}}"
  }
}

# Query: SELECT * FROM orders WHERE id = 500
# Engine reads stats → knows id ∈ [1, 1000] → scan this file
# Query: SELECT * FROM orders WHERE id = 5000
# Engine reads stats → knows id ∈ [1, 1000] → SKIP this file (data skipping)
```

---
**Autor:** Luis J. Raigoso V. (LJRV)

- Tabla de datos en formato columna (Parquet) sobre object storage.
- Transaccionalidad y versiones con capas de metadatos (Delta/Iceberg/Hudi).
- Catálogo central (Glue/Unity/Metastore) y gobernanza integrada.
- Lectores: engines SQL (Athena/Trino/Spark) + ML + BI.

## 2. Hands-on: tabla Parquet particionada local

### 💾 **Parquet: The Columnar Storage Format**

**¿Por qué Parquet y no CSV/JSON?**

**Query Performance Example:**

```python
# Dataset: 1M filas × 100 columnas = 100M valores
# Query: SELECT AVG(salary) FROM employees WHERE department = 'Engineering'

# ❌ CSV (Row-based):
for row in read_csv('employees.csv'):  # Lee TODAS las columnas
    if row['department'] == 'Engineering':
        salaries.append(row['salary'])  # Solo usa 2 columnas

# I/O: Lee 100M valores
# Time: ~30 segundos

# ✅ Parquet (Columnar):
departments = read_column('employees.parquet', 'department')
salaries = read_column('employees.parquet', 'salary')
avg = mean([s for d, s in zip(departments, salaries) if d == 'Engineering'])

# I/O: Lee 2M valores (solo 2 columnas)
# Time: ~0.6 segundos (50x faster!)
```

**Parquet File Structure:**

```
file.parquet
├── Header (4 bytes magic: "PAR1")
├── Row Group 1 (default: 128 MB)
│   ├── Column Chunk: id
│   │   ├── Data Pages (compressed)
│   │   └── Statistics (min, max, null_count)
│   ├── Column Chunk: name
│   └── Column Chunk: salary
├── Row Group 2
│   ├── Column Chunk: id
│   ├── Column Chunk: name
│   └── Column Chunk: salary
├── Footer Metadata
│   ├── Schema
│   ├── Column statistics (per row group)
│   └── Compression codec
└── Footer (4 bytes magic: "PAR1")
```

**Compression Codecs:**

```python
df.to_parquet('data.parquet', compression='snappy')

# Benchmark: 1M rows, 10 columns
compression_results = {
    'none': {
        'size_mb': 100,
        'write_s': 2.1,
        'read_s': 1.5,
        'ratio': '1x'
    },
    'snappy': {
        'size_mb': 33,
        'write_s': 2.8,
        'read_s': 1.8,
        'ratio': '3x',
        'verdict': '✅ Default (balance speed/compression)'
    },
    'gzip': {
        'size_mb': 20,
        'write_s': 8.5,
        'read_s': 4.2,
        'ratio': '5x',
        'verdict': '⚠️ Better compression, slower'
    },
    'zstd': {
        'size_mb': 22,
        'write_s': 3.5,
        'read_s': 2.0,
        'ratio': '4.5x',
        'verdict': '✅ Modern (better than snappy)'
    },
    'lz4': {
        'size_mb': 35,
        'write_s': 2.2,
        'read_s': 1.6,
        'ratio': '2.8x',
        'verdict': '⚡ Fastest compression'
    }
}
```

**Encoding Schemes:**

```python
# Plain Encoding (default for small datasets)
values = [100, 200, 150, 175, 100]
encoded = b'\x64\x00\xc8\x00\x96\x00\xaf\x00\x64\x00'  # Raw bytes

# Dictionary Encoding (for low cardinality)
# Column: ["Engineering", "Sales", "Engineering", "Engineering", "Sales"]
dictionary = ["Engineering", "Sales"]
indices = [0, 1, 0, 0, 1]  # Store indices (2 bits each vs 11 bytes per string)
# Compression: ~85%

# Run-Length Encoding (for repeated values)
# Column: [100, 100, 100, 200, 200, 300]
rle = [(100, 3), (200, 2), (300, 1)]  # (value, count)

# Delta Encoding (for sorted/incremental data)
# Column: [1000, 1001, 1002, 1003, 1004]
delta = [1000, 1, 1, 1, 1]  # Base + deltas
```

**Partition Pruning (Data Skipping):**

```
s3://bucket/sales/
├── year=2023/
│   ├── month=01/data.parquet
│   └── month=02/data.parquet
├── year=2024/
│   ├── month=01/data.parquet
│   └── month=12/data.parquet
└── year=2025/
    ├── month=01/data.parquet
    └── month=10/data.parquet  ← Only scan this!

# Query:
SELECT * FROM sales WHERE year=2025 AND month=10

# Without partitioning: Scan 7 files
# With partitioning: Scan 1 file (85% reduction!)
```

**Small Files Problem:**

```python
# ❌ Anti-pattern: Too many small files
for record in stream:
    df = pd.DataFrame([record])
    df.to_parquet(f's3://bucket/data/record_{record["id"]}.parquet')

# Result: 1M files × 1 KB each = Overhead disaster!
# Athena cost: $5/TB scanned + $0.002/file = $$$$

# ✅ Solution 1: Buffering
buffer = []
for record in stream:
    buffer.append(record)
    if len(buffer) >= 10000:
        pd.DataFrame(buffer).to_parquet(f's3://bucket/data/batch_{timestamp}.parquet')
        buffer = []

# ✅ Solution 2: Compaction (OPTIMIZE in Delta)
spark.sql("OPTIMIZE sales_table")
# Merges small files into 128MB-1GB files
```

**Partitioning Strategies:**

```python
# ❌ Over-partitioning (too granular)
df.write.partitionBy("year", "month", "day", "hour", "customer_id") \
    .parquet("s3://bucket/sales")
# Result: Millions of partitions → slow metadata operations

# ✅ Balanced partitioning
df.write.partitionBy("year", "month") \
    .parquet("s3://bucket/sales")
# Target: 128MB-1GB per partition
# Rule: Partition columns used in 80%+ of queries

# ⚡ Z-ordering (Delta Lake optimization)
spark.sql("OPTIMIZE sales_table ZORDER BY (customer_id, product_id)")
# Co-locates related data for better data skipping
```

**Schema Evolution:**

```python
# V1: Initial schema
df_v1 = pd.DataFrame({
    'id': [1, 2],
    'name': ['Alice', 'Bob']
})
df_v1.to_parquet('users_v1.parquet')

# V2: Add column (compatible)
df_v2 = pd.DataFrame({
    'id': [3, 4],
    'name': ['Charlie', 'Diana'],
    'email': ['c@x.com', 'd@x.com']  # New column
})
df_v2.to_parquet('users_v2.parquet')

# Reading both files:
df = pd.concat([
    pd.read_parquet('users_v1.parquet'),  # email will be null
    pd.read_parquet('users_v2.parquet')
])
# Parquet handles missing columns gracefully

# ❌ Breaking change: Rename/delete column
df_v3 = pd.DataFrame({
    'user_id': [5, 6],  # Renamed from 'id'
    'full_name': ['Eve', 'Frank']  # Renamed from 'name'
})
# This will break queries expecting 'id' column!
```

**Parquet + Pandas/Polars:**

```python
import pandas as pd
import pyarrow.parquet as pq

# Read specific columns (projection)
df = pd.read_parquet('sales.parquet', columns=['id', 'total'])

# Read with filters (predicate pushdown)
df = pd.read_parquet('sales.parquet', 
                     filters=[('year', '=', 2025), ('total', '>', 1000)])

# Read row groups (chunked reading for large files)
parquet_file = pq.ParquetFile('sales.parquet')
for batch in parquet_file.iter_batches(batch_size=10000):
    df = batch.to_pandas()
    process(df)

# Read from S3 directly
df = pd.read_parquet('s3://bucket/sales/year=2025/month=10/*.parquet')
```

**Monitoring Parquet Health:**

```python
import pyarrow.parquet as pq

def analyze_parquet_file(path):
    pf = pq.ParquetFile(path)
    metadata = pf.metadata
    
    metrics = {
        'num_row_groups': metadata.num_row_groups,
        'num_rows': metadata.num_rows,
        'num_columns': metadata.num_columns,
        'file_size_mb': os.path.getsize(path) / (1024**2),
        'avg_row_group_size_mb': os.path.getsize(path) / metadata.num_row_groups / (1024**2),
        'compression': pf.schema_arrow.metadata.get(b'compression', b'unknown').decode()
    }
    
    # Health checks
    if metrics['avg_row_group_size_mb'] < 64:
        print("⚠️ Row groups too small (target: 128MB)")
    
    if metrics['file_size_mb'] < 10:
        print("⚠️ File too small (target: >100MB)")
    
    if metrics['num_row_groups'] > 10:
        print("⚠️ Too many row groups (consider repartitioning)")
    
    return metrics
```

---
**Autor:** Luis J. Raigoso V. (LJRV)

In [None]:
import os, time
from pathlib import Path
import pandas as pd

BASE = Path('datasets/processed/lakehouse_demo')
(BASE).mkdir(parents=True, exist_ok=True)

df = pd.DataFrame({
  'id':[1,2,3,4],
  'fecha':['2025-10-01','2025-10-02','2025-10-02','2025-10-03'],
  'producto_id':[101,102,101,103],
  'cantidad':[1,2,1,3],
  'precio':[100.0,50.0,100.0,20.0]
})
df['total'] = df['cantidad'] * df['precio']
df['anio'] = pd.to_datetime(df['fecha']).dt.year
df['mes'] = pd.to_datetime(df['fecha']).dt.strftime('%Y-%m')

for (anio, mes), part in df.groupby(['anio','mes']):
    part_dir = BASE / f'anio={anio}' / f'mes={mes}'
    part_dir.mkdir(parents=True, exist_ok=True)
    fp = part_dir / f'ventas_{int(time.time())}.parquet'
    part.to_parquet(fp, index=False)
str(BASE)

### 2.1 Lectura particionada y pruning manual

In [None]:
parts = list((BASE / 'anio=2025' / 'mes=2025-10').glob('*.parquet'))
pd.concat([pd.read_parquet(p) for p in parts]).head()

## 3. Delta Lake/Iceberg: cómo y cuándo

### ⚡ **Delta Lake vs Apache Iceberg: Deep Dive**

**When to Choose Which?**

```python
decision_matrix = {
    'Use Delta Lake if': [
        'Already using Databricks',
        'Primary engine is Spark',
        'Need mature ecosystem (more tools)',
        'Streaming workloads (Spark Structured Streaming)',
        'Team familiar with Delta Lake'
    ],
    'Use Apache Iceberg if': [
        'Multi-engine strategy (Spark + Trino + Flink + Athena)',
        'AWS-centric (Athena native support)',
        'Need hidden partitioning (automatic partition management)',
        'Open governance important (Apache vs vendor)',
        'Future-proofing (growing adoption)'
    ],
    'Use Apache Hudi if': [
        'Heavy upsert/CDC workloads',
        'Record-level updates critical',
        'Uber-style use case'
    ]
}
```

**Feature Comparison:**

**1. ACID Transactions:**

```python
# Delta Lake: Optimistic Concurrency Control
@transaction
def update_delta_table():
    current_version = read_version()  # v5
    # ... perform transformation ...
    try:
        commit_new_version(v6, based_on=v5)
    except ConflictException:
        # Another writer committed v6 first
        retry_with_new_base(v6)

# Apache Iceberg: Snapshot Isolation
@transaction
def update_iceberg_table():
    snapshot_id = current_snapshot()
    # ... perform transformation ...
    new_snapshot = create_snapshot(changes)
    atomic_swap(current_snapshot, new_snapshot)
    # Readers on old snapshot unaffected
```

**2. Time Travel:**

```python
# Delta Lake
# Files: _delta_log/00000000000000000005.json
df = spark.read.format("delta") \
    .option("versionAsOf", 5) \
    .load("s3://bucket/orders")

df = spark.read.format("delta") \
    .option("timestampAsOf", "2025-10-20 00:00:00") \
    .load("s3://bucket/orders")

# Apache Iceberg
# Files: metadata/snap-xxxx-1-yy.avro
df = spark.read.format("iceberg") \
    .option("snapshot-id", 12345678) \
    .load("catalog.db.orders")

df = spark.read.format("iceberg") \
    .option("as-of-timestamp", "1730246400000") \
    .load("catalog.db.orders")

# Iceberg advantage: Faster time travel (O(1) vs O(n))
# Delta: Must replay transaction log from start
# Iceberg: Direct snapshot lookup
```

**3. Schema Evolution:**

```python
# Delta Lake
# ✅ ADD COLUMN (supported)
spark.sql("ALTER TABLE orders ADD COLUMNS (discount DOUBLE)")

# ⚠️ RENAME COLUMN (manual migration needed)
# Step 1: Add new column
spark.sql("ALTER TABLE orders ADD COLUMNS (customer_email STRING)")
# Step 2: Backfill
spark.sql("UPDATE orders SET customer_email = email")
# Step 3: Drop old column
spark.sql("ALTER TABLE orders DROP COLUMN email")

# ❌ CHANGE COLUMN TYPE (not supported)
# Must create new table

# Apache Iceberg
# ✅ ADD COLUMN
spark.sql("ALTER TABLE orders ADD COLUMN discount double")

# ✅ RENAME COLUMN (supported!)
spark.sql("ALTER TABLE orders RENAME COLUMN email TO customer_email")

# ✅ DROP COLUMN
spark.sql("ALTER TABLE orders DROP COLUMN discount")

# ⚠️ CHANGE TYPE (limited support)
spark.sql("ALTER TABLE orders ALTER COLUMN id TYPE bigint")  # int → bigint OK
# int → string: Not supported (data loss risk)
```

**4. Partition Evolution:**

```python
# Delta Lake: Manual partition management
# Initial: Partitioned by date
df.write.partitionBy("date").format("delta").save("orders")

# Later: Want to partition by date + country
# ❌ Problem: Must rewrite all data
df_all = spark.read.format("delta").load("orders")
df_all.write.partitionBy("date", "country") \
    .format("delta") \
    .mode("overwrite") \
    .option("overwriteSchema", "true") \
    .save("orders_new")

# Apache Iceberg: Hidden partitions (automatic!)
# Initial: Partitioned by date
spark.sql("""
    CREATE TABLE orders (
        id bigint,
        date date,
        country string,
        total double
    )
    USING iceberg
    PARTITIONED BY (days(date))
""")

# Later: Change partitioning strategy (NO data rewrite!)
spark.sql("ALTER TABLE orders DROP PARTITION FIELD days(date)")
spark.sql("ALTER TABLE orders ADD PARTITION FIELD bucket(10, country)")

# Iceberg tracks partition evolution in metadata
# Old data: date partitions
# New data: country buckets
# Queries work seamlessly across both!
```

**5. File Management (VACUUM/OPTIMIZE):**

```python
# Delta Lake
# OPTIMIZE: Compact small files
spark.sql("OPTIMIZE orders")
# Target: 1GB files

# Z-ORDER: Co-locate data
spark.sql("OPTIMIZE orders ZORDER BY (customer_id, product_id)")

# VACUUM: Delete old files
spark.sql("VACUUM orders RETAIN 168 HOURS")  # Keep 7 days
# Deletes:
# - Old data files (overwritten by OPTIMIZE)
# - Transaction log checkpoints

# Apache Iceberg
# REWRITE DATA FILES (equivalent to OPTIMIZE)
spark.sql("""
    CALL catalog.system.rewrite_data_files(
        table => 'db.orders',
        options => map(
            'target-file-size-bytes', '1073741824',  -- 1GB
            'min-input-files', '5'
        )
    )
""")

# EXPIRE SNAPSHOTS (equivalent to VACUUM)
spark.sql("""
    CALL catalog.system.expire_snapshots(
        table => 'db.orders',
        older_than => TIMESTAMP '2025-10-23 00:00:00',
        retain_last => 7
    )
""")

# REMOVE ORPHAN FILES
spark.sql("""
    CALL catalog.system.remove_orphan_files(
        table => 'db.orders',
        older_than => TIMESTAMP '2025-10-20 00:00:00'
    )
""")
```

**6. Multi-Engine Support:**

| Engine | Delta Lake | Iceberg |
|--------|------------|---------|
| **Spark** | ✅ Native | ✅ Native |
| **Presto/Trino** | ✅ Connector | ✅ Native |
| **Flink** | ❌ Limited | ✅ Native |
| **AWS Athena** | ⚠️ Via manifests | ✅ Native |
| **Dremio** | ✅ | ✅ |
| **Snowflake** | ⚠️ Via external tables | ✅ Iceberg Tables |

**7. Metadata Performance:**

```python
# Scenario: Table with 10,000 partitions

# Delta Lake:
# - Metadata: JSON files in _delta_log/
# - List partitions: O(n) scan of transaction log
# - Overhead: ~100MB metadata for large tables

# Iceberg:
# - Metadata: Avro files with manifest list → manifest files
# - List partitions: O(log n) via manifest index
# - Overhead: ~10MB metadata for same table

# Benchmark: SHOW PARTITIONS
# Delta: 15 seconds
# Iceberg: 0.5 seconds (30x faster!)
```

**8. Streaming Support:**

```python
# Delta Lake: Spark Structured Streaming native
stream_df = spark.readStream \
    .format("delta") \
    .load("orders")

stream_df.writeStream \
    .format("delta") \
    .outputMode("append") \
    .option("checkpointLocation", "/checkpoints") \
    .start("orders_processed")

# Apache Iceberg: Flink + Spark Streaming
# Flink (primary streaming engine for Iceberg)
TableEnvironment tableEnv = ...
tableEnv.executeSql("""
    CREATE TABLE orders (...)
    WITH ('connector' = 'iceberg', ...)
""")

tableEnv.executeSql("""
    INSERT INTO orders_processed
    SELECT * FROM orders
    WHERE total > 1000
""")

# Spark Streaming (also supported)
spark.readStream \
    .format("iceberg") \
    .load("orders")
```

**Real-World Migration Story:**

```python
# Company: Netflix → Apple (hypothetical)

# Phase 1: Dual-write (6 months)
df.write.format("delta").save("s3://bucket/delta/orders")
df.write.format("iceberg").save("s3://bucket/iceberg/orders")

# Phase 2: Validate (3 months)
delta_count = spark.read.format("delta").load(...).count()
iceberg_count = spark.read.format("iceberg").load(...).count()
assert delta_count == iceberg_count

# Phase 3: Cutover (1 week)
# Redirect readers to Iceberg
# Stop Delta writes

# Phase 4: Cleanup (1 month)
# Delete Delta files
# Reclaim storage

# Total timeline: 10 months
# Cost: ~$50K engineering effort
# Benefit: 30% query performance improvement
```

---
**Autor:** Luis J. Raigoso V. (LJRV)

- Delta Lake agrega ACID, time travel y MERGE INTO sobre Parquet.
- Iceberg optimiza la gestión de metadatos, particiones ocultas y evolución de esquema.
- Requiere motor como Spark/Trino/Flint y un catálogo (Glue/REST).
- Coste/beneficio: evalúa volumen, concurrencia, latencia y SLA.

### 3.1 Ejemplo Delta Lake (referencia con PySpark) [opcional]

In [None]:
delta_demo = r'''
from pyspark.sql import SparkSession
spark = (SparkSession.builder
    .appName('DeltaDemo')
    .config('spark.sql.extensions','io.delta.sql.DeltaSparkSessionExtension')
    .config('spark.sql.catalog.spark_catalog','org.apache.spark.sql.delta.catalog.DeltaCatalog')
    .getOrCreate())

df = spark.read.parquet('s3://bucket/curated/ventas/')
df.write.format('delta').mode('overwrite').save('s3://bucket/delta/ventas/')

delta = spark.read.format('delta').load('s3://bucket/delta/ventas/')
delta.createOrReplaceTempView('ventas')
spark.sql("SELECT mes, SUM(total) FROM ventas GROUP BY mes").show()
'''
print(delta_demo.splitlines()[:15])

## 4. Buenas prácticas de Lakehouse

- Definir contratos de datos y versionado de esquemas.
- Gestionar tamaños de archivos y compaction (OPTIMIZE/VACUUM).
- Catalogación y políticas de acceso por dominio (Data Mesh).
- Observabilidad y linaje (OpenLineage/Marquez, DataHub).