# ☁️ Azure para Ingeniería de Datos: ADLS, Synapse, Data Factory y Databricks

Este notebook introduce el ecosistema de Microsoft Azure para Data Engineering, cubriendo almacenamiento con ADLS Gen2, transformaciones con Synapse Analytics y Azure Data Factory, y computación distribuida con Azure Databricks.

**Autor:** LuisRai (Luis J. Raigoso V.)  
**Nivel:** Mid  
**Duración:** 90-120 minutos

## ⚠️ RECORDATORIO IMPORTANTE

### 🚨 NOTEBOOKS vs PRODUCCIÓN

Este curso usa notebooks para **enseñanza**, pero en tu trabajo real:

**❌ NO uses notebooks para pipelines en producción**

**✅ USA:**
- Scripts Python modulares en `src/`
- Azure Data Factory para orquestación
- Azure DevOps para CI/CD
- Azure Functions para event-driven
- ARM Templates o Bicep para IaC

---

**Autor:** LuisRai (Luis J. Raigoso V.) | © 2024-2025

---

## Requisitos y Notas de Ejecución

- Para ejecutar código real de Azure necesitas:
  - Suscripción Azure (free tier o pay-as-you-go)
  - Azure CLI instalado (`az login`)
  - Service Principal o Managed Identity configurado
  - Recursos creados: Storage Account, Synapse Workspace, etc.
- **Nunca subas credenciales al repositorio**
- Usa Azure Key Vault para secrets
- Este notebook muestra ejemplos ejecutables con Azure configurado

### ☁️ **Microsoft Azure: Ecosistema Enterprise para Datos**

**Stack Moderno de Azure para Data Engineering:**

1. **Azure Data Lake Storage Gen2 (ADLS Gen2)**: Object storage optimizado
   - Basado en Blob Storage con namespaces jerárquicos
   - Compatible con HDFS (Hadoop File System)
   - ACLs POSIX (permisos granulares)
   - Pricing: ~$0.018/GB/mes (Hot tier)

2. **Azure Synapse Analytics**: Data Warehouse + Analytics unificado
   - SQL Pools (MPP dedicated, ex-Azure SQL DW)
   - Serverless SQL (on-demand queries)
   - Spark Pools (Apache Spark managed)
   - Pipelines integrados (como Data Factory)
   - Pricing: desde $1.20/hora (DW100c) o $5/TB escaneado (serverless)

3. **Azure Data Factory (ADF)**: ETL/ELT orchestration
   - 90+ conectores nativos
   - Data Flows (visual transformations)
   - Integration Runtime (on-prem connectivity)
   - Pricing: por actividad ejecutada + data movement

4. **Azure Databricks**: Apache Spark optimizado
   - Runtime optimizado (Photon engine)
   - Delta Lake integrado
   - Notebooks colaborativos
   - MLflow para ML lifecycle
   - Pricing: DBU (Databricks Unit) + VM cost

5. **Azure Functions**: Serverless compute event-driven
   - Triggers: HTTP, Blob Storage, Event Hubs, Timer
   - Runtimes: Python 3.7-3.11, .NET, Node.js, Java
   - Consumption plan: paga solo por ejecuciones
   - Pricing: primeras 1M ejecuciones gratis

**Arquitectura de Referencia:**
```
Fuentes → [Event Hubs] → ADLS Gen2 (raw) → [Data Factory/Databricks] 
                              ↓
                    ADLS Gen2 (curated) → Synapse (analítica)
                              ↓
                    Power BI (visualización)
```

**Ventajas de Azure para Datos:**
- **Integración Microsoft**: Active Directory, Office 365, Power BI
- **Híbrido**: Azure Arc para on-prem + cloud unificado
- **Seguridad**: Compliance robusto (ISO, HIPAA, FedRAMP)
- **Enterprise-grade**: SLA 99.9%, soporte 24/7

**Comparación con AWS/GCP:**

| Servicio | Azure | AWS | GCP |
|----------|-------|-----|-----|
| Object Storage | ADLS Gen2 | S3 | Cloud Storage |
| Data Warehouse | Synapse (dedicated) | Redshift | BigQuery |
| DW Serverless | Synapse Serverless | Athena | BigQuery |
| ETL | Data Factory | Glue | Dataflow |
| Spark Managed | Databricks | EMR | Dataproc |
| Streaming | Event Hubs | Kinesis | Pub/Sub |
| Orchestration | Data Factory | Step Functions | Cloud Composer |
| Functions | Azure Functions | Lambda | Cloud Functions |

---
**Autor:** Luis J. Raigoso V. (LJRV)

## 1. Azure Data Lake Storage Gen2 (ADLS Gen2)

### 🗄️ **ADLS Gen2: Fundamentos**

**¿Qué es ADLS Gen2?**

ADLS Gen2 = Azure Blob Storage + Hierarchical Namespace + HDFS compatibility

**Características Clave:**

1. **Hierarchical Namespace (HNS):**
   - Directorios reales (no prefixes como S3)
   - Operaciones atómicas (rename directory instantáneo)
   - ACLs POSIX por directorio/archivo

2. **Tiers de Almacenamiento:**
   - **Hot**: Acceso frecuente (~$0.018/GB/mes)
   - **Cool**: Acceso infrecuente (<30 días, $0.01/GB/mes)
   - **Archive**: Almacenamiento a largo plazo ($0.002/GB/mes)

3. **Security:**
   - Azure AD authentication
   - RBAC (Role-Based Access Control)
   - ACLs (Access Control Lists)
   - Encryption at rest (Microsoft-managed o customer-managed keys)

**Organizacion Recomendada:**

```
container: data-lake
├── raw/                     ← Datos crudos
│   ├── sales/
│   │   └── 2025/10/30/
│   └── customers/
├── staging/                 ← Procesamiento
└── curated/                 ← Datos limpios
    ├── sales_aggregated/
    └── customer_360/
```

**Lifecycle Management:**
```json
{
  "rules": [
    {
      "name": "MoveToCool",
      "enabled": true,
      "type": "Lifecycle",
      "definition": {
        "actions": {
          "baseBlob": {
            "tierToCool": {
              "daysAfterModificationGreaterThan": 30
            },
            "tierToArchive": {
              "daysAfterModificationGreaterThan": 90
            }
          }
        },
        "filters": {
          "blobTypes": ["blockBlob"],
          "prefixMatch": ["raw/"]
        }
      }
    }
  ]
}
```

**Operaciones con Python:**
```python
from azure.storage.filedatalake import DataLakeServiceClient
from azure.identity import DefaultAzureCredential

# Autenticación con Azure AD
credential = DefaultAzureCredential()
service_client = DataLakeServiceClient(
    account_url="https://mystorageaccount.dfs.core.windows.net",
    credential=credential
)

# Crear filesystem (container)
file_system_client = service_client.create_file_system(file_system="data-lake")

# Subir archivo
file_client = file_system_client.get_file_client("raw/ventas/2025_10.csv")
with open("ventas.csv", "rb") as data:
    file_client.upload_data(data, overwrite=True)

# Listar archivos
paths = file_system_client.get_paths(path="raw/")
for path in paths:
    print(path.name)

# Descargar archivo
download = file_client.download_file()
with open("downloaded.csv", "wb") as f:
    f.write(download.readall())
```

**ACLs POSIX:**
```python
# Establecer permisos
from azure.storage.filedatalake import DataLakeDirectoryClient

directory_client = file_system_client.get_directory_client("raw/sales")
acl = 'user::rwx,group::r-x,other::---'
directory_client.set_access_control(acl=acl)

# Permisos por usuario específico
acl = 'user::rwx,user:john@example.com:r--,group::r-x,other::---'
directory_client.set_access_control(acl=acl)
```

**Costos:**
- Storage Hot: $0.0184/GB/mes
- Storage Cool: $0.01/GB/mes
- Operations (read): $0.004 per 10,000
- Operations (write): $0.065 per 10,000
- Data transfer (egress): primeros 5GB gratis, luego $0.087/GB

---
**Autor:** Luis J. Raigoso V. (LJRV)

In [None]:
# Configuración (sin ejecutar sin Azure configurado)
STORAGE_ACCOUNT = 'mydatalakestorage'
CONTAINER_NAME = 'data-lake'
TENANT_ID = 'your-tenant-id'
SUBSCRIPTION_ID = 'your-subscription-id'

print(f'📋 Storage Account: {STORAGE_ACCOUNT}')
print(f'📦 Container: {CONTAINER_NAME}')
print('⚠️ Requiere: az login o service principal configurado')

### 1.1 Crear container y subir datos

In [None]:
# Ejemplo de código (descomentar con Azure configurado)
'''
from azure.storage.filedatalake import DataLakeServiceClient
from azure.identity import DefaultAzureCredential
import pandas as pd

# Autenticación
credential = DefaultAzureCredential()
service_client = DataLakeServiceClient(
    account_url=f"https://{STORAGE_ACCOUNT}.dfs.core.windows.net",
    credential=credential
)

# Crear filesystem
try:
    file_system_client = service_client.create_file_system(file_system=CONTAINER_NAME)
    print(f'✅ Container {CONTAINER_NAME} creado')
except Exception as e:
    print(f'ℹ️ Container ya existe: {e}')
    file_system_client = service_client.get_file_system_client(file_system=CONTAINER_NAME)

# Subir CSV
df = pd.read_csv('../../datasets/raw/ventas.csv')
csv_string = df.to_csv(index=False)

file_client = file_system_client.get_file_client("raw/ventas/ventas_2025_10.csv")
file_client.upload_data(csv_string, overwrite=True)
print('📤 Archivo subido a ADLS Gen2')
'''
print('Código de ejemplo listo para ejecutar con Azure configurado')

## 2. Azure Synapse Analytics

### 📊 **Synapse Analytics: Analytics Platform Unificado**

**¿Qué es Synapse?**

Synapse unifica:
- Data Warehouse (SQL Pools)
- Big Data (Spark Pools)
- Data Integration (Pipelines)
- Visualization (Power BI integrado)

**Componentes:**

1. **SQL Pools (Dedicated):**
   - MPP (Massively Parallel Processing)
   - 60 distributions
   - Columnstore indexes
   - Pause/resume para ahorrar costos
   - Pricing: DWU (Data Warehouse Units), desde $1.20/hora (DW100c)

2. **Serverless SQL Pool:**
   - On-demand queries sobre ADLS Gen2
   - Sin infraestructura que administrar
   - Standard SQL (T-SQL)
   - Pricing: $5/TB escaneado

3. **Spark Pools:**
   - Apache Spark 3.x
   - Auto-scaling (min-max nodes)
   - Notebooks integrados (Python, Scala, SQL)
   - Delta Lake support
   - Pricing: por vCore-hora

**Serverless SQL: Query ADLS directly**

```sql
-- External table sobre CSV en ADLS
CREATE EXTERNAL DATA SOURCE adls_datasource
WITH (
    LOCATION = 'https://mystorageaccount.dfs.core.windows.net/data-lake'
);

CREATE EXTERNAL FILE FORMAT csv_format
WITH (
    FORMAT_TYPE = DELIMITEDTEXT,
    FORMAT_OPTIONS (
        FIELD_TERMINATOR = ',',
        STRING_DELIMITER = '"',
        FIRST_ROW = 2
    )
);

CREATE EXTERNAL TABLE ventas_external (
    venta_id INT,
    cliente_id INT,
    producto_id INT,
    cantidad INT,
    total FLOAT,
    fecha DATE
)
WITH (
    LOCATION = '/raw/ventas/',
    DATA_SOURCE = adls_datasource,
    FILE_FORMAT = csv_format
);

-- Query (paga solo por datos escaneados)
SELECT 
    cliente_id,
    SUM(total) as total_ventas
FROM ventas_external
WHERE fecha >= '2025-10-01'
GROUP BY cliente_id;
```

**Dedicated SQL Pool: Optimizaciones**

```sql
-- Tabla con distribución HASH
CREATE TABLE ventas (
    venta_id INT,
    cliente_id INT,
    total DECIMAL(10,2),
    fecha DATE
)
WITH (
    DISTRIBUTION = HASH(cliente_id),
    CLUSTERED COLUMNSTORE INDEX
);

-- CTAS (Create Table As Select) para cargar datos
CREATE TABLE ventas_2025
WITH (
    DISTRIBUTION = HASH(cliente_id),
    CLUSTERED COLUMNSTORE INDEX
)
AS
SELECT *
FROM ventas_external
WHERE YEAR(fecha) = 2025;

-- Estadísticas (crítico para performance)
CREATE STATISTICS stats_cliente ON ventas(cliente_id);
CREATE STATISTICS stats_fecha ON ventas(fecha);
```

**Particionamiento:**
```sql
CREATE TABLE ventas_partitioned (
    venta_id INT,
    cliente_id INT,
    total DECIMAL(10,2),
    fecha DATE
)
WITH (
    DISTRIBUTION = HASH(cliente_id),
    PARTITION (fecha RANGE RIGHT FOR VALUES 
        ('2025-01-01', '2025-02-01', '2025-03-01')
    )
);
```

**PolyBase (bulk load desde ADLS):**
```sql
COPY INTO ventas
FROM 'https://mystorageaccount.dfs.core.windows.net/data-lake/raw/ventas/*.csv'
WITH (
    FILE_TYPE = 'CSV',
    FIELDTERMINATOR = ',',
    ROWTERMINATOR = '\\n',
    FIRSTROW = 2
);
```

**Spark Pool con PySpark:**
```python
# Leer desde ADLS Gen2
df = spark.read.csv(
    "abfss://data-lake@mystorageaccount.dfs.core.windows.net/raw/ventas/*.csv",
    header=True,
    inferSchema=True
)

# Transformaciones
from pyspark.sql.functions import col, sum as _sum

df_agg = df.groupBy("cliente_id") \\
    .agg(_sum("total").alias("total_ventas"))

# Escribir como Delta Lake
df_agg.write.format("delta") \\
    .mode("overwrite") \\
    .save("abfss://data-lake@mystorageaccount.dfs.core.windows.net/curated/sales_summary")
```

**Cost Optimization:**
- Pause dedicated pools cuando no se usan
- Usar serverless para exploratory queries
- Particionar tablas grandes
- Comprimir datos (Parquet/Delta)
- Result set caching (automático, 48h)

---
**Autor:** Luis J. Raigoso V. (LJRV)

In [None]:
# Configuración Synapse
SYNAPSE_WORKSPACE = 'mysynapseworkspace'
DEDICATED_POOL = 'sqldw'
SERVERLESS_ENDPOINT = f'{SYNAPSE_WORKSPACE}-ondemand.sql.azuresynapse.net'

print(f'📊 Synapse Workspace: {SYNAPSE_WORKSPACE}')
print(f'💾 Dedicated Pool: {DEDICATED_POOL}')
print(f'⚡ Serverless: {SERVERLESS_ENDPOINT}')

### 2.1 Query serverless SQL desde Python

In [None]:
# Ejemplo de conexión a Synapse Serverless
'''
import pyodbc
from azure.identity import DefaultAzureCredential

# Obtener token Azure AD
credential = DefaultAzureCredential()
token = credential.get_token("https://database.windows.net/.default")

# Conexión con AAD authentication
conn_string = f"""
    Driver={{ODBC Driver 18 for SQL Server}};
    Server={SERVERLESS_ENDPOINT};
    Database=master;
    Encrypt=yes;
    TrustServerCertificate=no;
    Connection Timeout=30;
"""

conn = pyodbc.connect(conn_string, attrs_before={
    1256: token.token  # SQL_COPT_SS_ACCESS_TOKEN
})

# Ejecutar query
cursor = conn.cursor()
query = """
    SELECT TOP 10
        cliente_id,
        SUM(total) as total_ventas
    FROM OPENROWSET(
        BULK 'https://mystorageaccount.dfs.core.windows.net/data-lake/raw/ventas/*.csv',
        FORMAT = 'CSV',
        PARSER_VERSION = '2.0',
        HEADER_ROW = TRUE
    ) AS ventas
    GROUP BY cliente_id
    ORDER BY total_ventas DESC
"""

cursor.execute(query)
for row in cursor:
    print(f'Cliente {row[0]}: ${row[1]:.2f}')

conn.close()
'''
print('Código Synapse Serverless con autenticación Azure AD')

## 3. Azure Data Factory (ADF)

### 🔄 **Azure Data Factory: ETL/ELT Orchestration**

**Conceptos Core:**

1. **Linked Services**: Conexiones a fuentes (ADLS, SQL, REST APIs)
2. **Datasets**: Representación de datos (CSV, Parquet, tablas)
3. **Pipelines**: Flujo de actividades
4. **Activities**: Tareas individuales (Copy, Data Flow, Notebook, etc.)
5. **Triggers**: Ejecución (Schedule, Tumbling Window, Event-based)

**Actividades Comunes:**

- **Copy Activity**: Mover datos entre 90+ conectores
- **Data Flow**: Transformaciones visuales (similar a SSIS)
- **Databricks Notebook**: Ejecutar notebooks Databricks
- **Stored Procedure**: Ejecutar SQL en bases de datos
- **Web Activity**: Llamadas HTTP/REST
- **Azure Function**: Invocar functions

**Pipeline JSON Example:**

```json
{
  "name": "CopyCSVtoADLS",
  "properties": {
    "activities": [
      {
        "name": "CopyVentas",
        "type": "Copy",
        "inputs": [
          {
            "referenceName": "SourceCSV",
            "type": "DatasetReference"
          }
        ],
        "outputs": [
          {
            "referenceName": "SinkADLS",
            "type": "DatasetReference"
          }
        ],
        "typeProperties": {
          "source": {
            "type": "DelimitedTextSource"
          },
          "sink": {
            "type": "DelimitedTextSink"
          }
        }
      }
    ],
    "parameters": {
      "fechaEjecucion": {
        "type": "string"
      }
    }
  }
}
```

**Data Flow (GUI Transform):**

```
Source (ADLS CSV) 
  → Filter (total > 0) 
  → Aggregate (SUM by cliente_id) 
  → Sink (Synapse table)
```

**Integration Runtime:**

- **Azure IR**: Cloud-to-cloud (managed por Azure)
- **Self-hosted IR**: On-prem-to-cloud (instalas agent)
- **Azure-SSIS IR**: SSIS packages en cloud

**Parametrización:**

```json
{
  "activities": [
    {
      "name": "DynamicCopy",
      "inputs": [
        {
          "parameters": {
            "folderPath": {
              "value": "@concat('raw/', formatDateTime(utcNow(), 'yyyy/MM/dd'))",
              "type": "Expression"
            }
          }
        }
      ]
    }
  ]
}
```

**Triggers:**

```json
{
  "name": "DailyTrigger",
  "properties": {
    "type": "ScheduleTrigger",
    "typeProperties": {
      "recurrence": {
        "frequency": "Day",
        "interval": 1,
        "startTime": "2025-01-01T02:00:00Z",
        "timeZone": "UTC"
      }
    },
    "pipelines": [
      {
        "pipelineReference": {
          "referenceName": "VentasETL",
          "type": "PipelineReference"
        }
      }
    ]
  }
}
```

**Event-based Trigger (Blob created):**
```json
{
  "name": "BlobEventTrigger",
  "properties": {
    "type": "BlobEventsTrigger",
    "typeProperties": {
      "blobPathBeginsWith": "/data-lake/raw/ventas/",
      "blobPathEndsWith": ".csv",
      "ignoreEmptyBlobs": true,
      "events": ["Microsoft.Storage.BlobCreated"]
    }
  }
}
```

**Python SDK (crear pipeline programmatically):**

```python
from azure.identity import DefaultAzureCredential
from azure.mgmt.datafactory import DataFactoryManagementClient
from azure.mgmt.datafactory.models import *

credential = DefaultAzureCredential()
adf_client = DataFactoryManagementClient(credential, SUBSCRIPTION_ID)

# Crear Linked Service (ADLS Gen2)
ls_adls = AzureBlobFSLinkedService(
    url=f"https://{STORAGE_ACCOUNT}.dfs.core.windows.net"
)

adf_client.linked_services.create_or_update(
    resource_group_name='my-rg',
    factory_name='my-adf',
    linked_service_name='ADLS_LS',
    linked_service=ls_adls
)

# Crear dataset
ds_adls = Dataset(
    linked_service_name=LinkedServiceReference(reference_name='ADLS_LS'),
    type='DelimitedText',
    type_properties={
        'location': {
            'type': 'AzureBlobFSLocation',
            'folderPath': 'raw/ventas'
        }
    }
)

adf_client.datasets.create_or_update(
    resource_group_name='my-rg',
    factory_name='my-adf',
    dataset_name='VentasCSV',
    dataset=ds_adls
)
```

**Best Practices:**
- Usar parámetros para reutilización
- Logging con Azure Monitor
- Manejo de errores con retry policies
- Variables globales en Key Vault
- CI/CD con Azure DevOps/GitHub Actions

**Pricing:**
- Orchestration: $1 per 1,000 activity runs
- Data Movement: $0.25 per DIU-hour
- Data Flow: $0.27 per vCore-hour

---
**Autor:** Luis J. Raigoso V. (LJRV)

In [None]:
# Ejemplo de pipeline ADF definido en JSON
adf_pipeline_example = '''
{
  "name": "VentasDailyETL",
  "properties": {
    "activities": [
      {
        "name": "CopyFromSource",
        "type": "Copy",
        "inputs": [
          {"referenceName": "SourceSQL", "type": "DatasetReference"}
        ],
        "outputs": [
          {"referenceName": "SinkADLS", "type": "DatasetReference"}
        ],
        "typeProperties": {
          "source": {"type": "SqlSource"},
          "sink": {"type": "ParquetSink"}
        }
      },
      {
        "name": "RunDatabricksNotebook",
        "type": "DatabricksNotebook",
        "dependsOn": [
          {"activity": "CopyFromSource", "dependencyConditions": ["Succeeded"]}
        ],
        "typeProperties": {
          "notebookPath": "/notebooks/transform_ventas",
          "baseParameters": {
            "date": {
              "value": "@formatDateTime(utcNow(), 'yyyy-MM-dd')",
              "type": "Expression"
            }
          }
        }
      },
      {
        "name": "LoadToSynapse",
        "type": "Copy",
        "dependsOn": [
          {"activity": "RunDatabricksNotebook", "dependencyConditions": ["Succeeded"]}
        ],
        "inputs": [
          {"referenceName": "ProcessedADLS", "type": "DatasetReference"}
        ],
        "outputs": [
          {"referenceName": "SynapseTable", "type": "DatasetReference"}
        ],
        "typeProperties": {
          "source": {"type": "ParquetSource"},
          "sink": {
            "type": "SqlDWSink",
            "allowPolyBase": true
          }
        }
      }
    ],
    "parameters": {
      "fechaInicio": {"type": "string"},
      "fechaFin": {"type": "string"}
    }
  }
}
'''

print(adf_pipeline_example)
print('\\n💡 Deploy con: az datafactory pipeline create')

## 4. Azure Databricks

### ⚡ **Azure Databricks: Spark Optimizado**

**¿Qué es Databricks?**

Databricks es una plataforma de análisis unificada basada en Apache Spark, optimizada para Azure:

- **Photon Engine**: Vectorized query engine (hasta 2x más rápido)
- **Delta Lake**: ACID transactions sobre data lakes
- **Unity Catalog**: Governance centralizado
- **MLflow**: ML lifecycle management
- **Auto Loader**: Ingesta incremental eficiente

**Componentes:**

1. **Workspace**: Entorno colaborativo con notebooks
2. **Cluster**: Conjunto de VMs con Spark runtime
3. **Jobs**: Workloads programados
4. **Delta Tables**: Tablas con ACID guarantees

**Cluster Configuration:**

```python
# Cluster specs (via UI o API)
{
    "cluster_name": "etl-cluster",
    "spark_version": "13.3.x-scala2.12",
    "node_type_id": "Standard_DS3_v2",
    "num_workers": 2,
    "autoscale": {
        "min_workers": 2,
        "max_workers": 8
    },
    "spark_conf": {
        "spark.databricks.delta.preview.enabled": "true"
    },
    "azure_attributes": {
        "availability": "ON_DEMAND_AZURE",
        "spot_bid_max_price": -1
    }
}
```

**Delta Lake:**

```python
from delta.tables import DeltaTable

# Escribir Delta table
df = spark.read.csv(
    "abfss://data-lake@storage.dfs.core.windows.net/raw/ventas/*.csv",
    header=True,
    inferSchema=True
)

df.write.format("delta") \\
    .mode("overwrite") \\
    .option("overwriteSchema", "true") \\
    .save("/mnt/delta/ventas")

# Leer Delta table
df_delta = spark.read.format("delta").load("/mnt/delta/ventas")

# UPSERT (Merge)
from delta.tables import DeltaTable
from pyspark.sql.functions import col

deltaTable = DeltaTable.forPath(spark, "/mnt/delta/ventas")

updates = spark.read.csv("new_data.csv", header=True)

deltaTable.alias("target") \\
    .merge(
        updates.alias("source"),
        "target.venta_id = source.venta_id"
    ) \\
    .whenMatchedUpdate(set={"total": col("source.total")}) \\
    .whenNotMatchedInsertAll() \\
    .execute()

# Time Travel
df_v1 = spark.read.format("delta").option("versionAsOf", 1).load("/mnt/delta/ventas")
df_yesterday = spark.read.format("delta") \\
    .option("timestampAsOf", "2025-10-29") \\
    .load("/mnt/delta/ventas")

# Optimización
deltaTable.optimize().executeCompaction()
deltaTable.vacuum(168)  # Cleanup files older than 7 days
```

**Auto Loader (Structured Streaming):**

```python
df_stream = (spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "csv")
    .option("cloudFiles.schemaLocation", "/mnt/schema/ventas")
    .load("abfss://data-lake@storage.dfs.core.windows.net/raw/ventas/")
)

# Transformaciones
from pyspark.sql.functions import col, current_timestamp

df_processed = df_stream \\
    .filter(col("total") > 0) \\
    .withColumn("ingestion_timestamp", current_timestamp())

# Escribir stream a Delta
query = (df_processed.writeStream
    .format("delta")
    .outputMode("append")
    .option("checkpointLocation", "/mnt/checkpoints/ventas")
    .start("/mnt/delta/ventas_stream")
)

query.awaitTermination()
```

**Mount ADLS Gen2:**

```python
# Configurar Service Principal
configs = {
    "fs.azure.account.auth.type": "OAuth",
    "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
    "fs.azure.account.oauth2.client.id": "<client-id>",
    "fs.azure.account.oauth2.client.secret": "<client-secret>",
    "fs.azure.account.oauth2.client.endpoint": f"https://login.microsoftonline.com/<tenant-id>/oauth2/token"
}

dbutils.fs.mount(
    source=f"abfss://data-lake@{STORAGE_ACCOUNT}.dfs.core.windows.net/",
    mount_point="/mnt/data-lake",
    extra_configs=configs
)

# Usar mount
df = spark.read.csv("/mnt/data-lake/raw/ventas/*.csv")
```

**Notebook Widgets (parametrización):**

```python
# Crear widget
dbutils.widgets.text("fecha", "2025-10-30", "Fecha de ejecución")

# Leer valor
fecha = dbutils.widgets.get("fecha")
print(f"Procesando datos de {fecha}")

# Usar en paths dinámicos
path = f"/mnt/data-lake/raw/ventas/{fecha}/*.csv"
df = spark.read.csv(path)
```

**MLflow Tracking:**

```python
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier

mlflow.set_experiment("/experiments/churn-prediction")

with mlflow.start_run():
    model = RandomForestClassifier(n_estimators=100)
    model.fit(X_train, y_train)
    
    accuracy = model.score(X_test, y_test)
    
    mlflow.log_param("n_estimators", 100)
    mlflow.log_metric("accuracy", accuracy)
    mlflow.sklearn.log_model(model, "model")
    
    print(f"Model accuracy: {accuracy}")
```

**Pricing:**
- DBU (Databricks Unit): unidad de procesamiento
  - Jobs Compute: $0.15/DBU
  - All-Purpose Compute: $0.40/DBU
- VM cost (Azure): según tipo (Standard_DS3_v2 ~$0.15/hora)
- Total = DBU cost + VM cost

**Cost Optimization:**
- Usar Jobs Compute (más barato) para workloads programados
- Auto-termination de clusters inactivos
- Spot VMs (hasta 90% descuento)
- Photon acceleration (menos compute time)

---
**Autor:** Luis J. Raigoso V. (LJRV)

In [None]:
# Ejemplo de notebook Databricks
databricks_notebook_example = '''
# Databricks notebook source
# MAGIC %md
# MAGIC # ETL Pipeline con Delta Lake

# COMMAND ----------

# MAGIC %md
# MAGIC ## 1. Configuración

# COMMAND ----------

# Widgets
dbutils.widgets.text("fecha", "2025-10-30")
fecha = dbutils.widgets.get("fecha")

# COMMAND ----------

# MAGIC %md
# MAGIC ## 2. Ingesta desde ADLS

# COMMAND ----------

from pyspark.sql.functions import col, current_timestamp

df_raw = spark.read.csv(
    f"/mnt/data-lake/raw/ventas/{fecha}/*.csv",
    header=True,
    inferSchema=True
)

print(f"Registros leídos: {df_raw.count()}")

# COMMAND ----------

# MAGIC %md
# MAGIC ## 3. Transformaciones

# COMMAND ----------

df_clean = (df_raw
    .filter(col("total") > 0)
    .filter(col("cliente_id").isNotNull())
    .withColumn("fecha_ingestion", current_timestamp())
)

# COMMAND ----------

# MAGIC %md
# MAGIC ## 4. Escribir a Delta Lake

# COMMAND ----------

df_clean.write.format("delta") \\
    .mode("append") \\
    .partitionBy("fecha") \\
    .save("/mnt/delta/ventas")

print("✅ Datos escritos a Delta Lake")

# COMMAND ----------

# MAGIC %md
# MAGIC ## 5. Validación

# COMMAND ----------

df_delta = spark.read.format("delta").load("/mnt/delta/ventas")
print(f"Total registros en Delta: {df_delta.count()}")

# COMMAND ----------

# MAGIC %sql
# MAGIC -- Query Delta table con SQL
# MAGIC SELECT 
#     cliente_id,
#     SUM(total) as total_ventas,
#     COUNT(*) as num_ventas
# MAGIC FROM delta.`/mnt/delta/ventas`
# MAGIC WHERE fecha = current_date()
# MAGIC GROUP BY cliente_id
# MAGIC ORDER BY total_ventas DESC
# MAGIC LIMIT 10
'''

print(databricks_notebook_example)
print('\\n💡 Este notebook puede ser ejecutado desde ADF o Jobs')

## 5. Azure Container Instances y Web Apps: Compute Flexible

### 🐳 **Azure Container Instances (ACI) y Web Apps: Alternativas Serverless/Managed**

**¿Por qué Container Instances o Web Apps?**

Azure Functions tiene límites (10 min timeout en consumption plan, 60 min en premium). Para cargas más pesadas o personalizadas, Azure ofrece:

1. **Azure Container Instances (ACI)**: Contenedores serverless sin K8s
2. **Azure Web Apps**: PaaS para aplicaciones web (equivalente a App Engine GCP)

**Comparación:**

| Aspecto | Azure Functions | Container Instances | Web Apps |
|---------|----------------|---------------------|----------|
| **Tiempo límite** | 10 min (consumption) | Sin límite | Sin límite |
| **Lenguajes** | Python, .NET, Node, Java | Cualquiera (Docker) | Python, .NET, Node, PHP |
| **Escalado** | Automático | Manual/KEDA | Auto-scale rules |
| **Cold start** | ~1-2s | ~5-10s | Sin cold start |
| **Pricing** | Pay-per-execution | Pay-per-second (CPU+RAM) | Pay-per-hour |
| **Use case** | Event-driven rápido | Batch jobs, ETL largo | APIs, servicios 24/7 |

---

### **Azure Container Instances (ACI)**

**Características:**
- Ejecuta contenedores Docker sin administrar infraestructura
- Ideal para batch jobs, procesamiento largo, scripts personalizados
- Facturación por segundo (CPU + memoria)
- Integración con Virtual Networks

**Arquitectura para Data:**

```
Scheduler → ACI (ETL Python) → ADLS Gen2 / Synapse
              ↓
       CloudWatch Logs (monitoring)
```

**Ejemplo: ETL Python en ACI**

**Dockerfile:**
```dockerfile
FROM python:3.10-slim

WORKDIR /app

# Instalar dependencias
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copiar código
COPY etl_script.py .

# Comando de ejecución
CMD ["python", "etl_script.py"]
```

**etl_script.py:**
```python
from azure.storage.filedatalake import DataLakeServiceClient
from azure.identity import DefaultAzureCredential
import pandas as pd
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def main():
    logger.info('Starting ETL process')
    
    # Autenticación con Managed Identity
    credential = DefaultAzureCredential()
    service_client = DataLakeServiceClient(
        account_url="https://mystorageaccount.dfs.core.windows.net",
        credential=credential
    )
    
    # Leer datos
    file_system_client = service_client.get_file_system_client("data-lake")
    file_client = file_system_client.get_file_client("raw/ventas/ventas.csv")
    
    download = file_client.download_file()
    df = pd.read_csv(io.BytesIO(download.readall()))
    
    logger.info(f'Loaded {len(df)} rows')
    
    # Transformaciones (proceso largo permitido)
    df_clean = df.dropna()
    df_clean['total'] = df_clean['total'].astype(float)
    df_agg = df_clean.groupby('cliente_id')['total'].sum().reset_index()
    
    # Escribir resultados
    output_client = file_system_client.get_file_client("curated/ventas_summary.csv")
    csv_content = df_agg.to_csv(index=False)
    output_client.upload_data(csv_content, overwrite=True)
    
    logger.info(f'✅ ETL completed: {len(df_agg)} rows')

if __name__ == '__main__':
    main()
```

**Deploy a ACI:**
```bash
# Build y push a Azure Container Registry (ACR)
az acr build --registry myregistry \\
    --image sales-etl:v1 \\
    --file Dockerfile .

# Crear container instance
az container create \\
    --resource-group my-rg \\
    --name sales-etl-instance \\
    --image myregistry.azurecr.io/sales-etl:v1 \\
    --cpu 2 \\
    --memory 4 \\
    --restart-policy Never \\
    --assign-identity [system] \\
    --environment-variables STORAGE_ACCOUNT=mystorageaccount

# Ver logs
az container logs --resource-group my-rg --name sales-etl-instance
```

**Trigger con Logic Apps:**
```json
{
  "type": "Recurrence",
  "recurrence": {
    "frequency": "Day",
    "interval": 1,
    "startTime": "2025-01-01T02:00:00Z"
  },
  "actions": {
    "Create_Container_Instance": {
      "type": "ApiConnection",
      "inputs": {
        "host": {
          "connection": {
            "name": "@parameters('$connections')['aci']['connectionId']"
          }
        },
        "method": "post",
        "path": "/subscriptions/@{encodeURIComponent('sub-id')}/resourceGroups/@{encodeURIComponent('my-rg')}/providers/Microsoft.ContainerInstance/containerGroups/@{encodeURIComponent('sales-etl-instance')}"
      }
    }
  }
}
```

---

### **Azure Web Apps (App Service)**

**Características:**
- PaaS completamente administrado
- Sin cold start (siempre caliente)
- Auto-scaling basado en métricas
- Deployment slots (staging, production)

**Ejemplo: API de Datos con Flask**

**app.py:**
```python
from flask import Flask, request, jsonify
from azure.identity import DefaultAzureCredential
from azure.storage.filedatalake import DataLakeServiceClient
import pandas as pd

app = Flask(__name__)

credential = DefaultAzureCredential()
service_client = DataLakeServiceClient(
    account_url="https://mystorageaccount.dfs.core.windows.net",
    credential=credential
)

@app.route('/api/sales', methods=['GET'])
def get_sales():
    """Endpoint para obtener ventas filtradas"""
    customer_id = request.args.get('customer_id')
    
    # Leer desde ADLS
    file_system_client = service_client.get_file_system_client("data-lake")
    file_client = file_system_client.get_file_client("curated/ventas.parquet")
    
    download = file_client.download_file()
    df = pd.read_parquet(io.BytesIO(download.readall()))
    
    if customer_id:
        df = df[df['cliente_id'] == int(customer_id)]
    
    return jsonify(df.to_dict(orient='records'))

@app.route('/health', methods=['GET'])
def health():
    return 'OK', 200

if __name__ == '__main__':
    app.run()
```

**Deploy a App Service:**
```bash
# Crear App Service Plan
az appservice plan create \\
    --name my-plan \\
    --resource-group my-rg \\
    --sku B1 \\
    --is-linux

# Crear Web App
az webapp create \\
    --resource-group my-rg \\
    --plan my-plan \\
    --name sales-api \\
    --runtime "PYTHON:3.10" \\
    --deployment-container-image-name myregistry.azurecr.io/sales-api:v1

# Configurar auto-scaling
az monitor autoscale create \\
    --resource-group my-rg \\
    --resource sales-api \\
    --resource-type Microsoft.Web/sites \\
    --name autoscale-rule \\
    --min-count 1 \\
    --max-count 10 \\
    --count 2

# Agregar regla: escalar si CPU > 70%
az monitor autoscale rule create \\
    --resource-group my-rg \\
    --autoscale-name autoscale-rule \\
    --condition "Percentage CPU > 70 avg 5m" \\
    --scale out 1
```

**Pricing:**
- **ACI**: $0.0000012 por vCPU-segundo + $0.0000001 per GB-segundo
  - Ejemplo: 2 vCPU, 4GB, 30 min/día = ~$3/mes
- **Web Apps**: Desde $13/mes (B1 Basic plan)
  - Sin cold start, siempre activo

**Cuándo usar cada uno:**
- **Functions**: Event-driven rápido (<10 min)
- **ACI**: Batch jobs, ETL largo, scripts one-time
- **Web Apps**: APIs 24/7, servicios con SLA estricto

---
**Autor:** Luis J. Raigoso V. (LJRV)

## 6. Azure Functions para Datos

### ⚡ **Azure Functions: Event-Driven Data Processing**

**Triggers para Data Engineering:**

1. **Blob Storage Trigger**: Procesar archivos al subir
2. **Event Hubs Trigger**: Streaming events
3. **Timer Trigger**: Cron jobs
4. **HTTP Trigger**: REST APIs

**Ejemplo: Procesar CSV al subir a ADLS**

```python
# function_app.py
import azure.functions as func
import logging
import pandas as pd
from azure.storage.filedatalake import DataLakeServiceClient
from azure.identity import DefaultAzureCredential

app = func.FunctionApp()

@app.blob_trigger(
    arg_name="myblob", 
    path="data-lake/raw/{name}",
    connection="AzureWebJobsStorage"
)
def process_csv(myblob: func.InputStream):
    logging.info(f"Processing blob: {myblob.name} ({myblob.length} bytes)")
    
    # Skip if not CSV
    if not myblob.name.endswith('.csv'):
        logging.info("Skipping non-CSV file")
        return
    
    # Read CSV
    content = myblob.read()
    df = pd.read_csv(pd.io.common.BytesIO(content))
    
    # Validations
    required_cols = ['venta_id', 'cliente_id', 'total']
    if not all(col in df.columns for col in required_cols):
        logging.error(f"Missing columns: {required_cols}")
        raise ValueError("Invalid schema")
    
    # Clean data
    df_clean = df.dropna(subset=['total'])
    df_clean['total'] = df_clean['total'].astype(float)
    df_clean = df_clean[df_clean['total'] > 0]
    
    logging.info(f"Cleaned {len(df_clean)} rows (from {len(df)} original)")
    
    # Write to curated zone
    credential = DefaultAzureCredential()
    service_client = DataLakeServiceClient(
        account_url="https://mystorageaccount.dfs.core.windows.net",
        credential=credential
    )
    
    file_system_client = service_client.get_file_system_client("data-lake")
    
    # Output path
    output_path = myblob.name.replace('/raw/', '/curated/').replace('.csv', '_clean.csv')
    file_client = file_system_client.get_file_client(output_path)
    
    csv_output = df_clean.to_csv(index=False)
    file_client.upload_data(csv_output, overwrite=True)
    
    logging.info(f"✅ Written to {output_path}")
```

**requirements.txt:**
```
azure-functions
azure-storage-file-datalake
azure-identity
pandas
```

**Event Hubs Trigger (Streaming):**

```python
@app.event_hub_message_trigger(
    arg_name="events",
    event_hub_name="sales-events",
    connection="EventHubConnection"
)
@app.cosmos_db_output(
    arg_name="outputDocument",
    database_name="SalesDB",
    container_name="Events",
    connection="CosmosDBConnection"
)
def process_stream(events: func.EventHubEvent, outputDocument: func.Out[func.Document]):
    for event in events:
        logging.info(f"Processing event: {event.get_body().decode('utf-8')}")
        
        # Parse event
        import json
        data = json.loads(event.get_body().decode('utf-8'))
        
        # Enrich
        data['processed_timestamp'] = event.enqueued_time.isoformat()
        
        # Write to Cosmos DB
        outputDocument.set(func.Document.from_dict(data))
```

**Timer Trigger (Daily Job):**

```python
@app.timer_trigger(
    arg_name="timer",
    schedule="0 0 2 * * *",  # Daily at 2 AM
    run_on_startup=False
)
def daily_aggregation(timer: func.TimerRequest):
    logging.info("Running daily aggregation")
    
    from datetime import datetime, timedelta
    yesterday = (datetime.now() - timedelta(days=1)).strftime('%Y-%m-%d')
    
    # Call Data Factory pipeline
    from azure.mgmt.datafactory import DataFactoryManagementClient
    from azure.identity import DefaultAzureCredential
    
    credential = DefaultAzureCredential()
    adf_client = DataFactoryManagementClient(credential, SUBSCRIPTION_ID)
    
    run_response = adf_client.pipelines.create_run(
        resource_group_name='my-rg',
        factory_name='my-adf',
        pipeline_name='VentasDailyETL',
        parameters={'fecha': yesterday}
    )
    
    logging.info(f"Pipeline run ID: {run_response.run_id}")
```

**Deploy con Azure CLI:**

```bash
# Crear Function App
az functionapp create \\
    --resource-group my-rg \\
    --consumption-plan-location eastus \\
    --runtime python \\
    --runtime-version 3.10 \\
    --functions-version 4 \\
    --name my-data-functions \\
    --storage-account mystorageaccount

# Deploy código
func azure functionapp publish my-data-functions
```

**Durable Functions (Orchestration):**

```python
import azure.functions as func
import azure.durable_functions as df

# Orchestrator
def orchestrator_function(context: df.DurableOrchestrationContext):
    # Step 1: Validate
    valid = yield context.call_activity('validate_data', context.get_input())
    
    if not valid:
        return "Validation failed"
    
    # Step 2: Transform (parallel)
    tasks = [
        context.call_activity('transform_sales', None),
        context.call_activity('transform_customers', None)
    ]
    yield context.task_all(tasks)
    
    # Step 3: Load
    result = yield context.call_activity('load_to_warehouse', None)
    
    return result

main = df.Orchestrator.create(orchestrator_function)
```

**Best Practices:**
- Idempotent functions (rerun-safe)
- Application Insights para monitoring
- Key Vault para secrets
- Consumption plan para sporadic workloads
- Premium plan para VNet integration

**Pricing:**
- Consumption: $0.20 per 1M executions + $0.000016/GB-s
- Premium: desde $150/mes (dedicated instances)

---
**Autor:** Luis J. Raigoso V. (LJRV)

In [None]:
# Ejemplo completo de Azure Function
azure_function_complete = '''
# function_app.py
import azure.functions as func
import logging
import json
from datetime import datetime

app = func.FunctionApp()

@app.blob_trigger(
    arg_name="inputBlob",
    path="data-lake/raw/ventas/{name}",
    connection="AzureWebJobsStorage"
)
@app.blob_output(
    arg_name="outputBlob",
    path="data-lake/curated/ventas/{name}",
    connection="AzureWebJobsStorage"
)
def validate_and_move(inputBlob: func.InputStream, outputBlob: func.Out[bytes]):
    """
    Valida CSV y mueve a curated zone si es válido.
    """
    logging.info(f"🔍 Validating: {inputBlob.name}")
    
    try:
        import pandas as pd
        from io import BytesIO
        
        # Read CSV
        content = inputBlob.read()
        df = pd.read_csv(BytesIO(content))
        
        # Validations
        errors = []
        
        if 'venta_id' not in df.columns:
            errors.append("Missing venta_id column")
        
        if 'total' not in df.columns:
            errors.append("Missing total column")
        else:
            if df['total'].isnull().sum() > 0:
                errors.append(f"{df['total'].isnull().sum()} null values in total")
        
        if errors:
            logging.error(f"❌ Validation failed: {errors}")
            # Write to DLQ (dead letter queue)
            return
        
        # Clean
        df_clean = df[df['total'] > 0]
        
        # Add metadata
        df_clean['validated_at'] = datetime.utcnow().isoformat()
        
        # Write to output
        csv_output = df_clean.to_csv(index=False).encode('utf-8')
        outputBlob.set(csv_output)
        
        logging.info(f"✅ Validated {len(df_clean)} rows")
        
    except Exception as e:
        logging.error(f"💥 Error: {str(e)}")
        raise

# requirements.txt:
# azure-functions
# pandas
'''

print(azure_function_complete)

## 7. Comparación: Azure vs AWS vs GCP

### 🔄 **Multi-Cloud: Deep Comparison**

| Categoría | Azure | AWS | GCP |
|-----------|-------|-----|-----|
| **Object Storage** | ADLS Gen2 (HNS + POSIX) | S3 (prefixes) | Cloud Storage (strong consistency) |
| **DW Dedicated** | Synapse dedicated ($1.20/h) | Redshift ($0.25/h node) | BigQuery flat-rate ($2K/mo) |
| **DW Serverless** | Synapse serverless ($5/TB) | Athena ($5/TB) | BigQuery on-demand ($5/TB) |
| **Spark Managed** | Databricks (DBU model) | EMR ($0.27/h + EC2) | Dataproc ($0.01/vCPU-hour + Compute) |
| **ETL Visual** | Data Factory | Glue Studio | Dataflow |
| **Streaming** | Event Hubs + Stream Analytics | Kinesis + Lambda | Pub/Sub + Dataflow |
| **Notebooks** | Synapse + Databricks | SageMaker + EMR | Vertex AI Workbench |
| **Governance** | Purview | Lake Formation | Dataplex |
| **ML Platform** | Azure ML | SageMaker | Vertex AI |
| **BI Native** | Power BI | QuickSight | Looker |

**Fortalezas por Cloud:**

**Azure:**
- ✅ **Híbrido**: Azure Arc, ExpressRoute (on-prem + cloud seamless)
- ✅ **Microsoft Stack**: AD, Office 365, Power BI integration
- ✅ **Databricks**: Partnership más profunda que AWS
- ✅ **Synapse**: Unified analytics (DW + Spark + Pipelines)
- ❌ Más complejo que AWS/GCP
- ❌ Documentación dispersa

**AWS:**
- ✅ **Madurez**: Servicios battle-tested desde 2006
- ✅ **Amplitud**: 200+ servicios, más connectors
- ✅ **Comunidad**: Mayor adoption, más recursos
- ✅ **Compliance**: FedRAMP High, HIPAA, PCI DSS
- ❌ Pricing complejo
- ❌ Menos cohesión entre servicios

**GCP:**
- ✅ **BigQuery**: Mejor DW serverless del mercado
- ✅ **Kubernetes**: GKE líder (creadores de K8s)
- ✅ **ML/AI**: TensorFlow nativo, TPUs, Vertex AI
- ✅ **Pricing**: Más simple y transparente
- ❌ Menos enterprise features
- ❌ Menor market share (~10%)

**Cuándo elegir Azure:**

1. **Organización Microsoft-centric:**
   - Active Directory para autenticación
   - Office 365, Teams, SharePoint integrados
   - Windows workloads críticos

2. **Híbrido on-prem + cloud:**
   - Azure Stack para on-prem
   - ExpressRoute para conectividad dedicada
   - Arc para management unificado

3. **Databricks-heavy:**
   - Partnership Azure-Databricks más estrecha
   - Unity Catalog (governance)
   - Delta Lake performance optimizado

4. **Compliance estricto:**
   - GDPR, HIPAA, ISO certifications
   - Gobierno/Finanzas (common en Azure)

**Arquitectura Multi-Cloud:**

```
On-Premise (SQL Server)
    ↓ [ExpressRoute]
Azure Data Factory (ingesta)
    ↓
ADLS Gen2 (landing zone)
    ↓
Azure Databricks (transform) → Delta Lake
    ↓
Synapse Analytics (serving) → Power BI
    ↓ [Archive]
AWS S3 Glacier (long-term storage)
```

**IaC Comparison:**

```hcl
# Terraform (multi-cloud)
resource "azurerm_storage_account" "example" {
  name                     = "mystorageaccount"
  resource_group_name      = azurerm_resource_group.example.name
  location                 = "eastus"
  account_tier             = "Standard"
  account_replication_type = "LRS"
  is_hns_enabled           = true  # ADLS Gen2
}

# vs AWS
resource "aws_s3_bucket" "example" {
  bucket = "my-bucket"
  acl    = "private"
}

# vs GCP
resource "google_storage_bucket" "example" {
  name     = "my-bucket"
  location = "US"
}
```

**Cost Comparison (example workload):**

| Resource | Azure | AWS | GCP |
|----------|-------|-----|-----|
| Storage 1TB | $18/mo | $23/mo | $20/mo |
| DW Serverless 10TB | $50 | $50 | $50 |
| Spark (100 vCore-h) | $150 + DBU | $80 + EC2 | $10 + Compute |
| Orchestration | $100 | $120 | $80 |
| **Total** | **~$318** | **~$273** | **~$160** |

*Nota: Precios aproximados, varían por region/commitment*

---
**Autor:** Luis J. Raigoso V. (LJRV)

## 8. Ejercicios Prácticos

### 📝 **Ejercicios**

1. **ADLS Gen2 + Lifecycle:**
   - Crear Storage Account con HNS habilitado
   - Subir CSVs a `raw/`
   - Configurar lifecycle policy (Cool a 30d, Archive a 90d)
   - Implementar ACLs POSIX por directorio

2. **Synapse Serverless:**
   - Crear external table sobre CSVs en ADLS
   - Query con OPENROWSET
   - Particionar por fecha
   - Optimizar query (costo < $0.01)

3. **Data Factory Pipeline:**
   - Copy Activity: SQL → ADLS (Parquet)
   - Data Flow: Filter + Aggregate
   - Trigger: Event-based (blob created)
   - Parametrizar con `@pipeline().parameters.date`

4. **Azure Function:**
   - Blob trigger en `raw/`
   - Validar schema CSV
   - Si válido → mover a `curated/`
   - Si inválido → escribir a `errors/` con log

5. **Databricks Delta Lake:**
   - Mount ADLS Gen2
   - Escribir Delta table particionada
   - UPSERT con merge
   - Time travel (query versión anterior)

**Recursos:**
- [Azure Free Account](https://azure.microsoft.com/free/) ($200 crédito)
- [Microsoft Learn](https://learn.microsoft.com/azure/data-factory/)
- [Databricks Academy](https://www.databricks.com/learn/training/home)
- Certificación: [DP-203 Data Engineering on Azure](https://learn.microsoft.com/certifications/exams/dp-203)

## 9. Conclusión

### 🎯 **Key Takeaways**

**Azure Strengths para Data Engineering:**

1. **Synapse Analytics unifica todo:**
   - DW + Spark + Pipelines en un solo servicio
   - Elimina silos entre equipos SQL/Spark
   - Cost optimization con pause/resume

2. **ADLS Gen2 es superior para Big Data:**
   - Hierarchical namespace (operaciones atómicas)
   - ACLs POSIX (security granular)
   - HDFS compatible (Spark/Hadoop native)

3. **Databricks + Azure:**
   - Unity Catalog (governance)
   - Delta Lake ACID guarantees
   - Photon engine (performance)

4. **Integración Microsoft:**
   - Power BI nativo
   - Azure DevOps para CI/CD
   - Active Directory SSO

**Limitaciones:**

- Curva de aprendizaje pronunciada
- Docs a veces inconsistentes
- Menos "cool factor" que GCP
- Pricing puede ser opaco

**Próximos Pasos:**

1. **Crear cuenta Azure** (free tier $200 crédito por 30 días)
2. **Completar labs:**
   - [Microsoft Learn DP-203](https://learn.microsoft.com/training/paths/data-engineering-with-azure-data-factory/)
   - [Databricks Academy](https://www.databricks.com/learn/training/lakehouse-fundamentals)
3. **Certificación:** DP-203 (Azure Data Engineer Associate)
4. **Explorar:** Unity Catalog, Purview (governance)

**Comparación Final:**

- **AWS**: Más servicios, mayor comunidad, maduro
- **GCP**: BigQuery excelente, ML/AI líder, pricing simple
- **Azure**: Híbrido, Microsoft stack, enterprise-grade

**Elección depende de:**
- Stack actual (Microsoft → Azure, Google → GCP)
- On-prem requirements (Azure Arc wins)
- Team skills (SQL → Synapse, Spark → Databricks)
- Budget (GCP suele ser más barato)

**Happy data engineering en Azure! 🚀**

---
**Autor Final:** LuisRai (Luis J. Raigoso V.)  
© 2024-2025 - Data Engineering Modular Course