# ‚òÅÔ∏è GCP para Ingenier√≠a de Datos: BigQuery, Cloud Storage, Dataflow y Composer

Este notebook introduce el ecosistema de Google Cloud Platform (GCP) para Data Engineering, cubriendo almacenamiento en Cloud Storage, transformaciones con BigQuery y Dataflow, y orquestaci√≥n con Cloud Composer.

**Autor:** LuisRai (Luis J. Raigoso V.)  
**Nivel:** Mid  
**Duraci√≥n:** 90-120 minutos

## ‚ö†Ô∏è RECORDATORIO IMPORTANTE

### üö® NOTEBOOKS vs PRODUCCI√ìN

Este curso usa notebooks para **ense√±anza**, pero en tu trabajo real:

**‚ùå NO uses notebooks para pipelines en producci√≥n**

**‚úÖ USA:**
- Scripts Python modulares en `src/`
- Cloud Composer DAGs (Airflow managed)
- Cloud Functions para event-driven processing
- Cloud Build para CI/CD
- Terraform para Infrastructure as Code

---

**Autor:** LuisRai (Luis J. Raigoso V.) | ¬© 2024-2025

---

## Requisitos y Notas de Ejecuci√≥n

- Para ejecutar c√≥digo real de GCP necesitas:
  - Proyecto GCP creado
  - Credenciales configuradas (`gcloud auth login` o service account JSON)
  - Billing habilitado
  - APIs habilitadas: BigQuery, Cloud Storage, Dataflow, Composer
- **Nunca subas credenciales al repositorio**
- Usa variables de entorno o Google Cloud SDK
- Este notebook muestra ejemplos ejecutables si tienes un proyecto GCP configurado

### ‚òÅÔ∏è **Google Cloud Platform: Ecosistema para Data Engineering**

**Stack Moderno de GCP para Datos:**

1. **Cloud Storage (GCS)**: Object storage para Data Lakes
   - Similar a S3, pero con modelo de consistencia fuerte desde el inicio
   - Clases: Standard, Nearline (30d), Coldline (90d), Archive (365d)
   - Pricing: ~$0.020/GB/mes (Standard en multi-region)

2. **BigQuery**: Data Warehouse serverless con SQL
   - Almacenamiento columnar comprimido
   - Separaci√≥n compute/storage (paga solo por queries ejecutadas)
   - Pricing: $5/TB escaneado (on-demand) o flat-rate mensual
   - Streaming inserts: $0.01 per 200MB

3. **Dataflow**: Procesamiento stream/batch con Apache Beam
   - Serverless, auto-scaling
   - Unified model: mismo c√≥digo para batch y streaming
   - Pricing: por vCPU-hora + GB-hora (workers)

4. **Cloud Composer**: Airflow totalmente administrado
   - DAGs en Python, GKE-based
   - Integraci√≥n nativa con servicios GCP
   - Pricing: por tama√±o de environment + compute

5. **Cloud Functions**: Funciones serverless event-driven
   - Triggers: HTTP, Cloud Storage, Pub/Sub, Firestore
   - Runtime: Python 3.7-3.11, Node.js, Go, Java
   - Pricing: por invocaciones + compute time

**Arquitectura de Referencia:**
```
Fuentes ‚Üí [Pub/Sub] ‚Üí Cloud Storage (raw) ‚Üí [Dataflow/Cloud Functions] 
                           ‚Üì
                    Cloud Storage (curated) ‚Üí BigQuery (anal√≠tica)
                           ‚Üì
                    Cloud Composer (orquestaci√≥n)
```

**Ventajas de GCP para Datos:**
- **BigQuery**: Queries extremadamente r√°pidas (MPP distribuido)
- **Integraci√≥n ML**: BigQuery ML, Vertex AI
- **Consistencia fuerte**: No eventual consistency issues
- **Kubernetes nativo**: GKE para workloads custom

**Comparaci√≥n con AWS:**

| Servicio | GCP | AWS |
|----------|-----|-----|
| Object Storage | Cloud Storage | S3 |
| Data Warehouse | BigQuery | Redshift |
| ETL Serverless | Dataflow (Beam) | Glue (PySpark) |
| Streaming | Dataflow + Pub/Sub | Kinesis + Lambda |
| Orchestration | Cloud Composer | MWAA (Airflow) |
| Functions | Cloud Functions | Lambda |

---
**Autor:** Luis J. Raigoso V. (LJRV)

## 1. Cloud Storage: Data Lake B√°sico

### üóÑÔ∏è **Cloud Storage: Fundamentos**

**Conceptos Core:**

- **Bucket**: Contenedor global √∫nico
  - Nombre: `mi-data-lake-gcp` (min√∫sculas, n√∫meros, guiones)
  - Location: `us-central1`, `europe-west1`, `us` (multi-region)
  - Storage class: Standard, Nearline, Coldline, Archive

- **Object**: Archivo con metadata
  - Key: `gs://bucket/path/to/file.csv`
  - Metadata: Content-Type, custom headers
  - Versionamiento: Object Versioning (similar a S3)

**Organizacion Recomendada:**

```
gs://my-datalake/
‚îú‚îÄ‚îÄ raw/                    ‚Üê Datos crudos
‚îÇ   ‚îú‚îÄ‚îÄ sales/
‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ 2025/10/30/
‚îÇ   ‚îî‚îÄ‚îÄ customers/
‚îú‚îÄ‚îÄ staging/                ‚Üê Datos en proceso
‚îî‚îÄ‚îÄ curated/                ‚Üê Datos procesados
    ‚îú‚îÄ‚îÄ sales_aggregated/
    ‚îî‚îÄ‚îÄ customer_metrics/
```

**Lifecycle Management:**
```json
{
  "lifecycle": {
    "rule": [
      {
        "action": {"type": "SetStorageClass", "storageClass": "NEARLINE"},
        "condition": {"age": 30}
      },
      {
        "action": {"type": "Delete"},
        "condition": {"age": 365}
      }
    ]
  }
}
```

**Operaciones con Python Client:**
```python
from google.cloud import storage

client = storage.Client()

# Crear bucket
bucket = client.create_bucket('my-bucket', location='us-central1')

# Subir archivo
blob = bucket.blob('raw/ventas.csv')
blob.upload_from_filename('ventas.csv')

# Listar objetos
for blob in bucket.list_blobs(prefix='raw/'):
    print(blob.name)

# Descargar archivo
blob.download_to_filename('downloaded.csv')
```

**Costos:**
- Storage: $0.020/GB/mes (Standard)
- Class A operations (write): $0.05 per 10,000 ops
- Class B operations (read): $0.004 per 10,000 ops
- Network egress: $0.12/GB (fuera de GCP)

---
**Autor:** Luis J. Raigoso V. (LJRV)

In [None]:
# Configuraci√≥n (sin ejecutar sin proyecto GCP real)
PROJECT_ID = 'mi-proyecto-gcp'
BUCKET_NAME = f'{PROJECT_ID}-datalake'
LOCATION = 'us-central1'

print(f'üìã Configuraci√≥n: {PROJECT_ID} / {BUCKET_NAME}')
print('‚ö†Ô∏è Requiere: gcloud auth login o service account JSON')

### 1.1 Crear bucket y subir datos

In [None]:
# Ejemplo de c√≥digo (descomentar con proyecto real)
'''
from google.cloud import storage
import pandas as pd

# Inicializar cliente
client = storage.Client(project=PROJECT_ID)

# Crear bucket
try:
    bucket = client.create_bucket(BUCKET_NAME, location=LOCATION)
    print(f'‚úÖ Bucket {BUCKET_NAME} creado')
except Exception as e:
    print(f'‚ÑπÔ∏è Bucket ya existe o error: {e}')
    bucket = client.bucket(BUCKET_NAME)

# Subir CSV
df = pd.read_csv('../../datasets/raw/ventas.csv')
csv_string = df.to_csv(index=False)

blob = bucket.blob('raw/ventas/ventas_2025_10.csv')
blob.upload_from_string(csv_string, content_type='text/csv')
print('üì§ Archivo subido a Cloud Storage')
'''
print('C√≥digo de ejemplo listo para ejecutar con proyecto GCP')

## 2. BigQuery: Data Warehouse Serverless

### üìä **BigQuery: SQL Analytics a Escala**

**Arquitectura:**

BigQuery separa **storage** y **compute**:
- Storage: Columnar format (Capacitor), compresi√≥n autom√°tica
- Compute: Dremel engine (MPP distribuido con miles de workers)

**Beneficios:**
- Queries sobre TB/PB en segundos
- No administrar clusters (100% serverless)
- Standard SQL (ANSI SQL 2011 compatible)
- Integraci√≥n con herramientas BI (Looker, Tableau, Data Studio)

**Conceptos:**

1. **Dataset**: Contenedor l√≥gico de tablas
   - Similar a "database" en SQL tradicional
   - Permisos a nivel dataset

2. **Table**: Datos estructurados
   - Native tables: Datos en BigQuery storage
   - External tables: Data en GCS (federated queries)
   - Partitioned tables: Por fecha/rango (reduce scan)
   - Clustered tables: Por columnas espec√≠ficas (mejor performance)

3. **View**: Query guardada
   - Authorized views: Control de acceso granular

**Particionamiento:**
```sql
-- Tabla particionada por fecha
CREATE TABLE `project.dataset.sales_partitioned`
PARTITION BY DATE(order_date)
AS SELECT * FROM `project.dataset.sales_raw`;

-- Query optimizada (scan solo 1 d√≠a)
SELECT * FROM `project.dataset.sales_partitioned`
WHERE order_date = '2025-10-30';
```

**Clustering:**
```sql
-- Tabla clusterizada por customer_id, product_id
CREATE TABLE `project.dataset.sales_clustered`
PARTITION BY DATE(order_date)
CLUSTER BY customer_id, product_id
AS SELECT * FROM `project.dataset.sales_raw`;

-- Query beneficiada (pre-sorted data)
SELECT * FROM `project.dataset.sales_clustered`
WHERE customer_id = 12345;
```

**BigQuery ML (Machine Learning integrado):**
```sql
-- Crear modelo de regresi√≥n log√≠stica
CREATE OR REPLACE MODEL `project.dataset.churn_model`
OPTIONS(model_type='logistic_reg') AS
SELECT
  customer_id,
  age,
  total_purchases,
  churned as label
FROM `project.dataset.customers`;

-- Predecir
SELECT * FROM ML.PREDICT(
  MODEL `project.dataset.churn_model`,
  (SELECT * FROM `project.dataset.new_customers`)
);
```

**Best Practices:**
- Particionar tablas grandes (>1GB)
- Evitar `SELECT *` (especifica columnas)
- Usar `_TABLE_SUFFIX` para tablas wildcard
- Aprovechar result caching (24h gratuito)
- Usar slots reservation para cargas predecibles

**Pricing:**
- On-demand: $5/TB escaneado
- Flat-rate: desde $2,000/mes (100 slots)
- Storage: $0.020/GB active, $0.010/GB long-term (90d+)

---
**Autor:** Luis J. Raigoso V. (LJRV)

In [None]:
# Configuraci√≥n BigQuery
DATASET_ID = 'data_engineering_course'
TABLE_ID = 'ventas'

print(f'üìä Dataset: {PROJECT_ID}.{DATASET_ID}')
print(f'üìã Tabla: {TABLE_ID}')

### 2.1 Crear dataset y tabla desde Cloud Storage

In [None]:
# Ejemplo BigQuery
'''
from google.cloud import bigquery

client = bigquery.Client(project=PROJECT_ID)

# Crear dataset
dataset_ref = client.dataset(DATASET_ID)
try:
    dataset = bigquery.Dataset(dataset_ref)
    dataset.location = LOCATION
    dataset = client.create_dataset(dataset)
    print(f'‚úÖ Dataset {DATASET_ID} creado')
except Exception as e:
    print(f'‚ÑπÔ∏è Dataset ya existe: {e}')

# Cargar CSV desde GCS a BigQuery
job_config = bigquery.LoadJobConfig(
    schema=[
        bigquery.SchemaField('venta_id', 'INTEGER'),
        bigquery.SchemaField('cliente_id', 'INTEGER'),
        bigquery.SchemaField('producto_id', 'INTEGER'),
        bigquery.SchemaField('cantidad', 'INTEGER'),
        bigquery.SchemaField('total', 'FLOAT'),
        bigquery.SchemaField('fecha', 'DATE'),
    ],
    skip_leading_rows=1,
    source_format=bigquery.SourceFormat.CSV,
    write_disposition='WRITE_TRUNCATE',
)

uri = f'gs://{BUCKET_NAME}/raw/ventas/ventas_2025_10.csv'
table_ref = dataset_ref.table(TABLE_ID)

load_job = client.load_table_from_uri(uri, table_ref, job_config=job_config)
load_job.result()  # Wait for job to complete

print(f'‚úÖ Cargados {load_job.output_rows} registros en {TABLE_ID}')
'''
print('C√≥digo BigQuery listo para ejecutar')

### 2.2 Queries SQL en BigQuery

In [None]:
# Ejemplo de queries
'''
# Query simple
query = f"""
SELECT 
  cliente_id,
  SUM(total) as total_ventas,
  COUNT(*) as num_ventas
FROM `{PROJECT_ID}.{DATASET_ID}.{TABLE_ID}`
GROUP BY cliente_id
ORDER BY total_ventas DESC
LIMIT 10
"""

query_job = client.query(query)
results = query_job.result()

for row in results:
    print(f'Cliente {row.cliente_id}: ${row.total_ventas:.2f} ({row.num_ventas} ventas)')

# Metadata del job
print(f'\\nüìä Query Stats:')
print(f'  Bytes processed: {query_job.total_bytes_processed / 1e9:.2f} GB')
print(f'  Bytes billed: {query_job.total_bytes_billed / 1e9:.2f} GB')
print(f'  Cost estimate: ${(query_job.total_bytes_billed / 1e12) * 5:.4f}')
'''
print('Queries BigQuery con estimaci√≥n de costos')

## 3. Dataflow: Procesamiento con Apache Beam

### üåä **Dataflow: Unified Stream/Batch Processing**

**Apache Beam Concepts:**

Beam es un modelo de programaci√≥n unificado para batch y streaming:

```
Pipeline ‚Üí PCollection ‚Üí Transform ‚Üí PCollection ‚Üí ...
```

- **Pipeline**: Grafo de transformaciones
- **PCollection**: Conjunto distribuido de datos (immutable)
- **Transform**: Operaci√≥n sobre PCollection (Map, Filter, GroupByKey, etc.)

**Runners:**
- DirectRunner: Local (testing)
- DataflowRunner: GCP managed service
- FlinkRunner: Apache Flink
- SparkRunner: Apache Spark

**Patr√≥n B√°sico:**
```python
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions

options = PipelineOptions([
    '--project=my-project',
    '--region=us-central1',
    '--runner=DataflowRunner',
    '--temp_location=gs://my-bucket/temp',
])

with beam.Pipeline(options=options) as pipeline:
    (pipeline
     | 'Read' >> beam.io.ReadFromText('gs://input/*.csv')
     | 'Parse' >> beam.Map(lambda line: line.split(','))
     | 'Filter' >> beam.Filter(lambda row: float(row[2]) > 100)
     | 'Format' >> beam.Map(lambda row: f'{row[0]},{row[1]}')
     | 'Write' >> beam.io.WriteToText('gs://output/result'))
```

**Windowing (Streaming):**
```python
from apache_beam import window

(events
 | 'Window' >> beam.WindowInto(window.FixedWindows(60))  # 1-min windows
 | 'Sum' >> beam.CombinePerKey(sum)
 | 'Write' >> beam.io.WriteToBigQuery('project:dataset.table'))
```

**Transforms Comunes:**

1. **ParDo** (Parallel Do):
   ```python
   class ExtractFields(beam.DoFn):
       def process(self, element):
           parts = element.split(',')
           yield {'id': int(parts[0]), 'value': float(parts[1])}
   
   data | beam.ParDo(ExtractFields())
   ```

2. **GroupByKey**:
   ```python
   # (key, value) pairs ‚Üí (key, [values])
   pairs | beam.GroupByKey()
   ```

3. **CombinePerKey**:
   ```python
   # M√°s eficiente que GroupByKey + Map
   pairs | beam.CombinePerKey(sum)
   ```

4. **Flatten**:
   ```python
   # Merge m√∫ltiples PCollections
   (pcoll1, pcoll2, pcoll3) | beam.Flatten()
   ```

**Side Inputs (Broadcasting):**
```python
lookup_table = (pipeline
                | 'Read Lookup' >> beam.io.ReadFromText('gs://lookup.csv')
                | 'Parse' >> beam.Map(lambda x: x.split(',')))

main_data | beam.Map(
    lambda x, table: enrich(x, table),
    beam.pvalue.AsDict(lookup_table)
)
```

**Dataflow vs Alternatives:**

| Aspecto | Dataflow | Spark | Flink |
|---------|----------|-------|-------|
| **Model** | Beam (unified) | RDD/DataFrame | DataStream |
| **Serverless** | ‚úÖ S√≠ | ‚ùå No (EMR/Databricks) | ‚ùå No |
| **Exactly-once** | ‚úÖ S√≠ | ‚ö†Ô∏è Dif√≠cil | ‚úÖ S√≠ |
| **Late Data** | ‚úÖ Excellent | ‚ö†Ô∏è Manual | ‚úÖ Good |
| **Learning Curve** | Media | Baja | Alta |

**Use Cases:**
- ETL masivos (TB ‚Üí PB)
- Real-time analytics (con Pub/Sub)
- ML feature engineering
- Data quality validation

---
**Autor:** Luis J. Raigoso V. (LJRV)

In [None]:
# Ejemplo de pipeline Dataflow
dataflow_example = '''
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions

class ParseCSV(beam.DoFn):
    def process(self, element):
        parts = element.split(',')
        yield {
            'venta_id': int(parts[0]),
            'cliente_id': int(parts[1]),
            'total': float(parts[4])
        }

class FilterHighValue(beam.DoFn):
    def process(self, element):
        if element['total'] > 100:
            yield element

options = PipelineOptions([
    '--project=PROJECT_ID',
    '--region=us-central1',
    '--runner=DataflowRunner',
    '--temp_location=gs://BUCKET/temp',
    '--staging_location=gs://BUCKET/staging',
])

with beam.Pipeline(options=options) as pipeline:
    (pipeline
     | 'Read GCS' >> beam.io.ReadFromText('gs://BUCKET/raw/ventas/*.csv')
     | 'Skip Header' >> beam.Filter(lambda line: not line.startswith('venta_id'))
     | 'Parse' >> beam.ParDo(ParseCSV())
     | 'Filter' >> beam.ParDo(FilterHighValue())
     | 'Aggregate' >> beam.Map(lambda x: (x['cliente_id'], x['total']))
     | 'Group' >> beam.CombinePerKey(sum)
     | 'Format' >> beam.Map(lambda kv: f"{kv[0]},{kv[1]}")
     | 'Write' >> beam.io.WriteToText('gs://BUCKET/curated/ventas_summary'))
'''

print(dataflow_example)
print('\\nüí° Para ejecutar: python dataflow_job.py')

## 4. Cloud Composer: Airflow Administrado

### üéº **Cloud Composer: Airflow en GCP**

**¬øQu√© es Cloud Composer?**

Servicio totalmente administrado de Apache Airflow en GCP:
- Basado en GKE (Google Kubernetes Engine)
- Auto-scaling de workers
- Integraci√≥n nativa con servicios GCP
- Monitoreo con Cloud Logging/Monitoring

**Componentes:**

1. **Environment**: Cluster Airflow dedicado
   - Web server (UI)
   - Scheduler
   - Workers (Celery o Kubernetes)
   - Database (Cloud SQL PostgreSQL)

2. **DAGs Folder**: GCS bucket autom√°tico
   - `gs://[bucket]/dags/`
   - Sync autom√°tico al subir DAGs

**Operators GCP:**

```python
from airflow.providers.google.cloud.operators.bigquery import (
    BigQueryCreateEmptyDatasetOperator,
    BigQueryInsertJobOperator,
)
from airflow.providers.google.cloud.operators.gcs import (
    GCSCreateBucketOperator,
    GCSDeleteBucketOperator,
)
from airflow.providers.google.cloud.transfers.gcs_to_bigquery import (
    GCSToBigQueryOperator,
)

with DAG('gcp_pipeline', schedule_interval='@daily') as dag:
    
    create_dataset = BigQueryCreateEmptyDatasetOperator(
        task_id='create_dataset',
        dataset_id='my_dataset',
        project_id=PROJECT_ID,
    )
    
    load_to_bq = GCSToBigQueryOperator(
        task_id='load_csv_to_bq',
        bucket='my-bucket',
        source_objects=['raw/ventas/*.csv'],
        destination_project_dataset_table=f'{PROJECT_ID}.my_dataset.ventas',
        schema_fields=[
            {'name': 'venta_id', 'type': 'INTEGER'},
            {'name': 'cliente_id', 'type': 'INTEGER'},
            {'name': 'total', 'type': 'FLOAT'},
        ],
        write_disposition='WRITE_TRUNCATE',
    )
    
    run_query = BigQueryInsertJobOperator(
        task_id='aggregate_sales',
        configuration={
            'query': {
                'query': '''
                    CREATE OR REPLACE TABLE `{}.my_dataset.sales_summary` AS
                    SELECT cliente_id, SUM(total) as total
                    FROM `{}.my_dataset.ventas`
                    GROUP BY cliente_id
                '''.format(PROJECT_ID, PROJECT_ID),
                'useLegacySql': False,
            }
        },
    )
    
    create_dataset >> load_to_bq >> run_query
```

**Dataflow Operator:**
```python
from airflow.providers.google.cloud.operators.dataflow import (
    DataflowCreatePythonJobOperator,
)

run_dataflow = DataflowCreatePythonJobOperator(
    task_id='dataflow_etl',
    py_file='gs://my-bucket/dataflow/pipeline.py',
    job_name='ventas-etl',
    options={
        'project': PROJECT_ID,
        'region': 'us-central1',
        'tempLocation': 'gs://my-bucket/temp',
    },
)
```

**Environment Variables & Connections:**

```python
from airflow.models import Variable

# Airflow Variables (en UI o CLI)
PROJECT_ID = Variable.get('gcp_project_id')
BUCKET = Variable.get('data_bucket')

# Connections (GCP ‚Üí Airflow Connection)
# ID: google_cloud_default
# Type: Google Cloud
# Keyfile JSON: [service account JSON]
```

**Pricing:**
- Environment: ~$300-500/mes (peque√±o)
- Compute: Workers + scheduler
- Storage: GCS para DAGs + logs
- Database: Cloud SQL (managed)

**Alternatives:**
- Managed Airflow (MWAA en AWS)
- Self-hosted Airflow (Kubernetes)
- Cloud Workflows (simple orchestration)

---
**Autor:** Luis J. Raigoso V. (LJRV)

In [None]:
# Ejemplo de DAG para Composer
composer_dag = '''
from airflow import DAG
from airflow.providers.google.cloud.transfers.gcs_to_bigquery import GCSToBigQueryOperator
from airflow.providers.google.cloud.operators.bigquery import BigQueryInsertJobOperator
from datetime import datetime, timedelta

default_args = {
    'owner': 'data-team',
    'depends_on_past': False,
    'start_date': datetime(2025, 1, 1),
    'email_on_failure': True,
    'email_on_retry': False,
    'retries': 3,
    'retry_delay': timedelta(minutes=5),
}

with DAG(
    'ventas_daily_etl',
    default_args=default_args,
    description='Pipeline diario de ventas',
    schedule_interval='0 2 * * *',  # 2 AM daily
    catchup=False,
    tags=['ventas', 'etl', 'gcp'],
) as dag:
    
    load_raw = GCSToBigQueryOperator(
        task_id='load_raw_data',
        bucket='my-datalake',
        source_objects=['raw/ventas/{{ ds }}/*.csv'],
        destination_project_dataset_table='project.dataset.ventas_raw',
        write_disposition='WRITE_APPEND',
        skip_leading_rows=1,
    )
    
    transform = BigQueryInsertJobOperator(
        task_id='transform_data',
        configuration={
            'query': {
                'query': """
                    CREATE OR REPLACE TABLE `project.dataset.ventas_clean` AS
                    SELECT 
                        venta_id,
                        cliente_id,
                        CAST(total AS FLOAT64) as total,
                        DATE(fecha) as fecha
                    FROM `project.dataset.ventas_raw`
                    WHERE fecha = '{{ ds }}'
                      AND total > 0
                """,
                'useLegacySql': False,
            }
        },
    )
    
    load_raw >> transform

# Subir a: gs://[composer-bucket]/dags/ventas_etl.py
'''

print(composer_dag)
print('\\nüí° Subir a Cloud Composer DAGs folder en GCS')

## 5. Cloud Functions: Event-Driven Processing

### ‚ö° **Cloud Functions: Serverless para Datos**

**Triggers Disponibles:**

1. **HTTP**: API endpoints
2. **Cloud Storage**: Object created/deleted/updated
3. **Pub/Sub**: Message queue events
4. **Firestore**: Document changes
5. **Cloud Scheduler**: Cron jobs

**Ejemplo: Procesar CSV al subir a GCS**

```python
# main.py
from google.cloud import bigquery
import pandas as pd

def process_csv(event, context):
    """
    Triggered by Cloud Storage when CSV uploaded.
    
    Args:
        event (dict): Event payload (file metadata)
        context (google.cloud.functions.Context): Event context
    """
    file_name = event['name']
    bucket_name = event['bucket']
    
    print(f'Processing file: gs://{bucket_name}/{file_name}')
    
    # Skip if not in raw/ prefix
    if not file_name.startswith('raw/'):
        print('Skipping non-raw file')
        return
    
    # Read CSV from GCS
    gcs_uri = f'gs://{bucket_name}/{file_name}'
    df = pd.read_csv(gcs_uri)
    
    # Transform
    df_clean = df.dropna()
    df_clean['total'] = df_clean['total'].astype(float)
    
    # Load to BigQuery
    client = bigquery.Client()
    table_id = 'project.dataset.ventas_processed'
    
    job_config = bigquery.LoadJobConfig(
        write_disposition='WRITE_APPEND',
    )
    
    job = client.load_table_from_dataframe(df_clean, table_id, job_config=job_config)
    job.result()
    
    print(f'Loaded {len(df_clean)} rows to {table_id}')
```

**requirements.txt:**
```
google-cloud-bigquery
google-cloud-storage
pandas
```

**Deploy:**
```bash
gcloud functions deploy process_csv \\
  --runtime python310 \\
  --trigger-resource my-bucket \\
  --trigger-event google.storage.object.finalize \\
  --entry-point process_csv \\
  --region us-central1 \\
  --memory 512MB \\
  --timeout 300s
```

**Best Practices:**
- Idempotent functions (puede ejecutarse m√∫ltiples veces)
- Timeout < 9 min (max 540s)
- Lightweight dependencies (cold start impact)
- Use Pub/Sub para retry logic
- Cloud Run para workloads >9 min

**Pub/Sub + Cloud Functions:**
```python
import base64
import json

def process_message(event, context):
    """Triggered from Pub/Sub topic"""
    pubsub_message = base64.b64decode(event['data']).decode('utf-8')
    data = json.loads(pubsub_message)
    
    print(f'Processing message: {data}')
    # ETL logic here
```

**Publish to Pub/Sub:**
```bash
gcloud pubsub topics publish data-events \\
  --message '{"file": "gs://bucket/data.csv", "type": "sales"}'
```

---
**Autor:** Luis J. Raigoso V. (LJRV)

In [None]:
# Ejemplo de Cloud Function completo
cloud_function_example = '''
# main.py - Cloud Function para validar y cargar datos
from google.cloud import bigquery, storage
import pandas as pd
import logging

def validate_and_load(event, context):
    """
    Cloud Function triggered por Cloud Storage.
    Valida CSV y carga a BigQuery.
    """
    file_name = event['name']
    bucket_name = event['bucket']
    
    logging.info(f'File: gs://{bucket_name}/{file_name}')
    
    # Read from GCS
    storage_client = storage.Client()
    bucket = storage_client.bucket(bucket_name)
    blob = bucket.blob(file_name)
    content = blob.download_as_text()
    
    # Parse CSV
    from io import StringIO
    df = pd.read_csv(StringIO(content))
    
    # Validations
    required_cols = ['venta_id', 'cliente_id', 'total']
    if not all(col in df.columns for col in required_cols):
        raise ValueError(f'Missing columns: {required_cols}')
    
    if df['total'].isnull().sum() > 0:
        raise ValueError('Null values found in total column')
    
    # Clean
    df['total'] = df['total'].astype(float)
    df = df[df['total'] > 0]
    
    # Load to BigQuery
    bq_client = bigquery.Client()
    table_id = 'project.dataset.ventas'
    
    job = bq_client.load_table_from_dataframe(
        df, 
        table_id,
        job_config=bigquery.LoadJobConfig(
            write_disposition='WRITE_APPEND'
        )
    )
    job.result()
    
    logging.info(f'‚úÖ Loaded {len(df)} rows')
    return f'Success: {len(df)} rows'

# Deploy:
# gcloud functions deploy validate_and_load \\
#   --runtime python310 \\
#   --trigger-resource my-bucket \\
#   --trigger-event google.storage.object.finalize \\
#   --entry-point validate_and_load
'''

print(cloud_function_example)

## 6. Comparaci√≥n: GCP vs AWS vs Azure

### üîÑ **Multi-Cloud Comparison**

| Servicio | GCP | AWS | Azure |
|----------|-----|-----|-------|
| **Object Storage** | Cloud Storage | S3 | Blob Storage |
| **Data Warehouse** | BigQuery | Redshift | Synapse Analytics |
| **ETL Serverless** | Dataflow (Beam) | Glue (PySpark) | Data Factory |
| **Streaming** | Pub/Sub + Dataflow | Kinesis + Lambda | Event Hubs + Stream Analytics |
| **Orchestration** | Cloud Composer | MWAA (Airflow) | Data Factory |
| **Serverless Compute** | Cloud Functions | Lambda | Functions |
| **Notebooks** | Vertex AI Workbench | SageMaker | Machine Learning Studio |
| **ML Platform** | Vertex AI | SageMaker | Azure ML |

**Cu√°ndo elegir cada cloud:**

**GCP:**
- ‚úÖ BigQuery (mejor Data Warehouse serverless)
- ‚úÖ Kubernetes-first (GKE es l√≠der)
- ‚úÖ ML/AI (TensorFlow, Vertex AI)
- ‚úÖ Pricing transparente
- ‚ùå Menos servicios que AWS
- ‚ùå Menor presencia enterprise

**AWS:**
- ‚úÖ M√°s servicios (200+)
- ‚úÖ Mayor adoption (33% market share)
- ‚úÖ Mejor documentaci√≥n/comunidad
- ‚úÖ Compliance certifications m√°s amplio
- ‚ùå Pricing complejo
- ‚ùå Muchos servicios legacy

**Azure:**
- ‚úÖ Integraci√≥n Microsoft (Active Directory, Office 365)
- ‚úÖ H√≠brido (on-prem + cloud con Azure Arc)
- ‚úÖ Windows workloads
- ‚ùå Curva de aprendizaje pronunciada
- ‚ùå Documentaci√≥n inconsistente

**Arquitectura Multi-Cloud:**
```
On-Premise ‚Üí Azure (ingesta + AD)
              ‚Üì
          Cloud Storage (staging)
              ‚Üì
       BigQuery (analytics) ‚Üê Looker (BI)
              ‚Üì
      S3 (archival) ‚Üí Glacier
```

**Portabilidad:**
- Terraform para IaC multi-cloud
- Apache Beam (portable Dataflow/Flink/Spark)
- Kubernetes para compute portable
- Parquet/Avro para data formats

---
**Autor:** Luis J. Raigoso V. (LJRV)

## 7. Ejercicios Pr√°cticos

### üìù **Ejercicios**

1. **Cloud Storage + BigQuery:**
   - Crear bucket con lifecycle policy (Nearline a 30 d√≠as)
   - Subir CSV particionado por fecha
   - Cargar a BigQuery table particionada
   - Query con costo < $0.01

2. **Dataflow Pipeline:**
   - Leer CSVs de GCS
   - Filtrar registros inv√°lidos
   - Agregar por cliente
   - Escribir a BigQuery

3. **Cloud Composer DAG:**
   - Task 1: Validar archivos en GCS
   - Task 2: Run Dataflow job
   - Task 3: Ejecutar query BigQuery
   - Task 4: Enviar alerta a Pub/Sub

4. **Cloud Function:**
   - Trigger: Object created en GCS
   - Validar schema del CSV
   - Si v√°lido ‚Üí Pub/Sub topic "valid-data"
   - Si inv√°lido ‚Üí Pub/Sub topic "invalid-data"

5. **BigQuery Optimization:**
   - Crear tabla particionada + clustered
   - Comparar costo query con/sin optimizaci√≥n
   - Implementar result caching

**Recursos:**
- [GCP Free Tier](https://cloud.google.com/free)
- [BigQuery Sandbox](https://cloud.google.com/bigquery/docs/sandbox) (sin billing)
- [Qwiklabs GCP](https://www.qwiklabs.com/catalog?keywords=data%20engineering&cloud%5B%5D=GCP)

## 8. Conclusi√≥n

### üéØ **Key Takeaways**

**GCP Strengths para Data Engineering:**

1. **BigQuery es excepcional:**
   - Queries extremadamente r√°pidas
   - Serverless (no tuning de clusters)
   - BigQuery ML integrado
   - Flat-rate predictable pricing

2. **Dataflow ofrece flexibilidad:**
   - Unified model (batch + streaming)
   - Apache Beam portable
   - Auto-scaling inteligente

3. **Integraci√≥n cohesiva:**
   - IAM unified
   - Cloud Logging/Monitoring centralized
   - Stackdriver para observabilidad

**Limitaciones:**

- Menos servicios que AWS
- Lock-in en BigQuery (no standard SQL 100%)
- Menor comunidad/recursos que AWS

**Pr√≥ximos Pasos:**

1. Crear cuenta GCP (free tier $300 cr√©dito)
2. Completar [Data Engineering Qwiklab](https://www.qwiklabs.com/quests/132)
3. Certificaci√≥n: [Professional Data Engineer](https://cloud.google.com/certification/data-engineer)
4. Explorar Vertex AI para ML Pipelines

**Happy data engineering en GCP! üöÄ**

---
**Autor Final:** LuisRai (Luis J. Raigoso V.)  
¬© 2024-2025 - Data Engineering Modular Course