# 🧪 Proyecto Integrador Mid 1: API → DB → Parquet con Orquestación

Objetivo: construir un pipeline reproducible que extrae datos desde una API, valida y transforma, carga a una base de datos y exporta a Parquet, orquestado con Airflow.

- Duración: 120–150 min
- Dificultad: Media
- Prerrequisitos: Airflow básico, SQL, validación de datos

### 🎯 **Proyecto Integrador: Arquitectura End-to-End**

**Objetivo del Proyecto:**  
Implementar un pipeline de datos productivo que demuestre las mejores prácticas de Mid-level Data Engineering, integrando conceptos de los notebooks 01-08.

**Arquitectura del Pipeline:**

```
┌─────────────────────────────────────────────────────────────┐
│                    ORCHESTRATION LAYER                       │
│                    Apache Airflow DAG                        │
│                  (Schedule: @daily, Retries: 2)              │
└──────────────────────┬──────────────────────────────────────┘
                       │
       ┌───────────────┼───────────────┐
       ▼               ▼               ▼
   ┌────────┐    ┌─────────┐    ┌──────────┐
   │EXTRACT │───>│VALIDATE │───>│TRANSFORM │
   └────────┘    └─────────┘    └──────────┘
       │              │               │
       │              │               │
   REST API      Pandera/GE     Normalization
   (Retry +      Schema         + Enrichment
   Backoff)      Validation
       │              │               │
       └──────────────┴───────────────┘
                      │
       ┌──────────────┼──────────────┐
       ▼              ▼               ▼
   ┌──────┐     ┌─────────┐    ┌──────────┐
   │ LOAD │     │ EXPORT  │    │ MONITOR  │
   └──────┘     └─────────┘    └──────────┘
       │             │               │
   PostgreSQL/   Parquet        Logging +
   SQLite        Partitioned    Alerting
   (UPSERT)      by Date
```

**Tecnologías Integradas:**

1. **Airflow** (Notebook 01): Orquestación, scheduling, retry logic
2. **API Consumption** (Notebook 06): HTTP retry, exponential backoff
3. **Data Validation** (Notebook 05): Pandera schemas, quality gates
4. **Database Load** (Notebook 04): SQLAlchemy UPSERT, idempotencia
5. **Parquet Export** (Notebook 07): Particionamiento, columnar storage
6. **DataOps** (Notebook 05): Testing, logging, monitoring

**Principios Aplicados:**

✅ **Idempotencia**: Re-ejecutar el DAG no duplica datos (UPSERT)  
✅ **Resilience**: Exponential backoff + retries automáticos  
✅ **Observability**: Structured logging + métricas  
✅ **Quality**: Schema validation + data quality checks  
✅ **Reproducibility**: Versionado de schemas + data lineage  
✅ **Efficiency**: Particionamiento + formato columnar

**Casos de Uso Reales:**

- **FinTech**: Ingestión diaria de transacciones desde APIs bancarias
- **E-commerce**: Sincronización de inventario desde ERPs
- **Marketing**: Consolidación de métricas desde plataformas (Google Ads, Facebook)
- **Healthcare**: Carga de registros médicos desde sistemas legacy

---
**Autor:** Luis J. Raigoso V. (LJRV)

## 1. Diseño del flujo

- Extract: API pública (REST) con paginación y reintentos.
- Validate: esquema con Pandera/GE y controles de calidad (nulos, rangos).
- Transform: normalización de tipos, enriquecimiento simple.
- Load: upsert a DB (PostgreSQL/SQLite de demo).
- Export: particionado por fecha a Parquet.
- Orquestación: DAG diario; retries + alertas por email/logs.

### 📐 **Pipeline Design: Decisiones de Arquitectura**

**1. EXTRACT - API Consumption Strategy:**

**¿Por qué exponential backoff?**
```python
for i in range(max_retries):
    if status == 429:  # Rate limit
        wait = 2**i  # 1s, 2s, 4s, 8s, 16s
        time.sleep(wait)
```

- API puede tener rate limits (ej: 100 req/min)
- Reintentar inmediatamente → Baneado temporalmente
- Backoff da tiempo al rate limit para resetearse

**Alternativas consideradas:**
- ❌ `requests` simple: Sin retry automático
- ❌ `while True` sin límite: Infinite loop risk
- ✅ `max_retries=5` con backoff: Balance entre persistencia y timeout

**2. VALIDATE - Schema Enforcement:**

**Pandera vs Great Expectations:**
```python
# Pandera: Code-first, lightweight
schema = DataFrameSchema({
    'API': Column(str, Check.str_length(min=1)),
    'HTTPS': Column(bool)
})
df = schema.validate(df)  # Raises error si falla

# Great Expectations: Config-first, enterprise
expectation_suite = context.create_expectation_suite("api_suite")
df.expect_column_values_to_not_be_null("API")
results = df.validate()  # Devuelve report
```

**Decisión: Pandera para este proyecto**
- ✅ Menos setup (no requiere Expectation Store)
- ✅ Type hints integration
- ❌ GE: Mejor para equipos grandes con governance

**3. TRANSFORM - Data Normalization:**

**Operaciones aplicadas:**
```python
df['ingestion_ts'] = pd.Timestamp.utcnow()  # Metadata de procesamiento
df = df[columnas_relevantes].copy()          # Proyección
df['API'] = df['API'].str.strip().str.title()  # Normalización
```

**¿Por qué agregar `ingestion_ts`?**
- Debugging: Saber cuándo se procesó cada registro
- Data freshness: Métricas de lag (event_time vs ingestion_time)
- Auditoría: Compliance requirements

**4. LOAD - Database Strategy:**

**UPSERT vs INSERT:**
```python
# ❌ INSERT: Duplica datos si DAG se re-ejecuta
df.to_sql('table', engine, if_exists='append')

# ✅ UPSERT: Idempotente
INSERT INTO table VALUES (...) 
ON CONFLICT (id) DO UPDATE SET ...
```

**SQLite limitation:** No tiene UPSERT nativo  
**Workaround:** DELETE + INSERT en transacción

```python
with engine.begin() as conn:
    conn.execute("DELETE FROM apis WHERE API IN (:apis)", apis=df['API'].tolist())
    df.to_sql('apis', conn, if_exists='append', index=False)
```

**5. EXPORT - Parquet Partitioning:**

**Estrategia de particionado:**
```
datasets/processed/pi1/
├── year=2025/
│   ├── month=10/
│   │   ├── day=30/
│   │   │   └── data.parquet
```

**Beneficios:**
- Athena/Spark: Partition pruning automático
- Lifecycle: Eliminar particiones antiguas fácilmente
- Compresión: Parquet ~10x más eficiente que CSV

**Trade-off:** Small files problem  
- Si muy pocas filas/día → Considerar agregación semanal
- Target: 128-512 MB por archivo

---
**Autor:** Luis J. Raigoso V. (LJRV)

## 2. Componentes del pipeline (código reusable)

### 🔧 **Componentes Reusables: Modularización**

**Principio SOLID: Single Responsibility**

Cada función tiene **una única responsabilidad**:

```python
def fetch_data(url):      # EXTRACT
    """Solo responsable de HTTP requests"""
    pass

def validate(df):         # VALIDATE
    """Solo valida schema"""
    pass

def transform(df):        # TRANSFORM
    """Solo transforma datos"""
    pass

def load_db(df):          # LOAD
    """Solo escribe a DB"""
    pass

def export_parquet(df):   # EXPORT
    """Solo escribe Parquet"""
    pass
```

**Ventajas de modularización:**

1. **Testability:**
   ```python
   def test_validate():
       df = pd.DataFrame({'API': ['Test'], 'HTTPS': [True]})
       result = validate(df)
       assert len(result) == 1  # Unit test aislado
   ```

2. **Reusability:**
   ```python
   # Pipeline 1: API → DB
   df = fetch_data()
   df = validate(df)
   load_db(df)
   
   # Pipeline 2: API → S3 (reutiliza fetch + validate)
   df = fetch_data()
   df = validate(df)
   upload_to_s3(df)
   ```

3. **Debugging:**
   - Error en `validate()` → Solo revisar función de validación
   - vs Monolith: Error en línea 456 de 1000 líneas

**Error Handling Pattern:**

```python
from typing import Dict, Any
import logging

logger = logging.getLogger(__name__)

def fetch_data(url: str, max_retries: int = 5) -> List[Dict[str, Any]]:
    """
    Fetch data from API with exponential backoff.
    
    Args:
        url: API endpoint
        max_retries: Max retry attempts
        
    Returns:
        List of records
        
    Raises:
        RuntimeError: If max retries exceeded
        requests.HTTPError: For non-retriable errors (400, 401, 403)
    """
    for attempt in range(max_retries):
        try:
            response = requests.get(url, timeout=30)
            
            if response.status_code == 429:
                wait_time = 2 ** attempt
                logger.warning(f"Rate limited, waiting {wait_time}s (attempt {attempt+1}/{max_retries})")
                time.sleep(wait_time)
                continue
            
            response.raise_for_status()
            
            data = response.json().get('entries', [])
            logger.info(f"Fetched {len(data)} records from {url}")
            return data
            
        except requests.exceptions.Timeout:
            logger.error(f"Timeout on attempt {attempt+1}")
            if attempt == max_retries - 1:
                raise
        except requests.exceptions.RequestException as e:
            logger.error(f"Request failed: {e}")
            raise
    
    raise RuntimeError(f"Max retries ({max_retries}) exceeded")
```

**Configuration Management:**

```python
import os
from dataclasses import dataclass

@dataclass
class PipelineConfig:
    api_url: str = os.getenv('PI1_API_URL', 'https://api.example.com')
    db_uri: str = os.getenv('PI1_DB_URI', 'sqlite:///data.db')
    export_dir: str = os.getenv('PI1_EXPORT_DIR', './exports')
    max_retries: int = int(os.getenv('PI1_MAX_RETRIES', '5'))
    
    @classmethod
    def from_env(cls):
        """Load config from environment"""
        return cls()

# Usage
config = PipelineConfig.from_env()
data = fetch_data(config.api_url, config.max_retries)
```

**Metrics & Observability:**

```python
import time
from contextlib import contextmanager

@contextmanager
def log_execution_time(step_name: str):
    """Context manager para medir tiempo de ejecución"""
    start = time.time()
    logger.info(f"Starting {step_name}")
    try:
        yield
    finally:
        duration = time.time() - start
        logger.info(f"Completed {step_name} in {duration:.2f}s")

def run_pipeline():
    metrics = {}
    
    with log_execution_time("EXTRACT"):
        data = fetch_data()
        metrics['records_fetched'] = len(data)
    
    with log_execution_time("VALIDATE"):
        df = pd.DataFrame(data)
        df = validate(df)
        metrics['records_valid'] = len(df)
    
    with log_execution_time("LOAD"):
        rows_loaded = load_db(df)
        metrics['records_loaded'] = rows_loaded
    
    logger.info(f"Pipeline metrics: {metrics}")
    return metrics
```

---
**Autor:** Luis J. Raigoso V. (LJRV)

In [None]:
import os, time
import pandas as pd
import requests
from sqlalchemy import create_engine, text
from pandera import DataFrameSchema, Column, Check

BASE_URL = os.getenv('PI1_API_URL', 'https://api.publicapis.org/entries')
DB_URI = os.getenv('PI1_DB_URI', 'sqlite+pysqlite:///:memory:')
EXPORT_DIR = os.getenv('PI1_EXPORT_DIR', 'datasets/processed/pi1/')
os.makedirs(EXPORT_DIR, exist_ok=True)

def fetch_data(url=BASE_URL, max_retries=5):
    for i in range(max_retries):
        r = requests.get(url, timeout=30)
        if r.status_code == 429:
            time.sleep(2**i)
            continue
        r.raise_for_status()
        return r.json().get('entries', [])
    raise RuntimeError('Too many retries')

def validate(df: pd.DataFrame) -> pd.DataFrame:
    schema = DataFrameSchema({
        'API': Column(str),
        'HTTPS': Column(bool),
        'Category': Column(str)
    }, coerce=True)
    return schema.validate(df)

def transform(df: pd.DataFrame) -> pd.DataFrame:
    df = df[['API','Description','Auth','HTTPS','Cors','Link','Category']].copy()
    df['ingestion_ts'] = pd.Timestamp.utcnow()
    return df

def load_db(df: pd.DataFrame, uri=DB_URI):
    engine = create_engine(uri, future=True)
    df.to_sql('apis_publicas', engine, if_exists='append', index=False)
    return len(df)

def export_parquet(df: pd.DataFrame, out_dir=EXPORT_DIR):
    out_path = os.path.join(out_dir, f
)
    df.to_parquet(out_path, index=False)
    return out_path

def run_pipeline():
    items = fetch_data()
    df = pd.DataFrame(items)
    df = validate(df)
    df = transform(df)
    n = load_db(df)
    fp = export_parquet(df)
    return {'rows': n, 'parquet': fp}

# Quick test local
res = run_pipeline()
res

## 3. Orquestación con Airflow (DAG ejemplo)

### 🔀 **Airflow DAG: Orquestación Productiva**

**Anatomía del DAG:**

```python
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.operators.email import EmailOperator
from airflow.utils.trigger_rule import TriggerRule

default_args = {
    'owner': 'data-eng',
    'depends_on_past': False,  # No depende de ejecución previa
    'start_date': datetime(2025, 10, 1),
    'email': ['data-team@example.com'],
    'email_on_failure': True,
    'email_on_retry': False,
    'retries': 2,
    'retry_delay': timedelta(minutes=5),
    'execution_timeout': timedelta(hours=1),  # Timeout global
}

with DAG(
    'pi1_api_to_db_parquet',
    default_args=default_args,
    description='Daily API ingestion pipeline',
    schedule_interval='0 2 * * *',  # 2 AM daily (cron)
    catchup=False,  # No backfill automático
    max_active_runs=1,  # Solo 1 ejecución concurrente
    tags=['production', 'api', 'daily']
) as dag:
    
    extract_task = PythonOperator(
        task_id='extract_api_data',
        python_callable=fetch_data,
        op_kwargs={'url': '{{ var.value.api_url }}'},  # Airflow variable
        provide_context=True
    )
    
    validate_task = PythonOperator(
        task_id='validate_schema',
        python_callable=validate,
        op_kwargs={'df': '{{ ti.xcom_pull(task_ids="extract_api_data") }}'}
    )
    
    load_task = PythonOperator(
        task_id='load_to_database',
        python_callable=load_db
    )
    
    export_task = PythonOperator(
        task_id='export_to_parquet',
        python_callable=export_parquet
    )
    
    notify_success = EmailOperator(
        task_id='notify_success',
        to='data-team@example.com',
        subject='Pipeline Success - {{ ds }}',
        html_content='Pipeline completed successfully. Records: {{ ti.xcom_pull(...) }}',
        trigger_rule=TriggerRule.ALL_SUCCESS
    )
    
    notify_failure = EmailOperator(
        task_id='notify_failure',
        to='data-team@example.com',
        subject='Pipeline Failed - {{ ds }}',
        html_content='Pipeline failed. Check logs.',
        trigger_rule=TriggerRule.ONE_FAILED
    )
    
    # Dependencies
    extract_task >> validate_task >> [load_task, export_task] >> notify_success
    [extract_task, validate_task, load_task, export_task] >> notify_failure
```

**Conceptos Clave:**

**1. Schedule Interval (Cron):**
```
'0 2 * * *'  = Daily at 2 AM
'*/15 * * * *'  = Every 15 minutes
'0 */4 * * *'  = Every 4 hours
'@daily' = Alias for '0 0 * * *'
```

**2. Catchup:**
```python
catchup=False  # ✅ Recomendado para pipelines diarios
# Si DAG pausado 7 días → solo ejecuta hoy al reactivar

catchup=True   # ❌ Backfill automático
# Si DAG pausado 7 días → ejecuta 7 runs al reactivar
```

**3. XCom (Cross-Communication):**
```python
def extract_data(**context):
    data = fetch_from_api()
    context['ti'].xcom_push(key='raw_data', value=data)
    return len(data)

def transform_data(**context):
    data = context['ti'].xcom_pull(task_ids='extract_data', key='raw_data')
    transformed = process(data)
    return transformed
```

**Limitación:** XCom guarda en metadata DB (pickle)  
**Max size:** ~48 KB  
**Solución:** Para datasets grandes, pasar paths en vez de data

**4. Trigger Rules:**
```python
TriggerRule.ALL_SUCCESS     # Default: Todos los padres exitosos
TriggerRule.ALL_FAILED      # Todos fallaron
TriggerRule.ONE_SUCCESS     # Al menos 1 exitoso
TriggerRule.ONE_FAILED      # Al menos 1 falló
TriggerRule.NONE_FAILED     # Ninguno falló (success o skipped)
```

**5. SLA (Service Level Agreement):**
```python
sla_miss_callback = lambda context: send_slack_alert(context)

task = PythonOperator(
    task_id='critical_task',
    python_callable=process,
    sla=timedelta(hours=1),  # Alerta si tarda >1h
    on_failure_callback=sla_miss_callback
)
```

**Best Practices:**

✅ **Idempotencia**: DAG puede re-ejecutarse sin duplicar datos  
✅ **Atomicidad**: Cada task es unidad atómica (success or fail)  
✅ **Monitoreo**: Alertas en Slack/Email ante fallos  
✅ **Timeouts**: `execution_timeout` previene hung tasks  
✅ **Concurrency**: `max_active_runs=1` previene race conditions

---
**Autor:** Luis J. Raigoso V. (LJRV)

In [None]:
dag_code = r'''
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.utils.email import send_email

def run_pipeline_wrapper(**context):
    from pi1_module import run_pipeline
    res = run_pipeline()
    return res

default_args = {
  'owner': 'data-eng',
  'retries': 2,
  'retry_delay': timedelta(minutes=5),
}

with DAG('pi1_api_db_parquet', start_date=datetime(2025,10,1), schedule_interval='@daily', catchup=False, default_args=default_args) as dag:
    t1 = PythonOperator(task_id='run_pipeline', python_callable=run_pipeline_wrapper)
    t1
'''
print(dag_code.splitlines()[:25])

## 4. Validaciones y monitoreo

### 📊 **Data Quality: Validaciones y Monitoreo**

**Data Quality Dimensions (6 C's):**

1. **Completeness** (¿Están todos los datos?)
2. **Consistency** (¿Valores coherentes?)
3. **Conformity** (¿Cumplen formato?)
4. **Accuracy** (¿Valores correctos?)
5. **Integrity** (¿Relaciones válidas?)
6. **Timeliness** (¿Datos actualizados?)

**Implementación en el Pipeline:**

```python
from dataclasses import dataclass
from typing import Dict, List

@dataclass
class DataQualityMetrics:
    total_records: int
    null_counts: Dict[str, int]
    duplicate_count: int
    validation_errors: List[str]
    
    @property
    def completeness_score(self) -> float:
        """% de campos sin nulls"""
        total_nulls = sum(self.null_counts.values())
        total_fields = self.total_records * len(self.null_counts)
        return (1 - total_nulls / total_fields) * 100 if total_fields > 0 else 0

def validate_with_metrics(df: pd.DataFrame) -> tuple[pd.DataFrame, DataQualityMetrics]:
    """Valida y retorna métricas de calidad"""
    
    # 1. Schema validation (Pandera)
    schema = DataFrameSchema({
        'API': Column(str, Check.str_length(min_value=1)),
        'HTTPS': Column(bool),
        'Category': Column(str, Check.isin(['Animals', 'Business', 'Tech', ...]))
    })
    
    errors = []
    try:
        validated_df = schema.validate(df)
    except pa.errors.SchemaErrors as e:
        errors.append(str(e))
        validated_df = df
    
    # 2. Completeness check
    null_counts = df.isnull().sum().to_dict()
    
    # 3. Duplicates
    duplicates = df.duplicated(subset=['API']).sum()
    
    # 4. Custom business rules
    if len(df) < 10:
        errors.append("Too few records (expected >10)")
    
    https_rate = df['HTTPS'].mean()
    if https_rate < 0.5:
        errors.append(f"Low HTTPS rate: {https_rate:.1%} (expected >50%)")
    
    metrics = DataQualityMetrics(
        total_records=len(df),
        null_counts=null_counts,
        duplicate_count=duplicates,
        validation_errors=errors
    )
    
    return validated_df, metrics
```

**Alerting Strategy:**

```python
def check_quality_gates(metrics: DataQualityMetrics) -> bool:
    """Verifica thresholds de calidad"""
    
    alerts = []
    
    # Gate 1: Completeness
    if metrics.completeness_score < 95:
        alerts.append(f"Completeness {metrics.completeness_score:.1f}% < 95%")
    
    # Gate 2: Volume
    if metrics.total_records < 100:
        alerts.append(f"Low volume: {metrics.total_records} records")
    
    # Gate 3: Duplicates
    dup_rate = metrics.duplicate_count / metrics.total_records
    if dup_rate > 0.01:
        alerts.append(f"High duplicates: {dup_rate:.1%}")
    
    # Gate 4: Validation errors
    if metrics.validation_errors:
        alerts.append(f"Validation errors: {len(metrics.validation_errors)}")
    
    if alerts:
        send_slack_alert(
            channel="#data-quality",
            message=f"⚠️ Quality gates failed:\n" + "\n".join(alerts),
            severity="warning"
        )
        return False
    
    return True

def run_pipeline_with_quality():
    df = fetch_data()
    df, metrics = validate_with_metrics(df)
    
    # Log metrics
    logger.info(f"Quality metrics: {metrics}")
    
    # Check gates
    if not check_quality_gates(metrics):
        raise ValueError("Quality gates failed - aborting pipeline")
    
    # Continue pipeline
    load_db(df)
    export_parquet(df)
```

**Monitoring Dashboard (Metrics to Track):**

```python
from prometheus_client import Counter, Histogram, Gauge

# Counters
records_processed = Counter('pipeline_records_total', 'Total records processed')
pipeline_failures = Counter('pipeline_failures_total', 'Pipeline failures')

# Histograms (latency)
extract_duration = Histogram('extract_duration_seconds', 'Extract step duration')
transform_duration = Histogram('transform_duration_seconds', 'Transform duration')

# Gauges (current state)
data_freshness = Gauge('data_freshness_minutes', 'Minutes since last update')
quality_score = Gauge('data_quality_score', 'Quality score 0-100')

@extract_duration.time()
def fetch_data():
    data = requests.get(API_URL).json()
    records_processed.inc(len(data))
    return data

# Expose metrics endpoint
from prometheus_client import start_http_server
start_http_server(8000)  # Metrics at http://localhost:8000/metrics
```

**Data Lineage (Tracking):**

```python
def track_lineage(df: pd.DataFrame, step: str):
    """Registra metadata de linaje"""
    lineage = {
        'timestamp': pd.Timestamp.utcnow(),
        'step': step,
        'records_count': len(df),
        'columns': df.columns.tolist(),
        'source': 'api.publicapis.org',
        'pipeline_run_id': context['dag_run'].run_id
    }
    
    # Guardar en metadata store (ej: PostgreSQL)
    store_lineage(lineage)
```

---
**Autor:** Luis J. Raigoso V. (LJRV)

- Añadir validaciones de calidad (conteo mínimo, no duplicados).
- Alertar por email/Slack ante fallos o métricas fuera de rango.
- Publicar artefactos (Parquet) con nombres idempotentes o versionados.