# ♻️ DataOps y CI/CD para Pipelines de Datos

Objetivo: implementar prácticas de DataOps con control de calidad, pruebas automatizadas, hooks de Git y pipelines de CI/CD (GitHub Actions) para flujos de datos.

- Duración: 90 min
- Dificultad: Media
- Prerrequisitos: Pytest básico, Git y GitHub

### ♻️ **DataOps: DevOps para Datos**

**Definición:**  
DataOps es la aplicación de prácticas DevOps (CI/CD, IaC, monitoreo) a pipelines de datos para aumentar velocidad, calidad y confiabilidad del desarrollo de analytics.

**Principios Core:**

1. **Automation First:**
   - Tests automáticos de calidad de datos
   - Deployment automático de pipelines
   - Alertas automáticas ante anomalías

2. **Version Control Everything:**
   - Código (Python, SQL)
   - Configuración (YAML, JSON)
   - Schemas (Avro, Protobuf)
   - Infraestructura (Terraform, CloudFormation)

3. **Observability:**
   - Logging estructurado (JSON logs)
   - Métricas: latencia, throughput, error rate
   - Lineage: Trazabilidad origen → destino

4. **Testing Pyramid para Datos:**

```
       ▲
      /E2E\           ← End-to-End (pocos, lentos)
     /─────\
    /Integr.\        ← Integration tests (medianos)
   /─────────\
  /Unit Tests \      ← Unit tests (muchos, rápidos)
 /─────────────\
  Data Quality       ← Schema validation, nulls, ranges
```

**Comparación con Software Engineering:**

| Software DevOps | DataOps |
|-----------------|---------|
| Unit tests | Schema validation |
| Integration tests | Pipeline tests |
| Code review | Data quality review |
| Blue/Green deploy | Dual-write patterns |
| Monitoring (APM) | Data observability |

**Herramientas:**

- **Testing**: Great Expectations, Pandera, dbt tests
- **CI/CD**: GitHub Actions, GitLab CI, Jenkins
- **Orchestration**: Airflow, Prefect, Dagster
- **Observability**: DataDog, Monte Carlo, Bigeye
- **Version Control**: Git + DVC (Data Version Control)

**Impacto:**

- ⬇️ 80% reducción en incidentes de datos
- ⬆️ 10x velocidad de deployment
- ⬆️ 95%+ confianza en datos para decisiones

---
**Autor:** Luis J. Raigoso V. (LJRV)

## 1. Pruebas de datos con Great Expectations y Pandera

### 🛡️ **Data Quality Testing: Great Expectations vs Pandera**

**Great Expectations:**

```python
# Approach declarativo con expectations
df.expect_column_values_to_not_be_null('user_id')
df.expect_column_values_to_be_between('age', 0, 120)
df.expect_column_values_to_be_in_set('status', ['active', 'inactive'])
df.expect_column_values_to_match_regex('email', r'^[\w\.-]+@[\w\.-]+\.\w+$')
```

**Características:**
- 300+ expectations predefinidas
- Data Docs: HTML reports automáticos
- Checkpoints: Validación en pipeline stages
- Profiling: Auto-genera expectations desde data samples

**Pandera (Type Hints para DataFrames):**

```python
from pandera import DataFrameSchema, Column, Check

schema = DataFrameSchema({
    'user_id': Column(int, Check.gt(0), nullable=False),
    'age': Column(int, Check.in_range(0, 120)),
    'email': Column(str, Check.str_matches(r'^[\w\.-]+@')),
    'created_at': Column('datetime64[ns]')
})

@pa.check_types
def process_users(df: DataFrame[schema]) -> DataFrame:
    # Validación automática en runtime
    return df
```

**Ventajas:**
- Integración nativa con type hints (mypy compatible)
- Lightweight (sin dependencias pesadas)
- Hipótesis testing estadístico integrado

**Comparación:**

| Aspecto | Great Expectations | Pandera |
|---------|-------------------|---------|
| **Curva aprendizaje** | Media-alta | Baja |
| **Reporting** | Excelente (Data Docs) | Básico |
| **Performance** | Más lento (overhead) | Más rápido |
| **Type Safety** | No | Sí (mypy) |
| **Use Case** | Enterprise, governance | Fast prototyping |

**Estrategia Híbrida:**
- Pandera: Desarrollo local + CI tests
- Great Expectations: Producción + auditoría

**Niveles de Validación:**
1. **Schema**: Columnas, tipos, nullability
2. **Rango**: Min, max, percentiles
3. **Relaciones**: Foreign keys, duplicates
4. **Distribución**: Mean, std, outliers
5. **Negocio**: Reglas custom (ej: revenue >= costs)

---
**Autor:** Luis J. Raigoso V. (LJRV)

In [None]:
import pandas as pd
try:
    import great_expectations as ge
    from pandera import DataFrameSchema, Column, Check
    df = pd.DataFrame({
        'venta_id':[1,2,3],
        'total':[100.0, 50.0, 25.5],
        'metodo_pago':['tarjeta','cash','tarjeta']
    })
    # Great Expectations estilo rápido
    gdf = ge.from_pandas(df)
    gdf.expect_column_values_to_not_be_null('venta_id')
    gdf.expect_column_values_to_be_between('total', 0, 10000)
    print('GE checks:', gdf.validate().to_json_dict()['statistics'])
    # Pandera schema
    schema = DataFrameSchema({
        'venta_id': Column(int, Check.gt(0)),
        'total': Column(float, Check.ge(0)),
        'metodo_pago': Column(str)
    })
    schema.validate(df)
    print('Pandera OK')
except Exception as e:
    print('Instala great-expectations y pandera si deseas ejecutar este bloque:', e)

## 2. Pytest: estructura mínima

### 🧪 **Pytest: Testing Framework para Data Pipelines**

**Estructura de Proyecto:**
```
proyecto/
├── src/
│   ├── extract.py
│   ├── transform.py
│   └── load.py
├── tests/
│   ├── conftest.py         ← Fixtures compartidos
│   ├── test_extract.py
│   ├── test_transform.py
│   └── test_load.py
├── pytest.ini              ← Configuración
└── requirements-dev.txt
```

**Fixtures (Setup/Teardown):**
```python
# conftest.py
import pytest
import pandas as pd

@pytest.fixture
def sample_df():
    """Fixture reutilizable para tests"""
    return pd.DataFrame({
        'id': [1, 2, 3],
        'value': [10, 20, 30]
    })

@pytest.fixture
def db_connection():
    conn = create_connection()
    yield conn  # Test ejecuta aquí
    conn.close()  # Cleanup automático
```

**Parametrización (Data-Driven Tests):**
```python
@pytest.mark.parametrize("input,expected", [
    (10, 10.0),
    (-5, 0.0),
    ('invalid', 0.0),
    (None, 0.0)
])
def test_clean_total(input, expected):
    assert clean_total(input) == expected
```

**Mocking (Aislamiento de dependencias):**
```python
from unittest.mock import Mock, patch

@patch('requests.get')
def test_api_extractor(mock_get):
    mock_get.return_value.json.return_value = {'data': [1, 2, 3]}
    result = extract_from_api('https://api.example.com')
    assert len(result) == 3
    mock_get.assert_called_once()
```

**Coverage (Cobertura de Código):**
```bash
pytest --cov=src --cov-report=html
# Genera htmlcov/index.html con líneas cubiertas/no cubiertas
```

**Markers (Categorización de Tests):**
```python
@pytest.mark.slow
def test_large_dataset():
    # Test que tarda >10s
    pass

@pytest.mark.integration
def test_db_connection():
    # Test que requiere DB real
    pass

# Ejecutar solo tests rápidos
pytest -m "not slow"
```

**Best Practices:**
- Tests independientes (sin orden)
- Nombres descriptivos: `test_transform_handles_null_values`
- Arrange-Act-Assert pattern
- Un assert por test (o conceptos relacionados)
- Tests < 1s (unit), < 10s (integration)

---
**Autor:** Luis J. Raigoso V. (LJRV)

In [None]:
# tests/test_transform.py
sample_code = r'''
# archivo: src/transform.py
def clean_total(x):
    try:
        v = float(x)
        return max(v, 0.0)
    except Exception:
        return 0.0

# archivo: tests/test_transform.py
from src.transform import clean_total
def test_clean_total():
    assert clean_total(10) == 10.0
    assert clean_total(-5) == 0.0
    assert clean_total('oops') == 0.0
'''
print(sample_code)

## 3. Pre-commit hooks (lint, format, tests rápidos)

### 🪝 **Pre-commit Hooks: Quality Gates Locales**

**¿Qué son los Pre-commit Hooks?**  
Scripts que se ejecutan automáticamente **antes de cada commit** para validar código, formato, seguridad, etc. Bloquean el commit si fallan.

**Instalación:**
```bash
pip install pre-commit
# Crear .pre-commit-config.yaml
pre-commit install  # Activa hooks en .git/hooks/
```

**Hooks Esenciales para Data Engineering:**

1. **Black (Code Formatter):**
   - Auto-formatea código Python (PEP 8)
   - "The uncompromising formatter"
   ```python
   # Antes
   x=[1,2,3];y={'a':1,'b':2}
   
   # Después
   x = [1, 2, 3]
   y = {"a": 1, "b": 2}
   ```

2. **isort (Import Organizer):**
   - Ordena imports alfabéticamente
   ```python
   # Antes
   import sys
   from myproject import utils
   import os
   
   # Después
   import os
   import sys
   from myproject import utils
   ```

3. **flake8 (Linter):**
   - Detecta errores sintácticos, variables no usadas, imports redundantes
   - Warnings: E501 (línea >79 chars), F841 (variable asignada pero no usada)

4. **mypy (Type Checker):**
   - Valida type hints
   ```python
   def add(x: int, y: int) -> int:
       return x + y
   
   add("1", "2")  # ❌ mypy detecta error
   ```

5. **detect-secrets:**
   - Busca API keys, passwords hardcodeados
   - Bloquea commits con `AWS_SECRET_ACCESS_KEY = "xxx"`

**Hooks Custom para Datos:**
```yaml
- repo: local
  hooks:
    - id: check-sql-syntax
      name: Validate SQL files
      entry: sqlfluff lint
      language: system
      files: \.sql$
    
    - id: validate-schemas
      name: Check Avro schemas
      entry: python scripts/validate_schemas.py
      language: python
      files: schemas/.*\.avsc$
```

**Workflow:**
```bash
git add transform.py
git commit -m "Fix bug"
  ↓
[pre-commit] black........................Passed
[pre-commit] isort.......................Passed
[pre-commit] flake8......................Failed
  - src/transform.py:10:1: F401 'pandas' imported but unused
  ↓
[Commit bloqueado - fix errores y reintenta]
```

**Skip Hooks (Emergencia):**
```bash
git commit --no-verify -m "Hotfix crítico"
```

**Beneficios:**
- ⬆️ Calidad de código consistente en equipo
- ⬇️ Menos issues en code review
- ⚡ Feedback instantáneo vs esperar CI

---
**Autor:** Luis J. Raigoso V. (LJRV)

In [None]:
pre_commit_cfg = r'''
repos:
  - repo: https://github.com/psf/black
    rev: 22.6.0
    hooks:
      - id: black
  - repo: https://github.com/PyCQA/isort
    rev: 5.10.1
    hooks:
      - id: isort
  - repo: https://github.com/pycqa/flake8
    rev: 5.0.4
    hooks:
      - id: flake8
  - repo: local
    hooks:
      - id: pytest-quick
        name: pytest quick
        entry: pytest -q
        language: system
        types: [python]
'''
print(pre_commit_cfg)

## 4. GitHub Actions: CI para validar el repositorio

### 🔄 **GitHub Actions: CI/CD Automático**

**Concepto:**  
GitHub Actions ejecuta workflows automáticamente en eventos (push, PR, schedule) en runners de GitHub (Ubuntu/Windows/macOS).

**Anatomía de un Workflow:**

```yaml
name: ci                          # Nombre del workflow
on: [push, pull_request]          # Triggers

jobs:
  test:                           # Job ID
    runs-on: ubuntu-latest        # Runner environment
    steps:
      - uses: actions/checkout@v3 # Action oficial (git clone)
      
      - uses: actions/setup-python@v4
        with:
          python-version: '3.10'
          cache: 'pip'            # Cache de dependencias
      
      - name: Install dependencies
        run: |
          pip install -r requirements.txt
      
      - name: Run tests
        run: pytest --cov=src --cov-report=xml
      
      - name: Upload coverage
        uses: codecov/codecov-action@v3
        with:
          files: ./coverage.xml
```

**Triggers Avanzados:**

1. **Schedule (Cron):**
   ```yaml
   on:
     schedule:
       - cron: '0 2 * * *'  # Diario a las 2 AM UTC
   ```

2. **Manual Dispatch:**
   ```yaml
   on:
     workflow_dispatch:
       inputs:
         environment:
           description: 'Target environment'
           required: true
           default: 'staging'
   ```

3. **Path Filters:**
   ```yaml
   on:
     push:
       paths:
         - 'src/**'
         - 'tests/**'
   ```

**Parallel Jobs (Matrix Strategy):**
```yaml
jobs:
  test:
    strategy:
      matrix:
        python-version: ['3.8', '3.9', '3.10', '3.11']
        os: [ubuntu-latest, windows-latest]
    runs-on: ${{ matrix.os }}
    steps:
      - uses: actions/setup-python@v4
        with:
          python-version: ${{ matrix.python-version }}
```

**Secrets Management:**
```yaml
- name: Deploy to AWS
  env:
    AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
    AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
  run: aws s3 sync dist/ s3://my-bucket/
```

**Pipeline Completo para Data:**
```yaml
jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - run: flake8 src/
  
  test:
    needs: lint
    runs-on: ubuntu-latest
    steps:
      - run: pytest -v
  
  data-quality:
    needs: test
    runs-on: ubuntu-latest
    steps:
      - run: great_expectations checkpoint run validation_suite
  
  deploy:
    needs: [test, data-quality]
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    steps:
      - run: docker build -t pipeline:${{ github.sha }} .
      - run: docker push pipeline:${{ github.sha }}
```

**Artifacts (Compartir entre jobs):**
```yaml
- name: Generate report
  run: python generate_report.py

- uses: actions/upload-artifact@v3
  with:
    name: data-report
    path: reports/*.html
```

**Cost Optimization:**
- Public repos: Gratis ilimitado
- Private repos: 2,000 min/mes gratis (luego $0.008/min)
- Self-hosted runners para reducir costos

---
**Autor:** Luis J. Raigoso V. (LJRV)

In [None]:
gha_yaml = r'''
name: ci
on: [push, pull_request]
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: actions/setup-python@v4
        with:
          python-version: '3.10'
      - name: Install deps
        run: |
          pip install -r curso_ingenieria_datos/requirements.txt
      - name: Lint & Test
        run: |
          pip install pytest flake8 black
          flake8 .
          pytest -q
'''
print(gha_yaml)

## 5. Observabilidad: logs y métricas mínimas

### 📊 **Observability: Logs, Métricas y Alertas**

**Structured Logging con Loguru:**

```python
from loguru import logger

# Configuración
logger.add(
    "logs/pipeline_{time}.log",
    rotation="500 MB",
    retention="10 days",
    level="INFO",
    format="{time:YYYY-MM-DD HH:mm:ss} | {level} | {message}"
)

# Contexto estructurado
logger.bind(pipeline="etl", dataset="ventas").info(
    "Processed batch",
    records=1000,
    latency_ms=150.5,
    status="success"
)
```

**Output JSON (Parse con herramientas):**
```json
{
  "timestamp": "2025-10-30T14:23:45.123Z",
  "level": "INFO",
  "pipeline": "etl",
  "dataset": "ventas",
  "message": "Processed batch",
  "records": 1000,
  "latency_ms": 150.5,
  "status": "success"
}
```

**Métricas Esenciales (RED Pattern):**

1. **Rate (Throughput):**
   ```python
   records_processed = 10000
   duration_seconds = 60
   throughput = records_processed / duration_seconds
   logger.info(f"throughput={throughput:.2f} records/sec")
   ```

2. **Errors (Error Rate):**
   ```python
   total_records = 10000
   failed_records = 50
   error_rate = (failed_records / total_records) * 100
   logger.warning(f"error_rate={error_rate:.2f}%")
   
   if error_rate > 5.0:
       raise AlertException("High error rate detected")
   ```

3. **Duration (Latency):**
   ```python
   import time
   start = time.time()
   process_batch()
   latency = time.time() - start
   logger.info(f"latency={latency:.3f}s")
   ```

**Prometheus Metrics (Production):**
```python
from prometheus_client import Counter, Histogram, Gauge

records_processed = Counter('records_processed_total', 'Total records')
pipeline_duration = Histogram('pipeline_duration_seconds', 'Duration')
active_pipelines = Gauge('active_pipelines', 'Currently running')

@pipeline_duration.time()
def run_pipeline():
    records_processed.inc(1000)
    # Lógica del pipeline
```

**Data Observability Platforms:**

1. **Monte Carlo / Bigeye:**
   - Anomaly detection automático (distribución, volumen, freshness)
   - Lineage visual (upstream/downstream dependencies)
   - Alertas: "Tabla X no actualizada en 4 horas"

2. **DataDog / New Relic:**
   - APM para pipelines
   - Dashboards customizables
   - Distributed tracing

3. **dbt Cloud:**
   - Test results históricos
   - Model timing trends
   - Exposures: Qué dashboards dependen de qué modelos

**Alerting Strategy:**
```python
class DataQualityAlert:
    def __init__(self):
        self.thresholds = {
            'null_rate': 0.05,      # Max 5% nulls
            'duplicate_rate': 0.01, # Max 1% duplicates
            'freshness_hours': 4    # Data < 4h old
        }
    
    def check_and_alert(self, metrics):
        if metrics['null_rate'] > self.thresholds['null_rate']:
            send_slack_alert(f"⚠️ High null rate: {metrics['null_rate']:.2%}")
            send_pagerduty(severity='high')
```

**Golden Signals para Data:**
- **Freshness**: ¿Cuándo llegó el último dato?
- **Volume**: ¿Número de registros esperado?
- **Schema**: ¿Columnas/tipos cambiaron?
- **Distribution**: ¿Valores fuera de rango?

---
**Autor:** Luis J. Raigoso V. (LJRV)

In [None]:
from loguru import logger
import time, random

def process_batch(n=5):
    for i in range(n):
        start = time.time()
        try:
            if random.random() < 0.1:
                raise ValueError('Fallo aleatorio')
            time.sleep(0.05)
            latency = time.time() - start
            logger.info(f'item={i} status=ok latency={latency:.3f}s')
        except Exception as e:
            logger.error(f'item={i} status=error err={e}')
process_batch(10)

## 6. Ejercicios

1. Agrega una regla de pre-commit que bloquee archivos > 2 MB.
2. Crea un workflow adicional que ejecute validaciones de datos con Great Expectations.
3. Añade cobertura de pruebas y publica un badge en el README.