# üåê Conectores Avanzados: REST, GraphQL y SFTP

Objetivo: dominar integraci√≥n robusta con APIs REST/GraphQL y transferencia de archivos por SFTP, aplicando autenticaci√≥n (API Key/OAuth2), paginaci√≥n, rate limiting, reintentos exponenciales, validaci√≥n de esquemas y almacenamiento seguro.

- Duraci√≥n: 90‚Äì120 min
- Dificultad: Media
- Prerrequisitos: Python, requests/httpx, fundamentos de redes

## 0. Dependencias y configuraci√≥n

- REST: `requests` (sync) o `httpx` (async).
- GraphQL: se puede usar `requests` o `gql` (opcional).
- SFTP: `paramiko` (opcional, no incluido por defecto).
- Variables de entorno sugeridas: `API_BASE_URL`, `API_KEY`, `OAUTH_TOKEN_URL`, `OAUTH_CLIENT_ID`, `OAUTH_CLIENT_SECRET`, `SFTP_HOST`, `SFTP_USER`.

### üîå **Conectores: Integraci√≥n de Datos Externos**

**Definici√≥n:**  
Conectores son componentes que abstraen la complejidad de integraci√≥n con fuentes externas (APIs, archivos, bases de datos) proporcionando interfaces consistentes para extracci√≥n de datos.

**Tipos de Conectores:**

1. **REST APIs (REpresentational State Transfer):**
   - Arquitectura stateless con HTTP verbs (GET, POST, PUT, DELETE)
   - Respuestas JSON/XML
   - Autenticaci√≥n: API Key, OAuth2, JWT
   - Uso: 80% de APIs p√∫blicas

2. **GraphQL:**
   - Query language que permite solicitar exactamente los campos necesarios
   - Single endpoint (vs m√∫ltiples endpoints REST)
   - Previene over-fetching y under-fetching
   - Uso: APIs modernas (GitHub, Shopify, Facebook)

3. **SFTP (SSH File Transfer Protocol):**
   - Transferencia segura de archivos sobre SSH
   - Com√∫n en integraciones legacy/enterprise
   - Uso: EDI (Electronic Data Interchange), batch files

4. **Database Connectors:**
   - JDBC/ODBC para SQL databases
   - Drivers nativos (psycopg2, pymongo)

**Desaf√≠os Comunes:**

| Desaf√≠o | Soluci√≥n |
|---------|----------|
| Rate Limiting | Exponential backoff, request throttling |
| Autenticaci√≥n | Secrets manager, token refresh autom√°tico |
| Paginaci√≥n | Cursor-based vs offset-based |
| Idempotencia | Deduplicaci√≥n con hash/timestamp |
| Resiliencia | Circuit breaker, retry policies |
| Schemas din√°micos | Schema registry (Avro, Protobuf) |

**Patr√≥n Conector Gen√©rico:**
```python
class DataConnector:
    def __init__(self, config):
        self.auth = self._authenticate(config)
    
    def extract(self, params) -> Iterator[Dict]:
        # Yield records paginados
        pass
    
    def validate(self, record) -> bool:
        # Schema validation
        pass
    
    def _handle_errors(self, exception):
        # Retry logic, alertas
        pass
```

---
**Autor:** Luis J. Raigoso V. (LJRV)

## 1. REST con requests: paginaci√≥n, backoff y validaci√≥n

### üåê **REST APIs: Exponential Backoff y Rate Limiting**

**Rate Limiting:**  
L√≠mites impuestos por APIs para prevenir abuso:
- **429 Too Many Requests**: Excediste el l√≠mite
- Headers comunes:
  - `X-RateLimit-Limit: 5000` (requests por hora)
  - `X-RateLimit-Remaining: 4999`
  - `Retry-After: 60` (segundos hasta reintentar)

**Exponential Backoff Pattern:**
```python
wait_time = base_delay * (2 ** attempt) + jitter

Intento 1: 1s
Intento 2: 2s
Intento 3: 4s
Intento 4: 8s
Intento 5: 16s
```

**Jitter (Aleatoriedad):**
- Evita "thundering herd" (m√∫ltiples clientes reintentando simult√°neamente)
```python
import random
jitter = random.uniform(0, 1)
wait = (2 ** attempt) + jitter
```

**Implementaci√≥n Robusta:**
```python
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry

session = requests.Session()
retry_strategy = Retry(
    total=5,
    backoff_factor=1,  # {backoff factor} * (2 ** (retry - 1))
    status_forcelist=[429, 500, 502, 503, 504],
    allowed_methods=["HEAD", "GET", "OPTIONS", "POST"]
)
adapter = HTTPAdapter(max_retries=retry_strategy)
session.mount("http://", adapter)
session.mount("https://", adapter)
```

**Circuit Breaker Pattern:**
```python
class CircuitBreaker:
    CLOSED = 0  # Normal operation
    OPEN = 1    # Too many failures, block requests
    HALF_OPEN = 2  # Test if service recovered
    
    def call(self, func):
        if self.state == OPEN:
            if time.time() > self.next_retry:
                self.state = HALF_OPEN
            else:
                raise CircuitBreakerOpen()
        
        try:
            result = func()
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            raise
```

**Timeouts:**
```python
# Connect timeout: 3s, Read timeout: 10s
response = requests.get(url, timeout=(3, 10))
```

**Best Practices:**
- Cachear respuestas cuando sea posible (`requests-cache`)
- Usar `Session()` para connection pooling
- Log de request_id para debugging
- Monitorear latencia P50, P95, P99

---
**Autor:** Luis J. Raigoso V. (LJRV)

In [1]:
import os, time, math
import requests
from typing import Dict, Any, List

BASE_URL = os.getenv('API_BASE_URL', 'https://api.publicapis.org')
API_KEY = os.getenv('API_KEY')  # si aplica

def get_with_backoff(url: str, headers: Dict[str,str]=None, params: Dict[str,Any]=None, max_retries: int=5):
    """Funci√≥n con retry autom√°tico y exponential backoff"""
    for i in range(max_retries):
        try:
            resp = requests.get(url, headers=headers, params=params, timeout=30)
            if resp.status_code == 429:  # rate limit
                wait = 2 ** i
                print(f'‚è≥ Rate limit detectado, esperando {wait}s...')
                time.sleep(wait)
                continue
            resp.raise_for_status()
            return resp.json()
        except requests.exceptions.ConnectionError:
            print(f'‚ùå Error de conexi√≥n (sin internet o API no disponible)')
            # Simular respuesta para demostraci√≥n
            return {'entries': [{'API': 'Example', 'Description': 'Demo API'}], 'count': 1}
    raise RuntimeError('Max retries exceeded')

# Intentar llamada real (fallback a simulaci√≥n si falla)
try:
    data = get_with_backoff(f'{BASE_URL}/entries')
    print(f'‚úÖ Total de entradas obtenidas: {len(data.get("entries", []))}')
except Exception as e:
    print(f'‚ÑπÔ∏è Usando datos simulados (sin conexi√≥n real)')
    data = {'entries': [{'API': 'Demo', 'Description': 'Simulaci√≥n'}], 'count': 1}

print(f'üìä Resultado: {len(data.get("entries", []))} registros')

‚ùå Error de conexi√≥n (sin internet o API no disponible)
‚úÖ Total de entradas obtenidas: 1
üìä Resultado: 1 registros


### 1.1 Paginaci√≥n cursor/offset y almacenamiento incremental

### üìÑ **Paginaci√≥n: Estrategias para Large Datasets**

**1. Offset-based Pagination:**
```python
GET /api/items?offset=0&limit=100
GET /api/items?offset=100&limit=100
GET /api/items?offset=200&limit=100
```

**Ventajas:**
- Simple de implementar
- Permite saltos (p√°gina 5 directamente)

**Desventajas:**
- ‚ùå Ineficiente para offsets grandes (DB scan)
- ‚ùå Resultados inconsistentes si datos cambian durante paginaci√≥n
- ‚ùå Potencialmente duplicados/faltantes

**2. Cursor-based Pagination (Recomendado):**
```python
GET /api/items?cursor=initial
Response: {
  "data": [...],
  "next_cursor": "eyJpZCI6MTAwfQ=="
}

GET /api/items?cursor=eyJpZCI6MTAwfQ==
```

**Ventajas:**
- ‚úÖ Consistente (cursor marca posici√≥n exacta)
- ‚úÖ Performante (usa √≠ndices)
- ‚úÖ Soporta cambios en datos

**Cursor Implementation:**
```python
import base64, json

def encode_cursor(last_id: int) -> str:
    return base64.b64encode(json.dumps({'id': last_id}).encode()).decode()

def decode_cursor(cursor: str) -> int:
    return json.loads(base64.b64decode(cursor))['id']

# Query
SELECT * FROM items WHERE id > decode_cursor(cursor) ORDER BY id LIMIT 100
```

**3. Page-based Pagination:**
```python
GET /api/items?page=1&per_page=100
GET /api/items?page=2&per_page=100
```

**Ventajas:**
- Intuitivo para usuarios (UI)

**Desventajas:**
- Similar a offset-based (performance issues)

**4. Keyset Pagination (Seek Method):**
```sql
SELECT * FROM items 
WHERE (created_at, id) > ('2025-10-30 12:00:00', 12345)
ORDER BY created_at, id
LIMIT 100
```

**Almacenamiento Incremental:**
```python
import pickle

def fetch_all_incremental(url, checkpoint_file='checkpoint.pkl'):
    try:
        with open(checkpoint_file, 'rb') as f:
            cursor = pickle.load(f)
    except FileNotFoundError:
        cursor = None
    
    while True:
        data = fetch_page(url, cursor)
        yield from data['items']
        
        cursor = data.get('next_cursor')
        if not cursor:
            break
        
        # Checkpoint para recovery
        with open(checkpoint_file, 'wb') as f:
            pickle.dump(cursor, f)
```

**Rate Limiting con Paginaci√≥n:**
```python
import time

for page in paginate(url):
    process(page)
    time.sleep(1)  # 1 request/second
```

---
**Autor:** Luis J. Raigoso V. (LJRV)

In [2]:
def fetch_paginated(base_url: str, page_param='page', per_page=50, limit=3):
    all_items: List[Dict[str,Any]] = []
    for p in range(1, limit+1):
        payload = get_with_backoff(base_url, params={page_param: p, 'per_page': per_page})
        items = payload.get('data') or payload.get('entries') or []
        all_items.extend(items)
        if not items:
            break
    return all_items

items = fetch_paginated(f'{BASE_URL}/entries', page_param='page', per_page=20, limit=2)
len(items)

‚ùå Error de conexi√≥n (sin internet o API no disponible)
‚ùå Error de conexi√≥n (sin internet o API no disponible)


2

### 1.2 Validaci√≥n de esquema con Cerberus/Pandera

### ‚úÖ **Schema Validation: Cerberus vs Pydantic**

**Cerberus (Flexible Validation):**

```python
from cerberus import Validator

schema = {
    'user_id': {'type': 'integer', 'required': True, 'min': 1},
    'email': {'type': 'string', 'regex': r'^[\w\.-]+@[\w\.-]+\.\w+$'},
    'age': {'type': 'integer', 'min': 0, 'max': 120, 'nullable': True},
    'status': {'type': 'string', 'allowed': ['active', 'inactive', 'pending']},
    'tags': {'type': 'list', 'schema': {'type': 'string'}},
    'metadata': {'type': 'dict', 'allow_unknown': True}
}

v = Validator(schema)

# Validaci√≥n
if v.validate(document):
    print("Valid!")
else:
    print("Errors:", v.errors)
    # {'email': ['value does not match regex ...'], 'age': ['max value is 120']}
```

**Custom Rules:**
```python
class MyValidator(Validator):
    def _validate_is_even(self, is_even, field, value):
        if is_even and value % 2 != 0:
            self._error(field, "Must be even")

schema = {'count': {'type': 'integer', 'is_even': True}}
```

**Pydantic (Type-Safe Models):**

```python
from pydantic import BaseModel, EmailStr, Field, validator
from typing import List, Optional

class User(BaseModel):
    user_id: int = Field(..., gt=0)
    email: EmailStr
    age: Optional[int] = Field(None, ge=0, le=120)
    status: Literal['active', 'inactive', 'pending']
    tags: List[str] = []
    
    @validator('age')
    def validate_age(cls, v):
        if v and v < 18:
            raise ValueError('Must be 18+')
        return v
    
    class Config:
        validate_assignment = True  # Validar en updates

# Parsing autom√°tico
user = User(**api_response)  # Raises ValidationError si inv√°lido
```

**Comparaci√≥n:**

| Aspecto | Cerberus | Pydantic |
|---------|----------|----------|
| **Definici√≥n** | Dict-based schema | Class-based models |
| **Type Safety** | Runtime only | IDE autocomplete + mypy |
| **Performance** | M√°s lento | M√°s r√°pido (Cython) |
| **Coercion** | Manual | Autom√°tico (str‚Üíint) |
| **Serialization** | Manual | `.json()`, `.dict()` |
| **Use Case** | Dynamic schemas | API contracts |

**Estrategia Recomendada:**
- **Pydantic**: API inputs/outputs, configuraci√≥n
- **Cerberus**: Validaci√≥n de datos externos con schemas din√°micos
- **Pandera**: DataFrames (columnar data)

**Schema Registry (Producci√≥n):**
```python
# Confluent Schema Registry (Avro)
from confluent_kafka.schema_registry import SchemaRegistryClient

sr_client = SchemaRegistryClient({'url': 'http://localhost:8081'})
schema = sr_client.get_latest_version('user-value')

# Validar contra schema versionado
```

**Fail Fast vs Fail Late:**
- **Fail Fast**: Validar en el boundary (API ingestion)
- **Fail Late**: Procesar y loggear errores (permite partial success)

---
**Autor:** Luis J. Raigoso V. (LJRV)

In [3]:
from cerberus import Validator
schema = {
  'API': {'type':'string', 'required': True},
  'Description': {'type':'string', 'nullable': True},
  'HTTPS': {'type':'boolean'},
}
v = Validator(schema, allow_unknown=True)
valid_count = sum(1 for it in items if v.validate(it))
valid_count, len(items)

(2, 2)

## 2. OAuth2 Client Credentials (httpx) [opcional]

### üîê **OAuth2: Flujos de Autenticaci√≥n**

**OAuth2 Grant Types:**

1. **Client Credentials (Machine-to-Machine):**
   ```
   App ‚Üí Token Server: client_id + client_secret
   Token Server ‚Üí App: access_token
   App ‚Üí API: Authorization: Bearer {access_token}
   ```
   
   **Uso:** Servicios backend, pipelines de datos
   
   **C√≥digo:**
   ```python
   token_response = requests.post(
       'https://oauth.example.com/token',
       data={
           'grant_type': 'client_credentials',
           'client_id': CLIENT_ID,
           'client_secret': CLIENT_SECRET,
           'scope': 'read:data write:data'
       }
   )
   access_token = token_response.json()['access_token']
   ```

2. **Authorization Code (User Authorization):**
   ```
   User ‚Üí App: Inicia login
   App ‚Üí Auth Server: Redirect con client_id
   User ‚Üí Auth Server: Autoriza app
   Auth Server ‚Üí App: Redirect con code
   App ‚Üí Token Server: Intercambia code por token
   ```
   
   **Uso:** Aplicaciones web con usuarios

3. **Refresh Token Flow:**
   ```python
   # access_token expira en 1h, refresh_token en 30 d√≠as
   if is_token_expired(access_token):
       new_tokens = requests.post(
           token_url,
           data={
               'grant_type': 'refresh_token',
               'refresh_token': refresh_token,
               'client_id': CLIENT_ID
           }
       ).json()
       access_token = new_tokens['access_token']
   ```

**Token Management:**
```python
import time
from threading import Lock

class TokenManager:
    def __init__(self, token_url, client_id, client_secret):
        self.token_url = token_url
        self.client_id = client_id
        self.client_secret = client_secret
        self._token = None
        self._expires_at = 0
        self._lock = Lock()
    
    def get_token(self):
        with self._lock:
            if time.time() >= self._expires_at - 60:  # Refresh 60s antes
                self._refresh_token()
            return self._token
    
    def _refresh_token(self):
        response = requests.post(
            self.token_url,
            data={'grant_type': 'client_credentials', ...}
        ).json()
        self._token = response['access_token']
        self._expires_at = time.time() + response['expires_in']
```

**PKCE (Proof Key for Code Exchange):**
- Seguridad adicional para apps m√≥viles/SPA
- Genera `code_verifier` y `code_challenge`
- Previene authorization code interception

**Best Practices:**
- üîê Nunca hardcodear secrets (usar secrets manager)
- üîÑ Implementar token refresh autom√°tico
- ‚è∞ Cache tokens hasta expiraci√≥n
- üö® Renovar antes de expirar (buffer de 1-5 min)
- üìù Log de token refresh para debugging

---
**Autor:** Luis J. Raigoso V. (LJRV)

In [4]:
import httpx
async def fetch_oauth_token():
    token_url = os.getenv('OAUTH_TOKEN_URL')
    if not token_url:
        return None
    data = {
      'grant_type':'client_credentials',
      'client_id': os.getenv('OAUTH_CLIENT_ID'),
      'client_secret': os.getenv('OAUTH_CLIENT_SECRET'),
      'scope': os.getenv('OAUTH_SCOPE','')
    }
    async with httpx.AsyncClient(timeout=30) as client:
        r = await client.post(token_url, data=data)
        r.raise_for_status()
        return r.json().get('access_token')

# Uso: token = asyncio.run(fetch_oauth_token())

## 3. GraphQL: consultas y paginaci√≥n

### üîç **GraphQL: Queries Eficientes y Paginaci√≥n**

**Ventajas sobre REST:**

1. **Exact Data Fetching:**
   ```graphql
   # REST: GET /users/1 ‚Üí devuelve TODO
   # GraphQL: Solo lo necesario
   query {
     user(id: 1) {
       name
       email
       # No devuelve address, phone, etc.
     }
   }
   ```

2. **No Over-fetching:**
   - REST: 3 endpoints ‚Üí 3 requests
   - GraphQL: 1 query con nested fields

3. **Strongly Typed Schema:**
   ```graphql
   type User {
     id: ID!           # ! = required
     name: String!
     age: Int
     posts: [Post!]!   # Array de Posts (no nulls)
   }
   ```

**Query Anatomy:**
```graphql
query GetUserPosts($userId: ID!, $first: Int = 10) {
  user(id: $userId) {
    name
    posts(first: $first) {
      edges {
        node {
          id
          title
          createdAt
        }
        cursor
      }
      pageInfo {
        hasNextPage
        endCursor
      }
    }
  }
}
```

**Variables:**
```python
variables = {"userId": "123", "first": 20}
response = requests.post(
    graphql_url,
    json={"query": query, "variables": variables}
)
```

**Paginaci√≥n Relay-style (Cursor-based):**
```graphql
query {
  users(first: 10, after: "cursor123") {
    edges {
      cursor
      node { id name }
    }
    pageInfo {
      hasNextPage
      endCursor
    }
  }
}
```

**Implementaci√≥n Python:**
```python
def fetch_all_graphql(query, variables):
    has_next = True
    cursor = None
    
    while has_next:
        vars = {**variables, 'after': cursor}
        data = requests.post(url, json={'query': query, 'variables': vars}).json()
        
        yield from data['data']['users']['edges']
        
        page_info = data['data']['users']['pageInfo']
        has_next = page_info['hasNextPage']
        cursor = page_info['endCursor']
```

**Mutations (Writes):**
```graphql
mutation CreateUser($input: CreateUserInput!) {
  createUser(input: $input) {
    user {
      id
      name
    }
    errors {
      field
      message
    }
  }
}
```

**Introspection (Schema Discovery):**
```graphql
query {
  __schema {
    types {
      name
      fields {
        name
        type { name }
      }
    }
  }
}
```

**gql Library (Python):**
```python
from gql import gql, Client
from gql.transport.requests import RequestsHTTPTransport

transport = RequestsHTTPTransport(url=graphql_url)
client = Client(transport=transport, fetch_schema_from_transport=True)

query = gql("""
  query GetUser($id: ID!) {
    user(id: $id) { name email }
  }
""")

result = client.execute(query, variable_values={"id": "123"})
```

**Performance:**
- ‚ö†Ô∏è N+1 Query Problem: Resolver puede hacer queries por cada item
- ‚úÖ Soluci√≥n: DataLoader (batching + caching)
- üö¶ Rate Limiting: Basado en query complexity, no requests

---
**Autor:** Luis J. Raigoso V. (LJRV)

In [5]:
import json

GQL_URL = os.getenv('GQL_URL', 'https://countries.trevorblades.com/')

query = """
query getCountries($continent: ID!) {
  continent(code: $continent) {
    name
    countries {
      code
      name
      currency
      emoji
    }
  }
}
"""

variables = {"continent": "SA"}  # Sudam√©rica como ejemplo

resp = requests.post(GQL_URL, json={"query": query, "variables": variables}, timeout=30)
resp.raise_for_status()
data = resp.json()

# Mostrar primeras 5 entradas
data['data']['continent']['countries'][:5]

[{'code': 'AR', 'currency': 'ARS', 'emoji': 'üá¶üá∑', 'name': 'Argentina'},
 {'code': 'BO', 'currency': 'BOB,BOV', 'emoji': 'üáßüá¥', 'name': 'Bolivia'},
 {'code': 'BR', 'currency': 'BRL', 'emoji': 'üáßüá∑', 'name': 'Brazil'},
 {'code': 'CL', 'currency': 'CLF,CLP', 'emoji': 'üá®üá±', 'name': 'Chile'},
 {'code': 'CO', 'currency': 'COP', 'emoji': 'üá®üá¥', 'name': 'Colombia'}]

---

## üß≠ Navegaci√≥n

**‚Üê Anterior:** [‚ôªÔ∏è DataOps y CI/CD para Pipelines de Datos](05_dataops_cicd.ipynb)

**Siguiente ‚Üí:** [üß© Optimizaci√≥n SQL y Particionado de Datos ‚Üí](07_optimizacion_sql_particionado.ipynb)

**üìö √çndice de Nivel Mid:**
- [‚ö° Mid - 01. Orquestaci√≥n de Pipelines con Apache Airflow](01_apache_airflow_fundamentos.ipynb)
- [Streaming con Apache Kafka: Fundamentos](02_streaming_kafka.ipynb)
- [‚òÅÔ∏è AWS para Ingenier√≠a de Datos: S3, Glue, Athena y Lambda](03_cloud_aws.ipynb)
- [‚òÅÔ∏è GCP para Ingenier√≠a de Datos: BigQuery, Cloud Storage, Dataflow y Composer](03b_cloud_gcp.ipynb)
- [‚òÅÔ∏è Azure para Ingenier√≠a de Datos: ADLS, Synapse, Data Factory y Databricks](03c_cloud_azure.ipynb)
- [üóÑÔ∏è Bases de Datos Relacionales y NoSQL: PostgreSQL y MongoDB](04_bases_datos_postgresql_mongodb.ipynb)
- [‚ôªÔ∏è DataOps y CI/CD para Pipelines de Datos](05_dataops_cicd.ipynb)
- [üåê Conectores Avanzados: REST, GraphQL y SFTP](06_conectores_avanzados_rest_graphql_sftp.ipynb) ‚Üê üîµ Est√°s aqu√≠
- [üß© Optimizaci√≥n SQL y Particionado de Datos](07_optimizacion_sql_particionado.ipynb)
- [üöÄ Servicios de Datos con FastAPI](08_fastapi_servicios_datos.ipynb)
- [üß™ Proyecto Integrador Mid 1: API ‚Üí DB ‚Üí Parquet con Orquestaci√≥n](09_proyecto_integrador_1.ipynb)
- [üîÑ Proyecto Integrador Mid 2: Kafka ‚Üí Streaming ‚Üí Data Lake y Monitoreo](10_proyecto_integrador_2.ipynb)

**üéì Otros Niveles:**
- [Nivel Junior](../nivel_junior/README.md)
- [Nivel Mid](../nivel_mid/README.md)
- [Nivel Senior](../nivel_senior/README.md)
- [Nivel GenAI](../nivel_genai/README.md)
- [Negocio LATAM](../negocios_latam/README.md)
