## ðŸŽ¯ Contexto del Notebook

### Â¿QuÃ©?
Pipeline incremental que carga solo Ã³rdenes nuevas o modificadas desde la Ãºltima ejecuciÃ³n.

### Â¿Por quÃ©?
Procesar toda la historia diariamente es costoso. Un pipeline incremental reduce tiempo y recursos.

### Â¿Para quÃ©?
- Mantener actualizado el lakehouse sin recargas completas
- Habilitar near real-time analytics
- Optimizar uso de infraestructura

### Â¿CuÃ¡ndo?
Ejecutar cada hora o cada 4 horas segÃºn criticidad del negocio.

### Â¿CÃ³mo?
1. Leer Ãºltima fecha procesada desde checkpoint
2. Filtrar Ã³rdenes con `date > last_processed`
3. Append a tabla existente
4. Actualizar checkpoint

In [None]:
import pandas as pd
from src.utils.paths import DATA_RAW, DATA_PROCESSED, ensure_dirs
from src.utils.logging import get_logger
ensure_dirs()
logger = get_logger('DE-02')
logger.info('Iniciando pipeline incremental de Ã³rdenes')

In [None]:
# Simular checkpoint
checkpoint_file = DATA_PROCESSED / 'checkpoint_orders.txt'
if checkpoint_file.exists():
    last_date = checkpoint_file.read_text().strip()
else:
    last_date = '2024-01-01'
logger.info(f'Ãšltima fecha procesada: {last_date}')

In [None]:
orders = pd.read_csv(DATA_RAW / 'orders.csv')
orders['date'] = pd.to_datetime(orders['date'])
new_orders = orders[orders['date'] > last_date]
logger.info(f'Nuevas Ã³rdenes: {len(new_orders)}')
print(new_orders.head())

In [None]:
# Append a processed
output = DATA_PROCESSED / 'orders_incremental.parquet'
if output.exists():
    existing = pd.read_parquet(output)
    combined = pd.concat([existing, new_orders], ignore_index=True)
else:
    combined = new_orders
combined.to_parquet(output, index=False)
# Actualizar checkpoint
checkpoint_file.write_text(orders['date'].max().strftime('%Y-%m-%d'))
logger.info('âœ… Pipeline incremental completado')