# 🎲 Generación de Datos Sintéticos con LLMs

Objetivo: usar IA generativa para crear datos realistas de prueba, aumentar datasets, anonimizar datos sensibles, y generar casos edge para testing.

- Duración: 90 min
- Dificultad: Media
- Stack: OpenAI, Faker, SDV

### 🎲 **Datos Sintéticos con LLMs: De Testing a Producción**

**¿Por qué Generar Datos Sintéticos?**

```
Problema Tradicional                  Solución con LLMs
───────────────────                  ──────────────────
❌ Datos reales sensibles (PII)      ✅ Anonimización inteligente
❌ Datasets pequeños para ML         ✅ Data augmentation realista
❌ Casos edge difíciles de crear     ✅ Generación guiada por prompts
❌ Testing con datos de producción   ✅ Synthetic data on-demand
❌ Cold start (nuevos productos)     ✅ Datos de arranque realistas
```

**Evolución de Técnicas:**

| Era | Técnica | Limitaciones | Ejemplo |
|-----|---------|--------------|---------|
| **2010s** | Random data | No realista, no correlacionado | `random.randint(1, 100)` |
| **2015** | Faker/Mimesis | Patrones predefinidos, sin contexto | `faker.name()` → "John Doe" |
| **2018** | GANs/VAEs | Requiere entrenamiento, difícil interpretar | CTGAN para tabular data |
| **2020** | SDV (Synthetic Data Vault) | Preserva correlaciones, lento con schemas complejos | Gaussian Copula |
| **2023+** | **LLMs (GPT-4, Claude)** | **Context-aware, flexible, explainable** | Prompt → datos coherentes |

**Arquitectura de Generación LLM:**

```python
┌─────────────────────────────────────────────────────────────┐
│                   SYNTHETIC DATA PIPELINE                    │
├─────────────────────────────────────────────────────────────┤
│                                                               │
│  1️⃣ SCHEMA DEFINITION                                        │
│     ┌─────────────────────────────────┐                     │
│     │ JSON Schema / Pydantic Model    │                     │
│     │ + Business Rules                │                     │
│     │ + Data Relationships            │                     │
│     └─────────────────────────────────┘                     │
│                    ↓                                          │
│  2️⃣ PROMPT ENGINEERING                                       │
│     ┌─────────────────────────────────┐                     │
│     │ Schema → Structured Prompt      │                     │
│     │ + Examples (few-shot)           │                     │
│     │ + Constraints (range, format)   │                     │
│     │ + Context (business domain)     │                     │
│     └─────────────────────────────────┘                     │
│                    ↓                                          │
│  3️⃣ LLM GENERATION                                           │
│     ┌─────────────────────────────────┐                     │
│     │ GPT-4: High quality, expensive  │                     │
│     │ GPT-3.5: Fast, cheap, good      │                     │
│     │ Batch API: 50% cheaper, 24h     │                     │
│     │ Temperature: 0.7-0.9 diversity  │                     │
│     └─────────────────────────────────┘                     │
│                    ↓                                          │
│  4️⃣ VALIDATION & POST-PROCESSING                             │
│     ┌─────────────────────────────────┐                     │
│     │ Pydantic validation             │                     │
│     │ Great Expectations checks       │                     │
│     │ Statistical distribution match  │                     │
│     │ Uniqueness constraints          │                     │
│     └─────────────────────────────────┘                     │
│                    ↓                                          │
│  5️⃣ OUTPUT STORAGE                                           │
│     ┌─────────────────────────────────┐                     │
│     │ CSV/Parquet (batch)             │                     │
│     │ Database (structured)           │                     │
│     │ S3 (data lake)                  │                     │
│     └─────────────────────────────────┘                     │
└─────────────────────────────────────────────────────────────┘
```

**Generador de Datos Sintéticos (Producción):**

```python
from pydantic import BaseModel, Field, validator
from openai import OpenAI
from typing import List, Optional
import json

class CustomerSchema(BaseModel):
    """Schema con validación automática"""
    customer_id: int = Field(..., gt=0, lt=1000000)
    name: str = Field(..., min_length=2, max_length=100)
    email: str = Field(..., regex=r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$')
    age: int = Field(..., ge=18, le=100)
    city: str
    country: str = Field(default='USA')
    signup_date: str = Field(..., regex=r'^\d{4}-\d{2}-\d{2}$')
    lifetime_value: float = Field(..., ge=0, le=100000)
    
    @validator('email')
    def validate_email_domain(cls, v):
        """Rechaza emails de prueba"""
        if any(d in v for d in ['test.com', 'example.com', 'fake.com']):
            raise ValueError('Email domain not allowed')
        return v

class SyntheticDataGenerator:
    """Generador robusto de datos sintéticos"""
    
    def __init__(self, model: str = 'gpt-4', temperature: float = 0.8):
        self.client = OpenAI()
        self.model = model
        self.temperature = temperature
    
    def generate(
        self, 
        schema: BaseModel, 
        count: int,
        context: Optional[str] = None,
        few_shot_examples: Optional[List[dict]] = None
    ) -> List[dict]:
        """
        Genera datos sintéticos con validación.
        
        Args:
            schema: Pydantic model que define estructura
            count: Número de registros a generar
            context: Contexto de negocio para realismo
            few_shot_examples: Ejemplos para guiar generación
        
        Returns:
            Lista de diccionarios validados
        """
        # Construir prompt estructurado
        prompt_parts = [
            f"Generate {count} realistic synthetic records following this schema:\n",
            self._schema_to_prompt(schema),
        ]
        
        if context:
            prompt_parts.append(f"\nBusiness context: {context}")
        
        if few_shot_examples:
            prompt_parts.append("\nExamples of desired output:")
            prompt_parts.append(json.dumps(few_shot_examples, indent=2))
        
        prompt_parts.append(f"\n\nGenerate exactly {count} records as a JSON array.")
        prompt_parts.append("Ensure diversity, realism, and consistency.")
        
        prompt = "\n".join(prompt_parts)
        
        # Llamada LLM
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[
                {"role": "system", "content": "You are a data generation expert. Always return valid JSON."},
                {"role": "user", "content": prompt}
            ],
            temperature=self.temperature,
            response_format={"type": "json_object"}  # Garantiza JSON válido
        )
        
        # Parse y validación
        raw_data = json.loads(response.choices[0].message.content)
        
        # Extraer array (maneja respuestas con wrapper)
        if isinstance(raw_data, dict) and 'records' in raw_data:
            raw_data = raw_data['records']
        elif isinstance(raw_data, dict) and 'data' in raw_data:
            raw_data = raw_data['data']
        
        # Validar cada registro con Pydantic
        validated = []
        errors = []
        
        for i, record in enumerate(raw_data):
            try:
                validated_record = schema(**record)
                validated.append(validated_record.dict())
            except Exception as e:
                errors.append(f"Record {i}: {str(e)}")
        
        if errors:
            print(f"⚠️  Validation errors: {len(errors)}/{len(raw_data)}")
            for error in errors[:3]:  # Mostrar primeros 3
                print(f"   {error}")
        
        return validated
    
    def _schema_to_prompt(self, schema: BaseModel) -> str:
        """Convierte Pydantic schema a descripción legible"""
        lines = ["Schema:"]
        for field_name, field in schema.__fields__.items():
            field_type = field.outer_type_.__name__
            constraints = []
            
            if hasattr(field, 'field_info'):
                if field.field_info.ge is not None:
                    constraints.append(f">= {field.field_info.ge}")
                if field.field_info.le is not None:
                    constraints.append(f"<= {field.field_info.le}")
                if field.field_info.min_length:
                    constraints.append(f"min_length={field.field_info.min_length}")
                if field.field_info.max_length:
                    constraints.append(f"max_length={field.field_info.max_length}")
            
            constraint_str = f" ({', '.join(constraints)})" if constraints else ""
            lines.append(f"  - {field_name}: {field_type}{constraint_str}")
        
        return "\n".join(lines)

# Uso
generator = SyntheticDataGenerator(model='gpt-4', temperature=0.8)

customers = generator.generate(
    schema=CustomerSchema,
    count=100,
    context="B2B SaaS company targeting enterprise customers in tech industry",
    few_shot_examples=[
        {
            "customer_id": 1001,
            "name": "Acme Corp",
            "email": "contact@acme-corp.com",
            "age": 45,  # Company age
            "city": "San Francisco",
            "country": "USA",
            "signup_date": "2024-03-15",
            "lifetime_value": 15000.00
        }
    ]
)

print(f"✅ Generated {len(customers)} valid records")
```

**Ventajas vs Técnicas Tradicionales:**

| Aspecto | Faker | SDV (Copula) | LLMs (GPT-4) |
|---------|-------|--------------|--------------|
| **Realismo semántico** | ❌ Bajo (nombres random) | ⚠️ Medio (preserva correlaciones) | ✅ Alto (contexto de negocio) |
| **Coherencia cross-field** | ❌ No (email no match name) | ⚠️ Limitado | ✅ Sí (email coherente con empresa) |
| **Flexibilidad** | ⚠️ Providers predefinidos | ❌ Requiere training | ✅ Prompt engineering |
| **Casos edge** | ❌ Difícil especificar | ❌ No garantizado | ✅ Explícito en prompt |
| **Velocidad** | ✅ Muy rápido (1M/min) | ⚠️ Medio (1K/min) | ❌ Lento (100/min) |
| **Costo** | ✅ Gratis | ✅ Gratis | ❌ $0.03/1K registros (GPT-4) |
| **Setup** | ✅ Trivial | ❌ Complejo (fit model) | ✅ Simple (API key) |

**Cuándo Usar Cada Técnica:**

```python
# 1. FAKER: Volumen alto, estructura simple
from faker import Faker
fake = Faker()
[fake.name() for _ in range(10000)]  # Rápido y barato

# 2. SDV: Preservar distribuciones estadísticas complejas
from sdv.tabular import GaussianCopula
model = GaussianCopula()
model.fit(real_data)  # Aprende correlaciones
synthetic = model.sample(1000)  # Replica distribuciones

# 3. LLMs: Realismo semántico, coherencia contextual
generator.generate(
    schema=ComplexBusinessSchema,
    context="Healthcare provider with HIPAA compliance",
    count=100
)
```

**Estrategia Híbrida (Best Practice):**

```python
def generate_hybrid_dataset(count: int) -> pd.DataFrame:
    """
    Combina velocidad de Faker con inteligencia de LLMs.
    
    Faker: 95% de campos (estructura, IDs, timestamps)
    LLM: 5% de campos (texto libre, relaciones complejas)
    """
    # Paso 1: Estructura base con Faker (rápido)
    base_data = []
    fake = Faker()
    
    for i in range(count):
        base_data.append({
            'customer_id': i + 1000,
            'email': fake.email(),
            'phone': fake.phone_number(),
            'created_at': fake.date_time_between(start_date='-2y'),
            'city': fake.city(),
            'country': fake.country()
        })
    
    # Paso 2: Batch enriquecimiento con LLM (económico)
    # Procesar en lotes de 50 para eficiencia
    batch_size = 50
    
    for i in range(0, len(base_data), batch_size):
        batch = base_data[i:i+batch_size]
        
        # Single LLM call para batch completo
        prompt = f"""
        Enrich these {len(batch)} customer records with:
        - company_name (realistic, matching city/country)
        - industry (choose from: tech, finance, healthcare, retail, manufacturing)
        - company_size (employees: 10-50, 51-200, 201-1000, 1001+)
        
        Input records:
        {json.dumps(batch, indent=2, default=str)}
        
        Return JSON array with original fields + new fields.
        """
        
        response = client.chat.completions.create(
            model='gpt-3.5-turbo',  # Más barato para este task
            messages=[{'role': 'user', 'content': prompt}],
            temperature=0.7
        )
        
        enriched_batch = json.loads(response.choices[0].message.content)
        
        # Merge resultados
        for j, enriched in enumerate(enriched_batch):
            base_data[i+j].update(enriched)
    
    return pd.DataFrame(base_data)

# Resultado: 10K registros en ~2 min, costo ~$0.50
# vs 100% LLM: ~20 min, costo ~$5
# vs 100% Faker: ~10s, pero sin coherencia semántica
```

**Cost Optimization (Producción):**

```python
# ❌ Costoso: 1 llamada por registro
for i in range(1000):
    generate_one_record()  # $30 total

# ✅ Eficiente: Batch de 50 registros por llamada
for batch in chunked(range(1000), 50):
    generate_batch(50)  # $1.50 total (20x más barato)

# ✅ Más eficiente: Batch API (50% descuento, 24h latencia)
from openai import OpenAI
client = OpenAI()

batch_file = client.files.create(
    file=open('batch_requests.jsonl', 'rb'),
    purpose='batch'
)

batch_job = client.batches.create(
    input_file_id=batch_file.id,
    endpoint='/v1/chat/completions',
    completion_window='24h'
)
# Mismo resultado: $0.75 (40x más barato que individual)
```

**Métricas de Calidad:**

```python
def evaluate_synthetic_quality(real_df, synthetic_df):
    """Compara datos sintéticos vs reales"""
    
    # 1. Distribuciones estadísticas (KS test)
    from scipy.stats import ks_2samp
    
    for col in real_df.select_dtypes(include=[np.number]).columns:
        statistic, pvalue = ks_2samp(real_df[col], synthetic_df[col])
        print(f"{col}: KS statistic={statistic:.3f}, p-value={pvalue:.3f}")
        # p-value > 0.05 → distribuciones similares ✅
    
    # 2. Correlaciones preservadas
    real_corr = real_df.corr()
    synth_corr = synthetic_df.corr()
    corr_diff = np.abs(real_corr - synth_corr).mean().mean()
    print(f"Correlation MAE: {corr_diff:.3f}")  # Menor es mejor
    
    # 3. Privacy: no memorización
    # Synthetic data NO debe contener registros idénticos a reales
    merged = real_df.merge(synthetic_df, how='inner')
    print(f"Identical records: {len(merged)} (should be 0)")
    
    # 4. Utility: ML performance
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.metrics import f1_score
    
    # Train on synthetic, test on real
    model = RandomForestClassifier()
    model.fit(synthetic_df.drop('target', axis=1), synthetic_df['target'])
    pred = model.predict(real_df.drop('target', axis=1))
    f1 = f1_score(real_df['target'], pred)
    print(f"F1 score (synth→real): {f1:.3f}")  # Close to 1.0 ✅
```

---
**Autor:** Luis J. Raigoso V. (LJRV)

## 1. Generación simple de registros

In [None]:
import os
import json
from openai import OpenAI

client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

def generate_synthetic_records(schema: dict, count: int) -> list:
    """Genera registros sintéticos según esquema."""
    prompt = f'''
Genera {count} registros de datos realistas en formato JSON según este esquema:

{json.dumps(schema, indent=2)}

Devuelve un array JSON con {count} objetos.
'''
    
    resp = client.chat.completions.create(
        model='gpt-4',
        messages=[{'role':'user','content':prompt}],
        temperature=0.8
    )
    
    return json.loads(resp.choices[0].message.content)

# Esquema de clientes
customer_schema = {
    'customer_id': 'int',
    'name': 'string (nombre completo realista)',
    'email': 'string (email válido)',
    'age': 'int (18-75)',
    'city': 'string (ciudad de USA)',
    'signup_date': 'date (YYYY-MM-DD, últimos 2 años)'
}

customers = generate_synthetic_records(customer_schema, 5)
print(json.dumps(customers, indent=2))

### 🔄 **Data Augmentation: De Pocos a Muchos Datos**

**Problema de Pocos Datos en ML:**

```
Escenario Común                       Impacto en Modelos
───────────────                       ──────────────────
Startup con 100 clientes              → Overfitting severo
Nueva categoría producto (10 ventas)  → No puede entrenar modelo
Casos raros (fraude 0.1%)            → Clase desbalanceada
Migración de sistema (histórico lost) → Cold start
```

**Técnicas de Augmentation por Tipo de Dato:**

```python
┌────────────────────────────────────────────────────────────────┐
│              DATA AUGMENTATION STRATEGIES                      │
├────────────────────────────────────────────────────────────────┤
│                                                                 │
│  📊 TABULAR DATA                                               │
│     Traditional:                                                │
│       • SMOTE (Synthetic Minority Over-sampling)               │
│       • ADASYN (Adaptive Synthetic Sampling)                   │
│       • Random Over-sampling with noise                        │
│     LLM-based:                                                  │
│       ✅ Context-aware variations                              │
│       ✅ Relationship preservation                             │
│       ✅ Domain knowledge injection                            │
│                                                                 │
│  📝 TEXT DATA                                                  │
│     Traditional:                                                │
│       • Synonym replacement (WordNet)                          │
│       • Back-translation (EN→ES→EN)                            │
│       • Random insertion/deletion                              │
│     LLM-based:                                                  │
│       ✅ Paraphrase generation (maintains meaning)             │
│       ✅ Style transfer (formal ↔ casual)                      │
│       ✅ Entity substitution (coherent)                        │
│                                                                 │
│  🖼️ IMAGE DATA (menos común en data engineering)              │
│     • Rotation, flip, crop                                     │
│     • Color jittering                                          │
│     • Mixup, CutMix                                            │
│                                                                 │
│  ⏱️ TIME SERIES                                                │
│     Traditional:                                                │
│       • Jittering (add noise)                                  │
│       • Scaling, magnitude warping                             │
│       • Time warping                                           │
│     LLM-based:                                                  │
│       ✅ Seasonal pattern generation                           │
│       ✅ Anomaly injection (realistic)                         │
│       ✅ Multi-variate correlation                             │
└────────────────────────────────────────────────────────────────┘
```

**Tabular Data Augmentation con LLMs:**

```python
from typing import List, Dict
import pandas as pd
from openai import OpenAI
import json

class LLMDataAugmenter:
    """Aumenta datasets tabulares con LLMs manteniendo coherencia"""
    
    def __init__(self, model: str = 'gpt-4'):
        self.client = OpenAI()
        self.model = model
    
    def augment_records(
        self,
        original_records: List[Dict],
        target_count: int,
        augmentation_strategy: str = 'variation'
    ) -> List[Dict]:
        """
        Genera registros sintéticos basados en originales.
        
        Strategies:
            - 'variation': Variaciones sutiles (mismo patrón)
            - 'interpolation': Interpolación entre registros
            - 'extrapolation': Extrapolación (casos más extremos)
        """
        
        if augmentation_strategy == 'variation':
            return self._augment_by_variation(original_records, target_count)
        elif augmentation_strategy == 'interpolation':
            return self._augment_by_interpolation(original_records, target_count)
        elif augmentation_strategy == 'extrapolation':
            return self._augment_by_extrapolation(original_records, target_count)
    
    def _augment_by_variation(self, records: List[Dict], count: int) -> List[Dict]:
        """Genera variaciones realistas de registros existentes"""
        
        prompt = f"""
        You have {len(records)} original data records. Generate {count} new synthetic records
        that are VARIATIONS of the originals.
        
        Requirements:
        - Maintain the same schema/fields
        - Preserve statistical distributions (ranges, types)
        - Ensure diversity (don't just copy)
        - Keep business logic consistent (e.g., if price depends on category)
        - Introduce realistic variations (similar but not identical)
        
        Original records:
        {json.dumps(records, indent=2)}
        
        Generate {count} new records as JSON array.
        """
        
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[
                {"role": "system", "content": "You are a data augmentation expert."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.8,  # Alta para diversidad
            response_format={"type": "json_object"}
        )
        
        result = json.loads(response.choices[0].message.content)
        return result.get('records', result.get('data', []))
    
    def _augment_by_interpolation(self, records: List[Dict], count: int) -> List[Dict]:
        """Crea registros 'entre' dos existentes"""
        
        # Seleccionar pares de registros para interpolar
        import random
        pairs = [(records[i], records[j]) for i in range(len(records)) 
                 for j in range(i+1, len(records))]
        selected_pairs = random.sample(pairs, min(count, len(pairs)))
        
        augmented = []
        for record1, record2 in selected_pairs:
            prompt = f"""
            Create a synthetic record that is an "interpolation" between these two:
            
            Record A: {json.dumps(record1)}
            Record B: {json.dumps(record2)}
            
            Generate a record that represents a middle point:
            - Numeric fields: average or intermediate value
            - Categorical fields: pick one or create logical intermediate
            - Text fields: combine or blend characteristics
            
            Return single JSON object.
            """
            
            response = self.client.chat.completions.create(
                model='gpt-3.5-turbo',  # Más barato para task simple
                messages=[{'role': 'user', 'content': prompt}],
                temperature=0.5
            )
            
            augmented.append(json.loads(response.choices[0].message.content))
        
        return augmented
    
    def _augment_by_extrapolation(self, records: List[Dict], count: int) -> List[Dict]:
        """Genera casos más extremos/raros basados en patrones"""
        
        prompt = f"""
        Based on these {len(records)} examples, generate {count} EXTRAPOLATED records.
        
        Extrapolation means:
        - Extend beyond the observed range (but stay realistic)
        - Create edge cases (high values, low values, unusual combinations)
        - Maintain domain validity (no impossible values)
        
        Original records:
        {json.dumps(records, indent=2)}
        
        Generate {count} extrapolated records (JSON array).
        """
        
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[{'role': 'user', 'content': prompt}],
            temperature=0.7
        )
        
        result = json.loads(response.choices[0].message.content)
        return result.get('records', result.get('data', []))

# Ejemplo: Augmentar dataset pequeño
original_customers = [
    {"name": "Tech Startup Inc", "industry": "SaaS", "employees": 25, "revenue": 500000},
    {"name": "Finance Corp", "industry": "Banking", "employees": 150, "revenue": 5000000},
    {"name": "Retail Shop", "industry": "E-commerce", "employees": 10, "revenue": 200000}
]

augmenter = LLMDataAugmenter(model='gpt-4')

# Estrategia 1: Variaciones (más común)
variations = augmenter.augment_records(
    original_customers, 
    target_count=10,
    augmentation_strategy='variation'
)
print(f"✅ Generated {len(variations)} variations")

# Estrategia 2: Interpolaciones (crear registros intermedios)
interpolations = augmenter.augment_records(
    original_customers,
    target_count=5,
    augmentation_strategy='interpolation'
)
print(f"✅ Generated {len(interpolations)} interpolations")

# Estrategia 3: Extrapolaciones (casos extremos)
extrapolations = augmenter.augment_records(
    original_customers,
    target_count=5,
    augmentation_strategy='extrapolation'
)
print(f"✅ Generated {len(extrapolations)} extrapolations")

# Combinar todo
df_original = pd.DataFrame(original_customers)
df_augmented = pd.DataFrame(variations + interpolations + extrapolations)

print(f"\n📊 Dataset growth: {len(df_original)} → {len(df_augmented)} records")
```

**Text Augmentation (Paraphrasing):**

```python
def augment_text_data(texts: List[str], multiplier: int = 3) -> List[str]:
    """
    Augmenta dataset de texto con paráfrasis.
    Útil para NLP tasks (sentiment analysis, classification).
    """
    augmented = []
    
    for text in texts:
        prompt = f"""
        Generate {multiplier} paraphrases of this text.
        
        Requirements:
        - Maintain the same meaning
        - Use different wording/structure
        - Keep the same sentiment/tone
        - Realistic variations (not just synonym swap)
        
        Original text: "{text}"
        
        Return JSON: {{"paraphrases": ["version1", "version2", ...]}}
        """
        
        response = client.chat.completions.create(
            model='gpt-3.5-turbo',
            messages=[{'role': 'user', 'content': prompt}],
            temperature=0.8
        )
        
        result = json.loads(response.choices[0].message.content)
        augmented.extend(result['paraphrases'])
    
    return augmented

# Ejemplo: Reviews de productos
original_reviews = [
    "This product is amazing! Highly recommend.",
    "Terrible quality, waste of money."
]

augmented_reviews = augment_text_data(original_reviews, multiplier=5)
# Original: 2 reviews
# Augmented: 10 reviews (5 variations each)
```

**Time Series Augmentation:**

```python
def augment_timeseries(
    series: pd.DataFrame,
    num_synthetic_series: int,
    preserve_pattern: str
) -> List[pd.DataFrame]:
    """
    Genera series temporales sintéticas con patrones similares.
    
    Args:
        series: DataFrame con columnas [date, value]
        num_synthetic_series: Cuántas series generar
        preserve_pattern: 'trend', 'seasonality', 'both'
    """
    
    # Análisis estadístico de serie original
    stats = {
        'mean': series['value'].mean(),
        'std': series['value'].std(),
        'min': series['value'].min(),
        'max': series['value'].max(),
        'trend': 'increasing' if series['value'].iloc[-1] > series['value'].iloc[0] else 'decreasing'
    }
    
    prompt = f"""
    Generate {num_synthetic_series} synthetic time series similar to this one.
    
    Original series statistics:
    - Mean: {stats['mean']:.2f}
    - Std Dev: {stats['std']:.2f}
    - Range: [{stats['min']:.2f}, {stats['max']:.2f}]
    - Trend: {stats['trend']}
    - Length: {len(series)} data points
    
    Preserve pattern: {preserve_pattern}
    
    Sample of original series:
    {series.head(10).to_dict('records')}
    ...
    {series.tail(10).to_dict('records')}
    
    Generate {num_synthetic_series} series with similar characteristics.
    Each series should have {len(series)} points.
    
    Return JSON: {{"series": [[values1], [values2], ...]}}
    """
    
    response = client.chat.completions.create(
        model='gpt-4',
        messages=[{'role': 'user', 'content': prompt}],
        temperature=0.7
    )
    
    result = json.loads(response.choices[0].message.content)
    
    # Convertir a DataFrames
    synthetic_series = []
    for values in result['series']:
        df = pd.DataFrame({
            'date': pd.date_range(start='2024-01-01', periods=len(values), freq='D'),
            'value': values
        })
        synthetic_series.append(df)
    
    return synthetic_series

# Ejemplo: Ventas diarias (solo 30 días reales)
real_sales = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=30, freq='D'),
    'value': [1000 + 50*i + np.random.randn()*100 for i in range(30)]
})

# Generar 10 series sintéticas similares
synthetic_sales = augment_timeseries(real_sales, num_synthetic_series=10, preserve_pattern='both')

# Ahora tenemos 11 series (1 real + 10 sintéticas) para entrenar modelo de forecasting
```

**Class Balancing (Minority Class Augmentation):**

```python
def balance_classes_with_llm(df: pd.DataFrame, target_col: str, minority_class: str):
    """
    Aumenta clase minoritaria para balancear dataset.
    
    Útil para:
    - Fraud detection (fraude es <1% de transacciones)
    - Churn prediction (churn rate ~5%)
    - Anomaly detection (anomalías raras)
    """
    
    # Calcular desbalance
    class_counts = df[target_col].value_counts()
    majority_count = class_counts.max()
    minority_count = class_counts[minority_class]
    
    needed = majority_count - minority_count
    
    print(f"📊 Class distribution:")
    print(class_counts)
    print(f"\n🎯 Need to generate {needed} {minority_class} samples")
    
    # Obtener ejemplos de clase minoritaria
    minority_samples = df[df[target_col] == minority_class].to_dict('records')
    
    # Generar sintéticos
    prompt = f"""
    Generate {needed} synthetic records for the minority class: {minority_class}
    
    These are examples of {minority_class} records:
    {json.dumps(minority_samples[:10], indent=2, default=str)}  # Max 10 ejemplos
    
    Generate {needed} new similar records that:
    - Maintain the characteristics of {minority_class} class
    - Are diverse (not just copies)
    - Are realistic and coherent
    
    Return JSON array.
    """
    
    response = client.chat.completions.create(
        model='gpt-4',
        messages=[{'role': 'user', 'content': prompt}],
        temperature=0.8
    )
    
    synthetic_records = json.loads(response.choices[0].message.content)
    
    # Asegurar que tienen el target correcto
    for record in synthetic_records:
        record[target_col] = minority_class
    
    # Combinar con dataset original
    df_synthetic = pd.DataFrame(synthetic_records)
    df_balanced = pd.concat([df, df_synthetic], ignore_index=True)
    
    print(f"\n✅ Balanced dataset:")
    print(df_balanced[target_col].value_counts())
    
    return df_balanced

# Ejemplo: Dataset de transacciones (99% legítimas, 1% fraude)
transactions = pd.DataFrame({
    'amount': np.random.lognormal(5, 1, 1000),
    'merchant': np.random.choice(['Amazon', 'Walmart', 'Target'], 1000),
    'is_fraud': ['No'] * 990 + ['Yes'] * 10  # Desbalanceado!
})

# Balancear generando más casos de fraude
balanced_transactions = balance_classes_with_llm(transactions, 'is_fraud', 'Yes')
# Ahora: 990 legítimas, 980 fraude sintéticas → ~50/50 balance
```

**Validation de Augmented Data:**

```python
def validate_augmentation_quality(original: pd.DataFrame, augmented: pd.DataFrame):
    """
    Verifica que datos augmentados son válidos y útiles.
    
    Checks:
    1. Schema consistency
    2. Statistical similarity
    3. No memorization (no duplicates exactos)
    4. Diversity (no todos iguales)
    5. ML utility (mejora performance)
    """
    
    print("🔍 AUGMENTATION QUALITY CHECKS\n")
    
    # 1. Schema consistency
    assert list(original.columns) == list(augmented.columns), "Schema mismatch!"
    print("✅ Schema consistent")
    
    # 2. Statistical similarity (numeric columns)
    for col in original.select_dtypes(include=[np.number]).columns:
        orig_mean = original[col].mean()
        aug_mean = augmented[col].mean()
        diff_pct = abs(orig_mean - aug_mean) / orig_mean * 100
        
        print(f"📊 {col}:")
        print(f"   Original mean: {orig_mean:.2f}")
        print(f"   Augmented mean: {aug_mean:.2f}")
        print(f"   Difference: {diff_pct:.1f}%")
        
        if diff_pct > 20:
            print(f"   ⚠️  WARNING: Significant distribution shift")
    
    # 3. No memorization
    combined = pd.concat([original, augmented])
    duplicates = combined.duplicated().sum()
    print(f"\n🔒 Duplicate check: {duplicates} duplicates")
    assert duplicates == 0, "Augmented data contains exact copies!"
    
    # 4. Diversity
    unique_ratio = augmented.drop_duplicates().shape[0] / augmented.shape[0]
    print(f"🎨 Diversity: {unique_ratio*100:.1f}% unique records")
    assert unique_ratio > 0.9, "Low diversity in augmented data"
    
    # 5. ML Utility (si hay target)
    if 'target' in original.columns:
        from sklearn.model_selection import train_test_split
        from sklearn.ensemble import RandomForestClassifier
        from sklearn.metrics import f1_score
        
        # Baseline: Train y test en original
        X_train, X_test, y_train, y_test = train_test_split(
            original.drop('target', axis=1), original['target'], test_size=0.2
        )
        
        model_baseline = RandomForestClassifier(random_state=42)
        model_baseline.fit(X_train, y_train)
        f1_baseline = f1_score(y_test, model_baseline.predict(X_test), average='macro')
        
        # Con augmentation: Train en original+augmented, test en original
        X_aug = pd.concat([X_train, augmented.drop('target', axis=1)])
        y_aug = pd.concat([y_train, augmented['target']])
        
        model_aug = RandomForestClassifier(random_state=42)
        model_aug.fit(X_aug, y_aug)
        f1_aug = f1_score(y_test, model_aug.predict(X_test), average='macro')
        
        print(f"\n🤖 ML Performance:")
        print(f"   Baseline F1: {f1_baseline:.3f}")
        print(f"   With augmentation F1: {f1_aug:.3f}")
        print(f"   Improvement: {(f1_aug - f1_baseline)*100:.1f}%")
        
        if f1_aug > f1_baseline:
            print("   ✅ Augmentation improves model!")
        else:
            print("   ⚠️  Augmentation does not help (or hurts)")
    
    print("\n" + "="*50)
```

**Best Practices:**

```python
# ✅ DO:
# 1. Start small, validate, scale
augmented_sample = augment(original[:10], count=20)
validate_augmentation_quality(original[:10], augmented_sample)
# Si pasa validación → escalar a dataset completo

# 2. Preserve distributions
# Comparar histogramas antes/después
import matplotlib.pyplot as plt
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
original['price'].hist(ax=axes[0], bins=30)
augmented['price'].hist(ax=axes[1], bins=30)
axes[0].set_title('Original')
axes[1].set_title('Augmented')

# 3. Use temperature wisely
# Alta temp (0.8-1.0) → más diversidad (buenos para augmentation)
# Baja temp (0.3-0.5) → más conservador (buenos para variaciones sutiles)

# ❌ DON'T:
# 1. No augmentar test set (solo train!)
X_train_aug = augment(X_train)  # ✅
X_test_aug = augment(X_test)    # ❌ Contamination!

# 2. No augmentar sin validar
# Siempre verificar que augmentation mejora (no daña) el modelo

# 3. No sobre-augmentar
# Ratio 1:5 (original:augmented) es máximo recomendado
# Más allá de eso, el modelo puede aprender "synthetic patterns"
```

---
**Autor:** Luis J. Raigoso V. (LJRV)

## 2. Generación con contexto de negocio

In [None]:
def generate_sales_data(num_records: int, context: str) -> list:
    """Genera datos de ventas con contexto."""
    prompt = f'''
Contexto de negocio: {context}

Genera {num_records} transacciones de venta realistas con:
- transaction_id (UUID)
- date (últimos 30 días)
- product (nombre coherente con el contexto)
- quantity (int)
- unit_price (float)
- total (quantity * unit_price)
- payment_method (cash, card, online)

JSON array:
'''
    
    resp = client.chat.completions.create(
        model='gpt-4',
        messages=[{'role':'user','content':prompt}],
        temperature=0.7
    )
    
    return json.loads(resp.choices[0].message.content)

sales = generate_sales_data(
    num_records=3,
    context='Tienda de electrónica especializada en laptops y accesorios'
)

for sale in sales:
    print(sale)

## 3. Generación de casos edge

In [None]:
def generate_edge_cases(normal_schema: dict) -> list:
    """Genera casos extremos para testing."""
    prompt = f'''
Genera 10 casos edge/extremos para testing basados en este esquema:

{json.dumps(normal_schema, indent=2)}

Incluye casos como:
- Valores nulos
- Strings vacíos
- Números negativos/cero
- Fechas futuras/pasadas extremas
- Caracteres especiales
- Valores muy largos

JSON array con campo adicional 'edge_case_type':
'''
    
    resp = client.chat.completions.create(
        model='gpt-4',
        messages=[{'role':'user','content':prompt}],
        temperature=0.5
    )
    
    return json.loads(resp.choices[0].message.content)

edge_cases = generate_edge_cases(customer_schema)
print(f'Generados {len(edge_cases)} casos edge:\n')
for case in edge_cases[:3]:
    print(f"Tipo: {case.get('edge_case_type')}")
    print(case)
    print()

## 4. Anonimización inteligente

### 🔒 **Anonimización Inteligente: Privacy-Preserving Synthetic Data**

**¿Por qué Anonimización con LLMs?**

```
Métodos Tradicionales              Problemas
───────────────────               ──────────
🔹 Masking (XXX-XX-1234)          → Pierde utilidad, patrones obvios
🔹 Hashing (SHA256)               → Irreversible pero no realista
🔹 Randomization                  → Rompe correlaciones
🔹 K-anonymity                    → Difícil con high-dimensional data
🔹 Differential Privacy           → Requiere expertise matemático

LLMs                              Ventajas
────                              ────────
✅ Context-aware replacement      → Mantiene realismo semántico
✅ Relationship preservation      → Coherencia cross-field
✅ Flexible privacy levels        → Trade-off utilidad/privacidad
✅ Explainable transformations    → Auditable
```

**Niveles de Anonimización:**

```python
┌───────────────────────────────────────────────────────────────┐
│                  ANONYMIZATION SPECTRUM                        │
├───────────────────────────────────────────────────────────────┤
│                                                                │
│  Level 1: REDACTION (Lowest Utility)                          │
│     Original: "John Smith, john@acme.com, 555-1234"          │
│     Result:   "████ █████, ████@████.com, ███-████"         │
│     Use case: Public display, low risk                        │
│                                                                │
│  Level 2: DETERMINISTIC MASKING                               │
│     Original: "John Smith, john@acme.com, 555-1234"          │
│     Result:   "User_12345, user12345@example.com, ***-****"  │
│     Use case: Internal analytics, need consistency            │
│                                                                │
│  Level 3: PSEUDONYMIZATION (Reversible)                       │
│     Original: "John Smith, john@acme.com, 555-1234"          │
│     Result:   "Jane Doe, jane@techcorp.com, 555-5678"        │
│              + Mapping table (encrypted)                      │
│     Use case: Development, testing, need to reverse           │
│                                                                │
│  Level 4: SYNTHETIC REPLACEMENT (Highest Utility)             │
│     Original: "John Smith, 35, NYC, Software Engineer, $120K" │
│     Result:   "Emma Wilson, 33, Boston, Data Scientist, $115K"│
│     Properties: Similar distribution, no reverse mapping      │
│     Use case: ML training, external sharing, research         │
└───────────────────────────────────────────────────────────────┘
```

**LLM-Powered Anonymizer (Production-Ready):**

```python
from typing import Dict, List, Optional, Literal
from pydantic import BaseModel
from openai import OpenAI
import hashlib
import json

class AnonymizationConfig(BaseModel):
    """Configuración de anonimización por campo"""
    field_name: str
    pii_type: Literal['name', 'email', 'phone', 'ssn', 'address', 'date_of_birth', 'free_text']
    anonymization_level: Literal['redact', 'mask', 'pseudonymize', 'synthetic']
    preserve_format: bool = True  # Mantener formato (ej: email sigue siendo email@domain.com)
    preserve_domain: bool = False  # Para emails: mantener dominio (@acme.com)
    preserve_distribution: bool = True  # Para numéricos: mantener distribución estadística

class SmartAnonymizer:
    """Anonimizador inteligente con LLMs"""
    
    def __init__(self, model: str = 'gpt-4'):
        self.client = OpenAI()
        self.model = model
        self.pseudonym_cache = {}  # Para consistency (mismo input → mismo output)
    
    def anonymize_dataset(
        self,
        data: List[Dict],
        config: List[AnonymizationConfig],
        consistency: bool = True
    ) -> List[Dict]:
        """
        Anonimiza dataset completo preservando coherencia.
        
        Args:
            data: Lista de diccionarios a anonimizar
            config: Configuración por campo
            consistency: Si True, mismo valor original → mismo anonimizado
        
        Returns:
            Dataset anonimizado
        """
        
        # Build field-level instructions
        field_configs = {cfg.field_name: cfg for cfg in config}
        
        # Sample para entender contexto
        sample = data[:5] if len(data) > 5 else data
        
        prompt = self._build_anonymization_prompt(sample, field_configs)
        
        # Procesar en batches para eficiencia
        batch_size = 50
        anonymized_data = []
        
        for i in range(0, len(data), batch_size):
            batch = data[i:i+batch_size]
            
            batch_prompt = f"""
            Anonymize this batch of {len(batch)} records following these rules:
            
            {prompt}
            
            Records to anonymize:
            {json.dumps(batch, indent=2, default=str)}
            
            Return JSON array with anonymized records (same schema).
            """
            
            response = self.client.chat.completions.create(
                model=self.model,
                messages=[
                    {"role": "system", "content": "You are a privacy expert. Anonymize data while preserving utility."},
                    {"role": "user", "content": batch_prompt}
                ],
                temperature=0.7,
                response_format={"type": "json_object"}
            )
            
            result = json.loads(response.choices[0].message.content)
            anonymized_batch = result.get('records', result.get('data', []))
            
            # Apply consistency if needed
            if consistency:
                anonymized_batch = self._ensure_consistency(batch, anonymized_batch, field_configs)
            
            anonymized_data.extend(anonymized_batch)
        
        return anonymized_data
    
    def _build_anonymization_prompt(
        self, 
        sample: List[Dict], 
        configs: Dict[str, AnonymizationConfig]
    ) -> str:
        """Construye instrucciones claras para LLM"""
        
        rules = ["ANONYMIZATION RULES:"]
        
        for field_name, config in configs.items():
            rule = f"\n{field_name} ({config.pii_type}):"
            
            if config.anonymization_level == 'redact':
                rule += " → REDACT completely (use '████' or '[REDACTED]')"
            
            elif config.anonymization_level == 'mask':
                rule += " → MASK with generic placeholder"
                if config.pii_type == 'email':
                    rule += " (use 'user{id}@example.com')"
                elif config.pii_type == 'phone':
                    rule += " (use 'XXX-XXX-{last4}')"
            
            elif config.anonymization_level == 'pseudonymize':
                rule += " → REPLACE with realistic alternative"
                if config.preserve_format:
                    rule += ", MAINTAIN format"
                if config.preserve_domain and config.pii_type == 'email':
                    rule += ", KEEP domain"
            
            elif config.anonymization_level == 'synthetic':
                rule += " → GENERATE synthetic data"
                if config.preserve_distribution:
                    rule += ", PRESERVE statistical distribution"
            
            rules.append(rule)
        
        rules.append("\n\nGENERAL PRINCIPLES:")
        rules.append("- Maintain coherence (if name is 'John', email shouldn't be 'maria@...')")
        rules.append("- Preserve relationships between fields")
        rules.append("- Keep data realistic and useful for analytics")
        rules.append("- Same original value → same anonymized value (consistency)")
        
        return "\n".join(rules)
    
    def _ensure_consistency(
        self,
        original_batch: List[Dict],
        anonymized_batch: List[Dict],
        configs: Dict[str, AnonymizationConfig]
    ) -> List[Dict]:
        """
        Garantiza que valores repetidos se anonimicen consistentemente.
        
        Ejemplo: Si "john@acme.com" aparece 3 veces, las 3 deben mapearse
                 al mismo valor anonimizado "jane@techcorp.com".
        """
        
        for field_name in configs.keys():
            # Build mapping: original → anonymized
            mapping = {}
            
            for orig, anon in zip(original_batch, anonymized_batch):
                orig_value = orig.get(field_name)
                anon_value = anon.get(field_name)
                
                if orig_value is not None:
                    # Hash original value para lookup
                    key = self._hash_value(orig_value)
                    
                    if key not in self.pseudonym_cache:
                        self.pseudonym_cache[key] = anon_value
                    
                    # Apply consistent mapping
                    anon[field_name] = self.pseudonym_cache[key]
        
        return anonymized_batch
    
    def _hash_value(self, value) -> str:
        """Hash determinístico para cache key"""
        return hashlib.sha256(str(value).encode()).hexdigest()

# Ejemplo: Anonimizar base de datos de empleados
employees = [
    {
        "employee_id": 1001,
        "name": "John Smith",
        "email": "john.smith@acme.com",
        "phone": "555-123-4567",
        "salary": 120000,
        "department": "Engineering",
        "performance_review": "Excellent developer, strong communication skills"
    },
    {
        "employee_id": 1002,
        "name": "Maria Garcia",
        "email": "maria.garcia@acme.com",
        "phone": "555-234-5678",
        "salary": 95000,
        "department": "Marketing",
        "performance_review": "Great team player, needs to improve presentation skills"
    },
    # ... más empleados
]

# Configurar anonimización
config = [
    AnonymizationConfig(
        field_name='name',
        pii_type='name',
        anonymization_level='synthetic',
        preserve_format=True
    ),
    AnonymizationConfig(
        field_name='email',
        pii_type='email',
        anonymization_level='pseudonymize',
        preserve_format=True,
        preserve_domain=True  # Mantener @acme.com
    ),
    AnonymizationConfig(
        field_name='phone',
        pii_type='phone',
        anonymization_level='mask',
        preserve_format=True
    ),
    AnonymizationConfig(
        field_name='performance_review',
        pii_type='free_text',
        anonymization_level='synthetic',  # Reescribir sin nombres/detalles específicos
        preserve_format=False
    )
]

anonymizer = SmartAnonymizer(model='gpt-4')
anonymized_employees = anonymizer.anonymize_dataset(employees, config, consistency=True)

print("Original:", employees[0])
print("\nAnonymized:", anonymized_employees[0])
```

**Ejemplo Output:**

```python
Original: {
    "employee_id": 1001,
    "name": "John Smith",
    "email": "john.smith@acme.com",
    "phone": "555-123-4567",
    "salary": 120000,
    "department": "Engineering",
    "performance_review": "Excellent developer, strong communication skills"
}

Anonymized: {
    "employee_id": 1001,  # No PII, se mantiene
    "name": "Alex Thompson",  # Synthetic name
    "email": "alex.thompson@acme.com",  # Preserved domain
    "phone": "XXX-XXX-4567",  # Masked (last 4 preserved)
    "salary": 118000,  # Slightly perturbed (preserve distribution)
    "department": "Engineering",  # No PII, se mantiene
    "performance_review": "Strong technical contributor with good team collaboration"  # Rewritten
}
```

**Privacy Metrics:**

```python
def calculate_privacy_metrics(
    original: List[Dict],
    anonymized: List[Dict],
    pii_fields: List[str]
) -> Dict:
    """
    Mide qué tan bien se preserva privacidad.
    
    Metrics:
    - PII Removal Rate: % de PII eliminado
    - Uniqueness: Si combinaciones únicas se reducen (k-anonymity)
    - Re-identification Risk: Probabilidad de linkear a original
    """
    
    metrics = {}
    
    # 1. PII Removal Rate
    pii_removed = 0
    pii_total = 0
    
    for orig, anon in zip(original, anonymized):
        for field in pii_fields:
            pii_total += 1
            if orig[field] != anon[field]:
                pii_removed += 1
    
    metrics['pii_removal_rate'] = pii_removed / pii_total * 100
    
    # 2. K-anonymity
    # Combinaciones únicas de quasi-identifiers
    orig_combinations = set()
    anon_combinations = set()
    
    quasi_ids = ['age', 'zip_code', 'gender']  # Ejemplo
    
    for record in original:
        combo = tuple(record.get(q) for q in quasi_ids if q in record)
        orig_combinations.add(combo)
    
    for record in anonymized:
        combo = tuple(record.get(q) for q in quasi_ids if q in record)
        anon_combinations.add(combo)
    
    metrics['original_unique_combinations'] = len(orig_combinations)
    metrics['anonymized_unique_combinations'] = len(anon_combinations)
    metrics['k_anonymity_improvement'] = (
        (len(orig_combinations) - len(anon_combinations)) / len(orig_combinations) * 100
    )
    
    # 3. Re-identification Risk (simplified)
    # Si alguien puede match 1-1 original→anonymized
    perfect_matches = 0
    for orig, anon in zip(original, anonymized):
        # Check si non-PII fields son idénticos (facilita re-id)
        non_pii_match = all(
            orig.get(k) == anon.get(k) 
            for k in orig.keys() 
            if k not in pii_fields
        )
        if non_pii_match:
            perfect_matches += 1
    
    metrics['re_identification_risk'] = perfect_matches / len(original) * 100
    
    return metrics

# Evaluar
privacy_metrics = calculate_privacy_metrics(
    original=employees,
    anonymized=anonymized_employees,
    pii_fields=['name', 'email', 'phone', 'performance_review']
)

print("📊 PRIVACY METRICS:")
print(f"PII Removal Rate: {privacy_metrics['pii_removal_rate']:.1f}%")
print(f"K-anonymity Improvement: {privacy_metrics['k_anonymity_improvement']:.1f}%")
print(f"Re-identification Risk: {privacy_metrics['re_identification_risk']:.1f}%")
```

**Utility Preservation:**

```python
def calculate_utility_metrics(
    original: pd.DataFrame,
    anonymized: pd.DataFrame,
    numeric_cols: List[str],
    categorical_cols: List[str]
) -> Dict:
    """
    Mide qué tan útil sigue siendo el dataset anonimizado.
    
    Metrics:
    - Statistical similarity (numeric)
    - Categorical distribution preservation
    - Correlation preservation
    - ML model performance
    """
    
    metrics = {}
    
    # 1. Statistical Similarity (numeric)
    for col in numeric_cols:
        orig_mean = original[col].mean()
        anon_mean = anonymized[col].mean()
        
        orig_std = original[col].std()
        anon_std = anonymized[col].std()
        
        mean_diff = abs(orig_mean - anon_mean) / orig_mean * 100
        std_diff = abs(orig_std - anon_std) / orig_std * 100
        
        metrics[f'{col}_mean_diff_pct'] = mean_diff
        metrics[f'{col}_std_diff_pct'] = std_diff
    
    # 2. Categorical Distribution
    for col in categorical_cols:
        orig_dist = original[col].value_counts(normalize=True)
        anon_dist = anonymized[col].value_counts(normalize=True)
        
        # Jensen-Shannon divergence
        from scipy.spatial.distance import jensenshannon
        js_div = jensenshannon(orig_dist, anon_dist)
        metrics[f'{col}_distribution_similarity'] = 1 - js_div  # 1 = identical
    
    # 3. Correlation Preservation
    orig_corr = original[numeric_cols].corr()
    anon_corr = anonymized[numeric_cols].corr()
    
    corr_mae = np.abs(orig_corr - anon_corr).mean().mean()
    metrics['correlation_mae'] = corr_mae
    
    # 4. ML Performance (if target exists)
    if 'target' in original.columns:
        from sklearn.ensemble import RandomForestClassifier
        from sklearn.metrics import f1_score
        from sklearn.model_selection import train_test_split
        
        # Train on original
        X_orig = original.drop('target', axis=1)
        y_orig = original['target']
        X_train, X_test, y_train, y_test = train_test_split(X_orig, y_orig, test_size=0.2)
        
        model_orig = RandomForestClassifier()
        model_orig.fit(X_train, y_train)
        f1_orig = f1_score(y_test, model_orig.predict(X_test), average='macro')
        
        # Train on anonymized
        X_anon = anonymized.drop('target', axis=1)
        y_anon = anonymized['target']
        
        model_anon = RandomForestClassifier()
        model_anon.fit(X_anon, y_anon)
        f1_anon = f1_score(y_test, model_anon.predict(X_test), average='macro')
        
        metrics['ml_f1_original'] = f1_orig
        metrics['ml_f1_anonymized'] = f1_anon
        metrics['ml_utility_retention'] = (f1_anon / f1_orig) * 100
    
    return metrics

# Evaluar utilidad
utility_metrics = calculate_utility_metrics(
    original=pd.DataFrame(employees),
    anonymized=pd.DataFrame(anonymized_employees),
    numeric_cols=['salary'],
    categorical_cols=['department']
)

print("\n📈 UTILITY METRICS:")
print(f"Salary mean diff: {utility_metrics.get('salary_mean_diff_pct', 0):.1f}%")
print(f"Department distribution similarity: {utility_metrics.get('department_distribution_similarity', 0):.3f}")
print(f"ML utility retention: {utility_metrics.get('ml_utility_retention', 0):.1f}%")
```

**Compliance & Auditing:**

```python
class AnonymizationAudit:
    """Genera audit trail para compliance (GDPR, HIPAA, etc)"""
    
    def __init__(self):
        self.audit_log = []
    
    def log_anonymization(
        self,
        timestamp: str,
        dataset_name: str,
        num_records: int,
        config: List[AnonymizationConfig],
        privacy_metrics: Dict,
        utility_metrics: Dict,
        user: str
    ):
        """Registra operación de anonimización"""
        
        entry = {
            'timestamp': timestamp,
            'dataset': dataset_name,
            'record_count': num_records,
            'user': user,
            'configuration': [
                {
                    'field': cfg.field_name,
                    'pii_type': cfg.pii_type,
                    'method': cfg.anonymization_level
                }
                for cfg in config
            ],
            'privacy_metrics': privacy_metrics,
            'utility_metrics': utility_metrics,
            'compliance': {
                'gdpr_article_6': 'Lawful processing - anonymization',
                'gdpr_article_89': 'Safeguards for processing for research purposes',
                'hipaa_safe_harbor': 'PHI identifiers removed' if privacy_metrics['pii_removal_rate'] == 100 else 'Does not meet Safe Harbor'
            }
        }
        
        self.audit_log.append(entry)
        
        # Export para reguladores
        with open(f'audit_logs/anonymization_{timestamp}.json', 'w') as f:
            json.dump(entry, f, indent=2)
        
        print(f"✅ Audit log created: anonymization_{timestamp}.json")
    
    def generate_compliance_report(self) -> str:
        """Genera reporte para auditores"""
        
        report = [
            "ANONYMIZATION COMPLIANCE REPORT",
            "=" * 50,
            f"\nTotal Operations: {len(self.audit_log)}",
        ]
        
        for i, entry in enumerate(self.audit_log, 1):
            report.append(f"\n{i}. {entry['dataset']} ({entry['timestamp']})")
            report.append(f"   Records: {entry['record_count']}")
            report.append(f"   PII Removal: {entry['privacy_metrics']['pii_removal_rate']:.1f}%")
            report.append(f"   Utility Retention: {entry['utility_metrics'].get('ml_utility_retention', 'N/A')}")
            report.append(f"   Compliance: {entry['compliance']['hipaa_safe_harbor']}")
        
        return "\n".join(report)

# Uso
audit = AnonymizationAudit()

audit.log_anonymization(
    timestamp='2024-10-31T10:00:00Z',
    dataset_name='employees_q3_2024',
    num_records=len(employees),
    config=config,
    privacy_metrics=privacy_metrics,
    utility_metrics=utility_metrics,
    user='data_engineer_ljrv'
)

print(audit.generate_compliance_report())
```

**Best Practices:**

1. **Privacy First:**
   - Default to highest privacy level
   - Explicit consent for lower levels
   - Document rationale for each field

2. **Test Re-identification:**
   - Attempt to match anonymized→original
   - Hire external auditors (red team)
   - Use linkage attack simulations

3. **Layered Defense:**
   ```python
   # No confiar solo en anonimización
   anonymized_data = anonymize(data)
   + Differential privacy noise
   + k-anonymity grouping
   + Access controls
   + Audit logs
   = Defense in depth
   ```

4. **Update Regularly:**
   - Técnicas de de-anonymization mejoran
   - Re-anonymize periódicamente
   - Monitor for new attacks

---
**Autor:** Luis J. Raigoso V. (LJRV)

In [None]:
def anonymize_data(sensitive_records: list) -> list:
    """Anonimiza datos manteniendo realismo."""
    prompt = f'''
Anonimiza estos registros manteniendo la estructura y realismo:
- Reemplaza nombres con nombres ficticios
- Cambia emails pero mantén el formato
- Altera IDs pero mantén el tipo
- Preserva patrones estadísticos (edades, fechas)

Datos originales:
{json.dumps(sensitive_records, indent=2)}

Datos anonimizados (mismo formato JSON):
'''
    
    resp = client.chat.completions.create(
        model='gpt-4',
        messages=[{'role':'user','content':prompt}],
        temperature=0.7
    )
    
    return json.loads(resp.choices[0].message.content)

real_data = [
    {'id': 12345, 'name': 'John Smith', 'email': 'john@company.com', 'salary': 75000},
    {'id': 12346, 'name': 'Maria Garcia', 'email': 'maria@company.com', 'salary': 82000}
]

anonymized = anonymize_data(real_data)
print('Original:', real_data[0])
print('Anonimizado:', anonymized[0])

## 5. Aumento de datos (data augmentation)

In [None]:
def augment_dataset(original_records: list, target_count: int) -> list:
    """Aumenta dataset generando variaciones."""
    prompt = f'''
Tienes estos {len(original_records)} registros originales:

{json.dumps(original_records, indent=2)}

Genera {target_count} registros nuevos similares pero con variaciones realistas.
Mantén distribuciones y patrones.

JSON array:
'''
    
    resp = client.chat.completions.create(
        model='gpt-4',
        messages=[{'role':'user','content':prompt}],
        temperature=0.8
    )
    
    return json.loads(resp.choices[0].message.content)

original_products = [
    {'name': 'Laptop Pro 15', 'category': 'Electronics', 'price': 1299},
    {'name': 'Wireless Mouse', 'category': 'Accessories', 'price': 29}
]

augmented = augment_dataset(original_products, 5)
print(f'Dataset aumentado a {len(augmented)} registros:\n')
for item in augmented:
    print(item)

## 6. Generación de series temporales

In [None]:
def generate_timeseries(metric_name: str, days: int, pattern: str) -> list:
    """Genera serie temporal sintética."""
    prompt = f'''
Genera una serie temporal de {days} días para la métrica: {metric_name}
Patrón: {pattern}

Formato JSON:
[
  {{"date": "YYYY-MM-DD", "value": float}},
  ...
]

Los valores deben seguir el patrón descrito.
'''
    
    resp = client.chat.completions.create(
        model='gpt-4',
        messages=[{'role':'user','content':prompt}],
        temperature=0.5
    )
    
    return json.loads(resp.choices[0].message.content)

revenue_data = generate_timeseries(
    metric_name='daily_revenue',
    days=7,
    pattern='Tendencia creciente con picos los fines de semana, valores entre 5000-15000'
)

for entry in revenue_data:
    print(f"{entry['date']}: ${entry['value']:,.2f}")

## 7. Integración con Faker

In [None]:
# pip install faker
from faker import Faker
import pandas as pd

fake = Faker()

def generate_hybrid_dataset(count: int) -> pd.DataFrame:
    """Combina Faker (estructura) + LLM (semántica)."""
    # Estructura base con Faker
    base_data = [{
        'user_id': fake.uuid4(),
        'name': fake.name(),
        'email': fake.email(),
        'city': fake.city()
    } for _ in range(count)]
    
    # Enriquecer con LLM (ej: generar bio coherente)
    for record in base_data:
        prompt = f"Genera una bio de 1 frase para {record['name']}, vive en {record['city']}, trabaja en tech."
        resp = client.chat.completions.create(
            model='gpt-3.5-turbo',
            messages=[{'role':'user','content':prompt}],
            temperature=0.7,
            max_tokens=50
        )
        record['bio'] = resp.choices[0].message.content.strip()
    
    return pd.DataFrame(base_data)

df_hybrid = generate_hybrid_dataset(3)
print(df_hybrid)

## 8. Buenas prácticas

- **Validación**: verifica que datos sintéticos cumplan constraints.
- **Distribuciones**: compara estadísticas con datos reales.
- **Diversidad**: usa temperature alta para mayor variedad.
- **Lotes**: genera en batches para eficiencia.
- **Costos**: usa GPT-3.5 para volumen, GPT-4 para calidad.
- **Complementar**: combina LLMs con bibliotecas tradicionales (Faker, SDV).
- **Testing**: valida con Great Expectations.

### 🏭 **Datos Sintéticos en Producción: Arquitectura y ROI**

**Pipeline de Generación de Datos Sintéticos (Enterprise):**

```python
┌──────────────────────────────────────────────────────────────────┐
│           SYNTHETIC DATA PRODUCTION PIPELINE                      │
├──────────────────────────────────────────────────────────────────┤
│                                                                   │
│  1️⃣ DATA INGESTION & PROFILING                                   │
│     ┌────────────────────────────────────┐                      │
│     │ • Real data sampling               │                      │
│     │ • Schema extraction                │                      │
│     │ • Statistical profiling            │                      │
│     │ • PII detection (NER, regex)       │                      │
│     │ • Relationship mapping             │                      │
│     └────────────────────────────────────┘                      │
│            ↓                                                      │
│  2️⃣ CONFIGURATION & POLICY                                       │
│     ┌────────────────────────────────────┐                      │
│     │ • Anonymization rules              │                      │
│     │ • Generation strategy              │                      │
│     │ • Quality thresholds               │                      │
│     │ • Compliance requirements          │                      │
│     └────────────────────────────────────┘                      │
│            ↓                                                      │
│  3️⃣ GENERATION (Multi-Strategy)                                  │
│     ┌────────────────────────────────────┐                      │
│     │ LLM (GPT-4)     →  5% high-value   │                      │
│     │ LLM (GPT-3.5)   →  15% medium      │                      │
│     │ Faker           →  70% simple      │                      │
│     │ SDV (Copula)    →  10% complex     │                      │
│     └────────────────────────────────────┘                      │
│            ↓                                                      │
│  4️⃣ QUALITY VALIDATION                                           │
│     ┌────────────────────────────────────┐                      │
│     │ • Schema validation (Pydantic)     │                      │
│     │ • Statistical tests (KS, Chi²)     │                      │
│     │ • Business rules (Great Expect.)   │                      │
│     │ • PII leakage check                │                      │
│     │ • ML utility test                  │                      │
│     └────────────────────────────────────┘                      │
│            ↓                                                      │
│  5️⃣ STORAGE & DELIVERY                                           │
│     ┌────────────────────────────────────┐                      │
│     │ • Versioning (DVC, lakeFS)         │                      │
│     │ • Access control (RBAC)            │                      │
│     │ • Audit logging                    │                      │
│     │ • API endpoints (FastAPI)          │                      │
│     │ • Self-service portal              │                      │
│     └────────────────────────────────────┘                      │
└──────────────────────────────────────────────────────────────────┘
```

**Production-Grade Generator (Orchestrated):**

```python
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta
import pandas as pd
from typing import Dict, Any

class SyntheticDataPlatform:
    """
    Plataforma empresarial para generación de datos sintéticos.
    
    Features:
    - Multi-strategy generation (LLM + Faker + SDV)
    - Quality gates (fail if metrics below threshold)
    - Cost optimization (smart routing)
    - Versioning & lineage
    - Self-service API
    """
    
    def __init__(self, config: Dict[str, Any]):
        self.config = config
        self.llm_client = OpenAI()
        self.faker = Faker()
        
        # Metrics tracking
        self.metrics = {
            'records_generated': 0,
            'cost_usd': 0.0,
            'quality_score': 0.0,
            'generation_time_sec': 0.0
        }
    
    def generate_dataset(
        self,
        schema: Dict,
        count: int,
        quality_threshold: float = 0.85
    ) -> pd.DataFrame:
        """
        Genera dataset con quality gates.
        
        Pipeline:
        1. Profile schema → detect complexity
        2. Route to appropriate generator
        3. Validate quality
        4. If quality < threshold → regenerate or fail
        """
        
        import time
        start_time = time.time()
        
        # Step 1: Complexity analysis
        complexity_score = self._analyze_complexity(schema)
        
        # Step 2: Smart routing
        if complexity_score > 0.8:
            print("🧠 High complexity → GPT-4")
            df = self._generate_with_llm(schema, count, model='gpt-4')
        elif complexity_score > 0.5:
            print("⚡ Medium complexity → GPT-3.5")
            df = self._generate_with_llm(schema, count, model='gpt-3.5-turbo')
        elif complexity_score > 0.2:
            print("📊 Low complexity → Hybrid (Faker + GPT-3.5)")
            df = self._generate_hybrid(schema, count)
        else:
            print("🚀 Simple schema → Faker only")
            df = self._generate_with_faker(schema, count)
        
        # Step 3: Quality validation
        quality_metrics = self._validate_quality(df, schema)
        
        self.metrics['quality_score'] = quality_metrics['overall_score']
        self.metrics['records_generated'] = len(df)
        self.metrics['generation_time_sec'] = time.time() - start_time
        
        # Step 4: Quality gate
        if quality_metrics['overall_score'] < quality_threshold:
            raise ValueError(
                f"Quality score {quality_metrics['overall_score']:.2f} "
                f"below threshold {quality_threshold}"
            )
        
        # Step 5: Audit log
        self._log_generation(df, quality_metrics)
        
        return df
    
    def _analyze_complexity(self, schema: Dict) -> float:
        """
        Analiza complejidad del schema para routing.
        
        Complexity factors:
        - Free text fields (high complexity)
        - Cross-field dependencies (high complexity)
        - Numeric ranges (low complexity)
        - Categorical with enum (low complexity)
        """
        
        complexity = 0.0
        field_count = len(schema)
        
        for field_name, field_def in schema.items():
            if 'text' in field_def.get('type', '').lower():
                complexity += 0.3  # Text es complejo
            elif 'depends_on' in field_def:
                complexity += 0.2  # Dependencies son complejas
            elif 'enum' in field_def:
                complexity += 0.05  # Categorical simple
            else:
                complexity += 0.1  # Numeric/simple
        
        return min(complexity / field_count, 1.0)
    
    def _generate_hybrid(self, schema: Dict, count: int) -> pd.DataFrame:
        """
        Estrategia híbrida: Faker para estructura, LLM para semántica.
        
        70% más barato que 100% LLM
        3x más rápido que 100% LLM
        """
        
        # Paso 1: Estructura con Faker
        base_data = []
        for i in range(count):
            record = {}
            for field_name, field_def in schema.items():
                if field_def['generator'] == 'faker':
                    # Usar Faker para campos simples
                    faker_method = getattr(self.faker, field_def['faker_method'])
                    record[field_name] = faker_method()
                else:
                    # Placeholder para LLM
                    record[field_name] = None
            base_data.append(record)
        
        # Paso 2: Enriquecer con LLM (batch)
        llm_fields = [f for f, d in schema.items() if d['generator'] == 'llm']
        
        if llm_fields:
            # Procesar en batches de 50
            batch_size = 50
            for i in range(0, len(base_data), batch_size):
                batch = base_data[i:i+batch_size]
                
                prompt = f"""
                Enrich these {len(batch)} records by generating values for: {', '.join(llm_fields)}
                
                Schema:
                {json.dumps({f: schema[f] for f in llm_fields}, indent=2)}
                
                Existing data:
                {json.dumps(batch, indent=2, default=str)}
                
                Return complete records with enriched fields.
                """
                
                response = self.llm_client.chat.completions.create(
                    model='gpt-3.5-turbo',
                    messages=[{'role': 'user', 'content': prompt}],
                    temperature=0.7
                )
                
                enriched = json.loads(response.choices[0].message.content)
                
                # Merge
                for j, enriched_record in enumerate(enriched.get('records', [])):
                    base_data[i+j].update(enriched_record)
                
                # Track cost
                self.metrics['cost_usd'] += 0.001 * len(batch)  # ~$0.001 per record
        
        return pd.DataFrame(base_data)
    
    def _validate_quality(self, df: pd.DataFrame, schema: Dict) -> Dict:
        """
        Comprehensive quality validation.
        
        Checks:
        1. Schema compliance (100% must pass)
        2. Statistical validity (distributions)
        3. Business rules (custom validators)
        4. PII leakage (no real data leaked)
        5. Uniqueness (sufficient diversity)
        """
        
        metrics = {
            'schema_compliance': 0.0,
            'statistical_validity': 0.0,
            'business_rules': 0.0,
            'pii_leakage': 0.0,
            'uniqueness': 0.0,
            'overall_score': 0.0
        }
        
        # 1. Schema compliance
        required_cols = set(schema.keys())
        actual_cols = set(df.columns)
        metrics['schema_compliance'] = len(required_cols & actual_cols) / len(required_cols)
        
        # 2. Statistical validity (compare to expected ranges)
        valid_records = 0
        for col in df.columns:
            if col in schema and 'range' in schema[col]:
                expected_min, expected_max = schema[col]['range']
                valid = df[col].between(expected_min, expected_max).sum()
                valid_records += valid / len(df)
        
        metrics['statistical_validity'] = valid_records / len(df.columns) if len(df.columns) > 0 else 0
        
        # 3. Business rules (Great Expectations)
        from great_expectations.dataset import PandasDataset
        ge_df = PandasDataset(df)
        
        # Example: email format
        if 'email' in df.columns:
            result = ge_df.expect_column_values_to_match_regex(
                'email',
                r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
            )
            metrics['business_rules'] += result.success
        
        # 4. PII leakage check (placeholder - use Presidio in prod)
        # Check if synthetic data contains known real values
        metrics['pii_leakage'] = 1.0  # Assume no leakage for now
        
        # 5. Uniqueness
        duplicate_ratio = df.duplicated().sum() / len(df)
        metrics['uniqueness'] = 1 - duplicate_ratio
        
        # Overall score (weighted average)
        weights = {
            'schema_compliance': 0.3,
            'statistical_validity': 0.2,
            'business_rules': 0.2,
            'pii_leakage': 0.2,
            'uniqueness': 0.1
        }
        
        metrics['overall_score'] = sum(
            metrics[k] * weights[k] for k in weights
        )
        
        return metrics
    
    def _log_generation(self, df: pd.DataFrame, quality_metrics: Dict):
        """Audit trail para compliance y debugging"""
        
        log_entry = {
            'timestamp': datetime.now().isoformat(),
            'record_count': len(df),
            'quality_metrics': quality_metrics,
            'cost_usd': self.metrics['cost_usd'],
            'generation_time_sec': self.metrics['generation_time_sec'],
            'cost_per_record': self.metrics['cost_usd'] / len(df) if len(df) > 0 else 0
        }
        
        # Write to audit log (S3, database, etc)
        with open(f'audit_logs/generation_{datetime.now().strftime("%Y%m%d_%H%M%S")}.json', 'w') as f:
            json.dump(log_entry, f, indent=2)

# Airflow DAG para generación programada
default_args = {
    'owner': 'data_engineering',
    'depends_on_past': False,
    'start_date': datetime(2024, 1, 1),
    'email_on_failure': True,
    'email_on_retry': False,
    'retries': 2,
    'retry_delay': timedelta(minutes=5)
}

dag = DAG(
    'synthetic_data_generation',
    default_args=default_args,
    description='Generate synthetic data for testing environments',
    schedule_interval='@weekly',  # Regenerar cada semana
    catchup=False
)

def generate_test_data(**context):
    """Task para generar datos de testing"""
    
    platform = SyntheticDataPlatform(config={})
    
    schema = {
        'customer_id': {'type': 'int', 'generator': 'faker', 'faker_method': 'random_int'},
        'name': {'type': 'str', 'generator': 'faker', 'faker_method': 'name'},
        'email': {'type': 'str', 'generator': 'faker', 'faker_method': 'email'},
        'bio': {'type': 'text', 'generator': 'llm'}  # Solo bio usa LLM
    }
    
    df = platform.generate_dataset(schema, count=10000, quality_threshold=0.85)
    
    # Save to S3
    df.to_parquet(f's3://synthetic-data-bucket/test_customers_{datetime.now().date()}.parquet')
    
    # Push metrics to XCom
    context['task_instance'].xcom_push(key='metrics', value=platform.metrics)

generate_task = PythonOperator(
    task_id='generate_test_data',
    python_callable=generate_test_data,
    dag=dag
)
```

**Self-Service API (FastAPI):**

```python
from fastapi import FastAPI, HTTPException, BackgroundTasks
from pydantic import BaseModel
from typing import Dict, List, Optional
import uuid

app = FastAPI(title="Synthetic Data API")

class GenerationRequest(BaseModel):
    schema: Dict
    count: int
    quality_threshold: float = 0.85
    format: str = 'json'  # json, csv, parquet

class GenerationResponse(BaseModel):
    job_id: str
    status: str
    estimated_time_sec: int

# In-memory job store (use Redis in production)
jobs = {}

@app.post('/v1/generate', response_model=GenerationResponse)
async def generate_dataset(request: GenerationRequest, background_tasks: BackgroundTasks):
    """
    Endpoint para generar datasets sintéticos.
    
    Usage:
        POST /v1/generate
        {
            "schema": {...},
            "count": 1000,
            "quality_threshold": 0.85
        }
    
    Returns:
        {"job_id": "uuid", "status": "queued", "estimated_time_sec": 120}
    """
    
    job_id = str(uuid.uuid4())
    
    jobs[job_id] = {
        'status': 'queued',
        'created_at': datetime.now().isoformat(),
        'request': request.dict()
    }
    
    # Run generation in background
    background_tasks.add_task(run_generation, job_id, request)
    
    # Estimate time based on count
    estimated_time = request.count / 100  # ~100 records/sec
    
    return GenerationResponse(
        job_id=job_id,
        status='queued',
        estimated_time_sec=int(estimated_time)
    )

@app.get('/v1/jobs/{job_id}')
async def get_job_status(job_id: str):
    """Check generation job status"""
    
    if job_id not in jobs:
        raise HTTPException(404, "Job not found")
    
    return jobs[job_id]

@app.get('/v1/download/{job_id}')
async def download_dataset(job_id: str):
    """Download generated dataset"""
    
    job = jobs.get(job_id)
    
    if not job:
        raise HTTPException(404, "Job not found")
    
    if job['status'] != 'completed':
        raise HTTPException(400, f"Job status: {job['status']}")
    
    # Return file (use FileResponse for large files)
    from fastapi.responses import FileResponse
    return FileResponse(
        path=job['output_path'],
        media_type='application/octet-stream',
        filename=f'synthetic_data_{job_id}.parquet'
    )

async def run_generation(job_id: str, request: GenerationRequest):
    """Background task para generación"""
    
    try:
        jobs[job_id]['status'] = 'running'
        
        platform = SyntheticDataPlatform(config={})
        df = platform.generate_dataset(
            schema=request.schema,
            count=request.count,
            quality_threshold=request.quality_threshold
        )
        
        # Save to disk
        output_path = f'/tmp/synthetic_{job_id}.parquet'
        df.to_parquet(output_path)
        
        jobs[job_id].update({
            'status': 'completed',
            'output_path': output_path,
            'metrics': platform.metrics,
            'completed_at': datetime.now().isoformat()
        })
        
    except Exception as e:
        jobs[job_id].update({
            'status': 'failed',
            'error': str(e),
            'failed_at': datetime.now().isoformat()
        })
```

**ROI Analysis:**

```python
class SyntheticDataROI:
    """Calcula ROI de implementar synthetic data platform"""
    
    @staticmethod
    def calculate_roi(scenarios: Dict) -> Dict:
        """
        Compara costo de synthetic data vs alternativas.
        
        Scenarios:
        - Manual data creation
        - Production data copies
        - Synthetic data platform
        """
        
        # Baseline: Manual data creation
        manual_cost_per_hour = 50  # Engineer hourly rate
        manual_hours_per_dataset = 8  # 1 día crear dataset realista
        datasets_per_month = 4  # Sprints
        
        manual_monthly_cost = (
            manual_cost_per_hour * 
            manual_hours_per_dataset * 
            datasets_per_month
        )
        
        # Alternative: Production copies (risk costs)
        prod_copy_compute_cost = 100  # Monthly storage/compute
        data_breach_risk = 0.05  # 5% chance per year
        breach_cost = 500000  # Average breach cost (IBM report)
        
        prod_copy_monthly_cost = (
            prod_copy_compute_cost + 
            (data_breach_risk / 12) * breach_cost
        )
        
        # Synthetic data platform
        platform_llm_cost = 200  # GPT API
        platform_infra_cost = 150  # Hosting, storage
        platform_dev_cost_amortized = 5000 / 12  # $5K development / 12 months
        
        synthetic_monthly_cost = (
            platform_llm_cost +
            platform_infra_cost +
            platform_dev_cost_amortized
        )
        
        # Calculate savings
        manual_savings = manual_monthly_cost - synthetic_monthly_cost
        prod_copy_savings = prod_copy_monthly_cost - synthetic_monthly_cost
        
        # Time savings
        manual_time_hours = manual_hours_per_dataset * datasets_per_month
        synthetic_time_hours = 1 * datasets_per_month  # 1hr para configurar/revisar
        time_saved_hours = manual_time_hours - synthetic_time_hours
        
        # ROI calculation
        annual_savings = (manual_savings + prod_copy_savings) * 12
        initial_investment = 5000  # Development cost
        
        roi_pct = (annual_savings - initial_investment) / initial_investment * 100
        payback_months = initial_investment / (manual_savings + prod_copy_savings)
        
        return {
            'costs': {
                'manual_monthly': manual_monthly_cost,
                'prod_copy_monthly': prod_copy_monthly_cost,
                'synthetic_monthly': synthetic_monthly_cost
            },
            'savings': {
                'vs_manual_monthly': manual_savings,
                'vs_prod_copy_monthly': prod_copy_savings,
                'annual_total': annual_savings,
                'time_saved_hours_monthly': time_saved_hours
            },
            'roi': {
                'roi_pct': roi_pct,
                'payback_months': payback_months,
                'break_even_date': (
                    datetime.now() + timedelta(days=30*payback_months)
                ).strftime('%Y-%m-%d')
            }
        }

# Calcular ROI
roi_analysis = SyntheticDataROI.calculate_roi({})

print("💰 SYNTHETIC DATA ROI ANALYSIS\n")
print("Monthly Costs:")
print(f"  Manual creation: ${roi_analysis['costs']['manual_monthly']:,.2f}")
print(f"  Prod copies: ${roi_analysis['costs']['prod_copy_monthly']:,.2f}")
print(f"  Synthetic platform: ${roi_analysis['costs']['synthetic_monthly']:,.2f}")

print("\nSavings:")
print(f"  vs Manual: ${roi_analysis['savings']['vs_manual_monthly']:,.2f}/month")
print(f"  vs Prod copies: ${roi_analysis['savings']['vs_prod_copy_monthly']:,.2f}/month")
print(f"  Annual total: ${roi_analysis['savings']['annual_total']:,.2f}")
print(f"  Time saved: {roi_analysis['savings']['time_saved_hours_monthly']:.0f} hours/month")

print("\nROI:")
print(f"  ROI: {roi_analysis['roi']['roi_pct']:.0f}%")
print(f"  Payback: {roi_analysis['roi']['payback_months']:.1f} months")
print(f"  Break-even: {roi_analysis['roi']['break_even_date']}")

# Typical output:
# 💰 SYNTHETIC DATA ROI ANALYSIS
#
# Monthly Costs:
#   Manual creation: $1,600.00
#   Prod copies: $2,183.33
#   Synthetic platform: $766.67
#
# Savings:
#   vs Manual: $833.33/month
#   vs Prod copies: $1,416.67/month
#   Annual total: $27,000.00
#   Time saved: 28 hours/month
#
# ROI:
#   ROI: 440%
#   Payback: 2.2 months
#   Break-even: 2025-01-15
```

**Success Metrics (Dashboard):**

```python
import plotly.graph_objects as go

def create_synthetic_data_dashboard(metrics_history: List[Dict]):
    """Genera dashboard para monitorear plataforma"""
    
    fig = go.Figure()
    
    # Métrica 1: Volumen generado
    fig.add_trace(go.Scatter(
        x=[m['date'] for m in metrics_history],
        y=[m['records_generated'] for m in metrics_history],
        name='Records Generated',
        mode='lines+markers'
    ))
    
    # Métrica 2: Quality score
    fig.add_trace(go.Scatter(
        x=[m['date'] for m in metrics_history],
        y=[m['quality_score'] for m in metrics_history],
        name='Quality Score',
        yaxis='y2'
    ))
    
    fig.update_layout(
        title='Synthetic Data Platform Metrics',
        xaxis=dict(title='Date'),
        yaxis=dict(title='Records Generated'),
        yaxis2=dict(title='Quality Score', overlaying='y', side='right'),
        hovermode='x unified'
    )
    
    return fig

# KPIs a trackear:
kpis = {
    'volume': {
        'records_generated_monthly': 50000,
        'growth_mom': 0.15  # 15% growth month-over-month
    },
    'quality': {
        'avg_quality_score': 0.92,
        'quality_gate_pass_rate': 0.95  # 95% pasan quality gate
    },
    'cost': {
        'cost_per_1k_records': 0.50,
        'cost_reduction_vs_manual': 0.70  # 70% más barato
    },
    'adoption': {
        'active_users': 25,
        'datasets_per_week': 12,
        'api_requests_per_day': 150
    },
    'impact': {
        'testing_coverage_increase': 0.40,  # 40% más tests gracias a synthetic data
        'prod_incidents_reduced': 0.25,  # 25% menos bugs por mejor testing
        'time_to_env_setup_reduced': 0.60  # 60% más rápido setup de ambientes
    }
}
```

**Best Practices (Production):**

1. **Start Simple, Scale Gradually:**
   ```python
   Phase 1 (Month 1-2): Generate test data for 1 team
   Phase 2 (Month 3-4): Add anonymization for prod copies
   Phase 3 (Month 5-6): Self-service API for all teams
   Phase 4 (Month 7+): Advanced features (time series, ML augmentation)
   ```

2. **Monitor Obsessively:**
   - Quality score trends (alert if drops below 0.85)
   - Cost per record (budget alerts)
   - Generation latency (p95 < 5 min)
   - API uptime (99.5% SLA)

3. **Version Everything:**
   - Schema versions (backward compatibility)
   - Generated datasets (reproducibility)
   - Generation configs (audit trail)

4. **Fail Fast:**
   - Quality gates (no bad data downstream)
   - Cost limits ($500/day max)
   - PII detection (block if detected)

---
**Autor:** Luis J. Raigoso V. (LJRV)

## 9. Ejercicios

1. Genera un dataset completo de e-commerce (clientes, productos, pedidos) con relaciones coherentes.
2. Crea un generador de logs de aplicación realistas para testing de pipelines.
3. Implementa anonimización que preserve privacidad diferencial.
4. Construye un augmentador de datos para entrenar modelos de ML con pocos ejemplos.