# 💰 Cost Optimization y FinOps en la Nube

Objetivo: aplicar técnicas de optimización de costos cloud (AWS/GCP/Azure), establecer presupuestos, alertas y benchmarks, con métricas de eficiencia por workload.

- Duración: 120 min
- Dificultidad: Alta
- Prerrequisitos: Mid 03 (AWS), experiencia con pipelines en cloud

### 💡 **FinOps Framework: Cloud Cost Management como Práctica Estratégica**

**¿Qué es FinOps?**

FinOps (Financial Operations) es la práctica de traer accountability financiera al modelo variable de gasto en la nube, combinando finanzas, tecnología y negocios.

**Fases del Ciclo FinOps:**

```
┌─────────────────────────────────────────────────────────┐
│               FinOps Lifecycle                          │
├─────────────────────────────────────────────────────────┤
│  1. INFORM (Visibilidad)                                │
│     • Cost allocation y tagging                         │
│     • Showback/Chargeback por equipo                    │
│     • Anomaly detection                                 │
│                                                          │
│  2. OPTIMIZE (Eficiencia)                               │
│     • Rightsizing (reduce over-provisioning)            │
│     • Reserved Instances / Savings Plans                │
│     • Spot/Preemptible instances                        │
│     • Storage lifecycle policies                        │
│                                                          │
│  3. OPERATE (Cultura)                                   │
│     • Budget owners por dominio                         │
│     • Cost awareness en desarrollo                      │
│     • Continuous improvement                            │
└─────────────────────────────────────────────────────────┘
```

**Principios Fundamentales:**

1. **Teams Need to Collaborate**: Ingeniería, Finanzas, Product
2. **Everyone Takes Ownership**: Cada equipo responsable de su gasto
3. **Centralized Team Drives FinOps**: FinOps team como facilitador
4. **Reports Should Be Accessible**: Dashboards self-service
5. **Decisions Driven by Business Value**: Costo vs impacto en negocio
6. **Take Advantage of Variable Cost**: Elasticidad como ventaja

**Modelo de Responsabilidad:**

| Rol | Responsabilidades |
|-----|-------------------|
| **FinOps Team** | Políticas, herramientas, reporting, training |
| **Engineers** | Implementar optimizaciones, cost-aware design |
| **Finance** | Forecasting, budgets, business case analysis |
| **Executives** | Aprobar inversiones, cost targets estratégicos |
| **Product** | Priorizar features vs costo, unit economics |

**Cost Allocation Strategy (Tagging):**

```python
# AWS Resource Tagging Standard
tagging_strategy = {
    # Mandatory tags (billing)
    "Environment": ["dev", "staging", "prod"],
    "CostCenter": ["engineering", "data", "marketing"],
    "Project": ["data-platform", "ml-ops", "analytics"],
    "Owner": "email@company.com",
    
    # Optional (technical)
    "Application": "spark-etl",
    "ManagedBy": "terraform",
    "Compliance": ["gdpr", "sox", "hipaa"],
}

# Terraform enforcement
resource "aws_s3_bucket" "data_lake" {
  bucket = "my-data-lake"
  
  tags = {
    Environment = "prod"
    CostCenter  = "data"
    Project     = "data-platform"
    Owner       = "data-team@company.com"
  }
}

# Tag compliance policy (AWS Config)
aws configservice put-config-rule --config-rule '{
  "ConfigRuleName": "required-tags",
  "Source": {
    "Owner": "AWS",
    "SourceIdentifier": "REQUIRED_TAGS"
  },
  "InputParameters": "{\"tag1Key\":\"CostCenter\",\"tag2Key\":\"Project\"}"
}'
```

**Showback vs Chargeback:**

```python
# Showback: Informar costos (no cobrar)
monthly_report = {
    "team": "Data Engineering",
    "month": "2025-10",
    "breakdown": {
        "compute": {"ec2": 1200, "emr": 3500},
        "storage": {"s3": 800, "ebs": 400},
        "networking": {"data_transfer": 300},
        "total": 6200
    },
    "trend": "+15% vs last month",
    "top_resources": [
        {"resource": "emr-cluster-prod", "cost": 2800},
        {"resource": "ec2-spark-workers", "cost": 900}
    ]
}

# Chargeback: Cobrar internamente a cada equipo
chargeback_policy = {
    "data_team": {
        "budget": 10000,
        "actual": 6200,
        "remaining": 3800,
        "action": "Approved" if 6200 < 10000 else "Budget Review"
    }
}
```

**Unit Economics para Data Platforms:**

```
┌────────────────────────────────────────────────┐
│  Métricas de Eficiencia                        │
├────────────────────────────────────────────────┤
│  • Cost per TB Processed:  $5 - $50            │
│    (depende de transformación complexity)      │
│                                                 │
│  • Cost per Million Events: $0.10 - $5.00      │
│    (streaming platforms)                       │
│                                                 │
│  • Cost per Query: $0.001 - $1.00              │
│    (BigQuery, Athena)                          │
│                                                 │
│  • Cost per ML Training Run: $10 - $500        │
│    (model size, GPU hours)                     │
│                                                 │
│  • Storage Cost per TB/month: $20 - $200       │
│    (Standard → Glacier Deep Archive)           │
└────────────────────────────────────────────────┘
```

**Caso Real: Netflix FinOps**

Netflix procesa ~15 PB/día con presupuesto de ~$300M/año en AWS:

- **Tagging Strategy**: 100% de recursos taggeados con show/movie ID
- **Spot Instances**: 80% de workloads batch en Spot (ahorro 70%)
- **S3 Lifecycle**: 90% de contenido en Glacier (ahorro $50M/año)
- **Regional Optimization**: Data replication solo en regiones activas
- **Custom Metrics**: Costo por hora de streaming por usuario

**Dashboard FinOps (Grafana):**

```python
# Prometheus queries para dashboard
queries = {
    "total_monthly_cost": 'sum(aws_cost_total{period="monthly"})',
    "cost_by_service": 'sum by (service) (aws_cost_total)',
    "cost_by_team": 'sum by (tag_costcenter) (aws_cost_total)',
    "savings_opportunity": 'sum(aws_rightsizing_recommendations)',
    "budget_burn_rate": 'rate(aws_cost_total[7d]) * 30',
}

# Alertas
alerts = {
    "budget_80_percent": "Monthly cost > 80% of budget",
    "cost_spike": "Cost increase > 50% vs last week",
    "untagged_resources": "Resources without CostCenter tag > 5%",
}
```

---
**Autor:** Luis J. Raigoso V. (LJRV)

### ⚡ **Compute Optimization: Rightsizing, Spot & Autoscaling**

**Rightsizing: Eliminar Over-Provisioning**

El 30-40% de recursos cloud están sobre-aprovisionados. Rightsizing ajusta instancias al uso real.

**AWS EC2 Rightsizing Process:**

```python
import boto3
import pandas as pd
from datetime import datetime, timedelta

# CloudWatch metrics para rightsizing
cloudwatch = boto3.client('cloudwatch')
ce = boto3.client('ce')  # Cost Explorer

def get_instance_utilization(instance_id, days=14):
    """Analiza CPU/Memory/Network de instancia"""
    end = datetime.utcnow()
    start = end - timedelta(days=days)
    
    # CPU utilization
    cpu_response = cloudwatch.get_metric_statistics(
        Namespace='AWS/EC2',
        MetricName='CPUUtilization',
        Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
        StartTime=start,
        EndTime=end,
        Period=3600,  # 1 hora
        Statistics=['Average', 'Maximum']
    )
    
    cpu_avg = sum(dp['Average'] for dp in cpu_response['Datapoints']) / len(cpu_response['Datapoints'])
    cpu_max = max(dp['Maximum'] for dp in cpu_response['Datapoints'])
    
    return {
        'instance_id': instance_id,
        'cpu_avg': cpu_avg,
        'cpu_max': cpu_max,
        'recommendation': 'downsize' if cpu_avg < 20 else 'keep' if cpu_avg < 70 else 'upsize'
    }

# Rightsizing recommendations
def generate_rightsizing_report(ec2_client):
    instances = ec2_client.describe_instances()
    
    recommendations = []
    for reservation in instances['Reservations']:
        for instance in reservation['Instances']:
            inst_id = instance['InstanceId']
            inst_type = instance['InstanceType']
            
            metrics = get_instance_utilization(inst_id)
            
            # Lógica de recomendación
            if metrics['cpu_avg'] < 20 and metrics['cpu_max'] < 50:
                # Over-provisioned
                current_price = get_instance_price(inst_type)
                recommended_type = get_smaller_instance(inst_type)
                recommended_price = get_instance_price(recommended_type)
                
                recommendations.append({
                    'instance_id': inst_id,
                    'current_type': inst_type,
                    'current_cost': current_price * 730,  # mensual
                    'recommended_type': recommended_type,
                    'recommended_cost': recommended_price * 730,
                    'savings_pct': ((current_price - recommended_price) / current_price) * 100
                })
    
    return pd.DataFrame(recommendations)

# Ejemplo output
"""
instance_id      current_type  current_cost  recommended_type  recommended_cost  savings_pct
i-0123456789   m5.4xlarge     $500          m5.2xlarge        $250              50%
i-abcdef1234   c5.9xlarge     $1200         c5.4xlarge        $600              50%
Total potential savings: $700/month ($8,400/year)
"""
```

**Spot Instances: 70-90% Ahorro**

Spot instances son capacidad ociosa de AWS/GCP/Azure con descuentos masivos, ideal para workloads tolerantes a interrupciones.

**Casos de Uso Spot:**
- ✅ Batch ETL (reintentable)
- ✅ ML Training (checkpoints)
- ✅ Data analytics (stateless)
- ✅ CI/CD builds
- ❌ Databases (stateful)
- ❌ Real-time APIs (baja latencia)

**EMR con Spot Instances:**

```python
# AWS EMR cluster con 80% Spot + 20% On-Demand
emr_cluster_config = {
    "Name": "etl-cluster-spot",
    "Instances": {
        "MasterInstanceGroup": {
            "InstanceType": "m5.xlarge",
            "InstanceCount": 1,
            "Market": "ON_DEMAND"  # Master siempre On-Demand
        },
        "CoreInstanceGroup": {
            "InstanceType": "r5.2xlarge",
            "InstanceCount": 2,
            "Market": "ON_DEMAND",  # Core On-Demand (HDFS)
            "BidPrice": "auto"
        },
        "TaskInstanceGroups": [
            {
                "Name": "Task - Spot",
                "InstanceType": "r5.4xlarge",
                "InstanceCount": 8,
                "Market": "SPOT",
                "BidPrice": "auto"  # AWS auto-pricing
            }
        ]
    },
    "Applications": [{"Name": "Spark"}],
    "Configurations": [
        {
            "Classification": "spark-defaults",
            "Properties": {
                "spark.speculation": "true",  # Re-run tasks si Spot termina
                "spark.speculation.multiplier": "2",
                "spark.task.maxFailures": "8"
            }
        }
    ]
}

# Costo comparison
costs = {
    "on_demand": {
        "master": 0.192 * 1,
        "core": 0.504 * 2,
        "task": 1.008 * 8,
        "total_hourly": 9.240
    },
    "spot_mix": {
        "master": 0.192 * 1,
        "core": 0.504 * 2,
        "task": 0.303 * 8,  # ~70% discount
        "total_hourly": 3.624
    },
    "savings": (9.240 - 3.624) / 9.240 * 100  # 60.8% ahorro
}
```

**Spot Interruption Handling:**

```python
# Graceful termination con 2-minute warning
import boto3
import requests

def check_spot_termination():
    """EC2 Instance Metadata para detectar interruption"""
    try:
        response = requests.get(
            'http://169.254.169.254/latest/meta-data/spot/instance-action',
            timeout=1
        )
        if response.status_code == 200:
            # Spot instance será terminada en 2 minutos
            action = response.json()
            print(f"⚠️ Spot termination: {action['action']} at {action['time']}")
            
            # Guardar checkpoint
            save_checkpoint()
            
            # Graceful shutdown
            shutdown_gracefully()
            
            return True
    except:
        # No termination scheduled
        return False

# Spark checkpoint para resilience
spark_conf = {
    "spark.speculation": "true",
    "spark.dynamicAllocation.enabled": "true",
    "spark.shuffle.service.enabled": "true",
    "spark.checkpoint.dir": "s3://bucket/checkpoints/"
}
```

**Autoscaling: Dynamic Capacity**

```python
# AWS Auto Scaling Group con target tracking
asg_config = {
    "AutoScalingGroupName": "data-workers-asg",
    "MinSize": 2,
    "MaxSize": 20,
    "DesiredCapacity": 5,
    "LaunchTemplate": {
        "LaunchTemplateId": "lt-xxx",
        "Version": "$Latest"
    },
    "VPCZoneIdentifier": "subnet-a,subnet-b,subnet-c",
    "HealthCheckType": "ELB",
    "HealthCheckGracePeriod": 300,
    "TargetGroupARNs": ["arn:aws:elasticloadbalancing:..."],
    
    # Scaling policies
    "TargetTrackingConfiguration": [
        {
            "PredefinedMetricSpecification": {
                "PredefinedMetricType": "ASGAverageCPUUtilization"
            },
            "TargetValue": 70.0  # Scale cuando CPU > 70%
        },
        {
            # Custom metric: SQS queue depth
            "CustomizedMetricSpecification": {
                "MetricName": "ApproximateNumberOfMessagesVisible",
                "Namespace": "AWS/SQS",
                "Statistic": "Average",
                "Unit": "Count"
            },
            "TargetValue": 100  # Scale cuando queue > 100 msgs
        }
    ]
}

# Kubernetes HPA (Horizontal Pod Autoscaler)
"""
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: spark-driver
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: spark-driver
  minReplicas: 2
  maxReplicas: 50
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Pods
    pods:
      metric:
        name: pending_tasks
      target:
        type: AverageValue
        averageValue: "100"
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5min before scale-down
      policies:
      - type: Percent
        value: 50  # Scale down max 50% at a time
        periodSeconds: 60
"""
```

**Caso Real: Lyft - 75% Spot Coverage**

Lyft ejecuta toda su analítica en EMR con:
- **Task Nodes**: 100% Spot (10,000+ instancias)
- **Core Nodes**: On-Demand (HDFS persistence)
- **Checkpointing**: S3 cada 5 minutos
- **Interruption Rate**: ~10% (tolerado con retry)
- **Ahorro anual**: ~$20M vs full On-Demand

**Best Practices:**
1. **Diversify Instance Types**: Spot pools (c5, m5, r5)
2. **Capacity Rebalancing**: AWS auto-reemplaza antes de termination
3. **Allocation Strategy**: `price-capacity-optimized`
4. **Checkpointing**: Guardar estado cada N minutos
5. **Monitoring**: CloudWatch para interruption rate

---
**Autor:** Luis J. Raigoso V. (LJRV)

### 💾 **Storage Optimization: Lifecycle, Compression & Deduplication**

**S3 Storage Classes: Cost vs Access Trade-off**

AWS S3 ofrece 7 storage classes con diferentes precios y latencias:

| Storage Class | Use Case | Latency | Cost/GB/month | Retrieval Cost |
|---------------|----------|---------|---------------|----------------|
| **S3 Standard** | Hot data (frecuente) | ms | $0.023 | $0 |
| **S3 Intelligent-Tiering** | Auto-optimization | ms | $0.023-$0.0125 | $0 |
| **S3 Standard-IA** | Infrequent access | ms | $0.0125 | $0.01/GB |
| **S3 One Zone-IA** | Non-critical, infrequent | ms | $0.01 | $0.01/GB |
| **S3 Glacier Instant** | Archive, instant | ms | $0.004 | $0.03/GB |
| **S3 Glacier Flexible** | Archive, minutos-horas | 1-5 min | $0.0036 | $0.02/GB |
| **S3 Glacier Deep Archive** | Long-term, 12h | 12 hours | $0.00099 | $0.02/GB |

**Lifecycle Policy (Automated Tiering):**

```python
import boto3

s3 = boto3.client('s3')

lifecycle_policy = {
    'Rules': [
        {
            'Id': 'data-lake-lifecycle',
            'Status': 'Enabled',
            'Filter': {'Prefix': 'raw/'},
            'Transitions': [
                {
                    'Days': 30,
                    'StorageClass': 'STANDARD_IA'  # Después de 30 días
                },
                {
                    'Days': 90,
                    'StorageClass': 'GLACIER_IR'  # Después de 90 días
                },
                {
                    'Days': 365,
                    'StorageClass': 'DEEP_ARCHIVE'  # Después de 1 año
                }
            ],
            'Expiration': {
                'Days': 2555  # Delete después de 7 años (compliance)
            },
            'NoncurrentVersionTransitions': [
                {
                    'NoncurrentDays': 30,
                    'StorageClass': 'GLACIER_IR'
                }
            ],
            'NoncurrentVersionExpiration': {
                'NoncurrentDays': 90
            }
        },
        {
            'Id': 'delete-incomplete-multipart-uploads',
            'Status': 'Enabled',
            'Filter': {},
            'AbortIncompleteMultipartUpload': {
                'DaysAfterInitiation': 7  # Limpia uploads fallidos
            }
        },
        {
            'Id': 'logs-retention',
            'Status': 'Enabled',
            'Filter': {'Prefix': 'logs/'},
            'Expiration': {
                'Days': 90  # Logs solo 90 días
            }
        }
    ]
}

s3.put_bucket_lifecycle_configuration(
    Bucket='my-data-lake',
    LifecycleConfiguration=lifecycle_policy
)

# Costo Calculation
storage_breakdown = {
    "raw_data": {
        "size_tb": 100,
        "access_pattern": "write-once, read-rarely",
        "standard_cost": 100 * 1024 * 0.023,  # $2,355/month
        "with_lifecycle": {
            "standard_30d": 100 * 1024 * 0.023 * (30/365),
            "ia_60d": 100 * 1024 * 0.0125 * (60/365),
            "glacier_275d": 100 * 1024 * 0.00099 * (275/365),
            "total": 295  # $295/month (87% ahorro!)
        }
    }
}
```

**Compression: Reduce Storage & Transfer Costs**

Elegir codec y formato impacta significativamente en costos y performance.

**Compression Codecs Comparison:**

```python
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

# Dataset: 1 billion rows, 10 columns
sample_data = pd.DataFrame({
    'user_id': range(1_000_000_000),
    'timestamp': pd.date_range('2020-01-01', periods=1_000_000_000, freq='s'),
    'event_type': ['click'] * 500_000_000 + ['purchase'] * 500_000_000,
    'amount': [10.5] * 1_000_000_000,
    # ... 6 more columns
})

# Benchmark different codecs
codecs = ['snappy', 'gzip', 'zstd', 'lz4']
results = []

for codec in codecs:
    # Write Parquet
    table = pa.Table.from_pandas(sample_data)
    pq.write_table(
        table,
        f'output_{codec}.parquet',
        compression=codec,
        compression_level=3 if codec == 'zstd' else None
    )
    
    # Measure
    import os
    file_size = os.path.getsize(f'output_{codec}.parquet') / (1024**3)  # GB
    
    # Read benchmark
    import time
    start = time.time()
    df = pq.read_table(f'output_{codec}.parquet').to_pandas()
    read_time = time.time() - start
    
    results.append({
        'codec': codec,
        'file_size_gb': file_size,
        'compression_ratio': 100 / file_size,  # Original 100 GB
        'read_time_sec': read_time,
        'storage_cost_monthly': file_size * 0.023  # S3 Standard
    })

df_results = pd.DataFrame(results)
"""
codec    file_size_gb  compression_ratio  read_time_sec  storage_cost_monthly
snappy   15.2          6.6x               45             $0.35
gzip     8.4           11.9x              120            $0.19
zstd     7.1           14.1x              55             $0.16  ← Winner
lz4      16.8          5.9x               38             $0.39
"""
```

**Recomendaciones por Codec:**

```
┌────────────────────────────────────────────────────────┐
│  SNAPPY                                                │
│  • Uso: Default Spark, buen balance                   │
│  • Compresión: 6-8x (moderada)                        │
│  • CPU: Bajo overhead                                 │
│  • Splittable: ✅ (con Parquet)                       │
│  • Caso: ETL daily, query frecuente                   │
├────────────────────────────────────────────────────────┤
│  ZSTD (Zstandard)                                      │
│  • Uso: Mejor ratio, good speed                       │
│  • Compresión: 12-15x (alta)                          │
│  • CPU: Moderado                                      │
│  • Splittable: ✅                                     │
│  • Caso: Archives, long-term storage                  │
├────────────────────────────────────────────────────────┤
│  GZIP                                                  │
│  • Uso: Legacy, alta compresión                       │
│  • Compresión: 10-12x                                 │
│  • CPU: Alto overhead (lento decode)                  │
│  • Splittable: ❌ (sin Parquet)                       │
│  • Caso: Logs, texto plano                            │
├────────────────────────────────────────────────────────┤
│  LZ4                                                   │
│  • Uso: Ultra-fast, baja compresión                   │
│  • Compresión: 4-6x                                   │
│  • CPU: Muy bajo                                      │
│  • Splittable: ✅                                     │
│  • Caso: Streaming, low-latency queries               │
└────────────────────────────────────────────────────────┘
```

**Parquet Optimization (Row Groups & Page Size):**

```python
# Optimal Parquet configuration
pq.write_table(
    table,
    'optimized.parquet',
    compression='zstd',
    compression_level=3,
    row_group_size=128 * 1024 * 1024,  # 128 MB row groups (default: 64MB)
    data_page_size=1024 * 1024,  # 1 MB pages (default: 1MB)
    use_dictionary=True,  # Dictionary encoding (columnas repetitivas)
    column_encoding={
        'user_id': 'PLAIN',  # High cardinality
        'event_type': 'RLE_DICTIONARY',  # Low cardinality
    }
)

# Ahorro con dictionary encoding
"""
Column: event_type (1B rows, 10 unique values)
Without dict: 1B * 20 bytes = 20 GB
With dict:    10 * 20 bytes + 1B * 4 bytes (indices) = 4 GB (80% ahorro!)
"""
```

**Deduplication: Eliminar Redundancia**

```python
# Delta Lake deduplication
from delta.tables import DeltaTable

# Merge con deduplication
delta_table = DeltaTable.forPath(spark, 's3://bucket/delta-table')

new_data = spark.read.parquet('s3://bucket/new-data/')

# Deduplicate antes de insertar
delta_table.alias('old').merge(
    new_data.alias('new'),
    'old.id = new.id AND old.timestamp = new.timestamp'
).whenNotMatchedInsertAll().execute()

# Z-Order para reduce file scanning (performance + cost)
delta_table.optimize().executeZOrderBy('date', 'customer_id')

# Vacuum old versions (reduce storage)
delta_table.vacuum(retentionHours=168)  # 7 días retention
```

**Caso Real: Pinterest - $20M Storage Savings**

Pinterest redujo costos de S3 de $50M → $30M/año:

1. **Lifecycle Policies**: 70% datos → Glacier (90 días)
2. **Compression**: CSV → Parquet+ZSTD (10x compression)
3. **Deduplication**: Delta Lake merge (elimina 30% duplicados)
4. **Intelligent Tiering**: Auto-move cold data (ahorro 15%)
5. **Vacuum**: Delete old versions (retiene 7 días vs 30)

**Terraform IaC para Lifecycle:**

```hcl
resource "aws_s3_bucket_lifecycle_configuration" "data_lake" {
  bucket = aws_s3_bucket.data_lake.id

  rule {
    id     = "intelligent-tiering"
    status = "Enabled"

    transition {
      days          = 30
      storage_class = "STANDARD_IA"
    }

    transition {
      days          = 90
      storage_class = "GLACIER_IR"
    }

    transition {
      days          = 365
      storage_class = "DEEP_ARCHIVE"
    }

    expiration {
      days = 2555  # 7 years
    }
  }

  rule {
    id     = "abort-incomplete-uploads"
    status = "Enabled"

    abort_incomplete_multipart_upload {
      days_after_initiation = 7
    }
  }
}
```

**Cost Monitoring Query (AWS Cost Explorer API):**

```python
import boto3

ce = boto3.client('ce')

response = ce.get_cost_and_usage(
    TimePeriod={
        'Start': '2025-10-01',
        'End': '2025-10-31'
    },
    Granularity='DAILY',
    Metrics=['UnblendedCost'],
    GroupBy=[
        {'Type': 'DIMENSION', 'Key': 'SERVICE'},
        {'Type': 'DIMENSION', 'Key': 'USAGE_TYPE'}
    ],
    Filter={
        'Dimensions': {
            'Key': 'SERVICE',
            'Values': ['Amazon Simple Storage Service']
        }
    }
)

# Analiza costos por storage class
for result in response['ResultsByTime']:
    print(f"Date: {result['TimePeriod']['Start']}")
    for group in result['Groups']:
        service, usage = group['Keys']
        cost = float(group['Metrics']['UnblendedCost']['Amount'])
        if cost > 0:
            print(f"  {usage}: ${cost:.2f}")
```

---
**Autor:** Luis J. Raigoso V. (LJRV)

### 📊 **Cost Monitoring & Anomaly Detection: Observability con FinOps**

**Cost Observability Stack**

```
┌──────────────────────────────────────────────────────┐
│  Cost Data Sources                                   │
├──────────────────────────────────────────────────────┤
│  • AWS Cost & Usage Report (CUR) → S3 → Athena      │
│  • GCP Billing Export → BigQuery                     │
│  • Azure Cost Management → Export to Storage         │
│                                                       │
│  ↓                                                    │
│  Data Processing                                     │
│  • Spark/Airflow ETL (enrich with tags)             │
│  • Cost allocation (chargeback logic)                │
│  • Anomaly detection (ML models)                     │
│                                                       │
│  ↓                                                    │
│  Visualization & Alerting                            │
│  • Grafana dashboards (real-time)                    │
│  • Slack/PagerDuty alerts                            │
│  • Weekly reports (email)                            │
└──────────────────────────────────────────────────────┘
```

**AWS Cost & Usage Report (CUR) Setup:**

```python
import boto3
import pandas as pd

# Enable CUR
cur = boto3.client('cur', region_name='us-east-1')

cur.put_report_definition(
    ReportDefinition={
        'ReportName': 'data-platform-cur',
        'TimeUnit': 'HOURLY',  # Granularidad horaria
        'Format': 'Parquet',  # Mejor que CSV
        'Compression': 'Parquet',
        'AdditionalSchemaElements': ['RESOURCES', 'SPLIT_COST_ALLOCATION_DATA'],
        'S3Bucket': 'cost-reports-bucket',
        'S3Prefix': 'cur/',
        'S3Region': 'us-east-1',
        'AdditionalArtifacts': ['ATHENA'],  # Auto-crea Athena table
        'RefreshClosedReports': True,
        'ReportVersioning': 'OVERWRITE_REPORT'
    }
)

# Athena query para analizar CUR
athena = boto3.client('athena')

query = """
SELECT 
    line_item_usage_account_id,
    product_servicename,
    resource_tags_user_cost_center,
    resource_tags_user_project,
    DATE(line_item_usage_start_date) as date,
    SUM(line_item_unblended_cost) as daily_cost
FROM cur_database.cost_usage_report
WHERE year = '2025' 
  AND month = '10'
  AND line_item_line_item_type = 'Usage'
GROUP BY 1, 2, 3, 4, 5
ORDER BY daily_cost DESC
LIMIT 100
"""

response = athena.start_query_execution(
    QueryString=query,
    QueryExecutionContext={'Database': 'cur_database'},
    ResultConfiguration={'OutputLocation': 's3://query-results/'}
)
```

**Anomaly Detection con Machine Learning:**

```python
import pandas as pd
import numpy as np
from sklearn.ensemble import IsolationForest
from datetime import datetime, timedelta

# Load historical cost data
cost_history = pd.read_parquet('s3://cost-data/daily_costs.parquet')

# Feature engineering
cost_history['day_of_week'] = pd.to_datetime(cost_history['date']).dt.dayofweek
cost_history['week_of_year'] = pd.to_datetime(cost_history['date']).dt.isocalendar().week

# Train anomaly detector
features = ['daily_cost', 'day_of_week', 'week_of_year']
X = cost_history[features]

model = IsolationForest(
    contamination=0.05,  # 5% anomalías esperadas
    random_state=42
)
model.fit(X)

# Predict anomalies para hoy
today_cost = get_today_cost()
today_features = [today_cost, datetime.now().weekday(), datetime.now().isocalendar()[1]]

anomaly_score = model.decision_function([today_features])[0]
is_anomaly = model.predict([today_features])[0] == -1

if is_anomaly:
    # Calculate deviation
    expected_cost = cost_history[
        (cost_history['day_of_week'] == today_features[1]) &
        (cost_history['week_of_year'] == today_features[2])
    ]['daily_cost'].mean()
    
    deviation_pct = ((today_cost - expected_cost) / expected_cost) * 100
    
    send_alert(
        title=f"⚠️ Cost Anomaly Detected",
        message=f"Today's cost: ${today_cost:.2f}\n"
                f"Expected: ${expected_cost:.2f}\n"
                f"Deviation: {deviation_pct:+.1f}%",
        severity='high' if abs(deviation_pct) > 50 else 'medium'
    )
```

**Prophet (Facebook) para Forecasting:**

```python
from prophet import Prophet

# Prepare data
df = cost_history[['date', 'daily_cost']].rename(columns={'date': 'ds', 'daily_cost': 'y'})

# Train model
model = Prophet(
    yearly_seasonality=True,
    weekly_seasonality=True,
    daily_seasonality=False
)
model.fit(df)

# Forecast next 30 días
future = model.make_future_dataframe(periods=30)
forecast = model.predict(future)

# Budget alert
monthly_budget = 10000
forecasted_month_cost = forecast[forecast['ds'].dt.month == datetime.now().month]['yhat'].sum()

if forecasted_month_cost > monthly_budget:
    send_alert(
        title="📈 Budget Overrun Forecasted",
        message=f"Forecasted: ${forecasted_month_cost:.2f}\n"
                f"Budget: ${monthly_budget:.2f}\n"
                f"Overrun: ${forecasted_month_cost - monthly_budget:.2f}"
    )

# Plot
import matplotlib.pyplot as plt
model.plot(forecast)
model.plot_components(forecast)
```

**Real-Time Alerting con CloudWatch + Lambda:**

```python
# Lambda function triggered por CloudWatch Alarm
import boto3
import json
import requests

def lambda_handler(event, context):
    """Alert cuando costo diario > threshold"""
    
    # Parse CloudWatch alarm
    message = json.loads(event['Records'][0]['Sns']['Message'])
    alarm_name = message['AlarmName']
    new_state = message['NewStateValue']
    reason = message['NewStateReason']
    
    if new_state == 'ALARM':
        # Get current cost
        ce = boto3.client('ce')
        response = ce.get_cost_and_usage(
            TimePeriod={
                'Start': datetime.now().strftime('%Y-%m-%d'),
                'End': (datetime.now() + timedelta(days=1)).strftime('%Y-%m-%d')
            },
            Granularity='DAILY',
            Metrics=['UnblendedCost']
        )
        
        current_cost = float(response['ResultsByTime'][0]['Total']['UnblendedCost']['Amount'])
        
        # Get top services
        top_services = ce.get_cost_and_usage(
            TimePeriod={
                'Start': datetime.now().strftime('%Y-%m-%d'),
                'End': (datetime.now() + timedelta(days=1)).strftime('%Y-%m-%d')
            },
            Granularity='DAILY',
            Metrics=['UnblendedCost'],
            GroupBy=[{'Type': 'DIMENSION', 'Key': 'SERVICE'}]
        )
        
        # Format Slack message
        blocks = [
            {
                "type": "header",
                "text": {"type": "plain_text", "text": "⚠️ Cost Spike Detected"}
            },
            {
                "type": "section",
                "fields": [
                    {"type": "mrkdwn", "text": f"*Current Cost:*\n${current_cost:.2f}"},
                    {"type": "mrkdwn", "text": f"*Alarm:*\n{alarm_name}"},
                    {"type": "mrkdwn", "text": f"*Reason:*\n{reason}"}
                ]
            },
            {
                "type": "section",
                "text": {
                    "type": "mrkdwn",
                    "text": "*Top Services:*\n" + "\n".join([
                        f"• {g['Keys'][0]}: ${float(g['Metrics']['UnblendedCost']['Amount']):.2f}"
                        for g in sorted(
                            top_services['ResultsByTime'][0]['Groups'],
                            key=lambda x: float(x['Metrics']['UnblendedCost']['Amount']),
                            reverse=True
                        )[:5]
                    ])
                }
            },
            {
                "type": "actions",
                "elements": [
                    {
                        "type": "button",
                        "text": {"type": "plain_text", "text": "View in Cost Explorer"},
                        "url": "https://console.aws.amazon.com/cost-management/home"
                    }
                ]
            }
        ]
        
        # Send to Slack
        requests.post(
            'https://hooks.slack.com/services/YOUR/WEBHOOK/URL',
            json={"blocks": blocks}
        )

# CloudWatch Alarm (Terraform)
"""
resource "aws_cloudwatch_metric_alarm" "daily_cost_spike" {
  alarm_name          = "daily-cost-spike"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = "1"
  metric_name         = "EstimatedCharges"
  namespace           = "AWS/Billing"
  period              = "21600"  # 6 horas
  statistic           = "Maximum"
  threshold           = "500"  # $500/día
  alarm_description   = "Alert cuando costo diario > $500"
  alarm_actions       = [aws_sns_topic.cost_alerts.arn]

  dimensions = {
    Currency = "USD"
  }
}
"""
```

**Grafana Dashboard (FinOps):**

```python
# Prometheus queries
grafana_dashboard = {
    "panels": [
        {
            "title": "Total Monthly Cost",
            "query": 'sum(aws_billing_estimated_charges{currency="USD"})',
            "type": "stat"
        },
        {
            "title": "Daily Cost Trend",
            "query": 'sum by (service) (rate(aws_cost_total[1d]))',
            "type": "graph"
        },
        {
            "title": "Cost by Team",
            "query": 'sum by (tag_costcenter) (aws_cost_total)',
            "type": "piechart"
        },
        {
            "title": "Budget Burn Rate",
            "query": 'rate(aws_cost_total[7d]) * 30',
            "type": "gauge",
            "thresholds": [8000, 9000, 10000]  # Budget: $10k
        },
        {
            "title": "Top 10 Recursos",
            "query": 'topk(10, sum by (resource_id) (aws_cost_total))',
            "type": "table"
        },
        {
            "title": "Savings Opportunities",
            "query": 'sum(aws_rightsizing_savings_potential)',
            "type": "stat"
        }
    ]
}

# Export to JSON
import json
with open('finops_dashboard.json', 'w') as f:
    json.dump(grafana_dashboard, f, indent=2)
```

**Weekly Cost Report (Automated):**

```python
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
import smtplib

def generate_weekly_report():
    """Genera reporte semanal de costos"""
    
    # Get data
    last_week_cost = get_cost_for_period(days=7)
    previous_week_cost = get_cost_for_period(days=14, offset=7)
    
    wow_change = ((last_week_cost - previous_week_cost) / previous_week_cost) * 100
    
    # Top services
    top_services = get_cost_by_service(days=7)
    
    # Top resources
    top_resources = get_cost_by_resource(days=7)
    
    # Recommendations
    recommendations = get_cost_recommendations()
    
    # Generate HTML report
    html = f"""
    <html>
      <body>
        <h1>📊 Weekly Cost Report</h1>
        <h2>Overview</h2>
        <p><strong>Last Week:</strong> ${last_week_cost:.2f}</p>
        <p><strong>Previous Week:</strong> ${previous_week_cost:.2f}</p>
        <p><strong>Change:</strong> {wow_change:+.1f}%</p>
        
        <h2>Top Services</h2>
        <table border="1">
          <tr><th>Service</th><th>Cost</th><th>% of Total</th></tr>
          {''.join([f"<tr><td>{s['name']}</td><td>${s['cost']:.2f}</td><td>{s['pct']:.1f}%</td></tr>" for s in top_services[:5]])}
        </table>
        
        <h2>💰 Savings Opportunities</h2>
        <ul>
          {''.join([f"<li>{r['description']}: <strong>${r['savings']:.2f}/month</strong></li>" for r in recommendations])}
        </ul>
      </body>
    </html>
    """
    
    # Send email
    msg = MIMEMultipart('alternative')
    msg['Subject'] = f"Weekly Cost Report - ${last_week_cost:.2f}"
    msg['From'] = 'finops@company.com'
    msg['To'] = 'team@company.com'
    
    msg.attach(MIMEText(html, 'html'))
    
    smtp = smtplib.SMTP('smtp.company.com')
    smtp.send_message(msg)
    smtp.quit()

# Airflow DAG para schedule
"""
from airflow import DAG
from airflow.operators.python import PythonOperator

dag = DAG(
    'weekly_cost_report',
    schedule_interval='0 9 * * MON',  # Lunes 9am
    default_args={'owner': 'finops'}
)

PythonOperator(
    task_id='generate_report',
    python_callable=generate_weekly_report,
    dag=dag
)
"""
```

**Caso Real: Airbnb - Cost Attribution System**

Airbnb construyó un sistema interno de FinOps:
- **Real-time Dashboards**: Grafana con costo por servicio cada 15 min
- **ML Anomaly Detection**: Prophet + IsolationForest
- **Chargeback**: Cada equipo ve su gasto en tiempo real
- **Budget Enforcement**: Alertas cuando equipo > 80% budget
- **Resultado**: 30% reducción en waste, $100M+ ahorros/año

---
**Autor:** Luis J. Raigoso V. (LJRV)

## 1. Principios de FinOps

- Visibilidad: tagging de recursos, cost allocation por proyecto/equipo.
- Accountability: cada equipo/dominio responsable de su presupuesto.
- Optimización continua: rightsizing, reservas, spot instances, lifecycle policies.

## 2. Estrategias de optimización

### 2.1 Compute
- Spot/Preemptible para workloads tolerantes a interrupciones (batch, training).
- Reserved/Committed para cargas predecibles (> 1 año).
- Auto-scaling por métricas (CPU, latencia, queue depth).

### 2.2 Storage
- Lifecycle policies: S3 Standard → IA → Glacier → Deep Archive.
- Compresión y formatos eficientes (Parquet/ORC con Snappy/ZSTD).
- Borrado de snapshots y versiones antiguas.

### 2.3 Networking
- Minimizar data transfer inter-región.
- Usar VPC endpoints y PrivateLink para evitar egress a internet.
- CDN (CloudFront/Cloud CDN) para datos frecuentes.

## 3. Ejemplo: benchmark de costo por TB procesado

In [None]:
import pandas as pd
data = [
  {'Pipeline':'ETL Diario', 'TB_procesados':5.2, 'Costo_USD':42.0, 'USD_per_TB':8.08},
  {'Pipeline':'Streaming Kafka', 'TB_procesados':1.8, 'Costo_USD':95.0, 'USD_per_TB':52.78},
  {'Pipeline':'ML Training', 'TB_procesados':0.3, 'Costo_USD':120.0, 'USD_per_TB':400.0},
]
df = pd.DataFrame(data)
df

## 4. Alertas y presupuestos (AWS Budgets ejemplo)

In [None]:
budget_demo = r'''
# AWS CLI para crear presupuesto mensual de $1000 con alertas al 80% y 100%
aws budgets create-budget \
  --account-id 123456789012 \
  --budget file://budget.json \
  --notifications-with-subscribers file://notifications.json

# budget.json
{
  "BudgetName": "data-platform-monthly",
  "BudgetLimit": {"Amount": "1000", "Unit": "USD"},
  "TimeUnit": "MONTHLY",
  "BudgetType": "COST"
}

# notifications.json (alertas al 80% y 100%)
'''
print(budget_demo.splitlines()[:15])

## 5. Métricas clave de FinOps

- **Costo por TB procesado** (USD/TB).
- **Costo por evento** (USD/millón eventos).
- **Utilización de recursos** (% CPU/RAM usados vs aprovisionados).
- **Savings por optimización** (Spot, RI, rightsizing).
- **Trend mensual** (crecimiento de gasto vs negocio).

## 6. Herramientas y dashboards

- AWS Cost Explorer, GCP Billing Reports, Azure Cost Management.
- Terraform/CDK para IaC con políticas de costos.
- Terceros: Cloudability, CloudHealth, Vantage.
- Dashboard custom con Grafana + Prometheus exporters.