# üí∞ Cost Optimization & Scaling: Run ML for Pennies

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gouthamgo/FineTuning/blob/main/lessons/module5_deployment/03_cost_optimization_scaling.ipynb)

## Hey friend! Let's talk money üí∏

Here's what nobody tells you about ML in production:

**A badly deployed model can cost $10,000/month.**  
**The same model, optimized, costs $100/month.**

That's a **100x difference**. Your boss will LOVE you for knowing this!

### üéØ What You'll Learn

Today we're covering the business side of ML:

1. **Cost Analysis** - Calculate actual expenses (AWS, GCP, Azure)
2. **Serverless Deployment** - Pay only for what you use
3. **Auto-Scaling** - Handle traffic spikes efficiently
4. **Spot Instances** - Save 70% with preemptible VMs
5. **A/B Testing** - Roll out changes safely
6. **Cost Monitoring** - Track every penny

### üíº Why This Matters

**For your career:**
- Companies lose millions on inefficient ML deployments
- Knowing cost optimization makes you INVALUABLE
- This is senior-level knowledge most ML engineers don't have

**For interviews:**
- "How would you reduce ML infrastructure costs?"
- "Design a scalable ML system for 1M users"
- "What's your approach to A/B testing models?"

Let's save some money! üí∞

---

## Part 1: Understanding ML Costs üìä

### The Reality Check

Let's calculate the **actual cost** of running an ML model in production.

#### Scenario: Sentiment Analysis API
- **Traffic**: 5 million requests/day
- **Model**: DistilBERT (66M parameters)
- **Latency requirement**: <100ms p95
- **Availability**: 24/7

In [None]:
import pandas as pd
import numpy as np
from typing import Dict

class MLCostCalculator:
    """
    Calculate ML infrastructure costs across different strategies.
    """
    
    # Pricing (per hour) - As of 2024
    PRICING = {
        "aws": {
            "t3.large": 0.0832,      # 2 vCPU, 8GB RAM (CPU)
            "c5.2xlarge": 0.34,       # 8 vCPU, 16GB RAM (CPU optimized)
            "g4dn.xlarge": 0.526,     # 4 vCPU, 16GB, 1 GPU
            "g5.xlarge": 1.006,       # 4 vCPU, 16GB, 1 A10G GPU
            "lambda_per_1m": 0.20,    # Per 1M requests (1GB-sec)
        },
        "gcp": {
            "n1-standard-2": 0.095,   # 2 vCPU, 7.5GB RAM
            "n1-standard-8": 0.38,    # 8 vCPU, 30GB RAM
            "n1-gpu-t4": 0.35,        # T4 GPU per hour
            "cloud_run_per_1m": 0.40, # Per 1M requests
        },
        "azure": {
            "d2s_v3": 0.096,          # 2 vCPU, 8GB RAM
            "d8s_v3": 0.384,          # 8 vCPU, 32GB RAM
            "nc6": 0.90,              # 1 K80 GPU
        }
    }
    
    def __init__(self, daily_requests: int):
        self.daily_requests = daily_requests
        self.monthly_requests = daily_requests * 30
    
    def calculate_always_on_cost(
        self,
        instance_type: str,
        provider: str = "aws",
        num_instances: int = 1
    ) -> Dict:
        """
        Calculate cost for always-on instances.
        """
        hourly_cost = self.PRICING[provider][instance_type]
        monthly_hours = 24 * 30
        
        monthly_cost = hourly_cost * monthly_hours * num_instances
        
        return {
            "monthly_cost": monthly_cost,
            "cost_per_1m_requests": monthly_cost / (self.monthly_requests / 1_000_000),
            "instance_type": instance_type,
            "num_instances": num_instances
        }
    
    def calculate_serverless_cost(
        self,
        avg_duration_ms: float,
        memory_mb: int = 1024,
        provider: str = "aws"
    ) -> Dict:
        """
        Calculate serverless cost (Lambda, Cloud Run, etc.)
        """
        # Convert to GB-seconds
        gb_seconds_per_request = (memory_mb / 1024) * (avg_duration_ms / 1000)
        monthly_gb_seconds = gb_seconds_per_request * self.monthly_requests
        
        # AWS Lambda pricing: $0.0000166667 per GB-second
        compute_cost = monthly_gb_seconds * 0.0000166667
        
        # Request cost: $0.20 per 1M requests
        request_cost = (self.monthly_requests / 1_000_000) * 0.20
        
        total_cost = compute_cost + request_cost
        
        return {
            "monthly_cost": total_cost,
            "cost_per_1m_requests": total_cost / (self.monthly_requests / 1_000_000),
            "compute_cost": compute_cost,
            "request_cost": request_cost
        }
    
    def calculate_spot_instance_cost(
        self,
        instance_type: str,
        provider: str = "aws",
        discount: float = 0.7  # Spot instances are ~70% cheaper
    ) -> Dict:
        """
        Calculate cost using spot/preemptible instances.
        """
        hourly_cost = self.PRICING[provider][instance_type] * (1 - discount)
        monthly_hours = 24 * 30
        monthly_cost = hourly_cost * monthly_hours
        
        return {
            "monthly_cost": monthly_cost,
            "cost_per_1m_requests": monthly_cost / (self.monthly_requests / 1_000_000),
            "discount": f"{discount*100:.0f}%"
        }

# Create calculator for 5M requests/day
calc = MLCostCalculator(daily_requests=5_000_000)

print("‚úÖ Cost Calculator ready!")
print(f"\nAnalyzing costs for:")
print(f"  - {calc.daily_requests:,} requests/day")
print(f"  - {calc.monthly_requests:,} requests/month")

### Cost Comparison: Different Strategies

In [None]:
# Strategy 1: GPU Instance (the expensive way)
gpu_cost = calc.calculate_always_on_cost("g4dn.xlarge", "aws", num_instances=1)

# Strategy 2: CPU Instance (optimized model)
cpu_cost = calc.calculate_always_on_cost("c5.2xlarge", "aws", num_instances=1)

# Strategy 3: Serverless (Lambda)
serverless_cost = calc.calculate_serverless_cost(
    avg_duration_ms=50,  # Optimized model
    memory_mb=1024
)

# Strategy 4: Spot Instance (for batch processing)
spot_cost = calc.calculate_spot_instance_cost("c5.2xlarge", "aws")

# Create comparison table
comparison = pd.DataFrame([
    {"Strategy": "GPU Always-On", **gpu_cost},
    {"Strategy": "CPU Always-On", **cpu_cost},
    {"Strategy": "Serverless (Lambda)", **serverless_cost},
    {"Strategy": "Spot Instance", **spot_cost},
])

print("üí∞ COST COMPARISON (Monthly)\n")
print(comparison[['Strategy', 'monthly_cost', 'cost_per_1m_requests']].to_string(index=False))

print(f"\nüí° Insights:")
print(f"  - GPU vs CPU: {gpu_cost['monthly_cost'] / cpu_cost['monthly_cost']:.1f}x more expensive")
print(f"  - CPU vs Serverless: {cpu_cost['monthly_cost'] / serverless_cost['monthly_cost']:.1f}x more expensive")
print(f"  - Regular vs Spot: {cpu_cost['monthly_cost'] / spot_cost['monthly_cost']:.1f}x more expensive")
print(f"\nüéØ Best for this traffic: Serverless (saves ${cpu_cost['monthly_cost'] - serverless_cost['monthly_cost']:.2f}/month)")

---

## Part 2: Serverless Deployment üöÄ

### Why Serverless?

**Traditional Deployment:**
- Server runs 24/7 (even at 3 AM with zero traffic)
- You pay for idle time
- Must handle scaling manually

**Serverless:**
- Pay only when handling requests
- Auto-scales from 0 to millions
- Zero server management

### When to Use Serverless:
- ‚úÖ Variable traffic patterns
- ‚úÖ Small to medium traffic (<10M requests/day)
- ‚úÖ Can tolerate cold starts (100-500ms)
- ‚ùå Constant high traffic (always-on is cheaper)
- ‚ùå Require <10ms latency

In [None]:
# Example AWS Lambda handler for ML inference
serverless_code = '''
# lambda_handler.py

import json
import boto3
from transformers import pipeline
import time

# Global variable - loaded once per container (warm start)
model = None

def load_model():
    """Load model on cold start."""
    global model
    if model is None:
        print("Cold start - loading model...")
        start = time.time()
        
        # Load from S3 or use containerized model
        model = pipeline(
            "sentiment-analysis",
            model="./model",  # Pre-downloaded in Docker image
            device=-1  # CPU inference
        )
        
        print(f"Model loaded in {time.time() - start:.2f}s")
    return model

def lambda_handler(event, context):
    """
    AWS Lambda handler for ML inference.
    
    Cost optimization strategies:
    1. Use global variable for warm starts
    2. Optimize memory (1GB vs 10GB = 10x cost)
    3. Batch requests when possible
    4. Use ARM architecture (Graviton2 = 20% cheaper)
    """
    try:
        # Load model (fast on warm start)
        model = load_model()
        
        # Parse input
        body = json.loads(event['body'])
        text = body.get('text', '')
        
        if not text:
            return {
                'statusCode': 400,
                'body': json.dumps({'error': 'No text provided'})
            }
        
        # Inference
        start = time.time()
        result = model(text)[0]
        latency = time.time() - start
        
        return {
            'statusCode': 200,
            'headers': {
                'Content-Type': 'application/json',
                'X-Inference-Time': str(latency)
            },
            'body': json.dumps({
                'prediction': result['label'],
                'confidence': result['score'],
                'latency_ms': latency * 1000
            })
        }
        
    except Exception as e:
        print(f"Error: {e}")
        return {
            'statusCode': 500,
            'body': json.dumps({'error': str(e)})
        }

# Dockerfile for Lambda
"""
FROM public.ecr.aws/lambda/python:3.9

# Copy model (pre-downloaded)
COPY model/ ${LAMBDA_TASK_ROOT}/model/

# Install dependencies
COPY requirements.txt .
RUN pip install -r requirements.txt --target ${LAMBDA_TASK_ROOT}

# Copy handler
COPY lambda_handler.py ${LAMBDA_TASK_ROOT}

CMD ["lambda_handler.lambda_handler"]
"""
'''

print("üìÑ Serverless Lambda Handler:")
print(serverless_code)

print("\nüí° Deployment:")
print("""
1. Build Docker image:
   docker build -t ml-inference .

2. Push to ECR:
   aws ecr create-repository --repository-name ml-inference
   docker tag ml-inference:latest <account>.dkr.ecr.us-east-1.amazonaws.com/ml-inference
   docker push <account>.dkr.ecr.us-east-1.amazonaws.com/ml-inference

3. Create Lambda:
   aws lambda create-function \
     --function-name ml-inference \
     --package-type Image \
     --code ImageUri=<account>.dkr.ecr.us-east-1.amazonaws.com/ml-inference \
     --role <lambda-role-arn> \
     --timeout 30 \
     --memory-size 1024

4. Add API Gateway and you're done!
""")

---

## Part 3: Auto-Scaling Strategy üìà

### The Problem:

Traffic isn't constant:
- **9 AM - 5 PM**: 1000 requests/sec
- **Night time**: 50 requests/sec
- **Black Friday**: 5000 requests/sec

You need a system that scales automatically!

### Auto-Scaling Strategies:

1. **Horizontal Scaling**: Add/remove servers based on load
2. **Vertical Scaling**: Increase/decrease server size
3. **Predictive Scaling**: Scale before traffic arrives

In [None]:
# Kubernetes Horizontal Pod Autoscaler (HPA) configuration
k8s_autoscaling = '''
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-inference
spec:
  replicas: 2  # Start with 2 pods
  selector:
    matchLabels:
      app: ml-inference
  template:
    metadata:
      labels:
        app: ml-inference
    spec:
      containers:
      - name: inference
        image: your-ml-model:latest
        resources:
          requests:
            cpu: "500m"
            memory: "1Gi"
          limits:
            cpu: "1000m"
            memory: "2Gi"
        ports:
        - containerPort: 8000

---
# hpa.yaml - Auto-scaling configuration
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ml-inference-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ml-inference
  minReplicas: 2        # Always run at least 2 (high availability)
  maxReplicas: 20       # Scale up to 20 during traffic spikes
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70  # Scale when CPU > 70%
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80  # Scale when memory > 80%
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60   # Wait 60s before scaling up
      policies:
      - type: Percent
        value: 100                      # Double the pods
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5min before scaling down
      policies:
      - type: Percent
        value: 50                       # Remove half the pods
        periodSeconds: 60
'''

print("üìÑ Kubernetes Auto-Scaling Configuration:")
print(k8s_autoscaling)

print("\nüí° How it works:")
print("""
1. Start with 2 pods (for high availability)
2. When CPU > 70%, add more pods
3. Scale up aggressively (double capacity in 60s)
4. Scale down conservatively (wait 5min, then halve)
5. Never go below 2 or above 20 pods

Cost impact:
- Low traffic (2 pods): $100/month
- High traffic (20 pods): $1000/month
- Auto-scaling saves ~60% vs always running 20 pods
""")

---

## Part 4: A/B Testing & Gradual Rollouts üß™

### Why A/B Test Models?

Your new model might:
- Have bugs
- Be slower than expected
- Perform worse on real data

**Never replace a model 100% instantly!**

### Rollout Strategy:

1. **Canary (5%)**: Send 5% traffic to new model
2. **Monitor**: Watch metrics for 24 hours
3. **Expand (25%)**: If good, increase to 25%
4. **Expand (50%)**: Still good? Go to 50%
5. **Full (100%)**: Finally, switch everyone

In [None]:
# A/B testing infrastructure
ab_testing_code = '''
# ab_test_router.py

import random
import hashlib
from typing import Dict, Optional
import logging

class ABTestRouter:
    """
    Route traffic between model versions for A/B testing.
    
    Features:
    - Consistent routing per user (same user = same model)
    - Configurable traffic split
    - Metric tracking
    - Easy rollback
    """
    
    def __init__(self, model_a, model_b, b_percentage: int = 5):
        self.model_a = model_a  # Current production model
        self.model_b = model_b  # New model being tested
        self.b_percentage = b_percentage
        
        # Metrics
        self.metrics = {
            "a": {"requests": 0, "errors": 0, "total_latency": 0.0},
            "b": {"requests": 0, "errors": 0, "total_latency": 0.0}
        }
    
    def should_use_model_b(self, user_id: Optional[str] = None) -> bool:
        """
        Decide which model to use.
        
        If user_id provided: Consistent routing (same user always gets same model)
        If no user_id: Random routing
        """
        if user_id:
            # Hash user_id to get consistent routing
            hash_value = int(hashlib.md5(user_id.encode()).hexdigest(), 16)
            return (hash_value % 100) < self.b_percentage
        else:
            # Random routing
            return random.randint(0, 99) < self.b_percentage
    
    def predict(self, text: str, user_id: Optional[str] = None) -> Dict:
        """
        Make prediction, routing to appropriate model.
        """
        import time
        
        # Decide which model to use
        use_b = self.should_use_model_b(user_id)
        model = self.model_b if use_b else self.model_a
        variant = "b" if use_b else "a"
        
        # Track request
        self.metrics[variant]["requests"] += 1
        
        try:
            # Make prediction
            start = time.time()
            result = model(text)
            latency = time.time() - start
            
            # Track latency
            self.metrics[variant]["total_latency"] += latency
            
            return {
                "prediction": result,
                "model_version": variant,
                "latency_ms": latency * 1000
            }
            
        except Exception as e:
            # Track error
            self.metrics[variant]["errors"] += 1
            logging.error(f"Model {variant} error: {e}")
            raise
    
    def get_metrics(self) -> Dict:
        """
        Get A/B test metrics for comparison.
        """
        results = {}
        
        for variant in ["a", "b"]:
            m = self.metrics[variant]
            requests = m["requests"]
            
            if requests > 0:
                results[f"model_{variant}"] = {
                    "requests": requests,
                    "error_rate": m["errors"] / requests,
                    "avg_latency_ms": (m["total_latency"] / requests) * 1000
                }
        
        return results
    
    def should_rollback(self) -> bool:
        """
        Check if we should rollback model B.
        
        Rollback if:
        - Error rate > 2x model A
        - Latency > 1.5x model A
        """
        metrics = self.get_metrics()
        
        if "model_b" not in metrics or metrics["model_b"]["requests"] < 100:
            return False  # Not enough data
        
        a = metrics["model_a"]
        b = metrics["model_b"]
        
        # Check error rate
        if b["error_rate"] > a["error_rate"] * 2:
            logging.warning(f"Model B error rate too high: {b['error_rate']:.2%}")
            return True
        
        # Check latency
        if b["avg_latency_ms"] > a["avg_latency_ms"] * 1.5:
            logging.warning(f"Model B too slow: {b['avg_latency_ms']:.0f}ms")
            return True
        
        return False

# Usage:
router = ABTestRouter(model_a=old_model, model_b=new_model, b_percentage=5)

# Route traffic
result = router.predict("Great product!", user_id="user123")

# Check metrics after 24 hours
if router.should_rollback():
    print("‚ö†Ô∏è Rolling back model B!")
else:
    print("‚úÖ Model B looks good, increase traffic!")
'''

print("üìÑ A/B Testing Router:")
print(ab_testing_code)

---

## Part 5: Cost Monitoring & Alerts üìä

### The Golden Rule:

**"You can't optimize what you don't measure."**

Always track:
1. **Infrastructure cost** (servers, storage, network)
2. **Cost per request** (your efficiency metric)
3. **Cost per prediction** (business metric)
4. **Cost trends** (are you getting more efficient?)

In [None]:
# Cost monitoring system
cost_monitoring = '''
# cost_monitor.py

import boto3
from datetime import datetime, timedelta
from typing import Dict, List

class AWSCostMonitor:
    """
    Monitor AWS costs and send alerts.
    """
    
    def __init__(self, budget_monthly: float = 500):
        self.ce_client = boto3.client('ce')  # Cost Explorer
        self.sns_client = boto3.client('sns')
        self.budget_monthly = budget_monthly
    
    def get_current_month_cost(self) -> float:
        """
        Get total cost for current month.
        """
        # Get first day of month
        today = datetime.now()
        start = today.replace(day=1).strftime('%Y-%m-%d')
        end = today.strftime('%Y-%m-%d')
        
        response = self.ce_client.get_cost_and_usage(
            TimePeriod={'Start': start, 'End': end},
            Granularity='MONTHLY',
            Metrics=['UnblendedCost']
        )
        
        cost = float(
            response['ResultsByTime'][0]['Total']['UnblendedCost']['Amount']
        )
        
        return cost
    
    def get_service_breakdown(self) -> Dict[str, float]:
        """
        Get cost breakdown by service.
        """
        today = datetime.now()
        start = (today - timedelta(days=7)).strftime('%Y-%m-%d')
        end = today.strftime('%Y-%m-%d')
        
        response = self.ce_client.get_cost_and_usage(
            TimePeriod={'Start': start, 'End': end},
            Granularity='DAILY',
            Metrics=['UnblendedCost'],
            GroupBy=[{'Type': 'SERVICE', 'Key': 'SERVICE'}]
        )
        
        # Aggregate by service
        costs = {}
        for result in response['ResultsByTime']:
            for group in result['Groups']:
                service = group['Keys'][0]
                cost = float(group['Metrics']['UnblendedCost']['Amount'])
                costs[service] = costs.get(service, 0) + cost
        
        return costs
    
    def check_budget_alert(self) -> bool:
        """
        Check if we're over budget.
        """
        current_cost = self.get_current_month_cost()
        
        # Get days in month
        today = datetime.now()
        days_in_month = (today.replace(day=28) + timedelta(days=4)).replace(day=1) - timedelta(days=1)
        days_in_month = days_in_month.day
        
        # Projected cost
        projected = (current_cost / today.day) * days_in_month
        
        if projected > self.budget_monthly:
            self.send_alert(
                f"‚ö†Ô∏è COST ALERT: Projected ${projected:.2f} (budget: ${self.budget_monthly})"
            )
            return True
        
        return False
    
    def send_alert(self, message: str):
        """
        Send alert via SNS.
        """
        self.sns_client.publish(
            TopicArn='arn:aws:sns:us-east-1:123456789:cost-alerts',
            Subject='AWS Cost Alert',
            Message=message
        )

# Daily cost monitoring Lambda
def lambda_handler(event, context):
    """Run daily to check costs."""
    monitor = AWSCostMonitor(budget_monthly=500)
    
    # Check budget
    monitor.check_budget_alert()
    
    # Log breakdown
    breakdown = monitor.get_service_breakdown()
    print("Cost breakdown (last 7 days):")
    for service, cost in sorted(breakdown.items(), key=lambda x: x[1], reverse=True):
        print(f"  {service}: ${cost:.2f}")

# Schedule with CloudWatch Events (every day at 9 AM)
# Rate: cron(0 9 * * ? *)
'''

print("üìÑ Cost Monitoring System:")
print(cost_monitoring)

print("\nüí° Best practices:")
print("""
1. Set up budget alerts (monthly and daily)
2. Track cost per request (optimize this metric)
3. Monitor cost trends (should decrease over time)
4. Review top 5 expensive services weekly
5. Use cost anomaly detection (AWS Cost Anomaly Detection)
""")

---

## üéØ Resume Bullets (Copy These!)

### Option 1: Cost Optimization Focus
*"Reduced ML infrastructure costs by 92% (from $10K to $800/month) through serverless deployment, quantization, and auto-scaling"*

### Option 2: Scaling Focus
*"Designed auto-scaling ML system handling 1M-50M requests/day with 99.9% uptime and <100ms p95 latency"*

### Option 3: A/B Testing Focus
*"Implemented gradual rollout system with automatic rollback, safely deploying 15+ model updates with zero downtime"*

### Option 4: Multi-Cloud Focus
*"Architected cost-optimized ML deployment strategy across AWS, GCP, and Azure, selecting optimal platform per use case"*

---

## üìö Interview Prep

### Q: "How would you reduce ML infrastructure costs?"

**Your Answer:**

*"I'd start with a cost audit - understand where money is going. Usually, I find:*

*1. **Over-provisioned instances**: Running GPU when CPU would work (30x cost difference!)
*2. **Always-on servers**: Paying for idle time at 3 AM
*3. **Unoptimized models**: Using full precision when quantized would work

*My approach:*
*- **Model optimization**: Quantization + ONNX reduces instance requirements
*- **Right-size instances**: Match instance to actual load (not peak!)
*- **Serverless for variable traffic**: Pay only for requests, not idle time
*- **Spot instances for batch jobs**: 70% cost savings
*- **Auto-scaling**: Scale down during low traffic

*On my last project, this reduced costs from $10K to $800/month - 92% savings while improving latency."*

---

### Q: "Design a scalable ML system for 10M requests/day."

**Your Answer:**

*"I'd design for:

**Architecture:**
- **Load balancer** (ALB/NLB) distributing traffic
- **Auto-scaling group** (2-20 instances) with optimized model
- **Redis cache** for frequent queries (cache hit = 1ms vs 50ms)
- **Async processing** for batch requests via SQS

**Scaling strategy:**
- Start with 2 instances (high availability)
- Scale up when CPU > 70% or latency > 100ms
- Scale down after 5 min of low load
- Use predictive scaling for known traffic patterns

**Cost optimization:**
- Quantized model on CPU instances (not GPU)
- ONNX Runtime for 2-3x speedup
- Smart batching (batch size 8-16)
- Spot instances for batch processing

**Monitoring:**
- CloudWatch metrics: latency, error rate, cost
- Auto-rollback if error rate > 1%
- Cost alerts if projected > budget

This handles 10M requests/day for ~$300/month with 99.9% uptime."*

---

### Q: "How do you safely roll out a new model?"

**Your Answer:**

*"Never replace 100% instantly! I use gradual rollout:*

**Phase 1: Canary (5%, 24 hours)**
- Route 5% of traffic to new model
- Monitor: error rate, latency, user feedback
- Automatic rollback if error rate > 2x baseline

**Phase 2: Expand (25%, 24 hours)**
- If metrics look good, increase to 25%
- Compare A/B metrics side-by-side
- Look for edge cases or specific user segments affected

**Phase 3: Majority (50%, 48 hours)**
- Increase to 50%
- Final check of all metrics
- Ensure cost is as expected

**Phase 4: Full (100%)**
- Complete migration
- Keep old model for 7 days (quick rollback)
- Monitor for regression

I implement this with a router that uses consistent hashing - same user always gets same model version for fair testing."*

---

## üéâ You're Now a Cost Optimization Expert!

You've learned:

‚úÖ **Cost calculation** - Know the real expenses  
‚úÖ **Serverless deployment** - Pay only for requests  
‚úÖ **Auto-scaling** - Handle any traffic automatically  
‚úÖ **A/B testing** - Roll out changes safely  
‚úÖ **Cost monitoring** - Track and optimize continuously  

### Key Takeaways:

1. **Always optimize models first** - 10x cost savings just from quantization!
2. **Match architecture to traffic** - Serverless for variable, always-on for constant
3. **Never skip A/B testing** - Bugs in production = lost money
4. **Monitor everything** - You can't optimize what you don't measure
5. **Cost per request matters** - Not total cost, but efficiency

### Real-World Impact:

With these techniques, you can:
- Save $10K+/year on infrastructure
- Handle 100x traffic without code changes
- Deploy updates with confidence
- Scale from startup to enterprise

**This knowledge makes you invaluable to any company running ML in production!**

---

*Built with ‚ù§Ô∏è for people who understand business value*