# Lab: Model Interfaces and Deployment

## Learning Objectives

- Set up and configure local inference engines (Ollama, vLLM)
- Implement OpenAI-compatible interfaces across different providers
- Build production-ready deployment architectures
- Compare performance metrics across deployment options
- Implement security, monitoring, and scaling solutions

## Prerequisites

Install required packages:

In [None]:
# Install required packages
# !pip install openai huggingface_hub vllm requests python-dotenv psutil

## Setup and Configuration

Let's start by setting up our environment and API credentials:

In [None]:
import os
import json
import time
import requests
import psutil
from typing import Dict, Any, List, Optional
from dataclasses import dataclass
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# API Keys (use environment variables in production)
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "your-openai-key-here")
HF_TOKEN = os.getenv("HF_TOKEN", "your-huggingface-token-here")

print("Environment setup complete!")
print(f"OpenAI API Key configured: {'Yes' if OPENAI_API_KEY != 'your-openai-key-here' else 'No'}")
print(f"HuggingFace Token configured: {'Yes' if HF_TOKEN != 'your-huggingface-token-here' else 'No'}")

## Exercise 1: Universal OpenAI-Compatible Client

### Why a Universal Client? The Provider Lock-in Problem

**The Challenge:**

Different AI providers have **different APIs**:
- OpenAI: `client.chat.completions.create(...)`
- HuggingFace: `client.text_generation(...)`
- Anthropic: `client.messages.create(...)`
- Ollama: `requests.post("http://localhost:11434/api/generate", ...)`

**The Problem:**
- **Vendor Lock-in**: Hard to switch providers
- **Code Duplication**: Rewrite logic for each provider
- **Testing Complexity**: Test each provider separately
- **Maintenance Burden**: Update code in multiple places

**The Solution: Universal Client**

A **single interface** that works with multiple providers:
- ✅ **Same Code**: Write once, works everywhere
- ✅ **Easy Switching**: Change provider with one parameter
- ✅ **Unified Testing**: Test once, works for all
- ✅ **Future-Proof**: Add new providers easily

---

### Understanding OpenAI-Compatible APIs

**What is "OpenAI-Compatible"?**

Many providers now offer **OpenAI-compatible APIs**:
- Same endpoint structure: `/v1/chat/completions`
- Same request format: `messages`, `model`, `temperature`
- Same response format: `choices[0].message.content`

**Why This Matters:**
- **Standardization**: Industry moving toward common interface
- **Compatibility**: Can use OpenAI SDK with other providers
- **Portability**: Easy to switch between providers
- **Ecosystem**: Works with existing tools and libraries

**Providers with OpenAI-Compatible APIs:**
- OpenAI (original)
- HuggingFace Inference API
- Ollama (local)
- vLLM (local)
- Together AI
- Anyscale
- Many more...

---

### The Universal Client Architecture

**Design Pattern: Adapter Pattern**

```
Your Application
    ↓
Universal Client (adapter)
    ↓
Provider-Specific Clients
    ↓
OpenAI / HuggingFace / Ollama / etc.
```

**Key Components:**

**1. Provider Detection**
```python
if provider == "openai":
    return OpenAI(api_key=api_key)
elif provider == "huggingface":
    return OpenAI(base_url="https://api-inference.huggingface.co/v1")
```
**Why**: Each provider needs different initialization

**2. Unified Interface**
```python
def chat_completion(messages, model, **kwargs):
    # Same interface for all providers
    return client.chat.completions.create(...)
```
**Why**: Your code doesn't need to know which provider

**3. Response Normalization**
```python
return {
    'success': True,
    'response': response.choices[0].message.content,
    'usage': {...},
    'provider': self.provider
}
```
**Why**: Consistent response format regardless of provider

---

### Handling Provider Differences

**Challenge 1: Different Base URLs**
```python
base_urls = {
    'openai': 'https://api.openai.com/v1',
    'huggingface': 'https://api-inference.huggingface.co/v1',
    'ollama': 'http://localhost:11434/v1',
    'vllm': 'http://localhost:8000/v1'
}
```
**Solution**: Map provider to base URL

**Challenge 2: Different Model Names**
```python
default_models = {
    'openai': 'gpt-3.5-turbo',
    'huggingface': 'microsoft/DialoGPT-medium',
    'ollama': 'llama2',
    'vllm': 'microsoft/DialoGPT-medium'
}
```
**Solution**: Provider-specific defaults

**Challenge 3: Different Authentication**
```python
if provider == "openai":
    client = OpenAI(api_key=api_key)
elif provider == "ollama":
    client = OpenAI(api_key="ollama")  # No real auth needed
```
**Solution**: Handle auth per provider

**Challenge 4: Different Response Formats**
```python
# Normalize all responses to same format
response_data = {
    'response': extract_content(response),
    'usage': extract_usage(response),
    'model': extract_model(response)
}
```
**Solution**: Extract and normalize

---

### Benefits of Universal Client

**1. Flexibility**
- Switch providers without code changes
- Test with different providers easily
- Use cheapest/fastest provider per use case

**2. Reliability**
- Automatic failover between providers
- Redundancy for critical applications
- Provider-agnostic error handling

**3. Cost Optimization**
- Route requests to cheapest provider
- Use local models when possible
- Balance cost vs quality

**4. Development Speed**
- Write code once
- Test with multiple providers
- Deploy with confidence

---

### Real-World Use Cases

**1. Multi-Provider Strategy**
```python
# Try OpenAI first, fallback to HuggingFace
try:
    result = universal_client.chat_completion(..., provider="openai")
except:
    result = universal_client.chat_completion(..., provider="huggingface")
```

**2. Cost Optimization**
```python
# Use local Ollama for development
# Use OpenAI for production
provider = "ollama" if is_development else "openai"
```

**3. A/B Testing**
```python
# Test different providers
for provider in ["openai", "huggingface", "ollama"]:
    result = universal_client.chat_completion(..., provider=provider)
    compare_results(result)
```

**4. Geographic Routing**
```python
# Use local provider in EU for GDPR compliance
provider = "ollama" if user_location == "EU" else "openai"
```

---

### Best Practices

**1. Error Handling**
- Provider-specific error messages
- Graceful degradation
- Clear error reporting

**2. Logging**
- Log which provider was used
- Track provider performance
- Monitor provider availability

**3. Configuration**
- Environment variables for API keys
- Configurable provider selection
- Easy to add new providers

**4. Testing**
- Test with all providers
- Mock providers for unit tests
- Integration tests with real providers

Now let's build the universal client:

In [None]:
from openai import OpenAI

class UniversalAIClient:
    """Universal client for multiple AI providers with OpenAI-compatible APIs."""
    
    def __init__(self, provider: str, api_key: str, base_url: str = None):
        self.provider = provider
        self.api_key = api_key
        self.base_url = base_url
        self.client = self._initialize_client()
    
    def _initialize_client(self):
        """Initialize the appropriate client based on provider."""
        if self.provider == "openai":
            return OpenAI(api_key=self.api_key)
        elif self.provider in ["huggingface", "hf"]:
            return OpenAI(
                api_key=self.api_key,
                base_url=self.base_url or "https://api-inference.huggingface.co/v1"
            )
        elif self.provider == "ollama":
            return OpenAI(
                api_key="ollama",  # Ollama doesn't require auth
                base_url=self.base_url or "http://localhost:11434/v1"
            )
        elif self.provider == "vllm":
            return OpenAI(
                api_key="vllm",  # vLLM doesn't require auth
                base_url=self.base_url or "http://localhost:8000/v1"
            )
        else:
            raise ValueError(f"Unsupported provider: {self.provider}")
    
    def chat_completion(self, messages: List[Dict[str, str]], model: str = None, **kwargs) -> Dict[str, Any]:
        """Create chat completion with unified interface."""
        # Use appropriate model for each provider
        if not model:
            model = self._get_default_model()
        
        try:
            response = self.client.chat.completions.create(
                model=model,
                messages=messages,
                temperature=kwargs.get('temperature', 0.7),
                max_tokens=kwargs.get('max_tokens', 150),
                top_p=kwargs.get('top_p', 1.0),
                frequency_penalty=kwargs.get('frequency_penalty', 0),
                presence_penalty=kwargs.get('presence_penalty', 0),
                stream=kwargs.get('stream', False)
            )
            
            return {
                'success': True,
                'provider': self.provider,
                'model': model,
                'response': response.choices[0].message.content,
                'usage': {
                    'prompt_tokens': getattr(response.usage, 'prompt_tokens', 0),
                    'completion_tokens': getattr(response.usage, 'completion_tokens', 0),
                    'total_tokens': getattr(response.usage, 'total_tokens', 0)
                },
                'latency': kwargs.get('start_time', time.time()) - time.time()
            }
            
        except Exception as e:
            return {
                'success': False,
                'provider': self.provider,
                'error': str(e),
                'model': model
            }
    
    def _get_default_model(self) -> str:
        """Get default model for each provider."""
        defaults = {
            'openai': 'gpt-3.5-turbo',
            'huggingface': 'microsoft/DialoGPT-medium',
            'hf': 'microsoft/DialoGPT-medium',
            'ollama': 'llama2',
            'vllm': 'microsoft/DialoGPT-medium'
        }
        return defaults.get(self.provider, 'gpt-3.5-turbo')

# Test the universal client
print("Testing Universal AI Client...")

# Test with available providers
providers_to_test = []
if OPENAI_API_KEY != "your-openai-key-here":
    providers_to_test.append(("openai", OPENAI_API_KEY))
if HF_TOKEN != "your-huggingface-token-here":
    providers_to_test.append(("huggingface", HF_TOKEN))

test_messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"}
]

for provider, api_key in providers_to_test:
    try:
        print(f"\n--- Testing {provider.upper()} ---")
        client = UniversalAIClient(provider, api_key)
        result = client.chat_completion(test_messages)
        
        if result['success']:
            print(f"✅ Response: {result['response'][:100]}...")
            print(f"Tokens used: {result['usage']['total_tokens']}")
        else:
            print(f"❌ Error: {result['error']}")
            
    except Exception as e:
        print(f"Error testing {provider}: {e}")

## Exercise 2: HuggingFace Inference Provider Management

### Why Provider Management? The Reliability Challenge

**The Reality of Cloud Services:**

Cloud AI providers are **not 100% reliable**:
- **Rate Limits**: Too many requests → temporary blocks
- **Service Outages**: Providers go down occasionally
- **Geographic Issues**: Some regions slower than others
- **Model Availability**: Models may be temporarily unavailable

**The Impact:**
- **User Experience**: Failed requests = unhappy users
- **Business Impact**: Downtime = lost revenue
- **Reliability**: Single point of failure = risky

**The Solution: Provider Management**

Intelligent routing and failover:
- ✅ **Multiple Providers**: Don't rely on just one
- ✅ **Automatic Failover**: Switch on failure
- ✅ **Performance Tracking**: Use best-performing provider
- ✅ **Load Balancing**: Distribute requests

---

### Understanding HuggingFace Provider Options

**HuggingFace Inference API Providers:**

HuggingFace offers multiple inference providers:
- **hf-inference**: HuggingFace's own infrastructure
- **auto**: Automatically selects best available
- **Custom endpoints**: Your own deployed models

**Why Multiple Providers?**

**1. Redundancy**
- If one provider fails, others available
- No single point of failure
- Higher uptime

**2. Performance**
- Different providers have different speeds
- Geographic proximity matters
- Load balancing improves response times

**3. Cost**
- Some providers cheaper than others
- Route based on cost requirements
- Optimize spending

---

### The Failover Pattern

**How Failover Works:**

```
Request 1: Try Provider A
    ↓
Provider A fails
    ↓
Request 2: Try Provider B (failover)
    ↓
Provider B succeeds
    ↓
Continue using Provider B
```

**Implementation Strategy:**

**1. Preferred Provider List**
```python
preferred_providers = ["hf-inference", "auto"]
```
**Why**: Try best providers first

**2. Retry Logic**
```python
for attempt in range(max_retries):
    for provider in preferred_providers:
        try:
            response = client.chat_completion(..., provider=provider)
            return response  # Success!
        except Exception as e:
            continue  # Try next provider
```
**Why**: Automatic retry with different providers

**3. Exponential Backoff**
```python
time.sleep(2 ** attempt)  # Wait longer each retry
```
**Why**: Don't overwhelm failing providers

---

### Performance Tracking

**Why Track Performance?**

- **Optimization**: Use fastest providers
- **Monitoring**: Detect performance degradation
- **Decision Making**: Data-driven provider selection
- **Cost Analysis**: Balance speed vs cost

**Metrics to Track:**

**1. Success Rate**
```python
success_rate = successful_requests / total_requests * 100
```
**Why**: Reliability indicator

**2. Average Latency**
```python
avg_latency = sum(latencies) / len(latencies)
```
**Why**: Speed indicator

**3. Percentile Latencies**
```python
p50_latency = median(latencies)  # Typical user experience
p95_latency = percentile(latencies, 95)  # Worst-case for most users
p99_latency = percentile(latencies, 99)  # Extreme cases
```
**Why**: Understand latency distribution

**4. Error Types**
```python
error_counts = {
    'timeout': 5,
    'rate_limit': 2,
    'model_unavailable': 1
}
```
**Why**: Understand failure modes

---

### Provider Selection Strategy

**Strategy 1: Fastest First**
```python
# Sort providers by average latency
providers_sorted = sorted(providers, key=lambda p: avg_latency[p])
```
**Use when**: Speed is critical

**Strategy 2: Most Reliable First**
```python
# Sort providers by success rate
providers_sorted = sorted(providers, key=lambda p: success_rate[p], reverse=True)
```
**Use when**: Reliability is critical

**Strategy 3: Cost-Optimized**
```python
# Sort providers by cost per request
providers_sorted = sorted(providers, key=lambda p: cost_per_request[p])
```
**Use when**: Cost is critical

**Strategy 4: Balanced**
```python
# Weighted score: speed + reliability + cost
score = (latency_weight * latency) + (reliability_weight * success_rate) + (cost_weight * cost)
```
**Use when**: Need balance

---

### Real-World Failover Scenarios

**Scenario 1: Rate Limit Hit**
```
Request → Provider A (rate limited)
    ↓
Automatic failover → Provider B
    ↓
Success!
```

**Scenario 2: Service Outage**
```
Request → Provider A (timeout)
    ↓
Retry with exponential backoff
    ↓
Still failing → Switch to Provider B
    ↓
Success!
```

**Scenario 3: Geographic Routing**
```
EU User → EU Provider (faster, GDPR compliant)
US User → US Provider (faster)
```

---

### Best Practices

**1. Monitor Continuously**
- Track provider performance in real-time
- Alert on degradation
- Update provider rankings

**2. Test Failover Regularly**
- Simulate failures
- Verify failover works
- Measure failover time

**3. Document Provider Differences**
- Different capabilities
- Different costs
- Different SLAs

**4. Set Timeouts**
- Don't wait forever
- Fail fast, failover quickly
- Good user experience

Now let's implement provider management:

In [None]:
class HuggingFaceProviderManager:
    """Manages HuggingFace inference providers with failover support."""
    
    def __init__(self, token: str, preferred_providers: List[str] = None):
        self.token = token
        self.preferred_providers = preferred_providers or ["auto"]
        self.client = self._initialize_client()
        self.provider_performance = {}
    
    def _initialize_client(self):
        """Initialize HuggingFace Inference Client."""
        try:
            from huggingface_hub import InferenceClient
            return InferenceClient(token=self.token)
        except ImportError:
            print("HuggingFace Hub not installed. Install with: pip install huggingface_hub")
            return None
    
    def chat_completion_with_failover(self, messages: List[Dict[str, str]], 
                                    model: str, max_retries: int = 3) -> Dict[str, Any]:
        """Attempt chat completion with provider failover."""
        if not self.client:
            return {'success': False, 'error': 'HuggingFace client not available'}
        
        for attempt in range(max_retries):
            try:
                # Try with specified providers
                for provider in self.preferred_providers:
                    try:
                        start_time = time.time()
                        response = self.client.chat_completion(
                            messages=messages,
                            model=model,
                            provider=provider,
                            max_tokens=150,
                            temperature=0.7
                        )
                        
                        latency = time.time() - start_time
                        self._record_provider_performance(provider, latency, True)
                        
                        return {
                            'success': True,
                            'provider': provider,
                            'response': response.choices[0].message.content,
                            'latency': latency,
                            'attempt': attempt + 1
                        }
                        
                    except Exception as e:
                        print(f"Provider {provider} failed: {e}")
                        self._record_provider_performance(provider, 0, False)
                        continue
                
                # Fallback to auto selection
                start_time = time.time()
                response = self.client.chat_completion(
                    messages=messages,
                    model=model,
                    provider="auto",
                    max_tokens=150,
                    temperature=0.7
                )
                
                latency = time.time() - start_time
                self._record_provider_performance("auto", latency, True)
                
                return {
                    'success': True,
                    'provider': 'auto',
                    'response': response.choices[0].message.content,
                    'latency': latency,
                    'attempt': attempt + 1
                }
                
            except Exception as e:
                print(f"Attempt {attempt + 1} failed: {e}")
                if attempt == max_retries - 1:
                    return {
                        'success': False,
                        'error': str(e),
                        'attempts': max_retries
                    }
                time.sleep(2 ** attempt)  # Exponential backoff
        
        return {
            'success': False,
            'error': 'All providers failed after maximum retries',
            'attempts': max_retries
        }
    
    def _record_provider_performance(self, provider: str, latency: float, success: bool):
        """Record provider performance metrics."""
        if provider not in self.provider_performance:
            self.provider_performance[provider] = {
                'total_requests': 0,
                'successful_requests': 0,
                'total_latency': 0,
                'latencies': []
            }
        
        self.provider_performance[provider]['total_requests'] += 1
        if success:
            self.provider_performance[provider]['successful_requests'] += 1
            self.provider_performance[provider]['total_latency'] += latency
            self.provider_performance[provider]['latencies'].append(latency)
    
    def get_provider_statistics(self) -> Dict[str, Any]:
        """Get performance statistics for all providers."""
        stats = {}
        
        for provider, data in self.provider_performance.items():
            if data['total_requests'] > 0:
                success_rate = (data['successful_requests'] / data['total_requests']) * 100
                avg_latency = data['total_latency'] / data['successful_requests'] if data['successful_requests'] > 0 else 0
                
                stats[provider] = {
                    'total_requests': data['total_requests'],
                    'successful_requests': data['successful_requests'],
                    'success_rate': success_rate,
                    'average_latency': avg_latency,
                    'p50_latency': np.median(data['latencies']) if data['latencies'] else 0,
                    'p95_latency': np.percentile(data['latencies'], 95) if len(data['latencies']) > 1 else 0
                }
        
        return stats

# Test HuggingFace provider management
if HF_TOKEN != "your-huggingface-token-here":
    print("\n--- Testing HuggingFace Provider Management ---")
    
    hf_manager = HuggingFaceProviderManager(
        token=HF_TOKEN,
        preferred_providers=["hf-inference", "auto"]
    )
    
    test_messages = [
        {"role": "user", "content": "What are the benefits of renewable energy?"}
    ]
    
    # Test multiple requests to gather statistics
    for i in range(3):
        result = hf_manager.chat_completion_with_failover(
            test_messages,
            model="microsoft/DialoGPT-medium"
        )
        
        if result['success']:
            print(f"✅ Request {i+1}: Provider={result['provider']}, Latency={result['latency']:.2f}s")
        else:
            print(f"❌ Request {i+1}: {result['error']}")
    
    # Show provider statistics
    stats = hf_manager.get_provider_statistics()
    print("\nProvider Statistics:")
    for provider, data in stats.items():
        print(f"{provider}: {data['success_rate']:.1f}% success, {data['average_latency']:.2f}s avg latency")
else:
    print("HuggingFace token not configured. Skipping provider management test.")

## Exercise 3: Local Ollama Integration

### Why Local Integration? Revisiting the Benefits

**Local Models in Production:**

While we covered Ollama basics earlier, here we focus on **production integration**:
- **Health Checks**: Verify Ollama is running
- **Model Management**: List and select available models
- **Error Handling**: Graceful degradation if Ollama unavailable
- **Performance Monitoring**: Track local model performance

**Production Considerations:**

**1. Availability**
- Ollama must be running
- Models must be downloaded
- System resources must be sufficient

**2. Reliability**
- Local hardware can fail
- Models can be slow
- Need fallback options

**3. Monitoring**
- Track local model performance
- Monitor resource usage
- Alert on failures

---

### Health Check Pattern

**Why Health Checks?**

Before using Ollama, verify:
- ✅ Service is running
- ✅ Models are available
- ✅ System has resources
- ✅ API is responding

**Implementation:**

```python
def check_ollama_health():
    try:
        response = requests.get("http://localhost:11434/api/tags", timeout=5)
        if response.status_code == 200:
            return True, "Ollama is healthy"
        else:
            return False, f"Ollama returned {response.status_code}"
    except requests.exceptions.ConnectionError:
        return False, "Ollama is not running"
    except requests.exceptions.Timeout:
        return False, "Ollama connection timed out"
```

**When to Check:**
- **Startup**: Verify before accepting requests
- **Periodically**: Monitor ongoing health
- **Before Requests**: Quick check before each request
- **On Failure**: Verify after errors

---

### Model Discovery and Selection

**Why Model Discovery?**

Different Ollama installations have different models:
- User may have downloaded different models
- Models vary in size and capability
- Need to select appropriate model

**Implementation:**

```python
def get_available_models():
    response = requests.get("http://localhost:11434/api/tags")
    models = response.json().get('models', [])
    return [model['name'] for model in models]

def select_best_model(available_models, preferred_models):
    # Try preferred models first
    for preferred in preferred_models:
        if preferred in available_models:
            return preferred
    # Fallback to first available
    return available_models[0] if available_models else None
```

**Model Selection Strategy:**
- **Preferred Models**: Try best models first
- **Fallback**: Use any available model
- **Capability Matching**: Match model to task

---

### Error Handling and Fallback

**Error Scenarios:**

**1. Ollama Not Running**
```python
except requests.exceptions.ConnectionError:
    # Fallback to cloud provider
    return use_cloud_provider()
```

**2. Model Not Available**
```python
if model not in available_models:
    # Use alternative model
    model = select_alternative_model()
```

**3. Timeout**
```python
except requests.exceptions.Timeout:
    # Ollama too slow, use cloud
    return use_cloud_provider()
```

**4. Resource Exhaustion**
```python
if system_memory < required_memory:
    # Not enough resources, use cloud
    return use_cloud_provider()
```

---

### Hybrid Local/Cloud Strategy

**When to Use Local:**
- ✅ Sensitive data (privacy)
- ✅ High volume (cost savings)
- ✅ Offline requirements
- ✅ Development/testing

**When to Use Cloud:**
- ✅ Low latency requirements
- ✅ Limited local resources
- ✅ Need latest models
- ✅ High reliability needs

**Hybrid Approach:**
```python
def route_request(request, data_sensitivity):
    if data_sensitivity == "high":
        return use_local_ollama(request)
    elif ollama_available and ollama_fast_enough:
        return use_local_ollama(request)
    else:
        return use_cloud_provider(request)
```

---

### Performance Monitoring

**Metrics to Track:**

**1. Response Time**
- Local models can be slower
- Track vs cloud providers
- Optimize if needed

**2. Resource Usage**
- CPU usage
- Memory usage
- GPU usage (if available)

**3. Availability**
- Uptime percentage
- Failure rate
- Recovery time

**4. Cost Comparison**
- Local: Hardware + electricity
- Cloud: Per-request pricing
- Calculate break-even point

Now let's test the integration:

In [None]:
def test_ollama_integration():
    """Test integration with local Ollama instance."""
    try:
        # Check if Ollama is running
        ollama_response = requests.get("http://localhost:11434/api/tags", timeout=5)
        if ollama_response.status_code == 200:
            print("Ollama is running!")
            models = ollama_response.json().get('models', [])
            print(f"Available models: {[model['name'] for model in models]}")
            
            # Test with a simple prompt
            test_payload = {
                "model": "llama2" if any('llama2' in m['name'] for m in models) else models[0]['name'],
                "prompt": "Generate a simple JSON object with 'name' and 'age' fields.",
                "format": "json",
                "stream": False
            }
            
            response = requests.post(
                "http://localhost:11434/api/generate",
                json=test_payload,
                timeout=30
            )
            
            if response.status_code == 200:
                result = response.json()
                print(f"Ollama response: {result.get('response', 'No response')}")
                return True
            else:
                print(f"Ollama API error: {response.status_code}")
                return False
        else:
            print("Ollama is not responding properly")
            return False
            
    except requests.exceptions.ConnectionError:
        print("Ollama is not running. Start it with: ollama serve")
        return False
    except requests.exceptions.Timeout:
        print("Ollama connection timed out")
        return False
    except Exception as e:
        print(f"Error connecting to Ollama: {e}")
        return False

# Test Ollama integration
print("Testing Ollama integration...")
ollama_available = test_ollama_integration()

## Exercise 4: Performance Benchmarking Framework

### Why Benchmark? Making Data-Driven Decisions

**The Decision Problem:**

When choosing AI providers, you need to answer:
- **Which is fastest?** (User experience)
- **Which is cheapest?** (Cost optimization)
- **Which is most reliable?** (Uptime)
- **Which has best quality?** (Output quality)

**Without Benchmarking:**
- ❌ Guess based on marketing
- ❌ Anecdotal evidence
- ❌ Inconsistent testing
- ❌ Wrong decisions

**With Benchmarking:**
- ✅ Data-driven decisions
- ✅ Objective comparisons
- ✅ Systematic testing
- ✅ Right choices

---

### What to Benchmark

**1. Latency (Response Time)**
- **What**: Time from request to response
- **Why**: Users notice slow responses
- **How**: Measure end-to-end time
- **Target**: <2s for most use cases

**2. Throughput (Tokens/Second)**
- **What**: How fast tokens are generated
- **Why**: Affects total response time
- **How**: Tokens generated / time taken
- **Target**: Higher is better

**3. Cost per Request**
- **What**: Total cost divided by requests
- **Why**: Cost optimization critical
- **How**: Track API costs + infrastructure
- **Target**: Balance with quality

**4. Reliability (Success Rate)**
- **What**: Percentage of successful requests
- **Why**: Failures = bad user experience
- **How**: Successful requests / total requests
- **Target**: >99% for production

**5. Quality (Output Quality)**
- **What**: How good are the outputs?
- **Why**: Quality matters more than speed
- **How**: Human evaluation or automated metrics
- **Target**: Depends on use case

---

### Benchmarking Methodology

**1. Test Design**

**Diverse Test Prompts:**
- Different lengths
- Different complexities
- Different domains
- Real-world scenarios

**Why Diversity Matters:**
- Some providers better at certain tasks
- Avoid bias toward one provider
- Real-world performance

**2. Statistical Rigor**

**Multiple Runs:**
- Run each test multiple times
- Account for variability
- Calculate confidence intervals

**Why Multiple Runs:**
- AI outputs are probabilistic
- Network conditions vary
- Need statistical significance

**3. Controlled Environment**

**Same Conditions:**
- Same test prompts
- Same time of day (if network matters)
- Same hardware (for local)
- Same parameters

**Why Control Matters:**
- Fair comparison
- Reproducible results
- Valid conclusions

---

### Understanding Latency Metrics

**Average Latency:**
```python
avg_latency = sum(latencies) / len(latencies)
```
**Use for**: Overall performance indicator

**Median Latency (P50):**
```python
median_latency = median(latencies)
```
**Use for**: Typical user experience

**P95 Latency:**
```python
p95_latency = percentile(latencies, 95)
```
**Use for**: Worst-case for most users (5% are slower)

**P99 Latency:**
```python
p99_latency = percentile(latencies, 99)
```
**Use for**: Extreme cases (1% are slower)

**Why Percentiles Matter:**
- Average can hide outliers
- P95/P99 show worst-case scenarios
- Users remember bad experiences

---

### Cost Analysis

**Total Cost of Ownership (TCO):**

**Cloud Providers:**
```
Cost = (requests × cost_per_request) + infrastructure
```

**Local Models:**
```
Cost = hardware_cost + electricity + maintenance
```

**Break-Even Analysis:**
```
Break-even requests = hardware_cost / (cloud_cost_per_request - local_cost_per_request)
```

**Example:**
- Hardware: $5,000
- Cloud: $0.01/request
- Local: $0.001/request (electricity)
- Break-even: 5,000 / (0.01 - 0.001) = ~555,556 requests

---

### Benchmarking Best Practices

**1. Test Realistic Workloads**
- Use actual production prompts
- Test at production scale
- Include edge cases

**2. Measure What Matters**
- Don't optimize wrong metrics
- Focus on user experience
- Balance multiple factors

**3. Document Everything**
- Test conditions
- Provider versions
- System configuration
- Results and analysis

**4. Regular Re-benchmarking**
- Providers improve over time
- New models released
- Your needs change
- Keep benchmarks current

---

### Interpreting Results

**Example Results:**

```
Provider A:
  Avg Latency: 1.2s
  P95 Latency: 2.5s
  Cost: $0.01/request
  Success Rate: 99.5%

Provider B:
  Avg Latency: 0.5s
  P95 Latency: 1.0s
  Cost: $0.05/request
  Success Rate: 99.9%

Provider C:
  Avg Latency: 3.0s
  P95 Latency: 5.0s
  Cost: $0.001/request
  Success Rate: 98.0%
```

**Decision Framework:**
- **Speed critical**: Choose Provider B
- **Cost critical**: Choose Provider C
- **Balanced**: Choose Provider A

**Trade-offs:**
- Speed vs Cost
- Quality vs Speed
- Reliability vs Cost

Now let's build the benchmarking framework:

In [None]:
class InferenceBenchmark:
    """Comprehensive benchmarking framework for AI inference engines."""
    
    def __init__(self):
        self.results = {}
        self.test_prompts = self.load_test_prompts()
    
    def load_test_prompts(self):
        """Load diverse test prompts for benchmarking."""
        return [
            "Write a short story about artificial intelligence.",
            "Explain quantum computing in simple terms.",
            "What are the benefits of renewable energy?",
            "How does machine learning work?",
            "Describe the process of photosynthesis.",
            "What are the key principles of software engineering?",
            "Explain the difference between SQL and NoSQL databases.",
            "How do neural networks learn?",
            "What is the importance of data structures?",
            "Describe a sustainable city of the future."
        ]
    
    def benchmark_openai(self):
        """Benchmark OpenAI API."""
        try:
            from openai import OpenAI
            client = OpenAI()
            
            results = {
                'latencies': [],
                'token_counts': [],
                'throughput': [],
                'cost_estimates': []
            }
            
            for prompt in self.test_prompts:
                start_time = time.time()
                
                response = client.chat.completions.create(
                    model="gpt-3.5-turbo",
                    messages=[{"role": "user", "content": prompt}],
                    max_tokens=150
                )
                
                latency = time.time() - start_time
                token_count = response.usage.total_tokens
                
                results['latencies'].append(latency)
                results['token_counts'].append(token_count)
                results['throughput'].append(token_count / latency)
                results['cost_estimates'].append(token_count * 0.002 / 1000)  # Approximate cost
            
            return self.summarize_results(results)
            
        except Exception as e:
            return {'error': str(e)}
    
    def benchmark_huggingface(self):
        """Benchmark HuggingFace Inference API."""
        try:
            from huggingface_hub import InferenceClient
            client = InferenceClient()
            
            results = {
                'latencies': [],
                'token_counts': [],
                'throughput': []
            }
            
            for prompt in self.test_prompts:
                start_time = time.time()
                
                response = client.text_generation(
                    prompt,
                    model="microsoft/DialoGPT-medium",
                    max_new_tokens=150
                )
                
                latency = time.time() - start_time
                token_count = len(response.split())
                
                results['latencies'].append(latency)
                results['token_counts'].append(token_count)
                results['throughput'].append(token_count / latency)
            
            return self.summarize_results(results)
            
        except Exception as e:
            return {'error': str(e)}
    
    def summarize_results(self, results: Dict[str, List]) -> Dict[str, Any]:
        """Summarize benchmark results."""
        import numpy as np
        
        summary = {}
        for metric, values in results.items():
            if values:
                summary[metric] = {
                    'mean': np.mean(values),
                    'median': np.median(values),
                    'std': np.std(values),
                    'min': np.min(values),
                    'max': np.max(values)
                }
        
        return summary
    
    def run_comprehensive_benchmark(self):
        """Run comprehensive benchmark across all available providers."""
        print("Running comprehensive benchmark...")
        
        # Test OpenAI if available
        if OPENAI_API_KEY != "your-openai-key-here":
            print("\n--- Benchmarking OpenAI ---")
            openai_results = self.benchmark_openai()
            if 'error' not in openai_results:
                self.results['openai'] = openai_results
                print("✅ OpenAI benchmark completed")
            else:
                print(f"❌ OpenAI benchmark failed: {openai_results['error']}")
        
        # Test HuggingFace if available
        if HF_TOKEN != "your-huggingface-token-here":
            print("\n--- Benchmarking HuggingFace ---")
            hf_results = self.benchmark_huggingface()
            if 'error' not in hf_results:
                self.results['huggingface'] = hf_results
                print("✅ HuggingFace benchmark completed")
            else:
                print(f"❌ HuggingFace benchmark failed: {hf_results['error']}")
        
        return self.results

# Run comprehensive benchmark
benchmark = InferenceBenchmark()
results = benchmark.run_comprehensive_benchmark()

# Display results
print("\n" + "="*50)
print("BENCHMARK RESULTS SUMMARY")
print("="*50)

for provider, metrics in results.items():
    print(f"\n{provider.upper()}:")
    for metric, stats in metrics.items():
        print(f"  {metric}:")
        print(f"    Mean: {stats['mean']:.3f}")
        print(f"    Median: {stats['median']:.3f}")
        print(f"    Std Dev: {stats['std']:.3f}")
        print(f"    Range: {stats['min']:.3f} - {stats['max']:.3f}")

## Exercise 5: Production Deployment Architecture

### What Makes Deployment "Production-Ready"?

**Development vs Production:**

**Development:**
- ✅ Works on your machine
- ✅ Manual testing
- ✅ No monitoring
- ✅ Single user

**Production:**
- ✅ Works reliably for thousands of users
- ✅ Automated testing and validation
- ✅ Comprehensive monitoring
- ✅ Handles failures gracefully

**Production Requirements:**

1. **Reliability**: 99.9%+ uptime
2. **Performance**: Fast response times
3. **Scalability**: Handle growing load
4. **Monitoring**: Know what's happening
5. **Security**: Protect data and systems
6. **Maintainability**: Easy to update and fix

---

### The Production Architecture

**Core Components:**

**1. Application Layer**
- Your AI application code
- Request handling
- Response formatting

**2. Provider Management**
- Multi-provider support
- Failover logic
- Load balancing

**3. Monitoring Layer**
- Health checks
- Performance metrics
- Error tracking
- Alerting

**4. Infrastructure Layer**
- Servers/containers
- Load balancers
- Databases
- Caching

---

### Health Check System

**Why Health Checks?**

Health checks verify:
- ✅ Services are running
- ✅ Providers are available
- ✅ System has resources
- ✅ Dependencies are working

**Types of Health Checks:**

**1. Liveness Check**
```python
def liveness_check():
    # Is the service running?
    return service_is_running()
```
**Purpose**: Detect if service crashed
**Action**: Restart if failed

**2. Readiness Check**
```python
def readiness_check():
    # Can the service handle requests?
    return providers_available() and resources_sufficient()
```
**Purpose**: Detect if service ready
**Action**: Don't route traffic if not ready

**3. Startup Probe**
```python
def startup_check():
    # Has the service finished starting?
    return initialization_complete()
```
**Purpose**: Detect slow startups
**Action**: Give more time or restart

---

### Monitoring and Metrics

**What to Monitor:**

**1. Request Metrics**
- Total requests
- Successful requests
- Failed requests
- Request rate (requests/second)

**2. Performance Metrics**
- Average latency
- P50/P95/P99 latencies
- Throughput
- Error rate

**3. System Metrics**
- CPU usage
- Memory usage
- Disk usage
- Network usage

**4. Business Metrics**
- Cost per request
- User satisfaction
- Conversion rates
- Revenue impact

---

### Alerting Strategy

**When to Alert:**

**Critical Alerts (Immediate Action):**
- Service down
- Error rate > 5%
- Latency > 10s
- Cost spike

**Warning Alerts (Investigate):**
- Error rate > 1%
- Latency > 2s
- Resource usage > 80%
- Unusual patterns

**Info Alerts (Monitor):**
- New deployment
- Configuration changes
- Performance trends

**Alert Best Practices:**
- **Actionable**: Alert should trigger action
- **Not Noisy**: Don't alert on everything
- **Context**: Include relevant information
- **Escalation**: Critical alerts need immediate attention

---

### Scaling Strategies

**Horizontal Scaling (Scale Out):**
- Add more servers/instances
- Distribute load across instances
- **Pros**: No downtime, handle more load
- **Cons**: More complex, need load balancer

**Vertical Scaling (Scale Up):**
- Increase server resources (CPU, RAM)
- **Pros**: Simple, no code changes
- **Cons**: Limited by hardware, downtime

**Auto-Scaling:**
```python
if cpu_usage > 80%:
    add_instance()
elif cpu_usage < 30%:
    remove_instance()
```
**Benefits**: Automatic, cost-efficient
**Challenges**: Configuration, scaling delays

---

### Error Handling and Recovery

**Error Categories:**

**1. Transient Errors**
- Network timeouts
- Rate limits
- Temporary unavailability
- **Action**: Retry with backoff

**2. Permanent Errors**
- Invalid API key
- Model not found
- Malformed request
- **Action**: Return error, don't retry

**3. Resource Exhaustion**
- Out of memory
- Too many connections
- **Action**: Scale up or throttle

**Recovery Strategies:**

**1. Retry with Exponential Backoff**
```python
for attempt in range(max_retries):
    try:
        return make_request()
    except TransientError:
        time.sleep(2 ** attempt)
```

**2. Circuit Breaker**
```python
if error_rate > threshold:
    stop_sending_requests()
    wait_for_recovery()
    try_again()
```

**3. Graceful Degradation**
```python
if primary_provider_fails():
    use_backup_provider()
    notify_team()
```

---

### Security Considerations

**1. API Key Management**
- Store in environment variables
- Use secret management services
- Rotate keys regularly
- Never commit to version control

**2. Data Privacy**
- Encrypt sensitive data
- Use local models for sensitive data
- Comply with regulations (GDPR, HIPAA)
- Audit data access

**3. Rate Limiting**
- Prevent abuse
- Protect from DDoS
- Fair resource usage
- Cost control

**4. Input Validation**
- Validate all inputs
- Sanitize user data
- Prevent injection attacks
- Size limits

---

### Deployment Checklist

**Before Production:**

- [ ] **Testing**: Comprehensive test suite
- [ ] **Monitoring**: Metrics and alerting set up
- [ ] **Documentation**: Runbooks and procedures
- [ ] **Security**: Keys, encryption, access control
- [ ] **Scaling**: Auto-scaling configured
- [ ] **Backup**: Disaster recovery plan
- [ ] **Rollback**: Can revert if needed
- [ ] **Load Testing**: Test under production load

**Ongoing:**

- [ ] **Monitor**: Watch metrics continuously
- [ ] **Update**: Keep dependencies current
- [ ] **Optimize**: Improve based on data
- [ ] **Document**: Update runbooks
- [ ] **Review**: Regular architecture reviews

Now let's build the production deployment system:

In [None]:
class ProductionDeployment:
    """Production-ready deployment architecture with monitoring and scaling."""
    
    def __init__(self):
        self.health_status = {}
        self.request_metrics = {
            'total_requests': 0,
            'successful_requests': 0,
            'failed_requests': 0,
            'average_latency': 0,
            'p95_latency': 0,
            'p99_latency': 0
        }
        self.latencies = []
    
    def health_check(self, provider: str, client) -> Dict[str, Any]:
        """Perform health check on a provider."""
        try:
            start_time = time.time()
            
            # Simple health check request
            response = client.chat_completion([
                {"role": "user", "content": "Health check"}
            ])
            
            latency = time.time() - start_time
            
            health_status = {
                'status': 'healthy' if response['success'] else 'unhealthy',
                'latency': latency,
                'timestamp': time.time(),
                'error': response.get('error', None)
            }
            
            self.health_status[provider] = health_status
            return health_status
            
        except Exception as e:
            health_status = {
                'status': 'unhealthy',
                'latency': -1,
                'timestamp': time.time(),
                'error': str(e)
            }
            self.health_status[provider] = health_status
            return health_status
    
    def record_request_metrics(self, success: bool, latency: float):
        """Record request metrics for monitoring."""
        self.request_metrics['total_requests'] += 1
        
        if success:
            self.request_metrics['successful_requests'] += 1
            self.latencies.append(latency)
            
            # Update latency metrics
            if self.latencies:
                import numpy as np
                self.request_metrics['average_latency'] = np.mean(self.latencies)
                self.request_metrics['p95_latency'] = np.percentile(self.latencies, 95)
                self.request_metrics['p99_latency'] = np.percentile(self.latencies, 99)
        else:
            self.request_metrics['failed_requests'] += 1
    
    def get_system_metrics(self) -> Dict[str, Any]:
        """Get current system metrics."""
        import psutil
        
        return {
            'cpu_percent': psutil.cpu_percent(interval=1),
            'memory_percent': psutil.virtual_memory().percent,
            'disk_usage': psutil.disk_usage('/').percent,
            'network_connections': len(psutil.net_connections()),
            'process_count': len(psutil.pids())
        }
    
    def generate_deployment_report(self) -> Dict[str, Any]:
        """Generate comprehensive deployment report."""
        success_rate = (self.request_metrics['successful_requests'] / 
                       max(self.request_metrics['total_requests'], 1)) * 100
        
        return {
            'timestamp': time.time(),
            'health_status': self.health_status,
            'request_metrics': self.request_metrics,
            'system_metrics': self.get_system_metrics(),
            'success_rate': success_rate,
            'recommendations': self.generate_recommendations()
        }
    
    def generate_recommendations(self) -> List[str]:
        """Generate deployment recommendations based on metrics."""
        recommendations = []
        
        # Check success rate
        success_rate = (self.request_metrics['successful_requests'] / 
                       max(self.request_metrics['total_requests'], 1)) * 100
        
        if success_rate < 95:
            recommendations.append("Consider implementing circuit breaker pattern for failed requests")
        
        # Check latency
        if self.request_metrics['p95_latency'] > 2.0:
            recommendations.append("P95 latency is high. Consider scaling up resources or optimizing model")
        
        # Check system resources
        system_metrics = self.get_system_metrics()
        
        if system_metrics['cpu_percent'] > 80:
            recommendations.append("High CPU usage detected. Consider horizontal scaling")
        
        if system_metrics['memory_percent'] > 85:
            recommendations.append("High memory usage detected. Consider increasing memory allocation")
        
        return recommendations

# Test production deployment monitoring
print("Testing Production Deployment Monitoring...")

deployment = ProductionDeployment()

# Test with available providers
if OPENAI_API_KEY != "your-openai-key-here":
    print("\n--- Testing OpenAI Health Check ---")
    openai_client = UniversalAIClient("openai", OPENAI_API_KEY)
    health_status = deployment.health_check("openai", openai_client)
    print(f"OpenAI Health: {health_status['status']}, Latency: {health_status['latency']:.3f}s")
    
    # Record some test metrics
    deployment.record_request_metrics(True, 0.5)
    deployment.record_request_metrics(True, 0.7)
    deployment.record_request_metrics(False, 1.2)

# Generate deployment report
report = deployment.generate_deployment_report()
print("\n--- Deployment Report ---")
print(f"Success Rate: {report['success_rate']:.1f}%")
print(f"Total Requests: {report['request_metrics']['total_requests']}")
print(f"Average Latency: {report['request_metrics']['average_latency']:.3f}s")
print(f"P95 Latency: {report['request_metrics']['p95_latency']:.3f}s")

if report['recommendations']:
    print("\nRecommendations:")
    for rec in report['recommendations']:
        print(f"  - {rec}")
else:
    print("\nNo recommendations - system is performing well!")

print("\nSystem Metrics:")
for metric, value in report['system_metrics'].items():
    print(f"  {metric}: {value}%")

## Summary and Next Steps

Congratulations! You've completed the Model Interfaces and Deployment lab. Here's what you've learned:

### Key Skills Acquired:
- ✅ Universal OpenAI-compatible client implementation
- ✅ HuggingFace provider management with failover
- ✅ Local Ollama integration and testing
- ✅ Comprehensive performance benchmarking
- ✅ Production deployment monitoring and health checks
- ✅ System metrics collection and analysis

### Best Practices:
- Always implement proper error handling and retry mechanisms
- Use provider failover for high availability
- Monitor system resources and performance metrics
- Implement health checks for all services
- Use environment variables for sensitive configuration
- Design for scalability from the start

### Production Considerations:
- **Security**: Implement proper authentication and authorization
- **Monitoring**: Set up comprehensive logging and alerting
- **Scaling**: Design for horizontal scaling with load balancing
- **Backup**: Implement backup and disaster recovery procedures
- **Compliance**: Ensure compliance with data privacy regulations

### Next Steps:
1. Set up a local Ollama instance and test with different models
2. Implement a load balancer for multiple inference engines
3. Add comprehensive logging and monitoring dashboards
4. Create automated deployment pipelines
5. Implement A/B testing for model performance comparison
6. Add support for streaming responses
7. Implement rate limiting and quota management
8. Set up automated scaling based on demand

Remember: Production deployment requires careful planning, monitoring, and continuous optimization. Always test thoroughly in staging environments before deploying to production!