# Module 4: The Evolution to Production

## CML Models (Development) ‚Üí AI Inference Service (Production)

### The Complete MLOps Journey

- **Module 1**: Deployed sklearn model to CML Models for development/testing
- **Module 2**: Detected data drift signaling model degradation  
- **Module 3**: Retrained model and converted to ONNX format
- **Module 4**: Deployed ONNX model to production AI Inference Service ‚Üê **We are here**

### What We'll Compare

1. **Authentication**: API Key vs JWT Token
2. **API Protocol**: Custom REST vs Open Inference Standard
3. **Performance**: Latency and throughput measurements
4. **Operations**: Monitoring and enterprise capabilities

This hands-on comparison gives you the "before/after" story to tell customers about production ML deployment.

---

## Setup: Import Libraries

In [1]:
import pandas as pd
import numpy as np
import requests
import json
import time
import os
import sys
from datetime import datetime

# For AI Inference Service
import httpx
from open_inference.openapi.client import OpenInferenceClient, InferenceRequest

# Add module1 to path for preprocessing utilities
sys.path.insert(0, os.path.abspath('..'))
# Add module1 for helpers  
sys.path.insert(0, os.path.abspath('../module1'))


print("‚úÖ Libraries imported successfully")
print(f"   Current time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

‚úÖ Libraries imported successfully
   Current time: 2025-12-16 05:35:49


## Load and Prepare Test Data

We need two different data formats:
- **Module 1 (CML)**: Expects fully preprocessed data (64 features after one-hot encoding + feature engineering)
- **Module 4 (ONNX)**: Expects original 20 features with underscores (no engineering, no encoding)

In [2]:
from helpers import PreprocessingPipeline, FeatureEngineer

# Load raw inference data
df_raw = pd.read_csv("../module1/inference_data/raw_inference_data.csv", sep=";")
print(f"‚úÖ Loaded raw data: {df_raw.shape}")

# Load training data for preprocessing fit
df_train = pd.read_csv("../module1/data/bank-additional/bank-additional-full.csv", sep=";")
print(f"‚úÖ Loaded training data: {df_train.shape}")

# Apply feature engineering
fe = FeatureEngineer()
df_train_eng = fe.transform(df_train)
df_engineered = fe.transform(df_raw)

print(f"‚úÖ Feature engineering applied")
print(f"   Engineered features: {df_engineered.shape}")

‚úÖ Loaded raw data: (1000, 20)
‚úÖ Loaded training data: (41188, 21)
‚úÖ Feature engineering applied
   Engineered features: (1000, 24)


In [3]:
# Define features for preprocessing
numeric_features = [
    'age', 'duration', 'campaign', 'pdays', 'previous',
    'emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed',
    'engagement_score'
]

categorical_features = [
    'job', 'marital', 'education', 'default',
    'housing', 'loan', 'contact', 'month', 'day_of_week', 'poutcome',
    'age_group', 'emp_var_category', 'duration_category'
]

# Create and fit preprocessor on training data (for CML endpoint)
preprocessor = PreprocessingPipeline(
    numeric_features=numeric_features,
    categorical_features=categorical_features,
    include_engagement=True
)

X_train_full = df_train_eng[numeric_features + categorical_features].copy()
preprocessor.fit(X_train_full)

# Transform inference data for CML endpoint (fully preprocessed)
X_engineered = df_engineered[numeric_features + categorical_features].copy()
X_processed = pd.DataFrame(
    preprocessor.transform(X_engineered),
    columns=preprocessor.get_feature_names()
)

print(f"\n‚úÖ CML preprocessing complete")
print(f"   CML format (one-hot encoded): {X_processed.shape}")

# ============================================================================
# ONNX Model Data Preparation
# ============================================================================
# The ONNX model expects the ORIGINAL 20 features (no engineering) with underscores

onnx_numeric_features = [
    'age', 'duration', 'campaign', 'pdays', 'previous',
    'emp_var_rate', 'cons_price_idx', 'cons_conf_idx', 'euribor3m', 'nr_employed'
]

onnx_categorical_features = [
    'job', 'marital', 'education', 'default', 'housing', 
    'loan', 'contact', 'month', 'day_of_week', 'poutcome'
]

# Rename columns to match ONNX expectations (dots ‚Üí underscores)
df_onnx = df_raw.copy()
df_onnx = df_onnx.rename(columns={
    'emp.var.rate': 'emp_var_rate',
    'cons.price.idx': 'cons_price_idx',
    'cons.conf.idx': 'cons_conf_idx',
    'nr.employed': 'nr_employed'
})

# Select only the features the ONNX model expects
X_onnx = df_onnx[onnx_numeric_features + onnx_categorical_features].copy()

print(f"\n‚úÖ ONNX preprocessing complete")
print(f"   ONNX format (original features): {X_onnx.shape}")
print(f"   Features: {len(onnx_numeric_features)} numerical + {len(onnx_categorical_features)} categorical = 20 total")
print(f"\nüìä Ready to test {len(X_processed)} samples")


‚úÖ CML preprocessing complete
   CML format (one-hot encoded): (1000, 64)

‚úÖ ONNX preprocessing complete
   ONNX format (original features): (1000, 20)
   Features: 10 numerical + 10 categorical = 20 total

üìä Ready to test 1000 samples


---

# Section 1: Baseline - CML Model Endpoint

## What is CML Models?

CML Models is designed for **development and testing**:
- ‚úÖ Quick deployment for data science experimentation
- ‚úÖ Custom REST API with simple access key auth
- ‚úÖ Good for small-to-medium workloads
- ‚ö†Ô∏è Limited scalability and enterprise features

## Configuration

**Update these values from your Module 1 deployment:**

In [5]:
# Load model endpoint configuration from shared_utils.config
from shared_utils.config import MODEL_ENDPOINT_CONFIG

CML_MODEL_ENDPOINT = MODEL_ENDPOINT_CONFIG.get("model_endpoint")
CML_ACCESS_KEY = MODEL_ENDPOINT_CONFIG.get("access_key")

if not CML_MODEL_ENDPOINT or not CML_ACCESS_KEY:
    raise ValueError(
        "Missing model_endpoint or access_key in shared_utils/config.py\n"
        "Please update MODEL_ENDPOINT_CONFIG in shared_utils/config.py with your credentials."
    )

print("‚úÖ CML Model endpoint configured")
print(f"   Endpoint: {CML_MODEL_ENDPOINT}")
print(f"   Auth: API Access Key")

‚úÖ CML Model endpoint configured
   Endpoint: https://modelservice.ml-56979638-3f1.go01-dem.ylcu-atmi.cloudera.site/model
   Auth: API Access Key


## Test 1: Single Predictions (10 samples)

### CML API Format

CML Models uses a custom format:
```json
{
  "accessKey": "your_key",
  "request": {
    "dataframe_split": {
      "columns": ["feature1", "feature2", ...],
      "data": [[value1, value2, ...]]
    }
  }
}
```

In [None]:
# Select 10 samples for testing
test_samples = X_processed.head(10)

cml_latencies = []
cml_predictions = []

print("üîÑ Testing CML Model Endpoint (10 single predictions)...\n")

for idx in range(len(test_samples)):
    # Get single row
    row = test_samples.iloc[idx]
    
    # Handle NaN/inf values
    row_dict = row.to_dict()
    for key, val in row_dict.items():
        if pd.isna(val):
            row_dict[key] = 0
        elif np.isinf(val):
            row_dict[key] = 1e10 if val > 0 else -1e10
        else:
            row_dict[key] = float(val)
    
    # Create CML API payload
    payload = {
        "accessKey": CML_ACCESS_KEY,
        "request": {
            "dataframe_split": {
                "columns": list(row_dict.keys()),
                "data": [list(row_dict.values())]
            }
        }
    }
    
    # Time the request
    start_time = time.time()
    
    try:
        response = requests.post(
            CML_MODEL_ENDPOINT,
            data=json.dumps(payload),
            headers={'Content-Type': 'application/json'},
            timeout=30
        )
        
        latency = (time.time() - start_time) * 1000  # Convert to ms
        cml_latencies.append(latency)
        
        if response.status_code == 200:
            result = response.json()
            prediction = result['response']['prediction'][0]
            cml_predictions.append(prediction)
            pred_label = "YES" if prediction == 1 else "NO"
            print(f"  Sample {idx+1:2d}: {pred_label:3s} | Latency: {latency:7.2f}ms")
        else:
            print(f"  Sample {idx+1:2d}: ERROR - Status {response.status_code}")
            
    except Exception as e:
        print(f"  Sample {idx+1:2d}: ERROR - {str(e)[:50]}")

# Calculate statistics
if cml_latencies:
    cml_avg_latency = np.mean(cml_latencies)
    cml_p50_latency = np.median(cml_latencies)
    cml_p95_latency = np.percentile(cml_latencies, 95)
    cml_p99_latency = np.percentile(cml_latencies, 99)
else:
    cml_avg_latency = cml_p50_latency = cml_p95_latency = cml_p99_latency = 0

print(f"\nüìä CML Model - Single Prediction Statistics:")
print(f"   Successful: {len(cml_predictions)}/10")
print(f"   Avg Latency:  {cml_avg_latency:7.2f}ms")
print(f"   P50 Latency:  {cml_p50_latency:7.2f}ms")
print(f"   P95 Latency:  {cml_p95_latency:7.2f}ms")
print(f"   P99 Latency:  {cml_p99_latency:7.2f}ms")

## Test 2: Batch Prediction (50 samples)

Test throughput with a larger batch.

In [6]:
# Prepare batch of 50 samples
batch_samples = X_processed.head(50)

print("üîÑ Testing CML Model Endpoint (50-sample batch)...\n")

# We'll send them one at a time to measure throughput
batch_start = time.time()
batch_predictions = []

for idx in range(len(batch_samples)):
    row = batch_samples.iloc[idx]
    
    # Handle NaN/inf
    row_dict = row.to_dict()
    for key, val in row_dict.items():
        if pd.isna(val):
            row_dict[key] = 0
        elif np.isinf(val):
            row_dict[key] = 1e10 if val > 0 else -1e10
        else:
            row_dict[key] = float(val)
    
    payload = {
        "accessKey": CML_ACCESS_KEY,
        "request": {
            "dataframe_split": {
                "columns": list(row_dict.keys()),
                "data": [list(row_dict.values())]
            }
        }
    }
    
    try:
        response = requests.post(
            CML_MODEL_ENDPOINT,
            data=json.dumps(payload),
            headers={'Content-Type': 'application/json'},
            timeout=30
        )
        
        if response.status_code == 200:
            result = response.json()
            batch_predictions.append(result['response']['prediction'][0])
    except:
        pass

cml_batch_time = (time.time() - batch_start) * 1000  # ms
cml_throughput = len(batch_predictions) / (cml_batch_time / 1000)  # predictions/sec

print(f"‚úÖ Batch prediction complete")
print(f"   Successful predictions: {len(batch_predictions)}/50")
print(f"   Total time: {cml_batch_time:,.2f}ms")
print(f"   Avg per sample: {cml_batch_time/50:7.2f}ms")
print(f"   Throughput: {cml_throughput:7.2f} predictions/second")

üîÑ Testing CML Model Endpoint (50-sample batch)...

‚úÖ Batch prediction complete
   Successful predictions: 0/50
   Total time: 2,343.50ms
   Avg per sample:   46.87ms
   Throughput:    0.00 predictions/second


---

# Section 2: Production - AI Inference Service

## What is Cloudera AI Inference Service?

AI Inference Service is built for **enterprise production**:
- üöÄ Up to 36x faster inference (GPU) / 4x faster (CPU)
- üîí JWT token authentication (enterprise security)
- üìä Built-in autoscaling and high availability
- üåê Industry-standard Open Inference Protocol
- ‚öôÔ∏è Powered by NVIDIA Triton Inference Server
- üìà Full enterprise monitoring and observability

## Configuration

**Update these values from your AI Inference Service deployment:**

In [7]:
# AI Inference Service Configuration
# TODO: Update with your deployment details

# JWT token (typically at /tmp/jwt in CML workbench)
try:
    API_KEY = json.load(open("/tmp/jwt"))["access_token"]
    print("‚úÖ JWT token loaded from /tmp/jwt")
except FileNotFoundError:
    print("‚ö†Ô∏è  JWT token not found at /tmp/jwt, using environment variable")
    API_KEY = os.environ.get("CDP_TOKEN", "")

# Your AI Inference Service endpoint URL and model name
#BASE_URL = 'https://ml-XXXXX.cloudera.site/namespaces/serving-default/endpoints/your-model'
#MODEL_NAME = 'your-model-id'
BASE_URL = 'https://ml-64288d82-5dd.go01-dem.ylcu-atmi.cloudera.site/namespaces/serving-default/endpoints/banking-classifier-ozarate'
MODEL_NAME = '6nyo-l4ge-kbb9-odsy'

# Setup HTTPX client
headers = {
    'Authorization': 'Bearer ' + API_KEY,
    'Content-Type': 'application/json'
}

httpx_client = httpx.Client(headers=headers)
client = OpenInferenceClient(base_url=BASE_URL, httpx_client=httpx_client)

print("‚úÖ AI Inference Service client configured")
print(f"   Endpoint: {BASE_URL[:60]}...")
print(f"   Model: {MODEL_NAME}")
print(f"   Auth: JWT Token")

‚úÖ JWT token loaded from /tmp/jwt
‚úÖ AI Inference Service client configured
   Endpoint: https://ml-64288d82-5dd.go01-dem.ylcu-atmi.cloudera.site/nam...
   Model: 6nyo-l4ge-kbb9-odsy
   Auth: JWT Token


## Model Metadata - Open Inference Protocol Standard

One key advantage: standardized model introspection

In [8]:
print("üîç Checking server readiness...\n")

try:
    client.check_server_readiness()
    print("‚úÖ Server is ready and healthy\n")
except Exception as e:
    print(f"‚ùå Server check failed: {e}\n")

# Get model metadata
print("üìã Retrieving model metadata...\n")

try:
    metadata = client.read_model_metadata(MODEL_NAME)
    metadata_dict = json.loads(metadata.json())
    
    print(f"Model: {metadata_dict.get('name', 'N/A')}")
    print(f"Platform: {metadata_dict.get('platform', 'N/A')}")
    
    # Show input schema
    if 'inputs' in metadata_dict:
        print(f"\nüì• Inputs ({len(metadata_dict['inputs'])} features):")
        for inp in metadata_dict['inputs'][:5]:
            print(f"   ‚Ä¢ {inp['name']:20s} | {inp['datatype']:6s} | shape: {inp['shape']}")
        if len(metadata_dict['inputs']) > 5:
            print(f"   ... and {len(metadata_dict['inputs'])-5} more features")
    
    # Show output schema  
    if 'outputs' in metadata_dict:
        print(f"\nüì§ Outputs:")
        for out in metadata_dict['outputs']:
            print(f"   ‚Ä¢ {out['name']:20s} | {out['datatype']:6s} | shape: {out['shape']}")
            
except Exception as e:
    print(f"‚ùå Failed to get metadata: {e}")

üîç Checking server readiness...

‚ùå Server check failed: status_code: 401, body: <html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/>
<title>Error 401 Unauthorized</title>
</head>
<body><h2>HTTP ERROR 401 Unauthorized</h2>
<table>
<tr><th>URI:</th><td>/gateway/cml-serving-cdpauth-cdpauth/auth/api/v1/extauthz/v2/health/ready</td></tr>
<tr><th>STATUS:</th><td>401</td></tr>
<tr><th>MESSAGE:</th><td>Unauthorized</td></tr>
<tr><th>SERVLET:</th><td>cml-serving-cdpauth-cdpauth-knox-gateway-servlet</td></tr>
</table>

</body>
</html>


üìã Retrieving model metadata...

‚ùå Failed to get metadata: status_code: 401, body: <html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/>
<title>Error 401 Unauthorized</title>
</head>
<body><h2>HTTP ERROR 401 Unauthorized</h2>
<table>
<tr><th>URI:</th><td>/gateway/cml-serving-cdpauth-cdpauth/auth/api/v1/extauthz/v2/models/6nyo-l4ge-kbb9-odsy</td></tr>
<tr><th>STATUS:</th><td>401</td></tr>
<

## Open Inference Protocol Data Format

The ONNX model expects the **original 20 features** (not engineered features):
- ‚úÖ Original feature names with **underscores** (emp_var_rate, not emp.var.rate)
- ‚úÖ No feature engineering (no engagement_score, age_group, etc.)
- ‚úÖ Raw categorical and numerical values

```python
{
  "inputs": [
    {"name": "age", "shape": [1,1], "datatype": "FP32", "data": [35.0]},
    {"name": "job", "shape": [1,1], "datatype": "BYTES", "data": ["technician"]},
    {"name": "emp_var_rate", "shape": [1,1], "datatype": "FP32", "data": [1.1]},
    ...
  ]
}
```

### Helper Function

In [9]:
def format_for_onnx_inference(row, numeric_feats, categorical_feats):
    """
    Convert pandas row to Open Inference Protocol format for ONNX.
    
    Args:
        row: Single row from DataFrame (pd.Series)
        numeric_feats: List of numerical feature names
        categorical_feats: List of categorical feature names
    
    Returns:
        List of input dictionaries for InferenceRequest
    """
    inputs = []
    
    # Numerical features ‚Üí FP32
    for feat in numeric_feats:
        if feat in row.index:
            inputs.append({
                "name": feat,
                "shape": [1, 1],
                "datatype": "FP32",
                "data": [float(row[feat])]
            })
    
    # Categorical features ‚Üí BYTES
    for feat in categorical_feats:
        if feat in row.index:
            inputs.append({
                "name": feat,
                "shape": [1, 1],
                "datatype": "BYTES",
                "data": [str(row[feat])]
            })
    
    return inputs

# Test formatter with corrected features
print("üß™ Testing data formatter...\n")
sample_row = X_onnx.iloc[0]
formatted = format_for_onnx_inference(sample_row, onnx_numeric_features, onnx_categorical_features)

print(f"‚úÖ Formatted {len(formatted)} inputs (should be 20)")
print(f"\nFirst 5 inputs:")
for inp in formatted[:5]:
    print(f"   {inp['name']:20s} | {inp['datatype']:6s} | {inp['data']}")

üß™ Testing data formatter...

‚úÖ Formatted 20 inputs (should be 20)

First 5 inputs:
   age                  | FP32   | [57.0]
   duration             | FP32   | [371.0]
   campaign             | FP32   | [1.0]
   pdays                | FP32   | [999.0]
   previous             | FP32   | [1.0]


## Test 1: Single Predictions (10 samples)

In [10]:
# Use ONNX-prepared data (original 20 features with correct naming)
test_samples_onnx = X_onnx.head(10)

ai_latencies = []
ai_predictions = []

print("üîÑ Testing AI Inference Service (10 single predictions)...\n")

for idx in range(len(test_samples_onnx)):
    row = test_samples_onnx.iloc[idx]
    
    # Format for Open Inference Protocol
    inputs = format_for_onnx_inference(row, onnx_numeric_features, onnx_categorical_features)
    
    start_time = time.time()
    
    try:
        pred = client.model_infer(
            MODEL_NAME,
            request=InferenceRequest(inputs=inputs)
        )
        
        latency = (time.time() - start_time) * 1000  # ms
        ai_latencies.append(latency)
        
        # Parse response
        response_dict = json.loads(pred.json())
        prediction = response_dict['outputs'][0]['data'][0]
        ai_predictions.append(prediction)
        
        pred_label = "YES" if prediction == 1 else "NO"
        print(f"  Sample {idx+1:2d}: {pred_label:3s} | Latency: {latency:7.2f}ms")
        
    except Exception as e:
        print(f"  Sample {idx+1:2d}: ERROR - {str(e)[:50]}")

# Calculate statistics
if ai_latencies:
    ai_avg_latency = np.mean(ai_latencies)
    ai_p50_latency = np.median(ai_latencies)
    ai_p95_latency = np.percentile(ai_latencies, 95)
    ai_p99_latency = np.percentile(ai_latencies, 99)
else:
    ai_avg_latency = ai_p50_latency = ai_p95_latency = ai_p99_latency = 0

print(f"\nüìä AI Inference Service - Single Prediction Statistics:")
print(f"   Successful: {len(ai_predictions)}/10")
print(f"   Avg Latency:  {ai_avg_latency:7.2f}ms")
print(f"   P50 Latency:  {ai_p50_latency:7.2f}ms")
print(f"   P95 Latency:  {ai_p95_latency:7.2f}ms")
print(f"   P99 Latency:  {ai_p99_latency:7.2f}ms")

üîÑ Testing AI Inference Service (10 single predictions)...

  Sample  1: ERROR - status_code: 401, body: <html>
<head>
<meta http-e
  Sample  2: ERROR - status_code: 401, body: <html>
<head>
<meta http-e
  Sample  3: ERROR - status_code: 401, body: <html>
<head>
<meta http-e
  Sample  4: ERROR - status_code: 401, body: <html>
<head>
<meta http-e
  Sample  5: ERROR - status_code: 401, body: <html>
<head>
<meta http-e
  Sample  6: ERROR - status_code: 401, body: <html>
<head>
<meta http-e
  Sample  7: ERROR - status_code: 401, body: <html>
<head>
<meta http-e
  Sample  8: ERROR - status_code: 401, body: <html>
<head>
<meta http-e
  Sample  9: ERROR - status_code: 401, body: <html>
<head>
<meta http-e
  Sample 10: ERROR - status_code: 401, body: <html>
<head>
<meta http-e

üìä AI Inference Service - Single Prediction Statistics:
   Successful: 0/10
   Avg Latency:     0.00ms
   P50 Latency:     0.00ms
   P95 Latency:     0.00ms
   P99 Latency:     0.00ms


## Test 2: Batch Prediction (50 samples)

In [11]:
# Prepare batch of 50 samples
batch_samples_onnx = X_onnx.head(50)

print("üîÑ Testing AI Inference Service (50-sample batch)...\n")

batch_start = time.time()
batch_predictions_ai = []

for idx in range(len(batch_samples_onnx)):
    row = batch_samples_onnx.iloc[idx]
    inputs = format_for_onnx_inference(row, onnx_numeric_features, onnx_categorical_features)
    
    try:
        pred = client.model_infer(
            MODEL_NAME,
            request=InferenceRequest(inputs=inputs)
        )
        response_dict = json.loads(pred.json())
        batch_predictions_ai.append(response_dict['outputs'][0]['data'][0])
    except:
        pass

ai_batch_time = (time.time() - batch_start) * 1000  # ms
ai_throughput = len(batch_predictions_ai) / (ai_batch_time / 1000)  # predictions/sec

print(f"‚úÖ Batch prediction complete")
print(f"   Successful predictions: {len(batch_predictions_ai)}/50")
print(f"   Total time: {ai_batch_time:,.2f}ms")
print(f"   Avg per sample: {ai_batch_time/50:7.2f}ms")
print(f"   Throughput: {ai_throughput:7.2f} predictions/second")

üîÑ Testing AI Inference Service (50-sample batch)...

‚úÖ Batch prediction complete
   Successful predictions: 0/50
   Total time: 579.86ms
   Avg per sample:   11.60ms
   Throughput:    0.00 predictions/second


---

# Section 3: Side-by-Side Comparison

## Performance Results

In [12]:
# Calculate improvements
latency_improvement = ((cml_avg_latency - ai_avg_latency) / cml_avg_latency * 100) if cml_avg_latency > 0 else 0
throughput_improvement = ((ai_throughput - cml_throughput) / cml_throughput * 100) if cml_throughput > 0 else 0

# Create comparison DataFrame
comparison = pd.DataFrame({
    'Metric': [
        'Authentication',
        'API Protocol',
        'Model Format',
        '',
        'Avg Latency (10 samples)',
        'P95 Latency',
        'P99 Latency',
        '',
        'Batch Time (50 samples)',
        'Throughput (pred/sec)',
        '',
        'Purpose',
        'Scale',
        'Availability',
        'Monitoring'
    ],
    'CML Models': [
        'API Key',
        'Custom REST',
        'Pickled sklearn',
        '',
        f'{cml_avg_latency:.2f} ms',
        f'{cml_p95_latency:.2f} ms',
        f'{cml_p99_latency:.2f} ms',
        '',
        f'{cml_batch_time:,.0f} ms',
        f'{cml_throughput:.2f}',
        '',
        'Development/Testing',
        'Small-Medium',
        'Basic',
        'Basic metrics'
    ],
    'AI Inference Service': [
        'JWT Token',
        'Open Inference Protocol',
        'ONNX',
        '',
        f'{ai_avg_latency:.2f} ms',
        f'{ai_p95_latency:.2f} ms',
        f'{ai_p99_latency:.2f} ms',
        '',
        f'{ai_batch_time:,.0f} ms',
        f'{ai_throughput:.2f}',
        '',
        'Production Serving',
        'Enterprise Scale',
        'HA + Autoscaling',
        'Full observability'
    ],
    'Improvement': [
        'Enterprise security',
        'Industry standard',
        'Optimized format',
        '',
        f'{latency_improvement:+.1f}%' if latency_improvement != 0 else '-',
        f'{((cml_p95_latency - ai_p95_latency) / cml_p95_latency * 100):+.1f}%' if cml_p95_latency > 0 else '-',
        f'{((cml_p99_latency - ai_p99_latency) / cml_p99_latency * 100):+.1f}%' if cml_p99_latency > 0 else '-',
        '',
        f'{((cml_batch_time - ai_batch_time) / cml_batch_time * 100):+.1f}%' if cml_batch_time > 0 else '-',
        f'{throughput_improvement:+.1f}%' if throughput_improvement != 0 else '-',
        '',
        '‚Üí Production ready',
        '‚Üí Handles more load',
        '‚Üí Always available',
        '‚Üí Full visibility'
    ]
})

print("\n" + "="*100)
print("üìä DEPLOYMENT COMPARISON: Development ‚Üí Production")
print("="*100)
print(comparison.to_string(index=False))
print("="*100)


üìä DEPLOYMENT COMPARISON: Development ‚Üí Production
                  Metric          CML Models    AI Inference Service         Improvement
          Authentication             API Key               JWT Token Enterprise security
            API Protocol         Custom REST Open Inference Protocol   Industry standard
            Model Format     Pickled sklearn                    ONNX    Optimized format
                                                                                        
Avg Latency (10 samples)            47.75 ms                 0.00 ms             +100.0%
             P95 Latency            50.01 ms                 0.00 ms             +100.0%
             P99 Latency            51.01 ms                 0.00 ms             +100.0%
                                                                                        
 Batch Time (50 samples)            2,344 ms                  580 ms              +75.3%
   Throughput (pred/sec)                0.00          

## Key Takeaways for Customer Conversations

### 1. **Performance Story**
- AI Inference Service delivers measurable latency improvements
- ONNX optimization + Triton runtime = faster inference
- Concrete numbers to show customers: "X% faster response times"

### 2. **Enterprise Readiness**
- **Security**: API keys ‚Üí JWT tokens (enterprise authentication)
- **Standards**: Custom API ‚Üí Open Inference Protocol (no vendor lock-in)
- **Operations**: Basic monitoring ‚Üí Full observability stack

### 3. **Scale & Availability**
- **CML Models**: Good for dev/test, limited production scale
- **AI Inference Service**: Built for production with autoscaling and HA
- **Cost efficiency**: Scale-to-zero when idle, scale-up under load

### 4. **The Evolution Path**
This is the natural progression:
1. Start with CML Models for development
2. Monitor performance and detect issues
3. Optimize and convert to ONNX
4. Deploy to AI Inference Service for production

---

# Section 4: Production Monitoring & Operations

## Where to Find Operational Visibility

AI Inference Service provides enterprise-grade monitoring:

### 1. **Model Endpoint Dashboard**
- Navigate to: **CDP Console ‚Üí Machine Learning ‚Üí AI Inference Service ‚Üí Your Endpoint**
- View:
  - Request rate and latency graphs
  - Replica count (autoscaling status)
  - Error rates and health checks
  - Resource utilization (CPU/GPU/Memory)

### 2. **Grafana Dashboards**
- Access: **Cloudera AI Workbenches ‚Üí Actions ‚Üí Open Grafana**
- Pre-built dashboards for:
  - Inference latency percentiles (P50, P95, P99)
  - Throughput over time
  - Model-specific metrics
  - Infrastructure health

### 3. **Logs and Debugging**
- Model deployment logs show startup and errors
- Triton server logs for detailed inference traces
- Integration with enterprise log aggregation systems

### 4. **Alerting**
- Configure alerts on latency thresholds
- Error rate spikes
- Resource exhaustion warnings

---

## Workshop Complete! üéâ

### You've Experienced the Full MLOps Lifecycle:

‚úÖ **Module 1**: Model training and development deployment  
‚úÖ **Module 2**: Monitoring and drift detection  
‚úÖ **Module 3**: Automated retraining and ONNX conversion  
‚úÖ **Module 4**: Production deployment with enterprise features  

### What to Tell Customers:

1. **"Cloudera AI covers the complete ML lifecycle"** - From experimentation to production
2. **"Built-in optimization"** - ONNX conversion and Triton runtime for performance
3. **"Enterprise-grade from day one"** - Security, monitoring, and scale built in
4. **"No vendor lock-in"** - Open standards (Open Inference Protocol, ONNX)
5. **"Proven path to production"** - Clear evolution from dev to production

### Next Steps:

- Explore the monitoring dashboards
- Test autoscaling behavior under load
- Try deploying your own models
- Build customer demos with your use cases

---

**Thank you for completing the workshop!**