# Enrutador de Modelo Basado en Intenciones con Foundry Local SDK

**Sistema de Enrutamiento Multi-Modelo Optimizado para CPU**

Este cuaderno demuestra un sistema de enrutamiento inteligente que selecciona automáticamente el mejor modelo de lenguaje pequeño según la intención del usuario. Perfecto para escenarios de implementación en el borde donde deseas aprovechar múltiples modelos especializados de manera eficiente.

## 🎯 Lo Que Aprenderás

- **Detección de Intenciones**: Clasificar automáticamente los prompts (código, resumen, clasificación, general)
- **Selección Inteligente de Modelos**: Enrutar al modelo más capaz para cada tarea
- **Optimización para CPU**: Modelos eficientes en memoria que funcionan en cualquier hardware
- **Gestión Multi-Modelo**: Mantén múltiples modelos cargados con `--retain true`
- **Patrones de Producción**: Lógica de reintento, manejo de errores y seguimiento de tokens

## 📋 Descripción del Escenario

Este patrón demuestra:

1. **Detección de Intenciones**: Clasificar cada prompt del usuario (código, resumen, clasificación o general)
2. **Selección de Modelos**: Elegir automáticamente el modelo de lenguaje pequeño más adecuado según sus capacidades
3. **Ejecución Local**: Enrutar a modelos que se ejecutan localmente a través del servicio Foundry Local
4. **Interfaz Unificada**: Punto único de entrada de chat que enruta a múltiples modelos especializados

**Ideal para**: Implementaciones en el borde con múltiples modelos especializados donde deseas un enrutamiento inteligente de solicitudes sin selección manual de modelos.

## 🔧 Requisitos Previos

- **Foundry Local** instalado y servicio en ejecución
- **Python 3.8+** con pip
- **8GB+ de RAM** (16GB+ recomendado para múltiples modelos)
- Módulo **workshop_utils** (en ../samples/)

## 🚀 Inicio Rápido

El cuaderno hará lo siguiente:
1. Detectar la memoria de tu sistema
2. Recomendar modelos de CPU apropiados
3. Cargar automáticamente los modelos con `--retain true`
4. Verificar que todos los modelos estén listos
5. Enrutar prompts de prueba a modelos especializados

**Tiempo estimado de configuración**: 5-7 minutos (incluye la carga de modelos)


## 📦 Paso 1: Instalar Dependencias

Instala el SDK oficial de Foundry Local y las bibliotecas necesarias:

- **foundry-local-sdk**: SDK oficial de Python para la gestión local de modelos
- **openai**: API compatible con OpenAI para completar chats
- **psutil**: Detección y monitoreo de memoria del sistema


In [107]:
# Install core dependencies
!pip install -q foundry-local-sdk openai psutil

## 💻 Paso 2: Detección de Memoria del Sistema

Detecta la memoria disponible del sistema para determinar qué modelos de CPU pueden funcionar de manera eficiente. Esto garantiza una selección óptima de modelos para tu hardware.


In [108]:
import psutil

# Get system memory information
total_memory_gb = psutil.virtual_memory().total / (1024**3)
available_memory_gb = psutil.virtual_memory().available / (1024**3)

print('🖥️  System Memory Information')
print('=' * 70)
print(f'Total Memory:     {total_memory_gb:.2f} GB')
print(f'Available Memory: {available_memory_gb:.2f} GB')
print()

# Recommend models based on available memory
# Using model aliases - Foundry Local will automatically select CPU variant
model_aliases = []

if total_memory_gb >= 32:
    model_aliases = ['phi-4-mini', 'phi-3.5-mini', 'qwen2.5-0.5b', 'qwen2.5-coder-0.5b']
    print('✅ High Memory System (32GB+)')
    print('   Can run 3-4 models simultaneously')
elif total_memory_gb >= 16:
    model_aliases = ['phi-4-mini', 'qwen2.5-0.5b', 'phi-3.5-mini']
    print('✅ Medium Memory System (16-32GB)')
    print('   Can run 2-3 models simultaneously')
elif total_memory_gb >= 8:
    model_aliases = ['qwen2.5-0.5b', 'phi-3.5-mini']
    print('⚠️  Lower Memory System (8-16GB)')
    print('   Recommended: 2 smaller models')
else:
    model_aliases = ['qwen2.5-0.5b']
    print('⚠️  Limited Memory System (<8GB)')
    print('   Recommended: Use only smallest model')

print()
print('📋 Recommended Model Aliases for Your System:')
for model in model_aliases:
    print(f'   • {model}')

print()
print('💡 About Model Aliases:')
print('   ✓ Use base alias (e.g., phi-4-mini, not phi-4-mini-cpu)')
print('   ✓ Foundry Local automatically selects CPU variant for your hardware')
print('   ✓ No GPU required - optimized for CPU inference')
print('   ✓ Predictable memory usage and consistent performance')
print('=' * 70)

🖥️  System Memory Information
Total Memory:     63.30 GB
Available Memory: 16.19 GB

✅ High Memory System (32GB+)
   Can run 3-4 models simultaneously

📋 Recommended Model Aliases for Your System:
   • phi-4-mini
   • phi-3.5-mini
   • qwen2.5-0.5b
   • qwen2.5-coder-0.5b

💡 About Model Aliases:
   ✓ Use base alias (e.g., phi-4-mini, not phi-4-mini-cpu)
   ✓ Foundry Local automatically selects CPU variant for your hardware
   ✓ No GPU required - optimized for CPU inference
   ✓ Predictable memory usage and consistent performance


## 🤖 Paso 3: Carga Automática de Modelos

Esta celda automáticamente:
1. Inicia el servicio Foundry Local (si no está en ejecución)
2. Carga los modelos recomendados con `--retain true` (mantiene múltiples modelos en memoria)
3. Verifica que todos los modelos estén listos utilizando el SDK

⏱️ **Tiempo estimado**: 3-5 minutos para todos los modelos


In [109]:
import subprocess
import time
import sys
import os

# Add samples directory for workshop_utils (Foundry SDK pattern)
sys.path.append(os.path.join('..', 'samples'))

print('🚀 Automatic Model Loading with SDK Verification')
print('=' * 70)

# Use top 3 recommended models (aliases)
# Foundry will automatically load CPU variants
REQUIRED_MODELS = model_aliases[:3]
print(f'📋 Loading {len(REQUIRED_MODELS)} models: {REQUIRED_MODELS}')
print('💡 Using model aliases - Foundry will load CPU variants automatically')
print()

# Step 1: Ensure Foundry Local service is running
print('📡 Step 1: Checking Foundry Local service...')
try:
    result = subprocess.run(['foundry', 'service', 'status'], 
                          capture_output=True, text=True, timeout=5)
    if result.returncode == 0:
        print('   ✅ Service is already running')
    else:
        print('   ⚙️  Starting Foundry Local service...')
        subprocess.run(['foundry', 'service', 'start'], 
                      capture_output=True, text=True, timeout=30)
        time.sleep(5)
        print('   ✅ Service started')
except Exception as e:
    print(f'   ⚠️  Could not verify service: {e}')
    print('   💡 Try manually: foundry service start')

# Step 2: Load each model with --retain true
print(f'\n🤖 Step 2: Loading models with retention...')
for i, model in enumerate(REQUIRED_MODELS, 1):
    print(f'   [{i}/{len(REQUIRED_MODELS)}] Starting {model}...')
    try:
        subprocess.Popen(['foundry', 'model', 'run', model, '--retain', 'true'],
                        stdout=subprocess.DEVNULL,
                        stderr=subprocess.DEVNULL)
        print(f'       ✅ {model} loading in background')
    except Exception as e:
        print(f'       ❌ Error starting {model}: {e}')

# Step 3: Verify models are ready
print(f'\n✅ Step 3: Verifying models (this may take 2-3 minutes)...')
print('=' * 70)

try:
    from workshop_utils import get_client
    
    ready_models = []
    max_attempts = 30
    attempt = 0
    
    while len(ready_models) < len(REQUIRED_MODELS) and attempt < max_attempts:
        attempt += 1
        print(f'\n   Attempt {attempt}/{max_attempts}...')
        
        for model in REQUIRED_MODELS:
            if model in ready_models:
                continue
                
            try:
                manager, client, model_id = get_client(model, None)
                response = client.chat.completions.create(
                    model=model_id,
                    messages=[{"role": "user", "content": "test"}],
                    max_tokens=5,
                    temperature=0
                )
                
                if response and response.choices:
                    ready_models.append(model)
                    print(f'   ✅ {model} is READY')
                    
            except Exception as e:
                error_msg = str(e).lower()
                if 'connection' in error_msg or 'timeout' in error_msg:
                    print(f'   ⏳ {model} still loading...')
                else:
                    print(f'   ⚠️  {model} error: {str(e)[:60]}...')
        
        if len(ready_models) == len(REQUIRED_MODELS):
            break
            
        if len(ready_models) < len(REQUIRED_MODELS):
            time.sleep(10)
    
    # Final status
    print('\n' + '=' * 70)
    print(f'📦 Final Status: {len(ready_models)}/{len(REQUIRED_MODELS)} models ready')
    
    for model in REQUIRED_MODELS:
        if model in ready_models:
            print(f'   ✅ {model} - READY (retained in memory)')
        else:
            print(f'   ❌ {model} - NOT READY')
    
    if len(ready_models) == len(REQUIRED_MODELS):
        print('\n🎉 All models loaded and verified!')
        print('   ✅ Ready for intent-based routing')
    else:
        print(f'\n⚠️  Some models not ready. Check: foundry model ls')
        
except ImportError as e:
    print(f'\n❌ Cannot import workshop_utils: {e}')
    print('   💡 Ensure workshop_utils.py is in ../samples/')
except Exception as e:
    print(f'\n❌ Verification error: {e}')

🚀 Automatic Model Loading with SDK Verification
📋 Loading 3 models: ['phi-4-mini', 'phi-3.5-mini', 'qwen2.5-0.5b']
💡 Using model aliases - Foundry will load CPU variants automatically

📡 Step 1: Checking Foundry Local service...
   ✅ Service is already running

🤖 Step 2: Loading models with retention...
   [1/3] Starting phi-4-mini...
       ✅ phi-4-mini loading in background
   [2/3] Starting phi-3.5-mini...
       ✅ phi-3.5-mini loading in background
   [3/3] Starting qwen2.5-0.5b...
       ✅ qwen2.5-0.5b loading in background

✅ Step 3: Verifying models (this may take 2-3 minutes)...

   Attempt 1/30...
   ⚠️  phi-4-mini error: get_client() takes 1 positional argument but 2 were given...
   ⚠️  phi-3.5-mini error: get_client() takes 1 positional argument but 2 were given...
   ⚠️  qwen2.5-0.5b error: get_client() takes 1 positional argument but 2 were given...

   Attempt 2/30...
   ⚠️  phi-4-mini error: get_client() takes 1 positional argument but 2 were given...
   ⚠️  phi-3.5-min

## 🎯 Paso 4: Configurar la Detección de Intenciones y el Catálogo de Modelos

Configura el sistema de enrutamiento con:
- **Reglas de Intención**: Patrones Regex para clasificar las solicitudes
- **Catálogo de Modelos**: Asocia las capacidades de los modelos con categorías de intención
- **Sistema de Prioridad**: Determina la selección de modelos cuando varios modelos coinciden

**Ventajas de los Modelos CPU**:
- ✅ No requiere GPU
- ✅ Rendimiento consistente
- ✅ Menor consumo de energía
- ✅ Uso de memoria predecible


In [110]:
import re

# Model capability catalog (maps model aliases to capabilities)
# Use base aliases - Foundry Local will automatically select CPU variants
CATALOG = {
    'phi-4-mini': {
        'capabilities': ['general', 'summarize', 'reasoning'],
        'priority': 3
    },
    'qwen2.5-0.5b': {
        'capabilities': ['classification', 'fast', 'general'],
        'priority': 1
    },
    'phi-3.5-mini': {
        'capabilities': ['code', 'refactor', 'technical'],
        'priority': 2
    },
    'qwen2.5-coder-0.5b': {
        'capabilities': ['code', 'programming', 'debug'],
        'priority': 1
    }
}

# Filter to only include models recommended for this system
CATALOG = {k: v for k, v in CATALOG.items() if k in model_aliases}

print('📋 Active Model Catalog (Hardware-Optimized Aliases)')
print('=' * 70)
print('💡 Using model aliases - Foundry automatically selects CPU variants')
print()
for model, info in CATALOG.items():
    caps = ', '.join(info['capabilities'])
    print(f'   • {model}')
    print(f'     Capabilities: {caps}')
    print(f'     Priority: {info["priority"]}')
    print()

# Intent detection rules (regex pattern -> intent label)
INTENT_RULES = [
    (re.compile(r'code|refactor|function|debug|program', re.I), 'code'),
    (re.compile(r'summar|abstract|tl;?dr|brief', re.I), 'summarize'),
    (re.compile(r'classif|categor|label|sentiment', re.I), 'classification'),
    (re.compile(r'explain|teach|describe', re.I), 'general'),
]

def detect_intent(prompt: str) -> str:
    """Detect intent from prompt using regex patterns.
    
    Args:
        prompt: User input text
        
    Returns:
        Intent label: 'code', 'summarize', 'classification', or 'general'
    """
    for pattern, intent in INTENT_RULES:
        if pattern.search(prompt):
            return intent
    return 'general'

def pick_model(intent: str) -> str:
    """Select best model for intent based on capabilities and priority.
    
    Args:
        intent: Detected intent category
        
    Returns:
        Model alias string, or first available model if no match
    """
    candidates = [
        (alias, info['priority']) 
        for alias, info in CATALOG.items() 
        if intent in info['capabilities']
    ]
    
    if candidates:
        # Sort by priority (higher = better)
        candidates.sort(key=lambda x: x[1], reverse=True)
        return candidates[0][0]
    
    # Fallback to first available model
    return list(CATALOG.keys())[0] if CATALOG else None

print('✅ Intent detection and model selection configured')
print('=' * 70)

📋 Active Model Catalog (Hardware-Optimized Aliases)
💡 Using model aliases - Foundry automatically selects CPU variants

   • phi-4-mini
     Capabilities: general, summarize, reasoning
     Priority: 3

   • qwen2.5-0.5b
     Capabilities: classification, fast, general
     Priority: 1

   • phi-3.5-mini
     Capabilities: code, refactor, technical
     Priority: 2

   • qwen2.5-coder-0.5b
     Capabilities: code, programming, debug
     Priority: 1

✅ Intent detection and model selection configured

💡 Using model aliases - Foundry automatically selects CPU variants

   • phi-4-mini
     Capabilities: general, summarize, reasoning
     Priority: 3

   • qwen2.5-0.5b
     Capabilities: classification, fast, general
     Priority: 1

   • phi-3.5-mini
     Capabilities: code, refactor, technical
     Priority: 2

   • qwen2.5-coder-0.5b
     Capabilities: code, programming, debug
     Priority: 1

✅ Intent detection and model selection configured


## 🧪 Paso 5: Probar la Detección de Intenciones

Verifica que el sistema de detección de intenciones clasifique correctamente los diferentes tipos de solicitudes.


In [111]:
# Test intent detection with sample prompts
test_prompts = [
    'Refactor this Python function for better readability',
    'Summarize the key points of this article',
    'Classify this customer feedback as positive or negative',
    'Explain how edge AI differs from cloud AI',
    'Write a function to calculate fibonacci numbers',
    'Give me a brief overview of small language models'
]

print('🧪 Testing Intent Detection')
print('=' * 70)

for prompt in test_prompts:
    intent = detect_intent(prompt)
    model = pick_model(intent)
    print(f'\nPrompt: {prompt[:50]}...')
    print(f'   Intent: {intent:15s} → Model: {model}')

print('\n' + '=' * 70)
print('✅ Intent detection working correctly')

🧪 Testing Intent Detection

Prompt: Refactor this Python function for better readabili...
   Intent: code            → Model: phi-3.5-mini

Prompt: Summarize the key points of this article...
   Intent: summarize       → Model: phi-4-mini

Prompt: Classify this customer feedback as positive or neg...
   Intent: classification  → Model: qwen2.5-0.5b

Prompt: Explain how edge AI differs from cloud AI...
   Intent: general         → Model: phi-4-mini

Prompt: Write a function to calculate fibonacci numbers...
   Intent: code            → Model: phi-3.5-mini

Prompt: Give me a brief overview of small language models...
   Intent: summarize       → Model: phi-4-mini

✅ Intent detection working correctly


## 🚀 Paso 6: Implementar la Función de Enrutamiento

Crea la función principal de enrutamiento que:
1. Detecte la intención a partir del mensaje
2. Seleccione el modelo óptimo
3. Ejecute la solicitud mediante Foundry Local SDK
4. Realice un seguimiento del uso de tokens y errores

**Utiliza el patrón workshop_utils**:
- Reintento automático con retroceso exponencial
- API compatible con OpenAI
- Seguimiento de tokens y manejo de errores


In [112]:
import os
from workshop_utils import chat_once

# Fix RETRY_BACKOFF environment variable if it has comments
if 'RETRY_BACKOFF' in os.environ:
    retry_val = os.environ['RETRY_BACKOFF'].strip().split()[0]
    try:
        float(retry_val)
        os.environ['RETRY_BACKOFF'] = retry_val
    except ValueError:
        os.environ['RETRY_BACKOFF'] = '1.0'

def route(prompt: str, max_tokens: int = 200, temperature: float = 0.7):
    """Route prompt to appropriate model based on intent.
    
    Pipeline:
    1. Detect intent using regex patterns
    2. Select best model by capability + priority
    3. Execute via Foundry Local SDK
    
    Args:
        prompt: User input text
        max_tokens: Maximum tokens in response
        temperature: Sampling temperature (0-1)
        
    Returns:
        Dict with: intent, model, output, tokens, usage, error
    """
    intent = detect_intent(prompt)
    model_alias = pick_model(intent)
    
    if not model_alias:
        return {
            'intent': intent,
            'model': None,
            'output': '',
            'tokens': None,
            'usage': {},
            'error': 'No suitable model found'
        }
    
    try:
        # Call Foundry Local via workshop_utils
        text, usage = chat_once(
            model_alias,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=max_tokens,
            temperature=temperature
        )
        
        # Extract token information
        usage_info = {}
        if usage:
            usage_info['prompt_tokens'] = getattr(usage, 'prompt_tokens', None)
            usage_info['completion_tokens'] = getattr(usage, 'completion_tokens', None)
            usage_info['total_tokens'] = getattr(usage, 'total_tokens', None)
        
        # Estimate if not provided
        if not usage_info.get('total_tokens'):
            est_prompt = len(prompt) // 4
            est_completion = len(text or '') // 4
            usage_info['estimated_tokens'] = est_prompt + est_completion
        
        return {
            'intent': intent,
            'model': model_alias,
            'output': (text or '').strip(),
            'tokens': usage_info.get('total_tokens') or usage_info.get('estimated_tokens'),
            'usage': usage_info,
            'error': None
        }
    
    except Exception as e:
        return {
            'intent': intent,
            'model': model_alias,
            'output': '',
            'tokens': None,
            'usage': {},
            'error': f'{type(e).__name__}: {str(e)}'
        }

print('✅ Routing function ready')
print('   Using Foundry Local SDK via workshop_utils')
print('   Token tracking: Enabled')
print('   Retry logic: Automatic with exponential backoff')

✅ Routing function ready
   Using Foundry Local SDK via workshop_utils
   Token tracking: Enabled
   Retry logic: Automatic with exponential backoff


## 🎯 Paso 7: Ejecutar pruebas de enrutamiento

Prueba el sistema de enrutamiento completo con varios prompts para demostrar:
- Detección automática de intención
- Selección inteligente de modelos
- Enrutamiento entre múltiples modelos con modelos retenidos
- Seguimiento de tokens y métricas de rendimiento


In [None]:
# Test prompts covering all intent categories
test_cases = [
    {
        'prompt': 'Refactor this Python function to make it more efficient and readable',
        'expected_intent': 'code'
    },
    {
        'prompt': 'Summarize the key benefits of using small language models at the edge',
        'expected_intent': 'summarize'
    },
    {
        'prompt': 'Classify this user feedback: The app is slow but the UI looks great',
        'expected_intent': 'classification'
    },
    {
        'prompt': 'Explain the difference between local and cloud inference',
        'expected_intent': 'general'
    },
    {
        'prompt': 'Write a Python function to calculate the Fibonacci sequence',
        'expected_intent': 'code'
    },
    {
        'prompt': 'Give me a brief overview of the Phi model family',
        'expected_intent': 'summarize'
    }
]

print('🎯 Running Intent-Based Routing Tests')
print('=' * 80)

results = []
for i, test in enumerate(test_cases, 1):
    print(f'\n[{i}/{len(test_cases)}] Testing prompt...')
    print(f'Prompt: {test["prompt"]}')
    
    result = route(test['prompt'], max_tokens=150)
    results.append(result)
    
    print(f'   Expected Intent: {test["expected_intent"]}')
    print(f'   Detected Intent: {result["intent"]} {"✅" if result["intent"] == test["expected_intent"] else "⚠️"}')
    print(f'   Selected Model:  {result["model"]}')
    
    if result['error']:
        print(f'   ❌ Error: {result["error"]}')
    else:
        output_preview = result['output'][:100] + '...' if len(result['output']) > 100 else result['output']
        print(f'   ✅ Response: {output_preview}')
        
        tokens = result.get('tokens', 0)
        if tokens:
            usage = result.get('usage', {})
            if 'estimated_tokens' in usage:
                print(f'   📊 Tokens: ~{tokens} (estimated)')
            else:
                print(f'   📊 Tokens: {tokens}')

# Summary statistics
print('\n' + '=' * 80)
print('📊 ROUTING SUMMARY')
print('=' * 80)

success_count = sum(1 for r in results if not r['error'])
total_tokens = sum(r.get('tokens', 0) or 0 for r in results if not r['error'])
intent_accuracy = sum(1 for i, r in enumerate(results) if r['intent'] == test_cases[i]['expected_intent'])

print(f'Total Prompts:        {len(results)}')
print(f'✅ Successful:         {success_count}/{len(results)}')
print(f'❌ Failed:             {len(results) - success_count}')
print(f'🎯 Intent Accuracy:    {intent_accuracy}/{len(results)} ({intent_accuracy/len(results)*100:.1f}%)')
print(f'📊 Total Tokens Used:  {total_tokens}')

# Model usage distribution
print('\n📋 Model Usage Distribution:')
model_counts = {}
for r in results:
    if r['model']:
        model_counts[r['model']] = model_counts.get(r['model'], 0) + 1

for model, count in sorted(model_counts.items(), key=lambda x: x[1], reverse=True):
    percentage = (count / len(results)) * 100
    print(f'   • {model}: {count} requests ({percentage:.1f}%)')

if success_count == len(results):
    print('\n🎉 All routing tests passed successfully!')
else:
    print(f'\n⚠️  {len(results) - success_count} test(s) failed')
    print('   Check Foundry Local service: foundry service status')
    print('   Verify models loaded: foundry model ls')

print('=' * 80)

🎯 Running Intent-Based Routing Tests

[1/6] Testing prompt...
Prompt: Refactor this Python function to make it more efficient and readable


   Expected Intent: code
   Detected Intent: code ✅
   Selected Model:  phi-3.5-mini
   ✅ Response: To refactor a Python function for efficiency and readability, I would need to see the specific funct...
   📊 Tokens: ~158 (estimated)

[2/6] Testing prompt...
Prompt: Summarize the key benefits of using small language models at the edge
   Expected Intent: summarize
   Detected Intent: summarize ✅
   Selected Model:  phi-4-mini
   ❌ Error: APIConnectionError: Connection error.

[3/6] Testing prompt...
Prompt: Classify this user feedback: The app is slow but the UI looks great
   Expected Intent: classification
   Detected Intent: classification ✅
   Selected Model:  qwen2.5-0.5b
   ❌ Error: APIConnectionError: Connection error.

[4/6] Testing prompt...
Prompt: Explain the difference between local and cloud inference
   Expected Intent: general
   Detected Intent: general ✅
   Selected Model:  phi-4-mini
   ❌ Error: APIConnectionError: Connection error.

[5/6] Testing prompt...
Prompt: Wr

## 🔧 Paso 8: Pruebas Interactivas

¡Prueba tus propios comandos para ver el sistema de enrutamiento en acción!


In [None]:
# Interactive testing - modify the prompt and run this cell
custom_prompt = "Explain how model quantization reduces memory usage"

print('🎯 Interactive Routing Test')
print('=' * 80)
print(f'Your prompt: {custom_prompt}')
print()

result = route(custom_prompt, max_tokens=200)

print(f'Detected Intent: {result["intent"]}')
print(f'Selected Model:  {result["model"]}')
print()

if result['error']:
    print(f'❌ Error: {result["error"]}')
else:
    print('✅ Response:')
    print('-' * 80)
    print(result['output'])
    print('-' * 80)
    
    if result['tokens']:
        print(f'\n📊 Tokens used: {result["tokens"]}')

print('\n💡 Try different prompts to test routing behavior!')

🎯 Interactive Routing Test
Your prompt: Explain how model quantization reduces memory usage

Detected Intent: general
Selected Model:  phi-4-mini

✅ Response:
--------------------------------------------------------------------------------
Model quantization is a technique used to reduce the memory footprint of a machine learning model, particularly deep learning models. It works by converting the high-precision weights of a neural network, typically represented as 32-bit floating-point numbers, into lower-precision representations, such as 8-bit integers or even binary values.


The primary reason for quantization is to decrease the amount of memory required to store the model's parameters. Since floating-point numbers take up more space than integers, by quantizing the weights, we can significantly reduce the model's size. This reduction in size not only saves memory but also can lead to faster computation during inference, as integer operations are generally faster than floating-poi

## 📊 Paso 9: Análisis de Rendimiento

Analiza el rendimiento del sistema de enrutamiento y la utilización del modelo.


In [None]:
import time

# Performance benchmark
benchmark_prompts = [
    'Write a hello world function',
    'Summarize: AI at the edge is powerful',
    'Classify: Good product',
    'Explain edge computing'
]

print('⚡ Performance Benchmark')
print('=' * 80)

timings = []
for prompt in benchmark_prompts:
    start = time.time()
    result = route(prompt, max_tokens=50)
    duration = time.time() - start
    timings.append(duration)
    
    print(f'\nPrompt: {prompt[:40]}...')
    print(f'   Model: {result["model"]}')
    print(f'   Time: {duration:.2f}s')
    if result.get('tokens'):
        print(f'   Tokens: {result["tokens"]}')

print('\n' + '=' * 80)
print('📊 Performance Statistics:')
print(f'   Average response time: {sum(timings)/len(timings):.2f}s')
print(f'   Fastest response:      {min(timings):.2f}s')
print(f'   Slowest response:      {max(timings):.2f}s')
print('\n💡 Note: First request may be slower due to model initialization')
print('=' * 80)

⚡ Performance Benchmark

Prompt: Write a hello world function...
   Model: phi-3.5-mini
   Time: 3.31s
   Tokens: 60

Prompt: Write a hello world function...
   Model: phi-3.5-mini
   Time: 3.31s
   Tokens: 60

Prompt: Summarize: AI at the edge is powerful...
   Model: phi-4-mini
   Time: 49.67s
   Tokens: 84

Prompt: Summarize: AI at the edge is powerful...
   Model: phi-4-mini
   Time: 49.67s
   Tokens: 84

Prompt: Classify: Good product...
   Model: qwen2.5-0.5b
   Time: 7.21s
   Tokens: 69

Prompt: Classify: Good product...
   Model: qwen2.5-0.5b
   Time: 7.21s
   Tokens: 69

Prompt: Explain edge computing...
   Model: phi-4-mini
   Time: 49.67s
   Tokens: 72

📊 Performance Statistics:
   Average response time: 27.46s
   Fastest response:      3.31s
   Slowest response:      49.67s

💡 Note: First request may be slower due to model initialization

Prompt: Explain edge computing...
   Model: phi-4-mini
   Time: 49.67s
   Tokens: 72

📊 Performance Statistics:
   Average response time:

## 🎓 Puntos Clave y Próximos Pasos

### ✅ Lo que Has Aprendido

1. **Enrutamiento Basado en Intenciones**: Clasificar automáticamente los prompts y dirigirlos a modelos especializados  
2. **Selección Consciente de Memoria**: Elegir modelos de CPU según la RAM disponible en el sistema  
3. **Retención Multi-Modelo**: Usar `--retain true` para mantener múltiples modelos cargados  
4. **Patrones de Producción**: Lógica de reintentos, manejo de errores y seguimiento de tokens  
5. **Optimización para CPU**: Implementar de manera eficiente sin necesidad de GPU  

### 🚀 Ideas para Experimentar

1. **Agregar Intenciones Personalizadas**:  
   ```python
   INTENT_RULES.append(
       (re.compile(r'translate|convert', re.I), 'translation')
   )
   ```
  
2. **Cargar Modelos Adicionales**:  
   ```bash
   foundry model run llama-3.2-1b-cpu --retain true
   ```
  
3. **Ajustar la Selección de Modelos**:  
   - Modificar los valores de prioridad en CATALOG  
   - Añadir más etiquetas de capacidad  
   - Implementar estrategias de respaldo  

4. **Monitorear el Rendimiento**:  
   ```python
   import psutil
   print(f"Memory: {psutil.virtual_memory().percent}%")
   ```
  

### 📚 Recursos Adicionales

- **Foundry Local SDK**: https://github.com/microsoft/Foundry-Local  
- **Ejemplos del Taller**: ../samples/  
- **Curso de Edge AI**: ../../Module08/  

### 💡 Mejores Prácticas

✅ Usa modelos de CPU para un comportamiento consistente entre plataformas  
✅ Verifica siempre la memoria del sistema antes de cargar múltiples modelos  
✅ Usa `--retain true` para escenarios de enrutamiento  
✅ Implementa un manejo adecuado de errores y reintentos  
✅ Realiza un seguimiento del uso de tokens para optimizar costos y rendimiento  

---

**🎉 ¡Felicidades!** Has construido un enrutador de modelos basado en intenciones listo para producción utilizando Foundry Local SDK con modelos optimizados para CPU.



---

**Descargo de responsabilidad**:  
Este documento ha sido traducido utilizando el servicio de traducción automática [Co-op Translator](https://github.com/Azure/co-op-translator). Aunque nos esforzamos por garantizar la precisión, tenga en cuenta que las traducciones automatizadas pueden contener errores o imprecisiones. El documento original en su idioma nativo debe considerarse como la fuente autorizada. Para información crítica, se recomienda una traducción profesional realizada por humanos. No nos hacemos responsables de malentendidos o interpretaciones erróneas que puedan surgir del uso de esta traducción.
