# इंटेन्ट-आधारित मोडेल राउटर फाउन्ड्री लोकल SDK संग

**CPU-अप्टिमाइज्ड मल्टी-मोडेल राउटिङ प्रणाली**

यो नोटबुकले एक बुद्धिमान राउटिङ प्रणाली प्रदर्शन गर्दछ जसले प्रयोगकर्ताको उद्देश्यको आधारमा स्वतः सबैभन्दा उपयुक्त सानो भाषा मोडेल चयन गर्दछ। यो किनारामा तैनाती परिदृश्यहरूको लागि उपयुक्त छ जहाँ तपाईं धेरै विशेष मोडेलहरूलाई प्रभावकारी रूपमा प्रयोग गर्न चाहनुहुन्छ।

## 🎯 तपाईंले के सिक्नुहुनेछ

- **उद्देश्य पत्ता लगाउने**: स्वतः प्रम्प्टहरू वर्गीकरण गर्ने (कोड, संक्षेप, वर्गीकरण, सामान्य)
- **स्मार्ट मोडेल चयन**: प्रत्येक कार्यको लागि सबैभन्दा सक्षम मोडेलमा राउट गर्ने
- **CPU अप्टिमाइजेसन**: मेमोरी-कुशल मोडेलहरू जुन कुनै पनि हार्डवेयरमा काम गर्छन्
- **मल्टी-मोडेल व्यवस्थापन**: `--retain true` प्रयोग गरेर धेरै मोडेलहरू लोड राख्ने
- **उत्पादन ढाँचा**: पुन: प्रयास गर्ने तर्क, त्रुटि ह्यान्डलिङ, र टोकन ट्र्याकिङ

## 📋 परिदृश्यको अवलोकन

यो ढाँचाले प्रदर्शन गर्दछ:

1. **उद्देश्य पत्ता लगाउने**: प्रत्येक प्रयोगकर्ता प्रम्प्टलाई वर्गीकरण गर्ने (कोड, संक्षेप, वर्गीकरण, वा सामान्य)
2. **मोडेल चयन**: क्षमताको आधारमा स्वतः सबैभन्दा उपयुक्त सानो भाषा मोडेल चयन गर्ने
3. **स्थानीय कार्यान्वयन**: फाउन्ड्री लोकल सेवाको माध्यमबाट स्थानीय रूपमा चल्ने मोडेलहरूमा राउट गर्ने
4. **एकीकृत इन्टरफेस**: धेरै विशेष मोडेलहरूमा राउट गर्ने एकल च्याट प्रवेश बिन्दु

**आदर्श**: धेरै विशेष मोडेलहरू भएको किनारामा तैनातीहरूका लागि जहाँ तपाईं बुद्धिमान अनुरोध राउटिङ चाहनुहुन्छ बिना म्यानुअल मोडेल चयन।

## 🔧 पूर्वापेक्षाहरू

- **Foundry Local** स्थापना गरिएको र सेवा चलिरहेको
- **Python 3.8+** पिपसहित
- **8GB+ RAM** (धेरै मोडेलहरूको लागि 16GB+ सिफारिस गरिएको)
- **workshop_utils** मोड्युल (../samples/ मा)

## 🚀 छिटो सुरु

नोटबुकले:
1. तपाईंको प्रणाली मेमोरी पत्ता लगाउनेछ
2. उपयुक्त CPU मोडेलहरू सिफारिस गर्नेछ
3. `--retain true` प्रयोग गरेर स्वतः मोडेलहरू लोड गर्नेछ
4. सबै मोडेलहरू तयार छन् भनी प्रमाणित गर्नेछ
5. परीक्षण प्रम्प्टहरू विशेष मोडेलहरूमा राउट गर्नेछ

**अनुमानित सेटअप समय**: ५-७ मिनेट (मोडेल लोडिङ समावेश गर्दछ)


## 📦 चरण १: आवश्यकताहरू स्थापना गर्नुहोस्

औपचारिक Foundry Local SDK र आवश्यक पुस्तकालयहरू स्थापना गर्नुहोस्:

- **foundry-local-sdk**: स्थानीय मोडेल व्यवस्थापनका लागि आधिकारिक Python SDK
- **openai**: च्याट कम्प्लिसनका लागि OpenAI-संगत API
- **psutil**: प्रणालीको मेमोरी पत्ता लगाउने र निगरानी गर्ने उपकरण


In [107]:
# Install core dependencies
!pip install -q foundry-local-sdk openai psutil

## 💻 चरण २: प्रणाली मेमोरी पत्ता लगाउने

उपलब्ध प्रणाली मेमोरी पत्ता लगाएर कुन CPU मोडेलहरू प्रभावकारी रूपमा चल्न सक्छन् भनेर निर्धारण गर्नुहोस्। यसले तपाईंको हार्डवेयरका लागि उपयुक्त मोडेल चयन सुनिश्चित गर्दछ।


In [108]:
import psutil

# Get system memory information
total_memory_gb = psutil.virtual_memory().total / (1024**3)
available_memory_gb = psutil.virtual_memory().available / (1024**3)

print('🖥️  System Memory Information')
print('=' * 70)
print(f'Total Memory:     {total_memory_gb:.2f} GB')
print(f'Available Memory: {available_memory_gb:.2f} GB')
print()

# Recommend models based on available memory
# Using model aliases - Foundry Local will automatically select CPU variant
model_aliases = []

if total_memory_gb >= 32:
    model_aliases = ['phi-4-mini', 'phi-3.5-mini', 'qwen2.5-0.5b', 'qwen2.5-coder-0.5b']
    print('✅ High Memory System (32GB+)')
    print('   Can run 3-4 models simultaneously')
elif total_memory_gb >= 16:
    model_aliases = ['phi-4-mini', 'qwen2.5-0.5b', 'phi-3.5-mini']
    print('✅ Medium Memory System (16-32GB)')
    print('   Can run 2-3 models simultaneously')
elif total_memory_gb >= 8:
    model_aliases = ['qwen2.5-0.5b', 'phi-3.5-mini']
    print('⚠️  Lower Memory System (8-16GB)')
    print('   Recommended: 2 smaller models')
else:
    model_aliases = ['qwen2.5-0.5b']
    print('⚠️  Limited Memory System (<8GB)')
    print('   Recommended: Use only smallest model')

print()
print('📋 Recommended Model Aliases for Your System:')
for model in model_aliases:
    print(f'   • {model}')

print()
print('💡 About Model Aliases:')
print('   ✓ Use base alias (e.g., phi-4-mini, not phi-4-mini-cpu)')
print('   ✓ Foundry Local automatically selects CPU variant for your hardware')
print('   ✓ No GPU required - optimized for CPU inference')
print('   ✓ Predictable memory usage and consistent performance')
print('=' * 70)

🖥️  System Memory Information
Total Memory:     63.30 GB
Available Memory: 16.19 GB

✅ High Memory System (32GB+)
   Can run 3-4 models simultaneously

📋 Recommended Model Aliases for Your System:
   • phi-4-mini
   • phi-3.5-mini
   • qwen2.5-0.5b
   • qwen2.5-coder-0.5b

💡 About Model Aliases:
   ✓ Use base alias (e.g., phi-4-mini, not phi-4-mini-cpu)
   ✓ Foundry Local automatically selects CPU variant for your hardware
   ✓ No GPU required - optimized for CPU inference
   ✓ Predictable memory usage and consistent performance


## 🤖 चरण ३: स्वचालित मोडेल लोड गर्दै

यो सेलले स्वचालित रूपमा:
1. Foundry Local सेवा सुरु गर्छ (यदि चलिरहेको छैन भने)
2. `--retain true` को साथ सिफारिस गरिएका मोडेलहरू लोड गर्छ (धेरै मोडेलहरू मेमोरीमा राख्छ)
3. SDK प्रयोग गरेर सबै मोडेलहरू तयार छन् कि छैनन् भनेर पुष्टि गर्छ

⏱️ **अपेक्षित समय**: सबै मोडेलहरूको लागि ३-५ मिनेट


In [109]:
import subprocess
import time
import sys
import os

# Add samples directory for workshop_utils (Foundry SDK pattern)
sys.path.append(os.path.join('..', 'samples'))

print('🚀 Automatic Model Loading with SDK Verification')
print('=' * 70)

# Use top 3 recommended models (aliases)
# Foundry will automatically load CPU variants
REQUIRED_MODELS = model_aliases[:3]
print(f'📋 Loading {len(REQUIRED_MODELS)} models: {REQUIRED_MODELS}')
print('💡 Using model aliases - Foundry will load CPU variants automatically')
print()

# Step 1: Ensure Foundry Local service is running
print('📡 Step 1: Checking Foundry Local service...')
try:
    result = subprocess.run(['foundry', 'service', 'status'], 
                          capture_output=True, text=True, timeout=5)
    if result.returncode == 0:
        print('   ✅ Service is already running')
    else:
        print('   ⚙️  Starting Foundry Local service...')
        subprocess.run(['foundry', 'service', 'start'], 
                      capture_output=True, text=True, timeout=30)
        time.sleep(5)
        print('   ✅ Service started')
except Exception as e:
    print(f'   ⚠️  Could not verify service: {e}')
    print('   💡 Try manually: foundry service start')

# Step 2: Load each model with --retain true
print(f'\n🤖 Step 2: Loading models with retention...')
for i, model in enumerate(REQUIRED_MODELS, 1):
    print(f'   [{i}/{len(REQUIRED_MODELS)}] Starting {model}...')
    try:
        subprocess.Popen(['foundry', 'model', 'run', model, '--retain', 'true'],
                        stdout=subprocess.DEVNULL,
                        stderr=subprocess.DEVNULL)
        print(f'       ✅ {model} loading in background')
    except Exception as e:
        print(f'       ❌ Error starting {model}: {e}')

# Step 3: Verify models are ready
print(f'\n✅ Step 3: Verifying models (this may take 2-3 minutes)...')
print('=' * 70)

try:
    from workshop_utils import get_client
    
    ready_models = []
    max_attempts = 30
    attempt = 0
    
    while len(ready_models) < len(REQUIRED_MODELS) and attempt < max_attempts:
        attempt += 1
        print(f'\n   Attempt {attempt}/{max_attempts}...')
        
        for model in REQUIRED_MODELS:
            if model in ready_models:
                continue
                
            try:
                manager, client, model_id = get_client(model, None)
                response = client.chat.completions.create(
                    model=model_id,
                    messages=[{"role": "user", "content": "test"}],
                    max_tokens=5,
                    temperature=0
                )
                
                if response and response.choices:
                    ready_models.append(model)
                    print(f'   ✅ {model} is READY')
                    
            except Exception as e:
                error_msg = str(e).lower()
                if 'connection' in error_msg or 'timeout' in error_msg:
                    print(f'   ⏳ {model} still loading...')
                else:
                    print(f'   ⚠️  {model} error: {str(e)[:60]}...')
        
        if len(ready_models) == len(REQUIRED_MODELS):
            break
            
        if len(ready_models) < len(REQUIRED_MODELS):
            time.sleep(10)
    
    # Final status
    print('\n' + '=' * 70)
    print(f'📦 Final Status: {len(ready_models)}/{len(REQUIRED_MODELS)} models ready')
    
    for model in REQUIRED_MODELS:
        if model in ready_models:
            print(f'   ✅ {model} - READY (retained in memory)')
        else:
            print(f'   ❌ {model} - NOT READY')
    
    if len(ready_models) == len(REQUIRED_MODELS):
        print('\n🎉 All models loaded and verified!')
        print('   ✅ Ready for intent-based routing')
    else:
        print(f'\n⚠️  Some models not ready. Check: foundry model ls')
        
except ImportError as e:
    print(f'\n❌ Cannot import workshop_utils: {e}')
    print('   💡 Ensure workshop_utils.py is in ../samples/')
except Exception as e:
    print(f'\n❌ Verification error: {e}')

🚀 Automatic Model Loading with SDK Verification
📋 Loading 3 models: ['phi-4-mini', 'phi-3.5-mini', 'qwen2.5-0.5b']
💡 Using model aliases - Foundry will load CPU variants automatically

📡 Step 1: Checking Foundry Local service...
   ✅ Service is already running

🤖 Step 2: Loading models with retention...
   [1/3] Starting phi-4-mini...
       ✅ phi-4-mini loading in background
   [2/3] Starting phi-3.5-mini...
       ✅ phi-3.5-mini loading in background
   [3/3] Starting qwen2.5-0.5b...
       ✅ qwen2.5-0.5b loading in background

✅ Step 3: Verifying models (this may take 2-3 minutes)...

   Attempt 1/30...
   ⚠️  phi-4-mini error: get_client() takes 1 positional argument but 2 were given...
   ⚠️  phi-3.5-mini error: get_client() takes 1 positional argument but 2 were given...
   ⚠️  qwen2.5-0.5b error: get_client() takes 1 positional argument but 2 were given...

   Attempt 2/30...
   ⚠️  phi-4-mini error: get_client() takes 1 positional argument but 2 were given...
   ⚠️  phi-3.5-min

## 🎯 चरण ४: उद्देश्य पहिचान र मोडेल क्याटलग कन्फिगर गर्नुहोस्

राउटिङ प्रणाली सेटअप गर्नुहोस्:
- **उद्देश्य नियमहरू**: प्रम्प्टहरू वर्गीकृत गर्न Regex ढाँचाहरू
- **मोडेल क्याटलग**: मोडेल क्षमताहरूलाई उद्देश्य श्रेणीहरूसँग मिलाउँछ
- **प्राथमिकता प्रणाली**: धेरै मोडेलहरू मेल खाँदा मोडेल चयन निर्धारण गर्दछ

**CPU मोडेलका फाइदाहरू**:
- ✅ GPU आवश्यक छैन
- ✅ स्थिर प्रदर्शन
- ✅ कम ऊर्जा खपत
- ✅ अनुमानित मेमोरी प्रयोग


In [110]:
import re

# Model capability catalog (maps model aliases to capabilities)
# Use base aliases - Foundry Local will automatically select CPU variants
CATALOG = {
    'phi-4-mini': {
        'capabilities': ['general', 'summarize', 'reasoning'],
        'priority': 3
    },
    'qwen2.5-0.5b': {
        'capabilities': ['classification', 'fast', 'general'],
        'priority': 1
    },
    'phi-3.5-mini': {
        'capabilities': ['code', 'refactor', 'technical'],
        'priority': 2
    },
    'qwen2.5-coder-0.5b': {
        'capabilities': ['code', 'programming', 'debug'],
        'priority': 1
    }
}

# Filter to only include models recommended for this system
CATALOG = {k: v for k, v in CATALOG.items() if k in model_aliases}

print('📋 Active Model Catalog (Hardware-Optimized Aliases)')
print('=' * 70)
print('💡 Using model aliases - Foundry automatically selects CPU variants')
print()
for model, info in CATALOG.items():
    caps = ', '.join(info['capabilities'])
    print(f'   • {model}')
    print(f'     Capabilities: {caps}')
    print(f'     Priority: {info["priority"]}')
    print()

# Intent detection rules (regex pattern -> intent label)
INTENT_RULES = [
    (re.compile(r'code|refactor|function|debug|program', re.I), 'code'),
    (re.compile(r'summar|abstract|tl;?dr|brief', re.I), 'summarize'),
    (re.compile(r'classif|categor|label|sentiment', re.I), 'classification'),
    (re.compile(r'explain|teach|describe', re.I), 'general'),
]

def detect_intent(prompt: str) -> str:
    """Detect intent from prompt using regex patterns.
    
    Args:
        prompt: User input text
        
    Returns:
        Intent label: 'code', 'summarize', 'classification', or 'general'
    """
    for pattern, intent in INTENT_RULES:
        if pattern.search(prompt):
            return intent
    return 'general'

def pick_model(intent: str) -> str:
    """Select best model for intent based on capabilities and priority.
    
    Args:
        intent: Detected intent category
        
    Returns:
        Model alias string, or first available model if no match
    """
    candidates = [
        (alias, info['priority']) 
        for alias, info in CATALOG.items() 
        if intent in info['capabilities']
    ]
    
    if candidates:
        # Sort by priority (higher = better)
        candidates.sort(key=lambda x: x[1], reverse=True)
        return candidates[0][0]
    
    # Fallback to first available model
    return list(CATALOG.keys())[0] if CATALOG else None

print('✅ Intent detection and model selection configured')
print('=' * 70)

📋 Active Model Catalog (Hardware-Optimized Aliases)
💡 Using model aliases - Foundry automatically selects CPU variants

   • phi-4-mini
     Capabilities: general, summarize, reasoning
     Priority: 3

   • qwen2.5-0.5b
     Capabilities: classification, fast, general
     Priority: 1

   • phi-3.5-mini
     Capabilities: code, refactor, technical
     Priority: 2

   • qwen2.5-coder-0.5b
     Capabilities: code, programming, debug
     Priority: 1

✅ Intent detection and model selection configured

💡 Using model aliases - Foundry automatically selects CPU variants

   • phi-4-mini
     Capabilities: general, summarize, reasoning
     Priority: 3

   • qwen2.5-0.5b
     Capabilities: classification, fast, general
     Priority: 1

   • phi-3.5-mini
     Capabilities: code, refactor, technical
     Priority: 2

   • qwen2.5-coder-0.5b
     Capabilities: code, programming, debug
     Priority: 1

✅ Intent detection and model selection configured


## 🧪 चरण ५: उद्देश्य पहिचान परीक्षण गर्नुहोस्

पक्का गर्नुहोस् कि उद्देश्य पहिचान प्रणालीले विभिन्न प्रकारका संकेतहरूलाई सही रूपमा वर्गीकृत गर्दछ।


In [111]:
# Test intent detection with sample prompts
test_prompts = [
    'Refactor this Python function for better readability',
    'Summarize the key points of this article',
    'Classify this customer feedback as positive or negative',
    'Explain how edge AI differs from cloud AI',
    'Write a function to calculate fibonacci numbers',
    'Give me a brief overview of small language models'
]

print('🧪 Testing Intent Detection')
print('=' * 70)

for prompt in test_prompts:
    intent = detect_intent(prompt)
    model = pick_model(intent)
    print(f'\nPrompt: {prompt[:50]}...')
    print(f'   Intent: {intent:15s} → Model: {model}')

print('\n' + '=' * 70)
print('✅ Intent detection working correctly')

🧪 Testing Intent Detection

Prompt: Refactor this Python function for better readabili...
   Intent: code            → Model: phi-3.5-mini

Prompt: Summarize the key points of this article...
   Intent: summarize       → Model: phi-4-mini

Prompt: Classify this customer feedback as positive or neg...
   Intent: classification  → Model: qwen2.5-0.5b

Prompt: Explain how edge AI differs from cloud AI...
   Intent: general         → Model: phi-4-mini

Prompt: Write a function to calculate fibonacci numbers...
   Intent: code            → Model: phi-3.5-mini

Prompt: Give me a brief overview of small language models...
   Intent: summarize       → Model: phi-4-mini

✅ Intent detection working correctly


## 🚀 चरण ६: राउटिङ फङ्सन कार्यान्वयन गर्नुहोस्

मुख्य राउटिङ फङ्सन बनाउनुहोस् जसले:
1. प्रम्प्टबाट उद्देश्य पत्ता लगाउँछ
2. उपयुक्त मोडेल चयन गर्छ
3. Foundry Local SDK मार्फत अनुरोध कार्यान्वयन गर्छ
4. टोकन प्रयोग र त्रुटिहरू ट्र्याक गर्छ

**workshop_utils ढाँचा प्रयोग गर्दछ**:
- एक्स्पोनेन्सियल ब्याकअफसहित स्वचालित पुन: प्रयास
- OpenAI-सँग मिल्दो API
- टोकन ट्र्याकिङ र त्रुटि व्यवस्थापन


In [112]:
import os
from workshop_utils import chat_once

# Fix RETRY_BACKOFF environment variable if it has comments
if 'RETRY_BACKOFF' in os.environ:
    retry_val = os.environ['RETRY_BACKOFF'].strip().split()[0]
    try:
        float(retry_val)
        os.environ['RETRY_BACKOFF'] = retry_val
    except ValueError:
        os.environ['RETRY_BACKOFF'] = '1.0'

def route(prompt: str, max_tokens: int = 200, temperature: float = 0.7):
    """Route prompt to appropriate model based on intent.
    
    Pipeline:
    1. Detect intent using regex patterns
    2. Select best model by capability + priority
    3. Execute via Foundry Local SDK
    
    Args:
        prompt: User input text
        max_tokens: Maximum tokens in response
        temperature: Sampling temperature (0-1)
        
    Returns:
        Dict with: intent, model, output, tokens, usage, error
    """
    intent = detect_intent(prompt)
    model_alias = pick_model(intent)
    
    if not model_alias:
        return {
            'intent': intent,
            'model': None,
            'output': '',
            'tokens': None,
            'usage': {},
            'error': 'No suitable model found'
        }
    
    try:
        # Call Foundry Local via workshop_utils
        text, usage = chat_once(
            model_alias,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=max_tokens,
            temperature=temperature
        )
        
        # Extract token information
        usage_info = {}
        if usage:
            usage_info['prompt_tokens'] = getattr(usage, 'prompt_tokens', None)
            usage_info['completion_tokens'] = getattr(usage, 'completion_tokens', None)
            usage_info['total_tokens'] = getattr(usage, 'total_tokens', None)
        
        # Estimate if not provided
        if not usage_info.get('total_tokens'):
            est_prompt = len(prompt) // 4
            est_completion = len(text or '') // 4
            usage_info['estimated_tokens'] = est_prompt + est_completion
        
        return {
            'intent': intent,
            'model': model_alias,
            'output': (text or '').strip(),
            'tokens': usage_info.get('total_tokens') or usage_info.get('estimated_tokens'),
            'usage': usage_info,
            'error': None
        }
    
    except Exception as e:
        return {
            'intent': intent,
            'model': model_alias,
            'output': '',
            'tokens': None,
            'usage': {},
            'error': f'{type(e).__name__}: {str(e)}'
        }

print('✅ Routing function ready')
print('   Using Foundry Local SDK via workshop_utils')
print('   Token tracking: Enabled')
print('   Retry logic: Automatic with exponential backoff')

✅ Routing function ready
   Using Foundry Local SDK via workshop_utils
   Token tracking: Enabled
   Retry logic: Automatic with exponential backoff


## 🎯 चरण ७: राउटिङ परीक्षणहरू चलाउनुहोस्

विभिन्न प्रम्प्टहरू प्रयोग गरेर पूर्ण राउटिङ प्रणाली परीक्षण गर्नुहोस् ताकि देखाउन सकियोस्:
- स्वचालित उद्देश्य पहिचान
- बौद्धिक मोडेल चयन
- बहु-मोडेल राउटिङसँग मोडेलहरू कायम राख्ने
- टोकन ट्र्याकिङ र प्रदर्शन मेट्रिक्स


In [None]:
# Test prompts covering all intent categories
test_cases = [
    {
        'prompt': 'Refactor this Python function to make it more efficient and readable',
        'expected_intent': 'code'
    },
    {
        'prompt': 'Summarize the key benefits of using small language models at the edge',
        'expected_intent': 'summarize'
    },
    {
        'prompt': 'Classify this user feedback: The app is slow but the UI looks great',
        'expected_intent': 'classification'
    },
    {
        'prompt': 'Explain the difference between local and cloud inference',
        'expected_intent': 'general'
    },
    {
        'prompt': 'Write a Python function to calculate the Fibonacci sequence',
        'expected_intent': 'code'
    },
    {
        'prompt': 'Give me a brief overview of the Phi model family',
        'expected_intent': 'summarize'
    }
]

print('🎯 Running Intent-Based Routing Tests')
print('=' * 80)

results = []
for i, test in enumerate(test_cases, 1):
    print(f'\n[{i}/{len(test_cases)}] Testing prompt...')
    print(f'Prompt: {test["prompt"]}')
    
    result = route(test['prompt'], max_tokens=150)
    results.append(result)
    
    print(f'   Expected Intent: {test["expected_intent"]}')
    print(f'   Detected Intent: {result["intent"]} {"✅" if result["intent"] == test["expected_intent"] else "⚠️"}')
    print(f'   Selected Model:  {result["model"]}')
    
    if result['error']:
        print(f'   ❌ Error: {result["error"]}')
    else:
        output_preview = result['output'][:100] + '...' if len(result['output']) > 100 else result['output']
        print(f'   ✅ Response: {output_preview}')
        
        tokens = result.get('tokens', 0)
        if tokens:
            usage = result.get('usage', {})
            if 'estimated_tokens' in usage:
                print(f'   📊 Tokens: ~{tokens} (estimated)')
            else:
                print(f'   📊 Tokens: {tokens}')

# Summary statistics
print('\n' + '=' * 80)
print('📊 ROUTING SUMMARY')
print('=' * 80)

success_count = sum(1 for r in results if not r['error'])
total_tokens = sum(r.get('tokens', 0) or 0 for r in results if not r['error'])
intent_accuracy = sum(1 for i, r in enumerate(results) if r['intent'] == test_cases[i]['expected_intent'])

print(f'Total Prompts:        {len(results)}')
print(f'✅ Successful:         {success_count}/{len(results)}')
print(f'❌ Failed:             {len(results) - success_count}')
print(f'🎯 Intent Accuracy:    {intent_accuracy}/{len(results)} ({intent_accuracy/len(results)*100:.1f}%)')
print(f'📊 Total Tokens Used:  {total_tokens}')

# Model usage distribution
print('\n📋 Model Usage Distribution:')
model_counts = {}
for r in results:
    if r['model']:
        model_counts[r['model']] = model_counts.get(r['model'], 0) + 1

for model, count in sorted(model_counts.items(), key=lambda x: x[1], reverse=True):
    percentage = (count / len(results)) * 100
    print(f'   • {model}: {count} requests ({percentage:.1f}%)')

if success_count == len(results):
    print('\n🎉 All routing tests passed successfully!')
else:
    print(f'\n⚠️  {len(results) - success_count} test(s) failed')
    print('   Check Foundry Local service: foundry service status')
    print('   Verify models loaded: foundry model ls')

print('=' * 80)

🎯 Running Intent-Based Routing Tests

[1/6] Testing prompt...
Prompt: Refactor this Python function to make it more efficient and readable


   Expected Intent: code
   Detected Intent: code ✅
   Selected Model:  phi-3.5-mini
   ✅ Response: To refactor a Python function for efficiency and readability, I would need to see the specific funct...
   📊 Tokens: ~158 (estimated)

[2/6] Testing prompt...
Prompt: Summarize the key benefits of using small language models at the edge
   Expected Intent: summarize
   Detected Intent: summarize ✅
   Selected Model:  phi-4-mini
   ❌ Error: APIConnectionError: Connection error.

[3/6] Testing prompt...
Prompt: Classify this user feedback: The app is slow but the UI looks great
   Expected Intent: classification
   Detected Intent: classification ✅
   Selected Model:  qwen2.5-0.5b
   ❌ Error: APIConnectionError: Connection error.

[4/6] Testing prompt...
Prompt: Explain the difference between local and cloud inference
   Expected Intent: general
   Detected Intent: general ✅
   Selected Model:  phi-4-mini
   ❌ Error: APIConnectionError: Connection error.

[5/6] Testing prompt...
Prompt: Wr

## 🔧 चरण ८: अन्तरक्रियात्मक परीक्षण

आफ्नै प्रम्प्टहरू प्रयोग गरेर राउटिङ प्रणालीलाई काममा हेर्नुहोस्!


In [None]:
# Interactive testing - modify the prompt and run this cell
custom_prompt = "Explain how model quantization reduces memory usage"

print('🎯 Interactive Routing Test')
print('=' * 80)
print(f'Your prompt: {custom_prompt}')
print()

result = route(custom_prompt, max_tokens=200)

print(f'Detected Intent: {result["intent"]}')
print(f'Selected Model:  {result["model"]}')
print()

if result['error']:
    print(f'❌ Error: {result["error"]}')
else:
    print('✅ Response:')
    print('-' * 80)
    print(result['output'])
    print('-' * 80)
    
    if result['tokens']:
        print(f'\n📊 Tokens used: {result["tokens"]}')

print('\n💡 Try different prompts to test routing behavior!')

🎯 Interactive Routing Test
Your prompt: Explain how model quantization reduces memory usage

Detected Intent: general
Selected Model:  phi-4-mini

✅ Response:
--------------------------------------------------------------------------------
Model quantization is a technique used to reduce the memory footprint of a machine learning model, particularly deep learning models. It works by converting the high-precision weights of a neural network, typically represented as 32-bit floating-point numbers, into lower-precision representations, such as 8-bit integers or even binary values.


The primary reason for quantization is to decrease the amount of memory required to store the model's parameters. Since floating-point numbers take up more space than integers, by quantizing the weights, we can significantly reduce the model's size. This reduction in size not only saves memory but also can lead to faster computation during inference, as integer operations are generally faster than floating-poi

## 📊 चरण ९: प्रदर्शन विश्लेषण

राउटिङ प्रणालीको प्रदर्शन र मोडेलको उपयोगको विश्लेषण गर्नुहोस्।


In [None]:
import time

# Performance benchmark
benchmark_prompts = [
    'Write a hello world function',
    'Summarize: AI at the edge is powerful',
    'Classify: Good product',
    'Explain edge computing'
]

print('⚡ Performance Benchmark')
print('=' * 80)

timings = []
for prompt in benchmark_prompts:
    start = time.time()
    result = route(prompt, max_tokens=50)
    duration = time.time() - start
    timings.append(duration)
    
    print(f'\nPrompt: {prompt[:40]}...')
    print(f'   Model: {result["model"]}')
    print(f'   Time: {duration:.2f}s')
    if result.get('tokens'):
        print(f'   Tokens: {result["tokens"]}')

print('\n' + '=' * 80)
print('📊 Performance Statistics:')
print(f'   Average response time: {sum(timings)/len(timings):.2f}s')
print(f'   Fastest response:      {min(timings):.2f}s')
print(f'   Slowest response:      {max(timings):.2f}s')
print('\n💡 Note: First request may be slower due to model initialization')
print('=' * 80)

⚡ Performance Benchmark

Prompt: Write a hello world function...
   Model: phi-3.5-mini
   Time: 3.31s
   Tokens: 60

Prompt: Write a hello world function...
   Model: phi-3.5-mini
   Time: 3.31s
   Tokens: 60

Prompt: Summarize: AI at the edge is powerful...
   Model: phi-4-mini
   Time: 49.67s
   Tokens: 84

Prompt: Summarize: AI at the edge is powerful...
   Model: phi-4-mini
   Time: 49.67s
   Tokens: 84

Prompt: Classify: Good product...
   Model: qwen2.5-0.5b
   Time: 7.21s
   Tokens: 69

Prompt: Classify: Good product...
   Model: qwen2.5-0.5b
   Time: 7.21s
   Tokens: 69

Prompt: Explain edge computing...
   Model: phi-4-mini
   Time: 49.67s
   Tokens: 72

📊 Performance Statistics:
   Average response time: 27.46s
   Fastest response:      3.31s
   Slowest response:      49.67s

💡 Note: First request may be slower due to model initialization

Prompt: Explain edge computing...
   Model: phi-4-mini
   Time: 49.67s
   Tokens: 72

📊 Performance Statistics:
   Average response time:

## 🎓 मुख्य सिकाइहरू र आगामी कदमहरू

### ✅ तपाईंले के सिक्नुभयो

1. **इन्टेन्ट-आधारित राउटिङ**: स्वचालित रूपमा प्रम्प्टहरू वर्गीकरण गर्नुहोस् र विशेष मोडेलहरूमा राउट गर्नुहोस्  
2. **मेमोरी-प्रभावित चयन**: उपलब्ध प्रणालीको RAM अनुसार CPU मोडेलहरू चयन गर्नुहोस्  
3. **मल्टि-मोडेल रिटेन्सन**: `--retain true` प्रयोग गरेर धेरै मोडेलहरू लोड राख्नुहोस्  
4. **उत्पादन ढाँचा**: पुन: प्रयास गर्ने तर्क, त्रुटि ह्यान्डलिङ, र टोकन ट्र्याकिङ  
5. **CPU अनुकूलन**: GPU बिना कुशलतापूर्वक तैनात गर्नुहोस्  

### 🚀 प्रयोगका विचारहरू

1. **कस्टम इन्टेन्टहरू थप्नुहोस्**:  
   ```python
   INTENT_RULES.append(
       (re.compile(r'translate|convert', re.I), 'translation')
   )
   ```
  
2. **थप मोडेलहरू लोड गर्नुहोस्**:  
   ```bash
   foundry model run llama-3.2-1b-cpu --retain true
   ```
  
3. **मोडेल चयन ट्युन गर्नुहोस्**:  
   - CATALOG मा प्राथमिकता मानहरू समायोजन गर्नुहोस्  
   - थप क्षमता ट्यागहरू थप्नुहोस्  
   - फलब्याक रणनीतिहरू कार्यान्वयन गर्नुहोस्  

4. **प्रदर्शन अनुगमन गर्नुहोस्**:  
   ```python
   import psutil
   print(f"Memory: {psutil.virtual_memory().percent}%")
   ```
  

### 📚 थप स्रोतहरू

- **Foundry Local SDK**: https://github.com/microsoft/Foundry-Local  
- **वर्कशप नमूनाहरू**: ../samples/  
- **एज AI कोर्स**: ../../Module08/  

### 💡 उत्कृष्ट अभ्यासहरू

✅ CPU मोडेलहरू प्रयोग गर्नुहोस् ताकि क्रस-प्ल्याटफर्म व्यवहार स्थिर रहोस्  
✅ धेरै मोडेलहरू लोड गर्नु अघि सधैं प्रणालीको मेमोरी जाँच गर्नुहोस्  
✅ राउटिङ परिदृश्यहरूको लागि `--retain true` प्रयोग गर्नुहोस्  
✅ उचित त्रुटि ह्यान्डलिङ र पुन: प्रयास कार्यान्वयन गर्नुहोस्  
✅ लागत/प्रदर्शन अनुकूलनको लागि टोकन प्रयोग ट्र्याक गर्नुहोस्  

---

**🎉 बधाई छ!** तपाईंले Foundry Local SDK प्रयोग गरेर CPU-अनुकूलित मोडेलहरूको साथ उत्पादन-तयार इन्टेन्ट-आधारित मोडेल राउटर निर्माण गर्नुभएको छ!



---

**अस्वीकरण**:  
यो दस्तावेज़ AI अनुवाद सेवा [Co-op Translator](https://github.com/Azure/co-op-translator) प्रयोग गरी अनुवाद गरिएको हो। हामी यथासम्भव सटीकता सुनिश्चित गर्न प्रयास गर्छौं, तर कृपया ध्यान दिनुहोस् कि स्वचालित अनुवादहरूमा त्रुटिहरू वा अशुद्धताहरू हुन सक्छन्। यसको मूल भाषामा रहेको मूल दस्तावेजलाई आधिकारिक स्रोत मानिनुपर्छ। महत्त्वपूर्ण जानकारीका लागि, व्यावसायिक मानव अनुवाद सिफारिस गरिन्छ। यस अनुवादको प्रयोगबाट उत्पन्न हुने कुनै पनि गलतफहमी वा गलत व्याख्याको लागि हामी जिम्मेवार हुने छैनौं।
