# Session 4 ‚Äì Comparaison entre SLM et LLM

Comparer la latence et la qualit√© des r√©ponses g√©n√©r√©es entre un petit mod√®le de langage et un mod√®le plus grand ex√©cut√© via Foundry Local.


## ‚ö° D√©marrage rapide

**Configuration optimis√©e pour la m√©moire (Mise √† jour) :**
1. Les mod√®les s√©lectionnent automatiquement les variantes CPU (fonctionne sur tout type de mat√©riel)
2. Utilise `qwen2.5-3b` au lieu de 7B (√©conomise ~4 Go de RAM)
3. D√©tection automatique des ports (pas de configuration manuelle)
4. RAM totale n√©cessaire : ~8 Go recommand√©s (mod√®les + syst√®me d'exploitation)

**Configuration du terminal (30 secondes) :**
```bash
foundry service start
foundry model run phi-4-mini
foundry model run qwen2.5-3b
```

Ensuite, ex√©cutez ce notebook ! üöÄ


### Explication : Installation des d√©pendances
Installe les packages minimaux (`foundry-local-sdk`, `openai`, `numpy`) n√©cessaires pour les requ√™tes de chronom√©trage et de chat. Peut √™tre r√©ex√©cut√© sans risque de duplication.


# Sc√©nario
Comparer un mod√®le de langage de petite taille (SLM) avec un mod√®le plus grand sur une seule invite pour illustrer les compromis :
- **Diff√©rence de latence** (en secondes r√©elles)
- **Utilisation des tokens** (si disponible) comme indicateur de d√©bit
- **Exemple de sortie qualitative** pour une √©valuation rapide
- **Calcul de l'acc√©l√©ration** pour quantifier les gains de performance

**Variables d'environnement :**
- `SLM_ALIAS` - Mod√®le de langage de petite taille (par d√©faut : phi-4-mini, ~4 Go de RAM)
- `LLM_ALIAS` - Mod√®le de langage plus grand (par d√©faut : qwen2.5-7b, ~7 Go de RAM)
- `COMPARE_PROMPT` - Invite de test pour la comparaison
- `COMPARE_RETRIES` - Tentatives de r√©essai pour la r√©silience (par d√©faut : 2)
- `FOUNDRY_LOCAL_ENDPOINT` - Remplacement du point de terminaison du service (d√©tect√© automatiquement si non d√©fini)

**Comment √ßa fonctionne (Mod√®le SDK officiel) :**
1. **FoundryLocalManager** initialise et g√®re le service Foundry Local
2. Le service d√©marre automatiquement s'il n'est pas en cours d'ex√©cution (aucune configuration manuelle requise)
3. Les mod√®les sont r√©solus √† partir des alias vers des identifiants concrets automatiquement
4. Variantes optimis√©es pour le mat√©riel s√©lectionn√©es (CUDA, NPU ou CPU)
5. Le client compatible OpenAI effectue des compl√©tions de chat
6. Les m√©triques sont captur√©es : latence, tokens, qualit√© de sortie
7. Les r√©sultats sont compar√©s pour calculer le ratio d'acc√©l√©ration

Cette micro-comparaison aide √† d√©cider quand il est justifi√© de rediriger vers un mod√®le plus grand pour votre cas d'utilisation.

**R√©f√©rence SDK :** 
- SDK Python : https://github.com/microsoft/Foundry-Local/tree/main/sdk/python/foundry_local
- Utilitaires Workshop : Utilise le mod√®le officiel de ../samples/workshop_utils.py

**Principaux avantages :**
- ‚úÖ D√©couverte et initialisation automatique du service
- ‚úÖ D√©marrage automatique du service s'il n'est pas en cours d'ex√©cution
- ‚úÖ R√©solution et mise en cache int√©gr√©es des mod√®les
- ‚úÖ Optimisation mat√©rielle (CUDA/NPU/CPU)
- ‚úÖ Compatibilit√© avec le SDK OpenAI
- ‚úÖ Gestion robuste des erreurs avec r√©essais
- ‚úÖ Inf√©rence locale (aucune API cloud requise)


## üö® Pr√©requis : Foundry Local doit √™tre en cours d'ex√©cution !

**Avant d'ex√©cuter ce notebook**, assurez-vous que le service Foundry Local est configur√© :

### Commandes de d√©marrage rapide (√† ex√©cuter dans le terminal) :

```bash
# 1. Start the Foundry Local service
foundry service start

# 2. Load the default models used in this comparison (CPU-optimized)
foundry model run phi-4-mini
foundry model run qwen2.5-3b

# 3. Verify models are loaded
foundry model ls

# 4. Check service health
foundry service status
```

### Mod√®les alternatifs (si les param√®tres par d√©faut ne sont pas disponibles) :

```bash
# Even smaller alternatives (if memory is very limited)
foundry model run phi-3.5-mini
foundry model run qwen2.5-0.5b

# Or update the environment variables in this notebook:
# SLM_ALIAS = 'phi-3.5-mini'
# LLM_ALIAS = 'qwen2.5-1.5b'  # Or qwen2.5-0.5b for minimum memory
```

‚ö†Ô∏è **Si vous ignorez ces √©tapes**, vous verrez une `APIConnectionError` lors de l'ex√©cution des cellules du notebook ci-dessous.


In [29]:
# Install dependencies
!pip install -q foundry-local-sdk openai numpy requests

### Explication : Importations principales
Inclut les utilitaires de gestion du temps et les clients Foundry Local / OpenAI utilis√©s pour r√©cup√©rer les informations sur les mod√®les et effectuer des compl√©tions de chat.


In [30]:
import os, time, json
from foundry_local import FoundryLocalManager
from openai import OpenAI
import sys
sys.path.append('../samples')
from workshop_utils import get_client, chat_once

### Explication : Alias et configuration des invites
D√©finit des alias configurables par l'environnement pour un petit mod√®le et un mod√®le plus grand, ainsi qu'une invite de comparaison. Ajustez les variables d'environnement pour exp√©rimenter avec diff√©rentes familles de mod√®les ou t√¢ches.


In [31]:
# Default to CPU models for better memory efficiency
SLM = os.getenv('SLM_ALIAS', 'phi-4-mini')  # Auto-selects CPU variant
LLM = os.getenv('LLM_ALIAS', 'qwen2.5-3b')  # Smaller LLM, more memory-friendly
PROMPT = os.getenv('COMPARE_PROMPT', 'List 5 benefits of local AI inference.')
# Endpoint is now managed by FoundryLocalManager - it auto-detects or can be overridden
ENDPOINT = os.getenv('FOUNDRY_LOCAL_ENDPOINT', None)

### üí° Configuration Optimis√©e pour la M√©moire

**Ce notebook utilise par d√©faut des mod√®les √©conomes en m√©moire :**
- `phi-4-mini` ‚Üí ~4 Go de RAM (Foundry Local s√©lectionne automatiquement la variante CPU)
- `qwen2.5-3b` ‚Üí ~3 Go de RAM (au lieu de 7B qui n√©cessite ~7 Go+)

**D√©tection Automatique des Ports :**
- Foundry Local peut utiliser diff√©rents ports (souvent 55769 ou 59959)
- La cellule de diagnostic ci-dessous d√©tecte automatiquement le port correct
- Aucune configuration manuelle n√©cessaire !

**Si vous avez une RAM limit√©e (<8 Go), utilisez des mod√®les encore plus petits :**
```python
SLM = 'phi-3.5-mini'      # ~2GB
LLM = 'qwen2.5-0.5b'      # ~500MB
```


In [32]:
# Display current configuration
print("="*60)
print("CURRENT CONFIGURATION")
print("="*60)
print(f"SLM Model:     {SLM}")
print(f"LLM Model:     {LLM}")
print(f"SDK Pattern:   FoundryLocalManager (official)")
print(f"Endpoint:      {ENDPOINT or 'Auto-detect'}")
print(f"Test Prompt:   {PROMPT[:50]}...")
print(f"Retry Count:   2")
print("="*60)
print("\nüí° Using official Foundry SDK pattern from workshop_utils")
print("   ‚Üí FoundryLocalManager handles service lifecycle")
print("   ‚Üí Automatic model resolution and hardware optimization")
print("   ‚Üí OpenAI-compatible API for inference")

CURRENT CONFIGURATION
SLM Model:     phi-4-mini
LLM Model:     qwen2.5-7b
SDK Pattern:   FoundryLocalManager (official)
Endpoint:      Auto-detect
Test Prompt:   List 5 benefits of local AI inference....
Retry Count:   2

üí° Using official Foundry SDK pattern from workshop_utils
   ‚Üí FoundryLocalManager handles service lifecycle
   ‚Üí Automatic model resolution and hardware optimization
   ‚Üí OpenAI-compatible API for inference


### Explication : Aides √† l'ex√©cution (Mod√®le SDK Foundry)
Utilise le mod√®le officiel du SDK Foundry Local tel que document√© dans les exemples de l'atelier :

**Approche :**
- **FoundryLocalManager** - Initialise et g√®re le service Foundry Local
- **D√©tection automatique** - D√©couvre automatiquement le point de terminaison et g√®re le cycle de vie du service
- **R√©solution de mod√®les** - R√©sout les alias en identifiants complets de mod√®les (par ex., phi-4-mini ‚Üí phi-4-mini-instruct-cpu)
- **Optimisation mat√©rielle** - S√©lectionne la meilleure variante pour le mat√©riel disponible (CUDA, NPU ou CPU)
- **Client OpenAI** - Configur√© avec le point de terminaison du gestionnaire pour un acc√®s compatible avec l'API OpenAI

**Caract√©ristiques de r√©silience :**
- Logique de reprise avec d√©lai exponentiel (configurable via l'environnement)
- D√©marrage automatique du service s'il n'est pas en cours d'ex√©cution
- V√©rification de la connexion apr√®s l'initialisation
- Gestion des erreurs avec des rapports d√©taill√©s
- Mise en cache des mod√®les pour √©viter les initialisations r√©p√©t√©es

**Structure des r√©sultats :**
- Mesure de la latence (temps r√©el)
- Suivi de l'utilisation des jetons (si disponible)
- Exemple de sortie (troncature pour lisibilit√©)
- D√©tails des erreurs pour les requ√™tes √©chou√©es

Ce mod√®le s'appuie sur le module workshop_utils qui suit le mod√®le officiel du SDK.

**R√©f√©rence SDK :**
- R√©pertoire principal : https://github.com/microsoft/Foundry-Local
- SDK Python : https://github.com/microsoft/Foundry-Local/tree/main/sdk/python/foundry_local
- Workshop Utils : ../samples/workshop_utils.py


In [39]:
def setup(alias: str, endpoint: str = None, retries: int = 3):
    """
    Initialize a Foundry Local model connection using official SDK pattern.
    
    This follows the workshop_utils pattern which uses FoundryLocalManager
    to properly initialize the Foundry Local service and resolve models.
    
    Args:
        alias: Model alias (e.g., 'phi-4-mini', 'qwen2.5-3b')
        endpoint: Optional endpoint override (usually auto-detected)
        retries: Number of connection attempts (default: 3)
    
    Returns:
        tuple: (manager, client, model_id, metadata) or (None, None, alias, error_metadata) if failed
    """
    import time
    
    last_err = None
    current_delay = 2  # seconds
    
    for attempt in range(1, retries + 1):
        try:
            print(f"[Init] Connecting to '{alias}' (attempt {attempt}/{retries})...")
            
            # Use the workshop utility which follows the official SDK pattern
            manager, client, model_id = get_client(alias, endpoint=endpoint)
            
            print(f"[OK] Connected to '{alias}' -> {model_id}")
            print(f"     Endpoint: {manager.endpoint}")
            
            return manager, client, model_id, {
                'endpoint': manager.endpoint,
                'resolved': model_id,
                'attempts': attempt,
                'status': 'success'
            }
            
        except Exception as e:
            last_err = e
            error_msg = str(e)
            
            # Provide helpful error messages
            if "Connection error" in error_msg or "connection refused" in error_msg.lower():
                print(f"[ERROR] Cannot connect to Foundry Local service")
                print(f"        ‚Üí Is the service running? Try: foundry service start")
                print(f"        ‚Üí Is the model loaded? Try: foundry model run {alias}")
            elif "not found" in error_msg.lower():
                print(f"[ERROR] Model '{alias}' not found in catalog")
                print(f"        ‚Üí Available models: Run 'foundry model ls' in terminal")
                print(f"        ‚Üí Download model: Run 'foundry model download {alias}'")
            else:
                print(f"[ERROR] Setup failed: {type(e).__name__}: {error_msg}")
            
            if attempt < retries:
                print(f"[Retry] Waiting {current_delay:.1f}s before retry...")
                time.sleep(current_delay)
                current_delay *= 2  # Exponential backoff
    
    # All retries failed - provide actionable guidance
    print(f"\n‚ùå Failed to initialize '{alias}' after {retries} attempts")
    print(f"   Last error: {type(last_err).__name__}: {str(last_err)}")
    print(f"\nüí° Troubleshooting steps:")
    print(f"   1. Ensure Foundry Local service is running:")
    print(f"      ‚Üí foundry service status")
    print(f"      ‚Üí foundry service start (if not running)")
    print(f"   2. Ensure model is loaded:")
    print(f"      ‚Üí foundry model run {alias}")
    print(f"   3. Check available models:")
    print(f"      ‚Üí foundry model ls")
    print(f"   4. Try alternative models if '{alias}' isn't available")
    
    return None, None, alias, {
        'error': f"{type(last_err).__name__}: {str(last_err)}",
        'endpoint': endpoint or 'auto-detect',
        'attempts': retries,
        'status': 'failed'
    }


def run(client, model_id: str, prompt: str, max_tokens: int = 180, temperature: float = 0.5):
    """
    Run inference with the configured model using OpenAI SDK.
    
    Args:
        client: OpenAI client instance (configured for Foundry Local)
        model_id: Model identifier (resolved from alias)
        prompt: Input prompt
        max_tokens: Maximum response tokens (default: 180)
        temperature: Sampling temperature (default: 0.5)
    
    Returns:
        dict: Response with timing, tokens, and content
    """
    import time
    
    start = time.time()
    
    try:
        response = client.chat.completions.create(
            model=model_id,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=max_tokens,
            temperature=temperature
        )
        
        elapsed = time.time() - start
        
        # Extract response details
        content = response.choices[0].message.content
        
        # Try to extract token usage from multiple possible locations
        usage_info = {}
        if hasattr(response, 'usage') and response.usage:
            usage_info['prompt_tokens'] = getattr(response.usage, 'prompt_tokens', None)
            usage_info['completion_tokens'] = getattr(response.usage, 'completion_tokens', None)
            usage_info['total_tokens'] = getattr(response.usage, 'total_tokens', None)
        
        # Calculate approximate token count if API doesn't provide it
        # Rough estimate: ~4 characters per token for English text
        if not usage_info.get('total_tokens'):
            estimated_prompt_tokens = len(prompt) // 4
            estimated_completion_tokens = len(content) // 4
            estimated_total = estimated_prompt_tokens + estimated_completion_tokens
            usage_info['estimated_tokens'] = estimated_total
            usage_info['estimated_prompt_tokens'] = estimated_prompt_tokens
            usage_info['estimated_completion_tokens'] = estimated_completion_tokens
        
        return {
            'status': 'success',
            'content': content,
            'elapsed_sec': elapsed,
            'tokens': usage_info.get('total_tokens') or usage_info.get('estimated_tokens'),
            'usage': usage_info,
            'model': model_id
        }
        
    except Exception as e:
        elapsed = time.time() - start
        return {
            'status': 'error',
            'error': f"{type(e).__name__}: {str(e)}",
            'elapsed_sec': elapsed,
            'model': model_id
        }


print("‚úÖ Execution helpers defined: setup(), run()")
print("   ‚Üí Uses workshop_utils for proper SDK integration")
print("   ‚Üí setup() initializes with FoundryLocalManager")
print("   ‚Üí run() executes inference via OpenAI-compatible API")
print("   ‚Üí Token counting: Uses API data or estimates if unavailable")

‚úÖ Execution helpers defined: setup(), run()
   ‚Üí Uses workshop_utils for proper SDK integration
   ‚Üí setup() initializes with FoundryLocalManager
   ‚Üí run() executes inference via OpenAI-compatible API
   ‚Üí Token counting: Uses API data or estimates if unavailable


### Explication : Auto-test avant le lancement
Effectue une v√©rification rapide de la connectivit√© en utilisant FoundryLocalManager pour les deux mod√®les. Cela permet de v√©rifier :
- L'accessibilit√© du service
- L'initialisation des mod√®les
- La r√©solution des alias en identifiants r√©els des mod√®les
- La stabilit√© de la connexion avant d'ex√©cuter la comparaison

La fonction setup() utilise le mod√®le officiel du SDK provenant de workshop_utils.


In [34]:
# Simplified diagnostic: Just verify service is accessible
import requests

def check_foundry_service():
    """Quick diagnostic to verify Foundry Local is running."""
    # Try common ports
    endpoints_to_try = [
        "http://localhost:59959",
        "http://127.0.0.1:59959", 
        "http://localhost:55769",
        "http://127.0.0.1:55769",
    ]
    
    print("[Diagnostic] Checking Foundry Local service...")
    
    for endpoint in endpoints_to_try:
        try:
            response = requests.get(f"{endpoint}/health", timeout=2)
            if response.status_code == 200:
                print(f"‚úÖ Service is running at {endpoint}")
                
                # Try to list models
                try:
                    models_response = requests.get(f"{endpoint}/v1/models", timeout=2)
                    if models_response.status_code == 200:
                        models_data = models_response.json()
                        model_count = len(models_data.get('data', []))
                        print(f"‚úÖ Found {model_count} models available")
                        if model_count > 0:
                            print("   Models:", [m.get('id', 'unknown') for m in models_data.get('data', [])[:5]])
                except Exception as e:
                    print(f"‚ö†Ô∏è  Could not list models: {e}")
                
                return endpoint
        except requests.exceptions.ConnectionError:
            continue
        except Exception as e:
            print(f"‚ö†Ô∏è  Error checking {endpoint}: {e}")
    
    print("\n‚ùå Foundry Local service not found!")
    print("\nüí° To fix this:")
    print("   1. Open a terminal")
    print("   2. Run: foundry service start")
    print("   3. Run: foundry model run phi-4-mini")
    print("   4. Run: foundry model run qwen2.5-3b")
    print("   5. Re-run this notebook")
    return None

# Run diagnostic
discovered_endpoint = check_foundry_service()

if discovered_endpoint:
    print(f"\n‚úÖ Service detected (will be managed by FoundryLocalManager)")
else:
    print(f"\n‚ö†Ô∏è  No service detected - FoundryLocalManager will attempt to start it")

[Diagnostic] Checking Foundry Local service...

‚ùå Foundry Local service not found!

üí° To fix this:
   1. Open a terminal
   2. Run: foundry service start
   3. Run: foundry model run phi-4-mini
   4. Run: foundry model run qwen2.5-3b
   5. Re-run this notebook

‚ö†Ô∏è  No service detected - FoundryLocalManager will attempt to start it


In [35]:
# Quick Fix: Start service and load models from notebook
# Uncomment the commands you need:

# !foundry service start
# !foundry model run phi-4-mini
# !foundry model run qwen2.5-3b
# !foundry model ls

print("‚ö†Ô∏è  The commands above are commented out.")
print("Uncomment them if you want to start the service from the notebook.")
print("")
print("üí° Recommended: Run these commands in a separate terminal instead:")
print("   foundry service start")
print("   foundry model run phi-4-mini")
print("   foundry model run qwen2.5-3b")

‚ö†Ô∏è  The commands above are commented out.
Uncomment them if you want to start the service from the notebook.

üí° Recommended: Run these commands in a separate terminal instead:
   foundry service start
   foundry model run phi-4-mini
   foundry model run qwen2.5-3b


### üõ†Ô∏è Solution rapide : D√©marrer Foundry Local depuis le Notebook (Optionnel)

Si le diagnostic ci-dessus indique que le service ne fonctionne pas, vous pouvez essayer de le d√©marrer √† partir d'ici :

**Remarque :** Cela fonctionne mieux sur Windows. Sur d'autres plateformes, utilisez des commandes dans le terminal.


### ‚ö†Ô∏è R√©solution des erreurs de connexion

Si vous voyez `APIConnectionError`, le service Foundry Local peut ne pas √™tre en cours d'ex√©cution ou les mod√®les ne sont pas charg√©s. Essayez ces √©tapes :

**1. V√©rifiez l'√©tat du service :**
```bash
# In a terminal (not in notebook):
foundry service status
```

**2. D√©marrez le service (s'il n'est pas en cours d'ex√©cution) :**
```bash
foundry service start
```

**3. Chargez les mod√®les requis :**
```bash
# Load the models needed for comparison
foundry model run phi-4-mini
foundry model run qwen2.5-7b

# Or use alternative models:
foundry model run phi-3.5-mini
foundry model run qwen2.5-3b
```

**4. V√©rifiez que les mod√®les sont disponibles :**
```bash
foundry model ls
```

**Probl√®mes courants :**
- ‚ùå Service non d√©marr√© ‚Üí Ex√©cutez `foundry service start`
- ‚ùå Mod√®les non charg√©s ‚Üí Ex√©cutez `foundry model run <model-name>`
- ‚ùå Conflits de port ‚Üí V√©rifiez si un autre service utilise le port
- ‚ùå Pare-feu bloquant ‚Üí Assurez-vous que les connexions locales sont autoris√©es

**Solution rapide :** Ex√©cutez la cellule de diagnostic ci-dessous avant la v√©rification pr√©alable.


In [36]:
preflight = {}
retries = 2  # Number of retry attempts

for a in (SLM, LLM):
    mgr, c, mid, info = setup(a, endpoint=ENDPOINT, retries=retries)
    # Keep the original status from info (either 'success' or 'failed')
    preflight[a] = info

print('\n[Pre-flight Check]')
for alias, details in preflight.items():
    status_icon = '‚úÖ' if details['status'] == 'success' else '‚ùå'
    print(f"  {status_icon} {alias}: {details['status']} - {details.get('resolved', details.get('error', 'unknown'))}")

preflight

[Init] Connecting to 'phi-4-mini' (attempt 1/2)...
[OK] Connected to 'phi-4-mini' -> Phi-4-mini-instruct-cuda-gpu:4
     Endpoint: http://127.0.0.1:59959/v1
[Init] Connecting to 'qwen2.5-7b' (attempt 1/2)...
[OK] Connected to 'qwen2.5-7b' -> qwen2.5-7b-instruct-cuda-gpu:3
     Endpoint: http://127.0.0.1:59959/v1

[Pre-flight Check]
  ‚úÖ phi-4-mini: success - Phi-4-mini-instruct-cuda-gpu:4
  ‚úÖ qwen2.5-7b: success - qwen2.5-7b-instruct-cuda-gpu:3


{'phi-4-mini': {'endpoint': 'http://127.0.0.1:59959/v1',
  'resolved': 'Phi-4-mini-instruct-cuda-gpu:4',
  'attempts': 1,
  'status': 'success'},
 'qwen2.5-7b': {'endpoint': 'http://127.0.0.1:59959/v1',
  'resolved': 'qwen2.5-7b-instruct-cuda-gpu:3',
  'attempts': 1,
  'status': 'success'}}

### ‚úÖ V√©rification pr√©alable : Disponibilit√© des mod√®les

Cette cellule v√©rifie que les deux mod√®les sont accessibles √† l'adresse configur√©e avant de lancer la comparaison.


### Explication : Comparaison des ex√©cutions et collecte des r√©sultats
It√®re sur les deux alias en utilisant le mod√®le officiel du SDK Foundry :
1. Initialiser chaque mod√®le avec setup() (utilise FoundryLocalManager)
2. Ex√©cuter l'inf√©rence avec une API compatible OpenAI
3. Capturer la latence, les tokens et un exemple de sortie
4. Produire un r√©sum√© JSON avec une analyse comparative

Cela suit le m√™me mod√®le que les exemples de l'atelier dans session04/model_compare.py.


In [40]:
results = []
retries = 2  # Number of retry attempts

for alias in (SLM, LLM):
    mgr, client, mid, info = setup(alias, endpoint=ENDPOINT, retries=retries)
    if client:
        r = run(client, mid, PROMPT)
        results.append({'alias': alias, **r})
    else:
        # If setup failed, record error
        results.append({
            'alias': alias,
            'status': 'error',
            'error': info.get('error', 'Setup failed'),
            'elapsed_sec': 0,
            'tokens': None,
            'model': alias
        })

# Display results
print(json.dumps(results, indent=2))

# Quick comparative view
print('\n' + '='*80)
print('COMPARISON SUMMARY')
print('='*80)
print(f"{'Alias':<20} {'Status':<15} {'Latency(s)':<15} {'Tokens':<15}")
print('-'*80)

for row in results:
    status = row.get('status', 'unknown')
    status_icon = '‚úÖ' if status == 'success' else '‚ùå'
    latency_str = f"{row.get('elapsed_sec', 0):.3f}" if row.get('elapsed_sec') else 'N/A'
    
    # Handle token display - show if available or indicate estimated
    tokens = row.get('tokens')
    usage = row.get('usage', {})
    if tokens:
        if 'estimated_tokens' in usage:
            tokens_str = f"~{tokens} (est.)"
        else:
            tokens_str = str(tokens)
    else:
        tokens_str = 'N/A'
    
    print(f"{status_icon} {row['alias']:<18} {status:<15} {latency_str:<15} {tokens_str:<15}")

print('-'*80)

# Show detailed token breakdown if available
print("\nDetailed Token Usage:")
for row in results:
    if row.get('status') == 'success' and row.get('usage'):
        usage = row['usage']
        print(f"\n  {row['alias']}:")
        if 'prompt_tokens' in usage and usage['prompt_tokens']:
            print(f"    Prompt tokens:     {usage['prompt_tokens']}")
            print(f"    Completion tokens: {usage['completion_tokens']}")
            print(f"    Total tokens:      {usage['total_tokens']}")
        elif 'estimated_tokens' in usage:
            print(f"    Estimated prompt:     {usage['estimated_prompt_tokens']}")
            print(f"    Estimated completion: {usage['estimated_completion_tokens']}")
            print(f"    Estimated total:      {usage['estimated_tokens']}")
            print(f"    (API did not provide token counts - using ~4 chars/token estimate)")

print('\n' + '='*80)

# Calculate speedup if both succeeded
if len(results) == 2 and all(r.get('status') == 'success' and r.get('elapsed_sec') for r in results):
    speedup = results[1]['elapsed_sec'] / results[0]['elapsed_sec']
    print(f"\nüí° SLM is {speedup:.2f}x faster than LLM for this prompt")
    
    # Compare token throughput if available
    slm_tokens = results[0].get('tokens', 0)
    llm_tokens = results[1].get('tokens', 0)
    if slm_tokens and llm_tokens:
        slm_tps = slm_tokens / results[0]['elapsed_sec']
        llm_tps = llm_tokens / results[1]['elapsed_sec']
        print(f"   SLM throughput: {slm_tps:.1f} tokens/sec")
        print(f"   LLM throughput: {llm_tps:.1f} tokens/sec")
        
elif any(r.get('status') == 'error' for r in results):
    print(f"\n‚ö†Ô∏è  Some models failed - check errors above")
    print("   Ensure Foundry Local is running: foundry service start")
    print("   Ensure models are loaded: foundry model run <model-name>")

results

[Init] Connecting to 'phi-4-mini' (attempt 1/2)...
[OK] Connected to 'phi-4-mini' -> Phi-4-mini-instruct-cuda-gpu:4
     Endpoint: http://127.0.0.1:59959/v1
[Init] Connecting to 'qwen2.5-7b' (attempt 1/2)...
[OK] Connected to 'qwen2.5-7b' -> qwen2.5-7b-instruct-cuda-gpu:3
     Endpoint: http://127.0.0.1:59959/v1
[Init] Connecting to 'qwen2.5-7b' (attempt 1/2)...
[OK] Connected to 'qwen2.5-7b' -> qwen2.5-7b-instruct-cuda-gpu:3
     Endpoint: http://127.0.0.1:59959/v1
[
  {
    "alias": "phi-4-mini",
    "status": "success",
    "content": "1. Reduced Latency: Local AI inference can significantly reduce latency by processing data closer to the source, which is particularly beneficial for real-time applications such as autonomous vehicles or augmented reality.\n\n2. Enhanced Privacy: By keeping data processing local, sensitive information is less likely to be exposed to external networks, thereby enhancing privacy and security.\n\n3. Lower Bandwidth Usage: Local AI inference reduces the n

[{'alias': 'phi-4-mini',
  'status': 'success',
  'content': '1. Reduced Latency: Local AI inference can significantly reduce latency by processing data closer to the source, which is particularly beneficial for real-time applications such as autonomous vehicles or augmented reality.\n\n2. Enhanced Privacy: By keeping data processing local, sensitive information is less likely to be exposed to external networks, thereby enhancing privacy and security.\n\n3. Lower Bandwidth Usage: Local AI inference reduces the need for data transmission over the network, which can save bandwidth and reduce the risk of network congestion.\n\n4. Improved Reliability: Local processing can be more reliable, as it is less dependent on network connectivity. This is particularly important in scenarios where network connectivity is unreliable or intermittent.\n\n5. Scalability: Local AI inference can be easily scaled by adding more local processing units, making it easier to handle increasing data volumes or m

### Interpr√©tation des r√©sultats

**Indicateurs cl√©s :**
- **Latence** : Plus elle est faible, mieux c'est - indique un temps de r√©ponse plus rapide
- **Tokens** : Un d√©bit plus √©lev√© = plus de tokens trait√©s
- **Route** : Confirme quel point de terminaison API a √©t√© utilis√©

**Quand utiliser SLM vs LLM :**
- **SLM (Small Language Model)** : R√©ponses rapides, faible utilisation des ressources, id√©al pour les t√¢ches simples
- **LLM (Large Language Model)** : Qualit√© sup√©rieure, meilleur raisonnement, √† privil√©gier lorsque la qualit√© est essentielle

**Prochaines √©tapes :**
1. Essayez diff√©rents prompts pour voir comment la complexit√© influence la comparaison
2. Exp√©rimentez avec d'autres paires de mod√®les
3. Utilisez les exemples de routage du Workshop (Session 06) pour router intelligemment en fonction de la complexit√© des t√¢ches


In [38]:
# Final Validation Check
print("="*70)
print("VALIDATION SUMMARY")
print("="*70)
print(f"‚úÖ SLM Model: {SLM}")
print(f"‚úÖ LLM Model: {LLM}")
print(f"‚úÖ Using Foundry SDK Pattern: workshop_utils with FoundryLocalManager")
print(f"‚úÖ Pre-flight passed: {all(v['status'] == 'success' for v in preflight.values()) if 'preflight' in dir() else 'Not run yet'}")
print(f"‚úÖ Comparison completed: {len(results) == 2 if 'results' in dir() else 'Not run yet'}")
print(f"‚úÖ Both models responded: {all(r.get('status') == 'success' for r in results) if 'results' in dir() and results else 'Not run yet'}")
print("="*70)

# Check for common configuration issues
issues = []
if 'LLM' in dir() and LLM not in ['qwen2.5-3b', 'qwen2.5-0.5b', 'qwen2.5-1.5b', 'qwen2.5-7b', 'phi-3.5-mini']:
    issues.append(f"‚ö†Ô∏è  LLM is '{LLM}' - expected qwen2.5-3b for memory efficiency")
if 'preflight' in dir() and not all(v['status'] == 'success' for v in preflight.values()):
    issues.append("‚ö†Ô∏è  Pre-flight check failed - models not accessible")
if 'results' in dir() and results and not all(r.get('status') == 'success' for r in results):
    issues.append("‚ö†Ô∏è  Comparison incomplete - check for errors above")

if not issues and 'results' in dir() and results and all(r.get('status') == 'success' for r in results):
    print("üéâ ALL CHECKS PASSED! Notebook completed successfully.")
    print(f"   SLM ({SLM}) vs LLM ({LLM}) comparison completed.")
    if len(results) == 2:
        speedup = results[1]['elapsed_sec'] / results[0]['elapsed_sec'] if results[0]['elapsed_sec'] > 0 else 0
        print(f"   Performance: SLM is {speedup:.2f}x faster")
elif issues:
    print("\n‚ö†Ô∏è  Issues detected:")
    for issue in issues:
        print(f"   {issue}")
    print("\nüí° Troubleshooting:")
    print("   1. Ensure service is running: foundry service start")
    print("   2. Load models: foundry model run phi-4-mini && foundry model run qwen2.5-7b")
    print("   3. Check model list: foundry model ls")
else:
    print("\nüí° Run all cells above first, then re-run this validation.")
print("="*70)

VALIDATION SUMMARY
‚úÖ SLM Model: phi-4-mini
‚úÖ LLM Model: qwen2.5-7b
‚úÖ Using Foundry SDK Pattern: workshop_utils with FoundryLocalManager
‚úÖ Pre-flight passed: True
‚úÖ Comparison completed: True
‚úÖ Both models responded: True
üéâ ALL CHECKS PASSED! Notebook completed successfully.
   SLM (phi-4-mini) vs LLM (qwen2.5-7b) comparison completed.
   Performance: SLM is 5.14x faster



---

**Avertissement** :  
Ce document a √©t√© traduit √† l'aide du service de traduction automatique [Co-op Translator](https://github.com/Azure/co-op-translator). Bien que nous nous efforcions d'assurer l'exactitude, veuillez noter que les traductions automatis√©es peuvent contenir des erreurs ou des inexactitudes. Le document original dans sa langue d'origine doit √™tre consid√©r√© comme la source faisant autorit√©. Pour des informations critiques, il est recommand√© de recourir √† une traduction humaine professionnelle. Nous ne sommes pas responsables des malentendus ou des interpr√©tations erron√©es r√©sultant de l'utilisation de cette traduction.
