# Lab 2: On-Device LLM Quickstart and Testing

**Purpose:** Use the Foundry Local model to handle a simple query in a notebook, demonstrating how to interact with the local LLM via code. This lab focuses on verifying **low-latency local inference** for basic tasks.

## Overview

In this lab, we'll:
- Connect to the local Foundry model via Python
- Send simple queries to test functionality
- Measure response times to demonstrate low latency
- Understand the capabilities and limitations of local models

## Step 2.1: Load Configuration and Initialize Libraries

In [None]:
import os
import sys
# Add parent directory for module imports
sys.path.append(os.path.dirname(os.getcwd()))
from modules.config import Config, AzureFoundryConfig

af = AzureFoundryConfig()
af.azure_ai_foundry_endpoint

# Config().get_azure_foundry_endpoint()

In [None]:
af

In [None]:
import os
import sys
import time
# Add parent directory for module imports
sys.path.append(os.path.dirname(os.getcwd()))
import requests
from openai import OpenAI
# from modules.config import config

# # Test configuration access and show available configurations
# print("Configuration loaded successfully!")
# print(f"Debug mode: {config.is_debug_mode()}")
# print(f"Local model endpoint: {config.get_local_model_endpoint()}")
# print(f"Telemetry enabled: {config.is_telemetry_enabled()}")

# # Access specific configurations using the helper methods
# local_endpoint = config.get_local_model_endpoint()
# openai_key = config.get_azure_openai_key()
# complexity_threshold = config.get_complexity_threshold()

# print(f"\nConfiguration values:")
# print(f"  Local endpoint: {local_endpoint}")
# print(f"  OpenAI key configured: {'Yes' if openai_key else 'No'}")
# print(f"  Complexity threshold: {complexity_threshold}")

# # Display environment info for debugging
# env_info = config.get_environment_info()
# print(f"\nEnvironment configuration status:")
# for section, details in env_info.items():
#     print(f"  {section}: {details}")

# Local model configuration
LOCAL_ENDPOINT = os.environ["LOCAL_MODEL_ENDPOINT"] 
LOCAL_MODEL_ALIAS = os.environ["LOCAL_MODEL_NAME"]
AZURE_OPENAI_API_VERSION = os.environ["AZURE_OPENAI_API_VERSION"]

print(f"Local endpoint: {LOCAL_ENDPOINT}")
print(f"Local model alias: {LOCAL_MODEL_ALIAS}")

In [26]:
from foundry_local import FoundryLocalManager

# Initialize and optionally bootstrap with a model
manager = FoundryLocalManager(alias_or_model_id=None, bootstrap=True)

# List models in cache
local_models = manager.list_cached_models()
print(f"Models in cache: {local_models}")
print(f"Model alias: {local_models[0].alias}")
print(f"Local endpoint: {manager.endpoint}")


Models in cache: [FoundryModelInfo(alias=phi-3.5-mini, id=Phi-3.5-mini-instruct-generic-cpu, execution_provider=CPUExecutionProvider, device_type=CPU, file_size=2590 MB, license=MIT)]
Model alias: phi-3.5-mini
Local endpoint: http://127.0.0.1:57149/v1


## Step 2.2: Connect to Foundry Local via Python

We'll establish a connection to the local model service and configure the AzureOpenAI client to use the local endpoint.

In [None]:
# Verify local service is running
def check_local_service():
    try:
        response = requests.get(f"{LOCAL_ENDPOINT}/openai/status", timeout=5)
        return response.status_code == 200
    except requests.exceptions.RequestException:
        return False

if check_local_service():
    print("✅ Local model service is running")
else:
    print("❌ Local model service is not accessible")
    print("Please ensure 'foundry model run phi-3.5-mini' is running in a terminal")
    
# Configure OpenAI client for local endpoint
local_client = OpenAI(
    base_url=f"{LOCAL_ENDPOINT}/v1",  # Foundry Local typically uses OpenAI-compatible API
    api_key="not-needed"  # Local service doesn't require authentication
)

print("✅ Local OpenAI client configured")

## Step 2.3: Single-turn Query to Local Model

Let's test the local model with a simple factual question that should be handled well by a lightweight model.

In [None]:
# Test streaming response
try:
    response = local_client.chat.completions.create(
        model=local_models[0].id,
        messages=[
            {"role": "user", "content": "Hello, what is the capital of France?"}
        ],
        max_tokens=150,
        temperature=0.7,
        stream=True
    )
    
    print("Streaming response:")
    for chunk in response:
        if chunk.choices[0].delta.content is not None:
            print(chunk.choices[0].delta.content, end="")
    print()  # New line after streaming
    
except Exception as e:
    print(f"❌ Streaming error: {e}")
    print("Streaming might not be supported by this local service")

In [None]:
def query_local_model(prompt):
    """Send a query to the local model and return the response with timing."""
    try:
        start_time = time.time()
        
        # Ensure we're using the properly configured client
        response = local_client.chat.completions.create(
            model=local_models[0].id,  # Use the alias directly
            # base_url=manager.endpoint,
            messages=[
                {"role": "user", "content": prompt}
            ],
            max_tokens=150,
            temperature=0.7,
            stream=False
        )
        
        end_time = time.time()
        response_time = end_time - start_time
        
        content = response.choices[0].message.content
        return content, response_time
        
    except Exception as e:
        return f"Error: {str(e)}", 0

# Test with a simple factual question
test_prompt = "Hello, what is the capital of France?"
response, response_time = query_local_model(test_prompt)

print(f"Query: {test_prompt}")
print(f"Local model response: {response}")
print(f"Response time: {response_time:.3f} seconds")
print("\n" + "="*50)

## Step 2.4: Test Various Simple Queries

Let's test the local model with different types of simple queries to understand its capabilities:

In [None]:
# Test queries that should work well with a local model
simple_queries = [
    "What is 2 + 2?",
    "Hi there! How are you today?",
    "What is the largest planet in our solar system?",
    "Convert 100 degrees Fahrenheit to Celsius.",
    "What year was Python first released?"
]

print("Testing local model with simple queries:")
print("=" * 60)

total_time = 0
successful_queries = 0

for i, query in enumerate(simple_queries, 1):
    print(f"\nQuery {i}: {query}")
    response, response_time = query_local_model(query)
    
    if not response.startswith("Error:"):
        successful_queries += 1
        total_time += response_time
        print(f"Response: {response}")
        print(f"Time: {response_time:.3f} seconds")
    else:
        print(f"❌ {response}")
    
    print("-" * 40)

if successful_queries > 0:
    avg_response_time = total_time / successful_queries
    print(f"\n📊 Summary:")
    print(f"Successful queries: {successful_queries}/{len(simple_queries)}")
    print(f"Average response time: {avg_response_time:.3f} seconds")
    print(f"✅ Local model demonstrates low-latency responses!")

## Step 2.5: Test Local Model Limitations

Let's try some more complex queries to understand where the local model might struggle, which will justify the need for cloud fallback:

In [None]:
# Test queries that might challenge a local model
complex_queries = [
    "Explain quantum computing in detail and discuss its implications for cryptography.",
    "Write a comprehensive business plan for a sustainable energy startup.",
    "Analyze the economic impact of artificial intelligence on employment over the next decade."
]

print("Testing local model with complex queries:")
print("=" * 60)

for i, query in enumerate(complex_queries, 1):
    print(f"\nComplex Query {i}: {query}")
    response, response_time = query_local_model(query)
    
    if not response.startswith("Error:"):
        print(f"Response: {response[:200]}..." if len(response) > 200 else f"Response: {response}")
        print(f"Response length: {len(response)} characters")
        print(f"Time: {response_time:.3f} seconds")
    else:
        print(f"❌ {response}")
    
    print("-" * 40)

print("\n💡 Observations:")
print("- Local models excel at simple, factual queries")
print("- Complex queries may receive shorter or less detailed responses")
print("- This demonstrates the need for intelligent routing to cloud models")

## Step 2.6: Demonstrate Offline Capability

One key advantage of local models is that they work offline. Let's demonstrate this capability:

In [None]:
# Simulate offline capability by testing without internet dependency
import platform
import getpass

# Device-specific information (works offline)
device_info_prompt = f"""
I'm running on a {platform.system()} system. 
The current user is {getpass.getuser()}.
Can you help me with basic system information or simple calculations?
"""

print("Testing offline/device-specific capability:")
print("=" * 50)

response, response_time = query_local_model(device_info_prompt)

print(f"Device-specific query: {device_info_prompt.strip()}")
print(f"Local response: {response}")
print(f"Response time: {response_time:.3f} seconds")

print("\n🔒 Privacy Benefits:")
print("- All processing happens locally")
print("- No data sent to external servers")
print("- Works without internet connection")
print("- Ideal for sensitive or personal queries")

## Step 2.7: Performance Analysis

Let's analyze the performance characteristics of our local model:

In [None]:
# Measure performance with varying query lengths
import matplotlib.pyplot as plt

queries_by_length = [
    ("Hi", "Very short"),
    ("What is the weather like today?", "Short"),
    ("Can you explain what machine learning is and how it works in simple terms?", "Medium"),
    ("Please provide a detailed explanation of how neural networks function, including the concepts of forward propagation, backpropagation, and gradient descent, along with practical applications in modern AI systems.", "Long")
]

response_times = []
query_lengths = []
labels = []

print("Analyzing performance vs query complexity:")
print("=" * 50)

for query, label in queries_by_length:
    response, response_time = query_local_model(query)
    
    if not response.startswith("Error:"):
        response_times.append(response_time)
        query_lengths.append(len(query))
        labels.append(label)
        
        print(f"\n{label} query ({len(query)} chars): {response_time:.3f}s")
        print(f"Query: {query[:50]}{'...' if len(query) > 50 else ''}")

# Simple performance summary
if response_times:
    print(f"\n📈 Performance Summary:")
    print(f"Fastest response: {min(response_times):.3f}s")
    print(f"Slowest response: {max(response_times):.3f}s")
    print(f"Average response time: {sum(response_times)/len(response_times):.3f}s")
    
    # Show that local responses are consistently fast
    if max(response_times) < 2.0:  # If all responses under 2 seconds
        print("✅ All local responses were under 2 seconds - excellent for user experience!")
    else:
        print("⚠️  Some responses took longer - consider query complexity")

## Step 2.8: Create Helper Functions for Future Labs

Let's create reusable functions that we'll use in subsequent labs:

In [None]:
def get_local_client():
    """Get configured OpenAI client for local model."""
    return OpenAI(
        base_url=f"{LOCAL_ENDPOINT}/v1",
        api_key="not-needed"
    )

def query_local_with_history(prompt, chat_history=None):
    """Query local model with optional chat history."""
    if chat_history is None:
        chat_history = []
    
    # Add current prompt to history
    messages = chat_history + [{"role": "user", "content": prompt}]
    
    try:
        start_time = time.time()
        
        response = get_local_client().chat.completions.create(
            model=LOCAL_MODEL_ALIAS,
            messages=messages,
            max_tokens=150,
            temperature=0.7,
            stream=False
        )
        
        end_time = time.time()
        response_time = end_time - start_time
        
        content = response.choices[0].message.content
        return content, response_time, True
        
    except Exception as e:
        return f"Error: {str(e)}", 0, False

print("✅ Helper functions created and saved for future labs")

## 🎉 Lab 2 Complete!

### What You've Accomplished:
- ✅ Successfully connected to the local Foundry model via Python
- ✅ Demonstrated low-latency responses for simple queries
- ✅ Identified the strengths and limitations of local models
- ✅ Measured performance characteristics
- ✅ Created reusable functions for future labs

### Key Findings:
- **Speed**: Local models provide near-instant responses (typically < 1 second)
- **Privacy**: All processing happens on-device with no external data transmission
- **Availability**: Works offline without internet connectivity
- **Limitations**: Less capable with complex reasoning or lengthy generation tasks

### Next Steps:
- Proceed to Lab 3 to set up and test the Azure cloud model
- Keep your local model service running
- The helper functions created here will be used for the hybrid routing system

### Key Takeaways for Hybrid Architecture:
1. **Local models excel at**: Simple Q&A, basic calculations, greetings, quick responses
2. **Local models struggle with**: Complex reasoning, long-form content, specialized knowledge
3. **This justifies hybrid routing**: Use local for speed, cloud for complexity

The stage is now set to compare these local capabilities with cloud model performance in Lab 3!