# Inference Stress Test with Kamiwaza SDK

This notebook demonstrates how to perform a multi-threaded stress test on deployed models using the Kamiwaza SDK.

The test sends multiple concurrent requests to stress test the inference endpoint and measure performance metrics like:
- Response times
- Token throughput (tokens/second)
- Average token usage per request

This is especially important for engines like vLLM that benefit from continuous batching - you need concurrent requests to see real performance.

## Initialize the Kamiwaza Client

First, we connect to the Kamiwaza server and list all active deployments.

In [1]:
from kamiwaza_sdk import kamiwaza_sdk as kz

# Initialize the client
client = kz("http://localhost:7777/api/")

# List all active deployments
deployments = client.serving.list_active_deployments()

if not deployments:
    print("No active deployments found. Please deploy a model first.")
else:
    print(f"Found {len(deployments)} active deployment(s):")
    for i, dep in enumerate(deployments):
        print(f"  {i+1}. {dep.m_name} (Status: {dep.status}, Endpoint: {dep.endpoint})")
    
    # Select the first deployment for testing
    selected_deployment = deployments[0]
    print(f"\nUsing deployment: {selected_deployment.m_name}")

Found 1 active deployment(s):
  1. Qwen3-8B-GGUF (Status: DEPLOYED, Endpoint: http://localhost:61100/v1)

Using deployment: Qwen3-8B-GGUF


## Configure Test Parameters

Set the number of prompts to run and prepare the test configuration.

In [2]:
# For a gentle test, set this to a positive number (e.g., 5)
# For full stress test, set to -1 to run all prompts
PROMPTS_TO_RUN = 5  # Start with 5 for testing
# PROMPTS_TO_RUN = -1  # Uncomment for full stress test

# Number of concurrent workers
MAX_WORKERS = 128  # Adjust based on your system capabilities

# Request timeout in seconds
REQUEST_TIMEOUT = 600  # 10 minutes

import logging
logging.basicConfig(level=logging.WARNING)

# Reduce noise from all loggers
for logger_name in logging.root.manager.loggerDict:
    logger = logging.getLogger(logger_name)
    logger.setLevel(logging.WARNING)

## Prepare Test Prompts

We'll use a variety of coding-related prompts to test the model's performance.

In [3]:
import random

# Core test prompts
core_prompts = [
    "write me code in python to efficiently print prime numbers 0..1b",
    "write me a fastapi server that handles notes and can get/put them",
    "write me code that demonstrates a simple version of doing a 3-param linear regression model for home prices with price, sqft, rooms to predict prices in that market given sqft/rooms, training set in /tmp/trainingdata.json named 'price', 'sqft', 'rooms' as columns, count the number of rows in the code and split 90/10 training/testing",
    "write me a fastapi server that has endpoints for playing tic tac toe",
    "write me a fastapi server that has endpoints for playing chess; ensure all moves are valid; use chess notation for moves",
    "write me a list of the 50 most influential ai/compsci pioneers, 1 per line, numbered 1..50, with 10-12 words on why they belong on the list; you must not repeat any name",
    "write me an example of factoring an exceedingly large prime with a quantum computer",
    "show me three examples of messy python code that can be improved to be more idiomatic/pythonic with a comprehension",
    "come up with a sqlalchemy schema for a D&D 5e character"
]

# Extended test prompts
extended_prompts = [
    "Python script that scrapes weather data from a website and saves it in a CSV file",
    "create a Django app for a blog where users can post, edit, and delete their blogs",
    "write a Python function to convert a given text into Morse Code",
    "design a Python program that uses machine learning to predict stock market prices",
    "develop a command-line tool in Python for renaming files in bulk according to a specified pattern",
    "write a Python script to automate login to a website and download a specific file",
    "create a Flask API for a todo list application with CRUD operations",
    "write a Python script that uses OpenCV for facial recognition",
    "develop a Python-based text editor with basic functionalities like open, edit, and save files",
    "create a Python script for a basic chatbot that can answer FAQs",
    "write a Python program that solves Sudoku puzzles",
    "develop a web scraper in Python to collect data from multiple pages of a website",
    "write a Python script to monitor and log CPU and memory usage over time",
    "create a Python program that encrypts and decrypts text using a custom cipher",
    "write a Python script that visualizes data from a CSV file using Matplotlib",
    "develop a Python-based system for tracking and managing inventory",
    "create a Python program to analyze and plot cryptocurrency trends",
    "write a Python script that automates sending emails with attachments",
    "develop a Python application that converts speech to text",
    "write a Python-based calculator for complex mathematical expressions",
    "create a Python script to compare two text files and highlight differences",
    "develop a Python program that uses natural language processing to summarize text",
    "write a Python script for a password generator that creates strong, unique passwords",
    "create a Python-based tool for organizing and renaming photos based on date and location",
    "develop a Python script that tracks prices of products on e-commerce websites and sends alerts",
    "write a Python program for a simple expense tracker",
    "create a Python script to automate data entry into a web form",
    "develop a Python tool for visualizing geographic data on a map",
    "write a Python script to analyze and predict weather patterns",
    "create a Python-based server monitoring tool that sends alerts for downtime"
]

# Additional challenging prompts
additional_prompts = [
    "write a Python script to organize and analyze personal finance data from bank statements",
    "create a Python-based GUI application for a digital address book",
    "develop a Python script that automates the creation of PowerPoint presentations from a text outline",
    "write a Python program to generate music playlists based on mood analysis",
    "create a Python script for a basic file encryption and decryption tool",
    "develop a Python-based web application for real-time sports scores and statistics",
    "write a Python program to automate the process of image tagging using AI",
    "create a Python script for real-time chat translation for multiple languages",
    "develop a Python tool for tracking and managing personal health data",
    "write a Python-based system for managing library books and members",
    "create a Python script to convert handwriting to text using machine learning",
    "develop a Python program for a restaurant reservation system",
    "write a Python script for a tool that optimizes travel routes and schedules",
    "create a Python-based application for tracking and managing event tickets",
    "develop a Python script for a workout planner and tracker",
    "write a Python program for a real-time collaborative whiteboard",
    "create a Python script to generate and analyze fantasy sports teams",
    "develop a Python-based tool for home automation control",
    "write a Python script for an automated plant watering system",
    "create a Python program for a virtual reality tour guide"
]

# Combine and shuffle all prompts
all_prompts = core_prompts + extended_prompts + additional_prompts
random.shuffle(all_prompts)

print(f"Total prompts available: {len(all_prompts)}")
print(f"Will run: {PROMPTS_TO_RUN if PROMPTS_TO_RUN > 0 else len(all_prompts)} prompts")

Total prompts available: 59
Will run: 5 prompts


## Get OpenAI Client from SDK

We'll use the SDK's OpenAI client interface to interact with the deployed model.

In [4]:
if deployments:
    # Get the OpenAI client for the selected deployment
    # Use the model name directly
    model_name = selected_deployment.m_name
    
    print(f"Getting OpenAI client for model: {model_name}")
    openai_client = client.openai.get_client(model_name)
    print(f"OpenAI client configured successfully")
    print(f"Endpoint: {selected_deployment.endpoint}")

Getting OpenAI client for model: Qwen3-8B-GGUF
OpenAI client configured successfully
Endpoint: http://localhost:61100/v1


## Define Stress Test Functions

These functions handle concurrent request execution and performance measurement.

In [5]:
import time
import concurrent.futures
from typing import Dict, List, Tuple, Any

def fetch_response(prompt: str, openai_client, model_name: str = "model") -> Tuple[str, Dict, float, Dict]:
    """
    Fetch a single response from the model.
    
    Returns:
        Tuple of (prompt, response, duration, usage)
    """
    start_time = time.time()
    
    try:
        response = openai_client.chat.completions.create(
            model=model_name,
            messages=[
                {
                    "role": "system", 
                    "content": "You are an elite AI assistant, expert at coding. You always do everything you can to be helpful to the user"
                },
                {
                    "role": "user", 
                    "content": prompt
                }
            ],
            timeout=REQUEST_TIMEOUT
        )
        
        end_time = time.time()
        duration = end_time - start_time
        
        # Extract usage statistics
        usage = {
            'prompt_tokens': response.usage.prompt_tokens if response.usage else 0,
            'completion_tokens': response.usage.completion_tokens if response.usage else 0,
            'total_tokens': response.usage.total_tokens if response.usage else 0
        }
        
        # Extract the response content
        response_dict = {
            'content': response.choices[0].message.content,
            'model': response.model,
            'id': response.id
        }
        
        return prompt, response_dict, duration, usage
        
    except Exception as e:
        end_time = time.time()
        duration = end_time - start_time
        error_response = {'content': f'Error: {str(e)}', 'model': model_name, 'id': 'error'}
        error_usage = {'prompt_tokens': 0, 'completion_tokens': 0, 'total_tokens': 0}
        return prompt, error_response, duration, error_usage


def run_stress_test(prompts: List[str], openai_client, max_workers: int = 128) -> None:
    """
    Run the stress test with multiple concurrent requests.
    """
    print(f"\n{'='*60}")
    print(f"Starting stress test with {len(prompts)} prompts")
    print(f"Max concurrent workers: {max_workers}")
    print(f"{'='*60}\n")
    
    start_time = time.time()
    total_prompt_tokens = 0
    total_completion_tokens = 0
    total_tokens = 0
    successful_responses = 0
    failed_responses = 0
    
    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
        # Submit all tasks
        futures = {
            executor.submit(fetch_response, prompt, openai_client): prompt 
            for prompt in prompts
        }
        
        # Process completed tasks
        for i, future in enumerate(concurrent.futures.as_completed(futures), 1):
            try:
                prompt, response, duration, usage = future.result()
                
                # Update statistics
                total_prompt_tokens += usage['prompt_tokens']
                total_completion_tokens += usage['completion_tokens']
                total_tokens += usage['total_tokens']
                
                if response['id'] != 'error':
                    successful_responses += 1
                    print(f"[{i}/{len(prompts)}] ✓ Completed in {duration:.2f}s")
                    print(f"  Prompt: {prompt[:80]}...")
                    print(f"  Tokens: {usage['total_tokens']} (prompt: {usage['prompt_tokens']}, completion: {usage['completion_tokens']})")
                    print(f"  Response preview: {response['content'][:150]}...\n")
                else:
                    failed_responses += 1
                    print(f"[{i}/{len(prompts)}] ✗ Failed after {duration:.2f}s")
                    print(f"  Prompt: {prompt[:80]}...")
                    print(f"  Error: {response['content']}\n")
                    
            except Exception as e:
                failed_responses += 1
                print(f"[{i}/{len(prompts)}] ✗ Exception: {str(e)}\n")
    
    # Calculate and display final statistics
    total_time = time.time() - start_time
    num_prompts = len(prompts)
    
    print(f"\n{'='*60}")
    print(f"STRESS TEST RESULTS")
    print(f"{'='*60}")
    print(f"Total execution time: {total_time:.2f} seconds")
    print(f"Total prompts: {num_prompts}")
    print(f"Successful: {successful_responses} | Failed: {failed_responses}")
    
    if successful_responses > 0:
        avg_prompt_tokens = total_prompt_tokens / successful_responses
        avg_completion_tokens = total_completion_tokens / successful_responses
        avg_total_tokens = total_tokens / successful_responses
        tokens_per_second = total_tokens / total_time if total_time > 0 else 0
        
        print(f"\nToken Statistics:")
        print(f"  Total tokens: {total_tokens:,}")
        print(f"  - Prompt tokens: {total_prompt_tokens:,}")
        print(f"  - Completion tokens: {total_completion_tokens:,}")
        print(f"\nAverage per successful request:")
        print(f"  - Prompt tokens: {avg_prompt_tokens:.1f}")
        print(f"  - Completion tokens: {avg_completion_tokens:.1f}")
        print(f"  - Total tokens: {avg_total_tokens:.1f}")
        print(f"\nPerformance:")
        print(f"  - Overall throughput: {tokens_per_second:.1f} tokens/second")
        print(f"  - Average time per request: {total_time/num_prompts:.2f} seconds")
    
    print(f"{'='*60}\n")

## Run the Stress Test

Execute the stress test with the configured parameters.

**Note**: 
- For initial testing, start with a small number of prompts (5-10)
- For full stress testing, use all prompts with PROMPTS_TO_RUN = -1
- Performance will vary significantly based on:
  - Model size and quantization
  - Hardware (CPU vs GPU)
  - Inference engine (vLLM, LlamaCpp, etc.)

In [6]:
if deployments and openai_client:
    # Select the prompts to run
    prompts_to_test = all_prompts[:PROMPTS_TO_RUN] if PROMPTS_TO_RUN > 0 else all_prompts
    
    print(f"Testing with {len(prompts_to_test)} prompts...")
    print(f"Model: {selected_deployment.m_name}")
    print(f"Deployment ID: {selected_deployment.id}")
    print(f"Status: {selected_deployment.status}")
    print(f"Instances: {len(selected_deployment.instances)}")
    
    # Run the stress test
    run_stress_test(prompts_to_test, openai_client, max_workers=MAX_WORKERS)
else:
    print("Cannot run stress test - no active deployments or client not configured")

Testing with 5 prompts...
Model: Qwen3-8B-GGUF
Deployment ID: acae8693-1373-4595-bdaa-7570d2fe9c05
Status: DEPLOYED
Instances: 1

Starting stress test with 5 prompts
Max concurrent workers: 128

[1/5] ✓ Completed in 70.16s
  Prompt: create a Django app for a blog where users can post, edit, and delete their blog...
  Tokens: 2724 (prompt: 54, completion: 2670)
  Response preview: Here's a step-by-step guide to create a Django blog app with post creation, editing, and deletion:

### 1. Project Setup

```bash
django-admin startpr...

[2/5] ✓ Completed in 123.67s
  Prompt: create a Flask API for a todo list application with CRUD operations...
  Tokens: 2143 (prompt: 48, completion: 2095)
  Response preview: Here's a complete Flask API for a todo list application with CRUD operations:

```python
from flask import Flask, jsonify, request, abort
from flask_s...

[3/5] ✓ Completed in 188.76s
  Prompt: develop a Python tool for tracking and managing personal health data...
  Tokens: 2543 (prom

## Performance Expectations

Based on typical hardware and model configurations:

| Configuration | Expected Throughput | Notes |
|--------------|-------------------|-------|
| vLLM + RTX 4090/A100 | 500-600 tokens/s | Benefits from continuous batching |
| LlamaCpp + M2 Max (small model) | 50-80 tokens/s | Good for local development |
| LlamaCpp + M2 Max (large model) | 20-30 tokens/s | May timeout with many prompts |
| CPU inference | 5-20 tokens/s | Not recommended for stress testing |

**Tips for optimal performance:**
1. Use GPU acceleration when available
2. Choose appropriate quantization (q4_k, q5_k for balance)
3. Adjust MAX_WORKERS based on your hardware
4. Monitor system resources during testing

## Optional: Stop Deployment After Testing

If you want to free up resources after testing, uncomment and run the cell below.

In [None]:
# Uncomment to stop the deployment after testing
# if deployments and model_name:
#     print(f"Stopping deployment for {selected_deployment.m_name}...")
#     result = client.serving.stop_deployment(model_name)
#     print(f"Deployment stopped: {result}")