Skip to content

nchatti/llmperf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Performance Testing Tool

A focused load testing tool for LLM inference servers with OpenAI-compatible APIs. Test performance metrics through a Gradio web interface with comprehensive CSV export capabilities.

Table of Contents

Features

  • Multiple API Endpoints: Support for both /v1/completions and /v1/chat/completions endpoints
  • Streaming Support: Test streaming responses with Time To First Token (TTFT) and tokens/second metrics
  • Multi-dataset Support: Integration with HuggingFace datasets (Alpaca-GPT4, OpenOrca, ShareGPT-Vicuna, LIMA)
  • Configurable Load Testing: Simulate 1-100 concurrent users with adjustable think time
  • Comprehensive Metrics: Latency percentiles (P50, P95, P99), throughput, error rates, streaming metrics
  • CSV Export: Detailed summary and raw results for further analysis
  • Easy-to-Use Interface: Gradio web UI requiring no technical documentation
  • Real-time Progress: Monitor test execution and request completion
  • Error Categorization: Detailed error tracking and reporting

Requirements

  • Python 3.9 or higher
  • 2GB RAM minimum (4GB recommended)
  • 1GB disk space for datasets
  • Network connectivity to inference API and HuggingFace

Installation

Using uv (Recommended)

uv is a fast Python package manager. Install it first:

# macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh

# Windows
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"

Then install the project:

# Clone the repository
git clone <repository-url>
cd llmperf

# Install with uv (creates venv automatically)
uv sync

# Activate the virtual environment
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Using pip (Traditional)

# Clone the repository
git clone <repository-url>
cd llmperf

# Create and activate a virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Using pip install (if packaged)

pip install llmperf
# or
uv pip install llmperf

Quick Start

Testing with Cloud APIs

1. Launch the Web Interface

python -m src.ui.app

The Gradio interface will open at http://localhost:7860

Testing with Ollama (Local Models)

For testing local models with Ollama, use the launcher script:

# Option 1: Use launcher script (easiest)
./launch_ollama_ui.sh        # macOS/Linux
launch_ollama_ui.bat         # Windows

# Option 2: Run example script
python example_ollama.py

# Option 3: Manual (set env vars first!)
export LLMPERF_ALLOW_HTTP=true
export LLMPERF_REQUIRE_AUTH=false
python -m src.ui.app

Ollama Configuration:

  • Base URL: http://localhost:11434
  • JWT Token: dummy (any value, Ollama doesn't require auth)
  • Model Name: Your Ollama model (e.g., llama2, mistral)

See OLLAMA_QUICKSTART.md for quick reference or OLLAMA_SETUP.md for detailed instructions.

2. Configure Your Test

Fill in the configuration form:

API Settings:

  • Base URL: Your inference API endpoint (must be HTTPS)
    • Example: https://api.example.com
  • JWT Token: Authentication token for Bearer auth
  • Model Name: Model identifier
    • Example: gpt-3.5-turbo, llama-2-70b

Endpoint Settings:

  • Endpoint Type: Choose between /v1/completions or /v1/chat/completions
  • Enable Streaming: Enable streaming responses (chat endpoint only)
    • When enabled, tracks Time To First Token (TTFT) and tokens per second

Test Parameters:

  • Max Output Tokens: 1-4096 (default: 100)
  • Dataset: Choose from 4 pre-configured datasets
  • Concurrent Users: 1-100 (default: 10)
  • Duration: 10-3600 seconds (default: 60s)
  • Think Time: 0-60 seconds (default: 0s)

3. Run the Test

  1. Click "Start Test"
  2. Monitor progress in the status panel
  3. Wait for completion (or click "Stop Test" to terminate early)

4. Export Results

  1. Click "Export to CSV"
  2. Download both files:
    • Summary CSV: Aggregate metrics and configuration
    • Raw Results CSV: Individual request details

Usage Guide

Understanding Test Parameters

Concurrent Users

  • Simulates parallel users making requests
  • Higher values = more load on server
  • Start with 5-10 for initial testing

Duration

  • How long the test runs (in seconds)
  • Longer tests = more reliable metrics
  • Recommended: 60s minimum for meaningful results

Think Time

  • Wait time between requests per user
  • 0 seconds: Maximum load (continuous requests)
  • 1-5 seconds: Realistic user simulation
  • > 5 seconds: Light load testing

Max Tokens

  • Controls response size
  • Higher values = longer processing time
  • Affects latency measurements

Test Execution Flow

  1. Initialization (2-5 seconds)

    • Dataset downloads/loads
    • Workers created
  2. Execution (configured duration)

    • All workers run concurrently
    • Random prompts selected from dataset
    • Requests sent with configured think time
  3. Completion (instant)

    • Results aggregated
    • Metrics calculated
    • Report generated

Configuration Examples

High Load Test

Goal: Stress test the server at maximum capacity

Concurrent Users: 100
Duration: 300s (5 minutes)
Think Time: 0s
Max Tokens: 100

Expected Results: High throughput, tests server limits

Realistic User Simulation

Goal: Simulate actual user behavior

Concurrent Users: 20
Duration: 600s (10 minutes)
Think Time: 2s
Max Tokens: 500

Expected Results: Realistic latency, moderate throughput

Quick Smoke Test

Goal: Verify configuration is working

Concurrent Users: 5
Duration: 60s
Think Time: 1s
Max Tokens: 50

Expected Results: Fast execution, basic validation

Latency Baseline Test

Goal: Measure minimum latency without concurrency

Concurrent Users: 1
Duration: 120s
Think Time: 0s
Max Tokens: 100

Expected Results: Best-case latency measurements

Understanding Metrics

Latency Metrics (milliseconds)

Metric Description Interpretation
Average Mean latency General performance indicator
Median (P50) 50th percentile Typical user experience
P95 95th percentile Near-worst case (5% slower)
P99 99th percentile Worst case (1% slower)
Min Fastest request Best-case performance
Max Slowest request Absolute worst case
Std Dev Standard deviation Consistency measure

Guidelines:

  • P95 < 2x Median: Consistent performance ✅
  • P99 < 5x Median: Acceptable outliers ✅
  • High Std Dev: Investigate variability ⚠️

Streaming Metrics (Chat Completions Only)

When streaming is enabled with the chat completions endpoint, additional metrics are tracked:

Metric Description Interpretation
Avg TTFT Average Time To First Token How quickly streaming starts
Median TTFT Median Time To First Token Typical streaming start time
P95 TTFT 95th percentile TTFT Near-worst case streaming start
Avg Tokens/sec Average generation speed Token throughput during streaming
Median Tokens/sec Median generation speed Typical token generation rate

Guidelines:

  • TTFT < 200ms: Excellent responsiveness ✅
  • TTFT < 500ms: Good user experience ✅
  • TTFT > 1000ms: May feel sluggish ⚠️
  • Tokens/sec > 50: Good generation speed for most applications ✅

Throughput Metrics

Metric Description Calculation
Throughput (RPS) Requests per second total_requests / duration
Successful RPS Successful requests/sec successful / duration

Reliability Metrics

Metric Description Target
Error Rate % failed requests < 1% for production
Success Rate % successful requests > 99% for production
Total Requests All requests sent -
Failed Requests Count of failures Monitor for patterns

Datasets

ShareGPT-Vicuna (Recommended)

  • Size: ~90,000 examples
  • Type: Real-world conversations
  • Sample: 1,000 prompts
  • Use Case: General purpose testing
  • HF ID: anon8231489123/ShareGPT_Vicuna_unfiltered

OpenOrca

  • Size: ~4.2M examples
  • Type: Instruction-following (GPT-4 generated)
  • Sample: 1,000 prompts
  • Use Case: Testing instruction-tuned models
  • HF ID: Open-Orca/OpenOrca

Alpaca-GPT4

  • Size: ~52,000 examples
  • Type: Clean synthetic instructions
  • Sample: 1,000 prompts
  • Use Case: Consistent, shorter prompts
  • HF ID: vicgalle/alpaca-gpt4

LIMA

  • Size: ~1,000 examples
  • Type: High-quality curated examples
  • Sample: All prompts
  • Use Case: Quality-focused testing
  • HF ID: GAIR/lima

CSV Export Format

Summary CSV (1 row per test)

Columns:

test_id, timestamp, base_url, model_name, max_tokens, concurrent_users,
duration_seconds, dataset_name, think_time_seconds, total_requests,
successful_requests, failed_requests, error_rate_percent, avg_latency_ms,
median_latency_ms, p95_latency_ms, p99_latency_ms, min_latency_ms,
max_latency_ms, std_latency_ms, throughput_rps, actual_duration_seconds

Use Cases:

  • Compare multiple test runs
  • Track performance over time
  • Analyze different configurations

Raw Results CSV (1 row per request)

Columns:

test_id, timestamp, latency_ms, prompt_length, success,
error_message, status_code, tokens_generated

Use Cases:

  • Detailed investigation of failures
  • Temporal analysis (performance over time)
  • Individual request debugging
  • Custom metric calculations

File Naming

Format: {type}_{YYYYMMDD}_{HHMMSS}_{test_id_short}.csv

Examples:

  • summary_20260120_143022_a1b2c3d4.csv
  • raw_results_20260120_143022_a1b2c3d4.csv

API Compatibility

Supported Endpoints

This tool supports two OpenAI-compatible endpoints:

1. /v1/completions (Text Completions)

Request Format:

{
  "model": "model-name",
  "prompt": "the prompt text",
  "max_tokens": 100
}

Response Format:

{
  "id": "cmpl-xxx",
  "object": "text_completion",
  "created": 1234567890,
  "model": "model-name",
  "choices": [
    {
      "text": "completion text",
      "index": 0,
      "finish_reason": "length"
    }
  ]
}

2. /v1/chat/completions (Chat Completions)

Request Format (Non-Streaming):

{
  "model": "model-name",
  "messages": [
    {
      "role": "user",
      "content": "the prompt text"
    }
  ],
  "max_tokens": 100
}

Request Format (Streaming):

{
  "model": "model-name",
  "messages": [
    {
      "role": "user",
      "content": "the prompt text"
    }
  ],
  "max_tokens": 100,
  "stream": true
}

Response Format (Non-Streaming):

{
  "id": "chatcmpl-xxx",
  "object": "chat.completion",
  "created": 1234567890,
  "model": "model-name",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "response text"
      },
      "finish_reason": "stop"
    }
  ]
}

Response Format (Streaming): Server-Sent Events (SSE) stream:

data: {"choices":[{"delta":{"content":"response"}}]}
data: {"choices":[{"delta":{"content":" text"}}]}
data: [DONE]

Authentication:

  • Bearer token in Authorization header
  • Format: Authorization: Bearer {jwt_token}
  • Optional for local servers (set LLMPERF_REQUIRE_AUTH=false)

Compatible Inference Servers

  • OpenAI API: Full support for both endpoints and streaming
  • vLLM: Full support for both endpoints and streaming
  • Text Generation Inference (TGI): Full support for both endpoints and streaming
  • Ollama: Full support for both endpoints and streaming
  • FastChat: Full support for completions, check docs for chat/streaming
  • Any OpenAI-compatible API: Should work if following OpenAI's API spec

Troubleshooting

Configuration Issues

Problem: "Error: Base URL must use HTTPS" Solutions:

  • This is expected for local servers (like Ollama) that use HTTP
  • Set environment variables BEFORE launching the UI or running scripts:
    # macOS/Linux
    export LLMPERF_ALLOW_HTTP=true
    export LLMPERF_REQUIRE_AUTH=false
    python -m src.ui.app
    
    # Windows PowerShell
    $env:LLMPERF_ALLOW_HTTP="true"
    $env:LLMPERF_REQUIRE_AUTH="false"
    python -m src.ui.app
  • Or use the provided example_ollama.py script which sets these automatically
  • See OLLAMA_QUICKSTART.md for details

Dataset Issues

Problem: Dataset download fails Solutions:

  • Check internet connection
  • Verify HuggingFace is accessible
  • Try a different dataset
  • Clear cache: rm -rf data/cache/
  • Check disk space (need ~1GB)

Request Issues

Problem: All requests timing out Solutions:

  • Verify API endpoint URL is correct
  • Check network connectivity
  • Increase timeout value (default: 300s)
  • Confirm API server is running
  • Test with curl first

Problem: High error rate (>10%) Solutions:

  • Verify JWT token validity
  • Check API rate limits
  • Reduce concurrent users
  • Confirm model name is correct
  • Check server logs for errors

Problem: Authentication errors (401) Solutions:

  • Regenerate JWT token
  • Check token expiration
  • Verify token format
  • Confirm bearer auth is supported

Performance Issues

Problem: Lower throughput than expected Solutions:

  • Reduce think time (0s for max throughput)
  • Increase concurrent users
  • Check server capacity
  • Verify network bandwidth

Problem: High latency variability Solutions:

  • Run longer tests (>5 minutes)
  • Check server resource usage
  • Monitor network conditions
  • Reduce concurrent load

Development

Setup Development Environment

# Using uv (recommended)
uv sync --extra dev

# Or using pip
pip install -e ".[dev]"

Running Tests

# Run all tests
uv run pytest

# Run with coverage
uv run pytest --cov=src --cov-report=html --cov-report=term

# Run specific test file
uv run pytest tests/test_dataset_manager.py

# Run specific test
uv run pytest tests/test_api_client.py::TestAPIClient::test_successful_request

# Run only integration tests
uv run pytest tests/test_integration.py

Code Quality

# Format code
uv run black src/ tests/

# Type checking
uv run mypy src/

# Linting
uv run pylint src/

# Run all checks
uv run black src/ tests/ && uv run mypy src/ && uv run pytest

Project Structure

llmperf/
├── README.md                   # This file
├── requirements.txt            # Python dependencies
├── pytest.ini                  # Pytest configuration
├── setup.py                    # Package setup (optional)
├── .gitignore                  # Git ignore rules
│
├── src/                        # Source code
│   ├── __init__.py
│   ├── config.py               # Configuration constants
│   ├── models.py               # Pydantic data models
│   │
│   ├── dataset/                # Dataset management
│   │   ├── __init__.py
│   │   ├── manager.py          # Dataset loading & caching
│   │   └── extractors.py       # Prompt extraction methods
│   │
│   ├── client/                 # API client
│   │   ├── __init__.py
│   │   └── api_client.py       # HTTP client implementation
│   │
│   ├── load_test/              # Load testing engine
│   │   ├── __init__.py
│   │   ├── worker.py           # Worker (concurrent user)
│   │   └── orchestrator.py     # Test coordinator
│   │
│   ├── metrics/                # Metrics & export
│   │   ├── __init__.py
│   │   ├── collector.py        # Metrics calculation
│   │   └── exporter.py         # CSV export
│   │
│   └── ui/                     # User interface
│       ├── __init__.py
│       └── app.py              # Gradio web app
│
├── tests/                      # Test suite
│   ├── __init__.py
│   ├── conftest.py             # Pytest fixtures
│   ├── test_dataset_manager.py
│   ├── test_api_client.py
│   ├── test_metrics.py
│   ├── test_exporter.py
│   ├── test_integration.py
│   └── fixtures/
│       └── mock_responses.json
│
├── data/                       # Data storage
│   └── cache/                  # Cached datasets
│
└── outputs/                    # Output files
    └── reports/                # CSV reports

Architecture

System Components

┌─────────────────────────────────────────────────────────────┐
│                    Gradio Web Interface                     │
│  ┌────────────────┐  ┌──────────────┐  ┌────────────────┐  │
│  │  Configuration │  │ Test Control │  │ Results Display│  │
│  └────────────────┘  └──────────────┘  └────────────────┘  │
└──────────────────────────┬──────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│                  Test Orchestrator Layer                    │
│  ┌──────────────┐  ┌──────────────┐  ┌─────────────────┐   │
│  │   Dataset    │  │   Worker     │  │    Metrics      │   │
│  │   Manager    │  │   Manager    │  │   Aggregator    │   │
│  └──────────────┘  └──────────────┘  └─────────────────┘   │
└──────────────────────────┬──────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│                    Worker Thread Pool                       │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐    │
│  │ Worker 1 │  │ Worker 2 │  │ Worker 3 │  │    ...   │    │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘    │
└───────┼─────────────┼─────────────┼─────────────┼──────────┘
        │             │             │             │
        └─────────────┴─────────────┴─────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│              OpenAI-Compatible Inference API                │
│                    /v1/completions                          │
└─────────────────────────────────────────────────────────────┘

License

MIT License - see LICENSE file for details

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

Support

  • Issues: Open an issue on GitHub
  • Documentation: See this README and code comments
  • Examples: Check the Configuration Examples section

Acknowledgments

  • Built with Gradio for the web interface
  • Datasets from HuggingFace
  • Inspired by real-world LLM deployment challenges

Version: 1.0 Last Updated: January 2026

About

A focused load testing tool for LLM inference servers with OpenAI-compatible APIs

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages