LLM Performance Testing Tool

A focused load testing tool for LLM inference servers with OpenAI-compatible APIs. Test performance metrics through a Gradio web interface with comprehensive CSV export capabilities.

Features

Multiple API Endpoints: Support for both /v1/completions and /v1/chat/completions endpoints
Streaming Support: Test streaming responses with Time To First Token (TTFT) and tokens/second metrics
Multi-dataset Support: Integration with HuggingFace datasets (Alpaca-GPT4, OpenOrca, ShareGPT-Vicuna, LIMA)
Configurable Load Testing: Simulate 1-100 concurrent users with adjustable think time
Comprehensive Metrics: Latency percentiles (P50, P95, P99), throughput, error rates, streaming metrics
CSV Export: Detailed summary and raw results for further analysis
Easy-to-Use Interface: Gradio web UI requiring no technical documentation
Real-time Progress: Monitor test execution and request completion
Error Categorization: Detailed error tracking and reporting

Requirements

Python 3.9 or higher
2GB RAM minimum (4GB recommended)
1GB disk space for datasets
Network connectivity to inference API and HuggingFace

Installation

Using uv (Recommended)

uv is a fast Python package manager. Install it first:

# macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh

# Windows
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"

Then install the project:

# Clone the repository
git clone <repository-url>
cd llmperf

# Install with uv (creates venv automatically)
uv sync

# Activate the virtual environment
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Using pip (Traditional)

# Clone the repository
git clone <repository-url>
cd llmperf

# Create and activate a virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Using pip install (if packaged)

pip install llmperf
# or
uv pip install llmperf

Quick Start

Testing with Cloud APIs

1. Launch the Web Interface

python -m src.ui.app

The Gradio interface will open at http://localhost:7860

Testing with Ollama (Local Models)

For testing local models with Ollama, use the launcher script:

# Option 1: Use launcher script (easiest)
./launch_ollama_ui.sh        # macOS/Linux
launch_ollama_ui.bat         # Windows

# Option 2: Run example script
python example_ollama.py

# Option 3: Manual (set env vars first!)
export LLMPERF_ALLOW_HTTP=true
export LLMPERF_REQUIRE_AUTH=false
python -m src.ui.app

Ollama Configuration:

Base URL: http://localhost:11434
JWT Token: dummy (any value, Ollama doesn't require auth)
Model Name: Your Ollama model (e.g., llama2, mistral)

See OLLAMA_QUICKSTART.md for quick reference or OLLAMA_SETUP.md for detailed instructions.

2. Configure Your Test

Fill in the configuration form:

API Settings:

Base URL: Your inference API endpoint (must be HTTPS)
- Example: https://api.example.com
JWT Token: Authentication token for Bearer auth
Model Name: Model identifier
- Example: gpt-3.5-turbo, llama-2-70b

Endpoint Settings:

Endpoint Type: Choose between /v1/completions or /v1/chat/completions
Enable Streaming: Enable streaming responses (chat endpoint only)
- When enabled, tracks Time To First Token (TTFT) and tokens per second

Test Parameters:

Max Output Tokens: 1-4096 (default: 100)
Dataset: Choose from 4 pre-configured datasets
Concurrent Users: 1-100 (default: 10)
Duration: 10-3600 seconds (default: 60s)
Think Time: 0-60 seconds (default: 0s)

3. Run the Test

Click "Start Test"
Monitor progress in the status panel
Wait for completion (or click "Stop Test" to terminate early)

4. Export Results

Click "Export to CSV"
Download both files:
- Summary CSV: Aggregate metrics and configuration
- Raw Results CSV: Individual request details

Usage Guide

Understanding Test Parameters

Concurrent Users

Simulates parallel users making requests
Higher values = more load on server
Start with 5-10 for initial testing

Duration

How long the test runs (in seconds)
Longer tests = more reliable metrics
Recommended: 60s minimum for meaningful results

Think Time

Wait time between requests per user
0 seconds: Maximum load (continuous requests)
1-5 seconds: Realistic user simulation
> 5 seconds: Light load testing

Max Tokens

Controls response size
Higher values = longer processing time
Affects latency measurements

Test Execution Flow

Initialization (2-5 seconds)
- Dataset downloads/loads
- Workers created
Execution (configured duration)
- All workers run concurrently
- Random prompts selected from dataset
- Requests sent with configured think time
Completion (instant)
- Results aggregated
- Metrics calculated
- Report generated

Configuration Examples

High Load Test

Goal: Stress test the server at maximum capacity

Concurrent Users: 100
Duration: 300s (5 minutes)
Think Time: 0s
Max Tokens: 100

Expected Results: High throughput, tests server limits

Realistic User Simulation

Goal: Simulate actual user behavior

Concurrent Users: 20
Duration: 600s (10 minutes)
Think Time: 2s
Max Tokens: 500

Expected Results: Realistic latency, moderate throughput

Quick Smoke Test

Goal: Verify configuration is working

Concurrent Users: 5
Duration: 60s
Think Time: 1s
Max Tokens: 50

Expected Results: Fast execution, basic validation

Latency Baseline Test

Goal: Measure minimum latency without concurrency

Concurrent Users: 1
Duration: 120s
Think Time: 0s
Max Tokens: 100

Expected Results: Best-case latency measurements

Understanding Metrics

Latency Metrics (milliseconds)

Metric	Description	Interpretation
Average	Mean latency	General performance indicator
Median (P50)	50th percentile	Typical user experience
P95	95th percentile	Near-worst case (5% slower)
P99	99th percentile	Worst case (1% slower)
Min	Fastest request	Best-case performance
Max	Slowest request	Absolute worst case
Std Dev	Standard deviation	Consistency measure

Guidelines:

P95 < 2x Median: Consistent performance ✅
P99 < 5x Median: Acceptable outliers ✅
High Std Dev: Investigate variability ⚠️

Streaming Metrics (Chat Completions Only)

When streaming is enabled with the chat completions endpoint, additional metrics are tracked:

Metric	Description	Interpretation
Avg TTFT	Average Time To First Token	How quickly streaming starts
Median TTFT	Median Time To First Token	Typical streaming start time
P95 TTFT	95th percentile TTFT	Near-worst case streaming start
Avg Tokens/sec	Average generation speed	Token throughput during streaming
Median Tokens/sec	Median generation speed	Typical token generation rate

Guidelines:

TTFT < 200ms: Excellent responsiveness ✅
TTFT < 500ms: Good user experience ✅
TTFT > 1000ms: May feel sluggish ⚠️
Tokens/sec > 50: Good generation speed for most applications ✅

Throughput Metrics

Metric	Description	Calculation
Throughput (RPS)	Requests per second	total_requests / duration
Successful RPS	Successful requests/sec	successful / duration

Reliability Metrics

Metric	Description	Target
Error Rate	% failed requests	< 1% for production
Success Rate	% successful requests	> 99% for production
Total Requests	All requests sent	-
Failed Requests	Count of failures	Monitor for patterns

Datasets

ShareGPT-Vicuna (Recommended)

Size: ~90,000 examples
Type: Real-world conversations
Sample: 1,000 prompts
Use Case: General purpose testing
HF ID: anon8231489123/ShareGPT_Vicuna_unfiltered

OpenOrca

Size: ~4.2M examples
Type: Instruction-following (GPT-4 generated)
Sample: 1,000 prompts
Use Case: Testing instruction-tuned models
HF ID: Open-Orca/OpenOrca

Alpaca-GPT4

Size: ~52,000 examples
Type: Clean synthetic instructions
Sample: 1,000 prompts
Use Case: Consistent, shorter prompts
HF ID: vicgalle/alpaca-gpt4

LIMA

Size: ~1,000 examples
Type: High-quality curated examples
Sample: All prompts
Use Case: Quality-focused testing
HF ID: GAIR/lima

CSV Export Format

Summary CSV (1 row per test)

Columns:

test_id, timestamp, base_url, model_name, max_tokens, concurrent_users,
duration_seconds, dataset_name, think_time_seconds, total_requests,
successful_requests, failed_requests, error_rate_percent, avg_latency_ms,
median_latency_ms, p95_latency_ms, p99_latency_ms, min_latency_ms,
max_latency_ms, std_latency_ms, throughput_rps, actual_duration_seconds

Use Cases:

Compare multiple test runs
Track performance over time
Analyze different configurations

Raw Results CSV (1 row per request)

Columns:

test_id, timestamp, latency_ms, prompt_length, success,
error_message, status_code, tokens_generated

Use Cases:

Detailed investigation of failures
Temporal analysis (performance over time)
Individual request debugging
Custom metric calculations

File Naming

Format: {type}_{YYYYMMDD}_{HHMMSS}_{test_id_short}.csv

Examples:

summary_20260120_143022_a1b2c3d4.csv
raw_results_20260120_143022_a1b2c3d4.csv

API Compatibility

Supported Endpoints

This tool supports two OpenAI-compatible endpoints:

1. `/v1/completions` (Text Completions)

Request Format:

{
  "model": "model-name",
  "prompt": "the prompt text",
  "max_tokens": 100
}

Response Format:

{
  "id": "cmpl-xxx",
  "object": "text_completion",
  "created": 1234567890,
  "model": "model-name",
  "choices": [
    {
      "text": "completion text",
      "index": 0,
      "finish_reason": "length"
    }
  ]
}

2. `/v1/chat/completions` (Chat Completions)

Request Format (Non-Streaming):

{
  "model": "model-name",
  "messages": [
    {
      "role": "user",
      "content": "the prompt text"
    }
  ],
  "max_tokens": 100
}

Request Format (Streaming):

{
  "model": "model-name",
  "messages": [
    {
      "role": "user",
      "content": "the prompt text"
    }
  ],
  "max_tokens": 100,
  "stream": true
}

Response Format (Non-Streaming):

{
  "id": "chatcmpl-xxx",
  "object": "chat.completion",
  "created": 1234567890,
  "model": "model-name",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "response text"
      },
      "finish_reason": "stop"
    }
  ]
}

Response Format (Streaming): Server-Sent Events (SSE) stream:

data: {"choices":[{"delta":{"content":"response"}}]}
data: {"choices":[{"delta":{"content":" text"}}]}
data: [DONE]

Authentication:

Bearer token in Authorization header
Format: Authorization: Bearer {jwt_token}
Optional for local servers (set LLMPERF_REQUIRE_AUTH=false)

Compatible Inference Servers

OpenAI API: Full support for both endpoints and streaming
vLLM: Full support for both endpoints and streaming
Text Generation Inference (TGI): Full support for both endpoints and streaming
Ollama: Full support for both endpoints and streaming
FastChat: Full support for completions, check docs for chat/streaming
Any OpenAI-compatible API: Should work if following OpenAI's API spec

Troubleshooting

Configuration Issues

Problem: "Error: Base URL must use HTTPS" Solutions:

This is expected for local servers (like Ollama) that use HTTP

Set environment variables BEFORE launching the UI or running scripts:

# macOS/Linux
export LLMPERF_ALLOW_HTTP=true
export LLMPERF_REQUIRE_AUTH=false
python -m src.ui.app

# Windows PowerShell
$env:LLMPERF_ALLOW_HTTP="true"
$env:LLMPERF_REQUIRE_AUTH="false"
python -m src.ui.app

Or use the provided example_ollama.py script which sets these automatically
See OLLAMA_QUICKSTART.md for details

Dataset Issues

Problem: Dataset download fails Solutions:

Check internet connection
Verify HuggingFace is accessible
Try a different dataset
Clear cache: rm -rf data/cache/
Check disk space (need ~1GB)

Request Issues

Problem: All requests timing out Solutions:

Verify API endpoint URL is correct
Check network connectivity
Increase timeout value (default: 300s)
Confirm API server is running
Test with curl first

Problem: High error rate (>10%) Solutions:

Verify JWT token validity
Check API rate limits
Reduce concurrent users
Confirm model name is correct
Check server logs for errors

Problem: Authentication errors (401) Solutions:

Regenerate JWT token
Check token expiration
Verify token format
Confirm bearer auth is supported

Performance Issues

Problem: Lower throughput than expected Solutions:

Reduce think time (0s for max throughput)
Increase concurrent users
Check server capacity
Verify network bandwidth

Problem: High latency variability Solutions:

Run longer tests (>5 minutes)
Check server resource usage
Monitor network conditions
Reduce concurrent load

Development

Setup Development Environment

# Using uv (recommended)
uv sync --extra dev

# Or using pip
pip install -e ".[dev]"

Running Tests

# Run all tests
uv run pytest

# Run with coverage
uv run pytest --cov=src --cov-report=html --cov-report=term

# Run specific test file
uv run pytest tests/test_dataset_manager.py

# Run specific test
uv run pytest tests/test_api_client.py::TestAPIClient::test_successful_request

# Run only integration tests
uv run pytest tests/test_integration.py

Code Quality

# Format code
uv run black src/ tests/

# Type checking
uv run mypy src/

# Linting
uv run pylint src/

# Run all checks
uv run black src/ tests/ && uv run mypy src/ && uv run pytest

Project Structure

llmperf/
├── README.md                   # This file
├── requirements.txt            # Python dependencies
├── pytest.ini                  # Pytest configuration
├── setup.py                    # Package setup (optional)
├── .gitignore                  # Git ignore rules
│
├── src/                        # Source code
│   ├── __init__.py
│   ├── config.py               # Configuration constants
│   ├── models.py               # Pydantic data models
│   │
│   ├── dataset/                # Dataset management
│   │   ├── __init__.py
│   │   ├── manager.py          # Dataset loading & caching
│   │   └── extractors.py       # Prompt extraction methods
│   │
│   ├── client/                 # API client
│   │   ├── __init__.py
│   │   └── api_client.py       # HTTP client implementation
│   │
│   ├── load_test/              # Load testing engine
│   │   ├── __init__.py
│   │   ├── worker.py           # Worker (concurrent user)
│   │   └── orchestrator.py     # Test coordinator
│   │
│   ├── metrics/                # Metrics & export
│   │   ├── __init__.py
│   │   ├── collector.py        # Metrics calculation
│   │   └── exporter.py         # CSV export
│   │
│   └── ui/                     # User interface
│       ├── __init__.py
│       └── app.py              # Gradio web app
│
├── tests/                      # Test suite
│   ├── __init__.py
│   ├── conftest.py             # Pytest fixtures
│   ├── test_dataset_manager.py
│   ├── test_api_client.py
│   ├── test_metrics.py
│   ├── test_exporter.py
│   ├── test_integration.py
│   └── fixtures/
│       └── mock_responses.json
│
├── data/                       # Data storage
│   └── cache/                  # Cached datasets
│
└── outputs/                    # Output files
    └── reports/                # CSV reports

Architecture

System Components

┌─────────────────────────────────────────────────────────────┐
│                    Gradio Web Interface                     │
│  ┌────────────────┐  ┌──────────────┐  ┌────────────────┐  │
│  │  Configuration │  │ Test Control │  │ Results Display│  │
│  └────────────────┘  └──────────────┘  └────────────────┘  │
└──────────────────────────┬──────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│                  Test Orchestrator Layer                    │
│  ┌──────────────┐  ┌──────────────┐  ┌─────────────────┐   │
│  │   Dataset    │  │   Worker     │  │    Metrics      │   │
│  │   Manager    │  │   Manager    │  │   Aggregator    │   │
│  └──────────────┘  └──────────────┘  └─────────────────┘   │
└──────────────────────────┬──────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│                    Worker Thread Pool                       │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐    │
│  │ Worker 1 │  │ Worker 2 │  │ Worker 3 │  │    ...   │    │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘    │
└───────┼─────────────┼─────────────┼─────────────┼──────────┘
        │             │             │             │
        └─────────────┴─────────────┴─────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│              OpenAI-Compatible Inference API                │
│                    /v1/completions                          │
└─────────────────────────────────────────────────────────────┘

License

MIT License - see LICENSE file for details

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

Support

Issues: Open an issue on GitHub
Documentation: See this README and code comments
Examples: Check the Configuration Examples section

Acknowledgments

Built with Gradio for the web interface
Datasets from HuggingFace
Inspired by real-world LLM deployment challenges

Version: 1.0 Last Updated: January 2026

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
docs		docs
src		src
tests		tests
.gitignore		.gitignore
.python-version		.python-version
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
IMPLEMENTATION_SUMMARY.md		IMPLEMENTATION_SUMMARY.md
LICENSE		LICENSE
OLLAMA_QUICKSTART.md		OLLAMA_QUICKSTART.md
OLLAMA_SETUP.md		OLLAMA_SETUP.md
README.md		README.md
UV_MIGRATION.md		UV_MIGRATION.md
example_chat_completions.py		example_chat_completions.py
example_ollama.py		example_ollama.py
example_usage.py		example_usage.py
launch_ollama_ui.bat		launch_ollama_ui.bat
launch_ollama_ui.sh		launch_ollama_ui.sh
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt
setup.py		setup.py
test_prompt_lengths.py		test_prompt_lengths.py
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

LLM Performance Testing Tool

Table of Contents

Features

Requirements

Installation

Using uv (Recommended)

Using pip (Traditional)

Using pip install (if packaged)

Quick Start

Testing with Cloud APIs

1. Launch the Web Interface

Testing with Ollama (Local Models)

2. Configure Your Test

3. Run the Test

4. Export Results

Usage Guide

Understanding Test Parameters

Test Execution Flow

Configuration Examples

High Load Test

Realistic User Simulation

Quick Smoke Test

Latency Baseline Test

Understanding Metrics

Latency Metrics (milliseconds)

Streaming Metrics (Chat Completions Only)

Throughput Metrics

Reliability Metrics

Datasets

ShareGPT-Vicuna (Recommended)

OpenOrca

Alpaca-GPT4

LIMA

CSV Export Format

Summary CSV (1 row per test)

Raw Results CSV (1 row per request)

File Naming

API Compatibility

Supported Endpoints

1. /v1/completions (Text Completions)

2. /v1/chat/completions (Chat Completions)

Compatible Inference Servers

Troubleshooting

Configuration Issues

Dataset Issues

Request Issues

Performance Issues

Development

Setup Development Environment

Running Tests

Code Quality

Project Structure

Architecture

System Components

License

Contributing

Support

Acknowledgments

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. `/v1/completions` (Text Completions)

2. `/v1/chat/completions` (Chat Completions)

Packages