A focused load testing tool for LLM inference servers with OpenAI-compatible APIs. Test performance metrics through a Gradio web interface with comprehensive CSV export capabilities.
- Features
- Requirements
- Installation
- Quick Start
- Usage Guide
- Configuration Examples
- Understanding Metrics
- Datasets
- CSV Export Format
- API Compatibility
- Troubleshooting
- Development
- Project Structure
- License
- Multiple API Endpoints: Support for both
/v1/completionsand/v1/chat/completionsendpoints - Streaming Support: Test streaming responses with Time To First Token (TTFT) and tokens/second metrics
- Multi-dataset Support: Integration with HuggingFace datasets (Alpaca-GPT4, OpenOrca, ShareGPT-Vicuna, LIMA)
- Configurable Load Testing: Simulate 1-100 concurrent users with adjustable think time
- Comprehensive Metrics: Latency percentiles (P50, P95, P99), throughput, error rates, streaming metrics
- CSV Export: Detailed summary and raw results for further analysis
- Easy-to-Use Interface: Gradio web UI requiring no technical documentation
- Real-time Progress: Monitor test execution and request completion
- Error Categorization: Detailed error tracking and reporting
- Python 3.9 or higher
- 2GB RAM minimum (4GB recommended)
- 1GB disk space for datasets
- Network connectivity to inference API and HuggingFace
uv is a fast Python package manager. Install it first:
# macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh
# Windows
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"Then install the project:
# Clone the repository
git clone <repository-url>
cd llmperf
# Install with uv (creates venv automatically)
uv sync
# Activate the virtual environment
source .venv/bin/activate # On Windows: .venv\Scripts\activate# Clone the repository
git clone <repository-url>
cd llmperf
# Create and activate a virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtpip install llmperf
# or
uv pip install llmperfpython -m src.ui.appThe Gradio interface will open at http://localhost:7860
For testing local models with Ollama, use the launcher script:
# Option 1: Use launcher script (easiest)
./launch_ollama_ui.sh # macOS/Linux
launch_ollama_ui.bat # Windows
# Option 2: Run example script
python example_ollama.py
# Option 3: Manual (set env vars first!)
export LLMPERF_ALLOW_HTTP=true
export LLMPERF_REQUIRE_AUTH=false
python -m src.ui.appOllama Configuration:
- Base URL:
http://localhost:11434 - JWT Token:
dummy(any value, Ollama doesn't require auth) - Model Name: Your Ollama model (e.g.,
llama2,mistral)
See OLLAMA_QUICKSTART.md for quick reference or OLLAMA_SETUP.md for detailed instructions.
Fill in the configuration form:
API Settings:
- Base URL: Your inference API endpoint (must be HTTPS)
- Example:
https://api.example.com
- Example:
- JWT Token: Authentication token for Bearer auth
- Model Name: Model identifier
- Example:
gpt-3.5-turbo,llama-2-70b
- Example:
Endpoint Settings:
- Endpoint Type: Choose between
/v1/completionsor/v1/chat/completions - Enable Streaming: Enable streaming responses (chat endpoint only)
- When enabled, tracks Time To First Token (TTFT) and tokens per second
Test Parameters:
- Max Output Tokens: 1-4096 (default: 100)
- Dataset: Choose from 4 pre-configured datasets
- Concurrent Users: 1-100 (default: 10)
- Duration: 10-3600 seconds (default: 60s)
- Think Time: 0-60 seconds (default: 0s)
- Click "Start Test"
- Monitor progress in the status panel
- Wait for completion (or click "Stop Test" to terminate early)
- Click "Export to CSV"
- Download both files:
- Summary CSV: Aggregate metrics and configuration
- Raw Results CSV: Individual request details
Concurrent Users
- Simulates parallel users making requests
- Higher values = more load on server
- Start with 5-10 for initial testing
Duration
- How long the test runs (in seconds)
- Longer tests = more reliable metrics
- Recommended: 60s minimum for meaningful results
Think Time
- Wait time between requests per user
- 0 seconds: Maximum load (continuous requests)
- 1-5 seconds: Realistic user simulation
- > 5 seconds: Light load testing
Max Tokens
- Controls response size
- Higher values = longer processing time
- Affects latency measurements
-
Initialization (2-5 seconds)
- Dataset downloads/loads
- Workers created
-
Execution (configured duration)
- All workers run concurrently
- Random prompts selected from dataset
- Requests sent with configured think time
-
Completion (instant)
- Results aggregated
- Metrics calculated
- Report generated
Goal: Stress test the server at maximum capacity
Concurrent Users: 100
Duration: 300s (5 minutes)
Think Time: 0s
Max Tokens: 100
Expected Results: High throughput, tests server limits
Goal: Simulate actual user behavior
Concurrent Users: 20
Duration: 600s (10 minutes)
Think Time: 2s
Max Tokens: 500
Expected Results: Realistic latency, moderate throughput
Goal: Verify configuration is working
Concurrent Users: 5
Duration: 60s
Think Time: 1s
Max Tokens: 50
Expected Results: Fast execution, basic validation
Goal: Measure minimum latency without concurrency
Concurrent Users: 1
Duration: 120s
Think Time: 0s
Max Tokens: 100
Expected Results: Best-case latency measurements
| Metric | Description | Interpretation |
|---|---|---|
| Average | Mean latency | General performance indicator |
| Median (P50) | 50th percentile | Typical user experience |
| P95 | 95th percentile | Near-worst case (5% slower) |
| P99 | 99th percentile | Worst case (1% slower) |
| Min | Fastest request | Best-case performance |
| Max | Slowest request | Absolute worst case |
| Std Dev | Standard deviation | Consistency measure |
Guidelines:
- P95 < 2x Median: Consistent performance ✅
- P99 < 5x Median: Acceptable outliers ✅
- High Std Dev: Investigate variability
⚠️
When streaming is enabled with the chat completions endpoint, additional metrics are tracked:
| Metric | Description | Interpretation |
|---|---|---|
| Avg TTFT | Average Time To First Token | How quickly streaming starts |
| Median TTFT | Median Time To First Token | Typical streaming start time |
| P95 TTFT | 95th percentile TTFT | Near-worst case streaming start |
| Avg Tokens/sec | Average generation speed | Token throughput during streaming |
| Median Tokens/sec | Median generation speed | Typical token generation rate |
Guidelines:
- TTFT < 200ms: Excellent responsiveness ✅
- TTFT < 500ms: Good user experience ✅
- TTFT > 1000ms: May feel sluggish
⚠️ - Tokens/sec > 50: Good generation speed for most applications ✅
| Metric | Description | Calculation |
|---|---|---|
| Throughput (RPS) | Requests per second | total_requests / duration |
| Successful RPS | Successful requests/sec | successful / duration |
| Metric | Description | Target |
|---|---|---|
| Error Rate | % failed requests | < 1% for production |
| Success Rate | % successful requests | > 99% for production |
| Total Requests | All requests sent | - |
| Failed Requests | Count of failures | Monitor for patterns |
- Size: ~90,000 examples
- Type: Real-world conversations
- Sample: 1,000 prompts
- Use Case: General purpose testing
- HF ID:
anon8231489123/ShareGPT_Vicuna_unfiltered
- Size: ~4.2M examples
- Type: Instruction-following (GPT-4 generated)
- Sample: 1,000 prompts
- Use Case: Testing instruction-tuned models
- HF ID:
Open-Orca/OpenOrca
- Size: ~52,000 examples
- Type: Clean synthetic instructions
- Sample: 1,000 prompts
- Use Case: Consistent, shorter prompts
- HF ID:
vicgalle/alpaca-gpt4
- Size: ~1,000 examples
- Type: High-quality curated examples
- Sample: All prompts
- Use Case: Quality-focused testing
- HF ID:
GAIR/lima
Columns:
test_id, timestamp, base_url, model_name, max_tokens, concurrent_users,
duration_seconds, dataset_name, think_time_seconds, total_requests,
successful_requests, failed_requests, error_rate_percent, avg_latency_ms,
median_latency_ms, p95_latency_ms, p99_latency_ms, min_latency_ms,
max_latency_ms, std_latency_ms, throughput_rps, actual_duration_seconds
Use Cases:
- Compare multiple test runs
- Track performance over time
- Analyze different configurations
Columns:
test_id, timestamp, latency_ms, prompt_length, success,
error_message, status_code, tokens_generated
Use Cases:
- Detailed investigation of failures
- Temporal analysis (performance over time)
- Individual request debugging
- Custom metric calculations
Format: {type}_{YYYYMMDD}_{HHMMSS}_{test_id_short}.csv
Examples:
summary_20260120_143022_a1b2c3d4.csvraw_results_20260120_143022_a1b2c3d4.csv
This tool supports two OpenAI-compatible endpoints:
Request Format:
{
"model": "model-name",
"prompt": "the prompt text",
"max_tokens": 100
}Response Format:
{
"id": "cmpl-xxx",
"object": "text_completion",
"created": 1234567890,
"model": "model-name",
"choices": [
{
"text": "completion text",
"index": 0,
"finish_reason": "length"
}
]
}Request Format (Non-Streaming):
{
"model": "model-name",
"messages": [
{
"role": "user",
"content": "the prompt text"
}
],
"max_tokens": 100
}Request Format (Streaming):
{
"model": "model-name",
"messages": [
{
"role": "user",
"content": "the prompt text"
}
],
"max_tokens": 100,
"stream": true
}Response Format (Non-Streaming):
{
"id": "chatcmpl-xxx",
"object": "chat.completion",
"created": 1234567890,
"model": "model-name",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "response text"
},
"finish_reason": "stop"
}
]
}Response Format (Streaming): Server-Sent Events (SSE) stream:
data: {"choices":[{"delta":{"content":"response"}}]}
data: {"choices":[{"delta":{"content":" text"}}]}
data: [DONE]
Authentication:
- Bearer token in Authorization header
- Format:
Authorization: Bearer {jwt_token} - Optional for local servers (set
LLMPERF_REQUIRE_AUTH=false)
- OpenAI API: Full support for both endpoints and streaming
- vLLM: Full support for both endpoints and streaming
- Text Generation Inference (TGI): Full support for both endpoints and streaming
- Ollama: Full support for both endpoints and streaming
- FastChat: Full support for completions, check docs for chat/streaming
- Any OpenAI-compatible API: Should work if following OpenAI's API spec
Problem: "Error: Base URL must use HTTPS" Solutions:
- This is expected for local servers (like Ollama) that use HTTP
- Set environment variables BEFORE launching the UI or running scripts:
# macOS/Linux export LLMPERF_ALLOW_HTTP=true export LLMPERF_REQUIRE_AUTH=false python -m src.ui.app # Windows PowerShell $env:LLMPERF_ALLOW_HTTP="true" $env:LLMPERF_REQUIRE_AUTH="false" python -m src.ui.app
- Or use the provided
example_ollama.pyscript which sets these automatically - See OLLAMA_QUICKSTART.md for details
Problem: Dataset download fails Solutions:
- Check internet connection
- Verify HuggingFace is accessible
- Try a different dataset
- Clear cache:
rm -rf data/cache/ - Check disk space (need ~1GB)
Problem: All requests timing out Solutions:
- Verify API endpoint URL is correct
- Check network connectivity
- Increase timeout value (default: 300s)
- Confirm API server is running
- Test with curl first
Problem: High error rate (>10%) Solutions:
- Verify JWT token validity
- Check API rate limits
- Reduce concurrent users
- Confirm model name is correct
- Check server logs for errors
Problem: Authentication errors (401) Solutions:
- Regenerate JWT token
- Check token expiration
- Verify token format
- Confirm bearer auth is supported
Problem: Lower throughput than expected Solutions:
- Reduce think time (0s for max throughput)
- Increase concurrent users
- Check server capacity
- Verify network bandwidth
Problem: High latency variability Solutions:
- Run longer tests (>5 minutes)
- Check server resource usage
- Monitor network conditions
- Reduce concurrent load
# Using uv (recommended)
uv sync --extra dev
# Or using pip
pip install -e ".[dev]"# Run all tests
uv run pytest
# Run with coverage
uv run pytest --cov=src --cov-report=html --cov-report=term
# Run specific test file
uv run pytest tests/test_dataset_manager.py
# Run specific test
uv run pytest tests/test_api_client.py::TestAPIClient::test_successful_request
# Run only integration tests
uv run pytest tests/test_integration.py# Format code
uv run black src/ tests/
# Type checking
uv run mypy src/
# Linting
uv run pylint src/
# Run all checks
uv run black src/ tests/ && uv run mypy src/ && uv run pytestllmperf/
├── README.md # This file
├── requirements.txt # Python dependencies
├── pytest.ini # Pytest configuration
├── setup.py # Package setup (optional)
├── .gitignore # Git ignore rules
│
├── src/ # Source code
│ ├── __init__.py
│ ├── config.py # Configuration constants
│ ├── models.py # Pydantic data models
│ │
│ ├── dataset/ # Dataset management
│ │ ├── __init__.py
│ │ ├── manager.py # Dataset loading & caching
│ │ └── extractors.py # Prompt extraction methods
│ │
│ ├── client/ # API client
│ │ ├── __init__.py
│ │ └── api_client.py # HTTP client implementation
│ │
│ ├── load_test/ # Load testing engine
│ │ ├── __init__.py
│ │ ├── worker.py # Worker (concurrent user)
│ │ └── orchestrator.py # Test coordinator
│ │
│ ├── metrics/ # Metrics & export
│ │ ├── __init__.py
│ │ ├── collector.py # Metrics calculation
│ │ └── exporter.py # CSV export
│ │
│ └── ui/ # User interface
│ ├── __init__.py
│ └── app.py # Gradio web app
│
├── tests/ # Test suite
│ ├── __init__.py
│ ├── conftest.py # Pytest fixtures
│ ├── test_dataset_manager.py
│ ├── test_api_client.py
│ ├── test_metrics.py
│ ├── test_exporter.py
│ ├── test_integration.py
│ └── fixtures/
│ └── mock_responses.json
│
├── data/ # Data storage
│ └── cache/ # Cached datasets
│
└── outputs/ # Output files
└── reports/ # CSV reports
┌─────────────────────────────────────────────────────────────┐
│ Gradio Web Interface │
│ ┌────────────────┐ ┌──────────────┐ ┌────────────────┐ │
│ │ Configuration │ │ Test Control │ │ Results Display│ │
│ └────────────────┘ └──────────────┘ └────────────────┘ │
└──────────────────────────┬──────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Test Orchestrator Layer │
│ ┌──────────────┐ ┌──────────────┐ ┌─────────────────┐ │
│ │ Dataset │ │ Worker │ │ Metrics │ │
│ │ Manager │ │ Manager │ │ Aggregator │ │
│ └──────────────┘ └──────────────┘ └─────────────────┘ │
└──────────────────────────┬──────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Worker Thread Pool │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Worker 1 │ │ Worker 2 │ │ Worker 3 │ │ ... │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
└───────┼─────────────┼─────────────┼─────────────┼──────────┘
│ │ │ │
└─────────────┴─────────────┴─────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ OpenAI-Compatible Inference API │
│ /v1/completions │
└─────────────────────────────────────────────────────────────┘
MIT License - see LICENSE file for details
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
- Issues: Open an issue on GitHub
- Documentation: See this README and code comments
- Examples: Check the Configuration Examples section
- Built with Gradio for the web interface
- Datasets from HuggingFace
- Inspired by real-world LLM deployment challenges
Version: 1.0 Last Updated: January 2026