Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
102 changes: 80 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
# Benchmaker

<p align="center">
<img src="app-white.png" alt="Benchmaker Application Screenshot" width="800" />
</p>

A modular, visual LLM benchmarking desktop application for scientifically comparing multiple Large Language Models under identical conditions.

## Overview
Expand All @@ -11,28 +15,54 @@ Benchmaker enables AI researchers, prompt engineers, and technical teams to:
- **Execute benchmarks in parallel** across all selected models
- **Score results** using objective rules and/or LLM judge models
- **Persist and compare results** for regression testing and historical analysis
- **Analyze performance trends** with comprehensive analytics dashboard

## Features

### Core Benchmarking
- **Multi-model parallel execution** - Run the same prompts against multiple LLMs simultaneously
- **Real-time response streaming** - Watch responses as they're generated
- **Modular scoring system** - Exact match, regex, fuzzy matching, and LLM-as-judge scoring
- **Configurable inference parameters** - Temperature, top_p, max_tokens per run

### Scoring System
- **Exact match** - Precise string comparison
- **Regex match** - Pattern-based validation
- **Numeric tolerance** - Numerical comparison with configurable tolerance
- **Boolean match** - Contains-based validation
- **LLM-as-Judge** - AI-powered evaluation with customizable judge prompts

### AI-Assisted Tools
- **Benchmark Generator** - AI-powered generation of complete test suites from descriptions
- **Test Case Generator** - Automatically generate test cases based on system prompts
- **Prompt Enhancer** - AI-assisted improvement of system and judge prompts

### Analytics Dashboard
- **Model Leaderboard** - Ranked performance comparison across all runs
- **Performance Trends** - Track model improvements over time
- **Interesting Facts** - AI-generated insights about benchmark results
- **Per-suite filtering** - Analyze results by specific test suite

### Data Management
- **SQLite persistence** - Local database for test suites and run history
- **Data Vault** - Inspect and patch live JSON store directly
- **Import/Export** - Reproducible benchmark configurations

### User Experience
- **Monaco editor integration** - Rich code editor for prompt authoring
- **Custom Tauri window** - Native desktop experience with custom title bar
- **Auto-updates** - Checks GitHub releases on startup and installs new versions
- **AI-assisted tooling** - Test case generation and prompt enhancement helpers
- **Dark/Light mode** - Theme support for comfortable usage

## Tech Stack

**Frontend:**
- React 19 + TypeScript
- Vite build system
- Tailwind CSS
- Vite 7 build system
- Tailwind CSS 4
- Zustand state management
- Radix UI primitives (custom-styled)
- Monaco Editor
- Lucide React icons

**Backend:**
- Tauri (Rust)
Expand All @@ -53,7 +83,7 @@ Benchmaker enables AI researchers, prompt engineers, and technical teams to:

1. **Clone the repository:**
```bash
git clone https://github.com/yourusername/benchmaker.git
git clone https://github.com/oshtz/Benchmaker.git
cd benchmaker
```

Expand All @@ -79,36 +109,54 @@ benchmaker/
├── src/ # React frontend
│ ├── components/
│ │ ├── arena/ # Test execution controls
│ │ ├── prompt-manager/ # Test suite creation
│ │ ├── analytics/ # Analytics dashboard
│ │ ├── prompt-manager/ # Test suite creation & AI tools
│ │ ├── results/ # Results and reporting
│ │ ├── data/ # Data management
│ │ ├── data/ # Data management (Data Vault)
│ │ ├── settings/ # Settings UI
│ │ ├── layout/ # App layout components
│ │ └── ui/ # Reusable UI primitives
│ ├── services/ # Business logic
│ │ ├── analytics.ts # Analytics computation
│ │ ├── benchmarkGenerator.ts # AI benchmark generation
│ │ ├── execution.ts # Benchmark execution
│ │ ├── localDb.ts # SQLite operations
│ │ ├── openrouter.ts # OpenRouter API client
│ │ ├── promptEnhancer.ts # AI prompt enhancement
│ │ ├── testCaseGenerator.ts # AI test case generation
│ │ └── updater.ts # Auto-update service
│ ├── stores/ # Zustand state stores
│ │ ├── modelStore.ts # Model selection state
│ │ ├── runStore.ts # Benchmark run state
│ │ ├── settingsStore.ts # App settings
│ │ ├── testSuiteStore.ts # Test suite state
│ │ └── updateStore.ts # Update status state
│ ├── scoring/ # Scoring implementations
│ │ ├── exact-match.ts
│ │ ├── regex-match.ts
│ │ ├── numeric-tolerance.ts
│ │ └── llm-judge.ts
│ ├── types/ # TypeScript definitions
│ └── lib/ # Utilities
├── src-tauri/ # Rust backend
│ ├── src/main.rs # Tauri app + SQLite
│ └── tauri.conf.json # Tauri configuration
├── package.json
├── tsconfig.json
├── vite.config.ts
└── PRD.md # Product requirements
└── vite.config.ts
```

## Usage

### 1. Configure API Key
Enter your OpenRouter API key in the Settings tab.
Enter your OpenRouter API key in the Settings (gear icon in header).

### 2. Create a Test Suite
- Navigate to the **Prompt Manager** tab
- Navigate to the **Prompts** tab
- Define a system prompt that applies to all test cases
- Add individual test cases with prompts and expected outputs
- Optionally configure scoring methods per test case
- Configure scoring methods per test case (exact-match, regex, numeric, boolean, or llm-judge)
- Optionally use **AI tools** to generate test cases or enhance prompts

### 3. Select Models
- Go to the **Arena** tab
Expand All @@ -122,15 +170,22 @@ Enter your OpenRouter API key in the Settings tab.
- View progress and status per model

### 5. Analyze Results
- Switch to the **Results** tab
- Compare responses side-by-side in the grid view
- Review aggregate scores and per-test breakdowns
- Switch to the **Results** tab for detailed response comparison
- Use the **Analytics** tab for:
- Model leaderboard rankings
- Performance trends over time
- AI-generated insights and interesting facts
- Results are automatically saved for future reference

### 6. Manage Data
- Use the **Data** tab to inspect the live JSON store
- Patch data directly for reproducible runs
- Export/import benchmark configurations

### Updates
- The app checks for updates on startup.
- Click the version button in the header (e.g. `v0.0.9`) to view update status, release notes, or manually re-check.
- Updates are pulled from GitHub Releases and expect a `Benchmaker-Portable.exe` asset on the latest tag.
- The app checks for updates on startup
- Click the version button in the header (e.g. `v0.0.9`) to view update status, release notes, or manually re-check
- Updates are pulled from GitHub Releases and expect a `Benchmaker-Portable.exe` asset on the latest tag

## Development

Expand All @@ -144,8 +199,8 @@ Enter your OpenRouter API key in the Settings tab.
| `npm run tauri build` | Build production desktop app |

### Release Notes
- Version is sourced from `package.json` and `src-tauri/tauri.conf.json`.
- GitHub releases should be tagged as `vX.Y.Z` and include `Benchmaker-Portable.exe`.
- Version is sourced from `package.json` and `src-tauri/tauri.conf.json`
- GitHub releases should be tagged as `vX.Y.Z` and include `Benchmaker-Portable.exe`

### Architecture Notes

Expand Down Expand Up @@ -182,7 +237,7 @@ Contributions are welcome! Here's how to get started:

### Areas Open for Contribution

- Additional scoring plugins (regex, numeric tolerance, etc.)
- Additional scoring plugins
- Export functionality (JSON, CSV, PDF reports)
- UI/UX improvements
- Performance optimizations
Expand All @@ -195,10 +250,11 @@ Contributions are welcome! Here's how to get started:
- [ ] Code execution sandbox scoring
- [ ] Cost-aware benchmarking (track API costs)
- [ ] Prompt diff/comparison tools
- [ ] Export to JSON/CSV
- [ ] Export to JSON/CSV/PDF
- [ ] Public shareable benchmark URLs
- [ ] CI-style automated regression runs
- [ ] Team workspaces and collaboration
- [ ] Model cost tracking and optimization suggestions

## License

Expand All @@ -209,3 +265,5 @@ Contributions are welcome! Here's how to get started:
- [OpenRouter](https://openrouter.ai/) for unified LLM API access
- [Tauri](https://tauri.app/) for the desktop framework
- [Radix UI](https://www.radix-ui.com/) for accessible component primitives
- [Monaco Editor](https://microsoft.github.io/monaco-editor/) for the code editing experience
- [Lucide](https://lucide.dev/) for beautiful icons