oshtz · oshtz · Dec 23, 2025 · Dec 23, 2025
diff --git a/README.md b/README.md
@@ -1,5 +1,9 @@
 # Benchmaker
 
+<p align="center">
+  <img src="app-white.png" alt="Benchmaker Application Screenshot" width="800" />
+</p>
+
 A modular, visual LLM benchmarking desktop application for scientifically comparing multiple Large Language Models under identical conditions.
 
 ## Overview
@@ -11,28 +15,54 @@ Benchmaker enables AI researchers, prompt engineers, and technical teams to:
 - **Execute benchmarks in parallel** across all selected models
 - **Score results** using objective rules and/or LLM judge models
 - **Persist and compare results** for regression testing and historical analysis
+- **Analyze performance trends** with comprehensive analytics dashboard
 
 ## Features
 
+### Core Benchmarking
 - **Multi-model parallel execution** - Run the same prompts against multiple LLMs simultaneously
 - **Real-time response streaming** - Watch responses as they're generated
-- **Modular scoring system** - Exact match, regex, fuzzy matching, and LLM-as-judge scoring
+- **Configurable inference parameters** - Temperature, top_p, max_tokens per run
+
+### Scoring System
+- **Exact match** - Precise string comparison
+- **Regex match** - Pattern-based validation
+- **Numeric tolerance** - Numerical comparison with configurable tolerance
+- **Boolean match** - Contains-based validation
+- **LLM-as-Judge** - AI-powered evaluation with customizable judge prompts
+
+### AI-Assisted Tools
+- **Benchmark Generator** - AI-powered generation of complete test suites from descriptions
+- **Test Case Generator** - Automatically generate test cases based on system prompts
+- **Prompt Enhancer** - AI-assisted improvement of system and judge prompts
+
+### Analytics Dashboard
+- **Model Leaderboard** - Ranked performance comparison across all runs
+- **Performance Trends** - Track model improvements over time
+- **Interesting Facts** - AI-generated insights about benchmark results
+- **Per-suite filtering** - Analyze results by specific test suite
+
+### Data Management
 - **SQLite persistence** - Local database for test suites and run history
+- **Data Vault** - Inspect and patch live JSON store directly
+- **Import/Export** - Reproducible benchmark configurations
+
+### User Experience
 - **Monaco editor integration** - Rich code editor for prompt authoring
 - **Custom Tauri window** - Native desktop experience with custom title bar
 - **Auto-updates** - Checks GitHub releases on startup and installs new versions
-- **AI-assisted tooling** - Test case generation and prompt enhancement helpers
 - **Dark/Light mode** - Theme support for comfortable usage
 
 ## Tech Stack
 
 **Frontend:**
 - React 19 + TypeScript
-- Vite build system
-- Tailwind CSS
+- Vite 7 build system
+- Tailwind CSS 4
 - Zustand state management
 - Radix UI primitives (custom-styled)
 - Monaco Editor
+- Lucide React icons
 
 **Backend:**
 - Tauri (Rust)
@@ -53,7 +83,7 @@ Benchmaker enables AI researchers, prompt engineers, and technical teams to:
 
 1. **Clone the repository:**
    ```bash
-   git clone https://github.com/yourusername/benchmaker.git
+   git clone https://github.com/oshtz/Benchmaker.git
    cd benchmaker
    ```
 
@@ -79,36 +109,54 @@ benchmaker/
 ├── src/                          # React frontend
 │   ├── components/
 │   │   ├── arena/                # Test execution controls
-│   │   ├── prompt-manager/       # Test suite creation
+│   │   ├── analytics/            # Analytics dashboard
+│   │   ├── prompt-manager/       # Test suite creation & AI tools
 │   │   ├── results/              # Results and reporting
-│   │   ├── data/                 # Data management
+│   │   ├── data/                 # Data management (Data Vault)
 │   │   ├── settings/             # Settings UI
 │   │   ├── layout/               # App layout components
 │   │   └── ui/                   # Reusable UI primitives
 │   ├── services/                 # Business logic
+│   │   ├── analytics.ts          # Analytics computation
+│   │   ├── benchmarkGenerator.ts # AI benchmark generation
+│   │   ├── execution.ts          # Benchmark execution
+│   │   ├── localDb.ts            # SQLite operations
+│   │   ├── openrouter.ts         # OpenRouter API client
+│   │   ├── promptEnhancer.ts     # AI prompt enhancement
+│   │   ├── testCaseGenerator.ts  # AI test case generation
+│   │   └── updater.ts            # Auto-update service
 │   ├── stores/                   # Zustand state stores
+│   │   ├── modelStore.ts         # Model selection state
+│   │   ├── runStore.ts           # Benchmark run state
+│   │   ├── settingsStore.ts      # App settings
+│   │   ├── testSuiteStore.ts     # Test suite state
+│   │   └── updateStore.ts        # Update status state
 │   ├── scoring/                  # Scoring implementations
+│   │   ├── exact-match.ts
+│   │   ├── regex-match.ts
+│   │   ├── numeric-tolerance.ts
+│   │   └── llm-judge.ts
 │   ├── types/                    # TypeScript definitions
 │   └── lib/                      # Utilities
 ├── src-tauri/                    # Rust backend
 │   ├── src/main.rs               # Tauri app + SQLite
 │   └── tauri.conf.json           # Tauri configuration
 ├── package.json
 ├── tsconfig.json
-├── vite.config.ts
-└── PRD.md                        # Product requirements
+└── vite.config.ts
 ```
 
 ## Usage
 
 ### 1. Configure API Key
-Enter your OpenRouter API key in the Settings tab.
+Enter your OpenRouter API key in the Settings (gear icon in header).
 
 ### 2. Create a Test Suite
-- Navigate to the **Prompt Manager** tab
+- Navigate to the **Prompts** tab
 - Define a system prompt that applies to all test cases
 - Add individual test cases with prompts and expected outputs
-- Optionally configure scoring methods per test case
+- Configure scoring methods per test case (exact-match, regex, numeric, boolean, or llm-judge)
+- Optionally use **AI tools** to generate test cases or enhance prompts
 
 ### 3. Select Models
 - Go to the **Arena** tab
@@ -122,15 +170,22 @@ Enter your OpenRouter API key in the Settings tab.
 - View progress and status per model
 
 ### 5. Analyze Results
-- Switch to the **Results** tab
-- Compare responses side-by-side in the grid view
-- Review aggregate scores and per-test breakdowns
+- Switch to the **Results** tab for detailed response comparison
+- Use the **Analytics** tab for:
+  - Model leaderboard rankings
+  - Performance trends over time
+  - AI-generated insights and interesting facts
 - Results are automatically saved for future reference
 
+### 6. Manage Data
+- Use the **Data** tab to inspect the live JSON store
+- Patch data directly for reproducible runs
+- Export/import benchmark configurations
+
 ### Updates
-- The app checks for updates on startup.
-- Click the version button in the header (e.g. `v0.0.9`) to view update status, release notes, or manually re-check.
-- Updates are pulled from GitHub Releases and expect a `Benchmaker-Portable.exe` asset on the latest tag.
+- The app checks for updates on startup
+- Click the version button in the header (e.g. `v0.0.9`) to view update status, release notes, or manually re-check
+- Updates are pulled from GitHub Releases and expect a `Benchmaker-Portable.exe` asset on the latest tag
 
 ## Development
 
@@ -144,8 +199,8 @@ Enter your OpenRouter API key in the Settings tab.
 | `npm run tauri build` | Build production desktop app |
 
 ### Release Notes
-- Version is sourced from `package.json` and `src-tauri/tauri.conf.json`.
-- GitHub releases should be tagged as `vX.Y.Z` and include `Benchmaker-Portable.exe`.
+- Version is sourced from `package.json` and `src-tauri/tauri.conf.json`
+- GitHub releases should be tagged as `vX.Y.Z` and include `Benchmaker-Portable.exe`
 
 ### Architecture Notes
 
@@ -182,7 +237,7 @@ Contributions are welcome! Here's how to get started:
 
 ### Areas Open for Contribution
 
-- Additional scoring plugins (regex, numeric tolerance, etc.)
+- Additional scoring plugins
 - Export functionality (JSON, CSV, PDF reports)
 - UI/UX improvements
 - Performance optimizations
@@ -195,10 +250,11 @@ Contributions are welcome! Here's how to get started:
 - [ ] Code execution sandbox scoring
 - [ ] Cost-aware benchmarking (track API costs)
 - [ ] Prompt diff/comparison tools
-- [ ] Export to JSON/CSV
+- [ ] Export to JSON/CSV/PDF
 - [ ] Public shareable benchmark URLs
 - [ ] CI-style automated regression runs
 - [ ] Team workspaces and collaboration
+- [ ] Model cost tracking and optimization suggestions
 
 ## License
 
@@ -209,3 +265,5 @@ Contributions are welcome! Here's how to get started:
 - [OpenRouter](https://openrouter.ai/) for unified LLM API access
 - [Tauri](https://tauri.app/) for the desktop framework
 - [Radix UI](https://www.radix-ui.com/) for accessible component primitives
+- [Monaco Editor](https://microsoft.github.io/monaco-editor/) for the code editing experience
+- [Lucide](https://lucide.dev/) for beautiful icons