Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
304 changes: 70 additions & 234 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,303 +1,139 @@
# SvelteBench

An LLM benchmark for Svelte 5 based on the HumanEval methodology from OpenAI's paper "Evaluating Large Language Models Trained on Code". This benchmark evaluates LLMs' ability to generate functional Svelte 5 components with proper use of runes and modern Svelte features.
An LLM benchmark for Svelte 5 based on HumanEval methodology. Evaluates LLM-generated Svelte components through automated tests and calculates pass@k metrics.

## Overview

SvelteBench evaluates LLM-generated Svelte components by testing them against predefined test suites. It works by sending prompts to LLMs, generating Svelte components, and verifying their functionality through automated tests. The benchmark calculates pass@k metrics (typically pass@1 and pass@10) to measure model performance.

## Supported Providers

SvelteBench supports multiple LLM providers:

- **OpenAI** - GPT-4, GPT-4o, o1, o3, o4 models
- **Anthropic** - Claude 3.5, Claude 4 models
- **Google** - Gemini 2.5 models
- **OpenRouter** - Access to 100+ models through a unified API
- **Ollama** - Run models locally (Llama, Mistral, etc.)
- **Z.ai** - GLM-4 and other models

## Setup
## Quick Start

```bash
nvm use
# Install dependencies
pnpm install

# Create .env file from example
# Setup environment
cp .env.example .env
# Edit .env and add your API keys for providers you want to test
```

Then edit the `.env` file and add your API keys:

```bash
# OpenAI (optional)
OPENAI_API_KEY=your_openai_api_key_here

# Anthropic (optional)
ANTHROPIC_API_KEY=your_anthropic_api_key_here

# Google Gemini (optional)
GEMINI_API_KEY=your_gemini_api_key_here

# OpenRouter (optional)
OPENROUTER_API_KEY=your_openrouter_api_key_here
OPENROUTER_SITE_URL=https://github.com/khromov/svelte-bench # Optional
OPENROUTER_SITE_NAME=SvelteBench # Optional
OPENROUTER_PROVIDER=deepseek # Optional - preferred provider routing

# Ollama (optional - defaults to http://127.0.0.1:11434)
OLLAMA_HOST=http://127.0.0.1:11434

# Z.ai (optional)
Z_AI_API_KEY=your_z_ai_api_key_here
```

You only need to configure the providers you want to test with.

## Running the Benchmark

### Standard Execution

```bash
# Run the full benchmark (sequential execution)
pnpm start

# Run with parallel sample generation (faster)
PARALLEL_EXECUTION=true pnpm start

# Run tests only (without building visualization)
pnpm run run-tests
```

**NOTE: This will run all providers and models that are available!**
## Usage

### New CLI Interface

You can also use the new CLI interface with provider:model syntax:
### Basic Commands

```bash
# Basic syntax: pnpm start [provider:model] [options]

# Run with specific provider and model
# Run benchmark with specific model
pnpm start anthropic:claude-3-haiku

# Run with MCP tools for Svelte enhancements
# Run with MCP tools (Svelte-specific enhancements)
pnpm start google:gemini-2.5-flash --mcp

# Run with parallel execution
# Run with parallel execution (faster)
pnpm start openai:gpt-4o --parallel

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Include openrouter example

# Run with context file and short flags
# Run with context file
pnpm start moonshot:kimi-k2 -m -c ./context/svelte.dev/llms-small.txt

# Show help
pnpm start --help
```

**Available Options:**
### Options

- `-h, --help` - Show help message
- `-p, --parallel` - Enable parallel execution
- `-m, --mcp` - Enable MCP tools for Svelte support
- `-h, --help` - Show help
- `-p, --parallel` - Parallel execution (faster)
- `-m, --mcp` - Enable MCP tools
- `-c, --context <file>` - Load context file

### Execution Modes

SvelteBench supports two execution modes:
### Debug Mode (legacy)

- **Sequential (default)**: Tests and samples run one at a time. More reliable with detailed progress output.
- **Parallel**: Tests run sequentially, but samples within each test are generated in parallel. Faster execution with `PARALLEL_EXECUTION=true`.
Use `.env` for quick development testing:

### Debug Mode

For faster development, or to run just one provider/model, you can enable debug mode in your `.env` file:

```
```bash
DEBUG_MODE=true
DEBUG_PROVIDER=anthropic
DEBUG_MODEL=claude-3-7-sonnet-20250219
DEBUG_TEST=counter
```

Debug mode runs only one provider/model combination, making it much faster for testing during development.

#### Running Multiple Models in Debug Mode

You can now specify multiple models to test in debug mode by providing a comma-separated list:

DEBUG_TEST=counter # Optional: specific test
```
DEBUG_MODE=true
DEBUG_PROVIDER=anthropic
DEBUG_MODEL=claude-3-7-sonnet-20250219,claude-opus-4-20250514,claude-sonnet-4-20250514
```

This will run tests with all three models sequentially while still staying within the same provider.

### Running with Context
Multiple models supported: `DEBUG_MODEL=model1,model2,model3`

You can provide a context file (like Svelte documentation) to help the LLM generate better components:
### Environment Variables (legacy)

```bash
# Run with a context file
pnpm run run-tests -- --context ./context/svelte.dev/llms-small.txt && pnpm run build
```

The context file will be included in the prompt to the LLM, providing additional information for generating components.

## Visualizing Results

After running the benchmark, you can visualize the results using the built-in visualization tool:

```bash
pnpm run build
```

You can now find the visualization in the `dist` directory.

## Adding New Tests

To add a new test:

1. Create a new directory in `src/tests/` with the name of your test
2. Add a `prompt.md` file with instructions for the LLM
3. Add a `test.ts` file with Vitest tests for the generated component
4. Add a `Reference.svelte` file with a reference implementation for validation

Example structure:
# Run all providers (legacy interface)
pnpm start

# Parallel execution (legacy)
PARALLEL_EXECUTION=true pnpm start
```
src/tests/your-test/
├── prompt.md # Instructions for the LLM
├── test.ts # Tests for the generated component
└── Reference.svelte # Reference implementation
```

## Benchmark Results

### Output Files

After running the benchmark, results are saved in multiple formats:

- **JSON Results**: `benchmarks/benchmark-results-{timestamp}.json` - Machine-readable results with pass@k metrics
- **HTML Visualization**: `benchmarks/benchmark-results-{timestamp}.html` - Interactive visualization of results
- **Individual Model Results**: `benchmarks/benchmark-results-{provider}-{model}-{timestamp}.json` - Per-model results

When running with a context file, the results filename will include "with-context" in the name.

### Versioning System
## Supported Providers

**Current Results**: All new benchmark runs produce current results with:
Via **Vercel AI SDK** unified interface:

- Fixed test prompts and improved error handling
- Corrected Svelte syntax examples
- Standard naming without version suffixes
- **Native SDK Providers**: OpenAI, Anthropic, Google Gemini, OpenRouter, Moonshot, Z.ai, Ollama
- **AI SDK Registry**: Azure OpenAI, xAI (Grok), Mistral, Groq, DeepSeek, Cerebras, Fireworks, Together.ai, Perplexity, DeepInfra, Cohere, Amazon Bedrock, and more

**Legacy Results (v1)**: Historical results from the original test suite with known issues in the "inspect" test prompt (stored in `benchmarks/v1/`).
See `.env.example` for API key configuration.

### Merging Results
## Results & Visualization

You can merge multiple benchmark results into a single file:
Results are automatically saved to `benchmarks/` with timestamps. Build visualization:

```bash
# Merge current results (recommended)
pnpm run merge

# Merge legacy results (if needed)
pnpm run merge-v1

# Build visualization from current results
pnpm run build

# Build visualization from legacy results
pnpm run build-v1
pnpm run build # Creates merged visualization
```

This creates merged JSON and HTML files:
Output files:
- `benchmark-results-{timestamp}.json` - Raw results with pass@k metrics
- `benchmark-results-merged.html` - Interactive visualization

- `pnpm run merge``benchmarks/benchmark-results-merged.{json,html}` (current results)
- `pnpm run merge-v1``benchmarks/v1/benchmark-results-merged.{json,html}` (legacy results)
## Test Suite

The standard build process uses current results by default.
Tests for core Svelte 5 features:

## Advanced Features
- **hello-world** - Basic component rendering
- **counter** - State management (`$state`)
- **derived** - Computed values (`$derived`)
- **derived-by** - Advanced derived state (`$derived.by`)
- **effect** - Side effects (`$effect`)
- **props** - Component props (`$props`)
- **each** - List rendering (`{#each}`)
- **snippets** - Reusable templates
- **inspect** - Debug utilities (`$inspect`)

### Checkpoint & Resume
### Adding Tests

SvelteBench automatically saves checkpoints at the sample level, allowing you to resume interrupted benchmark runs:
Create directory in `src/tests/` with:
- `prompt.md` - LLM instructions
- `test.ts` - Vitest tests
- `Reference.svelte` - Reference implementation

- Checkpoints are saved in `tmp/checkpoint/` after each sample completion
- If a run is interrupted, it will automatically resume from the last checkpoint
- Checkpoints are cleaned up after successful completion
## Features

### Retry Mechanism
### Checkpoint & Resume
Automatic sample-level checkpointing in `tmp/checkpoint/` - interrupted runs resume automatically.

API calls have configurable retry logic with exponential backoff. Configure in `.env`:
### HumanEval Metrics
- **pass@1** - Probability single sample passes
- **pass@10** - Probability ≥1 of 10 samples passes
- Default: 10 samples/test (1 for expensive models)

### Retry Logic
Configurable exponential backoff via `.env`:
```bash
RETRY_MAX_ATTEMPTS=3 # Maximum retry attempts (default: 3)
RETRY_INITIAL_DELAY_MS=1000 # Initial delay before retry (default: 1000ms)
RETRY_MAX_DELAY_MS=30000 # Maximum delay between retries (default: 30s)
RETRY_BACKOFF_FACTOR=2 # Exponential backoff factor (default: 2)
RETRY_MAX_ATTEMPTS=3
RETRY_INITIAL_DELAY_MS=1000
RETRY_MAX_DELAY_MS=30000
RETRY_BACKOFF_FACTOR=2
```

### Model Validation

Before running benchmarks, models are automatically validated to ensure they're available and properly configured. Invalid models are skipped with appropriate warnings.

### HumanEval Metrics

The benchmark calculates pass@k metrics based on the HumanEval methodology:

- **pass@1**: Probability that a single sample passes all tests
- **pass@10**: Probability that at least one of 10 samples passes all tests
- Default: 10 samples per test (1 sample for expensive models)

### Test Verification

Verify that all tests have proper structure:
## Utility Commands

```bash
pnpm run verify
pnpm run verify # Verify test structure
pnpm run merge # Merge all results
pnpm run merge-v1 # Merge legacy results (legacy)
pnpm run build-v1 # Build legacy visualization (legacy)
```

This checks that each test has required files (prompt.md, test.ts, Reference.svelte).

## Current Test Suite

The benchmark includes tests for core Svelte 5 features:

- **hello-world**: Basic component rendering
- **counter**: State management with `$state` rune
- **derived**: Computed values with `$derived` rune
- **derived-by**: Advanced derived state with `$derived.by`
- **effect**: Side effects with `$effect` rune
- **props**: Component props with `$props` rune
- **each**: List rendering with `{#each}` blocks
- **snippets**: Reusable template snippets
- **inspect**: Debug utilities with `$inspect` rune

## Troubleshooting

### Common Issues

1. **Models not found**: Ensure API keys are correctly set in `.env`
2. **Tests failing**: Check that you're using Node.js 20+ and have run `pnpm install`
3. **Parallel execution errors**: Try sequential mode (remove `PARALLEL_EXECUTION=true`)
4. **Memory issues**: Reduce the number of samples or run in debug mode with fewer models

### Debugging

Enable detailed logging by examining the generated components in `tmp/samples/` directories and test outputs in the console.

## Contributing

Contributions are welcome! Please ensure:

1. New tests include all required files (prompt.md, test.ts, Reference.svelte)
2. Tests follow the existing structure and naming conventions
3. Reference implementations are correct and pass all tests
4. Documentation is updated for new features

## License

MIT
Loading