-
Notifications
You must be signed in to change notification settings - Fork 7
Streamline README and document UX improvement opportunities #54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
+222
−234
Merged
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,303 +1,139 @@ | ||
| # SvelteBench | ||
|
|
||
| An LLM benchmark for Svelte 5 based on the HumanEval methodology from OpenAI's paper "Evaluating Large Language Models Trained on Code". This benchmark evaluates LLMs' ability to generate functional Svelte 5 components with proper use of runes and modern Svelte features. | ||
| An LLM benchmark for Svelte 5 based on HumanEval methodology. Evaluates LLM-generated Svelte components through automated tests and calculates pass@k metrics. | ||
|
|
||
| ## Overview | ||
|
|
||
| SvelteBench evaluates LLM-generated Svelte components by testing them against predefined test suites. It works by sending prompts to LLMs, generating Svelte components, and verifying their functionality through automated tests. The benchmark calculates pass@k metrics (typically pass@1 and pass@10) to measure model performance. | ||
|
|
||
| ## Supported Providers | ||
|
|
||
| SvelteBench supports multiple LLM providers: | ||
|
|
||
| - **OpenAI** - GPT-4, GPT-4o, o1, o3, o4 models | ||
| - **Anthropic** - Claude 3.5, Claude 4 models | ||
| - **Google** - Gemini 2.5 models | ||
| - **OpenRouter** - Access to 100+ models through a unified API | ||
| - **Ollama** - Run models locally (Llama, Mistral, etc.) | ||
| - **Z.ai** - GLM-4 and other models | ||
|
|
||
| ## Setup | ||
| ## Quick Start | ||
|
|
||
| ```bash | ||
| nvm use | ||
| # Install dependencies | ||
| pnpm install | ||
|
|
||
| # Create .env file from example | ||
| # Setup environment | ||
| cp .env.example .env | ||
| # Edit .env and add your API keys for providers you want to test | ||
| ``` | ||
|
|
||
| Then edit the `.env` file and add your API keys: | ||
|
|
||
| ```bash | ||
| # OpenAI (optional) | ||
| OPENAI_API_KEY=your_openai_api_key_here | ||
|
|
||
| # Anthropic (optional) | ||
| ANTHROPIC_API_KEY=your_anthropic_api_key_here | ||
|
|
||
| # Google Gemini (optional) | ||
| GEMINI_API_KEY=your_gemini_api_key_here | ||
|
|
||
| # OpenRouter (optional) | ||
| OPENROUTER_API_KEY=your_openrouter_api_key_here | ||
| OPENROUTER_SITE_URL=https://github.com/khromov/svelte-bench # Optional | ||
| OPENROUTER_SITE_NAME=SvelteBench # Optional | ||
| OPENROUTER_PROVIDER=deepseek # Optional - preferred provider routing | ||
|
|
||
| # Ollama (optional - defaults to http://127.0.0.1:11434) | ||
| OLLAMA_HOST=http://127.0.0.1:11434 | ||
|
|
||
| # Z.ai (optional) | ||
| Z_AI_API_KEY=your_z_ai_api_key_here | ||
| ``` | ||
|
|
||
| You only need to configure the providers you want to test with. | ||
|
|
||
| ## Running the Benchmark | ||
|
|
||
| ### Standard Execution | ||
|
|
||
| ```bash | ||
| # Run the full benchmark (sequential execution) | ||
| pnpm start | ||
|
|
||
| # Run with parallel sample generation (faster) | ||
| PARALLEL_EXECUTION=true pnpm start | ||
|
|
||
| # Run tests only (without building visualization) | ||
| pnpm run run-tests | ||
| ``` | ||
|
|
||
| **NOTE: This will run all providers and models that are available!** | ||
| ## Usage | ||
|
|
||
| ### New CLI Interface | ||
|
|
||
| You can also use the new CLI interface with provider:model syntax: | ||
| ### Basic Commands | ||
|
|
||
| ```bash | ||
| # Basic syntax: pnpm start [provider:model] [options] | ||
|
|
||
| # Run with specific provider and model | ||
| # Run benchmark with specific model | ||
| pnpm start anthropic:claude-3-haiku | ||
|
|
||
| # Run with MCP tools for Svelte enhancements | ||
| # Run with MCP tools (Svelte-specific enhancements) | ||
| pnpm start google:gemini-2.5-flash --mcp | ||
|
|
||
| # Run with parallel execution | ||
| # Run with parallel execution (faster) | ||
| pnpm start openai:gpt-4o --parallel | ||
|
|
||
| # Run with context file and short flags | ||
| # Run with context file | ||
| pnpm start moonshot:kimi-k2 -m -c ./context/svelte.dev/llms-small.txt | ||
|
|
||
| # Show help | ||
| pnpm start --help | ||
| ``` | ||
|
|
||
| **Available Options:** | ||
| ### Options | ||
|
|
||
| - `-h, --help` - Show help message | ||
| - `-p, --parallel` - Enable parallel execution | ||
| - `-m, --mcp` - Enable MCP tools for Svelte support | ||
| - `-h, --help` - Show help | ||
| - `-p, --parallel` - Parallel execution (faster) | ||
| - `-m, --mcp` - Enable MCP tools | ||
| - `-c, --context <file>` - Load context file | ||
|
|
||
| ### Execution Modes | ||
|
|
||
| SvelteBench supports two execution modes: | ||
| ### Debug Mode (legacy) | ||
|
|
||
| - **Sequential (default)**: Tests and samples run one at a time. More reliable with detailed progress output. | ||
| - **Parallel**: Tests run sequentially, but samples within each test are generated in parallel. Faster execution with `PARALLEL_EXECUTION=true`. | ||
| Use `.env` for quick development testing: | ||
|
|
||
| ### Debug Mode | ||
|
|
||
| For faster development, or to run just one provider/model, you can enable debug mode in your `.env` file: | ||
|
|
||
| ``` | ||
| ```bash | ||
| DEBUG_MODE=true | ||
| DEBUG_PROVIDER=anthropic | ||
| DEBUG_MODEL=claude-3-7-sonnet-20250219 | ||
| DEBUG_TEST=counter | ||
| ``` | ||
|
|
||
| Debug mode runs only one provider/model combination, making it much faster for testing during development. | ||
|
|
||
| #### Running Multiple Models in Debug Mode | ||
|
|
||
| You can now specify multiple models to test in debug mode by providing a comma-separated list: | ||
|
|
||
| DEBUG_TEST=counter # Optional: specific test | ||
| ``` | ||
| DEBUG_MODE=true | ||
| DEBUG_PROVIDER=anthropic | ||
| DEBUG_MODEL=claude-3-7-sonnet-20250219,claude-opus-4-20250514,claude-sonnet-4-20250514 | ||
| ``` | ||
|
|
||
| This will run tests with all three models sequentially while still staying within the same provider. | ||
|
|
||
| ### Running with Context | ||
| Multiple models supported: `DEBUG_MODEL=model1,model2,model3` | ||
|
|
||
| You can provide a context file (like Svelte documentation) to help the LLM generate better components: | ||
| ### Environment Variables (legacy) | ||
|
|
||
| ```bash | ||
| # Run with a context file | ||
| pnpm run run-tests -- --context ./context/svelte.dev/llms-small.txt && pnpm run build | ||
| ``` | ||
|
|
||
| The context file will be included in the prompt to the LLM, providing additional information for generating components. | ||
|
|
||
| ## Visualizing Results | ||
|
|
||
| After running the benchmark, you can visualize the results using the built-in visualization tool: | ||
|
|
||
| ```bash | ||
| pnpm run build | ||
| ``` | ||
|
|
||
| You can now find the visualization in the `dist` directory. | ||
|
|
||
| ## Adding New Tests | ||
|
|
||
| To add a new test: | ||
|
|
||
| 1. Create a new directory in `src/tests/` with the name of your test | ||
| 2. Add a `prompt.md` file with instructions for the LLM | ||
| 3. Add a `test.ts` file with Vitest tests for the generated component | ||
| 4. Add a `Reference.svelte` file with a reference implementation for validation | ||
|
|
||
| Example structure: | ||
| # Run all providers (legacy interface) | ||
| pnpm start | ||
|
|
||
| # Parallel execution (legacy) | ||
| PARALLEL_EXECUTION=true pnpm start | ||
| ``` | ||
| src/tests/your-test/ | ||
| ├── prompt.md # Instructions for the LLM | ||
| ├── test.ts # Tests for the generated component | ||
| └── Reference.svelte # Reference implementation | ||
| ``` | ||
|
|
||
| ## Benchmark Results | ||
|
|
||
| ### Output Files | ||
|
|
||
| After running the benchmark, results are saved in multiple formats: | ||
|
|
||
| - **JSON Results**: `benchmarks/benchmark-results-{timestamp}.json` - Machine-readable results with pass@k metrics | ||
| - **HTML Visualization**: `benchmarks/benchmark-results-{timestamp}.html` - Interactive visualization of results | ||
| - **Individual Model Results**: `benchmarks/benchmark-results-{provider}-{model}-{timestamp}.json` - Per-model results | ||
|
|
||
| When running with a context file, the results filename will include "with-context" in the name. | ||
|
|
||
| ### Versioning System | ||
| ## Supported Providers | ||
|
|
||
| **Current Results**: All new benchmark runs produce current results with: | ||
| Via **Vercel AI SDK** unified interface: | ||
|
|
||
| - Fixed test prompts and improved error handling | ||
| - Corrected Svelte syntax examples | ||
| - Standard naming without version suffixes | ||
| - **Native SDK Providers**: OpenAI, Anthropic, Google Gemini, OpenRouter, Moonshot, Z.ai, Ollama | ||
| - **AI SDK Registry**: Azure OpenAI, xAI (Grok), Mistral, Groq, DeepSeek, Cerebras, Fireworks, Together.ai, Perplexity, DeepInfra, Cohere, Amazon Bedrock, and more | ||
|
|
||
| **Legacy Results (v1)**: Historical results from the original test suite with known issues in the "inspect" test prompt (stored in `benchmarks/v1/`). | ||
| See `.env.example` for API key configuration. | ||
|
|
||
| ### Merging Results | ||
| ## Results & Visualization | ||
|
|
||
| You can merge multiple benchmark results into a single file: | ||
| Results are automatically saved to `benchmarks/` with timestamps. Build visualization: | ||
|
|
||
| ```bash | ||
| # Merge current results (recommended) | ||
| pnpm run merge | ||
|
|
||
| # Merge legacy results (if needed) | ||
| pnpm run merge-v1 | ||
|
|
||
| # Build visualization from current results | ||
| pnpm run build | ||
|
|
||
| # Build visualization from legacy results | ||
| pnpm run build-v1 | ||
| pnpm run build # Creates merged visualization | ||
| ``` | ||
|
|
||
| This creates merged JSON and HTML files: | ||
| Output files: | ||
| - `benchmark-results-{timestamp}.json` - Raw results with pass@k metrics | ||
| - `benchmark-results-merged.html` - Interactive visualization | ||
|
|
||
| - `pnpm run merge` → `benchmarks/benchmark-results-merged.{json,html}` (current results) | ||
| - `pnpm run merge-v1` → `benchmarks/v1/benchmark-results-merged.{json,html}` (legacy results) | ||
| ## Test Suite | ||
|
|
||
| The standard build process uses current results by default. | ||
| Tests for core Svelte 5 features: | ||
|
|
||
| ## Advanced Features | ||
| - **hello-world** - Basic component rendering | ||
| - **counter** - State management (`$state`) | ||
| - **derived** - Computed values (`$derived`) | ||
| - **derived-by** - Advanced derived state (`$derived.by`) | ||
| - **effect** - Side effects (`$effect`) | ||
| - **props** - Component props (`$props`) | ||
| - **each** - List rendering (`{#each}`) | ||
| - **snippets** - Reusable templates | ||
| - **inspect** - Debug utilities (`$inspect`) | ||
|
|
||
| ### Checkpoint & Resume | ||
| ### Adding Tests | ||
|
|
||
| SvelteBench automatically saves checkpoints at the sample level, allowing you to resume interrupted benchmark runs: | ||
| Create directory in `src/tests/` with: | ||
| - `prompt.md` - LLM instructions | ||
| - `test.ts` - Vitest tests | ||
| - `Reference.svelte` - Reference implementation | ||
|
|
||
| - Checkpoints are saved in `tmp/checkpoint/` after each sample completion | ||
| - If a run is interrupted, it will automatically resume from the last checkpoint | ||
| - Checkpoints are cleaned up after successful completion | ||
| ## Features | ||
|
|
||
| ### Retry Mechanism | ||
| ### Checkpoint & Resume | ||
| Automatic sample-level checkpointing in `tmp/checkpoint/` - interrupted runs resume automatically. | ||
|
|
||
| API calls have configurable retry logic with exponential backoff. Configure in `.env`: | ||
| ### HumanEval Metrics | ||
| - **pass@1** - Probability single sample passes | ||
| - **pass@10** - Probability ≥1 of 10 samples passes | ||
| - Default: 10 samples/test (1 for expensive models) | ||
|
|
||
| ### Retry Logic | ||
| Configurable exponential backoff via `.env`: | ||
| ```bash | ||
| RETRY_MAX_ATTEMPTS=3 # Maximum retry attempts (default: 3) | ||
| RETRY_INITIAL_DELAY_MS=1000 # Initial delay before retry (default: 1000ms) | ||
| RETRY_MAX_DELAY_MS=30000 # Maximum delay between retries (default: 30s) | ||
| RETRY_BACKOFF_FACTOR=2 # Exponential backoff factor (default: 2) | ||
| RETRY_MAX_ATTEMPTS=3 | ||
| RETRY_INITIAL_DELAY_MS=1000 | ||
| RETRY_MAX_DELAY_MS=30000 | ||
| RETRY_BACKOFF_FACTOR=2 | ||
| ``` | ||
|
|
||
| ### Model Validation | ||
|
|
||
| Before running benchmarks, models are automatically validated to ensure they're available and properly configured. Invalid models are skipped with appropriate warnings. | ||
|
|
||
| ### HumanEval Metrics | ||
|
|
||
| The benchmark calculates pass@k metrics based on the HumanEval methodology: | ||
|
|
||
| - **pass@1**: Probability that a single sample passes all tests | ||
| - **pass@10**: Probability that at least one of 10 samples passes all tests | ||
| - Default: 10 samples per test (1 sample for expensive models) | ||
|
|
||
| ### Test Verification | ||
|
|
||
| Verify that all tests have proper structure: | ||
| ## Utility Commands | ||
|
|
||
| ```bash | ||
| pnpm run verify | ||
| pnpm run verify # Verify test structure | ||
| pnpm run merge # Merge all results | ||
| pnpm run merge-v1 # Merge legacy results (legacy) | ||
| pnpm run build-v1 # Build legacy visualization (legacy) | ||
| ``` | ||
|
|
||
| This checks that each test has required files (prompt.md, test.ts, Reference.svelte). | ||
|
|
||
| ## Current Test Suite | ||
|
|
||
| The benchmark includes tests for core Svelte 5 features: | ||
|
|
||
| - **hello-world**: Basic component rendering | ||
| - **counter**: State management with `$state` rune | ||
| - **derived**: Computed values with `$derived` rune | ||
| - **derived-by**: Advanced derived state with `$derived.by` | ||
| - **effect**: Side effects with `$effect` rune | ||
| - **props**: Component props with `$props` rune | ||
| - **each**: List rendering with `{#each}` blocks | ||
| - **snippets**: Reusable template snippets | ||
| - **inspect**: Debug utilities with `$inspect` rune | ||
|
|
||
| ## Troubleshooting | ||
|
|
||
| ### Common Issues | ||
|
|
||
| 1. **Models not found**: Ensure API keys are correctly set in `.env` | ||
| 2. **Tests failing**: Check that you're using Node.js 20+ and have run `pnpm install` | ||
| 3. **Parallel execution errors**: Try sequential mode (remove `PARALLEL_EXECUTION=true`) | ||
| 4. **Memory issues**: Reduce the number of samples or run in debug mode with fewer models | ||
|
|
||
| ### Debugging | ||
|
|
||
| Enable detailed logging by examining the generated components in `tmp/samples/` directories and test outputs in the console. | ||
|
|
||
| ## Contributing | ||
|
|
||
| Contributions are welcome! Please ensure: | ||
|
|
||
| 1. New tests include all required files (prompt.md, test.ts, Reference.svelte) | ||
| 2. Tests follow the existing structure and naming conventions | ||
| 3. Reference implementations are correct and pass all tests | ||
| 4. Documentation is updated for new features | ||
|
|
||
| ## License | ||
|
|
||
| MIT | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Include openrouter example