diff --git a/README.md b/README.md index 4a8d21b..2e8554e 100644 --- a/README.md +++ b/README.md @@ -1,303 +1,139 @@ # SvelteBench -An LLM benchmark for Svelte 5 based on the HumanEval methodology from OpenAI's paper "Evaluating Large Language Models Trained on Code". This benchmark evaluates LLMs' ability to generate functional Svelte 5 components with proper use of runes and modern Svelte features. +An LLM benchmark for Svelte 5 based on HumanEval methodology. Evaluates LLM-generated Svelte components through automated tests and calculates pass@k metrics. -## Overview - -SvelteBench evaluates LLM-generated Svelte components by testing them against predefined test suites. It works by sending prompts to LLMs, generating Svelte components, and verifying their functionality through automated tests. The benchmark calculates pass@k metrics (typically pass@1 and pass@10) to measure model performance. - -## Supported Providers - -SvelteBench supports multiple LLM providers: - -- **OpenAI** - GPT-4, GPT-4o, o1, o3, o4 models -- **Anthropic** - Claude 3.5, Claude 4 models -- **Google** - Gemini 2.5 models -- **OpenRouter** - Access to 100+ models through a unified API -- **Ollama** - Run models locally (Llama, Mistral, etc.) -- **Z.ai** - GLM-4 and other models - -## Setup +## Quick Start ```bash -nvm use +# Install dependencies pnpm install -# Create .env file from example +# Setup environment cp .env.example .env +# Edit .env and add your API keys for providers you want to test ``` -Then edit the `.env` file and add your API keys: - -```bash -# OpenAI (optional) -OPENAI_API_KEY=your_openai_api_key_here - -# Anthropic (optional) -ANTHROPIC_API_KEY=your_anthropic_api_key_here - -# Google Gemini (optional) -GEMINI_API_KEY=your_gemini_api_key_here - -# OpenRouter (optional) -OPENROUTER_API_KEY=your_openrouter_api_key_here -OPENROUTER_SITE_URL=https://github.com/khromov/svelte-bench # Optional -OPENROUTER_SITE_NAME=SvelteBench # Optional -OPENROUTER_PROVIDER=deepseek # Optional - preferred provider routing - -# Ollama (optional - defaults to http://127.0.0.1:11434) -OLLAMA_HOST=http://127.0.0.1:11434 - -# Z.ai (optional) -Z_AI_API_KEY=your_z_ai_api_key_here -``` - -You only need to configure the providers you want to test with. - -## Running the Benchmark - -### Standard Execution - -```bash -# Run the full benchmark (sequential execution) -pnpm start - -# Run with parallel sample generation (faster) -PARALLEL_EXECUTION=true pnpm start - -# Run tests only (without building visualization) -pnpm run run-tests -``` - -**NOTE: This will run all providers and models that are available!** +## Usage -### New CLI Interface - -You can also use the new CLI interface with provider:model syntax: +### Basic Commands ```bash -# Basic syntax: pnpm start [provider:model] [options] - -# Run with specific provider and model +# Run benchmark with specific model pnpm start anthropic:claude-3-haiku -# Run with MCP tools for Svelte enhancements +# Run with MCP tools (Svelte-specific enhancements) pnpm start google:gemini-2.5-flash --mcp -# Run with parallel execution +# Run with parallel execution (faster) pnpm start openai:gpt-4o --parallel -# Run with context file and short flags +# Run with context file pnpm start moonshot:kimi-k2 -m -c ./context/svelte.dev/llms-small.txt # Show help pnpm start --help ``` -**Available Options:** +### Options -- `-h, --help` - Show help message -- `-p, --parallel` - Enable parallel execution -- `-m, --mcp` - Enable MCP tools for Svelte support +- `-h, --help` - Show help +- `-p, --parallel` - Parallel execution (faster) +- `-m, --mcp` - Enable MCP tools - `-c, --context ` - Load context file -### Execution Modes - -SvelteBench supports two execution modes: +### Debug Mode (legacy) -- **Sequential (default)**: Tests and samples run one at a time. More reliable with detailed progress output. -- **Parallel**: Tests run sequentially, but samples within each test are generated in parallel. Faster execution with `PARALLEL_EXECUTION=true`. +Use `.env` for quick development testing: -### Debug Mode - -For faster development, or to run just one provider/model, you can enable debug mode in your `.env` file: - -``` +```bash DEBUG_MODE=true DEBUG_PROVIDER=anthropic DEBUG_MODEL=claude-3-7-sonnet-20250219 -DEBUG_TEST=counter -``` - -Debug mode runs only one provider/model combination, making it much faster for testing during development. - -#### Running Multiple Models in Debug Mode - -You can now specify multiple models to test in debug mode by providing a comma-separated list: - +DEBUG_TEST=counter # Optional: specific test ``` -DEBUG_MODE=true -DEBUG_PROVIDER=anthropic -DEBUG_MODEL=claude-3-7-sonnet-20250219,claude-opus-4-20250514,claude-sonnet-4-20250514 -``` - -This will run tests with all three models sequentially while still staying within the same provider. -### Running with Context +Multiple models supported: `DEBUG_MODEL=model1,model2,model3` -You can provide a context file (like Svelte documentation) to help the LLM generate better components: +### Environment Variables (legacy) ```bash -# Run with a context file -pnpm run run-tests -- --context ./context/svelte.dev/llms-small.txt && pnpm run build -``` - -The context file will be included in the prompt to the LLM, providing additional information for generating components. - -## Visualizing Results - -After running the benchmark, you can visualize the results using the built-in visualization tool: - -```bash -pnpm run build -``` - -You can now find the visualization in the `dist` directory. - -## Adding New Tests - -To add a new test: - -1. Create a new directory in `src/tests/` with the name of your test -2. Add a `prompt.md` file with instructions for the LLM -3. Add a `test.ts` file with Vitest tests for the generated component -4. Add a `Reference.svelte` file with a reference implementation for validation - -Example structure: +# Run all providers (legacy interface) +pnpm start +# Parallel execution (legacy) +PARALLEL_EXECUTION=true pnpm start ``` -src/tests/your-test/ -├── prompt.md # Instructions for the LLM -├── test.ts # Tests for the generated component -└── Reference.svelte # Reference implementation -``` - -## Benchmark Results - -### Output Files - -After running the benchmark, results are saved in multiple formats: - -- **JSON Results**: `benchmarks/benchmark-results-{timestamp}.json` - Machine-readable results with pass@k metrics -- **HTML Visualization**: `benchmarks/benchmark-results-{timestamp}.html` - Interactive visualization of results -- **Individual Model Results**: `benchmarks/benchmark-results-{provider}-{model}-{timestamp}.json` - Per-model results - -When running with a context file, the results filename will include "with-context" in the name. -### Versioning System +## Supported Providers -**Current Results**: All new benchmark runs produce current results with: +Via **Vercel AI SDK** unified interface: -- Fixed test prompts and improved error handling -- Corrected Svelte syntax examples -- Standard naming without version suffixes +- **Native SDK Providers**: OpenAI, Anthropic, Google Gemini, OpenRouter, Moonshot, Z.ai, Ollama +- **AI SDK Registry**: Azure OpenAI, xAI (Grok), Mistral, Groq, DeepSeek, Cerebras, Fireworks, Together.ai, Perplexity, DeepInfra, Cohere, Amazon Bedrock, and more -**Legacy Results (v1)**: Historical results from the original test suite with known issues in the "inspect" test prompt (stored in `benchmarks/v1/`). +See `.env.example` for API key configuration. -### Merging Results +## Results & Visualization -You can merge multiple benchmark results into a single file: +Results are automatically saved to `benchmarks/` with timestamps. Build visualization: ```bash -# Merge current results (recommended) -pnpm run merge - -# Merge legacy results (if needed) -pnpm run merge-v1 - -# Build visualization from current results -pnpm run build - -# Build visualization from legacy results -pnpm run build-v1 +pnpm run build # Creates merged visualization ``` -This creates merged JSON and HTML files: +Output files: +- `benchmark-results-{timestamp}.json` - Raw results with pass@k metrics +- `benchmark-results-merged.html` - Interactive visualization -- `pnpm run merge` → `benchmarks/benchmark-results-merged.{json,html}` (current results) -- `pnpm run merge-v1` → `benchmarks/v1/benchmark-results-merged.{json,html}` (legacy results) +## Test Suite -The standard build process uses current results by default. +Tests for core Svelte 5 features: -## Advanced Features +- **hello-world** - Basic component rendering +- **counter** - State management (`$state`) +- **derived** - Computed values (`$derived`) +- **derived-by** - Advanced derived state (`$derived.by`) +- **effect** - Side effects (`$effect`) +- **props** - Component props (`$props`) +- **each** - List rendering (`{#each}`) +- **snippets** - Reusable templates +- **inspect** - Debug utilities (`$inspect`) -### Checkpoint & Resume +### Adding Tests -SvelteBench automatically saves checkpoints at the sample level, allowing you to resume interrupted benchmark runs: +Create directory in `src/tests/` with: +- `prompt.md` - LLM instructions +- `test.ts` - Vitest tests +- `Reference.svelte` - Reference implementation -- Checkpoints are saved in `tmp/checkpoint/` after each sample completion -- If a run is interrupted, it will automatically resume from the last checkpoint -- Checkpoints are cleaned up after successful completion +## Features -### Retry Mechanism +### Checkpoint & Resume +Automatic sample-level checkpointing in `tmp/checkpoint/` - interrupted runs resume automatically. -API calls have configurable retry logic with exponential backoff. Configure in `.env`: +### HumanEval Metrics +- **pass@1** - Probability single sample passes +- **pass@10** - Probability ≥1 of 10 samples passes +- Default: 10 samples/test (1 for expensive models) +### Retry Logic +Configurable exponential backoff via `.env`: ```bash -RETRY_MAX_ATTEMPTS=3 # Maximum retry attempts (default: 3) -RETRY_INITIAL_DELAY_MS=1000 # Initial delay before retry (default: 1000ms) -RETRY_MAX_DELAY_MS=30000 # Maximum delay between retries (default: 30s) -RETRY_BACKOFF_FACTOR=2 # Exponential backoff factor (default: 2) +RETRY_MAX_ATTEMPTS=3 +RETRY_INITIAL_DELAY_MS=1000 +RETRY_MAX_DELAY_MS=30000 +RETRY_BACKOFF_FACTOR=2 ``` -### Model Validation - -Before running benchmarks, models are automatically validated to ensure they're available and properly configured. Invalid models are skipped with appropriate warnings. - -### HumanEval Metrics - -The benchmark calculates pass@k metrics based on the HumanEval methodology: - -- **pass@1**: Probability that a single sample passes all tests -- **pass@10**: Probability that at least one of 10 samples passes all tests -- Default: 10 samples per test (1 sample for expensive models) - -### Test Verification - -Verify that all tests have proper structure: +## Utility Commands ```bash -pnpm run verify +pnpm run verify # Verify test structure +pnpm run merge # Merge all results +pnpm run merge-v1 # Merge legacy results (legacy) +pnpm run build-v1 # Build legacy visualization (legacy) ``` -This checks that each test has required files (prompt.md, test.ts, Reference.svelte). - -## Current Test Suite - -The benchmark includes tests for core Svelte 5 features: - -- **hello-world**: Basic component rendering -- **counter**: State management with `$state` rune -- **derived**: Computed values with `$derived` rune -- **derived-by**: Advanced derived state with `$derived.by` -- **effect**: Side effects with `$effect` rune -- **props**: Component props with `$props` rune -- **each**: List rendering with `{#each}` blocks -- **snippets**: Reusable template snippets -- **inspect**: Debug utilities with `$inspect` rune - -## Troubleshooting - -### Common Issues - -1. **Models not found**: Ensure API keys are correctly set in `.env` -2. **Tests failing**: Check that you're using Node.js 20+ and have run `pnpm install` -3. **Parallel execution errors**: Try sequential mode (remove `PARALLEL_EXECUTION=true`) -4. **Memory issues**: Reduce the number of samples or run in debug mode with fewer models - -### Debugging - -Enable detailed logging by examining the generated components in `tmp/samples/` directories and test outputs in the console. - -## Contributing - -Contributions are welcome! Please ensure: - -1. New tests include all required files (prompt.md, test.ts, Reference.svelte) -2. Tests follow the existing structure and naming conventions -3. Reference implementations are correct and pass all tests -4. Documentation is updated for new features - ## License MIT diff --git a/suggestions.md b/suggestions.md new file mode 100644 index 0000000..9cf6838 --- /dev/null +++ b/suggestions.md @@ -0,0 +1,152 @@ +# UX Improvement Suggestions + +## Simplification Opportunities + +### 1. Remove or Consolidate Legacy Interfaces +- **Current state**: Multiple ways to run benchmarks (CLI, environment variables, DEBUG_MODE) +- **Suggestion**: Deprecate environment variable interface (`PARALLEL_EXECUTION`, running all providers at once) +- **Impact**: Reduces cognitive load, clearer documentation, easier onboarding +- **Implementation**: Add deprecation warnings when using old interface, remove in next major version + +### 2. Simplify Provider Configuration +- **Current state**: 30+ provider API key options in `.env.example` +- **Suggestion**: Group providers by category with commented sections, hide rarely-used providers +- **Impact**: Less overwhelming for new users, faster setup +- **Implementation**: + - Create "Common Providers" section (OpenAI, Anthropic, Google, OpenRouter) + - Move AI SDK registry providers to "Advanced Providers" section + - Move media providers to separate section or remove if not used + +### 3. Default to Modern CLI Interface +- **Current state**: `pnpm start` runs all providers (legacy behavior) +- **Suggestion**: Make `pnpm start` show help/usage instead, require explicit provider:model +- **Impact**: Prevents accidental expensive runs, clearer intent +- **Implementation**: Check if args are provided, show help if not + +### 4. Streamline Debug Mode +- **Current state**: Multiple DEBUG_* environment variables +- **Suggestion**: Replace with CLI flags: `pnpm start provider:model --debug --test counter` +- **Impact**: Consistent interface, no .env file editing needed +- **Implementation**: Add --debug flag, integrate DEBUG_TEST into CLI + +## Feature Enhancements + +### 5. Interactive Model Selection +- **Suggestion**: Add interactive prompt when no provider:model specified +- **Example**: + ``` + $ pnpm start + ? Select provider: (Use arrow keys) + ❯ OpenAI + Anthropic + Google + OpenRouter + ``` +- **Impact**: Better discoverability, reduced errors +- **Tools**: inquirer or prompts npm package + +### 6. Quick Start Template +- **Suggestion**: Add `pnpm run setup` that creates .env with guided prompts +- **Example**: Ask which providers user wants, only add those API keys +- **Impact**: Faster onboarding, less manual editing + +### 7. Preset Configurations +- **Suggestion**: Add named presets for common scenarios +- **Examples**: + - `pnpm start --preset fast` (uses cheapest/fastest models) + - `pnpm start --preset comprehensive` (runs multiple models) + - `pnpm start --preset local` (uses Ollama) +- **Impact**: Easier for new users, clear use cases + +### 8. Results Dashboard +- **Suggestion**: Add `pnpm run dashboard` that watches for new results and auto-refreshes +- **Impact**: Better live monitoring during long benchmarks +- **Tools**: chokidar for file watching, live-server for auto-refresh + +### 9. Cost Estimation +- **Suggestion**: Show estimated cost before running benchmark +- **Example**: "This benchmark will use ~50k tokens, estimated cost: $0.15" +- **Impact**: Prevents surprise API bills, informed decisions + +### 10. Progress Indicators +- **Suggestion**: Add better progress visualization +- **Current**: Text-based progress +- **Proposed**: Use ora spinner or progress bars showing: + - Current test (X/Y) + - Current sample (X/Y) + - Estimated time remaining +- **Impact**: Better user experience during long runs + +## Documentation Improvements + +### 11. Visual Quickstart Guide +- **Suggestion**: Add diagram showing benchmark flow +- **Content**: Prompt → LLM → Component → Tests → Results +- **Impact**: Faster understanding for new users + +### 12. Video Tutorial +- **Suggestion**: Create 2-minute screencast showing: + - Installation + - Running first benchmark + - Viewing results +- **Impact**: Reduced support questions, faster adoption + +### 13. Example Gallery +- **Suggestion**: Add `examples/` directory with common use cases +- **Examples**: + - Compare two models + - Test with custom context + - Run specific tests only +- **Impact**: Learning by example, reduced questions + +## Code Quality + +### 14. TypeScript Strictness +- **Suggestion**: Enable strict mode in tsconfig.json +- **Impact**: Catch more bugs, better IDE support + +### 15. Configuration Validation +- **Suggestion**: Validate .env on startup, show clear errors +- **Example**: "Missing OPENAI_API_KEY for provider 'openai:gpt-4'" +- **Impact**: Faster debugging, clearer error messages + +### 16. Automated Setup Testing +- **Suggestion**: Add `pnpm run doctor` that checks: + - Node version + - Dependencies installed + - API keys configured + - Test files valid +- **Impact**: Self-service troubleshooting + +## Performance + +### 17. Smart Caching +- **Suggestion**: Cache LLM responses by prompt hash +- **Impact**: Faster re-runs during development, reduced costs +- **Note**: Optional flag to disable for production benchmarks + +### 18. Parallel Test Execution +- **Current**: Parallel samples within tests +- **Suggestion**: Also parallelize tests across multiple test files +- **Impact**: 2-3x faster benchmarks +- **Consideration**: Resource usage, rate limits + +## Priority Recommendations + +**High Priority** (Quick wins, high impact): +1. Remove/deprecate legacy interfaces (#1) +2. Default to help on `pnpm start` (#3) +3. Add cost estimation (#9) +4. Better progress indicators (#10) + +**Medium Priority** (Good ROI): +5. Interactive model selection (#5) +6. Configuration validation (#15) +7. Simplify provider config (#2) +8. Quick start template (#6) + +**Low Priority** (Nice to have): +9. Results dashboard (#8) +10. Visual quickstart guide (#11) +11. Preset configurations (#7) +12. Example gallery (#13)