A prompt-based benchmark suite for evaluating local LLMs on frontend implementation, game development, GUI work, simulation, rendering, persistence, and larger self-contained software projects.
The tests are written as standalone prompt files. Paste a full prompt into a local model or coding agent, run the generated project exactly as requested, then judge the result against the explicit requirements in the prompt.
Most prompts intentionally forbid external dependencies, generated assets, package installs, CDNs, build steps, and network access. The goal is to test whether a model can produce runnable software from first principles inside tight local constraints.
| File | Focus | What it evaluates |
|---|---|---|
01_Browser_FPS_Raycasting_Game_Test.txt |
Browser FPS game | First-person rendering, collision, enemy AI, shooting, HUD, game states, and single-file execution. |
02_Browser_3D_Flight_Simulation_Game_Test.txt |
Browser flight simulation | Custom 3D math, aircraft controls, simplified flight physics, checkpoints, terrain, instruments, and mission flow. |
03_Memory_Card_Matching_Game_Test.txt |
Browser card game | Card state, shuffle logic, matching rules, localStorage best score, responsive UI, and clean restart behavior. |
04_Breakout_Canvas_Game_Test.txt |
Canvas arcade game | Game loop structure, collision handling, paddle control, scoring, lives, win/loss states, and restart flow. |
05_Personal_Expense_Tracker_App_Test.txt |
Browser CRUD app | Expense CRUD workflows, validation, localStorage persistence, filtering, summaries, theme persistence, and practical UX. |
06_Top_Down_2D_Driving_Game_Test.txt |
Canvas driving game | Camera movement, track sections, arcade driving controls, obstacle collision, reset behavior, and finish-line progression. |
07_Kanban_Board_App_Test.txt |
Browser productivity app | Drag-and-drop, task/column management, archive/restore, combined filters, persistence, keyboard fallback, and polished UX. |
08_Python_GUI_Photoshop_Clone_Test.txt |
Python GUI editor | Tkinter architecture, custom raster model, drawing tools, selections, layers, filters, undo/redo, and file formats. |
09_Python_Software_Rendered_3D_Game_Test.txt |
Python 3D game | Standard-library software renderer, 3D math, collision, entities, input, game states, performance, and HUD rendering. |
10_Single_File_Browser_Operating_System_Test.txt |
Browser OS | Single-file OS simulation with window manager, virtual filesystem, built-in apps, procedural audio, persistence, and two 3D games. |
- Pick a test prompt.
- Paste the full prompt into the model or coding agent being evaluated.
- Do not add missing requirements for the model unless you are intentionally measuring assisted iteration.
- Run or inspect the generated output exactly as the prompt requires.
- Score the result against the prompt's explicit requirements before considering polish or extra features.
For each generated project:
- Confirm it uses only the technologies and dependencies allowed by the prompt.
- Run it using the prompt's specified command or direct browser-open workflow.
- Exercise the main user path from start to completion, including restart or reload behavior.
- Check the prompt's "Before finishing" list.
- Inspect the console or terminal for runtime errors.
- Review code organization only after the core behavior works.
Use a simple 0-5 score per category:
Correctness: Does the output satisfy the requested behavior?Constraint adherence: Did it obey technology, dependency, formatting, and output restrictions?Completeness: Are all required features present?Code quality: Is the result readable, maintainable, and reasonably organized?Usability: Is the final app or game usable without hidden setup?Robustness: Does it handle restart, reload, invalid input, and ordinary edge cases without breaking?
For coding tasks, a model should not receive full credit if the project cannot run locally under the restrictions in the prompt, even if the code looks plausible. A visually impressive app with broken core behavior should score lower than a plain app that works end-to-end.
01-02: Advanced browser games with custom rendering or simulation.03-07: Browser apps and 2D games focused on state, UI, persistence, and interaction.08-09: Python standard-library challenges requiring custom data models and rendering.10: Extreme single-file integration benchmark for architecture, UI, games, persistence, and performance.
- Browser tasks generally expect plain HTML, CSS, and vanilla JavaScript, with Canvas or WebGL only where explicitly allowed.
- Python tasks generally require Python 3 standard library only.
- The benchmark favors reliable execution and requirement coverage over visual polish.
- Extra features should not compensate for missing required behavior.