Local LLM Coding Benchmark

A prompt-based benchmark suite for evaluating local LLMs on frontend implementation, game development, GUI work, simulation, rendering, persistence, and larger self-contained software projects.

The tests are written as standalone prompt files. Paste a full prompt into a local model or coding agent, run the generated project exactly as requested, then judge the result against the explicit requirements in the prompt.

Most prompts intentionally forbid external dependencies, generated assets, package installs, CDNs, build steps, and network access. The goal is to test whether a model can produce runnable software from first principles inside tight local constraints.

Test Set

File	Focus	What it evaluates
`01_Browser_FPS_Raycasting_Game_Test.txt`	Browser FPS game	First-person rendering, collision, enemy AI, shooting, HUD, game states, and single-file execution.
`02_Browser_3D_Flight_Simulation_Game_Test.txt`	Browser flight simulation	Custom 3D math, aircraft controls, simplified flight physics, checkpoints, terrain, instruments, and mission flow.
`03_Memory_Card_Matching_Game_Test.txt`	Browser card game	Card state, shuffle logic, matching rules, localStorage best score, responsive UI, and clean restart behavior.
`04_Breakout_Canvas_Game_Test.txt`	Canvas arcade game	Game loop structure, collision handling, paddle control, scoring, lives, win/loss states, and restart flow.
`05_Personal_Expense_Tracker_App_Test.txt`	Browser CRUD app	Expense CRUD workflows, validation, localStorage persistence, filtering, summaries, theme persistence, and practical UX.
`06_Top_Down_2D_Driving_Game_Test.txt`	Canvas driving game	Camera movement, track sections, arcade driving controls, obstacle collision, reset behavior, and finish-line progression.
`07_Kanban_Board_App_Test.txt`	Browser productivity app	Drag-and-drop, task/column management, archive/restore, combined filters, persistence, keyboard fallback, and polished UX.
`08_Python_GUI_Photoshop_Clone_Test.txt`	Python GUI editor	Tkinter architecture, custom raster model, drawing tools, selections, layers, filters, undo/redo, and file formats.
`09_Python_Software_Rendered_3D_Game_Test.txt`	Python 3D game	Standard-library software renderer, 3D math, collision, entities, input, game states, performance, and HUD rendering.
`10_Single_File_Browser_Operating_System_Test.txt`	Browser OS	Single-file OS simulation with window manager, virtual filesystem, built-in apps, procedural audio, persistence, and two 3D games.

How To Use

Pick a test prompt.
Paste the full prompt into the model or coding agent being evaluated.
Do not add missing requirements for the model unless you are intentionally measuring assisted iteration.
Run or inspect the generated output exactly as the prompt requires.
Score the result against the prompt's explicit requirements before considering polish or extra features.

Evaluation Workflow

For each generated project:

Confirm it uses only the technologies and dependencies allowed by the prompt.
Run it using the prompt's specified command or direct browser-open workflow.
Exercise the main user path from start to completion, including restart or reload behavior.
Check the prompt's "Before finishing" list.
Inspect the console or terminal for runtime errors.
Review code organization only after the core behavior works.

Suggested Scoring

Use a simple 0-5 score per category:

Correctness: Does the output satisfy the requested behavior?
Constraint adherence: Did it obey technology, dependency, formatting, and output restrictions?
Completeness: Are all required features present?
Code quality: Is the result readable, maintainable, and reasonably organized?
Usability: Is the final app or game usable without hidden setup?
Robustness: Does it handle restart, reload, invalid input, and ordinary edge cases without breaking?

For coding tasks, a model should not receive full credit if the project cannot run locally under the restrictions in the prompt, even if the code looks plausible. A visually impressive app with broken core behavior should score lower than a plain app that works end-to-end.

Difficulty Curve

01-02: Advanced browser games with custom rendering or simulation.
03-07: Browser apps and 2D games focused on state, UI, persistence, and interaction.
08-09: Python standard-library challenges requiring custom data models and rendering.
10: Extreme single-file integration benchmark for architecture, UI, games, persistence, and performance.

Notes

Browser tasks generally expect plain HTML, CSS, and vanilla JavaScript, with Canvas or WebGL only where explicitly allowed.
Python tasks generally require Python 3 standard library only.
The benchmark favors reliable execution and requirement coverage over visual polish.
Extra features should not compensate for missing required behavior.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Local LLM Coding Benchmark

Test Set

How To Use

Evaluation Workflow

Suggested Scoring

Difficulty Curve

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
01_Browser_FPS_Raycasting_Game_Test.txt		01_Browser_FPS_Raycasting_Game_Test.txt
02_Browser_3D_Flight_Simulation_Game_Test.txt		02_Browser_3D_Flight_Simulation_Game_Test.txt
03_Memory_Card_Matching_Game_Test.txt		03_Memory_Card_Matching_Game_Test.txt
04_Breakout_Canvas_Game_Test.txt		04_Breakout_Canvas_Game_Test.txt
05_Personal_Expense_Tracker_App_Test.txt		05_Personal_Expense_Tracker_App_Test.txt
06_Top_Down_2D_Driving_Game_Test.txt		06_Top_Down_2D_Driving_Game_Test.txt
07_Kanban_Board_App_Test.txt		07_Kanban_Board_App_Test.txt
08_Python_GUI_Photoshop_Clone_Test.txt		08_Python_GUI_Photoshop_Clone_Test.txt
09_Python_Software_Rendered_3D_Game_Test.txt		09_Python_Software_Rendered_3D_Game_Test.txt
10_Single_File_Browser_Operating_System_Test.txt		10_Single_File_Browser_Operating_System_Test.txt
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Local LLM Coding Benchmark

Test Set

How To Use

Evaluation Workflow

Suggested Scoring

Difficulty Curve

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages