Skip to content

psydox/Local-LLM-Coding-Benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Local LLM Coding Benchmark

A prompt-based benchmark suite for evaluating local LLMs on frontend implementation, game development, GUI work, simulation, rendering, persistence, and larger self-contained software projects.

The tests are written as standalone prompt files. Paste a full prompt into a local model or coding agent, run the generated project exactly as requested, then judge the result against the explicit requirements in the prompt.

Most prompts intentionally forbid external dependencies, generated assets, package installs, CDNs, build steps, and network access. The goal is to test whether a model can produce runnable software from first principles inside tight local constraints.

Test Set

File Focus What it evaluates
01_Browser_FPS_Raycasting_Game_Test.txt Browser FPS game First-person rendering, collision, enemy AI, shooting, HUD, game states, and single-file execution.
02_Browser_3D_Flight_Simulation_Game_Test.txt Browser flight simulation Custom 3D math, aircraft controls, simplified flight physics, checkpoints, terrain, instruments, and mission flow.
03_Memory_Card_Matching_Game_Test.txt Browser card game Card state, shuffle logic, matching rules, localStorage best score, responsive UI, and clean restart behavior.
04_Breakout_Canvas_Game_Test.txt Canvas arcade game Game loop structure, collision handling, paddle control, scoring, lives, win/loss states, and restart flow.
05_Personal_Expense_Tracker_App_Test.txt Browser CRUD app Expense CRUD workflows, validation, localStorage persistence, filtering, summaries, theme persistence, and practical UX.
06_Top_Down_2D_Driving_Game_Test.txt Canvas driving game Camera movement, track sections, arcade driving controls, obstacle collision, reset behavior, and finish-line progression.
07_Kanban_Board_App_Test.txt Browser productivity app Drag-and-drop, task/column management, archive/restore, combined filters, persistence, keyboard fallback, and polished UX.
08_Python_GUI_Photoshop_Clone_Test.txt Python GUI editor Tkinter architecture, custom raster model, drawing tools, selections, layers, filters, undo/redo, and file formats.
09_Python_Software_Rendered_3D_Game_Test.txt Python 3D game Standard-library software renderer, 3D math, collision, entities, input, game states, performance, and HUD rendering.
10_Single_File_Browser_Operating_System_Test.txt Browser OS Single-file OS simulation with window manager, virtual filesystem, built-in apps, procedural audio, persistence, and two 3D games.

How To Use

  1. Pick a test prompt.
  2. Paste the full prompt into the model or coding agent being evaluated.
  3. Do not add missing requirements for the model unless you are intentionally measuring assisted iteration.
  4. Run or inspect the generated output exactly as the prompt requires.
  5. Score the result against the prompt's explicit requirements before considering polish or extra features.

Evaluation Workflow

For each generated project:

  1. Confirm it uses only the technologies and dependencies allowed by the prompt.
  2. Run it using the prompt's specified command or direct browser-open workflow.
  3. Exercise the main user path from start to completion, including restart or reload behavior.
  4. Check the prompt's "Before finishing" list.
  5. Inspect the console or terminal for runtime errors.
  6. Review code organization only after the core behavior works.

Suggested Scoring

Use a simple 0-5 score per category:

  • Correctness: Does the output satisfy the requested behavior?
  • Constraint adherence: Did it obey technology, dependency, formatting, and output restrictions?
  • Completeness: Are all required features present?
  • Code quality: Is the result readable, maintainable, and reasonably organized?
  • Usability: Is the final app or game usable without hidden setup?
  • Robustness: Does it handle restart, reload, invalid input, and ordinary edge cases without breaking?

For coding tasks, a model should not receive full credit if the project cannot run locally under the restrictions in the prompt, even if the code looks plausible. A visually impressive app with broken core behavior should score lower than a plain app that works end-to-end.

Difficulty Curve

  • 01-02: Advanced browser games with custom rendering or simulation.
  • 03-07: Browser apps and 2D games focused on state, UI, persistence, and interaction.
  • 08-09: Python standard-library challenges requiring custom data models and rendering.
  • 10: Extreme single-file integration benchmark for architecture, UI, games, persistence, and performance.

Notes

  • Browser tasks generally expect plain HTML, CSS, and vanilla JavaScript, with Canvas or WebGL only where explicitly allowed.
  • Python tasks generally require Python 3 standard library only.
  • The benchmark favors reliable execution and requirement coverage over visual polish.
  • Extra features should not compensate for missing required behavior.

About

A benchmark suite of prompt-based coding and reasoning tests for evaluating local LLMs, from simple instruction-following tasks to complex browser, Python GUI, and software-rendered 3D projects.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors