Skip to content

integral-quality/loreval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

loreval

LLM Logic and Reasoning Evaluation — a CLI for measuring language model performance on grid-based spatial puzzles.

Two evaluation dimensions:

  • Solving — given a puzzle it hasn't seen before, can the model produce a valid move sequence?
  • Designing — given a set of constraints, can the model produce a well-formed, solvable puzzle?

Outcomes are discrete and verifiable. A puzzle is either solved or it isn't. A generated puzzle is either solvable or it isn't. No scoring rubrics or LLM-as-judge.


How it works

Puzzles are defined in a small text-based DSL (.lrev files). A level specifies a grid of tiles, a legend mapping characters to tile types, and agent declarations with start positions and goals.

level "Switch Puzzle" 7x3

grid = [
  W W W W W W W,
  W R S R D G W,
  W W W W W W W,
]

tile S = tiles.switch(blue)
tile D = tiles.door(blue)
tile G = tiles.goal(orange)

agent(orange) start(1,1) and reach(5,1)

The engine parses the DSL into a game state and executes moves according to fixed rules: walls block, doors open when a matching switch is hit, paint tiles change agent color, one-way tiles restrict direction, agents block each other like walls.

For solving: the DSL is sent to a Claude model with a system prompt describing the rules. The model returns a move sequence. The engine replays it and checks whether all goal agents reached their targets.

For designing: the model is given a difficulty level, grid size, and required mechanics. It returns a DSL. The engine parses and validates it with a BFS solvability check, and reports which mechanics appear in the solution vs. decoratively.

The loop command combines both: one model designs, another solves. This directly tests the core research question — can a model solve what it designs?


Tile types

DSL Behavior
W Wall — always impassable
R Floor — always passable
. Void — out of bounds
tiles.goal(color) Goal — agent wins by stepping on it (color must match)
tiles.commonGoal() Universal goal — any designated agent may claim it
tiles.door(color) Blocked until a matching switch is activated
tiles.switch(color) Opens all doors of matching color when stepped on
tiles.paint(color) Changes the stepping agent's color
tiles.one-way(dir) Passable only when entering from dir (up/down/left/right)
tiles.lock(color) Blocked until a matching-color agent steps on it

Install

git clone https://github.com/your-org/loreval
cd loreval
pip install -e .

API key — copy .env.example to .env and add your Anthropic key:

cp .env.example .env
# paste ANTHROPIC_API_KEY=sk-ant-... into .env

Or set it as an environment variable directly:

export ANTHROPIC_API_KEY=sk-ant-...

Commands

validate — check a puzzle file

Parses the DSL and runs a BFS solvability check. Reports which mechanics are used in the solution vs. potentially decorative.

loreval validate puzzle.lrev
✓ Parsed  Switch Puzzle  (7×3)
  Agents with goal     1
  Helper agents        0
  Mechanics present    door, goal, switch

Checking solvability…
✓ Solvable  (4 moves)
  Mechanics used in solution: door, goal, switch

solve — run a model against a puzzle

Sends the DSL to the model, replays the returned move sequence, and reports pass/fail with move count and first failure step.

loreval solve puzzle.lrev
loreval solve puzzle.lrev --model claude-haiku-4-5-20251001 --runs 5
Switch Puzzle  (7×3)  model=claude-sonnet-4-6  runs=3

  Run 1: ✓ pass  (4 moves)
  Run 2: ✓ pass  (4 moves)
  Run 3: ✗ fail  (6 moves)  first failure at move 3

Pass rate: 2/3

Options:

Flag Default Description
--model, -m claude-sonnet-4-6 Model to use
--runs, -r 1 Number of attempts

design — ask a model to create a puzzle

The model generates a DSL given difficulty, size, and required mechanics. The output is parsed and validated automatically.

loreval design --difficulty medium --size 8x6 --mechanics switch,door
loreval design --difficulty hard --size 10x10 --mechanics paint,lock,multi-agent --out puzzle.lrev
loreval design --prompt "a puzzle where two agents must coordinate" --size 8x8 --out collab.lrev
✓ Parsed  Corridor Lockdown  (8×6)
Checking solvability…
✓ Solvable  (18 moves)
  Mechanics used: door, goal, switch

Design notes:
The orange agent must first activate the blue switch at (3,2) before it
can pass through the door at (5,2). The indirect route forces planning
ahead rather than a direct path to the goal.

✓ Saved → puzzle.lrev

Options:

Flag Default Description
--model, -m claude-sonnet-4-6 Model to use
--difficulty, -d medium easy, medium, or hard
--size, -s 8x8 Grid size as WxH
--mechanics (none) Comma-separated: switches, doors, paint, one-way, locks, multi-agent
--prompt, -p (none) Free-form design brief
--out, -o (print to stdout) Save output to .lrev file

loop — design then solve

One model designs a puzzle; another model tries to solve it. The most direct test of the core research question.

loreval loop
loreval loop --designer claude-opus-4-6 --solver claude-haiku-4-5-20251001 --difficulty hard --runs 5
loreval loop --designer claude-sonnet-4-6 --solver claude-sonnet-4-6 --out-dir results/
Loop evaluation
  Designer: claude-opus-4-6
  Solver:   claude-haiku-4-5-20251001
  Puzzle:   hard  10×10  mechanics=any
  Runs:     5

✓ Designed: The Relay  (10×10)
  Validated solvable (31 moves)

Solver runs (5):
  Run 1: ✓ pass  (31 moves)
  Run 2: ✗ fail  (28 moves)  first failure at move 14
  Run 3: ✗ fail  (35 moves)  first failure at move 14
  Run 4: ✓ pass  (31 moves)
  Run 5: ✗ fail  (30 moves)  first failure at move 14

Result:  2/5 solved
Design solvable: yes

Options:

Flag Default Description
--designer claude-sonnet-4-6 Model that designs the puzzle
--solver claude-sonnet-4-6 Model that solves the puzzle
--difficulty medium easy, medium, or hard
--size 8x8 Grid size
--mechanics (none) Required mechanics
--runs 3 Solve attempts per designed puzzle
--out-dir (none) Save designed puzzles to directory

eval — batch evaluation

Runs solve or design evaluations over a directory of .lrev files and writes results to CSV.

# solve: run all .lrev files in benchmark/
loreval eval benchmark/ --model claude-haiku-4-5-20251001 --runs 3 --out results.csv

# design: generate N puzzles and validate each
loreval eval benchmark/ --task design --model claude-sonnet-4-6 --runs 10 --out design-results.csv

Output columns (solve):

Column Description
file Puzzle filename
model Model used
run Attempt number
outcome pass, fail, or error
moves Total moves in the model's response
first_failure_step Index of first blocked/wrong-position move
failure_reason wrong_position, blocked, incomplete, or empty

Output columns (design):

Column Description
parse_success Whether the DSL parsed without errors
is_solvable True, False, or None (state cap hit)
mechanics_required Pipe-separated mechanics used in solution
mechanics_decorative Pipe-separated mechanics not needed to solve
solution_length BFS solution move count

Options:

Flag Default Description
--model, -m claude-sonnet-4-6 Model to use
--task, -t solve solve or design
--runs, -r 1 Attempts per file (solve) or puzzles to generate (design)
--out, -o results.csv Output CSV path

Supported models

ID Name
claude-haiku-4-5-20251001 Claude Haiku 4.5
claude-sonnet-4-6 Claude Sonnet 4.6 (default)
claude-opus-4-6 Claude Opus 4.6

Project structure

loreval/
  engine/
    models.py     dataclasses: Level, Tile, Entity, GameState
    parser.py     .lrev text → Level
    game.py       move execution, passability, win condition
    validator.py  BFS solvability check (cap: 200k states)
  ai/
    prompts.py    solve + design system prompts, message builders, output parsers
    client.py     Anthropic SDK wrapper
  cli/
    validate.py   loreval validate
    solve.py      loreval solve
    design.py     loreval design
    loop.py       loreval loop
    eval.py       loreval eval

About

LLM Logic and Reasoning Evaluation on problem design and solving using grid based puzzles

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages