♟ Zugzwang

A reproducible research engine for pushing LLMs to their limits in chess

"Zugzwang" — the chess position where every move you make worsens your situation. We use this as a crucible: can LLMs reason their way out?

What is Zugzwang?

Zugzwang is a modular, reproducible research platform for studying how far Large Language Models can be pushed in chess using only prompt engineering, RAG, few-shot learning, chain-of-thought, tool-use, and multi-agent orchestration — no fine-tuning.

Chess is used not as the end goal, but as a microscope. The structured, verifiable nature of chess makes it an ideal domain for rigorously measuring the gap between raw LLM capability and augmented performance, move by move.

This project extends and builds upon the LLM Chess benchmark (Saplin et al., NeurIPS FoRLM 2025, arXiv:2512.01992) — the definitive framework for evaluating LLMs through chess play — by systematically exploring the techniques that paper identified as gaps: structured prompting, few-shot calibration, retrieval-augmented generation, and mixture-of-agents orchestration.

The Research Question

Using only LLM manipulation techniques — system prompts, RAG, few-shot, chain-of-thought, tool-use, multi-agent orchestration — and without fine-tuning any model, how far can a general-purpose LLM be pushed in chess?

Motivation & Background

The LLM Chess paper (Saplin et al., 2025) established that:

Most LLMs cannot beat a random player — they fail at instruction-following, not chess per se
Only reasoning-enhanced models (o3, o4-mini, Grok 3 Mini) reliably win against random play
The best model tested (o3 low) reaches only Elo ~758 against a calibrated engine — barely above the average chess.com player
FEN format outperforms Unicode boards by up to +21.7 pp for some models
Providing move history reduces blunders dramatically (11.2% → 1.6% for o4-mini)
Mixture-of-Agents combining strong-reasoning + strong-instruction-following models can double win rates and achieve 100% game completion

However, that benchmark used a simple, generic prompt with no few-shot examples, no RAG, no structured chain-of-thought, and no feedback-rich retry loop. Zugzwang is built to fill those gaps, rigorously and reproducibly.

Additional foundations:

GPT-3.5-turbo-instruct plays at ~1750 Elo feeding raw PGN, suggesting LLMs have latent chess knowledge suppressed by instruction tuning (Carlini, 2023)
3 trivial few-shot examples dramatically improve GPT-4o's chess performance (Dynomight, 2024)
Chess-playing transformers develop linear world models of board state (Karvonen, 2024)
LLMs fail at chess primarily due to knowledge access, not reasoning capacity (arXiv:2507.00726)

Architecture

Zugzwang is built in seven progressive layers, each independently testable:

Layer 0 — Infrastructure      Config loading, secret management, env validation
Layer 1 — Core Game Engine     BoardManager, game loop, LLM/Random/Engine players
Layer 2 — Evaluation           Stockfish scoring, move quality, Elo MLE estimation
Layer 3 — Strategy             Prompt library, context assembly, few-shot, validation
Layer 4 — Knowledge / RAG      Phase-aware retrieval: openings, tactics, endgames
Layer 5 — Multi-Agent          Capability-MoA, specialist agents, hybrid phase router
Layer 6 — Experiment Runner    Batch execution, resume, budget guardrails, scheduling
Layer 7 — Analysis             Statistics, plots, reports, React dashboard

Key design invariants:

No illegal move is ever applied to the board
Stockfish evaluation is never exposed to the LLM during live play
Every game artifact is self-contained and reproducible from its seed
Config is immutable after an experiment starts

Repository Layout

zugzwang-engine/
├── zugzwang/
│   ├── core/           # BoardManager, game loop, players, protocol
│   ├── providers/      # z.ai, GPT, Claude, Gemini, Grok, DeepSeek, Kimi, MiniMax, mock
│   ├── evaluation/     # Stockfish, move quality, Elo, metrics
│   ├── strategy/       # Prompts, context assembler, few-shot, validator
│   ├── knowledge/      # RAG: indexer, retriever, embeddings, vectordb
│   │   └── sources/    #   ECO openings, Lichess heuristics, endgames
│   ├── agents/         # Capability MoA, tactical, positional, endgame, critic
│   ├── experiments/    # Runner, scheduler, tracker, resume
│   ├── analysis/       # Statistics, plots, reports
│   └── api/            # FastAPI layer for UI/backend integration
├── zugzwang-ui/        # Vite + React + TypeScript frontend
├── configs/
│   ├── defaults.yaml
│   ├── baselines/      # benchmark_compat.yaml, best_known_start.yaml
│   ├── ablations/      # Experiment condition configs
│   └── models/         # Per-model overrides
├── data/               # ECO, puzzles, annotated games, vectordb (gitignored)
├── results/            # Run artifacts and reports (gitignored)
└── tests/

Quickstart

Prerequisites: Python 3.11+, a provider API key (or use the mock provider for local tests), and optionally Stockfish for evaluation.

# Install
pip install -e .[dev]

# Validate environment
zugzwang env-check --config configs/baselines/best_known_start.yaml

# Dry run (no API calls, no games)
zugzwang run --config configs/baselines/best_known_start.yaml --dry-run

# Play a single game
zugzwang play --config configs/baselines/best_known_start.yaml

# Run a full experiment (30 games, saves artifacts to results/)
zugzwang run --config configs/baselines/best_known_start.yaml

# Evaluate move quality with Stockfish
zugzwang evaluate --run-dir results/runs/<run-id>

# Start the API server
zugzwang api

Environment Setup

Copy .env.example to .env and fill in your API keys:

cp .env.example .env
# Edit .env and set provider keys (e.g., ZAI_API_KEY / OPENAI_API_KEY / ANTHROPIC_API_KEY / GEMINI_API_KEY / XAI_API_KEY / DEEPSEEK_API_KEY / MOONSHOT_API_KEY / KIMI_CODE_API_KEY / MINIMAX_API_KEY)
# For Stockfish: set STOCKFISH_PATH=/path/to/stockfish

CLI Reference

Command	Description
`zugzwang run --config <path>`	Run a full experiment
`zugzwang run --config <path> --dry-run`	Validate config without running
`zugzwang run --config <path> --resume`	Resume latest matching run
`zugzwang run --config <path> --resume-run-id <id>`	Resume specific run
`zugzwang play --config <path>`	Play a single game interactively
`zugzwang env-check --config <path>`	Validate provider credentials
`zugzwang evaluate --run-dir <path>`	Post-run Stockfish evaluation
`zugzwang api`	Start the API server (port 8000)

Config Overrides

Any config key can be overridden inline with --set:

zugzwang play --config configs/baselines/best_known_start.yaml \
  --set players.black.model=claude-opus-4-5 \
  --set strategy.board_format=fen \
  --set strategy.few_shot.enabled=true \
  --set strategy.few_shot.num_examples=3

Key Features

Two Baselines

Baseline	Config	Purpose
`benchmark_compat`	`configs/baselines/benchmark_compat.yaml`	Faithful reproduction of LLM Chess protocol
`best_known_start`	`configs/baselines/best_known_start.yaml`	Direct mode + FEN + legal moves + history (best empirical config)

Strategy Pipeline

The strategy block controls everything the LLM sees:

strategy:
  board_format: fen          # fen | ascii | combined | unicode (default: fen)
  provide_legal_moves: true
  provide_history: last_n
  history_length: 10
  few_shot:
    enabled: true
    num_examples: 3
    phase_specific: true     # Different examples per opening/middlegame/endgame
  validation:
    enabled: true
    max_retries: 3
    feedback_level: rich     # minimal | moderate | rich

RAG (Phase 4 — Available)

Phase-aware knowledge retrieval from local deterministic sources:

zugzwang play --config configs/baselines/best_known_start.yaml \
  --set strategy.rag.enabled=true \
  --set strategy.rag.max_chunks=3 \
  --set strategy.rag.include_sources.eco=true \
  --set strategy.rag.include_sources.lichess=true \
  --set strategy.rag.include_sources.endgames=true

Sources: ECO opening principles, Lichess tactical/positional heuristics, endgame theory.

RAG ablation config: configs/ablations/rag_variants.yaml

Multi-Agent Modes (Phase 5 — Available)

Mixture-of-Agents orchestration currently supports three config-level modes:

capability_moa: capability-style proposers (e.g. reasoning/compliance/safety)
specialist_moa: specialist proposers (tactical/positional/endgame)
hybrid_phase_router: proposer roles routed by game phase

zugzwang play --config configs/baselines/best_known_start.yaml \
  --set strategy.multi_agent.enabled=true \
  --set strategy.multi_agent.mode=capability_moa \
  --set strategy.multi_agent.proposer_count=2

Available ablation configs:

configs/ablations/moa_capability.yaml
configs/ablations/moa_specialist.yaml
configs/ablations/moa_hybrid_phase.yaml

Role-level model routing (same provider, per-role model overrides):

zugzwang play --config configs/baselines/best_known_start.yaml \
  --set strategy.multi_agent.enabled=true \
  --set strategy.multi_agent.mode=hybrid_phase_router \
  --set strategy.multi_agent.provider_policy=role_model_overrides \
  --set strategy.multi_agent.role_models.aggregator=mock-1

Budget & Reliability Guardrails

budget:
  max_total_usd: 5.00                         # Hard stop
  estimated_avg_cost_per_game_usd: 0.55       # For projected stop

runtime:
  timeout_policy:
    enabled: true
    min_games_before_enforcement: 5
    max_provider_timeout_game_rate: 0.30      # Stop if >30% games timeout
    min_observed_completion_rate: 0.60
    action: stop_run

Engine Player (UCI)

Play against Stockfish with native UCI Elo strength:

zugzwang play --config configs/baselines/best_known_start.yaml \
  --set players.white.type=engine \
  --set players.white.uci_limit_strength=true \
  --set players.white.uci_elo=1600

z.ai / GLM-5 Integration

zugzwang env-check --config configs/baselines/best_known_start_zai_glm5.yaml
zugzwang play --config configs/baselines/best_known_start_zai_glm5.yaml

Frontend - FastAPI + React

The project uses a split architecture: a FastAPI API server over Python services and a Vite + React + TypeScript frontend in zugzwang-ui/.

Start the API server:

pip install -e .[api]
zugzwang api                         # serves on localhost:8000
zugzwang api --reload                # dev mode with auto-reload

In development, run the frontend separately:

cd zugzwang-ui && npm install && npm run dev   # Vite on localhost:5173

In production, zugzwang api serves the built frontend as static files — single process, single port.

Frontend pages:

Page	Route	Description
Dashboard	`/dashboard`	Active jobs, recent runs, total spend
Run Lab	`/lab`	Configure, validate, and launch experiments
Job Monitor	`/dashboard/jobs/:id`	Live log streaming (SSE), progress bar, cancel
Run Explorer	`/runs`	Browse all runs, filter, sort
Run Detail	`/runs/:id`	Metrics tabs, move quality, config, evaluate
Game Replay	`/runs/:id/game/:n`	Board replay, per-ply metrics, MoA agent trace
Compare	`/compare`	Side-by-side run comparison with overlaid charts
Settings	`/settings`	Provider env check status

Stack: FastAPI · Uvicorn · Vite · React 19 · TypeScript · TanStack Router · TanStack Query · Zustand · shadcn/ui · Tailwind · react-chessboard

TypeScript types are auto-generated from the FastAPI OpenAPI schema — never written by hand:

npx openapi-typescript http://localhost:8000/openapi.json -o src/api/schema.ts

Run Artifacts

Each run creates a directory in results/runs/<run-id>/:

results/runs/<run-id>/
├── resolved_config.yaml         # Full merged config
├── config_hash.txt              # Deterministic config fingerprint
├── _run.json                    # Run metadata (secrets redacted)
├── games/
│   ├── game_0001.json           # Per-game artifact with full move trace
│   ├── game_0002.json
│   └── ...
├── experiment_report.json       # Aggregated metrics
└── experiment_report_evaluated.json  # Move quality + Elo (after evaluate)

Each GameRecord includes: move sequence, retry metadata, token usage, per-move latency, cost, termination reason, and RAG/MoA traces when enabled.

Experimental Roadmap

Phase	Status	What it enables
Phase 0 — Bootstrap	✅ Done	Reproducible config, CLI, env validation
Phase 1 — Core Engine	✅ Done	Legal games, all player types, protocol modes
Phase 2 — Evaluation	✅ Functional	Stockfish scoring, ACPL, Elo MLE, blunder rate
Phase 3 — Strategy	✅ Functional	Prompts, context assembly, few-shot, validation
Phase 4 — RAG	✅ MVP	Phase-aware local retrieval, ablation configs
Phase 5 — Multi-Agent	🔄 Baseline+	Capability, specialist, and hybrid phase-router MoA modes
Phase 6 — Experiment Runner	🔄 Partial	Batch + resume + budget; queue scheduler pending
Phase 7 — Analysis	🔄 Partial	FastAPI + React dashboard in progress

Next targets: specialist/hybrid MoA, queue scheduler, comparative visualizations.

Development

# Install with dev dependencies
pip install -e .[dev]

# Run all tests
pytest -q

# Install with API dependencies
pip install -e .[api]
zugzwang api --host 127.0.0.1 --port 8000

Tests cover: board legality, config hashing, move parsing, retry policies, Elo math, RAG retrieval, MoA orchestration, runner resume/dedup, budget enforcement.

References

Primary References

Kolasani, S., Saplin, M. et al. (2025). LLM CHESS: Benchmarking Reasoning and Instruction-Following in LLMs through Chess. NeurIPS FoRLM 2025. arXiv:2512.01992 · Code
Karvonen, A. (2024). Emergent World Models and Latent Variable Estimation in Chess-Playing Language Models. COLM 2024. arXiv:2403.15498
Feng, X. et al. (2023). ChessGPT: Bridging Policy Learning and Language Modeling. NeurIPS 2023. arXiv:2306.09200
Zhang, Y. et al. (2025). Complete Chess Games Enable LLM Become A Chess Master. NAACL 2025. arXiv:2501.17186
Monroe, D. & Leela Chess Zero Team (2024). Mastering Chess with a Transformer Model. arXiv:2409.12272
Ruoss, A. et al. (2024). Amortized Planning with Large-Scale Transformers: A Case Study on Chess. NeurIPS 2024. arXiv:2402.04494
Anonymous (2025). Can Large Language Models Develop Strategic Reasoning? Post-training Insights from Learning Chess. arXiv:2507.00726

Blog Posts & Analyses

Carlini, N. (2023). Playing chess with large language models. nicholas.carlini.com
Dynomight (2024). Something weird is happening with LLMs and chess. dynomight.net
Dynomight (2024). OK, I can partly explain the LLM chess weirdness now. dynomight.net
Karvonen, A. (2024). Chess-GPT's Internal World Model. adamkarvonen.github.io

License

MIT. See LICENSE.

_{Built with rigor, curiosity, and a deep respect for the game.}

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
configs		configs
data		data
tests		tests
tools		tools
zugzwang-ui		zugzwang-ui
zugzwang		zugzwang
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
README.pt-br.md		README.pt-br.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

♟ Zugzwang

What is Zugzwang?

The Research Question

Motivation & Background

Architecture

Repository Layout

Quickstart

Environment Setup

CLI Reference

Config Overrides

Key Features

Two Baselines

Strategy Pipeline

RAG (Phase 4 — Available)

Multi-Agent Modes (Phase 5 — Available)

Budget & Reliability Guardrails

Engine Player (UCI)

z.ai / GLM-5 Integration

Frontend - FastAPI + React

Run Artifacts

Experimental Roadmap

Development

References

Primary References

Blog Posts & Analyses

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

♟ Zugzwang

What is Zugzwang?

The Research Question

Motivation & Background

Architecture

Repository Layout

Quickstart

Environment Setup

CLI Reference

Config Overrides

Key Features

Two Baselines

Strategy Pipeline

RAG (Phase 4 — Available)

Multi-Agent Modes (Phase 5 — Available)

Budget & Reliability Guardrails

Engine Player (UCI)

z.ai / GLM-5 Integration

Frontend - FastAPI + React

Run Artifacts

Experimental Roadmap

Development

References

Primary References

Blog Posts & Analyses

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages