This benchmark evaluates LLM chess-playing ability by having models play games against calibrated engine anchors and other LLMs. Ratings are calculated using the Glicko-2 rating system, calibrated to approximate Lichess Classical ratings.
Results can be seen here
Author can be contacted at dfj2106@columbia.edu
Gameplay LLMs receive the current position (FEN + ASCII board) and must return a single UCI move. Illegal moves get one retry with a warning; second illegal move = forfeit.
Anchor Engines Games are played against engines with known Lichess Classical in order to anchor our rating pool to the Lichess Classical pool.
Rating Calculation Glicko-2 ratings are calculated based on game outcomes. FIDE rating is estimated using ChessGoals.com FIDE conversion data.
More general methodology notes are on the website.
pip install -r requirements.txtUsing anchor engines requires installing them individually. Maia Eubos
export OPENROUTER_API_KEY="your-key"# LLM vs Stockfish (default engine)
python cli.py manual --white-model meta-llama/llama-4-maverick --black-engine --stockfish-skill 5
# LLM vs LLM
python cli.py manual --white-model meta-llama/llama-4-maverick --black-model deepseek/deepseek-chat-v3-0324
# Multiple games (alternates colors each game)
python cli.py manual --white-model meta-llama/llama-4-maverick --black-engine --games 10
# Against different engine types
python cli.py manual --white-model meta-llama/llama-4-maverick --black-engine --engine-type maia-1100
python cli.py manual --white-model meta-llama/llama-4-maverick --black-engine --engine-type random
python cli.py manual --white-model meta-llama/llama-4-maverick --black-engine --engine-type eubos
# With reasoning models (use max-tokens 0 for extended thinking)
python cli.py manual --white-model deepseek/deepseek-r1 --black-engine --white-reasoning-effort high --max-tokens 0
# Enable reasoning mode for hybrid models
python cli.py manual --white-model deepseek/deepseek-chat --black-engine --reasoning
# Don't save the game
python cli.py manual --white-model meta-llama/llama-4-maverick --black-engine --no-saveManual command engine presets: stockfish, maia-1100, maia-1900, random, eubos
Note:
eubosis a hardcoded preset. For custom UCI engines in benchmarks, usetype: uciin config.
python cli.py run -c config/benchmark.yaml -vpython cli.py leaderboard --min-games 5
python cli.py leaderboard --sort legal # Sort by legal move %
python cli.py leaderboard --sort cost # Sort by $/gameRecalculate all ratings from stored game results (useful after playing manual games or changing anchor ratings):
python cli.py recalculate -c config/benchmark.yamlAvailable at https://chessbenchllm.onrender.com
python web/app.py
# Open http://localhost:5000Features:
- Leaderboard with Glicko-2 ratings, FIDE estimates, confidence intervals, legal move rates, $/game, and release dates
- Game library with filtering by player and pagination
- Interactive game viewer with move-by-move navigation
- Client-side Stockfish analysis (toggle-able eval bar + top engine lines)
- Timeline chart showing rating progression over time
- Cost vs Rating chart with efficiency frontier
- Methodology page explaining the rating system
- JSON API at
/api/leaderboard,/api/games,/api/game/<id>
Edit config/benchmark.yaml to configure:
- LLM models to benchmark (via OpenRouter)
- Engine anchors (Stockfish, Maia, Random, or any UCI engine)
- Games per matchup and concurrency settings
Example:
benchmark:
games_vs_anchor_per_color: 10
games_vs_llm_per_color: 5
max_concurrent: 4
max_moves: 200
rating_threshold: 600 # Only pair players within this rating difference
engines:
- player_id: "random-bot"
type: random
rating: 400
- player_id: "maia-1100"
type: maia
lc0_path: "/opt/homebrew/bin/lc0"
weights_path: "maia-1100.pb.gz"
rating: 1628
- player_id: "eubos"
type: uci # Generic UCI engine
path: "/path/to/engine"
rating: 2200
initial_time: 900 # Clock-based time control (seconds)
increment: 10
llms:
- player_id: "llama-4-maverick"
model_name: "meta-llama/llama-4-maverick"
temperature: 0.0
max_tokens: 10
- player_id: "deepseek-r1"
model_name: "deepseek/deepseek-r1"
reasoning_effort: "medium" # minimal, low, medium, highEngine types: stockfish, maia, random, uci (generic UCI engine)
├── cli.py # Main CLI entrypoint
├── config/
│ └── benchmark.yaml # Benchmark configuration
├── engines/ # Chess engine wrappers
│ ├── base_engine.py # Base engine class
│ ├── stockfish_engine.py
│ ├── maia_engine.py
│ ├── random_engine.py
│ └── uci_engine.py # Generic UCI engine wrapper
├── llm/ # LLM player clients
│ ├── base_llm.py # Base LLM player class
│ ├── openrouter_client.py
│ └── prompts.py # Chess prompt templates
├── game/ # Game execution
│ ├── game_runner.py # Core game loop
│ ├── match_scheduler.py # Parallel game execution
│ ├── models.py # Pydantic data models
│ ├── pgn_logger.py # PGN/result saving
│ └── stats_collector.py # Win/loss/draw stats
├── rating/ # Rating system
│ ├── glicko2.py # Glicko-2 implementation
│ ├── rating_store.py # Local JSON storage
│ ├── leaderboard.py # Leaderboard formatting
│ ├── fide_estimate.py # FIDE rating estimation
│ └── cost_calculator.py # API cost calculation
├── web/ # Web interface
│ ├── app.py # Flask application
│ ├── timeline_chart.py # Rating timeline visualization
│ ├── cost_chart.py # Cost vs rating visualization
│ ├── templates/ # HTML templates
│ └── static/ # CSS/JS assets
└── data/ # Output (gitignored)
├── games/ # PGN files
├── results/ # JSON game results
├── ratings.json # Current ratings
├── lichess_to_fide.json # FIDE conversion data
└── model_publish_dates.json # Model release dates
Uses Glicko-2 with:
- Rating (μ): Estimated skill level (starts at 1500)
- Rating Deviation (RD): Uncertainty (decreases with more games)
- Volatility (σ): Expected rating fluctuation
- FIDE Estimate: Approximate FIDE rating based on ChessGoals.com Lichess-to-FIDE conversion (valid for 1715-2500 range)
- Legal Move Rate: Percentage of moves that were legal on first attempt
Engine anchors have fixed ratings based on their approximate Elo and are never updated.
- First illegal move: Warning sent, LLM gets one retry
- Second illegal move: Immediate forfeit (loss), following FIDE rules
The retry prompt tells the LLM which move was illegal but does not provide a list of legal moves.