Chess LLM Benchmark

This benchmark evaluates LLM chess-playing ability by having models play games against calibrated engine anchors and other LLMs. Ratings are calculated using the Glicko-2 rating system, calibrated to approximate Lichess Classical ratings.

Results can be seen here

Author can be contacted at dfj2106@columbia.edu

How It Works

Gameplay LLMs receive the current position (FEN + ASCII board) and must return a single UCI move. Illegal moves get one retry with a warning; second illegal move = forfeit.

Anchor Engines Games are played against engines with known Lichess Classical in order to anchor our rating pool to the Lichess Classical pool.

Rating Calculation Glicko-2 ratings are calculated based on game outcomes. FIDE rating is estimated using ChessGoals.com FIDE conversion data.

More general methodology notes are on the website.

Installation

pip install -r requirements.txt

Using anchor engines requires installing them individually. Maia Eubos

Usage

Set API Key

export OPENROUTER_API_KEY="your-key"

Run Manual Games

# LLM vs Stockfish (default engine)
python cli.py manual --white-model meta-llama/llama-4-maverick --black-engine --stockfish-skill 5

# LLM vs LLM
python cli.py manual --white-model meta-llama/llama-4-maverick --black-model deepseek/deepseek-chat-v3-0324

# Multiple games (alternates colors each game)
python cli.py manual --white-model meta-llama/llama-4-maverick --black-engine --games 10

# Against different engine types
python cli.py manual --white-model meta-llama/llama-4-maverick --black-engine --engine-type maia-1100
python cli.py manual --white-model meta-llama/llama-4-maverick --black-engine --engine-type random
python cli.py manual --white-model meta-llama/llama-4-maverick --black-engine --engine-type eubos

# With reasoning models (use max-tokens 0 for extended thinking)
python cli.py manual --white-model deepseek/deepseek-r1 --black-engine --white-reasoning-effort high --max-tokens 0

# Enable reasoning mode for hybrid models
python cli.py manual --white-model deepseek/deepseek-chat --black-engine --reasoning

# Don't save the game
python cli.py manual --white-model meta-llama/llama-4-maverick --black-engine --no-save

Manual command engine presets: stockfish, maia-1100, maia-1900, random, eubos

Note: eubos is a hardcoded preset. For custom UCI engines in benchmarks, use type: uci in config.

Run Full Benchmark

python cli.py run -c config/benchmark.yaml -v

View Leaderboard

python cli.py leaderboard --min-games 5
python cli.py leaderboard --sort legal   # Sort by legal move %
python cli.py leaderboard --sort cost    # Sort by $/game

Recalculate Ratings

Recalculate all ratings from stored game results (useful after playing manual games or changing anchor ratings):

python cli.py recalculate -c config/benchmark.yaml

Web Interface

Available at https://chessbenchllm.onrender.com

python web/app.py
# Open http://localhost:5000

Features:

Leaderboard with Glicko-2 ratings, FIDE estimates, confidence intervals, legal move rates, $/game, and release dates
Game library with filtering by player and pagination
Interactive game viewer with move-by-move navigation
Client-side Stockfish analysis (toggle-able eval bar + top engine lines)
Timeline chart showing rating progression over time
Cost vs Rating chart with efficiency frontier
Methodology page explaining the rating system
JSON API at /api/leaderboard, /api/games, /api/game/<id>

Configuration

Edit config/benchmark.yaml to configure:

LLM models to benchmark (via OpenRouter)
Engine anchors (Stockfish, Maia, Random, or any UCI engine)
Games per matchup and concurrency settings

Example:

benchmark:
  games_vs_anchor_per_color: 10
  games_vs_llm_per_color: 5
  max_concurrent: 4
  max_moves: 200
  rating_threshold: 600  # Only pair players within this rating difference

engines:
  - player_id: "random-bot"
    type: random
    rating: 400

  - player_id: "maia-1100"
    type: maia
    lc0_path: "/opt/homebrew/bin/lc0"
    weights_path: "maia-1100.pb.gz"
    rating: 1628

  - player_id: "eubos"
    type: uci                    # Generic UCI engine
    path: "/path/to/engine"
    rating: 2200
    initial_time: 900            # Clock-based time control (seconds)
    increment: 10

llms:
  - player_id: "llama-4-maverick"
    model_name: "meta-llama/llama-4-maverick"
    temperature: 0.0
    max_tokens: 10

  - player_id: "deepseek-r1"
    model_name: "deepseek/deepseek-r1"
    reasoning_effort: "medium"  # minimal, low, medium, high

Engine types: stockfish, maia, random, uci (generic UCI engine)

Project Structure

├── cli.py                 # Main CLI entrypoint
├── config/
│   └── benchmark.yaml     # Benchmark configuration
├── engines/               # Chess engine wrappers
│   ├── base_engine.py     # Base engine class
│   ├── stockfish_engine.py
│   ├── maia_engine.py
│   ├── random_engine.py
│   └── uci_engine.py      # Generic UCI engine wrapper
├── llm/                   # LLM player clients
│   ├── base_llm.py        # Base LLM player class
│   ├── openrouter_client.py
│   └── prompts.py         # Chess prompt templates
├── game/                  # Game execution
│   ├── game_runner.py     # Core game loop
│   ├── match_scheduler.py # Parallel game execution
│   ├── models.py          # Pydantic data models
│   ├── pgn_logger.py      # PGN/result saving
│   └── stats_collector.py # Win/loss/draw stats
├── rating/                # Rating system
│   ├── glicko2.py         # Glicko-2 implementation
│   ├── rating_store.py    # Local JSON storage
│   ├── leaderboard.py     # Leaderboard formatting
│   ├── fide_estimate.py   # FIDE rating estimation
│   └── cost_calculator.py # API cost calculation
├── web/                   # Web interface
│   ├── app.py             # Flask application
│   ├── timeline_chart.py  # Rating timeline visualization
│   ├── cost_chart.py      # Cost vs rating visualization
│   ├── templates/         # HTML templates
│   └── static/            # CSS/JS assets
└── data/                  # Output (gitignored)
    ├── games/             # PGN files
    ├── results/           # JSON game results
    ├── ratings.json       # Current ratings
    ├── lichess_to_fide.json      # FIDE conversion data
    └── model_publish_dates.json  # Model release dates

Rating System

Uses Glicko-2 with:

Rating (μ): Estimated skill level (starts at 1500)
Rating Deviation (RD): Uncertainty (decreases with more games)
Volatility (σ): Expected rating fluctuation
FIDE Estimate: Approximate FIDE rating based on ChessGoals.com Lichess-to-FIDE conversion (valid for 1715-2500 range)
Legal Move Rate: Percentage of moves that were legal on first attempt

Engine anchors have fixed ratings based on their approximate Elo and are never updated.

Illegal Move Policy

First illegal move: Warning sent, LLM gets one retry
Second illegal move: Immediate forfeit (loss), following FIDE rules

The retry prompt tells the LLM which move was illegal but does not provide a list of legal moves.

Name		Name	Last commit message	Last commit date
Latest commit History 344 Commits
config		config
data		data
engines		engines
game		game
llm		llm
rating		rating
scripts		scripts
web		web
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
cleanup_bad_games.py		cleanup_bad_games.py
cli.py		cli.py
firebase_client.py		firebase_client.py
maia-1100.pb.gz		maia-1100.pb.gz
maia-1900.pb.gz		maia-1900.pb.gz
migrate_to_firestore.py		migrate_to_firestore.py
rename_grok4_fast.py		rename_grok4_fast.py
render.yaml		render.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Chess LLM Benchmark

How It Works

Installation

Usage

Set API Key

Run Manual Games

Run Full Benchmark

View Leaderboard

Recalculate Ratings

Web Interface

Configuration

Project Structure

Rating System

Illegal Move Policy

About

Uh oh!

Releases

Packages

Languages

License

lightnesscaster/Chess-LLM-Benchmark

Folders and files

Latest commit

History

Repository files navigation

Chess LLM Benchmark

How It Works

Installation

Usage

Set API Key

Run Manual Games

Run Full Benchmark

View Leaderboard

Recalculate Ratings

Web Interface

Configuration

Project Structure

Rating System

Illegal Move Policy

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages