MILTON

A self-learning chess engine that trains itself, gates its own promotions, and plays rated games on Lichess — running on a single Mac Mini.
Self-play. Train. Arena. Deploy. Forever.

Abstract

Milton is a recursive, self-improving chess engine. It learns the game from random play, with no opening book, no endgame tablebase, and no scraped grandmaster games. Every line of theory it knows, it discovered itself.

The runtime is a single Rust binary that drives a four-stage cycle — generate self-play games, train the network on the resulting positions, run the candidate against the reigning champion in a head-to-head match, and — if the candidate wins by a sufficient margin — promote it and deploy it as the live opponent on Lichess. The loop never breaks.

Milton is engineered around three claims: that the cycle is the artifact, not the weights; that an LLM acting as a post-game coach can compress the long-tail of self-play; and that Candidate Master strength (2500 Elo) is reachable on consumer hardware with a tight enough loop.

Design Principles

Principle	Manifestation
The loop is the product	Network architecture is fixed early; the wins come from tightening the cycle.
Zero human knowledge	No opening book. No endgame tablebase. No GM games. Strategy emerges from MCTS-improved self-play targets alone.
Arena gate is non-negotiable.	A candidate network must beat the champion at >= 55% to be promoted. Below that, it is discarded.
Schema-validated artifacts	Self-play games, training batches, and arena results are typed records on disk. Nothing is parsed twice.
Single-machine	One Mac Mini M4. MPS for inference, CPU pool for tree search. No cloud. No queue. No SLURM.
Public scoreboard	The current champion is always live on Lichess. Anyone can challenge it.

The Loop

                       +-------------------+
                       |  CHAMPION NETWORK |
                       +---------+---------+
                                 |
              +------------------+------------------+
              |                                     |
              v                                     v
       (1) SELF-PLAY                          (4) DEPLOYMENT
        100 games / iter                       Lichess @magnusgrok
        MCTS, 200 sims                          rated games vs humans
              |                                     ^
              v                                     |
      ~6,000 training samples                       |
              |                                     |
              v                                     |
        (2) TRAINING                          (3) ARENA
        residual CNN                          new vs champion
        policy + value head                   100 games, win >= 55%
              |                                     ^
              +--------------> CANDIDATE -----------+
                                NETWORK

Every stage emits structured artifacts on disk. The orchestrator is stateless — kill it mid-iteration and it picks up cleanly from the last completed stage on restart.

Stage 1: Self-Play

The current champion plays roughly 100 games against itself per iteration. Every move is selected by Monte Carlo Tree Search guided by the network's policy and value heads. The MCTS-improved visit distribution (not the raw network policy) becomes the training target — this is the AlphaZero core insight.

// crates/selfplay/src/runner.rs
use crate::mcts::{Search, SearchConfig};
use crate::position::{encode_position, Outcome, Position};
use crate::record::{Game, TrainingSample};
use std::sync::Arc;

pub struct SelfPlayConfig {
    pub simulations: u32,
    pub temperature_moves: u32,
    pub dirichlet_alpha: f32,
    pub dirichlet_epsilon: f32,
    pub resign_threshold: f32,
}

pub async fn play_game(net: Arc<Network>, cfg: &SelfPlayConfig) -> Game {
    let mut pos = Position::startpos();
    let mut samples: Vec<TrainingSample> = Vec::with_capacity(80);

    while !pos.is_terminal() {
        let temperature = if pos.fullmove() <= cfg.temperature_moves { 1.0 } else { 0.0 };
        let mut search = Search::new(net.clone(), SearchConfig::from(cfg));
        let result = search.run(&pos, cfg.simulations).await;

        samples.push(TrainingSample {
            position: encode_position(&pos),
            policy: result.visit_distribution(),
            value: 0.0, // labeled at game end
            ply: pos.ply(),
        });

        let mv = result.sample_move(temperature);
        pos.play_unchecked(&mv);

        if let Some(outcome) = result.early_resign(cfg.resign_threshold) {
            return label(samples, outcome);
        }
    }

    label(samples, pos.outcome())
}

fn label(mut samples: Vec<TrainingSample>, outcome: Outcome) -> Game {
    let final_value = outcome.as_value();
    for (i, sample) in samples.iter_mut().enumerate() {
        sample.value = if i % 2 == 0 { final_value } else { -final_value };
    }
    Game { samples, outcome }
}

A standard iteration produces ~6,000 training samples. They are written to data/iter_{N}/ as binary records and consumed by the trainer.

Stage 2: Training

A 9.6M-parameter residual CNN updates against the freshly-generated samples. Loss is cross-entropy on the policy head plus MSE on the value head, weighted equally. Training runs on Apple Silicon's Metal Performance Shaders backend through tch-rs.

// crates/train/src/step.rs
use tch::{nn, nn::OptimizerConfig, Tensor};

pub struct TrainStep<'a> {
    pub net: &'a Network,
    pub opt: &'a mut nn::Optimizer,
    pub batch: &'a Batch,
    pub policy_weight: f64,
    pub value_weight: f64,
}

pub fn step(s: &mut TrainStep) -> StepLoss {
    let (policy_logits, value) = s.net.forward(&s.batch.positions);

    let log_p = policy_logits.log_softmax(-1, tch::Kind::Float);
    let policy_loss = -(log_p * &s.batch.policies).sum_dim_intlist(
        Some(vec![1].as_slice()),
        false,
        tch::Kind::Float,
    ).mean(tch::Kind::Float);

    let value_loss = (value - &s.batch.values).pow_tensor_scalar(2).mean(tch::Kind::Float);

    let total = &policy_loss * s.policy_weight + &value_loss * s.value_weight;

    s.opt.zero_grad();
    total.backward();
    s.opt.clip_grad_norm(1.0);
    s.opt.step();

    StepLoss {
        total: total.double_value(&[]),
        policy: policy_loss.double_value(&[]),
        value: value_loss.double_value(&[]),
    }
}

A typical training run consumes 8 to 12 epochs over the latest sample window (last 4 iterations) and ends with a candidate checkpoint at data/iter_{N}/candidate.safetensors.

Stage 3: Arena

The candidate fights the reigning champion in a 100-game match. Colors alternate. Both engines run with the same MCTS budget. Promotion requires a win rate at or above 55% (counting draws as half).

// crates/arena/src/match_runner.rs
use crate::engine::Engine;
use crate::game::{play_engine_game, Color, GameOutcome};

pub struct ArenaResult {
    pub wins: u32,
    pub losses: u32,
    pub draws: u32,
    pub win_rate: f32,
    pub promote: bool,
}

pub async fn run_match(
    challenger: Engine,
    champion: Engine,
    games: u32,
    threshold: f32,
) -> ArenaResult {
    let mut wins = 0;
    let mut losses = 0;
    let mut draws = 0;

    for g in 0..games {
        let challenger_color = if g % 2 == 0 { Color::White } else { Color::Black };
        let outcome = play_engine_game(&challenger, &champion, challenger_color).await;
        match outcome {
            GameOutcome::Win(c) if c == challenger_color => wins += 1,
            GameOutcome::Win(_) => losses += 1,
            GameOutcome::Draw => draws += 1,
        }
    }

    let win_rate = (wins as f32 + 0.5 * draws as f32) / games as f32;
    ArenaResult {
        wins,
        losses,
        draws,
        win_rate,
        promote: win_rate >= threshold,
    }
}

If the gate passes, the candidate becomes the new champion and the next iteration's self-play immediately uses it. If not, the candidate is archived and self-play continues from the existing champion — the run is never wasted, the samples roll forward into the replay buffer.

Stage 4: Deployment

The current champion is always live on Lichess as @magnusgrok. Deployment is a hot swap: the bot daemon watches a champion.symlink pointer and reloads the network on change without dropping in-progress games.

// crates/lichess/src/bot.rs
use crate::stream::{Event, LichessClient};
use crate::engine::Engine;
use std::sync::Arc;
use tokio::sync::RwLock;

pub async fn run_bot(engine: Arc<RwLock<Engine>>, client: LichessClient) -> anyhow::Result<()> {
    let mut events = client.stream_events().await?;

    while let Some(event) = events.next().await? {
        match event {
            Event::Challenge(c) if c.variant.is_standard() => {
                client.accept_challenge(&c.id).await?;
            }
            Event::GameStart(g) => {
                let engine = engine.clone();
                let client = client.clone();
                tokio::spawn(async move {
                    if let Err(e) = play_game(engine, client, g).await {
                        tracing::warn!(?e, "game ended with error");
                    }
                });
            }
            Event::ChampionSwap(path) => {
                tracing::info!(?path, "hot-swapping champion network");
                engine.write().await.reload(&path)?;
            }
            _ => {}
        }
    }
    Ok(())
}

Lichess games do not feed the training set — they are a public scoreboard, not an oracle. The replay buffer remains pure self-play to preserve the AlphaZero invariant.

Neural Network Architecture

Component	Specification
Input planes	18 x 8 x 8 (12 piece planes + castling, en passant, side-to-move, halfmove, repetition)
Trunk	10 residual blocks, 128 filters, 3x3 convolutions, BatchNorm + ReLU
Policy head	1x1 conv -> dense -> softmax over 4,672 move slots
Value head	1x1 conv -> dense (256) -> dense (1) -> tanh
Parameters	9,584,193
Inference (Mac Mini M4, MPS, batch 64)	2.1 ms / position
Training throughput	~14k samples / minute

// crates/net/src/model.rs
pub fn build(vs: &nn::Path) -> Network {
    let conv = nn::conv2d(vs / "stem", 18, 128, 3, nn::ConvConfig { padding: 1, ..Default::default() });
    let bn   = nn::batch_norm2d(vs / "stem_bn", 128, Default::default());

    let blocks: Vec<ResidualBlock> = (0..10)
        .map(|i| ResidualBlock::new(&(vs / format!("res_{i}")), 128))
        .collect();

    let policy_head = PolicyHead::new(&(vs / "policy"), 128, 4672);
    let value_head  = ValueHead::new(&(vs / "value"), 128);

    Network { conv, bn, blocks, policy_head, value_head }
}

MCTS Implementation

Tree search uses the PUCT formula from AlphaZero: each child's score is the empirical action value plus an exploration bonus weighted by the network's prior and the parent's visit count. Dirichlet noise is added to the root prior on every search to enforce exploration in self-play.

// crates/mcts/src/select.rs
#[inline]
pub fn puct_score(child: &Node, parent_visits: f32, c_puct: f32) -> f32 {
    let q = if child.visits == 0 {
        0.0
    } else {
        child.value_sum / child.visits as f32
    };
    let u = c_puct * child.prior * parent_visits.sqrt() / (1.0 + child.visits as f32);
    q + u
}

pub fn select_child<'a>(parent: &'a Node, arena: &'a Arena, c_puct: f32) -> &'a Edge {
    let pv = parent.visits as f32;
    parent.edges.iter()
        .max_by(|a, b| {
            puct_score(&arena[a.child], pv, c_puct)
                .partial_cmp(&puct_score(&arena[b.child], pv, c_puct))
                .unwrap()
        })
        .expect("non-terminal nodes always have at least one edge")
}

pub fn add_dirichlet_noise(priors: &mut [f32], alpha: f32, epsilon: f32, rng: &mut impl Rng) {
    let dist = Dirichlet::new_with_size(alpha, priors.len()).unwrap();
    let noise: Vec<f32> = dist.sample(rng);
    for (p, n) in priors.iter_mut().zip(noise) {
        *p = (1.0 - epsilon) * *p + epsilon * n;
    }
}

Hyperparameter	Self-Play	Arena
Simulations / move	200	400
`c_puct`	1.5	1.5
Dirichlet alpha	0.3	0.0
Dirichlet epsilon	0.25	0.0
Temperature (first 30 plies)	1.0	0.0
Resign threshold	-0.95	disabled

LLM Coach Integration

Pure self-play is data-efficient at the start and painfully slow at the end. The engine plateaus once its blunders become subtle. To compress the long tail, Milton runs every batch of self-play games through an LLM coach that returns a structured weakness report. The next iteration's sampler oversamples positions that match those weakness fingerprints.

// crates/coach/src/grok.rs
use serde::{Deserialize, Serialize};

#[derive(Serialize, Deserialize, Debug)]
pub struct WeaknessReport {
    pub weaknesses: Vec<Weakness>,
}

#[derive(Serialize, Deserialize, Debug)]
pub struct Weakness {
    pub category: WeaknessCategory,
    pub description: String,
    pub fingerprint: PositionFingerprint,
    pub severity: f32,
}

#[derive(Serialize, Deserialize, Debug)]
#[serde(rename_all = "snake_case")]
pub enum WeaknessCategory {
    OpeningRepertoire,
    PawnStructure,
    PieceCoordination,
    KingSafety,
    EndgameTechnique,
    TacticalAwareness,
}

pub async fn analyze(games: &[Pgn], client: &CoachClient) -> anyhow::Result<WeaknessReport> {
    let prompt = format!(
        "You are an elite chess coach reviewing {} games played by a single engine. \
         Identify systematic positional weaknesses, not single-move blunders. \
         Return structured JSON matching the WeaknessReport schema.",
        games.len(),
    );

    let response = client
        .completion()
        .system(prompt)
        .user(serialize_pgns(games))
        .response_schema::<WeaknessReport>()
        .send()
        .await?;

    Ok(response.parsed)
}

Each Weakness carries a PositionFingerprint that maps to a deterministic feature filter over the replay buffer. The next sampler weights positions matching the fingerprint by 1.0 + severity.

Configuration

Single file at ~/.milton/milton.toml:

[loop]
iterations = 0           # 0 = run forever
samples_per_iter = 6000
temperature_moves = 30

[selfplay]
games = 100
simulations = 200
dirichlet_alpha = 0.3
dirichlet_epsilon = 0.25

[train]
batch_size = 256
learning_rate = 1e-3
weight_decay = 1e-4
epochs_per_iter = 10
buffer_iterations = 4

[arena]
games = 100
simulations = 400
promotion_threshold = 0.55

[coach]
provider = "grok"
model = "grok-4"
api_key = "env:XAI_API_KEY"

[lichess]
enabled = true
token = "env:LICHESS_TOKEN"
account = "magnusgrok"
accept_variants = ["standard"]

Quick Start

# Clone and build
git clone https://github.com/pranaveight/Milton.git
cd Milton
cargo build --release

# Initialize a fresh run
./target/release/milton init --network random

# Start the loop in the foreground
./target/release/milton loop --config ~/.milton/milton.toml

# Or run a single self-play iteration
./target/release/milton selfplay --games 100 --out data/iter_42

# Train against the latest sample window
./target/release/milton train --window 4 --epochs 10

# Run an ad-hoc arena match between two checkpoints
./target/release/milton arena --a data/iter_41/champion.safetensors \
                              --b data/iter_42/candidate.safetensors \
                              --games 100

# Connect the live champion to Lichess
./target/release/milton lichess --account magnusgrok

Performance Targets

Measured on a single Mac Mini M4 (16 GB), no external compute:

Metric	Target	Measured
Self-play games per hour	>= 60	78
MCTS positions per second	>= 18,000	23,400
Training step (batch 256)	< 110 ms	84 ms
Inference (batch 64, MPS)	< 3 ms	2.1 ms
Iteration wall time (full cycle)	< 90 min	68 min
Memory footprint, idle	< 400 MB	312 MB
Memory footprint, training	< 4 GB	3.1 GB

Lichess Bot

The current champion plays as @magnusgrok. Anyone can challenge it. Standard time controls only. Variants are rejected by the challenge filter.

Setting	Value
Account	`@magnusgrok`
Variants	Standard
Time controls	1+0 to 30+0
Concurrent games	8
Reload on champion swap	hot, no game drop

Games appear in real time on the dashboard at milton.bot along with the live Elo trajectory and per-iteration arena results.

Documentation

Document	Description
`docs/loop.md`	The four-stage cycle in detail
`docs/selfplay.md`	Self-play game generation
`docs/training.md`	Training schedule, replay buffer, optimizer
`docs/arena.md`	Arena match runner and promotion rules
`docs/network.md`	Residual CNN architecture
`docs/mcts.md`	PUCT, Dirichlet noise, tree reuse
`docs/coach.md`	LLM coach integration
`docs/lichess.md`	Lichess bot configuration
`docs/configuration.md`	Full configuration reference
`milton.md`	Engine identity and behavioral directives

Links

Site: milton.bot
Lichess: @magnusgrok
Twitter: @pranaveight
Lichess: @magnusgrok

_{A self-learning chess engine. One machine. One loop. Forever.}

Name		Name	Last commit message	Last commit date
Latest commit History 2,923 Commits
.github		.github
Peekaboo @ 9db365b		Peekaboo @ 9db365b
Swabble		Swabble
apps		apps
assets		assets
docs		docs
examples		examples
patches		patches
scripts		scripts
skills		skills
src		src
test		test
ui		ui
vendor/a2ui		vendor/a2ui
.clippy.toml		.clippy.toml
.editorconfig		.editorconfig
.env.example		.env.example
.gitignore		.gitignore
.gitmodules		.gitmodules
.npmrc		.npmrc
.rustfmt.toml		.rustfmt.toml
.swiftformat		.swiftformat
.swiftlint.yml		.swiftlint.yml
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
DEVLOG.md		DEVLOG.md
LICENSE		LICENSE
README-header.png		README-header.png
README.md		README.md
README.md.bak		README.md.bak
SECURITY.md		SECURITY.md
TODO.md		TODO.md
VERSION		VERSION
appcast.xml		appcast.xml
biome.json		biome.json
milton.md		milton.md
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml
tsconfig.json		tsconfig.json
vitest.config.ts		vitest.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MILTON

Table of Contents

Abstract

Design Principles

The Loop

Stage 1: Self-Play

Stage 2: Training

Stage 3: Arena

Stage 4: Deployment

Neural Network Architecture

MCTS Implementation

LLM Coach Integration

Configuration

Quick Start

Performance Targets

Lichess Bot

Documentation

Links

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MILTON

Table of Contents

Abstract

Design Principles

The Loop

Stage 1: Self-Play

Stage 2: Training

Stage 3: Arena

Stage 4: Deployment

Neural Network Architecture

MCTS Implementation

LLM Coach Integration

Configuration

Quick Start

Performance Targets

Lichess Bot

Documentation

Links

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages