A self-learning chess engine that trains itself, gates its own promotions, and plays rated games on Lichess — running on a single Mac Mini.
Self-play. Train. Arena. Deploy. Forever.
- Abstract
- Design Principles
- The Loop
- Stage 1: Self-Play
- Stage 2: Training
- Stage 3: Arena
- Stage 4: Deployment
- Neural Network Architecture
- MCTS Implementation
- LLM Coach Integration
- Configuration
- Quick Start
- Performance Targets
- Lichess Bot
- Documentation
Milton is a recursive, self-improving chess engine. It learns the game from random play, with no opening book, no endgame tablebase, and no scraped grandmaster games. Every line of theory it knows, it discovered itself.
The runtime is a single Rust binary that drives a four-stage cycle — generate self-play games, train the network on the resulting positions, run the candidate against the reigning champion in a head-to-head match, and — if the candidate wins by a sufficient margin — promote it and deploy it as the live opponent on Lichess. The loop never breaks.
Milton is engineered around three claims: that the cycle is the artifact, not the weights; that an LLM acting as a post-game coach can compress the long-tail of self-play; and that Candidate Master strength (2500 Elo) is reachable on consumer hardware with a tight enough loop.
| Principle | Manifestation |
|---|---|
| The loop is the product | Network architecture is fixed early; the wins come from tightening the cycle. |
| Zero human knowledge | No opening book. No endgame tablebase. No GM games. Strategy emerges from MCTS-improved self-play targets alone. |
| Arena gate is non-negotiable. | A candidate network must beat the champion at >= 55% to be promoted. Below that, it is discarded. |
| Schema-validated artifacts | Self-play games, training batches, and arena results are typed records on disk. Nothing is parsed twice. |
| Single-machine | One Mac Mini M4. MPS for inference, CPU pool for tree search. No cloud. No queue. No SLURM. |
| Public scoreboard | The current champion is always live on Lichess. Anyone can challenge it. |
+-------------------+
| CHAMPION NETWORK |
+---------+---------+
|
+------------------+------------------+
| |
v v
(1) SELF-PLAY (4) DEPLOYMENT
100 games / iter Lichess @magnusgrok
MCTS, 200 sims rated games vs humans
| ^
v |
~6,000 training samples |
| |
v |
(2) TRAINING (3) ARENA
residual CNN new vs champion
policy + value head 100 games, win >= 55%
| ^
+--------------> CANDIDATE -----------+
NETWORK
Every stage emits structured artifacts on disk. The orchestrator is stateless — kill it mid-iteration and it picks up cleanly from the last completed stage on restart.
The current champion plays roughly 100 games against itself per iteration. Every move is selected by Monte Carlo Tree Search guided by the network's policy and value heads. The MCTS-improved visit distribution (not the raw network policy) becomes the training target — this is the AlphaZero core insight.
// crates/selfplay/src/runner.rs
use crate::mcts::{Search, SearchConfig};
use crate::position::{encode_position, Outcome, Position};
use crate::record::{Game, TrainingSample};
use std::sync::Arc;
pub struct SelfPlayConfig {
pub simulations: u32,
pub temperature_moves: u32,
pub dirichlet_alpha: f32,
pub dirichlet_epsilon: f32,
pub resign_threshold: f32,
}
pub async fn play_game(net: Arc<Network>, cfg: &SelfPlayConfig) -> Game {
let mut pos = Position::startpos();
let mut samples: Vec<TrainingSample> = Vec::with_capacity(80);
while !pos.is_terminal() {
let temperature = if pos.fullmove() <= cfg.temperature_moves { 1.0 } else { 0.0 };
let mut search = Search::new(net.clone(), SearchConfig::from(cfg));
let result = search.run(&pos, cfg.simulations).await;
samples.push(TrainingSample {
position: encode_position(&pos),
policy: result.visit_distribution(),
value: 0.0, // labeled at game end
ply: pos.ply(),
});
let mv = result.sample_move(temperature);
pos.play_unchecked(&mv);
if let Some(outcome) = result.early_resign(cfg.resign_threshold) {
return label(samples, outcome);
}
}
label(samples, pos.outcome())
}
fn label(mut samples: Vec<TrainingSample>, outcome: Outcome) -> Game {
let final_value = outcome.as_value();
for (i, sample) in samples.iter_mut().enumerate() {
sample.value = if i % 2 == 0 { final_value } else { -final_value };
}
Game { samples, outcome }
}A standard iteration produces ~6,000 training samples. They are written to data/iter_{N}/ as binary records and consumed by the trainer.
A 9.6M-parameter residual CNN updates against the freshly-generated samples. Loss is cross-entropy on the policy head plus MSE on the value head, weighted equally. Training runs on Apple Silicon's Metal Performance Shaders backend through tch-rs.
// crates/train/src/step.rs
use tch::{nn, nn::OptimizerConfig, Tensor};
pub struct TrainStep<'a> {
pub net: &'a Network,
pub opt: &'a mut nn::Optimizer,
pub batch: &'a Batch,
pub policy_weight: f64,
pub value_weight: f64,
}
pub fn step(s: &mut TrainStep) -> StepLoss {
let (policy_logits, value) = s.net.forward(&s.batch.positions);
let log_p = policy_logits.log_softmax(-1, tch::Kind::Float);
let policy_loss = -(log_p * &s.batch.policies).sum_dim_intlist(
Some(vec![1].as_slice()),
false,
tch::Kind::Float,
).mean(tch::Kind::Float);
let value_loss = (value - &s.batch.values).pow_tensor_scalar(2).mean(tch::Kind::Float);
let total = &policy_loss * s.policy_weight + &value_loss * s.value_weight;
s.opt.zero_grad();
total.backward();
s.opt.clip_grad_norm(1.0);
s.opt.step();
StepLoss {
total: total.double_value(&[]),
policy: policy_loss.double_value(&[]),
value: value_loss.double_value(&[]),
}
}A typical training run consumes 8 to 12 epochs over the latest sample window (last 4 iterations) and ends with a candidate checkpoint at data/iter_{N}/candidate.safetensors.
The candidate fights the reigning champion in a 100-game match. Colors alternate. Both engines run with the same MCTS budget. Promotion requires a win rate at or above 55% (counting draws as half).
// crates/arena/src/match_runner.rs
use crate::engine::Engine;
use crate::game::{play_engine_game, Color, GameOutcome};
pub struct ArenaResult {
pub wins: u32,
pub losses: u32,
pub draws: u32,
pub win_rate: f32,
pub promote: bool,
}
pub async fn run_match(
challenger: Engine,
champion: Engine,
games: u32,
threshold: f32,
) -> ArenaResult {
let mut wins = 0;
let mut losses = 0;
let mut draws = 0;
for g in 0..games {
let challenger_color = if g % 2 == 0 { Color::White } else { Color::Black };
let outcome = play_engine_game(&challenger, &champion, challenger_color).await;
match outcome {
GameOutcome::Win(c) if c == challenger_color => wins += 1,
GameOutcome::Win(_) => losses += 1,
GameOutcome::Draw => draws += 1,
}
}
let win_rate = (wins as f32 + 0.5 * draws as f32) / games as f32;
ArenaResult {
wins,
losses,
draws,
win_rate,
promote: win_rate >= threshold,
}
}If the gate passes, the candidate becomes the new champion and the next iteration's self-play immediately uses it. If not, the candidate is archived and self-play continues from the existing champion — the run is never wasted, the samples roll forward into the replay buffer.
The current champion is always live on Lichess as @magnusgrok. Deployment is a hot swap: the bot daemon watches a champion.symlink pointer and reloads the network on change without dropping in-progress games.
// crates/lichess/src/bot.rs
use crate::stream::{Event, LichessClient};
use crate::engine::Engine;
use std::sync::Arc;
use tokio::sync::RwLock;
pub async fn run_bot(engine: Arc<RwLock<Engine>>, client: LichessClient) -> anyhow::Result<()> {
let mut events = client.stream_events().await?;
while let Some(event) = events.next().await? {
match event {
Event::Challenge(c) if c.variant.is_standard() => {
client.accept_challenge(&c.id).await?;
}
Event::GameStart(g) => {
let engine = engine.clone();
let client = client.clone();
tokio::spawn(async move {
if let Err(e) = play_game(engine, client, g).await {
tracing::warn!(?e, "game ended with error");
}
});
}
Event::ChampionSwap(path) => {
tracing::info!(?path, "hot-swapping champion network");
engine.write().await.reload(&path)?;
}
_ => {}
}
}
Ok(())
}Lichess games do not feed the training set — they are a public scoreboard, not an oracle. The replay buffer remains pure self-play to preserve the AlphaZero invariant.
| Component | Specification |
|---|---|
| Input planes | 18 x 8 x 8 (12 piece planes + castling, en passant, side-to-move, halfmove, repetition) |
| Trunk | 10 residual blocks, 128 filters, 3x3 convolutions, BatchNorm + ReLU |
| Policy head | 1x1 conv -> dense -> softmax over 4,672 move slots |
| Value head | 1x1 conv -> dense (256) -> dense (1) -> tanh |
| Parameters | 9,584,193 |
| Inference (Mac Mini M4, MPS, batch 64) | 2.1 ms / position |
| Training throughput | ~14k samples / minute |
// crates/net/src/model.rs
pub fn build(vs: &nn::Path) -> Network {
let conv = nn::conv2d(vs / "stem", 18, 128, 3, nn::ConvConfig { padding: 1, ..Default::default() });
let bn = nn::batch_norm2d(vs / "stem_bn", 128, Default::default());
let blocks: Vec<ResidualBlock> = (0..10)
.map(|i| ResidualBlock::new(&(vs / format!("res_{i}")), 128))
.collect();
let policy_head = PolicyHead::new(&(vs / "policy"), 128, 4672);
let value_head = ValueHead::new(&(vs / "value"), 128);
Network { conv, bn, blocks, policy_head, value_head }
}Tree search uses the PUCT formula from AlphaZero: each child's score is the empirical action value plus an exploration bonus weighted by the network's prior and the parent's visit count. Dirichlet noise is added to the root prior on every search to enforce exploration in self-play.
// crates/mcts/src/select.rs
#[inline]
pub fn puct_score(child: &Node, parent_visits: f32, c_puct: f32) -> f32 {
let q = if child.visits == 0 {
0.0
} else {
child.value_sum / child.visits as f32
};
let u = c_puct * child.prior * parent_visits.sqrt() / (1.0 + child.visits as f32);
q + u
}
pub fn select_child<'a>(parent: &'a Node, arena: &'a Arena, c_puct: f32) -> &'a Edge {
let pv = parent.visits as f32;
parent.edges.iter()
.max_by(|a, b| {
puct_score(&arena[a.child], pv, c_puct)
.partial_cmp(&puct_score(&arena[b.child], pv, c_puct))
.unwrap()
})
.expect("non-terminal nodes always have at least one edge")
}
pub fn add_dirichlet_noise(priors: &mut [f32], alpha: f32, epsilon: f32, rng: &mut impl Rng) {
let dist = Dirichlet::new_with_size(alpha, priors.len()).unwrap();
let noise: Vec<f32> = dist.sample(rng);
for (p, n) in priors.iter_mut().zip(noise) {
*p = (1.0 - epsilon) * *p + epsilon * n;
}
}| Hyperparameter | Self-Play | Arena |
|---|---|---|
| Simulations / move | 200 | 400 |
c_puct |
1.5 | 1.5 |
| Dirichlet alpha | 0.3 | 0.0 |
| Dirichlet epsilon | 0.25 | 0.0 |
| Temperature (first 30 plies) | 1.0 | 0.0 |
| Resign threshold | -0.95 | disabled |
Pure self-play is data-efficient at the start and painfully slow at the end. The engine plateaus once its blunders become subtle. To compress the long tail, Milton runs every batch of self-play games through an LLM coach that returns a structured weakness report. The next iteration's sampler oversamples positions that match those weakness fingerprints.
// crates/coach/src/grok.rs
use serde::{Deserialize, Serialize};
#[derive(Serialize, Deserialize, Debug)]
pub struct WeaknessReport {
pub weaknesses: Vec<Weakness>,
}
#[derive(Serialize, Deserialize, Debug)]
pub struct Weakness {
pub category: WeaknessCategory,
pub description: String,
pub fingerprint: PositionFingerprint,
pub severity: f32,
}
#[derive(Serialize, Deserialize, Debug)]
#[serde(rename_all = "snake_case")]
pub enum WeaknessCategory {
OpeningRepertoire,
PawnStructure,
PieceCoordination,
KingSafety,
EndgameTechnique,
TacticalAwareness,
}
pub async fn analyze(games: &[Pgn], client: &CoachClient) -> anyhow::Result<WeaknessReport> {
let prompt = format!(
"You are an elite chess coach reviewing {} games played by a single engine. \
Identify systematic positional weaknesses, not single-move blunders. \
Return structured JSON matching the WeaknessReport schema.",
games.len(),
);
let response = client
.completion()
.system(prompt)
.user(serialize_pgns(games))
.response_schema::<WeaknessReport>()
.send()
.await?;
Ok(response.parsed)
}Each Weakness carries a PositionFingerprint that maps to a deterministic feature filter over the replay buffer. The next sampler weights positions matching the fingerprint by 1.0 + severity.
Single file at ~/.milton/milton.toml:
[loop]
iterations = 0 # 0 = run forever
samples_per_iter = 6000
temperature_moves = 30
[selfplay]
games = 100
simulations = 200
dirichlet_alpha = 0.3
dirichlet_epsilon = 0.25
[train]
batch_size = 256
learning_rate = 1e-3
weight_decay = 1e-4
epochs_per_iter = 10
buffer_iterations = 4
[arena]
games = 100
simulations = 400
promotion_threshold = 0.55
[coach]
provider = "grok"
model = "grok-4"
api_key = "env:XAI_API_KEY"
[lichess]
enabled = true
token = "env:LICHESS_TOKEN"
account = "magnusgrok"
accept_variants = ["standard"]# Clone and build
git clone https://github.com/pranaveight/Milton.git
cd Milton
cargo build --release
# Initialize a fresh run
./target/release/milton init --network random
# Start the loop in the foreground
./target/release/milton loop --config ~/.milton/milton.toml
# Or run a single self-play iteration
./target/release/milton selfplay --games 100 --out data/iter_42
# Train against the latest sample window
./target/release/milton train --window 4 --epochs 10
# Run an ad-hoc arena match between two checkpoints
./target/release/milton arena --a data/iter_41/champion.safetensors \
--b data/iter_42/candidate.safetensors \
--games 100
# Connect the live champion to Lichess
./target/release/milton lichess --account magnusgrokMeasured on a single Mac Mini M4 (16 GB), no external compute:
| Metric | Target | Measured |
|---|---|---|
| Self-play games per hour | >= 60 | 78 |
| MCTS positions per second | >= 18,000 | 23,400 |
| Training step (batch 256) | < 110 ms | 84 ms |
| Inference (batch 64, MPS) | < 3 ms | 2.1 ms |
| Iteration wall time (full cycle) | < 90 min | 68 min |
| Memory footprint, idle | < 400 MB | 312 MB |
| Memory footprint, training | < 4 GB | 3.1 GB |
The current champion plays as @magnusgrok. Anyone can challenge it. Standard time controls only. Variants are rejected by the challenge filter.
| Setting | Value |
|---|---|
| Account | @magnusgrok |
| Variants | Standard |
| Time controls | 1+0 to 30+0 |
| Concurrent games | 8 |
| Reload on champion swap | hot, no game drop |
Games appear in real time on the dashboard at milton.bot along with the live Elo trajectory and per-iteration arena results.
| Document | Description |
|---|---|
docs/loop.md |
The four-stage cycle in detail |
docs/selfplay.md |
Self-play game generation |
docs/training.md |
Training schedule, replay buffer, optimizer |
docs/arena.md |
Arena match runner and promotion rules |
docs/network.md |
Residual CNN architecture |
docs/mcts.md |
PUCT, Dirichlet noise, tree reuse |
docs/coach.md |
LLM coach integration |
docs/lichess.md |
Lichess bot configuration |
docs/configuration.md |
Full configuration reference |
milton.md |
Engine identity and behavioral directives |
- Site: milton.bot
- Lichess: @magnusgrok
- Twitter: @pranaveight
- Lichess: @magnusgrok
A self-learning chess engine. One machine. One loop. Forever.