To kindle is to start a fire from nothing. This agent does the same with intelligence — a cold network that bootstraps its own understanding from environment-agnostic primitives, with no pretraining and no handed-down reward.
Built on meganeura — a cross-platform Rust neural network library with GPU-accelerated training and inference via blade-graphics.
- Vision
- Architecture Overview
- Modules
- Reward Circuit (Frozen)
- Temporal Buffer
- Tri-Level Learning Loop
- Continual Learning Strategy
- Meganeura Confidence Plan
- Milestones
- Diagnostics and Observability
- Python Bindings
- Repository Structure
- Dependencies
Three beats.
Cold start. The network is initialised Xavier-uniform and sees its first observation with no prior. There is no pretraining, no offline dataset, no demonstration corpus. Intelligence has to ignite from contact with the environment.
Self-grounded reward. Reward is not handed down. It is derived from four primitives the agent can evaluate without external supervision: surprise (did the world do what I expected?), novelty (have I been here before?), homeostatic balance (am I in a healthy range?), and order (am I making structure?). The circuit that combines these is frozen — its definition of "good" does not drift with the policy.
Emergent order. Exploration and consolidation are duals. Novelty rewards the agent for venturing into unseen phase space; order rewards it for concentrating its recent trajectory into a smaller subset of the digested observation space than its historical average. A healthy agent oscillates between them on its own.
Stability and correctness come first, performance second. This is a research and engineering project.
Raw Observation (o_t)
│
┌────▼────────────────┐
│ Env Adapter │ obs_dim → OBS_TOKEN_DIM (universal)
└────┬────────────────┘
│ obs_token
┌────▼────────────────┐
│ Encoder E │──────────────────────────────────┐
└────┬────────────────┘ │
│ latent z_t │
┌────▼─────────────────────────────────────────────┐ │
│ World Model W(z_t, a_t) → ẑ_{t+1} │ │
│ Loss: || ẑ_{t+1} − stop_grad(z_{t+1}) || │ │
└────┬─────────────────────────────────────────────┘ │
│ │
┌────▼─────────────────────────────────────────────┐ │
│ Reward Circuit (FROZEN) │ │
│ Four primitives → r_t │ │
│ · surprise (world model prediction err) │ │
│ · novelty (inverse-sqrt visit count) │ │
│ · homeostatic (env-defined signals) │ │
│ · order (causal entropy reduction) │ │
└────┬─────────────────────────────────────────────┘ │
│ r_t │
┌────▼─────────────────────────────────────────────┐ │
│ Credit Assigner C(r_t, history[t-H:t]) │ │
│ Causal attention → α_i per past timestep │ │
│ credit_i = r_t × α_i │ │
└────┬─────────────────────────────────────────────┘ │
│ credit[t-H:t] │
┌────▼─────────────────────────────────────────────┐ │
│ Value Head V(z_t) → V̂ │◄─────┘
└────┬─────────────────────────────────────────────┘
│
┌────▼─────────────────────────────────────────────┐
│ Policy π(z_t) → action distribution │
└────┬─────────────────────────────────────────────┘
│
┌────▼─────────────────────────────────────────────┐
│ Env Adapter │ MAX_ACTION_DIM → env-native action
└──────────────────────┘
All modules except the reward circuit train continuously through experience. The adapter layer keeps the core graph's tensor shapes universal (OBS_TOKEN_DIM, MAX_ACTION_DIM) so the agent can hop between environments at runtime without recompiling any GPU graph. See docs/universal-actions.md for the cross-environment design.
Converts observation tokens into a compact latent z_t. Shared backbone — all other modules consume z_t, not raw observations. The encoder is forced to learn representations that are simultaneously useful for world modelling, credit attribution, and policy generation. Gradient flow: world model loss (primary), policy gradient (secondary), value TD error (secondary).
Forward dynamics predictor W(z_t, a_t) → ẑ_{t+1} with MSE loss against stop_grad(z_{t+1}). The stop-gradient prevents encoder collapse (BYOL / Dreamer v3 trick). Serves two roles: self-supervised training signal, and surprise-component input to the reward circuit.
Answers which past actions caused this reward? via causal self-attention over the recent history buffer:
CreditNet(r_t, [(z_{t-H}, a_{t-H}), …, (z_t, a_t)]) → α ∈ R^H
credit_i = r_t × α_i (α softmaxed; attention is strictly causal)
Trained contrastively: when similar latent states at different times receive different rewards, CreditNet should attribute the difference to the diverging action sequences. The diagnostic H_eff = Σ_i (i × α_i) tracks effective temporal scope — growing H_eff means the agent is learning longer-horizon consequences.
Small MLP V̂(z_t) (actor-critic baseline) and stochastic policy π(z_t). Both branches use the same universal continuous-Gaussian graph — for discrete envs the adapter interprets the first n head dims as logits and samples categorically. Updated by credit-weighted policy gradient with an entropy bonus that never anneals below a floor (preserving exploration forever).
Four primitives, one frozen weighted sum:
r_t = w_s·r_surprise + w_n·r_novelty + w_h·r_homeo + w_o·r_order
The circuit receives no gradient updates. Freezing it is a stability claim: because what is good doesn't drift with the policy, the whole tri-level system is less prone to runaway feedback during early training. Unfreezing is explicitly scoped to a post-stability milestone.
| Primitive | Rewards | Decays when | Role |
|---|---|---|---|
surprise ‖ẑ−z‖ |
world behaves unexpectedly | world model converges | shapes z to be predictable |
novelty 1/√N(z) |
unvisited latent regions | region is revisited | exploration pressure |
homeostatic −Σ deviations |
staying in env-defined targets | agent is alive and happy | survival pressure |
order H_ref − H_recent |
concentrating recent trajectory vs historical baseline | recent converges to reference | consolidation pressure |
Novelty and order are the two ends of the exploration/consolidation axis. Novelty rewards entropy in the visited-state distribution (go somewhere new); order rewards negentropy in the recent-observation distribution (make the place you're in legible). A healthy agent trades off between them, and the weights anneal accordingly: w_n decays as phase space is covered; w_o can remain on indefinitely because its baseline moves with the agent.
At agent construction the circuit samples a frozen random linear digest φ: R^{OBS_TOKEN_DIM} → R^{d} (small matrix, d = 4) and fixed per-dim bucket edges. φ never trains. Each step:
bucket_id = quantize(tanh(φ · obs_token_t))- Push onto two ring buffers —
recent(W = 64) andreference(W_ref = 512). H_recent,H_reference= Shannon entropies of the bucket histograms.r_order = H_reference − H_recent(zero during reference warmup).
Why this formulation:
- Grounded. φ and the bucket grid are fixed at init, preserving the frozen-circuit invariant.
- Causal. The reference is the agent's own past — every step's signal depends only on the agent's recent actions vs its own longer-term behaviour.
- Environment-agnostic. Operates on adapter-normalized obs tokens, not env-specific feature shapes.
- Ungameable by latent collapse. The digest reads the observation token, not the encoder latent, so the encoder cannot cheaply maximise order by shrinking its representation.
The old −H(|o_i| / Σ|o_j|) definition is gone — it was trivially maxed by any one-hot observation and mostly measured the env's encoding scheme, not the agent's behaviour.
A single shared circular buffer is the agent's sole persistent memory:
pub struct ExperienceBuffer {
observations: RingBuffer<Vec<f32>>, // obs_token
latents: RingBuffer<Vec<f32>>, // z_t
actions: RingBuffer<Vec<f32>>, // action_token
rewards: RingBuffer<f32>,
credits: RingBuffer<f32>,
pred_errors: RingBuffer<f32>,
env_ids: RingBuffer<u32>, // which env produced this transition
env_boundary: RingBuffer<bool>, // true on the first step after switch_env
visit_counts: HashMap<StateKey, u32>, // for novelty
}No episodic boundary — experience accumulates continuously. Cross-env hops are tagged so the credit assigner and world-model replay don't try to attribute dynamics or reward across a switch. Replay mixing: ~20% of steps are shadowed by a gradient update on a random historical sample, which re-anchors the encoder against catastrophic forgetting without needing EWC machinery.
| Module | LR | Frequency | Notes |
|---|---|---|---|
| Encoder + World Model | lr_wm (base) |
every step | self-supervised; most stable |
| Credit Assigner | lr_credit (0.3× base) |
every step | slower; contrastive signal is noisy early |
| Policy + Value | lr_policy (0.5× base) |
every step | gated on warmup + entropy floor |
Policy updates are suppressed for the first N_warmup steps (so the value baseline can stabilise) and whenever policy entropy falls below a floor (so the agent can always explore).
- Replay mixing — 20% shadow gradient steps on random historical samples, active from day one.
- Representation drift monitor — a held-out probe set of observations is captured at warmup completion; every
drift_intervalsteps the encoder's output on the probe is compared against the stored reference. Excessive drift auto-scales the encoder LR down; low drift lets it recover. - Entropy floor — the policy entropy bonus is never annealed below a floor. The agent never becomes fully deterministic.
- Frozen reward circuit — the definition of good does not drift.
Meganeura is young. Using it as the training backbone for a complex multi-module system requires building active confidence in its correctness. This is part of the project, not a precondition.
Finite-difference gradient checks on every graph primitive used (linear, attention, layer norm, softmax). Tolerance: relative error < 1e-4 for f32. Lives in kindle/tests/grad_check.rs (planned) — currently covered indirectly by the convergence canaries.
OptLevel::{None, Full} flag isolates meganeura's e-graph search. Every module is tested with both levels and outputs compared within tight tolerance. Tests in kindle/tests/opt_parity.rs. The None path also serves as a debugging escape hatch during training-time anomalies.
| Canary | Expected | Failure signal |
|---|---|---|
| XOR classification (MLP, 4 samples) | Loss → 0 in < 500 steps | Broken optimizer or backprop |
| Next-step prediction on random walk | Prediction error < 0.01 within 10k steps | Broken world model training |
| Full kindle encoder + world model | Loss decreases by ≥ 2× on deterministic env | Broken graph fusion or encoder gradient |
These live in kindle/tests/canaries.rs. GPU-gated tests run via cargo test -- --ignored.
A one-time numerical cross-check of the causal-attention credit assigner against a PyTorch reference was performed during Phase 2 and archived. The PyTorch code is not maintained; it was a verification artifact, not a development dependency.
A long training run (target: 1M steps on a single env, multi-env variant to follow) logs loss curves, gradient norms, H_eff, drift, throughput. Re-run at each milestone; results committed to docs/stability_runs/.
Outcome gates, not calendar weeks. Each gate has a measurable exit condition; we move on when the gate closes.
- M1 — World model ignites.
loss_world_modeldecreases monotonically on random-walk and GridWorld; reaches < 0.01 prediction error within 10k steps. Status: closed. - M2 — Reward signal carries information. All four primitives produce non-zero, finite values that vary across states;
reward_meanseparates trajectories that lead to homeostatic violations from those that don't. Status: closed. - M3 — Policy learns on its own signal. Kindle agent solves CartPole from a cold start using intrinsic reward only (mean episode length > 195 within 50k steps). Status: open.
- M4 — Cross-env generalisation. Agent hops GridWorld → CartPole → MountainCar → Taxi → Acrobot → Pendulum without divergence; encoder representations remain meaningful after switches (drift < threshold). Status: partial — no divergence, generalisation TBD.
- M5 — Long-run stability. 1M-step run on real GPU hardware with no NaN, no gradient explosion,
H_efftrending upward, entropy above floor. Status: open. - M6 — Learnable reward circuit (unfreeze). Only attempted once M3–M5 are closed. Adds gradient updates to the reward weights under tight constraints documented at the time.
Structured JSON diagnostics every step (no external dependency required):
| Metric | What | Healthy range |
|---|---|---|
loss_world_model |
World model prediction MSE | Decreasing over time |
loss_policy |
Policy gradient loss | Noisy; trending down |
loss_credit |
Credit contrastive loss | Slowly decreasing |
loss_replay |
MSE on replay batches | Within an order of loss_wm |
reward_mean |
Mean r_t over window | Increasing over time |
reward_surprise |
Surprise component | Decreasing as world is learned |
reward_novelty |
Novelty component | Decreasing as space is explored |
reward_homeo |
Homeostatic component | Near zero when agent is healthy |
reward_order |
Order component (H_ref − H_recent) |
Positive when concentrating |
h_eff |
Effective credit scope (steps) | Increasing over training |
policy_entropy |
Policy distribution entropy | Above floor; not collapsing |
repr_drift |
Encoder drift on probe set | Below threshold |
buffer_len |
Experience buffer length | Up to buffer_capacity |
Kindle ships an optional pyo3 extension under python/, built via maturin:
pip install maturin
cd python && maturin develop --releaseThen in Python:
import gymnasium as gym
import kindle
env = gym.make("CartPole-v1")
agent = kindle.Agent(obs_dim=4, num_actions=2, env_id=0, seed=0)
# A tiny adapter that returns obs as a plain list of floats.
class Wrapped:
def __init__(self, env): self.env = env
def reset(self):
obs, _ = self.env.reset()
return [float(x) for x in obs]
def step(self, action):
obs, r, term, trunc, info = self.env.step(int(action) % 2)
if term or trunc:
obs, _ = self.env.reset()
return [float(x) for x in obs], float(r), bool(term), bool(trunc), info
agent.train(Wrapped(env), steps=10_000)
print(agent.diagnostics())The Python extension is a thin wrapper — training still runs on GPU through meganeura. Python is for scripting and for dropping kindle into a gymnasium loop; it is not a second implementation.
This is a Cargo workspace.
kindle/
├── Cargo.toml # [workspace]
├── kindle/ # core library crate (publishable)
│ ├── Cargo.toml
│ ├── src/ # agent, encoder, world_model, reward, credit, policy, …
│ └── tests/ # canaries, opt_parity, reward_signal
├── kindle-gym/ # Rust mini-gym + runnable examples
│ ├── Cargo.toml
│ ├── src/ # grid_world, cart_pole, acrobot, …
│ ├── examples/ # cargo run -p kindle-gym --example cart_pole
│ └── tests/ # stability (GPU-gated)
├── python/ # pyo3 cdylib (maturin; excluded from workspace)
│ ├── Cargo.toml
│ ├── pyproject.toml
│ ├── src/lib.rs
│ ├── kindle/__init__.py
│ └── tests/
├── docs/
│ ├── universal-actions.md
│ └── stability_runs/
└── .github/workflows/ci.yml
| Crate | Role |
|---|---|
meganeura |
Neural network training and inference |
blade-graphics |
GPU backend (Vulkan / Metal) |
rand |
Sampling, exploration noise |
serde / serde_json |
Diagnostic logging |
hashbrown |
Fast hash map for novelty visit counts |
pyo3 (python only) |
Optional Python bindings |
The core Rust crate has no Python dependency. The python/ crate is an optional extension module.