What if you could optimize any software system the same way you train a neural network?
AutoPilot is a PyTorch/Lightning-inspired framework for generalized optimization. It brings the rigor and developer experience of deep learning to non-differentiable systems. Structured feedback replaces numerical gradients. State mutations (like code edits or config updates) replace weight updates. The same forward -> loss -> backward -> optimizer.step() loop that trains neural networks now optimizes prompts, heuristics, rule engines, agents, and configurations -- deterministically, with memory, rollback, and policy gating.
Building complex, non-differentiable systems -- like AI agents, RAG pipelines, fraud detection heuristics, or rule-based engines -- is a manual, informal process today. You tweak a prompt or a regex rule, run the system, look at the output, decide if it got better, and repeat.
This process lacks the structured feedback loop that made deep learning iteration so fast:
- No Memory: There is no automatic log of what was already tried. You often re-try the same failed strategy multiple times.
- No Structured Feedback: Evaluation is often "looks right to me." There is no quantitative tracking of metrics across held-out validation sets.
- No Automatic Rollback: When a change makes things worse, you undo it by hand. If you're not sure, you guess.
- No Scalability: One person manually iterating is slow. There is no way to run this overnight, no way to hand it to an autonomous system, and no way to reproduce what happened three experiments ago.
Karpathy's autoresearch proved the loop works. Give an agent code, let it modify and evaluate, keep improvements, discard regressions, repeat. It ran 100 experiments overnight on a single file. But the entire orchestration lives in a markdown prompt. When to keep, when to discard, how to log results, when to revert are all natural language instructions the agent has to interpret correctly every time.
AutoPilot solves this by formalizing the iteration loop into the same structural abstractions that powered the deep learning revolution.
Optimizing any iterative system follows the same structure as training a neural network. AutoPilot formalizes this mapping into a real, typed interface.
In deep learning, you pass data through a model (forward pass). A loss function scores the output. Backpropagation computes gradients that explain how parameters should change. An optimizer reads those gradients and updates the weights. You repeat this in epochs, validate on held-out data, and checkpoint good states.
AutoPilot applies this exact structure to general software optimization:
- Module is your system (agent, rule engine, pipeline), exactly like
nn.Module. - Loss wraps an evaluator (Judge, profiler, or test suite) that produces structured feedback (gradients).
- Parameters mark what can be edited (prompts, JSON configs, source files via
PathParameter). - Optimizer applies changes based on gradients -- this can be an AI coding agent or a deterministic algorithm.
- Backward propagates structured feedback through the computation graph.
- Step triggers the update to the underlying parameters.
The difference is in what flows through the loop. Gradients can be text, JSON, or any arbitrary Python object. Weight updates can be code edits, file rewrites, or config tweaks. But the structure, the separation of concerns, and the lifecycle are identical to the PyTorch experience you already know.
AutoPilot is built for extreme extensibility. As long as you can define a forward pass and a way to score the result, you can optimize it:
- Prompt & AI Pipelines: Tune system prompts, RAG chunking parameters, or multi-agent routing logic based on LLM-judged evaluations.
- Heuristic & Rule Engines: Evolve fraud detection thresholds, spam filters, or trading algorithms where loss is based on precision/recall metrics.
- Configuration Tuning: Optimize database settings, cache eviction policies, or compiler flags using performance profiling reports as structured gradients.
- Simulation & Game Balancing: Adjust unit stats, physics parameters, or generation seeds based on win-rate or equilibrium metrics.
- Code Performance: Refactor SQL queries or tight loops using
EXPLAIN ANALYZEplans and profiler outputs as structured feedback for a coding optimizer.
AutoPilot isn't just a borrowed analogy; it's a structural equivalent that transfers everything ML practitioners know about training loops directly to software engineering:
| ML workflow | AutoPilot workflow |
|---|---|
| Training data | Eval dataset (test cases with ground truth) |
Forward pass (model(x)) |
Run the system on eval items (module(batch)) |
| Loss computation | Evaluator scores outputs, accumulates structured feedback |
Backward (loss.backward()) |
Feedback flows back to fill param.grad with "gradients" |
Optimizer step (optimizer.step()) |
Optimizer reads gradients and applies state mutations |
| Validation | Run on held-out split to check for regressions |
| Epoch | One full cycle: run all items -> judge -> gradient -> update -> redeploy |
| Overfitting | System tuned for train set quirks, failing on val/test |
| Checkpoint | Store snapshots code/config at each epoch, enabling rollback |
-
Model your system as a
Modulewithforward(batch). Declare what can change asParameterattributes -- files viaPathParameter, or custom subclasses for configs, prompts, thresholds. -
Define a Loss that accumulates per-batch feedback in
forward()and fillsparam.gradwith a structuredGradientinbackward(). This isn't just a number -- it tells the optimizer WHERE something failed and WHAT to fix. -
Choose an Optimizer: deterministic (like
RuleOptimizer-- reads gradients, applies heuristic fixes with zero LLM calls) or LLM-backed (AgentOptimizerwithClaudeCodeAgent-- reads gradients, edits code and prompts). -
Run the loop -- either a manual PyTorch-style
for epochloop, orTrainer.fit()which handles batching, validation, callbacks, and gradient accumulation automatically. -
Wire experiment lifecycle for production:
Experimentmanages the manifest and optionalStorefor content-addressed snapshots.StoreCheckpointCallbackauto-snapshots each epoch.Policygates progression and triggers rollback on regression.Memoryblocks failed strategies so the optimizer doesn't repeat mistakes.
Two entry points: library (import and compose in Python) and CLI (uv run autopilot ...) for workspace operations -- experiments, store history, memory queries, status, proposals.
A hand-rolled for epoch: run(); eval(); if bad: revert() works for one-off tweaking. It breaks down when you need:
- Structured feedback that tells the optimizer WHERE and WHAT to fix --
Loss.backward()produces typedGradienton eachParameter, not just "accuracy dropped" - Gradient accumulation across batches with correct step boundaries --
accumulate_grad_batchesonTrainer, automatic_should_steplogic inEpochLoop - Train/val split discipline with separate metric phases --
EpochLoopswitchesmodule.eval(), runsvalidation_step, callsexperiment.on_validation_completeafter val - Policy gating with automatic rollback to the correct epoch via content-addressed snapshots --
Policyreturns pass/fail;EpochOrchestratorcallsexperiment.rollback(best_epoch) - Persistent memory of what was tried, what failed, and which strategies are blocked --
FileMemorywithlearn(),recall(),block_strategy() - Reproducible experiment records with manifests, events, and artifacts --
ExperimentwithJSONLogger,JSONCheckpoint, epoch directories - The same Module working in both a manual loop and an automated Trainer -- progressive disclosure from explicit to orchestrated
AutoPilot standardizes all of this into a composable protocol with the same separation of Module / Loss / Optimizer / Trainer that made PyTorch productive for ML.
Like PyTorch + Lightning, AutoPilot offers two orchestration layers:
Manual loop (PyTorch-style) -- full control, plain Python objects:
from autopilot.ai.coding import ClaudeCodeAgent
from autopilot.ai.gradient import ConcatCollator
from autopilot.ai.loss import JudgeLoss
from autopilot.ai.optimizer import AgentOptimizer
from autopilot.core.module import Module
module = MyModule()
loss = JudgeLoss(judge=MyJudge(), collator=ConcatCollator())
optimizer = AgentOptimizer(agent=ClaudeCodeAgent(), parameters=module.parameters())
module.train()
for epoch in range(5):
for batch in train_loader:
data = module(batch)
loss(data, batch)
loss.backward() # structured feedback fills param.grad
optimizer.step() # optimizer applies improvements (e.g. edits code)
optimizer.zero_grad()Automated loop (Lightning-style) -- define steps, let Trainer handle the rest:
from autopilot.ai.coding import ClaudeCodeAgent
from autopilot.ai.optimizer import AgentOptimizer
from autopilot.core.module import AutoPilotModule
from autopilot.core.trainer import Trainer
class MyModule(AutoPilotModule):
def training_step(self, batch):
return self.forward(batch)
def configure_optimizers(self):
return AgentOptimizer(agent=ClaudeCodeAgent(), parameters=self.parameters())
trainer = Trainer(callbacks=[...], policy=my_policy, experiment=my_experiment)
trainer.fit(module, train_dataloaders=loader, max_epochs=10)| PyTorch / Lightning | AutoPilot |
|---|---|
nn.Module |
Module |
LightningModule |
AutoPilotModule |
Lightning Trainer |
Trainer |
nn.CrossEntropyLoss |
Loss / JudgeLoss |
optim.Adam |
Optimizer / AgentOptimizer |
nn.Parameter |
Parameter / PathParameter |
Tensor |
Datum / Gradient (can be any object) |
torchmetrics.Metric |
Metric |
EarlyStopping |
Policy + Gate |
ModelCheckpoint |
Store + StoreCheckpointCallback |
| Autograd engine | Graph / Node (propagates arbitrary objects) |
Dataset / DataLoader |
ListDataset / DataLoader |
Lightning Callback |
Callback |
Lightning FitLoop |
Loop / EpochLoop |
| No equivalent | Memory (persistent cross-epoch learning) |
| No equivalent | DataGenerator (structured dataset creation) |
| No equivalent | Judge (structured output scoring) |
See examples/ for runnable, self-contained projects:
- textmatch -- Deterministic Rule Optimization. Optimizes regex rules using a deterministic
RuleOptimizerand zero LLM calls. Shows the power of the framework without AI. - protim -- Agent-Driven Prompt Optimization. Optimizes a prompt file using
AgentOptimizerand Claude Code.
Each example is its own uv package. Clone, cd examples/<name>, uv sync, uv run python run.py.
uv sync && uv run autopilot --help- Uniform, Typed Interface: Compose systems the same way you compose PyTorch components. No string registries, no YAML configs. Instantiate objects, pass them in, call methods.
- Structured Feedback:
backward()fillsparam.gradwith actionable feedback, not just opaque scores. The optimizer readsparam.grad.render()andparam.render()to make targeted fixes. - Real Code/State Versioning:
FileStoreuses SHA-256 content addressing, snapshot manifests, and atomic writes.store.checkout(epoch)restores any previous state. - Persistent Memory:
FileMemoryrecords what was tried, what failed, and which strategies are blocked across epochs.MemoryCallbackcaptures this automatically. - Policy Gating: Use
MinGate,MaxGate,RangeGate, andCustomGateto enforce quality bars and automate early stopping with rollback. - Experiment Lifecycle:
Experimentmanages store, lifecycle hooks (on_epoch_complete,on_validation_complete,on_loop_complete), rollback, and best-epoch tracking above the training loop. - Production Infrastructure: Built-in CLI for experiments, project health, dataset management, diagnostics, and audit trails via
--expose.
| Command | Role |
|---|---|
optimize |
Drive the optimization loop |
ai |
Dataset generation and judging |
experiment |
Create, list, and manage experiment slugs and manifests |
project |
Create, list, and check project health |
store |
Content-addressed code versioning |
status |
Experiment overview (epoch, metrics, stop reason) |
memory |
Query, record, trends, and context |
diagnose |
Trace diagnostics and node heatmaps |
propose |
Create, verify, revert, and list proposals |
promote |
Promotion decisions and workflow |
Run uv run autopilot <command> --help for subcommands and flags.
src/autopilot/
core/ # Module, Trainer, Loss, Optimizer, Parameter, Gradient, Graph, Metric, Memory, Store, Experiment
data/ # Dataset, ListDataset, StreamingDataset, DataLoader, DataModule
ai/ # DataGenerator, Judge, Agent, AgentOptimizer, JudgeLoss, TextGradient, GradientCollator, step workflows
cli/ # argparse CLI, commands, context, output
tracking/ # manifest, events, command history
policy/ # Policy, Gate base classes
AutoPilot supports multiple projects in one workspace under autopilot/projects/<name>/:
workspace/
autopilot/
pyproject.toml
projects/
my-project/
cli.py
trainer.py
ai/
experiments/
datasets/
Each project has a cli.py that subclasses AutoPilotCLI and wires components in __init__:
from autopilot.cli.main import AutoPilotCLI
class MyCLI(AutoPilotCLI, project='my-project'):
def __init__(self):
super().__init__()
self.module = my_module
self.generator = MyGenerator()
self.judge = MyJudge()
MyCLI()()Comprehensive documentation lives in source docstrings. See PHILOSOPHY.md for design
principles. CLI command details are in the cli-conventions skill and source docstrings.