exprag is an experiment memory for coding agents with zero dependencies.
It is intentionally small: one JSONL file per run, plus enough structured context for an agent to answer questions about past experiments, compare runs, and recover the exact code state that produced a result.
This is not trying to be another ML dashboard. The main interface is your agent.
Ask things like:
"Which run had the best validation accuracy?"
"Compare the latest two runs and explain what changed."
"Find the best run, inspect the code that produced it, and tell me why it won."
"Restore the repository to the code state from the run with the lowest loss."
Every run records git state at startup: commit, branch, dirty status, status
output. For dirty repositories, exprag creates a dedicated run/<uuid> branch
that captures the exact commit plus any uncommitted edits that existed when
the run started. That means an agent can always reconstruct the true code state.
The result is a lightweight loop:
- Run experiments from normal Python.
- Track structured values with short semantic notes.
- Let an agent inspect
.exprag/runs/*.jsonl. - Ask the agent to compare, explain, or roll code back to any run.
uv pip install expraguv pip install -e . --group=devfrom exprag import Experiment
exp = Experiment(
"training my neural network",
# this metadata is captured only once at the experiment start
metadata={
"hparams": {
"learning_rate": 0.03,
"batch_size": 32,
}
},
)
for step in range(5):
loss = 1.0 / (step + 1)
acc = 0.6 + step * 0.05
exp.track(
{"step": step, "metrics": {"loss": loss, "acc": acc}},
note="training metrics after each step",
)Run:
python examples/track_experiment.pyThe runs are written to:
.exprag/runs/<run_id>.jsonl
Each run starts with a run_start record containing process, host, metadata,
and git state. Each track record contains your structured value, wall-clock
time, monotonic elapsed_ms, and optional note context for the agent.
Write the SKILL.md to the appropriate place so your agent finds it:
exprag-skill --write .claude/skills/exprag/SKILL.mdexprag-skill --write .agents/skills/exprag/SKILL.mdexprag-skill --write .opencode/skills/exprag/SKILL.mdThen ask your agent questions in terms of outcomes, not files:
"Which run in the last two weeks has the highest accuracy?"
"Show me the git diff between the two latest runs."
"Which learning rates result in accuracies above 90%?"
"Compare the best run against the latest run."
"Show the metric history for the run where batch size was 32."
"Restore the code to the state from the run that achieved the highest accuracy."
"Check out the exact code for run X."
The powerful part is that exprag captures git context per run.
A run_start record includes enough information for an agent to reason about
the source tree at experiment time:
commit: the commit to check out (snapshot commit for dirty runs, HEAD for clean)branch: the branch to check out (run/<uuid>for dirty runs, original branch for clean)dirty: whether the worktree had uncommitted changes- process cwd and argv
That lets an agent perform a workflow like:
- Find the run with the best metric.
- Read its
run_startgit state. git checkout <branch>always works — branch is eitherrun/<uuid>(dirty) or the original branch (clean).- Compare with
git diffbetween any two runs. - To get the base commit for a dirty run:
git log --oneline -1 <commit>^.
So a prompt like this is meaningful:
"Find the run with the best validation accuracy, reconstruct the code from that run, and show me the exact code changes compared with my current checkout."