A new agent type for pydantic-ai: agents that spin up mini-programs running in context.
Most AI agents stop at the first plausible response.
ContextAgent is built to keep going.
It turns quality into explicit criteria, generates candidate outputs, evaluates them, mutates its strategy, and repeats until it converges or reaches a defined limit.
That makes it a better fit for work where omission and vagueness have real consequences.
One-shot chat is often enough for brainstorming.
It is not enough for work that must be complete, auditable, and aligned to visible criteria before a human should trust it.
ContextAgent is for that second category.
- Healthcare: discharge instructions, care-gap summaries, prior authorization drafts, and structured health-data repair
- Public service: permit packages, code-compliance reviews, and eligibility guidance
- Interoperability and validation: structured artifacts that must satisfy validators, profiles, or completeness checks
- Scientific and regulatory work: protocols, consent materials, and submission sections where traceability and rigor matter more than speed
The Karpathy autoresearch pattern (generate → evaluate → mutate → repeat) demonstrated the value of structured iterative refinement, but the original loop was still orchestrated in Python.
What this project is trying to do is lift that pattern into a reusable agent runtime abstraction, so the loop is a first-class typed program instead of one-off notebook or script glue.
But pydantic-ai's current Agent is a single-turn request/response abstraction. To build
an autoresearch loop today, you'd wire together multiple agents with manual state threading,
hand-rolled loops, and ad-hoc scoring logic. The graph primitives exist in pydantic_graph,
but there's no high-level API that says: "here's a program — run it iteratively until it
converges."
A ContextAgent is a pydantic-ai agent type that runs context programs —
structured, stateful mini-programs with explicit loop state, typed evaluation criteria,
and convergence logic. The current implementation uses pydantic_graph to orchestrate
the loop while feeding the accumulated program state back into each model call.
+-------------------------------------------------------+
| ContextAgent |
| pydantic_graph runtime |
| |
| +-------------+ +-------------+ +-------------+ |
| | Generate |->| Evaluate |->| Decide | |
| | node | | node | | node | |
| +------+------+ +------+------+ +------+------+ |
| | | | |
| v v v |
| +-------------+ +-------------+ +-------------+ |
| | Generator | | Evaluator | | Mutate | |
| | agent | | agent |<-| node | |
| +------+------+ +-------------+ +------+------+ |
| ^ | |
| +---------- Mutator agent <-------+ |
| |
| LoopState: cycle, best_score, best_output, |
| current_instructions, stalled_cycles, |
| history |
+-------------------------------------------------------+
| Concept | Description |
|---|---|
| ContextProgram | A typed definition of a mini-program: what to generate, how to evaluate, when to stop |
| Criterion | A single PASS/FAIL evaluation check with typed conditions |
| CycleResult | Immutable snapshot of one iteration: candidates, scores, mutations, and whether the run improved |
| LoopState | Mutable runtime state carried across iterations: score, instructions, stop conditions, and history |
| ContextAgent | Orchestrator that runs the program loop using pydantic_graph underneath |
Today's agent patterns treat LLMs as black-box call-response units wired together by Python
glue code. ContextAgent pulls the iterative refinement loop into a reusable runtime shape:
generate, evaluate, mutate, and stop on convergence. That makes search and refinement a
first-class agent capability rather than bespoke application logic.
The important distinction here is not that Python disappears. It does not. The distinction is that the runtime preserves and reuses the full loop state across iterations instead of treating each model call as an isolated step.
- Ad hoc orchestration: manual loops, manual state threading, ad hoc scoring logic
- ContextAgent: explicit program state, typed criteria, graph-driven transitions, and model calls that see the accumulated context needed for the current phase
This matters because each phase can reason over the accumulated working state: prior candidates, failure patterns, current instructions, and cycle history. The loop is still externally orchestrated, but it is no longer just one-off glue code.
The auto-improve skill is the same generate -> evaluate -> mutate pattern written as an
in-agent procedure. ContextAgent promotes that pattern into a reusable runtime abstraction:
ContextProgram, Criterion, CycleResult, graph nodes, and shared loop state.
When context programs can target other agent artifacts (prompts, tools, system instructions),
agents can improve their own infrastructure. The auto-improve skill is a manual version of
this — ContextAgent makes it a reusable pattern.
The broader framing is that ContextAgent is a reusable runtime pattern first and a
pydantic-ai implementation second.
context-agent/
├── README.md
├── .env.example # Sample runtime configuration
├── pyproject.toml # Package metadata and CLI entry points
├── uv.lock # Locked dependency set for uv
├── src/context_agent/
│ ├── __init__.py # Public API exports
│ ├── agent.py # ContextAgent orchestration entry point
│ ├── bootstrap.py # Prompt -> ContextProgram decomposition
│ ├── program.py # ContextProgram, Criterion, CycleResult, ProgramResult
│ ├── nodes.py # pydantic_graph loop nodes
│ ├── cli.py # context-agent CLI
│ ├── ui.py # Gradio UI entry point
│ ├── data_sources.py # Pull-based context sources (FHIR, etc.)
│ ├── runtime_connectors.py # Runtime toolsets, MCP, and plugin loading
│ ├── plugin_examples.py # Example runtime plugin factory
│ ├── eval_types.py # Evaluation result models
│ ├── defaults.py # Shared model defaults
│ └── model_retry.py # Model retry and transient failure handling
├── examples/
│ ├── prompt_runner.py # Shared runner for prompt-driven examples
│ ├── autoresearch_skill_improver.py # Bespoke auto-improve loop example
│ ├── fhir_a1c_optimizer.py
│ ├── fhir_care_gaps.py
│ ├── fhir_wearable_monitor.py
│ ├── healthcare_cds_optimizer.py
│ ├── k8s_hardener.py
│ ├── permit_compliance_reviewer.py
│ ├── crop_rotation_optimizer.py
│ ├── spc_rule_optimizer.py
│ ├── trial_protocol_optimizer.py
│ └── sample_inputs/
├── tests/
│ ├── test_context_agent.py
│ ├── test_cli.py
│ ├── test_runtime_connectors.py
│ └── test_ui.py
├── pydantic-ai-context-agent.patch # Upstream patch/proposal artifact
└── vendor/ # Vendored upstream reference code
ContextAgent is implemented in this repository as its own runtime layer on top of pydantic-ai and pydantic_graph.
from pydantic_ai import Agent
from context_agent import ContextAgent, ContextProgram, Criterion
# Define evaluation criteria
criteria = [
Criterion(name="clear_trigger", pass_when="specific testable scenarios", fail_when="vague guidance"),
Criterion(name="actionable_steps", pass_when="concrete named actions", fail_when="'consider' or 'ensure'"),
]
# Define the program
program = ContextProgram(
name="improve-skill",
generator_instructions="Rewrite this SKILL.md to fix all failing criteria...",
evaluator_instructions="Score this SKILL.md against the criteria...",
mutator_instructions="Update the rewrite strategy based on failures...",
criteria=criteria,
max_cycles=3,
)
# Run it
agent = ContextAgent('openrouter:nvidia/nemotron-3-super-120b-a12b:free')
result = await agent.run_program(program, input_text=original_skill_md)
print(f"Final score: {result.best_score}/{len(criteria)}")
print(result.best_output)The fastest way to use context-agent: describe what you want. The agent bootstraps its own evaluation criteria, generator, and mutation strategy from your prompt.
# Install
cd context-agent && uv venv && uv sync --extra dev
# Install UI and OAuth extras when using context-agent-ui
uv sync --extra dev --extra ui
# If you already had the project installed before dependency changes, resync it
uv sync --extra dev
# Optional: start from the sample environment file
cp .env.example .env
# Open-ended: agent self-assembles a program from your prompt
uv run context-agent "Write a production-ready Dockerfile for a Python FastAPI app"
# Improve an existing file
uv run context-agent "Improve this system prompt for clarity and specificity" --input prompt.txt
# Evaluate without rewriting
uv run context-agent "Evaluate this API schema" --input openapi.yaml --eval-only
# More cycles + verbose output
uv run context-agent "Optimize this SQL query for performance" --input query.sql --cycles 5 --verbose
# Explicit model override
uv run context-agent "Harden this Terraform module" --input main.tf --model openrouter:nvidia/nemotron-3-super-120b-a12b:free
# Backward-compat: auto-improve a SKILL.md
uv run context-agent improve-skill --target path/to/SKILL.md
# Optional: expose the in-repo example runtime plugin factory
export CONTEXT_AGENT_TOOL_FACTORIES=context_agent.plugin_examples:build_example_runtime_plugins
# Inspect which runtime connection keys are available right now
uv run context-agent --list-connections
# Let the bootstrapper decide that CLI inspection is needed
CONTEXT_AGENT_ENABLE_CLI=1 uv run context-agent \
"Determine what operating system this machine is running on. Use the appropriate tool if needed and answer briefly."
# Or force the CLI connector explicitly for deterministic OS checks
CONTEXT_AGENT_ENABLE_CLI=1 uv run context-agent \
"Determine what operating system this machine is running on. Use the available command tool, prefer uname and sw_vers when available, and answer briefly." \
--connection cliWhen CONTEXT_AGENT_TOOL_FACTORIES or CONTEXT_AGENT_MCP_CONFIG are set, the UI shows any discovered tool: and mcp: keys at startup so you can see exactly which runtime surfaces came from env/config before a run begins.
ContextAgent defaults to the native pydantic-ai openrouter: provider. That provider still uses the openai Python client underneath, so the base install includes the openrouter extra from pydantic-ai-slim to bring in the required client library automatically.
Most one-shot scripts in examples/ are now thin prompt wrappers around the same bootstrap path as the CLI. The only intentionally bespoke examples left are the continuous monitoring loop and the auto-improve rubric, where the program structure itself is part of the example.
How it works under the hood:
- Bootstrap: An LLM call decomposes your prompt into 3-6 typed evaluation criteria + domain-specific instructions
- Generate: The generator agent produces candidate outputs
- Evaluate: The evaluator agent scores each candidate against the criteria (PASS/FAIL)
- Decide: Keep the best candidate if it improved the score
- Mutate: Analyze failures and evolve the generation strategy
- Repeat: Loop until convergence or max cycles