Telemetry-Driven Inference Troubleshooting and Optimization Loop
Deploy a model as an inference service via dstack, continuously fetch profiling telemetry from Graphsignal, analyze performance, and redeploy with improved configuration — all autonomously.
Inspired by karpathy/autoresearch. Where autoresearch lets an AI agent iterate on training code overnight, autodebug lets an AI agent iterate on inference deployment configuration: tuning batch sizes, caching strategies, parallelism, and engine parameters to minimize latency and maximize throughput.
The agent follows an optimization loop defined in program.md:
- Deploy an inference service (e.g. SGLang, vLLM) on dstack with Graphsignal telemetry enabled.
- Benchmark the endpoint with targeted request patterns (parallel, sequential, long prompts, etc.).
- Fetch telemetry from Graphsignal — profiling data, traces, metrics, and errors.
- Analyze performance: compute prefill throughput, decode throughput, token throughput, and identify bottlenecks.
- Redeploy with an optimized dstack configuration reflecting the improvements.
- Repeat indefinitely.
Each iteration is logged to a separate sessions/debug-<ISO>.md file (findings and rationale), building a complete record of the optimization journey.
Graphsignal provides inference observability — profiling, tracing, and metrics. Sign up and obtain an API key.
Install the debug CLI and log in:
uv tool install graphsignal-debug
graphsignal-debug logindstack manages cloud infrastructure for inference services. Set up your dstack project and log in:
uv tool install dstack
dstack loginDownload the skill files so the agent has full context:
mkdir -p ~/.claude/skills/graphsignal-python ~/.claude/skills/graphsignal-debug ~/.claude/skills/dstack
curl -sL https://raw.githubusercontent.com/graphsignal/graphsignal-python/main/SKILL.md -o ~/.claude/skills/graphsignal-python/SKILL.md
curl -sL https://raw.githubusercontent.com/graphsignal/graphsignal-debug/main/SKILL.md -o ~/.claude/skills/graphsignal-debug/SKILL.md
curl -sL https://raw.githubusercontent.com/dstackai/dstack/master/.skills/dstack/SKILL.md -o ~/.claude/skills/dstack/SKILL.mdprogram.md — agent instructions (the optimization loop)
dstack-baseline.yml — baseline dstack service configuration
sessions/
.gitkeep
debug-<ISO>.md — per-session findings and planned changes (created by agent)
dstack-<ISO>.yml — per-session optimized configurations (created by agent)
Point your AI agent at this repo and prompt:
Read program.md and let's set up an optimization session.
The agent will verify prerequisites, deploy the initial service, and begin the autonomous optimization loop. You can walk away — it will keep iterating, logging results, and redeploying improvements until you stop it.
MIT