A distributed AI agent stack — task queue, worker fleet, and interactive CLI for running Claude, Gemini, Codex, and Groq across multiple machines over a private Tailscale network.
Three machines. Four AI agents. Two ways to work.
- Local — run an agent directly on your MacBook, get an instant answer, stay in flow
- Queued — describe a task, pick a machine and agent, it runs in the background while you keep working
MacBook Pro (Orchestrator)
├── FastAPI queue server (port 8000, SQLite-backed)
├── da CLI — the control plane
└── Tailscale IP — reachable by all workers
Worker machines (Mac Mini, ThinkPad, …)
└── FastAPI worker server (port 8001)
├── Poller: claims tasks every 10 s
├── Dispatches to the right handler (agent_run / build / script / …)
└── Reports: done / failed / needs_human
All machines connect over Tailscale mesh VPN — no port-forwarding, no firewall rules.
| Machine | Role | Highlights |
|---|---|---|
| MacBook Pro | Orchestrator | Queue server, da CLI |
| Mac Mini (Intel, macOS) | Worker | iOS/Xcode, Flutter, Swift, Cloudflare deploys |
| ThinkPad (Ubuntu) | Worker | Android, Gradle, Python/Node backend, Docker |
Declare your own fleet in config/machines.yaml (see Configuration).
git clone https://github.com/your-username/distributed-infra.git
cd distributed-infracp .env.example .env
# Edit: set MACHINE_NAME, MACHINE_ROLE, SECRET_KEY (same value on all machines)
openssl rand -hex 32 # generate a SECRET_KEYpython3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtcp config/machines.yaml.example config/machines.yaml
# Fill in each machine's Tailscale IP: tailscale ip -4# MacBook — orchestrator
uvicorn orchestrator.main:app --host 0.0.0.0 --port 8000
# Mac Mini / ThinkPad — worker
uvicorn worker.main:app --host 0.0.0.0 --port 8001./da # from the repo root
# or if symlinked:
ln -sf "$(pwd)/da" /usr/local/bin/da
da╭──────────────────────────────────────────────────────────────────╮
│ Distributed Agents │
│ macbook-pro · mac-mini · thinkpad-x1 │
│ 2/2 workers online │
╰──────────────────────────────────────────────────────────────────╯
Type help for commands, exit to quit.
da ›
macOS (launchd)
cp scripts/com.techstartups.orchestrator.plist.example \
~/Library/LaunchAgents/com.techstartups.orchestrator.plist
# Edit: fill in YOUR_USERNAME and YOUR_SECRET_KEY
launchctl load ~/Library/LaunchAgents/com.techstartups.orchestrator.plistUbuntu (systemd)
# Worker service is declared in scripts/worker.service
# Copy it, fill in paths, enable:
sudo cp scripts/worker.service /etc/systemd/system/infra-worker.service
sudo systemctl daemon-reload
sudo systemctl enable --now infra-workerda › run claude explain the Riverpod provider pattern
da › run gemini summarise the last 10 commits
da › run codex refactor the auth module
da › test # smoke-test all four agents
da › test gemini # test one agent
da › assign <description> [--machine=X] [--agent=Y] [--type=Z]
Fully explicit — recommended when you know exactly where it should run:
da › assign build the iOS release --machine=mac-mini --agent=gemini --type=ios_build
da › assign run the test suite --machine=thinkpad-x1 --agent=claude --type=test_run
da › assign deploy the site --machine=mac-mini --agent=- --type=run_script
Auto-routed — Claude picks the best machine and agent:
da › assign refactor the payment service for better error handling
→ Asking Claude for routing recommendation…
→ Suggested: thinkpad-x1 / claude (Python backend, no build tools needed)
Machine thinkpad-x1
Agent claude
Task type agent_run
Reason Python backend work, no mobile build tooling required
Confirm? [Y/n]
Partial flags — override just what you care about:
da › assign write unit tests for the cart module --machine=thinkpad-x1
# Claude picks the agent; you lock the machine
Validation: assign checks before queuing:
- Blocks if the machine doesn't have the capability for the task type
- Warns if the requested agent isn't listed for that machine
da › queue # all tasks, newest first
da › queue --status=pending # filter: pending / claimed / in_progress
da › queue --status=needs_human # done / failed / needs_human
Queue columns:
- Machine — shows
→mac-minifor pending tasks (targeted but not yet claimed), actual machine name once claimed - Agent — which AI agent handled/will handle it
- Task / Notes — prompt preview or error summary
da › review # all tasks waiting for your attention
da › failures # failed tasks with error details + re-queue hints
da › resolve a3f29c1d done # mark a specific task done
da › resolve a3f29c1d pending # re-queue (retry from scratch)
da › resolve a3f29c1d failed # record as failure
da › resolve a3f29c1d done --notes="handled manually"
da › resolve all # bulk-close every needs_human task as done
da › resolve all pending # bulk re-queue all needs_human tasks
da › status # health, active tasks, done/failed counts, top LLM per machine
da › ssh mac-mini # open SSH session to a worker
da › skills # declared capabilities per machine
da › skills available # full registry + install status (SSH check)
da › skills available --category=mobile # filter by category
da › skills list mac-mini # SSH-verify what's actually installed
da › skills install thinkpad-x1 docker # install a skill via SSH
da › skills add mac-mini swiftlint # register new capability in machines.yaml
da › skills create deploy # scaffold a new custom skill handler
| Type | What it does | Typical machine |
|---|---|---|
agent_run |
Runs Claude / Gemini / Codex / Groq on a prompt | Any worker |
run_script |
Executes a shell script | Any worker |
git_pull |
Pulls latest on a repo | Any worker |
ios_build |
xcodebuild / Flutter / CocoaPods |
macOS worker |
android_build |
Gradle | Ubuntu worker |
npm_build |
npm run <script> |
Any worker |
test_run |
pytest / jest / gradle test / xcode test | Any worker |
lint |
ruff / eslint / ktlint / swiftlint | Any worker |
Custom types are auto-discovered from worker/handlers/<type>.py — no wiring needed.
Every task payload is a JSON object sent with the task. Common fields:
{
"agent": "claude",
"prompt": "Refactor the auth module to use dependency injection",
"cwd": "~/Projects/my-app",
"model": "claude-opus-4-5"
}cwd— working directory for the agent. Claude will be able to read and write files relative to this path. Always set this for tasks that touch files.model— optional model override (defaults to each agent's default)
{
"script": "cd ~/Projects/my-app && npm run build 2>&1",
"cwd": "~/Projects/my-app",
"timeout": 300
}timeout— seconds before the script is killed (default: 120)_target_machine— set automatically byassign --machine=X; can also be set directly
{
"repo_path": "~/Projects/my-app",
"branch": "main"
}agent_run is best for tasks that need creativity or reasoning — writing code, designing architecture, summarising. For deterministic file operations (deploy, build, scaffold), use run_script — it's faster, more predictable, and the output is always captured.
# Good — agent thinks, script acts
da › assign write the authentication handler --machine=thinkpad-x1 --agent=claude --type=agent_run
da › assign deploy to production --machine=mac-mini --type=run_script
Without cwd, Claude runs from the worker's project directory (distributed-infra) and cannot write to your actual project. Always include it:
da › assign add error handling to all API routes --machine=thinkpad-x1 --agent=claude --type=agent_run
Then manually add "cwd": "~/Projects/my-backend" to the payload — or submit via the API directly:
curl -X POST http://localhost:8000/tasks \
-H "x-secret-key: $SECRET_KEY" \
-H "Content-Type: application/json" \
-d '{
"type": "agent_run",
"payload": {
"agent": "claude",
"prompt": "Add error handling to all API routes",
"cwd": "~/Projects/my-backend",
"_target_machine": "thinkpad-x1"
}
}'The agent runs in a single non-interactive claude -p call — it cannot ask follow-up questions. Give it everything it needs upfront: file paths, constraints, output format, examples.
# Too vague — agent will ask for clarification it can't receive
"update the website"
# Self-contained — agent can act immediately
"In ~/Projects/motethansen-site, update index.html to add a Projects section
with cards for winedragons.asia and urbanlife.works. Use the existing CSS
variables. Do not change the navigation or footer."
_target_machine in the task payload is checked by the orchestrator before returning a task to a claiming worker. A machine that doesn't match the target will never see the task — no race conditions.
If a task fails or times out, resolve <id> pending puts it back in the queue with the same payload. Only create a new task if you want to change the prompt or routing.
da › resolve 3cd295f8 pending # retry with same payload
resolve all done closes everything in one shot. Run review first to make sure nothing important is hiding behind a generic needs_human status.
assign "..." --machine=mac-mini --agent=claude --type=agent_run
│ │ │ │
│ │ │ └── task type (capability)
│ │ └── which AI agent runs the prompt
│ └── which machine claims the task (enforced in DB)
└── natural language description → stored as notes + prompt
If any flag is omitted, Claude analyses the description and recommends the missing values. You confirm before anything is queued.
Enforcement layers:
da assignvalidates — blocks if machine lacks the capability; warns if agent isn't listed for that machine- Orchestrator DB filters —
json_extract(payload, '$._target_machine')in the SQL claim query — only the named machine can claim the task - Agent handler dispatches —
payload.agentis passed directly to the agent's CLI subprocess
config/skills.yaml is the source of truth for installable tools — install recipes, check commands, task type mappings, and handler paths for 20+ skills across four categories.
da › skills create summarise-pr
Description: Summarise a pull request using git log and diff
Category [custom]: backend
Check command: git --version
Install (macos): brew install git
Install (linux): sudo apt-get install -y git
Task type: summarise_pr
✓ Created handler: worker/handlers/summarise_pr.py
✓ Registered in skills.yaml
Next steps:
1. Implement worker/handlers/summarise_pr.py
2. Add to machines.yaml: skills add <machine> summarise_pr
The handler is auto-discovered by the worker — no changes to __init__.py needed.
machines:
macbook-pro:
tailscale_ip: "YOUR_MACBOOK_TAILSCALE_IP"
role: orchestrator
os: macos
queue_port: 8000
mac-mini:
tailscale_ip: "YOUR_MACMINI_TAILSCALE_IP"
role: worker
os: macos
capabilities:
- ios_build
- xcode
- swift
- agent_run
- run_script
- git_pull
agents:
- claude
- gemini
- codex
worker_port: 8001
thinkpad-x1:
tailscale_ip: "YOUR_THINKPAD_TAILSCALE_IP"
role: worker
os: linux
aliases: ["old-hostname"] # historical names kept for stats continuity
capabilities:
- android_build
- python_backend
- docker
- agent_run
- run_script
agents:
- claude
- gemini
worker_port: 8001config/machines.yaml is gitignored (it contains your real Tailscale IPs). Track config/machines.yaml.example in git.
MACHINE_NAME=macbook-pro # must match a key in machines.yaml
MACHINE_ROLE=orchestrator # orchestrator | worker
SECRET_KEY=<openssl rand -hex 32> # same on all machines
TAILSCALE_IP=100.x.x.x
ORCHESTRATOR_URL=http://100.x.x.x:8000 # worker only
WORKER_PORT=8001 # worker only
POLL_INTERVAL_SECONDS=10All four agents use their CLI tools — no API keys in config files. Each authenticates via its own login session:
| Agent | CLI install | Auth |
|---|---|---|
| Claude | npm install -g @anthropic-ai/claude-code |
claude login |
| Gemini | npm install -g @google/gemini-cli |
gemini login |
| Codex | npm install -g @openai/codex |
codex login |
| Groq | install via pip / npm | set GROQ_API_KEY in .env |
Agents run non-interactively in the queue:
- Claude:
claude -p "<prompt>" --dangerously-skip-permissions - Gemini:
gemini --yolo -p "<prompt>" - Codex:
codex --approval-mode full-auto -q "<prompt>"
Worker keeps showing "Poller error: ConnectError"
The worker can't reach the orchestrator. Check: (1) Tailscale is running on both machines, (2) ORCHESTRATOR_URL in worker .env matches the orchestrator's Tailscale IP, (3) orchestrator is running on port 8000.
Task goes to wrong machine
Make sure the orchestrator was restarted after any changes to db.py. The _target_machine filter runs in the orchestrator's SQLite query — stale code means the old query runs without the filter.
Agent task has empty output / needs_human with no details
Check da review — since recent fixes, stdout/stderr are preserved in the result even when a task fails. If still empty, the worker may be running old code: git pull && restart worker.
Claude writes to wrong directory
Add "cwd": "~/Projects/your-project" to the task payload. Without it, Claude runs from the worker's distributed-infra directory and cannot reach other projects.
da assign blocks with "machine doesn't have capability"
Add the capability to config/machines.yaml (skills add <machine> <type>) or use a different machine.
Task stuck in_progress after orchestrator restart
The worker's HTTP connection broke mid-task. Mark it manually:
da › resolve <task-id> failed --notes="orphaned after restart"
Then resolve <id> pending to retry.
MIT — see LICENSE.