English | ηΉι«δΈζ
The platform that makes local LLMs ship work.
Local models alone are weak. Wrap them in OpenTeddy and you get a real agent β hardened orchestration, a self-growing skills library, and just enough commercial-LLM escalation to finish what local can't.
π Web: openteddy-72cee.web.app Β Β·Β π¦ Source: github.com/m31527/OpenTeddy
A 2B / 4B / 7B local model on its own is a toy. It hallucinates, it loops, it stops mid-task. The model isn't the product β the platform around it is. OpenTeddy is that platform:
- A hardened agent loop that knows when to give up, when to retry, and when to call Claude β no infinite "let me try that again" doom-spirals.
- A self-growing skills library that turns repeated work into plain Python functions, so the second time you ask the same question it costs zero LLM calls.
- Hardware-tuned model presets for everything from a 16 GB MacBook to
a DGX Spark β the right
num_ctx,max_tokens, and timeout per tier. - Commercial-LLM escalation as a safety net, not a bill β Claude only gets called when local genuinely can't finish, and the Usage dashboard shows you how much GPT-4 would've charged for the same work.
The result: your $0/token local hardware actually finishes the job, and the savings counter in the sidebar is what makes you stop worrying about Claude Pro auto-renewing.
If this resonates with you β or you just want to cheer the project on β please drop a β on the repo. It genuinely helps and keeps me motivated to ship more. β github.com/m31527/OpenTeddy
- Local-first β planning (Gemma) and execution (Qwen) run on your machine via Ollama; Claude is only called when local models struggle.
- Auto-escalation to Claude β timeouts, low confidence, repeated failures, hard-failure signals in tool output (e.g.
unhealthycontainers,ERROR 1045), or failed health checks all trigger Claude intervention automatically. - Self-growing skills β repeated tasks are promoted into reusable Python skills, cutting LLM calls over time.
- Streaming UI β both the orchestrator's planning and the executor's answer stream token-by-token via WebSocket β no more staring at a spinner while the model thinks.
- Per-step deliverable verification β LLM-as-judge confirms each produced file actually matches the goal, catching the "wrote a report about the game instead of the game" failure mode. Toggleable for big-model setups where extra calls are too costly.
- Loop hardening for small models β adaptive prompts, a parallel low-risk tool fan-out, per-tool-name caps, a circuit breaker, discovery memos, and a context watchdog that compresses old turns before busting
num_ctx. - Reconnect-safe streaming β the WebSocket carries a 600-event ring buffer so a flaky network or a tab refresh replays the missed events instead of leaving the UI stuck.
- Web dashboard β submit tasks, watch tool calls stream live, review pending approvals, manage memory, render Markdown/GFM tables, embed Chart.js datalabels in HTML reports, and tune settings.
- Native macOS desktop client β Tauri 2.x shell with onboarding wizard (Ollama install + tier-based model pull), language picker, mode-locked sessions, auto-update against GitHub Releases, and one-click diagnostics download. See
desktop/. - Analytic / report mode β first-class
csv_describe+python_exectools and an HTML report generator that renders charts with value labels. - Human-in-the-loop β high-risk shell commands (rm, sudo, mv, β¦) pause for approval before running.
- Persistent memory β ChromaDB-backed long-term memory feeds relevant context back into future plans.
- 22-locale i18n β UI strings live in
static/i18n.js; build-hash check auto-reloads when the dashboard is updated. - Hot-reloadable settings β change models, thresholds, performance toggles, and endpoints from the UI without restarting the server.
User Goal
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Orchestrator (Gemma via Ollama) β
β β’ Decomposes goal into ordered SubTasks β
β β’ Streams plan tokens to the UI as it thinks β
β β’ Retrieves long-term memory for context β
β β’ Drives execution + escalation loop β
ββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββ
β SubTasks
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Executor (Qwen via Ollama, function calling) β
β β’ Runs a matching Skill if available β
β β’ Uses tools: shell, file, http, db, gcp, package,β
β csv_describe, python_exec, generate_report β
β β’ Streams answer tokens; parallelises low-risk β
β tool calls; caps per-tool-name retries β
β β’ Compresses old turns when context fills up β
β β’ Reports confidence (clamped on hard failures) β
ββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββ
β produced files
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Deliverable Verifier (LLM-as-judge, Qwen) β
β β’ Reads the produced HTML/MD/Py/etc. β
β β’ Verdict: PASS or FAIL β forces retry on FAIL β
β β’ Skipped via `verification_enabled = false` β
ββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββ
low conf β timeout β failure signal β unhealthy
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Escalation Agent (Claude via API) β
β β’ Resolves hard subtasks with full diagnostics β
β β’ Synthesises the final summary β
ββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββ
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Skill Factory (Claude via API) β
β β’ Generates new Python skills on demand β
β β’ Promotes skills after N successes β
β β’ Saves skills to disk + SQLite DB β
βββββββββββββββββββββββββββββββββββββββββββββββββββββ
The agent loop has been progressively hardened to make small / mid-size local models (Gemma 3:4B, Qwen 2.5:3B class) reliable enough to ship work end-to-end, not just fast enough to look impressive on a single tool call:
| Mechanism | What it does |
|---|---|
| Adaptive prompts | Compact system prompts on small models; richer guidance only when context allows. |
| Parallel tool fan-out | Low-risk tool calls (file reads, shell ls, HTTP gets, csv_describe) inside a single round are dispatched with asyncio.gather instead of serially. |
| Per-step deliverable verification | After each successful subtask, an LLM-as-judge reviews the produced HTML/MD/code file. If it looks like a description of the goal rather than the actual deliverable (the "Snake Game report" failure pattern), the subtask is forced to retry with feedback. |
| Context watchdog | When the prompt size approaches num_ctx, the executor compresses earlier turns into a recap and pins discovery memos to the system prompt β keeping recent tool context intact instead of letting Ollama silently truncate. |
| Discovery memos | Useful one-off facts learned from tool calls (e.g. "the workspace already contains data.csv with columns X/Y/Z") are pinned to the system prompt so the model doesn't re-discover them every round. |
| Per-tool-name cap | Each tool name is capped at 5 calls per subtask β stops the model from re-running csv_describe on the same file ten times. |
| Circuit breaker | After 5 cumulative tool failures the loop is forced to commit to a final answer instead of looping forever. |
| Common error hints | Twelve frequent stack-trace patterns (ModuleNotFoundError, KeyError, PermissionError, β¦) are matched against tool stderr and converted into one-line hints so the model corrects itself instead of repeating the same mistake. |
| WS reconnect + replay | The dashboard WebSocket carries a 600-event ring buffer keyed by sequence number β a refreshed tab or a wifi blip replays missed events on reconnect. |
OpenTeddy/
βββ config.py # Config via .env / environment variables
βββ models.py # Pydantic models + SQLite schema
βββ tracker.py # Async SQLite persistence (aiosqlite) + perf stats
βββ skill_factory.py # Claude-powered skill generation & loader
βββ executor.py # Qwen executor β function calling, streaming,
β # parallel low-risk tools, context watchdog,
β # discovery memos, per-tool cap, circuit breaker
βββ escalation.py # Claude escalation agent
βββ orchestrator.py # Gemma orchestrator (plan β execute β verify β
β # escalate) + per-step deliverable judge
βββ memory.py # ChromaDB long-term memory
βββ approval_store.py # Human-in-the-loop approval queue
βββ settings_store.py # Hot-reloadable settings (SQLite-backed)
βββ tool_registry.py # Tool registration + risk gating
βββ tools/ # shell / file / http / db / gcp / package /
β # analytic (csv_describe, python_exec) /
β # report_tool (HTML + Chart.js datalabels)
βββ skills/ # Auto-generated skill .py files
βββ static/ # Web dashboard (index.html, i18n.js β 22 locales,
β # OpenTeddy-logo.svg)
βββ desktop/ # Native macOS Tauri 2.x client (own repo)
βββ main.py # FastAPI server + CLI entry point + WS ring buffer
βββ .env.example # Environment variable template
- Python 3.11+
- Ollama running locally:
ollama pull gemma3:4b ollama pull qwen2.5:3b
- An Anthropic API key (used for escalation and skill generation)
git clone https://github.com/m31527/OpenTeddy.git
cd OpenTeddy
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txtcp .env.example .env
# Edit .env β at minimum set ANTHROPIC_API_KEYuvicorn main:app --reload
# Dashboard β http://localhost:8000
# API docs β http://localhost:8000/docs| Method | Endpoint | Description |
|---|---|---|
POST |
/run |
Submit a task |
GET |
/tasks/{id} |
Check task status |
GET |
/tasks |
List recent tasks (filter by session_id) |
GET |
/skills |
List all skills |
POST |
/skills/generate?name=β¦&description=β¦ |
Manually create a skill |
GET |
/tools |
List available tools |
GET |
/approvals |
Pending human approvals |
POST |
/approvals/{id}/approve | /reject |
Resolve an approval |
GET |
/memory |
Browse long-term memory |
GET |
/usage, /usage/summary |
Token usage & estimated cost |
GET |
/benchmark/stats |
Per-model token-throughput stats (#6) |
GET |
/settings | POST /settings |
Read/update runtime settings |
GET |
/settings/ollama/models | /status |
Local model management |
POST |
/settings/ollama/pull |
Pull a model (streamed progress) |
GET |
/version |
Build hash + version (used by UI auto-reload) |
GET |
/update/check |
Check GitHub Releases for a newer version |
POST |
/update/apply |
Apply an available update |
POST |
/optimize_prompt |
Rewrite a draft goal via Claude |
GET |
/admin/diagnostics |
Download a zipped diagnostic bundle |
GET |
/health |
Health check |
WS |
/ws?since=N |
Live event stream β since replays the ring buffer from sequence N |
curl -X POST http://localhost:8000/run \
-H 'Content-Type: application/json' \
-d '{"goal": "Summarise the key benefits of async Python", "priority": 7}'OpenTeddy tries to keep every task local. Claude is called only when the local path breaks down:
| Trigger | Where | Default |
|---|---|---|
| Subtask timeout (local model hangs) | orchestrator._run_subtask |
120 s |
| Low self-reported confidence | executor._qwen_execute |
< 0.6 |
| Repeated failures in a row | orchestrator._run_subtask |
3 |
Hard-failure signal in tool output (unhealthy, Exited, ERROR 1045, Error response from daemon, β¦) |
executor._finalize_response |
confidence clamped to 0.3 β escalates |
| Container health check fails after a Docker task | orchestrator._inspect_docker_health |
auto-pulls docker logs + inspect, then escalates |
Deliverable verifier returns FAIL |
orchestrator._verify_deliverable |
confidence clamped to 0.3 β retry, then escalate |
| Circuit breaker tripped (5 cumulative tool failures) | executor._qwen_execute |
forces final-answer commit; escalation kicks in if confidence is still low |
This keeps cost low for everyday work while still guaranteeing you get a real answer when the local model cannot deliver one. All triggers can be globally disabled via ESCALATION_ENABLED=false (or the per-session "Local-only" toggle in the UI).
- When Qwen executes a subtask it suggests a skill name if a reusable function would have helped.
- The Executor calls
SkillFactory.generate_skill()in the background. - Claude writes the skill as an
async def run(input_data: dict) -> strfunction and saves it toskills/<name>.py. - The skill starts in TESTING status. After
SKILL_PROMOTION_THRESHOLDsuccessful invocations it is promoted to ACTIVE. - Future tasks automatically match and invoke active skills β no LLM call needed.
Most of these can also be edited live from the dashboard's Settings panel β
changes are persisted to SQLite and config.reload_from_store() re-applies them
without a server restart.
| Variable | Default | Description |
|---|---|---|
ANTHROPIC_API_KEY |
β | Required only if escalation is enabled. Anthropic API key. |
CLAUDE_MODEL |
claude-opus-4-6 |
Claude model for escalation. |
GEMMA_BASE_URL |
http://localhost:11434 |
Ollama base URL for the orchestrator. |
GEMMA_MODEL |
gemma3:4b |
Orchestrator model tag. |
QWEN_BASE_URL |
http://localhost:11434 |
Ollama base URL for the executor. |
QWEN_MODEL |
qwen2.5:3b |
Executor model tag. |
DB_PATH |
openteddy.db |
SQLite database path. |
MEMORY_DB_PATH |
./memory_db |
ChromaDB directory. |
SKILLS_DIR |
skills |
Directory for skill files. |
| Variable | Default | Description |
|---|---|---|
ESCALATION_ENABLED |
true |
Master kill-switch for Claude. When false, low-confidence / timeout / failure-signal triggers stay local and surface a failure to the user instead of calling Claude. |
ESCALATION_THRESHOLD |
0.6 |
Min Qwen confidence before escalation. |
ESCALATION_FAILURE_LIMIT |
3 |
Max consecutive failures before escalation. |
SUBTASK_TIMEOUT |
120 |
Seconds before a subtask is treated as hung. |
SKILL_PROMOTION_THRESHOLD |
5 |
Successes needed to promote a skill. |
Most of these matter most on big models β turn them off to trade safety nets for speed.
| Variable | Default | Description |
|---|---|---|
STREAMING_ENABLED |
true |
Stream LLM tokens to the chat as they generate. Major perceived-latency win on small thinking models. |
VERIFICATION_ENABLED |
true |
Run the per-step LLM-as-judge verifier after each successful subtask. Set to false on big-model setups (DGX Spark, qwen3.5:35b) where each judge call is 5β60s. |
QWEN_NUM_CTX |
16384 |
Ollama num_ctx for the executor. Larger = more tool-round history before the watchdog has to compress, but more VRAM. |
GEMMA_NUM_CTX |
16384 |
Same, for the orchestrator. |
CONTEXT_COMPRESS_AT |
0.7 |
Trigger context compression when prompt-token usage crosses this fraction of num_ctx. |
OpenTeddy ships with a native macOS shell built on Tauri 2.x that wraps the
web dashboard inside a polished launcher. Source lives in desktop/
(its own repo β gitignored from the main repo).
What you get on top of the web UI:
- Onboarding wizard β language picker, Privacy Policy gate, hardware tier-select (Beginner / Advanced / Flagship), one-click Ollama install, streaming model-pull progress.
- Mode-locked sessions β once a session has its first task, the Chat / Analytic / Build mode is locked so the agent's tool palette stays consistent for that conversation.
- Custom dialogs β replaces native
confirm/alert/prompt(which Tauri blocks) with in-app modals that match the chrome. - Auto-update against GitHub Releases β periodic poll, in-app changelog, one-click apply.
- Diagnostics download β single-click
app.log+ tasks/usage/settings zip for bug reports. - Returning launches skip the splash β once you've finished onboarding the
splash goes straight into
enter_main, so subsequent starts land on the main window immediately.
cd desktop
npm install
npx tauri dev # hot-reload dev (still needs uvicorn running separately)
./scripts/build_macos.sh # package: desktop/dist/OpenTeddy-<ver>-<arch>.dmg
./scripts/build_macos.sh --target universal # universal2 (arm64 + x86_64)The packaged .dmg is unsigned until an Apple Developer ID is wired up β
first-run users need to right-click β Open, or:
xattr -dr com.apple.quarantine /Applications/OpenTeddy.app.
| Platform | Status | Notes |
|---|---|---|
| macOS (Intel / Apple Silicon) | β Fully supported | Primary development target. |
| Linux | β Fully supported | Any distro with Python 3.11+ and Ollama. |
| Windows (native) | See caveats below. | |
| Windows (WSL2) | β Fully supported | Behaves like Linux. Recommended on Windows. |
The codebase itself is cross-platform Python (uses pathlib, os.path.join,
asyncio), and package_tool.py already handles the Windows venv layout
(Scripts\pip.exe). The things that actually trip Windows users are:
- The executor LLM generates POSIX shell commands. When Qwen decides to
run
ls,rm -rf,grep,chmod, or pipes likecmd1 | tee file, those are executed through the system shell β which is cmd.exe / PowerShell on native Windows, so they fail. Running OpenTeddy under WSL2 makes this a non-issue. lsof/psare not available on native Windows. The deploy-tool helpers that inspect port occupancy (port_probe,port_freeintools/deploy_tool.py) degrade:port_probereturns a bound/free flag but no PID/process name;port_freereturns an error and cannot kill by port.- Ollama on Windows is officially supported (install from ollama.com) β pulling and running Gemma/Qwen works the same as on Mac/Linux.
Recommendation: on Windows, install Ollama natively on the host, then run OpenTeddy itself inside WSL2 Ubuntu. That gives you GPU-accelerated local inference + a POSIX userspace for the shell-heavy parts of the agent.
docker-compose.yml uses extra_hosts: ["host-gateway:host-gateway"] so
the container can reach Ollama running on the host. This requires Docker
Engine 20.10+ on Linux, and Ollama must be bound to 0.0.0.0, not
just 127.0.0.1 β otherwise the container's bridged traffic can't reach
it. Set OLLAMA_HOST=0.0.0.0:11434 before ollama serve. On Docker
Desktop (Mac / Windows) this "just works".
cp .env.example .env
# Fill in ANTHROPIC_API_KEY
docker compose up -d
# Open http://localhost:8000Notes:
- Ollama must be running on the host (
ollama serve). - The container reaches host Ollama via the
host-gatewayalias set indocker-compose.yml. - Skills and the usage database persist in the
openteddy_dataDocker volume. - Rebuild image:
docker compose up -d --build.
The default docker-compose.yml only mounts an isolated named volume
(openteddy_data β /app/data). It does not bind-mount your home
directory, Desktop, Downloads, or any other host folder. That means:
- Tasks like "read
~/Documents/report.pdf", "tidy up my Downloads folder", or "run this script on my Desktop" will not work in the Docker setup β the container simply cannot see those files. - The agent's shell/file/python tools operate entirely inside the
container. Any files it reads or writes live in
/app/dataand disappear if the volume is removed.
If you need the agent to operate on files on your machine, run OpenTeddy
directly with uvicorn (see Quick Start) instead of Docker.
The native process has full access to your filesystem (subject to your user's
permissions), which is what most "local assistant" use cases actually want.
Alternatively, if you really want to stay on Docker, you can add a bind mount
to docker-compose.yml β e.g.:
volumes:
- openteddy_data:/app/data
- ${HOME}/openteddy-workspace:/workspace # β exposed host folderβ¦and then point the agent at /workspace inside the container. Only the
folders you explicitly mount are visible; everything else stays isolated.
OpenTeddy is a solo side-project trying to prove that a small open stack can get close to the big commercial agents. If you want to see it keep growing:
- β Star the repo β github.com/m31527/OpenTeddy β it's the single biggest encouragement I get.
- π Open an issue if something breaks or a model setup confuses you.
- π§ Share a skill you built on top of OpenTeddy β PRs welcome.
MIT