Skip to content

m31527/OpenTeddy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

116 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

English | 繁體中文

OpenTeddy

OpenTeddy

The platform that makes local LLMs ship work.

Local models alone are weak. Wrap them in OpenTeddy and you get a real agent β€” hardened orchestration, a self-growing skills library, and just enough commercial-LLM escalation to finish what local can't.

Python FastAPI Ollama Anthropic License Stars

🌐 Web: openteddy-72cee.web.app Β Β·Β  πŸ“¦ Source: github.com/m31527/OpenTeddy


Why this exists

A 2B / 4B / 7B local model on its own is a toy. It hallucinates, it loops, it stops mid-task. The model isn't the product β€” the platform around it is. OpenTeddy is that platform:

  • A hardened agent loop that knows when to give up, when to retry, and when to call Claude β€” no infinite "let me try that again" doom-spirals.
  • A self-growing skills library that turns repeated work into plain Python functions, so the second time you ask the same question it costs zero LLM calls.
  • Hardware-tuned model presets for everything from a 16 GB MacBook to a DGX Spark β€” the right num_ctx, max_tokens, and timeout per tier.
  • Commercial-LLM escalation as a safety net, not a bill β€” Claude only gets called when local genuinely can't finish, and the Usage dashboard shows you how much GPT-4 would've charged for the same work.

The result: your $0/token local hardware actually finishes the job, and the savings counter in the sidebar is what makes you stop worrying about Claude Pro auto-renewing.

If this resonates with you β€” or you just want to cheer the project on β€” please drop a ⭐ on the repo. It genuinely helps and keeps me motivated to ship more. β†’ github.com/m31527/OpenTeddy

Highlights

  • Local-first β€” planning (Gemma) and execution (Qwen) run on your machine via Ollama; Claude is only called when local models struggle.
  • Auto-escalation to Claude β€” timeouts, low confidence, repeated failures, hard-failure signals in tool output (e.g. unhealthy containers, ERROR 1045), or failed health checks all trigger Claude intervention automatically.
  • Self-growing skills β€” repeated tasks are promoted into reusable Python skills, cutting LLM calls over time.
  • Streaming UI β€” both the orchestrator's planning and the executor's answer stream token-by-token via WebSocket β€” no more staring at a spinner while the model thinks.
  • Per-step deliverable verification β€” LLM-as-judge confirms each produced file actually matches the goal, catching the "wrote a report about the game instead of the game" failure mode. Toggleable for big-model setups where extra calls are too costly.
  • Loop hardening for small models β€” adaptive prompts, a parallel low-risk tool fan-out, per-tool-name caps, a circuit breaker, discovery memos, and a context watchdog that compresses old turns before busting num_ctx.
  • Reconnect-safe streaming β€” the WebSocket carries a 600-event ring buffer so a flaky network or a tab refresh replays the missed events instead of leaving the UI stuck.
  • Web dashboard β€” submit tasks, watch tool calls stream live, review pending approvals, manage memory, render Markdown/GFM tables, embed Chart.js datalabels in HTML reports, and tune settings.
  • Native macOS desktop client β€” Tauri 2.x shell with onboarding wizard (Ollama install + tier-based model pull), language picker, mode-locked sessions, auto-update against GitHub Releases, and one-click diagnostics download. See desktop/.
  • Analytic / report mode β€” first-class csv_describe + python_exec tools and an HTML report generator that renders charts with value labels.
  • Human-in-the-loop β€” high-risk shell commands (rm, sudo, mv, …) pause for approval before running.
  • Persistent memory β€” ChromaDB-backed long-term memory feeds relevant context back into future plans.
  • 22-locale i18n β€” UI strings live in static/i18n.js; build-hash check auto-reloads when the dashboard is updated.
  • Hot-reloadable settings β€” change models, thresholds, performance toggles, and endpoints from the UI without restarting the server.

Architecture

User Goal
   β”‚
   β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Orchestrator  (Gemma via Ollama)                  β”‚
β”‚  β€’ Decomposes goal into ordered SubTasks           β”‚
β”‚  β€’ Streams plan tokens to the UI as it thinks      β”‚
β”‚  β€’ Retrieves long-term memory for context          β”‚
β”‚  β€’ Drives execution + escalation loop              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β”‚ SubTasks
                     β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Executor  (Qwen via Ollama, function calling)     β”‚
β”‚  β€’ Runs a matching Skill if available              β”‚
β”‚  β€’ Uses tools: shell, file, http, db, gcp, package,β”‚
β”‚    csv_describe, python_exec, generate_report      β”‚
β”‚  β€’ Streams answer tokens; parallelises low-risk    β”‚
β”‚    tool calls; caps per-tool-name retries          β”‚
β”‚  β€’ Compresses old turns when context fills up      β”‚
β”‚  β€’ Reports confidence (clamped on hard failures)   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β”‚ produced files
                     β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Deliverable Verifier  (LLM-as-judge, Qwen)        β”‚
β”‚  β€’ Reads the produced HTML/MD/Py/etc.              β”‚
β”‚  β€’ Verdict: PASS or FAIL β€” forces retry on FAIL    β”‚
β”‚  β€’ Skipped via `verification_enabled = false`      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    low conf β”‚ timeout β”‚ failure signal β”‚ unhealthy
                     β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Escalation Agent  (Claude via API)                β”‚
β”‚  β€’ Resolves hard subtasks with full diagnostics    β”‚
β”‚  β€’ Synthesises the final summary                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Skill Factory  (Claude via API)                   β”‚
β”‚  β€’ Generates new Python skills on demand           β”‚
β”‚  β€’ Promotes skills after N successes               β”‚
β”‚  β€’ Saves skills to disk + SQLite DB                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Loop Hardening (small-model resilience)

The agent loop has been progressively hardened to make small / mid-size local models (Gemma 3:4B, Qwen 2.5:3B class) reliable enough to ship work end-to-end, not just fast enough to look impressive on a single tool call:

Mechanism What it does
Adaptive prompts Compact system prompts on small models; richer guidance only when context allows.
Parallel tool fan-out Low-risk tool calls (file reads, shell ls, HTTP gets, csv_describe) inside a single round are dispatched with asyncio.gather instead of serially.
Per-step deliverable verification After each successful subtask, an LLM-as-judge reviews the produced HTML/MD/code file. If it looks like a description of the goal rather than the actual deliverable (the "Snake Game report" failure pattern), the subtask is forced to retry with feedback.
Context watchdog When the prompt size approaches num_ctx, the executor compresses earlier turns into a recap and pins discovery memos to the system prompt β€” keeping recent tool context intact instead of letting Ollama silently truncate.
Discovery memos Useful one-off facts learned from tool calls (e.g. "the workspace already contains data.csv with columns X/Y/Z") are pinned to the system prompt so the model doesn't re-discover them every round.
Per-tool-name cap Each tool name is capped at 5 calls per subtask β€” stops the model from re-running csv_describe on the same file ten times.
Circuit breaker After 5 cumulative tool failures the loop is forced to commit to a final answer instead of looping forever.
Common error hints Twelve frequent stack-trace patterns (ModuleNotFoundError, KeyError, PermissionError, …) are matched against tool stderr and converted into one-line hints so the model corrects itself instead of repeating the same mistake.
WS reconnect + replay The dashboard WebSocket carries a 600-event ring buffer keyed by sequence number β€” a refreshed tab or a wifi blip replays missed events on reconnect.

File Structure

OpenTeddy/
β”œβ”€β”€ config.py          # Config via .env / environment variables
β”œβ”€β”€ models.py          # Pydantic models + SQLite schema
β”œβ”€β”€ tracker.py         # Async SQLite persistence (aiosqlite) + perf stats
β”œβ”€β”€ skill_factory.py   # Claude-powered skill generation & loader
β”œβ”€β”€ executor.py        # Qwen executor β€” function calling, streaming,
β”‚                      #   parallel low-risk tools, context watchdog,
β”‚                      #   discovery memos, per-tool cap, circuit breaker
β”œβ”€β”€ escalation.py      # Claude escalation agent
β”œβ”€β”€ orchestrator.py    # Gemma orchestrator (plan β†’ execute β†’ verify β†’
β”‚                      #   escalate) + per-step deliverable judge
β”œβ”€β”€ memory.py          # ChromaDB long-term memory
β”œβ”€β”€ approval_store.py  # Human-in-the-loop approval queue
β”œβ”€β”€ settings_store.py  # Hot-reloadable settings (SQLite-backed)
β”œβ”€β”€ tool_registry.py   # Tool registration + risk gating
β”œβ”€β”€ tools/             # shell / file / http / db / gcp / package /
β”‚                      #   analytic (csv_describe, python_exec) /
β”‚                      #   report_tool (HTML + Chart.js datalabels)
β”œβ”€β”€ skills/            # Auto-generated skill .py files
β”œβ”€β”€ static/            # Web dashboard (index.html, i18n.js β€” 22 locales,
β”‚                      #   OpenTeddy-logo.svg)
β”œβ”€β”€ desktop/           # Native macOS Tauri 2.x client (own repo)
β”œβ”€β”€ main.py            # FastAPI server + CLI entry point + WS ring buffer
└── .env.example       # Environment variable template

Quick Start

1. Prerequisites

  • Python 3.11+
  • Ollama running locally:
    ollama pull gemma3:4b
    ollama pull qwen2.5:3b
  • An Anthropic API key (used for escalation and skill generation)

2. Install

git clone https://github.com/m31527/OpenTeddy.git
cd OpenTeddy
python -m venv .venv
source .venv/bin/activate   # Windows: .venv\Scripts\activate
pip install -r requirements.txt

3. Configure

cp .env.example .env
# Edit .env β€” at minimum set ANTHROPIC_API_KEY

4. Run

uvicorn main:app --reload
# Dashboard β†’ http://localhost:8000
# API docs  β†’ http://localhost:8000/docs

API

Method Endpoint Description
POST /run Submit a task
GET /tasks/{id} Check task status
GET /tasks List recent tasks (filter by session_id)
GET /skills List all skills
POST /skills/generate?name=…&description=… Manually create a skill
GET /tools List available tools
GET /approvals Pending human approvals
POST /approvals/{id}/approve | /reject Resolve an approval
GET /memory Browse long-term memory
GET /usage, /usage/summary Token usage & estimated cost
GET /benchmark/stats Per-model token-throughput stats (#6)
GET /settings | POST /settings Read/update runtime settings
GET /settings/ollama/models | /status Local model management
POST /settings/ollama/pull Pull a model (streamed progress)
GET /version Build hash + version (used by UI auto-reload)
GET /update/check Check GitHub Releases for a newer version
POST /update/apply Apply an available update
POST /optimize_prompt Rewrite a draft goal via Claude
GET /admin/diagnostics Download a zipped diagnostic bundle
GET /health Health check
WS /ws?since=N Live event stream β€” since replays the ring buffer from sequence N

Example request

curl -X POST http://localhost:8000/run \
  -H 'Content-Type: application/json' \
  -d '{"goal": "Summarise the key benefits of async Python", "priority": 7}'

How Claude Steps In

OpenTeddy tries to keep every task local. Claude is called only when the local path breaks down:

Trigger Where Default
Subtask timeout (local model hangs) orchestrator._run_subtask 120 s
Low self-reported confidence executor._qwen_execute < 0.6
Repeated failures in a row orchestrator._run_subtask 3
Hard-failure signal in tool output (unhealthy, Exited, ERROR 1045, Error response from daemon, …) executor._finalize_response confidence clamped to 0.3 β†’ escalates
Container health check fails after a Docker task orchestrator._inspect_docker_health auto-pulls docker logs + inspect, then escalates
Deliverable verifier returns FAIL orchestrator._verify_deliverable confidence clamped to 0.3 β†’ retry, then escalate
Circuit breaker tripped (5 cumulative tool failures) executor._qwen_execute forces final-answer commit; escalation kicks in if confidence is still low

This keeps cost low for everyday work while still guaranteeing you get a real answer when the local model cannot deliver one. All triggers can be globally disabled via ESCALATION_ENABLED=false (or the per-session "Local-only" toggle in the UI).

Self-Growth Mechanism

  1. When Qwen executes a subtask it suggests a skill name if a reusable function would have helped.
  2. The Executor calls SkillFactory.generate_skill() in the background.
  3. Claude writes the skill as an async def run(input_data: dict) -> str function and saves it to skills/<name>.py.
  4. The skill starts in TESTING status. After SKILL_PROMOTION_THRESHOLD successful invocations it is promoted to ACTIVE.
  5. Future tasks automatically match and invoke active skills β€” no LLM call needed.

Configuration Reference

Most of these can also be edited live from the dashboard's Settings panel β€” changes are persisted to SQLite and config.reload_from_store() re-applies them without a server restart.

Core

Variable Default Description
ANTHROPIC_API_KEY β€” Required only if escalation is enabled. Anthropic API key.
CLAUDE_MODEL claude-opus-4-6 Claude model for escalation.
GEMMA_BASE_URL http://localhost:11434 Ollama base URL for the orchestrator.
GEMMA_MODEL gemma3:4b Orchestrator model tag.
QWEN_BASE_URL http://localhost:11434 Ollama base URL for the executor.
QWEN_MODEL qwen2.5:3b Executor model tag.
DB_PATH openteddy.db SQLite database path.
MEMORY_DB_PATH ./memory_db ChromaDB directory.
SKILLS_DIR skills Directory for skill files.

Escalation

Variable Default Description
ESCALATION_ENABLED true Master kill-switch for Claude. When false, low-confidence / timeout / failure-signal triggers stay local and surface a failure to the user instead of calling Claude.
ESCALATION_THRESHOLD 0.6 Min Qwen confidence before escalation.
ESCALATION_FAILURE_LIMIT 3 Max consecutive failures before escalation.
SUBTASK_TIMEOUT 120 Seconds before a subtask is treated as hung.
SKILL_PROMOTION_THRESHOLD 5 Successes needed to promote a skill.

Performance toggles (loop hardening)

Most of these matter most on big models β€” turn them off to trade safety nets for speed.

Variable Default Description
STREAMING_ENABLED true Stream LLM tokens to the chat as they generate. Major perceived-latency win on small thinking models.
VERIFICATION_ENABLED true Run the per-step LLM-as-judge verifier after each successful subtask. Set to false on big-model setups (DGX Spark, qwen3.5:35b) where each judge call is 5–60s.
QWEN_NUM_CTX 16384 Ollama num_ctx for the executor. Larger = more tool-round history before the watchdog has to compress, but more VRAM.
GEMMA_NUM_CTX 16384 Same, for the orchestrator.
CONTEXT_COMPRESS_AT 0.7 Trigger context compression when prompt-token usage crosses this fraction of num_ctx.

Desktop Client (macOS)

OpenTeddy ships with a native macOS shell built on Tauri 2.x that wraps the web dashboard inside a polished launcher. Source lives in desktop/ (its own repo β€” gitignored from the main repo).

What you get on top of the web UI:

  • Onboarding wizard β€” language picker, Privacy Policy gate, hardware tier-select (Beginner / Advanced / Flagship), one-click Ollama install, streaming model-pull progress.
  • Mode-locked sessions β€” once a session has its first task, the Chat / Analytic / Build mode is locked so the agent's tool palette stays consistent for that conversation.
  • Custom dialogs β€” replaces native confirm / alert / prompt (which Tauri blocks) with in-app modals that match the chrome.
  • Auto-update against GitHub Releases β€” periodic poll, in-app changelog, one-click apply.
  • Diagnostics download β€” single-click app.log + tasks/usage/settings zip for bug reports.
  • Returning launches skip the splash β€” once you've finished onboarding the splash goes straight into enter_main, so subsequent starts land on the main window immediately.
cd desktop
npm install
npx tauri dev          # hot-reload dev (still needs uvicorn running separately)
./scripts/build_macos.sh             # package: desktop/dist/OpenTeddy-<ver>-<arch>.dmg
./scripts/build_macos.sh --target universal   # universal2 (arm64 + x86_64)

The packaged .dmg is unsigned until an Apple Developer ID is wired up β€” first-run users need to right-click β†’ Open, or: xattr -dr com.apple.quarantine /Applications/OpenTeddy.app.

Platform Support

Platform Status Notes
macOS (Intel / Apple Silicon) βœ… Fully supported Primary development target.
Linux βœ… Fully supported Any distro with Python 3.11+ and Ollama.
Windows (native) ⚠️ Partial β€” use WSL2 if possible See caveats below.
Windows (WSL2) βœ… Fully supported Behaves like Linux. Recommended on Windows.

Windows caveats

The codebase itself is cross-platform Python (uses pathlib, os.path.join, asyncio), and package_tool.py already handles the Windows venv layout (Scripts\pip.exe). The things that actually trip Windows users are:

  • The executor LLM generates POSIX shell commands. When Qwen decides to run ls, rm -rf, grep, chmod, or pipes like cmd1 | tee file, those are executed through the system shell β€” which is cmd.exe / PowerShell on native Windows, so they fail. Running OpenTeddy under WSL2 makes this a non-issue.
  • lsof / ps are not available on native Windows. The deploy-tool helpers that inspect port occupancy (port_probe, port_free in tools/deploy_tool.py) degrade: port_probe returns a bound/free flag but no PID/process name; port_free returns an error and cannot kill by port.
  • Ollama on Windows is officially supported (install from ollama.com) β€” pulling and running Gemma/Qwen works the same as on Mac/Linux.

Recommendation: on Windows, install Ollama natively on the host, then run OpenTeddy itself inside WSL2 Ubuntu. That gives you GPU-accelerated local inference + a POSIX userspace for the shell-heavy parts of the agent.

Docker network caveat (Linux hosts)

docker-compose.yml uses extra_hosts: ["host-gateway:host-gateway"] so the container can reach Ollama running on the host. This requires Docker Engine 20.10+ on Linux, and Ollama must be bound to 0.0.0.0, not just 127.0.0.1 β€” otherwise the container's bridged traffic can't reach it. Set OLLAMA_HOST=0.0.0.0:11434 before ollama serve. On Docker Desktop (Mac / Windows) this "just works".

Docker Deployment

cp .env.example .env
# Fill in ANTHROPIC_API_KEY
docker compose up -d
# Open http://localhost:8000

Notes:

  • Ollama must be running on the host (ollama serve).
  • The container reaches host Ollama via the host-gateway alias set in docker-compose.yml.
  • Skills and the usage database persist in the openteddy_data Docker volume.
  • Rebuild image: docker compose up -d --build.

⚠️ Docker cannot touch your host filesystem

The default docker-compose.yml only mounts an isolated named volume (openteddy_data β†’ /app/data). It does not bind-mount your home directory, Desktop, Downloads, or any other host folder. That means:

  • Tasks like "read ~/Documents/report.pdf", "tidy up my Downloads folder", or "run this script on my Desktop" will not work in the Docker setup β€” the container simply cannot see those files.
  • The agent's shell/file/python tools operate entirely inside the container. Any files it reads or writes live in /app/data and disappear if the volume is removed.

If you need the agent to operate on files on your machine, run OpenTeddy directly with uvicorn (see Quick Start) instead of Docker. The native process has full access to your filesystem (subject to your user's permissions), which is what most "local assistant" use cases actually want.

Alternatively, if you really want to stay on Docker, you can add a bind mount to docker-compose.yml β€” e.g.:

    volumes:
      - openteddy_data:/app/data
      - ${HOME}/openteddy-workspace:/workspace   # ← exposed host folder

…and then point the agent at /workspace inside the container. Only the folders you explicitly mount are visible; everything else stays isolated.

Support the project

OpenTeddy is a solo side-project trying to prove that a small open stack can get close to the big commercial agents. If you want to see it keep growing:

  • ⭐ Star the repo β€” github.com/m31527/OpenTeddy β€” it's the single biggest encouragement I get.
  • πŸ› Open an issue if something breaks or a model setup confuses you.
  • 🧠 Share a skill you built on top of OpenTeddy β€” PRs welcome.

License

MIT

About

Self-growing multi-agent system: Gemma Orchestrator + Qwen Executor + Claude Escalation + Skill Factory

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors