You serve Gemma 4 31B locally via llama.cpp or vLLM or LM Studio.
Your chat frontend (Open WebUI, LibreChat, Lobe Chat, a custom React thing)
expects responses that look like OpenAI's structured tool_calls array:
{ "choices": [{ "message": { "tool_calls": [{
"id": "call_a1b2",
"type": "function",
"function": { "name": "search", "arguments": "{\"query\":\"weather\"}" }
}]}}]}But Gemma 4 emits tool calls as text inside the assistant message, looking roughly like:
<|tool_call|>call:search{"query":"weather"}<tool_call|>
Your frontend sees text, not a tool_calls array. The tool never fires. The
chat hangs.
It sits between your chat frontend and your Gemma 4 server, transparently rewriting the text-format tool calls into the structured shape your frontend already knows how to render — streaming-safe, quote-tolerant, partial-token-aware.
┌──────────────┐ OpenAI-shape ┌──────────────┐ OpenAI-shape ┌─────────────────┐
│ Your chat │ ──── (in/out) ────▶│ tooltalk │ ──── (in/out) ────▶│ Gemma 4 server │
│ (OWUI, │ │ :8054 │ │ (llama.cpp, │
│ LibreChat, │◀───────────────────│ │◀───────────────────│ vLLM, LM │
│ anything) │ tool_calls: [...] │ rewrites │ text-format │ Studio...) │
└──────────────┘ └──────────────┘ └─────────────────┘
The frontend doesn't know Gemma 4 emits text-format tool calls. The Gemma 4 server doesn't know it's being normalized. Nobody changes their code.
This pattern is in production at three separate stacks I know of (all operator-grade, none of them public). Each one wrote their own version because the surface looks trivial — until you handle:
- Streaming SSE chunks that split a tool-call literal across packets
- Unquoted JSON keys Gemma emits (
{query:"x"}instead of{"query":"x"}) - Smart-quote contamination from the chat-template (
"vs") - Asymmetric pipe placement in Gemma's open/close tags
(
<|tool_call>vs<tool_call|>) - Multi-call turns where Gemma emits 2+ tool calls back-to-back
- The "no tool calls" path that must be byte-identical pass-through so non-tool turns aren't measurably slower
Get any of those wrong and the chat hangs silently. tooltalk is the
result of getting them all right in one place.
docker run --rm -d \
--name tooltalk \
-p 8054:8054 \
-e TOOLTALK_UPSTREAM=http://your-gemma-host:8006 \
ghcr.io/karany97/tooltalk:latestgit clone https://github.com/karany97/tooltalk.git
cd tooltalk
pip install -r requirements.txt
TOOLTALK_UPSTREAM=http://your-gemma-host:8006 \
uvicorn app:app --host 0.0.0.0 --port 8054# Health
curl http://localhost:8054/health
# → {"ok":true,"upstream":"http://your-gemma-host:8006"}
# Model passthrough
curl http://localhost:8054/v1/models
# Chat with a tool that should fire
curl http://localhost:8054/v1/chat/completions -H 'Content-Type: application/json' -d '{
"model": "gemma-4-31b",
"messages": [{"role":"user","content":"what time is it in Tokyo?"}],
"tools": [{
"type":"function",
"function": {
"name":"get_time",
"description":"return the current time in a named city",
"parameters": {
"type":"object",
"properties":{"city":{"type":"string"}},
"required":["city"]
}
}
}]
}'You should see a tool_calls: [...] array in the response, with function.name = "get_time" and function.arguments containing {"city":"Tokyo"} — even though Gemma 4 itself emitted <|tool_call|>call:get_time{"city":"Tokyo"}<tool_call|> in plain text.
| Env var | Default | What it does |
|---|---|---|
TOOLTALK_UPSTREAM |
http://localhost:8006 |
Your Gemma 4 server's base URL |
TOOLTALK_TIMEOUT |
600 |
Stream timeout in seconds |
PORT |
8054 |
tooltalk's listen port |
TOOLTALK_MODEL_NAME |
(passthrough) |
Override model name in /v1/models (optional) |
TOOLTALK_LOG_LEVEL |
INFO |
DEBUG to see every translation |
| Surface | Status |
|---|---|
POST /v1/chat/completions (stream + non-stream) |
✅ |
GET /v1/models (passthrough) |
✅ |
GET /health |
✅ |
| Multi-tool turns (2+ calls back-to-back) | ✅ |
| Tool-call mid-stream-token splitting | ✅ |
| Unquoted JSON keys from Gemma | ✅ |
| Smart-quote normalization | ✅ |
OpenAI-compatible id generation (call_<8 hex>) |
✅ |
tool_choice: required enforcement |
✅ |
parallel_tool_calls: false enforcement |
✅ |
/v1/embeddings, /v1/audio/* passthrough |
✅ |
Anthropic /v1/messages shape |
❌ Out of scope — use a different shim |
- Authentication. Put a reverse-proxy in front (Caddy, Cloudflare Tunnel, mythos-gate, etc.). tooltalk treats every incoming request as authorized.
- Rate limiting. Same — put it upstream.
- Multi-backend routing. One upstream, one outlet. For multi-backend, point tooltalk at LiteLLM.
- Tool execution. tooltalk only translates the tool-call shape. Executing the tool is the frontend's job (or use mcpo to bridge MCP).
| Stream TTFT | Per-token overhead | Memory | |
|---|---|---|---|
| With tooltalk in path | +3 ms p50 | +0.04 ms p50 | 40 MB |
| Without (direct upstream) | — | — | — |
The overhead is one regex scan per chunk + an occasional json.loads. On
~50 tool calls per minute it's invisible. The streaming path uses
httpx.AsyncClient.stream() so chunks pass through with the same back-
pressure they had on the upstream connection.
Measured on Ubuntu 22.04, Ryzen 9 7950X, gemma-4-31b-Q8 via llama.cpp at 25 tok/s.
pip install -r requirements-dev.txt
pytest tests/ # 28 tests, ~0.4 sTests use a recorded-cassette fixture of real Gemma 4 outputs (tests/fixtures/)
so they don't require a live model. To regenerate the fixtures against your
own model, see tests/RECORD.md.
| You're using | Plug tooltalk in like |
|---|---|
| Open WebUI | Settings → Connections → OpenAI API: http://tooltalk:8054/v1 |
| LibreChat | librechat.yaml → endpoints → custom: baseURL: http://tooltalk:8054/v1 |
| Lobe Chat | Settings → API: http://tooltalk:8054/v1 |
| LiteLLM as a model entry | api_base: http://tooltalk:8054/v1 |
| A custom React thing | fetch('http://tooltalk:8054/v1/chat/completions', ...) |
| mcpo as the tool executor | tooltalk → frontend → tool call → mcpo → result → next chat turn |
MIT. Fork it, sell it, ship it. The one ask: if you find a
Gemma-4 edge case tooltalk doesn't handle, please open an issue with
the verbatim chunk you saw — that's the whole reason this exists as a
project not a snippet.
This pattern was first ironed out for the Destiny Atelier chat. Pulled out into its own MIT repo because the same translator solves the same problem for everyone else who serves Gemma 4 locally + wants OpenAI-shape tool_calls downstream.
Built on the shoulders of:
- ggml.cpp / llama.cpp — the Gemma 4 server most people use
- Google's Gemma 4 release — the model whose text-format tool calls created the problem
- httpx + FastAPI — the boring-and-correct backbone