tooltalk

A drop-in middleware that lets local Gemma 4 (and look-alikes) speak
OpenAI-compatible `tool_calls` to any frontend that expects them.

The problem

You serve Gemma 4 31B locally via llama.cpp or vLLM or LM Studio. Your chat frontend (Open WebUI, LibreChat, Lobe Chat, a custom React thing) expects responses that look like OpenAI's structured tool_calls array:

{ "choices": [{ "message": { "tool_calls": [{
  "id": "call_a1b2",
  "type": "function",
  "function": { "name": "search", "arguments": "{\"query\":\"weather\"}" }
}]}}]}

But Gemma 4 emits tool calls as text inside the assistant message, looking roughly like:

<|tool_call|>call:search{"query":"weather"}<tool_call|>

Your frontend sees text, not a tool_calls array. The tool never fires. The chat hangs.

What tooltalk does

It sits between your chat frontend and your Gemma 4 server, transparently rewriting the text-format tool calls into the structured shape your frontend already knows how to render — streaming-safe, quote-tolerant, partial-token-aware.

┌──────────────┐    OpenAI-shape    ┌──────────────┐    OpenAI-shape    ┌─────────────────┐
│  Your chat   │ ──── (in/out) ────▶│   tooltalk   │ ──── (in/out) ────▶│  Gemma 4 server │
│  (OWUI,      │                    │    :8054     │                    │  (llama.cpp,    │
│   LibreChat, │◀───────────────────│              │◀───────────────────│   vLLM, LM      │
│   anything)  │  tool_calls: [...] │  rewrites    │  text-format       │   Studio...)    │
└──────────────┘                    └──────────────┘                    └─────────────────┘

The frontend doesn't know Gemma 4 emits text-format tool calls. The Gemma 4 server doesn't know it's being normalized. Nobody changes their code.

Why this exists (and why it isn't on GitHub already)

This pattern is in production at three separate stacks I know of (all operator-grade, none of them public). Each one wrote their own version because the surface looks trivial — until you handle:

Streaming SSE chunks that split a tool-call literal across packets
Unquoted JSON keys Gemma emits ({query:"x"} instead of {"query":"x"})
Smart-quote contamination from the chat-template (" vs ")
Asymmetric pipe placement in Gemma's open/close tags (<|tool_call> vs <tool_call|>)
Multi-call turns where Gemma emits 2+ tool calls back-to-back
The "no tool calls" path that must be byte-identical pass-through so non-tool turns aren't measurably slower

Get any of those wrong and the chat hangs silently. tooltalk is the result of getting them all right in one place.

Install

Docker (recommended)

docker run --rm -d \
  --name tooltalk \
  -p 8054:8054 \
  -e TOOLTALK_UPSTREAM=http://your-gemma-host:8006 \
  ghcr.io/karany97/tooltalk:latest

From source

git clone https://github.com/karany97/tooltalk.git
cd tooltalk
pip install -r requirements.txt
TOOLTALK_UPSTREAM=http://your-gemma-host:8006 \
  uvicorn app:app --host 0.0.0.0 --port 8054

Verify

# Health
curl http://localhost:8054/health
# → {"ok":true,"upstream":"http://your-gemma-host:8006"}

# Model passthrough
curl http://localhost:8054/v1/models

# Chat with a tool that should fire
curl http://localhost:8054/v1/chat/completions -H 'Content-Type: application/json' -d '{
  "model": "gemma-4-31b",
  "messages": [{"role":"user","content":"what time is it in Tokyo?"}],
  "tools": [{
    "type":"function",
    "function": {
      "name":"get_time",
      "description":"return the current time in a named city",
      "parameters": {
        "type":"object",
        "properties":{"city":{"type":"string"}},
        "required":["city"]
      }
    }
  }]
}'

You should see a tool_calls: [...] array in the response, with function.name = "get_time" and function.arguments containing {"city":"Tokyo"} — even though Gemma 4 itself emitted <|tool_call|>call:get_time{"city":"Tokyo"}<tool_call|> in plain text.

Configure

Env var	Default	What it does
`TOOLTALK_UPSTREAM`	`http://localhost:8006`	Your Gemma 4 server's base URL
`TOOLTALK_TIMEOUT`	`600`	Stream timeout in seconds
`PORT`	`8054`	tooltalk's listen port
`TOOLTALK_MODEL_NAME`	`(passthrough)`	Override model name in `/v1/models` (optional)
`TOOLTALK_LOG_LEVEL`	`INFO`	`DEBUG` to see every translation

What's supported

Surface	Status
`POST /v1/chat/completions` (stream + non-stream)	✅
`GET /v1/models` (passthrough)	✅
`GET /health`	✅
Multi-tool turns (2+ calls back-to-back)	✅
Tool-call mid-stream-token splitting	✅
Unquoted JSON keys from Gemma	✅
Smart-quote normalization	✅
OpenAI-compatible `id` generation (`call_<8 hex>`)	✅
`tool_choice: required` enforcement	✅
`parallel_tool_calls: false` enforcement	✅
`/v1/embeddings`, `/v1/audio/*` passthrough	✅
Anthropic `/v1/messages` shape	❌ Out of scope — use a different shim

What's NOT supported (deliberate)

Authentication. Put a reverse-proxy in front (Caddy, Cloudflare Tunnel, mythos-gate, etc.). tooltalk treats every incoming request as authorized.
Rate limiting. Same — put it upstream.
Multi-backend routing. One upstream, one outlet. For multi-backend, point tooltalk at LiteLLM.
Tool execution. tooltalk only translates the tool-call shape. Executing the tool is the frontend's job (or use mcpo to bridge MCP).

Performance

	Stream TTFT	Per-token overhead	Memory
With tooltalk in path	+3 ms p50	+0.04 ms p50	40 MB
Without (direct upstream)	—	—	—

The overhead is one regex scan per chunk + an occasional json.loads. On ~50 tool calls per minute it's invisible. The streaming path uses httpx.AsyncClient.stream() so chunks pass through with the same back- pressure they had on the upstream connection.

Measured on Ubuntu 22.04, Ryzen 9 7950X, gemma-4-31b-Q8 via llama.cpp at 25 tok/s.

Testing

pip install -r requirements-dev.txt
pytest tests/        # 28 tests, ~0.4 s

Tests use a recorded-cassette fixture of real Gemma 4 outputs (tests/fixtures/) so they don't require a live model. To regenerate the fixtures against your own model, see tests/RECORD.md.

Compose with other tools

You're using	Plug tooltalk in like
Open WebUI	Settings → Connections → OpenAI API: `http://tooltalk:8054/v1`
LibreChat	`librechat.yaml` → endpoints → custom: `baseURL: http://tooltalk:8054/v1`
Lobe Chat	Settings → API: `http://tooltalk:8054/v1`
LiteLLM as a model entry	`api_base: http://tooltalk:8054/v1`
A custom React thing	`fetch('http://tooltalk:8054/v1/chat/completions', ...)`
mcpo as the tool executor	tooltalk → frontend → tool call → mcpo → result → next chat turn

License

MIT. Fork it, sell it, ship it. The one ask: if you find a Gemma-4 edge case tooltalk doesn't handle, please open an issue with the verbatim chunk you saw — that's the whole reason this exists as a project not a snippet.

Acknowledgements

This pattern was first ironed out for the Destiny Atelier chat. Pulled out into its own MIT repo because the same translator solves the same problem for everyone else who serves Gemma 4 locally + wants OpenAI-shape tool_calls downstream.

Built on the shoulders of:

ggml.cpp / llama.cpp — the Gemma 4 server most people use
Google's Gemma 4 release — the model whose text-format tool calls created the problem
httpx + FastAPI — the boring-and-correct backbone

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tooltalk

A drop-in middleware that lets local Gemma 4 (and look-alikes) speak
OpenAI-compatible `tool_calls` to any frontend that expects them.

The problem

What tooltalk does

Why this exists (and why it isn't on GitHub already)

Install

Docker (recommended)

From source

Verify

Configure

What's supported

What's NOT supported (deliberate)

Performance

Testing

Compose with other tools

License

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

tooltalk

A drop-in middleware that lets local Gemma 4 (and look-alikes) speakOpenAI-compatible tool_calls to any frontend that expects them.

The problem

What tooltalk does

Why this exists (and why it isn't on GitHub already)

Install

Docker (recommended)

From source

Verify

Configure

What's supported

What's NOT supported (deliberate)

Performance

Testing

Compose with other tools

License

Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

A drop-in middleware that lets local Gemma 4 (and look-alikes) speak
OpenAI-compatible `tool_calls` to any frontend that expects them.

Packages