Skip to content

karany97/tooltalk

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

tooltalk

A drop-in middleware that lets local Gemma 4 (and look-alikes) speak
OpenAI-compatible tool_calls to any frontend that expects them.

License: MIT Built for: local LLMs Surface: FastAPI


The problem

You serve Gemma 4 31B locally via llama.cpp or vLLM or LM Studio. Your chat frontend (Open WebUI, LibreChat, Lobe Chat, a custom React thing) expects responses that look like OpenAI's structured tool_calls array:

{ "choices": [{ "message": { "tool_calls": [{
  "id": "call_a1b2",
  "type": "function",
  "function": { "name": "search", "arguments": "{\"query\":\"weather\"}" }
}]}}]}

But Gemma 4 emits tool calls as text inside the assistant message, looking roughly like:

<|tool_call|>call:search{"query":"weather"}<tool_call|>

Your frontend sees text, not a tool_calls array. The tool never fires. The chat hangs.

What tooltalk does

It sits between your chat frontend and your Gemma 4 server, transparently rewriting the text-format tool calls into the structured shape your frontend already knows how to render — streaming-safe, quote-tolerant, partial-token-aware.

┌──────────────┐    OpenAI-shape    ┌──────────────┐    OpenAI-shape    ┌─────────────────┐
│  Your chat   │ ──── (in/out) ────▶│   tooltalk   │ ──── (in/out) ────▶│  Gemma 4 server │
│  (OWUI,      │                    │    :8054     │                    │  (llama.cpp,    │
│   LibreChat, │◀───────────────────│              │◀───────────────────│   vLLM, LM      │
│   anything)  │  tool_calls: [...] │  rewrites    │  text-format       │   Studio...)    │
└──────────────┘                    └──────────────┘                    └─────────────────┘

The frontend doesn't know Gemma 4 emits text-format tool calls. The Gemma 4 server doesn't know it's being normalized. Nobody changes their code.

Why this exists (and why it isn't on GitHub already)

This pattern is in production at three separate stacks I know of (all operator-grade, none of them public). Each one wrote their own version because the surface looks trivial — until you handle:

  • Streaming SSE chunks that split a tool-call literal across packets
  • Unquoted JSON keys Gemma emits ({query:"x"} instead of {"query":"x"})
  • Smart-quote contamination from the chat-template (" vs ")
  • Asymmetric pipe placement in Gemma's open/close tags (<|tool_call> vs <tool_call|>)
  • Multi-call turns where Gemma emits 2+ tool calls back-to-back
  • The "no tool calls" path that must be byte-identical pass-through so non-tool turns aren't measurably slower

Get any of those wrong and the chat hangs silently. tooltalk is the result of getting them all right in one place.

Install

Docker (recommended)

docker run --rm -d \
  --name tooltalk \
  -p 8054:8054 \
  -e TOOLTALK_UPSTREAM=http://your-gemma-host:8006 \
  ghcr.io/karany97/tooltalk:latest

From source

git clone https://github.com/karany97/tooltalk.git
cd tooltalk
pip install -r requirements.txt
TOOLTALK_UPSTREAM=http://your-gemma-host:8006 \
  uvicorn app:app --host 0.0.0.0 --port 8054

Verify

# Health
curl http://localhost:8054/health
# → {"ok":true,"upstream":"http://your-gemma-host:8006"}

# Model passthrough
curl http://localhost:8054/v1/models

# Chat with a tool that should fire
curl http://localhost:8054/v1/chat/completions -H 'Content-Type: application/json' -d '{
  "model": "gemma-4-31b",
  "messages": [{"role":"user","content":"what time is it in Tokyo?"}],
  "tools": [{
    "type":"function",
    "function": {
      "name":"get_time",
      "description":"return the current time in a named city",
      "parameters": {
        "type":"object",
        "properties":{"city":{"type":"string"}},
        "required":["city"]
      }
    }
  }]
}'

You should see a tool_calls: [...] array in the response, with function.name = "get_time" and function.arguments containing {"city":"Tokyo"} — even though Gemma 4 itself emitted <|tool_call|>call:get_time{"city":"Tokyo"}<tool_call|> in plain text.

Configure

Env var Default What it does
TOOLTALK_UPSTREAM http://localhost:8006 Your Gemma 4 server's base URL
TOOLTALK_TIMEOUT 600 Stream timeout in seconds
PORT 8054 tooltalk's listen port
TOOLTALK_MODEL_NAME (passthrough) Override model name in /v1/models (optional)
TOOLTALK_LOG_LEVEL INFO DEBUG to see every translation

What's supported

Surface Status
POST /v1/chat/completions (stream + non-stream)
GET /v1/models (passthrough)
GET /health
Multi-tool turns (2+ calls back-to-back)
Tool-call mid-stream-token splitting
Unquoted JSON keys from Gemma
Smart-quote normalization
OpenAI-compatible id generation (call_<8 hex>)
tool_choice: required enforcement
parallel_tool_calls: false enforcement
/v1/embeddings, /v1/audio/* passthrough
Anthropic /v1/messages shape ❌ Out of scope — use a different shim

What's NOT supported (deliberate)

  • Authentication. Put a reverse-proxy in front (Caddy, Cloudflare Tunnel, mythos-gate, etc.). tooltalk treats every incoming request as authorized.
  • Rate limiting. Same — put it upstream.
  • Multi-backend routing. One upstream, one outlet. For multi-backend, point tooltalk at LiteLLM.
  • Tool execution. tooltalk only translates the tool-call shape. Executing the tool is the frontend's job (or use mcpo to bridge MCP).

Performance

Stream TTFT Per-token overhead Memory
With tooltalk in path +3 ms p50 +0.04 ms p50 40 MB
Without (direct upstream)

The overhead is one regex scan per chunk + an occasional json.loads. On ~50 tool calls per minute it's invisible. The streaming path uses httpx.AsyncClient.stream() so chunks pass through with the same back- pressure they had on the upstream connection.

Measured on Ubuntu 22.04, Ryzen 9 7950X, gemma-4-31b-Q8 via llama.cpp at 25 tok/s.

Testing

pip install -r requirements-dev.txt
pytest tests/        # 28 tests, ~0.4 s

Tests use a recorded-cassette fixture of real Gemma 4 outputs (tests/fixtures/) so they don't require a live model. To regenerate the fixtures against your own model, see tests/RECORD.md.

Compose with other tools

You're using Plug tooltalk in like
Open WebUI Settings → Connections → OpenAI API: http://tooltalk:8054/v1
LibreChat librechat.yaml → endpoints → custom: baseURL: http://tooltalk:8054/v1
Lobe Chat Settings → API: http://tooltalk:8054/v1
LiteLLM as a model entry api_base: http://tooltalk:8054/v1
A custom React thing fetch('http://tooltalk:8054/v1/chat/completions', ...)
mcpo as the tool executor tooltalk → frontend → tool call → mcpo → result → next chat turn

License

MIT. Fork it, sell it, ship it. The one ask: if you find a Gemma-4 edge case tooltalk doesn't handle, please open an issue with the verbatim chunk you saw — that's the whole reason this exists as a project not a snippet.

Acknowledgements

This pattern was first ironed out for the Destiny Atelier chat. Pulled out into its own MIT repo because the same translator solves the same problem for everyone else who serves Gemma 4 locally + wants OpenAI-shape tool_calls downstream.

Built on the shoulders of:

About

Drop-in middleware translating Gemma 4 text-format tool calls into OpenAI structured tool_calls. Streaming-safe, quote-tolerant. 972 LOC FastAPI.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors