Features
- Load models from a direct URL: provide a direct URL to a .gguf model and Paddler downloads and caches it on the agent
- OpenAI Responses API: new
/v1/responsesendpoint - Token usage and classification: responses report per-kind token counts, and generated tokens are classified as content, reasoning, tool call, or undeterminable
- Structured tool-call parsing: set
parse_tool_callsto get parsed tool calls response - New endpoint: Get balancer applicable state
Breaking changes
- Inference parameters:
batch_n_tokensrenamed ton_batch - Inference parameters now requires
embedding_batch_size - Streamed tokens now arrive under kind-specific keys (
ContentToken,ReasoningToken,ToolCallToken,UndeterminableToken) instead ofToken, and the stream ends with aDoneevent carrying token usage - Minimum supported Rust version is now 1.95.0