Skip to content

LocalLLM v1.2.0

Latest

Choose a tag to compare

@mlnomadpy mlnomadpy released this 13 May 17:20
· 60 commits to main since this release

Stage 1 of the AI roadmap. Two coherent additions to the OpenAI-compatible
HTTP API that both already had runtime support: function/tool calling and
multimodal image input.

Added

Tool / function calling

  • POST /v1/chat/completions accepts OpenAI-shaped tools + tool_choice.
    Each ToolDef (type + function: {name, description, parameters}) is
    wrapped as a LiteRT-LM OpenApiTool (one tool per provider) and threaded
    into ConversationConfig.tools. tool_choice is honored at the gateway:
    "none" strips tools before the conversation is built; "auto" and the
    {type:"function", function:{name:"..."}} object form both pass the full
    set through (LiteRT-LM does not expose a single-tool selector, so the
    object form degrades to "auto").
  • Server returns tool_calls and finish_reason: "tool_calls" when the
    model elects to invoke a function. Each LiteRT-LM ToolCall is translated
    into a ToolCallApi with a stable-ish ID (call_${entry.id}_${index}),
    type: "function", and function: {name, arguments} where arguments is
    the JSON-encoded argument map per the OpenAI contract.
  • Streaming path emits a final delta.tool_calls chunk with
    finish_reason: "tool_calls" instead of "stop" when a tool call lands.
    Text deltas still stream as before for messages that mix text + tool use.
  • Two-turn protocol round-trips correctly. role: "tool" follow-up
    messages with tool_call_id and a serialized content are translated to
    a LiteRT-LM Role.TOOL message carrying a Content.ToolResponse. The
    session-reuse path treats a single new tool turn the same as a single
    new user turn so the KV cache survives the round trip.
  • automaticToolCalling = false on the conversation — the server
    forwards the tool call to the HTTP client rather than executing it
    in-process. (The OpenApiTool.execute shim is implemented defensively to
    return a structured error if the runtime ever tries to auto-call it.)

Multimodal image input

  • POST /v1/chat/completions accepts the OpenAI content array with
    {type:"text",...} and {type:"image_url",...} parts. Plain string
    content still works unchanged (polymorphic JsonElement on the wire,
    inspected at the call site).
  • data:image/...;base64,... URLs decode immediately to bytes via
    android.util.Base64. http://localhost(:port)/... URLs are fetched
    via OkHttp with a 5 MB cap, 10s read timeout. Every other scheme — public
    HTTP, file:, custom schemes — is rejected with a 400 for SSRF
    protection.
  • Image downscaling: any image exceeding 1024×1024 is decoded with
    BitmapFactory.inSampleSize and re-encoded as JPEG@85% before being
    handed to LiteRT-LM. Saves prefill time on phone-camera-sized inputs.
  • EngineConfig.visionBackend = Backend.CPU() is now always set.
    Adds a small startup cost (~hundreds of MB resident, a few hundred ms
    init) so the first multimodal request doesn't have to rebuild the engine.

API types

  • Message.content is now polymorphic (JsonElement?) — string,
    parts array, or null. Backwards-compatible: existing text-only clients
    see no behavior change.
  • New types: ToolDef, FunctionDef, ToolCallApi, ToolCallFunction,
    sealed ContentPart.{TextPart, ImagePart}, plus extension helpers
    Message.contentString(), Message.contentParts(), Message.textChars(),
    and JsonElement.toContentParts().
  • StreamDelta gains an optional tool_calls field for the streaming
    tool-call emission.

Changed

  • Prompt-size cap now counts characters across text parts rather than
    the old content.length. Image parts don't contribute to the limit.
  • messagesPrefixHash mixes in tool_call_id and tool_calls so a
    client that swaps a tool turn mid-session correctly invalidates the
    cached conversation.
  • runInferenceBlocking returns a LlmMessage (not just text) so the
    route handler can inspect toolCalls and choose the right finish_reason.
    runInferenceStreaming similarly tracks the last non-empty toolCalls
    snapshot of the Flow.
  • ChatBubble renders a [tool: pending — see API response] placeholder
    for empty assistant messages (defensive — the in-app Chat tab doesn't
    send tools, so this is reachable only when an external client drives
    the local server).

Fixed during the v1.2.0 cycle

  • automaticToolCalling = false is now passed explicitly. LiteRT-LM
    0.11.0's 4-arg ConversationConfig overload defaults this to true,
    not false as the initial Stage 1 implementation assumed. The runtime
    was auto-executing our OpenApiTool.execute() stub instead of
    surfacing the tool call to the HTTP client. The OpenAI contract is
    "model emits tool_calls, client executes, client sends a role:tool
    follow-up" — and that round-trip now works as designed.
  • Better ChatRequest parse-error logging. Root-cause exception
    class + message surface in the 400 response and via LogManager.e
    instead of Ktor's opaque "Failed to convert request body".

Extracted helpers

  • MessageHelpers.kt collects five pure top-level functions
    (messagesPrefixHash, isLoopbackHttpUrl, decodeDataImageUrl,
    parseToolArguments, jsonToAny, buildToolDescriptionJson) extracted
    from LLMServerService.kt so they're independently unit-testable on
    the JVM without spinning up the Service or LiteRT-LM JNI.
  • Fixed an IPv6 bracket-notation bug in isLoopbackHttpUrl discovered
    via the new tests — http://[::1]/img was previously mis-rejected.

Tests

  • ApiTypesTest.kt — pure-JVM Gson round-trip tests for both
    polymorphic content shapes (string + parts array), null content on
    tool-call assistant messages, tool follow-up turns, tools +
    tool_choice envelope deserialization, tool-call response shape. 11
    cases. Verifies the v1.1.0 text-only request contract is preserved
    byte-for-byte.
  • MessageHelpersTest.kt — 25 cases covering the extracted helpers.
    Total project test count is now 77, all green.

End-to-end verification on Pixel 6 (Tensor G1, CPU backend)

  • Tool calling: round 1 emits finish_reason: "tool_calls" +
    tool_calls[0].function.name = "get_weather"; round 2 with a
    role: "tool" follow-up produces a natural-language answer using the
    injected result.
  • Multimodal image: 3.2 KB JPEG → vision encoder → text description
    correctly identifying the colors and overlaid text.