Skip to content

Releases: mlnomadpy/localllm

LocalLLM v1.2.0

13 May 17:20

Choose a tag to compare

Stage 1 of the AI roadmap. Two coherent additions to the OpenAI-compatible
HTTP API that both already had runtime support: function/tool calling and
multimodal image input.

Added

Tool / function calling

  • POST /v1/chat/completions accepts OpenAI-shaped tools + tool_choice.
    Each ToolDef (type + function: {name, description, parameters}) is
    wrapped as a LiteRT-LM OpenApiTool (one tool per provider) and threaded
    into ConversationConfig.tools. tool_choice is honored at the gateway:
    "none" strips tools before the conversation is built; "auto" and the
    {type:"function", function:{name:"..."}} object form both pass the full
    set through (LiteRT-LM does not expose a single-tool selector, so the
    object form degrades to "auto").
  • Server returns tool_calls and finish_reason: "tool_calls" when the
    model elects to invoke a function. Each LiteRT-LM ToolCall is translated
    into a ToolCallApi with a stable-ish ID (call_${entry.id}_${index}),
    type: "function", and function: {name, arguments} where arguments is
    the JSON-encoded argument map per the OpenAI contract.
  • Streaming path emits a final delta.tool_calls chunk with
    finish_reason: "tool_calls" instead of "stop" when a tool call lands.
    Text deltas still stream as before for messages that mix text + tool use.
  • Two-turn protocol round-trips correctly. role: "tool" follow-up
    messages with tool_call_id and a serialized content are translated to
    a LiteRT-LM Role.TOOL message carrying a Content.ToolResponse. The
    session-reuse path treats a single new tool turn the same as a single
    new user turn so the KV cache survives the round trip.
  • automaticToolCalling = false on the conversation — the server
    forwards the tool call to the HTTP client rather than executing it
    in-process. (The OpenApiTool.execute shim is implemented defensively to
    return a structured error if the runtime ever tries to auto-call it.)

Multimodal image input

  • POST /v1/chat/completions accepts the OpenAI content array with
    {type:"text",...} and {type:"image_url",...} parts. Plain string
    content still works unchanged (polymorphic JsonElement on the wire,
    inspected at the call site).
  • data:image/...;base64,... URLs decode immediately to bytes via
    android.util.Base64. http://localhost(:port)/... URLs are fetched
    via OkHttp with a 5 MB cap, 10s read timeout. Every other scheme — public
    HTTP, file:, custom schemes — is rejected with a 400 for SSRF
    protection.
  • Image downscaling: any image exceeding 1024×1024 is decoded with
    BitmapFactory.inSampleSize and re-encoded as JPEG@85% before being
    handed to LiteRT-LM. Saves prefill time on phone-camera-sized inputs.
  • EngineConfig.visionBackend = Backend.CPU() is now always set.
    Adds a small startup cost (~hundreds of MB resident, a few hundred ms
    init) so the first multimodal request doesn't have to rebuild the engine.

API types

  • Message.content is now polymorphic (JsonElement?) — string,
    parts array, or null. Backwards-compatible: existing text-only clients
    see no behavior change.
  • New types: ToolDef, FunctionDef, ToolCallApi, ToolCallFunction,
    sealed ContentPart.{TextPart, ImagePart}, plus extension helpers
    Message.contentString(), Message.contentParts(), Message.textChars(),
    and JsonElement.toContentParts().
  • StreamDelta gains an optional tool_calls field for the streaming
    tool-call emission.

Changed

  • Prompt-size cap now counts characters across text parts rather than
    the old content.length. Image parts don't contribute to the limit.
  • messagesPrefixHash mixes in tool_call_id and tool_calls so a
    client that swaps a tool turn mid-session correctly invalidates the
    cached conversation.
  • runInferenceBlocking returns a LlmMessage (not just text) so the
    route handler can inspect toolCalls and choose the right finish_reason.
    runInferenceStreaming similarly tracks the last non-empty toolCalls
    snapshot of the Flow.
  • ChatBubble renders a [tool: pending — see API response] placeholder
    for empty assistant messages (defensive — the in-app Chat tab doesn't
    send tools, so this is reachable only when an external client drives
    the local server).

Fixed during the v1.2.0 cycle

  • automaticToolCalling = false is now passed explicitly. LiteRT-LM
    0.11.0's 4-arg ConversationConfig overload defaults this to true,
    not false as the initial Stage 1 implementation assumed. The runtime
    was auto-executing our OpenApiTool.execute() stub instead of
    surfacing the tool call to the HTTP client. The OpenAI contract is
    "model emits tool_calls, client executes, client sends a role:tool
    follow-up" — and that round-trip now works as designed.
  • Better ChatRequest parse-error logging. Root-cause exception
    class + message surface in the 400 response and via LogManager.e
    instead of Ktor's opaque "Failed to convert request body".

Extracted helpers

  • MessageHelpers.kt collects five pure top-level functions
    (messagesPrefixHash, isLoopbackHttpUrl, decodeDataImageUrl,
    parseToolArguments, jsonToAny, buildToolDescriptionJson) extracted
    from LLMServerService.kt so they're independently unit-testable on
    the JVM without spinning up the Service or LiteRT-LM JNI.
  • Fixed an IPv6 bracket-notation bug in isLoopbackHttpUrl discovered
    via the new tests — http://[::1]/img was previously mis-rejected.

Tests

  • ApiTypesTest.kt — pure-JVM Gson round-trip tests for both
    polymorphic content shapes (string + parts array), null content on
    tool-call assistant messages, tool follow-up turns, tools +
    tool_choice envelope deserialization, tool-call response shape. 11
    cases. Verifies the v1.1.0 text-only request contract is preserved
    byte-for-byte.
  • MessageHelpersTest.kt — 25 cases covering the extracted helpers.
    Total project test count is now 77, all green.

End-to-end verification on Pixel 6 (Tensor G1, CPU backend)

  • Tool calling: round 1 emits finish_reason: "tool_calls" +
    tool_calls[0].function.name = "get_weather"; round 2 with a
    role: "tool" follow-up produces a natural-language answer using the
    injected result.
  • Multimodal image: 3.2 KB JPEG → vision encoder → text description
    correctly identifying the colors and overlaid text.

LocalLLM v1.1.0

13 May 01:25

Choose a tag to compare

Production-readiness pass + tab-by-tab UX overhaul. Inference layer is
unchanged; this is all the operational and visual scaffolding around it.

Added

Release & build

  • R8 + resource shrinking on the release buildType. ProGuard rules
    already covered LiteRT-LM, Ktor, Netty, Gson, Compose — no new keep
    rules surfaced. :app:assembleRelease and :app:bundleRelease both
    green.
  • Per-ABI APK splits. arm64-v8a only (LiteRT-LM 0.11.0 ships JNI
    .so files for arm64-v8a + x86_64 only — no armeabi-v7a). The
    arm64-v8a release APK is ~28 MB, the universal is ~39 MB, the
    .aab is ~33 MB.
  • signingConfigs.release reading from ~/.gradle/gradle.properties
    or environment (LOCALLLM_KEYSTORE_PATH / _PASSWORD / _ALIAS /
    _PASSWORD). Gracefully falls back to the debug signing key when any
    of the four is missing — so contributors run :app:assembleRelease
    without needing the production keystore.
  • scripts/release.sh — one-command release: assembleDebug + mkdocs
    gh-deploy + tag + push + gh release create with notes scraped from
    this CHANGELOG.
  • .github/workflows/docs.yml — Material site build + Pages deploy,
    triggered on docs/ / mkdocs.yml changes. Pages currently sourced
    from the gh-pages branch (legacy mode) because GitHub Actions is
    administratively restricted on the hosting account — the workflow
    auto-resumes once Actions is re-enabled.
  • .github/dependabot.yml — weekly Monday updates for gradle,
    github-actions, and the pip-based docs requirements; Compose / Kotlin /
    Ktor each in their own update group; LiteRT-LM explicitly pinned
    (manual bumps only — model-side smoke test required).

Performance & lifecycle

  • Baseline Profiles via a new :macrobenchmark module
    (com.android.test + androidx.baselineprofile). StartupBenchmark
    measures cold-start under CompilationMode.None / Partial / Full;
    BaselineProfileGenerator walks Catalog → Dashboard → Console → Chat
    → Settings. Run on a device with
    ./gradlew :app:generateReleaseBaselineProfile.
  • onTrimMemory engine eviction. RUNNING_LOW / MODERATE shrinks
    the engine LRU to 1; RUNNING_CRITICAL / COMPLETE evicts everything.
    Both gated by inferenceMutex.tryLock so eviction never interrupts
    an active request.
  • Lifecycle.Event.ON_START re-kick in MainActivity. If the OS
    killed the foreground service while the Activity was backgrounded
    and autostart is on, the service comes back up the next time the
    user returns to the app.
  • START_STICKY contract documented on onStartCommand.

AUTO backend with real fallback

  • AUTO now tries Backend.GPU first; on Engine.initialize() failure
    (the common case on stock Pixel images missing libvndksupport.so),
    logs a warning and rebuilds on Backend.CPU. Explicit CPU / GPU
    selections stay strict (no fallback) so the user can debug them.

  • New engines array in GET /health surfaces the backend each
    cached engine actually initialized on:

    "engines": [
      { "key": "gemma-4-e2b_model_AUTO", "backend": "CPU" }
    ]

Settings layer

  • SettingsRepository backed by androidx.datastore.preferences: 1.1.1 with SharedPreferencesMigration("settings") so existing prefs
    carry over. Compose UI observes StateFlows instead of re-reading
    SharedPreferences on every recomposition (slider drag was triggering
    ~60 disk reads/sec before).
  • Public Settings.xxx(context) API preserved byte-for-byte — every
    existing caller (LLMServerService, BootReceiver, etc.) keeps working
    unchanged.

Debug-build hygiene

  • StrictMode thread + VM policies installed under BuildConfig. DEBUG. detectDiskReads / detectDiskWrites / detectNetwork / detectLeakedClosableObjects / detectActivityLeaks, all with
    penaltyLog only — never penaltyDeath.

Catalog tab (UX overhaul)

  • LinearProgressIndicator with "X.X MB / Y.Y GB" subtitle and
    inline Cancel (Icons.Outlined.Close) — replaces the text-only
    percentage.
  • SHA-256 verified badge (Icons.Outlined.Verified for built-ins
    with a known hash; Icons.Outlined.Info for custom URLs).
  • File size + last-used relative time on installed models, via
    Formatter.formatShortFileSize and DateUtils.getRelativeTimeSpanString.
  • "Get started" hero card when nothing is installed yet.
  • OutlinedCard hierarchy with proper M3 spacing, icons on every
    action (Download, Delete, UploadFile, Close).

Chat tab (markdown + visual polish)

  • MarkdownText composable backed by org.commonmark:commonmark: 0.22.0 — renders assistant messages with code blocks, lists (capped
    at depth 2), inline code, headings, bold/italic, block quotes, and
    links. Code blocks have a copy-to-clipboard icon. No WebView.
  • Bubble overhaul: role icons (Icons.Outlined.Person /
    Icons.Outlined.AutoAwesome), right-aligned timestamps, asymmetric
    rounded corners, 90% max-width, primaryContainer vs surfaceVariant
    backgrounds.
  • Streaming reveal animation: Animatable fades trailing delta
    characters from 0.5α to full opacity over tween(200ms). Swaps to
    MarkdownText rendering once streaming completes.
  • Empty-state hero: Icons.Outlined.AutoAwesome 56dp + title + body
    • 4 AssistChip sample prompts. Tap a chip to fill the input — never
      auto-sends.
  • Send / Stop buttons get icons (AutoMirrored.Outlined.Send,
    Icons.Outlined.Stop with errorContainer colors).
  • UiMessage.timestampMs field added (default-valued, backwards
    compatible).

Settings tab (restructure + Pixel-6 awareness)

  • Six collapsible domain sections with leading icons: Server
    (Dns, expanded by default), Inference (Memory), Security (Lock),
    Background (Battery5Bar), Limits (Speed), Startup
    (PowerSettingsNew). Animated chevron rotation.
  • Per-row Help expandables (Icons.Outlined.HelpOutline) — tap to
    toggle inline description without crowding the surface.
  • Backend description rewrite: removed the old MediaPipe / Pixel 10 /
    Tensor G5 / "NPU auto" claims. New copy describes AUTO as
    GPU-first-then-CPU fallback, CPU as ~6–12 tok/s on Pixel-class
    hardware for Gemma 4 E2B, GPU as strict-no-fallback. The selected
    mode's line gets a primary-container-tinted background.
  • Chipset hint above the backend selector, driven by Build.SOC_MODEL
    (API 31+). Renders "Your device: Pixel 6 (Tensor). GPU delegate often
    fails; AUTO will fall back to CPU."
    on Tensor SoCs (gs101+),
    "…(Snapdragon). NPU variant .litertlm files in the catalog should
    work."
    on Snapdragon, otherwise "AUTO is the safe choice."
  • Port-in-use validator: ServerSocket(port).also{close} attempt
    500 ms after the port field changes. On IOException the field
    shows error-tinted helper text without blocking save.

Dashboard tab

  • 2×2 stat-card grid with leading icons: Total / Avg latency / Avg
    tok/s / Error rate. Error rate severity-colored (green <1%, amber <5%,
    error >5%).
  • Tok/s sparkline via pure Compose Canvas — no chart library
    added. Catmull-Rom → cubic Bezier smoothing, 20%-alpha fill under
    the line, max-Y label top-right. Handles empty history / single
    point / NaN / all-zeros cleanly.
  • Promoted in-flight card with rotating Icons.Outlined.Bolt and
    indeterminate LinearProgressIndicator. Collapses to "Idle" with
    Icons.Outlined.Pause when nothing is running.
  • Status-icon history rows: CheckCircle / Cancel / Error
    leading icons. Tap to expand and see full request details inline.

Console tab

  • Debounced search (300 ms via snapshotFlow + debounce) with
    Icons.Outlined.Search leading icon and Icons.Outlined.Close
    clear-query trailing icon.
  • Level FilterChips (DEBUG / INFO / WARN / ERROR) — each chip's
    leading dot is colored to match its corresponding log-level text
    color.
  • Top-5 tag FilterChips parsed from [tag] message prefixes, with
    a "More…" overflow dropdown when the buffer has more than 5 distinct
    tags.
  • Auto-scroll toggle (Icons.Outlined.VerticalAlignBottom).
  • Color-coded log lines by level.
  • Long-press copy writes the full [time] LEVEL message line to
    the clipboard with a "Copied" toast.
  • "No matching log entries" empty state with a "Clear filters"
    TextButton.

Chrome restructure (header + tabs + theme)

  • Scaffold layout replacing the bespoke Column { Header + ScrollableTabRow + Box }. The old 2-row LIVE banner (~120dp of
    vertical chrome) is gone.
  • Compact CenterAlignedTopAppBar (56dp): status dot in the
    leading slot, middle-ellipsized URL as the title, context-aware
    trailing actions (Tune + Refresh on Chat tab; Copy URL elsewhere).
  • Top ScrollableTabRow → bottom NavigationBar with proper M3
    icons (FolderOpen / BarChart / Terminal /
    AutoMirrored.Outlined.Chat / Settings). Better one-handed reach
    on a 6.4" phone, more content above the fold.
  • Palette overhaul: primary desaturated #4ECDC4 → #6BD3CC,
    full M3 surface tonal scale (background #0E1113, surface
    #14181A, surfaceVariant #222729), brand teal reserved for the
    status dot, primary CTAs, progress, and user-message bubbles.
    WCAG-AA contrast verified.
  • Header.kt trimmed to a StatusDot(status) helper used by the
    app bar's leading slot.
  • Chat bubble redo: assistant messages are now borderless
    full-bleed text with a 3dp primary-tinted left rail (no card
    outline); user messages are tighter right-aligned pills (80%
    max-width, 20dp radius). Role icons removed — alignment + tint
    carry the signal.
  • Chat input row: rounded Surface containing a borderless
    BasicTextField and one circular Send/Stop button that swaps icon
    • tint based on isChatting. No more OutlinedTextField chrome.
  • System prompt moved out of the chat body into ...
Read more

LocalLLM v1.0.0 — Gemma 4 on Android

12 May 19:07

Choose a tag to compare

First public release. On-device, OpenAI-compatible LLM HTTP server for Android, powered by Google's LiteRT-LM runtime and Gemma 4.

Highlights

  • Gemma 4 E2B + E4B out of the box, downloaded from litert-community on HuggingFace and verified by SHA-256.
  • OpenAI-compatible POST /v1/chat/completions — both blocking and SSE streaming, with session_id-based KV cache reuse across turns.
  • AUTO backend with real fallback — tries GPU first, transparently falls back to CPU on init failure. The chosen backend is exposed via /health.
  • Foreground service with proper specialUse declaration and Play-required PROPERTY_SPECIAL_USE_FGS_SUBTYPE justification.
  • Polished Compose UI — scrollable tabs, friendly model labels, Stop button mid-stream, live tok/s counter, long-press copy, collapsible system prompt, distinct M3 primary/secondary/tertiary/error palette.
  • Quality-of-life ops — SSE error chunks on failure (no silent connection drops), atomic queue cap with 429 Retry-After, partial wake lock only while inference runs, idle eviction of GB-sized engines, GitHub Actions CI gate.

Install

Download app-debug.apk below and `adb install -r app-debug.apk`, or transfer the APK to your phone and open it (requires "install from unknown sources").

After install, open the app and tap Catalog → Download on Gemma 4 E2B IT (~2.6 GB). The server autostarts once a model is on disk. Verify with:

```bash
adb forward tcp:8099 tcp:8099
curl http://localhost:8099/health
```

Notes

  • Debug-signed APK. Not suitable for the Play Store yet (minify is off, no release keystore).
  • Requires Android 10 (API 29) or newer and ~6 GB free storage.
  • See the full docs in `docs/`mkdocs serve from the repo root.