Skip to content

LocalLLM v1.0.0 — Gemma 4 on Android

Choose a tag to compare

@mlnomadpy mlnomadpy released this 12 May 19:07

First public release. On-device, OpenAI-compatible LLM HTTP server for Android, powered by Google's LiteRT-LM runtime and Gemma 4.

Highlights

  • Gemma 4 E2B + E4B out of the box, downloaded from litert-community on HuggingFace and verified by SHA-256.
  • OpenAI-compatible POST /v1/chat/completions — both blocking and SSE streaming, with session_id-based KV cache reuse across turns.
  • AUTO backend with real fallback — tries GPU first, transparently falls back to CPU on init failure. The chosen backend is exposed via /health.
  • Foreground service with proper specialUse declaration and Play-required PROPERTY_SPECIAL_USE_FGS_SUBTYPE justification.
  • Polished Compose UI — scrollable tabs, friendly model labels, Stop button mid-stream, live tok/s counter, long-press copy, collapsible system prompt, distinct M3 primary/secondary/tertiary/error palette.
  • Quality-of-life ops — SSE error chunks on failure (no silent connection drops), atomic queue cap with 429 Retry-After, partial wake lock only while inference runs, idle eviction of GB-sized engines, GitHub Actions CI gate.

Install

Download app-debug.apk below and `adb install -r app-debug.apk`, or transfer the APK to your phone and open it (requires "install from unknown sources").

After install, open the app and tap Catalog → Download on Gemma 4 E2B IT (~2.6 GB). The server autostarts once a model is on disk. Verify with:

```bash
adb forward tcp:8099 tcp:8099
curl http://localhost:8099/health
```

Notes

  • Debug-signed APK. Not suitable for the Play Store yet (minify is off, no release keystore).
  • Requires Android 10 (API 29) or newer and ~6 GB free storage.
  • See the full docs in `docs/`mkdocs serve from the repo root.