LocalLLM v1.0.0 — Gemma 4 on Android
First public release. On-device, OpenAI-compatible LLM HTTP server for Android, powered by Google's LiteRT-LM runtime and Gemma 4.
Highlights
- Gemma 4 E2B + E4B out of the box, downloaded from
litert-communityon HuggingFace and verified by SHA-256. - OpenAI-compatible
POST /v1/chat/completions— both blocking and SSE streaming, withsession_id-based KV cache reuse across turns. - AUTO backend with real fallback — tries GPU first, transparently falls back to CPU on init failure. The chosen backend is exposed via
/health. - Foreground service with proper
specialUsedeclaration and Play-requiredPROPERTY_SPECIAL_USE_FGS_SUBTYPEjustification. - Polished Compose UI — scrollable tabs, friendly model labels, Stop button mid-stream, live tok/s counter, long-press copy, collapsible system prompt, distinct M3 primary/secondary/tertiary/error palette.
- Quality-of-life ops — SSE error chunks on failure (no silent connection drops), atomic queue cap with
429 Retry-After, partial wake lock only while inference runs, idle eviction of GB-sized engines, GitHub Actions CI gate.
Install
Download app-debug.apk below and `adb install -r app-debug.apk`, or transfer the APK to your phone and open it (requires "install from unknown sources").
After install, open the app and tap Catalog → Download on Gemma 4 E2B IT (~2.6 GB). The server autostarts once a model is on disk. Verify with:
```bash
adb forward tcp:8099 tcp:8099
curl http://localhost:8099/health
```
Notes
- Debug-signed APK. Not suitable for the Play Store yet (minify is off, no release keystore).
- Requires Android 10 (API 29) or newer and ~6 GB free storage.
- See the full docs in `docs/` —
mkdocs servefrom the repo root.