Add LiteRT-LM native backend support by leehack · Pull Request #167 · leehack/llamadart

leehack · 2026-05-26T13:32:36Z

Summary

add a LiteRT-LM native-assets hook path that downloads pinned litert-lm-native v0.12.0 runtime archives
require and emit the LiteRT-LM StreamProxy helper library alongside LiteRtLm on Android, iOS, macOS, Linux, and Windows
route native .litertlm model bundles to a worker-isolate LiteRT-LM backend while keeping GGUF and unknown formats on llama.cpp
keep .litertlm models on the same ModelSource/download/cache path as GGUF and add Pixel/macOS benchmark entrypoints
infer Gemma 4 metadata and chat-template data for .litertlm bundle names so high-level LlamaEngine chat templating uses the Gemma 4 handler
run blocking LiteRT-LM send_message in a helper isolate so the backend worker can process cancellation while StreamProxy is unavailable
redownload stale LiteRT-LM runtime archives when cached checksums no longer match pinned release digests
validate the full platform-specific LiteRT-LM companion runtime library set after extraction so broken native bundles fail during build, not at runtime
invalidate already extracted LiteRT-LM runtime caches when the pinned runtime checksum changes
keep LlamaEngine.chatTemplate and ChatSession usable when the selected backend does not expose tokenization, using null template token counts and conservative prompt-size estimates for history pruning
reject unsupported LiteRT-LM LoRA operations consistently, including clearLoras, across service, worker, backend, and high-level engine paths
promote the LiteRT-LM runtime client/result/metrics API out of the experimental benchmark namespace, with deprecated benchmark wrappers for compatibility
cover the public LiteRT-LM runtime API exports on VM and web, including native-only web stubs that fail with UnsupportedError
keep the web-safe LiteRtLmBackend placeholder constructor aligned with the native constructor so cross-platform code compiles before reporting the native-only unsupported error
align the macOS/Pixel benchmark app metric output so LiteRT-LM and llama.cpp report comparable target-token, early-EOS, sampling-inclusive, and wall-throughput fields
harden direct LiteRtLmRuntimeClient validation and native handle cleanup for reinitialization, initialization-failure, and streaming-failure paths
give Pixel benchmark runs separate LiteRT-LM, llama.cpp, and combined log timeouts so full Gemma 4 comparisons can finish on slower GGUF/Vulkan runs
coalesce concurrent LiteRT-LM worker startup requests so simultaneous pre-load diagnostics cannot leak extra backend isolates
reject concurrent LiteRT-LM generations at the Dart backend before a second stream can overwrite active generation cleanup
cancel and close active LiteRT-LM generation streams before direct backend reload, context free, or model free requests
serialize regular LiteRT-LM worker requests through one service queue while keeping cancellation responsive, including queued-generation cancellation before native runtime entry
prevent immediate LiteRT-LM stream cancellation from sending a stale generation request after the caller response port has already closed
align LiteRT-LM native FFI coverage handling with real-runtime smoke coverage while keeping public runtime result/metrics value types unit-covered
clear disposed LiteRT-LM service runtime clients when replacement initialization fails so later tokenization or generation retries start from a clean client
dispose LiteRT-LM service runtime clients and clear context-scoped metrics when contexts are freed or replaced
align LiteRT-LM maxTokens <= 0 generation semantics with llama.cpp by making them no-op without initializing runtime generation
report backend/model-specific LiteRT-LM engine creation failures and make Pixel benchmark runs fail on BENCHMARK: ERROR
normalize direct LiteRT-LM CPU/GPU/NPU backend selectors across runtime, service, and direct-backend APIs before native initialization
reject non-positive LiteRT-LM contextSize values at model load instead of allowing a later native initialization failure
refresh any LiteRT-LM runtime client that was initialized before context creation so tokenization before contextCreate cannot leave later calls using stale native context settings
reject explicit LiteRT-LM backend changes and unsupported llama.cpp-only context-time model parameters during LiteRT-LM context creation
validate macOS LiteRT-LM cache and app framework directories against the full ABI-specific runtime library set before selection
validate LiteRT-LM web context creation with the same model-parameter rules as native, including explicit NPU rejection
make the benchmark app choose a fair llama.cpp backend by platform and record the requested backend in metrics
document LiteRT-LM platform support in the website quickstart, model lifecycle, support matrix, and native build hook guides
reject LiteRT-LM detokenize(..., special: true) requests instead of silently ignoring the llama.cpp-only special-token flag
reject all LiteRT-LM multimodal backend operations consistently instead of allowing no-op projector frees or false capability probes
reject multimodal content passed to direct LiteRT-LM chat-template application instead of stringifying image/audio maps into text prompts

Native release state

litert-lm-native PR Feat/build infra refinement #2 merged: StreamProxy runtime helper packaging
litert-lm-native PR feat: implement LoRA support and training pipeline #3 merged: iOS CLiteRTLM.xcframework slices packaged as dylib-style runtime archives plus iOS StreamProxy
litert-lm-native release v0.12.0 refreshed with runtime archives for Android, iOS device/simulators, macOS, Linux, and Windows
native release workflow passed: https://github.com/leehack/litert-lm-native/actions/runs/26502345176
release manifest validation passed locally: Release manifest lists 21 required runtime artifacts
all current v0.12.0 runtime archive SHA-256 digests are pinned in hook/build.dart
iOS archive contents verified locally:
- ios/arm64/libLiteRtLm.dylib, ios/arm64/libStreamProxy.dylib
- ios/arm64-sim/libLiteRtLm.dylib, ios/arm64-sim/libStreamProxy.dylib
- ios/x64-sim/libLiteRtLm.dylib, ios/x64-sim/libStreamProxy.dylib

Verification

dart analyze
dart test test/unit/backends/litert_lm/litert_lm_service_test.dart test/unit/backends/litert_lm/worker_test.dart
- verified detokenize(..., special: true) fails explicitly before initializing the LiteRT-LM runtime.
- verified multimodal direct chat-template content returns a typed unsupported LiteRT-LM worker error instead of stringifying media maps.
- verified failed LiteRT-LM service runtime-client replacement disposes both old and failed clients, then retries tokenization with a fresh client.
dart test test/unit/backends/litert_lm/litert_lm_service_test.dart test/unit/backends/litert_lm/worker_test.dart test/unit/backends/litert_lm/litert_lm_backend_test.dart
- verified unsupported LiteRT-LM multimodal service, worker, and direct-backend calls fail explicitly.
dart run tool/testing/check_platform_boundaries.dart
dart test test/unit/hook/build_hook_litert_lm_integration_test.dart test/unit/backends/native/native_backend_test.dart test/unit/backends/litert_lm test/unit/core/engine/engine_test.dart test/unit/core/engine/chat_session_test.dart
dart test test/unit/backends/litert_lm/litert_lm_public_api_test.dart test/unit/backends/litert_lm/litert_lm_runtime_test.dart test/unit/experimental/litert_lm/litert_lm_benchmark_test.dart
dart test test/unit/backends/litert_lm/litert_lm_runtime_test.dart
- verified macOS LiteRT-LM extracted-runtime cache discovery uses the
  current runtime ABI, including macos/x64 for Intel macOS instead of the
  arm64 cache layout.
dart test -p chrome test/unit/backends/litert_lm/litert_lm_web_public_api_test.dart
- verified the web-safe LiteRT-LM backend placeholder accepts the native constructor shape before failing with UnsupportedError.
dart test test/unit/backends/litert_lm/litert_lm_public_api_test.dart test/unit/backends/litert_lm/litert_lm_runtime_test.dart test/unit/backends/litert_lm/litert_lm_backend_test.dart test/unit/backends/native/native_backend_test.dart
dart test test/unit/backends/litert_lm/litert_lm_backend_test.dart
- verified direct LiteRT-LM backend reload, context free, and model free cancel
  and close an active generation stream before sending teardown/reload requests.
dart test test/unit/backends/litert_lm/litert_lm_service_test.dart -n "context"
- verified LiteRT-LM context free and context recreation dispose the active
  runtime client and clear stale context-scoped performance metrics.
dart test test/unit/backends/litert_lm/litert_lm_service_test.dart -n "maxTokens"
- verified LiteRT-LM maxTokens <= 0 emits no chunks, does not start runtime
  generation, and clears stale context-scoped performance metrics.
dart test test/unit/backends/litert_lm/litert_lm_runtime_test.dart -n "engine create"
- verified LiteRT-LM engine-creation diagnostics include backend, model, and
  fallback guidance for NPU/GPU failures.
dart test test/unit/backends/litert_lm/litert_lm_runtime_test.dart -n "backend"
dart test test/unit/backends/litert_lm/litert_lm_backend_test.dart -n "preferred backend diagnostics"
dart test test/unit/backends/litert_lm/litert_lm_service_test.dart -n "backend preference"
- verified direct LiteRT-LM backend selectors are normalized or rejected before
  native initialization.
dart test test/unit/backends/litert_lm/litert_lm_service_test.dart -n "load-time model params"
- verified non-positive LiteRT-LM context sizes fail during model load.
dart test test/unit/backends/litert_lm/litert_lm_service_test.dart -n "createContext disposes pre-context runtime client"
- verified tokenization before context creation disposes the pre-context runtime client and reinitializes with the created context size.
dart test test/unit/backends/litert_lm/litert_lm_service_test.dart -n "backend changes during context creation"
- verified explicit LiteRT-LM context-time backend switches are rejected instead of silently reusing the model-load backend.
dart test test/unit/backends/litert_lm/litert_lm_service_test.dart -n "loads local litertlm bundles"
dart test test/unit/backends/litert_lm/litert_lm_service_test.dart -n "backend preference"
dart run tool/macos_litert_lm_engine_smoke.dart .dart_tool/litert_lm_models/gemma-4-E2B-it.litertlm metal "Write one concise sentence about on-device Gemma." 16
- latest Gemma 4 macOS LiteRT-LM e2e passed on head f089360ea: backend LiteRT-LM gpu, load 13 ms, wall 1017 ms, backend init 908.695 ms, prefill 117.54 tok/s, decode 45.83 tok/s for 16 decode tokens.
bash -n tool/litert_lm_pixel_benchmark.sh
dart test test/unit/backends/litert_lm/litert_lm_service_test.dart test/unit/backends/litert_lm/litert_lm_backend_test.dart test/unit/backends/litert_lm/worker_test.dart
dart test test/unit/backends/litert_lm
dart test test/unit/backends/native/native_backend_test.dart -n "litertlm"
dart analyze
dart run tool/testing/check_platform_boundaries.dart
git diff --check
dart test test/unit/backends/litert_lm/worker_test.dart
dart test test/unit/backends/litert_lm/litert_lm_service_test.dart test/unit/backends/litert_lm/litert_lm_backend_test.dart
- verified LiteRT-LM reloads issue fresh model/context handles and reject
  stale direct-backend handles for metadata, context lookup, context free,
  and model free
dart test test/unit/backends/native/native_backend_test.dart
- verified pre-load LlamaBackend() native GPU/VRAM diagnostics probe through
  llama.cpp without selecting a final backend, reuse that probe for GGUF
  loads, and dispose it before .litertlm routing
dart test test/unit/backends/native test/unit/backends/litert_lm
dart test test/unit/backends/litert_lm test/unit/backends/native/native_backend_test.dart
dart test test/unit/backends/litert_lm test/unit/experimental/litert_lm test/unit/backends/native
flutter pub publish --dry-run
dart run tool/macos_litert_lm_engine_smoke.dart .dart_tool/litert_lm_models/gemma-4-E2B-it.litertlm metal "Write one sentence about on-device Gemma." 32
- latest Gemma 4 macOS LiteRT-LM e2e passed through public LlamaEngine(LlamaBackend()) auto-routing: backend LiteRT-LM gpu, load 11 ms, wall 1147 ms, prefill 190.61 tok/s, decode 76.73 tok/s for 32 decode tokens
- output: On-device Gemma refers to the deployment of the Gemma model directly onto local hardware for processing, offering privacy and reduced latency.
dart run tool/macos_litert_lm_engine_smoke.dart .dart_tool/litert_lm_models/gemma-4-E2B-it.litertlm metal "Write one sentence about on-device Gemma." 16
- post-runtime-hardening Gemma 4 macOS LiteRT-LM e2e passed: backend LiteRT-LM gpu, load 11 ms, wall 623 ms, backend init 850.76 ms, decode 125.41 tok/s for 16 decode tokens
dart run tool/macos_litert_lm_engine_smoke.dart .dart_tool/litert_lm_models/gemma-4-E2B-it.litertlm metal "Write one concise sentence about on-device Gemma." 16
- post-maxTokens-parity Gemma 4 macOS LiteRT-LM e2e passed: backend LiteRT-LM gpu, load 15 ms, wall 952 ms, backend init 1051.79 ms, prefill 215.01 tok/s, decode 54.36 tok/s for 16 decode tokens
ADB=/Users/jhin.lee/Library/Android/sdk/platform-tools/adb DEVICE='adb-47031FDAP0011K-gqNYae._adb-tls-connect._tcp' TARGETS=litert_lm RUNS=1 WARMUPS=0 OUTPUT_TOKENS=16 BACKEND=gpu LITERT_LM_LOG_TIMEOUT=300 tool/litert_lm_pixel_benchmark.sh
- post-maxTokens-parity Pixel 9 Pro Gemma 4 LiteRT-LM/GPU e2e passed: load 1826 ms, backend init 48587.78 ms, wall 38692 ms, prefill 15.19 tok/s, decode 2.30 tok/s, wall 0.41 tok/s, 16/16 eval tokens, no early EOS
ADB=/Users/jhin.lee/Library/Android/sdk/platform-tools/adb DEVICE='adb-47031FDAP0011K-gqNYae._adb-tls-connect._tcp' TARGETS=litert_lm RUNS=1 WARMUPS=0 OUTPUT_TOKENS=16 BACKEND=npu LITERT_LM_LOG_TIMEOUT=300 tool/litert_lm_pixel_benchmark.sh
- Pixel 9 Pro Gemma 4 LiteRT-LM/NPU currently fails at engine creation with a clear diagnostic: LiteRT-LM engine creation failed for backend "npu" and model "gemma-4-E2B-it.litertlm". The Android NPU delegate may not support this device, OS, model, or bundle; try backend "gpu" or backend "cpu".
- verified the Pixel benchmark script now exits nonzero when the app logs BENCHMARK: ERROR.
ADB=/Users/jhin.lee/Library/Android/sdk/platform-tools/adb DEVICE='adb-47031FDAP0011K-gqNYae._adb-tls-connect._tcp' TARGETS=litert_lm RUNS=1 WARMUPS=0 OUTPUT_TOKENS=16 BACKEND=gpu LITERT_LM_LOG_TIMEOUT=300 tool/litert_lm_pixel_benchmark.sh
- post-benchmark-error-detection Pixel 9 Pro Gemma 4 LiteRT-LM/GPU e2e still passes: load 295 ms, backend init 8625.13 ms, wall 6177 ms, prefill 119.92 tok/s, decode 13.71 tok/s, wall 2.59 tok/s, 16/16 eval tokens, no early EOS
DECODE_TOKENS=256 tool/macos_fair_litert_vs_llamadart.sh
- llama.cpp GGUF/Metal (gemma-4-E2B-it-Q4_K_S.gguf): load 887 ms, wall 1942 ms, prefill 24819.86 tok/s, decode 1280.29 tok/s before sampling, decode-with-sampling 135.57 tok/s, wall 131.31 tok/s, 255/256 eval tokens
- LiteRT-LM/Metal (gemma-4-E2B-it.litertlm): high-level load 122 ms, backend init 4792.70 ms, wall 2219 ms, prefill 2685.39 tok/s, decode 116.80 tok/s, wall 115.37 tok/s, 256/256 eval tokens
- macOS conclusion for this local run: warm generation is close but llama.cpp/GGUF is faster on wall throughput; LiteRT-LM has substantially higher cold backend init cost on macOS. This is not the Pixel/NPU result.
fresh macOS LiteRT-LM archive and extracted-cache marker verified:
- 6fe694ccc895c904b173f2952b73b7698097eda18d8bff0210ea9fcf10ca3da9 .dart_tool/llamadart/litert_lm/0.12.0/litert-lm-native-runtime-macos-arm64-v0.12.0.tar.gz
- .dart_tool/llamadart/litert_lm/0.12.0/macos/arm64/.llamadart_litert_lm.sha256 contains the same digest
- extracted runtime contains libLiteRtLm.dylib, libStreamProxy.dylib, libGemmaModelConstraintProvider.dylib, libLiteRt.dylib, libLiteRtMetalAccelerator.dylib, libLiteRtTopKMetalSampler.dylib, libLiteRtTopKWebGpuSampler.dylib, and libLiteRtWebGpuAccelerator.dylib
dart run tool/macos_litert_lm_cancel_smoke.dart .dart_tool/litert_lm_models/gemma-4-E2B-it.litertlm metal 250 512 15000
- Gemma 4 macOS LiteRT-LM packaged-runtime cancellation e2e passed: completed before timeout in 1640 ms, 0 chunks, 0 characters, no stream error
DEVICE='adb-47031FDAP0011K-gqNYae._adb-tls-connect._tcp' TARGETS='litert_lm,llamadart' RUNS=3 WARMUPS=1 OUTPUT_TOKENS=256 BACKEND=gpu LOG_TIMEOUT=1200 tool/litert_lm_pixel_benchmark.sh
- Pixel 9 Pro Gemma 4 LiteRT-LM/GPU completed the full 256-token fair target: load 2151 ms, backend init 56159.13 ms, wall 60770 ms, prefill 327.01 tok/s, decode 3.63 tok/s, wall 4.21 tok/s, 256/256 eval tokens, no early EOS
- same Pixel 9 Pro llama.cpp/GGUF/Vulkan target did not reach BENCHMARK_DONE within 1200s for 256 tokens with 1 warmup and 3 measured runs
DEVICE='adb-47031FDAP0011K-gqNYae._adb-tls-connect._tcp' TARGETS='litert_lm,llamadart' RUNS=1 WARMUPS=0 OUTPUT_TOKENS=32 BACKEND=gpu LOG_TIMEOUT=900 tool/litert_lm_pixel_benchmark.sh
- LiteRT-LM/GPU 32-token cold run: load 1999 ms, backend init 39619.01 ms, wall 29760 ms, prefill 32.21 tok/s, decode 4.24 tok/s, wall 1.08 tok/s, 32/32 eval tokens, no early EOS
- llama.cpp/GGUF/Vulkan 32-token cold run: load 28363 ms, wall 73169 ms, resolved GPU layers 999, prefill 1.52 tok/s, decode 1.05 tok/s, decode-with-sampling 0.99 tok/s, wall 0.42 tok/s, 31/32 eval tokens, early EOS
- Pixel conclusion for these runs: LiteRT-LM is materially faster on Pixel 9 Pro than the current llama.cpp/GGUF Vulkan path for Gemma 4; the 256-token llama.cpp comparison needs a longer timeout or fewer measured runs to finish.
dart test test/unit/hook
dart test test/unit/backends/litert_lm test/unit/hook
dart test
- full local suite passed with 1434 passing and 67 skipped
dart pub global run coverage:format_coverage --lcov --in=coverage/test --out=coverage/lcov.info --report-on=lib --check-ignore && dart run tool/testing/check_lcov_threshold.dart coverage/lcov.info 70
- LCOV line coverage: 77.45% (8503/10979)
- lib/src/backends/litert_lm/litert_lm_runtime.dart reports LF:10/LH:10 after excluding the native FFI boundary; public runtime result/metrics value types remain covered by unit tests
git diff --check
bash tool/docs/build_site.sh
- verified the Docusaurus docs build after installing local website dependencies with npm ci.
previous PR head b528d23b passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26545476874
previous PR head 9fe15218d passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26546215623
previous PR head 2c231ddac passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26546737845
previous PR head f395eaade passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26547300385
previous PR head 1bbe148ff passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26547977565
previous PR head 801d4a2c passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26548444620
previous PR head 5d1aad4d passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26548920357
previous PR head c603f502 passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26549404276
previous PR head f35302ee passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26549985727
previous PR head 6fe7c5b2 passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26550451076
previous PR head fbfc68e2 passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26550993411
previous PR head 49d6f4b7a passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26551609361
previous PR head a6a8ecc93 passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26552258626
previous PR head 264bc897f passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26552774614
previous PR head f2916cb0e passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26553204346
previous PR head 7ac7e7bcc passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26553701141
previous PR head f089360ea passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26554262975
current PR head 985e9a9ea passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26554765555
current PR head b6a5181b local coverage hardening passed:
- dart test test/unit/backends/litert_lm/litert_lm_backend_test.dart
- dart test test/unit/backends/litert_lm/worker_test.dart
- dart test test/unit/backends/litert_lm
- dart test test/unit/backends/native/native_backend_test.dart
- dart analyze
- dart run tool/testing/check_platform_boundaries.dart
- git diff --check
- dart test -p vm -j 1 --exclude-tags local-only --coverage=coverage
- dart pub global run coverage:format_coverage --lcov --in=coverage/test --out=coverage/lcov.info --report-on=lib --check-ignore && dart run tool/testing/check_lcov_threshold.dart coverage/lcov.info 70
- LCOV line coverage: 78.62% (8728/11102); file coverage: LiteRT-LM backend 92.01%, LiteRT-LM worker 94.52%, LiteRT-LM service 89.09%, native router 90.36%
current PR head b6a5181b passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26555534894
current PR head 9e057a1f1 LiteRT-LM service coverage hardening passed:
- dart test test/unit/backends/litert_lm/litert_lm_service_test.dart
- dart test test/unit/backends/litert_lm
- dart analyze
- dart run tool/testing/check_platform_boundaries.dart
- git diff --check
- dart test -p vm -j 1 --exclude-tags local-only --coverage=coverage
- dart pub global run coverage:format_coverage --lcov --in=coverage/test --out=coverage/lcov.info --report-on=lib --check-ignore
- dart run tool/testing/check_lcov_threshold.dart coverage/lcov.info 70
- LCOV line coverage: 78.99% (8769/11102); LiteRT-LM service coverage: 99.49% (392/394), with only the internal no-model guard and macOS/Android cache-dir creation branch uncovered locally.
current PR head 9e057a1f1 passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26556282010
current PR head a0f043da4 LiteRT-LM production-hardening coverage pass:
- dart test test/unit/backends/litert_lm/litert_lm_backend_test.dart
- dart test test/unit/backends/litert_lm
- dart test test/unit/backends/native/native_backend_test.dart
- dart test test/unit/backends/native
- dart test test/unit/backends/litert_lm/worker_test.dart
- dart test test/unit/backends/litert_lm/worker_messages_test.dart
- dart analyze
- dart run tool/testing/check_platform_boundaries.dart
- git diff --check
- dart test -p vm -j 1 --exclude-tags local-only --coverage=coverage
- dart pub global run coverage:format_coverage --lcov --in=coverage/test --out=coverage/lcov.info --report-on=lib --check-ignore
- dart run tool/testing/check_lcov_threshold.dart coverage/lcov.info 70
- LCOV line coverage: 79.23% (8796/11102); key file coverage: native_backend.dart 100.00%, worker_messages.dart 100.00%, worker.dart 97.95%, litert_lm_service.dart 99.49%, litert_lm_backend.dart 93.75%.
current PR head a0f043da4 passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26557380437
dart run tool/macos_litert_lm_engine_smoke.dart .dart_tool/litert_lm_models/gemma-4-E2B-it.litertlm metal "Write one concise sentence about on-device Gemma." 16
- current-head Gemma 4 macOS LiteRT-LM e2e passed: backend LiteRT-LM gpu, load 14 ms, wall 1038 ms, backend init 935.581 ms, prefill 114.07 tok/s, decode 45.18 tok/s for 16/16 decode tokens.
- output: On-device Gemma allows for powerful, privacy-preserving AI processing directly on local
current PR head 55d718b7 adds LiteRT-LM web backend routing:
- .litertlm web URLs now route through a browser LiteRtLmBackend wrapper around official @litert-lm/core Engine.create(...) / createConversation(...) instead of falling through to the llama.cpp WebGPU bridge.
- WebAutoBackend keeps .gguf and unknown sources on WebGpuLlamaBackend, switches delegates by model format, and disposes the previous delegate on format changes.
- web LiteRT-LM supports URL/path loading, CPU/GPU backend selection, context sizing, streaming generation, stop-sequence cancellation, and high-level LlamaEngine / ChatSession flow.
- web LiteRT-LM explicitly rejects unsupported tokenizer, embeddings, state persistence, LoRA, grammar, multimodal, stream-batching, thread-tuning, and NPU operations.
- example/chat_app/web/index.html sets a default @litert-lm/core ESM module URL for .litertlm loads, while still allowing apps to override window.__llamadartLiteRtLmModuleUrl.
current PR head 55d718b7 web LiteRT-LM verification:
- dart analyze
- dart test -p chrome test/unit/backends/litert_lm/litert_lm_web_backend_test.dart test/unit/backends/litert_lm/litert_lm_web_public_api_test.dart test/unit/backends/web/web_backend_test.dart
- dart test test/unit/backends/litert_lm
- dart run tool/testing/check_platform_boundaries.dart
- dart test test/unit/backends/web test/unit/core/engine/engine_test.dart
- git diff --check
- dart test -p chrome test/unit/backends/webgpu/webgpu_backend_test.dart
- dart test test/unit/backends/native/native_backend_test.dart test/unit/core/models/model_source_test.dart
- ./tool/docs/build_site.sh
current PR head 6deefda9 fixes the CI mirrored-test-name gate by renaming the web LiteRT-LM test to test/unit/backends/litert_lm/litert_lm_backend_web_test.dart.
- dart test test/unit/test_structure/mirrored_unit_structure_test.dart
- dart test -p chrome test/unit/backends/litert_lm/litert_lm_backend_web_test.dart test/unit/backends/litert_lm/litert_lm_web_public_api_test.dart test/unit/backends/web/web_backend_test.dart
- dart analyze
- git diff --check
current PR head 6deefda9 passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26558781841
current PR head 3744a1f18 documents LiteRT-LM web backend support across README, website quickstart, model lifecycle, platform support matrix, and the chat example README.
- ./tool/docs/build_site.sh
- git diff --check
current PR head 3744a1f18 passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26559439872
current PR head 019486a80 adds a Gemma 4 LiteRT-LM chat app preset and keeps chat-app LiteRT-LM loads free of llama.cpp-only Android batch/thread tuning.
- flutter test test/chat_service_test.dart test/model_asset_source_test.dart test/model_download_controller_adapter_test.dart
- flutter test test/manage_models_screen_download_test.dart
- dart analyze
- flutter analyze from example/chat_app
- git diff --check
current PR head 019486a80 passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26560234889
current PR head b5590d305 adds repeatable Gemma 4 LiteRT-LM web smoke coverage and fixes the high-level web chat path:
- .litertlm web GPU selection now uses GPU_ARTISAN when available, which is the backend accepted by the Gemma 4 web bundle.
- web LiteRT-LM metadata exposes a passthrough latest-message template so @litert-lm/core Conversation can apply the bundle's own Gemma 4 template instead of receiving a pre-rendered prompt.
- native and web LiteRT-LM generation reject only real media parts; text/tool/thinking parts already represented in the rendered prompt no longer block high-level LlamaEngine.create / ChatSession generation.
- the chat app labels LiteRT-LM web GPU execution as WEBGPU instead of falling back to CPU.
- tool/testing/run_local_e2e.dart now includes chat-app-web-litert-gemma4-smoke, prefers the repo-local Playwright Python venv when present, and exits cleanly after background server cleanup.
current PR head b5590d305 verification:
- python3 -m py_compile tool/testing/playwright_chat_app_real_model_smoke.py
- dart test test/unit/tooling/run_local_e2e_test.dart
- dart test test/unit/backends/litert_lm/litert_lm_backend_test.dart
- dart test -p chrome test/unit/backends/litert_lm/litert_lm_backend_web_test.dart
- dart test test/unit/backends/litert_lm/litert_lm_service_test.dart
- flutter test test/backend_utils_test.dart from example/chat_app
- dart analyze
- git diff --check
- flutter build web --base-href=/example/chat_app/build/web/ from example/chat_app
- dart run tool/testing/run_local_e2e.dart --scenario chat-app-web-litert-gemma4-smoke --skip-build
  - real Gemma 4 web LiteRT-LM e2e passed through example/chat_app: model load 39.8s, response source litert, expected response 4, runtime label WEBGPU, settings backend 2 / GPU_ARTISAN, final body reported avg 14.1 tok/s, first token 141 ms, total 142 ms.
current PR head 6a5dc083 fixes the Windows dry-run assertion for repo-local Playwright Python paths.
- dart test test/unit/tooling/run_local_e2e_test.dart
- dart format test/unit/tooling/run_local_e2e_test.dart
- git diff --check
current PR head 6a5dc083 passed all CI checks after rerunning the transient Windows setup failure: https://github.com/leehack/llamadart/actions/runs/26563948801
current PR head 61eb4eccb hardens LiteRT-LM web backend production introspection:
- WebAutoBackend now implements BackendEmbeddingsSupport and reports active delegate embedding support, so LiteRT-LM web advertises unsupported embeddings reliably.
- LiteRT-LM web metadata now includes architecture, model name, context length, backend, and source URL alongside the passthrough chat template.
- LiteRT-LM web multimodal capability/free calls now reject unsupported operations explicitly instead of silently no-oping or returning false.
current PR head 61eb4eccb verification:
- dart test -p chrome test/unit/backends/litert_lm/litert_lm_backend_web_test.dart test/unit/backends/litert_lm/litert_lm_web_public_api_test.dart test/unit/backends/web/web_backend_test.dart
- dart analyze
- dart run tool/testing/check_platform_boundaries.dart
- git diff --check
current PR head 61eb4eccb passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26564974130
current PR head dc11792e1 hardens LiteRT-LM web direct-backend lifecycle parity:
- Web LiteRT-LM now issues monotonically increasing model/context handles instead of always returning 1.
- Stale web model/context handles are rejected after reload for metadata, context lookup/free, and model free, matching the native LiteRT-LM lifecycle hardening.
- Direct web chat-template calls now extract text parts and reject media content instead of stringifying image/audio maps into prompts.
current PR head dc11792e1 verification:
- dart test -p chrome test/unit/backends/litert_lm/litert_lm_backend_web_test.dart
- dart test -p chrome test/unit/backends/litert_lm/litert_lm_backend_web_test.dart test/unit/backends/litert_lm/litert_lm_web_public_api_test.dart test/unit/backends/web/web_backend_test.dart
- dart analyze
- dart run tool/testing/check_platform_boundaries.dart
- git diff --check
current PR head dc11792e1 passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26565806067
current PR head 0bff13ffb aligns LiteRT-LM web direct diagnostics and LoRA lifecycle:
- Direct web LiteRT-LM getVramInfo() now reports zero VRAM without requiring a loaded model, matching native LiteRT-LM/WebAuto diagnostic behavior.
- Web LiteRT-LM LoRA adapter calls now validate the active context handle before reporting unsupported LoRA, so stale/no-context direct calls fail as lifecycle errors instead of masking stale handles.
current PR head 0bff13ffb verification:
- dart test -p chrome test/unit/backends/litert_lm/litert_lm_backend_web_test.dart
- dart test -p chrome test/unit/backends/litert_lm/litert_lm_backend_web_test.dart test/unit/backends/litert_lm/litert_lm_web_public_api_test.dart test/unit/backends/web/web_backend_test.dart
- dart analyze
- dart run tool/testing/check_platform_boundaries.dart
- git diff --check
current PR head 0bff13ffb passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26566533318
current PR head 9da6d5d3b hardens remaining production-grade parity and benchmark fairness:
- macOS cache/framework discovery now accepts only complete ABI-specific LiteRT-LM runtime installs before selecting them.
- LiteRT-LM web contextCreate validates full ModelParams and rejects explicit NPU selection up front.
- benchmark app llama.cpp/GGUF leg now defaults to Metal on Apple, Vulkan on Android/Linux/Windows, records requestedBackend, and supports LLAMADART_BACKEND override.
current PR head 9da6d5d3b verification:
- dart test test/unit/backends/litert_lm/litert_lm_runtime_test.dart
- dart test test/unit/backends/litert_lm
- dart test test/unit/hook/build_hook_litert_lm_integration_test.dart
- dart test -p chrome test/unit/backends/litert_lm/litert_lm_backend_web_test.dart
- flutter test test/litert_lm_benchmark_app_test.dart from example/chat_app
- dart analyze
- dart run tool/testing/check_platform_boundaries.dart
- git diff --check
- dart run tool/macos_litert_lm_engine_smoke.dart .dart_tool/litert_lm_models/gemma-4-E2B-it.litertlm metal "Write one concise sentence about on-device Gemma." 16
  - current-head Gemma 4 macOS LiteRT-LM e2e passed after the benchmark-fairness changes: backend LiteRT-LM gpu, load 13 ms, wall 1048 ms, backend init 933.472 ms, prefill 111.32 tok/s, decode 44.78 tok/s for 16/16 decode tokens.
  - output: On-device Gemma allows for powerful, privacy-preserving AI processing directly on local
- DEVICE=adb-47031FDAP0011K-gqNYae._adb-tls-connect._tcp TARGETS=litert_lm RUNS=1 WARMUPS=0 OUTPUT_TOKENS=16 BACKEND=gpu LITERT_LM_LOG_TIMEOUT=300 tool/litert_lm_pixel_benchmark.sh
  - current-head Pixel 9 Pro Gemma 4 LiteRT-LM/GPU e2e passed after the benchmark-fairness changes: load 222 ms, backend init 9498.49 ms, wall 6965 ms, prefill 82.25 tok/s, decode 13.49 tok/s, wall 2.30 tok/s, 16/16 eval tokens, no early EOS.
current PR head 9da6d5d3b passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26567730946
current PR head 6e14a4fae forwards the scripted llama.cpp backend override for Pixel benchmark runs:
- tool/litert_lm_pixel_benchmark.sh now reads LLAMADART_BACKEND, prints it in the benchmark configuration, and passes it as a Flutter dart-define so scripted GGUF runs can force auto, cpu, vulkan, metal, etc.
current PR head 6e14a4fae verification:
- bash -n tool/litert_lm_pixel_benchmark.sh
- git diff --check HEAD
- DEVICE=adb-47031FDAP0011K-gqNYae._adb-tls-connect._tcp TARGETS=llamadart RUNS=1 WARMUPS=0 OUTPUT_TOKENS=16 BACKEND=gpu LLAMADART_BACKEND=vulkan LLAMADART_LOG_TIMEOUT=900 tool/litert_lm_pixel_benchmark.sh
  - current-head Pixel 9 Pro Gemma 4 llama.cpp/GGUF e2e passed with requestedBackend: vulkan: load 9431 ms, wall 22889 ms, resolved GPU layers 999, prefill 5.41 tok/s, decode 1.42 tok/s, decode-with-sampling 1.37 tok/s, wall 0.66 tok/s, 15/16 eval tokens, early EOS.
current PR head 6e14a4fae passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26568475041
current PR head 792b7d5e3 tightens LiteRT-LM platform hook asset coverage:
- Android, Linux x64, and Windows hook integration tests now assert LiteRT-LM companion runtime library filenames and package code-asset IDs, so platform hook coverage fails if .litertlm native runtime assets stop being emitted even when llama.cpp assets still pass.
current PR head 792b7d5e3 verification:
- dart test test/unit/hook/build_hook_android_integration_test.dart test/unit/hook/build_hook_linux_integration_test.dart test/unit/hook/build_hook_integration_test.dart
- dart analyze test/unit/hook/build_hook_android_integration_test.dart test/unit/hook/build_hook_linux_integration_test.dart test/unit/hook/build_hook_integration_test.dart
- dart test test/unit/hook/build_hook_litert_lm_integration_test.dart
- dart test test/unit/hook
- git diff --check
current PR head 792b7d5e3 passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26569144857
current PR head f9e909e60 hardens macOS LiteRT-LM runtime packaging:
- tool/macos_litert_lm_prepare_app.sh now validates complete ABI-specific LiteRT-LM runtime directories before installing app frameworks, supports arm64/x64 runtime cache layouts, installs the full arm64 companion framework set, and fails early on partial explicit runtime dirs.
- macOS runtime framework/cache/fallback loading now uses ABI-specific companion sets, so x64 does not inherit arm64-only Metal/WebGPU companion assumptions while arm64 validates and opens the full packaged companion set.
current PR head f9e909e60 verification:
- bash -n tool/macos_litert_lm_prepare_app.sh
- dart test test/unit/backends/litert_lm/litert_lm_runtime_test.dart
- dart test test/unit/tooling/macos_litert_lm_prepare_app_script_test.dart
- dart analyze lib/src/backends/litert_lm/litert_lm_runtime.dart test/unit/backends/litert_lm/litert_lm_runtime_test.dart test/unit/tooling/macos_litert_lm_prepare_app_script_test.dart
- dart test test/unit/backends/litert_lm
- dart test test/unit/tooling
- dart run tool/testing/check_platform_boundaries.dart
- dart analyze
- git diff --check
- dart run tool/macos_litert_lm_engine_smoke.dart .dart_tool/litert_lm_models/gemma-4-E2B-it.litertlm metal "Write one concise sentence about on-device Gemma." 16
  - Gemma 4 macOS LiteRT-LM e2e passed after the macOS runtime packaging hardening: backend LiteRT-LM gpu, load 13 ms, wall 2009 ms, backend init 1163.554 ms, prefill 18.65 tok/s, decode 43.73 tok/s for 16 decode tokens.
current PR head f9e909e60 passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26569935734
current PR head a227134c5 clarifies backend-aware state-persistence unsupported messages:
- LlamaEngine.stateSaveFile and stateLoadFile now preserve WebGPU bridge guidance for WebGPU backends while reporting LiteRT-LM-specific unsupported-state guidance for .litertlm models.
- Tests cover generic, WebGPU, and LiteRT-LM unsupported-state paths so LiteRT-LM users no longer receive WebGPU bridge-version instructions.
current PR head a227134c5 verification:
- dart test -p chrome test/unit/backends/web/web_backend_test.dart
- dart test test/unit/core/engine/engine_test.dart test/unit/backends/native/native_backend_test.dart
- dart analyze
- git diff --check
current PR head a227134c5 passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26571165587

codecov-commenter · 2026-05-26T15:33:46Z

Codecov Report

❌ Patch coverage is 94.97374% with 67 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.37%. Comparing base (47b8497) to head (91cb325).

Files with missing lines	Patch %	Lines
lib/src/backends/litert_lm/litert_lm_backend.dart	93.28%	19 Missing ⚠️
lib/src/hook/native_bundle_config.dart	71.42%	18 Missing ⚠️
lib/src/core/engine/engine.dart	86.81%	12 Missing ⚠️
lib/src/backends/litert_lm/litert_lm_service.dart	98.45%	6 Missing ⚠️
...rc/experimental/litert_lm/litert_lm_benchmark.dart	75.00%	6 Missing ⚠️
lib/src/backends/litert_lm/worker.dart	97.88%	3 Missing ⚠️
lib/src/backends/litert_lm/litert_lm_runtime.dart	98.18%	1 Missing ⚠️
lib/src/core/engine/chat_session.dart	90.00%	1 Missing ⚠️
lib/src/core/template/handlers/gemma4_handler.dart	95.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #167      +/-   ##
==========================================
+ Coverage   78.48%   80.37%   +1.88%     
==========================================
  Files          76       84       +8     
  Lines        9975    11244    +1269     
==========================================
+ Hits         7829     9037    +1208     
- Misses       2146     2207      +61

Flag	Coverage Δ
unittests	`80.37% <94.97%> (+1.88%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

leehack · 2026-05-28T14:02:01Z

Rebased this PR onto origin/main at 29291222e (llamadart-native / llama.cpp tag b9371) and reran the Pixel 9 Pro benchmark against Gemma 4 E2B.

Setup:

Device: Pixel 9 Pro, wireless ADB adb-47031FDAP0011K-gqNYae._adb-tls-connect._tcp
LiteRT-LM model: gemma-4-E2B-it.litertlm, backend gpu
llama.cpp model: gemma-4-E2B-it-Q4_K_S.gguf, backend vulkan, gpuLayers=999
Runs: RUNS=1, WARMUPS=0

128 output tokens:

Backend	Load	Wall	Prefill	Decode	Wall tok/s	Notes
LiteRT-LM gpu	2.86s	52.64s	17.28 tok/s	4.64 tok/s	2.43	128/128 tokens
llama.cpp Vulkan b9371	20.79s	87.68s	5.47 tok/s	1.78 tok/s	1.45	EOS at 127/128

32 output tokens:

Backend	Load	Wall	Prefill	Decode	Wall tok/s	Notes
LiteRT-LM gpu	0.18s	7.12s	106.54 tok/s	14.99 tok/s	4.50	32/32 tokens
llama.cpp Vulkan b9371	8.14s	25.80s	8.39 tok/s	1.81 tok/s	1.20	EOS at 31/32

Conclusion: the updated llama.cpp/Vulkan backend in b9371 is much better than the previous benchmarked backend, especially load/prefill, but LiteRT-LM is still ahead on Pixel 9 Pro for this Gemma 4 E2B setup. The current gap is about 2.6x on 128-token decode and 1.7x on 128-token wall throughput; for short 32-token output it is about 8.3x decode and 3.7x wall throughput. The 128-token wall gap is smaller because LiteRT-LM still has a large cold backend init cost, while llama.cpp hit EOS slightly before the requested output budget.

Local validation before push:

dart test test/unit/backends/native/native_backend_test.dart -n "litertlm" passed, 12 tests.
flutter clean && flutter test test/litert_lm_benchmark_app_test.dart test/chat_service_test.dart test/backend_utils_test.dart test/unit_test.dart passed, 57 tests.
flutter analyze passed.
git diff --check passed.

leehack · 2026-05-28T14:30:24Z

Follow-up on the Pixel 9 Pro benchmark: the earlier rerun may have been affected by device power state. After checking the device, dumpsys power reported mWakefulness=Dozing. I reran with the phone awake, plugged in, and the benchmark app focused.

Important caveat: the benchmark app currently reports the last measured run when RUNS > 1, not an average/median, so multi-run output should be treated as a warm-run sample until the harness is improved.

Screen-awake 128-token run (RUNS=1, WARMUPS=0):

Backend	Load	Wall	Prefill	Decode	Wall tok/s	Notes
LiteRT-LM gpu	0.25s	13.92s	81.48 tok/s	15.23 tok/s	9.20	128/128 tokens
llama.cpp Vulkan b9371	12.52s	137.97s	5.49 tok/s	1.05 tok/s	0.92	EOS at 127/128

Screen-awake 32-token cold run (RUNS=1, WARMUPS=0):

Backend	Load	Wall	Prefill	Decode	Wall tok/s	Notes
LiteRT-LM gpu	0.18s	7.34s	95.90 tok/s	14.74 tok/s	4.36	32/32 tokens
llama.cpp Vulkan b9371	8.38s	38.74s	8.32 tok/s	1.05 tok/s	0.80	EOS at 31/32

Screen-awake 32-token warm sample (RUNS=3, WARMUPS=1, last measured run reported):

Backend	Load	Wall	Prefill	Decode	Wall tok/s	Notes
LiteRT-LM gpu	0.17s	2.10s	435.05 tok/s	16.62 tok/s	15.27	last measured run
llama.cpp Vulkan b9371	8.48s	38.94s	14.87 tok/s	0.93 tok/s	0.80	last measured run, EOS at 31/32

Conclusion update: the previous screen-off/dozing numbers should not be treated as final. The screen-awake rerun makes LiteRT-LM look much faster for this Pixel 9 Pro / Gemma 4 setup, but llama.cpp/Vulkan also showed significant variance versus the previous run. Before using these as publishable benchmark numbers, we should improve the harness to keep the device awake/unlocked, record thermal/cooling state, report every run or median/p95, and separate cold-start from steady-state measurements.

Copilot

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

leehack · 2026-05-29T00:05:08Z

Review findings from this pass:

P1: CI is currently red from a stale Gemma 4 thinking expectation. test/unit/backends/native/native_backend_test.dart:576 still expects the LiteRT-LM prompt to end with <|turn>model\n, but the current Gemma 4 template correctly leaves thinking open and now ends with <|turn>model\n<|channel>thought\n. This is the failure in Linux/Web, macOS native, and Windows native jobs. The test expectation needs to be updated before this is mergeable.
P1: LiteRT-LM web does not really support the claimed high-level chat/session/tool path yet. lib/src/backends/litert_lm/litert_lm_backend_web.dart:38 exposes a template that renders only the last message; generate() creates a fresh JS conversation at :196 and deletes it at :217; metadata documents this as a passthrough at :266. That means LlamaEngine.create(...) / ChatSession on web drop system prompts, prior history, and tool declarations before reaching @litert-lm/core. The docs currently claim high-level flow and thinking/tool parsing (README.md:465, website/docs/platforms/support-matrix.md:113), while the real web smoke disables both tools and thinking (tool/testing/playwright_chat_app_real_model_smoke.py:217). Either implement a real web chat-message/tool mapping or scope the docs/tests to simple single-turn text generation on LiteRT-LM web.
P2: Android chat-app presets silently override Gemma 4 LiteRT-LM to very small limits. The Gemma 4 LiteRT preset declares contextSize: 8192 and maxTokens: 1024 in example/chat_app/lib/models/downloadable_model.dart:512, but applyModelPreset() clamps every Android .litertlm model to context 512 and max output 128 at example/chat_app/lib/providers/chat_provider.dart:1940 and :1964. That makes manual Gemma 4 thinking/tool testing easy to truncate and does not match the preset or benchmark docs. If we keep this mobile safety default, it should be explicit/model-specific and not silently override the preset used for validation.

Also cut the structured-output follow-up as #174: #174

leehack · 2026-05-29T00:23:30Z

Addressed the review findings in 25c5ccf3f:

Updated the stale Gemma 4 LiteRT-LM test expectation for the open thinking channel.
Scoped LiteRT-LM web to single-turn text in backend metadata, tests, README, support matrix, backend-selection docs, and release notes.
Removed the hidden Android LiteRT-LM 512/128 preset clamp so the Gemma 4 LiteRT preset values are used for manual testing/benchmarking.

Validation:

dart test test/unit/backends/native/native_backend_test.dart
dart test -p chrome test/unit/backends/litert_lm/litert_lm_backend_web_test.dart
flutter test test/unit_test.dart in example/chat_app
dart analyze
flutter analyze in example/chat_app
git diff --check
npm run build in website

GitHub checks are green on the latest PR run.

leehack changed the title ~~Add LiteRT-LM benchmark backend POC~~ Add LiteRT-LM native backend support May 27, 2026

leehack force-pushed the litert-lm-poc branch from 1030ad2 to b4dd730 Compare May 28, 2026 14:00

leehack force-pushed the litert-lm-poc branch 2 times, most recently from a29ac6d to bcaefcf Compare May 28, 2026 15:39

leehack marked this pull request as ready for review May 28, 2026 19:07

Copilot AI review requested due to automatic review settings May 28, 2026 19:07

Copilot AI reviewed May 28, 2026

View reviewed changes

This was referenced May 28, 2026

MTP support? #168

Open

Add LiteRT-LM LoRA adapter support #173

Open

Support strict structured output for LiteRT-LM backends #174

Open

leehack added 15 commits May 29, 2026 09:43

Add LiteRT-LM benchmark backend POC

b755832

Add LiteRT-LM POC unit tests

e1beadb

Keep LiteRT-LM POC exports web-safe

e1d7c53

Match LiteRT-LM stub API surface

1e72e0e

Run LiteRT-LM backend in a worker isolate

e1fd3c7

Fix WebGPU test length lint

5745491

Route native backend by model format

903a2ee

Document native model format routing

1ca2ad4

Use dedicated LiteRT-LM runtime assets

d0dc7e4

Add LiteRT backend mirrored tests

ef48095

Remove LiteRT-LM POC runtime naming

0565d3b

Document LiteRT-LM model source caching

b51d5b1

Stabilize LiteRT-LM callback handling

f492a8c

Add Gemma 4 metadata for LiteRT-LM bundles

0c7ce36

Improve LiteRT-LM cancellation responsiveness

861ada0

leehack added 27 commits May 29, 2026 09:45

Forward llama.cpp backend override in Pixel benchmark

819a185

Cover LiteRT-LM platform hook assets

a973303

Harden macOS LiteRT-LM runtime packaging

50623e4

Clarify LiteRT-LM state persistence errors

ff96dbf

Preserve WebGPU state persistence guidance

fcb2a4e

Keep chat app LiteRT-LM auto on GPU

58df71c

Warm up chat app LiteRT-LM loads

61c40f3

Use native perf counters in chat metrics

34b83af

Clarify LiteRT chat metric chips

95399e8

Reconcile chat token chips with native metrics

1dd208a

Raise Android LiteRT chat output budget

b6be642

Update chat app lockfile after rebase

1f82c54

Document llama.cpp and LiteRT-LM backend selection

efcbdfd

Add native runtime selection and LiteRT smoke coverage

80c17de

Add backend benchmark documentation

b9d8b76

Clarify web GGUF benchmark result

cd0a780

Clarify web GGUF sanity checks

8872921

Fix web benchmark backend selection

9210c2c

Fix web GGUF benchmark path

c4a16c9

Clean up web benchmark docs and runner

4a278c0

Clean up LiteRT-LM backend docs and helpers

713d55e

Add LiteRT-LM platform helper tests

784e999

Prepare 0.7.0 release

0e90a1c

Fix LiteRT-LM iOS loading and tool parsing

8dac3af

Cover LiteRT Gemma chat features

2692428

Fix LiteRT Gemma thinking channel

c141cd4

Clarify LiteRT-LM web chat limits

91cb325

leehack force-pushed the litert-lm-poc branch from 25c5ccf to 91cb325 Compare May 29, 2026 13:50

leehack merged commit 5d5f2b7 into main May 30, 2026
8 checks passed

leehack deleted the litert-lm-poc branch May 30, 2026 00:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add LiteRT-LM native backend support#167

Add LiteRT-LM native backend support#167
leehack merged 112 commits into
mainfrom
litert-lm-poc

leehack commented May 26, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented May 26, 2026 •

edited

Loading

Uh oh!

leehack commented May 28, 2026

Uh oh!

leehack commented May 28, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

leehack commented May 29, 2026

Uh oh!

leehack commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

leehack commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Native release state

Verification

Uh oh!

codecov-commenter commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

leehack commented May 28, 2026

Uh oh!

leehack commented May 28, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

leehack commented May 29, 2026

Uh oh!

leehack commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

leehack commented May 26, 2026 •

edited

Loading

codecov-commenter commented May 26, 2026 •

edited

Loading