Skip to content

Add LiteRT-LM native backend support#167

Merged
leehack merged 112 commits into
mainfrom
litert-lm-poc
May 30, 2026
Merged

Add LiteRT-LM native backend support#167
leehack merged 112 commits into
mainfrom
litert-lm-poc

Conversation

@leehack
Copy link
Copy Markdown
Owner

@leehack leehack commented May 26, 2026

Summary

  • add a LiteRT-LM native-assets hook path that downloads pinned litert-lm-native v0.12.0 runtime archives

  • require and emit the LiteRT-LM StreamProxy helper library alongside LiteRtLm on Android, iOS, macOS, Linux, and Windows

  • route native .litertlm model bundles to a worker-isolate LiteRT-LM backend while keeping GGUF and unknown formats on llama.cpp

  • keep .litertlm models on the same ModelSource/download/cache path as GGUF and add Pixel/macOS benchmark entrypoints

  • infer Gemma 4 metadata and chat-template data for .litertlm bundle names so high-level LlamaEngine chat templating uses the Gemma 4 handler

  • run blocking LiteRT-LM send_message in a helper isolate so the backend worker can process cancellation while StreamProxy is unavailable

  • redownload stale LiteRT-LM runtime archives when cached checksums no longer match pinned release digests

  • validate the full platform-specific LiteRT-LM companion runtime library set after extraction so broken native bundles fail during build, not at runtime

  • invalidate already extracted LiteRT-LM runtime caches when the pinned runtime checksum changes

  • keep LlamaEngine.chatTemplate and ChatSession usable when the selected backend does not expose tokenization, using null template token counts and conservative prompt-size estimates for history pruning

  • reject unsupported LiteRT-LM LoRA operations consistently, including clearLoras, across service, worker, backend, and high-level engine paths

  • promote the LiteRT-LM runtime client/result/metrics API out of the experimental benchmark namespace, with deprecated benchmark wrappers for compatibility

  • cover the public LiteRT-LM runtime API exports on VM and web, including native-only web stubs that fail with UnsupportedError

  • keep the web-safe LiteRtLmBackend placeholder constructor aligned with the native constructor so cross-platform code compiles before reporting the native-only unsupported error

  • align the macOS/Pixel benchmark app metric output so LiteRT-LM and llama.cpp report comparable target-token, early-EOS, sampling-inclusive, and wall-throughput fields

  • harden direct LiteRtLmRuntimeClient validation and native handle cleanup for reinitialization, initialization-failure, and streaming-failure paths

  • give Pixel benchmark runs separate LiteRT-LM, llama.cpp, and combined log timeouts so full Gemma 4 comparisons can finish on slower GGUF/Vulkan runs

  • coalesce concurrent LiteRT-LM worker startup requests so simultaneous pre-load diagnostics cannot leak extra backend isolates

  • reject concurrent LiteRT-LM generations at the Dart backend before a second stream can overwrite active generation cleanup

  • cancel and close active LiteRT-LM generation streams before direct backend reload, context free, or model free requests

  • serialize regular LiteRT-LM worker requests through one service queue while keeping cancellation responsive, including queued-generation cancellation before native runtime entry

  • prevent immediate LiteRT-LM stream cancellation from sending a stale generation request after the caller response port has already closed

  • align LiteRT-LM native FFI coverage handling with real-runtime smoke coverage while keeping public runtime result/metrics value types unit-covered

  • clear disposed LiteRT-LM service runtime clients when replacement initialization fails so later tokenization or generation retries start from a clean client

  • dispose LiteRT-LM service runtime clients and clear context-scoped metrics when contexts are freed or replaced

  • align LiteRT-LM maxTokens <= 0 generation semantics with llama.cpp by making them no-op without initializing runtime generation

  • report backend/model-specific LiteRT-LM engine creation failures and make Pixel benchmark runs fail on BENCHMARK: ERROR

  • normalize direct LiteRT-LM CPU/GPU/NPU backend selectors across runtime, service, and direct-backend APIs before native initialization

  • reject non-positive LiteRT-LM contextSize values at model load instead of allowing a later native initialization failure

  • refresh any LiteRT-LM runtime client that was initialized before context creation so tokenization before contextCreate cannot leave later calls using stale native context settings

  • reject explicit LiteRT-LM backend changes and unsupported llama.cpp-only context-time model parameters during LiteRT-LM context creation

  • validate macOS LiteRT-LM cache and app framework directories against the full ABI-specific runtime library set before selection

  • validate LiteRT-LM web context creation with the same model-parameter rules as native, including explicit NPU rejection

  • make the benchmark app choose a fair llama.cpp backend by platform and record the requested backend in metrics

  • document LiteRT-LM platform support in the website quickstart, model lifecycle, support matrix, and native build hook guides

  • reject LiteRT-LM detokenize(..., special: true) requests instead of silently ignoring the llama.cpp-only special-token flag

  • reject all LiteRT-LM multimodal backend operations consistently instead of allowing no-op projector frees or false capability probes

  • reject multimodal content passed to direct LiteRT-LM chat-template application instead of stringifying image/audio maps into text prompts

Native release state

  • litert-lm-native PR Feat/build infra refinement #2 merged: StreamProxy runtime helper packaging
  • litert-lm-native PR feat: implement LoRA support and training pipeline #3 merged: iOS CLiteRTLM.xcframework slices packaged as dylib-style runtime archives plus iOS StreamProxy
  • litert-lm-native release v0.12.0 refreshed with runtime archives for Android, iOS device/simulators, macOS, Linux, and Windows
  • native release workflow passed: https://github.com/leehack/litert-lm-native/actions/runs/26502345176
  • release manifest validation passed locally: Release manifest lists 21 required runtime artifacts
  • all current v0.12.0 runtime archive SHA-256 digests are pinned in hook/build.dart
  • iOS archive contents verified locally:
    • ios/arm64/libLiteRtLm.dylib, ios/arm64/libStreamProxy.dylib
    • ios/arm64-sim/libLiteRtLm.dylib, ios/arm64-sim/libStreamProxy.dylib
    • ios/x64-sim/libLiteRtLm.dylib, ios/x64-sim/libStreamProxy.dylib

Verification

  • dart analyze

  • dart test test/unit/backends/litert_lm/litert_lm_service_test.dart test/unit/backends/litert_lm/worker_test.dart

    • verified detokenize(..., special: true) fails explicitly before initializing the LiteRT-LM runtime.
    • verified multimodal direct chat-template content returns a typed unsupported LiteRT-LM worker error instead of stringifying media maps.
    • verified failed LiteRT-LM service runtime-client replacement disposes both old and failed clients, then retries tokenization with a fresh client.
  • dart test test/unit/backends/litert_lm/litert_lm_service_test.dart test/unit/backends/litert_lm/worker_test.dart test/unit/backends/litert_lm/litert_lm_backend_test.dart

    • verified unsupported LiteRT-LM multimodal service, worker, and direct-backend calls fail explicitly.
  • dart run tool/testing/check_platform_boundaries.dart

  • dart test test/unit/hook/build_hook_litert_lm_integration_test.dart test/unit/backends/native/native_backend_test.dart test/unit/backends/litert_lm test/unit/core/engine/engine_test.dart test/unit/core/engine/chat_session_test.dart

  • dart test test/unit/backends/litert_lm/litert_lm_public_api_test.dart test/unit/backends/litert_lm/litert_lm_runtime_test.dart test/unit/experimental/litert_lm/litert_lm_benchmark_test.dart

  • dart test test/unit/backends/litert_lm/litert_lm_runtime_test.dart

    • verified macOS LiteRT-LM extracted-runtime cache discovery uses the
      current runtime ABI, including macos/x64 for Intel macOS instead of the
      arm64 cache layout.
  • dart test -p chrome test/unit/backends/litert_lm/litert_lm_web_public_api_test.dart

    • verified the web-safe LiteRT-LM backend placeholder accepts the native constructor shape before failing with UnsupportedError.
  • dart test test/unit/backends/litert_lm/litert_lm_public_api_test.dart test/unit/backends/litert_lm/litert_lm_runtime_test.dart test/unit/backends/litert_lm/litert_lm_backend_test.dart test/unit/backends/native/native_backend_test.dart

  • dart test test/unit/backends/litert_lm/litert_lm_backend_test.dart

    • verified direct LiteRT-LM backend reload, context free, and model free cancel
      and close an active generation stream before sending teardown/reload requests.
  • dart test test/unit/backends/litert_lm/litert_lm_service_test.dart -n "context"

    • verified LiteRT-LM context free and context recreation dispose the active
      runtime client and clear stale context-scoped performance metrics.
  • dart test test/unit/backends/litert_lm/litert_lm_service_test.dart -n "maxTokens"

    • verified LiteRT-LM maxTokens <= 0 emits no chunks, does not start runtime
      generation, and clears stale context-scoped performance metrics.
  • dart test test/unit/backends/litert_lm/litert_lm_runtime_test.dart -n "engine create"

    • verified LiteRT-LM engine-creation diagnostics include backend, model, and
      fallback guidance for NPU/GPU failures.
  • dart test test/unit/backends/litert_lm/litert_lm_runtime_test.dart -n "backend"

  • dart test test/unit/backends/litert_lm/litert_lm_backend_test.dart -n "preferred backend diagnostics"

  • dart test test/unit/backends/litert_lm/litert_lm_service_test.dart -n "backend preference"

    • verified direct LiteRT-LM backend selectors are normalized or rejected before
      native initialization.
  • dart test test/unit/backends/litert_lm/litert_lm_service_test.dart -n "load-time model params"

    • verified non-positive LiteRT-LM context sizes fail during model load.
  • dart test test/unit/backends/litert_lm/litert_lm_service_test.dart -n "createContext disposes pre-context runtime client"

    • verified tokenization before context creation disposes the pre-context runtime client and reinitializes with the created context size.
  • dart test test/unit/backends/litert_lm/litert_lm_service_test.dart -n "backend changes during context creation"

    • verified explicit LiteRT-LM context-time backend switches are rejected instead of silently reusing the model-load backend.
  • dart test test/unit/backends/litert_lm/litert_lm_service_test.dart -n "loads local litertlm bundles"

  • dart test test/unit/backends/litert_lm/litert_lm_service_test.dart -n "backend preference"

  • dart run tool/macos_litert_lm_engine_smoke.dart .dart_tool/litert_lm_models/gemma-4-E2B-it.litertlm metal "Write one concise sentence about on-device Gemma." 16

    • latest Gemma 4 macOS LiteRT-LM e2e passed on head f089360ea: backend LiteRT-LM gpu, load 13 ms, wall 1017 ms, backend init 908.695 ms, prefill 117.54 tok/s, decode 45.83 tok/s for 16 decode tokens.
  • bash -n tool/litert_lm_pixel_benchmark.sh

  • dart test test/unit/backends/litert_lm/litert_lm_service_test.dart test/unit/backends/litert_lm/litert_lm_backend_test.dart test/unit/backends/litert_lm/worker_test.dart

  • dart test test/unit/backends/litert_lm

  • dart test test/unit/backends/native/native_backend_test.dart -n "litertlm"

  • dart analyze

  • dart run tool/testing/check_platform_boundaries.dart

  • git diff --check

  • dart test test/unit/backends/litert_lm/worker_test.dart

  • dart test test/unit/backends/litert_lm/litert_lm_service_test.dart test/unit/backends/litert_lm/litert_lm_backend_test.dart

    • verified LiteRT-LM reloads issue fresh model/context handles and reject
      stale direct-backend handles for metadata, context lookup, context free,
      and model free
  • dart test test/unit/backends/native/native_backend_test.dart

    • verified pre-load LlamaBackend() native GPU/VRAM diagnostics probe through
      llama.cpp without selecting a final backend, reuse that probe for GGUF
      loads, and dispose it before .litertlm routing
  • dart test test/unit/backends/native test/unit/backends/litert_lm

  • dart test test/unit/backends/litert_lm test/unit/backends/native/native_backend_test.dart

  • dart test test/unit/backends/litert_lm test/unit/experimental/litert_lm test/unit/backends/native

  • flutter pub publish --dry-run

  • dart run tool/macos_litert_lm_engine_smoke.dart .dart_tool/litert_lm_models/gemma-4-E2B-it.litertlm metal "Write one sentence about on-device Gemma." 32

    • latest Gemma 4 macOS LiteRT-LM e2e passed through public LlamaEngine(LlamaBackend()) auto-routing: backend LiteRT-LM gpu, load 11 ms, wall 1147 ms, prefill 190.61 tok/s, decode 76.73 tok/s for 32 decode tokens
    • output: On-device Gemma refers to the deployment of the Gemma model directly onto local hardware for processing, offering privacy and reduced latency.
  • dart run tool/macos_litert_lm_engine_smoke.dart .dart_tool/litert_lm_models/gemma-4-E2B-it.litertlm metal "Write one sentence about on-device Gemma." 16

    • post-runtime-hardening Gemma 4 macOS LiteRT-LM e2e passed: backend LiteRT-LM gpu, load 11 ms, wall 623 ms, backend init 850.76 ms, decode 125.41 tok/s for 16 decode tokens
  • dart run tool/macos_litert_lm_engine_smoke.dart .dart_tool/litert_lm_models/gemma-4-E2B-it.litertlm metal "Write one concise sentence about on-device Gemma." 16

    • post-maxTokens-parity Gemma 4 macOS LiteRT-LM e2e passed: backend LiteRT-LM gpu, load 15 ms, wall 952 ms, backend init 1051.79 ms, prefill 215.01 tok/s, decode 54.36 tok/s for 16 decode tokens
  • ADB=/Users/jhin.lee/Library/Android/sdk/platform-tools/adb DEVICE='adb-47031FDAP0011K-gqNYae._adb-tls-connect._tcp' TARGETS=litert_lm RUNS=1 WARMUPS=0 OUTPUT_TOKENS=16 BACKEND=gpu LITERT_LM_LOG_TIMEOUT=300 tool/litert_lm_pixel_benchmark.sh

    • post-maxTokens-parity Pixel 9 Pro Gemma 4 LiteRT-LM/GPU e2e passed: load 1826 ms, backend init 48587.78 ms, wall 38692 ms, prefill 15.19 tok/s, decode 2.30 tok/s, wall 0.41 tok/s, 16/16 eval tokens, no early EOS
  • ADB=/Users/jhin.lee/Library/Android/sdk/platform-tools/adb DEVICE='adb-47031FDAP0011K-gqNYae._adb-tls-connect._tcp' TARGETS=litert_lm RUNS=1 WARMUPS=0 OUTPUT_TOKENS=16 BACKEND=npu LITERT_LM_LOG_TIMEOUT=300 tool/litert_lm_pixel_benchmark.sh

    • Pixel 9 Pro Gemma 4 LiteRT-LM/NPU currently fails at engine creation with a clear diagnostic: LiteRT-LM engine creation failed for backend "npu" and model "gemma-4-E2B-it.litertlm". The Android NPU delegate may not support this device, OS, model, or bundle; try backend "gpu" or backend "cpu".
    • verified the Pixel benchmark script now exits nonzero when the app logs BENCHMARK: ERROR.
  • ADB=/Users/jhin.lee/Library/Android/sdk/platform-tools/adb DEVICE='adb-47031FDAP0011K-gqNYae._adb-tls-connect._tcp' TARGETS=litert_lm RUNS=1 WARMUPS=0 OUTPUT_TOKENS=16 BACKEND=gpu LITERT_LM_LOG_TIMEOUT=300 tool/litert_lm_pixel_benchmark.sh

    • post-benchmark-error-detection Pixel 9 Pro Gemma 4 LiteRT-LM/GPU e2e still passes: load 295 ms, backend init 8625.13 ms, wall 6177 ms, prefill 119.92 tok/s, decode 13.71 tok/s, wall 2.59 tok/s, 16/16 eval tokens, no early EOS
  • DECODE_TOKENS=256 tool/macos_fair_litert_vs_llamadart.sh

    • llama.cpp GGUF/Metal (gemma-4-E2B-it-Q4_K_S.gguf): load 887 ms, wall 1942 ms, prefill 24819.86 tok/s, decode 1280.29 tok/s before sampling, decode-with-sampling 135.57 tok/s, wall 131.31 tok/s, 255/256 eval tokens
    • LiteRT-LM/Metal (gemma-4-E2B-it.litertlm): high-level load 122 ms, backend init 4792.70 ms, wall 2219 ms, prefill 2685.39 tok/s, decode 116.80 tok/s, wall 115.37 tok/s, 256/256 eval tokens
    • macOS conclusion for this local run: warm generation is close but llama.cpp/GGUF is faster on wall throughput; LiteRT-LM has substantially higher cold backend init cost on macOS. This is not the Pixel/NPU result.
  • fresh macOS LiteRT-LM archive and extracted-cache marker verified:

    • 6fe694ccc895c904b173f2952b73b7698097eda18d8bff0210ea9fcf10ca3da9 .dart_tool/llamadart/litert_lm/0.12.0/litert-lm-native-runtime-macos-arm64-v0.12.0.tar.gz
    • .dart_tool/llamadart/litert_lm/0.12.0/macos/arm64/.llamadart_litert_lm.sha256 contains the same digest
    • extracted runtime contains libLiteRtLm.dylib, libStreamProxy.dylib, libGemmaModelConstraintProvider.dylib, libLiteRt.dylib, libLiteRtMetalAccelerator.dylib, libLiteRtTopKMetalSampler.dylib, libLiteRtTopKWebGpuSampler.dylib, and libLiteRtWebGpuAccelerator.dylib
  • dart run tool/macos_litert_lm_cancel_smoke.dart .dart_tool/litert_lm_models/gemma-4-E2B-it.litertlm metal 250 512 15000

    • Gemma 4 macOS LiteRT-LM packaged-runtime cancellation e2e passed: completed before timeout in 1640 ms, 0 chunks, 0 characters, no stream error
  • DEVICE='adb-47031FDAP0011K-gqNYae._adb-tls-connect._tcp' TARGETS='litert_lm,llamadart' RUNS=3 WARMUPS=1 OUTPUT_TOKENS=256 BACKEND=gpu LOG_TIMEOUT=1200 tool/litert_lm_pixel_benchmark.sh

    • Pixel 9 Pro Gemma 4 LiteRT-LM/GPU completed the full 256-token fair target: load 2151 ms, backend init 56159.13 ms, wall 60770 ms, prefill 327.01 tok/s, decode 3.63 tok/s, wall 4.21 tok/s, 256/256 eval tokens, no early EOS
    • same Pixel 9 Pro llama.cpp/GGUF/Vulkan target did not reach BENCHMARK_DONE within 1200s for 256 tokens with 1 warmup and 3 measured runs
  • DEVICE='adb-47031FDAP0011K-gqNYae._adb-tls-connect._tcp' TARGETS='litert_lm,llamadart' RUNS=1 WARMUPS=0 OUTPUT_TOKENS=32 BACKEND=gpu LOG_TIMEOUT=900 tool/litert_lm_pixel_benchmark.sh

    • LiteRT-LM/GPU 32-token cold run: load 1999 ms, backend init 39619.01 ms, wall 29760 ms, prefill 32.21 tok/s, decode 4.24 tok/s, wall 1.08 tok/s, 32/32 eval tokens, no early EOS
    • llama.cpp/GGUF/Vulkan 32-token cold run: load 28363 ms, wall 73169 ms, resolved GPU layers 999, prefill 1.52 tok/s, decode 1.05 tok/s, decode-with-sampling 0.99 tok/s, wall 0.42 tok/s, 31/32 eval tokens, early EOS
    • Pixel conclusion for these runs: LiteRT-LM is materially faster on Pixel 9 Pro than the current llama.cpp/GGUF Vulkan path for Gemma 4; the 256-token llama.cpp comparison needs a longer timeout or fewer measured runs to finish.
  • dart test test/unit/hook

  • dart test test/unit/backends/litert_lm test/unit/hook

  • dart test

    • full local suite passed with 1434 passing and 67 skipped
  • dart pub global run coverage:format_coverage --lcov --in=coverage/test --out=coverage/lcov.info --report-on=lib --check-ignore && dart run tool/testing/check_lcov_threshold.dart coverage/lcov.info 70

    • LCOV line coverage: 77.45% (8503/10979)
    • lib/src/backends/litert_lm/litert_lm_runtime.dart reports LF:10/LH:10 after excluding the native FFI boundary; public runtime result/metrics value types remain covered by unit tests
  • git diff --check

  • bash tool/docs/build_site.sh

    • verified the Docusaurus docs build after installing local website dependencies with npm ci.
  • previous PR head b528d23b passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26545476874

  • previous PR head 9fe15218d passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26546215623

  • previous PR head 2c231ddac passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26546737845

  • previous PR head f395eaade passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26547300385

  • previous PR head 1bbe148ff passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26547977565

  • previous PR head 801d4a2c passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26548444620

  • previous PR head 5d1aad4d passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26548920357

  • previous PR head c603f502 passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26549404276

  • previous PR head f35302ee passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26549985727

  • previous PR head 6fe7c5b2 passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26550451076

  • previous PR head fbfc68e2 passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26550993411

  • previous PR head 49d6f4b7a passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26551609361

  • previous PR head a6a8ecc93 passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26552258626

  • previous PR head 264bc897f passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26552774614

  • previous PR head f2916cb0e passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26553204346

  • previous PR head 7ac7e7bcc passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26553701141

  • previous PR head f089360ea passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26554262975

  • current PR head 985e9a9ea passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26554765555

  • current PR head b6a5181b local coverage hardening passed:

    • dart test test/unit/backends/litert_lm/litert_lm_backend_test.dart
    • dart test test/unit/backends/litert_lm/worker_test.dart
    • dart test test/unit/backends/litert_lm
    • dart test test/unit/backends/native/native_backend_test.dart
    • dart analyze
    • dart run tool/testing/check_platform_boundaries.dart
    • git diff --check
    • dart test -p vm -j 1 --exclude-tags local-only --coverage=coverage
    • dart pub global run coverage:format_coverage --lcov --in=coverage/test --out=coverage/lcov.info --report-on=lib --check-ignore && dart run tool/testing/check_lcov_threshold.dart coverage/lcov.info 70
    • LCOV line coverage: 78.62% (8728/11102); file coverage: LiteRT-LM backend 92.01%, LiteRT-LM worker 94.52%, LiteRT-LM service 89.09%, native router 90.36%
  • current PR head b6a5181b passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26555534894

  • current PR head 9e057a1f1 LiteRT-LM service coverage hardening passed:

    • dart test test/unit/backends/litert_lm/litert_lm_service_test.dart
    • dart test test/unit/backends/litert_lm
    • dart analyze
    • dart run tool/testing/check_platform_boundaries.dart
    • git diff --check
    • dart test -p vm -j 1 --exclude-tags local-only --coverage=coverage
    • dart pub global run coverage:format_coverage --lcov --in=coverage/test --out=coverage/lcov.info --report-on=lib --check-ignore
    • dart run tool/testing/check_lcov_threshold.dart coverage/lcov.info 70
    • LCOV line coverage: 78.99% (8769/11102); LiteRT-LM service coverage: 99.49% (392/394), with only the internal no-model guard and macOS/Android cache-dir creation branch uncovered locally.
  • current PR head 9e057a1f1 passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26556282010

  • current PR head a0f043da4 LiteRT-LM production-hardening coverage pass:

    • dart test test/unit/backends/litert_lm/litert_lm_backend_test.dart
    • dart test test/unit/backends/litert_lm
    • dart test test/unit/backends/native/native_backend_test.dart
    • dart test test/unit/backends/native
    • dart test test/unit/backends/litert_lm/worker_test.dart
    • dart test test/unit/backends/litert_lm/worker_messages_test.dart
    • dart analyze
    • dart run tool/testing/check_platform_boundaries.dart
    • git diff --check
    • dart test -p vm -j 1 --exclude-tags local-only --coverage=coverage
    • dart pub global run coverage:format_coverage --lcov --in=coverage/test --out=coverage/lcov.info --report-on=lib --check-ignore
    • dart run tool/testing/check_lcov_threshold.dart coverage/lcov.info 70
    • LCOV line coverage: 79.23% (8796/11102); key file coverage: native_backend.dart 100.00%, worker_messages.dart 100.00%, worker.dart 97.95%, litert_lm_service.dart 99.49%, litert_lm_backend.dart 93.75%.
  • current PR head a0f043da4 passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26557380437

  • dart run tool/macos_litert_lm_engine_smoke.dart .dart_tool/litert_lm_models/gemma-4-E2B-it.litertlm metal "Write one concise sentence about on-device Gemma." 16

    • current-head Gemma 4 macOS LiteRT-LM e2e passed: backend LiteRT-LM gpu, load 14 ms, wall 1038 ms, backend init 935.581 ms, prefill 114.07 tok/s, decode 45.18 tok/s for 16/16 decode tokens.
    • output: On-device Gemma allows for powerful, privacy-preserving AI processing directly on local
  • current PR head 55d718b7 adds LiteRT-LM web backend routing:

    • .litertlm web URLs now route through a browser LiteRtLmBackend wrapper around official @litert-lm/core Engine.create(...) / createConversation(...) instead of falling through to the llama.cpp WebGPU bridge.
    • WebAutoBackend keeps .gguf and unknown sources on WebGpuLlamaBackend, switches delegates by model format, and disposes the previous delegate on format changes.
    • web LiteRT-LM supports URL/path loading, CPU/GPU backend selection, context sizing, streaming generation, stop-sequence cancellation, and high-level LlamaEngine / ChatSession flow.
    • web LiteRT-LM explicitly rejects unsupported tokenizer, embeddings, state persistence, LoRA, grammar, multimodal, stream-batching, thread-tuning, and NPU operations.
    • example/chat_app/web/index.html sets a default @litert-lm/core ESM module URL for .litertlm loads, while still allowing apps to override window.__llamadartLiteRtLmModuleUrl.
  • current PR head 55d718b7 web LiteRT-LM verification:

    • dart analyze
    • dart test -p chrome test/unit/backends/litert_lm/litert_lm_web_backend_test.dart test/unit/backends/litert_lm/litert_lm_web_public_api_test.dart test/unit/backends/web/web_backend_test.dart
    • dart test test/unit/backends/litert_lm
    • dart run tool/testing/check_platform_boundaries.dart
    • dart test test/unit/backends/web test/unit/core/engine/engine_test.dart
    • git diff --check
    • dart test -p chrome test/unit/backends/webgpu/webgpu_backend_test.dart
    • dart test test/unit/backends/native/native_backend_test.dart test/unit/core/models/model_source_test.dart
    • ./tool/docs/build_site.sh
  • current PR head 6deefda9 fixes the CI mirrored-test-name gate by renaming the web LiteRT-LM test to test/unit/backends/litert_lm/litert_lm_backend_web_test.dart.

    • dart test test/unit/test_structure/mirrored_unit_structure_test.dart
    • dart test -p chrome test/unit/backends/litert_lm/litert_lm_backend_web_test.dart test/unit/backends/litert_lm/litert_lm_web_public_api_test.dart test/unit/backends/web/web_backend_test.dart
    • dart analyze
    • git diff --check
  • current PR head 6deefda9 passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26558781841

  • current PR head 3744a1f18 documents LiteRT-LM web backend support across README, website quickstart, model lifecycle, platform support matrix, and the chat example README.

    • ./tool/docs/build_site.sh
    • git diff --check
  • current PR head 3744a1f18 passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26559439872

  • current PR head 019486a80 adds a Gemma 4 LiteRT-LM chat app preset and keeps chat-app LiteRT-LM loads free of llama.cpp-only Android batch/thread tuning.

    • flutter test test/chat_service_test.dart test/model_asset_source_test.dart test/model_download_controller_adapter_test.dart
    • flutter test test/manage_models_screen_download_test.dart
    • dart analyze
    • flutter analyze from example/chat_app
    • git diff --check
  • current PR head 019486a80 passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26560234889

  • current PR head b5590d305 adds repeatable Gemma 4 LiteRT-LM web smoke coverage and fixes the high-level web chat path:

    • .litertlm web GPU selection now uses GPU_ARTISAN when available, which is the backend accepted by the Gemma 4 web bundle.
    • web LiteRT-LM metadata exposes a passthrough latest-message template so @litert-lm/core Conversation can apply the bundle's own Gemma 4 template instead of receiving a pre-rendered prompt.
    • native and web LiteRT-LM generation reject only real media parts; text/tool/thinking parts already represented in the rendered prompt no longer block high-level LlamaEngine.create / ChatSession generation.
    • the chat app labels LiteRT-LM web GPU execution as WEBGPU instead of falling back to CPU.
    • tool/testing/run_local_e2e.dart now includes chat-app-web-litert-gemma4-smoke, prefers the repo-local Playwright Python venv when present, and exits cleanly after background server cleanup.
  • current PR head b5590d305 verification:

    • python3 -m py_compile tool/testing/playwright_chat_app_real_model_smoke.py
    • dart test test/unit/tooling/run_local_e2e_test.dart
    • dart test test/unit/backends/litert_lm/litert_lm_backend_test.dart
    • dart test -p chrome test/unit/backends/litert_lm/litert_lm_backend_web_test.dart
    • dart test test/unit/backends/litert_lm/litert_lm_service_test.dart
    • flutter test test/backend_utils_test.dart from example/chat_app
    • dart analyze
    • git diff --check
    • flutter build web --base-href=/example/chat_app/build/web/ from example/chat_app
    • dart run tool/testing/run_local_e2e.dart --scenario chat-app-web-litert-gemma4-smoke --skip-build
      • real Gemma 4 web LiteRT-LM e2e passed through example/chat_app: model load 39.8s, response source litert, expected response 4, runtime label WEBGPU, settings backend 2 / GPU_ARTISAN, final body reported avg 14.1 tok/s, first token 141 ms, total 142 ms.
  • current PR head 6a5dc083 fixes the Windows dry-run assertion for repo-local Playwright Python paths.

    • dart test test/unit/tooling/run_local_e2e_test.dart
    • dart format test/unit/tooling/run_local_e2e_test.dart
    • git diff --check
  • current PR head 6a5dc083 passed all CI checks after rerunning the transient Windows setup failure: https://github.com/leehack/llamadart/actions/runs/26563948801

  • current PR head 61eb4eccb hardens LiteRT-LM web backend production introspection:

    • WebAutoBackend now implements BackendEmbeddingsSupport and reports active delegate embedding support, so LiteRT-LM web advertises unsupported embeddings reliably.
    • LiteRT-LM web metadata now includes architecture, model name, context length, backend, and source URL alongside the passthrough chat template.
    • LiteRT-LM web multimodal capability/free calls now reject unsupported operations explicitly instead of silently no-oping or returning false.
  • current PR head 61eb4eccb verification:

    • dart test -p chrome test/unit/backends/litert_lm/litert_lm_backend_web_test.dart test/unit/backends/litert_lm/litert_lm_web_public_api_test.dart test/unit/backends/web/web_backend_test.dart
    • dart analyze
    • dart run tool/testing/check_platform_boundaries.dart
    • git diff --check
  • current PR head 61eb4eccb passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26564974130

  • current PR head dc11792e1 hardens LiteRT-LM web direct-backend lifecycle parity:

    • Web LiteRT-LM now issues monotonically increasing model/context handles instead of always returning 1.
    • Stale web model/context handles are rejected after reload for metadata, context lookup/free, and model free, matching the native LiteRT-LM lifecycle hardening.
    • Direct web chat-template calls now extract text parts and reject media content instead of stringifying image/audio maps into prompts.
  • current PR head dc11792e1 verification:

    • dart test -p chrome test/unit/backends/litert_lm/litert_lm_backend_web_test.dart
    • dart test -p chrome test/unit/backends/litert_lm/litert_lm_backend_web_test.dart test/unit/backends/litert_lm/litert_lm_web_public_api_test.dart test/unit/backends/web/web_backend_test.dart
    • dart analyze
    • dart run tool/testing/check_platform_boundaries.dart
    • git diff --check
  • current PR head dc11792e1 passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26565806067

  • current PR head 0bff13ffb aligns LiteRT-LM web direct diagnostics and LoRA lifecycle:

    • Direct web LiteRT-LM getVramInfo() now reports zero VRAM without requiring a loaded model, matching native LiteRT-LM/WebAuto diagnostic behavior.
    • Web LiteRT-LM LoRA adapter calls now validate the active context handle before reporting unsupported LoRA, so stale/no-context direct calls fail as lifecycle errors instead of masking stale handles.
  • current PR head 0bff13ffb verification:

    • dart test -p chrome test/unit/backends/litert_lm/litert_lm_backend_web_test.dart
    • dart test -p chrome test/unit/backends/litert_lm/litert_lm_backend_web_test.dart test/unit/backends/litert_lm/litert_lm_web_public_api_test.dart test/unit/backends/web/web_backend_test.dart
    • dart analyze
    • dart run tool/testing/check_platform_boundaries.dart
    • git diff --check
  • current PR head 0bff13ffb passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26566533318

  • current PR head 9da6d5d3b hardens remaining production-grade parity and benchmark fairness:

    • macOS cache/framework discovery now accepts only complete ABI-specific LiteRT-LM runtime installs before selecting them.
    • LiteRT-LM web contextCreate validates full ModelParams and rejects explicit NPU selection up front.
    • benchmark app llama.cpp/GGUF leg now defaults to Metal on Apple, Vulkan on Android/Linux/Windows, records requestedBackend, and supports LLAMADART_BACKEND override.
  • current PR head 9da6d5d3b verification:

    • dart test test/unit/backends/litert_lm/litert_lm_runtime_test.dart
    • dart test test/unit/backends/litert_lm
    • dart test test/unit/hook/build_hook_litert_lm_integration_test.dart
    • dart test -p chrome test/unit/backends/litert_lm/litert_lm_backend_web_test.dart
    • flutter test test/litert_lm_benchmark_app_test.dart from example/chat_app
    • dart analyze
    • dart run tool/testing/check_platform_boundaries.dart
    • git diff --check
    • dart run tool/macos_litert_lm_engine_smoke.dart .dart_tool/litert_lm_models/gemma-4-E2B-it.litertlm metal "Write one concise sentence about on-device Gemma." 16
      • current-head Gemma 4 macOS LiteRT-LM e2e passed after the benchmark-fairness changes: backend LiteRT-LM gpu, load 13 ms, wall 1048 ms, backend init 933.472 ms, prefill 111.32 tok/s, decode 44.78 tok/s for 16/16 decode tokens.
      • output: On-device Gemma allows for powerful, privacy-preserving AI processing directly on local
    • DEVICE=adb-47031FDAP0011K-gqNYae._adb-tls-connect._tcp TARGETS=litert_lm RUNS=1 WARMUPS=0 OUTPUT_TOKENS=16 BACKEND=gpu LITERT_LM_LOG_TIMEOUT=300 tool/litert_lm_pixel_benchmark.sh
      • current-head Pixel 9 Pro Gemma 4 LiteRT-LM/GPU e2e passed after the benchmark-fairness changes: load 222 ms, backend init 9498.49 ms, wall 6965 ms, prefill 82.25 tok/s, decode 13.49 tok/s, wall 2.30 tok/s, 16/16 eval tokens, no early EOS.
  • current PR head 9da6d5d3b passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26567730946

  • current PR head 6e14a4fae forwards the scripted llama.cpp backend override for Pixel benchmark runs:

    • tool/litert_lm_pixel_benchmark.sh now reads LLAMADART_BACKEND, prints it in the benchmark configuration, and passes it as a Flutter dart-define so scripted GGUF runs can force auto, cpu, vulkan, metal, etc.
  • current PR head 6e14a4fae verification:

    • bash -n tool/litert_lm_pixel_benchmark.sh
    • git diff --check HEAD
    • DEVICE=adb-47031FDAP0011K-gqNYae._adb-tls-connect._tcp TARGETS=llamadart RUNS=1 WARMUPS=0 OUTPUT_TOKENS=16 BACKEND=gpu LLAMADART_BACKEND=vulkan LLAMADART_LOG_TIMEOUT=900 tool/litert_lm_pixel_benchmark.sh
      • current-head Pixel 9 Pro Gemma 4 llama.cpp/GGUF e2e passed with requestedBackend: vulkan: load 9431 ms, wall 22889 ms, resolved GPU layers 999, prefill 5.41 tok/s, decode 1.42 tok/s, decode-with-sampling 1.37 tok/s, wall 0.66 tok/s, 15/16 eval tokens, early EOS.
  • current PR head 6e14a4fae passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26568475041

  • current PR head 792b7d5e3 tightens LiteRT-LM platform hook asset coverage:

    • Android, Linux x64, and Windows hook integration tests now assert LiteRT-LM companion runtime library filenames and package code-asset IDs, so platform hook coverage fails if .litertlm native runtime assets stop being emitted even when llama.cpp assets still pass.
  • current PR head 792b7d5e3 verification:

    • dart test test/unit/hook/build_hook_android_integration_test.dart test/unit/hook/build_hook_linux_integration_test.dart test/unit/hook/build_hook_integration_test.dart
    • dart analyze test/unit/hook/build_hook_android_integration_test.dart test/unit/hook/build_hook_linux_integration_test.dart test/unit/hook/build_hook_integration_test.dart
    • dart test test/unit/hook/build_hook_litert_lm_integration_test.dart
    • dart test test/unit/hook
    • git diff --check
  • current PR head 792b7d5e3 passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26569144857

  • current PR head f9e909e60 hardens macOS LiteRT-LM runtime packaging:

    • tool/macos_litert_lm_prepare_app.sh now validates complete ABI-specific LiteRT-LM runtime directories before installing app frameworks, supports arm64/x64 runtime cache layouts, installs the full arm64 companion framework set, and fails early on partial explicit runtime dirs.
    • macOS runtime framework/cache/fallback loading now uses ABI-specific companion sets, so x64 does not inherit arm64-only Metal/WebGPU companion assumptions while arm64 validates and opens the full packaged companion set.
  • current PR head f9e909e60 verification:

    • bash -n tool/macos_litert_lm_prepare_app.sh
    • dart test test/unit/backends/litert_lm/litert_lm_runtime_test.dart
    • dart test test/unit/tooling/macos_litert_lm_prepare_app_script_test.dart
    • dart analyze lib/src/backends/litert_lm/litert_lm_runtime.dart test/unit/backends/litert_lm/litert_lm_runtime_test.dart test/unit/tooling/macos_litert_lm_prepare_app_script_test.dart
    • dart test test/unit/backends/litert_lm
    • dart test test/unit/tooling
    • dart run tool/testing/check_platform_boundaries.dart
    • dart analyze
    • git diff --check
    • dart run tool/macos_litert_lm_engine_smoke.dart .dart_tool/litert_lm_models/gemma-4-E2B-it.litertlm metal "Write one concise sentence about on-device Gemma." 16
      • Gemma 4 macOS LiteRT-LM e2e passed after the macOS runtime packaging hardening: backend LiteRT-LM gpu, load 13 ms, wall 2009 ms, backend init 1163.554 ms, prefill 18.65 tok/s, decode 43.73 tok/s for 16 decode tokens.
  • current PR head f9e909e60 passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26569935734

  • current PR head a227134c5 clarifies backend-aware state-persistence unsupported messages:

    • LlamaEngine.stateSaveFile and stateLoadFile now preserve WebGPU bridge guidance for WebGPU backends while reporting LiteRT-LM-specific unsupported-state guidance for .litertlm models.
    • Tests cover generic, WebGPU, and LiteRT-LM unsupported-state paths so LiteRT-LM users no longer receive WebGPU bridge-version instructions.
  • current PR head a227134c5 verification:

    • dart test -p chrome test/unit/backends/web/web_backend_test.dart
    • dart test test/unit/core/engine/engine_test.dart test/unit/backends/native/native_backend_test.dart
    • dart analyze
    • git diff --check
  • current PR head a227134c5 passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26571165587

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 26, 2026

Codecov Report

❌ Patch coverage is 94.97374% with 67 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.37%. Comparing base (47b8497) to head (91cb325).

Files with missing lines Patch % Lines
lib/src/backends/litert_lm/litert_lm_backend.dart 93.28% 19 Missing ⚠️
lib/src/hook/native_bundle_config.dart 71.42% 18 Missing ⚠️
lib/src/core/engine/engine.dart 86.81% 12 Missing ⚠️
lib/src/backends/litert_lm/litert_lm_service.dart 98.45% 6 Missing ⚠️
...rc/experimental/litert_lm/litert_lm_benchmark.dart 75.00% 6 Missing ⚠️
lib/src/backends/litert_lm/worker.dart 97.88% 3 Missing ⚠️
lib/src/backends/litert_lm/litert_lm_runtime.dart 98.18% 1 Missing ⚠️
lib/src/core/engine/chat_session.dart 90.00% 1 Missing ⚠️
lib/src/core/template/handlers/gemma4_handler.dart 95.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #167      +/-   ##
==========================================
+ Coverage   78.48%   80.37%   +1.88%     
==========================================
  Files          76       84       +8     
  Lines        9975    11244    +1269     
==========================================
+ Hits         7829     9037    +1208     
- Misses       2146     2207      +61     
Flag Coverage Δ
unittests 80.37% <94.97%> (+1.88%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@leehack leehack changed the title Add LiteRT-LM benchmark backend POC Add LiteRT-LM native backend support May 27, 2026
@leehack
Copy link
Copy Markdown
Owner Author

leehack commented May 28, 2026

Rebased this PR onto origin/main at 29291222e (llamadart-native / llama.cpp tag b9371) and reran the Pixel 9 Pro benchmark against Gemma 4 E2B.

Setup:

  • Device: Pixel 9 Pro, wireless ADB adb-47031FDAP0011K-gqNYae._adb-tls-connect._tcp
  • LiteRT-LM model: gemma-4-E2B-it.litertlm, backend gpu
  • llama.cpp model: gemma-4-E2B-it-Q4_K_S.gguf, backend vulkan, gpuLayers=999
  • Runs: RUNS=1, WARMUPS=0

128 output tokens:

Backend Load Wall Prefill Decode Wall tok/s Notes
LiteRT-LM gpu 2.86s 52.64s 17.28 tok/s 4.64 tok/s 2.43 128/128 tokens
llama.cpp Vulkan b9371 20.79s 87.68s 5.47 tok/s 1.78 tok/s 1.45 EOS at 127/128

32 output tokens:

Backend Load Wall Prefill Decode Wall tok/s Notes
LiteRT-LM gpu 0.18s 7.12s 106.54 tok/s 14.99 tok/s 4.50 32/32 tokens
llama.cpp Vulkan b9371 8.14s 25.80s 8.39 tok/s 1.81 tok/s 1.20 EOS at 31/32

Conclusion: the updated llama.cpp/Vulkan backend in b9371 is much better than the previous benchmarked backend, especially load/prefill, but LiteRT-LM is still ahead on Pixel 9 Pro for this Gemma 4 E2B setup. The current gap is about 2.6x on 128-token decode and 1.7x on 128-token wall throughput; for short 32-token output it is about 8.3x decode and 3.7x wall throughput. The 128-token wall gap is smaller because LiteRT-LM still has a large cold backend init cost, while llama.cpp hit EOS slightly before the requested output budget.

Local validation before push:

  • dart test test/unit/backends/native/native_backend_test.dart -n "litertlm" passed, 12 tests.
  • flutter clean && flutter test test/litert_lm_benchmark_app_test.dart test/chat_service_test.dart test/backend_utils_test.dart test/unit_test.dart passed, 57 tests.
  • flutter analyze passed.
  • git diff --check passed.

@leehack
Copy link
Copy Markdown
Owner Author

leehack commented May 28, 2026

Follow-up on the Pixel 9 Pro benchmark: the earlier rerun may have been affected by device power state. After checking the device, dumpsys power reported mWakefulness=Dozing. I reran with the phone awake, plugged in, and the benchmark app focused.

Important caveat: the benchmark app currently reports the last measured run when RUNS > 1, not an average/median, so multi-run output should be treated as a warm-run sample until the harness is improved.

Screen-awake 128-token run (RUNS=1, WARMUPS=0):

Backend Load Wall Prefill Decode Wall tok/s Notes
LiteRT-LM gpu 0.25s 13.92s 81.48 tok/s 15.23 tok/s 9.20 128/128 tokens
llama.cpp Vulkan b9371 12.52s 137.97s 5.49 tok/s 1.05 tok/s 0.92 EOS at 127/128

Screen-awake 32-token cold run (RUNS=1, WARMUPS=0):

Backend Load Wall Prefill Decode Wall tok/s Notes
LiteRT-LM gpu 0.18s 7.34s 95.90 tok/s 14.74 tok/s 4.36 32/32 tokens
llama.cpp Vulkan b9371 8.38s 38.74s 8.32 tok/s 1.05 tok/s 0.80 EOS at 31/32

Screen-awake 32-token warm sample (RUNS=3, WARMUPS=1, last measured run reported):

Backend Load Wall Prefill Decode Wall tok/s Notes
LiteRT-LM gpu 0.17s 2.10s 435.05 tok/s 16.62 tok/s 15.27 last measured run
llama.cpp Vulkan b9371 8.48s 38.94s 14.87 tok/s 0.93 tok/s 0.80 last measured run, EOS at 31/32

Conclusion update: the previous screen-off/dozing numbers should not be treated as final. The screen-awake rerun makes LiteRT-LM look much faster for this Pixel 9 Pro / Gemma 4 setup, but llama.cpp/Vulkan also showed significant variance versus the previous run. Before using these as publishable benchmark numbers, we should improve the harness to keep the device awake/unlocked, record thermal/cooling state, report every run or median/p95, and separate cold-start from steady-state measurements.

@leehack leehack force-pushed the litert-lm-poc branch 2 times, most recently from a29ac6d to bcaefcf Compare May 28, 2026 15:39
@leehack leehack marked this pull request as ready for review May 28, 2026 19:07
Copilot AI review requested due to automatic review settings May 28, 2026 19:07
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

@leehack
Copy link
Copy Markdown
Owner Author

leehack commented May 29, 2026

Review findings from this pass:

  1. P1: CI is currently red from a stale Gemma 4 thinking expectation. test/unit/backends/native/native_backend_test.dart:576 still expects the LiteRT-LM prompt to end with <|turn>model\n, but the current Gemma 4 template correctly leaves thinking open and now ends with <|turn>model\n<|channel>thought\n. This is the failure in Linux/Web, macOS native, and Windows native jobs. The test expectation needs to be updated before this is mergeable.

  2. P1: LiteRT-LM web does not really support the claimed high-level chat/session/tool path yet. lib/src/backends/litert_lm/litert_lm_backend_web.dart:38 exposes a template that renders only the last message; generate() creates a fresh JS conversation at :196 and deletes it at :217; metadata documents this as a passthrough at :266. That means LlamaEngine.create(...) / ChatSession on web drop system prompts, prior history, and tool declarations before reaching @litert-lm/core. The docs currently claim high-level flow and thinking/tool parsing (README.md:465, website/docs/platforms/support-matrix.md:113), while the real web smoke disables both tools and thinking (tool/testing/playwright_chat_app_real_model_smoke.py:217). Either implement a real web chat-message/tool mapping or scope the docs/tests to simple single-turn text generation on LiteRT-LM web.

  3. P2: Android chat-app presets silently override Gemma 4 LiteRT-LM to very small limits. The Gemma 4 LiteRT preset declares contextSize: 8192 and maxTokens: 1024 in example/chat_app/lib/models/downloadable_model.dart:512, but applyModelPreset() clamps every Android .litertlm model to context 512 and max output 128 at example/chat_app/lib/providers/chat_provider.dart:1940 and :1964. That makes manual Gemma 4 thinking/tool testing easy to truncate and does not match the preset or benchmark docs. If we keep this mobile safety default, it should be explicit/model-specific and not silently override the preset used for validation.

Also cut the structured-output follow-up as #174: #174

@leehack
Copy link
Copy Markdown
Owner Author

leehack commented May 29, 2026

Addressed the review findings in 25c5ccf3f:

  • Updated the stale Gemma 4 LiteRT-LM test expectation for the open thinking channel.
  • Scoped LiteRT-LM web to single-turn text in backend metadata, tests, README, support matrix, backend-selection docs, and release notes.
  • Removed the hidden Android LiteRT-LM 512/128 preset clamp so the Gemma 4 LiteRT preset values are used for manual testing/benchmarking.

Validation:

  • dart test test/unit/backends/native/native_backend_test.dart
  • dart test -p chrome test/unit/backends/litert_lm/litert_lm_backend_web_test.dart
  • flutter test test/unit_test.dart in example/chat_app
  • dart analyze
  • flutter analyze in example/chat_app
  • git diff --check
  • npm run build in website

GitHub checks are green on the latest PR run.

@leehack leehack merged commit 5d5f2b7 into main May 30, 2026
8 checks passed
@leehack leehack deleted the litert-lm-poc branch May 30, 2026 00:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants