Add LiteRT-LM native backend support#167
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #167 +/- ##
==========================================
+ Coverage 78.48% 80.37% +1.88%
==========================================
Files 76 84 +8
Lines 9975 11244 +1269
==========================================
+ Hits 7829 9037 +1208
- Misses 2146 2207 +61
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Rebased this PR onto Setup:
128 output tokens:
32 output tokens:
Conclusion: the updated llama.cpp/Vulkan backend in Local validation before push:
|
|
Follow-up on the Pixel 9 Pro benchmark: the earlier rerun may have been affected by device power state. After checking the device, Important caveat: the benchmark app currently reports the last measured run when Screen-awake 128-token run (
Screen-awake 32-token cold run (
Screen-awake 32-token warm sample (
Conclusion update: the previous screen-off/dozing numbers should not be treated as final. The screen-awake rerun makes LiteRT-LM look much faster for this Pixel 9 Pro / Gemma 4 setup, but llama.cpp/Vulkan also showed significant variance versus the previous run. Before using these as publishable benchmark numbers, we should improve the harness to keep the device awake/unlocked, record thermal/cooling state, report every run or median/p95, and separate cold-start from steady-state measurements. |
a29ac6d to
bcaefcf
Compare
|
Review findings from this pass:
|
|
Addressed the review findings in
Validation:
GitHub checks are green on the latest PR run. |
Summary
add a LiteRT-LM native-assets hook path that downloads pinned
litert-lm-nativev0.12.0 runtime archivesrequire and emit the LiteRT-LM StreamProxy helper library alongside
LiteRtLmon Android, iOS, macOS, Linux, and Windowsroute native
.litertlmmodel bundles to a worker-isolate LiteRT-LM backend while keeping GGUF and unknown formats on llama.cppkeep
.litertlmmodels on the same ModelSource/download/cache path as GGUF and add Pixel/macOS benchmark entrypointsinfer Gemma 4 metadata and chat-template data for
.litertlmbundle names so high-levelLlamaEnginechat templating uses the Gemma 4 handlerrun blocking LiteRT-LM
send_messagein a helper isolate so the backend worker can process cancellation while StreamProxy is unavailableredownload stale LiteRT-LM runtime archives when cached checksums no longer match pinned release digests
validate the full platform-specific LiteRT-LM companion runtime library set after extraction so broken native bundles fail during build, not at runtime
invalidate already extracted LiteRT-LM runtime caches when the pinned runtime checksum changes
keep
LlamaEngine.chatTemplateandChatSessionusable when the selected backend does not expose tokenization, usingnulltemplate token counts and conservative prompt-size estimates for history pruningreject unsupported LiteRT-LM LoRA operations consistently, including
clearLoras, across service, worker, backend, and high-level engine pathspromote the LiteRT-LM runtime client/result/metrics API out of the experimental benchmark namespace, with deprecated benchmark wrappers for compatibility
cover the public LiteRT-LM runtime API exports on VM and web, including native-only web stubs that fail with
UnsupportedErrorkeep the web-safe
LiteRtLmBackendplaceholder constructor aligned with the native constructor so cross-platform code compiles before reporting the native-only unsupported erroralign the macOS/Pixel benchmark app metric output so LiteRT-LM and llama.cpp report comparable target-token, early-EOS, sampling-inclusive, and wall-throughput fields
harden direct
LiteRtLmRuntimeClientvalidation and native handle cleanup for reinitialization, initialization-failure, and streaming-failure pathsgive Pixel benchmark runs separate LiteRT-LM, llama.cpp, and combined log timeouts so full Gemma 4 comparisons can finish on slower GGUF/Vulkan runs
coalesce concurrent LiteRT-LM worker startup requests so simultaneous pre-load diagnostics cannot leak extra backend isolates
reject concurrent LiteRT-LM generations at the Dart backend before a second stream can overwrite active generation cleanup
cancel and close active LiteRT-LM generation streams before direct backend reload, context free, or model free requests
serialize regular LiteRT-LM worker requests through one service queue while keeping cancellation responsive, including queued-generation cancellation before native runtime entry
prevent immediate LiteRT-LM stream cancellation from sending a stale generation request after the caller response port has already closed
align LiteRT-LM native FFI coverage handling with real-runtime smoke coverage while keeping public runtime result/metrics value types unit-covered
clear disposed LiteRT-LM service runtime clients when replacement initialization fails so later tokenization or generation retries start from a clean client
dispose LiteRT-LM service runtime clients and clear context-scoped metrics when contexts are freed or replaced
align LiteRT-LM
maxTokens <= 0generation semantics with llama.cpp by making them no-op without initializing runtime generationreport backend/model-specific LiteRT-LM engine creation failures and make Pixel benchmark runs fail on
BENCHMARK: ERRORnormalize direct LiteRT-LM CPU/GPU/NPU backend selectors across runtime, service, and direct-backend APIs before native initialization
reject non-positive LiteRT-LM
contextSizevalues at model load instead of allowing a later native initialization failurerefresh any LiteRT-LM runtime client that was initialized before context creation so tokenization before
contextCreatecannot leave later calls using stale native context settingsreject explicit LiteRT-LM backend changes and unsupported llama.cpp-only context-time model parameters during LiteRT-LM context creation
validate macOS LiteRT-LM cache and app framework directories against the full ABI-specific runtime library set before selection
validate LiteRT-LM web context creation with the same model-parameter rules as native, including explicit NPU rejection
make the benchmark app choose a fair llama.cpp backend by platform and record the requested backend in metrics
document LiteRT-LM platform support in the website quickstart, model lifecycle, support matrix, and native build hook guides
reject LiteRT-LM
detokenize(..., special: true)requests instead of silently ignoring the llama.cpp-only special-token flagreject all LiteRT-LM multimodal backend operations consistently instead of allowing no-op projector frees or false capability probes
reject multimodal content passed to direct LiteRT-LM chat-template application instead of stringifying image/audio maps into text prompts
Native release state
litert-lm-nativePR Feat/build infra refinement #2 merged: StreamProxy runtime helper packaginglitert-lm-nativePR feat: implement LoRA support and training pipeline #3 merged: iOSCLiteRTLM.xcframeworkslices packaged as dylib-style runtime archives plus iOS StreamProxylitert-lm-nativereleasev0.12.0refreshed with runtime archives for Android, iOS device/simulators, macOS, Linux, and WindowsRelease manifest lists 21 required runtime artifactsv0.12.0runtime archive SHA-256 digests are pinned inhook/build.dartios/arm64/libLiteRtLm.dylib,ios/arm64/libStreamProxy.dylibios/arm64-sim/libLiteRtLm.dylib,ios/arm64-sim/libStreamProxy.dylibios/x64-sim/libLiteRtLm.dylib,ios/x64-sim/libStreamProxy.dylibVerification
dart analyzedart test test/unit/backends/litert_lm/litert_lm_service_test.dart test/unit/backends/litert_lm/worker_test.dartdetokenize(..., special: true)fails explicitly before initializing the LiteRT-LM runtime.dart test test/unit/backends/litert_lm/litert_lm_service_test.dart test/unit/backends/litert_lm/worker_test.dart test/unit/backends/litert_lm/litert_lm_backend_test.dartdart run tool/testing/check_platform_boundaries.dartdart test test/unit/hook/build_hook_litert_lm_integration_test.dart test/unit/backends/native/native_backend_test.dart test/unit/backends/litert_lm test/unit/core/engine/engine_test.dart test/unit/core/engine/chat_session_test.dartdart test test/unit/backends/litert_lm/litert_lm_public_api_test.dart test/unit/backends/litert_lm/litert_lm_runtime_test.dart test/unit/experimental/litert_lm/litert_lm_benchmark_test.dartdart test test/unit/backends/litert_lm/litert_lm_runtime_test.dartcurrent runtime ABI, including
macos/x64for Intel macOS instead of thearm64 cache layout.
dart test -p chrome test/unit/backends/litert_lm/litert_lm_web_public_api_test.dartUnsupportedError.dart test test/unit/backends/litert_lm/litert_lm_public_api_test.dart test/unit/backends/litert_lm/litert_lm_runtime_test.dart test/unit/backends/litert_lm/litert_lm_backend_test.dart test/unit/backends/native/native_backend_test.dartdart test test/unit/backends/litert_lm/litert_lm_backend_test.dartand close an active generation stream before sending teardown/reload requests.
dart test test/unit/backends/litert_lm/litert_lm_service_test.dart -n "context"runtime client and clear stale context-scoped performance metrics.
dart test test/unit/backends/litert_lm/litert_lm_service_test.dart -n "maxTokens"maxTokens <= 0emits no chunks, does not start runtimegeneration, and clears stale context-scoped performance metrics.
dart test test/unit/backends/litert_lm/litert_lm_runtime_test.dart -n "engine create"fallback guidance for NPU/GPU failures.
dart test test/unit/backends/litert_lm/litert_lm_runtime_test.dart -n "backend"dart test test/unit/backends/litert_lm/litert_lm_backend_test.dart -n "preferred backend diagnostics"dart test test/unit/backends/litert_lm/litert_lm_service_test.dart -n "backend preference"native initialization.
dart test test/unit/backends/litert_lm/litert_lm_service_test.dart -n "load-time model params"dart test test/unit/backends/litert_lm/litert_lm_service_test.dart -n "createContext disposes pre-context runtime client"dart test test/unit/backends/litert_lm/litert_lm_service_test.dart -n "backend changes during context creation"dart test test/unit/backends/litert_lm/litert_lm_service_test.dart -n "loads local litertlm bundles"dart test test/unit/backends/litert_lm/litert_lm_service_test.dart -n "backend preference"dart run tool/macos_litert_lm_engine_smoke.dart .dart_tool/litert_lm_models/gemma-4-E2B-it.litertlm metal "Write one concise sentence about on-device Gemma." 16f089360ea: backendLiteRT-LM gpu, load 13 ms, wall 1017 ms, backend init 908.695 ms, prefill 117.54 tok/s, decode 45.83 tok/s for 16 decode tokens.bash -n tool/litert_lm_pixel_benchmark.shdart test test/unit/backends/litert_lm/litert_lm_service_test.dart test/unit/backends/litert_lm/litert_lm_backend_test.dart test/unit/backends/litert_lm/worker_test.dartdart test test/unit/backends/litert_lmdart test test/unit/backends/native/native_backend_test.dart -n "litertlm"dart analyzedart run tool/testing/check_platform_boundaries.dartgit diff --checkdart test test/unit/backends/litert_lm/worker_test.dartdart test test/unit/backends/litert_lm/litert_lm_service_test.dart test/unit/backends/litert_lm/litert_lm_backend_test.dartstale direct-backend handles for metadata, context lookup, context free,
and model free
dart test test/unit/backends/native/native_backend_test.dartLlamaBackend()native GPU/VRAM diagnostics probe throughllama.cpp without selecting a final backend, reuse that probe for GGUF
loads, and dispose it before
.litertlmroutingdart test test/unit/backends/native test/unit/backends/litert_lmdart test test/unit/backends/litert_lm test/unit/backends/native/native_backend_test.dartdart test test/unit/backends/litert_lm test/unit/experimental/litert_lm test/unit/backends/nativeflutter pub publish --dry-rundart run tool/macos_litert_lm_engine_smoke.dart .dart_tool/litert_lm_models/gemma-4-E2B-it.litertlm metal "Write one sentence about on-device Gemma." 32LlamaEngine(LlamaBackend())auto-routing: backendLiteRT-LM gpu, load 11 ms, wall 1147 ms, prefill 190.61 tok/s, decode 76.73 tok/s for 32 decode tokensOn-device Gemma refers to the deployment of the Gemma model directly onto local hardware for processing, offering privacy and reduced latency.dart run tool/macos_litert_lm_engine_smoke.dart .dart_tool/litert_lm_models/gemma-4-E2B-it.litertlm metal "Write one sentence about on-device Gemma." 16LiteRT-LM gpu, load 11 ms, wall 623 ms, backend init 850.76 ms, decode 125.41 tok/s for 16 decode tokensdart run tool/macos_litert_lm_engine_smoke.dart .dart_tool/litert_lm_models/gemma-4-E2B-it.litertlm metal "Write one concise sentence about on-device Gemma." 16LiteRT-LM gpu, load 15 ms, wall 952 ms, backend init 1051.79 ms, prefill 215.01 tok/s, decode 54.36 tok/s for 16 decode tokensADB=/Users/jhin.lee/Library/Android/sdk/platform-tools/adb DEVICE='adb-47031FDAP0011K-gqNYae._adb-tls-connect._tcp' TARGETS=litert_lm RUNS=1 WARMUPS=0 OUTPUT_TOKENS=16 BACKEND=gpu LITERT_LM_LOG_TIMEOUT=300 tool/litert_lm_pixel_benchmark.shADB=/Users/jhin.lee/Library/Android/sdk/platform-tools/adb DEVICE='adb-47031FDAP0011K-gqNYae._adb-tls-connect._tcp' TARGETS=litert_lm RUNS=1 WARMUPS=0 OUTPUT_TOKENS=16 BACKEND=npu LITERT_LM_LOG_TIMEOUT=300 tool/litert_lm_pixel_benchmark.shLiteRT-LM engine creation failed for backend "npu" and model "gemma-4-E2B-it.litertlm". The Android NPU delegate may not support this device, OS, model, or bundle; try backend "gpu" or backend "cpu".BENCHMARK: ERROR.ADB=/Users/jhin.lee/Library/Android/sdk/platform-tools/adb DEVICE='adb-47031FDAP0011K-gqNYae._adb-tls-connect._tcp' TARGETS=litert_lm RUNS=1 WARMUPS=0 OUTPUT_TOKENS=16 BACKEND=gpu LITERT_LM_LOG_TIMEOUT=300 tool/litert_lm_pixel_benchmark.shDECODE_TOKENS=256 tool/macos_fair_litert_vs_llamadart.shgemma-4-E2B-it-Q4_K_S.gguf): load 887 ms, wall 1942 ms, prefill 24819.86 tok/s, decode 1280.29 tok/s before sampling, decode-with-sampling 135.57 tok/s, wall 131.31 tok/s, 255/256 eval tokensgemma-4-E2B-it.litertlm): high-level load 122 ms, backend init 4792.70 ms, wall 2219 ms, prefill 2685.39 tok/s, decode 116.80 tok/s, wall 115.37 tok/s, 256/256 eval tokensfresh macOS LiteRT-LM archive and extracted-cache marker verified:
6fe694ccc895c904b173f2952b73b7698097eda18d8bff0210ea9fcf10ca3da9 .dart_tool/llamadart/litert_lm/0.12.0/litert-lm-native-runtime-macos-arm64-v0.12.0.tar.gz.dart_tool/llamadart/litert_lm/0.12.0/macos/arm64/.llamadart_litert_lm.sha256contains the same digestlibLiteRtLm.dylib,libStreamProxy.dylib,libGemmaModelConstraintProvider.dylib,libLiteRt.dylib,libLiteRtMetalAccelerator.dylib,libLiteRtTopKMetalSampler.dylib,libLiteRtTopKWebGpuSampler.dylib, andlibLiteRtWebGpuAccelerator.dylibdart run tool/macos_litert_lm_cancel_smoke.dart .dart_tool/litert_lm_models/gemma-4-E2B-it.litertlm metal 250 512 15000DEVICE='adb-47031FDAP0011K-gqNYae._adb-tls-connect._tcp' TARGETS='litert_lm,llamadart' RUNS=3 WARMUPS=1 OUTPUT_TOKENS=256 BACKEND=gpu LOG_TIMEOUT=1200 tool/litert_lm_pixel_benchmark.shBENCHMARK_DONEwithin 1200s for 256 tokens with 1 warmup and 3 measured runsDEVICE='adb-47031FDAP0011K-gqNYae._adb-tls-connect._tcp' TARGETS='litert_lm,llamadart' RUNS=1 WARMUPS=0 OUTPUT_TOKENS=32 BACKEND=gpu LOG_TIMEOUT=900 tool/litert_lm_pixel_benchmark.shdart test test/unit/hookdart test test/unit/backends/litert_lm test/unit/hookdart testdart pub global run coverage:format_coverage --lcov --in=coverage/test --out=coverage/lcov.info --report-on=lib --check-ignore && dart run tool/testing/check_lcov_threshold.dart coverage/lcov.info 70lib/src/backends/litert_lm/litert_lm_runtime.dartreports LF:10/LH:10 after excluding the native FFI boundary; public runtime result/metrics value types remain covered by unit testsgit diff --checkbash tool/docs/build_site.shwebsitedependencies withnpm ci.previous PR head
b528d23bpassed all CI checks: https://github.com/leehack/llamadart/actions/runs/26545476874previous PR head
9fe15218dpassed all CI checks: https://github.com/leehack/llamadart/actions/runs/26546215623previous PR head
2c231ddacpassed all CI checks: https://github.com/leehack/llamadart/actions/runs/26546737845previous PR head
f395eaadepassed all CI checks: https://github.com/leehack/llamadart/actions/runs/26547300385previous PR head
1bbe148ffpassed all CI checks: https://github.com/leehack/llamadart/actions/runs/26547977565previous PR head
801d4a2cpassed all CI checks: https://github.com/leehack/llamadart/actions/runs/26548444620previous PR head
5d1aad4dpassed all CI checks: https://github.com/leehack/llamadart/actions/runs/26548920357previous PR head
c603f502passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26549404276previous PR head
f35302eepassed all CI checks: https://github.com/leehack/llamadart/actions/runs/26549985727previous PR head
6fe7c5b2passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26550451076previous PR head
fbfc68e2passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26550993411previous PR head
49d6f4b7apassed all CI checks: https://github.com/leehack/llamadart/actions/runs/26551609361previous PR head
a6a8ecc93passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26552258626previous PR head
264bc897fpassed all CI checks: https://github.com/leehack/llamadart/actions/runs/26552774614previous PR head
f2916cb0epassed all CI checks: https://github.com/leehack/llamadart/actions/runs/26553204346previous PR head
7ac7e7bccpassed all CI checks: https://github.com/leehack/llamadart/actions/runs/26553701141previous PR head
f089360eapassed all CI checks: https://github.com/leehack/llamadart/actions/runs/26554262975current PR head
985e9a9eapassed all CI checks: https://github.com/leehack/llamadart/actions/runs/26554765555current PR head
b6a5181blocal coverage hardening passed:dart test test/unit/backends/litert_lm/litert_lm_backend_test.dartdart test test/unit/backends/litert_lm/worker_test.dartdart test test/unit/backends/litert_lmdart test test/unit/backends/native/native_backend_test.dartdart analyzedart run tool/testing/check_platform_boundaries.dartgit diff --checkdart test -p vm -j 1 --exclude-tags local-only --coverage=coveragedart pub global run coverage:format_coverage --lcov --in=coverage/test --out=coverage/lcov.info --report-on=lib --check-ignore && dart run tool/testing/check_lcov_threshold.dart coverage/lcov.info 70current PR head
b6a5181bpassed all CI checks: https://github.com/leehack/llamadart/actions/runs/26555534894current PR head
9e057a1f1LiteRT-LM service coverage hardening passed:dart test test/unit/backends/litert_lm/litert_lm_service_test.dartdart test test/unit/backends/litert_lmdart analyzedart run tool/testing/check_platform_boundaries.dartgit diff --checkdart test -p vm -j 1 --exclude-tags local-only --coverage=coveragedart pub global run coverage:format_coverage --lcov --in=coverage/test --out=coverage/lcov.info --report-on=lib --check-ignoredart run tool/testing/check_lcov_threshold.dart coverage/lcov.info 70current PR head
9e057a1f1passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26556282010current PR head
a0f043da4LiteRT-LM production-hardening coverage pass:dart test test/unit/backends/litert_lm/litert_lm_backend_test.dartdart test test/unit/backends/litert_lmdart test test/unit/backends/native/native_backend_test.dartdart test test/unit/backends/nativedart test test/unit/backends/litert_lm/worker_test.dartdart test test/unit/backends/litert_lm/worker_messages_test.dartdart analyzedart run tool/testing/check_platform_boundaries.dartgit diff --checkdart test -p vm -j 1 --exclude-tags local-only --coverage=coveragedart pub global run coverage:format_coverage --lcov --in=coverage/test --out=coverage/lcov.info --report-on=lib --check-ignoredart run tool/testing/check_lcov_threshold.dart coverage/lcov.info 70native_backend.dart100.00%,worker_messages.dart100.00%,worker.dart97.95%,litert_lm_service.dart99.49%,litert_lm_backend.dart93.75%.current PR head
a0f043da4passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26557380437dart run tool/macos_litert_lm_engine_smoke.dart .dart_tool/litert_lm_models/gemma-4-E2B-it.litertlm metal "Write one concise sentence about on-device Gemma." 16LiteRT-LM gpu, load 14 ms, wall 1038 ms, backend init 935.581 ms, prefill 114.07 tok/s, decode 45.18 tok/s for 16/16 decode tokens.On-device Gemma allows for powerful, privacy-preserving AI processing directly on localcurrent PR head
55d718b7adds LiteRT-LM web backend routing:.litertlmweb URLs now route through a browserLiteRtLmBackendwrapper around official@litert-lm/coreEngine.create(...)/createConversation(...)instead of falling through to the llama.cpp WebGPU bridge.WebAutoBackendkeeps.ggufand unknown sources onWebGpuLlamaBackend, switches delegates by model format, and disposes the previous delegate on format changes.LlamaEngine/ChatSessionflow.example/chat_app/web/index.htmlsets a default@litert-lm/coreESM module URL for.litertlmloads, while still allowing apps to overridewindow.__llamadartLiteRtLmModuleUrl.current PR head
55d718b7web LiteRT-LM verification:dart analyzedart test -p chrome test/unit/backends/litert_lm/litert_lm_web_backend_test.dart test/unit/backends/litert_lm/litert_lm_web_public_api_test.dart test/unit/backends/web/web_backend_test.dartdart test test/unit/backends/litert_lmdart run tool/testing/check_platform_boundaries.dartdart test test/unit/backends/web test/unit/core/engine/engine_test.dartgit diff --checkdart test -p chrome test/unit/backends/webgpu/webgpu_backend_test.dartdart test test/unit/backends/native/native_backend_test.dart test/unit/core/models/model_source_test.dart./tool/docs/build_site.shcurrent PR head
6deefda9fixes the CI mirrored-test-name gate by renaming the web LiteRT-LM test totest/unit/backends/litert_lm/litert_lm_backend_web_test.dart.dart test test/unit/test_structure/mirrored_unit_structure_test.dartdart test -p chrome test/unit/backends/litert_lm/litert_lm_backend_web_test.dart test/unit/backends/litert_lm/litert_lm_web_public_api_test.dart test/unit/backends/web/web_backend_test.dartdart analyzegit diff --checkcurrent PR head
6deefda9passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26558781841current PR head
3744a1f18documents LiteRT-LM web backend support across README, website quickstart, model lifecycle, platform support matrix, and the chat example README../tool/docs/build_site.shgit diff --checkcurrent PR head
3744a1f18passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26559439872current PR head
019486a80adds a Gemma 4 LiteRT-LM chat app preset and keeps chat-app LiteRT-LM loads free of llama.cpp-only Android batch/thread tuning.flutter test test/chat_service_test.dart test/model_asset_source_test.dart test/model_download_controller_adapter_test.dartflutter test test/manage_models_screen_download_test.dartdart analyzeflutter analyzefromexample/chat_appgit diff --checkcurrent PR head
019486a80passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26560234889current PR head
b5590d305adds repeatable Gemma 4 LiteRT-LM web smoke coverage and fixes the high-level web chat path:.litertlmweb GPU selection now usesGPU_ARTISANwhen available, which is the backend accepted by the Gemma 4 web bundle.@litert-lm/coreConversation can apply the bundle's own Gemma 4 template instead of receiving a pre-rendered prompt.LlamaEngine.create/ChatSessiongeneration.WEBGPUinstead of falling back toCPU.tool/testing/run_local_e2e.dartnow includeschat-app-web-litert-gemma4-smoke, prefers the repo-local Playwright Python venv when present, and exits cleanly after background server cleanup.current PR head
b5590d305verification:python3 -m py_compile tool/testing/playwright_chat_app_real_model_smoke.pydart test test/unit/tooling/run_local_e2e_test.dartdart test test/unit/backends/litert_lm/litert_lm_backend_test.dartdart test -p chrome test/unit/backends/litert_lm/litert_lm_backend_web_test.dartdart test test/unit/backends/litert_lm/litert_lm_service_test.dartflutter test test/backend_utils_test.dartfromexample/chat_appdart analyzegit diff --checkflutter build web --base-href=/example/chat_app/build/web/fromexample/chat_appdart run tool/testing/run_local_e2e.dart --scenario chat-app-web-litert-gemma4-smoke --skip-buildexample/chat_app: model load 39.8s, response sourcelitert, expected response4, runtime labelWEBGPU, settings backend2/GPU_ARTISAN, final body reported avg 14.1 tok/s, first token 141 ms, total 142 ms.current PR head
6a5dc083fixes the Windows dry-run assertion for repo-local Playwright Python paths.dart test test/unit/tooling/run_local_e2e_test.dartdart format test/unit/tooling/run_local_e2e_test.dartgit diff --checkcurrent PR head
6a5dc083passed all CI checks after rerunning the transient Windows setup failure: https://github.com/leehack/llamadart/actions/runs/26563948801current PR head
61eb4eccbhardens LiteRT-LM web backend production introspection:WebAutoBackendnow implementsBackendEmbeddingsSupportand reports active delegate embedding support, so LiteRT-LM web advertises unsupported embeddings reliably.current PR head
61eb4eccbverification:dart test -p chrome test/unit/backends/litert_lm/litert_lm_backend_web_test.dart test/unit/backends/litert_lm/litert_lm_web_public_api_test.dart test/unit/backends/web/web_backend_test.dartdart analyzedart run tool/testing/check_platform_boundaries.dartgit diff --checkcurrent PR head
61eb4eccbpassed all CI checks: https://github.com/leehack/llamadart/actions/runs/26564974130current PR head
dc11792e1hardens LiteRT-LM web direct-backend lifecycle parity:1.current PR head
dc11792e1verification:dart test -p chrome test/unit/backends/litert_lm/litert_lm_backend_web_test.dartdart test -p chrome test/unit/backends/litert_lm/litert_lm_backend_web_test.dart test/unit/backends/litert_lm/litert_lm_web_public_api_test.dart test/unit/backends/web/web_backend_test.dartdart analyzedart run tool/testing/check_platform_boundaries.dartgit diff --checkcurrent PR head
dc11792e1passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26565806067current PR head
0bff13ffbaligns LiteRT-LM web direct diagnostics and LoRA lifecycle:getVramInfo()now reports zero VRAM without requiring a loaded model, matching native LiteRT-LM/WebAuto diagnostic behavior.current PR head
0bff13ffbverification:dart test -p chrome test/unit/backends/litert_lm/litert_lm_backend_web_test.dartdart test -p chrome test/unit/backends/litert_lm/litert_lm_backend_web_test.dart test/unit/backends/litert_lm/litert_lm_web_public_api_test.dart test/unit/backends/web/web_backend_test.dartdart analyzedart run tool/testing/check_platform_boundaries.dartgit diff --checkcurrent PR head
0bff13ffbpassed all CI checks: https://github.com/leehack/llamadart/actions/runs/26566533318current PR head
9da6d5d3bhardens remaining production-grade parity and benchmark fairness:contextCreatevalidates fullModelParamsand rejects explicit NPU selection up front.requestedBackend, and supportsLLAMADART_BACKENDoverride.current PR head
9da6d5d3bverification:dart test test/unit/backends/litert_lm/litert_lm_runtime_test.dartdart test test/unit/backends/litert_lmdart test test/unit/hook/build_hook_litert_lm_integration_test.dartdart test -p chrome test/unit/backends/litert_lm/litert_lm_backend_web_test.dartflutter test test/litert_lm_benchmark_app_test.dartfromexample/chat_appdart analyzedart run tool/testing/check_platform_boundaries.dartgit diff --checkdart run tool/macos_litert_lm_engine_smoke.dart .dart_tool/litert_lm_models/gemma-4-E2B-it.litertlm metal "Write one concise sentence about on-device Gemma." 16LiteRT-LM gpu, load 13 ms, wall 1048 ms, backend init 933.472 ms, prefill 111.32 tok/s, decode 44.78 tok/s for 16/16 decode tokens.On-device Gemma allows for powerful, privacy-preserving AI processing directly on localDEVICE=adb-47031FDAP0011K-gqNYae._adb-tls-connect._tcp TARGETS=litert_lm RUNS=1 WARMUPS=0 OUTPUT_TOKENS=16 BACKEND=gpu LITERT_LM_LOG_TIMEOUT=300 tool/litert_lm_pixel_benchmark.shcurrent PR head
9da6d5d3bpassed all CI checks: https://github.com/leehack/llamadart/actions/runs/26567730946current PR head
6e14a4faeforwards the scripted llama.cpp backend override for Pixel benchmark runs:tool/litert_lm_pixel_benchmark.shnow readsLLAMADART_BACKEND, prints it in the benchmark configuration, and passes it as a Flutter dart-define so scripted GGUF runs can forceauto,cpu,vulkan,metal, etc.current PR head
6e14a4faeverification:bash -n tool/litert_lm_pixel_benchmark.shgit diff --check HEADDEVICE=adb-47031FDAP0011K-gqNYae._adb-tls-connect._tcp TARGETS=llamadart RUNS=1 WARMUPS=0 OUTPUT_TOKENS=16 BACKEND=gpu LLAMADART_BACKEND=vulkan LLAMADART_LOG_TIMEOUT=900 tool/litert_lm_pixel_benchmark.shrequestedBackend: vulkan: load 9431 ms, wall 22889 ms, resolved GPU layers 999, prefill 5.41 tok/s, decode 1.42 tok/s, decode-with-sampling 1.37 tok/s, wall 0.66 tok/s, 15/16 eval tokens, early EOS.current PR head
6e14a4faepassed all CI checks: https://github.com/leehack/llamadart/actions/runs/26568475041current PR head
792b7d5e3tightens LiteRT-LM platform hook asset coverage:.litertlmnative runtime assets stop being emitted even when llama.cpp assets still pass.current PR head
792b7d5e3verification:dart test test/unit/hook/build_hook_android_integration_test.dart test/unit/hook/build_hook_linux_integration_test.dart test/unit/hook/build_hook_integration_test.dartdart analyze test/unit/hook/build_hook_android_integration_test.dart test/unit/hook/build_hook_linux_integration_test.dart test/unit/hook/build_hook_integration_test.dartdart test test/unit/hook/build_hook_litert_lm_integration_test.dartdart test test/unit/hookgit diff --checkcurrent PR head
792b7d5e3passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26569144857current PR head
f9e909e60hardens macOS LiteRT-LM runtime packaging:tool/macos_litert_lm_prepare_app.shnow validates complete ABI-specific LiteRT-LM runtime directories before installing app frameworks, supports arm64/x64 runtime cache layouts, installs the full arm64 companion framework set, and fails early on partial explicit runtime dirs.current PR head
f9e909e60verification:bash -n tool/macos_litert_lm_prepare_app.shdart test test/unit/backends/litert_lm/litert_lm_runtime_test.dartdart test test/unit/tooling/macos_litert_lm_prepare_app_script_test.dartdart analyze lib/src/backends/litert_lm/litert_lm_runtime.dart test/unit/backends/litert_lm/litert_lm_runtime_test.dart test/unit/tooling/macos_litert_lm_prepare_app_script_test.dartdart test test/unit/backends/litert_lmdart test test/unit/toolingdart run tool/testing/check_platform_boundaries.dartdart analyzegit diff --checkdart run tool/macos_litert_lm_engine_smoke.dart .dart_tool/litert_lm_models/gemma-4-E2B-it.litertlm metal "Write one concise sentence about on-device Gemma." 16LiteRT-LM gpu, load 13 ms, wall 2009 ms, backend init 1163.554 ms, prefill 18.65 tok/s, decode 43.73 tok/s for 16 decode tokens.current PR head
f9e909e60passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26569935734current PR head
a227134c5clarifies backend-aware state-persistence unsupported messages:LlamaEngine.stateSaveFileandstateLoadFilenow preserve WebGPU bridge guidance for WebGPU backends while reporting LiteRT-LM-specific unsupported-state guidance for.litertlmmodels.current PR head
a227134c5verification:dart test -p chrome test/unit/backends/web/web_backend_test.dartdart test test/unit/core/engine/engine_test.dart test/unit/backends/native/native_backend_test.dartdart analyzegit diff --checkcurrent PR head
a227134c5passed all CI checks: https://github.com/leehack/llamadart/actions/runs/26571165587