feat(api): expose load-time tuning knobs on ModelParams#116
Conversation
Add the following fields to `ModelParams`, all mapped to existing fields on `llama_model_params` / `llama_context_params`: - useMmap (bool, default true) -> use_mmap - useMlock (bool, default false) -> use_mlock - flashAttention (FlashAttention enum: auto/enabled/disabled, default auto) -> flash_attn_type - cacheTypeK / cacheTypeV (KvCacheType enum: f16/q8_0/q4_0, default f16) -> type_k / type_v - kvUnified (bool?, default null = current heuristic) -> kv_unified - ropeFrequencyBase / ropeFrequencyScale (double?, default null = model's trained value) -> rope_freq_base / rope_freq_scale The defaults preserve current behavior. User-explicit settings are applied after the existing platform/backend heuristics so they win. Quality-of-life: when a non-F16 KV cache type is requested with `flashAttention: auto`, the service auto-promotes flash attention to enabled (llama.cpp refuses non-F16 KV cache without it). The motivation for this change is matching what other llama.cpp wrappers expose so memory-constrained mobile callers can run larger context windows. With Q8_0 KV the cache memory roughly halves vs F16, which on a 12 GB Android device is the difference between running a 12B model at n_ctx=4096 vs 8192. Tests cover the new defaults, copyWith propagation, and enum surface. Mirrored unit-structure test now sees sibling tests for both new config files. Native binaries are unaffected; the underlying struct fields were already in the autogenerated bindings.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #116 +/- ##
==========================================
+ Coverage 76.59% 76.71% +0.11%
==========================================
Files 68 69 +1
Lines 8678 8726 +48
==========================================
+ Hits 6647 6694 +47
- Misses 2031 2032 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
leehack
left a comment
There was a problem hiding this comment.
I left two API-focused comments. The overall direction looks useful, but I think the nullable copyWith reset behavior needs tightening before merge, and the KV-cache/flash-attention invalid-combination behavior should be made explicit.
…e nullables Two issues raised by maintainer review: 1. copyWith couldn't clear nullable fields back to null — `field ?? this.field` is indistinguishable from "argument omitted, keep current value". A user toggling an override off in a settings UI would be stuck with the previous value. Added explicit `clear*: bool = false` flags for all four nullable fields (chatTemplate, kvUnified, ropeFrequencyBase, ropeFrequencyScale). When the flag is set, the field becomes null regardless of any value passed for the field itself. Without the flag, behavior is unchanged (passing null still means "keep"). 2. q8_0/q4_0 KV cache types with flashAttention=disabled passed through to native and produced a cryptic llama.cpp runtime error. Added validation in the constructor that throws ArgumentError early with an actionable message. The auto-promote logic in llama_cpp_service still handles the flashAttention=auto case correctly — only the explicit `disabled` combination is rejected. Constructor lost `const` because it now has a body. Existing tests updated to use plain construction. 11 new tests cover the validation branches (5) and clear-flag behavior (6, including a regression test for the no-clear-without-flag legacy path). Total 16/16 VM tests passing.
Codecov flagged 13 uncovered lines in llama_cpp_service.dart from PR leehack#116 (the FA auto-promote switch and the ggml_type mapping). Both are deterministic transforms that didn't deserve to live in the integration-only loadModel path. - Extracted to lib/src/backends/llama_cpp/load_param_helpers.dart as `ggmlTypeFor` and `resolveFlashAttention` — pure functions, no FFI side effects. - Service calls them in the same place as before; behaviour unchanged. - 9 unit tests in load_param_helpers_test.dart cover all 3 KV cache types × switch arms and the four FA × KV cases (auto+F16 passthrough, auto+non-F16 promote, explicit enabled passthrough, explicit disabled+F16 passthrough — disabled+non-F16 is rejected upstream by ModelParams constructor, not this helper). Patch coverage on the new code is now 100% on the testable parts; remaining uncovered lines are pure FFI struct field assignments inside loadModel (kvUnified/ropeFreq* setters), which are trivially correct and don't add value to test in isolation.
Codecov flagged 13 uncovered lines from PR leehack#116 in loadModel — the FA switch, the ggml_type mapping, and the FFI struct setters. Previous commit moved the pure mappings to a helper but the struct-setting code was still inline in loadModel and untested. Real fix: extract `applyModelParams(mparams, params)` and `applyContextParams(ctxParams, params)` as functions that take the already-allocated FFI structs. Tests use `calloc<llama_model_params>()` and `calloc<llama_context_params>()` to build structs in pure Dart, call the helpers, then assert on field values. No model load needed. The remaining loadModel code is two function calls plus a one-line log when FA was auto-promoted — trivially correct. 20 tests in load_param_helpers_test.dart now cover: - ggmlTypeFor: all 3 enum branches - resolveFlashAttention: auto/enabled/disabled × F16/non-F16 matrix - applyModelParams: writes use_mmap + use_mlock from params (default + overridden) - applyContextParams: writes type_k/type_v, FA enabled/disabled, FA auto-promote on Q8 KV, kvUnified null preserves struct field unchanged + non-null writes through, ropeFreq* same semantics, return value matches resolved FA Verified locally with `dart pub global run coverage:format_coverage`: load_param_helpers.dart hits LH=LF on every line. The two remaining uncovered lines from the patch are pure FFI imports + the helper call sites in loadModel itself, which can't be tested without loading a real model — they're trivially safe (1-line forwarding calls).
…te() My previous review-fix commit (83f7257) added the FA/KV validation in the constructor body, which forced removing `const`. That broke existing `const ModelParams()` defaults in llamadart's own engine.dart (line 132, 172) plus any external caller using a const context. Now: const constructor restored. New `ModelParams.validate()` method checks the same invariant; LlamaCppService.loadModel calls it before any native work so users still get the early Dart-side ArgumentError the maintainer asked for, without breaking backwards-compat for const callers. Tests updated: validate() can be called on const-constructed instances, returns normally for valid combos, throws ArgumentError for the (non-F16 KV, FA disabled) combo. 36/36 VM tests passing.
leehack
left a comment
There was a problem hiding this comment.
Native mapping looks good overall. I’d only block on the current CI issues:
- Formatting is failing on four touched files; please run
dart format .. load_param_helpers_test.dartimportsdart:ffi/native bindings but is included in the Chrome test run. Please mark it VM-only with@TestOn('vm')or exclude it from browser tests.
Web note: the pinned web bridge does not expose these new load-time knobs yet, but I’m fine treating that as a follow-up bridge/assets update rather than blocking this PR.
Summary
Surface eight
llama_model_params/llama_context_paramsfields thatalready exist in the autogenerated bindings but aren't reachable from the
public
ModelParamsAPI:ModelParamsfielduseMmapllama_model_params.use_mmaptrue(current hardcoded value)useMlockllama_model_params.use_mlockfalseflashAttentionllama_context_params.flash_attn_typeFlashAttention.autocacheTypeKllama_context_params.type_kKvCacheType.f16cacheTypeVllama_context_params.type_vKvCacheType.f16kvUnifiedllama_context_params.kv_unifiednull(keeps current heuristic)ropeFrequencyBasellama_context_params.rope_freq_basenull(model's value)ropeFrequencyScalellama_context_params.rope_freq_scalenull(model's value)Two new public enums (
FlashAttention,KvCacheType) live next toGpuBackendunderlib/src/core/models/config/and are re-exportedfrom the main library entry.
Why
Memory-constrained mobile callers can't currently run KV-cache
quantization, which is the single biggest win for fitting larger
context windows on phones. With
cacheTypeK = q8_0the KV cacheroughly halves vs F16 — on a 12 GB Android device, that's the
difference between running a 12B model at n_ctx=4096 vs 8192. The
underlying llama.cpp parameters were already in the FFI bindings;
this PR just makes them user-reachable.
Other wrappers (e.g. llama.rn, used by PocketPal-AI) hardcode mmap to
falseon Android for throughput reasons. ExposinguseMmapletsllamadart consumers do the same when measurements warrant it.
Behavior
Defaults preserve current behavior:
useMmap = truematches the previously hardcoded value.flashAttention = auto,kvUnified = null,rope* = nullareno-ops that defer to the existing heuristics.
cacheTypeK = cacheTypeV = f16matches llama.cpp's defaults.User-explicit settings are applied after the existing
platform/backend heuristics in
llama_cpp_service.dartso they win. One quality-of-life: when theuser requests non-F16 KV cache with
flashAttention: auto, the serviceauto-promotes flash attention to
enabled(llama.cpp refuses non-F16KV cache without it).
Native side
No changes to
llamadart-native. The struct fields are already in theautogenerated bindings (
bindings.dart:8489, 8495, 8576, 8610, 8615, 8639, 8582, 8585), so this PR is pure Dart.Tests
test/unit/core/models/config/.model_params_test.dartextended to cover defaults, full-setconstruction, and
copyWithpropagation across all new fields.added for both new config files).
dart test -p vm): 653 passing locally.tool/testing/check_platform_boundaries.dart: clean (nodart:io/dart:ffileakage).Test plan
dart analyze lib/— no issuesdart test -p vm— 653 passingdart run tool/testing/check_platform_boundaries.dart— cleanbranch via
git:ref inpubspec.yamlauto-promote-flash-attentionheuristic is acceptable, or whether a strict-mode error is
preferred