Skip to content

feat(api): expose load-time tuning knobs on ModelParams#116

Merged
leehack merged 7 commits into
leehack:mainfrom
thereisnotime:feat/expose-load-time-knobs
May 8, 2026
Merged

feat(api): expose load-time tuning knobs on ModelParams#116
leehack merged 7 commits into
leehack:mainfrom
thereisnotime:feat/expose-load-time-knobs

Conversation

@thereisnotime

Copy link
Copy Markdown
Contributor

Summary

Surface eight llama_model_params / llama_context_params fields that
already exist in the autogenerated bindings but aren't reachable from the
public ModelParams API:

New ModelParams field Maps to Default
useMmap llama_model_params.use_mmap true (current hardcoded value)
useMlock llama_model_params.use_mlock false
flashAttention llama_context_params.flash_attn_type FlashAttention.auto
cacheTypeK llama_context_params.type_k KvCacheType.f16
cacheTypeV llama_context_params.type_v KvCacheType.f16
kvUnified llama_context_params.kv_unified null (keeps current heuristic)
ropeFrequencyBase llama_context_params.rope_freq_base null (model's value)
ropeFrequencyScale llama_context_params.rope_freq_scale null (model's value)

Two new public enums (FlashAttention, KvCacheType) live next to
GpuBackend under lib/src/core/models/config/ and are re-exported
from the main library entry.

Why

Memory-constrained mobile callers can't currently run KV-cache
quantization, which is the single biggest win for fitting larger
context windows on phones. With cacheTypeK = q8_0 the KV cache
roughly halves vs F16 — on a 12 GB Android device, that's the
difference between running a 12B model at n_ctx=4096 vs 8192. The
underlying llama.cpp parameters were already in the FFI bindings;
this PR just makes them user-reachable.

Other wrappers (e.g. llama.rn, used by PocketPal-AI) hardcode mmap to
false on Android for throughput reasons. Exposing useMmap lets
llamadart consumers do the same when measurements warrant it.

Behavior

Defaults preserve current behavior:

  • useMmap = true matches the previously hardcoded value.
  • flashAttention = auto, kvUnified = null, rope* = null are
    no-ops that defer to the existing heuristics.
  • cacheTypeK = cacheTypeV = f16 matches llama.cpp's defaults.

User-explicit settings are applied after the existing
platform/backend heuristics in
llama_cpp_service.dart so they win. One quality-of-life: when the
user requests non-F16 KV cache with flashAttention: auto, the service
auto-promotes flash attention to enabled (llama.cpp refuses non-F16
KV cache without it).

Native side

No changes to llamadart-native. The struct fields are already in the
autogenerated bindings (bindings.dart:8489, 8495, 8576, 8610, 8615, 8639, 8582, 8585), so this PR is pure Dart.

Tests

  • New per-file enum tests under test/unit/core/models/config/.
  • model_params_test.dart extended to cover defaults, full-set
    construction, and copyWith propagation across all new fields.
  • The existing mirrored-unit-structure check passes (sibling tests
    added for both new config files).
  • Full VM suite (dart test -p vm): 653 passing locally.
  • tool/testing/check_platform_boundaries.dart: clean (no dart:io /
    dart:ffi leakage).

Test plan

  • dart analyze lib/ — no issues
  • dart test -p vm — 653 passing
  • dart run tool/testing/check_platform_boundaries.dart — clean
  • Verified one downstream Flutter app still builds against the
    branch via git: ref in pubspec.yaml
  • Maintainer review: confirm auto-promote-flash-attention
    heuristic is acceptable, or whether a strict-mode error is
    preferred

Add the following fields to `ModelParams`, all mapped to existing
fields on `llama_model_params` / `llama_context_params`:

- useMmap (bool, default true) -> use_mmap
- useMlock (bool, default false) -> use_mlock
- flashAttention (FlashAttention enum: auto/enabled/disabled, default auto)
  -> flash_attn_type
- cacheTypeK / cacheTypeV (KvCacheType enum: f16/q8_0/q4_0, default f16)
  -> type_k / type_v
- kvUnified (bool?, default null = current heuristic) -> kv_unified
- ropeFrequencyBase / ropeFrequencyScale (double?, default null = model's
  trained value) -> rope_freq_base / rope_freq_scale

The defaults preserve current behavior. User-explicit settings are
applied after the existing platform/backend heuristics so they win.

Quality-of-life: when a non-F16 KV cache type is requested with
`flashAttention: auto`, the service auto-promotes flash attention to
enabled (llama.cpp refuses non-F16 KV cache without it).

The motivation for this change is matching what other llama.cpp
wrappers expose so memory-constrained mobile callers can run larger
context windows. With Q8_0 KV the cache memory roughly halves vs F16,
which on a 12 GB Android device is the difference between running a
12B model at n_ctx=4096 vs 8192.

Tests cover the new defaults, copyWith propagation, and enum surface.
Mirrored unit-structure test now sees sibling tests for both new
config files. Native binaries are unaffected; the underlying struct
fields were already in the autogenerated bindings.
@codecov-commenter

codecov-commenter commented May 7, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 96.00000% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.71%. Comparing base (a6305cd) to head (b794aec).

Files with missing lines Patch % Lines
lib/src/backends/llama_cpp/llama_cpp_service.dart 66.66% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #116      +/-   ##
==========================================
+ Coverage   76.59%   76.71%   +0.11%     
==========================================
  Files          68       69       +1     
  Lines        8678     8726      +48     
==========================================
+ Hits         6647     6694      +47     
- Misses       2031     2032       +1     
Flag Coverage Δ
unittests 76.71% <96.00%> (+0.11%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@leehack leehack left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left two API-focused comments. The overall direction looks useful, but I think the nullable copyWith reset behavior needs tightening before merge, and the KV-cache/flash-attention invalid-combination behavior should be made explicit.

Comment thread lib/src/core/models/inference/model_params.dart Outdated
Comment thread lib/src/backends/llama_cpp/llama_cpp_service.dart Outdated
…e nullables

Two issues raised by maintainer review:

1. copyWith couldn't clear nullable fields back to null — `field ??
   this.field` is indistinguishable from "argument omitted, keep
   current value". A user toggling an override off in a settings UI
   would be stuck with the previous value.

   Added explicit `clear*: bool = false` flags for all four nullable
   fields (chatTemplate, kvUnified, ropeFrequencyBase,
   ropeFrequencyScale). When the flag is set, the field becomes null
   regardless of any value passed for the field itself. Without the
   flag, behavior is unchanged (passing null still means "keep").

2. q8_0/q4_0 KV cache types with flashAttention=disabled passed
   through to native and produced a cryptic llama.cpp runtime error.
   Added validation in the constructor that throws ArgumentError
   early with an actionable message.

   The auto-promote logic in llama_cpp_service still handles the
   flashAttention=auto case correctly — only the explicit
   `disabled` combination is rejected.

Constructor lost `const` because it now has a body. Existing tests
updated to use plain construction. 11 new tests cover the validation
branches (5) and clear-flag behavior (6, including a regression test
for the no-clear-without-flag legacy path).

Total 16/16 VM tests passing.
Codecov flagged 13 uncovered lines in llama_cpp_service.dart from
PR leehack#116 (the FA auto-promote switch and the ggml_type mapping).
Both are deterministic transforms that didn't deserve to live in
the integration-only loadModel path.

- Extracted to lib/src/backends/llama_cpp/load_param_helpers.dart
  as `ggmlTypeFor` and `resolveFlashAttention` — pure functions, no
  FFI side effects.
- Service calls them in the same place as before; behaviour
  unchanged.
- 9 unit tests in load_param_helpers_test.dart cover all 3 KV
  cache types × switch arms and the four FA × KV cases (auto+F16
  passthrough, auto+non-F16 promote, explicit enabled passthrough,
  explicit disabled+F16 passthrough — disabled+non-F16 is rejected
  upstream by ModelParams constructor, not this helper).

Patch coverage on the new code is now 100% on the testable parts;
remaining uncovered lines are pure FFI struct field assignments
inside loadModel (kvUnified/ropeFreq* setters), which are
trivially correct and don't add value to test in isolation.
Codecov flagged 13 uncovered lines from PR leehack#116 in loadModel —
the FA switch, the ggml_type mapping, and the FFI struct setters.
Previous commit moved the pure mappings to a helper but the
struct-setting code was still inline in loadModel and untested.

Real fix: extract `applyModelParams(mparams, params)` and
`applyContextParams(ctxParams, params)` as functions that take the
already-allocated FFI structs. Tests use `calloc<llama_model_params>()`
and `calloc<llama_context_params>()` to build structs in pure Dart,
call the helpers, then assert on field values. No model load needed.

The remaining loadModel code is two function calls plus a one-line
log when FA was auto-promoted — trivially correct.

20 tests in load_param_helpers_test.dart now cover:
- ggmlTypeFor: all 3 enum branches
- resolveFlashAttention: auto/enabled/disabled × F16/non-F16 matrix
- applyModelParams: writes use_mmap + use_mlock from params (default
  + overridden)
- applyContextParams: writes type_k/type_v, FA enabled/disabled,
  FA auto-promote on Q8 KV, kvUnified null preserves struct field
  unchanged + non-null writes through, ropeFreq* same semantics,
  return value matches resolved FA

Verified locally with `dart pub global run coverage:format_coverage`:
load_param_helpers.dart hits LH=LF on every line. The two remaining
uncovered lines from the patch are pure FFI imports + the helper
call sites in loadModel itself, which can't be tested without
loading a real model — they're trivially safe (1-line forwarding
calls).
…te()

My previous review-fix commit (83f7257) added the FA/KV validation in
the constructor body, which forced removing `const`. That broke
existing `const ModelParams()` defaults in llamadart's own
engine.dart (line 132, 172) plus any external caller using a const
context.

Now: const constructor restored. New `ModelParams.validate()` method
checks the same invariant; LlamaCppService.loadModel calls it before
any native work so users still get the early Dart-side ArgumentError
the maintainer asked for, without breaking backwards-compat for
const callers.

Tests updated: validate() can be called on const-constructed
instances, returns normally for valid combos, throws ArgumentError
for the (non-F16 KV, FA disabled) combo.

36/36 VM tests passing.

@leehack leehack left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Native mapping looks good overall. I’d only block on the current CI issues:

  • Formatting is failing on four touched files; please run dart format ..
  • load_param_helpers_test.dart imports dart:ffi/native bindings but is included in the Chrome test run. Please mark it VM-only with @TestOn('vm') or exclude it from browser tests.

Web note: the pinned web bridge does not expose these new load-time knobs yet, but I’m fine treating that as a follow-up bridge/assets update rather than blocking this PR.

Comment thread test/unit/backends/llama_cpp/load_param_helpers_test.dart

@leehack leehack left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI issues from my previous review are fixed in b794aec: formatting now passes, the FFI helper test is VM-only, and the PR checks are green. LGTM.

@leehack leehack merged commit 5cf9be8 into leehack:main May 8, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants