feat(api): expose load-time tuning knobs on ModelParams by thereisnotime · Pull Request #116 · leehack/llamadart

thereisnotime · 2026-05-07T20:09:19Z

Summary

Surface eight llama_model_params / llama_context_params fields that
already exist in the autogenerated bindings but aren't reachable from the
public ModelParams API:

New `ModelParams` field	Maps to	Default
`useMmap`	`llama_model_params.use_mmap`	`true` (current hardcoded value)
`useMlock`	`llama_model_params.use_mlock`	`false`
`flashAttention`	`llama_context_params.flash_attn_type`	`FlashAttention.auto`
`cacheTypeK`	`llama_context_params.type_k`	`KvCacheType.f16`
`cacheTypeV`	`llama_context_params.type_v`	`KvCacheType.f16`
`kvUnified`	`llama_context_params.kv_unified`	`null` (keeps current heuristic)
`ropeFrequencyBase`	`llama_context_params.rope_freq_base`	`null` (model's value)
`ropeFrequencyScale`	`llama_context_params.rope_freq_scale`	`null` (model's value)

Two new public enums (FlashAttention, KvCacheType) live next to
GpuBackend under lib/src/core/models/config/ and are re-exported
from the main library entry.

Why

Memory-constrained mobile callers can't currently run KV-cache
quantization, which is the single biggest win for fitting larger
context windows on phones. With cacheTypeK = q8_0 the KV cache
roughly halves vs F16 — on a 12 GB Android device, that's the
difference between running a 12B model at n_ctx=4096 vs 8192. The
underlying llama.cpp parameters were already in the FFI bindings;
this PR just makes them user-reachable.

Other wrappers (e.g. llama.rn, used by PocketPal-AI) hardcode mmap to
false on Android for throughput reasons. Exposing useMmap lets
llamadart consumers do the same when measurements warrant it.

Behavior

Defaults preserve current behavior:

useMmap = true matches the previously hardcoded value.
flashAttention = auto, kvUnified = null, rope* = null are
no-ops that defer to the existing heuristics.
cacheTypeK = cacheTypeV = f16 matches llama.cpp's defaults.

User-explicit settings are applied after the existing
platform/backend heuristics in
llama_cpp_service.dart so they win. One quality-of-life: when the
user requests non-F16 KV cache with flashAttention: auto, the service
auto-promotes flash attention to enabled (llama.cpp refuses non-F16
KV cache without it).

Native side

No changes to llamadart-native. The struct fields are already in the
autogenerated bindings (bindings.dart:8489, 8495, 8576, 8610, 8615, 8639, 8582, 8585), so this PR is pure Dart.

Tests

New per-file enum tests under test/unit/core/models/config/.
model_params_test.dart extended to cover defaults, full-set
construction, and copyWith propagation across all new fields.
The existing mirrored-unit-structure check passes (sibling tests
added for both new config files).
Full VM suite (dart test -p vm): 653 passing locally.
tool/testing/check_platform_boundaries.dart: clean (no dart:io /
dart:ffi leakage).

Test plan

dart analyze lib/ — no issues
dart test -p vm — 653 passing
dart run tool/testing/check_platform_boundaries.dart — clean
Verified one downstream Flutter app still builds against the
branch via git: ref in pubspec.yaml
Maintainer review: confirm auto-promote-flash-attention
heuristic is acceptable, or whether a strict-mode error is
preferred

Add the following fields to `ModelParams`, all mapped to existing fields on `llama_model_params` / `llama_context_params`: - useMmap (bool, default true) -> use_mmap - useMlock (bool, default false) -> use_mlock - flashAttention (FlashAttention enum: auto/enabled/disabled, default auto) -> flash_attn_type - cacheTypeK / cacheTypeV (KvCacheType enum: f16/q8_0/q4_0, default f16) -> type_k / type_v - kvUnified (bool?, default null = current heuristic) -> kv_unified - ropeFrequencyBase / ropeFrequencyScale (double?, default null = model's trained value) -> rope_freq_base / rope_freq_scale The defaults preserve current behavior. User-explicit settings are applied after the existing platform/backend heuristics so they win. Quality-of-life: when a non-F16 KV cache type is requested with `flashAttention: auto`, the service auto-promotes flash attention to enabled (llama.cpp refuses non-F16 KV cache without it). The motivation for this change is matching what other llama.cpp wrappers expose so memory-constrained mobile callers can run larger context windows. With Q8_0 KV the cache memory roughly halves vs F16, which on a 12 GB Android device is the difference between running a 12B model at n_ctx=4096 vs 8192. Tests cover the new defaults, copyWith propagation, and enum surface. Mirrored unit-structure test now sees sibling tests for both new config files. Native binaries are unaffected; the underlying struct fields were already in the autogenerated bindings.

codecov-commenter · 2026-05-07T20:32:44Z

Codecov Report

❌ Patch coverage is 96.00000% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.71%. Comparing base (a6305cd) to head (b794aec).

Files with missing lines	Patch %	Lines
lib/src/backends/llama_cpp/llama_cpp_service.dart	66.66%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #116      +/-   ##
==========================================
+ Coverage   76.59%   76.71%   +0.11%     
==========================================
  Files          68       69       +1     
  Lines        8678     8726      +48     
==========================================
+ Hits         6647     6694      +47     
- Misses       2031     2032       +1

Flag	Coverage Δ
unittests	`76.71% <96.00%> (+0.11%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

leehack

I left two API-focused comments. The overall direction looks useful, but I think the nullable copyWith reset behavior needs tightening before merge, and the KV-cache/flash-attention invalid-combination behavior should be made explicit.

…e nullables Two issues raised by maintainer review: 1. copyWith couldn't clear nullable fields back to null — `field ?? this.field` is indistinguishable from "argument omitted, keep current value". A user toggling an override off in a settings UI would be stuck with the previous value. Added explicit `clear*: bool = false` flags for all four nullable fields (chatTemplate, kvUnified, ropeFrequencyBase, ropeFrequencyScale). When the flag is set, the field becomes null regardless of any value passed for the field itself. Without the flag, behavior is unchanged (passing null still means "keep"). 2. q8_0/q4_0 KV cache types with flashAttention=disabled passed through to native and produced a cryptic llama.cpp runtime error. Added validation in the constructor that throws ArgumentError early with an actionable message. The auto-promote logic in llama_cpp_service still handles the flashAttention=auto case correctly — only the explicit `disabled` combination is rejected. Constructor lost `const` because it now has a body. Existing tests updated to use plain construction. 11 new tests cover the validation branches (5) and clear-flag behavior (6, including a regression test for the no-clear-without-flag legacy path). Total 16/16 VM tests passing.

Codecov flagged 13 uncovered lines in llama_cpp_service.dart from PR leehack#116 (the FA auto-promote switch and the ggml_type mapping). Both are deterministic transforms that didn't deserve to live in the integration-only loadModel path. - Extracted to lib/src/backends/llama_cpp/load_param_helpers.dart as `ggmlTypeFor` and `resolveFlashAttention` — pure functions, no FFI side effects. - Service calls them in the same place as before; behaviour unchanged. - 9 unit tests in load_param_helpers_test.dart cover all 3 KV cache types × switch arms and the four FA × KV cases (auto+F16 passthrough, auto+non-F16 promote, explicit enabled passthrough, explicit disabled+F16 passthrough — disabled+non-F16 is rejected upstream by ModelParams constructor, not this helper). Patch coverage on the new code is now 100% on the testable parts; remaining uncovered lines are pure FFI struct field assignments inside loadModel (kvUnified/ropeFreq* setters), which are trivially correct and don't add value to test in isolation.

Codecov flagged 13 uncovered lines from PR leehack#116 in loadModel — the FA switch, the ggml_type mapping, and the FFI struct setters. Previous commit moved the pure mappings to a helper but the struct-setting code was still inline in loadModel and untested. Real fix: extract `applyModelParams(mparams, params)` and `applyContextParams(ctxParams, params)` as functions that take the already-allocated FFI structs. Tests use `calloc<llama_model_params>()` and `calloc<llama_context_params>()` to build structs in pure Dart, call the helpers, then assert on field values. No model load needed. The remaining loadModel code is two function calls plus a one-line log when FA was auto-promoted — trivially correct. 20 tests in load_param_helpers_test.dart now cover: - ggmlTypeFor: all 3 enum branches - resolveFlashAttention: auto/enabled/disabled × F16/non-F16 matrix - applyModelParams: writes use_mmap + use_mlock from params (default + overridden) - applyContextParams: writes type_k/type_v, FA enabled/disabled, FA auto-promote on Q8 KV, kvUnified null preserves struct field unchanged + non-null writes through, ropeFreq* same semantics, return value matches resolved FA Verified locally with `dart pub global run coverage:format_coverage`: load_param_helpers.dart hits LH=LF on every line. The two remaining uncovered lines from the patch are pure FFI imports + the helper call sites in loadModel itself, which can't be tested without loading a real model — they're trivially safe (1-line forwarding calls).

…te() My previous review-fix commit (83f7257) added the FA/KV validation in the constructor body, which forced removing `const`. That broke existing `const ModelParams()` defaults in llamadart's own engine.dart (line 132, 172) plus any external caller using a const context. Now: const constructor restored. New `ModelParams.validate()` method checks the same invariant; LlamaCppService.loadModel calls it before any native work so users still get the early Dart-side ArgumentError the maintainer asked for, without breaking backwards-compat for const callers. Tests updated: validate() can be called on const-constructed instances, returns normally for valid combos, throws ArgumentError for the (non-F16 KV, FA disabled) combo. 36/36 VM tests passing.

leehack

Native mapping looks good overall. I’d only block on the current CI issues:

Formatting is failing on four touched files; please run dart format ..
load_param_helpers_test.dart imports dart:ffi/native bindings but is included in the Chrome test run. Please mark it VM-only with @TestOn('vm') or exclude it from browser tests.

Web note: the pinned web bridge does not expose these new load-time knobs yet, but I’m fine treating that as a follow-up bridge/assets update rather than blocking this PR.

leehack

CI issues from my previous review are fixed in b794aec: formatting now passes, the FFI helper test is VM-only, and the PR checks are green. LGTM.

leehack reviewed May 7, 2026

View reviewed changes

Comment thread lib/src/core/models/inference/model_params.dart Outdated

Comment thread lib/src/backends/llama_cpp/llama_cpp_service.dart Outdated

thereisnotime added 4 commits May 8, 2026 12:28

leehack requested changes May 8, 2026

View reviewed changes

Comment thread test/unit/backends/llama_cpp/load_param_helpers_test.dart

thereisnotime and others added 2 commits May 8, 2026 17:22

Merge branch 'main' into feat/expose-load-time-knobs

e9ccdab

fix(api): address load param test CI failures

b794aec

leehack approved these changes May 8, 2026

View reviewed changes

leehack merged commit 5cf9be8 into leehack:main May 8, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(api): expose load-time tuning knobs on ModelParams#116

feat(api): expose load-time tuning knobs on ModelParams#116
leehack merged 7 commits into
leehack:mainfrom
thereisnotime:feat/expose-load-time-knobs

thereisnotime commented May 7, 2026

Uh oh!

codecov-commenter commented May 7, 2026 •

edited

Loading

Uh oh!

leehack left a comment

Uh oh!

Uh oh!

Uh oh!

leehack left a comment

Uh oh!

Uh oh!

leehack left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

thereisnotime commented May 7, 2026

Summary

Why

Behavior

Native side

Tests

Test plan

Uh oh!

codecov-commenter commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

leehack left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

leehack left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

leehack left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov-commenter commented May 7, 2026 •

edited

Loading