Skip to content

feat(config): default prompt_cache_all to true#9951

Merged
mudler merged 1 commit into
masterfrom
default-cache-prompt-true
May 22, 2026
Merged

feat(config): default prompt_cache_all to true#9951
mudler merged 1 commit into
masterfrom
default-cache-prompt-true

Conversation

@localai-bot
Copy link
Copy Markdown
Collaborator

Summary

Make the per-request cache_prompt knob default to on, matching upstream llama.cpp.

backend/cpp/llama-cpp/grpc-server.cpp unconditionally forwards the proto field:

data["cache_prompt"] = predict->promptcacheall();   // grpc-server.cpp:197

Since the YAML-loaded Go value was a plain bool (zero = false), any model that didn't explicitly set prompt_cache_all: true ended up sending cache_prompt=false to llama.cpp — overriding upstream's own default (common/common.h:592: bool cache_prompt = true). With kv_unified=true and cache_idle_slots=true already default in parse_options, this was the last piece keeping the per-request prompt cache from being usable out of the box.

Change

  • Make LLMConfig.PromptCacheAll tristate (*bool), mirroring MMap, MMlock, Reranking, etc.
  • In SetDefaults, when nil → set to true.
  • Dereference at the proto boundary (gRPCPredictOpts is post-SetDefaults, so non-nil by contract — same idiom as the surrounding *c.Temperature, *c.TopK lines).
  • Add three Ginkgo specs covering: default, explicit false, explicit true. Mirrors the existing enable_prefix_caching precedent in hooks_test.go.

Notes

  • No proto change. Generated pb.PredictOptions.PromptCacheAll stays bool.
  • Users can still opt out with prompt_cache_all: false in the model YAML.
  • No in-tree YAML currently sets prompt_cache_all: false, so the behavior flip lands as a pure improvement.

Test plan

  • go test ./core/config/ ./core/backend/ — 102 specs (up from 99) pass
  • go vet ./core/config/ ./core/backend/ clean
  • CI lint job (golangci-lint) — verify in PR
  • Manual smoke: serve a chat model without prompt_cache_all in YAML, confirm second request with shared prefix is faster than the first

🤖 Generated with Claude Code

Upstream llama.cpp defaults `cache_prompt = true` (common/common.h),
but `parse_options` in the grpc-server backend unconditionally forwards
the proto `PromptCacheAll` field, so any model that didn't set
`prompt_cache_all: true` in its YAML was getting `cache_prompt=false` —
silently overriding llama.cpp's own default. With `kv_unified` and
`cache_idle_slots` already on by default, this was the last piece
preventing the per-request prompt cache from being usable out of the
box.

Make `PromptCacheAll` tristate (`*bool`), default it to `true` in
`SetDefaults`, and dereference at the proto boundary. Users can still
opt out with an explicit `prompt_cache_all: false`. Same pattern as
`MMap`, `MMlock`, `Reranking`, etc.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mudler mudler merged commit c500461 into master May 22, 2026
56 checks passed
@mudler mudler deleted the default-cache-prompt-true branch May 22, 2026 20:06
@localai-bot localai-bot added the enhancement New feature or request label May 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants