feat(config): default prompt_cache_all to true#9951
Merged
Conversation
Upstream llama.cpp defaults `cache_prompt = true` (common/common.h), but `parse_options` in the grpc-server backend unconditionally forwards the proto `PromptCacheAll` field, so any model that didn't set `prompt_cache_all: true` in its YAML was getting `cache_prompt=false` — silently overriding llama.cpp's own default. With `kv_unified` and `cache_idle_slots` already on by default, this was the last piece preventing the per-request prompt cache from being usable out of the box. Make `PromptCacheAll` tristate (`*bool`), default it to `true` in `SetDefaults`, and dereference at the proto boundary. Users can still opt out with an explicit `prompt_cache_all: false`. Same pattern as `MMap`, `MMlock`, `Reranking`, etc. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Make the per-request
cache_promptknob default to on, matching upstream llama.cpp.backend/cpp/llama-cpp/grpc-server.cppunconditionally forwards the proto field:Since the YAML-loaded Go value was a plain
bool(zero =false), any model that didn't explicitly setprompt_cache_all: trueended up sendingcache_prompt=falseto llama.cpp — overriding upstream's own default (common/common.h:592:bool cache_prompt = true). Withkv_unified=trueandcache_idle_slots=truealready default inparse_options, this was the last piece keeping the per-request prompt cache from being usable out of the box.Change
LLMConfig.PromptCacheAlltristate (*bool), mirroringMMap,MMlock,Reranking, etc.SetDefaults, when nil → set totrue.gRPCPredictOptsis post-SetDefaults, so non-nil by contract — same idiom as the surrounding*c.Temperature,*c.TopKlines).false, explicittrue. Mirrors the existingenable_prefix_cachingprecedent inhooks_test.go.Notes
pb.PredictOptions.PromptCacheAllstaysbool.prompt_cache_all: falsein the model YAML.prompt_cache_all: false, so the behavior flip lands as a pure improvement.Test plan
go test ./core/config/ ./core/backend/— 102 specs (up from 99) passgo vet ./core/config/ ./core/backend/cleangolangci-lint) — verify in PRprompt_cache_allin YAML, confirm second request with shared prefix is faster than the first🤖 Generated with Claude Code