Is there a way to cache multiple prompt prefixes? #6969

ghnp5 · 2025-05-10T21:28:04Z

ghnp5
May 10, 2025

Hi,

I'm using the OpenAI-compatible API, running GGUF on a CPU, with the llama.cpp loader.

--streaming-llm is very useful to cache the last prompt prefix, so that the next time it runs, it will have to process the prompt only from the first token that is different.

However, in my case, I will have about 8 prompt prefixes that will be rotating all the time. This makes --streaming-llm mostly useless.

Is there a way to cache 8 variations of the prompt prefixes? (while still allowing me to inject suffixes that will always be different, and not expected to be cached)

Many thanks!

EDIT - I suspect this may be possible, by enabling "slots", and then play with --keep and -sps, --slot-prompt-similarity SIMILARITY. I don't think I can pass "any parameters" to llama-server from CMD_FLAGS.txt, though.

ghnp5 · 2025-06-11T02:00:16Z

ghnp5
Jun 11, 2025
Author

This tutorial may show how to do this: ggml-org/llama.cpp#13606

But we'd need oobabooga/text-generation-webui to support proxying these parameters onto llama.cpp.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a way to cache multiple prompt prefixes? #6969

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Is there a way to cache multiple prompt prefixes? #6969

Uh oh!

Uh oh!

ghnp5 May 10, 2025

Replies: 1 comment

Uh oh!

ghnp5 Jun 11, 2025 Author

ghnp5
May 10, 2025

ghnp5
Jun 11, 2025
Author