Replies: 1 comment
-
|
This tutorial may show how to do this: ggml-org/llama.cpp#13606 But we'd need |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I'm using the OpenAI-compatible API, running GGUF on a CPU, with the llama.cpp loader.
--streaming-llmis very useful to cache the last prompt prefix, so that the next time it runs, it will have to process the prompt only from the first token that is different.However, in my case, I will have about 8 prompt prefixes that will be rotating all the time. This makes
--streaming-llmmostly useless.Is there a way to cache 8 variations of the prompt prefixes? (while still allowing me to inject suffixes that will always be different, and not expected to be cached)
Many thanks!
EDIT - I suspect this may be possible, by enabling "slots", and then play with
--keepand-sps, --slot-prompt-similarity SIMILARITY. I don't think I can pass "any parameters" tollama-serverfromCMD_FLAGS.txt, though.Beta Was this translation helpful? Give feedback.
All reactions