multiple llama-server processes to achieve the parallel sequence processing? #1267
magikRUKKOLA
started this conversation in
General
Replies: 3 comments 4 replies
-
|
This would require a new buffer type that makes the tensors loaded in RAM available to other processes via shared memory. |
Beta Was this translation helpful? Give feedback.
1 reply
-
Have you tried? The |
Beta Was this translation helpful? Give feedback.
1 reply
-
What problems? |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Is it possible to run several processes of llama-server at the 512GB machine with two processes that uses separate GPUs for the head and the KV-cache etc. of MoE LLM such as Kimi-K2.5 IQ3_K (~460GB total) to achieve the parallel processing -- so that instead of
--paralleloption and multiple problems related to it I could simply run several instances of llama-server with the same LLM so all the processes would share the same CPU offloaded parts of the LLM in question?[EDIT]: If possible, I wonder what the total decode would be in such a case.
Beta Was this translation helpful? Give feedback.
All reactions