multiple llama-server processes to achieve the parallel sequence processing? #1267

magikRUKKOLA · 2026-02-14T07:02:18Z

magikRUKKOLA
Feb 14, 2026

Is it possible to run several processes of llama-server at the 512GB machine with two processes that uses separate GPUs for the head and the KV-cache etc. of MoE LLM such as Kimi-K2.5 IQ3_K (~460GB total) to achieve the parallel processing -- so that instead of --parallel option and multiple problems related to it I could simply run several instances of llama-server with the same LLM so all the processes would share the same CPU offloaded parts of the LLM in question?

[EDIT]: If possible, I wonder what the total decode would be in such a case.

ikawrakow · 2026-02-14T07:20:11Z

ikawrakow
Feb 14, 2026
Maintainer

This would require a new buffer type that makes the tensors loaded in RAM available to other processes via shared memory.

1 reply

magikRUKKOLA Feb 14, 2026
Author

But ... what about the mmap and MAP_SHARED ?

ikawrakow · 2026-02-14T08:08:11Z

ikawrakow
Feb 14, 2026
Maintainer

But ... what about the mmap and MAP_SHARED ?

Have you tried? The mmap system call uses MAP_SHARED, so theoretically it could work. One cannot use anything flags that would turn off mmap (-rtr, -mqkv, -muge come to mind).

1 reply

magikRUKKOLA Feb 14, 2026
Author

Have you tried?

Unfortunately no, I haven't tried anything.

Details

I had been thinking regarding the usage of RPC-CUDA devices to say, speed up the prefill and reduce hence the need to increase the -b and -ub . So theoretically it should be possible to offload the prefill via the network (via the RPC-connected instance), dump the kv-cache from the there, load it up into the main instance (with load batch sizes) and proceed with the decode. This way it should be possible to optimally offload the layers of MoE MLA LLMs when dealing with small-sized GPUs and the split mode layer ...

saood06 · 2026-02-14T19:27:33Z

saood06
Feb 14, 2026
Collaborator

-- so that instead of --parallel option and multiple problems related to it

What problems?

2 replies

magikRUKKOLA Feb 14, 2026
Author

@saood06

What problems?

Constant VRAM consumption. With --parallel 2 the KV-cache allocation is such that one have to sacrifice the ctx length to fit two parallel sessions.
But in case if one have some extra GPU in the system it would be much cleaner just to on-demand allocate another llama-server which would use the shared RAM segments but will use some spare GPUs (possibly including the RPC i.e. RDMA). This way the ctx length that can be supported is always better with such config.
That is what I meant.

saood06 Feb 14, 2026
Collaborator

That is what I meant.

Thanks for the clarification.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multiple llama-server processes to achieve the parallel sequence processing? #1267

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

multiple llama-server processes to achieve the parallel sequence processing? #1267

Uh oh!

Uh oh!

magikRUKKOLA Feb 14, 2026

Replies: 3 comments · 4 replies

Uh oh!

ikawrakow Feb 14, 2026 Maintainer

Uh oh!

magikRUKKOLA Feb 14, 2026 Author

Uh oh!

ikawrakow Feb 14, 2026 Maintainer

Uh oh!

magikRUKKOLA Feb 14, 2026 Author

Uh oh!

saood06 Feb 14, 2026 Collaborator

Uh oh!

magikRUKKOLA Feb 14, 2026 Author

Uh oh!

saood06 Feb 14, 2026 Collaborator

magikRUKKOLA
Feb 14, 2026

Replies: 3 comments 4 replies

ikawrakow
Feb 14, 2026
Maintainer

magikRUKKOLA Feb 14, 2026
Author

ikawrakow
Feb 14, 2026
Maintainer

magikRUKKOLA Feb 14, 2026
Author

saood06
Feb 14, 2026
Collaborator

magikRUKKOLA Feb 14, 2026
Author

saood06 Feb 14, 2026
Collaborator