feat: Align --parallel context allocation with llama.cpp semantics (per-slot context = ctx_size / n_parallel)

## Problem / Background

`mlxcel-server` and llama.cpp's `llama-server` both expose a `--parallel N` flag alongside `--ctx-size C` for controlling concurrent request slots, but they implement fundamentally different semantics for how context is allocated across those slots. This divergence breaks any downstream client that targets both engines through a unified configuration surface — most notably Backend.AI GO, which emits the same flag values to both engines from a single code path.

For the same invocation `--ctx-size C --parallel N`:

| Engine | Per-slot context | Total KV cache memory |
|---|---|---|
| llama.cpp / llama-server | `C / N` tokens (total context **divided** across slots) | roughly proportional to `C` only (constant in `N`) |
| mlxcel-server | `C` tokens per slot (each slot gets **full** context) | roughly proportional to `C * N` (linear in `N`) |

### Evidence in the mlxcel source

- CLI definition (matches `--parallel` byte-for-byte): `src/bin/mlx_server.rs:175` — `parallel: usize, default 1, env LLAMA_ARG_N_PARALLEL`
- Pass-through, no division: `src/server/startup.rs:646` — `context_size: startup.ctx_size` (1:1, not divided by `n_parallel`)
- Independent fields in config: `src/server/config.rs:224-225` — `pub context_size: usize` and `pub n_parallel: usize` stored separately
- KV cache pool sized at full depth: `src/server/scheduler.rs:392` — `let pool_capacity = max_batch_size + max_queue_depth;` — each cache allocated at full `context_size`
- Docs make no mention of context division: `docs/en/getting-started/configuration.md` describes `--n-parallel` only as a concurrency control, and `docs/CONTINUOUS_BATCHING.md` confirms the continuous-batching design with full-depth per-sequence KV caches

### Evidence in llama.cpp (upstream)

llama.cpp computes `n_ctx_per_seq = n_ctx / n_seq_max`. Documented in:

- https://github.com/ggml-org/llama.cpp/issues/11681
- https://github.com/ggml-org/llama.cpp/discussions/4130

### Why this matters for downstream clients

Backend.AI GO (`lablup/backend.ai-go`) emits `--parallel N` to both engines from the same `ServerConfig.to_args()` code path (`src-tauri/src/process/types.rs:996-1002`), with a single i18n description in the user-facing UI saying *"Total context is divided equally among slots (e.g., 4096 context with 2 parallel slots gives 2048 tokens per slot)"* — which is correct for llama-server but a lie for mlxcel-server. Users following the same UI guidance get wildly different memory budgets and per-slot context windows depending on which engine they're using.

Concrete consequences:

- Memory planning is broken across engines — a user setting `--parallel 4 --ctx-size 32768` on a 70B model expects llama.cpp-shaped memory and OOMs on mlxcel.
- Application-level defaults that make sense for llama.cpp (e.g., "set parallel to 2 for agent workloads") become memory-doubling decisions on mlxcel.
- There is no pre-flight memory check in mlxcel (no validation at `src/server/startup.rs` startup path; no explicit clamp). The mismatch fails silently at runtime.

## Proposed Solution

Align mlxcel's `--parallel` semantics with llama.cpp's behavior so the same flag value produces equivalent per-slot context windows and equivalent memory footprints across the two engines.

Specifically:

- When `--parallel N` is passed alongside `--ctx-size C`, each active slot's per-sequence context should be `C / N`, computed as `max(1, floor(C / N))` with a sensible minimum floor (e.g., reject configurations where `C / N < 512` with a clear error message at startup).
- The KV cache pool at `src/server/scheduler.rs:392` should allocate per-slot caches at the divided depth, not the full depth.
- Total KV memory for a given `(C, N)` pair should be roughly constant in `N` — matching llama.cpp.

### Migration / backward compatibility options

This is a behavior change that may break existing mlxcel deployments relying on the current "full context per slot" semantics. Maintainers should choose one of the following — we have no strong preference; we just need parity:

1. **Direct change with CHANGELOG entry.** Accept the break, document in CHANGELOG, bump minor version. Cleanest semantics; mirrors llama.cpp exactly.
2. **New flag `--ctx-size-per-seq`.** Make per-slot context size an explicit, separate knob; keep `--ctx-size` as total when `--ctx-size-per-seq` is unset, and switch to llama.cpp-style division. Most flexible, most code.
3. **Compatibility mode flag `--llama-server-compat`.** Opt-in to llama.cpp semantics; default keeps current behavior with a startup warning that the default will flip in a future release. Safest for existing users.

Recommend (1) or (3) for clean long-term semantics.

## Acceptance Criteria

- [ ] `--parallel N --ctx-size C` results in per-slot KV cache sized for `C / N` tokens (with the chosen migration strategy above).
- [ ] `/slots` endpoint (already exposed when `--slots` is enabled, per `src/server/startup.rs:647`) reports the per-slot context size, not the total.
- [ ] Updated `--help` text on the `--parallel` flag (currently *"Number of parallel request slots"* at `src/bin/mlx_server.rs:175`) explicitly states the context-division semantics.
- [ ] Documentation updated:
  - [ ] `docs/en/getting-started/configuration.md`
  - [ ] Korean mirror under `docs/ko/`
  - [ ] `docs/CONTINUOUS_BATCHING.md`
  - [ ] Man page `docs/man/mlxcel-server.1`
- [ ] CHANGELOG entry under the version where this lands.
- [ ] Test (integration or unit) confirming per-slot KV allocation matches `ctx_size / n_parallel` for several `(C, N)` pairs.
- [ ] Error message at startup if `ctx_size / n_parallel` falls below a sensible floor (suggested: 512 tokens).
- [ ] Total KV memory footprint for a given `(C, N)` pair is verified to be roughly constant in `N` (matching llama.cpp).

## Technical Considerations

### Cross-references

- Upstream llama.cpp behavior: https://github.com/ggml-org/llama.cpp/issues/11681 and https://github.com/ggml-org/llama.cpp/discussions/4130
- Downstream consumer with current UI mismatch — `lablup/backend.ai-go`:
  - `src-tauri/src/process/types.rs:996-1002` (arg emission)
  - `src/types/modelConfig.ts:363-373` (slider)
  - `src/components/ModelConfigDrawer/ContextTab.tsx:227-245` (control)
  - i18n keys `modelConfig.context.parallelRequests` / `parallelRequestsDesc` (currently llama.cpp-shaped)

### Downstream coordination

A separate `backend.ai-go` memory-guard issue is already filed to add pre-flight memory checks downstream. Once this mlxcel-side change lands, the memory math on the Backend.AI GO side becomes engine-independent and that guard can compute a single number for both engines.

### Impact scope

This is a breaking change candidate that affects every downstream embedder of `mlxcel-server`, not just Backend.AI GO. The migration path (option 1 / 2 / 3 above) is the most important call for maintainers to make first.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Align --parallel context allocation with llama.cpp semantics (per-slot context = ctx_size / n_parallel) #57

Problem / Background

Evidence in the mlxcel source

Evidence in llama.cpp (upstream)

Why this matters for downstream clients

Proposed Solution

Migration / backward compatibility options

Acceptance Criteria

Technical Considerations

Cross-references

Downstream coordination

Impact scope

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Engine	Per-slot context	Total KV cache memory
llama.cpp / llama-server	`C / N` tokens (total context divided across slots)	roughly proportional to `C` only (constant in `N`)
mlxcel-server	`C` tokens per slot (each slot gets full context)	roughly proportional to `C * N` (linear in `N`)

feat: Align --parallel context allocation with llama.cpp semantics (per-slot context = ctx_size / n_parallel) #57

Description

Problem / Background

Evidence in the mlxcel source

Evidence in llama.cpp (upstream)

Why this matters for downstream clients

Proposed Solution

Migration / backward compatibility options

Acceptance Criteria

Technical Considerations

Cross-references

Downstream coordination

Impact scope

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions