Skip to content

feat: Align --parallel context allocation with llama.cpp semantics (per-slot context = ctx_size / n_parallel) #57

@inureyes

Description

@inureyes

Problem / Background

mlxcel-server and llama.cpp's llama-server both expose a --parallel N flag alongside --ctx-size C for controlling concurrent request slots, but they implement fundamentally different semantics for how context is allocated across those slots. This divergence breaks any downstream client that targets both engines through a unified configuration surface — most notably Backend.AI GO, which emits the same flag values to both engines from a single code path.

For the same invocation --ctx-size C --parallel N:

Engine Per-slot context Total KV cache memory
llama.cpp / llama-server C / N tokens (total context divided across slots) roughly proportional to C only (constant in N)
mlxcel-server C tokens per slot (each slot gets full context) roughly proportional to C * N (linear in N)

Evidence in the mlxcel source

  • CLI definition (matches --parallel byte-for-byte): src/bin/mlx_server.rs:175parallel: usize, default 1, env LLAMA_ARG_N_PARALLEL
  • Pass-through, no division: src/server/startup.rs:646context_size: startup.ctx_size (1:1, not divided by n_parallel)
  • Independent fields in config: src/server/config.rs:224-225pub context_size: usize and pub n_parallel: usize stored separately
  • KV cache pool sized at full depth: src/server/scheduler.rs:392let pool_capacity = max_batch_size + max_queue_depth; — each cache allocated at full context_size
  • Docs make no mention of context division: docs/en/getting-started/configuration.md describes --n-parallel only as a concurrency control, and docs/CONTINUOUS_BATCHING.md confirms the continuous-batching design with full-depth per-sequence KV caches

Evidence in llama.cpp (upstream)

llama.cpp computes n_ctx_per_seq = n_ctx / n_seq_max. Documented in:

Why this matters for downstream clients

Backend.AI GO (lablup/backend.ai-go) emits --parallel N to both engines from the same ServerConfig.to_args() code path (src-tauri/src/process/types.rs:996-1002), with a single i18n description in the user-facing UI saying "Total context is divided equally among slots (e.g., 4096 context with 2 parallel slots gives 2048 tokens per slot)" — which is correct for llama-server but a lie for mlxcel-server. Users following the same UI guidance get wildly different memory budgets and per-slot context windows depending on which engine they're using.

Concrete consequences:

  • Memory planning is broken across engines — a user setting --parallel 4 --ctx-size 32768 on a 70B model expects llama.cpp-shaped memory and OOMs on mlxcel.
  • Application-level defaults that make sense for llama.cpp (e.g., "set parallel to 2 for agent workloads") become memory-doubling decisions on mlxcel.
  • There is no pre-flight memory check in mlxcel (no validation at src/server/startup.rs startup path; no explicit clamp). The mismatch fails silently at runtime.

Proposed Solution

Align mlxcel's --parallel semantics with llama.cpp's behavior so the same flag value produces equivalent per-slot context windows and equivalent memory footprints across the two engines.

Specifically:

  • When --parallel N is passed alongside --ctx-size C, each active slot's per-sequence context should be C / N, computed as max(1, floor(C / N)) with a sensible minimum floor (e.g., reject configurations where C / N < 512 with a clear error message at startup).
  • The KV cache pool at src/server/scheduler.rs:392 should allocate per-slot caches at the divided depth, not the full depth.
  • Total KV memory for a given (C, N) pair should be roughly constant in N — matching llama.cpp.

Migration / backward compatibility options

This is a behavior change that may break existing mlxcel deployments relying on the current "full context per slot" semantics. Maintainers should choose one of the following — we have no strong preference; we just need parity:

  1. Direct change with CHANGELOG entry. Accept the break, document in CHANGELOG, bump minor version. Cleanest semantics; mirrors llama.cpp exactly.
  2. New flag --ctx-size-per-seq. Make per-slot context size an explicit, separate knob; keep --ctx-size as total when --ctx-size-per-seq is unset, and switch to llama.cpp-style division. Most flexible, most code.
  3. Compatibility mode flag --llama-server-compat. Opt-in to llama.cpp semantics; default keeps current behavior with a startup warning that the default will flip in a future release. Safest for existing users.

Recommend (1) or (3) for clean long-term semantics.

Acceptance Criteria

  • --parallel N --ctx-size C results in per-slot KV cache sized for C / N tokens (with the chosen migration strategy above).
  • /slots endpoint (already exposed when --slots is enabled, per src/server/startup.rs:647) reports the per-slot context size, not the total.
  • Updated --help text on the --parallel flag (currently "Number of parallel request slots" at src/bin/mlx_server.rs:175) explicitly states the context-division semantics.
  • Documentation updated:
    • docs/en/getting-started/configuration.md
    • Korean mirror under docs/ko/
    • docs/CONTINUOUS_BATCHING.md
    • Man page docs/man/mlxcel-server.1
  • CHANGELOG entry under the version where this lands.
  • Test (integration or unit) confirming per-slot KV allocation matches ctx_size / n_parallel for several (C, N) pairs.
  • Error message at startup if ctx_size / n_parallel falls below a sensible floor (suggested: 512 tokens).
  • Total KV memory footprint for a given (C, N) pair is verified to be roughly constant in N (matching llama.cpp).

Technical Considerations

Cross-references

Downstream coordination

A separate backend.ai-go memory-guard issue is already filed to add pre-flight memory checks downstream. Once this mlxcel-side change lands, the memory math on the Backend.AI GO side becomes engine-independent and that guard can compute a single number for both engines.

Impact scope

This is a breaking change candidate that affects every downstream embedder of mlxcel-server, not just Backend.AI GO. The migration path (option 1 / 2 / 3 above) is the most important call for maintainers to make first.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:cliCommand-line interface / CLI flagsarea:coremlxcel-core: MLX FFI, primitives, KV cache, layersarea:docsUser and developer documentationarea:inferenceGeneration, sampling, decoding (incl. speculative, DRY)impact:breakingBreaking change requiring migrationpriority:highHigh prioritystatus:readyReady to be worked ontype:enhancementNew features, capabilities, or significant additions

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions