feat(vllm): expose AsyncEngineArgs via generic engine_args YAML map by richiejp · Pull Request #9563 · mudler/LocalAI

richiejp · 2026-04-25T13:20:56Z

Description

Allow arbitrary vLLM options to be set exposing far more features

Amongst many other things, this unlocks speculative decoding with drafter models like dflash!

Example config:

backend: vllm
context_size: 8192
engine_args:
    enable_chunked_prefill: true
    enable_prefix_caching: true
    max_num_batched_tokens: 8192
    speculative_config:
        method: dflash
        model: z-lab/Qwen3.5-4B-DFlash
        num_speculative_tokens: 15
function:
    disable_no_action: true
    grammar:
        disable: true
gpu_memory_utilization: 0.9
max_model_len: 8192
name: qwen3.5-4b-dflash
parameters:
    model: Qwen/Qwen3.5-4B
quantization: fp8
template:
    use_tokenizer_template: true
trust_remote_code: true

Notes for Reviewers

Signed commits

Yes, I signed my commits.

mudler · 2026-04-25T16:36:38Z

+  // EngineArgs carries a JSON-encoded map of backend-native engine arguments
+  // applied verbatim to the backend's engine constructor (e.g. vLLM AsyncEngineArgs).
+  // Unknown keys produce an error at LoadModel time.
+  string EngineArgs = 73;


while I'm ok with it in general, we already carry a repeated string Options = 62; that is used already for this purpose (which calls already for refactoring, as would make much more sense to have a map<string, string> instead)

I'm not sure what you mean, but EngineArgs is a JSON string because it is a nested structure, so if we set it as map<string, string> then the values will still need to be JSON in some cases or else we flatten the structure.

I don't have a strong opinion on whether to flatten it although if we did then Options could be kept as-is or we could switch it to a map.

ok cool, I thought it was merely used for K,V and since we had already some options I didn't wanted to reintroduce dups, but I'm good with it , have no strong opinion

mudler · 2026-04-27T12:28:37Z

+	"github.com/mudler/LocalAI/core/config"
+)
+
+func TestGrpcModelOpts_EngineArgsSerialization(t *testing.T) {


this should be ginkgo tests for consistency with all the other code

This is becoming quite a recurring issue, so I'll try creating a linter to prevent it in another pr

richiejp · 2026-04-27T14:39:15Z

@mudler you may want to rereview as I had to add a bunch of stuff to get the latest vLLM running on CUDA13. I have similar changes coming for Intel as well, but worse, will probably do that in a different PR.

LocalAI's vLLM backend wraps a small typed subset of vLLM's AsyncEngineArgs (quantization, tensor_parallel_size, dtype, etc.). Anything outside that subset -- pipeline/data/expert parallelism, speculative_config, kv_transfer_config, all2all_backend, prefix caching, chunked prefill, etc. -- requires a new protobuf field, a Go struct field, an options.go line, and a backend.py mapping per feature. That cadence is the bottleneck on shipping vLLM's production feature set. Add a generic `engine_args:` map on the model YAML that is JSON-serialised into a new ModelOptions.EngineArgs proto field and applied verbatim to AsyncEngineArgs at LoadModel time. Validation is done by the Python backend via dataclasses.fields(); unknown keys fail with the closest valid name as a hint. dataclasses.replace() is used so vLLM's __post_init__ re-runs and auto-converts dict values into nested config dataclasses (CompilationConfig, AttentionConfig, ...). speculative_config and kv_transfer_config flow through as dicts; vLLM converts them at engine init. Operators can now write: engine_args: data_parallel_size: 8 enable_expert_parallel: true all2all_backend: deepep_low_latency speculative_config: method: deepseek_mtp num_speculative_tokens: 3 kv_cache_dtype: fp8 without further proto/Go/Python plumbing per field. Production defaults seeded by hooks_vllm.go: enable_prefix_caching and enable_chunked_prefill default to true unless explicitly set. Existing typed YAML fields (gpu_memory_utilization, tensor_parallel_size, etc.) remain for back-compat; engine_args overrides them when both are set. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com>

vLLM's PyPI wheel is built against CUDA 12 (libcudart.so.12) and won't load on a cu130 host. Switch the cublas13 build to vLLM's per-tag cu130 simple-index (https://wheels.vllm.ai/0.20.0/cu130/) and pin vllm==0.20.0. The cu130-flavoured wheel ships libcudart.so.13 and includes the DFlash speculative-decoding method that landed in 0.20.0. cublas13 install gets --index-strategy=unsafe-best-match so uv consults both the cu130 index and PyPI when resolving — PyPI also publishes vllm==0.20.0, but with cu12 binaries that error at import time. Verified: Qwen3.5-4B + z-lab/Qwen3.5-4B-DFlash loads and serves chat completions on RTX 5070 Ti (sm_120, cu130). Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com>

vLLM's cu130 wheel index URL is itself version-locked (wheels.vllm.ai/<TAG>/cu130/, no /latest/ alias upstream), so a vLLM bump means rewriting two values atomically — the URL segment and the version constraint. bump_deps.sh handles git-sha-in-Makefile only; add a sibling bump_vllm_wheel.sh and a matching workflow job that mirrors the existing matrix's PR-creation pattern. The bumper queries /releases/latest (which excludes prereleases), strips the leading 'v', and seds both lines unconditionally. When the file is already on the latest tag the rewrite is a no-op and peter-evans/create-pull-request opens no PR. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com>

The new engine_args: map plumbs arbitrary AsyncEngineArgs through to vLLM, but the public docs only covered the basic typed fields. Add a short subsection in the vLLM section explaining the typed/generic split and showing a worked DFlash speculative-decoding config, with pointers to vLLM's SpeculativeConfig reference and z-lab's drafter collection. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com>

mudler

looking good!

mudler reviewed Apr 25, 2026

View reviewed changes

richiejp force-pushed the feat/vllm-conf branch from 42faf29 to 2bbda59 Compare April 27, 2026 11:39

mudler reviewed Apr 27, 2026

View reviewed changes

mudler previously approved these changes Apr 27, 2026

View reviewed changes

richiejp force-pushed the feat/vllm-conf branch from 2bbda59 to bef490b Compare April 27, 2026 12:32

mudler marked this pull request as ready for review April 27, 2026 12:35

mudler enabled auto-merge (squash) April 27, 2026 12:35

richiejp dismissed mudler’s stale review via b501bb3 April 27, 2026 13:59

richiejp force-pushed the feat/vllm-conf branch from 2d8745c to ad0d523 Compare April 27, 2026 14:33

richiejp force-pushed the feat/vllm-conf branch from ad0d523 to 7f9b2dd Compare April 28, 2026 09:20

richiejp added 4 commits April 28, 2026 15:38

richiejp force-pushed the feat/vllm-conf branch from 892d6bd to ecfc956 Compare April 28, 2026 14:39

mudler approved these changes Apr 28, 2026

View reviewed changes

mudler added 2 commits April 28, 2026 22:07

Merge branch 'master' into feat/vllm-conf

60de8e1

Merge branch 'master' into feat/vllm-conf

d88d099

mudler disabled auto-merge April 28, 2026 22:49

mudler merged commit 4916f8c into master Apr 28, 2026
51 checks passed

mudler deleted the feat/vllm-conf branch April 28, 2026 22:49

localai-bot added the enhancement New feature or request label May 9, 2026

BrewTestBot mentioned this pull request May 11, 2026

localai 4.2.0 Homebrew/homebrew-core#282016

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(vllm): expose AsyncEngineArgs via generic engine_args YAML map#9563

feat(vllm): expose AsyncEngineArgs via generic engine_args YAML map#9563
mudler merged 6 commits into
masterfrom
feat/vllm-conf

richiejp commented Apr 25, 2026 •

edited

Loading

Uh oh!

mudler Apr 25, 2026

Uh oh!

richiejp Apr 26, 2026

Uh oh!

mudler Apr 27, 2026

Uh oh!

mudler Apr 27, 2026

Uh oh!

richiejp Apr 28, 2026

Uh oh!

richiejp commented Apr 27, 2026

Uh oh!

mudler left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

richiejp commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mudler Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

richiejp Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

mudler Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

mudler Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

richiejp Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

richiejp commented Apr 27, 2026

Uh oh!

mudler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

richiejp commented Apr 25, 2026 •

edited

Loading