Skip to content

feat(vllm): expose AsyncEngineArgs via generic engine_args YAML map#9563

Merged
mudler merged 6 commits into
masterfrom
feat/vllm-conf
Apr 28, 2026
Merged

feat(vllm): expose AsyncEngineArgs via generic engine_args YAML map#9563
mudler merged 6 commits into
masterfrom
feat/vllm-conf

Conversation

@richiejp
Copy link
Copy Markdown
Collaborator

@richiejp richiejp commented Apr 25, 2026

Description

Allow arbitrary vLLM options to be set exposing far more features

Amongst many other things, this unlocks speculative decoding with drafter models like dflash!

Example config:

backend: vllm
context_size: 8192
engine_args:
    enable_chunked_prefill: true
    enable_prefix_caching: true
    max_num_batched_tokens: 8192
    speculative_config:
        method: dflash
        model: z-lab/Qwen3.5-4B-DFlash
        num_speculative_tokens: 15
function:
    disable_no_action: true
    grammar:
        disable: true
gpu_memory_utilization: 0.9
max_model_len: 8192
name: qwen3.5-4b-dflash
parameters:
    model: Qwen/Qwen3.5-4B
quantization: fp8
template:
    use_tokenizer_template: true
trust_remote_code: true

Notes for Reviewers

Signed commits

  • Yes, I signed my commits.

Comment thread backend/backend.proto
// EngineArgs carries a JSON-encoded map of backend-native engine arguments
// applied verbatim to the backend's engine constructor (e.g. vLLM AsyncEngineArgs).
// Unknown keys produce an error at LoadModel time.
string EngineArgs = 73;
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while I'm ok with it in general, we already carry a repeated string Options = 62; that is used already for this purpose (which calls already for refactoring, as would make much more sense to have a map<string, string> instead)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean, but EngineArgs is a JSON string because it is a nested structure, so if we set it as map<string, string> then the values will still need to be JSON in some cases or else we flatten the structure.

I don't have a strong opinion on whether to flatten it although if we did then Options could be kept as-is or we could switch it to a map.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok cool, I thought it was merely used for K,V and since we had already some options I didn't wanted to reintroduce dups, but I'm good with it , have no strong opinion

Comment thread core/backend/options_internal_test.go Outdated
"github.com/mudler/LocalAI/core/config"
)

func TestGrpcModelOpts_EngineArgsSerialization(t *testing.T) {
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be ginkgo tests for consistency with all the other code

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is becoming quite a recurring issue, so I'll try creating a linter to prevent it in another pr

mudler
mudler previously approved these changes Apr 27, 2026
@mudler mudler marked this pull request as ready for review April 27, 2026 12:35
@mudler mudler enabled auto-merge (squash) April 27, 2026 12:35
@richiejp
Copy link
Copy Markdown
Collaborator Author

@mudler you may want to rereview as I had to add a bunch of stuff to get the latest vLLM running on CUDA13. I have similar changes coming for Intel as well, but worse, will probably do that in a different PR.

LocalAI's vLLM backend wraps a small typed subset of vLLM's
AsyncEngineArgs (quantization, tensor_parallel_size, dtype, etc.).
Anything outside that subset -- pipeline/data/expert parallelism,
speculative_config, kv_transfer_config, all2all_backend, prefix
caching, chunked prefill, etc. -- requires a new protobuf field, a
Go struct field, an options.go line, and a backend.py mapping per
feature. That cadence is the bottleneck on shipping vLLM's
production feature set.

Add a generic `engine_args:` map on the model YAML that is
JSON-serialised into a new ModelOptions.EngineArgs proto field and
applied verbatim to AsyncEngineArgs at LoadModel time. Validation
is done by the Python backend via dataclasses.fields(); unknown
keys fail with the closest valid name as a hint.
dataclasses.replace() is used so vLLM's __post_init__ re-runs and
auto-converts dict values into nested config dataclasses
(CompilationConfig, AttentionConfig, ...). speculative_config and
kv_transfer_config flow through as dicts; vLLM converts them at
engine init.

Operators can now write:

  engine_args:
    data_parallel_size: 8
    enable_expert_parallel: true
    all2all_backend: deepep_low_latency
    speculative_config:
      method: deepseek_mtp
      num_speculative_tokens: 3
    kv_cache_dtype: fp8

without further proto/Go/Python plumbing per field.

Production defaults seeded by hooks_vllm.go: enable_prefix_caching
and enable_chunked_prefill default to true unless explicitly set.

Existing typed YAML fields (gpu_memory_utilization,
tensor_parallel_size, etc.) remain for back-compat; engine_args
overrides them when both are set.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
vLLM's PyPI wheel is built against CUDA 12 (libcudart.so.12) and won't
load on a cu130 host. Switch the cublas13 build to vLLM's per-tag cu130
simple-index (https://wheels.vllm.ai/0.20.0/cu130/) and pin
vllm==0.20.0. The cu130-flavoured wheel ships libcudart.so.13 and
includes the DFlash speculative-decoding method that landed in 0.20.0.

cublas13 install gets --index-strategy=unsafe-best-match so uv consults
both the cu130 index and PyPI when resolving — PyPI also publishes
vllm==0.20.0, but with cu12 binaries that error at import time.

Verified: Qwen3.5-4B + z-lab/Qwen3.5-4B-DFlash loads and serves chat
completions on RTX 5070 Ti (sm_120, cu130).

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
vLLM's cu130 wheel index URL is itself version-locked
(wheels.vllm.ai/<TAG>/cu130/, no /latest/ alias upstream), so a vLLM
bump means rewriting two values atomically — the URL segment and the
version constraint. bump_deps.sh handles git-sha-in-Makefile only;
add a sibling bump_vllm_wheel.sh and a matching workflow job that
mirrors the existing matrix's PR-creation pattern.

The bumper queries /releases/latest (which excludes prereleases),
strips the leading 'v', and seds both lines unconditionally. When the
file is already on the latest tag the rewrite is a no-op and
peter-evans/create-pull-request opens no PR.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
The new engine_args: map plumbs arbitrary AsyncEngineArgs through to
vLLM, but the public docs only covered the basic typed fields. Add a
short subsection in the vLLM section explaining the typed/generic
split and showing a worked DFlash speculative-decoding config, with
pointers to vLLM's SpeculativeConfig reference and z-lab's drafter
collection.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Copy link
Copy Markdown
Owner

@mudler mudler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good!

@mudler mudler disabled auto-merge April 28, 2026 22:49
@mudler mudler merged commit 4916f8c into master Apr 28, 2026
51 checks passed
@mudler mudler deleted the feat/vllm-conf branch April 28, 2026 22:49
@localai-bot localai-bot added the enhancement New feature or request label May 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants