feat(vllm): expose AsyncEngineArgs via generic engine_args YAML map#9563
Conversation
| // EngineArgs carries a JSON-encoded map of backend-native engine arguments | ||
| // applied verbatim to the backend's engine constructor (e.g. vLLM AsyncEngineArgs). | ||
| // Unknown keys produce an error at LoadModel time. | ||
| string EngineArgs = 73; |
There was a problem hiding this comment.
while I'm ok with it in general, we already carry a repeated string Options = 62; that is used already for this purpose (which calls already for refactoring, as would make much more sense to have a map<string, string> instead)
There was a problem hiding this comment.
I'm not sure what you mean, but EngineArgs is a JSON string because it is a nested structure, so if we set it as map<string, string> then the values will still need to be JSON in some cases or else we flatten the structure.
I don't have a strong opinion on whether to flatten it although if we did then Options could be kept as-is or we could switch it to a map.
There was a problem hiding this comment.
ok cool, I thought it was merely used for K,V and since we had already some options I didn't wanted to reintroduce dups, but I'm good with it , have no strong opinion
42faf29 to
2bbda59
Compare
| "github.com/mudler/LocalAI/core/config" | ||
| ) | ||
|
|
||
| func TestGrpcModelOpts_EngineArgsSerialization(t *testing.T) { |
There was a problem hiding this comment.
this should be ginkgo tests for consistency with all the other code
There was a problem hiding this comment.
This is becoming quite a recurring issue, so I'll try creating a linter to prevent it in another pr
2bbda59 to
bef490b
Compare
2d8745c to
ad0d523
Compare
|
@mudler you may want to rereview as I had to add a bunch of stuff to get the latest vLLM running on CUDA13. I have similar changes coming for Intel as well, but worse, will probably do that in a different PR. |
ad0d523 to
7f9b2dd
Compare
LocalAI's vLLM backend wraps a small typed subset of vLLM's
AsyncEngineArgs (quantization, tensor_parallel_size, dtype, etc.).
Anything outside that subset -- pipeline/data/expert parallelism,
speculative_config, kv_transfer_config, all2all_backend, prefix
caching, chunked prefill, etc. -- requires a new protobuf field, a
Go struct field, an options.go line, and a backend.py mapping per
feature. That cadence is the bottleneck on shipping vLLM's
production feature set.
Add a generic `engine_args:` map on the model YAML that is
JSON-serialised into a new ModelOptions.EngineArgs proto field and
applied verbatim to AsyncEngineArgs at LoadModel time. Validation
is done by the Python backend via dataclasses.fields(); unknown
keys fail with the closest valid name as a hint.
dataclasses.replace() is used so vLLM's __post_init__ re-runs and
auto-converts dict values into nested config dataclasses
(CompilationConfig, AttentionConfig, ...). speculative_config and
kv_transfer_config flow through as dicts; vLLM converts them at
engine init.
Operators can now write:
engine_args:
data_parallel_size: 8
enable_expert_parallel: true
all2all_backend: deepep_low_latency
speculative_config:
method: deepseek_mtp
num_speculative_tokens: 3
kv_cache_dtype: fp8
without further proto/Go/Python plumbing per field.
Production defaults seeded by hooks_vllm.go: enable_prefix_caching
and enable_chunked_prefill default to true unless explicitly set.
Existing typed YAML fields (gpu_memory_utilization,
tensor_parallel_size, etc.) remain for back-compat; engine_args
overrides them when both are set.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
vLLM's PyPI wheel is built against CUDA 12 (libcudart.so.12) and won't load on a cu130 host. Switch the cublas13 build to vLLM's per-tag cu130 simple-index (https://wheels.vllm.ai/0.20.0/cu130/) and pin vllm==0.20.0. The cu130-flavoured wheel ships libcudart.so.13 and includes the DFlash speculative-decoding method that landed in 0.20.0. cublas13 install gets --index-strategy=unsafe-best-match so uv consults both the cu130 index and PyPI when resolving — PyPI also publishes vllm==0.20.0, but with cu12 binaries that error at import time. Verified: Qwen3.5-4B + z-lab/Qwen3.5-4B-DFlash loads and serves chat completions on RTX 5070 Ti (sm_120, cu130). Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com>
vLLM's cu130 wheel index URL is itself version-locked (wheels.vllm.ai/<TAG>/cu130/, no /latest/ alias upstream), so a vLLM bump means rewriting two values atomically — the URL segment and the version constraint. bump_deps.sh handles git-sha-in-Makefile only; add a sibling bump_vllm_wheel.sh and a matching workflow job that mirrors the existing matrix's PR-creation pattern. The bumper queries /releases/latest (which excludes prereleases), strips the leading 'v', and seds both lines unconditionally. When the file is already on the latest tag the rewrite is a no-op and peter-evans/create-pull-request opens no PR. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com>
The new engine_args: map plumbs arbitrary AsyncEngineArgs through to vLLM, but the public docs only covered the basic typed fields. Add a short subsection in the vLLM section explaining the typed/generic split and showing a worked DFlash speculative-decoding config, with pointers to vLLM's SpeculativeConfig reference and z-lab's drafter collection. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com>
892d6bd to
ecfc956
Compare
Description
Allow arbitrary vLLM options to be set exposing far more features
Amongst many other things, this unlocks speculative decoding with drafter models like dflash!
Example config:
Notes for Reviewers
Signed commits