Bugfix release — vLLM (safetensors) models now load and chat correctly.
Fixed
- Chat on vLLM no longer fails with "Engine returned 400." Tool definitions were attached to every engine, but vLLM rejects a
toolsarray unless launched with--enable-auto-tool-choice+ a--tool-call-parser. Tools are now sent only to engines that accept them (the llama.cpp family). Tool-calling on vLLM remains unsupported for now. - Correct quant classification for vLLM/safetensors models. Compressed-tensors checkpoints were mislabeled as MLX
fp16; the quant is now read fromquantization_config(e.g.w4a16), so the model card shows the real quant instead of "MLX". - The vLLM "Max model length" control is settable again. Multimodal configs nest
max_position_embeddingsundertext_config; the scanner now reads it, so a model's native context length is no longer reported as0(which had clamped the input to 0).