Skip to content

v1.4.2

Latest

Choose a tag to compare

@mohitsoni48 mohitsoni48 released this 23 Jun 16:08
· 1 commit to main since this release
524087e

Bugfix release — vLLM (safetensors) models now load and chat correctly.

Fixed

  • Chat on vLLM no longer fails with "Engine returned 400." Tool definitions were attached to every engine, but vLLM rejects a tools array unless launched with --enable-auto-tool-choice + a --tool-call-parser. Tools are now sent only to engines that accept them (the llama.cpp family). Tool-calling on vLLM remains unsupported for now.
  • Correct quant classification for vLLM/safetensors models. Compressed-tensors checkpoints were mislabeled as MLX fp16; the quant is now read from quantization_config (e.g. w4a16), so the model card shows the real quant instead of "MLX".
  • The vLLM "Max model length" control is settable again. Multimodal configs nest max_position_embeddings under text_config; the scanner now reads it, so a model's native context length is no longer reported as 0 (which had clamped the input to 0).