v0.9.0

Latest

Latest

DongheJin released this 14 Apr 11:44

· 317 commits to main since this release

145c290

Highlights

Model Support

NPU

Support GLM-5 model.
Support GLM4.7-Flash model.
Support Qwen3-next model.
Support OneRec model.
Support Qwen3.5/Qwen3.5-MoE model.

CUDA

Support LongCat-Image model.
Support LongCat-Image-Edit model.

MLU

Support DeepSeek-V3.2 W4A8 MoE model.
Support GLM-5 W8A8 model.

ILU

Support Qwen3-8B model.
Support Qwen3-30B-MoE model.

Feature

Adapt NPU builds to CANN 8.5 and PyTorch 2.7.1.
Support graph mode for the LLM part of VLM models on NPU devices.
Support context parallelism for NPU DeepSeek-V3.2 / GLM-5.
Support DeepSeek-V3.2 prefill sequence parallel on MLU devices.
Support rolling weight loading and loading model weights with varied prefixes.
Support dynamic and scalable multi-model serving.
Support bidirectional remote-host to local-device KV cache transfer and batch offload.
Support Qwen3 xattention on NPU devices.
Support prefix cache for DeepSeek-V3.2.
Support chunked prefill on CUDA devices.
Support embedding interface for all generate LLM models.
Support Anthropic Messages API.
Support the new v1/sample interface.
Support a single xLLM instance connecting to multiple xLLM services.
Support startup progress bar, worker health check, and unified request statistics logging.
Optimize Qwen3 MoE performance on NPU devices.
Add CUDA Graph Executor and piecewise prefill graph.
Support KV cache quantization on MLU devices.
Add VMM-based allocators to reuse graph buffers and physical memory.
Improve FP8 GEMM, fused RMSNorm, fused MoE, xattention, and activation kernel performance.

Bugfix

Support the new compressed-tensors FP8 config and fix Qwen2 prompt length.
Fix Qwen3 MoE VL parameter settings on MLU devices.
Fix Qwen VL issues on MLU devices and Qwen2.5 chunked-prefill accuracy on NPU.
Fix DeepSeek tool-call, prefix-cache, DP/MTP, and PD-disagg related issues.
Fix GLM-4.7 streaming function call issues and GLM detector stability issues.
Fix graph mode, schedule overlap, KV cache, and REC multi-round stability issues.
Fix multiple compile, link, env setup, and worker lifecycle issues.

Assets 2