Skip to content

v0.9.0

Latest

Choose a tag to compare

@DongheJin DongheJin released this 14 Apr 11:44
· 317 commits to main since this release
145c290

Highlights

Model Support

NPU

  • Support GLM-5 model.
  • Support GLM4.7-Flash model.
  • Support Qwen3-next model.
  • Support OneRec model.
  • Support Qwen3.5/Qwen3.5-MoE model.

CUDA

  • Support LongCat-Image model.
  • Support LongCat-Image-Edit model.

MLU

  • Support DeepSeek-V3.2 W4A8 MoE model.
  • Support GLM-5 W8A8 model.

ILU

  • Support Qwen3-8B model.
  • Support Qwen3-30B-MoE model.

Feature

  • Adapt NPU builds to CANN 8.5 and PyTorch 2.7.1.
  • Support graph mode for the LLM part of VLM models on NPU devices.
  • Support context parallelism for NPU DeepSeek-V3.2 / GLM-5.
  • Support DeepSeek-V3.2 prefill sequence parallel on MLU devices.
  • Support rolling weight loading and loading model weights with varied prefixes.
  • Support dynamic and scalable multi-model serving.
  • Support bidirectional remote-host to local-device KV cache transfer and batch offload.
  • Support Qwen3 xattention on NPU devices.
  • Support prefix cache for DeepSeek-V3.2.
  • Support chunked prefill on CUDA devices.
  • Support embedding interface for all generate LLM models.
  • Support Anthropic Messages API.
  • Support the new v1/sample interface.
  • Support a single xLLM instance connecting to multiple xLLM services.
  • Support startup progress bar, worker health check, and unified request statistics logging.
  • Optimize Qwen3 MoE performance on NPU devices.
  • Add CUDA Graph Executor and piecewise prefill graph.
  • Support KV cache quantization on MLU devices.
  • Add VMM-based allocators to reuse graph buffers and physical memory.
  • Improve FP8 GEMM, fused RMSNorm, fused MoE, xattention, and activation kernel performance.

Bugfix

  • Support the new compressed-tensors FP8 config and fix Qwen2 prompt length.
  • Fix Qwen3 MoE VL parameter settings on MLU devices.
  • Fix Qwen VL issues on MLU devices and Qwen2.5 chunked-prefill accuracy on NPU.
  • Fix DeepSeek tool-call, prefix-cache, DP/MTP, and PD-disagg related issues.
  • Fix GLM-4.7 streaming function call issues and GLM detector stability issues.
  • Fix graph mode, schedule overlap, KV cache, and REC multi-round stability issues.
  • Fix multiple compile, link, env setup, and worker lifecycle issues.