From 1d28589f6984ed365433d11f5b6ebe3d6161a13f Mon Sep 17 00:00:00 2001 From: "Zhang, Weiwei1" Date: Tue, 28 Oct 2025 15:06:47 +0800 Subject: [PATCH 1/3] update readme for sglang support Signed-off-by: Zhang, Weiwei1 --- README.md | 23 +++++++++++++++++++++++ 1 file changed, 23 insertions(+) diff --git a/README.md b/README.md index 7303e77b4..a0c353057 100644 --- a/README.md +++ b/README.md @@ -27,6 +27,8 @@ and [fbaldassarri](https://huggingface.co/fbaldassarri). For usage instructions, ## 🆕 What's New +[2025/10] AutoRound has been integrated into **SGLang**. You can now run models in the AutoRound format directly using the latest SGLang master branch. + [2025/10] We enhanced the RTN mode (--iters 0) to significantly reduce quantization cost compared to the default tuning mode. Check out [this doc](./docs/opt_rtn.md) for some accuracy results. If you don’t have sufficient resources, you can use this mode for 4-bit quantization. [2025/10] We proposed a fast algorithm to generate **mixed bits/datatypes** schemes in minutes. Please @@ -287,6 +289,26 @@ for output in outputs: print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` + +### SGLang (Intel GPU/CUDA) +Please note that support for the MoE models and visual language models is currently limited. + +```python +import sglang as sgl + +llm = sgl.Engine(model_path="Intel/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound") +prompts = [ + "Hello, my name is", +] +sampling_params = {"temperature": 0.6, "top_p": 0.95} + +outputs = llm.generate(prompts, sampling_params) +for prompt, output in zip(prompts, outputs): + print("===============================") + print(f"Prompt: {prompt}\nGenerated text: {output['text']}") +``` + + ### Transformers (CPU/Intel GPU/Gaudi/CUDA) @@ -318,3 +340,4 @@ If you find AutoRound helpful, please ⭐ star the repo and share it with your c + From 7ed2dc148e9b67c4df72a7ca6374a15670d6aad2 Mon Sep 17 00:00:00 2001 From: "Zhang, Weiwei1" Date: Tue, 28 Oct 2025 15:43:35 +0800 Subject: [PATCH 2/3] refine doc Signed-off-by: Zhang, Weiwei1 --- README.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/README.md b/README.md index a0c353057..cee99cb6d 100644 --- a/README.md +++ b/README.md @@ -27,7 +27,7 @@ and [fbaldassarri](https://huggingface.co/fbaldassarri). For usage instructions, ## 🆕 What's New -[2025/10] AutoRound has been integrated into **SGLang**. You can now run models in the AutoRound format directly using the latest SGLang master branch. +[2025/10] AutoRound has been integrated into **SGLang**. You can now run models in the AutoRound format directly using the latest SGLang(at least this commit (caa4819bfcdc1b0e081d2b93500ea3d4d2cb8e00) or later). [2025/10] We enhanced the RTN mode (--iters 0) to significantly reduce quantization cost compared to the default tuning mode. Check out [this doc](./docs/opt_rtn.md) for some accuracy results. If you don’t have sufficient resources, you can use this mode for 4-bit quantization. @@ -270,7 +270,6 @@ ar.quantize_and_save(output_dir) ## Model Inference ### vLLM (CPU/Intel GPU/CUDA) -Please note that support for the MoE models and visual language models is currently limited. ```python from vllm import LLM, SamplingParams From c2063023d173d3ff7c4f980f9b3c7a22fd45081e Mon Sep 17 00:00:00 2001 From: Wenhua Cheng Date: Tue, 28 Oct 2025 15:51:47 +0800 Subject: [PATCH 3/3] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index cee99cb6d..a321f3934 100644 --- a/README.md +++ b/README.md @@ -27,7 +27,7 @@ and [fbaldassarri](https://huggingface.co/fbaldassarri). For usage instructions, ## 🆕 What's New -[2025/10] AutoRound has been integrated into **SGLang**. You can now run models in the AutoRound format directly using the latest SGLang(at least this commit (caa4819bfcdc1b0e081d2b93500ea3d4d2cb8e00) or later). +[2025/10] AutoRound has been integrated into **SGLang**. You can now run models in the AutoRound format directly using the latest SGLang later than v0.5.4. [2025/10] We enhanced the RTN mode (--iters 0) to significantly reduce quantization cost compared to the default tuning mode. Check out [this doc](./docs/opt_rtn.md) for some accuracy results. If you don’t have sufficient resources, you can use this mode for 4-bit quantization.