modelscope · Jintao-Huang · May 29, 2024 · May 19, 2024 · May 19, 2024 · May 19, 2024
diff --git a/docs/source/LLM/LLM量化文档.md b/docs/source/LLM/LLM量化文档.md
@@ -1,20 +1,17 @@
 # LLM量化文档
-swift支持使用awq, gptq, bnb, hqq, eetq技术对模型进行量化. 其中awq, gptq量化技术支持vllm进行推理加速, 且量化后的模型支持qlora微调.
+swift支持使用awq、gptq、bnb、hqq、eetq技术对模型进行量化。其中awq、gptq量化技术支持vllm进行推理加速，需要使用校准数据集，量化性能更好，但量化速度较慢。而bnb、hqq、eetq无需校准数据，量化速度较快。这五种量化方法都支持qlora微调。
 
-**注意** 量化在不同指令下的作用不同
-- sft lora训练中指定量化用于`qlora`，用于降低训练所需显存
-- export中指定量化用于量化模型并保存。
-- infer中指定量化用于量化模型并推理。
+awq、gptq需要使用`swift export`进行量化。而bnb、hqq、eetq可以直接在sft和infer时进行快速量化。
 
-其中bnb,hqq,eetq无需校准数据，量化速度较快，在 sft lora 训练 和 infer 中使用，指定`--quant_method bnb/hqq/eetq`
 
-awq,gptq需要校准数据，在 export 中使用，`--quant_method awq/gptq`
+从vllm推理加速支持的角度来看，更推荐使用awq和gptq进行量化。从量化效果的角度来看，更推荐使用awq、hqq和gptq进行量化。而从量化速度的角度来看，更推荐使用hqq进行量化。
+
 
 ## 目录
 - [环境准备](#环境准备)
-- [量化微调(qlora)](#量化微调(qlora))
 - [原始模型](#原始模型)
 - [微调后模型](#微调后模型)
+- [QLoRA微调](#QLoRA微调)
 - [推送模型](#推送模型)
 
 ## 环境准备
@@ -35,6 +32,9 @@ pip install autoawq -U
 # auto_gptq和cuda版本有对应关系，请按照`https://github.com/PanQiWei/AutoGPTQ#quick-installation`选择版本
 pip install auto_gptq -U
 
+# 使用bnb量化：
+pip install bitsandbytes -U
+
 # 使用hqq量化：
 # 需要transformers版本>4.40，从源码安装
 pip install git+https://github.com/huggingface/transformers
@@ -58,54 +58,9 @@ pip install -r requirements/framework.txt  -U
 pip install -r requirements/llm.txt  -U
 ```
 
-## 量化微调(qlora)
-在sft lora训练中指定`--quant_method`和`--quantization_bit`来执行qlora，显著减少训练所需显存
-```bash
-CUDA_VISIBLE_DEVICES=0 swift sft \
-    --model_type qwen1half-7b-chat \
-    --sft_type lora \
-    --dataset alpaca-zh#5000 \
-    --quant_method hqq \
-    --quantization_bit 4 \
-
-CUDA_VISIBLE_DEVICES=0 swift sft \
-    --model_type qwen1half-7b-chat \
-    --sft_type lora \
-    --dataset alpaca-zh#5000 \
-    --quant_method eetq \
-    --dtype fp16 \
-
-CUDA_VISIBLE_DEVICES=0 swift sft \
-    --model_type qwen1half-7b-chat \
-    --sft_type lora \
-    --dataset alpaca-zh#5000 \
-    --quant_method bnb \
-    --quantization_bit 4 \
-    --dtype fp16 \
-```
-**注意**
-- hqq支持更多自定义参数，比如为不同网络层指定不同量化配置，具体请见[命令行参数](https://github.com/modelscope/swift/blob/main/docs/source/LLM/命令行参数.md)
-- eetq量化为8bit量化，无需指定quantization_bit。目前不支持bf16，需要指定dtype为fp16
-- eetq目前qlora速度比较慢，推荐使用hqq。参考[issue](https://github.com/NetEase-FuXi/EETQ/issues/17)
-
 ## 原始模型
-使用bnb,hqq,eetq量化模型并推理
-```bash
-CUDA_VISIBLE_DEVICES=0 swift infer \
-    --model_type qwen1half-7b-chat \
-    --quant_method bnb \
-    --quantization_bit 4
 
-CUDA_VISIBLE_DEVICES=0 swift infer \
-    --model_type qwen1half-7b-chat \
-    --quant_method hqq \
-    --quantization_bit 4
-
-CUDA_VISIBLE_DEVICES=0 swift infer \
-    --model_type qwen1half-7b-chat \
-    --quant_method eetq \
-    --dtype fp16
-```
+### awq、gptq
 
 这里展示对qwen1half-7b-chat进行awq, gptq量化.
 ```bash
@@ -234,6 +189,25 @@ CUDA_VISIBLE_DEVICES=0 swift infer --model_type qwen1half-7b-chat
 ```
 
 
+### bnb、hqq、eetq
+对于bnb、hqq、eetq，我们只需要使用swift infer来进行快速量化并推理。
+```bash
+CUDA_VISIBLE_DEVICES=0 swift infer \
+    --model_type qwen1half-7b-chat \
+    --quant_method bnb \
+    --quantization_bit 4
+
+CUDA_VISIBLE_DEVICES=0 swift infer \
+    --model_type qwen1half-7b-chat \
+    --quant_method hqq \
+    --quantization_bit 4
+
+CUDA_VISIBLE_DEVICES=0 swift infer \
+    --model_type qwen1half-7b-chat \
+    --quant_method eetq \
+    --dtype fp16
+```
+
 ## 微调后模型
 
 假设你使用lora微调了qwen1half-4b-chat, 模型权重目录为: `output/qwen1half-4b-chat/vx-xxx/checkpoint-xxx`.
@@ -281,6 +255,65 @@ curl http://localhost:8000/v1/chat/completions \
 }'
 ```
 
+## QLoRA微调
+
+### awq、gptq
+如果想要对awq、gptq量化的模型进行qlora微调，你需要进行提前量化。例如可以对原始模型使用`swift export`进行量化。然后使用以下命令进行微调，你需要指定`--quant_method`来指定对应量化的方式：
+
+```bash
+# awq
+CUDA_VISIBLE_DEVICES=0 swift sft \
+    --model_type qwen1half-7b-chat \
+    --model_id_or_path qwen1half-7b-chat-awq-int4 \
+    --quant_method awq \
+    --sft_type lora \
+    --dataset alpaca-zh#5000 \
+
+# gptq
+CUDA_VISIBLE_DEVICES=0 swift sft \
+    --model_type qwen1half-7b-chat \
+    --model_id_or_path qwen1half-7b-chat-gptq-int4 \
+    --quant_method gptq \
+    --sft_type lora \
+    --dataset alpaca-zh#5000 \
+```
+
+
+### bnb、hqq、eetq
+如果想要使用bnb、hqq、eetq进行qlora微调，你需要在训练中指定`--quant_method`和`--quantization_bit`：
+
+```bash
+# bnb
+CUDA_VISIBLE_DEVICES=0 swift sft \
+    --model_type qwen1half-7b-chat \
+    --sft_type lora \
+    --dataset alpaca-zh#5000 \
+    --quant_method bnb \
+    --quantization_bit 4 \
+    --dtype fp16 \
+
+# hqq
+CUDA_VISIBLE_DEVICES=0 swift sft \
+    --model_type qwen1half-7b-chat \
+    --sft_type lora \
+    --dataset alpaca-zh#5000 \
+    --quant_method hqq \
+    --quantization_bit 4 \
+
+# eetq
+CUDA_VISIBLE_DEVICES=0 swift sft \
+    --model_type qwen1half-7b-chat \
+    --sft_type lora \
+    --dataset alpaca-zh#5000 \
+    --quant_method eetq \
+    --dtype fp16 \
+```
+
+**注意**
+- hqq支持更多自定义参数，比如为不同网络层指定不同量化配置，具体请见[命令行参数](https://github.com/modelscope/swift/blob/main/docs/source/LLM/命令行参数.md)
+- eetq量化为8bit量化，无需指定quantization_bit。目前不支持bf16，需要指定dtype为fp16
+- eetq目前qlora速度比较慢，推荐使用hqq。参考[issue](https://github.com/NetEase-FuXi/EETQ/issues/17)
+
 
 ## 推送模型
 假设你使用lora微调了qwen1half-4b-chat, 模型权重目录为: `output/qwen1half-4b-chat/vx-xxx/checkpoint-xxx`.

diff --git a/docs/source/LLM/Qwen1.5全流程最佳实践.md b/docs/source/LLM/Qwen1.5全流程最佳实践.md
@@ -413,7 +413,9 @@ for query in ['78654+657=?', '晚上睡不着觉怎么办']:
 
     print(f'query: {query}')
     print('response: ', end='')
+    response = ''
     for chunk in stream_resp:
+        response += chunk.choices[0].delta.content
         print(chunk.choices[0].delta.content, end='', flush=True)
     print()
     messages.append({'role': 'assistant', 'content': response})
@@ -574,7 +576,9 @@ for query in ['78654+657=?', '晚上睡不着觉怎么办']:
 
     print(f'query: {query}')
     print('response: ', end='')
+    response = ''
     for chunk in stream_resp:
+        response += chunk.choices[0].delta.content
         print(chunk.choices[0].delta.content, end='', flush=True)
     print()
     messages.append({'role': 'assistant', 'content': response})