modelscope · Jintao-Huang · May 29, 2024 · May 19, 2024 · May 19, 2024 · May 19, 2024
diff --git a/docs/source/LLM/LLM推理文档.md b/docs/source/LLM/LLM推理文档.md
@@ -183,26 +183,47 @@ model, tokenizer = get_model_tokenizer(model_type, model_kwargs={'device_map': '
 
 template = get_template(template_type, tokenizer)
 seed_everything(42)
+
 query = '浙江的省会在哪里？'
 gen = inference_stream(model, template, query)
 print(f'query: {query}')
 for response, history in gen:
-    print(f'response: {response}')
+    pass
+print(f'response: {response}')
+
+# 方式1
 query = '这有什么好吃的？'
-gen = inference_stream(model, template, query, history)
+old_history = history
+gen = inference_stream(model, template, query, old_history)
 print(f'query: {query}')
 for response, history in gen:
     print(f'response: {response}')
 print(f'history: {history}')
 
+# 方式2
+query = '这有什么好吃的？'
+gen = inference_stream(model, template, query, old_history)
+print_idx = 0
+print(f'query: {query}\nresponse: ', end='')
+for response, history in gen:
+    delta = response[print_idx:]
+    print(delta, end='', flush=True)
+    print_idx = len(response)
+print(f'\nhistory: {history}')
+
 """Out[0]
 query: 浙江的省会在哪里？
-...
 response: 浙江省的省会是杭州。
 query: 这有什么好吃的？
+response: 杭
+response: 杭州
+response: 杭州市有
 ...
-response: 杭州市有很多著名的美食，例如西湖醋鱼、龙井虾仁、糖醋排骨、毛血旺等。此外，还有杭州特色的点心，如桂花糕、荷花酥、艾窝窝等。
-history: [('浙江的省会在哪里？', '浙江省的省会是杭州。'), ('这有什么好吃的？', '杭州市有很多著名的美食，例如西湖醋鱼、龙井虾仁、糖醋排骨、毛血旺等。此外，还有杭州特色的点心，如桂花糕、荷花酥、艾窝窝等。')]
+response: 杭州市有很多著名的美食，例如西湖醋鱼、龙井虾仁、糖醋排骨、毛血旺等。此外，还有杭州特色的点心，如桂花酥饼、抹茶糕点等。
+history: [['浙江的省会在哪里？', '浙江省的省会是杭州。'], ['这有什么好吃的？', '杭州市有很多著名的美食，例如西湖醋鱼、龙井虾仁、糖醋排骨、毛血旺等。此外，还有杭州特色的点心，如桂花酥饼、抹茶糕点等。']]
+query: 这有什么好吃的？
+response: 杭州有许多美食，比如西湖醋鱼、龙井虾仁、酱鸭等。此外，还有许多小吃，如烧麦、春卷、油条等，都是浙江特色美食。
+history: [['浙江的省会在哪里？', '浙江省的省会是杭州。'], ['这有什么好吃的？', '杭州有许多美食，比如西湖醋鱼、龙井虾仁、酱鸭等。此外，还有许多小吃，如烧麦、春卷、油条等，都是浙江特色美食。']]
 """
 ```
 

diff --git a/docs/source/LLM/LLM量化文档.md b/docs/source/LLM/LLM量化文档.md
@@ -1,20 +1,17 @@
 # LLM量化文档
-swift支持使用awq, gptq, bnb, hqq, eetq技术对模型进行量化. 其中awq, gptq量化技术支持vllm进行推理加速, 且量化后的模型支持qlora微调.
+swift支持使用awq、gptq、bnb、hqq、eetq技术对模型进行量化。其中awq、gptq量化技术支持vllm进行推理加速，需要使用校准数据集，量化性能更好，但量化速度较慢。而bnb、hqq、eetq无需校准数据，量化速度较快。这五种量化方法都支持qlora微调。
 
-**注意** 量化在不同指令下的作用不同
-- sft lora训练中指定量化用于`qlora`，用于降低训练所需显存
-- export中指定量化用于量化模型并保存。
-- infer中指定量化用于量化模型并推理。
+awq、gptq需要使用`swift export`进行量化。而bnb、hqq、eetq可以直接在sft和infer时进行快速量化。
 
-其中bnb,hqq,eetq无需校准数据，量化速度较快，在 sft lora 训练 和 infer 中使用，指定`--quant_method bnb/hqq/eetq`
 
-awq,gptq需要校准数据，在 export 中使用，`--quant_method awq/gptq`
+从vllm推理加速支持的角度来看，更推荐使用awq和gptq进行量化。从量化效果的角度来看，更推荐使用awq、hqq和gptq进行量化。而从量化速度的角度来看，更推荐使用hqq进行量化。
+
 
 ## 目录
 - [环境准备](#环境准备)
-- [量化微调(qlora)](#量化微调(qlora))
 - [原始模型](#原始模型)
 - [微调后模型](#微调后模型)
+- [QLoRA微调](#QLoRA微调)
 - [推送模型](#推送模型)
 
 ## 环境准备
@@ -35,6 +32,9 @@ pip install autoawq -U
 # auto_gptq和cuda版本有对应关系，请按照`https://github.com/PanQiWei/AutoGPTQ#quick-installation`选择版本
 pip install auto_gptq -U
 
+# 使用bnb量化：
+pip install bitsandbytes -U
+
 # 使用hqq量化：
 # 需要transformers版本>4.40，从源码安装
 pip install git+https://github.com/huggingface/transformers
@@ -58,54 +58,9 @@ pip install -r requirements/framework.txt  -U
 pip install -r requirements/llm.txt  -U
 ```
 
-## 量化微调(qlora)
-在sft lora训练中指定`--quant_method`和`--quantization_bit`来执行qlora，显著减少训练所需显存
-```bash
-CUDA_VISIBLE_DEVICES=0 swift sft \
-    --model_type qwen1half-7b-chat \
-    --sft_type lora \
-    --dataset alpaca-zh#5000 \
-    --quant_method hqq \
-    --quantization_bit 4 \
-
-CUDA_VISIBLE_DEVICES=0 swift sft \
-    --model_type qwen1half-7b-chat \
-    --sft_type lora \
-    --dataset alpaca-zh#5000 \
-    --quant_method eetq \
-    --dtype fp16 \
-
-CUDA_VISIBLE_DEVICES=0 swift sft \
-    --model_type qwen1half-7b-chat \
-    --sft_type lora \
-    --dataset alpaca-zh#5000 \
-    --quant_method bnb \
-    --quantization_bit 4 \
-    --dtype fp16 \
-```
-**注意**
-- hqq支持更多自定义参数，比如为不同网络层指定不同量化配置，具体请见[命令行参数](https://github.com/modelscope/swift/blob/main/docs/source/LLM/命令行参数.md)
-- eetq量化为8bit量化，无需指定quantization_bit。目前不支持bf16，需要指定dtype为fp16
-- eetq目前qlora速度比较慢，推荐使用hqq。参考[issue](https://github.com/NetEase-FuXi/EETQ/issues/17)
-
 ## 原始模型
-使用bnb,hqq,eetq量化模型并推理
-```bash
-CUDA_VISIBLE_DEVICES=0 swift infer \
-    --model_type qwen1half-7b-chat \
-    --quant_method bnb \
-    --quantization_bit 4
 
-CUDA_VISIBLE_DEVICES=0 swift infer \
-    --model_type qwen1half-7b-chat \
-    --quant_method hqq \
-    --quantization_bit 4
-
-CUDA_VISIBLE_DEVICES=0 swift infer \
-    --model_type qwen1half-7b-chat \
-    --quant_method eetq \
-    --dtype fp16
-```
+### awq、gptq
 
 这里展示对qwen1half-7b-chat进行awq, gptq量化.
 ```bash
@@ -234,6 +189,25 @@ CUDA_VISIBLE_DEVICES=0 swift infer --model_type qwen1half-7b-chat
 ```
 
 
+### bnb、hqq、eetq
+对于bnb、hqq、eetq，我们只需要使用swift infer来进行快速量化并推理。
+```bash
+CUDA_VISIBLE_DEVICES=0 swift infer \
+    --model_type qwen1half-7b-chat \
+    --quant_method bnb \
+    --quantization_bit 4
+
+CUDA_VISIBLE_DEVICES=0 swift infer \
+    --model_type qwen1half-7b-chat \
+    --quant_method hqq \
+    --quantization_bit 4
+
+CUDA_VISIBLE_DEVICES=0 swift infer \
+    --model_type qwen1half-7b-chat \
+    --quant_method eetq \
+    --dtype fp16
+```
+
 ## 微调后模型
 
 假设你使用lora微调了qwen1half-4b-chat, 模型权重目录为: `output/qwen1half-4b-chat/vx-xxx/checkpoint-xxx`.
@@ -281,6 +255,65 @@ curl http://localhost:8000/v1/chat/completions \
 }'
 ```
 
+## QLoRA微调
+
+### awq、gptq
+如果想要对awq、gptq量化的模型进行qlora微调，你需要进行提前量化。例如可以对原始模型使用`swift export`进行量化。然后使用以下命令进行微调，你需要指定`--quant_method`来指定对应量化的方式：
+
+```bash
+# awq
+CUDA_VISIBLE_DEVICES=0 swift sft \
+    --model_type qwen1half-7b-chat \
+    --model_id_or_path qwen1half-7b-chat-awq-int4 \
+    --quant_method awq \
+    --sft_type lora \
+    --dataset alpaca-zh#5000 \
+
+# gptq
+CUDA_VISIBLE_DEVICES=0 swift sft \
+    --model_type qwen1half-7b-chat \
+    --model_id_or_path qwen1half-7b-chat-gptq-int4 \
+    --quant_method gptq \
+    --sft_type lora \
+    --dataset alpaca-zh#5000 \
+```
+
+
+### bnb、hqq、eetq
+如果想要使用bnb、hqq、eetq进行qlora微调，你需要在训练中指定`--quant_method`和`--quantization_bit`：
+
+```bash
+# bnb
+CUDA_VISIBLE_DEVICES=0 swift sft \
+    --model_type qwen1half-7b-chat \
+    --sft_type lora \
+    --dataset alpaca-zh#5000 \
+    --quant_method bnb \
+    --quantization_bit 4 \
+    --dtype fp16 \
+
+# hqq
+CUDA_VISIBLE_DEVICES=0 swift sft \
+    --model_type qwen1half-7b-chat \
+    --sft_type lora \
+    --dataset alpaca-zh#5000 \
+    --quant_method hqq \
+    --quantization_bit 4 \
+
+# eetq
+CUDA_VISIBLE_DEVICES=0 swift sft \
+    --model_type qwen1half-7b-chat \
+    --sft_type lora \
+    --dataset alpaca-zh#5000 \
+    --quant_method eetq \
+    --dtype fp16 \
+```
+
+**注意**
+- hqq支持更多自定义参数，比如为不同网络层指定不同量化配置，具体请见[命令行参数](https://github.com/modelscope/swift/blob/main/docs/source/LLM/命令行参数.md)
+- eetq量化为8bit量化，无需指定quantization_bit。目前不支持bf16，需要指定dtype为fp16
+- eetq目前qlora速度比较慢，推荐使用hqq。参考[issue](https://github.com/NetEase-FuXi/EETQ/issues/17)
+
 
 ## 推送模型
 假设你使用lora微调了qwen1half-4b-chat, 模型权重目录为: `output/qwen1half-4b-chat/vx-xxx/checkpoint-xxx`.

diff --git a/docs/source/LLM/Qwen1.5全流程最佳实践.md b/docs/source/LLM/Qwen1.5全流程最佳实践.md
@@ -413,7 +413,9 @@ for query in ['78654+657=?', '晚上睡不着觉怎么办']:
 
     print(f'query: {query}')
     print('response: ', end='')
+    response = ''
     for chunk in stream_resp:
+        response += chunk.choices[0].delta.content
         print(chunk.choices[0].delta.content, end='', flush=True)
     print()
     messages.append({'role': 'assistant', 'content': response})
@@ -574,7 +576,9 @@ for query in ['78654+657=?', '晚上睡不着觉怎么办']:
 
     print(f'query: {query}')
     print('response: ', end='')
+    response = ''
     for chunk in stream_resp:
+        response += chunk.choices[0].delta.content
         print(chunk.choices[0].delta.content, end='', flush=True)
     print()
     messages.append({'role': 'assistant', 'content': response})

diff --git a/docs/source/LLM/支持的模型和数据集.md b/docs/source/LLM/支持的模型和数据集.md
@@ -121,8 +121,8 @@
 |llama-3-chinese-8b-instruct|[ChineseAlpacaGroup/llama-3-chinese-8b-instruct](https://modelscope.cn/models/ChineseAlpacaGroup/llama-3-chinese-8b-instruct/summary)|q_proj, k_proj, v_proj|llama3|&#x2714;|&#x2714;||-|[hfl/llama-3-chinese-8b-instruct](https://huggingface.co/hfl/llama-3-chinese-8b-instruct)|
 |atom-7b|[FlagAlpha/Atom-7B](https://modelscope.cn/models/FlagAlpha/Atom-7B/summary)|q_proj, k_proj, v_proj|default-generation|&#x2714;|&#x2714;||-|[FlagAlpha/Atom-7B](https://huggingface.co/FlagAlpha/Atom-7B)|
 |atom-7b-chat|[FlagAlpha/Atom-7B-Chat](https://modelscope.cn/models/FlagAlpha/Atom-7B-Chat/summary)|q_proj, k_proj, v_proj|atom|&#x2714;|&#x2714;||-|[FlagAlpha/Atom-7B-Chat](https://huggingface.co/FlagAlpha/Atom-7B-Chat)|
-|llava1d6-mistral-7b-instruct|[AI-ModelScope/llava-v1.6-mistral-7b](https://modelscope.cn/models/AI-ModelScope/llava-v1.6-mistral-7b/summary)|q_proj, k_proj, v_proj|llava-mistral-instruct|&#x2714;|&#x2718;|transformers>=4.34|multi-modal, vision|[liuhaotian/llava-v1.6-mistral-7b](https://huggingface.co/liuhaotian/llava-v1.6-mistral-7b)|
-|llava1d6-yi-34b-instruct|[AI-ModelScope/llava-v1.6-34b](https://modelscope.cn/models/AI-ModelScope/llava-v1.6-34b/summary)|q_proj, k_proj, v_proj|llava-yi-instruct|&#x2714;|&#x2718;||multi-modal, vision|[liuhaotian/llava-v1.6-34b](https://huggingface.co/liuhaotian/llava-v1.6-34b)|
+|llava1_6-mistral-7b-instruct|[AI-ModelScope/llava-v1.6-mistral-7b](https://modelscope.cn/models/AI-ModelScope/llava-v1.6-mistral-7b/summary)|q_proj, k_proj, v_proj|llava-mistral-instruct|&#x2714;|&#x2718;|transformers>=4.34|multi-modal, vision|[liuhaotian/llava-v1.6-mistral-7b](https://huggingface.co/liuhaotian/llava-v1.6-mistral-7b)|
+|llava1_6-yi-34b-instruct|[AI-ModelScope/llava-v1.6-34b](https://modelscope.cn/models/AI-ModelScope/llava-v1.6-34b/summary)|q_proj, k_proj, v_proj|llava-yi-instruct|&#x2714;|&#x2718;||multi-modal, vision|[liuhaotian/llava-v1.6-34b](https://huggingface.co/liuhaotian/llava-v1.6-34b)|
 |llama3-llava-next-8b|[AI-Modelscope/llama3-llava-next-8b](https://modelscope.cn/models/AI-Modelscope/llama3-llava-next-8b/summary)|q_proj, k_proj, v_proj|llama-llava-next|&#x2714;|&#x2718;||multi-modal, vision|[lmms-lab/llama3-llava-next-8b](https://huggingface.co/lmms-lab/llama3-llava-next-8b)|
 |llava-next-72b|[AI-Modelscope/llava-next-72b](https://modelscope.cn/models/AI-Modelscope/llava-next-72b/summary)|q_proj, k_proj, v_proj|llava-qwen-instruct|&#x2714;|&#x2718;||multi-modal, vision|[lmms-lab/llava-next-72b](https://huggingface.co/lmms-lab/llava-next-72b)|
 |llava-next-110b|[AI-Modelscope/llava-next-110b](https://modelscope.cn/models/AI-Modelscope/llava-next-110b/summary)|q_proj, k_proj, v_proj|llava-qwen-instruct|&#x2714;|&#x2718;||multi-modal, vision|[lmms-lab/llava-next-110b](https://huggingface.co/lmms-lab/llava-next-110b)|
@@ -236,7 +236,7 @@
 |baichuan2-13b-chat|[baichuan-inc/Baichuan2-13B-Chat](https://modelscope.cn/models/baichuan-inc/Baichuan2-13B-Chat/summary)|W_pack|baichuan|&#x2718;|&#x2714;||-|[baichuan-inc/Baichuan2-13B-Chat](https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat)|
 |baichuan2-13b-chat-int4|[baichuan-inc/Baichuan2-13B-Chat-4bits](https://modelscope.cn/models/baichuan-inc/Baichuan2-13B-Chat-4bits/summary)|W_pack|baichuan|&#x2718;|&#x2718;|bitsandbytes<0.41.2, accelerate<0.26|-|[baichuan-inc/Baichuan2-13B-Chat-4bits](https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat-4bits)|
 |mplug-owl2-chat|[iic/mPLUG-Owl2](https://modelscope.cn/models/iic/mPLUG-Owl2/summary)|q_proj, k_proj.multiway.0, k_proj.multiway.1, v_proj.multiway.0, v_proj.multiway.1|mplug-owl2|&#x2714;|&#x2718;|transformers<4.35, icecream|-|[MAGAer13/mplug-owl2-llama2-7b](https://huggingface.co/MAGAer13/mplug-owl2-llama2-7b)|
-|mplug-owl2d1-chat|[iic/mPLUG-Owl2.1](https://modelscope.cn/models/iic/mPLUG-Owl2.1/summary)|c_attn.multiway.0, c_attn.multiway.1|mplug-owl2|&#x2714;|&#x2718;|transformers<4.35, icecream|-|[Mizukiluke/mplug_owl_2_1](https://huggingface.co/Mizukiluke/mplug_owl_2_1)|
+|mplug-owl2_1-chat|[iic/mPLUG-Owl2.1](https://modelscope.cn/models/iic/mPLUG-Owl2.1/summary)|c_attn.multiway.0, c_attn.multiway.1|mplug-owl2|&#x2714;|&#x2718;|transformers<4.35, icecream|-|[Mizukiluke/mplug_owl_2_1](https://huggingface.co/Mizukiluke/mplug_owl_2_1)|
 |yuan2-2b-instruct|[YuanLLM/Yuan2.0-2B-hf](https://modelscope.cn/models/YuanLLM/Yuan2.0-2B-hf/summary)|q_proj, k_proj, v_proj|yuan|&#x2714;|&#x2718;||-|[IEITYuan/Yuan2-2B-hf](https://huggingface.co/IEITYuan/Yuan2-2B-hf)|
 |yuan2-2b-janus-instruct|[YuanLLM/Yuan2-2B-Janus-hf](https://modelscope.cn/models/YuanLLM/Yuan2-2B-Janus-hf/summary)|q_proj, k_proj, v_proj|yuan|&#x2714;|&#x2718;||-|[IEITYuan/Yuan2-2B-Janus-hf](https://huggingface.co/IEITYuan/Yuan2-2B-Janus-hf)|
 |yuan2-51b-instruct|[YuanLLM/Yuan2.0-51B-hf](https://modelscope.cn/models/YuanLLM/Yuan2.0-51B-hf/summary)|q_proj, k_proj, v_proj|yuan|&#x2714;|&#x2718;||-|[IEITYuan/Yuan2-51B-hf](https://huggingface.co/IEITYuan/Yuan2-51B-hf)|

diff --git a/docs/source/LLM/自定义与拓展.md b/docs/source/LLM/自定义与拓展.md
@@ -8,8 +8,8 @@
 我们支持三种**自定义数据集**的方法.
 
 1. 【推荐】**命令行参数**的形式: **更加方便支持自定义数据集**, 支持四种数据集格式（即使用`SmartPreprocessor`）, 支持`dataset_id`和`dataset_path`.
-2. 添加数据集到`dataset_info.json`中, 比第一种方式更灵活, 支持对数据集使用两种预处理器并指定其参数: `RenameColumnsPreprocessor`, `ConversationsPreprocessor`（默认使用`SmartPreprocessor`）. 支持直接修改swift内置的`dataset_info.json`, 或者通过`--dataset_info_path xxx.json`的方式传入外置的json文件（方便pip install而非git clone的用户拓展数据集）.
-3. **注册数据集**的方式: 比第1、2种方式更加灵活, 支持使用函数对数据集进行预处理. 方法1、2在实现上借助了方法3. 可以直接修改源码进行拓展, 或者通过`--custom_register_path xxx.py`的方式传入, 脚本会对py文件进行解析（方便pip install的用户）.
+2. 添加数据集到`dataset_info.json`中, 比第一种方式更灵活但繁琐, 支持对数据集使用两种预处理器并指定其参数: `RenameColumnsPreprocessor`, `ConversationsPreprocessor`（默认使用`SmartPreprocessor`）. 支持直接修改swift内置的`dataset_info.json`, 或者通过`--dataset_info_path xxx.json`的方式传入外置的json文件（方便pip install而非git clone的用户拓展数据集）.
+3. **注册数据集**的方式: 比第1、2种方式更加灵活但繁琐, 支持使用函数对数据集进行预处理. 方法1、2在实现上借助了方法3. 可以直接修改源码进行拓展, 或者通过`--custom_register_path xxx.py`的方式传入, 脚本会对py文件进行解析（方便pip install的用户）.
 
 ### 📌 【推荐】命令行参数的形式
 支持直接传入行自定义的**dataset_id**(兼容MS和HF)和**dataset_path**, 以及同时传入多个自定义数据集以及对应采样数, 脚本会进行自动的预处理和拼接. 如果传入的是`dataset_id`, 默认会使用dataset\_id中的'default'子数据集, 并设置split为'train'. 如果该dataset\_id已经注册, 则会使用注册时传入的subsets、split以及预处理函数. 如果传入的是`dataset_path`, 则可以指定为相对路径和绝对路径, 其中相对路径为相对于当前运行目录.