modelscope · Jintao-Huang · Jan 5, 2024 · Jan 5, 2024 · Jan 5, 2024 · Jan 5, 2024
diff --git a/docs/source/LLM/VLLM推理加速与部署.md b/docs/source/LLM/VLLM推理加速与部署.md
@@ -235,7 +235,7 @@ CUDA_VISIBLE_DEVICES=0 swift app-ui --ckpt_dir 'xxx/vx_xxx/checkpoint-xxx-merged
 ## 部署
 swift使用VLLM作为推理后端, 并兼容openai的API样式.
 
-服务端的部署命令行参数可以参考[deploy命令行参数](命令行参数.md#deploy-命令行参数).
+服务端的部署命令行参数可以参考: [deploy命令行参数](命令行参数.md#deploy-命令行参数).
 
 客户端的openai的API参数可以参考: https://platform.openai.com/docs/api-reference/introduction.
 
@@ -251,23 +251,23 @@ CUDA_VISIBLE_DEVICES=0 swift deploy --model_type qwen-7b-chat
 
 使用swift:
 ```python
-from swift.llm import get_model_list_client, XRequest, inference_client
+from swift.llm import get_model_list_client, XRequestConfig, inference_client
 
 model_list = get_model_list_client()
 model_type = model_list.data[0].id
 print(f'model_type: {model_type}')
 
 query = '浙江的省会在哪里?'
-request_kwargs = XRequest(model=model_type, seed=42)
-resp = inference_client(query, request_kwargs=request_kwargs)
+request_config = XRequestConfig(seed=42)
+resp = inference_client(model_type, query, request_config=request_config)
 response = resp.choices[0].message.content
 print(f'query: {query}')
 print(f'response: {response}')
 
 history = [(query, response)]
 query = '这有什么好吃的?'
-request_kwargs = XRequest(model=model_type, stream=True, seed=42)
-stream_resp = inference_client(query, history, request_kwargs=request_kwargs)
+request_config = XRequestConfig(stream=True, seed=42)
+stream_resp = inference_client(model_type, query, history, request_config=request_config)
 print(f'query: {query}')
 print('response: ', end='')
 for chunk in stream_resp:
@@ -342,21 +342,21 @@ CUDA_VISIBLE_DEVICES=0 swift deploy --model_type qwen-7b
 
 使用swift:
 ```python
-from swift.llm import get_model_list_client, XRequest, inference_client
+from swift.llm import get_model_list_client, XRequestConfig, inference_client
 
 model_list = get_model_list_client()
 model_type = model_list.data[0].id
 print(f'model_type: {model_type}')
 
 query = '浙江 -> 杭州\n安徽 -> 合肥\n四川 ->'
-request_kwargs = XRequest(model=model_type, max_tokens=32, temperature=0.1, seed=42)
-resp = inference_client(query, request_kwargs=request_kwargs)
+request_config = XRequestConfig(max_tokens=32, temperature=0.1, seed=42)
+resp = inference_client(model_type, query, request_config=request_config)
 response = resp.choices[0].text
 print(f'query: {query}')
 print(f'response: {response}')
 
-request_kwargs.stream = True
-stream_resp = inference_client(query, request_kwargs=request_kwargs)
+request_config.stream = True
+stream_resp = inference_client(model_type, query, request_config=request_config)
 print(f'query: {query}')
 print('response: ', end='')
 for chunk in stream_resp:

diff --git a/docs/source/LLM/命令行参数.md b/docs/source/LLM/命令行参数.md
@@ -19,28 +19,30 @@
 - `--seed`: 全局的seed, 默认使用`42`. 用于复现训练效果.
 - `--resume_from_checkpoint`: 用于断点续训, 默认为`None`. 你可以将其设置为checkpoint的路径, 例如: `'output/qwen-7b-chat/vx_xxx/checkpoint-xxx'`, 来进行断点续训.
 - `--dtype`: 基模型载入时的torch_dtype, 默认为`'AUTO'`, 即智能选择dtype: 如果机器不支持bf16, 则使用fp16, 如果`MODEL_MAPPING`中对应模型有指定torch_dtype, 则使用其对应dtype, 否则使用bf16. 你可以选择的值包括: 'bf16', 'fp16', 'fp32'.
-- `--dataset`: 用于选择训练的数据集, 默认为`None`. 可以选择的数据集可以查看`DATASET_MAPPING.keys()`. 如果需要使用多个数据集进行训练, 你可以使用','或者' '进行分割, 例如: `alpaca-en,alpaca-zh` or `alpaca-en alpaca-zh`.
+- `--dataset`: 用于选择训练的数据集, 默认为`[]`. 可以选择的数据集可以查看`DATASET_MAPPING.keys()`. 如果需要使用多个数据集进行训练, 你可以使用','或者' '进行分割, 例如: `--dataset alpaca-en,alpaca-zh` or `--dataset alpaca-en alpaca-zh`.
 - `--dataset_seed`: 用于指定数据集处理的seed, 默认为`42`. 以random_state形式存在, 不影响全局seed.
 - `--dataset_test_ratio`: 用于指定子数据集切分成训练集和验证集的比例, 默认为`0.01`. 如果子数据集已经进行了训练集和验证集的切分, 则此参数无效.
 - `--train_dataset_sample`: 对训练集进行采样, 默认是`20000`, 用于加快训练的速度. 该参数是为了避免数据集过大, 单个epoch训练时间过长的问题. LoRA的收敛通常较快, 不需要很多数据样本的微调. 如果你指定为`-1`, 则使用完整的训练集进行训练, 该情况一般出现在全参数微调的设置下.
-- `--val_dataset_sample`: 对验证集进行采样, 默认是`None`. 如果你指定为`-1`, 则使用完整的验证集进行验证.
+- `--val_dataset_sample`: 对验证集进行采样, 默认是`None`, 自动选取合适数量的数据集数量进行验证. 如果你指定为`-1`, 则使用完整的验证集进行验证.
 - `--system`: 对话模板中使用的system, 默认为`None`, 即使用模型默认的system.
 - `--max_length`: token的最大长度, 默认为`2048`. 可以避免个别过长的数据样本造成OOM的问题. 当指定`--truncation_strategy delete`时, 如果某数据样本长度超过max_length, 我们会删除该数据样本. 如果指定`--truncation_strategy truncation_left`时, 我们会切除最前面的token: `input_ids[-max_length:]`. 如果设置为-1, 则无限制.
 - `--truncation_strategy`: 默认是`'delete'`表示把超过max_length的句子从数据集中删除. `'truncation_left'`表示会将超过文本的左边给切除掉, 这可能会切到special token, 会影响性能, 并不推荐.
 - `--check_dataset_strategy`: 默认值为`'none'`, 即不做检查. 如果你训练的模型是LLM, 则推荐使用`'warning'`作为数据检查的策略. 如果你的训练目标为句子分类等任务, 则建议设置为'`none`'.
-- `--custom_train_dataset_path`: 默认值为`None`. 具体的含义参考[自定义与拓展](./自定义与拓展.md).
-- `--custom_val_dataset_path`: 默认值为`None`. 具体的含义参考[自定义与拓展](./自定义与拓展.md).
+- `--custom_train_dataset_path`: 默认值为`[]`. 具体的含义参考[自定义与拓展](./自定义与拓展.md).
+- `--custom_val_dataset_path`: 默认值为`[]`. 具体的含义参考[自定义与拓展](./自定义与拓展.md).
 - `--self_cognition_sample`: 自我认知数据集的采样数. 默认为`0`. 你该值设置为>0时, 需要同时指定`--model_name`, `--model_author`. 如果你想了解更多, 可以查看[自我认知微调最佳实践](./自我认知微调最佳实践.md).
-- `--model_name`: 默认为`None`. 如果开启了自我认知数据集的采样(即self_cognition_sample>0), 你需要传入两个值, 分别代表模型的中文名和英文名. 例如: `--model_name 小黄 'Xiao Huang'`.
-- `--model_author`: 默认为`None`. 如果开启了自我认知数据集的采样, 你需要传入两个值, 分别代表作者的中文名和英文名. 例如: `--model_author 魔搭 ModelScope`.
+- `--model_name`: 默认为`[None, None]`. 如果开启了自我认知数据集的采样(即self_cognition_sample>0), 你需要传入两个值, 分别代表模型的中文名和英文名. 例如: `--model_name 小黄 'Xiao Huang'`.
+- `--model_author`: 默认为`[None, None]`. 如果开启了自我认知数据集的采样, 你需要传入两个值, 分别代表作者的中文名和英文名. 例如: `--model_author 魔搭 ModelScope`.
 - `--quantization_bit`: 用于指定是否进行量化和量化的bit数, 默认为`0`, 即不进行量化. 如果要使用4bit qlora, 你需要设置`--sft_type lora --quantization_bit 4`
 - `--bnb_4bit_comp_dtype`: 在进行4bit量化时, 我们需要在模型的forward和backward时, 将其进行反量化. 该参数用于指定反量化后的torch_dtype. 默认为`'AUTO'`, 即与`dtype`保持一致. 可选择的值包括: 'fp16', 'bf16', 'fp32'. 当quantization_bit为0时, 该参数无效.
 - `--bnb_4bit_quant_type`: 4bit量化时的量化方式, 默认是`'nf4'`. 可选择的值包括: 'nf4', 'fp4'. 当quantization_bit为0时, 该参数无效.
 - `--bnb_4bit_use_double_quant`: 是否在4bit量化时开启double量化, 默认为`True`. 当quantization_bit为0时, 该参数无效.
-- `--lora_target_modules`: 指定lora模块, 默认为`None`. 如果lora_target_modules为None, 或者传入`'DEFAULT'` or `'AUTO'`, 则根据`model_type`查找`MODEL_MAPPING`中的`lora_target_modules`(默认指定为qkv). 如果传入`ALL`, 则将所有的Linear层都指定为lora模块(不含head). 如果内存允许, 建议设置成'ALL'. 该参数只有当`sft_type`指定为'lora'时才生效.
+- `--lora_target_modules`: 指定lora模块, 默认为`['DEFAULT']`. 如果lora_target_modules为None, 或者传入`'DEFAULT'` or `'AUTO'`, 则根据`model_type`查找`MODEL_MAPPING`中的`lora_target_modules`(默认指定为qkv). 如果传入`ALL`, 则将所有的Linear层都指定为lora模块(不含head). 如果内存允许, 建议设置成'ALL'. 该参数只有当`sft_type`指定为'lora'时才生效.
 - `--lora_rank`: 默认为`8`. 只有当`sft_type`指定为'lora'时才生效.
 - `--lora_alpha`: 默认为`32`. 只有当`sft_type`指定为'lora'时才生效.
 - `--lora_dropout_p`: 默认为`0.05`, 只有当`sft_type`指定为'lora'时才生效.
+- `--lora_bias_trainable`: 默认为`'none'`, 可以选择的值: 'none', 'all'. 如果你要将bias全都设置为可训练, 你可以设置为`'all'`.
+- `--lora_modules_to_save`: 默认为`[]`. 如果你想要训练embedding, lm_head, 或者layer_norm, 你可以设置此参数, 例如: `--lora_modules_to_save wte ln1 ln_2 ln_f lm_head`.
 - `--lora_dtype`: 默认为`'fp32'`, 指定lora模块的dtype类型. 如果是`AUTO`则跟随原始模块的dtype类型. 你可以选择的值: 'fp16', 'bf16', 'fp32', 'AUTO'.
 - `--neftune_alpha`: `NEFTune`添加的噪声系数.
 - `--gradient_checkpointing`: 是否开启gradient checkpointing, 默认为`True`. 该参数可以用于节约显存, 虽然这会略微降低训练速度. 该参数在max_length较大, batch_size较大时作用显著.
@@ -74,10 +76,13 @@
 - `--preprocess_num_proc`: 在对数据集预处理时(对文本进行tokenize), 使用多进程. 默认为`1`. 与`lazy_tokenize`命令行参数一样, 用于解决预处理速度慢的问题. 但该策略无法减少内存占用, 所以如果当数据集巨大时, 建议使用`lazy_tokenize`. 推荐设置的值: 4, 8. 请注意: 当使用qwen-audio时, 该参数会强制设置为1, 因为qwen-audio的预处理函数中使用了torch的多进程, 会造成不兼容问题.
 - `--use_flash_attn`: 是否使用flash attn, 默认为`None`. 安装flash_attn的步骤可以查看[https://github.com/Dao-AILab/flash-attention](https://github.com/Dao-AILab/flash-attention). 支持flash_attn的模型可以查看[LLM支持的模型](./支持的模型和数据集.md#模型).
 - `--ignore_args_error`: 是否忽略命令行传参错误抛出的Error, 默认为`False`. 如果需要拷贝代码到notebook中运行, 需要设置成True.
-- `--logging_dir`: 默认为`None`. 即设置为`f'{self.output_dir}/runs'`, 表示tensorboard文件存储路径.
 - `--check_model_is_latest`: 检查模型是否是最新, 默认为`True`. 如果你需要断网进行训练, 请将该参数设置为`False`.
+- `--logging_dir`: 默认为`None`. 即设置为`f'{self.output_dir}/runs'`, 表示tensorboard文件存储路径.
+- `--report_to`: 默认为`['all']`.
+- `--acc_strategy`: 默认为`'token'`, 可选择的值包括: 'token', 'sentence'.
 - `--save_on_each_node`: 该参数在多机训练时生效, 默认为`True`.
 - `--save_strategy`: 保存checkpoint的策略, 默认为`'steps'`, 可选择的值包括: 'steps', 'no'.
+- `--save_safetensors`: 默认为`True`.
 - `--max_new_tokens`: 默认为`2048`. 该参数只有在`predict_with_generate`设置为True的时候才生效.
 - `--do_sample`: 默认为`True`. 该参数只有在`predict_with_generate`设置为True的时候才生效.
 - `--temperature`: 默认为`0.3`. 该参数只有在`predict_with_generate`设置为True的时候才生效.
@@ -101,16 +106,16 @@
 - `--eval_human`: 使用数据集中的验证集部分进行评估还是使用人工的方式评估. 默认值为`None`, 进行智能选择,  如果没有任何数据集(含自定义数据集)传入, 则会使用人工评估的方式. 如果有数据集传入, 则会使用数据集方式评估.
 - `--seed`: 默认值为`42`, 具体的参数介绍可以在`sft.sh命令行参数`中查看.
 - `--dtype`: 默认值为`'AUTO`, 具体的参数介绍可以在`sft.sh命令行参数`中查看.
-- `--dataset`: 默认值为`None`, 具体的参数介绍可以在`sft.sh命令行参数`中查看.
+- `--dataset`: 默认值为`[]`, 具体的参数介绍可以在`sft.sh命令行参数`中查看.
 - `--dataset_seed`: 默认值为`42`, 具体的参数介绍可以在`sft.sh命令行参数`中查看.
 - `--dataset_test_ratio`: 默认值为`0.01`, 具体的参数介绍可以在`sft.sh命令行参数`中查看.
 - `--val_dataset_sample`: 表示想要评估和展示的验证集的数量, 默认值为`10`.
 - `--system`: 默认值为`None`. 具体的参数介绍可以在`sft.sh命令行参数`中查看.
 - `--max_length`: 默认值为`2048`. 具体的参数介绍可以在`sft.sh命令行参数`中查看.
 - `--truncation_strategy`: 默认是`'delete'`. 具体的参数介绍可以在`sft.sh命令行参数`中查看.
 - `--check_dataset_strategy`: 默认值为`'none'`, 具体的参数介绍可以在`sft.sh命令行参数`中查看.
-- `--custom_train_dataset_path`: 默认值为`None`. 具体的含义参考README.md中的`自定义数据集`模块.
-- `--custom_val_dataset_path`: 默认值为`None`. 具体的含义参考README.md中的`自定义数据集`模块.
+- `--custom_train_dataset_path`: 默认值为`[]`. 具体的含义参考README.md中的`自定义数据集`模块.
+- `--custom_val_dataset_path`: 默认值为`[]`. 具体的含义参考README.md中的`自定义数据集`模块.
 - `--quantization_bit`: 默认值为0. 具体的参数介绍可以在`sft.sh命令行参数`中查看.
 - `--bnb_4bit_comp_dtype`: 默认值为`'AUTO'`.  具体的参数介绍可以在`sft.sh命令行参数`中查看. 若`quantization_bit`设置为0, 则该参数失效.
 - `--bnb_4bit_quant_type`: 默认值为`'nf4'`.  具体的参数介绍可以在`sft.sh命令行参数`中查看. 若`quantization_bit`设置为0, 则该参数失效.
@@ -139,4 +144,4 @@
 - `--port`: 默认为`8000`.
 - `--ssl_keyfile`: 默认为`None`.
 - `--ssl_certfile`: 默认为`None`.
-- 其他参数继承自infer的命令行参数
+- 其他参数继承自infer的命令行参数.
diff --git a/docs/source/LLM/自定义与拓展.md b/docs/source/LLM/自定义与拓展.md
@@ -20,11 +20,11 @@
 
 对应的sh案例脚本可以查看[这里](https://github.com/modelscope/swift/blob/main/examples/pytorch/llm/scripts/tongyi_finance_14b_chat_int4/qlora/sft.sh).
 
-1. `--custom_train_dataset_path`: 默认值为`None`, 表示不使用自定义数据集. 你可以像如下形式进行指定: `--custom_train_dataset_path alpaca.csv`或者指定多个训练数据集`--custom_train_dataset_path alpaca.csv chatml.jsonl swift.jsonl`, 脚本会进行自动的预处理和拼接.
+1. `--custom_train_dataset_path`: 默认值为`[]`, 表示不使用自定义数据集. 你可以像如下形式进行指定: `--custom_train_dataset_path alpaca.csv`或者指定多个训练数据集`--custom_train_dataset_path alpaca.csv chatml.jsonl swift.jsonl`, 脚本会进行自动的预处理和拼接.
 
    > 可以通过公开数据集和自定义数据集结合的方式进行训练: `--dataset blossom-math-zh --custom_train_dataset_path custom_math.jsonl`.
 
-2. `--custom_val_dataset_path`: 默认值为`None`, 表示不使用自定义验证数据集. 如果你指定了`custom_train_dataset_path`, 则自定义数据集的验证集将按照命令行参数`dataset_test_ratio`进行切割.
+2. `--custom_val_dataset_path`: 默认值为`[]`, 表示不使用自定义验证数据集. 如果你指定了`custom_train_dataset_path`, 则自定义数据集的验证集将按照命令行参数`dataset_test_ratio`进行切割.
 
 脚本支持的文件格式包含`csv`, `jsonl`, `json`格式. 你需要将传入的文件符合以下数据集格式. csv格式的文件只支持指令微调, 即没有history的情况. jsonl格式的文件支持system, history.
 

diff --git a/swift/llm/tuner.py b/swift/llm/tuner.py
@@ -24,7 +24,9 @@ def prepare_model(model, args: SftArguments):
                 'r': args.lora_rank,
                 'target_modules': args.lora_target_modules,
                 'lora_alpha': args.lora_alpha,
-                'lora_dropout': args.lora_dropout_p
+                'lora_dropout': args.lora_dropout_p,
+                'bias': args.lora_bias_trainable,
+                'modules_to_save': args.lora_modules_to_save,
             }
             if args.sft_type == 'lora':
                 if args.tuner_backend == 'swift':

diff --git a/swift/llm/utils/__init__.py b/swift/llm/utils/__init__.py
@@ -24,7 +24,7 @@
                        CompletionResponseChoice,
                        CompletionResponseStreamChoice,
                        CompletionStreamResponse, DeltaMessage, Model,
-                       ModelList, UsageInfo, XRequest, random_uuid)
+                       ModelList, UsageInfo, XRequestConfig, random_uuid)
 from .template import (DEFAULT_SYSTEM, TEMPLATE_MAPPING, History, Prompt,
                        StopWords, Template, TemplateType, get_template,
                        register_template)

diff --git a/swift/llm/utils/argument.py b/swift/llm/utils/argument.py
@@ -88,6 +88,9 @@ class SftArguments:
     lora_rank: int = 8
     lora_alpha: int = 32
     lora_dropout_p: float = 0.05
+    lora_bias_trainable: Literal['none', 'all'] = 'none'
+    # e.g. ['wte', 'ln1', 'ln_2', 'ln_f', 'lm_head']
+    lora_modules_to_save: List[str] = field(default_factory=list)
     lora_dtype: Literal['fp16', 'bf16', 'fp32', 'AUTO'] = 'fp32'
 
     neftune_alpha: float = 0.0

diff --git a/swift/llm/utils/client_utils.py b/swift/llm/utils/client_utils.py
@@ -8,7 +8,7 @@
 from .model import get_default_template_type
 from .protocol import (ChatCompletionResponse, ChatCompletionStreamResponse,
                        CompletionResponse, CompletionStreamResponse, ModelList,
-                       XRequest)
+                       XRequestConfig)
 from .template import History
 from .utils import history_to_messages
 
@@ -30,24 +30,28 @@ def _parse_stream_data(data: bytes) -> Optional[str]:
 
 
 def inference_client(
+    model_type: str,
     query: str,
     history: Optional[History] = None,
     system: Optional[str] = None,
     *,
-    request_kwargs: Optional[XRequest],
+    request_config: Optional[XRequestConfig] = None,
     host: str = '127.0.0.1',
     port: str = '8000',
     is_chat_request: Optional[bool] = None,
 ) -> Union[ChatCompletionResponse, CompletionResponse,
            Iterator[ChatCompletionStreamResponse],
            Iterator[CompletionStreamResponse]]:
+    if request_config is None:
+        request_config = XRequestConfig()
     if is_chat_request is None:
-        template_type = get_default_template_type(request_kwargs.model)
+        template_type = get_default_template_type(model_type)
         is_chat_request = 'generation' not in template_type
     data = {
         k: v
-        for k, v in request_kwargs.__dict__.items() if not k.startswith('__')
+        for k, v in request_config.__dict__.items() if not k.startswith('__')
     }
+    data['model'] = model_type
     if is_chat_request:
         data['messages'] = history_to_messages(history, query, system)
         url = f'http://{host}:{port}/v1/chat/completions'
@@ -57,7 +61,7 @@ def inference_client(
         )
         data['prompt'] = query
         url = f'http://{host}:{port}/v1/completions'
-    if request_kwargs.stream:
+    if request_config.stream:
         if is_chat_request:
             ret_cls = ChatCompletionStreamResponse
         else: