diff --git a/README.md b/README.md
index 967208d854..a7597220da 100644
--- a/README.md
+++ b/README.md
@@ -51,9 +51,48 @@ Users can check the [documentation of Swift](docs/source/GetStarted/Introduction
 ## LLM SFT Example
 Press [this link](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm) to view the detail documentation of these examples.
 
+### Basic Usage
+```bash
+git clone https://github.com/modelscope/swift.git
+cd swift
+pip install .[llm]
+```
+
+```python
+# Experimental environment: A10, 3090, A100, ...
+# 16GB GPU memory
+import os
+os.environ['CUDA_VISIBLE_DEVICES'] = '0'
+
+import torch
+
+from swift.llm import DatasetName, InferArguments, ModelType, SftArguments
+from swift.llm.run import infer_main, sft_main
+
+model_type = ModelType.qwen_7b_chat_int4
+sft_args = SftArguments(
+    model_type=model_type,
+    eval_steps=50,
+    train_dataset_sample=2000,
+    dataset=[DatasetName.leetcode_python_en],
+    output_dir='output',
+    gradient_checkpointing=True)
+best_ckpt_dir = sft_main(sft_args)
+print(f'best_ckpt_dir: {best_ckpt_dir}')
+torch.cuda.empty_cache()
+infer_args = InferArguments(
+    model_type=sft_args.model_type,
+    ckpt_dir=best_ckpt_dir,
+    dataset=sft_args.dataset,
+    stream=True,
+    show_dataset_sample=5)
+infer_main(infer_args)
+```
+
+
 ### Features
 - Supported SFT Methods: [lora](https://arxiv.org/abs/2106.09685), [qlora](https://arxiv.org/abs/2305.14314), full(full parameter fine-tuning)
-- Supported Features: quantization, DDP, model parallelism, gradient checkpointing, gradient accumulation, pushing to modelscope hub, custom datasets, multimodal and agent SFT, mutli-round chat, ...
+- Supported Features: quantization, DDP, model parallelism, gradient checkpointing, pushing to modelscope hub, custom datasets, multimodal and agent SFT, mutli-round chat, ...
 - Supported Models:
   - 🔥 qwen series: [qwen-7b](https://modelscope.cn/models/qwen/Qwen-7B/summary), [qwen-7b-chat](https://modelscope.cn/models/qwen/Qwen-7B-Chat/summary), [qwen-14b](https://modelscope.cn/models/qwen/Qwen-14B/summary), [qwen-14b-chat](https://modelscope.cn/models/qwen/Qwen-14B-Chat/summary), [qwen-7b-chat-int4](https://modelscope.cn/models/qwen/Qwen-7B-Chat-Int4/summary), [qwen-14b-chat-int4](https://modelscope.cn/models/qwen/Qwen-14B-Chat-Int4/summary), [qwen-7b-chat-int8](https://modelscope.cn/models/qwen/Qwen-7B-Chat-Int8/summary), [qwen-14b-chat-int8](https://modelscope.cn/models/qwen/Qwen-14B-Chat-Int8/summary)
   - 🔥 qwen-vl series: [qwen-vl](https://modelscope.cn/models/qwen/Qwen-VL/summary), [qwen-vl-chat](https://modelscope.cn/models/qwen/Qwen-VL-Chat/summary), [qwen-vl-chat-int4](https://modelscope.cn/models/qwen/Qwen-VL-Chat-Int4/summary)
@@ -65,6 +104,7 @@ Press [this link](https://github.com/modelscope/swift/tree/main/examples/pytorch
   - xverse series: [xverse-7b](https://modelscope.cn/models/xverse/XVERSE-7B/summary), [xverse-7b-chat](https://modelscope.cn/models/xverse/XVERSE-7B-Chat/summary), [xverse-13b](https://modelscope.cn/models/xverse/XVERSE-13B/summary), [xverse-13b-chat](https://modelscope.cn/models/xverse/XVERSE-13B-Chat/summary)
   - mistral series: [mistral-7b](https://modelscope.cn/models/AI-ModelScope/Mistral-7B-v0.1/summary), [mistral-7b-chat](https://modelscope.cn/models/AI-ModelScope/Mistral-7B-Instruct-v0.1/summary)
   - ziya series: [ziya2-13b](https://modelscope.cn/models/Fengshenbang/Ziya2-13B-Base/summary), [ziya2-13b-chat](https://modelscope.cn/models/Fengshenbang/Ziya2-13B-Chat/summary)
+  - skywork series: [skywork-13b](https://modelscope.cn/models/skywork/Skywork-13B-base/summary), [skywork-13b-chat](https://modelscope.cn/models/skywork/Skywork-13B-chat/summary)
   - other: [polylm-13b](https://modelscope.cn/models/damo/nlp_polylm_13b_text_generation/summary), [seqgpt-560m](https://modelscope.cn/models/damo/nlp_seqgpt-560m/summary)
 - Supported Datasets:
   - NLP:
@@ -81,8 +121,8 @@ Press [this link](https://github.com/modelscope/swift/tree/main/examples/pytorch
   - Multi-Modal: 🔥[coco-en](https://modelscope.cn/datasets/modelscope/coco_2014_caption/summary)
   - Custom Dataset
 - Supported Templates:
-  - Text Generation: default-generation, chatglm2-generation
-  - Chat: chatml(qwen), baichuan, chatglm2, chatglm3, llama, openbuddy-llama, default, internlm, xverse
+  - Text Generation: default-generation, chatglm-generation
+  - Chat: chatml(qwen), baichuan, chatglm2, chatglm3, llama, openbuddy-llama, default, internlm, xverse, skywork
 
 
 # Installation
diff --git a/README_CN.md b/README_CN.md
index 854230d14b..8da179875c 100644
--- a/README_CN.md
+++ b/README_CN.md
@@ -49,9 +49,48 @@ SWIFT（Scalable lightWeight Infrastructure for Fine-Tuning）是一个可扩展
 ## 大模型微调的例子
 可以[在这里](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm) 查看LLM微调的使用文档。
 
+### 简单使用
+```bash
+git clone https://github.com/modelscope/swift.git
+cd swift
+pip install .[llm]
+```
+
+```python
+# Experimental environment: A10, 3090, A100, ...
+# 16GB GPU memory
+import os
+os.environ['CUDA_VISIBLE_DEVICES'] = '0'
+
+import torch
+
+from swift.llm import DatasetName, InferArguments, ModelType, SftArguments
+from swift.llm.run import infer_main, sft_main
+
+model_type = ModelType.qwen_7b_chat_int4
+sft_args = SftArguments(
+    model_type=model_type,
+    eval_steps=50,
+    train_dataset_sample=2000,
+    dataset=[DatasetName.leetcode_python_en],
+    output_dir='output',
+    gradient_checkpointing=True)
+best_ckpt_dir = sft_main(sft_args)
+print(f'best_ckpt_dir: {best_ckpt_dir}')
+torch.cuda.empty_cache()
+infer_args = InferArguments(
+    model_type=sft_args.model_type,
+    ckpt_dir=best_ckpt_dir,
+    dataset=sft_args.dataset,
+    stream=True,
+    show_dataset_sample=5)
+infer_main(infer_args)
+```
+
+
 ### 特性
 - 支持的SFT方法: [lora](https://arxiv.org/abs/2106.09685), [qlora](https://arxiv.org/abs/2305.14314), 全参数微调
-- 支持的特性: 模型量化, DDP, 模型并行, gradient checkpointing, 梯度累加, 支持推送ModelScope Hub, 自定义数据集, 多模态和Agent SFT, 多轮对话, ...
+- 支持的特性: 模型量化, DDP, 模型并行, gradient checkpointing, 支持推送ModelScope Hub, 自定义数据集, 多模态和Agent SFT, 多轮对话, ...
 - 支持的模型
   - 🔥 qwen 系列: [qwen-7b](https://modelscope.cn/models/qwen/Qwen-7B/summary), [qwen-7b-chat](https://modelscope.cn/models/qwen/Qwen-7B-Chat/summary), [qwen-14b](https://modelscope.cn/models/qwen/Qwen-14B/summary), [qwen-14b-chat](https://modelscope.cn/models/qwen/Qwen-14B-Chat/summary), [qwen-7b-chat-int4](https://modelscope.cn/models/qwen/Qwen-7B-Chat-Int4/summary), [qwen-14b-chat-int4](https://modelscope.cn/models/qwen/Qwen-14B-Chat-Int4/summary), [qwen-7b-chat-int8](https://modelscope.cn/models/qwen/Qwen-7B-Chat-Int8/summary), [qwen-14b-chat-int8](https://modelscope.cn/models/qwen/Qwen-14B-Chat-Int8/summary)
   - 🔥 qwen-vl 系列: [qwen-vl](https://modelscope.cn/models/qwen/Qwen-VL/summary), [qwen-vl-chat](https://modelscope.cn/models/qwen/Qwen-VL-Chat/summary), [qwen-vl-chat-int4](https://modelscope.cn/models/qwen/Qwen-VL-Chat-Int4/summary)
@@ -63,6 +102,7 @@ SWIFT（Scalable lightWeight Infrastructure for Fine-Tuning）是一个可扩展
   - xverse 系列: [xverse-7b](https://modelscope.cn/models/xverse/XVERSE-7B/summary), [xverse-7b-chat](https://modelscope.cn/models/xverse/XVERSE-7B-Chat/summary), [xverse-13b](https://modelscope.cn/models/xverse/XVERSE-13B/summary), [xverse-13b-chat](https://modelscope.cn/models/xverse/XVERSE-13B-Chat/summary)
   - mistral 系列: [mistral-7b](https://modelscope.cn/models/AI-ModelScope/Mistral-7B-v0.1/summary), [mistral-7b-chat](https://modelscope.cn/models/AI-ModelScope/Mistral-7B-Instruct-v0.1/summary)
   - ziya 系列: [ziya2-13b](https://modelscope.cn/models/Fengshenbang/Ziya2-13B-Base/summary), [ziya2-13b-chat](https://modelscope.cn/models/Fengshenbang/Ziya2-13B-Chat/summary)
+  - skywork 系列: [skywork-13b](https://modelscope.cn/models/skywork/Skywork-13B-base/summary), [skywork-13b-chat](https://modelscope.cn/models/skywork/Skywork-13B-chat/summary)
   - other: [polylm-13b](https://modelscope.cn/models/damo/nlp_polylm_13b_text_generation/summary), [seqgpt-560m](https://modelscope.cn/models/damo/nlp_seqgpt-560m/summary)
 - 支持的数据集:
   - NLP:
@@ -79,8 +119,8 @@ SWIFT（Scalable lightWeight Infrastructure for Fine-Tuning）是一个可扩展
   - 多模态: 🔥[coco-en](https://modelscope.cn/datasets/modelscope/coco_2014_caption/summary)
   - 自定义数据集
 - 支持的对话模板:
-  - 文本生成: default-generation, chatglm2-generation
-  - 对话: chatml(qwen), baichuan, chatglm2, chatglm3, llama, openbuddy-llama, default, internlm, xverse
+  - 文本生成: default-generation, chatglm-generation
+  - 对话: chatml(qwen), baichuan, chatglm2, chatglm3, llama, openbuddy-llama, default, internlm, xverse, skywork
 
 
 # 安装
diff --git a/examples/pytorch/llm/README.md b/examples/pytorch/llm/README.md
index fa427d782e..3ffe37565d 100644
--- a/examples/pytorch/llm/README.md
+++ b/examples/pytorch/llm/README.md
@@ -3,7 +3,7 @@
 <p align="center">
 <img src="https://img.shields.io/badge/python-%E2%89%A53.8-5be.svg">
 <img src="https://img.shields.io/badge/pytorch-%E2%89%A51.12%20%7C%20%E2%89%A52.0-orange.svg">
-<a href="https://github.com/modelscope/modelscope/"><img src="https://img.shields.io/badge/modelscope-%E2%89%A51.9.2-5D91D4.svg"></a>
+<a href="https://github.com/modelscope/modelscope/"><img src="https://img.shields.io/badge/modelscope-%E2%89%A51.9.3-5D91D4.svg"></a>
 <a href="https://github.com/modelscope/swift/"><img src="https://img.shields.io/badge/ms--swift-Build from source-6FEBB9.svg"></a>
 </p>
 
@@ -17,7 +17,7 @@
 
 ## Features
 - Supported SFT Methods: [lora](https://arxiv.org/abs/2106.09685), [qlora](https://arxiv.org/abs/2305.14314), full(full parameter fine-tuning)
-- Supported Features: quantization, DDP, model parallelism, gradient checkpointing, gradient accumulation, pushing to modelscope hub, custom datasets, multimodal and agent SFT, mutli-round chat, ...
+- Supported Features: quantization, DDP, model parallelism, gradient checkpointing, pushing to modelscope hub, custom datasets, multimodal and agent SFT, mutli-round chat, ...
 - Supported Models:
   - 🔥 qwen series: [qwen-7b](https://modelscope.cn/models/qwen/Qwen-7B/summary), [qwen-7b-chat](https://modelscope.cn/models/qwen/Qwen-7B-Chat/summary), [qwen-14b](https://modelscope.cn/models/qwen/Qwen-14B/summary), [qwen-14b-chat](https://modelscope.cn/models/qwen/Qwen-14B-Chat/summary), [qwen-7b-chat-int4](https://modelscope.cn/models/qwen/Qwen-7B-Chat-Int4/summary), [qwen-14b-chat-int4](https://modelscope.cn/models/qwen/Qwen-14B-Chat-Int4/summary), [qwen-7b-chat-int8](https://modelscope.cn/models/qwen/Qwen-7B-Chat-Int8/summary), [qwen-14b-chat-int8](https://modelscope.cn/models/qwen/Qwen-14B-Chat-Int8/summary)
   - 🔥 qwen-vl series: [qwen-vl](https://modelscope.cn/models/qwen/Qwen-VL/summary), [qwen-vl-chat](https://modelscope.cn/models/qwen/Qwen-VL-Chat/summary), [qwen-vl-chat-int4](https://modelscope.cn/models/qwen/Qwen-VL-Chat-Int4/summary)
@@ -29,6 +29,7 @@
   - xverse series: [xverse-7b](https://modelscope.cn/models/xverse/XVERSE-7B/summary), [xverse-7b-chat](https://modelscope.cn/models/xverse/XVERSE-7B-Chat/summary), [xverse-13b](https://modelscope.cn/models/xverse/XVERSE-13B/summary), [xverse-13b-chat](https://modelscope.cn/models/xverse/XVERSE-13B-Chat/summary)
   - mistral series: [mistral-7b](https://modelscope.cn/models/AI-ModelScope/Mistral-7B-v0.1/summary), [mistral-7b-chat](https://modelscope.cn/models/AI-ModelScope/Mistral-7B-Instruct-v0.1/summary)
   - ziya series: [ziya2-13b](https://modelscope.cn/models/Fengshenbang/Ziya2-13B-Base/summary), [ziya2-13b-chat](https://modelscope.cn/models/Fengshenbang/Ziya2-13B-Chat/summary)
+  - skywork series: [skywork-13b](https://modelscope.cn/models/skywork/Skywork-13B-base/summary), [skywork-13b-chat](https://modelscope.cn/models/skywork/Skywork-13B-chat/summary)
   - other: [polylm-13b](https://modelscope.cn/models/damo/nlp_polylm_13b_text_generation/summary), [seqgpt-560m](https://modelscope.cn/models/damo/nlp_seqgpt-560m/summary)
 - Supported Datasets:
   - NLP:
@@ -45,27 +46,28 @@
   - Multi-Modal: 🔥[coco-en](https://modelscope.cn/datasets/modelscope/coco_2014_caption/summary)
   - Custom Dataset
 - Supported Templates:
-  - Text Generation: default-generation, chatglm2-generation
-  - Chat: chatml(qwen), baichuan, chatglm2, chatglm3, llama, openbuddy-llama, default, internlm, xverse
+  - Text Generation: default-generation, chatglm-generation
+  - Chat: chatml(qwen), baichuan, chatglm2, chatglm3, llama, openbuddy-llama, default, internlm, xverse, skywork
 
 
 ## News
-- 🔥 2023.10.27: Support for chatglm3 series models: chatglm3-6b-base, chatglm3-6b, chatglm3-6b-32k. The corresponding shell script can be found in `scripts/chatglm3_6b_32k`.
-- 🔥 2023.10.24: Use the registration mechanism to add models, datasets, and chat templates. To customize models, datasets, and chat templates, refer to the "User Guide" section. The corresponding Python file can be found in `custom.py`, and the corresponding shell script can be found in `scripts/custom/tigerbot_13b_chat`.
-- 🔥 2023.10.17: Supported int4, int8 models: qwen-7b-chat-int4, qwen-14b-chat-int4, qwen-vl-chat-int4, baichuan2-7b-chat-int4, baichuan2-13b-chat-int4, qwen-7b-chat-int8, qwen-14b-chat-int8. The corresponding shell script can be found at `scripts/qwen_7b_chat_int4`, `scripts/qwen_14b_chat_int4`, `scripts/qwen_vl_chat_int4`, `scripts/qwen_7b_chat_int8`, `scripts/qwen_14b_chat_int8`.
-- 2023.10.15: Supported ziya2-13b model series: ziya2-13b, ziya2-13b-chat. The corresponding shell script can be found at `scripts/ziya2_13b_chat`.
-- 2023.10.12: Supported mistral-7b model series: openbuddy-mistral-7b-chat, mistral-7b, mistral-7b-chat. The corresponding shell script can be found at `scripts/openbuddy_mistral_7b_chat`, `scripts/mistral_7b_chat`.
-- 🔥 2023.10.7: Supported DeepSpeed ZeRO-2, enabling LoRA (not just QLoRA) to run DDP on 2*A10. The corresponding shell script can be found at `scripts/qwen_7b_chat/lora_ddp_ds/sft.sh`.
+- 2023.10.30: Support for **skywork-13b** series models: skywork-13b, skywork-13b-chat. The corresponding shell script can be found in `scripts/skywork_13b`.
+- 🔥 2023.10.27: Support for **chatglm3** series models: chatglm3-6b-base, chatglm3-6b, chatglm3-6b-32k. The corresponding shell script can be found in `scripts/chatglm3_6b_32k`.
+- 🔥 2023.10.24: Use the **registration mechanism** to add models, **datasets**, and chat templates. To customize models, datasets, and chat templates, refer to the "User Guide" section. The corresponding Python file can be found in `custom.py`, and the corresponding shell script can be found in `scripts/custom/tigerbot_13b_chat`.
+- 🔥 2023.10.17: Supported **int4, int8** models: qwen-7b-chat-int4, qwen-14b-chat-int4, qwen-vl-chat-int4, baichuan2-7b-chat-int4, baichuan2-13b-chat-int4, qwen-7b-chat-int8, qwen-14b-chat-int8. The corresponding shell script can be found at `scripts/qwen_7b_chat_int4`, `scripts/qwen_14b_chat_int4`, `scripts/qwen_vl_chat_int4`, `scripts/qwen_7b_chat_int8`, `scripts/qwen_14b_chat_int8`.
+- 2023.10.15: Supported **ziya2-13b** model series: ziya2-13b, ziya2-13b-chat. The corresponding shell script can be found at `scripts/ziya2_13b_chat`.
+- 2023.10.12: Supported **mistral-7b** model series: openbuddy-mistral-7b-chat, mistral-7b, mistral-7b-chat. The corresponding shell script can be found at `scripts/openbuddy_mistral_7b_chat`, `scripts/mistral_7b_chat`.
+- 🔥 2023.10.7: Supported **DeepSpeed ZeRO-2**, enabling LoRA (not just QLoRA) to run DDP on 2*A10. The corresponding shell script can be found at `scripts/qwen_7b_chat/lora_ddp_ds/sft.sh`.
 - 2023.10.4: Supported datasets in the fields of mathematics, law, SQL, and coding: blossom-math-zh, school-math-zh, text2sql-en, sql-create-context-en, lawyer-llama-zh, tigerbot-law-zh, leetcode-python-en.
-- 🔥 2023.9.25: Supported qwen-14b model series: qwen-14b, qwen-14b-chat. The corresponding shell script can be found at `scripts/qwen_14b`, `scripts/qwen_14b_chat`.
-- 2023.9.18: Supported internlm-20b model series: internlm-20b, internlm-20b-chat. The corresponding shell script can be found at `scripts/internlm_20b`, `scripts/internlm_20b_chat`.
-- 2023.9.12: Supported training with MP+DDP to accelerate full-parameter fine-tuning speed. The corresponding shell script can be found at `scripts/qwen_7b_chat/full_mp_ddp/sft.sh`.
+- 🔥 2023.9.25: Supported **qwen-14b** model series: qwen-14b, qwen-14b-chat. The corresponding shell script can be found at `scripts/qwen_14b`, `scripts/qwen_14b_chat`.
+- 2023.9.18: Supported **internlm-20b** model series: internlm-20b, internlm-20b-chat. The corresponding shell script can be found at `scripts/internlm_20b`, `scripts/internlm_20b_chat`.
+- 2023.9.12: Supported training with **MP+DDP** to accelerate full-parameter fine-tuning speed. The corresponding shell script can be found at `scripts/qwen_7b_chat/full_mp_ddp/sft.sh`.
 - 2023.9.5: Supported training that only saves model weights without saving intermediate states such as optimizer weights required for checkpoint resumption, avoiding long checkpoint-saving times and large storage space in full-parameter fine-tuning. You can check the command-line parameter `--only_save_model` in the `sft.sh` script.
-- 2023.9.5: Supported openbuddy-llama2-70b-chat model. The corresponding shell script can be found at `scripts/openbuddy_llama2_70b_chat`.
-- 2023.9.3: Supported baichuan2 model series: baichuan2-7b, baichuan2-7b-chat, baichuan2-13b, baichuan2-13b-chat. The corresponding shell script can be found at `scripts/baichuan2_7b`, `scripts/baichuan2_7b_chat`, `scripts/baichuan2_13b_chat`.
+- 2023.9.5: Supported **openbuddy-llama2-70b-chat** model. The corresponding shell script can be found at `scripts/openbuddy_llama2_70b_chat`.
+- 2023.9.3: Supported **baichuan2** model series: baichuan2-7b, baichuan2-7b-chat, baichuan2-13b, baichuan2-13b-chat. The corresponding shell script can be found at `scripts/baichuan2_7b`, `scripts/baichuan2_7b_chat`, `scripts/baichuan2_13b_chat`.
 
 
-## Prepare the Environment
+## Preparing the Environment
 Experimental environment: A10, 3090, V100, A100, ...
 ```bash
 # Setting up a global mirror for pip and installing related Python packages
@@ -89,7 +91,7 @@ pip install bitsandbytes -U
 
 
 ## Basic Usage
-The following examples can be used to test the environment. Please make sure you have read the "Preparing the Experimental Environment" section.
+The following examples can be used to **test the environment**. Please make sure you have read the "Preparing the Environment" section.
 ```python
 # Experimental environment: A10, 3090, A100, ...
 # 16GB GPU memory
@@ -122,23 +124,23 @@ infer_main(infer_args)
 ```
 
 ## Run SFT and Inference
-Performace: full(nice) > lora > qlora
+Performace: full(nice) > lora > qlora(auto_gptq) > qlora(bnb)
 
 Training GPU memory: qlora(low,3090) > lora > full(2*A100)
 
 **Tips**:
-- You can set `--gradient_checkpointing true` during training to save GPU memory, but this will slightly decrease the training speed. This is useful if you need to train LLM on consumer-grade GPU, e.g. 3090.
-- If you want to use quantization based on auto_gptq, you need to install auto_gptq first: `pip install auto_gptq -U`.
+- You can set `--gradient_checkpointing true` during training to **save GPU memory**, but this will slightly decrease the training speed. This is useful if you need to train LLM on **consumer-grade GPU**, e.g. 3090.
+- If you want to use quantization based on **auto_gptq**, you need to install auto_gptq first: `pip install auto_gptq -U`.
   The models available with auto_gptq are: `qwen-7b-chat-int4`, `qwen-14b-chat-int4`, `qwen-7b-chat-int8`, `qwen-14b-chat-int8`.
-  If the script provides multiple versions of qlora SFT, including both non-quantized models and int4/int8 models, it is recommended to use the script for the int4/int8 model versions.
+  If the script provides multiple versions of qlora SFT, including both non-quantized models and int4/int8 models, it is **recommended to use the script for the int4/int8 model versions**.
 - If you want to use the quantization parameter `quantization_bit`, you need to install `bitsandbytes` first: `pip install bitsandbytes -U`.
-- If you want to use deepspeed, you need to `pip install deepspeed -U`. Using deepspeed can save GPU memory, but this may slightly decrease the training speed.
-- If you are using older GPUs like V100, you need to set `--dtype fp16`, because they do not support bf16.
-- qwen recommends installing [flash-attn](https://github.com/Dao-AILab/flash-attention), which will accelerate the training and inference speed and reduce GPU memory usage (A10, 3090, V100 machines do not support flash-attn).
-- If you want to perform second pre-training instead of SFT, you can refer to the `DatasetName.tigerbot_law_zh` dataset and its corresponding sh file: `scripts/qwen_7b/qlora_ddp`.
+- If you want to use deepspeed, you need to `pip install deepspeed -U`. Using deepspeed can **save GPU memory**, but this may slightly decrease the training speed.
+- If you are using older GPUs like **V100**, you need to set `--dtype fp16`, because they do not support bf16.
+- qwen recommends installing [**flash-attn**](https://github.com/Dao-AILab/flash-attention), which will accelerate the training and inference speed and reduce GPU memory usage (A10, 3090, V100 machines do not support flash-attn).
+- If you want to perform **second pre-training** instead of SFT, you can refer to the `DatasetName.tigerbot_law_zh` dataset and its corresponding sh file: `scripts/qwen_7b/qlora_ddp`.
 - If you want to push weights to the ModelScope Hub during training, you need to set `--push_to_hub true`.
-- If you want to merge LoRA weights and save them during inference, you need to set `--merge_lora_and_save true`. It is not recommended to merge quantized models, as this can result in performance degradation, specifically in the case of qlora.
-- Below is a shell script for running `qwen_7b_chat` directly (you just need to specify `ckpt_dir` during inference to execute it smoothly). For more model scripts, you can check the `scripts` folder. If you want to customize a shell script, it is recommended to refer to the script in `scripts/qwen_7b_chat`.
+- If you want to merge LoRA weights and save them during inference, you need to set `--merge_lora_and_save true`. It is **not recommended to merge quantized models**, as this can result in performance degradation, specifically in the case of qlora.
+- Below is a shell script for running `qwen_7b_chat` directly (you just need to specify `ckpt_dir` during inference to execute it smoothly). For more model scripts, you can check the `scripts` folder. If you want to **customize a shell script**, it is recommended to refer to the script in `scripts/qwen_7b_chat`.
 ```bash
 # sft(qlora) and infer qwen-7b-chat-int8, Requires 16GB GPU memory.
 # Recommended experimental environment: V100, A10, 3090
@@ -190,6 +192,7 @@ bash scripts/qwen_7b_chat/full_mp/infer.sh
 bash scripts/qwen_7b_chat/full_mp_ddp/sft.sh
 bash scripts/qwen_7b_chat/full_mp_ddp/infer.sh
 
+# The qlora script based on bnb below is no longer recommended for use. Please prioritize using the qlora script based on auto_gptq.
 # sft(qlora) and infer qwen-7b-chat, Requires 13GB GPU memory.
 # Recommended experimental environment: A10, 3090
 bash scripts/qwen_7b_chat/qlora/sft.sh
@@ -209,7 +212,7 @@ bash scripts/qwen_7b_chat/qlora_ddp_ds/infer.sh
 
 ## User Guide
 ### Custom Model
-Here is an example of a custom model. Running the shell script for this custom model can be found in `scripts/custom/tigerbot_13b_chat`.
+Here is an example of a **custom model**. Running the shell script for this custom model can be found in `scripts/custom/tigerbot_13b_chat`.
 
 ```python
 from swift.llm import (
@@ -268,7 +271,54 @@ The `register_model` function registers the model in `MODEL_MAPPING`, and its pa
 
 
 ### Custom Dataset
-Here is an example of a custom dataset. Running the shell script for this custom dataset can be found in `scripts/custom/tigerbot_13b_chat`.
+We support two methods for **customizing datasets**.
+1. **Command line arguments**: It is **more convenient for supporting local custom datasets**.
+2. **Registering datasets**: It is more flexible and allows for **further extension and development of swift**, but it requires some programming skills. Method 1 relies on Method 2 for implementation.
+
+#### Command Line Arguments
+Explanation of command line arguments:
+1. `--custom_train_dataset_path`: The default value is `None`, which means no custom dataset is used. You can specify it in the following format: `--custom_train_dataset_path alpaca.csv` or specify multiple training datasets like `--custom_train_dataset_path alpaca.csv chatml.jsonl swift.jsonl`. The script will automatically preprocess and concatenate them. You can also combine public datasets with custom datasets for training: `--dataset blossom-math-zh --custom_train_dataset_path custom_math.jsonl`.
+2. `--custom_val_dataset_path`: The default value is `None`, which means no custom validation dataset is used. If you specify `custom_train_dataset_path`, the custom dataset's validation set will be split according to the command line argument `dataset_test_ratio`. The format of the command line input can be referred to the `--custom_train_dataset_path` format.
+
+The script supports `csv` and `jsonl` file formats. The files you pass in need to conform to the following dataset formats. The csv format file only supports instruction fine-tuning, which means there is no history. The jsonl format file supports system and history.
+
+Format 1:
+```csv
+instruction,input,output
+11111,22222,33333
+aaaaa,bbbbb,ccccc
+AAAAA,BBBBB,CCCCC
+```
+
+```jsonl
+{"instruction": "11111", "input": "aaaaa", "output": "AAAAA"}
+{"instruction": "22222", "input": "bbbbb", "output": "BBBBB"}
+{"instruction": "33333", "input": "ccccc", "output": "CCCCC"}
+```
+
+Format 2:
+```jsonl
+{"query": "55555", "response": "66666", "history": [["11111", "22222"], ["33333", "44444"]]}
+{"query": "eeeee", "response": "fffff", "history": [["aaaaa", "bbbbb"], ["ccccc", "ddddd"]]}
+{"query": "EEEEE", "response": "FFFFF", "history": [["AAAAA", "BBBBB"], ["CCCCC", "DDDDD"]]}
+```
+
+Format 3:
+```jsonl
+{"conversations": [{"from": "user", "value": "11111"}, {"from": "assistant", "value": "22222"}, {"from": "user", "value": "33333"}, {"from": "assistant", "value": "44444"}]}
+{"conversations": [{"from": "user", "value": "aaaaa"}, {"from": "assistant", "value": "bbbbb"}, {"from": "user", "value": "ccccc"}, {"from": "assistant", "value": "ddddd"}]}
+{"conversations": [{"from": "user", "value": "AAAAA"}, {"from": "assistant", "value": "BBBBB"}, {"from": "user", "value": "CCCCC"}, {"from": "assistant", "value": "DDDDD"}]}
+```
+
+```Format 4
+{"messages": [{"role": "user", "content": "11111"}, {"role": "assistant", "content": "22222"}, {"role": "user", "content": "33333"}, {"role": "assistant", "content": "44444"}]}
+{"messages": [{"role": "user", "content": "aaaaa"}, {"role": "assistant", "content": "bbbbb"}, {"role": "user", "content": "ccccc"}, {"role": "assistant", "content": "ddddd"}]}
+{"messages": [{"role": "user", "content": "AAAAA"}, {"role": "assistant", "content": "BBBBB"}, {"role": "user", "content": "CCCCC"}, {"role": "assistant", "content": "DDDDD"}]}
+```
+
+
+#### Registering Datasets
+Here is an example of a **registering a dataset**. Running the shell script for this custom dataset can be found in `scripts/custom/tigerbot_13b_chat`.
 
 ```python
 import ast
@@ -302,7 +352,7 @@ register_dataset(
 
 if __name__ == '__main__':
     train_dataset, _ = get_dataset([CustomDatasetName.agent_instruct_all_en],
-                                   0.)
+                                   0., check_dataset_strategy='warning')
     print(train_dataset)
     print(train_dataset[0].keys())
 ```
@@ -321,7 +371,7 @@ The `register_dataset` function registers the dataset in the `DATASET_MAPPING`.
 
 
 ### Custom Chat Template
-Here is an example of a custom template. Running the shell script for this custom template can be found in `scripts/custom/tigerbot_13b_chat`.
+Here is an example of a **custom template**. Running the shell script for this custom template can be found in `scripts/custom/tigerbot_13b_chat`.
 
 ```python
 from swift.llm import (
@@ -378,6 +428,9 @@ The template initialization function retrieves the complete chat template based
 - `--train_dataset_sample`: Samples from the complete training dataset, default is `20000`, to speed up training. This parameter is used to avoid the issue of training time being too long for a single epoch when the dataset is large. LoRA convergence is usually fast and does not require a large number of data samples for fine-tuning. If you specify `-1`, the full training dataset will be used for training, which is typically used in the setting of full-parameter fine-tuning.
 - `--system`: The system used in the dialogue template, default is `'you are a helpful assistant!'`.
 - `--max_length`: Maximum token length, default is `2048`. This helps to avoid out-of-memory (OOM) issues caused by individual samples that are too long. If a data sample exceeds the `max_length`, the frontmost tokens will be truncated: `input_ids[-max_length:]`. If set to -1, there is no restriction.
+- `--check_dataset_strategy`: The default value is `'none'`, which means no checking will be done. If you are training an LLM model, it is recommended to use `'warning'` as the data checking strategy. If your training objective is sentence classification or Masked LM tasks, it is suggested to set it as `'none'`.
+- `custom_train_dataset_path`: The default value is `None`. Please refer to the `Custom Dataset` module in the README.md for specific meanings.
+- `custom_val_dataset_path`: The default value is `None`. Please refer to the `Custom Dataset` module in the README.md for specific meanings.
 - `--quantization_bit`: Specifies whether to perform quantization and the number of quantization bits, default is `0`, which means no quantization. Quantization is only supported for the lora fine-tuning method and not for full-parameter fine-tuning.
 - `--bnb_4bit_comp_dtype`: When performing 4-bit quantization, we need to dequantize it during the model's forward and backward passes. This parameter specifies the torch_dtype after dequantization. Default is `None`, which means it remains consistent with `dtype`. The possible values are: 'fp16', 'bf16', 'fp32'. This parameter is ignored when `quantization_bit` is 0.
 - `--bnb_4bit_quant_type`: The quantization type for 4-bit quantization, default is `'nf4'`. The possible values are: 'nf4', 'fp4'. This parameter is ignored when `quantization_bit` is 0.
@@ -439,6 +492,9 @@ The template initialization function retrieves the complete chat template based
 - `--show_dataset_sample`: Indicates the number of samples from the validation set to evaluate and display. Default value is `10`. This parameter only takes effect when `eval_human` is set to False.
 - `--system`: Default value is `'you are a helpful assistant!'`. For specific parameter details, please refer to the `sft.sh Command Line Arguments`.
 - `--max_length`: Default value is `2048`. For specific parameter details, please refer to the `sft.sh Command Line Arguments`.
+- `--check_dataset_strategy`: The default value is `'none'`, For specific parameter details, please refer to the `sft.sh Command Line Arguments`.
+- `--custom_train_dataset_path`: 默认值为`None`. 具体的含义参考README.md中的`自定义数据集`模块.
+- `--custom_val_dataset_path`: 默认值为`None`. 具体的含义参考README.md中的`自定义数据集`模块.
 - `--quantization_bit`: Default value is 0. For specific parameter details, please refer to the `sft.sh Command Line Arguments`.
 - `--bnb_4bit_comp_dtype`: Default value is `None`. For specific parameter details, please refer to the `sft.sh Command Line Arguments`. This parameter is not effective if `quantization_bit` is set to 0.
 - `--bnb_4bit_quant_type`: Default value is `'nf4'`. For specific parameter details, please refer to the `sft.sh Command Line Arguments`. This parameter is not effective if `quantization_bit` is set to 0.
diff --git a/examples/pytorch/llm/README_CN.md b/examples/pytorch/llm/README_CN.md
index 1eefa9c55b..fe450ff389 100644
--- a/examples/pytorch/llm/README_CN.md
+++ b/examples/pytorch/llm/README_CN.md
@@ -3,7 +3,7 @@
 <p align="center">
 <img src="https://img.shields.io/badge/python-%E2%89%A53.8-5be.svg">
 <img src="https://img.shields.io/badge/pytorch-%E2%89%A51.12%20%7C%20%E2%89%A52.0-orange.svg">
-<a href="https://github.com/modelscope/modelscope/"><img src="https://img.shields.io/badge/modelscope-%E2%89%A51.9.2-5D91D4.svg"></a>
+<a href="https://github.com/modelscope/modelscope/"><img src="https://img.shields.io/badge/modelscope-%E2%89%A51.9.3-5D91D4.svg"></a>
 <a href="https://github.com/modelscope/swift/"><img src="https://img.shields.io/badge/ms--swift-Build from source-6FEBB9.svg"></a>
 </p>
 
@@ -17,7 +17,7 @@
 
 ## 特性
 - 支持的SFT方法: [lora](https://arxiv.org/abs/2106.09685), [qlora](https://arxiv.org/abs/2305.14314), 全参数微调
-- 支持的特性: 模型量化, DDP, 模型并行, gradient checkpointing, 梯度累加, 支持推送ModelScope Hub, 自定义数据集, 多模态和Agent SFT, 多轮对话, ...
+- 支持的特性: 模型量化, DDP, 模型并行, gradient checkpointing, 支持推送ModelScope Hub, 自定义数据集, 多模态和Agent SFT, 多轮对话, ...
 - 支持的模型
   - 🔥 qwen 系列: [qwen-7b](https://modelscope.cn/models/qwen/Qwen-7B/summary), [qwen-7b-chat](https://modelscope.cn/models/qwen/Qwen-7B-Chat/summary), [qwen-14b](https://modelscope.cn/models/qwen/Qwen-14B/summary), [qwen-14b-chat](https://modelscope.cn/models/qwen/Qwen-14B-Chat/summary), [qwen-7b-chat-int4](https://modelscope.cn/models/qwen/Qwen-7B-Chat-Int4/summary), [qwen-14b-chat-int4](https://modelscope.cn/models/qwen/Qwen-14B-Chat-Int4/summary), [qwen-7b-chat-int8](https://modelscope.cn/models/qwen/Qwen-7B-Chat-Int8/summary), [qwen-14b-chat-int8](https://modelscope.cn/models/qwen/Qwen-14B-Chat-Int8/summary)
   - 🔥 qwen-vl 系列: [qwen-vl](https://modelscope.cn/models/qwen/Qwen-VL/summary), [qwen-vl-chat](https://modelscope.cn/models/qwen/Qwen-VL-Chat/summary), [qwen-vl-chat-int4](https://modelscope.cn/models/qwen/Qwen-VL-Chat-Int4/summary)
@@ -29,6 +29,7 @@
   - xverse 系列: [xverse-7b](https://modelscope.cn/models/xverse/XVERSE-7B/summary), [xverse-7b-chat](https://modelscope.cn/models/xverse/XVERSE-7B-Chat/summary), [xverse-13b](https://modelscope.cn/models/xverse/XVERSE-13B/summary), [xverse-13b-chat](https://modelscope.cn/models/xverse/XVERSE-13B-Chat/summary)
   - mistral 系列: [mistral-7b](https://modelscope.cn/models/AI-ModelScope/Mistral-7B-v0.1/summary), [mistral-7b-chat](https://modelscope.cn/models/AI-ModelScope/Mistral-7B-Instruct-v0.1/summary)
   - ziya 系列: [ziya2-13b](https://modelscope.cn/models/Fengshenbang/Ziya2-13B-Base/summary), [ziya2-13b-chat](https://modelscope.cn/models/Fengshenbang/Ziya2-13B-Chat/summary)
+  - skywork 系列: [skywork-13b](https://modelscope.cn/models/skywork/Skywork-13B-base/summary), [skywork-13b-chat](https://modelscope.cn/models/skywork/Skywork-13B-chat/summary)
   - other: [polylm-13b](https://modelscope.cn/models/damo/nlp_polylm_13b_text_generation/summary), [seqgpt-560m](https://modelscope.cn/models/damo/nlp_seqgpt-560m/summary)
 - 支持的数据集:
   - NLP:
@@ -45,23 +46,24 @@
   - 多模态: 🔥[coco-en](https://modelscope.cn/datasets/modelscope/coco_2014_caption/summary)
   - 自定义数据集
 - 支持的对话模板:
-  - 文本生成: default-generation, chatglm2-generation
-  - 对话: chatml(qwen), baichuan, chatglm2, chatglm3, llama, openbuddy-llama, default, internlm, xverse
+  - 文本生成: default-generation, chatglm-generation
+  - 对话: chatml(qwen), baichuan, chatglm2, chatglm3, llama, openbuddy-llama, default, internlm, xverse, skywork
 
 
 ## 新闻
-- 🔥 2023.10.27: 支持chatglm3系列模型: chatglm3-6b-base, chatglm3-6b, chatglm3-6b-32k. 对应的sh脚本可以查看`scripts/chatglm3_6b_32k`.
-- 🔥 2023.10.24: 使用注册机制来新增模型, 数据集和对话模板. 如何自定义模型, 数据集和对话模板可以查看`使用文档`部分, 其对应的py文件可以查看`custom.py`, 其对应的sh脚本可以查看`scripts/custom/tigerbot_13b_chat`.
-- 🔥 2023.10.17: 支持int4, int8模型的SFT: qwen-7b-chat-int4, qwen-14b-chat-int4, qwen-vl-chat-int4, baichuan2-7b-chat-int4, baichuan2-13b-chat-int4, qwen-7b-chat-int8, qwen-14b-chat-int8. 对应的sh脚本可以查看`scripts/qwen_7b_chat_int4`, `scripts/qwen_14b_chat_int4`, `scripts/qwen_vl_chat_int4`, `scripts/qwen_7b_chat_int8`, `scripts/qwen_14b_chat_int8`.
-- 2023.10.15: 支持ziya2-13b系列模型: ziya2-13b, ziya2-13b-chat. 对应的sh脚本可以查看`scripts/ziya2_13b_chat`.
-- 2023.10.12: 支持mistral-7b系列模型: openbuddy-mistral-7b-chat, mistral-7b, mistral-7b-chat. 对应的sh脚本可以查看`scripts/openbuddy_mistral_7b_chat`, `scripts/mistral_7b_chat`.
-- 🔥 2023.10.7: 支持DeepSpeed ZeRO-2, 使得lora(不仅仅是qlora)可以在双卡A10上运行DDP. 对应的sh脚本可以查看`scripts/qwen_7b_chat/lora_ddp_ds/sft.sh`.
+- 2023.10.30: 支持**skywork-13b**系列模型: skywork-13b, skywork-13b-chat. T对应的sh脚本可以查看`scripts/skywork_13b`.
+- 🔥 2023.10.27: 支持**chatglm3**系列模型: chatglm3-6b-base, chatglm3-6b, chatglm3-6b-32k. 对应的sh脚本可以查看`scripts/chatglm3_6b_32k`.
+- 🔥 2023.10.24: 使用**注册机制**来新增模型, **数据集**和对话模板. 如何自定义模型, 数据集和对话模板可以查看`使用文档`部分, 其对应的py文件可以查看`custom.py`, 其对应的sh脚本可以查看`scripts/custom/tigerbot_13b_chat`.
+- 🔥 2023.10.17: 支持**int4, int8**模型的SFT: qwen-7b-chat-int4, qwen-14b-chat-int4, qwen-vl-chat-int4, baichuan2-7b-chat-int4, baichuan2-13b-chat-int4, qwen-7b-chat-int8, qwen-14b-chat-int8. 对应的sh脚本可以查看`scripts/qwen_7b_chat_int4`, `scripts/qwen_14b_chat_int4`, `scripts/qwen_vl_chat_int4`, `scripts/qwen_7b_chat_int8`, `scripts/qwen_14b_chat_int8`.
+- 2023.10.15: 支持**ziya2-13b**系列模型: ziya2-13b, ziya2-13b-chat. 对应的sh脚本可以查看`scripts/ziya2_13b_chat`.
+- 2023.10.12: 支持**mistral-7b**系列模型: openbuddy-mistral-7b-chat, mistral-7b, mistral-7b-chat. 对应的sh脚本可以查看`scripts/openbuddy_mistral_7b_chat`, `scripts/mistral_7b_chat`.
+- 🔥 2023.10.7: 支持**DeepSpeed ZeRO-2**, 使得lora(不仅仅是qlora)可以在双卡A10上运行DDP. 对应的sh脚本可以查看`scripts/qwen_7b_chat/lora_ddp_ds/sft.sh`.
 - 2023.10.4: 支持更多数学, 法律, SQL, 代码领域的数据集: blossom-math-zh, school-math-zh, text2sql-en, sql-create-context-en, lawyer-llama-zh, tigerbot-law-zh, leetcode-python-en.
 - 🔥 2023.9.25: 支持**qwen-14b**系列模型: qwen-14b, qwen-14b-chat. 对应的sh脚本可以查看`scripts/qwen_14b`, `scripts/qwen_14b_chat`.
-- 2023.9.18: 支持internlm-20b系列模型: internlm-20b, internlm-20b-chat. 对应的sh脚本可以查看`scripts/internlm_20b`, `scripts/internlm_20b_chat`.
-- 2023.9.12: 支持MP+DDP的方式训练, 加快全参数微调的速度, 对应的sh脚本可以查看`scripts/qwen_7b_chat/full_mp_ddp/sft.sh`.
+- 2023.9.18: 支持**internlm-20b**系列模型: internlm-20b, internlm-20b-chat. 对应的sh脚本可以查看`scripts/internlm_20b`, `scripts/internlm_20b_chat`.
+- 2023.9.12: 支持**MP+DDP**的方式训练, 加快全参数微调的速度, 对应的sh脚本可以查看`scripts/qwen_7b_chat/full_mp_ddp/sft.sh`.
 - 2023.9.5: 支持训练只保存模型权重, 而不保存断点续训所需的优化器权重等中间状态, 避免全参数微调保存checkpoint所需时间过长和空间过大的问题. 可以查看`sft.sh`中的命令行参数: `--only_save_model`.
-- 2023.9.5: 支持openbuddy-llama2-70b-chat模型. 对应的sh脚本可以查看`scripts/openbuddy_llama2_70b_chat`.
+- 2023.9.5: 支持**openbuddy-llama2-70b-chat**模型. 对应的sh脚本可以查看`scripts/openbuddy_llama2_70b_chat`.
 - 2023.9.3: 支持baichuan2系列模型: baichuan2-7b, baichuan2-7b-chat, baichuan2-13b, baichuan2-13b-chat. 对应的sh脚本可以查看`scripts/baichuan2_7b`, `scripts/baichuan2_7b_chat`, `scripts/baichuan2_13b_chat`.
 
 
@@ -88,8 +90,8 @@ pip install bitsandbytes -U
 ```
 
 
-## 简单的使用
-以下案例可以用于测试环境. 请确保您已经阅读了`准备实验环境`部分.
+## 简单使用
+以下案例可以用于**测试环境**. 请确保您已经阅读了`准备实验环境`部分.
 ```python
 # Experimental environment: A10, 3090, A100, ...
 # 16GB GPU memory
@@ -123,23 +125,23 @@ infer_main(infer_args)
 
 
 ## 微调和推理
-性能: full(优) > lora > qlora
+性能: full(优) > lora > qlora(auto_gptq) > qlora(bnb)
 
 训练显存: qlora(低,3090) > lora > full(2*A100)
 
 **提示**:
-- 你可以在训练时设置`--gradient_checkpointing true`来节约显存, 但这会略微降低训练速度. 如果你需要在消费级显卡中训练大模型, 这很有用, 例如: 3090.
+- 你可以在训练时设置`--gradient_checkpointing true`来**节约显存**, 但这会略微降低训练速度. 如果你需要在**消费级显卡**中训练大模型, 这很有用, 例如: 3090.
 - 如果你想要使用量化参数`quantization_bit`, 你需要先安装bnb: `pip install bitsandbytes -U`.
-- 如果你想要使用基于auto_gptq的量化, 你需要先安装auto_gptq: `pip install auto_gptq -U`.
+- 如果你想要使用基于**auto_gptq**的量化, 你需要先安装auto_gptq: `pip install auto_gptq -U`.
   使用auto_gptq的模型包含: `qwen-7b-chat-int4`, `qwen-14b-chat-int4`, `qwen-7b-chat-int8`, `qwen-14b-chat-int8`.
-  如果脚本提供了非量化模型和int4/int8模型的多个版本的qlora SFT版本, 推荐使用int4/int8模型版本的脚本.
-- 如果你想要使用deepspeed, 你需要`pip install deepspeed -U`. 使用deepspeed可以节约显存, 但可能会略微降低训练速度.
-- 如果你使用的是V100等较老的GPU, 你需要设置`--dtype fp16`, 因为其不支持bf16.
-- 如果你的机器是A100等高性能显卡, 且使用的是qwen系列模型, 推荐你安装[flash-attn](https://github.com/Dao-AILab/flash-attention), 这将会加快训练和推理的速度以及显存占用(A10, 3090, V100等显卡不支持flash-attn进行训练).
-- 如果你要进行二次预训练而不是SFT, 你可以参考`DatasetName.tigerbot_law_zh`数据集和其对于的sh文件: `scripts/qwen_7b/qlora_ddp`.
+  如果脚本提供了非量化模型和int4/int8模型的多个版本的qlora SFT版本, **推荐使用int4/int8模型版本的脚本**.
+- 如果你想要使用deepspeed, 你需要`pip install deepspeed -U`. 使用deepspeed可以**节约显存**, 但可能会略微降低训练速度.
+- 如果你使用的是**V100**等较老的GPU, 你需要设置`--dtype fp16`, 因为其不支持bf16.
+- 如果你的机器是A100等高性能显卡, 且使用的是qwen系列模型, 推荐你安装[**flash-attn**](https://github.com/Dao-AILab/flash-attention), 这将会加快训练和推理的速度以及显存占用(A10, 3090, V100等显卡不支持flash-attn进行训练).
+- 如果你要进行**二次预训练**而不是SFT, 你可以参考`DatasetName.tigerbot_law_zh`数据集和其对于的sh文件: `scripts/qwen_7b/qlora_ddp`.
 - 如果你想在训练时, 将权重push到ModelScope Hub中, 你需要设置`--push_to_hub true`.
-- 如何你想要在推理时, 合并LoRA权重并保存，你需要设置`--merge_lora_and_save true`. 不推荐对量化的模型进行merge, 这会存在精度损失, 即qlora.
-- 以下提供了可以直接运行的`qwen_7b_chat`的sh脚本(你只需要在推理时指定`ckpt_dir`即可顺利执行). 更多模型的scripts脚本, 可以查看`scripts`文件夹. 如果你想要自定义sh脚本, 推荐你参考`scripts/qwen_7b_chat`中的脚本进行书写.
+- 如何你想要在推理时, 合并LoRA权重并保存，你需要设置`--merge_lora_and_save true`. **不推荐对量化的模型进行merge**, 这会存在精度损失, 即qlora.
+- 以下提供了可以直接运行的`qwen_7b_chat`的sh脚本(你只需要在推理时指定`ckpt_dir`即可顺利执行). 更多模型的scripts脚本, 可以查看`scripts`文件夹. 如果你想要**自定义sh脚本**, 推荐你参考`scripts/qwen_7b_chat`中的脚本进行书写.
 ```bash
 # 微调(qlora)+推理 qwen-7b-chat-int8, 需要16GB显存.
 # 推荐的实验环境: V100, A10, 3090
@@ -191,6 +193,7 @@ bash scripts/qwen_7b_chat/full_mp/infer.sh
 bash scripts/qwen_7b_chat/full_mp_ddp/sft.sh
 bash scripts/qwen_7b_chat/full_mp_ddp/infer.sh
 
+# 以下基于bnb的qlora脚本已不再推荐使用. 请优先使用基于auto_gptq的qlora脚本.
 # 微调(qlora)+推理 qwen-7b-chat, 需要13GB显存.
 # 推荐的实验环境: A10, 3090
 bash scripts/qwen_7b_chat/qlora/sft.sh
@@ -210,7 +213,7 @@ bash scripts/qwen_7b_chat/qlora_ddp_ds/infer.sh
 
 ## 使用文档
 ### 自定义模型
-以下是一个自定义模型的案例. 运行该自定义模型的sh可以查看`scripts/custom/tigerbot_13b_chat`.
+以下是一个**自定义模型**的案例. 运行该自定义模型的sh可以查看`scripts/custom/tigerbot_13b_chat`.
 
 ```python
 from swift.llm import (
@@ -269,7 +272,55 @@ if __name__ == '__main__':
 
 
 ### 自定义数据集
-以下是一个自定义数据集的案例. 运行该自定义数据集的sh可以查看`scripts/custom/tigerbot_13b_chat`.
+我们支持两种**自定义数据集**的方法.
+1. **命令行参数**的形式: **更加方便支持本地自定义数据集**.
+2. **注册数据集**的方式: 更加灵活, 可以对swift**进一步拓展和开发**, 但需要一定的编程门槛. 方法一在实现上借助了方法二.
+
+#### 命令行参数的形式
+命令行参数含义介绍:
+1. `--custom_train_dataset_path`: 默认值为`None`, 表示不使用自定义数据集. 你可以像如下形式进行指定: `--custom_train_dataset_path alpaca.csv`或者指定多个训练数据集`--custom_train_dataset_path alpaca.csv chatml.jsonl swift.jsonl`, 脚本会进行自动的预处理和拼接. 你也可以通过公开数据集和自定义数据集结合的方式进行训练: `--dataset blossom-math-zh --custom_train_dataset_path custom_math.jsonl`.
+2. `--custom_val_dataset_path`: 默认值为`None`, 表示不使用自定义验证数据集. 如果你指定了`custom_train_dataset_path`, 则自定义数据集的验证集将按照命令行参数`dataset_test_ratio`进行切割. 命令行传入的格式可以参考`--custom_train_dataset_path`.
+
+脚本支持的文件格式包含`csv`和`jsonl`格式. 你需要将传入的文件符合以下数据集格式. csv格式的文件只支持指令微调, 即没有history的情况. jsonl格式的文件支持system, history.
+
+格式1:
+```csv
+instruction,input,output
+11111,22222,33333
+aaaaa,bbbbb,ccccc
+AAAAA,BBBBB,CCCCC
+```
+
+```jsonl
+{"instruction": "11111", "input": "aaaaa", "output": "AAAAA"}
+{"instruction": "22222", "input": "bbbbb", "output": "BBBBB"}
+{"instruction": "33333", "input": "ccccc", "output": "CCCCC"}
+```
+
+格式2:
+```jsonl
+{"query": "55555", "response": "66666", "history": [["11111", "22222"], ["33333", "44444"]]}
+{"query": "eeeee", "response": "fffff", "history": [["aaaaa", "bbbbb"], ["ccccc", "ddddd"]]}
+{"query": "EEEEE", "response": "FFFFF", "history": [["AAAAA", "BBBBB"], ["CCCCC", "DDDDD"]]}
+```
+
+格式3:
+```jsonl
+{"conversations": [{"from": "user", "value": "11111"}, {"from": "assistant", "value": "22222"}, {"from": "user", "value": "33333"}, {"from": "assistant", "value": "44444"}]}
+{"conversations": [{"from": "user", "value": "aaaaa"}, {"from": "assistant", "value": "bbbbb"}, {"from": "user", "value": "ccccc"}, {"from": "assistant", "value": "ddddd"}]}
+{"conversations": [{"from": "user", "value": "AAAAA"}, {"from": "assistant", "value": "BBBBB"}, {"from": "user", "value": "CCCCC"}, {"from": "assistant", "value": "DDDDD"}]}
+```
+
+格式4:
+```jsonl
+{"messages": [{"role": "user", "content": "11111"}, {"role": "assistant", "content": "22222"}, {"role": "user", "content": "33333"}, {"role": "assistant", "content": "44444"}]}
+{"messages": [{"role": "user", "content": "aaaaa"}, {"role": "assistant", "content": "bbbbb"}, {"role": "user", "content": "ccccc"}, {"role": "assistant", "content": "ddddd"}]}
+{"messages": [{"role": "user", "content": "AAAAA"}, {"role": "assistant", "content": "BBBBB"}, {"role": "user", "content": "CCCCC"}, {"role": "assistant", "content": "DDDDD"}]}
+```
+
+
+#### 注册数据集的方式
+以下是一个**注册数据集**的案例. 运行该自定义数据集的sh可以查看`scripts/custom/tigerbot_13b_chat`.
 
 ```python
 import ast
@@ -303,7 +354,7 @@ register_dataset(
 
 if __name__ == '__main__':
     train_dataset, _ = get_dataset([CustomDatasetName.agent_instruct_all_en],
-                                   0.)
+                                   0., check_dataset_strategy='warning')
     print(train_dataset)
     print(train_dataset[0].keys())
 ```
@@ -321,7 +372,7 @@ if __name__ == '__main__':
 - `**kwargs`: 其他用于注释数据集的参数. 该参数一般不需要设置.
 
 ### 自定义对话模板
-以下是一个自定义对话模板的案例. 运行该自定义对话模板的sh可以查看`scripts/custom/tigerbot_13b_chat`.
+以下是一个**自定义对话模板**的案例. 运行该自定义对话模板的sh可以查看`scripts/custom/tigerbot_13b_chat`.
 
 ```python
 from swift.llm import (
@@ -377,6 +428,9 @@ if __name__ == '__main__':
 - `--train_dataset_sample`: 对完整训练集进行采样, 默认是`20000`, 用于加快训练的速度. 该参数是为了避免数据集过大, 单个epoch训练时间过长的问题. LoRA的收敛通常较快, 不需要过多数据样本的微调. 如果你指定为`-1`, 则使用完整的训练集进行训练, 该情况一般出现在全参数微调的设置下.
 - `--system`: 对话模板中使用的system, 默认为`'you are a helpful assistant!'`.
 - `--max_length`: token的最大长度, 默认为`2048`. 可以避免个别过长的数据样本造成OOM的问题. 如果某数据样本长度超过max_length, 我们会切除最前面的token: `input_ids[-max_length:]`. 如果设置为-1, 则无限制.
+- `--check_dataset_strategy`: 默认值为`'none'`, 即不做检查. 如果你训练的模型是LLM, 则推荐使用`'warning'`作为数据检查的策略. 如果你的训练目标为句子分类等任务, 则建议设置为'`none`'.
+- `--custom_train_dataset_path`: 默认值为`None`. 具体的含义参考README.md中的`自定义数据集`模块.
+- `--custom_val_dataset_path`: 默认值为`None`. 具体的含义参考README.md中的`自定义数据集`模块.
 - `--quantization_bit`: 用于指定是否进行量化和量化的bit数, 默认为`0`, 即不进行量化. 量化情况下, 只支持lora的微调方式, 不支持全参数的微调方式.
 - `--bnb_4bit_comp_dtype`: 在进行4bit量化时, 我们需要在模型的forward和backward时, 将其进行反量化. 该参数用于指定反量化后的torch_dtype. 默认为`None`, 即与`dtype`保持一致. 可选择的值包括: 'fp16', 'bf16', 'fp32'. 当quantization_bit为0时, 该参数无效.
 - `--bnb_4bit_quant_type`: 4bit量化时的量化方式, 默认是`'nf4'`. 可选择的值包括: 'nf4', 'fp4'. 当quantization_bit为0时, 该参数无效.
@@ -438,6 +492,9 @@ if __name__ == '__main__':
 - `--show_dataset_sample`: 表示想要评估和展示的验证集的数量, 默认值为`10`. 该参数只有在`eval_human`设置为False时才生效.
 - `--system`: 默认值为`'you are a helpful assistant!'`. 具体的参数介绍可以在`sft.sh命令行参数`中查看.
 - `--max_length`: 默认值为`2048`. 具体的参数介绍可以在`sft.sh命令行参数`中查看.
+- `--check_dataset_strategy`: 默认值为`'none'`, 具体的参数介绍可以在`sft.sh命令行参数`中查看.
+- `--custom_train_dataset_path`: 默认值为`None`. 具体的含义参考README.md中的`自定义数据集`模块.
+- `--custom_val_dataset_path`: 默认值为`None`. 具体的含义参考README.md中的`自定义数据集`模块.
 - `--quantization_bit`: 默认值为0. 具体的参数介绍可以在`sft.sh命令行参数`中查看.
 - `--bnb_4bit_comp_dtype`: 默认值为`None`.  具体的参数介绍可以在`sft.sh命令行参数`中查看. 若`quantization_bit`设置为0, 则该参数失效.
 - `--bnb_4bit_quant_type`: 默认值为`'nf4'`.  具体的参数介绍可以在`sft.sh命令行参数`中查看. 若`quantization_bit`设置为0, 则该参数失效.
diff --git a/examples/pytorch/llm/custom.py b/examples/pytorch/llm/custom.py
index 71b9b6c925..474299f32a 100644
--- a/examples/pytorch/llm/custom.py
+++ b/examples/pytorch/llm/custom.py
@@ -76,7 +76,8 @@ def repair_conversations_agent_instruct(s: str) -> str:
     # The Shell script can view `scripts/custom/tigerbot_13b_chat`.
     # test
     train_dataset, _ = get_dataset([CustomDatasetName.agent_instruct_all_en],
-                                   0.)
+                                   0.,
+                                   check_dataset_strategy='warning')
     model, tokenizer = get_model_tokenizer(
         CustomModelType.tigerbot_13b_chat, use_flash_attn=True)
     template = get_template(CustomTemplateType.tigerbot, tokenizer)
diff --git a/examples/pytorch/llm/scripts/baichuan2_13b_chat/lora_ddp_ds/infer.sh b/examples/pytorch/llm/scripts/baichuan2_13b_chat/lora_ddp_ds/infer.sh
index 380433a108..c1f7abcd57 100644
--- a/examples/pytorch/llm/scripts/baichuan2_13b_chat/lora_ddp_ds/infer.sh
+++ b/examples/pytorch/llm/scripts/baichuan2_13b_chat/lora_ddp_ds/infer.sh
@@ -12,9 +12,11 @@ python llm_infer.py \
     --eval_human false \
     --dataset damo-agent-mini-zh \
     --max_length 4096 \
+    --check_dataset_strategy warning \
     --max_new_tokens 2048 \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
     --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/baichuan2_13b_chat/lora_ddp_ds/sft.sh b/examples/pytorch/llm/scripts/baichuan2_13b_chat/lora_ddp_ds/sft.sh
index b34ad7ed67..06fca28e34 100644
--- a/examples/pytorch/llm/scripts/baichuan2_13b_chat/lora_ddp_ds/sft.sh
+++ b/examples/pytorch/llm/scripts/baichuan2_13b_chat/lora_ddp_ds/sft.sh
@@ -20,13 +20,14 @@ torchrun \
     --train_dataset_sample -1 \
     --num_train_epochs 1 \
     --max_length 4096 \
+    --check_dataset_strategy warning \
     --lora_rank 8 \
     --lora_alpha 32 \
     --lora_dropout_p 0.05 \
     --lora_target_modules ALL \
     --gradient_checkpointing true \
     --batch_size 1 \
-    --weight_decay 0. \
+    --weight_decay 0.01 \
     --learning_rate 1e-4 \
     --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
     --max_grad_norm 0.5 \
diff --git a/examples/pytorch/llm/scripts/baichuan2_13b_chat/lora_mp_ddp/infer.sh b/examples/pytorch/llm/scripts/baichuan2_13b_chat/lora_mp_ddp/infer.sh
index 4a5482cdd3..f9fbb9c5a2 100644
--- a/examples/pytorch/llm/scripts/baichuan2_13b_chat/lora_mp_ddp/infer.sh
+++ b/examples/pytorch/llm/scripts/baichuan2_13b_chat/lora_mp_ddp/infer.sh
@@ -12,9 +12,11 @@ python llm_infer.py \
     --eval_human false \
     --dataset blossom-math-zh \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --max_new_tokens 2048 \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
     --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/baichuan2_13b_chat/lora_mp_ddp/sft.sh b/examples/pytorch/llm/scripts/baichuan2_13b_chat/lora_mp_ddp/sft.sh
index b826eba99a..ac9e7d8735 100644
--- a/examples/pytorch/llm/scripts/baichuan2_13b_chat/lora_mp_ddp/sft.sh
+++ b/examples/pytorch/llm/scripts/baichuan2_13b_chat/lora_mp_ddp/sft.sh
@@ -20,13 +20,14 @@ torchrun \
     --train_dataset_sample -1 \
     --num_train_epochs 1 \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --lora_rank 8 \
     --lora_alpha 32 \
     --lora_dropout_p 0.05 \
     --lora_target_modules W_pack \
     --gradient_checkpointing true \
     --batch_size 1 \
-    --weight_decay 0. \
+    --weight_decay 0.01 \
     --learning_rate 1e-4 \
     --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
     --max_grad_norm 0.5 \
diff --git a/examples/pytorch/llm/scripts/baichuan2_13b_chat/qlora_ddp_ds/infer.sh b/examples/pytorch/llm/scripts/baichuan2_13b_chat/qlora_ddp_ds/infer.sh
index 3191db97a2..ba6ba7a9d4 100644
--- a/examples/pytorch/llm/scripts/baichuan2_13b_chat/qlora_ddp_ds/infer.sh
+++ b/examples/pytorch/llm/scripts/baichuan2_13b_chat/qlora_ddp_ds/infer.sh
@@ -11,11 +11,13 @@ python llm_infer.py \
     --eval_human false \
     --dataset damo-agent-mini-zh \
     --max_length 4096 \
+    --check_dataset_strategy warning \
     --quantization_bit 4 \
     --bnb_4bit_comp_dtype bf16 \
     --max_new_tokens 2048 \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
     --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/baichuan2_13b_chat/qlora_ddp_ds/sft.sh b/examples/pytorch/llm/scripts/baichuan2_13b_chat/qlora_ddp_ds/sft.sh
index fc2f70f91f..f880a5f529 100644
--- a/examples/pytorch/llm/scripts/baichuan2_13b_chat/qlora_ddp_ds/sft.sh
+++ b/examples/pytorch/llm/scripts/baichuan2_13b_chat/qlora_ddp_ds/sft.sh
@@ -20,6 +20,7 @@ torchrun \
     --train_dataset_sample -1 \
     --num_train_epochs 1 \
     --max_length 4096 \
+    --check_dataset_strategy warning \
     --quantization_bit 4 \
     --bnb_4bit_comp_dtype bf16 \
     --lora_rank 8 \
@@ -28,7 +29,7 @@ torchrun \
     --lora_target_modules ALL \
     --gradient_checkpointing true \
     --batch_size 1 \
-    --weight_decay 0. \
+    --weight_decay 0.01 \
     --learning_rate 1e-4 \
     --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
     --max_grad_norm 0.5 \
diff --git a/examples/pytorch/llm/scripts/baichuan2_13b_chat_int4/qlora_ddp_ds/infer.sh b/examples/pytorch/llm/scripts/baichuan2_13b_chat_int4/qlora_ddp_ds/infer.sh
index 9b3e4870fe..6817c5cd49 100644
--- a/examples/pytorch/llm/scripts/baichuan2_13b_chat_int4/qlora_ddp_ds/infer.sh
+++ b/examples/pytorch/llm/scripts/baichuan2_13b_chat_int4/qlora_ddp_ds/infer.sh
@@ -11,9 +11,11 @@ python llm_infer.py \
     --eval_human false \
     --dataset damo-agent-mini-zh \
     --max_length 4096 \
+    --check_dataset_strategy warning \
     --max_new_tokens 2048 \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
     --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/baichuan2_13b_chat_int4/qlora_ddp_ds/sft.sh b/examples/pytorch/llm/scripts/baichuan2_13b_chat_int4/qlora_ddp_ds/sft.sh
index ffbc8923bc..d5aaa664b5 100644
--- a/examples/pytorch/llm/scripts/baichuan2_13b_chat_int4/qlora_ddp_ds/sft.sh
+++ b/examples/pytorch/llm/scripts/baichuan2_13b_chat_int4/qlora_ddp_ds/sft.sh
@@ -20,13 +20,14 @@ torchrun \
     --train_dataset_sample -1 \
     --num_train_epochs 1 \
     --max_length 4096 \
+    --check_dataset_strategy warning \
     --lora_rank 8 \
     --lora_alpha 32 \
     --lora_dropout_p 0.05 \
     --lora_target_modules ALL \
     --gradient_checkpointing true \
     --batch_size 1 \
-    --weight_decay 0. \
+    --weight_decay 0.01 \
     --learning_rate 1e-4 \
     --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
     --max_grad_norm 0.5 \
diff --git a/examples/pytorch/llm/scripts/baichuan2_7b/qlora/infer.sh b/examples/pytorch/llm/scripts/baichuan2_7b/qlora/infer.sh
index 3a8366431f..0fdca91a5d 100644
--- a/examples/pytorch/llm/scripts/baichuan2_7b/qlora/infer.sh
+++ b/examples/pytorch/llm/scripts/baichuan2_7b/qlora/infer.sh
@@ -11,11 +11,13 @@ python llm_infer.py \
     --eval_human false \
     --dataset advertise-gen-zh \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --quantization_bit 4 \
     --bnb_4bit_comp_dtype bf16 \
     --max_new_tokens 2048 \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
     --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/baichuan2_7b/qlora/sft.sh b/examples/pytorch/llm/scripts/baichuan2_7b/qlora/sft.sh
index be8bdacd3a..b858593680 100644
--- a/examples/pytorch/llm/scripts/baichuan2_7b/qlora/sft.sh
+++ b/examples/pytorch/llm/scripts/baichuan2_7b/qlora/sft.sh
@@ -14,6 +14,7 @@ python llm_sft.py \
     --train_dataset_sample 20000 \
     --num_train_epochs 1 \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --quantization_bit 4 \
     --bnb_4bit_comp_dtype bf16 \
     --lora_rank 8 \
@@ -22,7 +23,7 @@ python llm_sft.py \
     --lora_target_modules ALL \
     --gradient_checkpointing true \
     --batch_size 1 \
-    --weight_decay 0. \
+    --weight_decay 0.01 \
     --learning_rate 1e-4 \
     --gradient_accumulation_steps 16 \
     --max_grad_norm 0.5 \
diff --git a/examples/pytorch/llm/scripts/baichuan2_7b_chat/lora_ddp/infer.sh b/examples/pytorch/llm/scripts/baichuan2_7b_chat/lora_ddp/infer.sh
index ac514a5ebe..1354acb249 100644
--- a/examples/pytorch/llm/scripts/baichuan2_7b_chat/lora_ddp/infer.sh
+++ b/examples/pytorch/llm/scripts/baichuan2_7b_chat/lora_ddp/infer.sh
@@ -11,9 +11,11 @@ python llm_infer.py \
     --eval_human false \
     --dataset damo-agent-mini-zh \
     --max_length 4096 \
+    --check_dataset_strategy warning \
     --max_new_tokens 2048 \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
     --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/baichuan2_7b_chat/lora_ddp/sft.sh b/examples/pytorch/llm/scripts/baichuan2_7b_chat/lora_ddp/sft.sh
index 6f760ea936..9bb4297a5b 100644
--- a/examples/pytorch/llm/scripts/baichuan2_7b_chat/lora_ddp/sft.sh
+++ b/examples/pytorch/llm/scripts/baichuan2_7b_chat/lora_ddp/sft.sh
@@ -20,13 +20,14 @@ torchrun \
     --train_dataset_sample -1 \
     --num_train_epochs 1 \
     --max_length 4096 \
+    --check_dataset_strategy warning \
     --lora_rank 8 \
     --lora_alpha 32 \
     --lora_dropout_p 0.05 \
     --lora_target_modules ALL \
     --gradient_checkpointing false \
     --batch_size 1 \
-    --weight_decay 0. \
+    --weight_decay 0.01 \
     --learning_rate 1e-4 \
     --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
     --max_grad_norm 0.5 \
diff --git a/examples/pytorch/llm/scripts/baichuan2_7b_chat/lora_ddp_ds/infer.sh b/examples/pytorch/llm/scripts/baichuan2_7b_chat/lora_ddp_ds/infer.sh
index 939e040104..ed1c617846 100644
--- a/examples/pytorch/llm/scripts/baichuan2_7b_chat/lora_ddp_ds/infer.sh
+++ b/examples/pytorch/llm/scripts/baichuan2_7b_chat/lora_ddp_ds/infer.sh
@@ -11,9 +11,11 @@ python llm_infer.py \
     --eval_human false \
     --dataset damo-agent-mini-zh \
     --max_length 4096 \
+    --check_dataset_strategy warning \
     --max_new_tokens 2048 \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
     --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/baichuan2_7b_chat/lora_ddp_ds/sft.sh b/examples/pytorch/llm/scripts/baichuan2_7b_chat/lora_ddp_ds/sft.sh
index 2ecd4db9e3..976f5c1fa9 100644
--- a/examples/pytorch/llm/scripts/baichuan2_7b_chat/lora_ddp_ds/sft.sh
+++ b/examples/pytorch/llm/scripts/baichuan2_7b_chat/lora_ddp_ds/sft.sh
@@ -20,13 +20,14 @@ torchrun \
     --train_dataset_sample -1 \
     --num_train_epochs 1 \
     --max_length 4096 \
+    --check_dataset_strategy warning \
     --lora_rank 8 \
     --lora_alpha 32 \
     --lora_dropout_p 0.05 \
     --lora_target_modules ALL \
     --gradient_checkpointing true \
     --batch_size 1 \
-    --weight_decay 0. \
+    --weight_decay 0.01 \
     --learning_rate 1e-4 \
     --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
     --max_grad_norm 0.5 \
diff --git a/examples/pytorch/llm/scripts/baichuan2_7b_chat/qlora_ddp_ds/infer.sh b/examples/pytorch/llm/scripts/baichuan2_7b_chat/qlora_ddp_ds/infer.sh
index 7431a47ff9..f02accca2c 100644
--- a/examples/pytorch/llm/scripts/baichuan2_7b_chat/qlora_ddp_ds/infer.sh
+++ b/examples/pytorch/llm/scripts/baichuan2_7b_chat/qlora_ddp_ds/infer.sh
@@ -11,11 +11,13 @@ python llm_infer.py \
     --eval_human false \
     --dataset damo-agent-mini-zh \
     --max_length 4096 \
+    --check_dataset_strategy warning \
     --quantization_bit 4 \
     --bnb_4bit_comp_dtype bf16 \
     --max_new_tokens 2048 \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
     --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/baichuan2_7b_chat/qlora_ddp_ds/sft.sh b/examples/pytorch/llm/scripts/baichuan2_7b_chat/qlora_ddp_ds/sft.sh
index 236e32927b..7f508771f3 100644
--- a/examples/pytorch/llm/scripts/baichuan2_7b_chat/qlora_ddp_ds/sft.sh
+++ b/examples/pytorch/llm/scripts/baichuan2_7b_chat/qlora_ddp_ds/sft.sh
@@ -20,6 +20,7 @@ torchrun \
     --train_dataset_sample -1 \
     --num_train_epochs 1 \
     --max_length 4096 \
+    --check_dataset_strategy warning \
     --quantization_bit 4 \
     --bnb_4bit_comp_dtype bf16 \
     --lora_rank 8 \
@@ -28,7 +29,7 @@ torchrun \
     --lora_target_modules ALL \
     --gradient_checkpointing true \
     --batch_size 1 \
-    --weight_decay 0. \
+    --weight_decay 0.01 \
     --learning_rate 1e-4 \
     --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
     --max_grad_norm 0.5 \
diff --git a/examples/pytorch/llm/scripts/baichuan2_7b_chat_int4/qlora_ddp_ds/infer.sh b/examples/pytorch/llm/scripts/baichuan2_7b_chat_int4/qlora_ddp_ds/infer.sh
index d8e998580e..683bff786b 100644
--- a/examples/pytorch/llm/scripts/baichuan2_7b_chat_int4/qlora_ddp_ds/infer.sh
+++ b/examples/pytorch/llm/scripts/baichuan2_7b_chat_int4/qlora_ddp_ds/infer.sh
@@ -11,9 +11,11 @@ python llm_infer.py \
     --eval_human false \
     --dataset damo-agent-mini-zh \
     --max_length 4096 \
+    --check_dataset_strategy warning \
     --max_new_tokens 2048 \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
     --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/baichuan2_7b_chat_int4/qlora_ddp_ds/sft.sh b/examples/pytorch/llm/scripts/baichuan2_7b_chat_int4/qlora_ddp_ds/sft.sh
index 25cb175b93..6ead6b7770 100644
--- a/examples/pytorch/llm/scripts/baichuan2_7b_chat_int4/qlora_ddp_ds/sft.sh
+++ b/examples/pytorch/llm/scripts/baichuan2_7b_chat_int4/qlora_ddp_ds/sft.sh
@@ -20,13 +20,14 @@ torchrun \
     --train_dataset_sample -1 \
     --num_train_epochs 1 \
     --max_length 4096 \
+    --check_dataset_strategy warning \
     --lora_rank 8 \
     --lora_alpha 32 \
     --lora_dropout_p 0.05 \
     --lora_target_modules ALL \
     --gradient_checkpointing true \
     --batch_size 1 \
-    --weight_decay 0. \
+    --weight_decay 0.01 \
     --learning_rate 1e-4 \
     --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
     --max_grad_norm 0.5 \
diff --git a/examples/pytorch/llm/scripts/baichuan_13b_chat/qlora_ddp_ds/infer.sh b/examples/pytorch/llm/scripts/baichuan_13b_chat/qlora_ddp_ds/infer.sh
index a485a9cc54..b3c266ffd7 100644
--- a/examples/pytorch/llm/scripts/baichuan_13b_chat/qlora_ddp_ds/infer.sh
+++ b/examples/pytorch/llm/scripts/baichuan_13b_chat/qlora_ddp_ds/infer.sh
@@ -11,11 +11,13 @@ python llm_infer.py \
     --eval_human false \
     --dataset blossom-math-zh \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --quantization_bit 4 \
     --bnb_4bit_comp_dtype bf16 \
     --max_new_tokens 2048 \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
     --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/baichuan_13b_chat/qlora_ddp_ds/sft.sh b/examples/pytorch/llm/scripts/baichuan_13b_chat/qlora_ddp_ds/sft.sh
index 714efadc80..15e260cb4a 100644
--- a/examples/pytorch/llm/scripts/baichuan_13b_chat/qlora_ddp_ds/sft.sh
+++ b/examples/pytorch/llm/scripts/baichuan_13b_chat/qlora_ddp_ds/sft.sh
@@ -20,6 +20,7 @@ torchrun \
     --train_dataset_sample -1 \
     --num_train_epochs 1 \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --quantization_bit 4 \
     --bnb_4bit_comp_dtype bf16 \
     --lora_rank 8 \
@@ -28,7 +29,7 @@ torchrun \
     --lora_target_modules ALL \
     --gradient_checkpointing true \
     --batch_size 1 \
-    --weight_decay 0. \
+    --weight_decay 0.01 \
     --learning_rate 1e-4 \
     --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
     --max_grad_norm 0.5 \
diff --git a/examples/pytorch/llm/scripts/chatglm2_6b/lora_ddp/infer.sh b/examples/pytorch/llm/scripts/chatglm2_6b/lora_ddp/infer.sh
index 01178c70c3..11dca858f7 100644
--- a/examples/pytorch/llm/scripts/chatglm2_6b/lora_ddp/infer.sh
+++ b/examples/pytorch/llm/scripts/chatglm2_6b/lora_ddp/infer.sh
@@ -11,9 +11,11 @@ python llm_infer.py \
     --eval_human false \
     --dataset damo-agent-mini-zh \
     --max_length 4096 \
+    --check_dataset_strategy warning \
     --max_new_tokens 2048 \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
     --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/chatglm2_6b/lora_ddp/sft.sh b/examples/pytorch/llm/scripts/chatglm2_6b/lora_ddp/sft.sh
index 6438fea128..64f10e4f8f 100644
--- a/examples/pytorch/llm/scripts/chatglm2_6b/lora_ddp/sft.sh
+++ b/examples/pytorch/llm/scripts/chatglm2_6b/lora_ddp/sft.sh
@@ -20,13 +20,14 @@ torchrun \
     --train_dataset_sample -1 \
     --num_train_epochs 1 \
     --max_length 4096 \
+    --check_dataset_strategy warning \
     --lora_rank 8 \
     --lora_alpha 32 \
     --lora_dropout_p 0.05 \
     --lora_target_modules ALL \
     --gradient_checkpointing false \
     --batch_size 1 \
-    --weight_decay 0. \
+    --weight_decay 0.01 \
     --learning_rate 1e-4 \
     --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
     --max_grad_norm 0.5 \
diff --git a/examples/pytorch/llm/scripts/chatglm2_6b/lora_ddp_ds/infer.sh b/examples/pytorch/llm/scripts/chatglm2_6b/lora_ddp_ds/infer.sh
index 48e48c7e8f..f15a4653ef 100644
--- a/examples/pytorch/llm/scripts/chatglm2_6b/lora_ddp_ds/infer.sh
+++ b/examples/pytorch/llm/scripts/chatglm2_6b/lora_ddp_ds/infer.sh
@@ -11,9 +11,11 @@ python llm_infer.py \
     --eval_human false \
     --dataset damo-agent-mini-zh \
     --max_length 4096 \
+    --check_dataset_strategy warning \
     --max_new_tokens 2048 \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
     --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/chatglm2_6b/lora_ddp_ds/sft.sh b/examples/pytorch/llm/scripts/chatglm2_6b/lora_ddp_ds/sft.sh
index d95f98cb3f..7c833e0502 100644
--- a/examples/pytorch/llm/scripts/chatglm2_6b/lora_ddp_ds/sft.sh
+++ b/examples/pytorch/llm/scripts/chatglm2_6b/lora_ddp_ds/sft.sh
@@ -20,13 +20,14 @@ torchrun \
     --train_dataset_sample -1 \
     --num_train_epochs 1 \
     --max_length 4096 \
+    --check_dataset_strategy warning \
     --lora_rank 8 \
     --lora_alpha 32 \
     --lora_dropout_p 0.05 \
     --lora_target_modules ALL \
     --gradient_checkpointing true \
     --batch_size 1 \
-    --weight_decay 0. \
+    --weight_decay 0.01 \
     --learning_rate 1e-4 \
     --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
     --max_grad_norm 0.5 \
diff --git a/examples/pytorch/llm/scripts/chatglm3_6b_32k/lora_ddp_ds/infer.sh b/examples/pytorch/llm/scripts/chatglm3_6b_32k/lora_ddp_ds/infer.sh
index bb18292ca4..ddf0eea488 100644
--- a/examples/pytorch/llm/scripts/chatglm3_6b_32k/lora_ddp_ds/infer.sh
+++ b/examples/pytorch/llm/scripts/chatglm3_6b_32k/lora_ddp_ds/infer.sh
@@ -11,9 +11,11 @@ python llm_infer.py \
     --eval_human false \
     --dataset damo-agent-mini-zh \
     --max_length 4096 \
+    --check_dataset_strategy warning \
     --max_new_tokens 2048 \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
     --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/chatglm3_6b_32k/lora_ddp_ds/sft.sh b/examples/pytorch/llm/scripts/chatglm3_6b_32k/lora_ddp_ds/sft.sh
index 85c301a4d2..ad6f5df02b 100644
--- a/examples/pytorch/llm/scripts/chatglm3_6b_32k/lora_ddp_ds/sft.sh
+++ b/examples/pytorch/llm/scripts/chatglm3_6b_32k/lora_ddp_ds/sft.sh
@@ -20,13 +20,14 @@ torchrun \
     --train_dataset_sample -1 \
     --num_train_epochs 1 \
     --max_length 4096 \
+    --check_dataset_strategy warning \
     --lora_rank 8 \
     --lora_alpha 32 \
     --lora_dropout_p 0.05 \
     --lora_target_modules AUTO \
     --gradient_checkpointing true \
     --batch_size 1 \
-    --weight_decay 0. \
+    --weight_decay 0.01 \
     --learning_rate 1e-4 \
     --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
     --max_grad_norm 0.5 \
diff --git a/examples/pytorch/llm/scripts/custom/tigerbot_13b_chat/qlora_ddp_ds/infer.sh b/examples/pytorch/llm/scripts/custom/tigerbot_13b_chat/qlora_ddp_ds/infer.sh
index 4176561631..059e0f6529 100644
--- a/examples/pytorch/llm/scripts/custom/tigerbot_13b_chat/qlora_ddp_ds/infer.sh
+++ b/examples/pytorch/llm/scripts/custom/tigerbot_13b_chat/qlora_ddp_ds/infer.sh
@@ -11,11 +11,13 @@ python llm_infer.py \
     --eval_human false \
     --dataset agent-instruct-all-en \
     --max_length 4096 \
+    --check_dataset_strategy warning \
     --quantization_bit 4 \
     --bnb_4bit_comp_dtype bf16 \
     --max_new_tokens 2048 \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
     --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/custom/tigerbot_13b_chat/qlora_ddp_ds/sft.sh b/examples/pytorch/llm/scripts/custom/tigerbot_13b_chat/qlora_ddp_ds/sft.sh
index c541515126..f88c37924c 100644
--- a/examples/pytorch/llm/scripts/custom/tigerbot_13b_chat/qlora_ddp_ds/sft.sh
+++ b/examples/pytorch/llm/scripts/custom/tigerbot_13b_chat/qlora_ddp_ds/sft.sh
@@ -20,6 +20,7 @@ torchrun \
     --train_dataset_sample -1 \
     --num_train_epochs 1 \
     --max_length 4096 \
+    --check_dataset_strategy warning \
     --quantization_bit 4 \
     --bnb_4bit_comp_dtype bf16 \
     --lora_rank 8 \
@@ -28,7 +29,7 @@ torchrun \
     --lora_target_modules AUTO \
     --gradient_checkpointing true \
     --batch_size 1 \
-    --weight_decay 0. \
+    --weight_decay 0.01 \
     --learning_rate 1e-4 \
     --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
     --max_grad_norm 0.5 \
diff --git a/examples/pytorch/llm/scripts/internlm_20b/lora_ddp/infer.sh b/examples/pytorch/llm/scripts/internlm_20b/lora_ddp/infer.sh
index d14dbfa4e3..f41f004740 100644
--- a/examples/pytorch/llm/scripts/internlm_20b/lora_ddp/infer.sh
+++ b/examples/pytorch/llm/scripts/internlm_20b/lora_ddp/infer.sh
@@ -11,9 +11,11 @@ python llm_infer.py \
     --eval_human false \
     --dataset jd-sentiment-zh \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --max_new_tokens 2048 \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
     --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/internlm_20b/lora_ddp/sft.sh b/examples/pytorch/llm/scripts/internlm_20b/lora_ddp/sft.sh
index 2592e58d73..daa2189553 100644
--- a/examples/pytorch/llm/scripts/internlm_20b/lora_ddp/sft.sh
+++ b/examples/pytorch/llm/scripts/internlm_20b/lora_ddp/sft.sh
@@ -19,13 +19,14 @@ torchrun \
     --train_dataset_sample -1 \
     --num_train_epochs 1 \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --lora_rank 8 \
     --lora_alpha 32 \
     --lora_dropout_p 0.05 \
     --lora_target_modules q_proj k_proj v_proj \
     --gradient_checkpointing false \
     --batch_size 1 \
-    --weight_decay 0. \
+    --weight_decay 0.01 \
     --learning_rate 1e-4 \
     --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
     --max_grad_norm 0.5 \
diff --git a/examples/pytorch/llm/scripts/internlm_20b/qlora/infer.sh b/examples/pytorch/llm/scripts/internlm_20b/qlora/infer.sh
index 36170e4ebf..ac73a5737e 100644
--- a/examples/pytorch/llm/scripts/internlm_20b/qlora/infer.sh
+++ b/examples/pytorch/llm/scripts/internlm_20b/qlora/infer.sh
@@ -11,11 +11,13 @@ python llm_infer.py \
     --eval_human false \
     --dataset advertise-gen-zh \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --quantization_bit 4 \
     --bnb_4bit_comp_dtype bf16 \
     --max_new_tokens 2048 \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
     --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/internlm_20b/qlora/sft.sh b/examples/pytorch/llm/scripts/internlm_20b/qlora/sft.sh
index 9eef462e5d..08272133c3 100644
--- a/examples/pytorch/llm/scripts/internlm_20b/qlora/sft.sh
+++ b/examples/pytorch/llm/scripts/internlm_20b/qlora/sft.sh
@@ -14,6 +14,7 @@ python llm_sft.py \
     --train_dataset_sample 20000 \
     --num_train_epochs 1 \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --quantization_bit 4 \
     --bnb_4bit_comp_dtype bf16 \
     --lora_rank 8 \
@@ -22,7 +23,7 @@ python llm_sft.py \
     --lora_target_modules ALL \
     --gradient_checkpointing true \
     --batch_size 1 \
-    --weight_decay 0. \
+    --weight_decay 0.01 \
     --learning_rate 1e-4 \
     --gradient_accumulation_steps 16 \
     --max_grad_norm 0.5 \
diff --git a/examples/pytorch/llm/scripts/internlm_20b_chat/lora_ddp/infer.sh b/examples/pytorch/llm/scripts/internlm_20b_chat/lora_ddp/infer.sh
index 3a730746c7..617a1d5827 100644
--- a/examples/pytorch/llm/scripts/internlm_20b_chat/lora_ddp/infer.sh
+++ b/examples/pytorch/llm/scripts/internlm_20b_chat/lora_ddp/infer.sh
@@ -12,9 +12,11 @@ python llm_infer.py \
     --eval_human false \
     --dataset damo-agent-mini-zh \
     --max_length 4096 \
+    --check_dataset_strategy warning \
     --max_new_tokens 2048 \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
     --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/internlm_20b_chat/lora_ddp/sft.sh b/examples/pytorch/llm/scripts/internlm_20b_chat/lora_ddp/sft.sh
index 760106e198..02c150f55d 100644
--- a/examples/pytorch/llm/scripts/internlm_20b_chat/lora_ddp/sft.sh
+++ b/examples/pytorch/llm/scripts/internlm_20b_chat/lora_ddp/sft.sh
@@ -20,13 +20,14 @@ torchrun \
     --train_dataset_sample 20000 \
     --num_train_epochs 1 \
     --max_length 4096 \
+    --check_dataset_strategy warning \
     --lora_rank 8 \
     --lora_alpha 32 \
     --lora_dropout_p 0.05 \
     --lora_target_modules ALL \
     --gradient_checkpointing true \
     --batch_size 1 \
-    --weight_decay 0. \
+    --weight_decay 0.01 \
     --learning_rate 1e-4 \
     --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
     --max_grad_norm 0.5 \
diff --git a/examples/pytorch/llm/scripts/internlm_20b_chat/qlora/infer.sh b/examples/pytorch/llm/scripts/internlm_20b_chat/qlora/infer.sh
index 27d359cdaf..d895c7a516 100644
--- a/examples/pytorch/llm/scripts/internlm_20b_chat/qlora/infer.sh
+++ b/examples/pytorch/llm/scripts/internlm_20b_chat/qlora/infer.sh
@@ -11,11 +11,13 @@ python llm_infer.py \
     --eval_human false \
     --dataset damo-agent-mini-zh \
     --max_length 4096 \
+    --check_dataset_strategy warning \
     --quantization_bit 4 \
     --bnb_4bit_comp_dtype bf16 \
     --max_new_tokens 2048 \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
     --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/internlm_20b_chat/qlora/sft.sh b/examples/pytorch/llm/scripts/internlm_20b_chat/qlora/sft.sh
index 693dca4e6c..14bff68921 100644
--- a/examples/pytorch/llm/scripts/internlm_20b_chat/qlora/sft.sh
+++ b/examples/pytorch/llm/scripts/internlm_20b_chat/qlora/sft.sh
@@ -14,6 +14,7 @@ python llm_sft.py \
     --train_dataset_sample 20000 \
     --num_train_epochs 1 \
     --max_length 4096 \
+    --check_dataset_strategy warning \
     --quantization_bit 4 \
     --bnb_4bit_comp_dtype bf16 \
     --lora_rank 8 \
@@ -22,7 +23,7 @@ python llm_sft.py \
     --lora_target_modules q_proj v_proj \
     --gradient_checkpointing true \
     --batch_size 1 \
-    --weight_decay 0. \
+    --weight_decay 0.01 \
     --learning_rate 1e-4 \
     --gradient_accumulation_steps 16 \
     --max_grad_norm 0.5 \
diff --git a/examples/pytorch/llm/scripts/internlm_20b_chat/qlora_ddp/infer.sh b/examples/pytorch/llm/scripts/internlm_20b_chat/qlora_ddp/infer.sh
index 27d359cdaf..d895c7a516 100644
--- a/examples/pytorch/llm/scripts/internlm_20b_chat/qlora_ddp/infer.sh
+++ b/examples/pytorch/llm/scripts/internlm_20b_chat/qlora_ddp/infer.sh
@@ -11,11 +11,13 @@ python llm_infer.py \
     --eval_human false \
     --dataset damo-agent-mini-zh \
     --max_length 4096 \
+    --check_dataset_strategy warning \
     --quantization_bit 4 \
     --bnb_4bit_comp_dtype bf16 \
     --max_new_tokens 2048 \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
     --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/internlm_20b_chat/qlora_ddp/sft.sh b/examples/pytorch/llm/scripts/internlm_20b_chat/qlora_ddp/sft.sh
index 6ddac11f80..44aa389bbe 100644
--- a/examples/pytorch/llm/scripts/internlm_20b_chat/qlora_ddp/sft.sh
+++ b/examples/pytorch/llm/scripts/internlm_20b_chat/qlora_ddp/sft.sh
@@ -19,6 +19,7 @@ torchrun \
     --train_dataset_sample 20000 \
     --num_train_epochs 1 \
     --max_length 4096 \
+    --check_dataset_strategy warning \
     --quantization_bit 4 \
     --bnb_4bit_comp_dtype bf16 \
     --lora_rank 8 \
@@ -27,7 +28,7 @@ torchrun \
     --lora_target_modules q_proj v_proj \
     --gradient_checkpointing true \
     --batch_size 1 \
-    --weight_decay 0. \
+    --weight_decay 0.01 \
     --learning_rate 1e-4 \
     --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
     --max_grad_norm 0.5 \
diff --git a/examples/pytorch/llm/scripts/llama2_13b_chat/qlora_ddp_ds/infer.sh b/examples/pytorch/llm/scripts/llama2_13b_chat/qlora_ddp_ds/infer.sh
index 9e6c51c054..c7337a1a3c 100644
--- a/examples/pytorch/llm/scripts/llama2_13b_chat/qlora_ddp_ds/infer.sh
+++ b/examples/pytorch/llm/scripts/llama2_13b_chat/qlora_ddp_ds/infer.sh
@@ -11,11 +11,13 @@ python llm_infer.py \
     --eval_human false \
     --dataset leetcode-python-en \
     --max_length 4096 \
+    --check_dataset_strategy warning \
     --quantization_bit 4 \
     --bnb_4bit_comp_dtype bf16 \
     --max_new_tokens 2048 \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
     --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/llama2_13b_chat/qlora_ddp_ds/sft.sh b/examples/pytorch/llm/scripts/llama2_13b_chat/qlora_ddp_ds/sft.sh
index d80ee01782..5599ae8b4e 100644
--- a/examples/pytorch/llm/scripts/llama2_13b_chat/qlora_ddp_ds/sft.sh
+++ b/examples/pytorch/llm/scripts/llama2_13b_chat/qlora_ddp_ds/sft.sh
@@ -20,6 +20,7 @@ torchrun \
     --train_dataset_sample -1 \
     --num_train_epochs 1 \
     --max_length 4096 \
+    --check_dataset_strategy warning \
     --quantization_bit 4 \
     --bnb_4bit_comp_dtype bf16 \
     --lora_rank 8 \
@@ -28,7 +29,7 @@ torchrun \
     --lora_target_modules ALL \
     --gradient_checkpointing true \
     --batch_size 1 \
-    --weight_decay 0. \
+    --weight_decay 0.01 \
     --learning_rate 1e-4 \
     --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
     --max_grad_norm 0.5 \
diff --git a/examples/pytorch/llm/scripts/llama2_70b_chat/qlora_ddp_ds/infer.sh b/examples/pytorch/llm/scripts/llama2_70b_chat/qlora_ddp_ds/infer.sh
index c09c89d505..4d8461e624 100644
--- a/examples/pytorch/llm/scripts/llama2_70b_chat/qlora_ddp_ds/infer.sh
+++ b/examples/pytorch/llm/scripts/llama2_70b_chat/qlora_ddp_ds/infer.sh
@@ -11,11 +11,13 @@ python llm_infer.py \
     --eval_human false \
     --dataset leetcode-python-en \
     --max_length 4096 \
+    --check_dataset_strategy warning \
     --quantization_bit 4 \
     --bnb_4bit_comp_dtype bf16 \
     --max_new_tokens 2048 \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
     --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/llama2_70b_chat/qlora_ddp_ds/sft.sh b/examples/pytorch/llm/scripts/llama2_70b_chat/qlora_ddp_ds/sft.sh
index ff0390b1fe..b9a42a5e24 100644
--- a/examples/pytorch/llm/scripts/llama2_70b_chat/qlora_ddp_ds/sft.sh
+++ b/examples/pytorch/llm/scripts/llama2_70b_chat/qlora_ddp_ds/sft.sh
@@ -20,6 +20,7 @@ torchrun \
     --train_dataset_sample -1 \
     --num_train_epochs 1 \
     --max_length 4096 \
+    --check_dataset_strategy warning \
     --quantization_bit 4 \
     --bnb_4bit_comp_dtype bf16 \
     --lora_rank 8 \
@@ -28,7 +29,7 @@ torchrun \
     --lora_target_modules q_proj v_proj \
     --gradient_checkpointing true \
     --batch_size 1 \
-    --weight_decay 0. \
+    --weight_decay 0.01 \
     --learning_rate 1e-4 \
     --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
     --max_grad_norm 0.5 \
diff --git a/examples/pytorch/llm/scripts/llama2_70b_chat/qlora_mp/infer.sh b/examples/pytorch/llm/scripts/llama2_70b_chat/qlora_mp/infer.sh
index 0988b94564..f170bb84df 100644
--- a/examples/pytorch/llm/scripts/llama2_70b_chat/qlora_mp/infer.sh
+++ b/examples/pytorch/llm/scripts/llama2_70b_chat/qlora_mp/infer.sh
@@ -11,11 +11,13 @@ python llm_infer.py \
     --eval_human false \
     --dataset sql-create-context-en \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --quantization_bit 4 \
     --bnb_4bit_comp_dtype bf16 \
     --max_new_tokens 2048 \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
     --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/llama2_70b_chat/qlora_mp/sft.sh b/examples/pytorch/llm/scripts/llama2_70b_chat/qlora_mp/sft.sh
index dc352adabb..cfa4438048 100644
--- a/examples/pytorch/llm/scripts/llama2_70b_chat/qlora_mp/sft.sh
+++ b/examples/pytorch/llm/scripts/llama2_70b_chat/qlora_mp/sft.sh
@@ -15,6 +15,7 @@ python llm_sft.py \
     --train_dataset_sample 20000 \
     --num_train_epochs 1 \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --quantization_bit 4 \
     --bnb_4bit_comp_dtype bf16 \
     --lora_rank 8 \
@@ -23,7 +24,7 @@ python llm_sft.py \
     --lora_target_modules q_proj v_proj \
     --gradient_checkpointing true \
     --batch_size 1 \
-    --weight_decay 0. \
+    --weight_decay 0.01 \
     --learning_rate 1e-4 \
     --gradient_accumulation_steps 16 \
     --max_grad_norm 0.5 \
diff --git a/examples/pytorch/llm/scripts/mistral_7b_chat/lora_ddp_ds/infer.sh b/examples/pytorch/llm/scripts/mistral_7b_chat/lora_ddp_ds/infer.sh
index 00a6c78400..84729446d8 100644
--- a/examples/pytorch/llm/scripts/mistral_7b_chat/lora_ddp_ds/infer.sh
+++ b/examples/pytorch/llm/scripts/mistral_7b_chat/lora_ddp_ds/infer.sh
@@ -12,9 +12,11 @@ python llm_infer.py \
     --eval_human false \
     --dataset leetcode-python-en \
     --max_length 4096 \
+    --check_dataset_strategy warning \
     --max_new_tokens 2048 \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
     --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/mistral_7b_chat/lora_ddp_ds/sft.sh b/examples/pytorch/llm/scripts/mistral_7b_chat/lora_ddp_ds/sft.sh
index 0fd5fbf313..231330c8e2 100644
--- a/examples/pytorch/llm/scripts/mistral_7b_chat/lora_ddp_ds/sft.sh
+++ b/examples/pytorch/llm/scripts/mistral_7b_chat/lora_ddp_ds/sft.sh
@@ -20,13 +20,14 @@ torchrun \
     --train_dataset_sample -1 \
     --num_train_epochs 1 \
     --max_length 4096 \
+    --check_dataset_strategy warning \
     --lora_rank 8 \
     --lora_alpha 32 \
     --lora_dropout_p 0.05 \
     --lora_target_modules ALL \
     --gradient_checkpointing true \
     --batch_size 1 \
-    --weight_decay 0. \
+    --weight_decay 0.01 \
     --learning_rate 1e-4 \
     --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
     --max_grad_norm 0.5 \
diff --git a/examples/pytorch/llm/scripts/openbuddy_llama2_13b_chat/qlora_ddp_ds/infer.sh b/examples/pytorch/llm/scripts/openbuddy_llama2_13b_chat/qlora_ddp_ds/infer.sh
index 20c6d98134..4efa4e3362 100644
--- a/examples/pytorch/llm/scripts/openbuddy_llama2_13b_chat/qlora_ddp_ds/infer.sh
+++ b/examples/pytorch/llm/scripts/openbuddy_llama2_13b_chat/qlora_ddp_ds/infer.sh
@@ -11,11 +11,13 @@ python llm_infer.py \
     --eval_human false \
     --dataset blossom-math-zh \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --quantization_bit 4 \
     --bnb_4bit_comp_dtype bf16 \
     --max_new_tokens 2048 \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
     --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/openbuddy_llama2_13b_chat/qlora_ddp_ds/sft.sh b/examples/pytorch/llm/scripts/openbuddy_llama2_13b_chat/qlora_ddp_ds/sft.sh
index 4769651237..f41c8d2007 100644
--- a/examples/pytorch/llm/scripts/openbuddy_llama2_13b_chat/qlora_ddp_ds/sft.sh
+++ b/examples/pytorch/llm/scripts/openbuddy_llama2_13b_chat/qlora_ddp_ds/sft.sh
@@ -20,6 +20,7 @@ torchrun \
     --train_dataset_sample -1 \
     --num_train_epochs 1 \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --quantization_bit 4 \
     --bnb_4bit_comp_dtype bf16 \
     --lora_rank 8 \
@@ -28,7 +29,7 @@ torchrun \
     --lora_target_modules ALL \
     --gradient_checkpointing true \
     --batch_size 1 \
-    --weight_decay 0. \
+    --weight_decay 0.01 \
     --learning_rate 1e-4 \
     --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
     --max_grad_norm 0.5 \
diff --git a/examples/pytorch/llm/scripts/openbuddy_llama2_70b_chat/qlora_ddp_ds/infer.sh b/examples/pytorch/llm/scripts/openbuddy_llama2_70b_chat/qlora_ddp_ds/infer.sh
index ade974f815..09d3a545ce 100644
--- a/examples/pytorch/llm/scripts/openbuddy_llama2_70b_chat/qlora_ddp_ds/infer.sh
+++ b/examples/pytorch/llm/scripts/openbuddy_llama2_70b_chat/qlora_ddp_ds/infer.sh
@@ -11,11 +11,13 @@ python llm_infer.py \
     --eval_human false \
     --dataset blossom-math-zh \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --quantization_bit 4 \
     --bnb_4bit_comp_dtype bf16 \
     --max_new_tokens 2048 \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
     --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/openbuddy_llama2_70b_chat/qlora_ddp_ds/sft.sh b/examples/pytorch/llm/scripts/openbuddy_llama2_70b_chat/qlora_ddp_ds/sft.sh
index 6dbf350f5a..6327677e67 100644
--- a/examples/pytorch/llm/scripts/openbuddy_llama2_70b_chat/qlora_ddp_ds/sft.sh
+++ b/examples/pytorch/llm/scripts/openbuddy_llama2_70b_chat/qlora_ddp_ds/sft.sh
@@ -20,6 +20,7 @@ torchrun \
     --train_dataset_sample -1 \
     --num_train_epochs 1 \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --quantization_bit 4 \
     --bnb_4bit_comp_dtype bf16 \
     --lora_rank 8 \
@@ -28,7 +29,7 @@ torchrun \
     --lora_target_modules q_proj v_proj \
     --gradient_checkpointing true \
     --batch_size 1 \
-    --weight_decay 0. \
+    --weight_decay 0.01 \
     --learning_rate 1e-4 \
     --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
     --max_grad_norm 0.5 \
diff --git a/examples/pytorch/llm/scripts/openbuddy_llama2_70b_chat/qlora_mp/infer.sh b/examples/pytorch/llm/scripts/openbuddy_llama2_70b_chat/qlora_mp/infer.sh
index 222d07ee44..ae31bc4923 100644
--- a/examples/pytorch/llm/scripts/openbuddy_llama2_70b_chat/qlora_mp/infer.sh
+++ b/examples/pytorch/llm/scripts/openbuddy_llama2_70b_chat/qlora_mp/infer.sh
@@ -11,11 +11,13 @@ python llm_infer.py \
     --eval_human false \
     --dataset blossom-math-zh \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --quantization_bit 4 \
     --bnb_4bit_comp_dtype bf16 \
     --max_new_tokens 2048 \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
     --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/openbuddy_llama2_70b_chat/qlora_mp/sft.sh b/examples/pytorch/llm/scripts/openbuddy_llama2_70b_chat/qlora_mp/sft.sh
index de79b30e9a..a25251b0a6 100644
--- a/examples/pytorch/llm/scripts/openbuddy_llama2_70b_chat/qlora_mp/sft.sh
+++ b/examples/pytorch/llm/scripts/openbuddy_llama2_70b_chat/qlora_mp/sft.sh
@@ -14,6 +14,7 @@ python llm_sft.py \
     --train_dataset_sample -1 \
     --num_train_epochs 1 \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --quantization_bit 4 \
     --bnb_4bit_comp_dtype bf16 \
     --lora_rank 8 \
@@ -22,7 +23,7 @@ python llm_sft.py \
     --lora_target_modules q_proj v_proj \
     --gradient_checkpointing true \
     --batch_size 1 \
-    --weight_decay 0. \
+    --weight_decay 0.01 \
     --learning_rate 1e-4 \
     --gradient_accumulation_steps 16 \
     --max_grad_norm 0.5 \
diff --git a/examples/pytorch/llm/scripts/openbuddy_mistral_7b_chat/lora_ddp_ds/infer.sh b/examples/pytorch/llm/scripts/openbuddy_mistral_7b_chat/lora_ddp_ds/infer.sh
index 1317ce4147..0472900125 100644
--- a/examples/pytorch/llm/scripts/openbuddy_mistral_7b_chat/lora_ddp_ds/infer.sh
+++ b/examples/pytorch/llm/scripts/openbuddy_mistral_7b_chat/lora_ddp_ds/infer.sh
@@ -12,9 +12,11 @@ python llm_infer.py \
     --eval_human false \
     --dataset blossom-math-zh \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --max_new_tokens 2048 \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
     --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/openbuddy_mistral_7b_chat/lora_ddp_ds/sft.sh b/examples/pytorch/llm/scripts/openbuddy_mistral_7b_chat/lora_ddp_ds/sft.sh
index 82aaa0ddf7..cc5871975a 100644
--- a/examples/pytorch/llm/scripts/openbuddy_mistral_7b_chat/lora_ddp_ds/sft.sh
+++ b/examples/pytorch/llm/scripts/openbuddy_mistral_7b_chat/lora_ddp_ds/sft.sh
@@ -20,13 +20,14 @@ torchrun \
     --train_dataset_sample -1 \
     --num_train_epochs 1 \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --lora_rank 8 \
     --lora_alpha 32 \
     --lora_dropout_p 0.05 \
     --lora_target_modules ALL \
     --gradient_checkpointing true \
     --batch_size 1 \
-    --weight_decay 0. \
+    --weight_decay 0.01 \
     --learning_rate 1e-4 \
     --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
     --max_grad_norm 0.5 \
diff --git a/examples/pytorch/llm/scripts/polylm_13b/qlora_ddp_ds/infer.sh b/examples/pytorch/llm/scripts/polylm_13b/qlora_ddp_ds/infer.sh
index 200c7be100..b866064835 100644
--- a/examples/pytorch/llm/scripts/polylm_13b/qlora_ddp_ds/infer.sh
+++ b/examples/pytorch/llm/scripts/polylm_13b/qlora_ddp_ds/infer.sh
@@ -11,11 +11,13 @@ python llm_infer.py \
     --eval_human false \
     --dataset advertise-gen-zh \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --quantization_bit 4 \
     --bnb_4bit_comp_dtype bf16 \
     --max_new_tokens 2048 \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
     --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/polylm_13b/qlora_ddp_ds/sft.sh b/examples/pytorch/llm/scripts/polylm_13b/qlora_ddp_ds/sft.sh
index 1e9330e663..181a055ed6 100644
--- a/examples/pytorch/llm/scripts/polylm_13b/qlora_ddp_ds/sft.sh
+++ b/examples/pytorch/llm/scripts/polylm_13b/qlora_ddp_ds/sft.sh
@@ -21,6 +21,7 @@ torchrun \
     --train_dataset_sample 20000 \
     --num_train_epochs 1 \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --quantization_bit 4 \
     --bnb_4bit_comp_dtype bf16 \
     --lora_rank 8 \
@@ -29,7 +30,7 @@ torchrun \
     --lora_target_modules ALL \
     --gradient_checkpointing true \
     --batch_size 1 \
-    --weight_decay 0. \
+    --weight_decay 0.01 \
     --learning_rate 1e-4 \
     --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
     --max_grad_norm 0.5 \
diff --git a/examples/pytorch/llm/scripts/qwen_14b/lora_ddp_ds/infer.sh b/examples/pytorch/llm/scripts/qwen_14b/lora_ddp_ds/infer.sh
index 27df43c917..bb86f50819 100644
--- a/examples/pytorch/llm/scripts/qwen_14b/lora_ddp_ds/infer.sh
+++ b/examples/pytorch/llm/scripts/qwen_14b/lora_ddp_ds/infer.sh
@@ -12,10 +12,12 @@ python llm_infer.py \
     --eval_human false \
     --dataset dureader-robust-zh \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --use_flash_attn true \
     --max_new_tokens 2048 \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
     --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/qwen_14b/lora_ddp_ds/sft.sh b/examples/pytorch/llm/scripts/qwen_14b/lora_ddp_ds/sft.sh
index c713836cc7..66f8c56b0e 100644
--- a/examples/pytorch/llm/scripts/qwen_14b/lora_ddp_ds/sft.sh
+++ b/examples/pytorch/llm/scripts/qwen_14b/lora_ddp_ds/sft.sh
@@ -20,13 +20,14 @@ torchrun \
     --train_dataset_sample -1 \
     --num_train_epochs 1 \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --lora_rank 8 \
     --lora_alpha 32 \
     --lora_dropout_p 0.05 \
     --lora_target_modules ALL \
     --gradient_checkpointing true \
     --batch_size 1 \
-    --weight_decay 0. \
+    --weight_decay 0.01 \
     --learning_rate 1e-4 \
     --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
     --max_grad_norm 0.5 \
diff --git a/examples/pytorch/llm/scripts/qwen_14b/qlora/infer.sh b/examples/pytorch/llm/scripts/qwen_14b/qlora/infer.sh
index 26b014595b..65c3abf551 100644
--- a/examples/pytorch/llm/scripts/qwen_14b/qlora/infer.sh
+++ b/examples/pytorch/llm/scripts/qwen_14b/qlora/infer.sh
@@ -11,6 +11,7 @@ python llm_infer.py \
     --eval_human false \
     --dataset dureader-robust-zh \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --quantization_bit 4 \
     --bnb_4bit_comp_dtype bf16 \
     --use_flash_attn false \
@@ -18,5 +19,6 @@ python llm_infer.py \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
     --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/qwen_14b/qlora/sft.sh b/examples/pytorch/llm/scripts/qwen_14b/qlora/sft.sh
index 531a6569d4..d0db88f940 100644
--- a/examples/pytorch/llm/scripts/qwen_14b/qlora/sft.sh
+++ b/examples/pytorch/llm/scripts/qwen_14b/qlora/sft.sh
@@ -14,6 +14,7 @@ python llm_sft.py \
     --train_dataset_sample -1 \
     --num_train_epochs 1 \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --quantization_bit 4 \
     --bnb_4bit_comp_dtype bf16 \
     --lora_rank 8 \
@@ -22,7 +23,7 @@ python llm_sft.py \
     --lora_target_modules ALL \
     --gradient_checkpointing true \
     --batch_size 1 \
-    --weight_decay 0. \
+    --weight_decay 0.01 \
     --learning_rate 1e-4 \
     --gradient_accumulation_steps 16 \
     --max_grad_norm 0.5 \
diff --git a/examples/pytorch/llm/scripts/qwen_14b/qlora_ddp_ds/infer.sh b/examples/pytorch/llm/scripts/qwen_14b/qlora_ddp_ds/infer.sh
index 26b014595b..65c3abf551 100644
--- a/examples/pytorch/llm/scripts/qwen_14b/qlora_ddp_ds/infer.sh
+++ b/examples/pytorch/llm/scripts/qwen_14b/qlora_ddp_ds/infer.sh
@@ -11,6 +11,7 @@ python llm_infer.py \
     --eval_human false \
     --dataset dureader-robust-zh \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --quantization_bit 4 \
     --bnb_4bit_comp_dtype bf16 \
     --use_flash_attn false \
@@ -18,5 +19,6 @@ python llm_infer.py \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
     --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/qwen_14b/qlora_ddp_ds/sft.sh b/examples/pytorch/llm/scripts/qwen_14b/qlora_ddp_ds/sft.sh
index 374d5c778e..058edd7e0d 100644
--- a/examples/pytorch/llm/scripts/qwen_14b/qlora_ddp_ds/sft.sh
+++ b/examples/pytorch/llm/scripts/qwen_14b/qlora_ddp_ds/sft.sh
@@ -20,6 +20,7 @@ torchrun \
     --train_dataset_sample -1 \
     --num_train_epochs 1 \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --quantization_bit 4 \
     --bnb_4bit_comp_dtype bf16 \
     --lora_rank 8 \
@@ -28,7 +29,7 @@ torchrun \
     --lora_target_modules ALL \
     --gradient_checkpointing true \
     --batch_size 1 \
-    --weight_decay 0. \
+    --weight_decay 0.01 \
     --learning_rate 1e-4 \
     --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
     --max_grad_norm 0.5 \
diff --git a/examples/pytorch/llm/scripts/qwen_14b_chat/lora_ddp_ds/infer.sh b/examples/pytorch/llm/scripts/qwen_14b_chat/lora_ddp_ds/infer.sh
index 04c8ba0137..fd4458459a 100644
--- a/examples/pytorch/llm/scripts/qwen_14b_chat/lora_ddp_ds/infer.sh
+++ b/examples/pytorch/llm/scripts/qwen_14b_chat/lora_ddp_ds/infer.sh
@@ -12,10 +12,12 @@ python llm_infer.py \
     --eval_human false \
     --dataset blossom-math-zh \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --use_flash_attn true \
     --max_new_tokens 2048 \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
     --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/qwen_14b_chat/lora_ddp_ds/sft.sh b/examples/pytorch/llm/scripts/qwen_14b_chat/lora_ddp_ds/sft.sh
index 22f6dcc48d..86dcaa4aa8 100644
--- a/examples/pytorch/llm/scripts/qwen_14b_chat/lora_ddp_ds/sft.sh
+++ b/examples/pytorch/llm/scripts/qwen_14b_chat/lora_ddp_ds/sft.sh
@@ -20,13 +20,14 @@ torchrun \
     --train_dataset_sample -1 \
     --num_train_epochs 1 \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --lora_rank 8 \
     --lora_alpha 32 \
     --lora_dropout_p 0.05 \
     --lora_target_modules ALL \
     --gradient_checkpointing true \
     --batch_size 1 \
-    --weight_decay 0. \
+    --weight_decay 0.01 \
     --learning_rate 1e-4 \
     --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
     --max_grad_norm 0.5 \
diff --git a/examples/pytorch/llm/scripts/qwen_14b_chat/qlora/infer.sh b/examples/pytorch/llm/scripts/qwen_14b_chat/qlora/infer.sh
index db968457e0..22ce1738c7 100644
--- a/examples/pytorch/llm/scripts/qwen_14b_chat/qlora/infer.sh
+++ b/examples/pytorch/llm/scripts/qwen_14b_chat/qlora/infer.sh
@@ -11,6 +11,7 @@ python llm_infer.py \
     --eval_human false \
     --dataset blossom-math-zh \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --quantization_bit 4 \
     --bnb_4bit_comp_dtype bf16 \
     --use_flash_attn false \
@@ -18,5 +19,6 @@ python llm_infer.py \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
     --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/qwen_14b_chat/qlora/sft.sh b/examples/pytorch/llm/scripts/qwen_14b_chat/qlora/sft.sh
index 8aedd646c7..8f366f392b 100644
--- a/examples/pytorch/llm/scripts/qwen_14b_chat/qlora/sft.sh
+++ b/examples/pytorch/llm/scripts/qwen_14b_chat/qlora/sft.sh
@@ -14,6 +14,7 @@ python llm_sft.py \
     --train_dataset_sample -1 \
     --num_train_epochs 1 \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --quantization_bit 4 \
     --bnb_4bit_comp_dtype bf16 \
     --lora_rank 8 \
@@ -22,7 +23,7 @@ python llm_sft.py \
     --lora_target_modules ALL \
     --gradient_checkpointing true \
     --batch_size 1 \
-    --weight_decay 0. \
+    --weight_decay 0.01 \
     --learning_rate 1e-4 \
     --gradient_accumulation_steps 16 \
     --max_grad_norm 0.5 \
diff --git a/examples/pytorch/llm/scripts/qwen_14b_chat/qlora_ddp_ds/infer.sh b/examples/pytorch/llm/scripts/qwen_14b_chat/qlora_ddp_ds/infer.sh
index f93be0d932..9fdaea4a80 100644
--- a/examples/pytorch/llm/scripts/qwen_14b_chat/qlora_ddp_ds/infer.sh
+++ b/examples/pytorch/llm/scripts/qwen_14b_chat/qlora_ddp_ds/infer.sh
@@ -11,6 +11,7 @@ python llm_infer.py \
     --eval_human false \
     --dataset blossom-math-zh \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --quantization_bit 4 \
     --bnb_4bit_comp_dtype bf16 \
     --use_flash_attn false \
@@ -18,5 +19,6 @@ python llm_infer.py \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
     --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/qwen_14b_chat/qlora_ddp_ds/sft.sh b/examples/pytorch/llm/scripts/qwen_14b_chat/qlora_ddp_ds/sft.sh
index 9cb8f009c5..3b29cb451d 100644
--- a/examples/pytorch/llm/scripts/qwen_14b_chat/qlora_ddp_ds/sft.sh
+++ b/examples/pytorch/llm/scripts/qwen_14b_chat/qlora_ddp_ds/sft.sh
@@ -20,6 +20,7 @@ torchrun \
     --train_dataset_sample -1 \
     --num_train_epochs 1 \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --quantization_bit 4 \
     --bnb_4bit_comp_dtype bf16 \
     --lora_rank 8 \
@@ -28,7 +29,7 @@ torchrun \
     --lora_target_modules ALL \
     --gradient_checkpointing true \
     --batch_size 1 \
-    --weight_decay 0. \
+    --weight_decay 0.01 \
     --learning_rate 1e-4 \
     --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
     --max_grad_norm 0.5 \
diff --git a/examples/pytorch/llm/scripts/qwen_14b_chat_int4/qlora/infer.sh b/examples/pytorch/llm/scripts/qwen_14b_chat_int4/qlora/infer.sh
index bcf0ef526e..f38c5cdd52 100644
--- a/examples/pytorch/llm/scripts/qwen_14b_chat_int4/qlora/infer.sh
+++ b/examples/pytorch/llm/scripts/qwen_14b_chat_int4/qlora/infer.sh
@@ -11,10 +11,12 @@ python llm_infer.py \
     --eval_human false \
     --dataset leetcode-python-en \
     --max_length 4096 \
+    --check_dataset_strategy warning \
     --use_flash_attn false \
     --max_new_tokens 2048 \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
     --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/qwen_14b_chat_int4/qlora/sft.sh b/examples/pytorch/llm/scripts/qwen_14b_chat_int4/qlora/sft.sh
index 9ce0194701..fa3649142b 100644
--- a/examples/pytorch/llm/scripts/qwen_14b_chat_int4/qlora/sft.sh
+++ b/examples/pytorch/llm/scripts/qwen_14b_chat_int4/qlora/sft.sh
@@ -14,13 +14,14 @@ python llm_sft.py \
     --train_dataset_sample -1 \
     --num_train_epochs 1 \
     --max_length 4096 \
+    --check_dataset_strategy warning \
     --lora_rank 8 \
     --lora_alpha 32 \
     --lora_dropout_p 0.05 \
     --lora_target_modules ALL \
     --gradient_checkpointing true \
     --batch_size 1 \
-    --weight_decay 0. \
+    --weight_decay 0.01 \
     --learning_rate 1e-4 \
     --gradient_accumulation_steps 16 \
     --max_grad_norm 0.5 \
diff --git a/examples/pytorch/llm/scripts/qwen_14b_chat_int4/qlora_ddp_ds/infer.sh b/examples/pytorch/llm/scripts/qwen_14b_chat_int4/qlora_ddp_ds/infer.sh
index 2d70b8d63a..f887c97590 100644
--- a/examples/pytorch/llm/scripts/qwen_14b_chat_int4/qlora_ddp_ds/infer.sh
+++ b/examples/pytorch/llm/scripts/qwen_14b_chat_int4/qlora_ddp_ds/infer.sh
@@ -11,10 +11,12 @@ python llm_infer.py \
     --eval_human false \
     --dataset damo-agent-mini-zh \
     --max_length 4096 \
+    --check_dataset_strategy warning \
     --use_flash_attn false \
     --max_new_tokens 2048 \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
     --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/qwen_14b_chat_int4/qlora_ddp_ds/sft.sh b/examples/pytorch/llm/scripts/qwen_14b_chat_int4/qlora_ddp_ds/sft.sh
index c1a6271a0b..75a54b874a 100644
--- a/examples/pytorch/llm/scripts/qwen_14b_chat_int4/qlora_ddp_ds/sft.sh
+++ b/examples/pytorch/llm/scripts/qwen_14b_chat_int4/qlora_ddp_ds/sft.sh
@@ -20,13 +20,14 @@ torchrun \
     --train_dataset_sample 20000 \
     --num_train_epochs 1 \
     --max_length 4096 \
+    --check_dataset_strategy warning \
     --lora_rank 8 \
     --lora_alpha 32 \
     --lora_dropout_p 0.05 \
     --lora_target_modules ALL \
     --gradient_checkpointing true \
     --batch_size 1 \
-    --weight_decay 0. \
+    --weight_decay 0.01 \
     --learning_rate 1e-4 \
     --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
     --max_grad_norm 0.5 \
diff --git a/examples/pytorch/llm/scripts/qwen_14b_chat_int8/qlora/infer.sh b/examples/pytorch/llm/scripts/qwen_14b_chat_int8/qlora/infer.sh
index fa6e491f92..aa1dcc96c1 100644
--- a/examples/pytorch/llm/scripts/qwen_14b_chat_int8/qlora/infer.sh
+++ b/examples/pytorch/llm/scripts/qwen_14b_chat_int8/qlora/infer.sh
@@ -11,10 +11,12 @@ python llm_infer.py \
     --eval_human false \
     --dataset blossom-math-zh \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --use_flash_attn false \
     --max_new_tokens 2048 \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
     --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/qwen_14b_chat_int8/qlora/sft.sh b/examples/pytorch/llm/scripts/qwen_14b_chat_int8/qlora/sft.sh
index 2b9fbf6bbe..3d19f9e971 100644
--- a/examples/pytorch/llm/scripts/qwen_14b_chat_int8/qlora/sft.sh
+++ b/examples/pytorch/llm/scripts/qwen_14b_chat_int8/qlora/sft.sh
@@ -14,13 +14,14 @@ python llm_sft.py \
     --train_dataset_sample -1 \
     --num_train_epochs 1 \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --lora_rank 8 \
     --lora_alpha 32 \
     --lora_dropout_p 0.05 \
     --lora_target_modules ALL \
     --gradient_checkpointing true \
     --batch_size 1 \
-    --weight_decay 0. \
+    --weight_decay 0.01 \
     --learning_rate 1e-4 \
     --gradient_accumulation_steps 16 \
     --max_grad_norm 0.5 \
diff --git a/examples/pytorch/llm/scripts/qwen_14b_chat_int8/qlora_ddp_ds/infer.sh b/examples/pytorch/llm/scripts/qwen_14b_chat_int8/qlora_ddp_ds/infer.sh
index 61abcb0ce0..32441d1219 100644
--- a/examples/pytorch/llm/scripts/qwen_14b_chat_int8/qlora_ddp_ds/infer.sh
+++ b/examples/pytorch/llm/scripts/qwen_14b_chat_int8/qlora_ddp_ds/infer.sh
@@ -11,10 +11,12 @@ python llm_infer.py \
     --eval_human false \
     --dataset lawyer-llama-zh \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --use_flash_attn false \
     --max_new_tokens 2048 \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
     --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/qwen_14b_chat_int8/qlora_ddp_ds/sft.sh b/examples/pytorch/llm/scripts/qwen_14b_chat_int8/qlora_ddp_ds/sft.sh
index 39caf3605c..a431f64d8b 100644
--- a/examples/pytorch/llm/scripts/qwen_14b_chat_int8/qlora_ddp_ds/sft.sh
+++ b/examples/pytorch/llm/scripts/qwen_14b_chat_int8/qlora_ddp_ds/sft.sh
@@ -20,13 +20,14 @@ torchrun \
     --train_dataset_sample -1 \
     --num_train_epochs 1 \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --lora_rank 8 \
     --lora_alpha 32 \
     --lora_dropout_p 0.05 \
     --lora_target_modules ALL \
     --gradient_checkpointing true \
     --batch_size 1 \
-    --weight_decay 0. \
+    --weight_decay 0.01 \
     --learning_rate 1e-4 \
     --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
     --max_grad_norm 0.5 \
diff --git a/examples/pytorch/llm/scripts/qwen_7b/lora_ddp_ds/infer.sh b/examples/pytorch/llm/scripts/qwen_7b/lora_ddp_ds/infer.sh
index bece6e86ce..e8d76af1b7 100644
--- a/examples/pytorch/llm/scripts/qwen_7b/lora_ddp_ds/infer.sh
+++ b/examples/pytorch/llm/scripts/qwen_7b/lora_ddp_ds/infer.sh
@@ -12,10 +12,12 @@ python llm_infer.py \
     --eval_human false \
     --dataset dureader-robust-zh \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --use_flash_attn false \
     --max_new_tokens 2048 \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
     --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/qwen_7b/lora_ddp_ds/sft.sh b/examples/pytorch/llm/scripts/qwen_7b/lora_ddp_ds/sft.sh
index fee282527f..8a4dbe5568 100644
--- a/examples/pytorch/llm/scripts/qwen_7b/lora_ddp_ds/sft.sh
+++ b/examples/pytorch/llm/scripts/qwen_7b/lora_ddp_ds/sft.sh
@@ -20,13 +20,14 @@ torchrun \
     --train_dataset_sample -1 \
     --num_train_epochs 1 \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --lora_rank 8 \
     --lora_alpha 32 \
     --lora_dropout_p 0.05 \
     --lora_target_modules c_attn c_proj \
     --gradient_checkpointing true \
     --batch_size 1 \
-    --weight_decay 0. \
+    --weight_decay 0.01 \
     --learning_rate 1e-4 \
     --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
     --max_grad_norm 0.5 \
diff --git a/examples/pytorch/llm/scripts/qwen_7b/qlora_ddp/infer.sh b/examples/pytorch/llm/scripts/qwen_7b/qlora_ddp/infer.sh
index 0803de2885..14dba2e106 100644
--- a/examples/pytorch/llm/scripts/qwen_7b/qlora_ddp/infer.sh
+++ b/examples/pytorch/llm/scripts/qwen_7b/qlora_ddp/infer.sh
@@ -10,6 +10,7 @@ python llm_infer.py \
     --ckpt_dir "output/qwen-7b/vx_xxx/checkpoint-xxx" \
     --eval_human true \
     --max_length 4096 \
+    --check_dataset_strategy warning \
     --quantization_bit 4 \
     --bnb_4bit_comp_dtype bf16 \
     --use_flash_attn false \
@@ -17,5 +18,6 @@ python llm_infer.py \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
     --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/qwen_7b/qlora_ddp/sft.sh b/examples/pytorch/llm/scripts/qwen_7b/qlora_ddp/sft.sh
index 5256b70520..cd42d98061 100644
--- a/examples/pytorch/llm/scripts/qwen_7b/qlora_ddp/sft.sh
+++ b/examples/pytorch/llm/scripts/qwen_7b/qlora_ddp/sft.sh
@@ -20,6 +20,7 @@ torchrun \
     --train_dataset_sample -1 \
     --num_train_epochs 1 \
     --max_length 4096 \
+    --check_dataset_strategy warning \
     --quantization_bit 4 \
     --bnb_4bit_comp_dtype bf16 \
     --lora_rank 8 \
@@ -28,7 +29,7 @@ torchrun \
     --lora_target_modules AUTO \
     --gradient_checkpointing true \
     --batch_size 1 \
-    --weight_decay 0. \
+    --weight_decay 0.01 \
     --learning_rate 1e-4 \
     --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
     --max_grad_norm 0.5 \
diff --git a/examples/pytorch/llm/scripts/qwen_7b_chat/full_mp/infer.sh b/examples/pytorch/llm/scripts/qwen_7b_chat/full_mp/infer.sh
index 7e8957a8ef..a46d131bb1 100644
--- a/examples/pytorch/llm/scripts/qwen_7b_chat/full_mp/infer.sh
+++ b/examples/pytorch/llm/scripts/qwen_7b_chat/full_mp/infer.sh
@@ -11,9 +11,11 @@ python llm_infer.py \
     --eval_human false \
     --dataset damo-agent-zh \
     --max_length 6144 \
+    --check_dataset_strategy warning \
     --use_flash_attn true \
     --max_new_tokens 2048 \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
diff --git a/examples/pytorch/llm/scripts/qwen_7b_chat/full_mp/sft.sh b/examples/pytorch/llm/scripts/qwen_7b_chat/full_mp/sft.sh
index b138c40442..48b6fb5793 100644
--- a/examples/pytorch/llm/scripts/qwen_7b_chat/full_mp/sft.sh
+++ b/examples/pytorch/llm/scripts/qwen_7b_chat/full_mp/sft.sh
@@ -15,6 +15,7 @@ python llm_sft.py \
     --train_dataset_sample 200000 \
     --num_train_epochs 1 \
     --max_length 8192 \
+    --check_dataset_strategy warning \
     --gradient_checkpointing false \
     --batch_size 1 \
     --weight_decay 0.01 \
diff --git a/examples/pytorch/llm/scripts/qwen_7b_chat/full_mp_ddp/infer.sh b/examples/pytorch/llm/scripts/qwen_7b_chat/full_mp_ddp/infer.sh
index 69b85a9e93..871c1a50d7 100644
--- a/examples/pytorch/llm/scripts/qwen_7b_chat/full_mp_ddp/infer.sh
+++ b/examples/pytorch/llm/scripts/qwen_7b_chat/full_mp_ddp/infer.sh
@@ -11,9 +11,11 @@ python llm_infer.py \
     --eval_human false \
     --dataset medical-en medical-zh \
     --max_length 6144 \
+    --check_dataset_strategy warning \
     --use_flash_attn true \
     --max_new_tokens 2048 \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
diff --git a/examples/pytorch/llm/scripts/qwen_7b_chat/full_mp_ddp/sft.sh b/examples/pytorch/llm/scripts/qwen_7b_chat/full_mp_ddp/sft.sh
index e5a2e77981..480d3294a1 100644
--- a/examples/pytorch/llm/scripts/qwen_7b_chat/full_mp_ddp/sft.sh
+++ b/examples/pytorch/llm/scripts/qwen_7b_chat/full_mp_ddp/sft.sh
@@ -20,6 +20,7 @@ torchrun \
     --train_dataset_sample 200000 \
     --num_train_epochs 1 \
     --max_length 8192 \
+    --check_dataset_strategy warning \
     --gradient_checkpointing false \
     --batch_size 1 \
     --weight_decay 0.01 \
diff --git a/examples/pytorch/llm/scripts/qwen_7b_chat/lora/infer.sh b/examples/pytorch/llm/scripts/qwen_7b_chat/lora/infer.sh
index 0ca68c9f0f..c750a9002b 100644
--- a/examples/pytorch/llm/scripts/qwen_7b_chat/lora/infer.sh
+++ b/examples/pytorch/llm/scripts/qwen_7b_chat/lora/infer.sh
@@ -12,10 +12,12 @@ python llm_infer.py \
     --eval_human false \
     --dataset damo-agent-mini-zh \
     --max_length 4096 \
+    --check_dataset_strategy warning \
     --use_flash_attn true \
     --max_new_tokens 2048 \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
     --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/qwen_7b_chat/lora/sft.sh b/examples/pytorch/llm/scripts/qwen_7b_chat/lora/sft.sh
index 2f546a750a..79c8a3c5e5 100644
--- a/examples/pytorch/llm/scripts/qwen_7b_chat/lora/sft.sh
+++ b/examples/pytorch/llm/scripts/qwen_7b_chat/lora/sft.sh
@@ -16,13 +16,14 @@ python llm_sft.py \
     --train_dataset_sample -1 \
     --num_train_epochs 1 \
     --max_length 4096 \
+    --check_dataset_strategy warning \
     --lora_rank 8 \
     --lora_alpha 32 \
     --lora_dropout_p 0.05 \
     --lora_target_modules ALL \
     --gradient_checkpointing false \
     --batch_size 1 \
-    --weight_decay 0. \
+    --weight_decay 0.01 \
     --learning_rate 1e-4 \
     --gradient_accumulation_steps 16 \
     --max_grad_norm 0.5 \
diff --git a/examples/pytorch/llm/scripts/qwen_7b_chat/lora_ddp/infer.sh b/examples/pytorch/llm/scripts/qwen_7b_chat/lora_ddp/infer.sh
index 0ca68c9f0f..c750a9002b 100644
--- a/examples/pytorch/llm/scripts/qwen_7b_chat/lora_ddp/infer.sh
+++ b/examples/pytorch/llm/scripts/qwen_7b_chat/lora_ddp/infer.sh
@@ -12,10 +12,12 @@ python llm_infer.py \
     --eval_human false \
     --dataset damo-agent-mini-zh \
     --max_length 4096 \
+    --check_dataset_strategy warning \
     --use_flash_attn true \
     --max_new_tokens 2048 \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
     --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/qwen_7b_chat/lora_ddp/sft.sh b/examples/pytorch/llm/scripts/qwen_7b_chat/lora_ddp/sft.sh
index 3fc7b313d7..9e05ecd98e 100644
--- a/examples/pytorch/llm/scripts/qwen_7b_chat/lora_ddp/sft.sh
+++ b/examples/pytorch/llm/scripts/qwen_7b_chat/lora_ddp/sft.sh
@@ -22,13 +22,14 @@ torchrun \
     --train_dataset_sample -1 \
     --num_train_epochs 1 \
     --max_length 4096 \
+    --check_dataset_strategy warning \
     --lora_rank 8 \
     --lora_alpha 32 \
     --lora_dropout_p 0.05 \
     --lora_target_modules ALL \
     --gradient_checkpointing false \
     --batch_size 1 \
-    --weight_decay 0. \
+    --weight_decay 0.01 \
     --learning_rate 1e-4 \
     --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
     --max_grad_norm 0.5 \
diff --git a/examples/pytorch/llm/scripts/qwen_7b_chat/lora_ddp_ds/infer.sh b/examples/pytorch/llm/scripts/qwen_7b_chat/lora_ddp_ds/infer.sh
index 8fe65220dd..662907e108 100644
--- a/examples/pytorch/llm/scripts/qwen_7b_chat/lora_ddp_ds/infer.sh
+++ b/examples/pytorch/llm/scripts/qwen_7b_chat/lora_ddp_ds/infer.sh
@@ -12,10 +12,12 @@ python llm_infer.py \
     --eval_human false \
     --dataset advertise-gen-zh \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --use_flash_attn false \
     --max_new_tokens 2048 \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
     --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/qwen_7b_chat/lora_ddp_ds/sft.sh b/examples/pytorch/llm/scripts/qwen_7b_chat/lora_ddp_ds/sft.sh
index a37e833980..191f72eb65 100644
--- a/examples/pytorch/llm/scripts/qwen_7b_chat/lora_ddp_ds/sft.sh
+++ b/examples/pytorch/llm/scripts/qwen_7b_chat/lora_ddp_ds/sft.sh
@@ -20,13 +20,14 @@ torchrun \
     --train_dataset_sample -1 \
     --num_train_epochs 1 \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --lora_rank 8 \
     --lora_alpha 32 \
     --lora_dropout_p 0.05 \
     --lora_target_modules ALL \
     --gradient_checkpointing true \
     --batch_size 1 \
-    --weight_decay 0. \
+    --weight_decay 0.01 \
     --learning_rate 1e-4 \
     --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
     --max_grad_norm 0.5 \
diff --git a/examples/pytorch/llm/scripts/qwen_7b_chat/lora_mp_ddp/infer.sh b/examples/pytorch/llm/scripts/qwen_7b_chat/lora_mp_ddp/infer.sh
index 95adbaad09..b0c3061c21 100644
--- a/examples/pytorch/llm/scripts/qwen_7b_chat/lora_mp_ddp/infer.sh
+++ b/examples/pytorch/llm/scripts/qwen_7b_chat/lora_mp_ddp/infer.sh
@@ -12,10 +12,12 @@ python llm_infer.py \
     --eval_human false \
     --dataset advertise-gen-zh \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --use_flash_attn false \
     --max_new_tokens 2048 \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
     --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/qwen_7b_chat/lora_mp_ddp/sft.sh b/examples/pytorch/llm/scripts/qwen_7b_chat/lora_mp_ddp/sft.sh
index b6426de860..ea3079c51b 100644
--- a/examples/pytorch/llm/scripts/qwen_7b_chat/lora_mp_ddp/sft.sh
+++ b/examples/pytorch/llm/scripts/qwen_7b_chat/lora_mp_ddp/sft.sh
@@ -20,13 +20,14 @@ torchrun \
     --train_dataset_sample -1 \
     --num_train_epochs 1 \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --lora_rank 8 \
     --lora_alpha 32 \
     --lora_dropout_p 0.05 \
     --lora_target_modules c_attn \
     --gradient_checkpointing true \
     --batch_size 1 \
-    --weight_decay 0. \
+    --weight_decay 0.01 \
     --learning_rate 1e-4 \
     --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
     --max_grad_norm 0.5 \
diff --git a/examples/pytorch/llm/scripts/qwen_7b_chat/qlora/infer.sh b/examples/pytorch/llm/scripts/qwen_7b_chat/qlora/infer.sh
index 8999920d09..199731cd04 100644
--- a/examples/pytorch/llm/scripts/qwen_7b_chat/qlora/infer.sh
+++ b/examples/pytorch/llm/scripts/qwen_7b_chat/qlora/infer.sh
@@ -11,6 +11,7 @@ python llm_infer.py \
     --eval_human false \
     --dataset leetcode-python-en \
     --max_length 4096 \
+    --check_dataset_strategy warning \
     --quantization_bit 4 \
     --bnb_4bit_comp_dtype bf16 \
     --use_flash_attn false \
@@ -18,5 +19,6 @@ python llm_infer.py \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
     --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/qwen_7b_chat/qlora/sft.sh b/examples/pytorch/llm/scripts/qwen_7b_chat/qlora/sft.sh
index ae283cca29..12c0815ed7 100644
--- a/examples/pytorch/llm/scripts/qwen_7b_chat/qlora/sft.sh
+++ b/examples/pytorch/llm/scripts/qwen_7b_chat/qlora/sft.sh
@@ -14,6 +14,7 @@ python llm_sft.py \
     --train_dataset_sample -1 \
     --num_train_epochs 1 \
     --max_length 4096 \
+    --check_dataset_strategy warning \
     --quantization_bit 4 \
     --bnb_4bit_comp_dtype bf16 \
     --lora_rank 8 \
@@ -22,7 +23,7 @@ python llm_sft.py \
     --lora_target_modules ALL \
     --gradient_checkpointing true \
     --batch_size 1 \
-    --weight_decay 0. \
+    --weight_decay 0.01 \
     --learning_rate 1e-4 \
     --gradient_accumulation_steps 16 \
     --max_grad_norm 0.5 \
diff --git a/examples/pytorch/llm/scripts/qwen_7b_chat/qlora_ddp/infer.sh b/examples/pytorch/llm/scripts/qwen_7b_chat/qlora_ddp/infer.sh
index 17c243e17a..c408f1edf4 100644
--- a/examples/pytorch/llm/scripts/qwen_7b_chat/qlora_ddp/infer.sh
+++ b/examples/pytorch/llm/scripts/qwen_7b_chat/qlora_ddp/infer.sh
@@ -11,6 +11,7 @@ python llm_infer.py \
     --eval_human false \
     --dataset advertise-gen-zh \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --quantization_bit 4 \
     --bnb_4bit_comp_dtype bf16 \
     --use_flash_attn false \
@@ -18,5 +19,6 @@ python llm_infer.py \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
     --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/qwen_7b_chat/qlora_ddp/sft.sh b/examples/pytorch/llm/scripts/qwen_7b_chat/qlora_ddp/sft.sh
index 7054a6bcf4..d7eeeade80 100644
--- a/examples/pytorch/llm/scripts/qwen_7b_chat/qlora_ddp/sft.sh
+++ b/examples/pytorch/llm/scripts/qwen_7b_chat/qlora_ddp/sft.sh
@@ -20,6 +20,7 @@ torchrun \
     --train_dataset_sample 20000 \
     --num_train_epochs 1 \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --quantization_bit 4 \
     --bnb_4bit_comp_dtype bf16 \
     --lora_rank 8 \
@@ -28,7 +29,7 @@ torchrun \
     --lora_target_modules ALL \
     --gradient_checkpointing true \
     --batch_size 1 \
-    --weight_decay 0. \
+    --weight_decay 0.01 \
     --learning_rate 1e-4 \
     --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
     --max_grad_norm 0.5 \
diff --git a/examples/pytorch/llm/scripts/qwen_7b_chat/qlora_ddp_ds/infer.sh b/examples/pytorch/llm/scripts/qwen_7b_chat/qlora_ddp_ds/infer.sh
index ec9d5dbe9b..247ca2712e 100644
--- a/examples/pytorch/llm/scripts/qwen_7b_chat/qlora_ddp_ds/infer.sh
+++ b/examples/pytorch/llm/scripts/qwen_7b_chat/qlora_ddp_ds/infer.sh
@@ -11,6 +11,7 @@ python llm_infer.py \
     --eval_human false \
     --dataset damo-agent-mini-zh \
     --max_length 4096 \
+    --check_dataset_strategy warning \
     --quantization_bit 4 \
     --bnb_4bit_comp_dtype bf16 \
     --use_flash_attn false \
@@ -18,5 +19,6 @@ python llm_infer.py \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
     --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/qwen_7b_chat/qlora_ddp_ds/sft.sh b/examples/pytorch/llm/scripts/qwen_7b_chat/qlora_ddp_ds/sft.sh
index 9ff0941292..284e410d7c 100644
--- a/examples/pytorch/llm/scripts/qwen_7b_chat/qlora_ddp_ds/sft.sh
+++ b/examples/pytorch/llm/scripts/qwen_7b_chat/qlora_ddp_ds/sft.sh
@@ -20,6 +20,7 @@ torchrun \
     --train_dataset_sample 20000 \
     --num_train_epochs 1 \
     --max_length 4096 \
+    --check_dataset_strategy warning \
     --quantization_bit 4 \
     --bnb_4bit_comp_dtype bf16 \
     --lora_rank 8 \
@@ -28,7 +29,7 @@ torchrun \
     --lora_target_modules ALL \
     --gradient_checkpointing true \
     --batch_size 1 \
-    --weight_decay 0. \
+    --weight_decay 0.01 \
     --learning_rate 1e-4 \
     --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
     --max_grad_norm 0.5 \
diff --git a/examples/pytorch/llm/scripts/qwen_7b_chat_int4/qlora/infer.sh b/examples/pytorch/llm/scripts/qwen_7b_chat_int4/qlora/infer.sh
index 2716a12de8..329f651cc8 100644
--- a/examples/pytorch/llm/scripts/qwen_7b_chat_int4/qlora/infer.sh
+++ b/examples/pytorch/llm/scripts/qwen_7b_chat_int4/qlora/infer.sh
@@ -11,10 +11,12 @@ python llm_infer.py \
     --eval_human false \
     --dataset leetcode-python-en \
     --max_length 4096 \
+    --check_dataset_strategy warning \
     --use_flash_attn false \
     --max_new_tokens 2048 \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
     --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/qwen_7b_chat_int4/qlora/sft.sh b/examples/pytorch/llm/scripts/qwen_7b_chat_int4/qlora/sft.sh
index b4e1315e57..2843318294 100644
--- a/examples/pytorch/llm/scripts/qwen_7b_chat_int4/qlora/sft.sh
+++ b/examples/pytorch/llm/scripts/qwen_7b_chat_int4/qlora/sft.sh
@@ -14,13 +14,14 @@ python llm_sft.py \
     --train_dataset_sample -1 \
     --num_train_epochs 1 \
     --max_length 4096 \
+    --check_dataset_strategy warning \
     --lora_rank 8 \
     --lora_alpha 32 \
     --lora_dropout_p 0.05 \
     --lora_target_modules ALL \
     --gradient_checkpointing true \
     --batch_size 1 \
-    --weight_decay 0. \
+    --weight_decay 0.01 \
     --learning_rate 1e-4 \
     --gradient_accumulation_steps 16 \
     --max_grad_norm 0.5 \
diff --git a/examples/pytorch/llm/scripts/qwen_7b_chat_int4/qlora_ddp_ds/infer.sh b/examples/pytorch/llm/scripts/qwen_7b_chat_int4/qlora_ddp_ds/infer.sh
index 5cd54d0757..4e392caa6e 100644
--- a/examples/pytorch/llm/scripts/qwen_7b_chat_int4/qlora_ddp_ds/infer.sh
+++ b/examples/pytorch/llm/scripts/qwen_7b_chat_int4/qlora_ddp_ds/infer.sh
@@ -11,10 +11,12 @@ python llm_infer.py \
     --eval_human false \
     --dataset damo-agent-mini-zh \
     --max_length 4096 \
+    --check_dataset_strategy warning \
     --use_flash_attn false \
     --max_new_tokens 2048 \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
     --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/qwen_7b_chat_int4/qlora_ddp_ds/sft.sh b/examples/pytorch/llm/scripts/qwen_7b_chat_int4/qlora_ddp_ds/sft.sh
index b9a1218afd..fd815d2dad 100644
--- a/examples/pytorch/llm/scripts/qwen_7b_chat_int4/qlora_ddp_ds/sft.sh
+++ b/examples/pytorch/llm/scripts/qwen_7b_chat_int4/qlora_ddp_ds/sft.sh
@@ -20,13 +20,14 @@ torchrun \
     --train_dataset_sample 20000 \
     --num_train_epochs 1 \
     --max_length 4096 \
+    --check_dataset_strategy warning \
     --lora_rank 8 \
     --lora_alpha 32 \
     --lora_dropout_p 0.05 \
     --lora_target_modules ALL \
     --gradient_checkpointing true \
     --batch_size 1 \
-    --weight_decay 0. \
+    --weight_decay 0.01 \
     --learning_rate 1e-4 \
     --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
     --max_grad_norm 0.5 \
diff --git a/examples/pytorch/llm/scripts/qwen_7b_chat_int8/qlora/infer.sh b/examples/pytorch/llm/scripts/qwen_7b_chat_int8/qlora/infer.sh
index f6992a8ace..7f6f378055 100644
--- a/examples/pytorch/llm/scripts/qwen_7b_chat_int8/qlora/infer.sh
+++ b/examples/pytorch/llm/scripts/qwen_7b_chat_int8/qlora/infer.sh
@@ -11,10 +11,12 @@ python llm_infer.py \
     --eval_human false \
     --dataset leetcode-python-en \
     --max_length 4096 \
+    --check_dataset_strategy warning \
     --use_flash_attn false \
     --max_new_tokens 2048 \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
     --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/qwen_7b_chat_int8/qlora/sft.sh b/examples/pytorch/llm/scripts/qwen_7b_chat_int8/qlora/sft.sh
index ff903dd12a..df1433a76e 100644
--- a/examples/pytorch/llm/scripts/qwen_7b_chat_int8/qlora/sft.sh
+++ b/examples/pytorch/llm/scripts/qwen_7b_chat_int8/qlora/sft.sh
@@ -14,13 +14,14 @@ python llm_sft.py \
     --train_dataset_sample -1 \
     --num_train_epochs 1 \
     --max_length 4096 \
+    --check_dataset_strategy warning \
     --lora_rank 8 \
     --lora_alpha 32 \
     --lora_dropout_p 0.05 \
     --lora_target_modules ALL \
     --gradient_checkpointing true \
     --batch_size 1 \
-    --weight_decay 0. \
+    --weight_decay 0.01 \
     --learning_rate 1e-4 \
     --gradient_accumulation_steps 16 \
     --max_grad_norm 0.5 \
diff --git a/examples/pytorch/llm/scripts/qwen_7b_chat_int8/qlora_ddp_ds/infer.sh b/examples/pytorch/llm/scripts/qwen_7b_chat_int8/qlora_ddp_ds/infer.sh
index f13440538f..25f27a2f13 100644
--- a/examples/pytorch/llm/scripts/qwen_7b_chat_int8/qlora_ddp_ds/infer.sh
+++ b/examples/pytorch/llm/scripts/qwen_7b_chat_int8/qlora_ddp_ds/infer.sh
@@ -11,10 +11,12 @@ python llm_infer.py \
     --eval_human false \
     --dataset damo-agent-mini-zh \
     --max_length 4096 \
+    --check_dataset_strategy warning \
     --use_flash_attn false \
     --max_new_tokens 2048 \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
     --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/qwen_7b_chat_int8/qlora_ddp_ds/sft.sh b/examples/pytorch/llm/scripts/qwen_7b_chat_int8/qlora_ddp_ds/sft.sh
index 514d26c010..d48220cae9 100644
--- a/examples/pytorch/llm/scripts/qwen_7b_chat_int8/qlora_ddp_ds/sft.sh
+++ b/examples/pytorch/llm/scripts/qwen_7b_chat_int8/qlora_ddp_ds/sft.sh
@@ -20,13 +20,14 @@ torchrun \
     --train_dataset_sample 20000 \
     --num_train_epochs 1 \
     --max_length 4096 \
+    --check_dataset_strategy warning \
     --lora_rank 8 \
     --lora_alpha 32 \
     --lora_dropout_p 0.05 \
     --lora_target_modules ALL \
     --gradient_checkpointing true \
     --batch_size 1 \
-    --weight_decay 0. \
+    --weight_decay 0.01 \
     --learning_rate 1e-4 \
     --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
     --max_grad_norm 0.5 \
diff --git a/examples/pytorch/llm/scripts/qwen_vl/lora_ddp_ds/infer.sh b/examples/pytorch/llm/scripts/qwen_vl/lora_ddp_ds/infer.sh
index 271073ea30..dab450c625 100644
--- a/examples/pytorch/llm/scripts/qwen_vl/lora_ddp_ds/infer.sh
+++ b/examples/pytorch/llm/scripts/qwen_vl/lora_ddp_ds/infer.sh
@@ -11,10 +11,12 @@ python llm_infer.py \
     --eval_human false \
     --dataset coco-en \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --use_flash_attn false \
     --max_new_tokens 2048 \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
     --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/qwen_vl/lora_ddp_ds/sft.sh b/examples/pytorch/llm/scripts/qwen_vl/lora_ddp_ds/sft.sh
index 9a87f65682..c1d661a688 100644
--- a/examples/pytorch/llm/scripts/qwen_vl/lora_ddp_ds/sft.sh
+++ b/examples/pytorch/llm/scripts/qwen_vl/lora_ddp_ds/sft.sh
@@ -20,13 +20,14 @@ torchrun \
     --train_dataset_sample 20000 \
     --num_train_epochs 1 \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --lora_rank 8 \
     --lora_alpha 32 \
     --lora_dropout_p 0.05 \
     --lora_target_modules c_attn attn.c_proj \
     --gradient_checkpointing true \
     --batch_size 1 \
-    --weight_decay 0. \
+    --weight_decay 0.01 \
     --learning_rate 1e-4 \
     --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
     --max_grad_norm 0.5 \
diff --git a/examples/pytorch/llm/scripts/qwen_vl_chat/lora_ddp_ds/infer.sh b/examples/pytorch/llm/scripts/qwen_vl_chat/lora_ddp_ds/infer.sh
index e32251a452..8ba037f996 100644
--- a/examples/pytorch/llm/scripts/qwen_vl_chat/lora_ddp_ds/infer.sh
+++ b/examples/pytorch/llm/scripts/qwen_vl_chat/lora_ddp_ds/infer.sh
@@ -12,10 +12,12 @@ python llm_infer.py \
     --eval_human false \
     --dataset coco-en \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --use_flash_attn false \
     --max_new_tokens 2048 \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
     --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/qwen_vl_chat/lora_ddp_ds/sft.sh b/examples/pytorch/llm/scripts/qwen_vl_chat/lora_ddp_ds/sft.sh
index 2ad1ab133c..27ae7178de 100644
--- a/examples/pytorch/llm/scripts/qwen_vl_chat/lora_ddp_ds/sft.sh
+++ b/examples/pytorch/llm/scripts/qwen_vl_chat/lora_ddp_ds/sft.sh
@@ -20,13 +20,14 @@ torchrun \
     --train_dataset_sample 20000 \
     --num_train_epochs 1 \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --lora_rank 8 \
     --lora_alpha 32 \
     --lora_dropout_p 0.05 \
     --lora_target_modules c_attn attn.c_proj \
     --gradient_checkpointing true \
     --batch_size 1 \
-    --weight_decay 0. \
+    --weight_decay 0.01 \
     --learning_rate 1e-4 \
     --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
     --max_grad_norm 0.5 \
diff --git a/examples/pytorch/llm/scripts/qwen_vl_chat/qlora/infer.sh b/examples/pytorch/llm/scripts/qwen_vl_chat/qlora/infer.sh
index 90c5742109..b1aff1546b 100644
--- a/examples/pytorch/llm/scripts/qwen_vl_chat/qlora/infer.sh
+++ b/examples/pytorch/llm/scripts/qwen_vl_chat/qlora/infer.sh
@@ -11,6 +11,7 @@ python llm_infer.py \
     --eval_human false \
     --dataset coco-en \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --quantization_bit 4 \
     --bnb_4bit_comp_dtype bf16 \
     --use_flash_attn false \
@@ -18,5 +19,6 @@ python llm_infer.py \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
     --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/qwen_vl_chat/qlora/sft.sh b/examples/pytorch/llm/scripts/qwen_vl_chat/qlora/sft.sh
index 97956e6233..1d19cd3dd8 100644
--- a/examples/pytorch/llm/scripts/qwen_vl_chat/qlora/sft.sh
+++ b/examples/pytorch/llm/scripts/qwen_vl_chat/qlora/sft.sh
@@ -14,6 +14,7 @@ python llm_sft.py \
     --train_dataset_sample 20000 \
     --num_train_epochs 1 \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --quantization_bit 4 \
     --bnb_4bit_comp_dtype bf16 \
     --lora_rank 8 \
@@ -22,7 +23,7 @@ python llm_sft.py \
     --lora_target_modules c_attn attn.c_proj \
     --gradient_checkpointing true \
     --batch_size 1 \
-    --weight_decay 0. \
+    --weight_decay 0.01 \
     --learning_rate 1e-4 \
     --gradient_accumulation_steps 16 \
     --max_grad_norm 0.5 \
diff --git a/examples/pytorch/llm/scripts/qwen_vl_chat_int4/qlora/infer.sh b/examples/pytorch/llm/scripts/qwen_vl_chat_int4/qlora/infer.sh
index ba4bc6b952..8e40d41aea 100644
--- a/examples/pytorch/llm/scripts/qwen_vl_chat_int4/qlora/infer.sh
+++ b/examples/pytorch/llm/scripts/qwen_vl_chat_int4/qlora/infer.sh
@@ -11,10 +11,12 @@ python llm_infer.py \
     --eval_human false \
     --dataset coco-en \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --use_flash_attn false \
     --max_new_tokens 2048 \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
     --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/qwen_vl_chat_int4/qlora/sft.sh b/examples/pytorch/llm/scripts/qwen_vl_chat_int4/qlora/sft.sh
index e4062f14ec..1395bf840e 100644
--- a/examples/pytorch/llm/scripts/qwen_vl_chat_int4/qlora/sft.sh
+++ b/examples/pytorch/llm/scripts/qwen_vl_chat_int4/qlora/sft.sh
@@ -14,13 +14,14 @@ python llm_sft.py \
     --train_dataset_sample 20000 \
     --num_train_epochs 1 \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --lora_rank 8 \
     --lora_alpha 32 \
     --lora_dropout_p 0.05 \
     --lora_target_modules c_attn attn.c_proj \
     --gradient_checkpointing true \
     --batch_size 1 \
-    --weight_decay 0. \
+    --weight_decay 0.01 \
     --learning_rate 1e-4 \
     --gradient_accumulation_steps 16 \
     --max_grad_norm 0.5 \
diff --git a/examples/pytorch/llm/scripts/qwen_vl_chat_int4/qlora_ddp_ds/infer.sh b/examples/pytorch/llm/scripts/qwen_vl_chat_int4/qlora_ddp_ds/infer.sh
index ba4bc6b952..8e40d41aea 100644
--- a/examples/pytorch/llm/scripts/qwen_vl_chat_int4/qlora_ddp_ds/infer.sh
+++ b/examples/pytorch/llm/scripts/qwen_vl_chat_int4/qlora_ddp_ds/infer.sh
@@ -11,10 +11,12 @@ python llm_infer.py \
     --eval_human false \
     --dataset coco-en \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --use_flash_attn false \
     --max_new_tokens 2048 \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
     --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/qwen_vl_chat_int4/qlora_ddp_ds/sft.sh b/examples/pytorch/llm/scripts/qwen_vl_chat_int4/qlora_ddp_ds/sft.sh
index 730fd4f61e..d377053943 100644
--- a/examples/pytorch/llm/scripts/qwen_vl_chat_int4/qlora_ddp_ds/sft.sh
+++ b/examples/pytorch/llm/scripts/qwen_vl_chat_int4/qlora_ddp_ds/sft.sh
@@ -20,13 +20,14 @@ torchrun \
     --train_dataset_sample 20000 \
     --num_train_epochs 1 \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --lora_rank 8 \
     --lora_alpha 32 \
     --lora_dropout_p 0.05 \
     --lora_target_modules c_attn attn.c_proj \
     --gradient_checkpointing true \
     --batch_size 1 \
-    --weight_decay 0. \
+    --weight_decay 0.01 \
     --learning_rate 1e-4 \
     --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
     --max_grad_norm 0.5 \
diff --git a/examples/pytorch/llm/scripts/seqgpt_560m/full/infer.sh b/examples/pytorch/llm/scripts/seqgpt_560m/full/infer.sh
index 69257300d8..ef3d9afb41 100644
--- a/examples/pytorch/llm/scripts/seqgpt_560m/full/infer.sh
+++ b/examples/pytorch/llm/scripts/seqgpt_560m/full/infer.sh
@@ -11,8 +11,10 @@ python llm_infer.py \
     --eval_human false \
     --dataset ner-jave-zh \
     --max_length 1024 \
+    --check_dataset_strategy warning \
     --max_new_tokens 2048 \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
diff --git a/examples/pytorch/llm/scripts/seqgpt_560m/full/sft.sh b/examples/pytorch/llm/scripts/seqgpt_560m/full/sft.sh
index 0614a87477..3c739f38d8 100644
--- a/examples/pytorch/llm/scripts/seqgpt_560m/full/sft.sh
+++ b/examples/pytorch/llm/scripts/seqgpt_560m/full/sft.sh
@@ -13,6 +13,7 @@ python llm_sft.py \
     --train_dataset_sample -1 \
     --num_train_epochs 3 \
     --max_length 1024 \
+    --check_dataset_strategy warning \
     --gradient_checkpointing true \
     --batch_size 4 \
     --weight_decay 0.01 \
diff --git a/examples/pytorch/llm/scripts/seqgpt_560m/full_ddp/infer.sh b/examples/pytorch/llm/scripts/seqgpt_560m/full_ddp/infer.sh
index 69257300d8..ef3d9afb41 100644
--- a/examples/pytorch/llm/scripts/seqgpt_560m/full_ddp/infer.sh
+++ b/examples/pytorch/llm/scripts/seqgpt_560m/full_ddp/infer.sh
@@ -11,8 +11,10 @@ python llm_infer.py \
     --eval_human false \
     --dataset ner-jave-zh \
     --max_length 1024 \
+    --check_dataset_strategy warning \
     --max_new_tokens 2048 \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
diff --git a/examples/pytorch/llm/scripts/seqgpt_560m/full_ddp/sft.sh b/examples/pytorch/llm/scripts/seqgpt_560m/full_ddp/sft.sh
index b52bcc5c9f..6bd887c862 100644
--- a/examples/pytorch/llm/scripts/seqgpt_560m/full_ddp/sft.sh
+++ b/examples/pytorch/llm/scripts/seqgpt_560m/full_ddp/sft.sh
@@ -19,6 +19,7 @@ torchrun \
     --train_dataset_sample -1 \
     --num_train_epochs 3 \
     --max_length 1024 \
+    --check_dataset_strategy warning \
     --gradient_checkpointing true \
     --batch_size 4 \
     --weight_decay 0.01 \
diff --git a/examples/pytorch/llm/scripts/skywork_13b/qlora/infer.sh b/examples/pytorch/llm/scripts/skywork_13b/qlora/infer.sh
new file mode 100644
index 0000000000..9e1cf2506b
--- /dev/null
+++ b/examples/pytorch/llm/scripts/skywork_13b/qlora/infer.sh
@@ -0,0 +1,23 @@
+# Experimental environment: A10, 3090
+PYTHONPATH=../../.. \
+CUDA_VISIBLE_DEVICES=0 \
+python llm_infer.py \
+    --model_id_or_path skywork/Skywork-13B-base \
+    --model_revision master \
+    --sft_type lora \
+    --template_type default-generation \
+    --dtype bf16 \
+    --ckpt_dir "output/skywork-13b/vx_xxx/checkpoint-xxx" \
+    --eval_human false \
+    --dataset advertise-gen-zh \
+    --max_length 2048 \
+    --check_dataset_strategy warning \
+    --quantization_bit 4 \
+    --bnb_4bit_comp_dtype bf16 \
+    --max_new_tokens 2048 \
+    --temperature 0.9 \
+    --top_k 20 \
+    --top_p 0.9 \
+    --repetition_penalty 1.05 \
+    --do_sample true \
+    --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/skywork_13b/qlora/sft.sh b/examples/pytorch/llm/scripts/skywork_13b/qlora/sft.sh
new file mode 100644
index 0000000000..daf9198368
--- /dev/null
+++ b/examples/pytorch/llm/scripts/skywork_13b/qlora/sft.sh
@@ -0,0 +1,38 @@
+# Experimental environment: A10, 3090
+# 16GB GPU memory
+PYTHONPATH=../../.. \
+CUDA_VISIBLE_DEVICES=0 \
+python llm_sft.py \
+    --model_id_or_path skywork/Skywork-13B-base \
+    --model_revision master \
+    --sft_type lora \
+    --tuner_backend swift \
+    --template_type default-generation \
+    --dtype bf16 \
+    --output_dir output \
+    --dataset advertise-gen-zh \
+    --train_dataset_sample 20000 \
+    --num_train_epochs 1 \
+    --max_length 2048 \
+    --check_dataset_strategy warning \
+    --quantization_bit 4 \
+    --bnb_4bit_comp_dtype bf16 \
+    --lora_rank 8 \
+    --lora_alpha 32 \
+    --lora_dropout_p 0.05 \
+    --lora_target_modules AUTO \
+    --gradient_checkpointing true \
+    --batch_size 1 \
+    --weight_decay 0.01 \
+    --learning_rate 1e-4 \
+    --gradient_accumulation_steps 16 \
+    --max_grad_norm 0.5 \
+    --warmup_ratio 0.03 \
+    --eval_steps 100 \
+    --save_steps 100 \
+    --save_total_limit 2 \
+    --logging_steps 10 \
+    --push_to_hub false \
+    --hub_model_id skywork-13b-qlora \
+    --hub_private_repo true \
+    --hub_token 'your-sdk-token' \
diff --git a/examples/pytorch/llm/scripts/xverse_13b/qlora/infer.sh b/examples/pytorch/llm/scripts/xverse_13b/qlora/infer.sh
index 45de9099fa..28d3aa98b3 100644
--- a/examples/pytorch/llm/scripts/xverse_13b/qlora/infer.sh
+++ b/examples/pytorch/llm/scripts/xverse_13b/qlora/infer.sh
@@ -11,11 +11,13 @@ python llm_infer.py \
     --eval_human false \
     --dataset advertise-gen-zh \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --quantization_bit 4 \
     --bnb_4bit_comp_dtype bf16 \
     --max_new_tokens 2048 \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
     --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/xverse_13b/qlora/sft.sh b/examples/pytorch/llm/scripts/xverse_13b/qlora/sft.sh
index ee871d9ead..9dd5d2724d 100644
--- a/examples/pytorch/llm/scripts/xverse_13b/qlora/sft.sh
+++ b/examples/pytorch/llm/scripts/xverse_13b/qlora/sft.sh
@@ -14,6 +14,7 @@ python llm_sft.py \
     --train_dataset_sample 20000 \
     --num_train_epochs 1 \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --quantization_bit 4 \
     --bnb_4bit_comp_dtype bf16 \
     --lora_rank 8 \
@@ -22,7 +23,7 @@ python llm_sft.py \
     --lora_target_modules ALL \
     --gradient_checkpointing true \
     --batch_size 1 \
-    --weight_decay 0. \
+    --weight_decay 0.01 \
     --learning_rate 1e-4 \
     --gradient_accumulation_steps 16 \
     --max_grad_norm 0.5 \
diff --git a/examples/pytorch/llm/scripts/ziya2_13b_chat/qlora/infer.sh b/examples/pytorch/llm/scripts/ziya2_13b_chat/qlora/infer.sh
index 57b981c529..90a82d64ac 100644
--- a/examples/pytorch/llm/scripts/ziya2_13b_chat/qlora/infer.sh
+++ b/examples/pytorch/llm/scripts/ziya2_13b_chat/qlora/infer.sh
@@ -11,11 +11,13 @@ python llm_infer.py \
     --eval_human false \
     --dataset lawyer-llama-zh \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --quantization_bit 4 \
     --bnb_4bit_comp_dtype bf16 \
     --max_new_tokens 2048 \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
     --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/ziya2_13b_chat/qlora/sft.sh b/examples/pytorch/llm/scripts/ziya2_13b_chat/qlora/sft.sh
index 4b90d92464..72c0a15b12 100644
--- a/examples/pytorch/llm/scripts/ziya2_13b_chat/qlora/sft.sh
+++ b/examples/pytorch/llm/scripts/ziya2_13b_chat/qlora/sft.sh
@@ -15,6 +15,7 @@ python llm_sft.py \
     --train_dataset_sample -1 \
     --num_train_epochs 1 \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --quantization_bit 4 \
     --bnb_4bit_comp_dtype bf16 \
     --lora_rank 8 \
@@ -23,7 +24,7 @@ python llm_sft.py \
     --lora_target_modules ALL \
     --gradient_checkpointing true \
     --batch_size 1 \
-    --weight_decay 0. \
+    --weight_decay 0.01 \
     --learning_rate 1e-4 \
     --gradient_accumulation_steps 16 \
     --max_grad_norm 0.5 \
diff --git a/examples/pytorch/llm/scripts/ziya2_13b_chat/qlora_ddp_ds/infer.sh b/examples/pytorch/llm/scripts/ziya2_13b_chat/qlora_ddp_ds/infer.sh
index 57b981c529..90a82d64ac 100644
--- a/examples/pytorch/llm/scripts/ziya2_13b_chat/qlora_ddp_ds/infer.sh
+++ b/examples/pytorch/llm/scripts/ziya2_13b_chat/qlora_ddp_ds/infer.sh
@@ -11,11 +11,13 @@ python llm_infer.py \
     --eval_human false \
     --dataset lawyer-llama-zh \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --quantization_bit 4 \
     --bnb_4bit_comp_dtype bf16 \
     --max_new_tokens 2048 \
     --temperature 0.9 \
     --top_k 20 \
     --top_p 0.9 \
+    --repetition_penalty 1.05 \
     --do_sample true \
     --merge_lora_and_save false \
diff --git a/examples/pytorch/llm/scripts/ziya2_13b_chat/qlora_ddp_ds/sft.sh b/examples/pytorch/llm/scripts/ziya2_13b_chat/qlora_ddp_ds/sft.sh
index b60ce8a026..016197ae09 100644
--- a/examples/pytorch/llm/scripts/ziya2_13b_chat/qlora_ddp_ds/sft.sh
+++ b/examples/pytorch/llm/scripts/ziya2_13b_chat/qlora_ddp_ds/sft.sh
@@ -20,6 +20,7 @@ torchrun \
     --train_dataset_sample -1 \
     --num_train_epochs 1 \
     --max_length 2048 \
+    --check_dataset_strategy warning \
     --quantization_bit 4 \
     --bnb_4bit_comp_dtype bf16 \
     --lora_rank 8 \
@@ -28,7 +29,7 @@ torchrun \
     --lora_target_modules ALL \
     --gradient_checkpointing true \
     --batch_size 1 \
-    --weight_decay 0. \
+    --weight_decay 0.01 \
     --learning_rate 1e-4 \
     --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
     --max_grad_norm 0.5 \
diff --git a/merge_lora_weights_to_model.py b/merge_lora_weights_to_model.py
index 8565aa65db..56faf11a50 100644
--- a/merge_lora_weights_to_model.py
+++ b/merge_lora_weights_to_model.py
@@ -4,5 +4,4 @@
 
 if __name__ == '__main__':
     args, remaining_argv = parse_args(InferArguments, None)
-    args.init_argument()
     merge_lora(args, replace_if_exists=True)
diff --git a/swift/llm/infer.py b/swift/llm/infer.py
index 5c8063faaa..94ef2540af 100644
--- a/swift/llm/infer.py
+++ b/swift/llm/infer.py
@@ -107,20 +107,21 @@ def llm_infer(args: InferArguments) -> None:
     if args.eval_human:
         while True:
             query = input('<<< ')
-            data = {'query': query}
-            input_ids = template.encode(data)['input_ids']
-            inference(input_ids, model, tokenizer, args.stream)
+            inference(model, template, query, stream=args.stream)
     else:
         _, val_dataset = get_dataset(args.dataset, args.dataset_test_ratio,
                                      args.dataset_seed)
         mini_val_dataset = val_dataset.select(
             range(min(args.show_dataset_sample, val_dataset.shape[0])))
         for data in mini_val_dataset:
-            response = data['response']
-            data['response'] = None
-            input_ids = template.encode(data)['input_ids']
-            inference(input_ids, model, tokenizer, args.stream)
+            inference(
+                model,
+                template,
+                data.get('query'),
+                data.get('history'),
+                data.get('system'),
+                stream=args.stream)
             print()
-            print(f'[LABELS]{response}')
+            print(f"[LABELS]{data.get('response')}")
             print('-' * 80)
             # input('next[ENTER]')
diff --git a/swift/llm/sft.py b/swift/llm/sft.py
index 8d29eaa5bc..529eac3bc1 100644
--- a/swift/llm/sft.py
+++ b/swift/llm/sft.py
@@ -112,9 +112,11 @@ def llm_sft(args: SftArguments) -> str:
 
     # ### Loading Dataset
     random_state = np.random.RandomState(args.dataset_seed)
-    train_dataset, val_dataset = get_dataset(args.dataset,
-                                             args.dataset_test_ratio,
-                                             random_state)
+    train_dataset, val_dataset = get_dataset(
+        args.dataset,
+        args.dataset_test_ratio,
+        random_state,
+        check_dataset_strategy=args.check_dataset_strategy)
     if args.train_dataset_sample >= 0:
         args.train_dataset_sample = min(args.train_dataset_sample,
                                         len(train_dataset))
@@ -199,6 +201,7 @@ def llm_sft(args: SftArguments) -> str:
         local_rank=local_rank,
         only_save_model=args.only_save_model,
         train_sampler_random=args.train_sampler_random,
+        report_to=args.report_to,
         deepspeed=args.deepspeed)
 
     if args.gradient_checkpointing:
diff --git a/swift/llm/utils/argument.py b/swift/llm/utils/argument.py
index fc3bf90122..de61d6f2c5 100644
--- a/swift/llm/utils/argument.py
+++ b/swift/llm/utils/argument.py
@@ -12,8 +12,9 @@
 from swift import get_logger
 from swift.hub import HubApi, ModelScopeConfig
 from swift.utils import (add_version_to_work_dir, broadcast_string,
-                         get_dist_setting, is_dist, is_master)
-from .dataset import DATASET_MAPPING, DatasetName
+                         get_dist_setting, is_dist, is_master, read_from_jsonl)
+from .dataset import (DATASET_MAPPING, DatasetName, get_custom_dataset,
+                      register_dataset)
 from .model import MODEL_MAPPING, ModelType, dtype_mapping
 from .template import TEMPLATE_MAPPING
 
@@ -58,6 +59,11 @@ class SftArguments:
     train_dataset_sample: int = 20000  # -1: all dataset
     system: str = 'you are a helpful assistant!'
     max_length: int = 2048  # -1: no limit
+    check_dataset_strategy: str = field(
+        default='none',
+        metadata={'choices': ['none', 'discard', 'error', 'warning']})
+    custom_train_dataset_path: Optional[List[str]] = None
+    custom_val_dataset_path: Optional[List[str]] = None
 
     # If you want to use qlora, set the quantization_bit to 8 or 4.
     # And you need to install bitsandbytes: `pip install bitsandbytes -U`
@@ -126,6 +132,7 @@ class SftArguments:
     use_flash_attn: Optional[bool] = None
     ignore_args_error: bool = False  # True: notebook compatibility
     logging_dir: Optional[str] = None
+    report_to: List[str] = None
 
     # generation config
     max_new_tokens: int = 2048
@@ -135,11 +142,11 @@ class SftArguments:
     top_p: float = 0.9
     repetition_penalty: float = 1.05
 
-    def init_argument(self):
-        # Can be manually initialized, unlike __post_init__
+    def __post_init__(self):
         handle_compatibility(self)
         set_model_type(self)
-        handle_dir(self)
+        register_custom_dataset(self)
+        handle_path(self)
         if self.add_output_dir_suffix:
             self.output_dir = os.path.join(self.output_dir, self.model_type)
             if is_master():
@@ -213,6 +220,8 @@ def init_argument(self):
             logger.info(f'Using deepspeed: {self.deepspeed}')
         if self.logging_dir is None:
             self.logging_dir = f'{self.output_dir}/runs'
+        if self.report_to is None:
+            self.report_to == ['all']
 
 
 @dataclass
@@ -247,6 +256,11 @@ class InferArguments:
     show_dataset_sample: int = 10
     system: str = 'you are a helpful assistant!'
     max_length: int = 2048  # -1: no limit
+    check_dataset_strategy: str = field(
+        default='none',
+        metadata={'choices': ['none', 'discard', 'error', 'warning']})
+    custom_train_dataset_path: Optional[List[str]] = None
+    custom_val_dataset_path: Optional[List[str]] = None
 
     quantization_bit: int = field(default=0, metadata={'choices': [0, 4, 8]})
     bnb_4bit_comp_dtype: str = field(
@@ -269,14 +283,14 @@ class InferArguments:
     merge_lora_and_save: bool = False
     overwrite_generation_config: bool = False
 
-    def init_argument(self):
-        # Can be manually initialized, unlike __post_init__
+    def __post_init__(self):
         handle_compatibility(self)
         if not os.path.isdir(self.ckpt_dir):
             raise ValueError(f'Please enter a valid ckpt_dir: {self.ckpt_dir}')
         logger.info(f'ckpt_dir: {self.ckpt_dir}')
         set_model_type(self)
-        handle_dir(self)
+        register_custom_dataset(self)
+        handle_path(self)
 
         self.torch_dtype, _, _ = select_dtype(self)
         if self.template_type is None:
@@ -304,11 +318,10 @@ class RomeArguments(InferArguments):
             'to get the format'
         })
 
-    def init_argument(self):
-        # Can be manually initialized, unlike __post_init__
+    def __post_init__(self):
         handle_compatibility(self)
         set_model_type(self)
-        handle_dir(self)
+        handle_path(self)
 
         self.torch_dtype, _, _ = select_dtype(self)
         if self.template_type is None:
@@ -377,6 +390,8 @@ def handle_compatibility(args: Union[SftArguments, InferArguments]) -> None:
     if args.dataset is not None and len(
             args.dataset) == 1 and ',' in args.dataset[0]:
         args.dataset = args.dataset[0].split(',')
+    if args.template_type == 'chatglm2-generation':
+        args.template_type = 'chatglm-generation'
 
 
 def set_model_type(args: Union[SftArguments, InferArguments]) -> None:
@@ -421,7 +436,7 @@ def prepare_push_ms_hub(args: SftArguments) -> None:
         logger.info('hub login successful!')
 
 
-def handle_dir(args: Union[SftArguments, InferArguments]) -> None:
+def handle_path(args: Union[SftArguments, InferArguments]) -> None:
     for k in [
             'model_cache_dir', 'output_dir', 'ckpt_dir',
             'resume_from_checkpoint', 'deepspeed_config_path', 'logging_dir'
@@ -431,3 +446,19 @@ def handle_dir(args: Union[SftArguments, InferArguments]) -> None:
             value = os.path.expanduser(value)
             value = os.path.abspath(value)
             setattr(args, k, value)
+
+
+def register_custom_dataset(args: Union[SftArguments, InferArguments]) -> None:
+    if args.custom_train_dataset_path is None:
+        assert args.custom_val_dataset_path is None
+        return
+    register_dataset(
+        '_custom_dataset',
+        '_custom_dataset',
+        args.custom_train_dataset_path,
+        args.custom_val_dataset_path,
+        get_function=get_custom_dataset)
+    if args.dataset is None:
+        args.dataset = ['_custom_dataset']
+    else:
+        args.dataset.append('_custom_dataset')
diff --git a/swift/llm/utils/dataset.py b/swift/llm/utils/dataset.py
index 06b8ca534b..405af25c41 100644
--- a/swift/llm/utils/dataset.py
+++ b/swift/llm/utils/dataset.py
@@ -7,13 +7,16 @@
 
 import json
 import numpy as np
+import pandas as pd
 from datasets import Dataset as HfDataset
 from datasets import concatenate_datasets
 from modelscope import MsDataset
 from numpy.random import RandomState
+from pandas import DataFrame
 from tqdm.auto import tqdm
 
-from swift.utils import get_logger, get_seed
+from swift.utils import (get_logger, get_seed, read_from_jsonl,
+                         transform_jsonl_to_df)
 from .preprocess import (AlpacaPreprocessor, ClsPreprocessor,
                          ComposePreprocessor, ConversationsPreprocessor,
                          PreprocessFunc, RenameColumnsPreprocessor,
@@ -186,7 +189,7 @@ def get_dataset_from_repo(
     dataset_id: str,
     train_subset_split_list: List[Tuple[str, str]],
     val_subset_split_list: List[Tuple[str, str]],
-    preprocess_func: List[Tuple[str, str]],
+    preprocess_func: PreprocessFunc,
     remove_useless_columns: bool = True,
     dataset_sample: int = -1,
 ) -> Union[HfDataset, Tuple[HfDataset, HfDataset]]:
@@ -702,7 +705,7 @@ def get_dataset(
     dataset_test_ratio: float = 0.,
     dataset_seed: Union[RandomState, int] = 42,
     check_dataset_strategy: Literal['none', 'discard', 'error',
-                                    'warning'] = 'warning'
+                                    'warning'] = 'none'
 ) -> Tuple[HfDataset, Optional[HfDataset]]:
     """Returns train_dataset and val_dataset"""
     train_dataset_list: List[HfDataset] = []
@@ -712,7 +715,7 @@ def get_dataset(
         random_state = RandomState(dataset_seed)
     for dataset_name in dataset_name_list:
         dataset_info = DATASET_MAPPING[dataset_name]
-        get_function = dataset_info['get_function']
+        get_function: GetDatasetFunction = dataset_info['get_function']
         dataset = get_function(
             dataset_info['dataset_id_or_path'],
             train_subset_split_list=dataset_info['train_subset_split_list'],
@@ -740,3 +743,36 @@ def get_dataset(
     train_dataset = _check_dataset(train_dataset, check_dataset_strategy)
     val_dataset = _check_dataset(val_dataset, check_dataset_strategy)
     return train_dataset, val_dataset
+
+
+def load_dataset_from_local(
+        dataset_path_list: Optional[Union[str, List[str]]],
+        preprocess_func: PreprocessFunc) -> Optional[HfDataset]:
+    if isinstance(dataset_path_list, str):
+        dataset_path_list = [dataset_path_list]
+    if dataset_path_list is None or len(dataset_path_list) == 0:
+        return None
+    assert isinstance(dataset_path_list, (list, tuple))
+
+    dataset_list = []
+    for dataset_path in dataset_path_list:
+        assert isinstance(dataset_path, str)
+        df: DataFrame
+        if dataset_path.endswith('.csv'):
+            df = pd.read_csv(dataset_path)
+        elif dataset_path.endswith('.jsonl'):
+            df = transform_jsonl_to_df(read_from_jsonl(dataset_path))
+        dataset = HfDataset.from_dict(df.to_dict(orient='list'))
+        dataset_list.append(preprocess_func(dataset))
+    return concatenate_datasets(dataset_list)
+
+
+def get_custom_dataset(_: str, train_subset_split_list: List[str],
+                       val_subset_split_list: List[str],
+                       preprocess_func: PreprocessFunc,
+                       **kwargs) -> Tuple[HfDataset, Optional[HfDataset]]:
+    train_dataset = load_dataset_from_local(train_subset_split_list,
+                                            preprocess_func)
+    val_dataset = load_dataset_from_local(val_subset_split_list,
+                                          preprocess_func)
+    return train_dataset, val_dataset
diff --git a/swift/llm/utils/model.py b/swift/llm/utils/model.py
index dffa28a352..9efb19a83e 100644
--- a/swift/llm/utils/model.py
+++ b/swift/llm/utils/model.py
@@ -86,6 +86,9 @@ class ModelType:
     # ziya
     ziya2_13b = 'ziya2-13b'
     ziya2_13b_chat = 'ziya2-13b-chat'
+    # skywork
+    skywork_13b_chat = 'skywork-13b-chat'
+    skywork_13b = 'skywork-13b'
     # other
     polylm_13b = 'polylm-13b'
     seqgpt_560m = 'seqgpt-560m'
@@ -630,6 +633,27 @@ def get_model_tokenizer_qwen_intx(model_dir: str,
     return model, tokenizer
 
 
+register_model(ModelType.skywork_13b, 'skywork/Skywork-13B-base',
+               LoRATM.llama2, TemplateType.default,
+               get_model_tokenizer_from_repo)
+
+
+@register_model(ModelType.skywork_13b_chat, 'skywork/Skywork-13B-chat',
+                LoRATM.llama2, TemplateType.skywork)
+def get_skywork_model_tokenizer(model_dir: str,
+                                torch_dtype: Dtype,
+                                model_kwargs: Dict[str, Any],
+                                load_model: bool = True,
+                                **kwargs):
+    model, tokenizer = get_model_tokenizer_from_repo(model_dir, torch_dtype,
+                                                     model_kwargs, load_model,
+                                                     **kwargs)
+    tokenizer.add_tokens('[USER]')
+    tokenizer.add_tokens('[BOT]')
+    tokenizer.add_tokens('[SEP]')
+    return model, tokenizer
+
+
 def get_model_tokenizer(
         model_type: str,
         torch_dtype: Optional[Dtype] = None,
diff --git a/swift/llm/utils/preprocess.py b/swift/llm/utils/preprocess.py
index 87efec592b..74c4ad4704 100644
--- a/swift/llm/utils/preprocess.py
+++ b/swift/llm/utils/preprocess.py
@@ -120,12 +120,6 @@ def __call__(self, dataset: HfDataset) -> HfDataset:
         return dataset
 
 
-class ChatmlPreprocessor:
-
-    def __init__(self):
-        pass
-
-
 class ComposePreprocessor:
 
     def __init__(self, preprocessor_list: List[PreprocessFunc]) -> None:
@@ -163,6 +157,14 @@ def __init__(self) -> None:
             'conversations': {
                 'required': ['conversations'],
                 'preprocessor': ConversationsPreprocessor()
+            },
+            'chatml': {
+                'required': ['messages'],
+                'preprocessor':
+                ConversationsPreprocessor(
+                    conversations_key='messages',
+                    from_key='role',
+                    value_key='content')
             }
         }
 
diff --git a/swift/llm/utils/template.py b/swift/llm/utils/template.py
index 89a6ca1f26..fd5425d364 100644
--- a/swift/llm/utils/template.py
+++ b/swift/llm/utils/template.py
@@ -13,13 +13,14 @@ class TemplateType:
     chatml = 'chatml'
     baichuan = 'baichuan'
     chatglm2 = 'chatglm2'
-    chatglm2_generation = 'chatglm2-generation'
+    chatglm_generation = 'chatglm-generation'
     chatglm3 = 'chatglm3'
     llama = 'llama'
     openbuddy = 'openbuddy'
     internlm = 'internlm'
     xverse = 'xverse'
     ziya = 'ziya'
+    skywork = 'skywork'
 
 
 Prompt = List[Union[str, List[Union[str, int]]]]
@@ -102,7 +103,7 @@ def register_template(template_type: str, template: Template) -> None:
              ['\n\n'], [['eos_token_id']]))
 
 register_template(
-    TemplateType.chatglm2_generation,
+    TemplateType.chatglm_generation,
     Template([[64790, 64792]], ['{{QUERY}}'], None, [['eos_token_id']]))
 
 register_template(
@@ -135,6 +136,10 @@ def register_template(template_type: str, template: Template) -> None:
     Template([['bos_token_id']], ['<human>:{{QUERY}}\n<bot>:'], ['\n'],
              [['eos_token_id']]))
 
+register_template(
+    TemplateType.skywork,
+    Template([], ['</s><s>[USER]{{QUERY}}[SEP][BOT]'], None, ['[SEP]</s>']))
+
 Context = Union[str, List[int]]
 
 
@@ -210,7 +215,7 @@ def _encode(
     res_context_list: List[Context] = []
     _concat_context_list(template.prefix, res_context_list, system=system)
     for i, (q, r) in enumerate(history):
-        assert template.chat_sep is not None, 'not support multi-round chat'
+        assert template.chat_sep is not None, 'the template not support multi-round chat'
         _concat_context_list([*template.prompt, r, *template.chat_sep],
                              res_context_list,
                              query=q,
diff --git a/swift/llm/utils/utils.py b/swift/llm/utils/utils.py
index f450bb5069..5e5cd618a6 100644
--- a/swift/llm/utils/utils.py
+++ b/swift/llm/utils/utils.py
@@ -6,7 +6,7 @@
 import shutil
 from functools import wraps
 from tempfile import TemporaryDirectory
-from typing import (Any, Callable, Dict, Iterator, List, Optional, Type,
+from typing import (Any, Callable, Dict, Iterator, List, Optional, Tuple, Type,
                     TypeVar, Union)
 
 import accelerate
@@ -34,6 +34,7 @@
 from swift.hub import ModelScopeConfig
 from swift.utils import (get_dist_setting, get_logger, is_ddp_plus_mp, is_dist,
                          is_local_master, is_master, lower_bound, parse_args)
+from .template import History, Template
 
 logger = get_logger()
 ms_logger = get_ms_logger()
@@ -179,18 +180,18 @@ def get_main(
     args_class: Type[_TArgsClass], llm_x: Callable[[_TArgsClass], _T]
 ) -> Callable[[Union[List[str], _TArgsClass, NoneType]], _T]:
 
-    def x_main(argv: Union[List[str], _TArgsClass, NoneType] = None) -> _T:
+    def x_main(argv: Union[List[str], _TArgsClass, NoneType] = None,
+               **kwargs) -> _T:
         if isinstance(argv, args_class):
             args, remaining_argv = argv, []
         else:
             args, remaining_argv = parse_args(args_class, argv)
-        args.init_argument()
         if len(remaining_argv) > 0:
             if args.ignore_args_error:
                 logger.warning(f'remaining_argv: {remaining_argv}')
             else:
                 raise ValueError(f'remaining_argv: {remaining_argv}')
-        return llm_x(args)
+        return llm_x(args, **kwargs)
 
     return x_main
 
@@ -308,10 +309,17 @@ def sort_by_max_length(dataset: HfDataset, num_dataset: int) -> HfDataset:
 
 
 def inference_stream(
-    input_ids: List[int],
     model: PreTrainedModel,
-    tokenizer: PreTrainedTokenizerBase,
-) -> Iterator[str]:
+    template: Template,
+    query: str,
+    history: Optional[History] = None,
+    system: Optional[str] = None,
+) -> Iterator[Tuple[str, History]]:
+    if history is None:
+        history = []
+    example = {'query': query, 'history': history, 'system': system}
+    input_ids = template.encode(example)['input_ids']
+    tokenizer = template.tokenizer
     input_ids = torch.tensor(input_ids)[None].cuda()
     attention_mask = torch.ones_like(input_ids)
     model.eval()
@@ -328,35 +336,51 @@ def inference_stream(
         generation_config=stream_config,
         seed=-1)
     generate_ids = []
+    history.append(None)  # dummy
     for token in gen:
         generate_ids.append(token.item())
-        yield tokenizer.decode(generate_ids)
+        response = tokenizer.decode(generate_ids)
+        history[-1] = (query, response)
+        yield response, history
 
 
-def inference(input_ids: List[int],
-              model: PreTrainedModel,
-              tokenizer: PreTrainedTokenizerBase,
+def inference(model: PreTrainedModel,
+              template: Template,
+              query: Optional[str] = None,
+              history: Optional[History] = None,
+              system: Optional[str] = None,
               stream: bool = True,
-              verbose: bool = True) -> str:
+              verbose: bool = True,
+              prompt_prefix: str = '[PROMPT]',
+              output_prefix: str = '[OUTPUT]') -> Tuple[str, History]:
+    if history is None:
+        history = []
+    example = {'query': query, 'history': history, 'system': system}
+    input_ids = template.encode(example)['input_ids']
+    tokenizer = template.tokenizer
+    input_ids = torch.tensor(input_ids)[None].cuda()
+    attention_mask = torch.ones_like(input_ids)
+    model.eval()
+    generation_config = getattr(model, 'generation_config', None)
     if verbose:
-        print(f'[PROMPT]{tokenizer.decode(input_ids)}[OUTPUT]', end='')
+        print(
+            f'{prompt_prefix}{tokenizer.decode(input_ids[0])}{output_prefix}',
+            end='')
+    else:
+        stream = False
     streamer = None
     if stream:
-        assert verbose
         streamer = TextStreamer(tokenizer, skip_prompt=True)
-    generation_config = getattr(model, 'generation_config', None)
-    input_ids = torch.tensor(input_ids)[None].cuda()
-    attention_mask = torch.ones_like(input_ids)
-    model.eval()
     generate_ids = model.generate(
         input_ids=input_ids,
         attention_mask=attention_mask,
         streamer=streamer,
         generation_config=generation_config)
-    output_text = tokenizer.decode(generate_ids[0, len(input_ids[0]):])
+    response = tokenizer.decode(generate_ids[0, len(input_ids[0]):])
     if verbose and not streamer:
-        print(output_text)
-    return output_text
+        print(response)
+    history.append((query, response))
+    return response, history
 
 
 # monkey patching
diff --git a/tests/llm/test_run.py b/tests/llm/test_run.py
index 36114665bc..fb910cf898 100644
--- a/tests/llm/test_run.py
+++ b/tests/llm/test_run.py
@@ -34,6 +34,7 @@ def test_run_1(self):
             model_type=model_type,
             quantization_bit=4,
             eval_steps=5,
+            check_dataset_strategy='warning',
             train_dataset_sample=200,
             predict_with_generate=False,
             dataset=[DatasetName.jd_sentiment_zh],
@@ -46,6 +47,7 @@ def test_run_1(self):
             model_type=model_type,
             quantization_bit=4,
             ckpt_dir=best_ckpt_dir,
+            check_dataset_strategy='warning',
             dataset=[DatasetName.jd_sentiment_zh],
             stream=False,
             show_dataset_sample=5,
diff --git a/tests/llm/test_template.py b/tests/llm/test_template.py
index 2c4e12cba7..bdc6d1f156 100644
--- a/tests/llm/test_template.py
+++ b/tests/llm/test_template.py
@@ -1,9 +1,13 @@
+if __name__ == '__main__':
+    import os
+    os.environ['CUDA_VISIBLE_DEVICES'] = '0'
 import os
-import tempfile
 import unittest
 
-from swift.llm import (MODEL_MAPPING, ModelType, get_model_tokenizer,
-                       get_template)
+from modelscope import GenerationConfig
+
+from swift.llm import (MODEL_MAPPING, ModelType, TemplateType,
+                       get_model_tokenizer, get_template, inference)
 
 
 class TestTemplate(unittest.TestCase):
@@ -42,6 +46,140 @@ def test_template(self):
 <|endoftext|>"""
             self.assertTrue(result == text)
 
+    def test_chatglm3_template(self):
+        if not __name__ == '__main__':
+            # avoid ci test
+            return
+        model_type = ModelType.chatglm3_6b
+        template_type = TemplateType.chatglm3
+        model, tokenizer = get_model_tokenizer(model_type, load_model=True)
+        template = get_template(template_type, tokenizer)
+        model.generation_config = GenerationConfig(
+            max_new_tokens=128,
+            temperature=0.9,
+            top_k=20,
+            top_p=0.9,
+            repetition_penalt=1.05,
+            do_sample=True,
+            eos_token_id=tokenizer.eos_token_id,
+            pad_token_id=tokenizer.eos_token_id)
+        query = '12345+234=？'
+        print(f'query: {query}')
+        response, _ = inference(model, template, query, verbose=False)
+        print(f'swift response: {response}')
+        response = model.chat(tokenizer, query, max_length=None)[0]
+        print(f'official response: {response}')
+
+    def test_qwen_template(self):
+        if not __name__ == '__main__':
+            # avoid ci test
+            return
+        model_type = ModelType.qwen_7b_chat
+        template_type = TemplateType.chatml
+        model, tokenizer = get_model_tokenizer(model_type, load_model=True)
+        template = get_template(template_type, tokenizer)
+        query = '12345+234=？'
+        print(f'query: {query}')
+        response, _ = inference(model, template, query, verbose=False)
+        print(f'swift response: {response}')
+        model.generation_config.chat_format = 'chatml'
+        model.generation_config.max_window_size = 1024
+        response = model.chat(tokenizer, query, None, max_length=None)[0]
+        print(f'official response: {response}')
+
+    def test_llama_template(self):
+        if not __name__ == '__main__':
+            # avoid ci test
+            return
+        model_type = ModelType.llama2_7b_chat
+        template_type = TemplateType.llama
+        _, tokenizer = get_model_tokenizer(model_type, load_model=False)
+        from modelscope import Model, snapshot_download
+        model_dir = snapshot_download(
+            'modelscope/Llama-2-7b-chat-ms',
+            'master',
+            ignore_file_pattern=[r'.+\.bin$'])
+        model = Model.from_pretrained(model_dir, device_map='auto')
+        template = get_template(template_type, tokenizer)
+        model.generation_config = GenerationConfig(
+            max_new_tokens=128,
+            temperature=0.9,
+            top_k=20,
+            top_p=0.9,
+            repetition_penalt=1.05,
+            do_sample=True,
+            eos_token_id=tokenizer.eos_token_id,
+            pad_token_id=tokenizer.eos_token_id)
+        query = '12345+234=？'
+        print(f'query: {query}')
+        response, _ = inference(model, template, query, verbose=False)
+        print(f'swift response: {response}')
+        response = model.chat({'text': query}, tokenizer)['response']
+        print(f'official response: {response}')
+
+    def test_baichuan_template(self):
+        if not __name__ == '__main__':
+            # avoid ci test
+            return
+        model_type = ModelType.baichuan2_7b_chat
+        template_type = TemplateType.baichuan
+        model, tokenizer = get_model_tokenizer(model_type, load_model=True)
+        template = get_template(template_type, tokenizer)
+        query = '12345+234=？'
+        print(f'query: {query}')
+        response, _ = inference(model, template, query, verbose=False)
+        print(f'swift response: {response}')
+        response = model.chat(tokenizer, [{'role': 'user', 'content': query}])
+        print(f'official response: {response}')
+
+    def test_chatglm2_template(self):
+        if not __name__ == '__main__':
+            # avoid ci test
+            return
+        model_type = ModelType.chatglm2_6b
+        template_type = TemplateType.chatglm2
+        model, tokenizer = get_model_tokenizer(model_type, load_model=True)
+        template = get_template(template_type, tokenizer)
+        model.generation_config = GenerationConfig(
+            max_new_tokens=128,
+            temperature=0.9,
+            top_k=20,
+            top_p=0.9,
+            repetition_penalt=1.05,
+            do_sample=True,
+            eos_token_id=tokenizer.eos_token_id,
+            pad_token_id=tokenizer.eos_token_id)
+        query = '12345+234=？'
+        print(f'query: {query}')
+        response, _ = inference(model, template, query, verbose=False)
+        print(f'swift response: {response}')
+        response = model.chat(tokenizer, query)[0]
+        print(f'official response: {response}')
+
+    def test_internlm_template(self):
+        if not __name__ == '__main__':
+            # avoid ci test
+            return
+        model_type = ModelType.internlm_20b_chat
+        template_type = TemplateType.internlm
+        model, tokenizer = get_model_tokenizer(model_type, load_model=True)
+        template = get_template(template_type, tokenizer)
+        model.generation_config = GenerationConfig(
+            max_new_tokens=128,
+            temperature=0.9,
+            top_k=20,
+            top_p=0.9,
+            repetition_penalt=1.05,
+            do_sample=True,
+            eos_token_id=tokenizer.eos_token_id,
+            pad_token_id=tokenizer.eos_token_id)
+        query = '12345+234=？'
+        print(f'query: {query}')
+        response, _ = inference(model, template, query, verbose=False)
+        print(f'swift response: {response}')
+        response = model.chat(tokenizer, query)[0]
+        print(f'official response: {response}')
+
 
 if __name__ == '__main__':
     unittest.main()
diff --git a/tests/trainers/test_trainer.py b/tests/trainers/test_trainer.py
index 475358d765..3d26e24309 100644
--- a/tests/trainers/test_trainer.py
+++ b/tests/trainers/test_trainer.py
@@ -60,7 +60,7 @@ def _save(self, output_dir: Optional[str] = None, state_dict=None):
         torch.save(self.args, os.path.join(output_dir, 'training_args.bin'))
 
 
-class TestTrainer:
+class TestTrainer(unittest.TestCase):
 
     def setUp(self):
         self._tmp_dir = tempfile.TemporaryDirectory()
@@ -77,7 +77,6 @@ def tearDown(self):
         # logger.info(f'delete model: {self.hub_model_id}')
 
     def test_trainer(self):
-        os.system('nvidia-smi')
         push_to_hub = True
         if not __name__ == '__main__':
             # ignore citest error in github
diff --git a/tests/utils/test_llm_utils.py b/tests/utils/test_llm_utils.py
index 62def0bd47..d03c334c20 100644
--- a/tests/utils/test_llm_utils.py
+++ b/tests/utils/test_llm_utils.py
@@ -1,9 +1,8 @@
 import os
 import unittest
 
-from swift.llm import (MODEL_MAPPING, ModelType, TemplateType,
-                       get_model_tokenizer, get_template, inference,
-                       inference_stream)
+from swift.llm import (MODEL_MAPPING, ModelType, get_model_tokenizer,
+                       get_template, inference, inference_stream)
 from swift.utils import lower_bound, seed_everything
 
 
@@ -24,44 +23,31 @@ def test_inference(self):
         model, tokenizer = get_model_tokenizer(model_type)
         template = get_template(MODEL_MAPPING[model_type]['template'],
                                 tokenizer)
-        inputs = template.encode({'query': '你好！'})
-
-        seed_everything(42, True)
-        print('stream=True')
-        gen_text_stream = inference(inputs['input_ids'], model, tokenizer,
-                                    True)
-        print(f'[GEN]: {gen_text_stream}')
-        #
-        seed_everything(42, True)
-        gen = inference_stream(inputs['input_ids'], model, tokenizer)
-        for gen_text_stream2 in gen:
-            pass
-        print(f'[GEN]: {gen_text_stream2}')
-        #
-        seed_everything(42, True)
-        print('stream=False')
-        gen_text = inference(inputs['input_ids'], model, tokenizer, False)
-        print(f'[GEN]: {gen_text}')
-        self.assertTrue(gen_text_stream == gen_text_stream2 == gen_text)
-        #
-        inputs = template.encode({'query': 'hello!'})
-        seed_everything(42, True)
-        print('stream=True')
-        gen_text_stream = inference(inputs['input_ids'], model, tokenizer,
-                                    True)
-        print(f'[GEN]: {gen_text_stream}')
-        #
-        seed_everything(42, True)
-        gen = inference_stream(inputs['input_ids'], model, tokenizer)
-        for gen_text_stream2 in gen:
-            pass
-        print(f'[GEN]: {gen_text_stream2}')
-        #
-        seed_everything(42, True)
-        print('stream=False')
-        gen_text = inference(inputs['input_ids'], model, tokenizer, False)
-        print(f'[GEN]: {gen_text}')
-        self.assertTrue(gen_text_stream == gen_text_stream2 == gen_text)
+        model.generation_config.max_length = 128
+        model.generation_config.do_sample = True
+        for query in ['你好', 'hello']:
+            seed_everything(42, True)
+            print('stream=True')
+            gen_text_stream, history = inference(
+                model, template, query, stream=True)
+            print(f'[GEN]: {gen_text_stream}')
+            print(f'[HISTORY]: {history}')
+            #
+            seed_everything(42, True)
+            gen = inference_stream(model, template, query)
+            for gen_text_stream2, history2 in gen:
+                pass
+            print(f'[GEN]: {gen_text_stream2}')
+            print(f'[HISTORY]: {history2}')
+            #
+            seed_everything(42, True)
+            print('stream=False')
+            gen_text, history3 = inference(
+                model, template, query, stream=False)
+            print(f'[GEN]: {gen_text}')
+            print(f'[HISTORY]: {history3}')
+            self.assertTrue(gen_text_stream == gen_text_stream2 == gen_text)
+            self.assertTrue(history == history2 == history3)
 
 
 if __name__ == '__main__':