Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 20 additions & 14 deletions examples/pytorch/llm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,19 +48,20 @@


## News
- 2023.10.7: Supported DeepSpeed ZeRO-2, enabling LoRA (not just QLoRA) to run DDP on 2*A10. The corresponding shell script can be found at `scripts/qwen_7b_chat/lora_ddp_ds/sft.sh`, `scripts/baichuan2_7b_chat/lora_ddp_ds/sft.sh`.
- 2023.10.4: Supported datasets in the fields of mathematics, law, SQL, and coding: blossom-math-zh, school-math-zh, text2sql-en, sql-create-context-en, lawyer-llama-zh, tigerbot-law-zh, leetcode-python-en.
- 2023.9.26: Supported xverse model series: xverse-7b, xverse-7b-chat, xverse-13b, xverse-13b-chat.
- 2023.9.25: Supported qwen-14b model series: qwen-14b, qwen-14b-chat.
- 2023.9.20: Supported incremental weight merging from LoRA and QLoRA training methods into base model weights, and saved the complete model weights for easy deployment by users.
- 2023.9.18: Supported internlm-20b model series: internlm-20b, internlm-20b-chat.
- 2023.9.26: Supported xverse model series: xverse-7b, xverse-7b-chat, xverse-13b, xverse-13b-chat. The corresponding shell script can be found at `scripts/xverse_13b`.
- 2023.9.25: Supported qwen-14b model series: qwen-14b, qwen-14b-chat. The corresponding shell script can be found at `scripts/qwen_14b`, `scripts/qwen_14b_chat`.
- 2023.9.20: Supported incremental weight merging from LoRA and QLoRA training methods into base model weights, and saved the complete model weights for easy deployment by users. You can check the command-line parameter `--merge_lora_and_save` in the `infer.sh` script.
- 2023.9.18: Supported internlm-20b model series: internlm-20b, internlm-20b-chat. The corresponding shell script can be found at `scripts/internlm_20b`, `scripts/internlm_20b_chat`.
- 2023.9.12: Supported training with MP+DDP to accelerate full-parameter fine-tuning speed. The corresponding shell script can be found at `scripts/qwen_7b_chat/full_mp_ddp/sft.sh`.
- 2023.9.5: Supported training that only saves model weights without saving intermediate states such as optimizer weights required for checkpoint resumption, avoiding long checkpoint-saving times and large storage space in full-parameter fine-tuning.
- 2023.9.5: Supported openbuddy-llama2-70b model.
- 2023.9.3: Supported baichuan-13b model series: baichuan-13b, baichuan-13b-chat.
- 2023.9.5: Supported training that only saves model weights without saving intermediate states such as optimizer weights required for checkpoint resumption, avoiding long checkpoint-saving times and large storage space in full-parameter fine-tuning. You can check the command-line parameter `--only_save_model` in the `sft.sh` script.
- 2023.9.5: Supported openbuddy-llama2-70b model. The corresponding shell script can be found at `scripts/openbuddy-llama2-70b`.
- 2023.9.3: Supported baichuan2 model series: baichuan2-7b, baichuan2-7b-chat, baichuan2-13b, baichuan2-13b-chat. The corresponding shell script can be found at `scripts/baichuan2_7b`, `scripts/baichuan2_7b_chat`.


## Prepare the Environment
Experimental environment: V100, A10, 3090, A100, ...
Experimental environment: A10, 3090, V100, A100, ...
```bash
# Installing miniconda
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
Expand Down Expand Up @@ -88,13 +89,13 @@ Performace: full(nice) > lora > qlora
Training GPU memory: qlora(low,3090) > lora > full(2*A100)

Tips:
- You can set `--gradient_checkpointing true` during training to save GPU memory, but this will slightly decrease the training speed.
- You can set `--gradient_checkpointing true` during training to save GPU memory, but this will slightly decrease the training speed. This is useful if you need to train LLM on consumer-grade GPU, e.g. 3090.
- If you want to push weights to the ModelScope Hub during training, you need to set `--push_to_hub true`.
- If you want to merge LoRA weights and save during inference, you need to set `--merge_lora_and_save true`.
- If you want to use quantization, you need to install `bitsandbytes` first: `pip install bitsandbytes -U`.
- If you want to use deepspeed, you need to `pip install deepspeed -U`.
- If you want to use deepspeed, you need to `pip install deepspeed -U`. Using deepspeed can save GPU memory, but this may slightly decrease the training speed.
- If you are using older GPUs like V100, you need to set `--dtype fp16`, because they do not support bf16.
- qwen recommends installing [flash-attn](https://github.com/Dao-AILab/flash-attention), which will accelerate the training and inference speed and reduce GPU memory usage (V100, 3090, A10 machines do not support flash-attn).
- qwen recommends installing [flash-attn](https://github.com/Dao-AILab/flash-attention), which will accelerate the training and inference speed and reduce GPU memory usage (A10, 3090, V100 machines do not support flash-attn).
- Below is a shell script for running `qwen_7b_chat` directly (you just need to specify `ckpt_dir` during inference to execute it smoothly). For more model scripts, you can check the `scripts` folder. If you want to customize a shell script, it is recommended to refer to the script in `scripts/qwen_7b_chat`.
```bash
# sft lora and infer qwen-7b-chat, Requires 38GB GPU memory.
Expand All @@ -113,20 +114,25 @@ bash scripts/qwen_7b_chat/lora_ddp_ds/sft.sh
bash scripts/qwen_7b_chat/lora_ddp_ds/infer.sh

# sft(lora+mp+ddp) and infer qwen-7b-chat, Requires 4*15GB GPU memory.
# Recommended experimental environment: V100, A10, 3090
# Recommended experimental environment: A10, 3090
bash scripts/qwen_7b_chat/lora_mp_ddp/sft.sh
bash scripts/qwen_7b_chat/lora_mp_ddp/infer.sh

# sft(qlora) and infer qwen-7b-chat, Requires 10GB GPU memory.
# Recommended experimental environment: V100, A10, 3090
# Recommended experimental environment: A10, 3090
bash scripts/qwen_7b_chat/qlora/sft.sh
bash scripts/qwen_7b_chat/qlora/infer.sh

# sft(qlora+ddp) and infer qwen-7b-chat, Requires 2*14GB GPU memory.
# Recommended experimental environment: V100, A10, 3090
# Recommended experimental environment: A10, 3090
bash scripts/qwen_7b_chat/qlora_ddp/sft.sh
bash scripts/qwen_7b_chat/qlora_ddp/infer.sh

# sft(qlora+ddp+deepspeed) and infer qwen-7b-chat, Requires 2*16GB GPU memory.
# Recommended experimental environment: A10, 3090
bash scripts/qwen_7b_chat/qlora_ddp_ds/sft.sh
bash scripts/qwen_7b_chat/qlora_ddp_ds/infer.sh

# sft(full+mp) and infer qwen-7b-chat, Requires 2*75GB GPU memory.
# Recommended experimental environment: A100
bash scripts/qwen_7b_chat/full_mp/sft.sh
Expand Down
34 changes: 20 additions & 14 deletions examples/pytorch/llm/README_CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,19 +48,20 @@


## 新闻
- 2023.10.7: 支持DeepSpeed ZeRO-2, 使得lora(不仅仅是qlora)可以在双卡A10上运行DDP. 对应的sh脚本可以查看`scripts/qwen_7b_chat/lora_ddp_ds/sft.sh`, `scripts/baichuan2_7b_chat/lora_ddp_ds/sft.sh`.
- 2023.10.4: 支持更多数学, 法律, SQL, 代码领域的数据集: blossom-math-zh, school-math-zh, text2sql-en, sql-create-context-en, lawyer-llama-zh, tigerbot-law-zh, leetcode-python-en.
- 2023.9.26: 支持xverse系列模型: xverse-7b, xverse-7b-chat, xverse-13b, xverse-13b-chat.
- 2023.9.25: 支持**qwen-14b**系列模型: qwen-14b, qwen-14b-chat
- 2023.9.20: 支持在LoRA, QLoRA的方式训练后, 将其增量权重merge到基模型权重中, 并保存完整的模型权重, 方便用户的部署.
- 2023.9.18: 支持internlm-20b系列模型: internlm-20b, internlm-20b-chat
- 2023.9.26: 支持xverse系列模型: xverse-7b, xverse-7b-chat, xverse-13b, xverse-13b-chat. 对应的sh脚本可以查看`scripts/xverse_13b`.
- 2023.9.25: 支持**qwen-14b**系列模型: qwen-14b, qwen-14b-chat. 对应的sh脚本可以查看`scripts/qwen_14b`, `scripts/qwen_14b_chat`.
- 2023.9.20: 支持在LoRA, QLoRA的方式训练后, 将其增量权重merge到基模型权重中, 并保存完整的模型权重, 方便用户的部署. 可以查看`infer.sh`中的命令行参数: `--merge_lora_and_save`.
- 2023.9.18: 支持internlm-20b系列模型: internlm-20b, internlm-20b-chat. 对应的sh脚本可以查看`scripts/internlm_20b`, `scripts/internlm_20b_chat`.
- 2023.9.12: 支持MP+DDP的方式训练, 加快全参数微调的速度, 对应的sh脚本可以查看`scripts/qwen_7b_chat/full_mp_ddp/sft.sh`.
- 2023.9.5: 支持训练只保存模型权重, 而不保存断点续训所需的优化器权重等中间状态, 避免全参数微调保存checkpoint所需时间过长和空间过大的问题.
- 2023.9.5: 支持openbuddy-llama2-70b模型.
- 2023.9.3: 支持baichuan-13b系列模型: baichuan-13b, baichuan-13b-chat.
- 2023.9.5: 支持训练只保存模型权重, 而不保存断点续训所需的优化器权重等中间状态, 避免全参数微调保存checkpoint所需时间过长和空间过大的问题. 可以查看`sft.sh`中的命令行参数: `--only_save_model`.
- 2023.9.5: 支持openbuddy-llama2-70b模型. 对应的sh脚本可以查看`scripts/openbuddy_llama2_70b`.
- 2023.9.3: 支持baichuan2系列模型: baichuan2-7b, baichuan2-7b-chat, baichuan2-13b, baichuan2-13b-chat. 对应的sh脚本可以查看`scripts/baichuan2_7b`, `scripts/baichuan2_7b_chat`.


## 准备实验环境
实验环境: V100, A10, 3090, A100均可.
实验环境: A10, 3090, V100, A100均可.
```bash
# 安装miniconda
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
Expand Down Expand Up @@ -89,13 +90,13 @@ pip install -r requirements.txt -U
训练显存: qlora(低,3090) > lora > full(2*A100)

提示:
- 你可以在训练时设置`--gradient_checkpointing true`来节约显存, 但这会略微降低训练速度.
- 你可以在训练时设置`--gradient_checkpointing true`来节约显存, 但这会略微降低训练速度. 如果你需要在消费级显卡中训练大模型, 这很有用, 例如: 3090.
- 如果你想在训练时, 将权重push到ModelScope Hub中, 你需要设置`--push_to_hub true`.
- 如何你想要在推理时, 合并LoRA权重并保存,你需要设置`--merge_lora_and_save true`.
- 如果你想要使用量化, 你需要先安装bnb: `pip install bitsandbytes -U`.
- 如果你想要使用deepspeed, 你需要`pip install deepspeed -U`.
- 如果你想要使用deepspeed, 你需要`pip install deepspeed -U`. 使用deepspeed可以节约显存, 但可能会略微降低训练速度.
- 如果你使用的是V100等较老的GPU, 你需要设置`--dtype fp16`, 因为其不支持bf16.
- 如果你的机器是A100等高性能显卡, 且使用的是qwen系列模型, 推荐你安装[flash-attn](https://github.com/Dao-AILab/flash-attention), 这将会加快训练和推理的速度以及显存占用(V100, 3090, A10等显卡不支持flash-attn进行训练).
- 如果你的机器是A100等高性能显卡, 且使用的是qwen系列模型, 推荐你安装[flash-attn](https://github.com/Dao-AILab/flash-attention), 这将会加快训练和推理的速度以及显存占用(A10, 3090, V100等显卡不支持flash-attn进行训练).
- 以下提供了可以直接运行的`qwen_7b_chat`的sh脚本(你只需要在推理时指定`ckpt_dir`即可顺利执行). 更多模型的scripts脚本, 可以查看`scripts`文件夹. 如果你想要自定义sh脚本, 推荐你参考`scripts/qwen_7b_chat`中的脚本进行书写.
```bash
# 微调(lora)+推理 qwen-7b-chat, 需要38GB显存.
Expand All @@ -114,20 +115,25 @@ bash scripts/qwen_7b_chat/lora_ddp_ds/sft.sh
bash scripts/qwen_7b_chat/lora_ddp_ds/infer.sh

# 微调(lora+mp+ddp)+推理 qwen-7b-chat, 需要4卡*15GB显存.
# 推荐的实验环境: V100, 3090, A10
# 推荐的实验环境: A10, 3090
bash scripts/qwen_7b_chat/lora_mp_ddp/sft.sh
bash scripts/qwen_7b_chat/lora_mp_ddp/infer.sh

# 微调(qlora)+推理 qwen-7b-chat, 需要10GB显存.
# 推荐的实验环境: V100, 3090, A10
# 推荐的实验环境: A10, 3090
bash scripts/qwen_7b_chat/qlora/sft.sh
bash scripts/qwen_7b_chat/qlora/infer.sh

# 微调(qlora+ddp)+推理 qwen-7b-chat, 需要2卡*14GB显存.
# 推荐的实验环境: V100, 3090, A10
# 推荐的实验环境: A10, 3090
bash scripts/qwen_7b_chat/qlora_ddp/sft.sh
bash scripts/qwen_7b_chat/qlora_ddp/infer.sh

# 微调(qlora+ddp+deepspeed)+推理 qwen-7b-chat, 需要2卡*16GB显存.
# 推荐的实验环境: A10, 3090
bash scripts/qwen_7b_chat/qlora_ddp_ds/sft.sh
bash scripts/qwen_7b_chat/qlora_ddp_ds/infer.sh

# 微调(full+mp)+推理 qwen-7b-chat, 需要2卡*75G显存.
# 推荐的实验环境: A100
bash scripts/qwen_7b_chat/full_mp/sft.sh
Expand Down
4 changes: 4 additions & 0 deletions examples/pytorch/llm/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,10 @@
accelerate
charset_normalizer
cpm_kernels
matplotlib
modelscope>=1.9
sentencepiece
tensorboard
tiktoken
transformers
transformers_stream_generator
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
PYTHONPATH=../../.. \
CUDA_VISIBLE_DEVICES=0 \
python src/llm_infer.py \
--model_type baichuan2-7b-chat \
--sft_type lora \
--template_type baichuan \
--dtype bf16 \
--ckpt_dir "output/baichuan2-7b-chat/vx_xxx/checkpoint-xxx" \
--eval_human false \
--dataset damo-agent-mini-zh \
--max_length 4096 \
--max_new_tokens 2048 \
--temperature 0.9 \
--top_k 20 \
--top_p 0.9 \
--do_sample true \
--merge_lora_and_save false \
40 changes: 40 additions & 0 deletions examples/pytorch/llm/scripts/baichuan2_7b_chat/lora_ddp_ds/sft.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# Experimental environment: 2 * A10
# 2 * 21GB GPU memory
nproc_per_node=2

PYTHONPATH=../../.. \
CUDA_VISIBLE_DEVICES=0,1 \
torchrun \
--nproc_per_node=$nproc_per_node \
--master_port 29500 \
src/llm_sft.py \
--model_type baichuan2-7b-chat \
--sft_type lora \
--template_type baichuan \
--dtype bf16 \
--output_dir output \
--ddp_backend nccl \
--dataset damo-agent-mini-zh \
--train_dataset_sample -1 \
--num_train_epochs 1 \
--max_length 4096 \
--lora_rank 8 \
--lora_alpha 32 \
--lora_dropout_p 0. \
--lora_target_modules ALL \
--gradient_checkpointing true \
--batch_size 1 \
--weight_decay 0. \
--learning_rate 1e-4 \
--gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
--max_grad_norm 0.5 \
--warmup_ratio 0.03 \
--eval_steps 100 \
--save_steps 100 \
--save_total_limit 2 \
--logging_steps 10 \
--push_to_hub false \
--hub_model_id baichuan2-7b-chat-lora \
--hub_private_repo true \
--hub_token 'your-sdk-token' \
--deepspeed_config_path 'ds_config/zero2.json' \
7 changes: 4 additions & 3 deletions examples/pytorch/llm/scripts/chatglm2_6b/lora_ddp/infer.sh
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,11 @@ python src/llm_infer.py \
--dtype bf16 \
--ckpt_dir "output/chatglm2-6b/vx_xxx/checkpoint-xxx" \
--eval_human false \
--dataset code-python-zh \
--max_length 8192 \
--max_new_tokens 1024 \
--dataset damo-agent-mini-zh \
--max_length 4096 \
--max_new_tokens 2048 \
--temperature 0.9 \
--top_k 20 \
--top_p 0.9 \
--do_sample true \
--merge_lora_and_save false \
6 changes: 3 additions & 3 deletions examples/pytorch/llm/scripts/chatglm2_6b/lora_ddp/sft.sh
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Experimental environment: 2 * A100
# 2 * 50GB GPU memory
# 2 * 35GB GPU memory
nproc_per_node=2

PYTHONPATH=../../.. \
Expand All @@ -14,10 +14,10 @@ torchrun \
--dtype bf16 \
--output_dir output \
--ddp_backend nccl \
--dataset code-python-zh \
--dataset damo-agent-mini-zh \
--train_dataset_sample -1 \
--num_train_epochs 1 \
--max_length 8192 \
--max_length 4096 \
--lora_rank 8 \
--lora_alpha 32 \
--lora_dropout_p 0. \
Expand Down
17 changes: 17 additions & 0 deletions examples/pytorch/llm/scripts/chatglm2_6b/lora_ddp_ds/infer.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
PYTHONPATH=../../.. \
CUDA_VISIBLE_DEVICES=0 \
python src/llm_infer.py \
--model_type chatglm2-6b \
--sft_type lora \
--template_type chatglm2 \
--dtype bf16 \
--ckpt_dir "output/chatglm2-6b/vx_xxx/checkpoint-xxx" \
--eval_human false \
--dataset damo-agent-mini-zh \
--max_length 4096 \
--max_new_tokens 2048 \
--temperature 0.9 \
--top_k 20 \
--top_p 0.9 \
--do_sample true \
--merge_lora_and_save false \
40 changes: 40 additions & 0 deletions examples/pytorch/llm/scripts/chatglm2_6b/lora_ddp_ds/sft.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# Experimental environment: 2 * A10
# 2 * 18GB GPU memory
nproc_per_node=2

PYTHONPATH=../../.. \
CUDA_VISIBLE_DEVICES=0,1 \
torchrun \
--nproc_per_node=$nproc_per_node \
--master_port 29500 \
src/llm_sft.py \
--model_type chatglm2-6b \
--sft_type lora \
--template_type chatglm2 \
--dtype bf16 \
--output_dir output \
--ddp_backend nccl \
--dataset damo-agent-mini-zh \
--train_dataset_sample -1 \
--num_train_epochs 1 \
--max_length 4096 \
--lora_rank 8 \
--lora_alpha 32 \
--lora_dropout_p 0. \
--lora_target_modules ALL \
--gradient_checkpointing true \
--batch_size 1 \
--weight_decay 0. \
--learning_rate 1e-4 \
--gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
--max_grad_norm 0.5 \
--warmup_ratio 0.03 \
--eval_steps 100 \
--save_steps 100 \
--save_total_limit 2 \
--logging_steps 10 \
--push_to_hub false \
--hub_model_id chatglm2-6b-lora \
--hub_private_repo true \
--hub_token 'your-sdk-token' \
--deepspeed_config_path 'ds_config/zero2.json' \
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ python src/llm_infer.py \
--eval_human false \
--dataset dureader-robust-zh \
--max_length 2048 \
--use_flash_attn true \
--use_flash_attn false \
--max_new_tokens 1024 \
--temperature 0.9 \
--top_k 20 \
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
# Experimental environment: 2 * A10
# 2 * 19GB GPU memory (not use flash_attn)
nproc_per_node=2

PYTHONPATH=../../.. \
Expand All @@ -20,7 +22,7 @@ torchrun \
--lora_alpha 32 \
--lora_dropout_p 0. \
--lora_target_modules c_attn c_proj \
--gradient_checkpointing false \
--gradient_checkpointing true \
--batch_size 1 \
--weight_decay 0. \
--learning_rate 1e-4 \
Expand All @@ -31,8 +33,9 @@ torchrun \
--save_steps 100 \
--save_total_limit 2 \
--logging_steps 10 \
--use_flash_attn true \
--use_flash_attn false \
--push_to_hub false \
--hub_model_id qwen-7b-lora \
--hub_private_repo true \
--hub_token 'your-sdk-token' \
--deepspeed_config_path 'ds_config/zero2.json' \
4 changes: 2 additions & 2 deletions examples/pytorch/llm/scripts/qwen_7b_chat/lora_ddp_ds/sft.sh
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Experimental environment: 2 * A10
# 2 * 18GB GPU memory
# 2 * 18GB GPU memory (not use flash_attn)
nproc_per_node=2

PYTHONPATH=../../.. \
Expand Down Expand Up @@ -38,4 +38,4 @@ torchrun \
--hub_model_id qwen-7b-chat-lora \
--hub_private_repo true \
--hub_token 'your-sdk-token' \
--deepspeed_config_path ds_config/zero2.json \
--deepspeed_config_path 'ds_config/zero2.json' \
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ python src/llm_infer.py \
--model_type qwen-7b-chat \
--sft_type lora \
--template_type chatml \
--dtype fp16 \
--dtype bf16 \
--ckpt_dir "output/qwen-7b-chat/vx_xxx/checkpoint-xxx" \
--eval_human false \
--dataset advertise-gen \
Expand Down
6 changes: 3 additions & 3 deletions examples/pytorch/llm/scripts/qwen_7b_chat/lora_mp_ddp/sft.sh
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Experimental environment: 4 * V100
# 4 * 15GB GPU memory
# Experimental environment: 4 * 3090
# 4 * 15GB GPU memory (not use flash_attn)
nproc_per_node=2

PYTHONPATH=../../.. \
Expand All @@ -11,7 +11,7 @@ torchrun \
--model_type qwen-7b-chat \
--sft_type lora \
--template_type chatml \
--dtype fp16 \
--dtype bf16 \
--output_dir output \
--ddp_backend nccl \
--dataset advertise-gen \
Expand Down
2 changes: 1 addition & 1 deletion examples/pytorch/llm/scripts/qwen_7b_chat/qlora_ddp/sft.sh
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Experimental environment: 2 * 3090
# Experimental environment: 2 * A10
# 2 * 14GB GPU memory
nproc_per_node=2

Expand Down
Loading