lora 训练 deepseek 6.7B base 24G显存，训练到中途会OOM #3310

webdxq · 2024-04-17T03:44:14Z

Reminder

I have read the README and searched the existing issues.

Reproduction

train.sh

MODEL="/tmp/pretrain_model"

accelerate launch
--config_file ac_config.yaml
src/train_bash.py
--stage sft
--do_train
--model_name_or_path $MODEL
--dataset evol_instruct_code_80k
--dataset_dir data
--template deepseekcoder
--finetuning_type lora
--lora_target q_proj,v_proj
--output_dir /tmp/finetuned_model
--overwrite_cache
--overwrite_output_dir
--cutoff_len 8192
--save_on_each_node
--preprocessing_num_workers 16
--per_device_train_batch_size 2
--per_device_eval_batch_size 4
--lr_scheduler_type cosine
--logging_steps 10
--warmup_steps 20
--save_steps 10
--eval_steps 10
--evaluation_strategy steps
--load_best_model_at_end
--learning_rate 5e-5
--num_train_epochs 3.0
--max_samples 5000
--val_size 0.1
--ddp_timeout 180000000
--plot_loss
--fp16
--load_best_model_at_end False
--report_to wandb

ac_config.yaml

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
deepspeed_multinode_launcher: standard
offload_optimizer_device: none
offload_param_device: none
zero3_init_flag: true
zero3_save_16bit_model: true
zero_stage: 1
distributed_type: DEEPSPEED
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_process_ip: 192.168.0.1
main_process_port: 29555
main_training_function: main
mixed_precision: fp16
num_machines: 2 # the number of nodes
num_processes: 16 # the number of GPUs in all nodes
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Expected behavior

No response

System Info

No response

Others

No response

tanghui315 · 2024-04-17T05:53:16Z

cutoff_len 设置太大了吧

webdxq · 2024-04-17T05:53:50Z

这是来自QQ邮箱的自动回复邮件。您好~您发送的邮件我已收到。谢谢您的邮件~

codemayq · 2024-04-17T09:07:23Z

看着是 cutoff_len 设置太大了，显存会随着长度平方级增长

webdxq · 2024-04-17T13:12:12Z

但是显存不够的话，我理解在最开始启动任务的时候就会报OOM，为什么会在训练了几个几个step之后报OOM呢？

hiyouga · 2024-04-17T13:14:57Z

使用 FlashAttention

webdxq · 2024-04-17T15:58:26Z

@hiyouga 您好，请问为什么单机多卡跑同样的bs就不会OOM呢？

webdxq · 2024-04-18T05:07:09Z

@codemayq Hi, 我这边吧cut_off length 改到1024，批大小改到2还是会OOM

couldn · 2024-04-24T08:01:15Z

训练过程显存波动造成的吧，具体原因还需进一步探究

xd2333 · 2024-05-16T08:48:09Z

我发现一个trick，可以尝试改一下seed，多改几次可能就不爆了

webdxq · 2024-05-16T08:48:40Z

这是来自QQ邮箱的自动回复邮件。您好~您发送的邮件我已收到。谢谢您的邮件~

codemayq added the pending This problem is yet to be addressed label Apr 17, 2024

hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Apr 17, 2024

hiyouga closed this as completed Apr 17, 2024

Cucunnber mentioned this issue May 27, 2024

预训练codeqwen1.5-7b时显存分布异常，训练一段时间后爆OOM #3908

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lora 训练 deepseek 6.7B base 24G显存，训练到中途会OOM #3310

lora 训练 deepseek 6.7B base 24G显存，训练到中途会OOM #3310

webdxq commented Apr 17, 2024

tanghui315 commented Apr 17, 2024

webdxq commented Apr 17, 2024 via email

codemayq commented Apr 17, 2024

webdxq commented Apr 17, 2024

hiyouga commented Apr 17, 2024

webdxq commented Apr 17, 2024

webdxq commented Apr 18, 2024

couldn commented Apr 24, 2024

xd2333 commented May 16, 2024

webdxq commented May 16, 2024 via email

lora 训练 deepseek 6.7B base 24G显存，训练到中途会OOM #3310

lora 训练 deepseek 6.7B base 24G显存，训练到中途会OOM #3310

Comments

webdxq commented Apr 17, 2024

Reminder

Reproduction

train.sh

ac_config.yaml

Expected behavior

System Info

Others

tanghui315 commented Apr 17, 2024

webdxq commented Apr 17, 2024 via email

codemayq commented Apr 17, 2024

webdxq commented Apr 17, 2024

hiyouga commented Apr 17, 2024

webdxq commented Apr 17, 2024

webdxq commented Apr 18, 2024

couldn commented Apr 24, 2024

xd2333 commented May 16, 2024

webdxq commented May 16, 2024 via email