Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lora 训练 deepseek 6.7B base 24G显存,训练到中途会OOM #3310

Closed
1 task done
webdxq opened this issue Apr 17, 2024 · 10 comments
Closed
1 task done

lora 训练 deepseek 6.7B base 24G显存,训练到中途会OOM #3310

webdxq opened this issue Apr 17, 2024 · 10 comments
Labels
solved This problem has been already solved

Comments

@webdxq
Copy link

webdxq commented Apr 17, 2024

Reminder

  • I have read the README and searched the existing issues.

Reproduction

train.sh

MODEL="/tmp/pretrain_model"

accelerate launch
--config_file ac_config.yaml
src/train_bash.py
--stage sft
--do_train
--model_name_or_path $MODEL
--dataset evol_instruct_code_80k
--dataset_dir data
--template deepseekcoder
--finetuning_type lora
--lora_target q_proj,v_proj
--output_dir /tmp/finetuned_model
--overwrite_cache
--overwrite_output_dir
--cutoff_len 8192
--save_on_each_node
--preprocessing_num_workers 16
--per_device_train_batch_size 2
--per_device_eval_batch_size 4
--lr_scheduler_type cosine
--logging_steps 10
--warmup_steps 20
--save_steps 10
--eval_steps 10
--evaluation_strategy steps
--load_best_model_at_end
--learning_rate 5e-5
--num_train_epochs 3.0
--max_samples 5000
--val_size 0.1
--ddp_timeout 180000000
--plot_loss
--fp16
--load_best_model_at_end False
--report_to wandb

ac_config.yaml

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
deepspeed_multinode_launcher: standard
offload_optimizer_device: none
offload_param_device: none
zero3_init_flag: true
zero3_save_16bit_model: true
zero_stage: 1
distributed_type: DEEPSPEED
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_process_ip: 192.168.0.1
main_process_port: 29555
main_training_function: main
mixed_precision: fp16
num_machines: 2 # the number of nodes
num_processes: 16 # the number of GPUs in all nodes
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Expected behavior

No response

System Info

No response

Others

No response

@tanghui315
Copy link

cutoff_len 设置太大了吧

@webdxq
Copy link
Author

webdxq commented Apr 17, 2024 via email

@codemayq codemayq added the pending This problem is yet to be addressed label Apr 17, 2024
@codemayq
Copy link
Collaborator

看着是 cutoff_len 设置太大了,显存会随着长度 平方级增长

@webdxq
Copy link
Author

webdxq commented Apr 17, 2024

但是显存不够的话,我理解在最开始启动任务的时候就会报OOM,为什么会在训练了几个几个step之后报OOM呢?

@hiyouga
Copy link
Owner

hiyouga commented Apr 17, 2024

使用 FlashAttention

@hiyouga hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Apr 17, 2024
@hiyouga hiyouga closed this as completed Apr 17, 2024
@webdxq
Copy link
Author

webdxq commented Apr 17, 2024

@hiyouga 您好,请问为什么单机多卡跑同样的bs就不会OOM呢?

@webdxq
Copy link
Author

webdxq commented Apr 18, 2024

@codemayq Hi, 我这边吧cut_off length 改到1024, 批大小改到2还是会OOM

@couldn
Copy link

couldn commented Apr 24, 2024

训练过程显存波动造成的吧,具体原因还需进一步探究

@xd2333
Copy link
Contributor

xd2333 commented May 16, 2024

我发现一个trick,可以尝试改一下seed,多改几次可能就不爆了

@webdxq
Copy link
Author

webdxq commented May 16, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

No branches or pull requests

6 participants