-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lora 训练 deepseek 6.7B base 24G显存,训练到中途会OOM #3310
Labels
solved
This problem has been already solved
Comments
cutoff_len 设置太大了吧 |
这是来自QQ邮箱的自动回复邮件。您好~您发送的邮件我已收到。谢谢您的邮件~
|
看着是 cutoff_len 设置太大了,显存会随着长度 平方级增长 |
但是显存不够的话,我理解在最开始启动任务的时候就会报OOM,为什么会在训练了几个几个step之后报OOM呢? |
使用 FlashAttention |
@hiyouga 您好,请问为什么单机多卡跑同样的bs就不会OOM呢? |
@codemayq Hi, 我这边吧cut_off length 改到1024, 批大小改到2还是会OOM |
训练过程显存波动造成的吧,具体原因还需进一步探究 |
我发现一个trick,可以尝试改一下seed,多改几次可能就不爆了 |
这是来自QQ邮箱的自动回复邮件。您好~您发送的邮件我已收到。谢谢您的邮件~
|
1 task
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Reminder
Reproduction
train.sh
MODEL="/tmp/pretrain_model"
accelerate launch
--config_file ac_config.yaml
src/train_bash.py
--stage sft
--do_train
--model_name_or_path $MODEL
--dataset evol_instruct_code_80k
--dataset_dir data
--template deepseekcoder
--finetuning_type lora
--lora_target q_proj,v_proj
--output_dir /tmp/finetuned_model
--overwrite_cache
--overwrite_output_dir
--cutoff_len 8192
--save_on_each_node
--preprocessing_num_workers 16
--per_device_train_batch_size 2
--per_device_eval_batch_size 4
--lr_scheduler_type cosine
--logging_steps 10
--warmup_steps 20
--save_steps 10
--eval_steps 10
--evaluation_strategy steps
--load_best_model_at_end
--learning_rate 5e-5
--num_train_epochs 3.0
--max_samples 5000
--val_size 0.1
--ddp_timeout 180000000
--plot_loss
--fp16
--load_best_model_at_end False
--report_to wandb
ac_config.yaml
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
deepspeed_multinode_launcher: standard
offload_optimizer_device: none
offload_param_device: none
zero3_init_flag: true
zero3_save_16bit_model: true
zero_stage: 1
distributed_type: DEEPSPEED
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_process_ip: 192.168.0.1
main_process_port: 29555
main_training_function: main
mixed_precision: fp16
num_machines: 2 # the number of nodes
num_processes: 16 # the number of GPUs in all nodes
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
Expected behavior
No response
System Info
No response
Others
No response
The text was updated successfully, but these errors were encountered: