Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A800 8卡GPU LLAMA3 8B lora 训练一段时间后总是会OOM #3631

Closed
1 task done
807660937 opened this issue May 8, 2024 · 3 comments
Closed
1 task done

A800 8卡GPU LLAMA3 8B lora 训练一段时间后总是会OOM #3631

807660937 opened this issue May 8, 2024 · 3 comments
Labels
solved This problem has been already solved

Comments

@807660937
Copy link

807660937 commented May 8, 2024

Reminder

  • I have read the README and searched the existing issues.

Reproduction

USE_MODELSCOPE_HUB=1 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.run \
        --nproc_per_node 8 \
        --nnodes 1 \
        --standalone \
        src/train.py examples/water/0508_wa_llama3_8b_lora_sft.yaml 
# model
model_name_or_path: LLM-Research/Meta-Llama-3-8B-Instruct

# method
stage: sft
do_train: true
finetuning_type: lora
lora_target: q_proj,v_proj

# ddp
ddp_timeout: 180000000
deepspeed: examples/deepspeed/ds_z3_config.json

# dataset
dataset: identity_water,alpaca_gpt4_en,alpaca_gpt4_zh,lima,glaive_toolcall,oaast_sft_zh,ruozhiba,identity_water
template: llama3
cutoff_len: 8192
max_samples: 
val_size: 0.01
overwrite_cache: true
preprocessing_num_workers: 32

# output
output_dir: saves/LLM-Research/Meta-Llama-3-8B-Instruct/lora/sft_wa_0508
logging_steps: 4
save_steps: 200
plot_loss: true
overwrite_output_dir: true

# train
per_device_train_batch_size: 6
gradient_accumulation_steps: 8
learning_rate: 0.0001
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_steps: 0.1
bf16: true

# eval
per_device_eval_batch_size: 1
evaluation_strategy: steps
eval_steps: 100

两次实验几乎稳定复现,看着疑似显存使用一直在增加?
15%|█▍ | 129/882 [40:17<3:41:05, 17.62s/it]Traceback (most recent call last):
8%|▊ | 55/663 [24:56<5:05:23, 30.14s/it]Traceback (most recent call last):

训练一段时间内会稳定出现OOM
image

Expected behavior

No response

System Info

No response

Others

No response

@hiyouga
Copy link
Owner

hiyouga commented May 8, 2024

降低 batchsize,因为序列长度不一样所以显存会有波动

@hiyouga hiyouga added the pending This problem is yet to be addressed label May 8, 2024
@807660937
Copy link
Author

好,我试一下,所以cut_seqlen是只对超长做截断?所以没做padding对吧

@hiyouga
Copy link
Owner

hiyouga commented May 8, 2024

padding 会显著减慢训练速度

@hiyouga hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels May 8, 2024
@hiyouga hiyouga closed this as completed May 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

No branches or pull requests

2 participants