A800 8卡GPU LLAMA3 8B lora 训练一段时间后总是会OOM #3631

807660937 · 2024-05-08T07:33:23Z

Reminder

I have read the README and searched the existing issues.

Reproduction

USE_MODELSCOPE_HUB=1 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.run \
        --nproc_per_node 8 \
        --nnodes 1 \
        --standalone \
        src/train.py examples/water/0508_wa_llama3_8b_lora_sft.yaml

# model
model_name_or_path: LLM-Research/Meta-Llama-3-8B-Instruct

# method
stage: sft
do_train: true
finetuning_type: lora
lora_target: q_proj,v_proj

# ddp
ddp_timeout: 180000000
deepspeed: examples/deepspeed/ds_z3_config.json

# dataset
dataset: identity_water,alpaca_gpt4_en,alpaca_gpt4_zh,lima,glaive_toolcall,oaast_sft_zh,ruozhiba,identity_water
template: llama3
cutoff_len: 8192
max_samples: 
val_size: 0.01
overwrite_cache: true
preprocessing_num_workers: 32

# output
output_dir: saves/LLM-Research/Meta-Llama-3-8B-Instruct/lora/sft_wa_0508
logging_steps: 4
save_steps: 200
plot_loss: true
overwrite_output_dir: true

# train
per_device_train_batch_size: 6
gradient_accumulation_steps: 8
learning_rate: 0.0001
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_steps: 0.1
bf16: true

# eval
per_device_eval_batch_size: 1
evaluation_strategy: steps
eval_steps: 100

两次实验几乎稳定复现，看着疑似显存使用一直在增加？
15%|█▍ | 129/882 [40:17<3:41:05, 17.62s/it]Traceback (most recent call last):
8%|▊ | 55/663 [24:56<5:05:23, 30.14s/it]Traceback (most recent call last):

训练一段时间内会稳定出现OOM

Expected behavior

No response

System Info

No response

Others

No response

The text was updated successfully, but these errors were encountered:

hiyouga · 2024-05-08T07:37:10Z

降低 batchsize，因为序列长度不一样所以显存会有波动

807660937 · 2024-05-08T07:42:27Z

好，我试一下，所以cut_seqlen是只对超长做截断？所以没做padding对吧

hiyouga · 2024-05-08T07:43:55Z

padding 会显著减慢训练速度

hiyouga added the pending This problem is yet to be addressed label May 8, 2024

hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels May 8, 2024

hiyouga closed this as completed May 8, 2024

Cucunnber mentioned this issue May 27, 2024

预训练codeqwen1.5-7b时显存分布异常，训练一段时间后爆OOM #3908

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A800 8卡GPU LLAMA3 8B lora 训练一段时间后总是会OOM #3631

A800 8卡GPU LLAMA3 8B lora 训练一段时间后总是会OOM #3631

807660937 commented May 8, 2024 •

edited

Loading

hiyouga commented May 8, 2024 •

edited

Loading

807660937 commented May 8, 2024

hiyouga commented May 8, 2024

A800 8卡GPU LLAMA3 8B lora 训练一段时间后总是会OOM #3631

A800 8卡GPU LLAMA3 8B lora 训练一段时间后总是会OOM #3631

Comments

807660937 commented May 8, 2024 • edited Loading

Reminder

Reproduction

Expected behavior

System Info

Others

hiyouga commented May 8, 2024 • edited Loading

807660937 commented May 8, 2024

hiyouga commented May 8, 2024

807660937 commented May 8, 2024 •

edited

Loading

hiyouga commented May 8, 2024 •

edited

Loading