Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chinese-llama-2-13b-hf Baichuan2-13B-Chat等 13B 模型A100-40G 微调OOM #2908

Closed
1 task done
liboaccn opened this issue Mar 20, 2024 · 5 comments
Closed
1 task done
Labels
solved This problem has been already solved

Comments

@liboaccn
Copy link

Reminder

  • I have read the README and searched the existing issues.

Reproduction

CUDA_VISIBLE_DEVICES=$GPU_NO python /home/users/xxx/code/LLaMA-Factory/src/train_bash.py
--stage sft
--do_train
--model_name_or_path $MODEL_PATH
--dataset $DATASET
--dataset_dir $DATASET_DIR
--template $TEMP
--finetuning_type lora
--lora_target $TARGET
--output_dir $SFT_CHECKPOINT
--overwrite_cache
--overwrite_output_dir
--per_device_train_batch_size 4
--gradient_accumulation_steps 4
--lr_scheduler_type cosine
--logging_steps 10
--save_steps 1000
--learning_rate 5e-5
--num_train_epochs 3.0
--plot_loss
--fp16

Expected behavior

试过几个13B 模型 微调的时候 都OOM报错

Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 39.39 GiB of which 107.94 MiB is free. Including non-PyTorch memory, this process has 39.28 GiB memory in use. Of the allocated memory 37.86 GiB is allocated by PyTorch, and 93.47 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

System Info

accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=False,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=True,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
generation_config=None,
generation_max_length=None,
generation_num_beams=None,
gradient_accumulation_steps=4,
gradient_checkpointing=False,
gradient_checkpointing_kwargs=None,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_always_push=False,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
include_num_input_tokens_seen=False,
include_tokens_per_second=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=/home/users/xxx/code/csaft/sft/../output/Baichuan2-13B-Chat/lawyer_llama_sh/,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=10,
logging_strategy=steps,
lr_scheduler_kwargs={},
lr_scheduler_type=cosine,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
neftune_noise_alpha=None,
no_cuda=False,
num_train_epochs=3.0,
optim=adamw_torch,
optim_args=None,
output_dir=/home/users/xxx/code/csaft/sft/../output/Baichuan2-13B-Chat/lawyer_llama_sh,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=4,
predict_with_generate=False,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=['tensorboard'],
resume_from_checkpoint=None,
run_name=/home/users/xxx/code/csaft/sft/../output/Baichuan2-13B-Chat/lawyer_llama_sh,
save_on_each_node=False,
save_only_model=False,
save_safetensors=True,
save_steps=1000,
save_strategy=steps,
save_total_limit=None,
seed=42,
skip_memory_metrics=True,
sortish_sampler=False,
split_batches=None,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
)

Others

No response

@hiyouga
Copy link
Owner

hiyouga commented Mar 20, 2024

--per_device_train_batch_size 1

@hiyouga hiyouga added the solved This problem has been already solved label Mar 20, 2024
@hiyouga hiyouga closed this as completed Mar 20, 2024
@liboaccn
Copy link
Author

liboaccn commented Mar 20, 2024

补充 之前版本的[LLaMA-Factory] 在13B微调的时候没有问题,后来升级到最新版,升级各种组件后报错了
torch 2.2.0
torchvision 0.17.1
accelerate 0.28.0
peft 0.9.0
transformers 4.38.0.

@hiyouga
Copy link
Owner

hiyouga commented Mar 20, 2024

看起来不是版本问题

@liboaccn
Copy link
Author

--per_device_train_batch_size 1

尝试过 报错依然 换其他模型 也不行 qwen llama baichuan 13b /14b的都不行

@hiyouga
Copy link
Owner

hiyouga commented Mar 20, 2024

用 readme 推荐的版本试试

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

No branches or pull requests

2 participants