We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
USE_MODELSCOPE_HUB=1 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.run \ --nproc_per_node 8 \ --nnodes 1 \ --standalone \ src/train.py examples/water/0508_wa_llama3_8b_lora_sft.yaml
# model model_name_or_path: LLM-Research/Meta-Llama-3-8B-Instruct # method stage: sft do_train: true finetuning_type: lora lora_target: q_proj,v_proj # ddp ddp_timeout: 180000000 deepspeed: examples/deepspeed/ds_z3_config.json # dataset dataset: identity_water,alpaca_gpt4_en,alpaca_gpt4_zh,lima,glaive_toolcall,oaast_sft_zh,ruozhiba,identity_water template: llama3 cutoff_len: 8192 max_samples: val_size: 0.01 overwrite_cache: true preprocessing_num_workers: 32 # output output_dir: saves/LLM-Research/Meta-Llama-3-8B-Instruct/lora/sft_wa_0508 logging_steps: 4 save_steps: 200 plot_loss: true overwrite_output_dir: true # train per_device_train_batch_size: 6 gradient_accumulation_steps: 8 learning_rate: 0.0001 num_train_epochs: 3.0 lr_scheduler_type: cosine warmup_steps: 0.1 bf16: true # eval per_device_eval_batch_size: 1 evaluation_strategy: steps eval_steps: 100
两次实验几乎稳定复现,看着疑似显存使用一直在增加? 15%|█▍ | 129/882 [40:17<3:41:05, 17.62s/it]Traceback (most recent call last): 8%|▊ | 55/663 [24:56<5:05:23, 30.14s/it]Traceback (most recent call last):
训练一段时间内会稳定出现OOM
No response
The text was updated successfully, but these errors were encountered:
降低 batchsize,因为序列长度不一样所以显存会有波动
Sorry, something went wrong.
好,我试一下,所以cut_seqlen是只对超长做截断?所以没做padding对吧
padding 会显著减慢训练速度
No branches or pull requests
Reminder
Reproduction
两次实验几乎稳定复现,看着疑似显存使用一直在增加?
15%|█▍ | 129/882 [40:17<3:41:05, 17.62s/it]Traceback (most recent call last):
8%|▊ | 55/663 [24:56<5:05:23, 30.14s/it]Traceback (most recent call last):
训练一段时间内会稳定出现OOM
![image](https://private-user-images.githubusercontent.com/23006385/328796092-d0896cb0-dceb-4722-85cd-0d7db2e81a7f.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjA2MzE2ODUsIm5iZiI6MTcyMDYzMTM4NSwicGF0aCI6Ii8yMzAwNjM4NS8zMjg3OTYwOTItZDA4OTZjYjAtZGNlYi00NzIyLTg1Y2QtMGQ3ZGIyZTgxYTdmLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA3MTAlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNzEwVDE3MDk0NVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTYxODk1ZDM3OWJiYTVhNTFlMzUzOTdkMDJlNGZlYjExZmFhZWM2MDVhN2ZlNGVkYjA0ODJkNmU0NmU3MmU5YjQmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.kI-Dlv0LPDVLhM_5Lu-yj75W01ryZu-d-W8jzGBU-_U)
Expected behavior
No response
System Info
No response
Others
No response
The text was updated successfully, but these errors were encountered: