-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FT-Data Ranker-1b OOM finetuning on single GPU #39
Comments
建议将deepspeed的config文件更换为
|
@zhijianma 感谢回复。将deepspeed的config文件更换为 ORI NUMBER: 23237, AFTER FILETER: 22564, DROP NUMBER: 673
Total 22564 samples [ 6.48M tokens] in training!
[WARNING] cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!
Using /home/ubuntu/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Emitting ninja build file /home/ubuntu/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.2267723083496094 seconds
Parameter Offload: Total persistent parameters: 643072 in 194 params
[2023-10-20 12:49:31,065] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 179742
[2023-10-20 12:49:31,067] [ERROR] [launch.py:321:sigkill_handler] ['/home/ubuntu/Softwares/anaconda3/envs/dj_comp/bin/python', '-u', 'train.py', '--local_rank=0', '--model_name_or_path', '../data/models/falcon-rw-1b', '--tokenizer', '../data/models/falcon-rw-1b', '--data_path', '../data/1b_data/v1/train_data.jsonl', '--output_dir', '../data/finetune/v1', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '32', '--lang', 'en', '--bf16', 'True', '--gradient_checkpointing_enable', 'True', '--num_train_epochs', '3', '--model_max_length', '1024', '--learning_rate', '2.5e-5', '--weight_decay', '0', '--warmup_ratio', '0.03', '--evaluation_strategy', 'no', '--save_strategy', 'no', '--save_steps', '-1', '--save_total_limit', '999', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--deepspeed', '/home/ubuntu/Projects/ft_data_ranker_1b/competition_kit/lm-training/train_scripts/deepspeed_configs/ds_config_stage3_offload-opt_offload-para.json'] exits with return code = -9 |
我把环境从本地换成了docker实例,尝试了offload到cpu和硬盘上,都会失败:(
|
This issue is marked as stale because there has been no activity for 21 days. Remove stale label or add new comments or this issue will be closed in 3 day. |
Close this stale issue. |
Before Asking 在提问之前
I have read the README carefully. 我已经仔细阅读了 README 上的操作指引。
I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。
Search before asking 先搜索,再提问
Question
使用竞赛提供的代码,deepspeed单卡(3090 24g)微调falcon-rw-1b,调试过参数和deepspeed的配置,均为OOM。
Additional 额外信息
The text was updated successfully, but these errors were encountered: