Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FT-Data Ranker-1b OOM finetuning on single GPU #39

Closed
3 tasks done
xnuohz opened this issue Oct 19, 2023 · 6 comments
Closed
3 tasks done

FT-Data Ranker-1b OOM finetuning on single GPU #39

xnuohz opened this issue Oct 19, 2023 · 6 comments
Assignees
Labels
competition:FT-Data Ranker issues about FT-Data Ranker competition of Data-Juicer question Further information is requested stale-issue

Comments

@xnuohz
Copy link

xnuohz commented Oct 19, 2023

Before Asking 在提问之前

  • I have read the README carefully. 我已经仔细阅读了 README 上的操作指引。

  • I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。

Search before asking 先搜索,再提问

  • I have searched the Data-Juicer issues and found no similar questions. 我已经在 issue列表 中搜索但是没有发现类似的问题。

Question

使用竞赛提供的代码,deepspeed单卡(3090 24g)微调falcon-rw-1b,调试过参数和deepspeed的配置,均为OOM。

Additional 额外信息

  • Script
#!/bin/bash

set -e 
export CUDA_DEVICE_MAX_CONNECTIONS=1

if [ -z $XDG_CACHE_HOME ]; then
    export XDG_CACHE_HOME=$HOME/.cache
fi

if [[ $# -ne 3 ]]; then
    echo "Three arguments required! " >&2
    exit 2
fi

# Model Path
# e.g /home/model/baichuan2-7b/
model_path=${1} #/path/to/your/model/
tokenizer=${model_path}

# Data Path
# e.g /home/data/train.jsonl
data_path=${2} # /path/to/your/dataset.jsonl

# Output Path
# e.g ${WORK_DIR}/checkpoints/baichuan2-7b/
output_path=${3} #/path/to/your/output/

mkdir -p ${output_path}/

WORK_DIR=$(echo `cd $(dirname $0); pwd | xargs dirname`)
cd ${WORK_DIR}

# Deepspeed
# ds_config_file=${WORK_DIR}/train_scripts/deepspeed_configs/ds_config_stage3.json
ds_config_file=${WORK_DIR}/train_scripts/deepspeed_configs/ds_config_stage3_offload-para.json

# Train Parameter
bs_per_gpu=1
num_nodes=1
# nproc_per_node=`nvidia-smi | grep MiB | wc -l`
nproc_per_node=1
master_port=50000

# grad_acc=`expr 256 / ${bs_per_gpu} / ${num_nodes} / ${nproc_per_node}`
grad_acc=`expr 32 / ${bs_per_gpu} / ${num_nodes} / ${nproc_per_node}`
deepspeed --num_gpus ${nproc_per_node} --num_nodes ${num_nodes} --master_port ${master_port} train.py \
    --model_name_or_path ${model_path} \
    --tokenizer ${tokenizer} \
    --data_path ${data_path} \
    --output_dir ${output_path} \
    --per_device_train_batch_size ${bs_per_gpu} \
    --gradient_accumulation_steps ${grad_acc} \
    --lang en \
    --bf16 True \
    --gradient_checkpointing_enable True \
    --num_train_epochs 3 \
    --model_max_length 1024 \
    --learning_rate 2.5e-5 \
    --weight_decay 0 \
    --warmup_ratio 0.03 \
    --evaluation_strategy "no" \
    --save_strategy "no" \
    --save_steps -1 \
    --save_total_limit 999 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --deepspeed ${ds_config_file} | tee ${output_path}/training_log.txt
  • Log
ORI NUMBER: 23237, AFTER FILETER: 22564, DROP NUMBER: 673
Total 22564 samples [ 6.48M tokens] in training!
  0%|                                                                                                                                                                                                                                                             | 0/705 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/ubuntu/Projects/ft_data_ranker_1b/competition_kit/lm-training/train.py", line 465, in <module>
    train()
  File "/home/ubuntu/Projects/ft_data_ranker_1b/competition_kit/lm-training/train.py", line 457, in train
    trainer.train()
  File "/home/ubuntu/Softwares/anaconda3/envs/dj_comp/lib/python3.10/site-packages/transformers/trainer.py", line 1591, in train
    return inner_training_loop(
  File "/home/ubuntu/Softwares/anaconda3/envs/dj_comp/lib/python3.10/site-packages/transformers/trainer.py", line 1971, in _inner_training_loop
    self.optimizer.step()
  File "/home/ubuntu/Softwares/anaconda3/envs/dj_comp/lib/python3.10/site-packages/accelerate/optimizer.py", line 145, in step
    self.optimizer.step(closure)
  File "/home/ubuntu/Softwares/anaconda3/envs/dj_comp/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 68, in wrapper
    return wrapped(*args, **kwargs)
  File "/home/ubuntu/Softwares/anaconda3/envs/dj_comp/lib/python3.10/site-packages/torch/optim/optimizer.py", line 373, in wrapper
    out = func(*args, **kwargs)
  File "/home/ubuntu/Softwares/anaconda3/envs/dj_comp/lib/python3.10/site-packages/torch/optim/optimizer.py", line 76, in _use_grad
    ret = func(self, *args, **kwargs)
  File "/home/ubuntu/Softwares/anaconda3/envs/dj_comp/lib/python3.10/site-packages/torch/optim/adamw.py", line 184, in step
    adamw(
  File "/home/ubuntu/Softwares/anaconda3/envs/dj_comp/lib/python3.10/site-packages/torch/optim/adamw.py", line 335, in adamw
    func(
  File "/home/ubuntu/Softwares/anaconda3/envs/dj_comp/lib/python3.10/site-packages/torch/optim/adamw.py", line 599, in _multi_tensor_adamw
    exp_avg_sq_sqrt = torch._foreach_sqrt(device_exp_avg_sqs)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB. GPU 0 has a total capacty of 23.69 GiB of which 36.38 MiB is free. Including non-PyTorch memory, this process has 23.64 GiB memory in use. Of the allocated memory 22.50 GiB is allocated by PyTorch, and 105.61 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
@xnuohz xnuohz added the question Further information is requested label Oct 19, 2023
@zhijianma
Copy link
Collaborator

建议将deepspeed的config文件更换为ds_config_stage3_offload-opt.json 或者ds_config_stage3_offload-opt_offload-para.json ,不同配置下训练时间也有所差异,可根据自身GPU 资源选择。
我们在T4 和A100 进行了测试,资源消耗参考如下:

ds_config T4(16G fp16) GPU Memory A100(40G bf16) GPU Memory
ds_config_stage3_offload-opt.json ~14428MiB ~8588MiB
ds_config_stage3_offload-opt_offload-para.json ~8204MiB ~8682MiB

@xnuohz
Copy link
Author

xnuohz commented Oct 20, 2023

@zhijianma 感谢回复。将deepspeed的config文件更换为ds_config_stage3_offload-opt.json 或者ds_config_stage3_offload-opt_offload-para.json之后,会出现下面的错误。我的环境是PyTorch 2.1.0/CUDA 11.6
(可能torch的cuda版本和本地的cuda版本不一致)

ORI NUMBER: 23237, AFTER FILETER: 22564, DROP NUMBER: 673
Total 22564 samples [ 6.48M tokens] in training!
 [WARNING]  cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!
Using /home/ubuntu/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Emitting ninja build file /home/ubuntu/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.2267723083496094 seconds
Parameter Offload: Total persistent parameters: 643072 in 194 params
[2023-10-20 12:49:31,065] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 179742
[2023-10-20 12:49:31,067] [ERROR] [launch.py:321:sigkill_handler] ['/home/ubuntu/Softwares/anaconda3/envs/dj_comp/bin/python', '-u', 'train.py', '--local_rank=0', '--model_name_or_path', '../data/models/falcon-rw-1b', '--tokenizer', '../data/models/falcon-rw-1b', '--data_path', '../data/1b_data/v1/train_data.jsonl', '--output_dir', '../data/finetune/v1', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '32', '--lang', 'en', '--bf16', 'True', '--gradient_checkpointing_enable', 'True', '--num_train_epochs', '3', '--model_max_length', '1024', '--learning_rate', '2.5e-5', '--weight_decay', '0', '--warmup_ratio', '0.03', '--evaluation_strategy', 'no', '--save_strategy', 'no', '--save_steps', '-1', '--save_total_limit', '999', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--deepspeed', '/home/ubuntu/Projects/ft_data_ranker_1b/competition_kit/lm-training/train_scripts/deepspeed_configs/ds_config_stage3_offload-opt_offload-para.json'] exits with return code = -9

@zhijianma
Copy link
Collaborator

zhijianma commented Oct 20, 2023

在deepspeed中的有提到类似的问题#3463, #3824, #2788, 大家觉得在offload开启后,在一些模型上引发了CPU的OOM问题。
这里的话,我觉得可以尝试在ds_config中将pin_memory设置为False,

"offload_param": {
    "device": "cpu",
    "pin_memory": false

如果上述配置仍然有问题,可以继续尝试将其offload 到硬盘上,

"offload_param": {
    "device": "nvme",
    "nvme_path": "/your_nvme_path",

@xnuohz
Copy link
Author

xnuohz commented Oct 20, 2023

我把环境从本地换成了docker实例,尝试了offload到cpu和硬盘上,都会失败:(

  • offload到cpu上
ORI NUMBER: 23237, AFTER FILETER: 22564, DROP NUMBER: 673
Total 22564 samples [ 6.48M tokens] in training!
Using /root/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Creating extension directory /root/.cache/torch_extensions/py310_cu117/cpu_adam...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py310_cu117/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/4] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/opt/conda/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_86,code=compute_86 -DBF16_AVAILABLE -c /opt/conda/lib/python3.10/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o 
[2/4] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/opt/conda/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256__ -D__ENABLE_CUDA__ -DBF16_AVAILABLE -c /opt/conda/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o 
[3/4] c++ -MMD -MF cpu_adam_impl.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/opt/conda/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256__ -D__ENABLE_CUDA__ -DBF16_AVAILABLE -c /opt/conda/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/cpu_adam_impl.cpp -o cpu_adam_impl.o 
[4/4] c++ cpu_adam.o cpu_adam_impl.o custom_cuda_kernel.cuda.o -shared -lcurand -L/opt/conda/lib/python3.10/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o cpu_adam.so
Loading extension module cpu_adam...
Time to load cpu_adam op: 17.70175266265869 seconds
Parameter Offload: Total persistent parameters: 643072 in 194 params
[2023-10-20 15:00:32,785] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 59
[2023-10-20 15:00:32,790] [ERROR] [launch.py:321:sigkill_handler] ['/opt/conda/bin/python', '-u', 'train.py', '--local_rank=0', '--model_name_or_path', '../data/models/falcon-rw-1b', '--tokenizer', '../data/models/falcon-rw-1b', '--data_path', '../data/1b_data/v1/train_data.jsonl', '--output_dir', '../data/finetune/v1', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '32', '--lang', 'en', '--bf16', 'True', '--gradient_checkpointing_enable', 'True', '--num_train_epochs', '3', '--model_max_length', '1024', '--learning_rate', '2.5e-5', '--weight_decay', '0', '--warmup_ratio', '0.03', '--evaluation_strategy', 'no', '--save_strategy', 'no', '--save_steps', '-1', '--save_total_limit', '999', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--deepspeed', '/workspace/competition_kit/lm-training/train_scripts/deepspeed_configs/ds_config_stage3_offload-opt_offload-para.json'] exits with return code = -9
  • offload到硬盘上
Loading model from ../data/models/falcon-rw-1b
Traceback (most recent call last):
  File "/workspace/competition_kit/lm-training/train.py", line 465, in <module>
    train()
  File "/workspace/competition_kit/lm-training/train.py", line 360, in train
    model = transformers.AutoModelForCausalLM.from_pretrained(
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 558, in from_pretrained
    return model_class.from_pretrained(
  File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2961, in from_pretrained
    init_contexts = [deepspeed.zero.Init(config_dict_or_path=deepspeed_config())] + init_contexts
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 924, in __init__
    self.param_swapper = param_swapper or AsyncPartitionedParameterSwapper(_ds_config, self.dtype)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/swap_tensor/partitioned_param_swapper.py", line 40, in __init__
    aio_op = AsyncIOBuilder().load(verbose=False)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 446, in load
    return self.jit_load(verbose)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 450, in jit_load
    raise RuntimeError(
RuntimeError: Unable to JIT load the async_io op due to it not being compatible due to hardware/software issue. None
[2023-10-20 15:04:40,690] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 398
[2023-10-20 15:04:40,691] [ERROR] [launch.py:321:sigkill_handler] ['/opt/conda/bin/python', '-u', 'train.py', '--local_rank=0', '--model_name_or_path', '../data/models/falcon-rw-1b', '--tokenizer', '../data/models/falcon-rw-1b', '--data_path', '../data/1b_data/v1/train_data.jsonl', '--output_dir', '../data/finetune/v1', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '32', '--lang', 'en', '--bf16', 'True', '--gradient_checkpointing_enable', 'True', '--num_train_epochs', '3', '--model_max_length', '1024', '--learning_rate', '2.5e-5', '--weight_decay', '0', '--warmup_ratio', '0.03', '--evaluation_strategy', 'no', '--save_strategy', 'no', '--save_steps', '-1', '--save_total_limit', '999', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--deepspeed', '/workspace/competition_kit/lm-training/train_scripts/deepspeed_configs/ds_config_stage3_offload-opt_offload-para.json'] exits with return code = 1

@HYLcool HYLcool added the competition:FT-Data Ranker issues about FT-Data Ranker competition of Data-Juicer label Oct 26, 2023
Copy link

This issue is marked as stale because there has been no activity for 21 days. Remove stale label or add new comments or this issue will be closed in 3 day.

Copy link

Close this stale issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
competition:FT-Data Ranker issues about FT-Data Ranker competition of Data-Juicer question Further information is requested stale-issue
Projects
None yet
Development

No branches or pull requests

5 participants