Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[question] how to apply model parallism to solve cuda memory error #1510

Closed
yananchen1989 opened this issue Apr 6, 2024 · 7 comments
Closed

Comments

@yananchen1989
Copy link

hi team. I am using the SFT and PPO code to train my model, link https://github.com/huggingface/trl/tree/main/examples/scripts.

Due to long context length and 7B-level model size, I am facing cuda memory issue on my single gpu.

Is there any straightforward manner to utilize multiple gpus on my server to train the model thru SFT and PPO script ?
such as spliting the model to multiple gpus as model parallism. Is there any argument parameters I can directly pass into my training script ?
Thanks a lot.

export CUDA_VISIBLE_DEVICES='7'; python examples/scripts/sft_travel.py \
    --model_name_or_path="mistralai/Mistral-7B-Instruct-v0.2"         \
    --report_to="wandb" \
    --learning_rate=5e-5 \
    --per_device_train_batch_size=4 \
    --gradient_accumulation_steps=16 \
    --logging_steps=1 \
    --num_train_epochs=120 \
    --lr_scheduler_type "constant" \
    --max_steps=-1 \
    --gradient_checkpointing \
    --max_seq_length 16000 \
    --output_dir "8bit" \
    --overwrite_output_dir True \
    --logging_strategy  "epoch" \
    --evaluation_strategy "no"
@yananchen1989
Copy link
Author

also have tried to use examples/accelerate_configs/deepspeed_zero3.yaml

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_multinode_launcher: standard
  gradient_accumulation_steps: 1
  offload_optimizer_device: cpu
  offload_param_device: cpu
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'yes'
machine_rank: 0
main_training_function: main
mixed_precision: 'bf16'
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

when set max_seq_length to 4096, it can train on MP manner on 8 gpus (A40 48GB)

but when increase max_seq_length to higher number , for example, 10000, it crashes due to OOM.

@yananchen1989
Copy link
Author

my launching script.

accelerate launch \
    --config_file=examples/accelerate_configs/deepspeed_zero3.yaml \
    examples/scripts/sft_tp.py \
    --model_name_or_path 'mistralai/Mistral-7B-Instruct-v0.2'\
    --report_to="wandb" \
    --learning_rate=4.41e-5 \
    --per_device_train_batch_size=4 \
    --gradient_accumulation_steps=1 \
    --output_dir="tp_deepspeed" \
    --logging_steps=1 \
    --num_train_epochs=100 \
    --max_steps=-1 \
    --gradient_checkpointing \
    --bf16 True \
    --do_eval True \
    --evaluation_strategy 'epoch' \
    --max_seq_length 4096

@yananchen1989
Copy link
Author

just change max_seq_length from 4096 to 5000 or 5120, without any other change, will also cause oom error:

/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/accelerator.py:436: FutureWarning: Passing the following arguments to Accelerator is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches', 'even_batches', 'use_seedable_sampler']). Please pass an accelerate.DataLoaderConfiguration instead:
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
warnings.warn(
Installed CUDA version 12.0 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
Using /home/chenyanan/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/chenyanan/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.356724500656128 seconds
Installed CUDA version 12.0 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
Using /home/chenyanan/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/chenyanan/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Installed CUDA version 12.0 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
Using /home/chenyanan/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Installed CUDA version 12.0 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
Using /home/chenyanan/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
ninja: no work to do.
Installed CUDA version 12.0 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
Using /home/chenyanan/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.348627805709839 seconds
Installed CUDA version 12.0 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
Using /home/chenyanan/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/chenyanan/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Installed CUDA version 12.0 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
Using /home/chenyanan/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Installed CUDA version 12.0 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
Using /home/chenyanan/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.3698253631591797 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.3834192752838135 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.3927829265594482 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.3720104694366455 seconds
Parameter Offload: Total persistent parameters: 266240 in 65 params
Loading extension module cpu_adam...
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.401362895965576 seconds
Time to load cpu_adam op: 2.3812966346740723 seconds
wandb: Currently logged in as: yananchen1116. Use wandb login --relogin to force relogin
wandb: Tracking run with wandb version 0.16.6
wandb: Run data is saved locally in /home/chenyanan/trl/wandb/run-20240406_153830-vb22d3qy
wandb: Run wandb offline to turn off syncing.
wandb: Syncing run final-cube-98
wandb: ⭐️ View project at https://wandb.ai/yananchen1116/huggingface
wandb: 🚀 View run at https://wandb.ai/yananchen1116/huggingface/runs/vb22d3qy
0%| | 0/200 [00:00<?, ?it/s]Traceback (most recent call last):
File "/home/chenyanan/trl/examples/scripts/sft_tp.py", line 159, in
trainer.train()
File "/home/chenyanan/trl/trl/trainer/sft_trainer.py", line 360, in train
output = super().train(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 1780, in train
return inner_training_loop(
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 2118, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 3045, in training_step
self.accelerator.backward(loss)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/accelerator.py", line 2007, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 166, in backward
self.engine.backward(loss, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1976, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2213, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward
torch.autograd.backward(
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/autograd/init.py", line 266, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 50.31 GiB. GPU 5 has a total capacity of 47.54 GiB of which 33.77 GiB is free. Process 2567931 has 2.62 GiB memory in use. Including non-PyTorch memory, this process has 11.12 GiB memory in use. Of the allocated memory 8.73 GiB is allocated by PyTorch, and 1.90 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Traceback (most recent call last):
File "/home/chenyanan/trl/examples/scripts/sft_tp.py", line 159, in
trainer.train()
File "/home/chenyanan/trl/trl/trainer/sft_trainer.py", line 360, in train
output = super().train(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 1780, in train
return inner_training_loop(
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 2118, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 3045, in training_step
self.accelerator.backward(loss)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/accelerator.py", line 2007, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 166, in backward
self.engine.backward(loss, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1976, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2213, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward
torch.autograd.backward(
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/autograd/init.py", line 266, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 50.31 GiB. GPU 6 has a total capacity of 47.54 GiB of which 34.13 GiB is free. Process 2567931 has 2.41 GiB memory in use. Including non-PyTorch memory, this process has 10.97 GiB memory in use. Of the allocated memory 8.73 GiB is allocated by PyTorch, and 1.75 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Traceback (most recent call last):
File "/home/chenyanan/trl/examples/scripts/sft_tp.py", line 159, in
trainer.train()
File "/home/chenyanan/trl/trl/trainer/sft_trainer.py", line 360, in train
output = super().train(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 1780, in train
return inner_training_loop(
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 2118, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 3045, in training_step
self.accelerator.backward(loss)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/accelerator.py", line 2007, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 166, in backward
self.engine.backward(loss, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1976, in backward
Traceback (most recent call last):
File "/home/chenyanan/trl/examples/scripts/sft_tp.py", line 159, in
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
trainer.train()
File "/home/chenyanan/trl/trl/trainer/sft_trainer.py", line 360, in train
ret_val = func(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2213, in backward
output = super().train(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 1780, in train
return inner_training_loop(
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 2118, in _inner_training_loop
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward
tr_loss_step = self.training_step(model, inputs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 3045, in training_step
torch.autograd.backward(
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/autograd/init.py", line 266, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 50.31 GiB. GPU 3 has a total capacity of 47.54 GiB of which 33.91 GiB is free. Process 2567931 has 2.62 GiB memory in use. Including non-PyTorch memory, this process has 10.99 GiB memory in use. Of the allocated memory 8.73 GiB is allocated by PyTorch, and 1.77 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
self.accelerator.backward(loss)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/accelerator.py", line 2007, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 166, in backward
self.engine.backward(loss, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1976, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2213, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward
torch.autograd.backward(
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/autograd/init.py", line 266, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 50.31 GiB. GPU 7 has a total capacity of 47.54 GiB of which 36.31 GiB is free. Process 2567931 has 260.00 MiB memory in use. Including non-PyTorch memory, this process has 10.96 GiB memory in use. Of the allocated memory 8.73 GiB is allocated by PyTorch, and 1.75 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Traceback (most recent call last):
File "/home/chenyanan/trl/examples/scripts/sft_tp.py", line 159, in
trainer.train()
File "/home/chenyanan/trl/trl/trainer/sft_trainer.py", line 360, in train
output = super().train(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 1780, in train
return inner_training_loop(
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 2118, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 3045, in training_step
self.accelerator.backward(loss)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/accelerator.py", line 2007, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 166, in backward
self.engine.backward(loss, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1976, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2213, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward
torch.autograd.backward(
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/autograd/init.py", line 266, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: Traceback (most recent call last):
CUDA out of memory. Tried to allocate 50.31 GiB. GPU 4 has a total capacity of 47.54 GiB of which 33.77 GiB is free. Process 2567931 has 2.62 GiB memory in use. Including non-PyTorch memory, this process has 11.12 GiB memory in use. Of the allocated memory 8.73 GiB is allocated by PyTorch, and 1.90 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
File "/home/chenyanan/trl/examples/scripts/sft_tp.py", line 159, in
trainer.train()
File "/home/chenyanan/trl/trl/trainer/sft_trainer.py", line 360, in train
output = super().train(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 1780, in train
return inner_training_loop(
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 2118, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 3045, in training_step
self.accelerator.backward(loss)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/accelerator.py", line 2007, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 166, in backward
self.engine.backward(loss, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1976, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2213, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward
torch.autograd.backward(
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/autograd/init.py", line 266, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 50.31 GiB. GPU 2 has a total capacity of 47.54 GiB of which 33.77 GiB is free. Process 2567931 has 2.62 GiB memory in use. Including non-PyTorch memory, this process has 11.12 GiB memory in use. Of the allocated memory 8.73 GiB is allocated by PyTorch, and 1.90 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[2024-04-06 15:38:49,201] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3235888 closing signal SIGTERM
[2024-04-06 15:38:49,202] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3235889 closing signal SIGTERM
[2024-04-06 15:38:50,371] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 2 (pid: 3235890) of binary: /home/chenyanan/anaconda3/envs/mp/bin/python
Traceback (most recent call last):
File "/home/chenyanan/anaconda3/envs/mp/bin/accelerate", line 8, in
sys.exit(main())
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
args.func(args)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1060, in launch_command
deepspeed_launcher(args)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/commands/launch.py", line 764, in deepspeed_launcher
distrib_run.run(args)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

examples/scripts/sft_tp.py FAILED

Failures:
[1]:
time : 2024-04-06_15:38:49
host : A40-36-111-143-5
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 3235891)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-04-06_15:38:49
host : A40-36-111-143-5
rank : 4 (local_rank: 4)
exitcode : 1 (pid: 3235892)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2024-04-06_15:38:49
host : A40-36-111-143-5
rank : 5 (local_rank: 5)
exitcode : 1 (pid: 3235893)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
time : 2024-04-06_15:38:49
host : A40-36-111-143-5
rank : 6 (local_rank: 6)
exitcode : 1 (pid: 3235894)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
time : 2024-04-06_15:38:49
host : A40-36-111-143-5
rank : 7 (local_rank: 7)
exitcode : 1 (pid: 3235895)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2024-04-06_15:38:49
host : A40-36-111-143-5
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 3235890)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

@arivero
Copy link

arivero commented Apr 12, 2024

Besides training, model parallelism in the trl chat would be welcome too.

@iFe1er
Copy link

iFe1er commented Apr 20, 2024

@yananchen1989 any suggestions here?

@iFe1er
Copy link

iFe1er commented Apr 24, 2024

any updates?

@yananchen1989
Copy link
Author

i go back to single gpu training. especially when using PPO. There are some issues when using multi-gpu with deepspeed_zero3.yaml.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants