[question] how to apply model parallism to solve cuda memory error #1510

yananchen1989 · 2024-04-06T02:09:36Z

hi team. I am using the SFT and PPO code to train my model, link https://github.com/huggingface/trl/tree/main/examples/scripts.

Due to long context length and 7B-level model size, I am facing cuda memory issue on my single gpu.

Is there any straightforward manner to utilize multiple gpus on my server to train the model thru SFT and PPO script ?
such as spliting the model to multiple gpus as model parallism. Is there any argument parameters I can directly pass into my training script ?
Thanks a lot.

export CUDA_VISIBLE_DEVICES='7'; python examples/scripts/sft_travel.py \
    --model_name_or_path="mistralai/Mistral-7B-Instruct-v0.2"         \
    --report_to="wandb" \
    --learning_rate=5e-5 \
    --per_device_train_batch_size=4 \
    --gradient_accumulation_steps=16 \
    --logging_steps=1 \
    --num_train_epochs=120 \
    --lr_scheduler_type "constant" \
    --max_steps=-1 \
    --gradient_checkpointing \
    --max_seq_length 16000 \
    --output_dir "8bit" \
    --overwrite_output_dir True \
    --logging_strategy  "epoch" \
    --evaluation_strategy "no"

The text was updated successfully, but these errors were encountered:

yananchen1989 · 2024-04-06T18:14:38Z

also have tried to use examples/accelerate_configs/deepspeed_zero3.yaml

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_multinode_launcher: standard
  gradient_accumulation_steps: 1
  offload_optimizer_device: cpu
  offload_param_device: cpu
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'yes'
machine_rank: 0
main_training_function: main
mixed_precision: 'bf16'
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

when set max_seq_length to 4096, it can train on MP manner on 8 gpus (A40 48GB)

but when increase max_seq_length to higher number , for example, 10000, it crashes due to OOM.

yananchen1989 · 2024-04-06T18:15:21Z

my launching script.

accelerate launch \
    --config_file=examples/accelerate_configs/deepspeed_zero3.yaml \
    examples/scripts/sft_tp.py \
    --model_name_or_path 'mistralai/Mistral-7B-Instruct-v0.2'\
    --report_to="wandb" \
    --learning_rate=4.41e-5 \
    --per_device_train_batch_size=4 \
    --gradient_accumulation_steps=1 \
    --output_dir="tp_deepspeed" \
    --logging_steps=1 \
    --num_train_epochs=100 \
    --max_steps=-1 \
    --gradient_checkpointing \
    --bf16 True \
    --do_eval True \
    --evaluation_strategy 'epoch' \
    --max_seq_length 4096

yananchen1989 · 2024-04-06T19:41:39Z

just change max_seq_length from 4096 to 5000 or 5120, without any other change, will also cause oom error:

/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/accelerator.py:436: FutureWarning: Passing the following arguments to Accelerator is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches', 'even_batches', 'use_seedable_sampler']). Please pass an accelerate.DataLoaderConfiguration instead:
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
warnings.warn(
Installed CUDA version 12.0 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
Using /home/chenyanan/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/chenyanan/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.356724500656128 seconds
Installed CUDA version 12.0 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
Using /home/chenyanan/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/chenyanan/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Installed CUDA version 12.0 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
Using /home/chenyanan/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Installed CUDA version 12.0 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
Using /home/chenyanan/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
ninja: no work to do.
Installed CUDA version 12.0 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
Using /home/chenyanan/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.348627805709839 seconds
Installed CUDA version 12.0 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
Using /home/chenyanan/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/chenyanan/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Installed CUDA version 12.0 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
Using /home/chenyanan/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Installed CUDA version 12.0 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
Using /home/chenyanan/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.3698253631591797 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.3834192752838135 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.3927829265594482 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.3720104694366455 seconds
Parameter Offload: Total persistent parameters: 266240 in 65 params
Loading extension module cpu_adam...
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.401362895965576 seconds
Time to load cpu_adam op: 2.3812966346740723 seconds
wandb: Currently logged in as: yananchen1116. Use wandb login --relogin to force relogin
wandb: Tracking run with wandb version 0.16.6
wandb: Run data is saved locally in /home/chenyanan/trl/wandb/run-20240406_153830-vb22d3qy
wandb: Run wandb offline to turn off syncing.
wandb: Syncing run final-cube-98
wandb: ⭐️ View project at https://wandb.ai/yananchen1116/huggingface
wandb: 🚀 View run at https://wandb.ai/yananchen1116/huggingface/runs/vb22d3qy
0%| | 0/200 [00:00<?, ?it/s]Traceback (most recent call last):
File "/home/chenyanan/trl/examples/scripts/sft_tp.py", line 159, in
trainer.train()
File "/home/chenyanan/trl/trl/trainer/sft_trainer.py", line 360, in train
output = super().train(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 1780, in train
return inner_training_loop(
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 2118, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 3045, in training_step
self.accelerator.backward(loss)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/accelerator.py", line 2007, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 166, in backward
self.engine.backward(loss, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1976, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2213, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward
torch.autograd.backward(
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/autograd/init.py", line 266, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 50.31 GiB. GPU 5 has a total capacity of 47.54 GiB of which 33.77 GiB is free. Process 2567931 has 2.62 GiB memory in use. Including non-PyTorch memory, this process has 11.12 GiB memory in use. Of the allocated memory 8.73 GiB is allocated by PyTorch, and 1.90 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Traceback (most recent call last):
File "/home/chenyanan/trl/examples/scripts/sft_tp.py", line 159, in
trainer.train()
File "/home/chenyanan/trl/trl/trainer/sft_trainer.py", line 360, in train
output = super().train(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 1780, in train
return inner_training_loop(
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 2118, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 3045, in training_step
self.accelerator.backward(loss)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/accelerator.py", line 2007, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 166, in backward
self.engine.backward(loss, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1976, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2213, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward
torch.autograd.backward(
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/autograd/init.py", line 266, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 50.31 GiB. GPU 6 has a total capacity of 47.54 GiB of which 34.13 GiB is free. Process 2567931 has 2.41 GiB memory in use. Including non-PyTorch memory, this process has 10.97 GiB memory in use. Of the allocated memory 8.73 GiB is allocated by PyTorch, and 1.75 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Traceback (most recent call last):
File "/home/chenyanan/trl/examples/scripts/sft_tp.py", line 159, in
trainer.train()
File "/home/chenyanan/trl/trl/trainer/sft_trainer.py", line 360, in train
output = super().train(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 1780, in train
return inner_training_loop(
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 2118, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 3045, in training_step
self.accelerator.backward(loss)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/accelerator.py", line 2007, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 166, in backward
self.engine.backward(loss, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1976, in backward
Traceback (most recent call last):
File "/home/chenyanan/trl/examples/scripts/sft_tp.py", line 159, in
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
trainer.train()
File "/home/chenyanan/trl/trl/trainer/sft_trainer.py", line 360, in train
ret_val = func(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2213, in backward
output = super().train(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 1780, in train
return inner_training_loop(
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 2118, in _inner_training_loop
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward
tr_loss_step = self.training_step(model, inputs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 3045, in training_step
torch.autograd.backward(
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/autograd/init.py", line 266, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 50.31 GiB. GPU 3 has a total capacity of 47.54 GiB of which 33.91 GiB is free. Process 2567931 has 2.62 GiB memory in use. Including non-PyTorch memory, this process has 10.99 GiB memory in use. Of the allocated memory 8.73 GiB is allocated by PyTorch, and 1.77 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
self.accelerator.backward(loss)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/accelerator.py", line 2007, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 166, in backward
self.engine.backward(loss, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1976, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2213, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward
torch.autograd.backward(
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/autograd/init.py", line 266, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 50.31 GiB. GPU 7 has a total capacity of 47.54 GiB of which 36.31 GiB is free. Process 2567931 has 260.00 MiB memory in use. Including non-PyTorch memory, this process has 10.96 GiB memory in use. Of the allocated memory 8.73 GiB is allocated by PyTorch, and 1.75 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Traceback (most recent call last):
File "/home/chenyanan/trl/examples/scripts/sft_tp.py", line 159, in
trainer.train()
File "/home/chenyanan/trl/trl/trainer/sft_trainer.py", line 360, in train
output = super().train(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 1780, in train
return inner_training_loop(
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 2118, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 3045, in training_step
self.accelerator.backward(loss)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/accelerator.py", line 2007, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 166, in backward
self.engine.backward(loss, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1976, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2213, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward
torch.autograd.backward(
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/autograd/init.py", line 266, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: Traceback (most recent call last):
CUDA out of memory. Tried to allocate 50.31 GiB. GPU 4 has a total capacity of 47.54 GiB of which 33.77 GiB is free. Process 2567931 has 2.62 GiB memory in use. Including non-PyTorch memory, this process has 11.12 GiB memory in use. Of the allocated memory 8.73 GiB is allocated by PyTorch, and 1.90 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
File "/home/chenyanan/trl/examples/scripts/sft_tp.py", line 159, in
trainer.train()
File "/home/chenyanan/trl/trl/trainer/sft_trainer.py", line 360, in train
output = super().train(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 1780, in train
return inner_training_loop(
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 2118, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 3045, in training_step
self.accelerator.backward(loss)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/accelerator.py", line 2007, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 166, in backward
self.engine.backward(loss, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1976, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2213, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward
torch.autograd.backward(
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/autograd/init.py", line 266, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 50.31 GiB. GPU 2 has a total capacity of 47.54 GiB of which 33.77 GiB is free. Process 2567931 has 2.62 GiB memory in use. Including non-PyTorch memory, this process has 11.12 GiB memory in use. Of the allocated memory 8.73 GiB is allocated by PyTorch, and 1.90 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[2024-04-06 15:38:49,201] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3235888 closing signal SIGTERM
[2024-04-06 15:38:49,202] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3235889 closing signal SIGTERM
[2024-04-06 15:38:50,371] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 2 (pid: 3235890) of binary: /home/chenyanan/anaconda3/envs/mp/bin/python
Traceback (most recent call last):
File "/home/chenyanan/anaconda3/envs/mp/bin/accelerate", line 8, in
sys.exit(main())
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
args.func(args)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1060, in launch_command
deepspeed_launcher(args)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/commands/launch.py", line 764, in deepspeed_launcher
distrib_run.run(args)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

examples/scripts/sft_tp.py FAILED

Failures:
[1]:
time : 2024-04-06_15:38:49
host : A40-36-111-143-5
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 3235891)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-04-06_15:38:49
host : A40-36-111-143-5
rank : 4 (local_rank: 4)
exitcode : 1 (pid: 3235892)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2024-04-06_15:38:49
host : A40-36-111-143-5
rank : 5 (local_rank: 5)
exitcode : 1 (pid: 3235893)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
time : 2024-04-06_15:38:49
host : A40-36-111-143-5
rank : 6 (local_rank: 6)
exitcode : 1 (pid: 3235894)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
time : 2024-04-06_15:38:49
host : A40-36-111-143-5
rank : 7 (local_rank: 7)
exitcode : 1 (pid: 3235895)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2024-04-06_15:38:49
host : A40-36-111-143-5
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 3235890)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

arivero · 2024-04-12T00:58:19Z

Besides training, model parallelism in the trl chat would be welcome too.

iFe1er · 2024-04-20T17:59:15Z

@yananchen1989 any suggestions here?

iFe1er · 2024-04-24T05:20:40Z

any updates?

yananchen1989 · 2024-05-06T17:02:33Z

i go back to single gpu training. especially when using PPO. There are some issues when using multi-gpu with deepspeed_zero3.yaml.

yananchen1989 closed this as completed Apr 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[question] how to apply model parallism to solve cuda memory error #1510

[question] how to apply model parallism to solve cuda memory error #1510

yananchen1989 commented Apr 6, 2024

yananchen1989 commented Apr 6, 2024

yananchen1989 commented Apr 6, 2024

yananchen1989 commented Apr 6, 2024

examples/scripts/sft_tp.py FAILED

Root Cause (first observed failure):
[0]:
time : 2024-04-06_15:38:49
host : A40-36-111-143-5
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 3235890)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

arivero commented Apr 12, 2024

iFe1er commented Apr 20, 2024

iFe1er commented Apr 24, 2024

yananchen1989 commented May 6, 2024

[question] how to apply model parallism to solve cuda memory error #1510

[question] how to apply model parallism to solve cuda memory error #1510

Comments

yananchen1989 commented Apr 6, 2024

yananchen1989 commented Apr 6, 2024

yananchen1989 commented Apr 6, 2024

yananchen1989 commented Apr 6, 2024

examples/scripts/sft_tp.py FAILED

Root Cause (first observed failure): [0]: time : 2024-04-06_15:38:49 host : A40-36-111-143-5 rank : 2 (local_rank: 2) exitcode : 1 (pid: 3235890) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

arivero commented Apr 12, 2024

iFe1er commented Apr 20, 2024

iFe1er commented Apr 24, 2024

yananchen1989 commented May 6, 2024

Root Cause (first observed failure):
[0]:
time : 2024-04-06_15:38:49
host : A40-36-111-143-5
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 3235890)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html