Skip to content

"Raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) " happened when starting studing #6695

@dahui-y

Description

@dahui-y

Describe the bug

I run the train_text_to_image.py with the command described in the instruction. However, when the process is in the VAE encoding, it took a lot of time and raised the error (shown below).

01/24/2024 11:00:18 - INFO - main - ***** Running training *****
01/24/2024 11:00:18 - INFO - main - Num examples = 834
01/24/2024 11:00:18 - INFO - main - Num Epochs = 72
01/24/2024 11:00:18 - INFO - main - Instantaneous batch size per device = 1
01/24/2024 11:00:18 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 4
01/24/2024 11:00:18 - INFO - main - Gradient Accumulation steps = 4
01/24/2024 11:00:18 - INFO - main - Total optimization steps = 15000
Steps: 0%| | 0/15000 [00:00<?, ?it/s]Shape of pixel_values: torch.Size([1, 3, 176, 512, 512])
Shape of rearrange pixel_values: torch.Size([176, 3, 512, 512])
Traceback (most recent call last):
File "/home/vipuser/.conda/envs/diffusion/bin/accelerate", line 8, in
sys.exit(main())
File "/home/vipuser/.conda/envs/diffusion/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/home/vipuser/.conda/envs/diffusion/lib/python3.8/site-packages/accelerate/commands/launch.py", line 1023, in launch_command
simple_launcher(args)
File "/home/vipuser/.conda/envs/diffusion/lib/python3.8/site-packages/accelerate/commands/launch.py", line 643, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/vipuser/.conda/envs/diffusion/bin/python', '/home/vipuser/Downloads/diffusers/examples/text_to_image/temp_3D_v2.py', '--pretrained_model_name_or_path=/home/vipuser/Downloads/stable-diffusion-v1-4', '--train_data_dir=/data/dataset-NKI', '--use_ema', '--train_batch_size=1', '--gradient_accumulation_steps=4', '--gradient_checkpointing', '--max_train_steps=15000', '--learning_rate=1e-05', '--max_grad_norm=1', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--output_dir=output_3D']' died with <Signals.SIGSEGV: 11>.

Reproduction

accelerate launch --mixed_precision="fp16" /home/Downloads/diffusers/examples/text_to_image/train_text_to_image.py
--pretrained_model_name_or_path=$MODEL_NAME
--train_data_dir=“/data/dataset"
--use_ema
--train_batch_size=1
--gradient_accumulation_steps=4
--gradient_checkpointing
--max_train_steps=15000
--learning_rate=1e-05
--max_grad_norm=1
--lr_scheduler="constant" --lr_warmup_steps=0
--output_dir="output"

Logs

No response

System Info

  • diffusers version: 0.26.0.dev0
  • Platform: Linux-5.4.0-164-generic-x86_64-with-glibc2.17
  • Python version: 3.8.18
  • PyTorch version (GPU?): 2.3.0.dev20240123+cu118 (True)
  • Huggingface_hub version: 0.20.3
  • Transformers version: 4.37.0
  • Accelerate version: 0.26.1
  • xFormers version: not installed
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Who can help?

@sayakpaul @patrickvonplaten

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstaleIssues that haven't received updates

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions