-
Notifications
You must be signed in to change notification settings - Fork 6.6k
Description
Describe the bug
I run the train_text_to_image.py with the command described in the instruction. However, when the process is in the VAE encoding, it took a lot of time and raised the error (shown below).
01/24/2024 11:00:18 - INFO - main - ***** Running training *****
01/24/2024 11:00:18 - INFO - main - Num examples = 834
01/24/2024 11:00:18 - INFO - main - Num Epochs = 72
01/24/2024 11:00:18 - INFO - main - Instantaneous batch size per device = 1
01/24/2024 11:00:18 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 4
01/24/2024 11:00:18 - INFO - main - Gradient Accumulation steps = 4
01/24/2024 11:00:18 - INFO - main - Total optimization steps = 15000
Steps: 0%| | 0/15000 [00:00<?, ?it/s]Shape of pixel_values: torch.Size([1, 3, 176, 512, 512])
Shape of rearrange pixel_values: torch.Size([176, 3, 512, 512])
Traceback (most recent call last):
File "/home/vipuser/.conda/envs/diffusion/bin/accelerate", line 8, in
sys.exit(main())
File "/home/vipuser/.conda/envs/diffusion/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/home/vipuser/.conda/envs/diffusion/lib/python3.8/site-packages/accelerate/commands/launch.py", line 1023, in launch_command
simple_launcher(args)
File "/home/vipuser/.conda/envs/diffusion/lib/python3.8/site-packages/accelerate/commands/launch.py", line 643, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/vipuser/.conda/envs/diffusion/bin/python', '/home/vipuser/Downloads/diffusers/examples/text_to_image/temp_3D_v2.py', '--pretrained_model_name_or_path=/home/vipuser/Downloads/stable-diffusion-v1-4', '--train_data_dir=/data/dataset-NKI', '--use_ema', '--train_batch_size=1', '--gradient_accumulation_steps=4', '--gradient_checkpointing', '--max_train_steps=15000', '--learning_rate=1e-05', '--max_grad_norm=1', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--output_dir=output_3D']' died with <Signals.SIGSEGV: 11>.
Reproduction
accelerate launch --mixed_precision="fp16" /home/Downloads/diffusers/examples/text_to_image/train_text_to_image.py
--pretrained_model_name_or_path=$MODEL_NAME
--train_data_dir=“/data/dataset"
--use_ema
--train_batch_size=1
--gradient_accumulation_steps=4
--gradient_checkpointing
--max_train_steps=15000
--learning_rate=1e-05
--max_grad_norm=1
--lr_scheduler="constant" --lr_warmup_steps=0
--output_dir="output"
Logs
No response
System Info
diffusersversion: 0.26.0.dev0- Platform: Linux-5.4.0-164-generic-x86_64-with-glibc2.17
- Python version: 3.8.18
- PyTorch version (GPU?): 2.3.0.dev20240123+cu118 (True)
- Huggingface_hub version: 0.20.3
- Transformers version: 4.37.0
- Accelerate version: 0.26.1
- xFormers version: not installed
- Using GPU in script?:
- Using distributed or parallel set-up in script?: