-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Description
Describe the bug
Hi - I've spent a couple days trying to get Dreambooth to run, and can't get past this:
_Steps: 0%| | 0/800 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/scratch/StableDiffusion/diffusers/examples/dreambooth/train_dreambooth.py", line 765, in
main()
File "/scratch/StableDiffusion/diffusers/examples/dreambooth/train_dreambooth.py", line 712, in main
noise_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample
File "/home/stablediffusion/.conda/envs/diffusers/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/stablediffusion/.conda/envs/diffusers/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
return func(*args, **kwargs)
File "/home/stablediffusion/.conda/envs/diffusers/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1673, in forward
loss = self.module(*inputs, **kwargs)
File "/home/stablediffusion/.conda/envs/diffusers/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/stablediffusion/.conda/envs/diffusers/lib/python3.9/site-packages/diffusers/models/unet_2d_condition.py", line 287, in forward
emb = self.time_embedding(t_emb)
File "/home/stablediffusion/.conda/envs/diffusers/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/stablediffusion/.conda/envs/diffusers/lib/python3.9/site-packages/diffusers/models/embeddings.py", line 75, in forward
sample = self.linear_1(sample)
File "/home/stablediffusion/.conda/envs/diffusers/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in call_impl
return forward_call(*input, **kwargs)
File "/home/stablediffusion/.conda/envs/diffusers/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasLtMatmul( ltHandle, computeDesc.descriptor(), &alpha_val, mat1_ptr, Adesc.descriptor(), mat2_ptr, Bdesc.descriptor(), &beta_val, result_ptr, Cdesc.descriptor(), result_ptr, Cdesc.descriptor(), &heuristicResult.algo, workspace.data_ptr(), workspaceSize, at::cuda::getCurrentCUDAStream())
Steps: 0%| | 0/800 [00:00<?, ?it/s]
[2022-10-31 12:46:24,888] [INFO] [launch.py:286:sigkill_handler] Killing subprocess 711745
[2022-10-31 12:46:24,889] [ERROR] [launch.py:292:sigkill_handler] ['/home/stablediffusion/.conda/envs/diffusers/bin/python', '-u', 'train_dreambooth.py', '--pretrained_model_name_or_path=runwayml/stable-diffusion-v1-5', '--instance_data_dir=training/dataset', '--class_data_dir=classes', '--output_dir=output', '--instance_prompt=MyObject dragon', '--class_prompt=dragon', '--seed=3434554', '--resolution=512', '--center_crop', '--train_batch_size=1', '--mixed_precision=fp16', '--use_8bit_adam', '--gradient_accumulation_steps=1', '--gradient_checkpointing', '--learning_rate=5e-6', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--num_class_images=100', '--sample_batch_size=4', '--max_train_steps=800'] exits with return code = 1
Traceback (most recent call last):
File "/home/stablediffusion/.conda/envs/diffusers/bin/accelerate", line 8, in
sys.exit(main())
File "/home/stablediffusion/.conda/envs/diffusers/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 43, in main
args.func(args)
File "/home/stablediffusion/.conda/envs/diffusers/lib/python3.9/site-packages/accelerate/commands/launch.py", line 827, in launch_command
deepspeed_launcher(args)
File "/home/stablediffusion/.conda/envs/diffusers/lib/python3.9/site-packages/accelerate/commands/launch.py", line 540, in deepspeed_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['deepspeed', '--no_local_rank', '--num_gpus', '1', 'train_dreambooth.py', '--pretrained_model_name_or_path=runwayml/stable-diffusion-v1-5', '--instance_data_dir=training/dataset', '--class_data_dir=classes', '--output_dir=output', '--instance_prompt=MyObject dragon', '--class_prompt=dragon', '--seed=3434554', '--resolution=512', '--center_crop', '--train_batch_size=1', '--mixed_precision=fp16', '--use_8bit_adam', '--gradient_accumulation_steps=1', '--gradient_checkpointing', '--learning_rate=5e-6', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--num_class_images=100', '--sample_batch_size=4', '--max_train_steps=800']' returned non-zero exit status 1.
I can run other CUDA apps just fine. No other GPU-using apps are running.
Reproduction
export MODEL_NAME="runwayml/stable-diffusion-v1-5"
export INSTANCE_DIR="training/dataset"
export CLASS_DIR="classes"
export OUTPUT_DIR="output"
accelerate launch train_dreambooth.py
--pretrained_model_name_or_path=$MODEL_NAME
--instance_data_dir=$INSTANCE_DIR
--class_data_dir=$CLASS_DIR
--output_dir=$OUTPUT_DIR
--instance_prompt="MyObject dragon"
--class_prompt="dragon"
--seed=3434554
--resolution=512
--center_crop
--train_batch_size=1
--mixed_precision="fp16"
--use_8bit_adam
--gradient_accumulation_steps=1 --gradient_checkpointing
--learning_rate=5e-6
--lr_scheduler="constant"
--lr_warmup_steps=0
--num_class_images=100
--sample_batch_size=4
--max_train_steps=800
Logs
See above.System Info
diffusersversion: 0.7.0.dev0- Platform: Linux-5.19.16-200.fc36.x86_64-x86_64-with-glibc2.35
- Python version: 3.9.13
- PyTorch version (GPU?): 1.13.0+cu116 (True)
- Huggingface_hub version: 0.10.1
- Transformers version: 4.23.1
- Using GPU in script?:
- Using distributed or parallel set-up in script?:
GPU is a RTX 3060 (12GB), hence the need to limit memory usage.