Dreambooth: RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasLtMatmul

### Describe the bug

Hi - I've spent a couple days trying to get Dreambooth to run, and can't get past this:

_Steps:   0%|                                                                                                                                                                                                        | 0/800 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/scratch/StableDiffusion/diffusers/examples/dreambooth/train_dreambooth.py", line 765, in <module>
    main()
  File "/scratch/StableDiffusion/diffusers/examples/dreambooth/train_dreambooth.py", line 712, in main
    noise_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample
  File "/home/stablediffusion/.conda/envs/diffusers/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/stablediffusion/.conda/envs/diffusers/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    return func(*args, **kwargs)
  File "/home/stablediffusion/.conda/envs/diffusers/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1673, in forward
    loss = self.module(*inputs, **kwargs)
  File "/home/stablediffusion/.conda/envs/diffusers/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/stablediffusion/.conda/envs/diffusers/lib/python3.9/site-packages/diffusers/models/unet_2d_condition.py", line 287, in forward
    emb = self.time_embedding(t_emb)
  File "/home/stablediffusion/.conda/envs/diffusers/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/stablediffusion/.conda/envs/diffusers/lib/python3.9/site-packages/diffusers/models/embeddings.py", line 75, in forward
    sample = self.linear_1(sample)
  File "/home/stablediffusion/.conda/envs/diffusers/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/stablediffusion/.conda/envs/diffusers/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasLtMatmul( ltHandle, computeDesc.descriptor(), &alpha_val, mat1_ptr, Adesc.descriptor(), mat2_ptr, Bdesc.descriptor(), &beta_val, result_ptr, Cdesc.descriptor(), result_ptr, Cdesc.descriptor(), &heuristicResult.algo, workspace.data_ptr(), workspaceSize, at::cuda::getCurrentCUDAStream())`
Steps:   0%|                                                                                                                                                                                                        | 0/800 [00:00<?, ?it/s]
[2022-10-31 12:46:24,888] [INFO] [launch.py:286:sigkill_handler] Killing subprocess 711745
[2022-10-31 12:46:24,889] [ERROR] [launch.py:292:sigkill_handler] ['/home/stablediffusion/.conda/envs/diffusers/bin/python', '-u', 'train_dreambooth.py', '--pretrained_model_name_or_path=runwayml/stable-diffusion-v1-5', '--instance_data_dir=training/dataset', '--class_data_dir=classes', '--output_dir=output', '--instance_prompt=MyObject dragon', '--class_prompt=dragon', '--seed=3434554', '--resolution=512', '--center_crop', '--train_batch_size=1', '--mixed_precision=fp16', '--use_8bit_adam', '--gradient_accumulation_steps=1', '--gradient_checkpointing', '--learning_rate=5e-6', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--num_class_images=100', '--sample_batch_size=4', '--max_train_steps=800'] exits with return code = 1
Traceback (most recent call last):
  File "/home/stablediffusion/.conda/envs/diffusers/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/stablediffusion/.conda/envs/diffusers/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 43, in main
    args.func(args)
  File "/home/stablediffusion/.conda/envs/diffusers/lib/python3.9/site-packages/accelerate/commands/launch.py", line 827, in launch_command
    deepspeed_launcher(args)
  File "/home/stablediffusion/.conda/envs/diffusers/lib/python3.9/site-packages/accelerate/commands/launch.py", line 540, in deepspeed_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['deepspeed', '--no_local_rank', '--num_gpus', '1', 'train_dreambooth.py', '--pretrained_model_name_or_path=runwayml/stable-diffusion-v1-5', '--instance_data_dir=training/dataset', '--class_data_dir=classes', '--output_dir=output', '--instance_prompt=MyObject dragon', '--class_prompt=dragon', '--seed=3434554', '--resolution=512', '--center_crop', '--train_batch_size=1', '--mixed_precision=fp16', '--use_8bit_adam', '--gradient_accumulation_steps=1', '--gradient_checkpointing', '--learning_rate=5e-6', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--num_class_images=100', '--sample_batch_size=4', '--max_train_steps=800']' returned non-zero exit status 1._

I can run other CUDA apps just fine. No other GPU-using apps are running.

### Reproduction

export MODEL_NAME="runwayml/stable-diffusion-v1-5"
export INSTANCE_DIR="training/dataset"
export CLASS_DIR="classes"
export OUTPUT_DIR="output"

accelerate launch train_dreambooth.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --instance_data_dir=$INSTANCE_DIR \
  --class_data_dir=$CLASS_DIR \
  --output_dir=$OUTPUT_DIR \
  --instance_prompt="MyObject dragon" \
  --class_prompt="dragon" \
  --seed=3434554 \
  --resolution=512 \
  --center_crop \
  --train_batch_size=1 \
  --mixed_precision="fp16" \
  --use_8bit_adam \
  --gradient_accumulation_steps=1 --gradient_checkpointing \
  --learning_rate=5e-6 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --num_class_images=100 \
  --sample_batch_size=4 \
  --max_train_steps=800


### Logs

```shell
See above.
```


### System Info

- `diffusers` version: 0.7.0.dev0
- Platform: Linux-5.19.16-200.fc36.x86_64-x86_64-with-glibc2.35
- Python version: 3.9.13
- PyTorch version (GPU?): 1.13.0+cu116 (True)
- Huggingface_hub version: 0.10.1
- Transformers version: 4.23.1
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

GPU is a RTX 3060 (12GB), hence the need to limit memory usage.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dreambooth: RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasLtMatmul #1082

Describe the bug

Reproduction

Logs

System Info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Dreambooth: RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasLtMatmul #1082

Description

Describe the bug

Reproduction

Logs

System Info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions