-
Notifications
You must be signed in to change notification settings - Fork 6.4k
Description
Describe the bug
i got "RuntimeError: mat1 and mat2 shapes cannot be multiplied (2x1536 and 768x3072)" when i ran "train_dreambooth_lora_flux_advanced.py" from the latest version of diffusers or v0.35.1 but the problem solved when i downgraded the version to v0.31.0 including all the dependencies. i ran the scripts on modal (serverless gpu cloud). i used L40S 48GB and the same training parameters/arguments.
Reproduction
i think it's because i put "accelerate env" when the building image was in progress so the is no the description of the gpu below/
for the latest version of diffusers. i used this config:
Accelerateversion: 1.10.0- Platform: Linux-4.4.0-x86_64-with-glibc2.39
acceleratebash location: /usr/local/bin/accelerate- Python version: 3.11.5
- Numpy version: 2.3.4
- PyTorch version: 2.8.0+cu129
- PyTorch accelerator: N/A
- System RAM: 167.58 GB
Acceleratedefault config:- compute_environment: LOCAL_MACHINE
- distributed_type: NO
- mixed_precision: bf16
- use_cpu: False
- debug: False
- num_processes: 1
- machine_rank: 0
- num_machines: 1
- rdzv_backend: static
- same_network: False
- main_training_function: main
- enable_cpu_affinity: False
- downcast_bf16: False
- tpu_use_cluster: False
- tpu_use_sudo: False
for the oldest version of diffusers. i used this config:
Accelerateversion: 1.2.1- Platform: Linux-4.4.0-x86_64-with-glibc2.35
acceleratebash location: /usr/local/bin/accelerate- Python version: 3.11.5
- Numpy version: 2.3.3
- PyTorch version (GPU?): 2.5.1+cu124 (False)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch MUSA available: False
- System RAM: 167.58 GB
Acceleratedefault config:- compute_environment: LOCAL_MACHINE
- distributed_type: NO
- mixed_precision: bf16
- use_cpu: False
- debug: False
- num_processes: 1
- machine_rank: 0
- num_machines: 1
- rdzv_backend: static
- same_network: False
- main_training_function: main
- enable_cpu_affinity: False
- downcast_bf16: False
- tpu_use_cluster: False
- tpu_use_sudo: False
Logs
the latest version of diffusers:
10/15/2025 16:56:04 - INFO - __main__ - ***** Running training *****
10/15/2025 16:56:04 - INFO - __main__ - Num examples = 338
10/15/2025 16:56:04 - INFO - __main__ - Num batches each epoch = 338
10/15/2025 16:56:04 - INFO - __main__ - Num Epochs = 10
10/15/2025 16:56:04 - INFO - __main__ - Instantaneous batch size per device = 1
10/15/2025 16:56:04 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 1
10/15/2025 16:56:04 - INFO - __main__ - Gradient Accumulation steps = 1
10/15/2025 16:56:04 - INFO - __main__ - Total optimization steps = 3380
Steps: 0%| | 0/3380 [00:00<?, ?it/s]
[2025-10-15 16:56:05,548] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to cuda (auto detect)
df: /root/.triton/autotune: No such file or directory
[2025-10-15 16:56:09,437] [INFO] [logging.py:107:log_dist] [Rank -1] [TorchCheckpointEngine] Initialized with serialization = False
Traceback (most recent call last):
File "/root/diffusers/examples/advanced_diffusion_training/train_dreambooth_lora_flux_advanced.py", line 2470, in <module>
main(args)
File "/root/diffusers/examples/advanced_diffusion_training/train_dreambooth_lora_flux_advanced.py", line 2224, in main
model_pred = transformer(
^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/accelerate/utils/operations.py", line 818, in forward
return model_forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/accelerate/utils/operations.py", line 806, in __call__
return convert_to_fp32(self.model_forward(*args, **kwargs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/amp/autocast_mode.py", line 44, in decorate_autocast
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/diffusers/src/diffusers/models/transformers/transformer_flux.py", line 696, in forward
else self.time_text_embed(timestep, guidance, pooled_projections)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/diffusers/src/diffusers/models/embeddings.py", line 1614, in forward
pooled_projections = self.text_embedder(pooled_projection)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/diffusers/src/diffusers/models/embeddings.py", line 2207, in forward
hidden_states = self.linear_1(caption)
^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/linear.py", line 125, in forward
return F.linear(input, self.weight, self.bias)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: mat1 and mat2 shapes cannot be multiplied (2x1536 and 768x3072)
Steps: 0%| | 0/3380 [00:07<?, ?it/s]
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 10, in <module>
sys.exit(main())
^^^^^^
File "/usr/local/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 50, in main
args.func(args)
File "/usr/local/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1235, in launch_command
simple_launcher(args)
File "/usr/local/lib/python3.11/site-packages/accelerate/commands/launch.py", line 823, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/local/bin/python', 'train_dreambooth_lora_flux_advanced.py', '--pretrained_model_name_or_path=black-forest-labs/FLUX.1-dev', '--instance_data_dir=/root/r41s4', '--token_abstraction=r4is4', '--instance_prompt=a photo of a r4is4 woman', '--class_data_dir=/root/class_images', '--class_prompt=a photo of a woman', '--with_prior_preservation', '--prior_loss_weight=0.3', '--num_class_images=338', '--output_dir=/root/output_lora', '--lora_layers=attn.to_k,attn.to_q,attn.to_v,attn.to_out.0', '--mixed_precision=bf16', '--optimizer=prodigy', '--train_transformer_frac=1', '--train_text_encoder_ti', '--train_text_encoder_ti_frac=.25', '--weighting_scheme=none', '--resolution=1024', '--train_batch_size=1', '--guidance_scale=1', '--repeats=10', '--learning_rate=1.0', '--gradient_accumulation_steps=1', '--rank=16', '--num_train_epochs=10', '--checkpointing_steps=100', '--cache_latents', '--mixed_precision=bf16', '--gradient_checkpointing']' returned non-zero exit status 1.
Traceback (most recent call last):
File "/pkg/modal/_runtime/container_io_manager.py", line 778, in handle_input_exception
yield
File "/pkg/modal/_container_entrypoint.py", line 243, in run_input_sync
res = io_context.call_finalized_function()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/pkg/modal/_runtime/container_io_manager.py", line 197, in call_finalized_function
res = self.finalized_function.callable(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/training_baru11_flux_full.py", line 159, in mulai_training
subprocess.run(jalankan_training, cwd="/root/diffusers/examples/advanced_diffusion_training", check=True)
File "/usr/local/lib/python3.11/subprocess.py", line 571, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['accelerate', 'launch', '--config_file', '/root/accelerate_config.yaml', 'train_dreambooth_lora_flux_advanced.py', '--pretrained_model_name_or_path=black-forest-labs/FLUX.1-dev', '--instance_data_dir=/root/r41s4', '--token_abstraction=r4is4', '--instance_prompt=a photo of a r4is4 woman', '--class_data_dir=/root/class_images', '--class_prompt=a photo of a woman', '--with_prior_preservation', '--prior_loss_weight=0.3', '--num_class_images=338', '--output_dir=/root/output_lora', '--lora_layers=attn.to_k,attn.to_q,attn.to_v,attn.to_out.0', '--mixed_precision=bf16', '--optimizer=prodigy', '--train_transformer_frac=1', '--train_text_encoder_ti', '--train_text_encoder_ti_frac=.25', '--weighting_scheme=none', '--resolution=1024', '--train_batch_size=1', '--guidance_scale=1', '--repeats=10', '--learning_rate=1.0', '--gradient_accumulation_steps=1', '--rank=16', '--num_train_epochs=10', '--checkpointing_steps=100', '--cache_latents', '--mixed_precision=bf16', '--gradient_checkpointing']' returned non-zero exit status 1.
the oldest version of diffusers:
10/15/2025 17:00:37 - INFO - __main__ - ***** Running training *****
10/15/2025 17:00:37 - INFO - __main__ - Num examples = 338
10/15/2025 17:00:37 - INFO - __main__ - Num batches each epoch = 338
10/15/2025 17:00:37 - INFO - __main__ - Num Epochs = 10
10/15/2025 17:00:37 - INFO - __main__ - Instantaneous batch size per device = 1
10/15/2025 17:00:37 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 1
10/15/2025 17:00:37 - INFO - __main__ - Gradient Accumulation steps = 1
10/15/2025 17:00:37 - INFO - __main__ - Total optimization steps = 3380
Steps: 0%| | 0/3380 [00:00<?, ?it/s]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps: 0%| | 1/3380 [00:05<5:08:01, 5.47s/it, loss=0.582, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps: 0%| | 2/3380 [00:10<4:53:58, 5.22s/it, loss=0.737, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps: 0%| | 3/3380 [00:15<4:49:15, 5.14s/it, loss=0.694, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps: 0%| | 4/3380 [00:20<4:47:07, 5.10s/it, loss=0.518, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps: 0%| | 5/3380 [00:25<4:45:53, 5.08s/it, loss=0.536, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps: 0%| | 6/3380 [00:30<4:45:17, 5.07s/it, loss=0.381, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps: 0%| | 7/3380 [00:35<4:44:57, 5.07s/it, loss=0.692, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps: 0%| | 8/3380 [00:40<4:44:49, 5.07s/it, loss=1.08, lr=1] Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps: 0%| | 9/3380 [00:45<4:44:47, 5.07s/it, loss=0.59, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps: 0%| | 10/3380 [00:50<4:44:35, 5.07s/it, loss=0.687, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps: 0%| | 11/3380 [00:56<4:44:33, 5.07s/it, loss=0.7, lr=1] Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps: 0%| | 12/3380 [01:01<4:44:29, 5.07s/it, loss=0.747, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps: 0%| | 13/3380 [01:06<4:44:25, 5.07s/it, loss=0.48, lr=1] Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps: 0%| | 14/3380 [01:11<4:44:16, 5.07s/it, loss=0.448, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps: 0%| | 15/3380 [01:16<4:44:13, 5.07s/it, loss=0.722, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps: 0%| | 16/3380 [01:21<4:44:16, 5.07s/it, loss=0.578, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps: 1%| | 17/3380 [01:26<4:44:09, 5.07s/it, loss=0.683, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
..............................................System Info
since it's kinda complex (for me ) to run the cli on modal container. i provide these system infos using the images of each of the scripts:
for the latest version of diffusers, i used this image:
image = (
modal.Image.from_registry(
"nvidia/cuda:12.9.1-devel-ubuntu24.04", add_python="3.11"
)
.apt_install("git")
.pip_install("uv==0.8.12","ninja<=1.13.0") #==0.5.5
.run_commands("git clone https://github.com/huggingface/diffusers.git /root/diffusers && cd /root/diffusers && uv pip install --system -e .")
.uv_pip_install("huggingface_hub[hf_transfer]==0.34.4", #0.1.8
"accelerate>=0.31.0,<=1.10.0",
"transformers>=4.41.2,<=4.55.2",
"ftfy<=6.2.3",
"tensorboard<=2.20.0",
"Jinja2<=3.1.6",
"peft>=0.11.1,<=0.17.0",
"sentencepiece<=0.2.1",
"wheel<=0.41.1",
"wandb<=0.21.1",
"bitsandbytes<=0.47.0",
"datasets<=4.0.0",
"pyarrow<=21.0.0",
"prodigyopt<=1.1.2",
"deepspeed<=0.17.4",
"xformers<=0.0.32.post2",
"triton<=3.4.0",
"torch==2.8.0",
"torchaudio==2.8.0",
"torchvision==0.23.0",
extra_index_url="https://download.pytorch.org/whl/cu129"
)
.uv_pip_install("https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/download/v0.3.14/flash_attn-2.8.2+cu129torch2.8-cp311-cp311-linux_x86_64.whl")
.run_function(setup_accelerate, gpu="L40S")
.env({"PYTORCH_CUDA_ALLOC_CONF": "expandable_segments:True"})
.add_local_dir(DATASET_LOCAL_PATH, DATASET_DIR)
.add_local_dir(CLASS_LOCAL_PATH, CLASS_DIR)
)
for the oldest version of diffusers, i used this image:
image = (
modal.Image.from_registry(
"nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04", add_python="3.11"
)
.apt_install("git")
.pip_install("uv==0.5.5",)
.run_commands("git clone -b v0.31.0 https://github.com/huggingface/diffusers.git /root/diffusers && cd /root/diffusers && uv pip install --system -e .")
.uv_pip_install("huggingface_hub[hf_transfer]==0.26.0",
"accelerate>=0.31.0,<=1.2.1",
"transformers>=4.41.2,<=4.47.0",
"ftfy==6.3.1",
"tensorboard==2.18.0",
"Jinja2==3.1.4",
"peft>=0.11.1,<=0.14.0",
"sentencepiece<=0.2.0",
"wheel<=0.44.0",
"bitsandbytes<=0.44.1",
"datasets<=3.0.1",
"pyarrow<=20.0.0",
"prodigyopt<=1.0",
"deepspeed<=0.15.3",
"xformers<=0.0.28.post3",
"triton<=3.1.0",
"torch==2.5.1",
"torchaudio==2.5.1",
"torchvision==0.20.1",
extra_index_url="https://download.pytorch.org/whl/cu124"
)
.uv_pip_install("flash-attn<=2.7.2.post1", extra_options="--no-build-isolation")
.run_function(setup_accelerate, gpu="L40S")
.env({"PYTORCH_CUDA_ALLOC_CONF": "expandable_segments:True"})
.add_local_dir(DATASET_LOCAL_PATH, DATASET_DIR)
.add_local_dir(CLASS_LOCAL_PATH, CLASS_DIR)
)