Skip to content

i think there is something wrong with new/latest scripts. RuntimeError: mat1 and mat2 shapes cannot be multiplied (2x1536 and 768x3072) #12494

@gerylavin

Description

@gerylavin

Describe the bug

i got "RuntimeError: mat1 and mat2 shapes cannot be multiplied (2x1536 and 768x3072)" when i ran "train_dreambooth_lora_flux_advanced.py" from the latest version of diffusers or v0.35.1 but the problem solved when i downgraded the version to v0.31.0 including all the dependencies. i ran the scripts on modal (serverless gpu cloud). i used L40S 48GB and the same training parameters/arguments.

Reproduction

i think it's because i put "accelerate env" when the building image was in progress so the is no the description of the gpu below/

for the latest version of diffusers. i used this config:

  • Accelerate version: 1.10.0
  • Platform: Linux-4.4.0-x86_64-with-glibc2.39
  • accelerate bash location: /usr/local/bin/accelerate
  • Python version: 3.11.5
  • Numpy version: 2.3.4
  • PyTorch version: 2.8.0+cu129
  • PyTorch accelerator: N/A
  • System RAM: 167.58 GB
  • Accelerate default config:
    • compute_environment: LOCAL_MACHINE
    • distributed_type: NO
    • mixed_precision: bf16
    • use_cpu: False
    • debug: False
    • num_processes: 1
    • machine_rank: 0
    • num_machines: 1
    • rdzv_backend: static
    • same_network: False
    • main_training_function: main
    • enable_cpu_affinity: False
    • downcast_bf16: False
    • tpu_use_cluster: False
    • tpu_use_sudo: False

for the oldest version of diffusers. i used this config:

  • Accelerate version: 1.2.1
  • Platform: Linux-4.4.0-x86_64-with-glibc2.35
  • accelerate bash location: /usr/local/bin/accelerate
  • Python version: 3.11.5
  • Numpy version: 2.3.3
  • PyTorch version (GPU?): 2.5.1+cu124 (False)
  • PyTorch XPU available: False
  • PyTorch NPU available: False
  • PyTorch MLU available: False
  • PyTorch MUSA available: False
  • System RAM: 167.58 GB
  • Accelerate default config:
    • compute_environment: LOCAL_MACHINE
    • distributed_type: NO
    • mixed_precision: bf16
    • use_cpu: False
    • debug: False
    • num_processes: 1
    • machine_rank: 0
    • num_machines: 1
    • rdzv_backend: static
    • same_network: False
    • main_training_function: main
    • enable_cpu_affinity: False
    • downcast_bf16: False
    • tpu_use_cluster: False
    • tpu_use_sudo: False

Logs

the latest version of diffusers:
10/15/2025 16:56:04 - INFO - __main__ - ***** Running training *****
10/15/2025 16:56:04 - INFO - __main__ -   Num examples = 338
10/15/2025 16:56:04 - INFO - __main__ -   Num batches each epoch = 338
10/15/2025 16:56:04 - INFO - __main__ -   Num Epochs = 10
10/15/2025 16:56:04 - INFO - __main__ -   Instantaneous batch size per device = 1
10/15/2025 16:56:04 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 1
10/15/2025 16:56:04 - INFO - __main__ -   Gradient Accumulation steps = 1
10/15/2025 16:56:04 - INFO - __main__ -   Total optimization steps = 3380
Steps:   0%|          | 0/3380 [00:00<?, ?it/s]
[2025-10-15 16:56:05,548] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to cuda (auto detect)
df: /root/.triton/autotune: No such file or directory
[2025-10-15 16:56:09,437] [INFO] [logging.py:107:log_dist] [Rank -1] [TorchCheckpointEngine] Initialized with serialization = False
Traceback (most recent call last):
  File "/root/diffusers/examples/advanced_diffusion_training/train_dreambooth_lora_flux_advanced.py", line 2470, in <module>
    main(args)
  File "/root/diffusers/examples/advanced_diffusion_training/train_dreambooth_lora_flux_advanced.py", line 2224, in main
    model_pred = transformer(
                 ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/accelerate/utils/operations.py", line 818, in forward
    return model_forward(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/accelerate/utils/operations.py", line 806, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/amp/autocast_mode.py", line 44, in decorate_autocast
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/diffusers/src/diffusers/models/transformers/transformer_flux.py", line 696, in forward
    else self.time_text_embed(timestep, guidance, pooled_projections)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/diffusers/src/diffusers/models/embeddings.py", line 1614, in forward
    pooled_projections = self.text_embedder(pooled_projection)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/diffusers/src/diffusers/models/embeddings.py", line 2207, in forward
    hidden_states = self.linear_1(caption)
                    ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/linear.py", line 125, in forward
    return F.linear(input, self.weight, self.bias)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: mat1 and mat2 shapes cannot be multiplied (2x1536 and 768x3072)
Steps:   0%|          | 0/3380 [00:07<?, ?it/s]
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 50, in main
    args.func(args)
  File "/usr/local/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1235, in launch_command
    simple_launcher(args)
  File "/usr/local/lib/python3.11/site-packages/accelerate/commands/launch.py", line 823, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/local/bin/python', 'train_dreambooth_lora_flux_advanced.py', '--pretrained_model_name_or_path=black-forest-labs/FLUX.1-dev', '--instance_data_dir=/root/r41s4', '--token_abstraction=r4is4', '--instance_prompt=a photo of a r4is4 woman', '--class_data_dir=/root/class_images', '--class_prompt=a photo of a woman', '--with_prior_preservation', '--prior_loss_weight=0.3', '--num_class_images=338', '--output_dir=/root/output_lora', '--lora_layers=attn.to_k,attn.to_q,attn.to_v,attn.to_out.0', '--mixed_precision=bf16', '--optimizer=prodigy', '--train_transformer_frac=1', '--train_text_encoder_ti', '--train_text_encoder_ti_frac=.25', '--weighting_scheme=none', '--resolution=1024', '--train_batch_size=1', '--guidance_scale=1', '--repeats=10', '--learning_rate=1.0', '--gradient_accumulation_steps=1', '--rank=16', '--num_train_epochs=10', '--checkpointing_steps=100', '--cache_latents', '--mixed_precision=bf16', '--gradient_checkpointing']' returned non-zero exit status 1.
Traceback (most recent call last):
  File "/pkg/modal/_runtime/container_io_manager.py", line 778, in handle_input_exception
    yield
  File "/pkg/modal/_container_entrypoint.py", line 243, in run_input_sync
    res = io_context.call_finalized_function()
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pkg/modal/_runtime/container_io_manager.py", line 197, in call_finalized_function
    res = self.finalized_function.callable(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/training_baru11_flux_full.py", line 159, in mulai_training
    subprocess.run(jalankan_training, cwd="/root/diffusers/examples/advanced_diffusion_training", check=True)
  File "/usr/local/lib/python3.11/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['accelerate', 'launch', '--config_file', '/root/accelerate_config.yaml', 'train_dreambooth_lora_flux_advanced.py', '--pretrained_model_name_or_path=black-forest-labs/FLUX.1-dev', '--instance_data_dir=/root/r41s4', '--token_abstraction=r4is4', '--instance_prompt=a photo of a r4is4 woman', '--class_data_dir=/root/class_images', '--class_prompt=a photo of a woman', '--with_prior_preservation', '--prior_loss_weight=0.3', '--num_class_images=338', '--output_dir=/root/output_lora', '--lora_layers=attn.to_k,attn.to_q,attn.to_v,attn.to_out.0', '--mixed_precision=bf16', '--optimizer=prodigy', '--train_transformer_frac=1', '--train_text_encoder_ti', '--train_text_encoder_ti_frac=.25', '--weighting_scheme=none', '--resolution=1024', '--train_batch_size=1', '--guidance_scale=1', '--repeats=10', '--learning_rate=1.0', '--gradient_accumulation_steps=1', '--rank=16', '--num_train_epochs=10', '--checkpointing_steps=100', '--cache_latents', '--mixed_precision=bf16', '--gradient_checkpointing']' returned non-zero exit status 1.


the oldest version of diffusers:
10/15/2025 17:00:37 - INFO - __main__ - ***** Running training *****
10/15/2025 17:00:37 - INFO - __main__ -   Num examples = 338
10/15/2025 17:00:37 - INFO - __main__ -   Num batches each epoch = 338
10/15/2025 17:00:37 - INFO - __main__ -   Num Epochs = 10
10/15/2025 17:00:37 - INFO - __main__ -   Instantaneous batch size per device = 1
10/15/2025 17:00:37 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 1
10/15/2025 17:00:37 - INFO - __main__ -   Gradient Accumulation steps = 1
10/15/2025 17:00:37 - INFO - __main__ -   Total optimization steps = 3380
Steps:   0%|          | 0/3380 [00:00<?, ?it/s]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|          | 1/3380 [00:05<5:08:01,  5.47s/it, loss=0.582, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|          | 2/3380 [00:10<4:53:58,  5.22s/it, loss=0.737, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|          | 3/3380 [00:15<4:49:15,  5.14s/it, loss=0.694, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|          | 4/3380 [00:20<4:47:07,  5.10s/it, loss=0.518, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|          | 5/3380 [00:25<4:45:53,  5.08s/it, loss=0.536, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|          | 6/3380 [00:30<4:45:17,  5.07s/it, loss=0.381, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|          | 7/3380 [00:35<4:44:57,  5.07s/it, loss=0.692, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|          | 8/3380 [00:40<4:44:49,  5.07s/it, loss=1.08, lr=1] Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|          | 9/3380 [00:45<4:44:47,  5.07s/it, loss=0.59, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|          | 10/3380 [00:50<4:44:35,  5.07s/it, loss=0.687, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|          | 11/3380 [00:56<4:44:33,  5.07s/it, loss=0.7, lr=1]  Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|          | 12/3380 [01:01<4:44:29,  5.07s/it, loss=0.747, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|          | 13/3380 [01:06<4:44:25,  5.07s/it, loss=0.48, lr=1] Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|          | 14/3380 [01:11<4:44:16,  5.07s/it, loss=0.448, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|          | 15/3380 [01:16<4:44:13,  5.07s/it, loss=0.722, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|          | 16/3380 [01:21<4:44:16,  5.07s/it, loss=0.578, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   1%|          | 17/3380 [01:26<4:44:09,  5.07s/it, loss=0.683, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
..............................................

System Info

since it's kinda complex (for me ) to run the cli on modal container. i provide these system infos using the images of each of the scripts:

for the latest version of diffusers, i used this image:
image = (
modal.Image.from_registry(
"nvidia/cuda:12.9.1-devel-ubuntu24.04", add_python="3.11"
)

.apt_install("git")
.pip_install("uv==0.8.12","ninja<=1.13.0") #==0.5.5
.run_commands("git clone https://github.com/huggingface/diffusers.git /root/diffusers && cd /root/diffusers && uv pip install --system -e .")
.uv_pip_install("huggingface_hub[hf_transfer]==0.34.4", #0.1.8
                "accelerate>=0.31.0,<=1.10.0",
                "transformers>=4.41.2,<=4.55.2",
                "ftfy<=6.2.3",
                "tensorboard<=2.20.0",
                "Jinja2<=3.1.6",
                "peft>=0.11.1,<=0.17.0",
                "sentencepiece<=0.2.1",
                "wheel<=0.41.1",
                "wandb<=0.21.1",
                "bitsandbytes<=0.47.0",
                "datasets<=4.0.0",
                "pyarrow<=21.0.0",
                "prodigyopt<=1.1.2",
                "deepspeed<=0.17.4",
                "xformers<=0.0.32.post2",
                "triton<=3.4.0",
                "torch==2.8.0",
                "torchaudio==2.8.0",
                "torchvision==0.23.0",
                 extra_index_url="https://download.pytorch.org/whl/cu129"
                )
.uv_pip_install("https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/download/v0.3.14/flash_attn-2.8.2+cu129torch2.8-cp311-cp311-linux_x86_64.whl")
.run_function(setup_accelerate, gpu="L40S")
.env({"PYTORCH_CUDA_ALLOC_CONF": "expandable_segments:True"})
.add_local_dir(DATASET_LOCAL_PATH, DATASET_DIR)
.add_local_dir(CLASS_LOCAL_PATH, CLASS_DIR)

    )

for the oldest version of diffusers, i used this image:
image = (
modal.Image.from_registry(
"nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04", add_python="3.11"
)

.apt_install("git")
.pip_install("uv==0.5.5",)
.run_commands("git clone -b v0.31.0 https://github.com/huggingface/diffusers.git /root/diffusers && cd /root/diffusers && uv pip install --system -e .")
.uv_pip_install("huggingface_hub[hf_transfer]==0.26.0",
                "accelerate>=0.31.0,<=1.2.1",
                "transformers>=4.41.2,<=4.47.0",
                "ftfy==6.3.1",
                "tensorboard==2.18.0",
                "Jinja2==3.1.4",
                "peft>=0.11.1,<=0.14.0",
                "sentencepiece<=0.2.0",
                "wheel<=0.44.0",
                "bitsandbytes<=0.44.1",
                "datasets<=3.0.1",
                "pyarrow<=20.0.0",
                "prodigyopt<=1.0",
                "deepspeed<=0.15.3",
                "xformers<=0.0.28.post3",
                "triton<=3.1.0",
                "torch==2.5.1",
                "torchaudio==2.5.1",
                "torchvision==0.20.1",
                 extra_index_url="https://download.pytorch.org/whl/cu124"
                )
.uv_pip_install("flash-attn<=2.7.2.post1", extra_options="--no-build-isolation")
.run_function(setup_accelerate, gpu="L40S")
.env({"PYTORCH_CUDA_ALLOC_CONF": "expandable_segments:True"})
.add_local_dir(DATASET_LOCAL_PATH, DATASET_DIR)
.add_local_dir(CLASS_LOCAL_PATH, CLASS_DIR)

    )

Who can help?

@sayakpaul

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions