i think there is something wrong with new/latest scripts. RuntimeError: mat1 and mat2 shapes cannot be multiplied (2x1536 and 768x3072)

### Describe the bug

i got "RuntimeError: mat1 and mat2 shapes cannot be multiplied (2x1536 and 768x3072)" when i ran "train_dreambooth_lora_flux_advanced.py" from the latest version of diffusers or v0.35.1 but the problem solved when i downgraded the version to v0.31.0 including all the dependencies. i ran the scripts on modal (serverless gpu cloud). i used L40S 48GB and the same training parameters/arguments.

### Reproduction

i think it's because i put "accelerate env" when the building image was in progress so the is no the description of the gpu below/

for the latest version of diffusers. i used this config:
- `Accelerate` version: 1.10.0
- Platform: Linux-4.4.0-x86_64-with-glibc2.39
- `accelerate` bash location: /usr/local/bin/accelerate
- Python version: 3.11.5
- Numpy version: 2.3.4
- PyTorch version: 2.8.0+cu129
- PyTorch accelerator: N/A
- System RAM: 167.58 GB
- `Accelerate` default config:
	- compute_environment: LOCAL_MACHINE
	- distributed_type: NO
	- mixed_precision: bf16
	- use_cpu: False
	- debug: False
	- num_processes: 1
	- machine_rank: 0
	- num_machines: 1
	- rdzv_backend: static
	- same_network: False
	- main_training_function: main
	- enable_cpu_affinity: False
	- downcast_bf16: False
	- tpu_use_cluster: False
	- tpu_use_sudo: False

for the oldest version of diffusers. i used this config:
- `Accelerate` version: 1.2.1
- Platform: Linux-4.4.0-x86_64-with-glibc2.35
- `accelerate` bash location: /usr/local/bin/accelerate
- Python version: 3.11.5
- Numpy version: 2.3.3
- PyTorch version (GPU?): 2.5.1+cu124 (False)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch MUSA available: False
- System RAM: 167.58 GB
- `Accelerate` default config:
	- compute_environment: LOCAL_MACHINE
	- distributed_type: NO
	- mixed_precision: bf16
	- use_cpu: False
	- debug: False
	- num_processes: 1
	- machine_rank: 0
	- num_machines: 1
	- rdzv_backend: static
	- same_network: False
	- main_training_function: main
	- enable_cpu_affinity: False
	- downcast_bf16: False
	- tpu_use_cluster: False
	- tpu_use_sudo: False

### Logs

```shell
the latest version of diffusers:
10/15/2025 16:56:04 - INFO - __main__ - ***** Running training *****
10/15/2025 16:56:04 - INFO - __main__ -   Num examples = 338
10/15/2025 16:56:04 - INFO - __main__ -   Num batches each epoch = 338
10/15/2025 16:56:04 - INFO - __main__ -   Num Epochs = 10
10/15/2025 16:56:04 - INFO - __main__ -   Instantaneous batch size per device = 1
10/15/2025 16:56:04 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 1
10/15/2025 16:56:04 - INFO - __main__ -   Gradient Accumulation steps = 1
10/15/2025 16:56:04 - INFO - __main__ -   Total optimization steps = 3380
Steps:   0%|          | 0/3380 [00:00<?, ?it/s]
[2025-10-15 16:56:05,548] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to cuda (auto detect)
df: /root/.triton/autotune: No such file or directory
[2025-10-15 16:56:09,437] [INFO] [logging.py:107:log_dist] [Rank -1] [TorchCheckpointEngine] Initialized with serialization = False
Traceback (most recent call last):
  File "/root/diffusers/examples/advanced_diffusion_training/train_dreambooth_lora_flux_advanced.py", line 2470, in <module>
    main(args)
  File "/root/diffusers/examples/advanced_diffusion_training/train_dreambooth_lora_flux_advanced.py", line 2224, in main
    model_pred = transformer(
                 ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/accelerate/utils/operations.py", line 818, in forward
    return model_forward(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/accelerate/utils/operations.py", line 806, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/amp/autocast_mode.py", line 44, in decorate_autocast
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/diffusers/src/diffusers/models/transformers/transformer_flux.py", line 696, in forward
    else self.time_text_embed(timestep, guidance, pooled_projections)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/diffusers/src/diffusers/models/embeddings.py", line 1614, in forward
    pooled_projections = self.text_embedder(pooled_projection)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/diffusers/src/diffusers/models/embeddings.py", line 2207, in forward
    hidden_states = self.linear_1(caption)
                    ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/linear.py", line 125, in forward
    return F.linear(input, self.weight, self.bias)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: mat1 and mat2 shapes cannot be multiplied (2x1536 and 768x3072)
Steps:   0%|          | 0/3380 [00:07<?, ?it/s]
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 50, in main
    args.func(args)
  File "/usr/local/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1235, in launch_command
    simple_launcher(args)
  File "/usr/local/lib/python3.11/site-packages/accelerate/commands/launch.py", line 823, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/local/bin/python', 'train_dreambooth_lora_flux_advanced.py', '--pretrained_model_name_or_path=black-forest-labs/FLUX.1-dev', '--instance_data_dir=/root/r41s4', '--token_abstraction=r4is4', '--instance_prompt=a photo of a r4is4 woman', '--class_data_dir=/root/class_images', '--class_prompt=a photo of a woman', '--with_prior_preservation', '--prior_loss_weight=0.3', '--num_class_images=338', '--output_dir=/root/output_lora', '--lora_layers=attn.to_k,attn.to_q,attn.to_v,attn.to_out.0', '--mixed_precision=bf16', '--optimizer=prodigy', '--train_transformer_frac=1', '--train_text_encoder_ti', '--train_text_encoder_ti_frac=.25', '--weighting_scheme=none', '--resolution=1024', '--train_batch_size=1', '--guidance_scale=1', '--repeats=10', '--learning_rate=1.0', '--gradient_accumulation_steps=1', '--rank=16', '--num_train_epochs=10', '--checkpointing_steps=100', '--cache_latents', '--mixed_precision=bf16', '--gradient_checkpointing']' returned non-zero exit status 1.
Traceback (most recent call last):
  File "/pkg/modal/_runtime/container_io_manager.py", line 778, in handle_input_exception
    yield
  File "/pkg/modal/_container_entrypoint.py", line 243, in run_input_sync
    res = io_context.call_finalized_function()
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pkg/modal/_runtime/container_io_manager.py", line 197, in call_finalized_function
    res = self.finalized_function.callable(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/training_baru11_flux_full.py", line 159, in mulai_training
    subprocess.run(jalankan_training, cwd="/root/diffusers/examples/advanced_diffusion_training", check=True)
  File "/usr/local/lib/python3.11/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['accelerate', 'launch', '--config_file', '/root/accelerate_config.yaml', 'train_dreambooth_lora_flux_advanced.py', '--pretrained_model_name_or_path=black-forest-labs/FLUX.1-dev', '--instance_data_dir=/root/r41s4', '--token_abstraction=r4is4', '--instance_prompt=a photo of a r4is4 woman', '--class_data_dir=/root/class_images', '--class_prompt=a photo of a woman', '--with_prior_preservation', '--prior_loss_weight=0.3', '--num_class_images=338', '--output_dir=/root/output_lora', '--lora_layers=attn.to_k,attn.to_q,attn.to_v,attn.to_out.0', '--mixed_precision=bf16', '--optimizer=prodigy', '--train_transformer_frac=1', '--train_text_encoder_ti', '--train_text_encoder_ti_frac=.25', '--weighting_scheme=none', '--resolution=1024', '--train_batch_size=1', '--guidance_scale=1', '--repeats=10', '--learning_rate=1.0', '--gradient_accumulation_steps=1', '--rank=16', '--num_train_epochs=10', '--checkpointing_steps=100', '--cache_latents', '--mixed_precision=bf16', '--gradient_checkpointing']' returned non-zero exit status 1.


the oldest version of diffusers:
10/15/2025 17:00:37 - INFO - __main__ - ***** Running training *****
10/15/2025 17:00:37 - INFO - __main__ -   Num examples = 338
10/15/2025 17:00:37 - INFO - __main__ -   Num batches each epoch = 338
10/15/2025 17:00:37 - INFO - __main__ -   Num Epochs = 10
10/15/2025 17:00:37 - INFO - __main__ -   Instantaneous batch size per device = 1
10/15/2025 17:00:37 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 1
10/15/2025 17:00:37 - INFO - __main__ -   Gradient Accumulation steps = 1
10/15/2025 17:00:37 - INFO - __main__ -   Total optimization steps = 3380
Steps:   0%|          | 0/3380 [00:00<?, ?it/s]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|          | 1/3380 [00:05<5:08:01,  5.47s/it, loss=0.582, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|          | 2/3380 [00:10<4:53:58,  5.22s/it, loss=0.737, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|          | 3/3380 [00:15<4:49:15,  5.14s/it, loss=0.694, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|          | 4/3380 [00:20<4:47:07,  5.10s/it, loss=0.518, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|          | 5/3380 [00:25<4:45:53,  5.08s/it, loss=0.536, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|          | 6/3380 [00:30<4:45:17,  5.07s/it, loss=0.381, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|          | 7/3380 [00:35<4:44:57,  5.07s/it, loss=0.692, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|          | 8/3380 [00:40<4:44:49,  5.07s/it, loss=1.08, lr=1] Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|          | 9/3380 [00:45<4:44:47,  5.07s/it, loss=0.59, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|          | 10/3380 [00:50<4:44:35,  5.07s/it, loss=0.687, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|          | 11/3380 [00:56<4:44:33,  5.07s/it, loss=0.7, lr=1]  Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|          | 12/3380 [01:01<4:44:29,  5.07s/it, loss=0.747, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|          | 13/3380 [01:06<4:44:25,  5.07s/it, loss=0.48, lr=1] Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|          | 14/3380 [01:11<4:44:16,  5.07s/it, loss=0.448, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|          | 15/3380 [01:16<4:44:13,  5.07s/it, loss=0.722, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|          | 16/3380 [01:21<4:44:16,  5.07s/it, loss=0.578, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   1%|          | 17/3380 [01:26<4:44:09,  5.07s/it, loss=0.683, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
..............................................
```

### System Info

since it's kinda complex (for me ) to run the cli on modal container. i provide these system infos using the images of each of the scripts:

for the latest version of diffusers, i used this image:
image = (  
    modal.Image.from_registry(
        "nvidia/cuda:12.9.1-devel-ubuntu24.04", add_python="3.11"
        )
    
    .apt_install("git")
    .pip_install("uv==0.8.12","ninja<=1.13.0") #==0.5.5
    .run_commands("git clone https://github.com/huggingface/diffusers.git /root/diffusers && cd /root/diffusers && uv pip install --system -e .")
    .uv_pip_install("huggingface_hub[hf_transfer]==0.34.4", #0.1.8
                    "accelerate>=0.31.0,<=1.10.0",
                    "transformers>=4.41.2,<=4.55.2",
                    "ftfy<=6.2.3",
                    "tensorboard<=2.20.0",
                    "Jinja2<=3.1.6",
                    "peft>=0.11.1,<=0.17.0",
                    "sentencepiece<=0.2.1",
                    "wheel<=0.41.1",
                    "wandb<=0.21.1",
                    "bitsandbytes<=0.47.0",
                    "datasets<=4.0.0",
                    "pyarrow<=21.0.0",
                    "prodigyopt<=1.1.2",
                    "deepspeed<=0.17.4",
                    "xformers<=0.0.32.post2",
                    "triton<=3.4.0",
                    "torch==2.8.0",
                    "torchaudio==2.8.0",
                    "torchvision==0.23.0",
                     extra_index_url="https://download.pytorch.org/whl/cu129"
                    )
    .uv_pip_install("https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/download/v0.3.14/flash_attn-2.8.2+cu129torch2.8-cp311-cp311-linux_x86_64.whl")
    .run_function(setup_accelerate, gpu="L40S")
    .env({"PYTORCH_CUDA_ALLOC_CONF": "expandable_segments:True"})
    .add_local_dir(DATASET_LOCAL_PATH, DATASET_DIR)
    .add_local_dir(CLASS_LOCAL_PATH, CLASS_DIR)
 
        )


for the oldest version of diffusers, i used this image:
image = (  
    modal.Image.from_registry(
        "nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04", add_python="3.11"
        )
    
    .apt_install("git")
    .pip_install("uv==0.5.5",)
    .run_commands("git clone -b v0.31.0 https://github.com/huggingface/diffusers.git /root/diffusers && cd /root/diffusers && uv pip install --system -e .")
    .uv_pip_install("huggingface_hub[hf_transfer]==0.26.0",
                    "accelerate>=0.31.0,<=1.2.1",
                    "transformers>=4.41.2,<=4.47.0",
                    "ftfy==6.3.1",
                    "tensorboard==2.18.0",
                    "Jinja2==3.1.4",
                    "peft>=0.11.1,<=0.14.0",
                    "sentencepiece<=0.2.0",
                    "wheel<=0.44.0",
                    "bitsandbytes<=0.44.1",
                    "datasets<=3.0.1",
                    "pyarrow<=20.0.0",
                    "prodigyopt<=1.0",
                    "deepspeed<=0.15.3",
                    "xformers<=0.0.28.post3",
                    "triton<=3.1.0",
                    "torch==2.5.1",
                    "torchaudio==2.5.1",
                    "torchvision==0.20.1",
                     extra_index_url="https://download.pytorch.org/whl/cu124"
                    )
    .uv_pip_install("flash-attn<=2.7.2.post1", extra_options="--no-build-isolation")
    .run_function(setup_accelerate, gpu="L40S")
    .env({"PYTORCH_CUDA_ALLOC_CONF": "expandable_segments:True"})
    .add_local_dir(DATASET_LOCAL_PATH, DATASET_DIR)
    .add_local_dir(CLASS_LOCAL_PATH, CLASS_DIR)
    
        )

### Who can help?

@sayakpaul 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

i think there is something wrong with new/latest scripts. RuntimeError: mat1 and mat2 shapes cannot be multiplied (2x1536 and 768x3072) #12494

Describe the bug

Reproduction

Logs

System Info

Who can help?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

i think there is something wrong with new/latest scripts. RuntimeError: mat1 and mat2 shapes cannot be multiplied (2x1536 and 768x3072) #12494

Description

Describe the bug

Reproduction

Logs

System Info

Who can help?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions