auraflow: attention mask is not passed into the sdpa function

### Describe the bug

```py
        hidden_states = F.scaled_dot_product_attention(
            query, key, value, dropout_p=0.0, scale=attn.scale, is_causal=False
        )
```

### Reproduction

when we calculate the hidden states inside the auraflow attention class, we are not passing the attention mask into the sdpa function

this leads to non-zero attention scores on the padded positions of the input. when then training on long sequence lengths, the model is unnecessarily perturbed by the change. loss can be as high as 2.0! it is about as bad as reparameterising the model.

so that's an issue for another day, but we should at least make the attention mask optional in the transformer `__call__` method that then passes it through to the attention class, similar to how deepfloyd handles them as an input to the unet `__call__` method.

### Logs

_No response_

### System Info

diffusers git

### Who can help?

@sayakpaul @yiyixuxu (and @DN6 since your tag is related to SD3)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

auraflow: attention mask is not passed into the sdpa function #8886

Describe the bug

Reproduction

Logs

System Info

Who can help?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

auraflow: attention mask is not passed into the sdpa function #8886

Description

Describe the bug

Reproduction

Logs

System Info

Who can help?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions