[Feature] SEGA: Spectral-Energy Guided Attention for Resolution Extrapolation

### Feature Summary

SEGA dynamically rescales attention across RoPE components based on the latent's spatial-frequency content, enabling stable high-resolution generation without retraining.

### Detailed Description

<img width="640" height="424" alt="Image" src="https://github.com/user-attachments/assets/ba4e1654-bc0e-4249-bada-25ae7edf03af" />

Source: https://rajabi2001.github.io/sega/
Papers: https://arxiv.org/html/2605.22668v1

SEGA dynamically rescales attention across RoPE components based on the latent's spatial-frequency content, enabling stable high-resolution generation without retraining. Our method resolves the trade-off between structure and detail preservation, achieving coherent synthesis at ultra-high resolutions up to 36 megapixels across multiple models and target resolutions.

Abstract

**Diffusion transformers (DiTs) have emerged as a dominant architecture for text-to-image generation, yet their performance drops when generating at resolutions beyond their training range. Existing training-free approaches mitigate this by modifying inference-time attention behavior, often through Rotary Position Embeddings (RoPE) extrapolation combined with attention scaling. However, these strategies apply a uniform and content-agnostic scaling across RoPE components with distinct frequency characteristics, inducing a trade-off between preserving global structure and recovering fine detail. We introduce SEGA, a training-free method that dynamically scales attention across RoPE components according to the latent's spatial-frequency structure at each denoising step. This adaptive scaling improves both structural coherence and fine-detail fidelity. Experiments show that SEGA consistently improves high-resolution synthesis across multiple target resolutions, outperforming state-of-the-art training-free baselines.**

Method Overview

- How SEGA Works
SEGA turns fixed attention scaling into dynamic, content-aware scaling by looking at the latent's frequency content during denoising.

### Alternatives you considered

_No response_

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] SEGA: Spectral-Energy Guided Attention for Resolution Extrapolation #1553

Feature Summary

Detailed Description

Alternatives you considered

Additional context

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Feature] SEGA: Spectral-Energy Guided Attention for Resolution Extrapolation #1553

Description

Feature Summary

Detailed Description

Alternatives you considered

Additional context

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions