Feature Summary
SEGA dynamically rescales attention across RoPE components based on the latent's spatial-frequency content, enabling stable high-resolution generation without retraining.
Detailed Description
Source: https://rajabi2001.github.io/sega/
Papers: https://arxiv.org/html/2605.22668v1
SEGA dynamically rescales attention across RoPE components based on the latent's spatial-frequency content, enabling stable high-resolution generation without retraining. Our method resolves the trade-off between structure and detail preservation, achieving coherent synthesis at ultra-high resolutions up to 36 megapixels across multiple models and target resolutions.
Abstract
Diffusion transformers (DiTs) have emerged as a dominant architecture for text-to-image generation, yet their performance drops when generating at resolutions beyond their training range. Existing training-free approaches mitigate this by modifying inference-time attention behavior, often through Rotary Position Embeddings (RoPE) extrapolation combined with attention scaling. However, these strategies apply a uniform and content-agnostic scaling across RoPE components with distinct frequency characteristics, inducing a trade-off between preserving global structure and recovering fine detail. We introduce SEGA, a training-free method that dynamically scales attention across RoPE components according to the latent's spatial-frequency structure at each denoising step. This adaptive scaling improves both structural coherence and fine-detail fidelity. Experiments show that SEGA consistently improves high-resolution synthesis across multiple target resolutions, outperforming state-of-the-art training-free baselines.
Method Overview
- How SEGA Works
SEGA turns fixed attention scaling into dynamic, content-aware scaling by looking at the latent's frequency content during denoising.
Alternatives you considered
No response
Additional context
No response
Feature Summary
SEGA dynamically rescales attention across RoPE components based on the latent's spatial-frequency content, enabling stable high-resolution generation without retraining.
Detailed Description
Source: https://rajabi2001.github.io/sega/
Papers: https://arxiv.org/html/2605.22668v1
SEGA dynamically rescales attention across RoPE components based on the latent's spatial-frequency content, enabling stable high-resolution generation without retraining. Our method resolves the trade-off between structure and detail preservation, achieving coherent synthesis at ultra-high resolutions up to 36 megapixels across multiple models and target resolutions.
Abstract
Diffusion transformers (DiTs) have emerged as a dominant architecture for text-to-image generation, yet their performance drops when generating at resolutions beyond their training range. Existing training-free approaches mitigate this by modifying inference-time attention behavior, often through Rotary Position Embeddings (RoPE) extrapolation combined with attention scaling. However, these strategies apply a uniform and content-agnostic scaling across RoPE components with distinct frequency characteristics, inducing a trade-off between preserving global structure and recovering fine detail. We introduce SEGA, a training-free method that dynamically scales attention across RoPE components according to the latent's spatial-frequency structure at each denoising step. This adaptive scaling improves both structural coherence and fine-detail fidelity. Experiments show that SEGA consistently improves high-resolution synthesis across multiple target resolutions, outperforming state-of-the-art training-free baselines.
Method Overview
SEGA turns fixed attention scaling into dynamic, content-aware scaling by looking at the latent's frequency content during denoising.
Alternatives you considered
No response
Additional context
No response