I was looking into the RoPE implementation for ERNIE-Image and noticed something interesting that I wanted to ask about.
It looks like the freqs_cis tensor uses an interleaved format (e.g., repeating pairs like [1.0, 1.0, 0.707, 0.707...]), but the actual rotation logic uses Megatron-style blocked chunking. Here is the snippet for context:
# Apply RoPE: same rotate_half logic as Megatron _apply_rotary_pos_emb_bshd (rotary_interleaved=False)
# x_in: [B, S, heads, head_dim], freqs_cis: [B, S, 1, head_dim] with angles [θ0,θ0,θ1,θ1,...]
def apply_rotary_emb(x_in: torch.Tensor, freqs_cis: torch.Tensor) -> torch.Tensor:
rot_dim = freqs_cis.shape[-1]
x, x_pass = x_in[..., :rot_dim], x_in[..., rot_dim:]
cos_ = torch.cos(freqs_cis).to(x.dtype)
sin_ = torch.sin(freqs_cis).to(x.dtype)
# Non-interleaved rotate_half: [-x2, x1]
x1, x2 = x.chunk(2, dim=-1)
x_rotated = torch.cat((-x2, x1), dim=-1)
return torch.cat((x * cos_ + x_rotated * sin_, x_pass), dim=-1)
If I am understanding the code correctly, applying interleaved frequencies to [-x2, x1] blocks means the two halves of a coordinate pair might be rotated by different angles. For example, the first element x[0] gets paired with -x[D/2] and multiplied by theta_0, but x[D/2] gets paired with x[0] and multiplied by theta_{D/4}.
To match the x.chunk(2, dim=-1) logic, shouldn't freqs_cis be formatted in blocks like [theta_0, theta_1, ..., theta_0, theta_1, ...] instead of being interleaved?
Is this asymmetric rotation a known behavior from training, or am I missing something about how this specific implementation is supposed to work mathematically? I tried changing the alignment so the frequencies match the blocks, and as expected, the model's output turns into garbage.
I was looking into the RoPE implementation for ERNIE-Image and noticed something interesting that I wanted to ask about.
It looks like the
freqs_cistensor uses an interleaved format (e.g., repeating pairs like[1.0, 1.0, 0.707, 0.707...]), but the actual rotation logic uses Megatron-style blocked chunking. Here is the snippet for context:If I am understanding the code correctly, applying interleaved frequencies to
[-x2, x1]blocks means the two halves of a coordinate pair might be rotated by different angles. For example, the first elementx[0]gets paired with-x[D/2]and multiplied bytheta_0, butx[D/2]gets paired withx[0]and multiplied bytheta_{D/4}.To match the
x.chunk(2, dim=-1)logic, shouldn'tfreqs_cisbe formatted in blocks like[theta_0, theta_1, ..., theta_0, theta_1, ...]instead of being interleaved?Is this asymmetric rotation a known behavior from training, or am I missing something about how this specific implementation is supposed to work mathematically? I tried changing the alignment so the frequencies match the blocks, and as expected, the model's output turns into garbage.