Skip to content

[X86] Suboptimal codegen for in-lane broadcast after combining two xmm registers into a ymm #58585

@TellowKrinkle

Description

@TellowKrinkle

Godbolt link: https://gcc.godbolt.org/z/cr9W1PhjW

When compiling for AVX2, the following function:

__m256 combine_and_broadcast(__m128 a, __m128 b) {
	__m256 ab = _mm256_insertf128_ps(_mm256_castps128_ps256(a), b, 1);
	return _mm256_shuffle_ps(ab, ab, _MM_SHUFFLE(0, 0, 0, 0));
}

gets the broadcast moved to before combining the two registers, adding an extra instruction:

vbroadcastss    xmm0, xmm0
vbroadcastss    xmm1, xmm1
vinsertf128     ymm0, ymm0, xmm1, 1

instead of

vinsertf128     ymm0, ymm0, xmm1, 1
vpermilps       ymm0, ymm0, 0           # ymm0 = ymm0[0,0,0,0,4,4,4,4]

(This seems to only happen if you try to broadcast element 0, probably related to there being a dedicated instruction for that)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions