Godbolt link: https://gcc.godbolt.org/z/cr9W1PhjW
When compiling for AVX2, the following function:
__m256 combine_and_broadcast(__m128 a, __m128 b) {
__m256 ab = _mm256_insertf128_ps(_mm256_castps128_ps256(a), b, 1);
return _mm256_shuffle_ps(ab, ab, _MM_SHUFFLE(0, 0, 0, 0));
}
gets the broadcast moved to before combining the two registers, adding an extra instruction:
vbroadcastss xmm0, xmm0
vbroadcastss xmm1, xmm1
vinsertf128 ymm0, ymm0, xmm1, 1
instead of
vinsertf128 ymm0, ymm0, xmm1, 1
vpermilps ymm0, ymm0, 0 # ymm0 = ymm0[0,0,0,0,4,4,4,4]
(This seems to only happen if you try to broadcast element 0, probably related to there being a dedicated instruction for that)