[X86] Suboptimal codegen for in-lane broadcast after combining two xmm registers into a ymm

Godbolt link: https://gcc.godbolt.org/z/cr9W1PhjW

When compiling for AVX2, the following function:
```c
__m256 combine_and_broadcast(__m128 a, __m128 b) {
	__m256 ab = _mm256_insertf128_ps(_mm256_castps128_ps256(a), b, 1);
	return _mm256_shuffle_ps(ab, ab, _MM_SHUFFLE(0, 0, 0, 0));
}
```
gets the broadcast moved to before combining the two registers, adding an extra instruction:
```asm
vbroadcastss    xmm0, xmm0
vbroadcastss    xmm1, xmm1
vinsertf128     ymm0, ymm0, xmm1, 1
```
instead of
```asm
vinsertf128     ymm0, ymm0, xmm1, 1
vpermilps       ymm0, ymm0, 0           # ymm0 = ymm0[0,0,0,0,4,4,4,4]
```
(This seems to only happen if you try to broadcast element 0, probably related to there being a dedicated instruction for that)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[X86] Suboptimal codegen for in-lane broadcast after combining two xmm registers into a ymm #58585

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[X86] Suboptimal codegen for in-lane broadcast after combining two xmm registers into a ymm #58585

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions