-
Notifications
You must be signed in to change notification settings - Fork 15.4k
Open
Description
Changing instruction order from:
const __m128i va0 = _mm_broadcastq_epi64(_mm_loadl_epi64((const __m128i*) a0));
const __m256i vxa0 = _mm256_cvtepi8_epi16(va0);
a0 += 8;
const __m128i va1 = _mm_broadcastq_epi64(_mm_loadl_epi64((const __m128i*) a1));
const __m256i vxa1 = _mm256_cvtepi8_epi16(va1);
a1 += 8;
const __m128i va2 = _mm_broadcastq_epi64(_mm_loadl_epi64((const __m128i*) a2));
const __m256i vxa2 = _mm256_cvtepi8_epi16(va2);
a2 += 8;
to this:
const __m128i va0 = _mm_broadcastq_epi64(_mm_loadl_epi64((const __m128i*) a0));
const __m128i va1 = _mm_broadcastq_epi64(_mm_loadl_epi64((const __m128i*) a1));
const __m128i va2 = _mm_broadcastq_epi64(_mm_loadl_epi64((const __m128i*) a2));
const __m256i vxa0 = _mm256_cvtepi8_epi16(va0);
a0 += 8;
const __m256i vxa1 = _mm256_cvtepi8_epi16(va1);
a1 += 8;
const __m256i vxa2 = _mm256_cvtepi8_epi16(va2);
a2 += 8;
Causes a microkernel to go from 46 instructions to 60 instructions, due to register spill (of 1 vector)
The generated code generates quite a few vmovdqa to shuffle register order
Attached is preprocessed source