Jump to conversation
Unresolved conversations (2)
@gchatelet gchatelet Jan 9, 2024
It appears that this code is not well optimized by the compiler, using GPRs instead of vector registers. https://github.com/llvm/llvm-project/issues/77459 The vector code is 21 cycles whether the GPR is 25 according to llvm-mca. I'll benchmark both of them and report.
libc/src/string/memory_utils/op_x86.h
@nafi3000 nafi3000 Jan 6, 2024
Consider doing this after calling the `_mm256_movemask_epi8`. Then a `rorx` (1 cycle) will be sufficient. E.g. ```cc static inline uint32_t SwapWords(uint32_t x) { return (x << 16) | (x >> 16); } int cmp_neq(__m256i a, __m256i b) { __m256i vmax = _mm256_max_epu8(a, b); __m256i a_le_b = _mm256_cmpeq_epi8(vmax, b); __m256i a_ge_b = _mm256_cmpeq_epi8(vmax, a); const __m256i indices = _mm256_set_epi8(0, 1, 2, 3, 4, 5, 6, 7, // 8, 9, 10, 11, 12, 13, 14, 15, // 0, 1, 2, 3, 4, 5, 6, 7, // 8, 9, 10, 11, 12, 13, 14, 15); uint32_t le = SwapWords(_mm256_movemask_epi8(_mm256_shuffle_epi8(a_le_b, indices))); uint32_t ge = SwapWords(_mm256_movemask_epi8(_mm256_shuffle_epi8(a_ge_b, indices))); return le < ge ? 5 : -5; } ``` https://godbolt.org/z/6bhjE35j3 Version|llvm-mca|skylake latency|znver3 latency|skylake latency without mask loading|znver3 latency without mask loading :-:|:-:|:-:|:-:|:-:|:-: [With bug](https://godbolt.org/z/ean49e9Pn)|[link](https://godbolt.org/z/d5Ej6K5PP)|13|13|8|7 [#else version above](https://godbolt.org/z/nj8o9eK5M)|[link](https://godbolt.org/z/rPxcnKzhh)|17|19|12|13 [#if version above](https://godbolt.org/z/jesrK8568)|[link](https://godbolt.org/z/r77bdb4fq)|15|17|10|10 [shuffle-movemask-rorx version](https://godbolt.org/z/6bhjE35j3)|[link](https://godbolt.org/z/11rbWzbnv)|14|14|9|8 I have tested the shuffle-movemask-rorx version: https://godbolt.org/z/as5Kqjof5. The tests fail without the `rorx` fix: https://godbolt.org/z/ccndGYM6z. What do you think?
Outdated
libc/src/string/memory_utils/op_x86.h
Resolved conversations (2)
@michaelrj-google michaelrj-google Jan 5, 2024
nit: I think `ASSERT_GE(..., 1)` is clearer
Outdated
libc/test/src/string/memcmp_test.cpp
@topperc topperc Jan 5, 2024
accross -> across
Outdated
libc/src/string/memory_utils/op_x86.h