Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement NEON vector algorithms #2597

Closed
wants to merge 5 commits into from
Closed

Conversation

cbezault
Copy link
Contributor

@cbezault cbezault commented Feb 27, 2022

Fixes #813
This PR adds NEON versions of all the vector algorithms currently implemented for SSE/AVX2.

Benchmarks

Benchmark using google benchmark and borrowed from @BillyONeal all run on an SQ1 Surface Pro X.

Summary

Interestingly it appears that for small sizes and for unsigned long long the cost of crossing the ARM64EC thunk boundary potentially outweighs the gains that a native implementation provides. However, the runs of the x64 binary had a lot more variation than the runs of the native ARM64 binaries and the results of any individual run should be taken with a grain of salt (note the differences between the runs of BenchUnsignedIntSseReverse which was unchanged between the two test runs).

Benchmarks

Benchmark Code
#include <algorithm>
#include <benchmark/benchmark.h>
#include <deque>
#include <functional>
#include <list>
#include <numeric>
#include <stdlib.h>
#include <utility>
#include <vector>

using namespace std;

void verify(bool b)
{
  if (!b)
  {
    exit(1);
  }
}

template <class _BidIt>
void plain_bidi_reverse(_BidIt _First, _BidIt _Last)
{
  for (; _First != _Last && _First != --_Last; ++_First)
  {
    const auto _Temp = *_First;
    *_First = *_Last;
    *_Last = _Temp;
  }
}

template <class Container, class TestedFn>
inline void RunTest(benchmark::State &state, size_t dataSize, TestedFn fn)
{
  Container data(dataSize);
  iota(data.begin(), data.end(),
       static_cast<typename Container::value_type>(1));
  fn(data);
  verify(is_sorted(data.begin(), data.end(), greater<>{}));
  fn(data);
  verify(is_sorted(data.begin(), data.end(), less<>{}));
  for (auto _ : state)
  {
    (void)_;
    fn(data);
  }
}

template <class Container>
void BenchPlainBidiReverse(benchmark::State &state)
{
  RunTest<Container>(state, static_cast<size_t>(state.range(0)),
                     [](auto &c)
                     { plain_bidi_reverse(c.begin(), c.end()); });
}

template <class Container>
void BenchStdReverse(benchmark::State &state)
{
  RunTest<Container>(state, static_cast<size_t>(state.range(0)),
                     [](auto &c)
                     { reverse(c.begin(), c.end()); });
}

BENCHMARK_TEMPLATE(BenchStdReverse, deque<unsigned int>)->Range(8, 100'000);
BENCHMARK_TEMPLATE(BenchStdReverse, list<unsigned int>)->Range(8, 100'000);

BENCHMARK_TEMPLATE(BenchPlainBidiReverse, vector<unsigned char>)->Range(8, 255);
BENCHMARK_TEMPLATE(BenchStdReverse, vector<unsigned char>)->Range(8, 255);
BENCHMARK_TEMPLATE(BenchPlainBidiReverse, vector<unsigned short>)->Range(8, 65535);
BENCHMARK_TEMPLATE(BenchStdReverse, vector<unsigned short>)->Range(8, 65535);
BENCHMARK_TEMPLATE(BenchPlainBidiReverse, vector<unsigned int>)->Range(8, 100'000);
BENCHMARK_TEMPLATE(BenchStdReverse, vector<unsigned int>)->Range(8, 100'000);

#include <arm64_neon.h>
extern "C" void _cdecl __std_neon_reverse_trivially_copyable_4(
    unsigned int *_First, unsigned int *_Last) throw()
{
  if (_Last - _First >= 8)
  {
    unsigned int *_Stop_at = _First + ((_Last - _First) >> 3 << 2);
    do
    {
      _Last -= 4;
      // 128-bit loads
      const __n128 _Left = neon_ld1r_q32(reinterpret_cast<__int32 *>(_First));
      const __n128 _Right = neon_ld1r_q32(reinterpret_cast<__int32 *>(_Last));
      // Reverse the bytes of each 64-bit DWORDs
      const __n128 _Left_dword_reversed = neon_rev64q_32(_Left);
      const __n128 _Right_dword_reversed = neon_rev64q_32(_Right);
      // Swap the 64-bit DWORDS
      const __n128 _Left_reversed = neon_extq64(_Left_dword_reversed, _Left_dword_reversed, 1);
      const __n128 _Right_reversed = neon_extq64(_Right_dword_reversed, _Right_dword_reversed, 1);
      // 128-bit stores
      neon_st1m_q32(reinterpret_cast<__int32 *>(_Last), _Left_reversed);
      neon_st1m_q32(reinterpret_cast<__int32 *>(_First), _Right_reversed);
      _First += 4;
    } while (_First != _Stop_at);
  }

  for (; _First != _Last && _First != --_Last; ++_First)
  {
    const unsigned int _Temp = *_First;
    *_First = *_Last;
    *_Last = _Temp;
  }
}

void BenchUnsignedIntNeonReverse(benchmark::State &state)
{
  RunTest<vector<unsigned int>>(state, static_cast<size_t>(state.range(0)), [](auto &c)
                                { __std_neon_reverse_trivially_copyable_4(&*c.begin(), &*c.end()); });
}

BENCHMARK(BenchUnsignedIntNeonReverse)->Range(8, 100'000);

extern "C" void _cdecl __std_neon_unrolled1_reverse_trivially_copyable_4(
    unsigned int *_First, unsigned int *_Last) throw()
{
  if (_Last - _First >= 16)
  {
    unsigned int *_Stop_at = _First + ((_Last - _First) >> 4 << 3);
    do
    {
      _Last -= 8;

      const __n128x2 _Left = neon_ld1m2_q32(reinterpret_cast<__int32 *>(_First));
      const __n128x2 _Right = neon_ld1m2_q32(reinterpret_cast<__int32 *>(_Last));
      const __n128 _Left_dword_reversed1 = neon_rev64q_32(_Left.val[0]);
      const __n128 _Left_dword_reversed2 = neon_rev64q_32(_Left.val[1]);
      const __n128 _Right_dword_reversed2 = neon_rev64q_32(_Right.val[0]);
      const __n128 _Right_dword_reversed1 = neon_rev64q_32(_Right.val[1]);
      const __n128 _Left_reversed1 = neon_extq64(_Left_dword_reversed1, _Left_dword_reversed1, 1);
      const __n128 _Left_reversed2 = neon_extq64(_Left_dword_reversed2, _Left_dword_reversed2, 1);
      const __n128 _Right_reversed2 = neon_extq64(_Right_dword_reversed2, _Right_dword_reversed2, 1);
      const __n128 _Right_reversed1 = neon_extq64(_Right_dword_reversed1, _Right_dword_reversed1, 1);
      neon_st1m2_q32(reinterpret_cast<__int32 *>(_Last), __n128x2{_Left_reversed2, _Left_reversed1});
      neon_st1m2_q32(reinterpret_cast<__int32 *>(_First), __n128x2{_Right_reversed1, _Right_reversed2});
      _First += 8;
    } while (_First != _Stop_at);
  }

  if (_Last - _First >= 8)
  {
    unsigned int *_Stop_at = _First + ((_Last - _First) >> 3 << 2);
    do
    {
      _Last -= 4;
      // 128-bit loads
      const __n128 _Left = neon_ld1r_q32(reinterpret_cast<__int32 *>(_First));
      const __n128 _Right = neon_ld1r_q32(reinterpret_cast<__int32 *>(_Last));
      // Reverse the bytes of each 64-bit DWORDs
      const __n128 _Left_dword_reversed = neon_rev64q_32(_Left);
      const __n128 _Right_dword_reversed = neon_rev64q_32(_Right);
      // Swap the 64-bit DWORDS
      const __n128 _Left_reversed = neon_extq64(_Left_dword_reversed, _Left_dword_reversed, 1);
      const __n128 _Right_reversed = neon_extq64(_Right_dword_reversed, _Right_dword_reversed, 1);
      // 128-bit stores
      neon_st1m_q32(reinterpret_cast<__int32 *>(_Last), _Left_reversed);
      neon_st1m_q32(reinterpret_cast<__int32 *>(_First), _Right_reversed);
      _First += 4;
    } while (_First != _Stop_at);
  }

  for (; _First != _Last && _First != --_Last; ++_First)
  {
    const unsigned int _Temp = *_First;
    *_First = *_Last;
    *_Last = _Temp;
  }
}

void BenchUnsignedIntNeonUnrolledReverse(benchmark::State &state)
{
  RunTest<vector<unsigned int>>(state, static_cast<size_t>(state.range(0)), [](auto &c)
                                { __std_neon_unrolled1_reverse_trivially_copyable_4(&*c.begin(), &*c.end()); });
}

BENCHMARK(BenchUnsignedIntNeonUnrolledReverse)->Range(8, 100'000);

BENCHMARK_TEMPLATE(BenchPlainBidiReverse, vector<unsigned long long>)
    ->Range(8, 100'000);
BENCHMARK_TEMPLATE(BenchStdReverse, vector<unsigned long long>)
    ->Range(8, 100'000);

BENCHMARK_MAIN();
Old Results (ARM64 native)
---------------------------------------------------------------------------------------------------
Benchmark                                                         Time             CPU   Iterations
---------------------------------------------------------------------------------------------------
BenchStdReverse<vector<unsigned char>>/8                       4.65 ns         4.59 ns    160000000
BenchStdReverse<vector<unsigned char>>/64                      29.5 ns         29.5 ns     24888889
BenchStdReverse<vector<unsigned char>>/255                      115 ns          114 ns      5600000
BenchStdReverse<vector<unsigned short>>/8                      4.47 ns         4.45 ns    154482759
BenchStdReverse<vector<unsigned short>>/64                     29.6 ns         29.5 ns     24888889
BenchStdReverse<vector<unsigned short>>/512                     217 ns          220 ns      2986667
BenchStdReverse<vector<unsigned short>>/4096                   1893 ns         1880 ns       407273
BenchStdReverse<vector<unsigned short>>/32768                 14067 ns        13951 ns        44800
BenchStdReverse<vector<unsigned short>>/65535                 28419 ns        27902 ns        22400
BenchStdReverse<vector<unsigned int>>/8                        4.81 ns         4.85 ns    154482759
BenchStdReverse<vector<unsigned int>>/64                       30.2 ns         30.5 ns     23578947
BenchStdReverse<vector<unsigned int>>/512                       246 ns          246 ns      2986667
BenchStdReverse<vector<unsigned int>>/4096                     1744 ns         1768 ns       344615
BenchStdReverse<vector<unsigned int>>/32768                   16464 ns        16322 ns        49778
BenchStdReverse<vector<unsigned int>>/100000                  43053 ns        43247 ns        14452
BenchUnsignedIntNeonReverse/8                                  4.65 ns         4.62 ns    172307692
BenchUnsignedIntNeonReverse/64                                 13.0 ns         13.0 ns     40727273
BenchUnsignedIntNeonReverse/512                                97.9 ns         98.4 ns      7466667
BenchUnsignedIntNeonReverse/4096                                753 ns          750 ns       896000
BenchUnsignedIntNeonReverse/32768                              7462 ns         7530 ns       107906
BenchUnsignedIntNeonReverse/100000                            18512 ns        18415 ns        37333
BenchUnsignedIntNeonUnrolledReverse/8                          4.20 ns         4.24 ns    165925926
BenchUnsignedIntNeonUnrolledReverse/64                         12.0 ns         12.2 ns     64000000
BenchUnsignedIntNeonUnrolledReverse/512                         104 ns          105 ns      6400000
BenchUnsignedIntNeonUnrolledReverse/4096                        711 ns          711 ns      1120000
BenchUnsignedIntNeonUnrolledReverse/32768                      5940 ns         5999 ns       112000
BenchUnsignedIntNeonUnrolledReverse/100000                    18114 ns        17997 ns        37333
BenchStdReverse<vector<unsigned long long>>/8                  4.12 ns         4.17 ns    172307692
BenchStdReverse<vector<unsigned long long>>/64                 26.3 ns         26.8 ns     28000000
BenchStdReverse<vector<unsigned long long>>/512                 206 ns          205 ns      3200000
BenchStdReverse<vector<unsigned long long>>/4096               1653 ns         1674 ns       448000
BenchStdReverse<vector<unsigned long long>>/32768             13722 ns        13811 ns        49778
BenchStdReverse<vector<unsigned long long>>/100000            43225 ns        43316 ns        16593
New Results (ARM64 Native, std::reverse = NeonUnrolledReverse)
---------------------------------------------------------------------------------------------------
Benchmark                                                         Time             CPU   Iterations
---------------------------------------------------------------------------------------------------
BenchStdReverse<vector<unsigned char>>/8                       6.11 ns         6.14 ns    112000000
BenchStdReverse<vector<unsigned char>>/64                      4.83 ns         4.74 ns    112000000
BenchStdReverse<vector<unsigned char>>/255                     24.1 ns         24.1 ns     29866667
BenchStdReverse<vector<unsigned short>>/8                      6.02 ns         6.00 ns    112000000
BenchStdReverse<vector<unsigned short>>/64                     6.11 ns         6.14 ns    112000000
BenchStdReverse<vector<unsigned short>>/512                    44.1 ns         43.5 ns     15448276
BenchStdReverse<vector<unsigned short>>/4096                    349 ns          353 ns      2036364
BenchStdReverse<vector<unsigned short>>/32768                  2750 ns         2727 ns       263529
BenchStdReverse<vector<unsigned short>>/65535                  6456 ns         6417 ns       112000
BenchStdReverse<vector<unsigned int>>/8                        4.02 ns         4.01 ns    179200000
BenchStdReverse<vector<unsigned int>>/64                       11.5 ns         11.5 ns     64000000
BenchStdReverse<vector<unsigned int>>/512                      88.4 ns         87.9 ns      7466667
BenchStdReverse<vector<unsigned int>>/4096                      688 ns          680 ns       896000
BenchStdReverse<vector<unsigned int>>/32768                    5779 ns         5859 ns       112000
BenchStdReverse<vector<unsigned int>>/100000                  17519 ns        17648 ns        40727
BenchUnsignedIntNeonReverse/8                                  3.87 ns         3.84 ns    179200000
BenchUnsignedIntNeonReverse/64                                 11.3 ns         11.2 ns     64000000
BenchUnsignedIntNeonReverse/512                                89.0 ns         87.9 ns      7466667
BenchUnsignedIntNeonReverse/4096                                766 ns          767 ns       896000
BenchUnsignedIntNeonReverse/32768                              5993 ns         5999 ns       112000
BenchUnsignedIntNeonReverse/100000                            18191 ns        18415 ns        37333
BenchUnsignedIntNeonUnrolledReverse/8                          3.82 ns         3.84 ns    179200000
BenchUnsignedIntNeonUnrolledReverse/64                         11.7 ns         11.7 ns     64000000
BenchUnsignedIntNeonUnrolledReverse/512                        89.1 ns         87.9 ns      7466667
BenchUnsignedIntNeonUnrolledReverse/4096                        702 ns          711 ns      1120000
BenchUnsignedIntNeonUnrolledReverse/32768                      5692 ns         5720 ns       112000
BenchUnsignedIntNeonUnrolledReverse/100000                    18090 ns        18090 ns        40727
BenchStdReverse<vector<unsigned long long>>/8                  4.03 ns         4.08 ns    172307692
BenchStdReverse<vector<unsigned long long>>/64                 16.5 ns         16.4 ns     44800000
BenchStdReverse<vector<unsigned long long>>/512                 131 ns          129 ns      4977778
BenchStdReverse<vector<unsigned long long>>/4096               1116 ns         1123 ns       640000
BenchStdReverse<vector<unsigned long long>>/32768              8808 ns         8789 ns        74667
BenchStdReverse<vector<unsigned long long>>/100000            27140 ns        26995 ns        24889
Old Results (x64 on ARM64, std::reverse = SseReverse)
BenchStdReverse<vector<unsigned char>>/8                       8.14 ns         8.20 ns     89600000
BenchStdReverse<vector<unsigned char>>/64                      7.45 ns         7.32 ns     89600000
BenchStdReverse<vector<unsigned char>>/255                     33.6 ns         34.4 ns     21333333
BenchStdReverse<vector<unsigned short>>/8                      7.79 ns         7.85 ns     89600000
BenchStdReverse<vector<unsigned short>>/64                     10.3 ns         10.3 ns     74666667
BenchStdReverse<vector<unsigned short>>/512                    47.7 ns         47.6 ns     14451613
BenchStdReverse<vector<unsigned short>>/4096                    350 ns          345 ns      1947826
BenchStdReverse<vector<unsigned short>>/32768                  2808 ns         2825 ns       248889
BenchStdReverse<vector<unsigned short>>/65535                  5973 ns         5999 ns       112000
BenchStdReverse<vector<unsigned int>>/8                        4.49 ns         4.45 ns    154482759
BenchStdReverse<vector<unsigned int>>/64                       14.3 ns         14.6 ns     44800000
BenchStdReverse<vector<unsigned int>>/512                      89.3 ns         88.9 ns      8960000
BenchStdReverse<vector<unsigned int>>/4096                      701 ns          698 ns      1120000
BenchStdReverse<vector<unsigned int>>/32768                    5589 ns         5625 ns       100000
BenchStdReverse<vector<unsigned int>>/100000                  17345 ns        17264 ns        40727
BenchUnsignedIntSseReverse/8                                   5.73 ns         5.72 ns    112000000
BenchUnsignedIntSseReverse/64                                  10.7 ns         10.5 ns     64000000
BenchUnsignedIntSseReverse/512                                 76.2 ns         76.7 ns      8960000
BenchUnsignedIntSseReverse/4096                                 624 ns          625 ns      1000000
BenchUnsignedIntSseReverse/32768                               5102 ns         5156 ns       100000
BenchUnsignedIntSseReverse/100000                             15445 ns        15695 ns        44800
BenchStdReverse<vector<unsigned long long>>/8                  6.01 ns         6.00 ns    112000000
BenchStdReverse<vector<unsigned long long>>/64                 20.1 ns         19.9 ns     34461538
BenchStdReverse<vector<unsigned long long>>/512                 134 ns          135 ns      4977778
BenchStdReverse<vector<unsigned long long>>/4096               1041 ns         1025 ns       746667
BenchStdReverse<vector<unsigned long long>>/32768              9676 ns         9626 ns        74667
BenchStdReverse<vector<unsigned long long>>/100000            30541 ns        30483 ns        23579
New Results (x64 on ARM64, std::reverse = Native NEON)
BenchStdReverse<vector<unsigned char>>/8                       8.70 ns         8.58 ns     74666667
BenchStdReverse<vector<unsigned char>>/64                      7.54 ns         7.67 ns    112000000
BenchStdReverse<vector<unsigned char>>/255                     36.5 ns         36.8 ns     20363636
BenchStdReverse<vector<unsigned short>>/8                      8.02 ns         8.02 ns     89600000
BenchStdReverse<vector<unsigned short>>/64                     10.3 ns         10.3 ns     64000000
BenchStdReverse<vector<unsigned short>>/512                    48.3 ns         49.2 ns     14933333
BenchStdReverse<vector<unsigned short>>/4096                    353 ns          353 ns      2036364
BenchStdReverse<vector<unsigned short>>/32768                  2813 ns         2846 ns       263529
BenchStdReverse<vector<unsigned short>>/65535                  5902 ns         5999 ns       112000
BenchStdReverse<vector<unsigned int>>/8                        5.07 ns         5.16 ns    100000000
BenchStdReverse<vector<unsigned int>>/64                       13.2 ns         13.1 ns     56000000
BenchStdReverse<vector<unsigned int>>/512                      79.3 ns         80.2 ns      8960000
BenchStdReverse<vector<unsigned int>>/4096                      610 ns          614 ns      1120000
BenchStdReverse<vector<unsigned int>>/32768                    5071 ns         5156 ns       100000
BenchStdReverse<vector<unsigned int>>/100000                  15918 ns        16044 ns        44800
BenchUnsignedIntSseReverse/8                                   5.74 ns         5.72 ns    112000000
BenchUnsignedIntSseReverse/64                                  10.7 ns         10.9 ns     74666667
BenchUnsignedIntSseReverse/512                                 77.4 ns         78.5 ns      8960000
BenchUnsignedIntSseReverse/4096                                 724 ns          725 ns      1120000
BenchUnsignedIntSseReverse/32768                               5642 ns         5580 ns       112000
BenchUnsignedIntSseReverse/100000                             18198 ns        17648 ns        40727
BenchStdReverse<vector<unsigned long long>>/8                  6.38 ns         6.28 ns    112000000
BenchStdReverse<vector<unsigned long long>>/64                 23.8 ns         23.5 ns     29866667
BenchStdReverse<vector<unsigned long long>>/512                 144 ns          142 ns      4072727
BenchStdReverse<vector<unsigned long long>>/4096               1134 ns         1130 ns       746667
BenchStdReverse<vector<unsigned long long>>/32768             10490 ns        10324 ns        56000
BenchStdReverse<vector<unsigned long long>>/100000            33041 ns        33692 ns        19478

Codegen

Summary

The codegen looks quite good and is comparable to gcc/llvm codegen except for some random movs that get inserted. (See this godbolt link). LLVM prefers to emit ldp/stp instructions with quad registers vs. the ld1/st1 instructions with explicit vector registers I ended up having to use and that gcc uses. If I force LLVM to use ld1 it does not interleave the loads with the rev64 instruction and it ends up looking almost identical to the gcc codegen.

Code

Code rewritten for LLVM/GCC
#include <arm_neon.h>

extern "C" void __std_neon_unrolled1_reverse_trivially_copyable_4(
    unsigned int *_First, unsigned int *_Last) throw()
{
  if (_Last - _First >= 16)
  {
    unsigned int *_Stop_at = _First + ((_Last - _First) >> 6 << 5);
    do
    {
      _Last -= 8;
      // 128-bit loads
#ifdef __llvm__
      const uint64x2_t Left1 = (uint64x2_t)vldrq_p128(_First);
      const uint64x2_t Left2 = (uint64x2_t)vldrq_p128(_First + 4);
      const uint64x2_t Right2 = (uint64x2_t)vldrq_p128(_Last);
      const uint64x2_t Right1 = (uint64x2_t)vldrq_p128(_Last + 4);
      const uint64x2x2_t _Left = {Left1[0], Left1[1], Left2[0], Left2[1]};
      const uint64x2x2_t _Right = {Right1[0], Right1[1], Right2[0], Right2[1]};
#else
      const uint64x2x2_t _Left = vld1q_u64_x2((uint64_t*)_First);
      const uint64x2x2_t _Right = vld1q_u64_x2((uint64_t*)_Last);
#endif
      // Reverse the bytes of each 64-bit DWORDs
      uint32x4_t _Left_dword_reversed1 = vrev64q_u32((uint32x4_t)_Left.val[0]);
      uint32x4_t _Left_dword_reversed2 = vrev64q_u32((uint32x4_t)_Left.val[1]);
      uint32x4_t _Right_dword_reversed2 = vrev64q_u32((uint32x4_t)_Right.val[0]);
      uint32x4_t _Right_dword_reversed1 = vrev64q_u32((uint32x4_t)_Right.val[1]);
      // Swap the 64-bit DWORDS
      const uint64x2_t _Left_reversed1 = vextq_u64((uint64x2_t)_Left_dword_reversed1, (uint64x2_t)_Left_dword_reversed1, 1);
      const uint64x2_t _Left_reversed2 = vextq_u64((uint64x2_t)_Left_dword_reversed2, (uint64x2_t)_Left_dword_reversed2, 1);
      const uint64x2_t _Right_reversed2 = vextq_u64((uint64x2_t)_Right_dword_reversed2, (uint64x2_t)_Right_dword_reversed2, 1);
      const uint64x2_t _Right_reversed1 = vextq_u64((uint64x2_t)_Right_dword_reversed1, (uint64x2_t)_Right_dword_reversed1, 1);
      // 128-bit stores
#ifdef __llvm__
      vstrq_p128((poly128_t*)_Last, (poly128_t)_Left_reversed2);
      vstrq_p128((poly128_t*)(_Last + 4), (poly128_t)_Left_reversed1);
      vstrq_p128((poly128_t*)_First, (poly128_t)_Right_reversed1);
      vstrq_p128((poly128_t*)(_First + 4), (poly128_t)_Right_reversed2);
#else
      vst1q_u64_x2((uint64_t*)_Last, uint64x2x2_t{_Left_reversed2, _Left_reversed1});
      vst1q_u64_x2((uint64_t*)_First, uint64x2x2_t{_Right_reversed1, _Right_reversed2});
#endif

      _First += 8;
    } while (_First != _Stop_at);
  }

  for (; _First != _Last && _First != --_Last; ++_First)
  {
    const unsigned int _Temp = *_First;
    *_First = *_Last;
    *_Last = _Temp;
  }
}
MSVC codegen
|$LL4@std_neon_u|
	ld1         {v0.4s,v1.4s},[x0]
	sub         x1,x1,#0x20
	mov         v16.16b,v1.16b
	mov         v17.16b,v0.16b
	ld1         {v0.4s,v1.4s},[x1]
	rev64       v16.4s,v16.4s
	rev64       v17.4s,v17.4s
	mov         v18.16b,v1.16b
	mov         v19.16b,v0.16b
	ext8        v0.16b,v16.16b,v16.16b,#8
	ext8        v1.16b,v17.16b,v17.16b,#8
	rev64       v19.4s,v19.4s
	rev64       v18.4s,v18.4s
	st1         {v0.4s,v1.4s},[x1]
	ext8        v1.16b,v19.16b,v19.16b,#8
	ext8        v0.16b,v18.16b,v18.16b,#8
	st1         {v0.4s,v1.4s},[x0]
	add         x0,x0,#0x20
	cmp         x0,x10
	bne         |$LL4@std_neon_u|
LLVM codegen
.LBB0_2:                                // =>This Inner Loop Header: Depth=1
        ldp     q0, q1, [x0]
        rev64   v0.4s, v0.4s
        ldp     q2, q3, [x1, #-32]!
        rev64   v1.4s, v1.4s
        ext     v0.16b, v0.16b, v0.16b, #8
        rev64   v2.4s, v2.4s
        rev64   v3.4s, v3.4s
        ext     v1.16b, v1.16b, v1.16b, #8
        ext     v2.16b, v2.16b, v2.16b, #8
        ext     v3.16b, v3.16b, v3.16b, #8
        stp     q1, q0, [x1]
        stp     q3, q2, [x0], #32
        cmp     x0, x8
        b.ne    .LBB0_2
GCC codegen
.L4:
        sub     x3, x3, #32
        ld1     {v2.2d - v3.2d}, [x2]
        ld1     {v0.2d - v1.2d}, [x3]
        rev64   v16.4s, v3.4s
        rev64   v2.4s, v2.4s
        rev64   v3.4s, v1.4s
        rev64   v0.4s, v0.4s
        ext     v4.16b, v16.16b, v16.16b, #8
        ext     v5.16b, v2.16b, v2.16b, #8
        ext     v6.16b, v3.16b, v3.16b, #8
        ext     v7.16b, v0.16b, v0.16b, #8
        st1     {v4.2d - v5.2d}, [x3]
        st1     {v6.2d - v7.2d}, [x2], 32
        cmp     x4, x2
        bne     .L4

llvm-mca

All results run with llvm-mca -all-stats --march=arm64 -mcpu=kryo -mattr=+kryo,+a76,+neon -dispatch=8 -lqueue=68 -squeue=72 -register-file-size=128 this doesn't exactly match reality as the SQ1 has a 160 entry reorder buffer not the 128 entry one that is specified for A76 processors but there isn't a super easy way to tune that parameter.

Summary

llvm-mca indicates that the unrolled version of our code will perform less well for all data sizes tested by a small margin.
This is primarily due to the unnecessary movs that are inserted by the compiler. If the assembly is directly modified to remove those movs we achieve parity and start outstripping the unrolled code when the data is of size 256 bytes or larger. This aligns more closely to the empirical results presented above.

llvm-mca output

Vanilla Assembly (not unrolled)
	sub         x1,x1,#0x10
	ld1r        {v16.4s},[x0]
	ld1r        {v17.4s},[x1]
	rev64       v16.4s,v16.4s
	rev64       v17.4s,v17.4s
	ext         v16.16b,v16.16b,v16.16b,#8
	ext         v17.16b,v17.16b,v17.16b,#8
	st1         {v16.4s},[x1]
	st1         {v17.4s},[x0]
	add         x0,x0,#0x10
	cmp         x0,x10
	bne         .loop
Unrolled Assembly
	ld1         {v0.4s,v1.4s},[x0]
	sub         x1,x1,#0x20
	mov         v16.16b,v1.16b
	mov         v17.16b,v0.16b
	ld1         {v0.4s,v1.4s},[x1]
	rev64       v16.4s,v16.4s
	rev64       v17.4s,v17.4s
	mov         v18.16b,v1.16b
	mov         v19.16b,v0.16b
	ext         v0.16b,v16.16b,v16.16b,#8
	ext         v1.16b,v17.16b,v17.16b,#8
	rev64       v19.4s,v19.4s
	rev64       v18.4s,v18.4s
	st1         {v0.4s,v1.4s},[x1]
	ext         v1.16b,v19.16b,v19.16b,#8
	ext         v0.16b,v18.16b,v18.16b,#8
	st1         {v0.4s,v1.4s},[x0]
	add         x0,x0,#0x20
	cmp         x0,x10
	bne         .loop
Vanilla results size 16 bytes
Iterations:        2
Instructions:      24
Total Cycles:      15
Total uOps:        38

Dispatch Width:    8
uOps Per Cycle:    2.53
IPC:               1.60
Block RThroughput: 2.4


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      1     0.25                        sub	x1, x1, #16
 1      3     0.50    *                   ld1r	{ v16.4s }, [x0]
 1      3     0.50    *                   ld1r	{ v17.4s }, [x1]
 2      1     0.50                        rev64	v16.4s, v16.4s
 2      1     0.50                        rev64	v17.4s, v17.4s
 2      1     1.00                        ext	v16.16b, v16.16b, v16.16b, #8
 2      1     1.00                        ext	v17.16b, v17.16b, v17.16b, #8
 2      0     0.50           *            st1	{ v16.4s }, [x1]
 2      0     0.50           *            st1	{ v17.4s }, [x0]
 1      1     0.25                        add	x0, x0, #16
 2      2     0.50                        cmp	x0, x10
 1      1     0.25                        b.ne	.B0


Dynamic Dispatch Stall Cycles:
RAT     - Register unavailable:                      0
RCU     - Retire tokens unavailable:                 0
SCHEDQ  - Scheduler full:                            0
LQ      - Load queue full:                           0
SQ      - Store queue full:                          0
GROUP   - Static restrictions on the dispatch group: 0


Dispatch Logic - number of cycles where we saw N micro opcodes dispatched:
[# dispatched], [# cycles]
 0,              10  (66.7%)
 7,              2  (13.3%)
 8,              3  (20.0%)


Schedulers - number of cycles where we saw N micro opcodes issued:
[# issued], [# cycles]
 0,          4  (26.7%)
 1,          2  (13.3%)
 2,          2  (13.3%)
 4,          4  (26.7%)
 5,          2  (13.3%)
 6,          1  (6.7%)

Scheduler's queue usage:
No scheduler resources used.


Retire Control Unit - number of cycles where we saw N instructions retired:
[# retired], [# cycles]
 0,           5  (33.3%)
 1,           4  (26.7%)
 2,           3  (20.0%)
 4,           1  (6.7%)
 5,           2  (13.3%)

Total ROB Entries:                128
Max Used ROB Entries:             37  ( 28.9% )
Average Used ROB Entries per cy:  17  ( 13.3% )


Register File statistics:
Total number of mappings created:    20
Max number of mappings used:         19


Resources:
[0]   - KryoUnitLSA
[1]   - KryoUnitLSB
[2]   - KryoUnitXA
[3]   - KryoUnitXB
[4]   - KryoUnitYA
[5]   - KryoUnitYB


Resource pressure per iteration:
[0]    [1]    [2]    [3]    [4]    [5]    
2.00   2.00   4.00   3.50   3.50   4.00   

Resource pressure by instruction:
[0]    [1]    [2]    [3]    [4]    [5]    Instructions:
 -      -      -     0.50    -     0.50   sub	x1, x1, #16
 -     1.00    -      -      -      -     ld1r	{ v16.4s }, [x0]
1.00    -      -      -      -      -     ld1r	{ v17.4s }, [x1]
 -      -     1.00    -      -     1.00   rev64	v16.4s, v16.4s
 -      -      -      -     2.00    -     rev64	v17.4s, v17.4s
 -      -     1.00   1.00    -      -     ext	v16.16b, v16.16b, v16.16b, #8
 -      -     1.00   1.00    -      -     ext	v17.16b, v17.16b, v17.16b, #8
 -     1.00    -      -      -     1.00   st1	{ v16.4s }, [x1]
1.00    -      -      -     1.00    -     st1	{ v17.4s }, [x0]
 -      -      -     0.50   0.50    -     add	x0, x0, #16
 -      -     1.00    -      -     1.00   cmp	x0, x10
 -      -      -     0.50    -     0.50   b.ne	.B0

Unrolled results size 16 bytes
Iterations:        1
Instructions:      20
Total Cycles:      17
Total uOps:        43

Dispatch Width:    8
uOps Per Cycle:    2.53
IPC:               1.18
Block RThroughput: 5.8


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 2      3     1.00    *                   ld1	{ v0.4s, v1.4s }, [x0]
 1      1     0.25                        sub	x1, x1, #32
 2      1     0.50                        mov	v16.16b, v1.16b
 2      1     0.50                        mov	v17.16b, v0.16b
 2      3     1.00    *                   ld1	{ v0.4s, v1.4s }, [x1]
 2      1     0.50                        rev64	v16.4s, v16.4s
 2      1     0.50                        rev64	v17.4s, v17.4s
 2      1     0.50                        mov	v18.16b, v1.16b
 2      1     0.50                        mov	v19.16b, v0.16b
 2      1     1.00                        ext	v0.16b, v16.16b, v16.16b, #8
 2      1     1.00                        ext	v1.16b, v17.16b, v17.16b, #8
 2      1     0.50                        rev64	v19.4s, v19.4s
 2      1     0.50                        rev64	v18.4s, v18.4s
 5      1     1.00           *            st1	{ v0.4s, v1.4s }, [x1]
 2      1     1.00                        ext	v1.16b, v19.16b, v19.16b, #8
 2      1     1.00                        ext	v0.16b, v18.16b, v18.16b, #8
 5      1     1.00           *            st1	{ v0.4s, v1.4s }, [x0]
 1      1     0.25                        add	x0, x0, #32
 2      2     0.50                        cmp	x0, x10
 1      1     0.25                        b.ne	.loop


Dynamic Dispatch Stall Cycles:
RAT     - Register unavailable:                      0
RCU     - Retire tokens unavailable:                 0
SCHEDQ  - Scheduler full:                            0
LQ      - Load queue full:                           0
SQ      - Store queue full:                          0
GROUP   - Static restrictions on the dispatch group: 0


Dispatch Logic - number of cycles where we saw N micro opcodes dispatched:
[# dispatched], [# cycles]
 0,              10  (58.8%)
 1,              1  (5.9%)
 4,              1  (5.9%)
 7,              2  (11.8%)
 8,              3  (17.6%)


Schedulers - number of cycles where we saw N micro opcodes issued:
[# issued], [# cycles]
 0,          5  (29.4%)
 1,          1  (5.9%)
 2,          2  (11.8%)
 3,          1  (5.9%)
 4,          5  (29.4%)
 5,          3  (17.6%)

Scheduler's queue usage:
No scheduler resources used.


Retire Control Unit - number of cycles where we saw N instructions retired:
[# retired], [# cycles]
 0,           6  (35.3%)
 1,           4  (23.5%)
 2,           5  (29.4%)
 3,           2  (11.8%)

Total ROB Entries:                128
Max Used ROB Entries:             39  ( 30.5% )
Average Used ROB Entries per cy:  18  ( 14.1% )


Register File statistics:
Total number of mappings created:    18
Max number of mappings used:         16


Resources:
[0]   - KryoUnitLSA
[1]   - KryoUnitLSB
[2]   - KryoUnitXA
[3]   - KryoUnitXB
[4]   - KryoUnitYA
[5]   - KryoUnitYB


Resource pressure per iteration:
[0]    [1]    [2]    [3]    [4]    [5]    
4.00   4.00   8.00   10.00  8.00   9.00   

Resource pressure by instruction:
[0]    [1]    [2]    [3]    [4]    [5]    Instructions:
 -     2.00    -      -      -      -     ld1	{ v0.4s, v1.4s }, [x0]
 -      -      -      -      -     1.00   sub	x1, x1, #32
 -      -      -      -     2.00    -     mov	v16.16b, v1.16b
 -      -      -     2.00    -      -     mov	v17.16b, v0.16b
2.00    -      -      -      -      -     ld1	{ v0.4s, v1.4s }, [x1]
 -      -     2.00    -      -      -     rev64	v16.4s, v16.4s
 -      -      -      -      -     2.00   rev64	v17.4s, v17.4s
 -      -      -      -     2.00    -     mov	v18.16b, v1.16b
 -      -      -     2.00    -      -     mov	v19.16b, v0.16b
 -      -     2.00    -      -      -     ext	v0.16b, v16.16b, v16.16b, #8
 -      -      -     2.00    -      -     ext	v1.16b, v17.16b, v17.16b, #8
 -      -      -      -      -     2.00   rev64	v19.4s, v19.4s
 -      -      -      -     2.00    -     rev64	v18.4s, v18.4s
 -     2.00   1.00    -      -     2.00   st1	{ v0.4s, v1.4s }, [x1]
 -      -      -     2.00    -      -     ext	v1.16b, v19.16b, v19.16b, #8
 -      -     2.00    -      -      -     ext	v0.16b, v18.16b, v18.16b, #8
2.00    -      -      -     1.00   2.00   st1	{ v0.4s, v1.4s }, [x0]
 -      -      -      -     1.00    -     add	x0, x0, #32
 -      -      -     2.00    -      -     cmp	x0, x10
 -      -     1.00    -      -      -     b.ne	.loop

Vanilla results size 32 bytes
Iterations:        4
Instructions:      48
Total Cycles:      21
Total uOps:        76

Dispatch Width:    8
uOps Per Cycle:    3.62
IPC:               2.29
Block RThroughput: 2.4


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      1     0.25                        sub	x1, x1, #16
 1      3     0.50    *                   ld1r	{ v16.4s }, [x0]
 1      3     0.50    *                   ld1r	{ v17.4s }, [x1]
 2      1     0.50                        rev64	v16.4s, v16.4s
 2      1     0.50                        rev64	v17.4s, v17.4s
 2      1     1.00                        ext	v16.16b, v16.16b, v16.16b, #8
 2      1     1.00                        ext	v17.16b, v17.16b, v17.16b, #8
 2      0     0.50           *            st1	{ v16.4s }, [x1]
 2      0     0.50           *            st1	{ v17.4s }, [x0]
 1      1     0.25                        add	x0, x0, #16
 2      2     0.50                        cmp	x0, x10
 1      1     0.25                        b.ne	.B0


Dynamic Dispatch Stall Cycles:
RAT     - Register unavailable:                      0
RCU     - Retire tokens unavailable:                 0
SCHEDQ  - Scheduler full:                            0
LQ      - Load queue full:                           0
SQ      - Store queue full:                          0
GROUP   - Static restrictions on the dispatch group: 0


Dispatch Logic - number of cycles where we saw N micro opcodes dispatched:
[# dispatched], [# cycles]
 0,              11  (52.4%)
 7,              4  (19.0%)
 8,              6  (28.6%)


Schedulers - number of cycles where we saw N micro opcodes issued:
[# issued], [# cycles]
 0,          2  (9.5%)
 1,          1  (4.8%)
 2,          3  (14.3%)
 3,          2  (9.5%)
 4,          6  (28.6%)
 5,          5  (23.8%)
 6,          1  (4.8%)
 8,          1  (4.8%)

Scheduler's queue usage:
No scheduler resources used.


Retire Control Unit - number of cycles where we saw N instructions retired:
[# retired], [# cycles]
 0,           5  (23.8%)
 1,           5  (23.8%)
 2,           5  (23.8%)
 3,           1  (4.8%)
 4,           1  (4.8%)
 5,           2  (9.5%)
 6,           1  (4.8%)
 10,          1  (4.8%)

Total ROB Entries:                128
Max Used ROB Entries:             52  ( 40.6% )
Average Used ROB Entries per cy:  31  ( 24.2% )


Register File statistics:
Total number of mappings created:    40
Max number of mappings used:         27


Resources:
[0]   - KryoUnitLSA
[1]   - KryoUnitLSB
[2]   - KryoUnitXA
[3]   - KryoUnitXB
[4]   - KryoUnitYA
[5]   - KryoUnitYB


Resource pressure per iteration:
[0]    [1]    [2]    [3]    [4]    [5]    
2.00   2.00   3.75   4.00   3.50   3.75   

Resource pressure by instruction:
[0]    [1]    [2]    [3]    [4]    [5]    Instructions:
 -      -      -     0.50    -     0.50   sub	x1, x1, #16
0.25   0.75    -      -      -      -     ld1r	{ v16.4s }, [x0]
1.00    -      -      -      -      -     ld1r	{ v17.4s }, [x1]
 -      -     0.50    -     0.50   1.00   rev64	v16.4s, v16.4s
 -      -      -     0.50   1.00   0.50   rev64	v17.4s, v17.4s
 -      -     1.50   0.50    -      -     ext	v16.16b, v16.16b, v16.16b, #8
 -      -     1.00   1.00    -      -     ext	v17.16b, v17.16b, v17.16b, #8
 -     1.00    -      -     0.25   0.75   st1	{ v16.4s }, [x1]
0.75   0.25    -      -     0.50   0.50   st1	{ v17.4s }, [x0]
 -      -      -     0.50   0.50    -     add	x0, x0, #16
 -      -     0.50   0.50   0.50   0.50   cmp	x0, x10
 -      -     0.25   0.50   0.25    -     b.ne	.B0

Unrolled results size 32 bytes
Iterations:        2
Instructions:      40
Total Cycles:      24
Total uOps:        86

Dispatch Width:    8
uOps Per Cycle:    3.58
IPC:               1.67
Block RThroughput: 5.8


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 2      3     1.00    *                   ld1	{ v0.4s, v1.4s }, [x0]
 1      1     0.25                        sub	x1, x1, #32
 2      1     0.50                        mov	v16.16b, v1.16b
 2      1     0.50                        mov	v17.16b, v0.16b
 2      3     1.00    *                   ld1	{ v0.4s, v1.4s }, [x1]
 2      1     0.50                        rev64	v16.4s, v16.4s
 2      1     0.50                        rev64	v17.4s, v17.4s
 2      1     0.50                        mov	v18.16b, v1.16b
 2      1     0.50                        mov	v19.16b, v0.16b
 2      1     1.00                        ext	v0.16b, v16.16b, v16.16b, #8
 2      1     1.00                        ext	v1.16b, v17.16b, v17.16b, #8
 2      1     0.50                        rev64	v19.4s, v19.4s
 2      1     0.50                        rev64	v18.4s, v18.4s
 5      1     1.00           *            st1	{ v0.4s, v1.4s }, [x1]
 2      1     1.00                        ext	v1.16b, v19.16b, v19.16b, #8
 2      1     1.00                        ext	v0.16b, v18.16b, v18.16b, #8
 5      1     1.00           *            st1	{ v0.4s, v1.4s }, [x0]
 1      1     0.25                        add	x0, x0, #32
 2      2     0.50                        cmp	x0, x10
 1      1     0.25                        b.ne	.loop


Dynamic Dispatch Stall Cycles:
RAT     - Register unavailable:                      0
RCU     - Retire tokens unavailable:                 0
SCHEDQ  - Scheduler full:                            0
LQ      - Load queue full:                           0
SQ      - Store queue full:                          0
GROUP   - Static restrictions on the dispatch group: 0


Dispatch Logic - number of cycles where we saw N micro opcodes dispatched:
[# dispatched], [# cycles]
 0,              11  (45.8%)
 1,              1  (4.2%)
 4,              2  (8.3%)
 7,              3  (12.5%)
 8,              7  (29.2%)


Schedulers - number of cycles where we saw N micro opcodes issued:
[# issued], [# cycles]
 0,          4  (16.7%)
 2,          1  (4.2%)
 3,          1  (4.2%)
 4,          11  (45.8%)
 5,          6  (25.0%)
 7,          1  (4.2%)

Scheduler's queue usage:
No scheduler resources used.


Retire Control Unit - number of cycles where we saw N instructions retired:
[# retired], [# cycles]
 0,           6  (25.0%)
 1,           5  (20.8%)
 2,           7  (29.2%)
 3,           4  (16.7%)
 4,           1  (4.2%)
 5,           1  (4.2%)

Total ROB Entries:                128
Max Used ROB Entries:             55  ( 43.0% )
Average Used ROB Entries per cy:  32  ( 25.0% )


Register File statistics:
Total number of mappings created:    36
Max number of mappings used:         23


Resources:
[0]   - KryoUnitLSA
[1]   - KryoUnitLSB
[2]   - KryoUnitXA
[3]   - KryoUnitXB
[4]   - KryoUnitYA
[5]   - KryoUnitYB


Resource pressure per iteration:
[0]    [1]    [2]    [3]    [4]    [5]    
4.00   4.00   8.50   9.00   8.50   9.00   

Resource pressure by instruction:
[0]    [1]    [2]    [3]    [4]    [5]    Instructions:
 -     2.00    -      -      -      -     ld1	{ v0.4s, v1.4s }, [x0]
 -      -     0.50    -      -     0.50   sub	x1, x1, #32
 -      -      -      -     2.00    -     mov	v16.16b, v1.16b
 -      -      -     2.00    -      -     mov	v17.16b, v0.16b
2.00    -      -      -      -      -     ld1	{ v0.4s, v1.4s }, [x1]
 -      -     2.00    -      -      -     rev64	v16.4s, v16.4s
 -      -      -      -      -     2.00   rev64	v17.4s, v17.4s
 -      -      -      -     2.00    -     mov	v18.16b, v1.16b
 -      -      -     2.00    -      -     mov	v19.16b, v0.16b
 -      -     2.00    -      -      -     ext	v0.16b, v16.16b, v16.16b, #8
 -      -      -     2.00    -      -     ext	v1.16b, v17.16b, v17.16b, #8
 -      -      -      -      -     2.00   rev64	v19.4s, v19.4s
 -      -      -      -     2.00    -     rev64	v18.4s, v18.4s
 -     2.00   1.00    -      -     2.00   st1	{ v0.4s, v1.4s }, [x1]
 -      -      -     2.00    -      -     ext	v1.16b, v19.16b, v19.16b, #8
 -      -     2.00    -      -      -     ext	v0.16b, v18.16b, v18.16b, #8
2.00    -      -      -     1.00   2.00   st1	{ v0.4s, v1.4s }, [x0]
 -      -      -      -     1.00    -     add	x0, x0, #32
 -      -     1.00   1.00    -      -     cmp	x0, x10
 -      -      -      -     0.50   0.50   b.ne	.loop

Vanilla results size 64 bytes
Iterations:        8
Instructions:      96
Total Cycles:      36
Total uOps:        152

Dispatch Width:    8
uOps Per Cycle:    4.22
IPC:               2.67
Block RThroughput: 2.4


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      1     0.25                        sub	x1, x1, #16
 1      3     0.50    *                   ld1r	{ v16.4s }, [x0]
 1      3     0.50    *                   ld1r	{ v17.4s }, [x1]
 2      1     0.50                        rev64	v16.4s, v16.4s
 2      1     0.50                        rev64	v17.4s, v17.4s
 2      1     1.00                        ext	v16.16b, v16.16b, v16.16b, #8
 2      1     1.00                        ext	v17.16b, v17.16b, v17.16b, #8
 2      0     0.50           *            st1	{ v16.4s }, [x1]
 2      0     0.50           *            st1	{ v17.4s }, [x0]
 1      1     0.25                        add	x0, x0, #16
 2      2     0.50                        cmp	x0, x10
 1      1     0.25                        b.ne	.B0


Dynamic Dispatch Stall Cycles:
RAT     - Register unavailable:                      0
RCU     - Retire tokens unavailable:                 0
SCHEDQ  - Scheduler full:                            0
LQ      - Load queue full:                           0
SQ      - Store queue full:                          0
GROUP   - Static restrictions on the dispatch group: 0


Dispatch Logic - number of cycles where we saw N micro opcodes dispatched:
[# dispatched], [# cycles]
 0,              16  (44.4%)
 7,              8  (22.2%)
 8,              12  (33.3%)


Schedulers - number of cycles where we saw N micro opcodes issued:
[# issued], [# cycles]
 0,          2  (5.6%)
 1,          1  (2.8%)
 2,          6  (16.7%)
 3,          4  (11.1%)
 4,          6  (16.7%)
 5,          8  (22.2%)
 6,          3  (8.3%)
 7,          3  (8.3%)
 8,          3  (8.3%)

Scheduler's queue usage:
No scheduler resources used.


Retire Control Unit - number of cycles where we saw N instructions retired:
[# retired], [# cycles]
 0,           8  (22.2%)
 1,           7  (19.4%)
 2,           10  (27.8%)
 3,           2  (5.6%)
 4,           1  (2.8%)
 5,           3  (8.3%)
 6,           1  (2.8%)
 8,           1  (2.8%)
 10,          3  (8.3%)

Total ROB Entries:                128
Max Used ROB Entries:             86  ( 67.2% )
Average Used ROB Entries per cy:  48  ( 37.5% )


Register File statistics:
Total number of mappings created:    80
Max number of mappings used:         44


Resources:
[0]   - KryoUnitLSA
[1]   - KryoUnitLSB
[2]   - KryoUnitXA
[3]   - KryoUnitXB
[4]   - KryoUnitYA
[5]   - KryoUnitYB


Resource pressure per iteration:
[0]    [1]    [2]    [3]    [4]    [5]    
2.00   2.00   3.88   3.75   3.75   3.63   

Resource pressure by instruction:
[0]    [1]    [2]    [3]    [4]    [5]    Instructions:
 -      -      -     0.38   0.25   0.38   sub	x1, x1, #16
0.38   0.63    -      -      -      -     ld1r	{ v16.4s }, [x0]
0.75   0.25    -      -      -      -     ld1r	{ v17.4s }, [x1]
 -      -     0.75    -     0.50   0.75   rev64	v16.4s, v16.4s
 -      -     0.50   0.25   1.00   0.25   rev64	v17.4s, v17.4s
 -      -     1.25   0.75    -      -     ext	v16.16b, v16.16b, v16.16b, #8
 -      -     0.75   1.25    -      -     ext	v17.16b, v17.16b, v17.16b, #8
0.38   0.63    -      -     0.25   0.75   st1	{ v16.4s }, [x1]
0.50   0.50    -      -     0.63   0.38   st1	{ v17.4s }, [x0]
 -      -     0.13   0.38   0.38   0.13   add	x0, x0, #16
 -      -     0.25   0.25   0.75   0.75   cmp	x0, x10
 -      -     0.25   0.50    -     0.25   b.ne	.B0

Unrolled results size 64 bytes
Iterations:        4
Instructions:      80
Total Cycles:      42
Total uOps:        172

Dispatch Width:    8
uOps Per Cycle:    4.10
IPC:               1.90
Block RThroughput: 5.8


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 2      3     1.00    *                   ld1	{ v0.4s, v1.4s }, [x0]
 1      1     0.25                        sub	x1, x1, #32
 2      1     0.50                        mov	v16.16b, v1.16b
 2      1     0.50                        mov	v17.16b, v0.16b
 2      3     1.00    *                   ld1	{ v0.4s, v1.4s }, [x1]
 2      1     0.50                        rev64	v16.4s, v16.4s
 2      1     0.50                        rev64	v17.4s, v17.4s
 2      1     0.50                        mov	v18.16b, v1.16b
 2      1     0.50                        mov	v19.16b, v0.16b
 2      1     1.00                        ext	v0.16b, v16.16b, v16.16b, #8
 2      1     1.00                        ext	v1.16b, v17.16b, v17.16b, #8
 2      1     0.50                        rev64	v19.4s, v19.4s
 2      1     0.50                        rev64	v18.4s, v18.4s
 5      1     1.00           *            st1	{ v0.4s, v1.4s }, [x1]
 2      1     1.00                        ext	v1.16b, v19.16b, v19.16b, #8
 2      1     1.00                        ext	v0.16b, v18.16b, v18.16b, #8
 5      1     1.00           *            st1	{ v0.4s, v1.4s }, [x0]
 1      1     0.25                        add	x0, x0, #32
 2      2     0.50                        cmp	x0, x10
 1      1     0.25                        b.ne	.loop


Dynamic Dispatch Stall Cycles:
RAT     - Register unavailable:                      0
RCU     - Retire tokens unavailable:                 0
SCHEDQ  - Scheduler full:                            0
LQ      - Load queue full:                           0
SQ      - Store queue full:                          0
GROUP   - Static restrictions on the dispatch group: 0


Dispatch Logic - number of cycles where we saw N micro opcodes dispatched:
[# dispatched], [# cycles]
 0,              17  (40.5%)
 1,              1  (2.4%)
 4,              4  (9.5%)
 7,              5  (11.9%)
 8,              15  (35.7%)


Schedulers - number of cycles where we saw N micro opcodes issued:
[# issued], [# cycles]
 0,          4  (9.5%)
 1,          2  (4.8%)
 2,          5  (11.9%)
 3,          2  (4.8%)
 4,          10  (23.8%)
 5,          8  (19.0%)
 6,          6  (14.3%)
 7,          4  (9.5%)
 10,          1  (2.4%)

Scheduler's queue usage:
No scheduler resources used.


Retire Control Unit - number of cycles where we saw N instructions retired:
[# retired], [# cycles]
 0,           10  (23.8%)
 1,           9  (21.4%)
 2,           10  (23.8%)
 3,           7  (16.7%)
 4,           3  (7.1%)
 5,           1  (2.4%)
 6,           1  (2.4%)
 7,           1  (2.4%)

Total ROB Entries:                128
Max Used ROB Entries:             86  ( 67.2% )
Average Used ROB Entries per cy:  49  ( 38.3% )


Register File statistics:
Total number of mappings created:    72
Max number of mappings used:         36


Resources:
[0]   - KryoUnitLSA
[1]   - KryoUnitLSB
[2]   - KryoUnitXA
[3]   - KryoUnitXB
[4]   - KryoUnitYA
[5]   - KryoUnitYB


Resource pressure per iteration:
[0]    [1]    [2]    [3]    [4]    [5]    
4.00   4.00   8.50   9.00   8.75   8.75   

Resource pressure by instruction:
[0]    [1]    [2]    [3]    [4]    [5]    Instructions:
0.50   1.50    -      -      -      -     ld1	{ v0.4s, v1.4s }, [x0]
 -      -     0.25    -     0.25   0.50   sub	x1, x1, #32
 -      -      -     1.00   0.50   0.50   mov	v16.16b, v1.16b
 -      -     0.50   1.00    -     0.50   mov	v17.16b, v0.16b
2.00    -      -      -      -      -     ld1	{ v0.4s, v1.4s }, [x1]
 -      -     1.00    -     1.00    -     rev64	v16.4s, v16.4s
 -      -     0.50   0.50    -     1.00   rev64	v17.4s, v17.4s
 -      -      -     0.50   0.50   1.00   mov	v18.16b, v1.16b
 -      -     1.00   1.00    -      -     mov	v19.16b, v0.16b
 -      -     1.00   1.00    -      -     ext	v0.16b, v16.16b, v16.16b, #8
 -      -     1.00   1.00    -      -     ext	v1.16b, v17.16b, v17.16b, #8
 -      -      -      -     1.00   1.00   rev64	v19.4s, v19.4s
 -      -      -      -     1.00   1.00   rev64	v18.4s, v18.4s
 -     2.00   0.50   0.25   1.50   0.75   st1	{ v0.4s, v1.4s }, [x1]
 -      -     0.50   1.50    -      -     ext	v1.16b, v19.16b, v19.16b, #8
 -      -     1.50   0.50    -      -     ext	v0.16b, v18.16b, v18.16b, #8
1.50   0.50    -     0.25   1.50   1.25   st1	{ v0.4s, v1.4s }, [x0]
 -      -      -      -     0.75   0.25   add	x0, x0, #32
 -      -     0.50   0.50   0.50   0.50   cmp	x0, x10
 -      -     0.25    -     0.25   0.50   b.ne	.loop
Vanilla results size 128 bytes
Iterations:        16
Instructions:      192
Total Cycles:      67
Total uOps:        304

Dispatch Width:    8
uOps Per Cycle:    4.54
IPC:               2.87
Block RThroughput: 2.4


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      1     0.25                        sub	x1, x1, #16
 1      3     0.50    *                   ld1r	{ v16.4s }, [x0]
 1      3     0.50    *                   ld1r	{ v17.4s }, [x1]
 2      1     0.50                        rev64	v16.4s, v16.4s
 2      1     0.50                        rev64	v17.4s, v17.4s
 2      1     1.00                        ext	v16.16b, v16.16b, v16.16b, #8
 2      1     1.00                        ext	v17.16b, v17.16b, v17.16b, #8
 2      0     0.50           *            st1	{ v16.4s }, [x1]
 2      0     0.50           *            st1	{ v17.4s }, [x0]
 1      1     0.25                        add	x0, x0, #16
 2      2     0.50                        cmp	x0, x10
 1      1     0.25                        b.ne	.B0


Dynamic Dispatch Stall Cycles:
RAT     - Register unavailable:                      0
RCU     - Retire tokens unavailable:                 2  (3.0%)
SCHEDQ  - Scheduler full:                            0
LQ      - Load queue full:                           0
SQ      - Store queue full:                          0
GROUP   - Static restrictions on the dispatch group: 0


Dispatch Logic - number of cycles where we saw N micro opcodes dispatched:
[# dispatched], [# cycles]
 0,              26  (38.8%)
 3,              1  (1.5%)
 4,              1  (1.5%)
 7,              15  (22.4%)
 8,              24  (35.8%)


Schedulers - number of cycles where we saw N micro opcodes issued:
[# issued], [# cycles]
 0,          3  (4.5%)
 1,          1  (1.5%)
 2,          6  (9.0%)
 3,          8  (11.9%)
 4,          13  (19.4%)
 5,          15  (22.4%)
 6,          10  (14.9%)
 7,          8  (11.9%)
 8,          3  (4.5%)

Scheduler's queue usage:
No scheduler resources used.


Retire Control Unit - number of cycles where we saw N instructions retired:
[# retired], [# cycles]
 0,           12  (17.9%)
 1,           11  (16.4%)
 2,           22  (32.8%)
 3,           4  (6.0%)
 4,           1  (1.5%)
 5,           8  (11.9%)
 8,           4  (6.0%)
 9,           1  (1.5%)
 10,          4  (6.0%)

Total ROB Entries:                128
Max Used ROB Entries:             128  ( 100.0% )
Average Used ROB Entries per cy:  75  ( 58.6% )


Register File statistics:
Total number of mappings created:    160
Max number of mappings used:         67


Resources:
[0]   - KryoUnitLSA
[1]   - KryoUnitLSB
[2]   - KryoUnitXA
[3]   - KryoUnitXB
[4]   - KryoUnitYA
[5]   - KryoUnitYB


Resource pressure per iteration:
[0]    [1]    [2]    [3]    [4]    [5]    
2.00   2.00   3.81   3.75   3.81   3.63   

Resource pressure by instruction:
[0]    [1]    [2]    [3]    [4]    [5]    Instructions:
 -      -     0.19   0.25   0.25   0.31   sub	x1, x1, #16
0.31   0.69    -      -      -      -     ld1r	{ v16.4s }, [x0]
0.75   0.25    -      -      -      -     ld1r	{ v17.4s }, [x1]
 -      -     0.50   0.38   0.50   0.63   rev64	v16.4s, v16.4s
 -      -     0.50   0.38   0.75   0.38   rev64	v17.4s, v17.4s
 -      -     1.00   1.00    -      -     ext	v16.16b, v16.16b, v16.16b, #8
 -      -     1.13   0.88    -      -     ext	v17.16b, v17.16b, v17.16b, #8
0.50   0.50    -      -     0.25   0.75   st1	{ v16.4s }, [x1]
0.44   0.56    -      -     0.69   0.31   st1	{ v17.4s }, [x0]
 -      -     0.13   0.31   0.38   0.19   add	x0, x0, #16
 -      -     0.13   0.25   0.88   0.75   cmp	x0, x10
 -      -     0.25   0.31   0.13   0.31   b.ne	.B0
Unrolled results size 128 bytes
Iterations:        8
Instructions:      160
Total Cycles:      77
Total uOps:        344

Dispatch Width:    8
uOps Per Cycle:    4.47
IPC:               2.08
Block RThroughput: 5.8


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 2      3     1.00    *                   ld1	{ v0.4s, v1.4s }, [x0]
 1      1     0.25                        sub	x1, x1, #32
 2      1     0.50                        mov	v16.16b, v1.16b
 2      1     0.50                        mov	v17.16b, v0.16b
 2      3     1.00    *                   ld1	{ v0.4s, v1.4s }, [x1]
 2      1     0.50                        rev64	v16.4s, v16.4s
 2      1     0.50                        rev64	v17.4s, v17.4s
 2      1     0.50                        mov	v18.16b, v1.16b
 2      1     0.50                        mov	v19.16b, v0.16b
 2      1     1.00                        ext	v0.16b, v16.16b, v16.16b, #8
 2      1     1.00                        ext	v1.16b, v17.16b, v17.16b, #8
 2      1     0.50                        rev64	v19.4s, v19.4s
 2      1     0.50                        rev64	v18.4s, v18.4s
 5      1     1.00           *            st1	{ v0.4s, v1.4s }, [x1]
 2      1     1.00                        ext	v1.16b, v19.16b, v19.16b, #8
 2      1     1.00                        ext	v0.16b, v18.16b, v18.16b, #8
 5      1     1.00           *            st1	{ v0.4s, v1.4s }, [x0]
 1      1     0.25                        add	x0, x0, #32
 2      2     0.50                        cmp	x0, x10
 1      1     0.25                        b.ne	.loop


Dynamic Dispatch Stall Cycles:
RAT     - Register unavailable:                      0
RCU     - Retire tokens unavailable:                 5  (6.5%)
SCHEDQ  - Scheduler full:                            0
LQ      - Load queue full:                           0
SQ      - Store queue full:                          0
GROUP   - Static restrictions on the dispatch group: 0


Dispatch Logic - number of cycles where we saw N micro opcodes dispatched:
[# dispatched], [# cycles]
 0,              27  (35.1%)
 2,              2  (2.6%)
 3,              1  (1.3%)
 4,              7  (9.1%)
 6,              1  (1.3%)
 7,              9  (11.7%)
 8,              30  (39.0%)


Schedulers - number of cycles where we saw N micro opcodes issued:
[# issued], [# cycles]
 0,          5  (6.5%)
 1,          2  (2.6%)
 2,          9  (11.7%)
 3,          2  (2.6%)
 4,          19  (24.7%)
 5,          18  (23.4%)
 6,          11  (14.3%)
 7,          6  (7.8%)
 8,          3  (3.9%)
 10,          2  (2.6%)

Scheduler's queue usage:
No scheduler resources used.


Retire Control Unit - number of cycles where we saw N instructions retired:
[# retired], [# cycles]
 0,           14  (18.2%)
 1,           17  (22.1%)
 2,           19  (24.7%)
 3,           16  (20.8%)
 4,           4  (5.2%)
 5,           3  (3.9%)
 6,           2  (2.6%)
 7,           2  (2.6%)

Total ROB Entries:                128
Max Used ROB Entries:             127  ( 99.2% )
Average Used ROB Entries per cy:  77  ( 60.2% )


Register File statistics:
Total number of mappings created:    144
Max number of mappings used:         53


Resources:
[0]   - KryoUnitLSA
[1]   - KryoUnitLSB
[2]   - KryoUnitXA
[3]   - KryoUnitXB
[4]   - KryoUnitYA
[5]   - KryoUnitYB


Resource pressure per iteration:
[0]    [1]    [2]    [3]    [4]    [5]    
4.00   4.00   8.75   8.88   8.75   8.63   

Resource pressure by instruction:
[0]    [1]    [2]    [3]    [4]    [5]    Instructions:
0.50   1.50    -      -      -      -     ld1	{ v0.4s, v1.4s }, [x0]
 -      -     0.38    -     0.13   0.50   sub	x1, x1, #32
 -      -     0.50   0.75   0.25   0.50   mov	v16.16b, v1.16b
 -      -     0.50   0.50    -     1.00   mov	v17.16b, v0.16b
1.75   0.25    -      -      -      -     ld1	{ v0.4s, v1.4s }, [x1]
 -      -     0.50   0.25   1.25    -     rev64	v16.4s, v16.4s
 -      -     0.25   1.00   0.25   0.50   rev64	v17.4s, v17.4s
 -      -     0.75   0.25   0.25   0.75   mov	v18.16b, v1.16b
 -      -     0.75   0.75    -     0.50   mov	v19.16b, v0.16b
 -      -     1.00   1.00    -      -     ext	v0.16b, v16.16b, v16.16b, #8
 -      -     1.00   1.00    -      -     ext	v1.16b, v17.16b, v17.16b, #8
 -      -      -      -     1.50   0.50   rev64	v19.4s, v19.4s
 -      -      -      -     0.50   1.50   rev64	v18.4s, v18.4s
0.25   1.75   0.25   0.50   1.75   0.50   st1	{ v0.4s, v1.4s }, [x1]
 -      -     1.25   0.75    -      -     ext	v1.16b, v19.16b, v19.16b, #8
 -      -     0.75   1.25    -      -     ext	v0.16b, v18.16b, v18.16b, #8
1.50   0.50   0.25   0.13   1.88   0.75   st1	{ v0.4s, v1.4s }, [x0]
 -      -      -      -     0.38   0.63   add	x0, x0, #32
 -      -     0.50   0.50   0.50   0.50   cmp	x0, x10
 -      -     0.13   0.25   0.13   0.50   b.ne	.loop
Vanilla results size 256 bytes
Iterations:        32
Instructions:      384
Total Cycles:      126
Total uOps:        608

Dispatch Width:    8
uOps Per Cycle:    4.83
IPC:               3.05
Block RThroughput: 2.4


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      1     0.25                        sub	x1, x1, #16
 1      3     0.50    *                   ld1r	{ v16.4s }, [x0]
 1      3     0.50    *                   ld1r	{ v17.4s }, [x1]
 2      1     0.50                        rev64	v16.4s, v16.4s
 2      1     0.50                        rev64	v17.4s, v17.4s
 2      1     1.00                        ext	v16.16b, v16.16b, v16.16b, #8
 2      1     1.00                        ext	v17.16b, v17.16b, v17.16b, #8
 2      0     0.50           *            st1	{ v16.4s }, [x1]
 2      0     0.50           *            st1	{ v17.4s }, [x0]
 1      1     0.25                        add	x0, x0, #16
 2      2     0.50                        cmp	x0, x10
 1      1     0.25                        b.ne	.B0


Dynamic Dispatch Stall Cycles:
RAT     - Register unavailable:                      0
RCU     - Retire tokens unavailable:                 37  (29.4%)
SCHEDQ  - Scheduler full:                            0
LQ      - Load queue full:                           0
SQ      - Store queue full:                          0
GROUP   - Static restrictions on the dispatch group: 0


Dispatch Logic - number of cycles where we saw N micro opcodes dispatched:
[# dispatched], [# cycles]
 0,              30  (23.8%)
 1,              3  (2.4%)
 2,              2  (1.6%)
 3,              16  (12.7%)
 4,              2  (1.6%)
 5,              6  (4.8%)
 6,              2  (1.6%)
 7,              17  (13.5%)
 8,              48  (38.1%)


Schedulers - number of cycles where we saw N micro opcodes issued:
[# issued], [# cycles]
 0,          2  (1.6%)
 1,          1  (0.8%)
 2,          7  (5.6%)
 3,          10  (7.9%)
 4,          30  (23.8%)
 5,          34  (27.0%)
 6,          24  (19.0%)
 7,          15  (11.9%)
 8,          3  (2.4%)

Scheduler's queue usage:
No scheduler resources used.


Retire Control Unit - number of cycles where we saw N instructions retired:
[# retired], [# cycles]
 0,           21  (16.7%)
 1,           17  (13.5%)
 2,           41  (32.5%)
 3,           10  (7.9%)
 4,           1  (0.8%)
 5,           17  (13.5%)
 6,           1  (0.8%)
 7,           1  (0.8%)
 8,           7  (5.6%)
 9,           3  (2.4%)
 10,          7  (5.6%)

Total ROB Entries:                128
Max Used ROB Entries:             128  ( 100.0% )
Average Used ROB Entries per cy:  100  ( 78.1% )


Register File statistics:
Total number of mappings created:    320
Max number of mappings used:         68


Resources:
[0]   - KryoUnitLSA
[1]   - KryoUnitLSB
[2]   - KryoUnitXA
[3]   - KryoUnitXB
[4]   - KryoUnitYA
[5]   - KryoUnitYB


Resource pressure per iteration:
[0]    [1]    [2]    [3]    [4]    [5]    
2.00   2.00   3.78   3.78   3.78   3.66   

Resource pressure by instruction:
[0]    [1]    [2]    [3]    [4]    [5]    Instructions:
 -      -     0.25   0.28   0.22   0.25   sub	x1, x1, #16
0.28   0.72    -      -      -      -     ld1r	{ v16.4s }, [x0]
0.75   0.25    -      -      -      -     ld1r	{ v17.4s }, [x1]
 -      -     0.50   0.44   0.38   0.69   rev64	v16.4s, v16.4s
 -      -     0.56   0.31   0.75   0.38   rev64	v17.4s, v17.4s
 -      -     0.94   1.06    -      -     ext	v16.16b, v16.16b, v16.16b, #8
 -      -     0.94   1.06    -      -     ext	v17.16b, v17.16b, v17.16b, #8
0.53   0.47    -      -     0.31   0.69   st1	{ v16.4s }, [x1]
0.44   0.56    -      -     0.66   0.34   st1	{ v17.4s }, [x0]
 -      -     0.16   0.22   0.41   0.22   add	x0, x0, #16
 -      -     0.19   0.19   0.88   0.75   cmp	x0, x10
 -      -     0.25   0.22   0.19   0.34   b.ne	.B0
Unrolled results size 256 bytes
Iterations:        16
Instructions:      320
Total Cycles:      147
Total uOps:        688

Dispatch Width:    8
uOps Per Cycle:    4.68
IPC:               2.18
Block RThroughput: 5.8


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 2      3     1.00    *                   ld1	{ v0.4s, v1.4s }, [x0]
 1      1     0.25                        sub	x1, x1, #32
 2      1     0.50                        mov	v16.16b, v1.16b
 2      1     0.50                        mov	v17.16b, v0.16b
 2      3     1.00    *                   ld1	{ v0.4s, v1.4s }, [x1]
 2      1     0.50                        rev64	v16.4s, v16.4s
 2      1     0.50                        rev64	v17.4s, v17.4s
 2      1     0.50                        mov	v18.16b, v1.16b
 2      1     0.50                        mov	v19.16b, v0.16b
 2      1     1.00                        ext	v0.16b, v16.16b, v16.16b, #8
 2      1     1.00                        ext	v1.16b, v17.16b, v17.16b, #8
 2      1     0.50                        rev64	v19.4s, v19.4s
 2      1     0.50                        rev64	v18.4s, v18.4s
 5      1     1.00           *            st1	{ v0.4s, v1.4s }, [x1]
 2      1     1.00                        ext	v1.16b, v19.16b, v19.16b, #8
 2      1     1.00                        ext	v0.16b, v18.16b, v18.16b, #8
 5      1     1.00           *            st1	{ v0.4s, v1.4s }, [x0]
 1      1     0.25                        add	x0, x0, #32
 2      2     0.50                        cmp	x0, x10
 1      1     0.25                        b.ne	.loop


Dynamic Dispatch Stall Cycles:
RAT     - Register unavailable:                      0
RCU     - Retire tokens unavailable:                 49  (33.3%)
SCHEDQ  - Scheduler full:                            0
LQ      - Load queue full:                           0
SQ      - Store queue full:                          0
GROUP   - Static restrictions on the dispatch group: 0


Dispatch Logic - number of cycles where we saw N micro opcodes dispatched:
[# dispatched], [# cycles]
 0,              35  (23.8%)
 1,              2  (1.4%)
 2,              14  (9.5%)
 3,              1  (0.7%)
 4,              15  (10.2%)
 6,              13  (8.8%)
 7,              19  (12.9%)
 8,              48  (32.7%)


Schedulers - number of cycles where we saw N micro opcodes issued:
[# issued], [# cycles]
 0,          5  (3.4%)
 1,          4  (2.7%)
 2,          17  (11.6%)
 3,          2  (1.4%)
 4,          37  (25.2%)
 5,          38  (25.9%)
 6,          21  (14.3%)
 7,          12  (8.2%)
 8,          7  (4.8%)
 10,          4  (2.7%)

Scheduler's queue usage:
No scheduler resources used.


Retire Control Unit - number of cycles where we saw N instructions retired:
[# retired], [# cycles]
 0,           22  (15.0%)
 1,           33  (22.4%)
 2,           37  (25.2%)
 3,           34  (23.1%)
 4,           6  (4.1%)
 5,           7  (4.8%)
 6,           4  (2.7%)
 7,           4  (2.7%)

Total ROB Entries:                128
Max Used ROB Entries:             128  ( 100.0% )
Average Used ROB Entries per cy:  100  ( 78.1% )


Register File statistics:
Total number of mappings created:    288
Max number of mappings used:         54


Resources:
[0]   - KryoUnitLSA
[1]   - KryoUnitLSB
[2]   - KryoUnitXA
[3]   - KryoUnitXB
[4]   - KryoUnitYA
[5]   - KryoUnitYB


Resource pressure per iteration:
[0]    [1]    [2]    [3]    [4]    [5]    
4.00   4.00   8.75   8.81   8.75   8.69   

Resource pressure by instruction:
[0]    [1]    [2]    [3]    [4]    [5]    Instructions:
0.50   1.50    -      -      -      -     ld1	{ v0.4s, v1.4s }, [x0]
 -      -     0.44    -     0.19   0.38   sub	x1, x1, #32
 -      -     0.75   0.63   0.13   0.50   mov	v16.16b, v1.16b
 -      -     0.50   0.25    -     1.25   mov	v17.16b, v0.16b
1.63   0.38    -      -      -      -     ld1	{ v0.4s, v1.4s }, [x1]
 -      -     0.25   0.38   1.38    -     rev64	v16.4s, v16.4s
 -      -     0.13   1.25   0.38   0.25   rev64	v17.4s, v17.4s
 -      -     1.13   0.13   0.13   0.63   mov	v18.16b, v1.16b
 -      -     0.63   0.63    -     0.75   mov	v19.16b, v0.16b
 -      -     1.00   1.00    -      -     ext	v0.16b, v16.16b, v16.16b, #8
 -      -     1.00   1.00    -      -     ext	v1.16b, v17.16b, v17.16b, #8
 -      -      -      -     1.75   0.25   rev64	v19.4s, v19.4s
 -      -      -      -     0.25   1.75   rev64	v18.4s, v18.4s
0.63   1.38   0.13   0.63   1.88   0.38   st1	{ v0.4s, v1.4s }, [x1]
 -      -     1.63   0.38    -      -     ext	v1.16b, v19.16b, v19.16b, #8
 -      -     0.38   1.63    -      -     ext	v0.16b, v18.16b, v18.16b, #8
1.25   0.75   0.25   0.06   1.94   0.75   st1	{ v0.4s, v1.4s }, [x0]
 -      -      -      -     0.19   0.81   add	x0, x0, #32
 -      -     0.50   0.50   0.50   0.50   cmp	x0, x10
 -      -     0.06   0.38   0.06   0.50   b.ne	.loop

@cbezault cbezault requested a review from a team as a code owner February 27, 2022 19:50
@CaseyCarter CaseyCarter added the performance Must go faster label Feb 28, 2022
@CaseyCarter CaseyCarter added this to Initial Review in Code Reviews via automation Feb 28, 2022
@cbezault
Copy link
Contributor Author

cbezault commented Feb 28, 2022

Note, according to this article I should be using the NEON versions of these function under ARM64EC but _M_X64 will also be defined.
https://techcommunity.microsoft.com/t5/windows-kernel-internals-blog/getting-to-know-arm64ec-defines-and-intrinsic-functions/ba-p/2957235

@cbezault
Copy link
Contributor Author

As noted on Discord, there is the question of whether or not we might want to unroll the inner loop at all.
We start seeing wins for vector<unsigned int> with size >= 512.

@StephanTLavavej StephanTLavavej added the ARM64 Related to the ARM64 architecture label Mar 1, 2022
@cbezault
Copy link
Contributor Author

cbezault commented Mar 1, 2022

After further refining the unrolled version of the code we now have equal performance for vector<unsigned int> of size >= 64.

@StephanTLavavej StephanTLavavej self-assigned this Mar 9, 2022
@cbezault
Copy link
Contributor Author

@rhuijben
Copy link

rhuijben commented Apr 9, 2022

Opened a devcom issue for poor codegen here: https://developercommunity.visualstudio.com/t/arm64-neon-ld1m2-q32-intrinsic-results-in-extraneo/1690469?from=email

Issue was closed there, because some questions were not answered in time.

@AlexGuteniev
Copy link
Contributor

This PR adds NEON versions of all the vector algorithms currently implemented for SSE/AVX2.

Guess not all anymore after #2434

@cbezault
Copy link
Contributor Author

@rhuijben I didn't bother answering because I got confirmation from a backend dev that it'll be a won't fix. They're working a full overhaul of their register allocator that will fix it.

@ghost
Copy link

ghost commented Apr 29, 2022

CLA assistant check
All CLA requirements met.

@StephanTLavavej StephanTLavavej removed their assignment May 4, 2022
@StephanTLavavej StephanTLavavej moved this from Initial Review to Work In Progress in Code Reviews May 4, 2022
Copy link
Member

@barcharcraz barcharcraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code looks correct to me, however I'm not sure we want to take this approach to adding vectorized algorithms for additional architectures. Trying to bolt on additional code with the preprocessor is probably not too bad for vector reverse, but I have a feeling it might become a nightmare for more complicated algorithms, and it makes things harder to follow.

const __m256i _Left = _mm256_loadu_si256(static_cast<__m256i*>(_First1));
const __m256i _Right = _mm256_loadu_si256(static_cast<__m256i*>(_First2));
_mm256_storeu_si256(static_cast<__m256i*>(_First1), _Right);
_mm256_storeu_si256(static_cast<__m256i*>(_First2), _Left);
#elif defined(_VECTOR_ARM64) // ^^^ _M_IX86 || _VECTOR_X64 ^^^ // vvv _VECTOR_ARM64 vvv
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should have a seperate function for arm, I'm not a huge fan of trying to weave together all the varients using the preprocessor like this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure we could do that. I just felt that the logic was so similar that we might as well just inline it.

@barcharcraz
Copy link
Member

There's also a merge issue with find/count

@barcharcraz barcharcraz added this to @barcharcraz in Maintainer Priorities Aug 31, 2022
@StephanTLavavej
Copy link
Member

Hi @cbezault! We talked about this PR at the weekly maintainer meeting - sorry for not reviewing it earlier, but the codebase has changed significantly since this was originally opened. @barcharcraz says, based on her earlier review, that in addition to the source-level conflicts that have accumulated, some of the algorithms' approaches have changed, so more substantial rework would be required. Additionally, we would need benchmarks added to the new benchmarks suite, to re-confirm that the changed code is still an improvement. Finally, she mentioned that a whole bunch of vectorized algorithms have been added in the meantime (@AlexGuteniev has been busy 😹 😻), although that isn't a blocking issue strictly speaking since we can always make incremental improvements.

When I thought only source-level conflicts were the issue, I was getting ready to pick up this PR, but after talking with @barcharcraz, I believe that it would be as much or more work to get this PR ready as it would be to start from scratch. Therefore we're going to close this PR without merging. We still appreciate the effort you put into it, and I apologize for not being able to land it!

Code Reviews automation moved this from Work In Progress to Done Feb 7, 2024
Maintainer Priorities automation moved this from @barcharcraz to Done Feb 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ARM64 Related to the ARM64 architecture performance Must go faster
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Provide ARM64 implementations of vector algorithms
6 participants