Skip to content

Implementing a New Function

Michael R. Crusoe edited this page Jan 18, 2023 · 7 revisions

Adding a new function to SIMDe, including tests to make sure it's implemented correctly, is pretty straightforward once you know where things are.

This document will guide you through adding everything you need in order to get a patch accepted. We'll use an AVX2 function, _mm256_add_epi32, as an example but the process is similar for other ISA extensions.

Stub Out An Implementation

First, we'll create a stub implementation of our function in the right place. For AVX2, that's simde/x86/avx2.h. The functions are generally in alphabetical order, but if it makes more sense to put it next to a function that is similar instead you can do that, too.

We don't need to worry about the portable implementation much right now, we just want something that will compile and run so we can generate a test vector. I generally just copy an existing function with the same prototype if possible (for example, the body of our _mm256_add_epi32 could just be a copy of _mm256_sub_epi32 with the function name changed for now).

We do need to change the native implementation to call the right function, though, since we'll be using it soon to generate a test vector.

Adding a Test

All functions in SIMDe require at least one test case; patches missing a test will not be accepted. Writing these tests can be quite repetitive, but SIMDe has lots of code to help you get started. Since we're writing an AVX2 function, the first thing you'll want to do is open up test/x86/skel.c and find a function that corresponds with the types you want to use; in the case of _mm256_add_epi32, we're adding two 256-bit vectors of 8 32-bit integers each, so you'll want to use test_simde_mm256_xxx_epi32. Note the naming convention; you should be able to find the necessary function quickly by searching instead of reading through the whole thing.

Here is what that function looks like at the time this document was written:

static int
test_simde_mm256_xxx_epi32(SIMDE_MUNIT_TEST_ARGS) {
  (void) params;
  (void) data;

  const struct {
    int8_t a[32];
    int8_t b[32];
    int8_t r[32];
  } test_vec[8] = {

  };

  printf("\n");
  for (size_t i = 0 ; i < (sizeof(test_vec) / (sizeof(test_vec[0]))) ; i++) {
    simde__m256i a = simde_test_x86_random_i8x32();
    simde__m256i b = simde_test_x86_random_i8x32();
    simde__m256i r = simde_mm256_xxx_epi32(a, b);
    simde_test_x86_write_i8x32(2, a, SIMDE_TEST_VEC_POS_FIRST);
    simde_test_x86_write_i8x32(2, b, SIMDE_TEST_VEC_POS_MIDDLE);
    simde_test_x86_write_i8x32(2, r, SIMDE_TEST_VEC_POS_LAST);
  }
  return 1;

  for (size_t i = 0 ; i < (sizeof(test_vec) / sizeof(test_vec[0])); i++) {
    simde__m256i a = simde_x_mm256_loadu_epi8(test_vec[i].a);                   
    simde__m256i b = simde_x_mm256_loadu_epi8(test_vec[i].b);
    simde__m256i r = simde_mm256_xxx_epi32(a, b);
    simde_test_x86_assert_equal_i8x32(r, simde_x_mm256_loadu_epi8(test_vec[i].r));
  }

  return 0;
}

Copy the code into the file with all the other test cases for the ISA extension you're working on; in this case, it's test/x86/avx2.c and change the name to reflect the function you're working on, which is test_simde_mm256_xxx_epi32 for this example. Don't forget to also change the function calls in the body of the function; you can usually just to a find and replace in the file to replace "xxx" with whatever you need ("add") in this case.

You'll also need to tell the test runner about the function; at the end of the file, there is a list of macro calls SIMDE_TEST_FUNC_LIST_BEGIN, just add an entry to that:

SIMDE_TEST_FUNC_LIST_BEGIN
  /* ... */
  SIMDE_TEST_FUNC_LIST_ENTRY(mm256_add_epi32),
  /* ... */

Now our test function will run when the test suite runs.

Most of the test function is currently not actually used for testing, but rather to generate a test vector. If you run the tests on a machine which supports the instruction it should print out a test vector for you to copy into the test function. For example, for _mm256_add_epi32 you might end up with something like:

    { simde_mm256_set_epi32(INT32_C( 1102687755), INT32_C( 1275949869), INT32_C( -388043296), INT32_C( 1616523445),
                            INT32_C( -312991452), INT32_C(-1980926618), INT32_C( 1274012126), INT32_C(  -45808693)),
      simde_mm256_set_epi32(INT32_C(-1821401638), INT32_C( 1143218625), INT32_C(-1072188421), INT32_C( -228883992),
                            INT32_C( 1453787917), INT32_C(-1686415046), INT32_C(-1856178723), INT32_C(-1344248495)),
      simde_mm256_set_epi32(INT32_C( -718713883), INT32_C(-1875798802), INT32_C(-1460231717), INT32_C( 1387639453),
                            INT32_C( 1140796465), INT32_C(  627625632), INT32_C( -582166597), INT32_C(-1390057188)) },
    { simde_mm256_set_epi32(INT32_C( -511556352), INT32_C(  512138684), INT32_C( 2115720361), INT32_C( -345092241),
                            INT32_C( -115713034), INT32_C( 1435785542), INT32_C( -578341737), INT32_C(  626663856)),
      simde_mm256_set_epi32(INT32_C( 1905028737), INT32_C(  164639990), INT32_C(-1952346601), INT32_C( 1853095591),
                            INT32_C(-1825217200), INT32_C(-1102744367), INT32_C(-1105586227), INT32_C(-1908622941)),
      simde_mm256_set_epi32(INT32_C( 1393472385), INT32_C(  676778674), INT32_C(  163373760), INT32_C( 1508003350),
                            INT32_C(-1940930234), INT32_C(  333041175), INT32_C(-1683927964), INT32_C(-1281959085)) },
    { simde_mm256_set_epi32(INT32_C(  841608097), INT32_C(-2001797484), INT32_C(-1658305288), INT32_C(  966942303),
                            INT32_C(  842108123), INT32_C(  697774066), INT32_C(-1273233002), INT32_C( -331057125)),
      simde_mm256_set_epi32(INT32_C(  824745259), INT32_C( 1162513122), INT32_C( 1536105364), INT32_C( 1572988069),
                            INT32_C( 1601630355), INT32_C(  105174023), INT32_C( -548723565), INT32_C(  342919548)),
      simde_mm256_set_epi32(INT32_C( 1666353356), INT32_C( -839284362), INT32_C( -122199924), INT32_C(-1755036924),
                            INT32_C(-1851228818), INT32_C(  802948089), INT32_C(-1821956567), INT32_C(   11862423)) },
    { simde_mm256_set_epi32(INT32_C(-1982661498), INT32_C( -454967885), INT32_C( 1606399367), INT32_C( 1911771725),
                            INT32_C( -320200723), INT32_C( 2055189331), INT32_C( 1782567162), INT32_C(  617047003)),
      simde_mm256_set_epi32(INT32_C(-1988185598), INT32_C( 1350171177), INT32_C( -741176174), INT32_C( 1024642864),
                            INT32_C( 1174775607), INT32_C(-1489493977), INT32_C( 2114610376), INT32_C(-1150946108)),
      simde_mm256_set_epi32(INT32_C(  324120200), INT32_C(  895203292), INT32_C(  865223193), INT32_C(-1358552707),
                            INT32_C(  854574884), INT32_C(  565695354), INT32_C( -397789758), INT32_C( -533899105)) },
    { simde_mm256_set_epi32(INT32_C(-1636237507), INT32_C(-2022044523), INT32_C( 1298417038), INT32_C( -498789244),
                            INT32_C(-1120565370), INT32_C(  -10552717), INT32_C( 1267811859), INT32_C( 1736112342)),
      simde_mm256_set_epi32(INT32_C(   30746202), INT32_C( 1464439343), INT32_C( 1694184093), INT32_C(-1066802952),
                            INT32_C( -664495133), INT32_C(-2016253412), INT32_C(-1975304715), INT32_C(  -70672826)),
      simde_mm256_set_epi32(INT32_C(-1605491305), INT32_C( -557605180), INT32_C(-1302366165), INT32_C(-1565592196),
                            INT32_C(-1785060503), INT32_C(-2026806129), INT32_C( -707492856), INT32_C( 1665439516)) },
    { simde_mm256_set_epi32(INT32_C(  289000373), INT32_C( 1573632519), INT32_C(  -39248751), INT32_C( -989305129),
                            INT32_C( -946333511), INT32_C( -275686449), INT32_C(  -98660627), INT32_C(-1519479102)),
      simde_mm256_set_epi32(INT32_C(  297476793), INT32_C(  436731799), INT32_C(  124294563), INT32_C(-1635813332),
                            INT32_C(  263383074), INT32_C( -533172755), INT32_C( 1125990821), INT32_C( -786980387)),
      simde_mm256_set_epi32(INT32_C(  586477166), INT32_C( 2010364318), INT32_C(   85045812), INT32_C( 1669848835),
                            INT32_C( -682950437), INT32_C( -808859204), INT32_C( 1027330194), INT32_C( 1988507807)) },
    { simde_mm256_set_epi32(INT32_C(  518182194), INT32_C(-1204047142), INT32_C(  -66070725), INT32_C(  499109808),
                            INT32_C(-2041576579), INT32_C( -621515360), INT32_C(  566201077), INT32_C(  301667364)),
      simde_mm256_set_epi32(INT32_C(-1846226401), INT32_C(-1479610627), INT32_C( -205605694), INT32_C( 2074175879),
                            INT32_C(  797873427), INT32_C(  232260429), INT32_C( 2122451120), INT32_C(-1502060759)),
      simde_mm256_set_epi32(INT32_C(-1328044207), INT32_C( 1611309527), INT32_C( -271676419), INT32_C(-1721681609),
                            INT32_C(-1243703152), INT32_C( -389254931), INT32_C(-1606315099), INT32_C(-1200393395)) },
    { simde_mm256_set_epi32(INT32_C(  405834501), INT32_C(-1910761465), INT32_C(  957239954), INT32_C( -786856288),
                            INT32_C(  843920617), INT32_C(  327146567), INT32_C( -333483012), INT32_C(-1269489720)),
      simde_mm256_set_epi32(INT32_C( -343554450), INT32_C( -768698719), INT32_C(-1629325598), INT32_C(  -86112156),
                            INT32_C(-1762054840), INT32_C(-1230219631), INT32_C(-1955142376), INT32_C(  681367456)),
      simde_mm256_set_epi32(INT32_C(   62280051), INT32_C( 1615507112), INT32_C( -672085644), INT32_C( -872968444),
                            INT32_C( -918134223), INT32_C( -903073064), INT32_C( 2006341908), INT32_C( -588122264)) }

Go ahead and copy that, and paste it right into the empty test_vector array near the beginning of the function. Then you can get rid of all the code to generate the test vector, which is everything from the printf("\n"); to the return 1;.

Now you're done with the test, and you can get on with the implementation.

Implementing the Function

Head back over to your implementation stub in simde/x86/avx2.h. You'll probably be able to figure out the rest from looking at other functions in the same file, so feel free to skip the rest of this document.

First, take a look at Intel's documentation; remember, our example is for _mm256_add_epi32. The documentation isn't always easy to understand, but between it and the test vector you should be able to figure it out. In our case, we just have to add the elements in a to the elements in b.

The simde_mm256 type is a giant union, and you should probably take a look at the definition if you haven't already. It's in the header for the first version of the ISA extension that uses the type, so in this case simde/x86/avx2.h. You'll see members for various different integer sizes, both signed and unsigned, with a consistent naming scheme (32-bit integers are i32, 64-bit unsigned integers are u64, etc.). You should also note the n (for "native") member, which exists if the compiler supports the datatype natively in the current configuration. Finally, you can treat the simde_m256 as an array of two __m128 vectors; this makes it easy to create a fallback using two 128-bit operations.

An initial version of our function might look like this:

SIMDE__FUNCTION_ATTRIBUTES
simde__m256i
simde_mm256_add_epi32 (simde__m256i a, simde__m256i b) {
  simde__m256i r;

#if defined(SIMDE_AVX2_NATIVE)
  r.n = _mm256_add_epi32(a.n, b.n);
#else
  for (size_t i = 0 ; i < (sizeof(r.i32) / sizeof(r.i32[0])) ; i++) {
    r.i32[i] = a.i32[i] + b.i32[i];
  }
#endif

  return r;
}

The first thing you'll probably want to do is provide a hint to the compiler that it can (and should) try to automatically vectorize the loop. We do this using one the SIMDE__VECTORIZE macro which is declared in simde/simde-common.h header. These macros use either OpenMP 4 SIMD, Cilk Plus, or compiler-specific pragmas, depending on which compiler is in use. Usually you'll just need SIMDE__VECTORIZE, not SIMDE__VECTORIZE_SAFELEN, SIMDE__VECTORIZE_REDUCTION, or SIMDE__VECTORIZE_ALIGNED. Just place the macro before the loop:

  SIMDE__VECTORIZE
  for (size_t i = 0 ; i < (sizeof(r.i32) / sizeof(r.i32[0])) ; i++) {
    r.i32[i] = a.i32[i] + b.i32[i];
  }

For a simple function like add, the compiler will likely be able to generate vector instructions even for architectures we haven't explicitly created an implementation for, but it certainly doesn't hurt to add optimized fallbacks. Let's go ahead and create one for SSE2:

SIMDE__FUNCTION_ATTRIBUTES
simde__m256i
simde_mm256_add_epi32 (simde__m256i a, simde__m256i b) {
  simde__m256i r;

#if defined(SIMDE_AVX2_NATIVE)
  r.n = _mm256_add_epi32(a.n, b.n);
#elif defined(SIMDE_SSE2_NATIVE)
  r.m128i[0] = _mm_add_epi32(a.m128i[0], b.m128i[0]);
  r.m128i[1] = _mm_add_epi32(a.m128i[1], b.m128i[1]);
#else
  for (size_t i = 0 ; i < (sizeof(r.i32) / sizeof(r.i32[0])) ; i++) {
    r.i32[i] = a.i32[i] + b.i32[i];
  }
#endif

  return r;
}

Note: currently running the tests locally doesn't test all code paths, so it is advisable to comment out different implementations (AVX2_NATIVE, SSE2_NATIVE, SIMDE__SHUFFLE_VECTOR, etc) to double check

And you're done. If your implementations are correct the tests should pass, and you're ready to submit a pull request (please!).