Pruning elements in SIMD vectors (i.e., packing left elements)
C
Switch branches/tags
Nothing to show
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
scripts
.gitignore
LICENSE
README.md
benchmark.c
demo.c
simdprune.h
simdprune_tables.h

README.md

simdprune

Pruning elements in SIMD vectors

Suppose that you are given an vector like 0,1,1,0,3,1,1,4 and you want to remove all 1s to get 0,0,3,4,... One way to do this is to compare the original vector with the vector 1,1,1,1,1,1,1,1 to get the mask 0b01100110 (where a 1 appears if and only if the corresponding elements are equal). We then want to pass the mask 0b01100110 and the vector 0,1,1,0,3,1,1,4 to some function that will produce a vector that begins with 0,0,3,4, skipping the 1s.

The AVX-512 instruction sets offer vcompress instructions for this purpose, but other instructions sets like SSSE3 or AVX2 provide no help.

That's where this library comes in.

Further documentation: Quickly pruning elements in SIMD vectors using the simdprune library

On processors benefiting from advanced AVX-512 instructions

The AVX-512 instruction set vcompress helps but is limited to 32-bit and 64-bit words.

On ARM processors

Some ARM processors will benefit from Scalable Vector Extensions. It seems that a sequence of SPLICE and INCP instructions can effectively achieve arbitrary prunning. ARM Scalable Vector Extensions also support a related instruction (compact) but it also seems to be limited to 32-bit and 64-bit words.

Practical examples

Usage

To prune every other value:

  // 128-bit vectors (SSSE3)
  prune_epi8(x,0b1010101010101010);
  prune_epi16(x,0b10101010);
  prune_epi32(x,0b1010);
  // 256-bit vectors (AVX2)
  prune256_epi32(x,0b10101010);

Replacing the various masks by, say, 0b1 would prune just the first value.

How fast is it?

The throughput of these functions is likely quite good. The latency spans several cycles, however. Especially expensive is prune_epi8 due to its large table.

These numbers assume that one is able to hide the cache/RAM latency by prefetching the bit masks. Table lookups from RAM take dozens of cycles at least.

$ gcc -o benchmark benchmark.c -mavx2 -O3 && ./benchmark
This test measures the latency in CPU cycles.
rdtsc_overhead set to 30
runprune_epi8(bitmasks, N, &x)                              	:  4.163 cycles per operation (best) 	4.629 cycles per operation (avg)
runprune_epi16(bitmasks, N, &x)                             	:  2.404 cycles per operation (best) 	2.462 cycles per operation (avg)
runprune_epi32(bitmasks, N, &x)                             	:  2.369 cycles per operation (best) 	2.444 cycles per operation (avg)
runprune256_epi32(bitmasks, N, &xx)                         	:  3.135 cycles per operation (best) 	3.155 cycles per operation (avg)

How to install

Just copy the header files and include them. Look at demo.c for an example.