#include <stdint.h>
uint64_t example_1(uint64_t len, const unsigned char* input) {
uint64_t total = 0;
for (uint64_t i = 0; i < len; i++) {
// example computation to be vectorized
unsigned char output = input[i] ^ 0x07;
// accumulator
total += output;
}
return total;
}
The loop vectorizer optimizes the above function very poorly: it chooses a vectorization width of 2, when it should be able to use a much higher vectorization width, ie 16.
If you pick a narrower accumulator (ie, change the type of total to uint8_t), the vectorizer will choose a high width as expected.