Skip to content

[AArch64] Excessive vectorization of scalar rotates results in poor performance #166808

@zeux

Description

@zeux

Compiling the attached file, test.cpp, targeting AArch64 with -O3 flag, results in suboptimal code generation. Specifically, the tail of the loop in C intends to copy the vector to scalar registers (rotateleft64 is a wrapper around __builtin_rotateleft64):

		// store results to stack so that we can rotate using scalar instructions
		uint64_t res[4];
		vst1q_u64((uint64_t*)&res[0], res_0);
		vst1q_u64((uint64_t*)&res[2], res_1);

		// rotate and store
		uint64_t* out = reinterpret_cast<uint64_t*>(&data[i * 4]);

		out[0] = rotateleft64(res[0], data[(i + 0) * 4 + 3] << 4);
		out[1] = rotateleft64(res[1], data[(i + 1) * 4 + 3] << 4);
		out[2] = rotateleft64(res[2], data[(i + 2) * 4 + 3] << 4);
		out[3] = rotateleft64(res[3], data[(i + 3) * 4 + 3] << 4);

... however, clang vectorizes this instead. Because A64 doesn't have vector forms of rotates, the vectorization is quite unprofitable, as it needs to synthesize the rotates out of shifts, ands, adds, ors, as well as getting the shift mask into the right registers. Compared to a version of this loop where res is marked as volatile, the volatile version (that uses vector stack stores & GPR loads) has 60 instructions in the loop body, whereas without volatile with the vectorization it has 70 instructions. Running the code confirms that this is a pessimization: on Graviton 3 CPU, the volatile variant runs at ~5 GB/s, and without the volatile it runs at ~3.6 GB/s, so the slowdown is quite significant. (note, the performance delta is significant and is likely at least partially due to the fact that scalar ops can run on some ports that are otherwise free, while the next iteration runs vector instructions)

It would be nice if clang... didn't do this. gcc doesn't, and gets 5 GB/s on Graviton 3 without volatile.

Godbolt link for easy testing: https://gcc.godbolt.org/z/azPoGh1ns (uncomment volatile on line 70 to compare codegen)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions