This is observed with -xhip targeting AMD MI300 (gfx942).
Compiler Explorer link: https://godbolt.org/z/xrfhhaaeY. For completeness, the clang flags are -O3 --cuda-device-only -x hip -nogpuinc -nogpulib --offload-arch=gfx942.
Testcase:
__attribute__((device))
int64_t a(int64_t i) {
return i % 3;
}
This compiles to 80 instructions.
By contrast, the same testcase with int64_t replaced by int32_t compiles to just 8 instructions.
I was expecting the int64 variant to generate slightly over 2x more instructions than the int32 variant (since the target requires rewriting int64 ops into pairs of int32 ops). Not 10x.
The above Compiler Explorer link shows the same happening with i / 3 instead of i % 3.