[AMDGPU] `int64` modulo-constant `x % 3` and divide-by-constant `x / 3` compile to 80 instructions.

This is observed with `-xhip` targeting AMD MI300 (`gfx942`).

Compiler Explorer link: https://godbolt.org/z/xrfhhaaeY. For completeness, the clang flags are `-O3 --cuda-device-only -x hip  -nogpuinc -nogpulib --offload-arch=gfx942`.

Testcase:

```c++
__attribute__((device))
int64_t a(int64_t i) {
    return i % 3;
}
```

This compiles to 80 instructions.

By contrast, the same testcase with `int64_t` replaced by `int32_t` compiles to just 8 instructions.

I was expecting the `int64` variant to generate slightly over 2x more instructions than the `int32` variant (since the target requires rewriting `int64` ops into pairs of `int32` ops). Not 10x.

The above Compiler Explorer link shows the same happening with `i / 3` instead of `i % 3`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AMDGPU] `int64` modulo-constant `x % 3` and divide-by-constant `x / 3` compile to 80 instructions. #100383

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[AMDGPU] int64 modulo-constant x % 3 and divide-by-constant x / 3 compile to 80 instructions. #100383

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[AMDGPU] `int64` modulo-constant `x % 3` and divide-by-constant `x / 3` compile to 80 instructions. #100383