Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OpenMP] amdgpu bad choice of max_flat_workgroup_size #64816

Closed
ye-luo opened this issue Aug 18, 2023 · 16 comments
Closed

[OpenMP] amdgpu bad choice of max_flat_workgroup_size #64816

ye-luo opened this issue Aug 18, 2023 · 16 comments

Comments

@ye-luo
Copy link
Contributor

ye-luo commented Aug 18, 2023

Currently clang sets max_flat_workgroup_size always to 1024 and causes register spill

    .max_flat_workgroup_size: 1024
    .name:           __omp_offloading_32_7a3077cd__ZN11qmcplusplus17einspline_spo_ompIfE18multi_evaluate_vghERKSt6vectorIPNS_6SPOSetESaIS4_EERKS2_IPNS_11ParticleSetESaISA_EEib_l413
    .private_segment_fixed_size: 264
    .sgpr_count:     60
    .sgpr_spill_count: 0
    .symbol:         __omp_offloading_32_7a3077cd__ZN11qmcplusplus17einspline_spo_ompIfE18multi_evaluate_vghERKSt6vectorIPNS_6SPOSetESaIS4_EERKS2_IPNS_11ParticleSetESaISA_EEib_l413.kd
    .vgpr_count:     128
    .vgpr_spill_count: 66
    .wavefront_size: 64

I tested overriding the default using ompx_attribute(__attribute__((amdgpu_flat_work_group_size(128, 256))))
and got 2x kernel speed-up.

    .max_flat_workgroup_size: 256
    .name:           __omp_offloading_32_7a4aee56__ZN11qmcplusplus17einspline_spo_ompIfE18multi_evaluate_vghERKSt6vectorIPNS_6SPOSetESaIS4_EERKS2_IPNS_11ParticleSetESaISA_EEib_l413
    .private_segment_fixed_size: 0
    .sgpr_count:     58
    .sgpr_spill_count: 0
    .symbol:         __omp_offloading_32_7a4aee56__ZN11qmcplusplus17einspline_spo_ompIfE18multi_evaluate_vghERKSt6vectorIPNS_6SPOSetESaIS4_EERKS2_IPNS_11ParticleSetESaISA_EEib_l413.kd
    .vgpr_count:     166
    .vgpr_spill_count: 0
    .wavefront_size: 64

The default 1024 is clearly very bad in this case. When I code cuda, even 1024 is supported, I really use 1024 but mostly 128 or 256.

  1. Can max_flat_workgroup_size be chosen at linking when the needed vgpr got figured out?
  2. When I specify thread_limit(192) clause, can the compiler take advantage of it?
@arsenm
Copy link
Contributor

arsenm commented Aug 18, 2023

1024 is the conservative default is always executable. CUDA, unlike OpenCL, doesn't require checking the supported workgroup size before executing, so for compatibility the backend has to support the maximum workgroup size by default.

The clang-chosen default for OpenCL is maximum 256. If OpenMP only executes kernels under controlled situations the default could be decreased.

@llvmbot
Copy link
Collaborator

llvmbot commented Aug 18, 2023

@llvm/issue-subscribers-openmp

@llvmbot
Copy link
Collaborator

llvmbot commented Aug 18, 2023

@llvm/issue-subscribers-backend-amdgpu

@jhuber6
Copy link
Contributor

jhuber6 commented Aug 18, 2023

The problem with OpenMP is that it sets a maximum of 1024 and allows environment variables to change the number of thread. Meaning, even if the number of threads is stated statically in the program we need to assume it can change at runtime.

@ye-luo
Copy link
Contributor Author

ye-luo commented Aug 18, 2023

The problem with OpenMP is that it sets a maximum of 1024 and allows environment variables to change the number of thread. Meaning, even if the number of threads is stated statically in the program we need to assume it can change at runtime.

Which environment variables were you referring to?

@ye-luo
Copy link
Contributor Author

ye-luo commented Aug 18, 2023

Given the kernel is compiled with max_flat_workgroup_size, environment variable can only reduce but not increase the workgroup size at run, right?

I saw MaxFlatWorkgroupSize in the plugin but it doesn't seem being used. On AMD GPU, I noticed 256 as the default workgroup size at run (shown by rocprof).

@ye-luo
Copy link
Contributor Author

ye-luo commented Aug 18, 2023

A related question asked to AOMP ROCm/aomp#614

@jhuber6
Copy link
Contributor

jhuber6 commented Aug 18, 2023

It's just in general there are OMP_NUM_THREADS environment variables that can be set by the user at any time. Because of that we can't just assume the thread count even when constant. I believe AOMP has optimizations that specifically turn it off to spite the standard.

@ye-luo
Copy link
Contributor Author

ye-luo commented Aug 18, 2023

OMP_NUM_THREADS never change GPU threads. Every implementation should respect that. I guess you meant OMP_THREAD_LIMIT.

@jdoerfert
Copy link
Member

There are (supposed to be) env vars for devices too. That said, in your case it's somewhat easy since you already provide a constant thread_limit. We'll look into this.

@ye-luo
Copy link
Contributor Author

ye-luo commented Aug 18, 2023

I don't always set thread_limit. Since the runtime usually go with 256 at run, would prefer a consistent default and users may override via command line option or attributes in the source code.

@jdoerfert
Copy link
Member

I understand, but that requires more machinery or at least verification that we won't go over the 256 limit at runtime. I think we need to record assumptions such that the runtime can pick them up and guarantee them. The constant case is easier though.

@ye-luo
Copy link
Contributor Author

ye-luo commented Aug 19, 2023

I also checked adding ompx_attribute(__attribute__((amdgpu_flat_work_group_size(45, 90)))) with thread_limit removed. I got .max_flat_workgroup_size: 90. However, the runtime still dispatch workgroup size 256(runtime default). This is a bug.

@jdoerfert
Copy link
Member

Only partially fixes with df8d33f

@jdoerfert jdoerfert reopened this Aug 19, 2023
@jdoerfert
Copy link
Member

I also checked adding ompx_attribute(__attribute__((amdgpu_flat_work_group_size(45, 90)))) with thread_limit removed. I got .max_flat_workgroup_size: 90. However, the runtime still dispatch workgroup size 256(runtime default). This is a bug.

No it is not. This is a user error. The attribute states it is the users responsibility. We could/should warn/error, but still, a user bug.

@ye-luo
Copy link
Contributor Author

ye-luo commented Aug 19, 2023

Right now the runtime ignores max_flat_workgroup_size info from kernel and blindly run with the runtime default. I consider this a bug.

razmser pushed a commit to SuduIDE/llvm-project that referenced this issue Oct 2, 2023
…ounts

If we know the thread count statically and it is a constant, we can set
the "amdgpu-flat-work-group-size" kernel attribute.

Fixes llvm#64816 in parts.
razmser pushed a commit to SuduIDE/llvm-project that referenced this issue Oct 2, 2023
…ounts

If we know the thread count statically and it is a constant, we can set
the "amdgpu-flat-work-group-size" kernel attribute.

Fixes llvm#64816 in parts.
razmser pushed a commit to SuduIDE/llvm-project that referenced this issue Oct 2, 2023
…ounts

If we know the thread count statically and it is a constant, we can set
the "amdgpu-flat-work-group-size" kernel attribute.

Fixes llvm#64816 in parts.
razmser pushed a commit to SuduIDE/llvm-project that referenced this issue Oct 3, 2023
…ounts

If we know the thread count statically and it is a constant, we can set
the "amdgpu-flat-work-group-size" kernel attribute.

Fixes llvm#64816 in parts.
razmser pushed a commit to SuduIDE/llvm-project that referenced this issue Oct 3, 2023
…ounts

If we know the thread count statically and it is a constant, we can set
the "amdgpu-flat-work-group-size" kernel attribute.

Fixes llvm#64816 in parts.
razmser pushed a commit to SuduIDE/llvm-project that referenced this issue Oct 6, 2023
…ounts

If we know the thread count statically and it is a constant, we can set
the "amdgpu-flat-work-group-size" kernel attribute.

Fixes llvm#64816 in parts.
razmser pushed a commit to SuduIDE/llvm-project that referenced this issue Oct 11, 2023
…ounts

If we know the thread count statically and it is a constant, we can set
the "amdgpu-flat-work-group-size" kernel attribute.

Fixes llvm#64816 in parts.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants