-
Notifications
You must be signed in to change notification settings - Fork 15k
[libclc] Refine __clc_get_sub_group_size with fast full sub-group path #163149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Add a fast path for the common case that total work-group size is multiple of max sub-group size, avoiding need to calculate number of sub-groups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR optimizes the __clc_get_sub_group_size() function by adding a fast path for the common case where the total work-group size is a multiple of the maximum sub-group size. This optimization avoids the need to calculate the number of sub-groups in many cases.
Key changes:
- Replaced conditional logic with a more efficient calculation using modulo operation
- Simplified the function to use a single boolean condition to determine if a sub-group is full-sized
- Reduced the number of function calls and intermediate calculations
| uint linear_size = __clc_get_local_size(0) * __clc_get_local_size(1) * | ||
| __clc_get_local_size(2); |
Copilot
AI
Oct 13, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function calls to __clc_get_local_size() are repeated three times. Consider caching these values in local variables to avoid potential multiple function call overhead.
| uint linear_size = __clc_get_local_size(0) * __clc_get_local_size(1) * | |
| __clc_get_local_size(2); | |
| uint local_size_0 = __clc_get_local_size(0); | |
| uint local_size_1 = __clc_get_local_size(1); | |
| uint local_size_2 = __clc_get_local_size(2); | |
| uint linear_size = local_size_0 * local_size_1 * local_size_2; |
Copilot uses AI. Check for mistakes.
| uint linear_size = __clc_get_local_size(0) * __clc_get_local_size(1) * | ||
| __clc_get_local_size(2); | ||
| uint remainder = linear_size % __clc_get_max_sub_group_size(); | ||
| bool full_sub_group = (remainder == 0) || (__clc_get_sub_group_id() < | ||
| __clc_get_num_sub_groups() - 1); | ||
|
|
||
| return full_sub_group ? __clc_get_max_sub_group_size() : remainder; |
Copilot
AI
Oct 13, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type inconsistency: __clc_get_local_size() returns size_t but the result is assigned to uint. This could cause truncation on 64-bit systems where size_t is larger than uint. Consider using size_t for linear_size and remainder variables.
| uint linear_size = __clc_get_local_size(0) * __clc_get_local_size(1) * | |
| __clc_get_local_size(2); | |
| uint remainder = linear_size % __clc_get_max_sub_group_size(); | |
| bool full_sub_group = (remainder == 0) || (__clc_get_sub_group_id() < | |
| __clc_get_num_sub_groups() - 1); | |
| return full_sub_group ? __clc_get_max_sub_group_size() : remainder; | |
| size_t linear_size = __clc_get_local_size(0) * __clc_get_local_size(1) * | |
| __clc_get_local_size(2); | |
| size_t remainder = linear_size % __clc_get_max_sub_group_size(); | |
| bool full_sub_group = (remainder == 0) || (__clc_get_sub_group_id() < | |
| __clc_get_num_sub_groups() - 1); | |
| return (uint)(full_sub_group ? __clc_get_max_sub_group_size() : remainder); |
Copilot uses AI. Check for mistakes.
| return linear_size - uniform_size; | ||
| size_t linear_size = __clc_get_local_size(0) * __clc_get_local_size(1) * | ||
| __clc_get_local_size(2); | ||
| uint remainder = linear_size % __clc_get_max_sub_group_size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how this is faster? The old implementation carefully avoided division, and this introduces urem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how this is faster? The old implementation carefully avoided division, and this introduces urem?
in most cases __clc_get_max_sub_group_size() is a power of 2 and modulo has the same code as & (__clc_get_max_sub_group_size() - 1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kindly ping

Add a fast path for the common case that total work-group size is multiple of max sub-group size, avoiding need to calculate number of sub-groups.