Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 7 additions & 10 deletions libclc/clc/lib/generic/workitem/clc_get_sub_group_size.cl
Original file line number Diff line number Diff line change
Expand Up @@ -13,14 +13,11 @@
#include <clc/workitem/clc_get_sub_group_size.h>

_CLC_OVERLOAD _CLC_DEF uint __clc_get_sub_group_size() {
if (__clc_get_sub_group_id() != __clc_get_num_sub_groups() - 1) {
return __clc_get_max_sub_group_size();
}
size_t size_x = __clc_get_local_size(0);
size_t size_y = __clc_get_local_size(1);
size_t size_z = __clc_get_local_size(2);
size_t linear_size = size_z * size_y * size_x;
size_t uniform_groups = __clc_get_num_sub_groups() - 1;
size_t uniform_size = __clc_get_max_sub_group_size() * uniform_groups;
return linear_size - uniform_size;
size_t linear_size = __clc_get_local_size(0) * __clc_get_local_size(1) *
__clc_get_local_size(2);
uint remainder = linear_size % __clc_get_max_sub_group_size();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how this is faster? The old implementation carefully avoided division, and this introduces urem?

Copy link
Contributor Author

@wenju-he wenju-he Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how this is faster? The old implementation carefully avoided division, and this introduces urem?

in most cases __clc_get_max_sub_group_size() is a power of 2 and modulo has the same code as & (__clc_get_max_sub_group_size() - 1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following is diff on nvptx64--nvidiacl.bc and diff on llc -march=nvptx64 output:
There are 3 improvements:

  • return value range is tightened
  • fast path for total work-group size being multiple of max sub-group size
  • number of ptx register is reduced from 18 to 15
image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kindly ping

bool full_sub_group = (remainder == 0) || (__clc_get_sub_group_id() <
__clc_get_num_sub_groups() - 1);

return full_sub_group ? __clc_get_max_sub_group_size() : remainder;
}
Loading