Skip to content

Conversation

@wenju-he
Copy link
Contributor

Add a fast path for the common case that total work-group size is multiple of max sub-group size, avoiding need to calculate number of sub-groups.

Add a fast path for the common case that total work-group size is
multiple of max sub-group size, avoiding need to calculate number of
sub-groups.
@wenju-he wenju-he requested a review from Copilot October 13, 2025 08:55
@llvmbot llvmbot added the libclc libclc OpenCL library label Oct 13, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR optimizes the __clc_get_sub_group_size() function by adding a fast path for the common case where the total work-group size is a multiple of the maximum sub-group size. This optimization avoids the need to calculate the number of sub-groups in many cases.

Key changes:

  • Replaced conditional logic with a more efficient calculation using modulo operation
  • Simplified the function to use a single boolean condition to determine if a sub-group is full-sized
  • Reduced the number of function calls and intermediate calculations

Comment on lines 16 to 17
uint linear_size = __clc_get_local_size(0) * __clc_get_local_size(1) *
__clc_get_local_size(2);
Copy link

Copilot AI Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function calls to __clc_get_local_size() are repeated three times. Consider caching these values in local variables to avoid potential multiple function call overhead.

Suggested change
uint linear_size = __clc_get_local_size(0) * __clc_get_local_size(1) *
__clc_get_local_size(2);
uint local_size_0 = __clc_get_local_size(0);
uint local_size_1 = __clc_get_local_size(1);
uint local_size_2 = __clc_get_local_size(2);
uint linear_size = local_size_0 * local_size_1 * local_size_2;

Copilot uses AI. Check for mistakes.

Comment on lines 16 to 22
uint linear_size = __clc_get_local_size(0) * __clc_get_local_size(1) *
__clc_get_local_size(2);
uint remainder = linear_size % __clc_get_max_sub_group_size();
bool full_sub_group = (remainder == 0) || (__clc_get_sub_group_id() <
__clc_get_num_sub_groups() - 1);

return full_sub_group ? __clc_get_max_sub_group_size() : remainder;
Copy link

Copilot AI Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type inconsistency: __clc_get_local_size() returns size_t but the result is assigned to uint. This could cause truncation on 64-bit systems where size_t is larger than uint. Consider using size_t for linear_size and remainder variables.

Suggested change
uint linear_size = __clc_get_local_size(0) * __clc_get_local_size(1) *
__clc_get_local_size(2);
uint remainder = linear_size % __clc_get_max_sub_group_size();
bool full_sub_group = (remainder == 0) || (__clc_get_sub_group_id() <
__clc_get_num_sub_groups() - 1);
return full_sub_group ? __clc_get_max_sub_group_size() : remainder;
size_t linear_size = __clc_get_local_size(0) * __clc_get_local_size(1) *
__clc_get_local_size(2);
size_t remainder = linear_size % __clc_get_max_sub_group_size();
bool full_sub_group = (remainder == 0) || (__clc_get_sub_group_id() <
__clc_get_num_sub_groups() - 1);
return (uint)(full_sub_group ? __clc_get_max_sub_group_size() : remainder);

Copilot uses AI. Check for mistakes.

return linear_size - uniform_size;
size_t linear_size = __clc_get_local_size(0) * __clc_get_local_size(1) *
__clc_get_local_size(2);
uint remainder = linear_size % __clc_get_max_sub_group_size();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how this is faster? The old implementation carefully avoided division, and this introduces urem?

Copy link
Contributor Author

@wenju-he wenju-he Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how this is faster? The old implementation carefully avoided division, and this introduces urem?

in most cases __clc_get_max_sub_group_size() is a power of 2 and modulo has the same code as & (__clc_get_max_sub_group_size() - 1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following is diff on nvptx64--nvidiacl.bc and diff on llc -march=nvptx64 output:
There are 3 improvements:

  • return value range is tightened
  • fast path for total work-group size being multiple of max sub-group size
  • number of ptx register is reduced from 18 to 15
image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kindly ping

@wenju-he wenju-he requested a review from arsenm October 14, 2025 23:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

libclc libclc OpenCL library

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants