[libclc] Refine __clc_get_sub_group_size with fast full sub-group path #163149

wenju-he · 2025-10-13T08:55:26Z

Add a fast path for the common case that total work-group size is multiple of max sub-group size, avoiding need to calculate number of sub-groups.

Copilot

Pull Request Overview

This PR optimizes the __clc_get_sub_group_size() function by adding a fast path for the common case where the total work-group size is a multiple of the maximum sub-group size. This optimization avoids the need to calculate the number of sub-groups in many cases.

Key changes:

Replaced conditional logic with a more efficient calculation using modulo operation
Simplified the function to use a single boolean condition to determine if a sub-group is full-sized
Reduced the number of function calls and intermediate calculations

Copilot · 2025-10-13T08:56:11Z

libclc/clc/lib/generic/workitem/clc_get_sub_group_size.cl

+  uint linear_size = __clc_get_local_size(0) * __clc_get_local_size(1) *
+                     __clc_get_local_size(2);


The function calls to __clc_get_local_size() are repeated three times. Consider caching these values in local variables to avoid potential multiple function call overhead.

Suggested change

uint linear_size = __clc_get_local_size(0) * __clc_get_local_size(1) *

__clc_get_local_size(2);

uint local_size_0 = __clc_get_local_size(0);

uint local_size_1 = __clc_get_local_size(1);

uint local_size_2 = __clc_get_local_size(2);

uint linear_size = local_size_0 * local_size_1 * local_size_2;

Copilot · 2025-10-13T08:56:11Z

libclc/clc/lib/generic/workitem/clc_get_sub_group_size.cl

+  uint linear_size = __clc_get_local_size(0) * __clc_get_local_size(1) *
+                     __clc_get_local_size(2);
+  uint remainder = linear_size % __clc_get_max_sub_group_size();
+  bool full_sub_group = (remainder == 0) || (__clc_get_sub_group_id() <
+                                             __clc_get_num_sub_groups() - 1);
+
+  return full_sub_group ? __clc_get_max_sub_group_size() : remainder;


Type inconsistency: __clc_get_local_size() returns size_t but the result is assigned to uint. This could cause truncation on 64-bit systems where size_t is larger than uint. Consider using size_t for linear_size and remainder variables.

Suggested change

uint linear_size = __clc_get_local_size(0) * __clc_get_local_size(1) *

__clc_get_local_size(2);

uint remainder = linear_size % __clc_get_max_sub_group_size();

bool full_sub_group = (remainder == 0) || (__clc_get_sub_group_id() <

__clc_get_num_sub_groups() - 1);

return full_sub_group ? __clc_get_max_sub_group_size() : remainder;

size_t linear_size = __clc_get_local_size(0) * __clc_get_local_size(1) *

__clc_get_local_size(2);

size_t remainder = linear_size % __clc_get_max_sub_group_size();

bool full_sub_group = (remainder == 0) || (__clc_get_sub_group_id() <

__clc_get_num_sub_groups() - 1);

return (uint)(full_sub_group ? __clc_get_max_sub_group_size() : remainder);

arsenm · 2025-10-13T10:43:46Z

libclc/clc/lib/generic/workitem/clc_get_sub_group_size.cl

-  return linear_size - uniform_size;
+  size_t linear_size = __clc_get_local_size(0) * __clc_get_local_size(1) *
+                       __clc_get_local_size(2);
+  uint remainder = linear_size % __clc_get_max_sub_group_size();


Not sure how this is faster? The old implementation carefully avoided division, and this introduces urem?

Not sure how this is faster? The old implementation carefully avoided division, and this introduces urem?

in most cases __clc_get_max_sub_group_size() is a power of 2 and modulo has the same code as & (__clc_get_max_sub_group_size() - 1)

Following is diff on nvptx64--nvidiacl.bc and diff on llc -march=nvptx64 output:
There are 3 improvements:

return value range is tightened

fast path for total work-group size being multiple of max sub-group size

number of ptx register is reduced from 18 to 15

kindly ping

[libclc] Refine __clc_get_sub_group_size with fast full sub-group path

d8a9602

Add a fast path for the common case that total work-group size is multiple of max sub-group size, avoiding need to calculate number of sub-groups.

wenju-he requested a review from Copilot October 13, 2025 08:55

llvmbot added the libclc libclc OpenCL library label Oct 13, 2025

wenju-he requested review from arsenm and frasercrmck October 13, 2025 08:55

Copilot AI reviewed Oct 13, 2025

View reviewed changes

uint -> size_t

954a9e2

arsenm reviewed Oct 13, 2025

View reviewed changes

wenju-he requested a review from arsenm October 14, 2025 23:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[libclc] Refine __clc_get_sub_group_size with fast full sub-group path #163149

[libclc] Refine __clc_get_sub_group_size with fast full sub-group path #163149

wenju-he commented Oct 13, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Oct 13, 2025

Uh oh!

Copilot AI Oct 13, 2025

Uh oh!

arsenm Oct 13, 2025

Uh oh!

wenju-he Oct 13, 2025 •

edited

Loading

Uh oh!

wenju-he Oct 14, 2025

Uh oh!

wenju-he Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		uint linear_size = __clc_get_local_size(0) * __clc_get_local_size(1) *
		__clc_get_local_size(2);

-  uint linear_size = __clc_get_local_size(0) * __clc_get_local_size(1) *
-                     __clc_get_local_size(2);
+  uint local_size_0 = __clc_get_local_size(0);
+  uint local_size_1 = __clc_get_local_size(1);
+  uint local_size_2 = __clc_get_local_size(2);
+  uint linear_size = local_size_0 * local_size_1 * local_size_2;

[libclc] Refine __clc_get_sub_group_size with fast full sub-group path #163149

Are you sure you want to change the base?

[libclc] Refine __clc_get_sub_group_size with fast full sub-group path #163149

Conversation

wenju-he commented Oct 13, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

arsenm Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

wenju-he Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wenju-he Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

wenju-he Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wenju-he Oct 13, 2025 •

edited

Loading