[clang][OpenMP] Improve loop structure for distributed loops by ro-i · Pull Request #201670 · llvm/llvm-project

ro-i · 2026-06-04T19:30:41Z

This is a part of a series of patches that rework OpenMP cross-team reductions.

This patches wires the existing
kmp_sched_distr_static_chunk_sched_static_chunkone to be used by CodeGen.

Example of the intended change of this patch:

target teams distribute parallel for reduction(+:s)
  for (i = 0; i < N; i++) s += a[i];

Before:

__kmpc_distribute_static_init(91)
for (team_lb = team*nthreads; team_lb < N; team_lb += nteams*nthreads) {
  __kmpc_for_static_init(33)
  for (iv = team_lb + tid; iv < team_lb + nthreads; iv += nthreads) {
    priv += a[iv];
  }
  __kmpc_nvptx_parallel_reduce_nowait_v2
}
__kmpc_nvptx_teams_reduce_nowait_v2

After:

__kmpc_for_static_init(93)
for (iv = team*nthreads + tid;
     iv < N;
     iv += nteams*nthreads) {
    priv += a[iv];
}
__kmpc_nvptx_parallel_reduce_nowait_v2
__kmpc_nvptx_teams_reduce_nowait_v2

Performance:
All performance tests can be reproduced with
https://github.com/ro-i/xteam-test @ commit
6025e5afc14dd6e65ee2658e5001c16e9b9245ff. To reproduce, simply create a local.mk file in the cloned directory with a suitable OFFLOAD_ARCH for your machine and CXX_trunk + CXX_trunk_cg set to the paths of the clang++ binaries for llvm/main and this patch. (llvm/main should best be at the commit that is currently the base for this PR. At the moment, this is 69f7aeb). Then, run make trunk trunk_cg to build the benchmark binaries for 208 and 10400 teams. Run them with ./run_bench.sh -rq -n10 red_trunk_208 red_trunk_cg_208 red_trunk_10400 red_trunk_cg_10400 to get the avg performance numbers over 10 rounds. This tests multiple reduction workloads, including reductions that run in the Generic-SPMD mode, with 208 teams and with 10400 teams, both à 512 threads, and with a reduction array size of 177,777,777. I tested on a gfx942 and found the following numbers showing the performance of this patch relative to the baseline:

red_comb_sep_arr_32    double   change for 208 teams:    +0.01%   change for 10400 teams:    +5.53%
red_sum_arr_32         double   change for 208 teams:  +570.47%   change for 10400 teams:    -2.23%
red_comb               double   change for 208 teams:  +350.30%   change for 10400 teams:    +0.72%
red_comb_sep           double   change for 208 teams:    +4.82%   change for 10400 teams:    +2.18%
red_dot                double   change for 208 teams:  +202.45%   change for 10400 teams:    +3.48%
red_indirect           double   change for 208 teams:  +239.33%   change for 10400 teams:    +4.63%
red_kernel_part        double   change for 208 teams:    +3.30%   change for 10400 teams:    +3.43%
red_max                double   change for 208 teams:  +273.46%   change for 10400 teams:    +5.12%
red_mult               double   change for 208 teams:  +239.50%   change for 10400 teams:    +5.23%
red_sum                double   change for 208 teams:  +239.47%   change for 10400 teams:    +5.15%
red_pi                 double   change for 208 teams:   +90.06%   change for 10400 teams:   +78.67%
red_comb_sep_arr_32    uint     change for 208 teams:    -0.16%   change for 10400 teams:   +26.98%
red_sum_arr_32         uint     change for 208 teams:  +139.64%   change for 10400 teams:   -14.55%
red_dot                uint     change for 208 teams:  +202.92%   change for 10400 teams:    +5.11%
red_max                uint     change for 208 teams:  +221.41%   change for 10400 teams:    +6.54%
red_sum                uint     change for 208 teams:  +220.83%   change for 10400 teams:    +7.80%
red_comb_sep_arr_32    ulong    change for 208 teams:    -0.19%   change for 10400 teams:    +5.80%
red_sum_arr_32         ulong    change for 208 teams:  +523.98%   change for 10400 teams:    -3.17%
red_dot                ulong    change for 208 teams:  +232.14%   change for 10400 teams:    +3.57%
red_max                ulong    change for 208 teams:  +279.87%   change for 10400 teams:    +6.17%
red_sum                ulong    change for 208 teams:  +261.54%   change for 10400 teams:    +5.72%
red_comb_sep_arr_32    Value    change for 208 teams:    +0.22%   change for 10400 teams:    +0.04%
red_sum_arr_32         Value    change for 208 teams:  +423.38%   change for 10400 teams:    +9.08%
red_dot                Value    change for 208 teams:  +153.87%   change for 10400 teams:    -2.62%
red_max                Value    change for 208 teams: +1097.62%   change for 10400 teams:  +261.16%
red_sum                Value    change for 208 teams:  +358.88%   change for 10400 teams:   +21.44%

Claude assisted with this patch.

This is a part of a series of patches that rework OpenMP cross-team reductions. This patches wires the existing `kmp_sched_distr_static_chunk_sched_static_chunkone` to be used by CodeGen. Example of the intended change of this patch: ``` target teams distribute parallel for reduction(+:s) for (i = 0; i < N; i++) s += a[i]; ``` Before: ``` __kmpc_distribute_static_init(91) for (team_lb = team*nthreads; team_lb < N; team_lb += nteams*nthreads) { __kmpc_for_static_init(33) for (iv = team_lb + tid; iv < team_lb + nthreads; iv += nthreads) { priv += a[iv]; } __kmpc_nvptx_parallel_reduce_nowait_v2 } __kmpc_nvptx_teams_reduce_nowait_v2 ``` After: ``` __kmpc_for_static_init(93) for (iv = team*nthreads + tid; iv < N; iv += nteams*nthreads) { priv += a[iv]; } __kmpc_nvptx_parallel_reduce_nowait_v2 __kmpc_nvptx_teams_reduce_nowait_v2 ``` Performance: All performance tests can be reproduced with https://github.com/ro-i/xteam-test @ commit 6025e5afc14dd6e65ee2658e5001c16e9b9245ff. To reproduce, simply create a `local.mk` file in the cloned directory with a suitable `OFFLOAD_ARCH` for your machine and `CXX_trunk` + `CXX_trunk_cg` set to the paths of the clang++ binaries for llvm/main and this patch. (llvm/main should best be at the commit that is currently the base for this PR. At the moment, this is 69f7aeb). Then, run `make trunk trunk_cg` to build the benchmark binaries for 208 and 10400 teams. Run them with `./run_bench.sh -rq -n10 red_trunk_208 red_trunk_cg_208 red_trunk_10400 red_trunk_cg_10400` to get the avg performance numbers over 10 rounds. This tests multiple reduction workloads, including reductions that run in the Generic-SPMD mode, with 208 teams and with 10400 teams, both à 512 threads, and with a reduction array size of 177,777,777. I tested on a gfx942 and found the following numbers showing the performance of this patch relative to the baseline: ``` red_comb_sep_arr_32 double change for 208 teams: +0.01% change for 10400 teams: +5.53% red_sum_arr_32 double change for 208 teams: +570.47% change for 10400 teams: -2.23% red_comb double change for 208 teams: +350.30% change for 10400 teams: +0.72% red_comb_sep double change for 208 teams: +4.82% change for 10400 teams: +2.18% red_dot double change for 208 teams: +202.45% change for 10400 teams: +3.48% red_indirect double change for 208 teams: +239.33% change for 10400 teams: +4.63% red_kernel_part double change for 208 teams: +3.30% change for 10400 teams: +3.43% red_max double change for 208 teams: +273.46% change for 10400 teams: +5.12% red_mult double change for 208 teams: +239.50% change for 10400 teams: +5.23% red_sum double change for 208 teams: +239.47% change for 10400 teams: +5.15% red_pi double change for 208 teams: +90.06% change for 10400 teams: +78.67% red_comb_sep_arr_32 uint change for 208 teams: -0.16% change for 10400 teams: +26.98% red_sum_arr_32 uint change for 208 teams: +139.64% change for 10400 teams: -14.55% red_dot uint change for 208 teams: +202.92% change for 10400 teams: +5.11% red_max uint change for 208 teams: +221.41% change for 10400 teams: +6.54% red_sum uint change for 208 teams: +220.83% change for 10400 teams: +7.80% red_comb_sep_arr_32 ulong change for 208 teams: -0.19% change for 10400 teams: +5.80% red_sum_arr_32 ulong change for 208 teams: +523.98% change for 10400 teams: -3.17% red_dot ulong change for 208 teams: +232.14% change for 10400 teams: +3.57% red_max ulong change for 208 teams: +279.87% change for 10400 teams: +6.17% red_sum ulong change for 208 teams: +261.54% change for 10400 teams: +5.72% red_comb_sep_arr_32 Value change for 208 teams: +0.22% change for 10400 teams: +0.04% red_sum_arr_32 Value change for 208 teams: +423.38% change for 10400 teams: +9.08% red_dot Value change for 208 teams: +153.87% change for 10400 teams: -2.62% red_max Value change for 208 teams: +1097.62% change for 10400 teams: +261.16% red_sum Value change for 208 teams: +358.88% change for 10400 teams: +21.44% ```

llvmorg-github-actions · 2026-06-04T19:31:18Z

@llvm/pr-subscribers-clang-codegen
@llvm/pr-subscribers-offload

@llvm/pr-subscribers-backend-amdgpu

Author: Robert Imschweiler (ro-i)

Changes

This is a part of a series of patches that rework OpenMP cross-team reductions.

This patches wires the existing
kmp_sched_distr_static_chunk_sched_static_chunkone to be used by CodeGen.

Example of the intended change of this patch:

target teams distribute parallel for reduction(+:s)
  for (i = 0; i &lt; N; i++) s += a[i];

Before:

__kmpc_distribute_static_init(91)
for (team_lb = team*nthreads; team_lb &lt; N; team_lb += nteams*nthreads) {
  __kmpc_for_static_init(33)
  for (iv = team_lb + tid; iv &lt; team_lb + nthreads; iv += nthreads) {
    priv += a[iv];
  }
  __kmpc_nvptx_parallel_reduce_nowait_v2
}
__kmpc_nvptx_teams_reduce_nowait_v2

After:

__kmpc_for_static_init(93)
for (iv = team*nthreads + tid;
     iv &lt; N;
     iv += nteams*nthreads) {
    priv += a[iv];
}
__kmpc_nvptx_parallel_reduce_nowait_v2
__kmpc_nvptx_teams_reduce_nowait_v2

Performance:
All performance tests can be reproduced with
https://github.com/ro-i/xteam-test @ commit
6025e5afc14dd6e65ee2658e5001c16e9b9245ff. To reproduce, simply create a local.mk file in the cloned directory with a suitable OFFLOAD_ARCH for your machine and CXX_trunk + CXX_trunk_cg set to the paths of the clang++ binaries for llvm/main and this patch. (llvm/main should best be at the commit that is currently the base for this PR. At the moment, this is 69f7aeb). Then, run make trunk trunk_cg to build the benchmark binaries for 208 and 10400 teams. Run them with ./run_bench.sh -rq -n10 red_trunk_208 red_trunk_cg_208 red_trunk_10400 red_trunk_cg_10400 to get the avg performance numbers over 10 rounds. This tests multiple reduction workloads, including reductions that run in the Generic-SPMD mode, with 208 teams and with 10400 teams, both à 512 threads, and with a reduction array size of 177,777,777. I tested on a gfx942 and found the following numbers showing the performance of this patch relative to the baseline:

red_comb_sep_arr_32    double   change for 208 teams:    +0.01%   change for 10400 teams:    +5.53%
red_sum_arr_32         double   change for 208 teams:  +570.47%   change for 10400 teams:    -2.23%
red_comb               double   change for 208 teams:  +350.30%   change for 10400 teams:    +0.72%
red_comb_sep           double   change for 208 teams:    +4.82%   change for 10400 teams:    +2.18%
red_dot                double   change for 208 teams:  +202.45%   change for 10400 teams:    +3.48%
red_indirect           double   change for 208 teams:  +239.33%   change for 10400 teams:    +4.63%
red_kernel_part        double   change for 208 teams:    +3.30%   change for 10400 teams:    +3.43%
red_max                double   change for 208 teams:  +273.46%   change for 10400 teams:    +5.12%
red_mult               double   change for 208 teams:  +239.50%   change for 10400 teams:    +5.23%
red_sum                double   change for 208 teams:  +239.47%   change for 10400 teams:    +5.15%
red_pi                 double   change for 208 teams:   +90.06%   change for 10400 teams:   +78.67%
red_comb_sep_arr_32    uint     change for 208 teams:    -0.16%   change for 10400 teams:   +26.98%
red_sum_arr_32         uint     change for 208 teams:  +139.64%   change for 10400 teams:   -14.55%
red_dot                uint     change for 208 teams:  +202.92%   change for 10400 teams:    +5.11%
red_max                uint     change for 208 teams:  +221.41%   change for 10400 teams:    +6.54%
red_sum                uint     change for 208 teams:  +220.83%   change for 10400 teams:    +7.80%
red_comb_sep_arr_32    ulong    change for 208 teams:    -0.19%   change for 10400 teams:    +5.80%
red_sum_arr_32         ulong    change for 208 teams:  +523.98%   change for 10400 teams:    -3.17%
red_dot                ulong    change for 208 teams:  +232.14%   change for 10400 teams:    +3.57%
red_max                ulong    change for 208 teams:  +279.87%   change for 10400 teams:    +6.17%
red_sum                ulong    change for 208 teams:  +261.54%   change for 10400 teams:    +5.72%
red_comb_sep_arr_32    Value    change for 208 teams:    +0.22%   change for 10400 teams:    +0.04%
red_sum_arr_32         Value    change for 208 teams:  +423.38%   change for 10400 teams:    +9.08%
red_dot                Value    change for 208 teams:  +153.87%   change for 10400 teams:    -2.62%
red_max                Value    change for 208 teams: +1097.62%   change for 10400 teams:  +261.16%
red_sum                Value    change for 208 teams:  +358.88%   change for 10400 teams:   +21.44%

Patch is 1.39 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/201670.diff

16 Files Affected:

(modified) clang/include/clang/Basic/OpenMPKinds.h (+3)
(modified) clang/lib/CodeGen/CGOpenMPRuntime.cpp (+17-5)
(modified) clang/lib/CodeGen/CGStmtOpenMP.cpp (+119-92)
(modified) clang/test/OpenMP/amdgcn_target_device_vla.cpp (+33-93)
(modified) clang/test/OpenMP/amdgpu_target_with_aligned_attribute.c (+23-83)
(modified) clang/test/OpenMP/metadirective_device_arch_codegen.cpp (+4-7)
(modified) clang/test/OpenMP/nvptx_SPMD_codegen.cpp (+2367-3627)
(modified) clang/test/OpenMP/nvptx_distribute_parallel_generic_mode_codegen.cpp (+59-179)
(modified) clang/test/OpenMP/nvptx_target_teams_distribute_parallel_for_codegen.cpp (+301-1141)
(modified) clang/test/OpenMP/nvptx_target_teams_distribute_parallel_for_simd_codegen.cpp (+223-543)
(modified) clang/test/OpenMP/nvptx_target_teams_generic_loop_codegen.cpp (+316-1156)
(modified) clang/test/OpenMP/nvptx_target_teams_generic_loop_generic_mode_codegen.cpp (+16-86)
(modified) clang/test/OpenMP/target_teams_generic_loop_codegen.cpp (+48-102)
(modified) clang/test/OpenMP/target_teams_generic_loop_codegen_as_distribute.cpp (+21-56)
(modified) clang/test/OpenMP/target_teams_generic_loop_codegen_as_parallel_for.cpp (+120-360)
(modified) offload/test/offloading/gpupgo/pgo_atomic_teams.c (+4-2)

diff --git a/clang/include/clang/Basic/OpenMPKinds.h b/clang/include/clang/Basic/OpenMPKinds.h
index 4e83bfcd0128b..516219a408edb 100644
--- a/clang/include/clang/Basic/OpenMPKinds.h
+++ b/clang/include/clang/Basic/OpenMPKinds.h
@@ -188,6 +188,9 @@ struct OpenMPScheduleTy final {
   OpenMPScheduleClauseKind Schedule = OMPC_SCHEDULE_unknown;
   OpenMPScheduleClauseModifier M1 = OMPC_SCHEDULE_MODIFIER_unknown;
   OpenMPScheduleClauseModifier M2 = OMPC_SCHEDULE_MODIFIER_unknown;
+  /// Request the fused distr_static_chunk + static_chunkone runtime schedule
+  /// in `for_static_init`. The outer `distribute_static_init` is skipped.
+  bool IsDistChunkedAndChunkOne = false;
 };
 
 /// OpenMP modifiers for 'reduction' clause.
diff --git a/clang/lib/CodeGen/CGOpenMPRuntime.cpp b/clang/lib/CodeGen/CGOpenMPRuntime.cpp
index f3158f48e7944..4462d5b63d677 100644
--- a/clang/lib/CodeGen/CGOpenMPRuntime.cpp
+++ b/clang/lib/CodeGen/CGOpenMPRuntime.cpp
@@ -546,6 +546,12 @@ enum OpenMPSchedType {
   /// dist_schedule types
   OMP_dist_sch_static_chunked = 91,
   OMP_dist_sch_static = 92,
+  /// Fused distribute+for static schedule (entityId = team*nthreads + tid,
+  /// num_entities = nteams*nthreads). One for_static_init call, no
+  /// surrounding distribute_static_init. Matches
+  /// kmp_sched_distr_static_chunk_sched_static_chunkone in the device RTL
+  /// (openmp/device/include/DeviceTypes.h).
+  OMP_dist_sch_static_chunked_sch_static_chunkone = 93,
   /// Support for OpenMP 4.5 monotonic and nonmonotonic schedule modifiers.
   /// Set if the monotonic schedule modifier was present.
   OMP_sch_modifier_monotonic = (1 << 29),
@@ -2630,7 +2636,8 @@ static int addMonoNonMonoModifier(CodeGenModule &CGM, OpenMPSchedType Schedule,
           Schedule == OMP_sch_static_balanced_chunked ||
           Schedule == OMP_ord_static_chunked || Schedule == OMP_ord_static ||
           Schedule == OMP_dist_sch_static_chunked ||
-          Schedule == OMP_dist_sch_static))
+          Schedule == OMP_dist_sch_static ||
+          Schedule == OMP_dist_sch_static_chunked_sch_static_chunkone))
       Modifier = OMP_sch_modifier_nonmonotonic;
   }
   return Schedule | Modifier;
@@ -2692,7 +2699,8 @@ static void emitForStaticInitCall(
          Schedule == OMP_sch_static_balanced_chunked ||
          Schedule == OMP_ord_static || Schedule == OMP_ord_static_chunked ||
          Schedule == OMP_dist_sch_static ||
-         Schedule == OMP_dist_sch_static_chunked);
+         Schedule == OMP_dist_sch_static_chunked ||
+         Schedule == OMP_dist_sch_static_chunked_sch_static_chunkone);
 
   // Call __kmpc_for_static_init(
   //          ident_t *loc, kmp_int32 tid, kmp_int32 schedtype,
@@ -2710,7 +2718,8 @@ static void emitForStaticInitCall(
     assert((Schedule == OMP_sch_static_chunked ||
             Schedule == OMP_sch_static_balanced_chunked ||
             Schedule == OMP_ord_static_chunked ||
-            Schedule == OMP_dist_sch_static_chunked) &&
+            Schedule == OMP_dist_sch_static_chunked ||
+            Schedule == OMP_dist_sch_static_chunked_sch_static_chunkone) &&
            "expected static chunked schedule");
   }
   llvm::Value *Args[] = {
@@ -2733,8 +2742,11 @@ void CGOpenMPRuntime::emitForStaticInit(CodeGenFunction &CGF,
                                         OpenMPDirectiveKind DKind,
                                         const OpenMPScheduleTy &ScheduleKind,
                                         const StaticRTInput &Values) {
-  OpenMPSchedType ScheduleNum = getRuntimeSchedule(
-      ScheduleKind.Schedule, Values.Chunk != nullptr, Values.Ordered);
+  OpenMPSchedType ScheduleNum =
+      ScheduleKind.IsDistChunkedAndChunkOne
+          ? OMP_dist_sch_static_chunked_sch_static_chunkone
+          : getRuntimeSchedule(ScheduleKind.Schedule, Values.Chunk != nullptr,
+                               Values.Ordered);
   assert((isOpenMPWorksharingDirective(DKind) || (DKind == OMPD_loop)) &&
          "Expected loop-based or sections-based directive.");
   llvm::Value *UpdatedLocation = emitUpdateLocation(CGF, Loc,
diff --git a/clang/lib/CodeGen/CGStmtOpenMP.cpp b/clang/lib/CodeGen/CGStmtOpenMP.cpp
index 1eaf8efa142c5..376e9bd1cee4e 100644
--- a/clang/lib/CodeGen/CGStmtOpenMP.cpp
+++ b/clang/lib/CodeGen/CGStmtOpenMP.cpp
@@ -50,6 +50,16 @@ static const VarDecl *getBaseDecl(const Expr *Ref);
 static OpenMPDirectiveKind
 getEffectiveDirectiveKind(const OMPExecutableDirective &S);
 
+static bool canEmitGPUFusedDistSchedule(const CodeGenModule &CGM,
+                                        const OMPLoopDirective &S,
+                                        OpenMPDirectiveKind DKind) {
+  return CGM.getLangOpts().OpenMPIsTargetDevice && CGM.getTriple().isGPU() &&
+         isOpenMPLoopBoundSharingDirective(DKind) &&
+         !S.getSingleClause<OMPDistScheduleClause>() &&
+         !S.getSingleClause<OMPScheduleClause>() &&
+         !S.getSingleClause<OMPOrderedClause>();
+}
+
 namespace {
 /// Lexical scope for OpenMP executable constructs, that handles correct codegen
 /// for captured expressions.
@@ -3879,6 +3889,12 @@ bool CodeGenFunction::EmitOMPWorksharingLoop(
           RT.isStaticChunked(ScheduleKind.Schedule,
                              /* Chunked */ Chunk != nullptr) &&
           HasChunkSizeOne && isOpenMPLoopBoundSharingDirective(EKind);
+      // GPU combined `distribute parallel for`: emit a single
+      // for_static_init with the fused distr_static_chunk + static_chunkone
+      // schedule (enum 93). The surrounding EmitOMPDistributeLoop must skip
+      // its distribute_static_init under the same conditions.
+      if (StaticChunkedOne && canEmitGPUFusedDistSchedule(CGM, S, EKind))
+        ScheduleKind.IsDistChunkedAndChunkOne = true;
       bool IsMonotonic =
           Ordered ||
           (ScheduleKind.Schedule == OMPC_SCHEDULE_static &&
@@ -6275,102 +6291,113 @@ void CodeGenFunction::EmitOMPDistributeLoop(const OMPLoopDirective &S,
       const unsigned IVSize = getContext().getTypeSize(IVExpr->getType());
       const bool IVSigned = IVExpr->getType()->hasSignedIntegerRepresentation();
 
-      // OpenMP [2.10.8, distribute Construct, Description]
-      // If dist_schedule is specified, kind must be static. If specified,
-      // iterations are divided into chunks of size chunk_size, chunks are
-      // assigned to the teams of the league in a round-robin fashion in the
-      // order of the team number. When no chunk_size is specified, the
-      // iteration space is divided into chunks that are approximately equal
-      // in size, and at most one chunk is distributed to each team of the
-      // league. The size of the chunks is unspecified in this case.
-      bool StaticChunked =
-          RT.isStaticChunked(ScheduleKind, /* Chunked */ Chunk != nullptr) &&
-          isOpenMPLoopBoundSharingDirective(S.getDirectiveKind());
-      if (RT.isStaticNonchunked(ScheduleKind,
-                                /* Chunked */ Chunk != nullptr) ||
-          StaticChunked) {
-        CGOpenMPRuntime::StaticRTInput StaticInit(
-            IVSize, IVSigned, /* Ordered = */ false, IL.getAddress(),
-            LB.getAddress(), UB.getAddress(), ST.getAddress(),
-            StaticChunked ? Chunk : nullptr);
-        RT.emitDistributeStaticInit(*this, S.getBeginLoc(), ScheduleKind,
-                                    StaticInit);
+      // GPU fused schedule: omit the outer distribute loop and let the inner
+      // worksharing loop schedule the flattened team/thread iteration space.
+      if (canEmitGPUFusedDistSchedule(CGM, S, S.getDirectiveKind())) {
         JumpDest LoopExit =
             getJumpDestInCurrentScope(createBasicBlock("omp.loop.exit"));
-        // UB = min(UB, GlobalUB);
-        EmitIgnoredExpr(isOpenMPLoopBoundSharingDirective(S.getDirectiveKind())
-                            ? S.getCombinedEnsureUpperBound()
-                            : S.getEnsureUpperBound());
-        // IV = LB;
-        EmitIgnoredExpr(isOpenMPLoopBoundSharingDirective(S.getDirectiveKind())
-                            ? S.getCombinedInit()
-                            : S.getInit());
-
-        const Expr *Cond =
-            isOpenMPLoopBoundSharingDirective(S.getDirectiveKind())
-                ? S.getCombinedCond()
-                : S.getCond();
-
-        if (StaticChunked)
-          Cond = S.getCombinedDistCond();
-
-        // For static unchunked schedules generate:
-        //
-        //  1. For distribute alone, codegen
-        //    while (idx <= UB) {
-        //      BODY;
-        //      ++idx;
-        //    }
-        //
-        //  2. When combined with 'for' (e.g. as in 'distribute parallel for')
-        //    while (idx <= UB) {
-        //      <CodeGen rest of pragma>(LB, UB);
-        //      idx += ST;
-        //    }
-        //
-        // For static chunk one schedule generate:
-        //
-        // while (IV <= GlobalUB) {
-        //   <CodeGen rest of pragma>(LB, UB);
-        //   LB += ST;
-        //   UB += ST;
-        //   UB = min(UB, GlobalUB);
-        //   IV = LB;
-        // }
-        //
-        emitCommonSimdLoop(
-            *this, S,
-            [&S](CodeGenFunction &CGF, PrePostActionTy &) {
-              if (isOpenMPSimdDirective(S.getDirectiveKind()))
-                CGF.EmitOMPSimdInit(S);
-            },
-            [&S, &LoopScope, Cond, IncExpr, LoopExit, &CodeGenLoop,
-             StaticChunked](CodeGenFunction &CGF, PrePostActionTy &) {
-              CGF.EmitOMPInnerLoop(
-                  S, LoopScope.requiresCleanups(), Cond, IncExpr,
-                  [&S, LoopExit, &CodeGenLoop](CodeGenFunction &CGF) {
-                    CodeGenLoop(CGF, S, LoopExit);
-                  },
-                  [&S, StaticChunked](CodeGenFunction &CGF) {
-                    if (StaticChunked) {
-                      CGF.EmitIgnoredExpr(S.getCombinedNextLowerBound());
-                      CGF.EmitIgnoredExpr(S.getCombinedNextUpperBound());
-                      CGF.EmitIgnoredExpr(S.getCombinedEnsureUpperBound());
-                      CGF.EmitIgnoredExpr(S.getCombinedInit());
-                    }
-                  });
-            });
+        CodeGenLoop(*this, S, LoopExit);
         EmitBlock(LoopExit.getBlock());
-        // Tell the runtime we are done.
-        RT.emitForStaticFinish(*this, S.getEndLoc(), OMPD_distribute);
       } else {
-        // Emit the outer loop, which requests its work chunk [LB..UB] from
-        // runtime and runs the inner loop to process it.
-        const OMPLoopArguments LoopArguments = {
-            LB.getAddress(), UB.getAddress(), ST.getAddress(), IL.getAddress(),
-            Chunk};
-        EmitOMPDistributeOuterLoop(ScheduleKind, S, LoopScope, LoopArguments,
-                                   CodeGenLoop);
+        // OpenMP [2.10.8, distribute Construct, Description]
+        // If dist_schedule is specified, kind must be static. If specified,
+        // iterations are divided into chunks of size chunk_size, chunks are
+        // assigned to the teams of the league in a round-robin fashion in the
+        // order of the team number. When no chunk_size is specified, the
+        // iteration space is divided into chunks that are approximately equal
+        // in size, and at most one chunk is distributed to each team of the
+        // league. The size of the chunks is unspecified in this case.
+        bool StaticChunked =
+            RT.isStaticChunked(ScheduleKind, /* Chunked */ Chunk != nullptr) &&
+            isOpenMPLoopBoundSharingDirective(S.getDirectiveKind());
+        if (RT.isStaticNonchunked(ScheduleKind,
+                                  /* Chunked */ Chunk != nullptr) ||
+            StaticChunked) {
+          CGOpenMPRuntime::StaticRTInput StaticInit(
+              IVSize, IVSigned, /* Ordered = */ false, IL.getAddress(),
+              LB.getAddress(), UB.getAddress(), ST.getAddress(),
+              StaticChunked ? Chunk : nullptr);
+          RT.emitDistributeStaticInit(*this, S.getBeginLoc(), ScheduleKind,
+                                      StaticInit);
+          JumpDest LoopExit =
+              getJumpDestInCurrentScope(createBasicBlock("omp.loop.exit"));
+          // UB = min(UB, GlobalUB);
+          EmitIgnoredExpr(
+              isOpenMPLoopBoundSharingDirective(S.getDirectiveKind())
+                  ? S.getCombinedEnsureUpperBound()
+                  : S.getEnsureUpperBound());
+          // IV = LB;
+          EmitIgnoredExpr(
+              isOpenMPLoopBoundSharingDirective(S.getDirectiveKind())
+                  ? S.getCombinedInit()
+                  : S.getInit());
+
+          const Expr *Cond =
+              isOpenMPLoopBoundSharingDirective(S.getDirectiveKind())
+                  ? S.getCombinedCond()
+                  : S.getCond();
+
+          if (StaticChunked)
+            Cond = S.getCombinedDistCond();
+
+          // For static unchunked schedules generate:
+          //
+          //  1. For distribute alone, codegen
+          //    while (idx <= UB) {
+          //      BODY;
+          //      ++idx;
+          //    }
+          //
+          //  2. When combined with 'for' (e.g. as in 'distribute parallel for')
+          //    while (idx <= UB) {
+          //      <CodeGen rest of pragma>(LB, UB);
+          //      idx += ST;
+          //    }
+          //
+          // For static chunk one schedule generate:
+          //
+          // while (IV <= GlobalUB) {
+          //   <CodeGen rest of pragma>(LB, UB);
+          //   LB += ST;
+          //   UB += ST;
+          //   UB = min(UB, GlobalUB);
+          //   IV = LB;
+          // }
+          //
+          emitCommonSimdLoop(
+              *this, S,
+              [&S](CodeGenFunction &CGF, PrePostActionTy &) {
+                if (isOpenMPSimdDirective(S.getDirectiveKind()))
+                  CGF.EmitOMPSimdInit(S);
+              },
+              [&S, &LoopScope, Cond, IncExpr, LoopExit, &CodeGenLoop,
+               StaticChunked](CodeGenFunction &CGF, PrePostActionTy &) {
+                CGF.EmitOMPInnerLoop(
+                    S, LoopScope.requiresCleanups(), Cond, IncExpr,
+                    [&S, LoopExit, &CodeGenLoop](CodeGenFunction &CGF) {
+                      CodeGenLoop(CGF, S, LoopExit);
+                    },
+                    [&S, StaticChunked](CodeGenFunction &CGF) {
+                      if (StaticChunked) {
+                        CGF.EmitIgnoredExpr(S.getCombinedNextLowerBound());
+                        CGF.EmitIgnoredExpr(S.getCombinedNextUpperBound());
+                        CGF.EmitIgnoredExpr(S.getCombinedEnsureUpperBound());
+                        CGF.EmitIgnoredExpr(S.getCombinedInit());
+                      }
+                    });
+              });
+          EmitBlock(LoopExit.getBlock());
+          // Tell the runtime we are done.
+          RT.emitForStaticFinish(*this, S.getEndLoc(), OMPD_distribute);
+        } else {
+          // Emit the outer loop, which requests its work chunk [LB..UB] from
+          // runtime and runs the inner loop to process it.
+          const OMPLoopArguments LoopArguments = {
+              LB.getAddress(), UB.getAddress(), ST.getAddress(),
+              IL.getAddress(), Chunk};
+          EmitOMPDistributeOuterLoop(ScheduleKind, S, LoopScope, LoopArguments,
+                                     CodeGenLoop);
+        }
       }
       if (isOpenMPSimdDirective(S.getDirectiveKind())) {
         EmitOMPSimdFinal(S, [IL, &S](CodeGenFunction &CGF) {
diff --git a/clang/test/OpenMP/amdgcn_target_device_vla.cpp b/clang/test/OpenMP/amdgcn_target_device_vla.cpp
index 5064c114c0863..323725f215f79 100644
--- a/clang/test/OpenMP/amdgcn_target_device_vla.cpp
+++ b/clang/test/OpenMP/amdgcn_target_device_vla.cpp
@@ -276,92 +276,32 @@ int main() {
 // CHECK-NEXT:    store i32 1, ptr [[DOTOMP_STRIDE_ASCAST]], align 4
 // CHECK-NEXT:    store i32 0, ptr [[DOTOMP_IS_LAST_ASCAST]], align 4
 // CHECK-NEXT:    [[NVPTX_NUM_THREADS:%.*]] = call i32 @__kmpc_get_hardware_num_threads_in_block()
-// CHECK-NEXT:    [[TMP6:%.*]] = load ptr, ptr [[DOTGLOBAL_TID__ADDR_ASCAST]], align 8
-// CHECK-NEXT:    [[TMP7:%.*]] = load i32, ptr [[TMP6]], align 4
-// CHECK-NEXT:    call void @__kmpc_distribute_static_init_4(ptr addrspacecast (ptr addrspace(1) @[[GLOB2:[0-9]+]] to ptr), i32 [[TMP7]], i32 91, ptr [[DOTOMP_IS_LAST_ASCAST]], ptr [[DOTOMP_COMB_LB_ASCAST]], ptr [[DOTOMP_COMB_UB_ASCAST]], ptr [[DOTOMP_STRIDE_ASCAST]], i32 1, i32 [[NVPTX_NUM_THREADS]])
+// CHECK-NEXT:    [[TMP6:%.*]] = load i32, ptr [[DOTOMP_COMB_LB_ASCAST]], align 4
+// CHECK-NEXT:    [[TMP7:%.*]] = zext i32 [[TMP6]] to i64
 // CHECK-NEXT:    [[TMP8:%.*]] = load i32, ptr [[DOTOMP_COMB_UB_ASCAST]], align 4
-// CHECK-NEXT:    [[TMP9:%.*]] = load i32, ptr [[DOTCAPTURE_EXPR_1_ASCAST]], align 4
-// CHECK-NEXT:    [[CMP4:%.*]] = icmp sgt i32 [[TMP8]], [[TMP9]]
-// CHECK-NEXT:    br i1 [[CMP4]], label [[COND_TRUE:%.*]], label [[COND_FALSE:%.*]]
-// CHECK:       cond.true:
-// CHECK-NEXT:    [[TMP10:%.*]] = load i32, ptr [[DOTCAPTURE_EXPR_1_ASCAST]], align 4
-// CHECK-NEXT:    br label [[COND_END:%.*]]
-// CHECK:       cond.false:
-// CHECK-NEXT:    [[TMP11:%.*]] = load i32, ptr [[DOTOMP_COMB_UB_ASCAST]], align 4
-// CHECK-NEXT:    br label [[COND_END]]
-// CHECK:       cond.end:
-// CHECK-NEXT:    [[COND:%.*]] = phi i32 [ [[TMP10]], [[COND_TRUE]] ], [ [[TMP11]], [[COND_FALSE]] ]
-// CHECK-NEXT:    store i32 [[COND]], ptr [[DOTOMP_COMB_UB_ASCAST]], align 4
-// CHECK-NEXT:    [[TMP12:%.*]] = load i32, ptr [[DOTOMP_COMB_LB_ASCAST]], align 4
-// CHECK-NEXT:    store i32 [[TMP12]], ptr [[DOTOMP_IV_ASCAST]], align 4
-// CHECK-NEXT:    br label [[OMP_INNER_FOR_COND:%.*]]
-// CHECK:       omp.inner.for.cond:
-// CHECK-NEXT:    [[TMP13:%.*]] = load i32, ptr [[DOTOMP_IV_ASCAST]], align 4
-// CHECK-NEXT:    [[TMP14:%.*]] = load i32, ptr [[DOTCAPTURE_EXPR_1_ASCAST]], align 4
-// CHECK-NEXT:    [[ADD:%.*]] = add nsw i32 [[TMP14]], 1
-// CHECK-NEXT:    [[CMP5:%.*]] = icmp slt i32 [[TMP13]], [[ADD]]
-// CHECK-NEXT:    br i1 [[CMP5]], label [[OMP_INNER_FOR_BODY:%.*]], label [[OMP_INNER_FOR_END:%.*]]
-// CHECK:       omp.inner.for.body:
-// CHECK-NEXT:    [[TMP15:%.*]] = load i32, ptr [[DOTOMP_COMB_LB_ASCAST]], align 4
-// CHECK-NEXT:    [[TMP16:%.*]] = zext i32 [[TMP15]] to i64
-// CHECK-NEXT:    [[TMP17:%.*]] = load i32, ptr [[DOTOMP_COMB_UB_ASCAST]], align 4
-// CHECK-NEXT:    [[TMP18:%.*]] = zext i32 [[TMP17]] to i64
-// CHECK-NEXT:    [[TMP19:%.*]] = load i32, ptr [[M_ADDR_ASCAST]], align 4
-// CHECK-NEXT:    store i32 [[TMP19]], ptr addrspace(5) [[M_CASTED]], align 4
-// CHECK-NEXT:    [[TMP20:%.*]] = load i64, ptr addrspace(5) [[M_CASTED]], align 8
-// CHECK-NEXT:    [[TMP21:%.*]] = getelementptr inbounds [5 x ptr], ptr [[CAPTURED_VARS_ADDRS_ASCAST]], i64 0, i64 0
-// CHECK-NEXT:    [[TMP22:%.*]] = inttoptr i64 [[TMP16]] to ptr
-// CHECK-NEXT:    store ptr [[TMP22]], ptr [[TMP21]], align 8
-// CHECK-NEXT:    [[TMP23:%.*]] = getelementptr inbounds [5 x ptr], ptr [[CAPTURED_VARS_ADDRS_ASCAST]], i64 0, i64 1
-// CHECK-NEXT:    [[TMP24:%.*]] = inttoptr i64 [[TMP18]] to ptr
-// CHECK-NEXT:    store ptr [[TMP24]], ptr [[TMP23]], align 8
-// CHECK-NEXT:    [[TMP25:%.*]] = getelementptr inbounds [5 x ptr], ptr [[CAPTURED_VARS_ADDRS_ASCAST]], i64 0, i64 2
-// CHECK-NEXT:    [[TMP26:%.*]] = inttoptr i64 [[TMP20]] to ptr
-// CHECK-NEXT:    store ptr [[TMP26]], ptr [[TMP25]], align 8
-// CHECK-NEXT:    [[TMP27:%.*]] = getelementptr inbounds [5 x ptr], ptr [[CAPTURED_VARS_ADDRS_ASCAST]], i64 0, i64 3
-// CHECK-NEXT:    [[TMP28:%.*]] = inttoptr i64 [[TMP0]] to ptr
-// CHECK-NEXT:    store ptr [[TMP28]], ptr [[TMP27]], align 8
-// CHECK-NEXT:    [[TMP29:%.*]] = getelementptr inbounds [5 x ptr], ptr [[CAPTURED_VARS_ADDRS_ASCAST]], i64 0, i64 4
-// CHECK-NEXT:    store ptr [[TMP1]], ptr [[TMP29]], align 8
-// CHECK-NEXT:    [[TMP30:%.*]] = load ptr, ptr [[DOTGLOBAL_TID__ADDR_ASCAST]], align 8
-// CHECK-NEXT:    [[TMP31:%.*]] = load i32, ptr [[TMP30]], align 4
-// CHECK-NEXT:    call void @__kmpc_parallel_60(ptr addrspacecast (ptr addrspace(1) @[[GLOB1]] to ptr), i32 [[TMP31]], i32 1, i32 -1, i32 -1, ptr @{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z4foo2v_l30_omp_outlined_omp_outlined, ptr null, ptr [[CAPTURED_VARS_ADDRS_ASCAST]], i64 5, i32 0)
-// CHECK-NEXT:    br label [[OMP_INNER_FOR_INC:%.*]]
-// CHECK:       omp.inn...
[truncated]

jdoerfert · 2026-06-05T23:13:40Z

red_max Value change for 208 teams: +1097.62% change for 10400 teams: +261.16%

higher is better?

ro-i · 2026-06-06T05:50:35Z

Yes, MB/s percentage change

ro-i · 2026-06-06T07:44:27Z

btw, sometimes it's a little hard to define what the MB in MB/s should be for a given reduction. But since the formula is consistent, it doesn't matter for the relative comparison if different people would define the MB constant differently.

with m = the MB constant, x = this PR's time (faster, larger MB/s), and y = base time:

((m/x - m/y) / (m/y)) * 100
  = (m/x)/(m/y) * 100 - 100
  = (y/x) * 100 - 100
  = ((y - x) / x) * 100

carlobertolli · 2026-06-06T15:22:36Z

I don't see a condition over the loop having reductions: does it apply to any ttdpf, even without reductions?
If so, do you have performance numbers for those cases? The ones you show have the label "red" in front of them, but I see at least one test in the suite that does not: "evict_device_cache".
FYI: I am expecting this change to apply to non reduction cases.

carlobertolli · 2026-06-06T15:38:43Z

 static OpenMPDirectiveKind
 getEffectiveDirectiveKind(const OMPExecutableDirective &S);

+static bool canEmitGPUFusedDistSchedule(const CodeGenModule &CGM,


Add description of what FusedDist is here.

carlobertolli · 2026-06-06T15:42:02Z

  OpenMPScheduleClauseModifier M2 = OMPC_SCHEDULE_MODIFIER_unknown;
+  /// Request the fused distr_static_chunk + static_chunkone runtime schedule
+  /// in `for_static_init`. The outer `distribute_static_init` is skipped.
+  bool IsDistChunkedAndChunkOne = false;


are there more variants? If so, better use an enum.

ro-i · 2026-06-06T15:47:26Z

I don't see a condition over the loop having reductions: does it apply to any ttdpf, even without reductions?

Yes

If so, do you have performance numbers for those cases?

No, not yet. Do you have interesting snippets you'd like to see results for?

The ones you show have the label "red" in front of them, but I see at least one test in the suite that does not: "evict_device_cache".

That's something else. The current code in my benchmark/test repo contains only cross-team reduction related stuff (and scan, but that's another topic and currently not supported by llvm or rocm)

ro-i · 2026-06-08T09:16:56Z

No, not yet. Do you have interesting snippets you'd like to see results for?

Hm, I let AI come up with some tests and created some experimental benchmarks for them: https://github.com/ro-i/xteam-test/blob/main/src/xteam_misc.cpp

For these, the changed loop structure seems to be not beneficial at all. Let me find out why

ro-i · 2026-06-08T10:34:11Z

One would think that the changed loop structure is basically a no-op for non-reduction loops. But there seem to be some downstream optimizations (maybe unrolling the inner loop).
The one-line change easy way out is to just gate my loop change so that it takes only effect for reduction loops. But give me a moment to test sth I'm thinking about

ro-i requested review from CatherineMoore, dhruvachak, jdoerfert, jhuber6 and shiltian June 4, 2026 19:30

llvmorg-github-actions Bot added backend:AMDGPU clang:frontend Language frontend issues, e.g. anything involving "Sema" clang:codegen IR generation bugs: mangling, exceptions, etc. clang:openmp OpenMP related changes to Clang offload labels Jun 4, 2026

carlobertolli reviewed Jun 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[clang][OpenMP] Improve loop structure for distributed loops#201670

[clang][OpenMP] Improve loop structure for distributed loops#201670
ro-i wants to merge 1 commit into
mainfrom
users/ro-i/xteam-red-codegen

ro-i commented Jun 4, 2026 •

edited

Loading

Uh oh!

llvmorg-github-actions Bot commented Jun 4, 2026 •

edited

Loading

Uh oh!

jdoerfert commented Jun 5, 2026

Uh oh!

ro-i commented Jun 6, 2026

Uh oh!

ro-i commented Jun 6, 2026

Uh oh!

carlobertolli commented Jun 6, 2026

Uh oh!

carlobertolli Jun 6, 2026

Uh oh!

carlobertolli Jun 6, 2026

Uh oh!

ro-i commented Jun 6, 2026

Uh oh!

ro-i commented Jun 8, 2026

Uh oh!

ro-i commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ro-i commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmorg-github-actions Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jdoerfert commented Jun 5, 2026

Uh oh!

ro-i commented Jun 6, 2026

Uh oh!

ro-i commented Jun 6, 2026

Uh oh!

carlobertolli commented Jun 6, 2026

Uh oh!

carlobertolli Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

carlobertolli Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

ro-i commented Jun 6, 2026

Uh oh!

ro-i commented Jun 8, 2026

Uh oh!

ro-i commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ro-i commented Jun 4, 2026 •

edited

Loading

llvmorg-github-actions Bot commented Jun 4, 2026 •

edited

Loading