Skip to content

[clang][OpenMP] Improve loop structure for distributed loops#201670

Open
ro-i wants to merge 1 commit into
mainfrom
users/ro-i/xteam-red-codegen
Open

[clang][OpenMP] Improve loop structure for distributed loops#201670
ro-i wants to merge 1 commit into
mainfrom
users/ro-i/xteam-red-codegen

Conversation

@ro-i
Copy link
Copy Markdown
Contributor

@ro-i ro-i commented Jun 4, 2026

This is a part of a series of patches that rework OpenMP cross-team reductions.

This patches wires the existing
kmp_sched_distr_static_chunk_sched_static_chunkone to be used by CodeGen.

Example of the intended change of this patch:

target teams distribute parallel for reduction(+:s)
  for (i = 0; i < N; i++) s += a[i];

Before:

__kmpc_distribute_static_init(91)
for (team_lb = team*nthreads; team_lb < N; team_lb += nteams*nthreads) {
  __kmpc_for_static_init(33)
  for (iv = team_lb + tid; iv < team_lb + nthreads; iv += nthreads) {
    priv += a[iv];
  }
  __kmpc_nvptx_parallel_reduce_nowait_v2
}
__kmpc_nvptx_teams_reduce_nowait_v2

After:

__kmpc_for_static_init(93)
for (iv = team*nthreads + tid;
     iv < N;
     iv += nteams*nthreads) {
    priv += a[iv];
}
__kmpc_nvptx_parallel_reduce_nowait_v2
__kmpc_nvptx_teams_reduce_nowait_v2

Performance:
All performance tests can be reproduced with
https://github.com/ro-i/xteam-test @ commit
6025e5afc14dd6e65ee2658e5001c16e9b9245ff. To reproduce, simply create a local.mk file in the cloned directory with a suitable OFFLOAD_ARCH for your machine and CXX_trunk + CXX_trunk_cg set to the paths of the clang++ binaries for llvm/main and this patch. (llvm/main should best be at the commit that is currently the base for this PR. At the moment, this is 69f7aeb). Then, run make trunk trunk_cg to build the benchmark binaries for 208 and 10400 teams. Run them with ./run_bench.sh -rq -n10 red_trunk_208 red_trunk_cg_208 red_trunk_10400 red_trunk_cg_10400 to get the avg performance numbers over 10 rounds. This tests multiple reduction workloads, including reductions that run in the Generic-SPMD mode, with 208 teams and with 10400 teams, both à 512 threads, and with a reduction array size of 177,777,777. I tested on a gfx942 and found the following numbers showing the performance of this patch relative to the baseline:

red_comb_sep_arr_32    double   change for 208 teams:    +0.01%   change for 10400 teams:    +5.53%
red_sum_arr_32         double   change for 208 teams:  +570.47%   change for 10400 teams:    -2.23%
red_comb               double   change for 208 teams:  +350.30%   change for 10400 teams:    +0.72%
red_comb_sep           double   change for 208 teams:    +4.82%   change for 10400 teams:    +2.18%
red_dot                double   change for 208 teams:  +202.45%   change for 10400 teams:    +3.48%
red_indirect           double   change for 208 teams:  +239.33%   change for 10400 teams:    +4.63%
red_kernel_part        double   change for 208 teams:    +3.30%   change for 10400 teams:    +3.43%
red_max                double   change for 208 teams:  +273.46%   change for 10400 teams:    +5.12%
red_mult               double   change for 208 teams:  +239.50%   change for 10400 teams:    +5.23%
red_sum                double   change for 208 teams:  +239.47%   change for 10400 teams:    +5.15%
red_pi                 double   change for 208 teams:   +90.06%   change for 10400 teams:   +78.67%
red_comb_sep_arr_32    uint     change for 208 teams:    -0.16%   change for 10400 teams:   +26.98%
red_sum_arr_32         uint     change for 208 teams:  +139.64%   change for 10400 teams:   -14.55%
red_dot                uint     change for 208 teams:  +202.92%   change for 10400 teams:    +5.11%
red_max                uint     change for 208 teams:  +221.41%   change for 10400 teams:    +6.54%
red_sum                uint     change for 208 teams:  +220.83%   change for 10400 teams:    +7.80%
red_comb_sep_arr_32    ulong    change for 208 teams:    -0.19%   change for 10400 teams:    +5.80%
red_sum_arr_32         ulong    change for 208 teams:  +523.98%   change for 10400 teams:    -3.17%
red_dot                ulong    change for 208 teams:  +232.14%   change for 10400 teams:    +3.57%
red_max                ulong    change for 208 teams:  +279.87%   change for 10400 teams:    +6.17%
red_sum                ulong    change for 208 teams:  +261.54%   change for 10400 teams:    +5.72%
red_comb_sep_arr_32    Value    change for 208 teams:    +0.22%   change for 10400 teams:    +0.04%
red_sum_arr_32         Value    change for 208 teams:  +423.38%   change for 10400 teams:    +9.08%
red_dot                Value    change for 208 teams:  +153.87%   change for 10400 teams:    -2.62%
red_max                Value    change for 208 teams: +1097.62%   change for 10400 teams:  +261.16%
red_sum                Value    change for 208 teams:  +358.88%   change for 10400 teams:   +21.44%

Claude assisted with this patch.

This is a part of a series of patches that rework OpenMP cross-team
reductions.

This patches wires the existing
`kmp_sched_distr_static_chunk_sched_static_chunkone` to be used by
CodeGen.

Example of the intended change of this patch:
```
target teams distribute parallel for reduction(+:s)
  for (i = 0; i < N; i++) s += a[i];
```

Before:
```
__kmpc_distribute_static_init(91)
for (team_lb = team*nthreads; team_lb < N; team_lb += nteams*nthreads) {
  __kmpc_for_static_init(33)
  for (iv = team_lb + tid; iv < team_lb + nthreads; iv += nthreads) {
    priv += a[iv];
  }
  __kmpc_nvptx_parallel_reduce_nowait_v2
}
__kmpc_nvptx_teams_reduce_nowait_v2
```

After:
```
__kmpc_for_static_init(93)
for (iv = team*nthreads + tid;
     iv < N;
     iv += nteams*nthreads) {
    priv += a[iv];
}
__kmpc_nvptx_parallel_reduce_nowait_v2
__kmpc_nvptx_teams_reduce_nowait_v2
```

Performance:
All performance tests can be reproduced with
https://github.com/ro-i/xteam-test @ commit
6025e5afc14dd6e65ee2658e5001c16e9b9245ff. To reproduce, simply create a
`local.mk` file in the cloned directory with a suitable `OFFLOAD_ARCH`
for your machine and `CXX_trunk` + `CXX_trunk_cg` set to the paths of
the clang++ binaries for llvm/main and this patch. (llvm/main should
best be at the commit that is currently the base for this PR. At the
moment, this is 69f7aeb). Then, run
`make trunk trunk_cg` to build the benchmark binaries for 208 and 10400
teams. Run them with `./run_bench.sh -rq -n10 red_trunk_208
red_trunk_cg_208 red_trunk_10400 red_trunk_cg_10400` to get the avg
performance numbers over 10 rounds. This tests multiple reduction
workloads, including reductions that run in the Generic-SPMD mode, with
208 teams and with 10400 teams, both à 512 threads, and with a reduction
array size of 177,777,777. I tested on a gfx942 and found the following
numbers showing the performance of this patch relative to the baseline:

```
red_comb_sep_arr_32    double   change for 208 teams:    +0.01%   change for 10400 teams:    +5.53%
red_sum_arr_32         double   change for 208 teams:  +570.47%   change for 10400 teams:    -2.23%
red_comb               double   change for 208 teams:  +350.30%   change for 10400 teams:    +0.72%
red_comb_sep           double   change for 208 teams:    +4.82%   change for 10400 teams:    +2.18%
red_dot                double   change for 208 teams:  +202.45%   change for 10400 teams:    +3.48%
red_indirect           double   change for 208 teams:  +239.33%   change for 10400 teams:    +4.63%
red_kernel_part        double   change for 208 teams:    +3.30%   change for 10400 teams:    +3.43%
red_max                double   change for 208 teams:  +273.46%   change for 10400 teams:    +5.12%
red_mult               double   change for 208 teams:  +239.50%   change for 10400 teams:    +5.23%
red_sum                double   change for 208 teams:  +239.47%   change for 10400 teams:    +5.15%
red_pi                 double   change for 208 teams:   +90.06%   change for 10400 teams:   +78.67%
red_comb_sep_arr_32    uint     change for 208 teams:    -0.16%   change for 10400 teams:   +26.98%
red_sum_arr_32         uint     change for 208 teams:  +139.64%   change for 10400 teams:   -14.55%
red_dot                uint     change for 208 teams:  +202.92%   change for 10400 teams:    +5.11%
red_max                uint     change for 208 teams:  +221.41%   change for 10400 teams:    +6.54%
red_sum                uint     change for 208 teams:  +220.83%   change for 10400 teams:    +7.80%
red_comb_sep_arr_32    ulong    change for 208 teams:    -0.19%   change for 10400 teams:    +5.80%
red_sum_arr_32         ulong    change for 208 teams:  +523.98%   change for 10400 teams:    -3.17%
red_dot                ulong    change for 208 teams:  +232.14%   change for 10400 teams:    +3.57%
red_max                ulong    change for 208 teams:  +279.87%   change for 10400 teams:    +6.17%
red_sum                ulong    change for 208 teams:  +261.54%   change for 10400 teams:    +5.72%
red_comb_sep_arr_32    Value    change for 208 teams:    +0.22%   change for 10400 teams:    +0.04%
red_sum_arr_32         Value    change for 208 teams:  +423.38%   change for 10400 teams:    +9.08%
red_dot                Value    change for 208 teams:  +153.87%   change for 10400 teams:    -2.62%
red_max                Value    change for 208 teams: +1097.62%   change for 10400 teams:  +261.16%
red_sum                Value    change for 208 teams:  +358.88%   change for 10400 teams:   +21.44%
```
@llvmorg-github-actions llvmorg-github-actions Bot added backend:AMDGPU clang:frontend Language frontend issues, e.g. anything involving "Sema" clang:codegen IR generation bugs: mangling, exceptions, etc. clang:openmp OpenMP related changes to Clang offload labels Jun 4, 2026
@llvmorg-github-actions
Copy link
Copy Markdown

llvmorg-github-actions Bot commented Jun 4, 2026

@llvm/pr-subscribers-clang-codegen
@llvm/pr-subscribers-offload

@llvm/pr-subscribers-backend-amdgpu

Author: Robert Imschweiler (ro-i)

Changes

This is a part of a series of patches that rework OpenMP cross-team reductions.

This patches wires the existing
kmp_sched_distr_static_chunk_sched_static_chunkone to be used by CodeGen.

Example of the intended change of this patch:

target teams distribute parallel for reduction(+:s)
  for (i = 0; i &lt; N; i++) s += a[i];

Before:

__kmpc_distribute_static_init(91)
for (team_lb = team*nthreads; team_lb &lt; N; team_lb += nteams*nthreads) {
  __kmpc_for_static_init(33)
  for (iv = team_lb + tid; iv &lt; team_lb + nthreads; iv += nthreads) {
    priv += a[iv];
  }
  __kmpc_nvptx_parallel_reduce_nowait_v2
}
__kmpc_nvptx_teams_reduce_nowait_v2

After:

__kmpc_for_static_init(93)
for (iv = team*nthreads + tid;
     iv &lt; N;
     iv += nteams*nthreads) {
    priv += a[iv];
}
__kmpc_nvptx_parallel_reduce_nowait_v2
__kmpc_nvptx_teams_reduce_nowait_v2

Performance:
All performance tests can be reproduced with
https://github.com/ro-i/xteam-test @ commit
6025e5afc14dd6e65ee2658e5001c16e9b9245ff. To reproduce, simply create a local.mk file in the cloned directory with a suitable OFFLOAD_ARCH for your machine and CXX_trunk + CXX_trunk_cg set to the paths of the clang++ binaries for llvm/main and this patch. (llvm/main should best be at the commit that is currently the base for this PR. At the moment, this is 69f7aeb). Then, run make trunk trunk_cg to build the benchmark binaries for 208 and 10400 teams. Run them with ./run_bench.sh -rq -n10 red_trunk_208 red_trunk_cg_208 red_trunk_10400 red_trunk_cg_10400 to get the avg performance numbers over 10 rounds. This tests multiple reduction workloads, including reductions that run in the Generic-SPMD mode, with 208 teams and with 10400 teams, both à 512 threads, and with a reduction array size of 177,777,777. I tested on a gfx942 and found the following numbers showing the performance of this patch relative to the baseline:

red_comb_sep_arr_32    double   change for 208 teams:    +0.01%   change for 10400 teams:    +5.53%
red_sum_arr_32         double   change for 208 teams:  +570.47%   change for 10400 teams:    -2.23%
red_comb               double   change for 208 teams:  +350.30%   change for 10400 teams:    +0.72%
red_comb_sep           double   change for 208 teams:    +4.82%   change for 10400 teams:    +2.18%
red_dot                double   change for 208 teams:  +202.45%   change for 10400 teams:    +3.48%
red_indirect           double   change for 208 teams:  +239.33%   change for 10400 teams:    +4.63%
red_kernel_part        double   change for 208 teams:    +3.30%   change for 10400 teams:    +3.43%
red_max                double   change for 208 teams:  +273.46%   change for 10400 teams:    +5.12%
red_mult               double   change for 208 teams:  +239.50%   change for 10400 teams:    +5.23%
red_sum                double   change for 208 teams:  +239.47%   change for 10400 teams:    +5.15%
red_pi                 double   change for 208 teams:   +90.06%   change for 10400 teams:   +78.67%
red_comb_sep_arr_32    uint     change for 208 teams:    -0.16%   change for 10400 teams:   +26.98%
red_sum_arr_32         uint     change for 208 teams:  +139.64%   change for 10400 teams:   -14.55%
red_dot                uint     change for 208 teams:  +202.92%   change for 10400 teams:    +5.11%
red_max                uint     change for 208 teams:  +221.41%   change for 10400 teams:    +6.54%
red_sum                uint     change for 208 teams:  +220.83%   change for 10400 teams:    +7.80%
red_comb_sep_arr_32    ulong    change for 208 teams:    -0.19%   change for 10400 teams:    +5.80%
red_sum_arr_32         ulong    change for 208 teams:  +523.98%   change for 10400 teams:    -3.17%
red_dot                ulong    change for 208 teams:  +232.14%   change for 10400 teams:    +3.57%
red_max                ulong    change for 208 teams:  +279.87%   change for 10400 teams:    +6.17%
red_sum                ulong    change for 208 teams:  +261.54%   change for 10400 teams:    +5.72%
red_comb_sep_arr_32    Value    change for 208 teams:    +0.22%   change for 10400 teams:    +0.04%
red_sum_arr_32         Value    change for 208 teams:  +423.38%   change for 10400 teams:    +9.08%
red_dot                Value    change for 208 teams:  +153.87%   change for 10400 teams:    -2.62%
red_max                Value    change for 208 teams: +1097.62%   change for 10400 teams:  +261.16%
red_sum                Value    change for 208 teams:  +358.88%   change for 10400 teams:   +21.44%

Patch is 1.39 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/201670.diff

16 Files Affected:

  • (modified) clang/include/clang/Basic/OpenMPKinds.h (+3)
  • (modified) clang/lib/CodeGen/CGOpenMPRuntime.cpp (+17-5)
  • (modified) clang/lib/CodeGen/CGStmtOpenMP.cpp (+119-92)
  • (modified) clang/test/OpenMP/amdgcn_target_device_vla.cpp (+33-93)
  • (modified) clang/test/OpenMP/amdgpu_target_with_aligned_attribute.c (+23-83)
  • (modified) clang/test/OpenMP/metadirective_device_arch_codegen.cpp (+4-7)
  • (modified) clang/test/OpenMP/nvptx_SPMD_codegen.cpp (+2367-3627)
  • (modified) clang/test/OpenMP/nvptx_distribute_parallel_generic_mode_codegen.cpp (+59-179)
  • (modified) clang/test/OpenMP/nvptx_target_teams_distribute_parallel_for_codegen.cpp (+301-1141)
  • (modified) clang/test/OpenMP/nvptx_target_teams_distribute_parallel_for_simd_codegen.cpp (+223-543)
  • (modified) clang/test/OpenMP/nvptx_target_teams_generic_loop_codegen.cpp (+316-1156)
  • (modified) clang/test/OpenMP/nvptx_target_teams_generic_loop_generic_mode_codegen.cpp (+16-86)
  • (modified) clang/test/OpenMP/target_teams_generic_loop_codegen.cpp (+48-102)
  • (modified) clang/test/OpenMP/target_teams_generic_loop_codegen_as_distribute.cpp (+21-56)
  • (modified) clang/test/OpenMP/target_teams_generic_loop_codegen_as_parallel_for.cpp (+120-360)
  • (modified) offload/test/offloading/gpupgo/pgo_atomic_teams.c (+4-2)
diff --git a/clang/include/clang/Basic/OpenMPKinds.h b/clang/include/clang/Basic/OpenMPKinds.h
index 4e83bfcd0128b..516219a408edb 100644
--- a/clang/include/clang/Basic/OpenMPKinds.h
+++ b/clang/include/clang/Basic/OpenMPKinds.h
@@ -188,6 +188,9 @@ struct OpenMPScheduleTy final {
   OpenMPScheduleClauseKind Schedule = OMPC_SCHEDULE_unknown;
   OpenMPScheduleClauseModifier M1 = OMPC_SCHEDULE_MODIFIER_unknown;
   OpenMPScheduleClauseModifier M2 = OMPC_SCHEDULE_MODIFIER_unknown;
+  /// Request the fused distr_static_chunk + static_chunkone runtime schedule
+  /// in `for_static_init`. The outer `distribute_static_init` is skipped.
+  bool IsDistChunkedAndChunkOne = false;
 };
 
 /// OpenMP modifiers for 'reduction' clause.
diff --git a/clang/lib/CodeGen/CGOpenMPRuntime.cpp b/clang/lib/CodeGen/CGOpenMPRuntime.cpp
index f3158f48e7944..4462d5b63d677 100644
--- a/clang/lib/CodeGen/CGOpenMPRuntime.cpp
+++ b/clang/lib/CodeGen/CGOpenMPRuntime.cpp
@@ -546,6 +546,12 @@ enum OpenMPSchedType {
   /// dist_schedule types
   OMP_dist_sch_static_chunked = 91,
   OMP_dist_sch_static = 92,
+  /// Fused distribute+for static schedule (entityId = team*nthreads + tid,
+  /// num_entities = nteams*nthreads). One for_static_init call, no
+  /// surrounding distribute_static_init. Matches
+  /// kmp_sched_distr_static_chunk_sched_static_chunkone in the device RTL
+  /// (openmp/device/include/DeviceTypes.h).
+  OMP_dist_sch_static_chunked_sch_static_chunkone = 93,
   /// Support for OpenMP 4.5 monotonic and nonmonotonic schedule modifiers.
   /// Set if the monotonic schedule modifier was present.
   OMP_sch_modifier_monotonic = (1 << 29),
@@ -2630,7 +2636,8 @@ static int addMonoNonMonoModifier(CodeGenModule &CGM, OpenMPSchedType Schedule,
           Schedule == OMP_sch_static_balanced_chunked ||
           Schedule == OMP_ord_static_chunked || Schedule == OMP_ord_static ||
           Schedule == OMP_dist_sch_static_chunked ||
-          Schedule == OMP_dist_sch_static))
+          Schedule == OMP_dist_sch_static ||
+          Schedule == OMP_dist_sch_static_chunked_sch_static_chunkone))
       Modifier = OMP_sch_modifier_nonmonotonic;
   }
   return Schedule | Modifier;
@@ -2692,7 +2699,8 @@ static void emitForStaticInitCall(
          Schedule == OMP_sch_static_balanced_chunked ||
          Schedule == OMP_ord_static || Schedule == OMP_ord_static_chunked ||
          Schedule == OMP_dist_sch_static ||
-         Schedule == OMP_dist_sch_static_chunked);
+         Schedule == OMP_dist_sch_static_chunked ||
+         Schedule == OMP_dist_sch_static_chunked_sch_static_chunkone);
 
   // Call __kmpc_for_static_init(
   //          ident_t *loc, kmp_int32 tid, kmp_int32 schedtype,
@@ -2710,7 +2718,8 @@ static void emitForStaticInitCall(
     assert((Schedule == OMP_sch_static_chunked ||
             Schedule == OMP_sch_static_balanced_chunked ||
             Schedule == OMP_ord_static_chunked ||
-            Schedule == OMP_dist_sch_static_chunked) &&
+            Schedule == OMP_dist_sch_static_chunked ||
+            Schedule == OMP_dist_sch_static_chunked_sch_static_chunkone) &&
            "expected static chunked schedule");
   }
   llvm::Value *Args[] = {
@@ -2733,8 +2742,11 @@ void CGOpenMPRuntime::emitForStaticInit(CodeGenFunction &CGF,
                                         OpenMPDirectiveKind DKind,
                                         const OpenMPScheduleTy &ScheduleKind,
                                         const StaticRTInput &Values) {
-  OpenMPSchedType ScheduleNum = getRuntimeSchedule(
-      ScheduleKind.Schedule, Values.Chunk != nullptr, Values.Ordered);
+  OpenMPSchedType ScheduleNum =
+      ScheduleKind.IsDistChunkedAndChunkOne
+          ? OMP_dist_sch_static_chunked_sch_static_chunkone
+          : getRuntimeSchedule(ScheduleKind.Schedule, Values.Chunk != nullptr,
+                               Values.Ordered);
   assert((isOpenMPWorksharingDirective(DKind) || (DKind == OMPD_loop)) &&
          "Expected loop-based or sections-based directive.");
   llvm::Value *UpdatedLocation = emitUpdateLocation(CGF, Loc,
diff --git a/clang/lib/CodeGen/CGStmtOpenMP.cpp b/clang/lib/CodeGen/CGStmtOpenMP.cpp
index 1eaf8efa142c5..376e9bd1cee4e 100644
--- a/clang/lib/CodeGen/CGStmtOpenMP.cpp
+++ b/clang/lib/CodeGen/CGStmtOpenMP.cpp
@@ -50,6 +50,16 @@ static const VarDecl *getBaseDecl(const Expr *Ref);
 static OpenMPDirectiveKind
 getEffectiveDirectiveKind(const OMPExecutableDirective &S);
 
+static bool canEmitGPUFusedDistSchedule(const CodeGenModule &CGM,
+                                        const OMPLoopDirective &S,
+                                        OpenMPDirectiveKind DKind) {
+  return CGM.getLangOpts().OpenMPIsTargetDevice && CGM.getTriple().isGPU() &&
+         isOpenMPLoopBoundSharingDirective(DKind) &&
+         !S.getSingleClause<OMPDistScheduleClause>() &&
+         !S.getSingleClause<OMPScheduleClause>() &&
+         !S.getSingleClause<OMPOrderedClause>();
+}
+
 namespace {
 /// Lexical scope for OpenMP executable constructs, that handles correct codegen
 /// for captured expressions.
@@ -3879,6 +3889,12 @@ bool CodeGenFunction::EmitOMPWorksharingLoop(
           RT.isStaticChunked(ScheduleKind.Schedule,
                              /* Chunked */ Chunk != nullptr) &&
           HasChunkSizeOne && isOpenMPLoopBoundSharingDirective(EKind);
+      // GPU combined `distribute parallel for`: emit a single
+      // for_static_init with the fused distr_static_chunk + static_chunkone
+      // schedule (enum 93). The surrounding EmitOMPDistributeLoop must skip
+      // its distribute_static_init under the same conditions.
+      if (StaticChunkedOne && canEmitGPUFusedDistSchedule(CGM, S, EKind))
+        ScheduleKind.IsDistChunkedAndChunkOne = true;
       bool IsMonotonic =
           Ordered ||
           (ScheduleKind.Schedule == OMPC_SCHEDULE_static &&
@@ -6275,102 +6291,113 @@ void CodeGenFunction::EmitOMPDistributeLoop(const OMPLoopDirective &S,
       const unsigned IVSize = getContext().getTypeSize(IVExpr->getType());
       const bool IVSigned = IVExpr->getType()->hasSignedIntegerRepresentation();
 
-      // OpenMP [2.10.8, distribute Construct, Description]
-      // If dist_schedule is specified, kind must be static. If specified,
-      // iterations are divided into chunks of size chunk_size, chunks are
-      // assigned to the teams of the league in a round-robin fashion in the
-      // order of the team number. When no chunk_size is specified, the
-      // iteration space is divided into chunks that are approximately equal
-      // in size, and at most one chunk is distributed to each team of the
-      // league. The size of the chunks is unspecified in this case.
-      bool StaticChunked =
-          RT.isStaticChunked(ScheduleKind, /* Chunked */ Chunk != nullptr) &&
-          isOpenMPLoopBoundSharingDirective(S.getDirectiveKind());
-      if (RT.isStaticNonchunked(ScheduleKind,
-                                /* Chunked */ Chunk != nullptr) ||
-          StaticChunked) {
-        CGOpenMPRuntime::StaticRTInput StaticInit(
-            IVSize, IVSigned, /* Ordered = */ false, IL.getAddress(),
-            LB.getAddress(), UB.getAddress(), ST.getAddress(),
-            StaticChunked ? Chunk : nullptr);
-        RT.emitDistributeStaticInit(*this, S.getBeginLoc(), ScheduleKind,
-                                    StaticInit);
+      // GPU fused schedule: omit the outer distribute loop and let the inner
+      // worksharing loop schedule the flattened team/thread iteration space.
+      if (canEmitGPUFusedDistSchedule(CGM, S, S.getDirectiveKind())) {
         JumpDest LoopExit =
             getJumpDestInCurrentScope(createBasicBlock("omp.loop.exit"));
-        // UB = min(UB, GlobalUB);
-        EmitIgnoredExpr(isOpenMPLoopBoundSharingDirective(S.getDirectiveKind())
-                            ? S.getCombinedEnsureUpperBound()
-                            : S.getEnsureUpperBound());
-        // IV = LB;
-        EmitIgnoredExpr(isOpenMPLoopBoundSharingDirective(S.getDirectiveKind())
-                            ? S.getCombinedInit()
-                            : S.getInit());
-
-        const Expr *Cond =
-            isOpenMPLoopBoundSharingDirective(S.getDirectiveKind())
-                ? S.getCombinedCond()
-                : S.getCond();
-
-        if (StaticChunked)
-          Cond = S.getCombinedDistCond();
-
-        // For static unchunked schedules generate:
-        //
-        //  1. For distribute alone, codegen
-        //    while (idx <= UB) {
-        //      BODY;
-        //      ++idx;
-        //    }
-        //
-        //  2. When combined with 'for' (e.g. as in 'distribute parallel for')
-        //    while (idx <= UB) {
-        //      <CodeGen rest of pragma>(LB, UB);
-        //      idx += ST;
-        //    }
-        //
-        // For static chunk one schedule generate:
-        //
-        // while (IV <= GlobalUB) {
-        //   <CodeGen rest of pragma>(LB, UB);
-        //   LB += ST;
-        //   UB += ST;
-        //   UB = min(UB, GlobalUB);
-        //   IV = LB;
-        // }
-        //
-        emitCommonSimdLoop(
-            *this, S,
-            [&S](CodeGenFunction &CGF, PrePostActionTy &) {
-              if (isOpenMPSimdDirective(S.getDirectiveKind()))
-                CGF.EmitOMPSimdInit(S);
-            },
-            [&S, &LoopScope, Cond, IncExpr, LoopExit, &CodeGenLoop,
-             StaticChunked](CodeGenFunction &CGF, PrePostActionTy &) {
-              CGF.EmitOMPInnerLoop(
-                  S, LoopScope.requiresCleanups(), Cond, IncExpr,
-                  [&S, LoopExit, &CodeGenLoop](CodeGenFunction &CGF) {
-                    CodeGenLoop(CGF, S, LoopExit);
-                  },
-                  [&S, StaticChunked](CodeGenFunction &CGF) {
-                    if (StaticChunked) {
-                      CGF.EmitIgnoredExpr(S.getCombinedNextLowerBound());
-                      CGF.EmitIgnoredExpr(S.getCombinedNextUpperBound());
-                      CGF.EmitIgnoredExpr(S.getCombinedEnsureUpperBound());
-                      CGF.EmitIgnoredExpr(S.getCombinedInit());
-                    }
-                  });
-            });
+        CodeGenLoop(*this, S, LoopExit);
         EmitBlock(LoopExit.getBlock());
-        // Tell the runtime we are done.
-        RT.emitForStaticFinish(*this, S.getEndLoc(), OMPD_distribute);
       } else {
-        // Emit the outer loop, which requests its work chunk [LB..UB] from
-        // runtime and runs the inner loop to process it.
-        const OMPLoopArguments LoopArguments = {
-            LB.getAddress(), UB.getAddress(), ST.getAddress(), IL.getAddress(),
-            Chunk};
-        EmitOMPDistributeOuterLoop(ScheduleKind, S, LoopScope, LoopArguments,
-                                   CodeGenLoop);
+        // OpenMP [2.10.8, distribute Construct, Description]
+        // If dist_schedule is specified, kind must be static. If specified,
+        // iterations are divided into chunks of size chunk_size, chunks are
+        // assigned to the teams of the league in a round-robin fashion in the
+        // order of the team number. When no chunk_size is specified, the
+        // iteration space is divided into chunks that are approximately equal
+        // in size, and at most one chunk is distributed to each team of the
+        // league. The size of the chunks is unspecified in this case.
+        bool StaticChunked =
+            RT.isStaticChunked(ScheduleKind, /* Chunked */ Chunk != nullptr) &&
+            isOpenMPLoopBoundSharingDirective(S.getDirectiveKind());
+        if (RT.isStaticNonchunked(ScheduleKind,
+                                  /* Chunked */ Chunk != nullptr) ||
+            StaticChunked) {
+          CGOpenMPRuntime::StaticRTInput StaticInit(
+              IVSize, IVSigned, /* Ordered = */ false, IL.getAddress(),
+              LB.getAddress(), UB.getAddress(), ST.getAddress(),
+              StaticChunked ? Chunk : nullptr);
+          RT.emitDistributeStaticInit(*this, S.getBeginLoc(), ScheduleKind,
+                                      StaticInit);
+          JumpDest LoopExit =
+              getJumpDestInCurrentScope(createBasicBlock("omp.loop.exit"));
+          // UB = min(UB, GlobalUB);
+          EmitIgnoredExpr(
+              isOpenMPLoopBoundSharingDirective(S.getDirectiveKind())
+                  ? S.getCombinedEnsureUpperBound()
+                  : S.getEnsureUpperBound());
+          // IV = LB;
+          EmitIgnoredExpr(
+              isOpenMPLoopBoundSharingDirective(S.getDirectiveKind())
+                  ? S.getCombinedInit()
+                  : S.getInit());
+
+          const Expr *Cond =
+              isOpenMPLoopBoundSharingDirective(S.getDirectiveKind())
+                  ? S.getCombinedCond()
+                  : S.getCond();
+
+          if (StaticChunked)
+            Cond = S.getCombinedDistCond();
+
+          // For static unchunked schedules generate:
+          //
+          //  1. For distribute alone, codegen
+          //    while (idx <= UB) {
+          //      BODY;
+          //      ++idx;
+          //    }
+          //
+          //  2. When combined with 'for' (e.g. as in 'distribute parallel for')
+          //    while (idx <= UB) {
+          //      <CodeGen rest of pragma>(LB, UB);
+          //      idx += ST;
+          //    }
+          //
+          // For static chunk one schedule generate:
+          //
+          // while (IV <= GlobalUB) {
+          //   <CodeGen rest of pragma>(LB, UB);
+          //   LB += ST;
+          //   UB += ST;
+          //   UB = min(UB, GlobalUB);
+          //   IV = LB;
+          // }
+          //
+          emitCommonSimdLoop(
+              *this, S,
+              [&S](CodeGenFunction &CGF, PrePostActionTy &) {
+                if (isOpenMPSimdDirective(S.getDirectiveKind()))
+                  CGF.EmitOMPSimdInit(S);
+              },
+              [&S, &LoopScope, Cond, IncExpr, LoopExit, &CodeGenLoop,
+               StaticChunked](CodeGenFunction &CGF, PrePostActionTy &) {
+                CGF.EmitOMPInnerLoop(
+                    S, LoopScope.requiresCleanups(), Cond, IncExpr,
+                    [&S, LoopExit, &CodeGenLoop](CodeGenFunction &CGF) {
+                      CodeGenLoop(CGF, S, LoopExit);
+                    },
+                    [&S, StaticChunked](CodeGenFunction &CGF) {
+                      if (StaticChunked) {
+                        CGF.EmitIgnoredExpr(S.getCombinedNextLowerBound());
+                        CGF.EmitIgnoredExpr(S.getCombinedNextUpperBound());
+                        CGF.EmitIgnoredExpr(S.getCombinedEnsureUpperBound());
+                        CGF.EmitIgnoredExpr(S.getCombinedInit());
+                      }
+                    });
+              });
+          EmitBlock(LoopExit.getBlock());
+          // Tell the runtime we are done.
+          RT.emitForStaticFinish(*this, S.getEndLoc(), OMPD_distribute);
+        } else {
+          // Emit the outer loop, which requests its work chunk [LB..UB] from
+          // runtime and runs the inner loop to process it.
+          const OMPLoopArguments LoopArguments = {
+              LB.getAddress(), UB.getAddress(), ST.getAddress(),
+              IL.getAddress(), Chunk};
+          EmitOMPDistributeOuterLoop(ScheduleKind, S, LoopScope, LoopArguments,
+                                     CodeGenLoop);
+        }
       }
       if (isOpenMPSimdDirective(S.getDirectiveKind())) {
         EmitOMPSimdFinal(S, [IL, &S](CodeGenFunction &CGF) {
diff --git a/clang/test/OpenMP/amdgcn_target_device_vla.cpp b/clang/test/OpenMP/amdgcn_target_device_vla.cpp
index 5064c114c0863..323725f215f79 100644
--- a/clang/test/OpenMP/amdgcn_target_device_vla.cpp
+++ b/clang/test/OpenMP/amdgcn_target_device_vla.cpp
@@ -276,92 +276,32 @@ int main() {
 // CHECK-NEXT:    store i32 1, ptr [[DOTOMP_STRIDE_ASCAST]], align 4
 // CHECK-NEXT:    store i32 0, ptr [[DOTOMP_IS_LAST_ASCAST]], align 4
 // CHECK-NEXT:    [[NVPTX_NUM_THREADS:%.*]] = call i32 @__kmpc_get_hardware_num_threads_in_block()
-// CHECK-NEXT:    [[TMP6:%.*]] = load ptr, ptr [[DOTGLOBAL_TID__ADDR_ASCAST]], align 8
-// CHECK-NEXT:    [[TMP7:%.*]] = load i32, ptr [[TMP6]], align 4
-// CHECK-NEXT:    call void @__kmpc_distribute_static_init_4(ptr addrspacecast (ptr addrspace(1) @[[GLOB2:[0-9]+]] to ptr), i32 [[TMP7]], i32 91, ptr [[DOTOMP_IS_LAST_ASCAST]], ptr [[DOTOMP_COMB_LB_ASCAST]], ptr [[DOTOMP_COMB_UB_ASCAST]], ptr [[DOTOMP_STRIDE_ASCAST]], i32 1, i32 [[NVPTX_NUM_THREADS]])
+// CHECK-NEXT:    [[TMP6:%.*]] = load i32, ptr [[DOTOMP_COMB_LB_ASCAST]], align 4
+// CHECK-NEXT:    [[TMP7:%.*]] = zext i32 [[TMP6]] to i64
 // CHECK-NEXT:    [[TMP8:%.*]] = load i32, ptr [[DOTOMP_COMB_UB_ASCAST]], align 4
-// CHECK-NEXT:    [[TMP9:%.*]] = load i32, ptr [[DOTCAPTURE_EXPR_1_ASCAST]], align 4
-// CHECK-NEXT:    [[CMP4:%.*]] = icmp sgt i32 [[TMP8]], [[TMP9]]
-// CHECK-NEXT:    br i1 [[CMP4]], label [[COND_TRUE:%.*]], label [[COND_FALSE:%.*]]
-// CHECK:       cond.true:
-// CHECK-NEXT:    [[TMP10:%.*]] = load i32, ptr [[DOTCAPTURE_EXPR_1_ASCAST]], align 4
-// CHECK-NEXT:    br label [[COND_END:%.*]]
-// CHECK:       cond.false:
-// CHECK-NEXT:    [[TMP11:%.*]] = load i32, ptr [[DOTOMP_COMB_UB_ASCAST]], align 4
-// CHECK-NEXT:    br label [[COND_END]]
-// CHECK:       cond.end:
-// CHECK-NEXT:    [[COND:%.*]] = phi i32 [ [[TMP10]], [[COND_TRUE]] ], [ [[TMP11]], [[COND_FALSE]] ]
-// CHECK-NEXT:    store i32 [[COND]], ptr [[DOTOMP_COMB_UB_ASCAST]], align 4
-// CHECK-NEXT:    [[TMP12:%.*]] = load i32, ptr [[DOTOMP_COMB_LB_ASCAST]], align 4
-// CHECK-NEXT:    store i32 [[TMP12]], ptr [[DOTOMP_IV_ASCAST]], align 4
-// CHECK-NEXT:    br label [[OMP_INNER_FOR_COND:%.*]]
-// CHECK:       omp.inner.for.cond:
-// CHECK-NEXT:    [[TMP13:%.*]] = load i32, ptr [[DOTOMP_IV_ASCAST]], align 4
-// CHECK-NEXT:    [[TMP14:%.*]] = load i32, ptr [[DOTCAPTURE_EXPR_1_ASCAST]], align 4
-// CHECK-NEXT:    [[ADD:%.*]] = add nsw i32 [[TMP14]], 1
-// CHECK-NEXT:    [[CMP5:%.*]] = icmp slt i32 [[TMP13]], [[ADD]]
-// CHECK-NEXT:    br i1 [[CMP5]], label [[OMP_INNER_FOR_BODY:%.*]], label [[OMP_INNER_FOR_END:%.*]]
-// CHECK:       omp.inner.for.body:
-// CHECK-NEXT:    [[TMP15:%.*]] = load i32, ptr [[DOTOMP_COMB_LB_ASCAST]], align 4
-// CHECK-NEXT:    [[TMP16:%.*]] = zext i32 [[TMP15]] to i64
-// CHECK-NEXT:    [[TMP17:%.*]] = load i32, ptr [[DOTOMP_COMB_UB_ASCAST]], align 4
-// CHECK-NEXT:    [[TMP18:%.*]] = zext i32 [[TMP17]] to i64
-// CHECK-NEXT:    [[TMP19:%.*]] = load i32, ptr [[M_ADDR_ASCAST]], align 4
-// CHECK-NEXT:    store i32 [[TMP19]], ptr addrspace(5) [[M_CASTED]], align 4
-// CHECK-NEXT:    [[TMP20:%.*]] = load i64, ptr addrspace(5) [[M_CASTED]], align 8
-// CHECK-NEXT:    [[TMP21:%.*]] = getelementptr inbounds [5 x ptr], ptr [[CAPTURED_VARS_ADDRS_ASCAST]], i64 0, i64 0
-// CHECK-NEXT:    [[TMP22:%.*]] = inttoptr i64 [[TMP16]] to ptr
-// CHECK-NEXT:    store ptr [[TMP22]], ptr [[TMP21]], align 8
-// CHECK-NEXT:    [[TMP23:%.*]] = getelementptr inbounds [5 x ptr], ptr [[CAPTURED_VARS_ADDRS_ASCAST]], i64 0, i64 1
-// CHECK-NEXT:    [[TMP24:%.*]] = inttoptr i64 [[TMP18]] to ptr
-// CHECK-NEXT:    store ptr [[TMP24]], ptr [[TMP23]], align 8
-// CHECK-NEXT:    [[TMP25:%.*]] = getelementptr inbounds [5 x ptr], ptr [[CAPTURED_VARS_ADDRS_ASCAST]], i64 0, i64 2
-// CHECK-NEXT:    [[TMP26:%.*]] = inttoptr i64 [[TMP20]] to ptr
-// CHECK-NEXT:    store ptr [[TMP26]], ptr [[TMP25]], align 8
-// CHECK-NEXT:    [[TMP27:%.*]] = getelementptr inbounds [5 x ptr], ptr [[CAPTURED_VARS_ADDRS_ASCAST]], i64 0, i64 3
-// CHECK-NEXT:    [[TMP28:%.*]] = inttoptr i64 [[TMP0]] to ptr
-// CHECK-NEXT:    store ptr [[TMP28]], ptr [[TMP27]], align 8
-// CHECK-NEXT:    [[TMP29:%.*]] = getelementptr inbounds [5 x ptr], ptr [[CAPTURED_VARS_ADDRS_ASCAST]], i64 0, i64 4
-// CHECK-NEXT:    store ptr [[TMP1]], ptr [[TMP29]], align 8
-// CHECK-NEXT:    [[TMP30:%.*]] = load ptr, ptr [[DOTGLOBAL_TID__ADDR_ASCAST]], align 8
-// CHECK-NEXT:    [[TMP31:%.*]] = load i32, ptr [[TMP30]], align 4
-// CHECK-NEXT:    call void @__kmpc_parallel_60(ptr addrspacecast (ptr addrspace(1) @[[GLOB1]] to ptr), i32 [[TMP31]], i32 1, i32 -1, i32 -1, ptr @{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z4foo2v_l30_omp_outlined_omp_outlined, ptr null, ptr [[CAPTURED_VARS_ADDRS_ASCAST]], i64 5, i32 0)
-// CHECK-NEXT:    br label [[OMP_INNER_FOR_INC:%.*]]
-// CHECK:       omp.inn...
[truncated]

@jdoerfert
Copy link
Copy Markdown
Member

red_max Value change for 208 teams: +1097.62% change for 10400 teams: +261.16%

higher is better?

@ro-i
Copy link
Copy Markdown
Contributor Author

ro-i commented Jun 6, 2026

Yes, MB/s percentage change

@ro-i
Copy link
Copy Markdown
Contributor Author

ro-i commented Jun 6, 2026

btw, sometimes it's a little hard to define what the MB in MB/s should be for a given reduction. But since the formula is consistent, it doesn't matter for the relative comparison if different people would define the MB constant differently.

with m = the MB constant, x = this PR's time (faster, larger MB/s), and y = base time:

((m/x - m/y) / (m/y)) * 100
  = (m/x)/(m/y) * 100 - 100
  = (y/x) * 100 - 100
  = ((y - x) / x) * 100

@carlobertolli
Copy link
Copy Markdown
Member

I don't see a condition over the loop having reductions: does it apply to any ttdpf, even without reductions?
If so, do you have performance numbers for those cases? The ones you show have the label "red" in front of them, but I see at least one test in the suite that does not: "evict_device_cache".
FYI: I am expecting this change to apply to non reduction cases.

static OpenMPDirectiveKind
getEffectiveDirectiveKind(const OMPExecutableDirective &S);

static bool canEmitGPUFusedDistSchedule(const CodeGenModule &CGM,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add description of what FusedDist is here.

OpenMPScheduleClauseModifier M2 = OMPC_SCHEDULE_MODIFIER_unknown;
/// Request the fused distr_static_chunk + static_chunkone runtime schedule
/// in `for_static_init`. The outer `distribute_static_init` is skipped.
bool IsDistChunkedAndChunkOne = false;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there more variants? If so, better use an enum.

@ro-i
Copy link
Copy Markdown
Contributor Author

ro-i commented Jun 6, 2026

I don't see a condition over the loop having reductions: does it apply to any ttdpf, even without reductions?

Yes

If so, do you have performance numbers for those cases?

No, not yet. Do you have interesting snippets you'd like to see results for?

The ones you show have the label "red" in front of them, but I see at least one test in the suite that does not: "evict_device_cache".

That's something else. The current code in my benchmark/test repo contains only cross-team reduction related stuff (and scan, but that's another topic and currently not supported by llvm or rocm)

@ro-i
Copy link
Copy Markdown
Contributor Author

ro-i commented Jun 8, 2026

No, not yet. Do you have interesting snippets you'd like to see results for?

Hm, I let AI come up with some tests and created some experimental benchmarks for them: https://github.com/ro-i/xteam-test/blob/main/src/xteam_misc.cpp

For these, the changed loop structure seems to be not beneficial at all. Let me find out why

@ro-i
Copy link
Copy Markdown
Contributor Author

ro-i commented Jun 8, 2026

One would think that the changed loop structure is basically a no-op for non-reduction loops. But there seem to be some downstream optimizations (maybe unrolling the inner loop).
The one-line change easy way out is to just gate my loop change so that it takes only effect for reduction loops. But give me a moment to test sth I'm thinking about

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend:AMDGPU clang:codegen IR generation bugs: mangling, exceptions, etc. clang:frontend Language frontend issues, e.g. anything involving "Sema" clang:openmp OpenMP related changes to Clang offload

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants