[clang][OpenMP] Improve loop structure for distributed loops#201670
[clang][OpenMP] Improve loop structure for distributed loops#201670ro-i wants to merge 1 commit into
Conversation
This is a part of a series of patches that rework OpenMP cross-team
reductions.
This patches wires the existing
`kmp_sched_distr_static_chunk_sched_static_chunkone` to be used by
CodeGen.
Example of the intended change of this patch:
```
target teams distribute parallel for reduction(+:s)
for (i = 0; i < N; i++) s += a[i];
```
Before:
```
__kmpc_distribute_static_init(91)
for (team_lb = team*nthreads; team_lb < N; team_lb += nteams*nthreads) {
__kmpc_for_static_init(33)
for (iv = team_lb + tid; iv < team_lb + nthreads; iv += nthreads) {
priv += a[iv];
}
__kmpc_nvptx_parallel_reduce_nowait_v2
}
__kmpc_nvptx_teams_reduce_nowait_v2
```
After:
```
__kmpc_for_static_init(93)
for (iv = team*nthreads + tid;
iv < N;
iv += nteams*nthreads) {
priv += a[iv];
}
__kmpc_nvptx_parallel_reduce_nowait_v2
__kmpc_nvptx_teams_reduce_nowait_v2
```
Performance:
All performance tests can be reproduced with
https://github.com/ro-i/xteam-test @ commit
6025e5afc14dd6e65ee2658e5001c16e9b9245ff. To reproduce, simply create a
`local.mk` file in the cloned directory with a suitable `OFFLOAD_ARCH`
for your machine and `CXX_trunk` + `CXX_trunk_cg` set to the paths of
the clang++ binaries for llvm/main and this patch. (llvm/main should
best be at the commit that is currently the base for this PR. At the
moment, this is 69f7aeb). Then, run
`make trunk trunk_cg` to build the benchmark binaries for 208 and 10400
teams. Run them with `./run_bench.sh -rq -n10 red_trunk_208
red_trunk_cg_208 red_trunk_10400 red_trunk_cg_10400` to get the avg
performance numbers over 10 rounds. This tests multiple reduction
workloads, including reductions that run in the Generic-SPMD mode, with
208 teams and with 10400 teams, both à 512 threads, and with a reduction
array size of 177,777,777. I tested on a gfx942 and found the following
numbers showing the performance of this patch relative to the baseline:
```
red_comb_sep_arr_32 double change for 208 teams: +0.01% change for 10400 teams: +5.53%
red_sum_arr_32 double change for 208 teams: +570.47% change for 10400 teams: -2.23%
red_comb double change for 208 teams: +350.30% change for 10400 teams: +0.72%
red_comb_sep double change for 208 teams: +4.82% change for 10400 teams: +2.18%
red_dot double change for 208 teams: +202.45% change for 10400 teams: +3.48%
red_indirect double change for 208 teams: +239.33% change for 10400 teams: +4.63%
red_kernel_part double change for 208 teams: +3.30% change for 10400 teams: +3.43%
red_max double change for 208 teams: +273.46% change for 10400 teams: +5.12%
red_mult double change for 208 teams: +239.50% change for 10400 teams: +5.23%
red_sum double change for 208 teams: +239.47% change for 10400 teams: +5.15%
red_pi double change for 208 teams: +90.06% change for 10400 teams: +78.67%
red_comb_sep_arr_32 uint change for 208 teams: -0.16% change for 10400 teams: +26.98%
red_sum_arr_32 uint change for 208 teams: +139.64% change for 10400 teams: -14.55%
red_dot uint change for 208 teams: +202.92% change for 10400 teams: +5.11%
red_max uint change for 208 teams: +221.41% change for 10400 teams: +6.54%
red_sum uint change for 208 teams: +220.83% change for 10400 teams: +7.80%
red_comb_sep_arr_32 ulong change for 208 teams: -0.19% change for 10400 teams: +5.80%
red_sum_arr_32 ulong change for 208 teams: +523.98% change for 10400 teams: -3.17%
red_dot ulong change for 208 teams: +232.14% change for 10400 teams: +3.57%
red_max ulong change for 208 teams: +279.87% change for 10400 teams: +6.17%
red_sum ulong change for 208 teams: +261.54% change for 10400 teams: +5.72%
red_comb_sep_arr_32 Value change for 208 teams: +0.22% change for 10400 teams: +0.04%
red_sum_arr_32 Value change for 208 teams: +423.38% change for 10400 teams: +9.08%
red_dot Value change for 208 teams: +153.87% change for 10400 teams: -2.62%
red_max Value change for 208 teams: +1097.62% change for 10400 teams: +261.16%
red_sum Value change for 208 teams: +358.88% change for 10400 teams: +21.44%
```
|
@llvm/pr-subscribers-clang-codegen @llvm/pr-subscribers-backend-amdgpu Author: Robert Imschweiler (ro-i) ChangesThis is a part of a series of patches that rework OpenMP cross-team reductions. This patches wires the existing Example of the intended change of this patch: Before: After: Performance: Patch is 1.39 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/201670.diff 16 Files Affected:
diff --git a/clang/include/clang/Basic/OpenMPKinds.h b/clang/include/clang/Basic/OpenMPKinds.h
index 4e83bfcd0128b..516219a408edb 100644
--- a/clang/include/clang/Basic/OpenMPKinds.h
+++ b/clang/include/clang/Basic/OpenMPKinds.h
@@ -188,6 +188,9 @@ struct OpenMPScheduleTy final {
OpenMPScheduleClauseKind Schedule = OMPC_SCHEDULE_unknown;
OpenMPScheduleClauseModifier M1 = OMPC_SCHEDULE_MODIFIER_unknown;
OpenMPScheduleClauseModifier M2 = OMPC_SCHEDULE_MODIFIER_unknown;
+ /// Request the fused distr_static_chunk + static_chunkone runtime schedule
+ /// in `for_static_init`. The outer `distribute_static_init` is skipped.
+ bool IsDistChunkedAndChunkOne = false;
};
/// OpenMP modifiers for 'reduction' clause.
diff --git a/clang/lib/CodeGen/CGOpenMPRuntime.cpp b/clang/lib/CodeGen/CGOpenMPRuntime.cpp
index f3158f48e7944..4462d5b63d677 100644
--- a/clang/lib/CodeGen/CGOpenMPRuntime.cpp
+++ b/clang/lib/CodeGen/CGOpenMPRuntime.cpp
@@ -546,6 +546,12 @@ enum OpenMPSchedType {
/// dist_schedule types
OMP_dist_sch_static_chunked = 91,
OMP_dist_sch_static = 92,
+ /// Fused distribute+for static schedule (entityId = team*nthreads + tid,
+ /// num_entities = nteams*nthreads). One for_static_init call, no
+ /// surrounding distribute_static_init. Matches
+ /// kmp_sched_distr_static_chunk_sched_static_chunkone in the device RTL
+ /// (openmp/device/include/DeviceTypes.h).
+ OMP_dist_sch_static_chunked_sch_static_chunkone = 93,
/// Support for OpenMP 4.5 monotonic and nonmonotonic schedule modifiers.
/// Set if the monotonic schedule modifier was present.
OMP_sch_modifier_monotonic = (1 << 29),
@@ -2630,7 +2636,8 @@ static int addMonoNonMonoModifier(CodeGenModule &CGM, OpenMPSchedType Schedule,
Schedule == OMP_sch_static_balanced_chunked ||
Schedule == OMP_ord_static_chunked || Schedule == OMP_ord_static ||
Schedule == OMP_dist_sch_static_chunked ||
- Schedule == OMP_dist_sch_static))
+ Schedule == OMP_dist_sch_static ||
+ Schedule == OMP_dist_sch_static_chunked_sch_static_chunkone))
Modifier = OMP_sch_modifier_nonmonotonic;
}
return Schedule | Modifier;
@@ -2692,7 +2699,8 @@ static void emitForStaticInitCall(
Schedule == OMP_sch_static_balanced_chunked ||
Schedule == OMP_ord_static || Schedule == OMP_ord_static_chunked ||
Schedule == OMP_dist_sch_static ||
- Schedule == OMP_dist_sch_static_chunked);
+ Schedule == OMP_dist_sch_static_chunked ||
+ Schedule == OMP_dist_sch_static_chunked_sch_static_chunkone);
// Call __kmpc_for_static_init(
// ident_t *loc, kmp_int32 tid, kmp_int32 schedtype,
@@ -2710,7 +2718,8 @@ static void emitForStaticInitCall(
assert((Schedule == OMP_sch_static_chunked ||
Schedule == OMP_sch_static_balanced_chunked ||
Schedule == OMP_ord_static_chunked ||
- Schedule == OMP_dist_sch_static_chunked) &&
+ Schedule == OMP_dist_sch_static_chunked ||
+ Schedule == OMP_dist_sch_static_chunked_sch_static_chunkone) &&
"expected static chunked schedule");
}
llvm::Value *Args[] = {
@@ -2733,8 +2742,11 @@ void CGOpenMPRuntime::emitForStaticInit(CodeGenFunction &CGF,
OpenMPDirectiveKind DKind,
const OpenMPScheduleTy &ScheduleKind,
const StaticRTInput &Values) {
- OpenMPSchedType ScheduleNum = getRuntimeSchedule(
- ScheduleKind.Schedule, Values.Chunk != nullptr, Values.Ordered);
+ OpenMPSchedType ScheduleNum =
+ ScheduleKind.IsDistChunkedAndChunkOne
+ ? OMP_dist_sch_static_chunked_sch_static_chunkone
+ : getRuntimeSchedule(ScheduleKind.Schedule, Values.Chunk != nullptr,
+ Values.Ordered);
assert((isOpenMPWorksharingDirective(DKind) || (DKind == OMPD_loop)) &&
"Expected loop-based or sections-based directive.");
llvm::Value *UpdatedLocation = emitUpdateLocation(CGF, Loc,
diff --git a/clang/lib/CodeGen/CGStmtOpenMP.cpp b/clang/lib/CodeGen/CGStmtOpenMP.cpp
index 1eaf8efa142c5..376e9bd1cee4e 100644
--- a/clang/lib/CodeGen/CGStmtOpenMP.cpp
+++ b/clang/lib/CodeGen/CGStmtOpenMP.cpp
@@ -50,6 +50,16 @@ static const VarDecl *getBaseDecl(const Expr *Ref);
static OpenMPDirectiveKind
getEffectiveDirectiveKind(const OMPExecutableDirective &S);
+static bool canEmitGPUFusedDistSchedule(const CodeGenModule &CGM,
+ const OMPLoopDirective &S,
+ OpenMPDirectiveKind DKind) {
+ return CGM.getLangOpts().OpenMPIsTargetDevice && CGM.getTriple().isGPU() &&
+ isOpenMPLoopBoundSharingDirective(DKind) &&
+ !S.getSingleClause<OMPDistScheduleClause>() &&
+ !S.getSingleClause<OMPScheduleClause>() &&
+ !S.getSingleClause<OMPOrderedClause>();
+}
+
namespace {
/// Lexical scope for OpenMP executable constructs, that handles correct codegen
/// for captured expressions.
@@ -3879,6 +3889,12 @@ bool CodeGenFunction::EmitOMPWorksharingLoop(
RT.isStaticChunked(ScheduleKind.Schedule,
/* Chunked */ Chunk != nullptr) &&
HasChunkSizeOne && isOpenMPLoopBoundSharingDirective(EKind);
+ // GPU combined `distribute parallel for`: emit a single
+ // for_static_init with the fused distr_static_chunk + static_chunkone
+ // schedule (enum 93). The surrounding EmitOMPDistributeLoop must skip
+ // its distribute_static_init under the same conditions.
+ if (StaticChunkedOne && canEmitGPUFusedDistSchedule(CGM, S, EKind))
+ ScheduleKind.IsDistChunkedAndChunkOne = true;
bool IsMonotonic =
Ordered ||
(ScheduleKind.Schedule == OMPC_SCHEDULE_static &&
@@ -6275,102 +6291,113 @@ void CodeGenFunction::EmitOMPDistributeLoop(const OMPLoopDirective &S,
const unsigned IVSize = getContext().getTypeSize(IVExpr->getType());
const bool IVSigned = IVExpr->getType()->hasSignedIntegerRepresentation();
- // OpenMP [2.10.8, distribute Construct, Description]
- // If dist_schedule is specified, kind must be static. If specified,
- // iterations are divided into chunks of size chunk_size, chunks are
- // assigned to the teams of the league in a round-robin fashion in the
- // order of the team number. When no chunk_size is specified, the
- // iteration space is divided into chunks that are approximately equal
- // in size, and at most one chunk is distributed to each team of the
- // league. The size of the chunks is unspecified in this case.
- bool StaticChunked =
- RT.isStaticChunked(ScheduleKind, /* Chunked */ Chunk != nullptr) &&
- isOpenMPLoopBoundSharingDirective(S.getDirectiveKind());
- if (RT.isStaticNonchunked(ScheduleKind,
- /* Chunked */ Chunk != nullptr) ||
- StaticChunked) {
- CGOpenMPRuntime::StaticRTInput StaticInit(
- IVSize, IVSigned, /* Ordered = */ false, IL.getAddress(),
- LB.getAddress(), UB.getAddress(), ST.getAddress(),
- StaticChunked ? Chunk : nullptr);
- RT.emitDistributeStaticInit(*this, S.getBeginLoc(), ScheduleKind,
- StaticInit);
+ // GPU fused schedule: omit the outer distribute loop and let the inner
+ // worksharing loop schedule the flattened team/thread iteration space.
+ if (canEmitGPUFusedDistSchedule(CGM, S, S.getDirectiveKind())) {
JumpDest LoopExit =
getJumpDestInCurrentScope(createBasicBlock("omp.loop.exit"));
- // UB = min(UB, GlobalUB);
- EmitIgnoredExpr(isOpenMPLoopBoundSharingDirective(S.getDirectiveKind())
- ? S.getCombinedEnsureUpperBound()
- : S.getEnsureUpperBound());
- // IV = LB;
- EmitIgnoredExpr(isOpenMPLoopBoundSharingDirective(S.getDirectiveKind())
- ? S.getCombinedInit()
- : S.getInit());
-
- const Expr *Cond =
- isOpenMPLoopBoundSharingDirective(S.getDirectiveKind())
- ? S.getCombinedCond()
- : S.getCond();
-
- if (StaticChunked)
- Cond = S.getCombinedDistCond();
-
- // For static unchunked schedules generate:
- //
- // 1. For distribute alone, codegen
- // while (idx <= UB) {
- // BODY;
- // ++idx;
- // }
- //
- // 2. When combined with 'for' (e.g. as in 'distribute parallel for')
- // while (idx <= UB) {
- // <CodeGen rest of pragma>(LB, UB);
- // idx += ST;
- // }
- //
- // For static chunk one schedule generate:
- //
- // while (IV <= GlobalUB) {
- // <CodeGen rest of pragma>(LB, UB);
- // LB += ST;
- // UB += ST;
- // UB = min(UB, GlobalUB);
- // IV = LB;
- // }
- //
- emitCommonSimdLoop(
- *this, S,
- [&S](CodeGenFunction &CGF, PrePostActionTy &) {
- if (isOpenMPSimdDirective(S.getDirectiveKind()))
- CGF.EmitOMPSimdInit(S);
- },
- [&S, &LoopScope, Cond, IncExpr, LoopExit, &CodeGenLoop,
- StaticChunked](CodeGenFunction &CGF, PrePostActionTy &) {
- CGF.EmitOMPInnerLoop(
- S, LoopScope.requiresCleanups(), Cond, IncExpr,
- [&S, LoopExit, &CodeGenLoop](CodeGenFunction &CGF) {
- CodeGenLoop(CGF, S, LoopExit);
- },
- [&S, StaticChunked](CodeGenFunction &CGF) {
- if (StaticChunked) {
- CGF.EmitIgnoredExpr(S.getCombinedNextLowerBound());
- CGF.EmitIgnoredExpr(S.getCombinedNextUpperBound());
- CGF.EmitIgnoredExpr(S.getCombinedEnsureUpperBound());
- CGF.EmitIgnoredExpr(S.getCombinedInit());
- }
- });
- });
+ CodeGenLoop(*this, S, LoopExit);
EmitBlock(LoopExit.getBlock());
- // Tell the runtime we are done.
- RT.emitForStaticFinish(*this, S.getEndLoc(), OMPD_distribute);
} else {
- // Emit the outer loop, which requests its work chunk [LB..UB] from
- // runtime and runs the inner loop to process it.
- const OMPLoopArguments LoopArguments = {
- LB.getAddress(), UB.getAddress(), ST.getAddress(), IL.getAddress(),
- Chunk};
- EmitOMPDistributeOuterLoop(ScheduleKind, S, LoopScope, LoopArguments,
- CodeGenLoop);
+ // OpenMP [2.10.8, distribute Construct, Description]
+ // If dist_schedule is specified, kind must be static. If specified,
+ // iterations are divided into chunks of size chunk_size, chunks are
+ // assigned to the teams of the league in a round-robin fashion in the
+ // order of the team number. When no chunk_size is specified, the
+ // iteration space is divided into chunks that are approximately equal
+ // in size, and at most one chunk is distributed to each team of the
+ // league. The size of the chunks is unspecified in this case.
+ bool StaticChunked =
+ RT.isStaticChunked(ScheduleKind, /* Chunked */ Chunk != nullptr) &&
+ isOpenMPLoopBoundSharingDirective(S.getDirectiveKind());
+ if (RT.isStaticNonchunked(ScheduleKind,
+ /* Chunked */ Chunk != nullptr) ||
+ StaticChunked) {
+ CGOpenMPRuntime::StaticRTInput StaticInit(
+ IVSize, IVSigned, /* Ordered = */ false, IL.getAddress(),
+ LB.getAddress(), UB.getAddress(), ST.getAddress(),
+ StaticChunked ? Chunk : nullptr);
+ RT.emitDistributeStaticInit(*this, S.getBeginLoc(), ScheduleKind,
+ StaticInit);
+ JumpDest LoopExit =
+ getJumpDestInCurrentScope(createBasicBlock("omp.loop.exit"));
+ // UB = min(UB, GlobalUB);
+ EmitIgnoredExpr(
+ isOpenMPLoopBoundSharingDirective(S.getDirectiveKind())
+ ? S.getCombinedEnsureUpperBound()
+ : S.getEnsureUpperBound());
+ // IV = LB;
+ EmitIgnoredExpr(
+ isOpenMPLoopBoundSharingDirective(S.getDirectiveKind())
+ ? S.getCombinedInit()
+ : S.getInit());
+
+ const Expr *Cond =
+ isOpenMPLoopBoundSharingDirective(S.getDirectiveKind())
+ ? S.getCombinedCond()
+ : S.getCond();
+
+ if (StaticChunked)
+ Cond = S.getCombinedDistCond();
+
+ // For static unchunked schedules generate:
+ //
+ // 1. For distribute alone, codegen
+ // while (idx <= UB) {
+ // BODY;
+ // ++idx;
+ // }
+ //
+ // 2. When combined with 'for' (e.g. as in 'distribute parallel for')
+ // while (idx <= UB) {
+ // <CodeGen rest of pragma>(LB, UB);
+ // idx += ST;
+ // }
+ //
+ // For static chunk one schedule generate:
+ //
+ // while (IV <= GlobalUB) {
+ // <CodeGen rest of pragma>(LB, UB);
+ // LB += ST;
+ // UB += ST;
+ // UB = min(UB, GlobalUB);
+ // IV = LB;
+ // }
+ //
+ emitCommonSimdLoop(
+ *this, S,
+ [&S](CodeGenFunction &CGF, PrePostActionTy &) {
+ if (isOpenMPSimdDirective(S.getDirectiveKind()))
+ CGF.EmitOMPSimdInit(S);
+ },
+ [&S, &LoopScope, Cond, IncExpr, LoopExit, &CodeGenLoop,
+ StaticChunked](CodeGenFunction &CGF, PrePostActionTy &) {
+ CGF.EmitOMPInnerLoop(
+ S, LoopScope.requiresCleanups(), Cond, IncExpr,
+ [&S, LoopExit, &CodeGenLoop](CodeGenFunction &CGF) {
+ CodeGenLoop(CGF, S, LoopExit);
+ },
+ [&S, StaticChunked](CodeGenFunction &CGF) {
+ if (StaticChunked) {
+ CGF.EmitIgnoredExpr(S.getCombinedNextLowerBound());
+ CGF.EmitIgnoredExpr(S.getCombinedNextUpperBound());
+ CGF.EmitIgnoredExpr(S.getCombinedEnsureUpperBound());
+ CGF.EmitIgnoredExpr(S.getCombinedInit());
+ }
+ });
+ });
+ EmitBlock(LoopExit.getBlock());
+ // Tell the runtime we are done.
+ RT.emitForStaticFinish(*this, S.getEndLoc(), OMPD_distribute);
+ } else {
+ // Emit the outer loop, which requests its work chunk [LB..UB] from
+ // runtime and runs the inner loop to process it.
+ const OMPLoopArguments LoopArguments = {
+ LB.getAddress(), UB.getAddress(), ST.getAddress(),
+ IL.getAddress(), Chunk};
+ EmitOMPDistributeOuterLoop(ScheduleKind, S, LoopScope, LoopArguments,
+ CodeGenLoop);
+ }
}
if (isOpenMPSimdDirective(S.getDirectiveKind())) {
EmitOMPSimdFinal(S, [IL, &S](CodeGenFunction &CGF) {
diff --git a/clang/test/OpenMP/amdgcn_target_device_vla.cpp b/clang/test/OpenMP/amdgcn_target_device_vla.cpp
index 5064c114c0863..323725f215f79 100644
--- a/clang/test/OpenMP/amdgcn_target_device_vla.cpp
+++ b/clang/test/OpenMP/amdgcn_target_device_vla.cpp
@@ -276,92 +276,32 @@ int main() {
// CHECK-NEXT: store i32 1, ptr [[DOTOMP_STRIDE_ASCAST]], align 4
// CHECK-NEXT: store i32 0, ptr [[DOTOMP_IS_LAST_ASCAST]], align 4
// CHECK-NEXT: [[NVPTX_NUM_THREADS:%.*]] = call i32 @__kmpc_get_hardware_num_threads_in_block()
-// CHECK-NEXT: [[TMP6:%.*]] = load ptr, ptr [[DOTGLOBAL_TID__ADDR_ASCAST]], align 8
-// CHECK-NEXT: [[TMP7:%.*]] = load i32, ptr [[TMP6]], align 4
-// CHECK-NEXT: call void @__kmpc_distribute_static_init_4(ptr addrspacecast (ptr addrspace(1) @[[GLOB2:[0-9]+]] to ptr), i32 [[TMP7]], i32 91, ptr [[DOTOMP_IS_LAST_ASCAST]], ptr [[DOTOMP_COMB_LB_ASCAST]], ptr [[DOTOMP_COMB_UB_ASCAST]], ptr [[DOTOMP_STRIDE_ASCAST]], i32 1, i32 [[NVPTX_NUM_THREADS]])
+// CHECK-NEXT: [[TMP6:%.*]] = load i32, ptr [[DOTOMP_COMB_LB_ASCAST]], align 4
+// CHECK-NEXT: [[TMP7:%.*]] = zext i32 [[TMP6]] to i64
// CHECK-NEXT: [[TMP8:%.*]] = load i32, ptr [[DOTOMP_COMB_UB_ASCAST]], align 4
-// CHECK-NEXT: [[TMP9:%.*]] = load i32, ptr [[DOTCAPTURE_EXPR_1_ASCAST]], align 4
-// CHECK-NEXT: [[CMP4:%.*]] = icmp sgt i32 [[TMP8]], [[TMP9]]
-// CHECK-NEXT: br i1 [[CMP4]], label [[COND_TRUE:%.*]], label [[COND_FALSE:%.*]]
-// CHECK: cond.true:
-// CHECK-NEXT: [[TMP10:%.*]] = load i32, ptr [[DOTCAPTURE_EXPR_1_ASCAST]], align 4
-// CHECK-NEXT: br label [[COND_END:%.*]]
-// CHECK: cond.false:
-// CHECK-NEXT: [[TMP11:%.*]] = load i32, ptr [[DOTOMP_COMB_UB_ASCAST]], align 4
-// CHECK-NEXT: br label [[COND_END]]
-// CHECK: cond.end:
-// CHECK-NEXT: [[COND:%.*]] = phi i32 [ [[TMP10]], [[COND_TRUE]] ], [ [[TMP11]], [[COND_FALSE]] ]
-// CHECK-NEXT: store i32 [[COND]], ptr [[DOTOMP_COMB_UB_ASCAST]], align 4
-// CHECK-NEXT: [[TMP12:%.*]] = load i32, ptr [[DOTOMP_COMB_LB_ASCAST]], align 4
-// CHECK-NEXT: store i32 [[TMP12]], ptr [[DOTOMP_IV_ASCAST]], align 4
-// CHECK-NEXT: br label [[OMP_INNER_FOR_COND:%.*]]
-// CHECK: omp.inner.for.cond:
-// CHECK-NEXT: [[TMP13:%.*]] = load i32, ptr [[DOTOMP_IV_ASCAST]], align 4
-// CHECK-NEXT: [[TMP14:%.*]] = load i32, ptr [[DOTCAPTURE_EXPR_1_ASCAST]], align 4
-// CHECK-NEXT: [[ADD:%.*]] = add nsw i32 [[TMP14]], 1
-// CHECK-NEXT: [[CMP5:%.*]] = icmp slt i32 [[TMP13]], [[ADD]]
-// CHECK-NEXT: br i1 [[CMP5]], label [[OMP_INNER_FOR_BODY:%.*]], label [[OMP_INNER_FOR_END:%.*]]
-// CHECK: omp.inner.for.body:
-// CHECK-NEXT: [[TMP15:%.*]] = load i32, ptr [[DOTOMP_COMB_LB_ASCAST]], align 4
-// CHECK-NEXT: [[TMP16:%.*]] = zext i32 [[TMP15]] to i64
-// CHECK-NEXT: [[TMP17:%.*]] = load i32, ptr [[DOTOMP_COMB_UB_ASCAST]], align 4
-// CHECK-NEXT: [[TMP18:%.*]] = zext i32 [[TMP17]] to i64
-// CHECK-NEXT: [[TMP19:%.*]] = load i32, ptr [[M_ADDR_ASCAST]], align 4
-// CHECK-NEXT: store i32 [[TMP19]], ptr addrspace(5) [[M_CASTED]], align 4
-// CHECK-NEXT: [[TMP20:%.*]] = load i64, ptr addrspace(5) [[M_CASTED]], align 8
-// CHECK-NEXT: [[TMP21:%.*]] = getelementptr inbounds [5 x ptr], ptr [[CAPTURED_VARS_ADDRS_ASCAST]], i64 0, i64 0
-// CHECK-NEXT: [[TMP22:%.*]] = inttoptr i64 [[TMP16]] to ptr
-// CHECK-NEXT: store ptr [[TMP22]], ptr [[TMP21]], align 8
-// CHECK-NEXT: [[TMP23:%.*]] = getelementptr inbounds [5 x ptr], ptr [[CAPTURED_VARS_ADDRS_ASCAST]], i64 0, i64 1
-// CHECK-NEXT: [[TMP24:%.*]] = inttoptr i64 [[TMP18]] to ptr
-// CHECK-NEXT: store ptr [[TMP24]], ptr [[TMP23]], align 8
-// CHECK-NEXT: [[TMP25:%.*]] = getelementptr inbounds [5 x ptr], ptr [[CAPTURED_VARS_ADDRS_ASCAST]], i64 0, i64 2
-// CHECK-NEXT: [[TMP26:%.*]] = inttoptr i64 [[TMP20]] to ptr
-// CHECK-NEXT: store ptr [[TMP26]], ptr [[TMP25]], align 8
-// CHECK-NEXT: [[TMP27:%.*]] = getelementptr inbounds [5 x ptr], ptr [[CAPTURED_VARS_ADDRS_ASCAST]], i64 0, i64 3
-// CHECK-NEXT: [[TMP28:%.*]] = inttoptr i64 [[TMP0]] to ptr
-// CHECK-NEXT: store ptr [[TMP28]], ptr [[TMP27]], align 8
-// CHECK-NEXT: [[TMP29:%.*]] = getelementptr inbounds [5 x ptr], ptr [[CAPTURED_VARS_ADDRS_ASCAST]], i64 0, i64 4
-// CHECK-NEXT: store ptr [[TMP1]], ptr [[TMP29]], align 8
-// CHECK-NEXT: [[TMP30:%.*]] = load ptr, ptr [[DOTGLOBAL_TID__ADDR_ASCAST]], align 8
-// CHECK-NEXT: [[TMP31:%.*]] = load i32, ptr [[TMP30]], align 4
-// CHECK-NEXT: call void @__kmpc_parallel_60(ptr addrspacecast (ptr addrspace(1) @[[GLOB1]] to ptr), i32 [[TMP31]], i32 1, i32 -1, i32 -1, ptr @{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z4foo2v_l30_omp_outlined_omp_outlined, ptr null, ptr [[CAPTURED_VARS_ADDRS_ASCAST]], i64 5, i32 0)
-// CHECK-NEXT: br label [[OMP_INNER_FOR_INC:%.*]]
-// CHECK: omp.inn...
[truncated]
|
higher is better? |
|
Yes, MB/s percentage change |
|
btw, sometimes it's a little hard to define what the MB in MB/s should be for a given reduction. But since the formula is consistent, it doesn't matter for the relative comparison if different people would define the MB constant differently. with m = the MB constant, x = this PR's time (faster, larger MB/s), and y = base time: |
|
I don't see a condition over the loop having reductions: does it apply to any ttdpf, even without reductions? |
| static OpenMPDirectiveKind | ||
| getEffectiveDirectiveKind(const OMPExecutableDirective &S); | ||
|
|
||
| static bool canEmitGPUFusedDistSchedule(const CodeGenModule &CGM, |
There was a problem hiding this comment.
Add description of what FusedDist is here.
| OpenMPScheduleClauseModifier M2 = OMPC_SCHEDULE_MODIFIER_unknown; | ||
| /// Request the fused distr_static_chunk + static_chunkone runtime schedule | ||
| /// in `for_static_init`. The outer `distribute_static_init` is skipped. | ||
| bool IsDistChunkedAndChunkOne = false; |
There was a problem hiding this comment.
are there more variants? If so, better use an enum.
Yes
No, not yet. Do you have interesting snippets you'd like to see results for?
That's something else. The current code in my benchmark/test repo contains only cross-team reduction related stuff (and scan, but that's another topic and currently not supported by llvm or rocm) |
Hm, I let AI come up with some tests and created some experimental benchmarks for them: https://github.com/ro-i/xteam-test/blob/main/src/xteam_misc.cpp For these, the changed loop structure seems to be not beneficial at all. Let me find out why |
|
One would think that the changed loop structure is basically a no-op for non-reduction loops. But there seem to be some downstream optimizations (maybe unrolling the inner loop). |
This is a part of a series of patches that rework OpenMP cross-team reductions.
This patches wires the existing
kmp_sched_distr_static_chunk_sched_static_chunkoneto be used by CodeGen.Example of the intended change of this patch:
Before:
After:
Performance:
All performance tests can be reproduced with
https://github.com/ro-i/xteam-test @ commit
6025e5afc14dd6e65ee2658e5001c16e9b9245ff. To reproduce, simply create a
local.mkfile in the cloned directory with a suitableOFFLOAD_ARCHfor your machine andCXX_trunk+CXX_trunk_cgset to the paths of the clang++ binaries for llvm/main and this patch. (llvm/main should best be at the commit that is currently the base for this PR. At the moment, this is 69f7aeb). Then, runmake trunk trunk_cgto build the benchmark binaries for 208 and 10400 teams. Run them with./run_bench.sh -rq -n10 red_trunk_208 red_trunk_cg_208 red_trunk_10400 red_trunk_cg_10400to get the avg performance numbers over 10 rounds. This tests multiple reduction workloads, including reductions that run in the Generic-SPMD mode, with 208 teams and with 10400 teams, both à 512 threads, and with a reduction array size of 177,777,777. I tested on a gfx942 and found the following numbers showing the performance of this patch relative to the baseline:Claude assisted with this patch.