[clang][Driver][HIP] Change OffloadingActionBuilder to respect the --no-gpu-bundle-output flag #161849

MixedMatched · 2025-10-03T13:58:09Z

Currently, the command clang -c -emit-llvm --no-gpu-bundle-output --offload-arch=gfx900,gfx1030 -O3 -x hip square.hip will lead to a bundled output:

❯ ../bin/clang -c -emit-llvm --no-gpu-bundle-output --offload-arch=gfx900,gfx1030 -O3 -x hip square.hip
❯ ls
square.hip
square.bc

This doesn't match my expectation of the behavior of --no-gpu-bundle-output, so this adds a check into OffloadingActionBuilder for the flag when replacing the host compile action for a bundling action.

…no-gpu-bundle-output flag

github-actions · 2025-10-03T13:58:28Z

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

llvmbot · 2025-10-03T13:58:58Z

@llvm/pr-subscribers-clang-driver

@llvm/pr-subscribers-clang

Author: alessandra simmons (MixedMatched)

Changes

Currently, the command clang -c -emit-llvm --no-gpu-bundle-output --offload-arch=gfx900,gfx1030 -O3 -x hip square.hip will lead to a bundled output:

❯ ../bin/clang -c -emit-llvm --no-gpu-bundle-output --offload-arch=gfx900,gfx1030 -O3 -x hip square.hip
❯ ls
square.hip
square.bc

This doesn't match my expectation of the behavior of --no-gpu-bundle-output, so this adds a check into OffloadingActionBuilder for the flag when replacing the host compile action for a bundling action.

Full diff: https://github.com/llvm/llvm-project/pull/161849.diff

2 Files Affected:

(modified) clang/lib/Driver/Driver.cpp (+11-5)
(added) clang/test/Driver/no-gpu-bundle-respected.hip (+18)

diff --git a/clang/lib/Driver/Driver.cpp b/clang/lib/Driver/Driver.cpp
index 85a1335785542..245ecf6053202 100644
--- a/clang/lib/Driver/Driver.cpp
+++ b/clang/lib/Driver/Driver.cpp
@@ -3844,6 +3844,9 @@ class OffloadingActionBuilder final {
   /// Flag set to true if all valid builders allow file bundling/unbundling.
   bool CanUseBundler;
 
+  /// Flag set to false if an argument turns off bundling.
+  bool ShouldUseBundler;
+
 public:
   OffloadingActionBuilder(Compilation &C, DerivedArgList &Args,
                           const Driver::InputList &Inputs)
@@ -3878,6 +3881,9 @@ class OffloadingActionBuilder final {
     }
     CanUseBundler =
         ValidBuilders && ValidBuilders == ValidBuildersSupportingBundling;
+
+    ShouldUseBundler = Args.hasFlag(options::OPT_gpu_bundle_output,
+                                    options::OPT_no_gpu_bundle_output, true);
   }
 
   ~OffloadingActionBuilder() {
@@ -4029,11 +4035,11 @@ class OffloadingActionBuilder final {
       SB->appendTopLevelActions(OffloadAL);
     }
 
-    // If we can use the bundler, replace the host action by the bundling one in
-    // the resulting list. Otherwise, just append the device actions. For
-    // device only compilation, HostAction is a null pointer, therefore only do
-    // this when HostAction is not a null pointer.
-    if (CanUseBundler && HostAction &&
+    // If we can and should use the bundler, replace the host action by the
+    // bundling one in the resulting list. Otherwise, just append the device
+    // actions. For device only compilation, HostAction is a null pointer,
+    // therefore only do this when HostAction is not a null pointer.
+    if (CanUseBundler && ShouldUseBundler && HostAction &&
         HostAction->getType() != types::TY_Nothing && !OffloadAL.empty()) {
       // Add the host action to the list in order to create the bundling action.
       OffloadAL.push_back(HostAction);
diff --git a/clang/test/Driver/no-gpu-bundle-respected.hip b/clang/test/Driver/no-gpu-bundle-respected.hip
new file mode 100644
index 0000000000000..1587551f0322d
--- /dev/null
+++ b/clang/test/Driver/no-gpu-bundle-respected.hip
@@ -0,0 +1,18 @@
+// RUN: %clang -ccc-print-phases -c -emit-llvm \
+// RUN:   --offload-arch=gfx900,gfx1030 -O3 -x hip %s \
+// RUN:   2>&1 | FileCheck %s --check-prefix=OFFLOAD
+
+// RUN: %clang -ccc-print-phases -c -emit-llvm \
+// RUN:   --gpu-bundle-output --offload-arch=gfx900,gfx1030 -O3 -x hip %s \
+// RUN:   2>&1 | FileCheck %s --check-prefix=OFFLOAD
+
+// RUN: %clang -ccc-print-phases -c -emit-llvm \
+// RUN:   --no-gpu-bundle-output --offload-arch=gfx900,gfx1030 -O3 -x hip %s \
+// RUN:   2>&1 | FileCheck %s --check-prefix=OFFLOAD2
+
+// OFFLOAD: clang-offload-bundler
+// OFFLOAD2-NOT: clang-offload-bundler
+
+int square(int num) {
+    return num * num;
+}

MixedMatched · 2025-10-03T14:01:30Z

@lamb-j @yxsamliu

…FC. (llvm#161670) Use the `Base` type alias from llvm#158433.

…1711)

…m#92384) Found this problem when investigating llvm#91207

llvm#161761) CallableTraitsHelper identifies the return type and argument types of a callable type and passes those to an implementation class template to operate on. The CallableArgInfo utility uses CallableTraitsHelper to provide typedefs for the return type and argument types (as a tuple) of a callable type. In WrapperFunction.h, the detail::WFCallableTraits utility is rewritten in terms of CallableTraitsHandler (and renamed to WFHandlerTraits).

…vm#161763) Serializers only need to provide two methods now, rather than four. The first method should return an argument serializer / deserializer, the second a result value serializer / deserializer. The interfaces for these are now more uniform (deserialize now returns a tuple, rather than taking its output location(s) by reference). The intent is to simplify Serializer helper code. NFCI.

…161765) This fixes a regression introduced in llvm#147835 When parsing a lambda where the call operator has a late parsed attribute, we would try to build a 'this' type for the lambda, but in a lambda 'this' never refers to the lambda class itself. This late parsed attribute can be added implicitly by the -ftrapping-math flag. This patch patch makes it so CXXThisScopeRAII ignores lambdas. This became observable in llvm#147835 because that made clang lazily create tag types, and it removed a workaround for a lambda dependency bug where any previously created tag type for the lambda is discarded after its dependency is recalculated. But the 'this' scope created above would defeat this laziness and create the lambda type too soon, before its dependency was updated. Since this regression was never released, there are no release notes. Fixes llvm#161657

…vm#161768) This commit aims to reduce boilerplate by adding transparent conversion between Error/Expected types and their SPS-serializable counterparts (SPSSerializableError/SPSSerializableExpected). This allows SPSWrapperFunction calls and handles to be written in terms of Error/Expected directly. This functionality can also be extended to transparently convert between other types. This may be used in the future to provide conversion between ExecutorAddr and native pointer types.

…lvm#161712) We have additional patterns for GISel because we need to make s16 and s32 legal for load/store. GISel does not distinquish integer and FP scalar types in LLT. We only know whether the load should be integer or FP after register bank selection. These patterns should have been updated to use relaxed_load/store when the patterns in RISCVInstrInfoA.td were updated. Without this we will miscompile loads/stores with strong memory ordering when Zalasr is enabled. This patch just fixes the miscompile, Zalasr will now cause a GISel abort in some cases. A follow up patch will add additional GISel patterns for Zalasr.

Fixes the new test introduced in llvm#161765, so that it always uses a triple which supports floating point exceptions. Otherwise, some post-commit bots fail.

The specialized version has a time complexity of `O(n)`.

Adds the name and triple of the graph to LinkGraph::dump output before the rest of the graph content. Calls from JITLinkGeneric.cpp to dump the graph are updated to avoid redundantly naming the graph.

…pe aliases (llvm#161035) We missed calling CheckTemplateArgument when building CTAD deduction guides. That ensures some InitExprs are correctly initialized, as in the test that crashed due to incorrect NTTP initialization. I don't use CheckTemplateArguments because, in CTAD synthesis, template parameter packs don't always appear at the end of the parameter list, unlike user-written ones mandated by the standard. This makes it difficult for CheckTemplateArguments to determine how many arguments a pack in middle should match, leading to unnecessary complexity. On the other hand, since we substitute non-deduced template parameters with deduced ones, we need to fold the packs midway through substitution, where CheckTemplateArgument is more convenient. As a drive-by this also removes some dead code in SemaInit. Fixes llvm#131408

combineBitcastvxi1 is sometimes called pre-legalization, so don't introduce X86ISD::MOVMSK nodes when vector types aren't legal Fixes llvm#161693

… to the biggest legal type (llvm#158070) For ARM, we want to do this up to 32-bits. Otherwise the code ends up bigger and bloated.

Needed for future patch.

…161671) In the standard, constraint satisfaction checking is done on the normalized form of a constraint. Clang instead substitutes on the non-normalized form, which causes us to report substitution failures in template arguments or concept ids, which is non-conforming but unavoidable without a parameter mapping This patch normalizes before satisfaction checking. However, we preserve concept-id nodes in the normalized form, solely for diagnostics purposes. This addresses llvm#61811 and related concepts conformance bugs, ideally to make the remaining implementation of concept template parameters easier Fixes llvm#135190 Fixes llvm#61811 Co-authored-by: Younan Zhang [zyn7109@gmail.com](mailto:zyn7109@gmail.com) --------- Co-authored-by: Younan Zhang <zyn7109@gmail.com>

llvm#161799) …cation. Replaces a call to ObjectFile::makeTriple (still used for ELF and COFF) with a call to MachOObjectFile::getArchTriple. The latter knows how to build correct triples for different MachO CPU subtypes, e.g. arm64 vs arm64e, which is important for selecting the right slice from universal archives.

) This commit adds the GNUPropertyRewriter, which parses features from the .note.gnu.property section. Currently we only read the bit indicating BTI support (GNU_PROPERTY_AARCH64_FEATURE_1_BTI). As BOLT does not add BTI landing pads to targets of indirect branches/calls, we have to emit a warning that the output binary may be corrupted.

Hardware inserts an implicit `S_WAIT_XCNT 0` between alternate SMEM and VMEM instructions, so there are never outstanding address translations for both SMEM and VMEM at the same time.

LongJmp does not support warm blocks. On builds without assertions, this may lead to unexpected crashes. This patch exits with a clear message.

For uniformity with other transforms.

Add Section Header check for getBuildID, fix crash with invalid Program Header. Fixes: llvm#126418 --------- Signed-off-by: Ruoyu Qiu <cabbaken@outlook.com> Signed-off-by: Ruoyu Qiu <qiuruoyu@xiaomi.com> Co-authored-by: Ruoyu Qiu <qiuruoyu@xiaomi.com> Co-authored-by: James Henderson <James.Henderson@sony.com>

…#158641) This cost-model takes into account any type-legalisation that would happen on vectors such as splitting and promotion. This results in wider VFs being chosen for loops that can use partial reductions. The cost-model now also assumes that when SVE is available, the SVE dot instructions for i16 -> i64 dot products can be used for fixed-length vectors. In practice this means that loops with non-scalable VFs are vectorized using partial reductions where they wouldn't before, e.g. ``` int64_t foo2(int8_t *src1, int8_t *src2, int N) { int64_t sum = 0; for (int i=0; i<N; ++i) sum += (int64_t)src1[i] * (int64_t)src2[i]; return sum; } ``` These changes also fix an issue where previously a partial reduction would be used for mixed sign/zero-extends (USDOT), even when +i8mm was not available.

…mask` instead of hard code zero (llvm#161771) There is no test change at this moment because we don't have a target that has this feature by default yet.

This patch introduces support for the jobserver protocol to control parallelism for device offloading tasks. When running a parallel build with a modern build system like `make -jN` or `ninja -jN`, each Clang process might also be configured to use multiple threads for its own tasks (e.g., via `--offload-jobs=4`). This can lead to an explosion of threads (N * 4), causing heavy system load, CPU contention, and ultimately slowing down the entire build. This patch allows Clang to act as a cooperative client of the build system's jobserver. It extends the `--offload-jobs` option to accept the value 'jobserver'. With the recent addition of jobserver support to the Ninja build system, this functionality now benefits users of both Make and Ninja. When `--offload-jobs=jobserver` is specified, Clang's thread pool will: 1. Parse the MAKEFLAGS environment variable to find the jobserver details. 2. Before dispatching a task, acquire a job slot from the jobserver. If none are available, the worker thread will block. 3. Release the job slot once the task is complete. This ensures that the total number of active offload tasks across all Clang processes does not exceed the limit defined by the parent build system, leading to more efficient and controlled parallel builds. Implementation: - A new library, `llvm/Support/Jobserver`, is added to provide a platform-agnostic client for the jobserver protocol, with backends for Unix (FIFO) and Windows (semaphores). - `llvm/Support/ThreadPool` and `llvm/Support/Parallel` are updated with a `jobserver_concurrency` strategy to integrate this logic. - The Clang driver and linker-wrapper are modified to recognize the 'jobserver' argument and enable the new thread pool strategy. - New unit and integration tests are added to validate the feature.

The Ada front end can emit somewhat complicated DWARF expressions for the offset of a field. While working in this area I found that I needed DW_OP_rot (to implement a branch-free computation -- it looked more difficult to add support for branching); and DW_OP_neg and DW_OP_abs (just basic functionality).

When using information from dereferenceable assumptions, we need to make sure that the memory is not freed between the assume and the specified context instruction. Instead of just checking canBeFreed, check if there any calls that may free between the assume and the context instruction. This patch introduces a willNotFreeBetween to check for calls that may free between an assume and a context instructions, to also be used in llvm#161255. PR: llvm#161725

Add support for the standalone OpenMP tile construct: ```f90 !$omp tile sizes(...) DO i = 1, 100 ... ``` This is complementary to llvm#143715 which added support for the tile construct as part of another loop-associated construct such as worksharing-loop, distribute, etc.

Now that llvm#161007 will attempt to fold this back to ADD(x,x) in X86FixupInstTunings, we can more aggressively create X86ISD::VSHLI nodes to avoid missed optimisations due to oneuse limits, avoids unnecessary freezes and allows AVX512 to fold to mi memory folding variants. I've currently limited SSE targets to cases where ADD is the only user of x to prevent extra moves - AVX shift patterns benefit from breaking the ADD+ADD+ADD chains into shifts, but its not so beneficial on SSE with the extra moves.

[clang][Driver][HIP] Change OffloadingActionBuilder to respect the --…

f8c34d1

…no-gpu-bundle-output flag

llvmbot added clang Clang issues not falling into any other category clang:driver 'clang' and 'clang++' user-facing binaries. Not 'clang-cl' labels Oct 3, 2025

kuhar and others added 24 commits October 3, 2025 10:07

[mlir][vector] Simplify op rewrite pattern inheriting constructors. N…

1ddc8ed

…FC. (llvm#161670) Use the `Base` type alias from llvm#158433.

[NFC][IR2Vec] Moving parseVocabSection() to VocabStorage (llvm#16…

dadcbae

…1711)

[X86][AMX] Combine constant zero vector and AMX cast to tilezero (llv…

b5ce27a

…m#92384) Found this problem when investigating llvm#91207

[clang] fix llvm#161765 test triple dependency (llvm#161769)

cdf6914

Fixes the new test introduced in llvm#161765, so that it always uses a triple which supports floating point exceptions. Otherwise, some post-commit bots fail.

[flang] use specialized scan/verify version for char (llvm#161767)

be2e77b

The specialized version has a time complexity of `O(n)`.

[JITLink] Add LinkGraph name / triple to debugging output. (llvm#161772)

d78d2c8

Adds the name and triple of the graph to LinkGraph::dump output before the rest of the graph content. Calls from JITLinkGeneric.cpp to dump the graph are updated to avoid redundantly naming the graph.

[X86] combineBitcastvxi1 - bail out on soft-float targets (llvm#161704)

f403e83

combineBitcastvxi1 is sometimes called pre-legalization, so don't introduce X86ISD::MOVMSK nodes when vector types aren't legal Fixes llvm#161693

[ARM] shouldFoldMaskToVariableShiftPair should be true for scalars up…

4aebd1c

… to the biggest legal type (llvm#158070) For ARM, we want to do this up to 32-bits. Otherwise the code ends up bigger and bloated.

[AMDGPU] Define VS_128*. NFCI (llvm#161798)

b5519e2

Needed for future patch.

[gn build] Port 441f0c7

c054155

[AMDGPU] Account for implicit XCNT insertion (llvm#160812)

5bb6fca

Hardware inserts an implicit `S_WAIT_XCNT 0` between alternate SMEM and VMEM instructions, so there are never outstanding address translations for both SMEM and VMEM at the same time.

[BOLT][AArch64] Refuse to run CDSplit pass (llvm#159351)

0fafc77

LongJmp does not support warm blocks. On builds without assertions, this may lead to unexpected crashes. This patch exits with a clear message.

[VPlan] Deref VPlanPtr when passing to transform (NFC) (llvm#161369)

32982d7

For uniformity with other transforms.

shiltian and others added 6 commits October 3, 2025 10:07

[AMDGPU][Disassembler] Use target feature for `.amdhsa_reserve_xnack_…

dde8f01

…mask` instead of hard code zero (llvm#161771) There is no test change at this moment because we don't have a target that has this feature by default yet.

MixedMatched closed this Oct 3, 2025

MixedMatched deleted the respect-no-gpu-bundle branch October 3, 2025 14:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[clang][Driver][HIP] Change OffloadingActionBuilder to respect the --no-gpu-bundle-output flag #161849

[clang][Driver][HIP] Change OffloadingActionBuilder to respect the --no-gpu-bundle-output flag #161849

MixedMatched commented Oct 3, 2025

Uh oh!

github-actions bot commented Oct 3, 2025

Uh oh!

llvmbot commented Oct 3, 2025 •

edited

Loading

Uh oh!

MixedMatched commented Oct 3, 2025

Uh oh!

Uh oh!

[clang][Driver][HIP] Change OffloadingActionBuilder to respect the --no-gpu-bundle-output flag #161849

[clang][Driver][HIP] Change OffloadingActionBuilder to respect the --no-gpu-bundle-output flag #161849

Conversation

MixedMatched commented Oct 3, 2025

Uh oh!

github-actions bot commented Oct 3, 2025

Uh oh!

llvmbot commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MixedMatched commented Oct 3, 2025

Uh oh!

Uh oh!

llvmbot commented Oct 3, 2025 •

edited

Loading