Skip to content

Conversation

MixedMatched
Copy link

Currently, the command clang -c -emit-llvm --no-gpu-bundle-output --offload-arch=gfx900,gfx1030 -O3 -x hip square.hip will lead to a bundled output:

❯ ../bin/clang -c -emit-llvm --no-gpu-bundle-output --offload-arch=gfx900,gfx1030 -O3 -x hip square.hip
❯ ls
square.hip
square.bc

This doesn't match my expectation of the behavior of --no-gpu-bundle-output, so this adds a check into OffloadingActionBuilder for the flag when replacing the host compile action for a bundling action.

Copy link

github-actions bot commented Oct 3, 2025

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@llvmbot llvmbot added clang Clang issues not falling into any other category clang:driver 'clang' and 'clang++' user-facing binaries. Not 'clang-cl' labels Oct 3, 2025
@llvmbot
Copy link
Member

llvmbot commented Oct 3, 2025

@llvm/pr-subscribers-clang-driver

@llvm/pr-subscribers-clang

Author: alessandra simmons (MixedMatched)

Changes

Currently, the command clang -c -emit-llvm --no-gpu-bundle-output --offload-arch=gfx900,gfx1030 -O3 -x hip square.hip will lead to a bundled output:

❯ ../bin/clang -c -emit-llvm --no-gpu-bundle-output --offload-arch=gfx900,gfx1030 -O3 -x hip square.hip
❯ ls
square.hip
square.bc

This doesn't match my expectation of the behavior of --no-gpu-bundle-output, so this adds a check into OffloadingActionBuilder for the flag when replacing the host compile action for a bundling action.


Full diff: https://github.com/llvm/llvm-project/pull/161849.diff

2 Files Affected:

  • (modified) clang/lib/Driver/Driver.cpp (+11-5)
  • (added) clang/test/Driver/no-gpu-bundle-respected.hip (+18)
diff --git a/clang/lib/Driver/Driver.cpp b/clang/lib/Driver/Driver.cpp
index 85a1335785542..245ecf6053202 100644
--- a/clang/lib/Driver/Driver.cpp
+++ b/clang/lib/Driver/Driver.cpp
@@ -3844,6 +3844,9 @@ class OffloadingActionBuilder final {
   /// Flag set to true if all valid builders allow file bundling/unbundling.
   bool CanUseBundler;
 
+  /// Flag set to false if an argument turns off bundling.
+  bool ShouldUseBundler;
+
 public:
   OffloadingActionBuilder(Compilation &C, DerivedArgList &Args,
                           const Driver::InputList &Inputs)
@@ -3878,6 +3881,9 @@ class OffloadingActionBuilder final {
     }
     CanUseBundler =
         ValidBuilders && ValidBuilders == ValidBuildersSupportingBundling;
+
+    ShouldUseBundler = Args.hasFlag(options::OPT_gpu_bundle_output,
+                                    options::OPT_no_gpu_bundle_output, true);
   }
 
   ~OffloadingActionBuilder() {
@@ -4029,11 +4035,11 @@ class OffloadingActionBuilder final {
       SB->appendTopLevelActions(OffloadAL);
     }
 
-    // If we can use the bundler, replace the host action by the bundling one in
-    // the resulting list. Otherwise, just append the device actions. For
-    // device only compilation, HostAction is a null pointer, therefore only do
-    // this when HostAction is not a null pointer.
-    if (CanUseBundler && HostAction &&
+    // If we can and should use the bundler, replace the host action by the
+    // bundling one in the resulting list. Otherwise, just append the device
+    // actions. For device only compilation, HostAction is a null pointer,
+    // therefore only do this when HostAction is not a null pointer.
+    if (CanUseBundler && ShouldUseBundler && HostAction &&
         HostAction->getType() != types::TY_Nothing && !OffloadAL.empty()) {
       // Add the host action to the list in order to create the bundling action.
       OffloadAL.push_back(HostAction);
diff --git a/clang/test/Driver/no-gpu-bundle-respected.hip b/clang/test/Driver/no-gpu-bundle-respected.hip
new file mode 100644
index 0000000000000..1587551f0322d
--- /dev/null
+++ b/clang/test/Driver/no-gpu-bundle-respected.hip
@@ -0,0 +1,18 @@
+// RUN: %clang -ccc-print-phases -c -emit-llvm \
+// RUN:   --offload-arch=gfx900,gfx1030 -O3 -x hip %s \
+// RUN:   2>&1 | FileCheck %s --check-prefix=OFFLOAD
+
+// RUN: %clang -ccc-print-phases -c -emit-llvm \
+// RUN:   --gpu-bundle-output --offload-arch=gfx900,gfx1030 -O3 -x hip %s \
+// RUN:   2>&1 | FileCheck %s --check-prefix=OFFLOAD
+
+// RUN: %clang -ccc-print-phases -c -emit-llvm \
+// RUN:   --no-gpu-bundle-output --offload-arch=gfx900,gfx1030 -O3 -x hip %s \
+// RUN:   2>&1 | FileCheck %s --check-prefix=OFFLOAD2
+
+// OFFLOAD: clang-offload-bundler
+// OFFLOAD2-NOT: clang-offload-bundler
+
+int square(int num) {
+    return num * num;
+}

@MixedMatched
Copy link
Author

@lamb-j @yxsamliu

kuhar and others added 24 commits October 3, 2025 10:07
llvm#161761)

CallableTraitsHelper identifies the return type and argument types of a
callable type and passes those to an implementation class template to
operate on.

The CallableArgInfo utility uses CallableTraitsHelper to provide
typedefs for the return type and argument types (as a tuple) of a
callable type.

In WrapperFunction.h, the detail::WFCallableTraits utility is rewritten
in terms of CallableTraitsHandler (and renamed to WFHandlerTraits).
…vm#161763)

Serializers only need to provide two methods now, rather than four. The
first method should return an argument serializer / deserializer, the
second a result value serializer / deserializer. The interfaces for
these are now more uniform (deserialize now returns a tuple, rather than
taking its output location(s) by reference). The intent is to simplify
Serializer helper code. NFCI.
…161765)

This fixes a regression introduced in llvm#147835

When parsing a lambda where the call operator has a late parsed
attribute, we would try to build a 'this' type for the lambda, but in a
lambda 'this' never refers to the lambda class itself.

This late parsed attribute can be added implicitly by the
-ftrapping-math flag.

This patch patch makes it so CXXThisScopeRAII ignores lambdas.

This became observable in llvm#147835 because that made clang lazily create
tag types, and it removed a workaround for a lambda dependency bug where
any previously created tag type for the lambda is discarded after its
dependency is recalculated.

But the 'this' scope created above would defeat this laziness and create
the lambda type too soon, before its dependency was updated.

Since this regression was never released, there are no release notes.

Fixes llvm#161657
…vm#161768)

This commit aims to reduce boilerplate by adding transparent conversion
between Error/Expected types and their SPS-serializable counterparts
(SPSSerializableError/SPSSerializableExpected). This allows
SPSWrapperFunction calls and handles to be written in terms of
Error/Expected directly.

This functionality can also be extended to transparently convert between
other types. This may be used in the future to provide conversion
between ExecutorAddr and native pointer types.
…lvm#161712)

We have additional patterns for GISel because we need to make s16 and
s32 legal for load/store. GISel does not distinquish integer and FP
scalar types in LLT. We only know whether the load should be integer or
FP after register bank selection.

These patterns should have been updated to use relaxed_load/store when
the patterns in RISCVInstrInfoA.td were updated. Without this we will
miscompile loads/stores with strong memory ordering when Zalasr is
enabled.

This patch just fixes the miscompile, Zalasr will now cause a GISel
abort in some cases. A follow up patch will add additional GISel
patterns for Zalasr.
Fixes the new test introduced in llvm#161765, so that it always uses a
triple which supports floating point exceptions.

Otherwise, some post-commit bots fail.
The specialized version has a time complexity of `O(n)`.
Adds the name and triple of the graph to LinkGraph::dump output before
the rest of the graph content. Calls from JITLinkGeneric.cpp to dump the
graph are updated to avoid redundantly naming the graph.
…pe aliases (llvm#161035)

We missed calling CheckTemplateArgument when building CTAD deduction
guides. That ensures some InitExprs are correctly initialized, as in the
test that crashed due to incorrect NTTP initialization.

I don't use CheckTemplateArguments because, in CTAD synthesis, template
parameter packs don't always appear at the end of the parameter list,
unlike user-written ones mandated by the standard. This makes it
difficult for CheckTemplateArguments to determine how many arguments a
pack in middle should match, leading to unnecessary complexity.

On the other hand, since we substitute non-deduced template parameters
with deduced ones, we need to fold the packs midway through
substitution, where CheckTemplateArgument is more convenient.

As a drive-by this also removes some dead code in SemaInit.

Fixes llvm#131408
combineBitcastvxi1 is sometimes called pre-legalization, so don't
introduce X86ISD::MOVMSK nodes when vector types aren't legal

Fixes llvm#161693
… to the biggest legal type (llvm#158070)

For ARM, we want to do this up to 32-bits. Otherwise the code ends up
bigger and bloated.
…161671)

In the standard, constraint satisfaction checking is done on the
normalized form of a constraint.

Clang instead substitutes on the non-normalized form, which causes us to
report substitution failures in template arguments or concept ids, which
is non-conforming but unavoidable without a parameter mapping

This patch normalizes before satisfaction checking. However, we preserve
concept-id nodes in the normalized form, solely for diagnostics
purposes.

This addresses llvm#61811 and
related concepts conformance bugs, ideally to make the remaining
implementation of concept template parameters easier

Fixes llvm#135190
Fixes llvm#61811

Co-authored-by: Younan Zhang
[zyn7109@gmail.com](mailto:zyn7109@gmail.com)

---------

Co-authored-by: Younan Zhang <zyn7109@gmail.com>
llvm#161799)

…cation.

Replaces a call to ObjectFile::makeTriple (still used for ELF and COFF)
with a call to MachOObjectFile::getArchTriple. The latter knows how to
build correct triples for different MachO CPU subtypes, e.g. arm64 vs
arm64e, which is important for selecting the right slice from universal
archives.
)

This commit adds the GNUPropertyRewriter, which parses features from the
.note.gnu.property section.

Currently we only read the bit indicating BTI support
(GNU_PROPERTY_AARCH64_FEATURE_1_BTI).

As BOLT does not add BTI landing pads to targets of indirect
branches/calls, we have to emit a warning that the output binary may be
corrupted.
Hardware inserts an implicit `S_WAIT_XCNT 0` between 
alternate SMEM and VMEM instructions, so there are 
never outstanding address translations for both SMEM 
and VMEM at the same time.
LongJmp does not support warm blocks.
On builds without assertions, this may lead to unexpected crashes.

This patch exits with a clear message.
Add Section Header check for getBuildID, fix crash with invalid Program
Header.

Fixes: llvm#126418

---------

Signed-off-by: Ruoyu Qiu <cabbaken@outlook.com>
Signed-off-by: Ruoyu Qiu <qiuruoyu@xiaomi.com>
Co-authored-by: Ruoyu Qiu <qiuruoyu@xiaomi.com>
Co-authored-by: James Henderson <James.Henderson@sony.com>
…#158641)

This cost-model takes into account any type-legalisation that would
happen on vectors such as splitting and promotion. This results in wider
VFs being chosen for loops that can use partial reductions.

The cost-model now also assumes that when SVE is available, the SVE dot
instructions for i16 -> i64 dot products can be used for fixed-length
vectors. In practice this means that loops with non-scalable VFs are
vectorized using partial reductions where they wouldn't before, e.g.

```
  int64_t foo2(int8_t *src1, int8_t *src2, int N) {
    int64_t sum = 0;
    for (int i=0; i<N; ++i)
      sum += (int64_t)src1[i] * (int64_t)src2[i];
    return sum;
  }
```

These changes also fix an issue where previously a partial reduction
would be used for mixed sign/zero-extends (USDOT), even when +i8mm was
not available.
shiltian and others added 6 commits October 3, 2025 10:07
…mask` instead of hard code zero (llvm#161771)

There is no test change at this moment because we don't have a target
that has this feature by default yet.
This patch introduces support for the jobserver protocol to control
parallelism for device offloading tasks.

When running a parallel build with a modern build system like `make -jN`
or `ninja -jN`, each Clang process might also be configured to use
multiple threads for its own tasks (e.g., via `--offload-jobs=4`). This
can lead to an explosion of threads (N * 4), causing heavy system load,
CPU contention, and ultimately slowing down the entire build.

This patch allows Clang to act as a cooperative client of the build
system's jobserver. It extends the `--offload-jobs` option to accept the
value 'jobserver'. With the recent addition of jobserver support to the
Ninja build system, this functionality now benefits users of both Make
and Ninja.

When `--offload-jobs=jobserver` is specified, Clang's thread pool will:
1. Parse the MAKEFLAGS environment variable to find the jobserver
details.
2. Before dispatching a task, acquire a job slot from the jobserver. If
none are available, the worker thread will block.
3. Release the job slot once the task is complete.

This ensures that the total number of active offload tasks across all
Clang processes does not exceed the limit defined by the parent build
system, leading to more efficient and controlled parallel builds.

Implementation:
- A new library, `llvm/Support/Jobserver`, is added to provide a
platform-agnostic client for the jobserver protocol, with backends for
Unix (FIFO) and Windows (semaphores).
- `llvm/Support/ThreadPool` and `llvm/Support/Parallel` are updated with
a `jobserver_concurrency` strategy to integrate this logic.
- The Clang driver and linker-wrapper are modified to recognize the
'jobserver' argument and enable the new thread pool strategy.
- New unit and integration tests are added to validate the feature.
The Ada front end can emit somewhat complicated DWARF expressions for
the offset of a field. While working in this area I found that I needed
DW_OP_rot (to implement a branch-free computation -- it looked more
difficult to add support for branching); and DW_OP_neg and DW_OP_abs
(just basic functionality).
When using information from dereferenceable assumptions, we need to make
sure that the memory is not freed between the assume and the specified
context instruction. Instead of just checking canBeFreed, check if there
any calls that may free between the assume and the context instruction.

This patch introduces a willNotFreeBetween to check for calls that may
free between an assume and a context instructions, to also be used in
llvm#161255.

PR: llvm#161725
Add support for the standalone OpenMP tile construct:
```f90
!$omp tile sizes(...)
DO i = 1, 100
  ...
```

This is complementary to llvm#143715 which added support for the tile
construct as part of another loop-associated construct such as
worksharing-loop, distribute, etc.
Now that llvm#161007 will attempt to fold this back to ADD(x,x) in
X86FixupInstTunings, we can more aggressively create X86ISD::VSHLI nodes
to avoid missed optimisations due to oneuse limits, avoids unnecessary
freezes and allows AVX512 to fold to mi memory folding variants.

I've currently limited SSE targets to cases where ADD is the only user
of x to prevent extra moves - AVX shift patterns benefit from breaking
the ADD+ADD+ADD chains into shifts, but its not so beneficial on SSE
with the extra moves.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clang:driver 'clang' and 'clang++' user-facing binaries. Not 'clang-cl' clang Clang issues not falling into any other category
Projects
None yet
Development

Successfully merging this pull request may close these issues.