Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rebase xtheadvector on llvm/llvm-project:main #1

Merged
merged 3,031 commits into from
Apr 16, 2024

Conversation

silvanshade
Copy link

This PR rebases the xtheadvector branch.

fhahn and others added 27 commits April 2, 2024 21:48
Make sure that VPInstructions with OR opcodes are properly registered as
disjoint ops.

Fixes llvm#87378.
…lvm#87296)

We can't just check if it is a splat constant or not. We should also
check if the value match.
This fixes a test broken in 3d469c0.
fast-forwarded.
../clang/test/CodeGenHLSL/builtins/wave_get_lane_index_subcall.hlsl
When a nested parallel region ends, the runtime calls __kmp_join_call().
During this call, the primary thread of the nested parallel region will
reset its tid (retval of omp_get_thread_num()) to what it was in the
outer parallel region. A data race occurs with the current code when
another worker thread from the nested inner parallel region tries to
steal tasks from the primary thread's task deque. The worker thread
reads the tid value directly from the primary thread's data structure
and may read the wrong value.

This change just uses the calculated victim_tid from execute_tasks()
directly in the steal_task() routine rather than reading tid from the
data structure.

Fixes: llvm#87307
Summary:
This patch adds a temporary implementation that uses a struct-based
interface in lieu of varargs support. Once varargs support exists we
will move this implementation to the "real" printf implementation.

Conceptually, this patch has the client copy over its format string and
arguments to the server. The server will then scan the format string
searching for any specifiers that are actually a string. If it is a
string then we will send the pointer back to the server to tell it to
copy it back. This copied value will then replace the pointer when the
final formatting is done.

This will require a built-in extension to the varargs support to get
access to the underlying struct. The varargs used on the GPU will simply
be a struct wrapped in a varargs ABI.
…RE_PAUTH` (llvm#85231)

This adds support for `GNU_PROPERTY_AARCH64_FEATURE_PAUTH` feature (as
defined in ARM-software/abi-aa#240) handling in
llvm-readobj and llvm-readelf. The following constants for supported
platforms are also introduced:

- `AARCH64_PAUTH_PLATFORM_INVALID = 0x0`
- `AARCH64_PAUTH_PLATFORM_BAREMETAL = 0x1`
- `AARCH64_PAUTH_PLATFORM_LLVM_LINUX = 0x10000002`

For the llvm_linux platform, output of the tools contains descriptions
of PAuth features which are enabled/disabled depending on the version
value. Version value bits correspond to the following `LangOptions`
defined in llvm#85232:

- bit 0: `PointerAuthIntrinsics`;
- bit 1: `PointerAuthCalls`;
- bit 2: `PointerAuthReturns`;
- bit 3: `PointerAuthAuthTraps`;
- bit 4: `PointerAuthVTPtrAddressDiscrimination`;
- bit 5: `PointerAuthVTPtrTypeDiscrimination`;
- bit 6: `PointerAuthInitFini`.

Support for `.note.AARCH64-PAUTH-ABI-tag` is dropped since it's deleted
from the spec in ARM-software/abi-aa#250.
…#85638)

Summary:
This patch adds an implementation of `printf` that's provided by the GPU
C library runtime. This `pritnf` currently implemented using the same
wrapper handling that OpenMP sets up. This will be removed once we have
proper varargs support.

This `printf` differs from the one CUDA offers in that it is synchronous
and uses a finite size. Additionally we support pretty much every format
specifier except the `%n` option.

Depends on llvm#85331
Summary:
The RPC server build for the GPU support needs to be build from the
"projects" phase of the LLVM build. That means it is built with the same
compile that LLVM supports, which currently is GCC 7.4 in most cases.
A previous patch removed the `LIBC_HAS_BUILTIN` indirection we used,
which regressed the case where we used the `libc` source externally. The
files that we need to use here are `converter.cpp` and `writer.cpp`
which currently are compatible with C++17, so there aren't issues with
the code itself. However, older GCC does not have this builtin which
makes the checks fail.

This patch just adds in a simple wrapper that allows it to correctly
ignore everything if using a compiler that doesn't support it.
This patch adds SWIG cmake flags to the stage2 build in Fuchsia
Clang configuration.
…vancing iterator (llvm#84126)

Currently, the bounds check in `std::ranges::advance(it, n, s)` is done
_before_ `n` is checked. This results in one extra, unneeded bounds
check.

Thus, `std::ranges::advance(it, 1, s)` currently is _not_ simply
equivalent to:

```c++
if (it != s) {
    ++it;
}
```

This difference in behavior matters when the check involves some
"expensive" logic. For example, the `==` operator of
`std::istreambuf_iterator` may actually have to read the underlying
`streambuf`.

Swapping around the checks in the `while` results in the expected
behavior.
… continue iteration of object files (llvm#87344)

This patch introduces a new `IterationMarker` enum (happy to take
alternative name suggestions), which callbacks, like the one in
`SymbolFileDWARFDebugMap::ForEachSymbolFile`, can return in order to
indicate whether the caller should continue iterating or bail.

For now this patch just changes the `ForEachSymbolFile` callback to use
this new enum. In the future we could change the various
`DWARFIndex::GetXXX` callbacks to do the same.

This makes the callbacks easier to read and hopefully reduces the chance
of bugs like llvm#87177.
The tosa-infer-shapes pass inserts tensor.cast operations to mediate
refined result types with consumers whose types cannot be refined. This
process interferes with how types are refined in tosa.while_loop body
regions, where types are propagated speculatively (to determine the
types of the tosa.yield terminator) and then reverted.

The new tosa.cast operations result in a crash due to not having types
associated to them for the reversion process.

This change modifies the shape propagation behavior so that the
introduction to tensor.cast operations behaves better with this type
reversion process. The new behavior is to only introduce tensor.cast
operations once we wish to commit the newly computed types to the IR.

This is an example causing the crash:

```mlir
func.func @while_dont_crash(%arg0 : tensor<i32>) -> (tensor<*xi32>) {
  %0 = tosa.add %arg0, %arg0 : (tensor<i32>, tensor<i32>) -> tensor<*xi32>

  %1 = tosa.while_loop (%arg1 = %0) : (tensor<*xi32>) -> tensor<*xi32> {
    %2 = "tosa.const"() <{value = dense<3> : tensor<i32>}> : () -> tensor<i32>
    %3 = tosa.greater_equal %2, %arg1 : (tensor<i32>, tensor<*xi32>) -> tensor<*xi1>
    tosa.yield %3 : tensor<*xi1>
  } do {
  ^bb0(%arg1: tensor<*xi32>):
    // Inferrable operation whose type will refine to tensor<i32>
    %3 = tosa.add %arg1, %arg1 : (tensor<*xi32>, tensor<*xi32>) -> tensor<*xi32>

    // Non-inferrable use site, will require the cast:
    //     tensor.cast %3 : tensor<i32> to tensor<*xi32>
    // 
    // The new cast operation will result in accessing undefined memory through
    // originalTypeMap in the C++ code.
    "use"(%3) : (tensor<*xi32>) -> ()
    tosa.yield %3 : tensor<*xi32>
  }

  return %1 : tensor<*xi32>
}
```

The `tensor.cast` operation inserted in the loop body causes a failure
in the code which resets the types after propagation through the loop
body:

```c++
// The types inferred in the block assume the operand types specified for
// this iteration. We need to restore the original types to ensure that
// future iterations only use the already specified types, not possible
// types from previous iterations.
for (auto &block : bodyRegion) {
  for (auto arg : block.getArguments())
    arg.setType(originalTypeMap[arg]);
  for (auto &op : block)
    for (auto result : op.getResults())
      result.setType(originalTypeMap[result]);  // problematic access
}
```

---------

Co-authored-by: Spenser Bauman <sabauma@fastmail>
llvm#87155)

Both `std::distance` or `ranges::distance` are inefficient for
non-sized ranges. Also, calculating the range using `int` type is
seriously problematic.

This patch avoids using `distance` and calculation of the length of
non-sized ranges.

Fixes llvm#86833.
Summary:
This is more hacky, but I want to get the bot green before we work on a
better solution.
…arse tensors (llvm#87305)

`linalg.generic` ops with sparse tensors do not necessarily bufferize to
element-wise access, because insertions into a sparse tensor may change
the layout of (or reallocate) the underlying sparse data structures.
…lvm#87308)

Use a single insert for the non-mask case instead of a push_back
followed by an insert that may contain 0 registers.
…` instead of `SourceLocation` (llvm#87427)

For pragma diagnostic mappings, we always write/read `SourceLocation`
with offset 0. This is equivalent to just writing a `FileID`, which is
exactly what this patch starts doing.

Originally reviewed here: https://reviews.llvm.org/D137213
Improve the test gnu-ifunc-nonpreemptible.s to check IRELATIVE offsets.
Ensure that IRELATIVE offsets are ordered to improve locality.
…MP (llvm#85638)"

This reverts commit 2cf8118.

Failing tests, revert until I can fix it
- The opcode of the mina.fmt and max.fmt is documented wrong, the
  object code compiled from the same assembly with LLVM behaves
  differently than one compiled with GCC and Binutils.
- Modify the opcodes to match Binutils. The actual opcodes are as
follows:

  {5,3} | bits {2,0} of func
           |    ...   | 100  | 101    | 110   | 111
  -----+-----+-----+-----+-----+-----
   010  |   ...   |  min  | mina | max  | maxa
aeubanks and others added 27 commits April 4, 2024 17:05
commit d89914f
  Author: Kazu Hirata <kazu@google.com>
  Date:   Wed Apr 3 21:48:38 2024 -0700

changed RecordWriterTrait to a template class with IndexedVersion as a
template parameter.  This patch changes the class back to a
non-template one while retaining the ability to serialize multiple
versions.

The reason I changed RecordWriterTrait to a template class was
because, even if RecordWriterTrait had IndexedVersion as a member
variable, RecordWriterTrait::EmitKeyDataLength, being a static
function, would not have access to the variable.

Since OnDiskChainedHashTableGenerator calls EmitKeyDataLength as:

  const std::pair<offset_type, offset_type> &Len =
      InfoObj.EmitKeyDataLength(Out, I->Key, I->Data);

we can make EmitKeyDataLength a member function, but we have one
problem.  InstrProfWriter::writeImpl calls:

  void insert(typename Info::key_type_ref Key,
              typename Info::data_type_ref Data) {
    Info InfoObj;
    insert(Key, Data, InfoObj);
  }

which default-constructs RecordWriterTrait without a specific version
number.  This patch fixes the problem by adjusting
InstrProfWriter::writeImpl to call the other form of insert instead:

  void insert(typename Info::key_type_ref Key,
              typename Info::data_type_ref Data, Info &InfoObj)

To prevent an accidental invocation of the default constructor of
RecordWriterTrait, this patch deletes the default constructor.
…87567)

LLVMgold.so can be used with GNU ar, gold, ld, and nm to process LLVM
bitcode files. Install it in LLVM_INSTALL_TOOLCHAIN_ONLY=on builds like
we install libLTO.so.

Suggested by @emelife

Fix llvm#84271
…motableOpInterface` (llvm#86792)

Add `requiresReplacedValues` and `visitReplacedValues` methods to
`PromotableOpInterface`. These methods allow `PromotableOpInterface` ops
to transforms definitions mutated by a `store`.

This change is necessary to correctly handle the promotion of
`LLVM_DbgDeclareOp`.

---------

Co-authored-by: Théo Degioanni <30992420+Moxinilian@users.noreply.github.com>
In `(icmp eq (and x,y), C)` all 1s in `C` must also be set in both
`x`/`y`.

In `(icmp eq (or x,y), C)` all 0s in `C` must also be set in both
`x`/`y`.

Closes llvm#87143
There is one notable "regression". This patch replaces the bespoke `or
disjoint` logic we a direct match. This means we fail some
simplification during `instsimplify`.
All the cases we fail in `instsimplify` we do handle in `instcombine`
as we add `disjoint` flags.

Other than that, just some basic cases.

See proofs: https://alive2.llvm.org/ce/z/_-g7C8

Closes llvm#86083
It matches up with other _attribute_ adding member functions and helps
simplify InterfaceFile assignment for InstallAPI.
Depends on llvm#87545

Emit `GNU_PROPERTY_AARCH64_FEATURE_PAUTH` property in
`.note.gnu.property` section depending on
`aarch64-elf-pauthabi-platform` and `aarch64-elf-pauthabi-version` llvm
module flags.
…#86894)

The existing heuristics were assuming that every core behaves like an
Apple A7, where any extend/shift costs an extra micro-op... but in
reality, nothing else behaves like that.

On some older Cortex designs, shifts by 1 or 4 cost extra, but all other
shifts/extensions are free. On all other cores, as far as I can tell,
all shifts/extensions for integer loads are free (i.e. the same cost as
an unshifted load).

To reflect this, this patch:

- Enables aggressive folding of shifts into loads by default.

- Removes the old AddrLSLFast feature, since it applies to everything
except A7 (and even if you are explicitly targeting A7, we want to
assume extensions are free because the code will almost always run on a
newer core).

- Adds a new feature AddrLSLSlow14 that applies specifically to the
Cortex cores where shifts by 1 or 4 cost extra.

I didn't add support for AddrLSLSlow14 on the GlobalISel side because it
would require a bunch of refactoring to work correctly. Someone can pick
this up as a followup.
Summary:
This synchronization should be done before we handle the logic relating
to closing the port. This isn't majorly important now but it would break
if we ever decided to run a server on the GPU.
I believe I've got the tests properly configured to only run on Linux
x86(_64), as I don't have a Linux AArch64/Arm device to diagnose what's
going wrong with the tests (I suspect there's some issue with generating
`.note.gnu.build-id` sections...)

The actual code fixes have now been reviewed 3 times:
llvm#79181 (moved shell tests to
API tests), llvm#85693 (Changed
some of the testing infra), and
llvm#86812 (didn't get the tests
configured quite right). The Debuginfod integration for symbol
acquisition in LLDB now works with the `executable` and `debuginfo`
Debuginfod network requests working properly for normal, `objcopy
--only-keep-debug` stripped, split-dwarf, and `objcopy
--only-keep-debug` stripped *plus* split-dwarf symbols/binaries.

The reasons for the multiple attempts have been tests on platforms I
don't have access to (Linux AArch64/Arm + MacOS x86_64). I believe I've
got the tests properly disabled for everything except for Linux x86(_64)
now. I've built & tested on MacOS AArch64 and Linux x86_64.

---------

Co-authored-by: Kevin Frei <freik@meta.com>
Since we have released Clang 16 is no longer actively supported. However
the FreeBSD runner is still using this, so some tests still guard
against Clang 16.
@kata-ark
Copy link
Owner

Oh thanks very much!! 👍

@kata-ark kata-ark merged commit e059cb3 into kata-ark:main Apr 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment