Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[libomptarget] Support BE ELF files in plugins-nextgen #83976

Merged
merged 1 commit into from Mar 6, 2024

Conversation

uweigand
Copy link
Member

@uweigand uweigand commented Mar 5, 2024

Code in plugins-nextgen reading ELF files is currently hard-coded to assume a 64-bit little-endian ELF format. Unfortunately, this assumption is even embedded in the interface between GlobalHandler and Utils/ELF routines, which use ELF64LE types.

To fix this, I've refactored the interface to push all ELF specific types into Utils/ELF. Specifically, this patch removes both the getSymbol and getSymbolAddress routines and replaces them with a single findSymbolInImage, which gets a MemoryBufferRef identifying the raw object file image as input, and returns a StringRef covering the data addressed by the symbol (address and size) if found, or an empty StringRef otherwise.

This allows properly templating over multiple ELF format variants inside Utils/ELF; specifically, this patch adds support for 64-bit big-endian ELF files in addition to 64-bit little-endian files.

@llvmbot llvmbot added the openmp:libomptarget OpenMP offload runtime label Mar 5, 2024
Copy link

github-actions bot commented Mar 5, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

uweigand added a commit to uweigand/llvm-project that referenced this pull request Mar 5, 2024
The plugin was not getting built as the build_generic_elf64
macro assumes the LLVM triple processor name matches the
CMake processor name, which is unfortunately not the case
for SystemZ.

Fix this by providing two separate arguments instead.

Actually building the plugin exposed a number of other issues
causing various test failures.  Specifically, I've had to add
the SystemZ target to
- CompilerInvocation::ParseLangArgs
- linkDevice in ClangLinuxWrapper.cpp
- OMPContext::OMPContext (to set the device_kind_cpu trait)
- LIBOMPTARGET_ALL_TARGETS in libomptarget/CMakeLists.txt
- a check_plugin_target call in libomptarget/src/CMakeLists.txt

Finally, I've had to set a number of test cases to UNSUPPORTED
on s390x-ibm-linux-gnu; all these tests were already marked as
UNSUPPORTED for x86_64-pc-linux-gnu and aarch64-unknown-linux-gnu
and are failing on s390x for what seem to be the same reason.

In addition, this also requires support for BE ELF files in
plugins-nextgen: llvm#83976
Copy link
Contributor

@jhuber6 jhuber6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall seems fine, just some nits. Hard coding this to LE ELF was the easy solution because we didn't have any targets that used otherwise.

Comment on lines 76 to 80
// Little-endian 64-bit
if (const ELF64LEObjectFile *ELFObj =
dyn_cast<ELF64LEObjectFile>(&**ElfOrErr))
return checkMachineImpl(*ELFObj, EMachine);
// Big-endian 64-bit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, comments probably unnecessary, but if you keep them they should end with punctuation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just remove those now.

// Setup the global symbol's address and size.
ImageGlobal.setPtr(const_cast<void *>(*AddrOrErr));
ImageGlobal.setSize((*SymOrErr)->st_size);
ImageGlobal.setPtr((void *)(SymOrErr->data()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C++ casts please

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Unfortunately this takes both a static_cast and a const_cast, but I guess this can't be helped here.

// If symbol not found, return an empty StringRef.
if (!*SymOrErr)
return StringRef();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't we used to have a separate boolean check for this? I suppose it works if we want to encode that error logic at the call site.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the point is that the check needs to be done at the call site. One caller wants to check whether the symbol exists or not (so non-existence should not be an error here), for the other caller non-existence is an error, so that error is (still) generated at the call site.

uweigand added a commit to uweigand/llvm-project that referenced this pull request Mar 6, 2024
The plugin was not getting built as the build_generic_elf64
macro assumes the LLVM triple processor name matches the
CMake processor name, which is unfortunately not the case
for SystemZ.

Fix this by providing two separate arguments instead.

Actually building the plugin exposed a number of other issues
causing various test failures.  Specifically, I've had to add
the SystemZ target to
- CompilerInvocation::ParseLangArgs
- linkDevice in ClangLinuxWrapper.cpp
- OMPContext::OMPContext (to set the device_kind_cpu trait)
- LIBOMPTARGET_ALL_TARGETS in libomptarget/CMakeLists.txt
- a check_plugin_target call in libomptarget/src/CMakeLists.txt

Finally, I've had to set a number of test cases to UNSUPPORTED
on s390x-ibm-linux-gnu; all these tests were already marked as
UNSUPPORTED for x86_64-pc-linux-gnu and aarch64-unknown-linux-gnu
and are failing on s390x for what seem to be the same reason.

In addition, this also requires support for BE ELF files in
plugins-nextgen: llvm#83976
/// an empty StringRef; otherwise, returns a StringRef covering the symbol's
/// data in the Obj buffer, based on its address and size
llvm::Expected<llvm::StringRef>
findSymbolInImage(const llvm::MemoryBufferRef Obj, llvm::StringRef Name);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the other functions go off of an llvm::StringRef for the ELF object, can we do the same here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the caller has an MemoryBufferRef available, and the callee needs a MemoryBufferRef (to pass to ObjectFile::createELFObjectFile), so it seemed preferable to just pass it through rather then stripping out the StringRef in the caller and re-creating another MemoryBufferRef in the callee ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caller can use Buffer.getBuffer() to get the StringRef, and we already construct the memory buffer elsewhere. It's just easier to be consistent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, I'll make that change.

if (!SymOrErr) {
consumeError(SymOrErr.takeError());
return false;
}

return *SymOrErr;
return !SymOrErr->empty();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this interact with symbols that have no size? I.e. SHT_NOBITS.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll probably need to double check how that's handled in the ELF as well, I forget exactly how it's presented in the symbol form since it doesn't have a representation in ELF memory.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's true that for symbols with no size, we'd also report an empty StringRef, so we cannot distinguish these two cases (easily). I thought this should be OK as the user here actually wants to copy data to/from the memory object identified by the symbol, so it cannot really do anything with a zero-sized symbol either.

If we do need to be able to make that distinction, we'd have to tweak the interface a bit. Either add an explicit boolean, or else expose a bit more details of the implementation (e.g. we could check for SymOrErr->data() != nullptr).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just make it std::optional if it's not meant to fail.

Copy link
Contributor

@jhuber6 jhuber6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG, thanks for cleaning this up. Just a small style nit.

static Expected<std::optional<StringRef>>
findSymbolInImageImpl(const object::ELFObjectFile<ELFT> &ELFObj,
StringRef Name) {
// Search for the symbol by name.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, a lot of these comments are just restating what is generally observable from the code. I.e. getSymbol(ELFObj, Name) implies we're looking up a symbol by name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left in a single comment describing the return value, removed all the others.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good.

Code in plugins-nextgen reading ELF files is currently hard-coded
to assume a 64-bit little-endian ELF format.  Unfortunately, this
assumption is even embedded in the interface between GlobalHandler
and Utils/ELF routines, which use ELF64LE types.

To fix this, I've refactored the interface to push all ELF specific
types into Utils/ELF.  Specifically, this patch removes both the
getSymbol and getSymbolAddress routines and replaces them with a
single findSymbolInImage, which gets a StringRef identifying the
raw object file image as input, and returns a StringRef covering
the data addressed by the symbol (address and size) if found, or
std::nullopt otherwise.

This allows properly templating over multiple ELF format variants
inside Utils/ELF; specifically, this patch adds support for 64-bit
big-endian ELF files in addition to 64-bit little-endian files.
@uweigand
Copy link
Member Author

uweigand commented Mar 6, 2024

Thanks for the review!

@uweigand uweigand merged commit 15b7b31 into llvm:main Mar 6, 2024
4 checks passed
@uweigand uweigand deleted the openmp-plugin-elfbe branch March 6, 2024 19:49
uweigand added a commit that referenced this pull request Mar 6, 2024
The plugin was not getting built as the build_generic_elf64 macro
assumes the LLVM triple processor name matches the CMake processor name,
which is unfortunately not the case for SystemZ.

Fix this by providing two separate arguments instead.

Actually building the plugin exposed a number of other issues causing
various test failures. Specifically, I've had to add the SystemZ target
to
- CompilerInvocation::ParseLangArgs
- linkDevice in ClangLinuxWrapper.cpp
- OMPContext::OMPContext (to set the device_kind_cpu trait)
- LIBOMPTARGET_ALL_TARGETS in libomptarget/CMakeLists.txt
- a check_plugin_target call in libomptarget/src/CMakeLists.txt

Finally, I've had to set a number of test cases to UNSUPPORTED on
s390x-ibm-linux-gnu; all these tests were already marked as UNSUPPORTED
for x86_64-pc-linux-gnu and aarch64-unknown-linux-gnu and are failing on
s390x for what seem to be the same reason.

In addition, this also requires support for BE ELF files in
plugins-nextgen: #83976
uweigand added a commit that referenced this pull request Mar 6, 2024
@uweigand uweigand restored the openmp-plugin-elfbe branch March 6, 2024 20:38
@uweigand
Copy link
Member Author

uweigand commented Mar 6, 2024

Unfortunately, this seems to have caused regressions in the cuda and amdgpu builders. I was able to restore the builds by this commit: b64482e, but the amdgpu builders still failed due to some GPU memory address faults:
https://lab.llvm.org/buildbot/#/builders/193/builds/47890

Not sure what this is all about, I've reverted all patches again for now. If you have any suggestion what might have caused that problem, I'd appreciate it! I'll see if I'm able to reproduce the problem locally somehow.

@jhuber6
Copy link
Contributor

jhuber6 commented Mar 6, 2024

Most recent build seems green https://lab.llvm.org/buildbot/#/builders/193/builds/47893. Those bots sometimes just die for no reason, there's a lot of flaky tests unfortunately.

@uweigand
Copy link
Member Author

uweigand commented Mar 6, 2024

Most recent build seems green https://lab.llvm.org/buildbot/#/builders/193/builds/47893.

Well, that's exactly the revision of my revert ... It does seem to be related, builds started failing exactly with the revision that checked in this PR, and started passing again with the revert.

@jhuber6
Copy link
Contributor

jhuber6 commented Mar 6, 2024

Most recent build seems green https://lab.llvm.org/buildbot/#/builders/193/builds/47893.

Well, that's exactly the revision of my revert ... It does seem to be related, builds started failing exactly with the revision that checked in this PR, and started passing again with the revert.

Ah, I thought you only landed the one fix, apologies. The most recent messages seem to be a compiler failure and not a test failure. But I wouldn't be surprised if there was some hidden behavior here.

@uweigand
Copy link
Member Author

uweigand commented Mar 6, 2024

Ok, here's the full sequence of commits:

  1. 15b7b31 [libomptarget] Support BE ELF files in plugins-nextgen ([libomptarget] Support BE ELF files in plugins-nextgen #83976)
    This PR. This actually caused 4 builders to fail with a compiler error, but before I noticed this, I also commited the next PR.

  2. 3ecd38c [libomptarget] Build plugins-nextgen for SystemZ ([libomptarget] Build plugins-nextgen for SystemZ #83978)
    PR 83978, which I had waited to commit as it depends on this PR. After I had committed this, I started getting the builder failures. I noticed the compiler error, and though this was easy to fix, and checked in the following quick fix.

  3. b64482e [libomptarget] Fix CUDA plugin build regression
    Quick fix intended to fix the compile error, which it actually did. This causes two of the four failing builders to pass again. The two remaining ones now also started compiling successfully again, but still failed during test - now with the GPU memory access fault. Here, I decided to revert all three patches again to get the builders green.

  4. d4f4f80 Revert "[libomptarget] Fix CUDA plugin build regression"
    One builder picked up this intermediate state and again failed with the compiler error.

  5. 70677c8 Revert "[libomptarget] Build plugins-nextgen for SystemZ ([libomptarget] Build plugins-nextgen for SystemZ #83978)"
    This intermediate state was also picked up, still failing with the compiler error.

  6. fb7cc73 Revert "[libomptarget] Support BE ELF files in plugins-nextgen ([libomptarget] Support BE ELF files in plugins-nextgen #83976)"
    Now all builders are green again.

@jhuber6
Copy link
Contributor

jhuber6 commented Mar 6, 2024

Do you have a GPU to run tests locally on? I would guess that the CPU targets don't requires a lot of the implicit argument handling or kernel argument handling so there's probably some overlooked behavior.

@uweigand
Copy link
Member Author

Do you have a GPU to run tests locally on? I would guess that the CPU targets don't requires a lot of the implicit argument handling or kernel argument handling so there's probably some overlooked behavior.

Unfortunately, I don't have an AMD GPU locally. I've done another thorough review, and noticed a number of unintended changes in this PR:

  • I had overlooked that cuda/src/rtl.cpp directly accesses Handler.getELFObjectFile
  • When using ObjectFile::createELFObjectFile in the new findSymbolInImage, I was passing /*InitContent=*/false (copied from checkMachine). While this is fine for checkMachine, when doing anything more complicated with the ELFObjectFile, this may cause problems
  • There are some corner cases where using the new findSymbolInImage for simply checking symbol existance may return a different result than the original getSymbol - I cannot prove this breaks anything, but I cannot really exclude it either.

As a conservative option, I've now implemented a new approach here: #85246 This keeps the overall structure the same, but just replaces the ELF-specific types with more generic ELF types in the common-code interfaces. This works the same on IBM Z, and I hope it will avoid introducing an breakage elsewhere (which I guess we'll see via build bot results if and when it can get committed)

@uweigand uweigand deleted the openmp-plugin-elfbe branch March 15, 2024 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
openmp:libomptarget OpenMP offload runtime
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants