Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[lld] Support thumb PLTs #86223

Merged
merged 1 commit into from
May 28, 2024
Merged

[lld] Support thumb PLTs #86223

merged 1 commit into from
May 28, 2024

Conversation

PiJoules
Copy link
Contributor

@PiJoules PiJoules commented Mar 22, 2024

We are using PLTs for cortex-m33 which only supports thumb. More specifically, this is for a very restricted use case. There's no MMU so there's no sharing of virtual addresses between two processes, but this is fine. The MCU is used for running chre nanoapps for android. Each nanoapp is a shared library (but effectively acts as an executable containing a test suite) that is loaded and run on the MCU one binary at a time and there's only one process running at a time, so we ensure that the same text segment cannot be shared by two different running executables. GNU LD supports thumb PLTs but we want to migrate to a clang toolchain and use LLD, so thumb PLTs are needed.

Copy link

github-actions bot commented Mar 22, 2024

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:
git-clang-format --diff 73eb9b33147ba5157cbf5d8276ee718629dfbbda a12e0eace68bcee6def5af4868053cdb739554c3 -- lld/ELF/Arch/ARM.cpp lld/ELF/Config.h lld/ELF/InputFiles.cpp
View the diff from clang-format here.
diff --git a/lld/ELF/Arch/ARM.cpp b/lld/ELF/Arch/ARM.cpp
index 3e0efe540e..46c1b29148 100644
--- a/lld/ELF/Arch/ARM.cpp
+++ b/lld/ELF/Arch/ARM.cpp
@@ -244,7 +244,8 @@ void ARM::writePltHeader(uint8_t *buf) const {
     // `pc` in the add instruction and 8 bytes for the `lr` adjustment.
     //
     uint64_t offset = in.gotPlt->getVA() - in.plt->getVA() - 16;
-    assert(llvm::isUInt<32>(offset) && "This should always fit into a 32-bit offset");
+    assert(llvm::isUInt<32>(offset) &&
+           "This should always fit into a 32-bit offset");
     write16(buf + 0, 0xb500);
     // Split into two halves to support endianness correctly.
     write16(buf + 2, 0xf8df);
@@ -255,7 +256,7 @@ void ARM::writePltHeader(uint8_t *buf) const {
     write16(buf + 10, 0xff08);
     write32(buf + 12, offset);
 
-    memcpy(buf + 16, trapInstr.data(), 4);  // Pad to 32-byte boundary
+    memcpy(buf + 16, trapInstr.data(), 4); // Pad to 32-byte boundary
     memcpy(buf + 20, trapInstr.data(), 4);
     memcpy(buf + 24, trapInstr.data(), 4);
     memcpy(buf + 28, trapInstr.data(), 4);
@@ -339,7 +340,8 @@ void ARM::writePlt(uint8_t *buf, const Symbol &sym,
     memcpy(buf + 12, trapInstr.data(), 4); // Pad to 16-byte boundary
   } else {
     uint64_t offset = sym.getGotPltVA() - pltEntryAddr - 12;
-    assert(llvm::isUInt<32>(offset) && "This should always fit into a 32-bit offset");
+    assert(llvm::isUInt<32>(offset) &&
+           "This should always fit into a 32-bit offset");
 
     // A PLT entry will be:
     //
@@ -359,10 +361,10 @@ void ARM::writePlt(uint8_t *buf, const Symbol &sym,
     write16(buf + 6, 0x0c00); // use `ip`
     relocateNoSym(buf + 4, R_ARM_THM_MOVT_ABS, offset);
 
-    write16(buf + 8, 0x44fc);       // add ip, pc
-    write16(buf + 10, 0xf8dc);      // ldr.w   pc, [ip] (bottom half)
-    write16(buf + 12, 0xf000);      // ldr.w   pc, [ip] (upper half)
-    write16(buf + 14, 0xe7fc);      // Branch to previous instruction
+    write16(buf + 8, 0x44fc);  // add ip, pc
+    write16(buf + 10, 0xf8dc); // ldr.w   pc, [ip] (bottom half)
+    write16(buf + 12, 0xf000); // ldr.w   pc, [ip] (upper half)
+    write16(buf + 14, 0xe7fc); // Branch to previous instruction
   }
 }
 
@@ -407,7 +409,8 @@ bool ARM::needsThunk(RelExpr expr, RelType type, const InputFile *file,
   case R_ARM_THM_JUMP24:
     // Source is Thumb, when all PLT entries are ARM interworking is required.
     // Otherwise we need to interwork if STT_FUNC Symbol has bit 0 clear (ARM).
-    if ((expr == R_PLT_PC && !config->armThumbPLTs) || (s.isFunc() && (s.getVA() & 1) == 0))
+    if ((expr == R_PLT_PC && !config->armThumbPLTs) ||
+        (s.isFunc() && (s.getVA() & 1) == 0))
       return true;
     [[fallthrough]];
   case R_ARM_THM_CALL: {

Copy link
Collaborator

@smithp35 smithp35 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is marked WIP so please ignore my comments if you don't want them.

For Cortex-m PLTs presumably for shared-libraries? Aren't a great fit due to the lack of an MMU. Although much more work to implement the fdpic ABI https://github.com/mickael-guene/fdpic_doc/blob/master/abi.txt may be a better fit.

lld/ELF/Arch/ARM.cpp Outdated Show resolved Hide resolved
lld/ELF/Arch/ARM.cpp Show resolved Hide resolved
lld/ELF/Arch/ARM.cpp Outdated Show resolved Hide resolved
write16(buf + 14, 0xe7fc); // Branch to the previous instruction.
//memcpy(buf + 14, trapInstr.data(), 2); // Pad to 16-byte boundary

// The PLT size for ARM is 16 bytes and the above sequence is 14 bytes so we could potentially fit one more instruction.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that aligning to 16-bytes may give better cache performance for A-profile at least. I'm not sure how the cached M-profile would respond though.

lld/ELF/Arch/ARM.cpp Outdated Show resolved Hide resolved
lld/ELF/Arch/ARM.cpp Outdated Show resolved Hide resolved
lld/ELF/Arch/ARM.cpp Show resolved Hide resolved
lld/ELF/InputFiles.cpp Outdated Show resolved Hide resolved
@PiJoules
Copy link
Contributor Author

For Cortex-m PLTs presumably for shared-libraries?

Correct.

Copy link

✅ With the latest revision this PR passed the Python code formatter.

@PiJoules PiJoules changed the title [WIP][lld] Support thumb PLTs for cortex-M [lld] Support thumb PLTs Mar 23, 2024
@PiJoules PiJoules requested a review from smithp35 March 23, 2024 00:07
@PiJoules PiJoules marked this pull request as ready for review March 23, 2024 00:07
@PiJoules PiJoules requested a review from frobtech March 23, 2024 00:08
@llvmbot
Copy link
Collaborator

llvmbot commented Mar 23, 2024

@llvm/pr-subscribers-lld-elf

@llvm/pr-subscribers-lld

Author: None (PiJoules)

Changes

We are using PLTs for cortex-m33 which only supports thumb.


Full diff: https://github.com/llvm/llvm-project/pull/86223.diff

4 Files Affected:

  • (modified) lld/ELF/Arch/ARM.cpp (+148-52)
  • (modified) lld/ELF/Config.h (+1)
  • (modified) lld/ELF/InputFiles.cpp (+12)
  • (added) lld/test/ELF/armv8-thumb-plt-reloc.s (+121)
diff --git a/lld/ELF/Arch/ARM.cpp b/lld/ELF/Arch/ARM.cpp
index 687f9499009d5e..9e6cbc87645aba 100644
--- a/lld/ELF/Arch/ARM.cpp
+++ b/lld/ELF/Arch/ARM.cpp
@@ -231,36 +231,67 @@ static void writePltHeaderLong(uint8_t *buf) {
 // The default PLT header requires the .got.plt to be within 128 Mb of the
 // .plt in the positive direction.
 void ARM::writePltHeader(uint8_t *buf) const {
-  // Use a similar sequence to that in writePlt(), the difference is the calling
-  // conventions mean we use lr instead of ip. The PLT entry is responsible for
-  // saving lr on the stack, the dynamic loader is responsible for reloading
-  // it.
-  const uint32_t pltData[] = {
-      0xe52de004, // L1: str lr, [sp,#-4]!
-      0xe28fe600, //     add lr, pc,  #0x0NN00000 &(.got.plt - L1 - 4)
-      0xe28eea00, //     add lr, lr,  #0x000NN000 &(.got.plt - L1 - 4)
-      0xe5bef000, //     ldr pc, [lr, #0x00000NNN] &(.got.plt -L1 - 4)
-  };
-
-  uint64_t offset = in.gotPlt->getVA() - in.plt->getVA() - 4;
-  if (!llvm::isUInt<27>(offset)) {
-    // We cannot encode the Offset, use the long form.
-    writePltHeaderLong(buf);
-    return;
+  if (!config->armThumbPLTs) {
+    // Use a similar sequence to that in writePlt(), the difference is the calling
+    // conventions mean we use lr instead of ip. The PLT entry is responsible for
+    // saving lr on the stack, the dynamic loader is responsible for reloading
+    // it.
+    const uint32_t pltData[] = {
+        0xe52de004, // L1: str lr, [sp,#-4]!
+        0xe28fe600, //     add lr, pc,  #0x0NN00000 &(.got.plt - L1 - 4)
+        0xe28eea00, //     add lr, lr,  #0x000NN000 &(.got.plt - L1 - 4)
+        0xe5bef000, //     ldr pc, [lr, #0x00000NNN] &(.got.plt -L1 - 4)
+    };
+
+    uint64_t offset = in.gotPlt->getVA() - in.plt->getVA() - 4;
+    if (!llvm::isUInt<27>(offset)) {
+      // We cannot encode the Offset, use the long form.
+      writePltHeaderLong(buf);
+      return;
+    }
+    write32(buf + 0, pltData[0]);
+    write32(buf + 4, pltData[1] | ((offset >> 20) & 0xff));
+    write32(buf + 8, pltData[2] | ((offset >> 12) & 0xff));
+    write32(buf + 12, pltData[3] | (offset & 0xfff));
+    memcpy(buf + 16, trapInstr.data(), 4); // Pad to 32-byte boundary
+    memcpy(buf + 20, trapInstr.data(), 4);
+    memcpy(buf + 24, trapInstr.data(), 4);
+    memcpy(buf + 28, trapInstr.data(), 4);
+  } else {
+    // The instruction sequence for thumb:
+    //
+    // 0: b500          push    {lr}
+    // 2: f8df e008     ldr.w   lr, [pc, #0x8]          @ 0xe <func+0xe>
+    // 6: 44fe          add     lr, pc
+    // 8: f85e ff08     ldr     pc, [lr, #8]!
+    // e:               .word   .got.plt - .plt - 16
+    //
+    // At 0x8, we want to jump to .got.plt, the -16 accounts for 8 bytes from
+    // `pc` in the add instruction and 8 bytes for the `lr` adjustment.
+    //
+    uint64_t offset = in.gotPlt->getVA() - in.plt->getVA() - 16;
+    assert(llvm::isUInt<32>(offset) && "This should always fit into a 32-bit offset");
+    write16(buf + 0, 0xb500);
+    write32(buf + 2, 0xe008f8df);
+    write16(buf + 6, 0x44fe);
+    write32(buf + 8, 0xff08f85e);
+    write32(buf + 12, offset);
+
+    memcpy(buf + 16, trapInstr.data(), 4);  // Pad to 32-byte boundary
+    memcpy(buf + 20, trapInstr.data(), 4);
+    memcpy(buf + 24, trapInstr.data(), 4);
+    memcpy(buf + 28, trapInstr.data(), 4);
   }
-  write32(buf + 0, pltData[0]);
-  write32(buf + 4, pltData[1] | ((offset >> 20) & 0xff));
-  write32(buf + 8, pltData[2] | ((offset >> 12) & 0xff));
-  write32(buf + 12, pltData[3] | (offset & 0xfff));
-  memcpy(buf + 16, trapInstr.data(), 4); // Pad to 32-byte boundary
-  memcpy(buf + 20, trapInstr.data(), 4);
-  memcpy(buf + 24, trapInstr.data(), 4);
-  memcpy(buf + 28, trapInstr.data(), 4);
 }
 
 void ARM::addPltHeaderSymbols(InputSection &isec) const {
-  addSyntheticLocal("$a", STT_NOTYPE, 0, 0, isec);
-  addSyntheticLocal("$d", STT_NOTYPE, 16, 0, isec);
+  if (!config->armThumbPLTs) {
+    addSyntheticLocal("$a", STT_NOTYPE, 0, 0, isec);
+    addSyntheticLocal("$d", STT_NOTYPE, 16, 0, isec);
+  } else {
+    addSyntheticLocal("$t", STT_NOTYPE, 0, 0, isec);
+    addSyntheticLocal("$d", STT_NOTYPE, 12, 0, isec);
+  }
 }
 
 // Long form PLT entries that do not have any restrictions on the displacement
@@ -279,32 +310,97 @@ static void writePltLong(uint8_t *buf, uint64_t gotPltEntryAddr,
 // .plt in the positive direction.
 void ARM::writePlt(uint8_t *buf, const Symbol &sym,
                    uint64_t pltEntryAddr) const {
-  // The PLT entry is similar to the example given in Appendix A of ELF for
-  // the Arm Architecture. Instead of using the Group Relocations to find the
-  // optimal rotation for the 8-bit immediate used in the add instructions we
-  // hard code the most compact rotations for simplicity. This saves a load
-  // instruction over the long plt sequences.
-  const uint32_t pltData[] = {
-      0xe28fc600, // L1: add ip, pc,  #0x0NN00000  Offset(&(.got.plt) - L1 - 8
-      0xe28cca00, //     add ip, ip,  #0x000NN000  Offset(&(.got.plt) - L1 - 8
-      0xe5bcf000, //     ldr pc, [ip, #0x00000NNN] Offset(&(.got.plt) - L1 - 8
-  };
 
-  uint64_t offset = sym.getGotPltVA() - pltEntryAddr - 8;
-  if (!llvm::isUInt<27>(offset)) {
-    // We cannot encode the Offset, use the long form.
-    writePltLong(buf, sym.getGotPltVA(), pltEntryAddr);
-    return;
+  if (!config->armThumbPLTs) {
+    uint64_t offset = sym.getGotPltVA() - pltEntryAddr - 8;
+
+    // The PLT entry is similar to the example given in Appendix A of ELF for
+    // the Arm Architecture. Instead of using the Group Relocations to find the
+    // optimal rotation for the 8-bit immediate used in the add instructions we
+    // hard code the most compact rotations for simplicity. This saves a load
+    // instruction over the long plt sequences.
+    const uint32_t pltData[] = {
+        0xe28fc600, // L1: add ip, pc,  #0x0NN00000  Offset(&(.got.plt) - L1 - 8
+        0xe28cca00, //     add ip, ip,  #0x000NN000  Offset(&(.got.plt) - L1 - 8
+        0xe5bcf000, //     ldr pc, [ip, #0x00000NNN] Offset(&(.got.plt) - L1 - 8
+    };
+    if (!llvm::isUInt<27>(offset)) {
+      // We cannot encode the Offset, use the long form.
+      writePltLong(buf, sym.getGotPltVA(), pltEntryAddr);
+      return;
+    }
+    write32(buf + 0, pltData[0] | ((offset >> 20) & 0xff));
+    write32(buf + 4, pltData[1] | ((offset >> 12) & 0xff));
+    write32(buf + 8, pltData[2] | (offset & 0xfff));
+    memcpy(buf + 12, trapInstr.data(), 4); // Pad to 16-byte boundary
+  } else {
+    uint64_t offset = sym.getGotPltVA() - pltEntryAddr - 12;
+    assert(llvm::isUInt<32>(offset) && "This should always fit into a 32-bit offset");
+
+    // A PLT entry will be:
+    //
+    //       movw ip, #<lower 16 bits>
+    //       movt ip, #<upper 16 bits>
+    //       add ip, pc
+    //   L1: ldr.w pc, [ip]
+    //       b L1
+    //
+    // where ip = r12 = 0xc
+    //
+    constexpr uint32_t pltData[] = {
+        0x0c00f240, //     movw ip, <offset lower 16>
+        0x0c00f2c0, //     movt ip, <offset higher 16>
+    };
+
+    // movw encoding:
+    //
+    //   15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
+    //   1  1  1  1  0  i  1 0 0 1 0 0 imm4
+    //
+    //   15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
+    //   0  imm3     Rd0       imm8
+    //
+    //   imm16 = imm4:i:imm3:imm8, i = bit 11
+    //
+    uint16_t offset_lower = offset & 0xffff;
+    uint32_t movwImm8 = offset_lower & 0xff;
+    uint32_t movwImm3 = (offset_lower >> 8) & 0x7;
+    uint32_t movwI = (offset_lower >> 11) & 0x1;
+    uint32_t movwImm4 = (offset_lower >> 12) & 0xf;
+    uint32_t movwBits = (movwI << 10) | (movwImm4 << 0) | (movwImm3 << 28) | (movwImm8 << 16);
+    write32(buf + 0, pltData[0] | movwBits);
+
+    // movt encoding:
+    //
+    //   15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
+    //   1  1  1  1  0  i  1 0 1 1 0 0 imm4
+    //
+    //   15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
+    //   0  imm3     Rd0       imm8
+    //
+    //   imm16 = imm4:i:imm3:imm8, i = bit 11
+    //
+    uint16_t offset_upper = static_cast<uint16_t>(offset >> 16);
+    uint32_t movtImm8 = offset_upper & 0xff;
+    uint32_t movtImm3 = (offset_upper >> 8) & 0x7;
+    uint32_t movtI = (offset_upper >> 11) & 0x1;
+    uint32_t movtImm4 = (offset_upper >> 12) & 0xf;
+    uint32_t movtBits = (movtI << 10) | (movtImm4 << 0) | (movtImm3 << 28) | (movtImm8 << 16);
+    write32(buf + 4, pltData[1] | movtBits);
+
+    write16(buf + 8, 0x44fc);       // add ip, pc
+    write32(buf + 10, 0xf000f8dc);  // ldr.w   pc, [ip]
+    write16(buf + 14, 0xe7fc);      // Branch to previous instruction
   }
-  write32(buf + 0, pltData[0] | ((offset >> 20) & 0xff));
-  write32(buf + 4, pltData[1] | ((offset >> 12) & 0xff));
-  write32(buf + 8, pltData[2] | (offset & 0xfff));
-  memcpy(buf + 12, trapInstr.data(), 4); // Pad to 16-byte boundary
 }
 
 void ARM::addPltSymbols(InputSection &isec, uint64_t off) const {
-  addSyntheticLocal("$a", STT_NOTYPE, off, 0, isec);
-  addSyntheticLocal("$d", STT_NOTYPE, off + 12, 0, isec);
+  if (!config->armThumbPLTs) {
+    addSyntheticLocal("$a", STT_NOTYPE, off, 0, isec);
+    addSyntheticLocal("$d", STT_NOTYPE, off + 12, 0, isec);
+  } else {
+    addSyntheticLocal("$t", STT_NOTYPE, off, 0, isec);
+  }
 }
 
 bool ARM::needsThunk(RelExpr expr, RelType type, const InputFile *file,
@@ -337,7 +433,7 @@ bool ARM::needsThunk(RelExpr expr, RelType type, const InputFile *file,
   case R_ARM_THM_JUMP24:
     // Source is Thumb, all PLT entries are ARM so interworking is required.
     // Otherwise we need to interwork if STT_FUNC Symbol has bit 0 clear (ARM).
-    if (expr == R_PLT_PC || (s.isFunc() && (s.getVA() & 1) == 0))
+    if ((expr == R_PLT_PC && !config->armThumbPLTs) || (s.isFunc() && (s.getVA() & 1) == 0))
       return true;
     [[fallthrough]];
   case R_ARM_THM_CALL: {
@@ -547,7 +643,6 @@ void ARM::relocate(uint8_t *loc, const Relocation &rel, uint64_t val) const {
     // STT_FUNC we choose whether to write a BL or BLX depending on the
     // value of bit 0 of Val. With bit 0 == 1 denoting Thumb. If the symbol is
     // not of type STT_FUNC then we must preserve the original instruction.
-    // PLT entries are always ARM state so we know we don't need to interwork.
     assert(rel.sym); // R_ARM_CALL is always reached via relocate().
     bool bit0Thumb = val & 1;
     bool isBlx = (read32(loc) & 0xfe000000) == 0xfa000000;
@@ -606,12 +701,13 @@ void ARM::relocate(uint8_t *loc, const Relocation &rel, uint64_t val) const {
     // PLT entries are always ARM state so we know we need to interwork.
     assert(rel.sym); // R_ARM_THM_CALL is always reached via relocate().
     bool bit0Thumb = val & 1;
+    bool useThumb = bit0Thumb || config->armThumbPLTs;
     bool isBlx = (read16(loc + 2) & 0x1000) == 0;
     // lld 10.0 and before always used bit0Thumb when deciding to write a BLX
-    // even when type not STT_FUNC. PLT entries generated by LLD are always ARM.
-    if (!rel.sym->isFunc() && !rel.sym->isInPlt() && isBlx == bit0Thumb)
+    // even when type not STT_FUNC.
+    if (!rel.sym->isFunc() && !rel.sym->isInPlt() && isBlx == useThumb)
       stateChangeWarning(loc, rel.type, *rel.sym);
-    if (rel.sym->isFunc() || rel.sym->isInPlt() ? !bit0Thumb : isBlx) {
+    if ((rel.sym->isFunc() || rel.sym->isInPlt()) ? !useThumb : isBlx) {
       // We are writing a BLX. Ensure BLX destination is 4-byte aligned. As
       // the BLX instruction may only be two byte aligned. This must be done
       // before overflow check.
diff --git a/lld/ELF/Config.h b/lld/ELF/Config.h
index bb3608da80b21f..89121c625e302a 100644
--- a/lld/ELF/Config.h
+++ b/lld/ELF/Config.h
@@ -212,6 +212,7 @@ struct Config {
   bool allowMultipleDefinition;
   bool fatLTOObjects;
   bool androidPackDynRelocs = false;
+  bool armThumbPLTs = false;
   bool armHasBlx = false;
   bool armHasMovtMovw = false;
   bool armJ1J2BranchEncoding = false;
diff --git a/lld/ELF/InputFiles.cpp b/lld/ELF/InputFiles.cpp
index 4c614c865d24a9..9d7690688bfd61 100644
--- a/lld/ELF/InputFiles.cpp
+++ b/lld/ELF/InputFiles.cpp
@@ -194,6 +194,18 @@ static void updateSupportedARMFeatures(const ARMAttributeParser &attributes) {
   if (arch >= ARMBuildAttrs::CPUArch::v8_M_Base &&
       profile == ARMBuildAttrs::MicroControllerProfile)
     config->armCMSESupport = true;
+
+  // The thumb PLT entries require Thumb2 which can be used on multiple archs.
+  // For now, let's limit it to ones where ARM isn't available and we know have
+  // Thumb2.
+  std::optional<unsigned> armISA =
+      attributes.getAttributeValue(ARMBuildAttrs::ARM_ISA_use);
+  std::optional<unsigned> thumb =
+      attributes.getAttributeValue(ARMBuildAttrs::THUMB_ISA_use);
+  bool noArmISA = !armISA || *armISA == ARMBuildAttrs::Not_Allowed;
+  bool hasThumb2 = thumb && *thumb >= ARMBuildAttrs::AllowThumb32;
+  if (noArmISA && hasThumb2)
+    config->armThumbPLTs = true;
 }
 
 InputFile::InputFile(Kind k, MemoryBufferRef m)
diff --git a/lld/test/ELF/armv8-thumb-plt-reloc.s b/lld/test/ELF/armv8-thumb-plt-reloc.s
new file mode 100644
index 00000000000000..608362b658bdd6
--- /dev/null
+++ b/lld/test/ELF/armv8-thumb-plt-reloc.s
@@ -0,0 +1,121 @@
+// REQUIRES: arm
+// RUN: llvm-mc -filetype=obj -arm-add-build-attributes --arch=thumb --mcpu=cortex-m33 %p/Inputs/arm-plt-reloc.s -o %t1
+// RUN: llvm-mc -filetype=obj -arm-add-build-attributes --arch=thumb --mcpu=cortex-m33 %s -o %t2
+// RUN: ld.lld %t1 %t2 -o %t
+// RUN: llvm-objdump --no-print-imm-hex -d %t | FileCheck %s
+// RUN: ld.lld -shared %t1 %t2 -o %t.so
+// RUN: llvm-objdump --no-print-imm-hex -d %t.so | FileCheck --check-prefix=DSO %s
+// RUN: llvm-readobj -S -r %t.so | FileCheck -check-prefix=DSOREL %s
+
+// RUN: llvm-mc -filetype=obj -arm-add-build-attributes --arch=thumb --mcpu=cortex-m33 %p/Inputs/arm-plt-reloc.s -o %t1
+// RUN: llvm-mc -filetype=obj -arm-add-build-attributes --arch=thumb --mcpu=cortex-m33 %s -o %t2
+// RUN: ld.lld %t1 %t2 -o %t
+// RUN: llvm-objdump --no-print-imm-hex -d %t | FileCheck %s
+// RUN: ld.lld -shared %t1 %t2 -o %t.so
+// RUN: llvm-objdump --no-print-imm-hex -d %t.so | FileCheck --check-prefix=DSO %s
+// RUN: llvm-readobj -S -r %t.so | FileCheck -check-prefix=DSOREL %s
+
+// RUN: ld.lld --be8 %t1 %t2 -o %t
+// RUN: llvm-objdump --no-print-imm-hex -d %t | FileCheck %s
+// RUN: ld.lld --be8 -shared %t1 %t2 -o %t.so
+// RUN: llvm-objdump --no-print-imm-hex -d %t.so | FileCheck --check-prefix=DSO %s
+// RUN: llvm-readobj -S -r %t.so | FileCheck -check-prefix=DSOREL %s
+
+// Test PLT entry generation
+ .text
+ .align 2
+ .globl _start
+ .type  _start,%function
+_start:
+// FIXME, interworking is only supported for BL via BLX at the moment, when
+// interworking thunks are available for b.w and b<cond>.w this can be altered
+// to test the different forms of interworking.
+ bl func1
+ bl func2
+ bl func3
+
+// Executable, expect no PLT
+// CHECK: Disassembly of section .text:
+// CHECK-EMPTY:
+// CHECK-NEXT: <func1>:
+// CHECK-NEXT:   200b4: 4770    bx      lr
+// CHECK: <func2>:
+// CHECK-NEXT:   200b6: 4770    bx      lr
+// CHECK: <func3>:
+// CHECK-NEXT:   200b8: 4770    bx      lr
+// CHECK-NEXT:   200ba: d4d4 
+// CHECK: <_start>:
+// CHECK-NEXT:   200bc: f7ff fffa       bl      0x200b4 <func1>
+// CHECK-NEXT:   200c0: f7ff fff9       bl      0x200b6 <func2>
+// CHECK-NEXT:   200c4: f7ff fff8       bl      0x200b8 <func3>
+
+// Expect PLT entries as symbols can be preempted
+// .text is Thumb and .plt is ARM, llvm-objdump can currently only disassemble
+// as ARM or Thumb. Work around by disassembling twice.
+// DSO: Disassembly of section .text:
+// DSO-EMPTY:
+// DSO-NEXT: <func1>:
+// DSO-NEXT:     10214:     4770    bx      lr
+// DSO: <func2>:
+// DSO-NEXT:     10216:     4770    bx      lr
+// DSO: <func3>:
+// DSO-NEXT:     10218:     4770    bx      lr
+// DSO-NEXT:     1021a:     d4d4 
+// DSO: <_start>:
+// 0x10250 = PLT func1
+// DSO-NEXT:     1021c:     f000 f818       bl     0x10250
+// 0x10260 = PLT func2
+// DSO-NEXT:     10220:     f000 f81e       bl     0x10260
+// 0x10270 = PLT func3
+// DSO-NEXT:     10224:     f000 f824       bl     0x10270
+// DSO: Disassembly of section .plt:
+// DSO-EMPTY:
+// DSO-NEXT: <.plt>:
+// DSO-NEXT:     10230: b500          push    {lr}
+// DSO-NEXT:     10232: f8df e008     ldr.w   lr, [pc, #8]
+// DSO-NEXT:     10236: 44fe          add     lr, pc
+// DSO-NEXT:     10238: f85e ff08     ldr     pc, [lr, #8]!
+// 0x20098 = .got.plt (0x302D8) - pc (0x10238 = .plt + 8) - 8
+// DSO-NEXT:     1023c: 98 00 02 00   .word   0x00020098
+// DSO-NEXT:     10240: d4 d4 d4 d4   .word   0xd4d4d4d4
+// DSO-NEXT:     10244: d4 d4 d4 d4   .word   0xd4d4d4d4
+// DSO-NEXT:     10248: d4 d4 d4 d4   .word   0xd4d4d4d4
+// DSO-NEXT:     1024c: d4 d4 d4 d4   .word   0xd4d4d4d4
+
+// 136 + 2 << 16 + 0x1025c = 0x302e4 = got entry 1
+// DSO-NEXT:     10250:       f240 0c88     movw    r12, #136
+// DSO-NEXT:     10254:       f2c0 0c02     movt    r12, #2
+// DSO-NEXT:     10258:       44fc          add     r12, pc
+// DSO-NEXT:     1025a:       f8dc f000     ldr.w   pc, [r12]
+// DSO-NEXT:     1025e:       e7fc          b       0x1025a
+// 124 + 2 << 16 + 0x1026c = 0x302e8 = got entry 2
+// DSO-NEXT:     10260:       f240 0c7c     movw    r12, #124
+// DSO-NEXT:     10264:       f2c0 0c02     movt    r12, #2
+// DSO-NEXT:     10268:       44fc          add     r12, pc
+// DSO-NEXT:     1026a:       f8dc f000     ldr.w   pc, [r12]
+// DSO-NEXT:     1026e:       e7fc          b       0x1026a
+// 112 + 2 << 16 + 0x1027c = 0x302ec = got entry 3
+// DSO-NEXT:     10270:       f240 0c70     movw    r12, #112
+// DSO-NEXT:     10274:       f2c0 0c02     movt    r12, #2
+// DSO-NEXT:     10278:       44fc          add     r12, pc
+// DSO-NEXT:     1027a:       f8dc f000     ldr.w   pc, [r12]
+// DSO-NEXT:     1027e:       e7fc          b       0x1027a
+
+// DSOREL:    Name: .got.plt
+// DSOREL-NEXT:    Type: SHT_PROGBITS
+// DSOREL-NEXT:    Flags [
+// DSOREL-NEXT:      SHF_ALLOC
+// DSOREL-NEXT:      SHF_WRITE
+// DSOREL-NEXT:    ]
+// DSOREL-NEXT:    Address: 0x302D8
+// DSOREL-NEXT:    Offset:
+// DSOREL-NEXT:    Size: 24
+// DSOREL-NEXT:    Link:
+// DSOREL-NEXT:    Info:
+// DSOREL-NEXT:    AddressAlignment: 4
+// DSOREL-NEXT:    EntrySize:
+// DSOREL:  Relocations [
+// DSOREL-NEXT:  Section (5) .rel.plt {
+// DSOREL-NEXT:    0x302E4 R_ARM_JUMP_SLOT func1
+// DSOREL-NEXT:    0x302E8 R_ARM_JUMP_SLOT func2
+// DSOREL-NEXT:    0x302EC R_ARM_JUMP_SLOT func3

@PiJoules
Copy link
Contributor Author

Ok, formally requesting reviews now

@MaskRay
Copy link
Member

MaskRay commented Mar 23, 2024

For Cortex-m PLTs presumably for shared-libraries? Aren't a great fit due to the lack of an MMU. Although much more work to implement the fdpic ABI mickael-guene/fdpic_doc@master/abi.txt may be a better fit.

I am curious about the use case as well. I have a fdpic WIP implementation for my FDPIC study that hard codes the FDPIC register r9, which does not yield the best performance. A better implementation is to make r9 like a hidden argument that is call-clobbered.

(BTW: gcc's arm fdpic does not handle r9 very well. gcc's sh4 port handles r12 in a better way.)

Copy link
Collaborator

@smithp35 smithp35 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update.

I think the current code won't work for big-endian systems. Other than that it looks good to me.

I'll be away on vacation till the 2nd April. I'm happy for MaskRay to approve if he is happy with the changes.

lld/ELF/Arch/ARM.cpp Outdated Show resolved Hide resolved
lld/ELF/Arch/ARM.cpp Outdated Show resolved Hide resolved
lld/ELF/Arch/ARM.cpp Outdated Show resolved Hide resolved
lld/ELF/Arch/ARM.cpp Outdated Show resolved Hide resolved
lld/ELF/Arch/ARM.cpp Show resolved Hide resolved
lld/test/ELF/armv8-thumb-plt-reloc.s Show resolved Hide resolved
Copy link
Collaborator

@smithp35 smithp35 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the delay. Have just come back from holiday. Some of the comments in the test don't look right, perhaps cut-and-paste from a different PLT test. Other than that, looks good to me.

lld/test/ELF/armv8-thumb-plt-reloc.s Outdated Show resolved Hide resolved
lld/test/ELF/armv8-thumb-plt-reloc.s Outdated Show resolved Hide resolved
@MaskRay
Copy link
Member

MaskRay commented May 27, 2024

This Thumb PLT support is for very restricted use cases.
The lack of a MMU implies no sharing for the text segment used by two processes:

  • Two processes running the same executable must have different data segments. The System V dynamic linking model requires a constant distance between the text segment and the data segment. Corollary: the text segment cannot be shared.
  • Similarly, a shared object's text segment cannot be shared by two processes running different executables.

Perhaps the use case has a stronger requirement: no two executables (nanoapps?) are running at the same time.

It seems useful to update the description to clarify the case. The code generally looks good.

@PiJoules
Copy link
Contributor Author

This Thumb PLT support is for very restricted use cases. The lack of a MMU implies no sharing for the text segment used by two processes:

  • Two processes running the same executable must have different data segments. The System V dynamic linking model requires a constant distance between the text segment and the data segment. Corollary: the text segment cannot be shared.
  • Similarly, a shared object's text segment cannot be shared by two processes running different executables.

Perhaps the use case has a stronger requirement: no two executables (nanoapps?) are running at the same time.

It seems useful to update the description to clarify the case. The code generally looks good.

Updated

Copy link
Member

@MaskRay MaskRay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Realized that I forgot to push the comments)

lld/ELF/Arch/ARM.cpp Outdated Show resolved Hide resolved
lld/ELF/Arch/ARM.cpp Outdated Show resolved Hide resolved
lld/ELF/Arch/ARM.cpp Outdated Show resolved Hide resolved
lld/test/ELF/armv8-thumb-plt-reloc.s Outdated Show resolved Hide resolved
lld/test/ELF/armv8-thumb-plt-reloc.s Outdated Show resolved Hide resolved
lld/test/ELF/armv8-thumb-plt-reloc.s Outdated Show resolved Hide resolved
We are using PLTs for cortex-m33 which only supports thumb.
@PiJoules PiJoules merged commit 760c2aa into llvm:main May 28, 2024
3 of 6 checks passed
@PiJoules PiJoules deleted the thumb-plts branch May 28, 2024 22:37
@joker-eph
Copy link
Collaborator

This broke windows CI, and it was already visible before you most recent force-push: https://buildkite.com/llvm-project/github-pull-requests/builds/67798#018fc158-f8b8-4a4f-aafa-5bd92a0a8d8a

joker-eph added a commit that referenced this pull request May 29, 2024
joker-eph added a commit that referenced this pull request May 29, 2024
Reverts #86223

windows pre-merge is broken.
vg0204 pushed a commit to vg0204/llvm-project that referenced this pull request May 29, 2024
We are using PLTs for cortex-m33 which only supports thumb. More
specifically, this is for a very restricted use case. There's no MMU so
there's no sharing of virtual addresses between two processes, but this
is fine. The MCU is used for running [chre
nanoapps](https://android.googlesource.com/platform/system/chre/+/HEAD/doc/nanoapp_overview.md)
for android. Each nanoapp is a shared library (but effectively acts as
an executable containing a test suite) that is loaded and run on the MCU
one binary at a time and there's only one process running at a time, so
we ensure that the same text segment cannot be shared by two different
running executables. GNU LD supports thumb PLTs but we want to migrate
to a clang toolchain and use LLD, so thumb PLTs are needed.
vg0204 pushed a commit to vg0204/llvm-project that referenced this pull request May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants