Skip to content

AArch64 Zero/SignExtend elision is inadequate #238

@adinn

Description

@adinn

The Problem:

AArch64 includes a couple of Node match rules to merge the Zero/SignExtend operations associated with integral value reads into the read. For example, without these rules a read of a short value that is consumed as a word operand would be modelled as SignExtendNode fed by a ReadNode and this would generate

  ldrh   w1, [x2, #20]
  sxtw w1, w1

The match rule matches a "(SignExtend Read)" pattern and generates a sign-extend short load

  ldrhs w1, x2, #20]

Similarly, a ZeroExtend of a character read can safely omit the ZeroExtend (AArch64 reads guarantee to zero the upper bytes of any target register for the read) allowing

  ldrh   w1, [x2, #20]
  andw w1, w1, 0xffff

to be replaced with

  ldrh   w1, [x2, #20]

These rules avoid unnecessary execuiton and also help to cut down code size.

Unfortunately the rules are not effective in the presence of multiple reads. Consider the following code

    int testSnippet1(char[] chars) {
        int x = 1;
        x += chars[0];
        x -= chars[1];
        x *= chars[2];
        x /= chars[3];
        x &= chars[4];
        x |= chars[5];
        x ^= chars[6];
        x <<= chars[7];
        x >>= (chars[8] - chars[0]);
        x >>>= (chars[9] - chars[0]);
        x += chars[1];
        return x;
    }

The emitted code for this is as follows:

  ...
  0x000003ff7bdfe994: ldr	w0, [x2,#12]
  0x000003ff7bdfe998: cmp	w0, #0x5
  0x000003ff7bdfe99c: b.cc	0x000003ff7bdfea3c
  0x000003ff7bdfe9a0: ldrh	w0, [x2,#22]
  0x000003ff7bdfe9a4: cmp	w0, #0x0
  0x000003ff7bdfe9a8: b.eq	0x000003ff7bdfea58
  0x000003ff7bdfe9ac: ldrh	w1, [x2,#16]
  0x000003ff7bdfe9b0: ldrh	w3, [x2,#18]
  0x000003ff7bdfe9b4: and	w1, w1, #0xffff
  0x000003ff7bdfe9b8: add	w4, w1, #0x1
  0x000003ff7bdfe9bc: and	w3, w3, #0xffff
  0x000003ff7bdfe9c0: sub	w4, w4, w3
  0x000003ff7bdfe9c4: ldrh	w5, [x2,#20]
  0x000003ff7bdfe9c8: mul	w4, w4, w5
  0x000003ff7bdfe9cc: sdiv	w0, w4, w0
  0x000003ff7bdfe9d0: ldrh	w4, [x2,#24]
  0x000003ff7bdfe9d4: ldrh	w5, [x2,#26]
  0x000003ff7bdfe9d8: ldrh	w6, [x2,#28]
  0x000003ff7bdfe9dc: ldrh	w7, [x2,#30]
  0x000003ff7bdfe9e0: ldrh	w10, [x2,#32]
  0x000003ff7bdfe9e4: and	w4, w4, #0xffff
  0x000003ff7bdfe9e8: and	w0, w0, w4
  0x000003ff7bdfe9ec: and	w4, w5, #0xffff
  0x000003ff7bdfe9f0: orr	w0, w0, w4
  0x000003ff7bdfe9f4: and	w4, w6, #0xffff
  0x000003ff7bdfe9f8: eor	w0, w0, w4
  0x000003ff7bdfe9fc: and	w4, w7, #0xffff
  0x000003ff7bdfea00: lsl	w0, w0, w4
  0x000003ff7bdfea04: and	w4, w10, #0xffff
  0x000003ff7bdfea08: sub	w4, w4, w1
  0x000003ff7bdfea0c: asr	w0, w0, w4
  0x000003ff7bdfea10: ldrh	w2, [x2,#34]
  0x000003ff7bdfea14: sub	w1, w2, w1
  0x000003ff7bdfea18: lsr	w0, w0, w1
  0x000003ff7bdfea1c: add	w0, w0, w3
  0x000003ff7bdfea20: ldp	xfp, xlr, [sp,#32]
  0x000003ff7bdfea24: add	sp, sp, #0x30
  0x000003ff7bdfea28: movz	xscratch1, #0x0  ;   {poll_return}
  0x000003ff7bdfea2c: movk	xscratch1, #0x945a, lsl #16
  0x000003ff7bdfea30: movk	xscratch1, #0x3ff, lsl #32
  0x000003ff7bdfea34: ldr	wzr, [xscratch1]  ;   {poll_return}
  0x000003ff7bdfea38: ret
  ...

Notice that the ZeroExtend mask operations are only omitted for the loads of chars[3] (offset #22) and chars[2] (offset 20) and chars[9] (offset 34). The mask operations are retained for the other reads because of an inherent problem in the operation of matching. The graph for the above code prior to matching includes a series of fixed ReadNodes ordered before a series of floating ZeroExtendNodes. Prior optimization stages ensure that later Reads (in source order) intervene between earlier Reads and their associated ZeroExtend.

When a match is made the match processor considers whether the matched node set can be emitted at the point of match. So, for example, in the case of the read for bytes[1], the match would need to remove that Read and its related SignExtendNode from the generated lir instruction sequence and instead generate an lir node to produce the required ldrh at the position of the SignExtendNode. Unfortunately, that is disallowed because there are two other Reads (for chars[2] and chars[3]) between that first Read and its SignExtend -- changing the order could lead to an out of order array index bounds exception. What is needed is to replace the combined Read + SIgnExtend pair at the point where the Read lies. This is not possible because matching can only replace the root node.

The Required Fix:

A solution for this is to add an AArch64-specific phase to the end of the Low Tier phase which replaces Read + Sign/ZeroExtend pairs at the position of the Read with an AArch64-specific replacement for ReadNode that includes information determining whether to emit a signed or unsigned read and, in the latter case what size the generated signed value should be (16, 32 or 64 bits). This replacement strategy is only safe when the ReadNode only has one usage, either a ZeroExtendNode or SignExtendNode. When an AArch64ReadNode is created to replace the Read + Sign/ZeroExtend pair its usages must set to be the usages of the Sign/ZeroExtend, its access must be typed using the access stamp of the original Read and its own stamp must be that of the Sign/ZeroExtend.

With the above example the resulting code should look like this

  ...
  0x000003ff8fdfea14: ldr	w0, [x2,#12]
  0x000003ff8fdfea18: cmp	w0, #0x5
  0x000003ff8fdfea1c: b.cc	0x000003ff8fdfeaa0
  0x000003ff8fdfea20: ldrh	w0, [x2,#22]
  0x000003ff8fdfea24: cmp	w0, #0x0
  0x000003ff8fdfea28: b.eq	0x000003ff8fdfeabc
  0x000003ff8fdfea2c: ldrh	w1, [x2,#16]
  0x000003ff8fdfea30: ldrh	w3, [x2,#18]
  0x000003ff8fdfea34: ldrh	w4, [x2,#20]
  0x000003ff8fdfea38: add	w5, w1, #0x1
  0x000003ff8fdfea3c: sub	w5, w5, w3
  0x000003ff8fdfea40: mul	w4, w5, w4
  0x000003ff8fdfea44: sdiv	w0, w4, w0
  0x000003ff8fdfea48: ldrh	w4, [x2,#24]
  0x000003ff8fdfea4c: ldrh	w5, [x2,#26]
  0x000003ff8fdfea50: ldrh	w6, [x2,#28]
  0x000003ff8fdfea54: ldrh	w7, [x2,#30]
  0x000003ff8fdfea58: ldrh	w10, [x2,#32]
  0x000003ff8fdfea5c: ldrh	w2, [x2,#34]
  0x000003ff8fdfea60: and	w0, w0, w4
  0x000003ff8fdfea64: orr	w0, w0, w5
  0x000003ff8fdfea68: eor	w0, w0, w6
  0x000003ff8fdfea6c: lsl	w0, w0, w7
  0x000003ff8fdfea70: sub	w4, w10, w1
  0x000003ff8fdfea74: asr	w0, w0, w4
  0x000003ff8fdfea78: sub	w1, w2, w1
  0x000003ff8fdfea7c: lsr	w0, w0, w1
  0x000003ff8fdfea80: add	w0, w0, w3
  0x000003ff8fdfea84: ldp	xfp, xlr, [sp,#32]
  0x000003ff8fdfea88: add	sp, sp, #0x30
  0x000003ff8fdfea8c: movz	xscratch1, #0x0  ;   {poll_return}
  0x000003ff8fdfea90: movk	xscratch1, #0xa880, lsl #16
  0x000003ff8fdfea94: movk	xscratch1, #0x3ff, lsl #32
  0x000003ff8fdfea98: ldr	wzr, [xscratch1]  ;   {poll_return}
  0x000003ff8fdfea9c: ret
  ...

Note the reuse of w1 (chars[0]) and w3 (chars[1]) in the final 2 sub instructions and the final add instruction, respectively. The substitution is permitted here because each of the relevant reads only has one usage, a 32-bit ZeroExtend, which feeds: in the case of chars[0] the first int add and the final two subtracts; in the case of chars[1], the first int subtract and the final add. If the variable x were declared to be of type long then the first Read of chars[0] would feed both a 32-bit ZeroExtend (feeding the last two subtracts) and a 64-bit ZeroExtend (feeding the first add) which would inhibit the replacement. That si critical when SignExtends are in play because the result of a 32-bit and 64-bit SignExtend are different so dropping one or the other could produce incorrect results.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions