-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
The Problem:
AArch64 includes a couple of Node match rules to merge the Zero/SignExtend operations associated with integral value reads into the read. For example, without these rules a read of a short value that is consumed as a word operand would be modelled as SignExtendNode fed by a ReadNode and this would generate
ldrh w1, [x2, #20]
sxtw w1, w1
The match rule matches a "(SignExtend Read)" pattern and generates a sign-extend short load
ldrhs w1, x2, #20]
Similarly, a ZeroExtend of a character read can safely omit the ZeroExtend (AArch64 reads guarantee to zero the upper bytes of any target register for the read) allowing
ldrh w1, [x2, #20]
andw w1, w1, 0xffff
to be replaced with
ldrh w1, [x2, #20]
These rules avoid unnecessary execuiton and also help to cut down code size.
Unfortunately the rules are not effective in the presence of multiple reads. Consider the following code
int testSnippet1(char[] chars) {
int x = 1;
x += chars[0];
x -= chars[1];
x *= chars[2];
x /= chars[3];
x &= chars[4];
x |= chars[5];
x ^= chars[6];
x <<= chars[7];
x >>= (chars[8] - chars[0]);
x >>>= (chars[9] - chars[0]);
x += chars[1];
return x;
}
The emitted code for this is as follows:
...
0x000003ff7bdfe994: ldr w0, [x2,#12]
0x000003ff7bdfe998: cmp w0, #0x5
0x000003ff7bdfe99c: b.cc 0x000003ff7bdfea3c
0x000003ff7bdfe9a0: ldrh w0, [x2,#22]
0x000003ff7bdfe9a4: cmp w0, #0x0
0x000003ff7bdfe9a8: b.eq 0x000003ff7bdfea58
0x000003ff7bdfe9ac: ldrh w1, [x2,#16]
0x000003ff7bdfe9b0: ldrh w3, [x2,#18]
0x000003ff7bdfe9b4: and w1, w1, #0xffff
0x000003ff7bdfe9b8: add w4, w1, #0x1
0x000003ff7bdfe9bc: and w3, w3, #0xffff
0x000003ff7bdfe9c0: sub w4, w4, w3
0x000003ff7bdfe9c4: ldrh w5, [x2,#20]
0x000003ff7bdfe9c8: mul w4, w4, w5
0x000003ff7bdfe9cc: sdiv w0, w4, w0
0x000003ff7bdfe9d0: ldrh w4, [x2,#24]
0x000003ff7bdfe9d4: ldrh w5, [x2,#26]
0x000003ff7bdfe9d8: ldrh w6, [x2,#28]
0x000003ff7bdfe9dc: ldrh w7, [x2,#30]
0x000003ff7bdfe9e0: ldrh w10, [x2,#32]
0x000003ff7bdfe9e4: and w4, w4, #0xffff
0x000003ff7bdfe9e8: and w0, w0, w4
0x000003ff7bdfe9ec: and w4, w5, #0xffff
0x000003ff7bdfe9f0: orr w0, w0, w4
0x000003ff7bdfe9f4: and w4, w6, #0xffff
0x000003ff7bdfe9f8: eor w0, w0, w4
0x000003ff7bdfe9fc: and w4, w7, #0xffff
0x000003ff7bdfea00: lsl w0, w0, w4
0x000003ff7bdfea04: and w4, w10, #0xffff
0x000003ff7bdfea08: sub w4, w4, w1
0x000003ff7bdfea0c: asr w0, w0, w4
0x000003ff7bdfea10: ldrh w2, [x2,#34]
0x000003ff7bdfea14: sub w1, w2, w1
0x000003ff7bdfea18: lsr w0, w0, w1
0x000003ff7bdfea1c: add w0, w0, w3
0x000003ff7bdfea20: ldp xfp, xlr, [sp,#32]
0x000003ff7bdfea24: add sp, sp, #0x30
0x000003ff7bdfea28: movz xscratch1, #0x0 ; {poll_return}
0x000003ff7bdfea2c: movk xscratch1, #0x945a, lsl #16
0x000003ff7bdfea30: movk xscratch1, #0x3ff, lsl #32
0x000003ff7bdfea34: ldr wzr, [xscratch1] ; {poll_return}
0x000003ff7bdfea38: ret
...
Notice that the ZeroExtend mask operations are only omitted for the loads of chars[3] (offset #22) and chars[2] (offset 20) and chars[9] (offset 34). The mask operations are retained for the other reads because of an inherent problem in the operation of matching. The graph for the above code prior to matching includes a series of fixed ReadNodes ordered before a series of floating ZeroExtendNodes. Prior optimization stages ensure that later Reads (in source order) intervene between earlier Reads and their associated ZeroExtend.
When a match is made the match processor considers whether the matched node set can be emitted at the point of match. So, for example, in the case of the read for bytes[1], the match would need to remove that Read and its related SignExtendNode from the generated lir instruction sequence and instead generate an lir node to produce the required ldrh at the position of the SignExtendNode. Unfortunately, that is disallowed because there are two other Reads (for chars[2] and chars[3]) between that first Read and its SignExtend -- changing the order could lead to an out of order array index bounds exception. What is needed is to replace the combined Read + SIgnExtend pair at the point where the Read lies. This is not possible because matching can only replace the root node.
The Required Fix:
A solution for this is to add an AArch64-specific phase to the end of the Low Tier phase which replaces Read + Sign/ZeroExtend pairs at the position of the Read with an AArch64-specific replacement for ReadNode that includes information determining whether to emit a signed or unsigned read and, in the latter case what size the generated signed value should be (16, 32 or 64 bits). This replacement strategy is only safe when the ReadNode only has one usage, either a ZeroExtendNode or SignExtendNode. When an AArch64ReadNode is created to replace the Read + Sign/ZeroExtend pair its usages must set to be the usages of the Sign/ZeroExtend, its access must be typed using the access stamp of the original Read and its own stamp must be that of the Sign/ZeroExtend.
With the above example the resulting code should look like this
...
0x000003ff8fdfea14: ldr w0, [x2,#12]
0x000003ff8fdfea18: cmp w0, #0x5
0x000003ff8fdfea1c: b.cc 0x000003ff8fdfeaa0
0x000003ff8fdfea20: ldrh w0, [x2,#22]
0x000003ff8fdfea24: cmp w0, #0x0
0x000003ff8fdfea28: b.eq 0x000003ff8fdfeabc
0x000003ff8fdfea2c: ldrh w1, [x2,#16]
0x000003ff8fdfea30: ldrh w3, [x2,#18]
0x000003ff8fdfea34: ldrh w4, [x2,#20]
0x000003ff8fdfea38: add w5, w1, #0x1
0x000003ff8fdfea3c: sub w5, w5, w3
0x000003ff8fdfea40: mul w4, w5, w4
0x000003ff8fdfea44: sdiv w0, w4, w0
0x000003ff8fdfea48: ldrh w4, [x2,#24]
0x000003ff8fdfea4c: ldrh w5, [x2,#26]
0x000003ff8fdfea50: ldrh w6, [x2,#28]
0x000003ff8fdfea54: ldrh w7, [x2,#30]
0x000003ff8fdfea58: ldrh w10, [x2,#32]
0x000003ff8fdfea5c: ldrh w2, [x2,#34]
0x000003ff8fdfea60: and w0, w0, w4
0x000003ff8fdfea64: orr w0, w0, w5
0x000003ff8fdfea68: eor w0, w0, w6
0x000003ff8fdfea6c: lsl w0, w0, w7
0x000003ff8fdfea70: sub w4, w10, w1
0x000003ff8fdfea74: asr w0, w0, w4
0x000003ff8fdfea78: sub w1, w2, w1
0x000003ff8fdfea7c: lsr w0, w0, w1
0x000003ff8fdfea80: add w0, w0, w3
0x000003ff8fdfea84: ldp xfp, xlr, [sp,#32]
0x000003ff8fdfea88: add sp, sp, #0x30
0x000003ff8fdfea8c: movz xscratch1, #0x0 ; {poll_return}
0x000003ff8fdfea90: movk xscratch1, #0xa880, lsl #16
0x000003ff8fdfea94: movk xscratch1, #0x3ff, lsl #32
0x000003ff8fdfea98: ldr wzr, [xscratch1] ; {poll_return}
0x000003ff8fdfea9c: ret
...
Note the reuse of w1 (chars[0]) and w3 (chars[1]) in the final 2 sub instructions and the final add instruction, respectively. The substitution is permitted here because each of the relevant reads only has one usage, a 32-bit ZeroExtend, which feeds: in the case of chars[0] the first int add and the final two subtracts; in the case of chars[1], the first int subtract and the final add. If the variable x were declared to be of type long then the first Read of chars[0] would feed both a 32-bit ZeroExtend (feeding the last two subtracts) and a 64-bit ZeroExtend (feeding the first add) which would inhibit the replacement. That si critical when SignExtends are in play because the result of a 32-bit and 64-bit SignExtend are different so dropping one or the other could produce incorrect results.