Skip to content

Conversation

@adinn
Copy link
Collaborator

@adinn adinn commented Jun 20, 2017

This patch implements prefetch for AArch64.

Currently, on AArch64 a PrefetchAllocateNode inserted into generic code results in a redundant load constant followed by a nop. This patch ensures that a prefetch (prfm) instruction is generated with a suitable prefetch mode and displacement mode address.

Note that the AArch64 port does not currently set a suitable value for PrefetchAllocateInstr in it's implementation of class VM_Version. The default value of 0 (PLDL1KEEP) would not be a disaster but is not really the best choice. So, this patch follows the current AArch64 C2 compiler in enforcing use of prefetch mode PSTL1KEEP when lowering a PrefetchAllocateNode, ignoring the config setting derived from PrefetchAllocateInstr.

@adinn
Copy link
Collaborator Author

adinn commented Jun 20, 2017

It seems this patch is not quite ready for integration. One of the unit tests has generated a prefetch request with an address that employs a REGISTER_OFFSET addressing mode. I'm not sure how that can happen given that the address ought always to be a base address + 1, 2 or 3 cache lines offset. I am investigating now.

@adinn
Copy link
Collaborator Author

adinn commented Jun 20, 2017

Ok, it seems that a specific PrefetchAllocateNode is being created with an OffsetAddress that employs a base register + small displacement but that phase FixReads replaceConstantInputs is somehow transforming it into an OffsetAddress with a new base register and an offset too large to encode as a displacement (specifically 100208). As a consequence, the resulting AArch64Address is generated to use a REGISTER_OFFSET addressing mode. I guess I will have to modify the Assembler to expect and handle REGISTER_OFFSET addresses.

@adinn
Copy link
Collaborator Author

adinn commented Jun 20, 2017

prfm now accepts REGISTER_OFFSET addresses which fixes the unit test and makes this patch ok to merge.

That said, I think there is something else wrong in the compiler which still needs fixing -- specifically in the way that FixReads is operating. The prefetch node references an input OffsetAddress constructed by adding a constant to the new TLAB top, either 192, 256 or 320. The TLAB top is itself calculated by adding an allocated object's size to the old TLAB top -- the latter pulled out of a thread field. In certain cases the FixReads phase replaces the prefetch node's input OffsetAddress with a new OffsetAddress by combining constants used to derive the object size with the prefetch offset.

In the unit test method which previously failed, System_setOut, the test is creating a bytearray of length 100000. As a result the PrefetchNode ends up with an OffsetAddress which is based on the TLAB_TOP and employs offsets in the set {196, 256, 320} + 100012. This leads to the following code

  0x000003ff993a2794: ldr	x1, [xthread,#96]
  0x000003ff993a2798: ldr	x0, [xthread,#112]
  0x000003ff993a279c: movz	x2, #0x86b0    # n.b. this is 100012
  0x000003ff993a27a0: movk	x2, #0x1, lsl #16
  0x000003ff993a27a4: add	x3, x1, x2
  0x000003ff993a27a8: movz	x4, #0x0, lsl #16  ;   {metadata({type array byte})}
  0x000003ff993a27ac: movk	x4, #0x788
  0x000003ff993a27b0: movz	x5, #0x87f0
  0x000003ff993a27b4: movk	x5, #0x1, lsl #16
  0x000003ff993a27b8: movz	x6, #0x87b0
  0x000003ff993a27bc: movk	x6, #0x1, lsl #16
  0x000003ff993a27c0: movz	x7, #0x8770
  0x000003ff993a27c4: movk	x7, #0x1, lsl #16
  0x000003ff993a27c8: movz	x10, #0x86a0
  0x000003ff993a27cc: movk	x10, #0x1, lsl #16
  0x000003ff993a27d0: str	w10, [sp,#300]
  0x000003ff993a27d4: cmp	x3, x0
  0x000003ff993a27d8: b.hi	0x000003ff993a7334
  0x000003ff993a27dc: str	x3, [xthread,#96]
  0x000003ff993a27e0: prfm	pstl1keep, [x1,x7]
  0x000003ff993a27e4: prfm	pstl1keep, [x1,x6]
  0x000003ff993a27e8: prfm	pstl1keep, [x1,x5]

The eager combination of constants by FixReads misses the opportunity for the prefetch instructions to reuse the TLAB top address loaded into x3 with (displacement) offsets 196, 256 and 320. Instead the TLAB top and the 3 prefetch addresses are all derived from the old TLAB top in x1 using (register) offsets 100012, 100208 etc. The latter requires an extra 3 movz/movk pairs to install the required offset values into an offset register.

I'll think more about how Fixreads might spot situations like this and inhibit some constant combination so that generated code can reuse the result of a load effective address (the add x3, x1, x2 instruction above) rather than keep reloading large constants.

My first thought was to wonder if it is appropriate for FixReads to be constant combination at all? The x86 and AArch64 AddressLowering phases both implement constant offset merging as part of address improvement (I'm not sure if SPARC also does). If it was left solely to the AddressLowering phase to do this merging of constants in the address chain then I think AArch64 would do the right thing here (although maybe it would not be right in other circumstances) because AArch64 only merges constants into displacements when they produce a result in the embeddable range. Is there a reason why both FixReads and AddressLowering try to apply the same 'improvement'?

@dougxc
Copy link
Member

dougxc commented Jun 21, 2017

Can you please rebase this PR on 317384c so that you get it run through the Travis gate.

@adinn
Copy link
Collaborator Author

adinn commented Jun 21, 2017

sure, will do :-)

@adinn
Copy link
Collaborator Author

adinn commented Jun 23, 2017

n.b it has been rebased

@dougxc
Copy link
Member

dougxc commented Jun 23, 2017

Is there a reason why both FixReads and AddressLowering try to apply the same 'improvement'?

I can't answer this knowledgeably (@davleopo probably can) but did you try disabling the improvement in FixReads and seeing if AArch64 does the right thing as you expect?

@adinn
Copy link
Collaborator Author

adinn commented Jun 23, 2017

I worked through this a bit further and it seems that the call to replaceConstantInputs under the FixReads phase performs a more general coalescing of constants than AddressLowering. The generality has two dimensions: AddressLowering only improves constant expressions that are fed as inputs to RawAddress or OffsetAddress objects and then only in so far as they can be incorporated into displacements or offset register shifts. In consequence of that latter limitation, AddressLowering only improves a subset of the expressions simplified by FixReadsPhase pulling up constants from nested adds (E + C1) and shifts (E << C2). It does not attempt to simplify reducible constant expressions like C1 & C2 or C1 | C2 -- I guess that is because they have already been dealt with by FixReadsPhase.

What I have not yet understood is why in this specific unit test the x86 code ends up putting the large object size into a register and using a small displacement plus register offset mode address. If the expression is reducible to a register + displacement on AArch64 I would also expect it to be reducible to register + displacement on x86 yet in this case a part of that displacement gets retained in an offset register. I tweaked the Java code for the test a little and found that in other cases x86 the constant expression is reduced to a single large displacement. WHn the test simply creates a large byte[] the Prefetch offsets get coalesced with a byte[] size resulting in 3 prefetch instructions each with a large embedded displacement. So, it's not just an AArch64 issue and the disparity here is because occasionally x86 gets lucky and reuses an offset register value. In which case, maybe this is just something to live with for now.

I did wonder whether it might be worth inhibiting constant replacement based on usage count > 1 (or maybe > 2?). That should help here because the new TOP is consumed 3 times. This ought to result in computation of the new TOP address once with each subsequent prefetch using a small displacement relative to that computed address. It's hard to know what the consequences of this change might be for later phases though. Limited coalescing might break some other important invariant.

With C2 I know that the normalization model expected by each phase has some simple requirements for order-independence and convergence but I am not aware of whether Graal is built with any similar expectations. I might play with inhibiting according to usage count just to see what happens. However, an explanation of whatever theoretical/formal underpinnings the organization of the Graal phases relies on would probably be necessary before determining to adopt any change identified from such playing around.

@tkrodriguez
Copy link
Member

Heuristics based on usage counts are always kind of flaky. I agree that folding the constant into the address for this pattern on ARM doesn't improve the code. Something more principled might be the perform address lowering before fixing reads. Then you could have helpers on AddressNode that could incorporate the flow dependent constants into the address if that's profitable. Would that work?

@tkrodriguez
Copy link
Member

Also I don't really the see the weirdness you were describing with large constants being materialized on amd64, though I'm looking at the high level graph. Maybe something is going wrong when we produce the assembly?

@adinn
Copy link
Collaborator Author

adinn commented Jun 26, 2017

Hi Tom,

I played around with this and agree that relying on usage counts doesn't look very promising.

Placing address lowering before the FixReads phase would only work if the former was rewritten to perform the more complex constant reduction that FixReads currently does but limiting it by

i) selectively applying it only to value chains that feed address base/offset inputs

ii) inhibiting reduction in cases where the resulting offset cannot be used as a displacement

This might be worth trying but it would probably be quite complex to implement.

I reproduced the large constant offset on x86 by changing the original test to create and return a byte[] array. I think it also made the test routine do a little bit of extra work -- unfortunately I have now reverted the code and cannot remember exactly what I used. Anyway, I didn't look at the graph itself, just the generated code and the large constant had been folded into a large displacement relative to the old heap top loaded into r15.

@tkrodriguez
Copy link
Member

Well I was imagining we'd add a method to AddressNode like foldConstant(Node value, long newValue) that would be called by FixedReads to ask it to incorporate a constant into it's displacement. This would let the address reject that transformation if it would require a different address form or switch to a different form. Would something like that work?

@adinn
Copy link
Collaborator Author

adinn commented Jun 27, 2017

I was imagining we'd add a method to AddressNode like foldConstant(Node value, long newValue) that would be called by FixedReads to ask it to incorporate a constant into it's displacement. This would let the address reject that transformation if it would require a different address form or switch to a different form. Would something like that work?

If this is intended to be performed after lowering then I'm not clear how what you are suggesting is supposed to work. A lowered AArch64AddressNode may have

  • a Value in both its base and index and a zero displacement (addressingMode = REGISTER_OFFSET)
  • a Value in its base, an IllegalValue in its index and a zero displacement (addressingMode = BASE_REGISTER)
  • a Value in its base, an IllegalValue in its index and a non-zero displacement (addressingMode = SCALED/UNSCALED_DISPLACEMENT and scaleFactor in {1, 2, 4, 8})

It is only in the first case that there will be an old Value to replace with a new Value and in that case the old Value will likely not be a small constant because small constants will already have been folded into a SCALED/UNSCALED_DISPLACEMENT mode address).

It is only in the 3rd case that the displacement in the lowered AArch64AddressNode will be non-zero meaning that an attempt to augment it might make sense. However, there isn't any old Value here to be replaced. The address merely contains an int displacement in one of its fields.

Am I missing your point here? Coul you explain a bit more about how you thought this was going to work?

@tkrodriguez
Copy link
Member

I may not have thought this through completely but I'll take a stab at fleshing it out. I read your original complaint as being that the extra constant folding done by FixReads may be a bad idea for AArch64 but since it's shared code there's no way to control it. So if we lower the addresses first then we could build some contract between AddressNode and FixReads that would let it participate in pulling in constants if that would be profitable. The replacing constants on an AddressNode would have to work a bit differently since it's unlikely to have a direct input which can be converted into a constant. FixReads would either need to have some special logic which decomposes the inputs to an AddressNode to look for things like x + y -> x + c and offer those values to the AddressNode or FixReads could provide a helper like getConstantForNode that wraps up the getBestStamp logic in replaceConstantInputs that it passes to a new AddressNode.foldConstants routine.

I agree for the AArch64 case that it's unlikely that there are any opportunities that are likely to be profitable but I think that's the point of your complaints. The only case would be redistributing an add of a constant from the base in the 3rd form but only if it the result would still fit in the displacement. Sparc has similar problems as AArch64 with constants but AMD64 can fold in almost anything. Does that make more sense and/or sound like a good idea?

@adinn
Copy link
Collaborator Author

adinn commented Jun 30, 2017

Thanks for the clarification Tom. I think I now understand what you are suggesting and I believe it may be possible to do what you are suggesting (although I am not sure). I'll be happy to research this idea but perhaps it would be better to submit it as a separate patch to follow on from this one -- it's really an independent issue.

n.b. I rebased the patch against the latest repo master HEAD in the hope that it would bypass whatever issue is stopping the Travis build but it has still failed to build (I think it cannto find a suitable jdk9). When originally submitted the PR passed the Travis check. Is there something that can be done to fix Travis?

@dougxc
Copy link
Member

dougxc commented Jul 3, 2017

I apologize for the instability of the Travis gate but we continue to be at the mercy of external factors. Namely, the JDK9 EA binaries appear to be removed as soon as a new EA binary is available. And we still cannot produce a JVMCI JDK8 binary. I'll update Graal's JDK9 dependency once https://bugs.openjdk.java.net/browse/JDK-8182310 lands in an EA binary.

@adinn
Copy link
Collaborator Author

adinn commented Jul 3, 2017

Thanks for the explanation Doug.

Can the Travis build be fixed by updating the top-level .travis.yml to specify JDK9_EA_BUILD="176" or is there more required than that?

@dougxc
Copy link
Member

dougxc commented Jul 3, 2017

That might work - feel free to include that change in this PR. We will then update the version used internally (defined in common.hocon) if it passes.

In general we try to keep the JDK9 version used for internal and external testing the same.

@adinn adinn force-pushed the prefetch_patch branch from c91b21a to 1877be6 Compare July 3, 2017 12:11
@dougxc
Copy link
Member

dougxc commented Jul 3, 2017

You're now running into #232. You will need this patch as well:

diff --git a/compiler/src/org.graalvm.compiler.hotspot/src/org/graalvm/compiler/hotspot/JVMCIVersionCheck.java b/compiler/src/org.graalvm.compiler.hotspot/src/org/graalvm/compiler/hotspot/JVMCIVersionCheck.java
index c9f0b26..042e4e7 100644
--- a/compiler/src/org.graalvm.compiler.hotspot/src/org/graalvm/compiler/hotspot/JVMCIVersionCheck.java
+++ b/compiler/src/org.graalvm.compiler.hotspot/src/org/graalvm/compiler/hotspot/JVMCIVersionCheck.java
@@ -121,8 +121,8 @@ class JVMCIVersionCheck {
             }
             // http://openjdk.java.net/jeps/223
             // Only support EA builds until GA is available
-            if (vmVersion.startsWith("9-ea+")) {
-                int start = "9-ea+".length();
+            if (vmVersion.startsWith("9+")) {
+                int start = "9+".length();
                 int end = start;
                 end = start;
                 while (end < vmVersion.length() && Character.isDigit(vmVersion.charAt(end))) {

@adinn
Copy link
Collaborator Author

adinn commented Jul 3, 2017

Ok, so this required a little more work than I thought. The download for jdk9+176 does not contain string "-ea". Nor does the reported java version. So, I had to update the YAML in two places and also fix class JVMCIVersionCheck.

@adinn
Copy link
Collaborator Author

adinn commented Jul 3, 2017

Hmm, sorry we appear to have both fixed the same issue. I implemented a slightly different fix for JVMCIVersionCheck which allowed 9+176 to work but also continued to allow EA releases to work. Shall I back that change out?

@dougxc
Copy link
Member

dougxc commented Jul 3, 2017

Yes, I'd prefer my check as we don't need to be backward compatible for JDK9 EA binaries. Let's keep it as simple as possible while we can!

@adinn adinn force-pushed the prefetch_patch branch from d55e19c to dbcc355 Compare July 3, 2017 13:02
@adinn
Copy link
Collaborator Author

adinn commented Jul 3, 2017

Ok, I have applied your patch to my PR -- it ought to pass the build now.

@tkrodriguez
Copy link
Member

I agree that any change in how fix reads and address lowering work should be in a separate PR.

@dougxc dougxc merged commit 7dacc3b into oracle:master Jul 3, 2017
@adinn adinn deleted the prefetch_patch branch July 4, 2017 09:51
@dougxc dougxc added the accept label Dec 12, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants