-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Implement prefetch for AArch64 #228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
It seems this patch is not quite ready for integration. One of the unit tests has generated a prefetch request with an address that employs a REGISTER_OFFSET addressing mode. I'm not sure how that can happen given that the address ought always to be a base address + 1, 2 or 3 cache lines offset. I am investigating now. |
|
Ok, it seems that a specific PrefetchAllocateNode is being created with an OffsetAddress that employs a base register + small displacement but that phase FixReads replaceConstantInputs is somehow transforming it into an OffsetAddress with a new base register and an offset too large to encode as a displacement (specifically 100208). As a consequence, the resulting AArch64Address is generated to use a REGISTER_OFFSET addressing mode. I guess I will have to modify the Assembler to expect and handle REGISTER_OFFSET addresses. |
|
prfm now accepts REGISTER_OFFSET addresses which fixes the unit test and makes this patch ok to merge. That said, I think there is something else wrong in the compiler which still needs fixing -- specifically in the way that FixReads is operating. The prefetch node references an input OffsetAddress constructed by adding a constant to the new TLAB top, either 192, 256 or 320. The TLAB top is itself calculated by adding an allocated object's size to the old TLAB top -- the latter pulled out of a thread field. In certain cases the FixReads phase replaces the prefetch node's input OffsetAddress with a new OffsetAddress by combining constants used to derive the object size with the prefetch offset. In the unit test method which previously failed, System_setOut, the test is creating a bytearray of length 100000. As a result the PrefetchNode ends up with an OffsetAddress which is based on the TLAB_TOP and employs offsets in the set {196, 256, 320} + 100012. This leads to the following code The eager combination of constants by FixReads misses the opportunity for the prefetch instructions to reuse the TLAB top address loaded into x3 with (displacement) offsets 196, 256 and 320. Instead the TLAB top and the 3 prefetch addresses are all derived from the old TLAB top in x1 using (register) offsets 100012, 100208 etc. The latter requires an extra 3 movz/movk pairs to install the required offset values into an offset register. I'll think more about how Fixreads might spot situations like this and inhibit some constant combination so that generated code can reuse the result of a load effective address (the add x3, x1, x2 instruction above) rather than keep reloading large constants. My first thought was to wonder if it is appropriate for FixReads to be constant combination at all? The x86 and AArch64 AddressLowering phases both implement constant offset merging as part of address improvement (I'm not sure if SPARC also does). If it was left solely to the AddressLowering phase to do this merging of constants in the address chain then I think AArch64 would do the right thing here (although maybe it would not be right in other circumstances) because AArch64 only merges constants into displacements when they produce a result in the embeddable range. Is there a reason why both FixReads and AddressLowering try to apply the same 'improvement'? |
|
Can you please rebase this PR on 317384c so that you get it run through the Travis gate. |
|
sure, will do :-) |
|
n.b it has been rebased |
I can't answer this knowledgeably (@davleopo probably can) but did you try disabling the improvement in FixReads and seeing if AArch64 does the right thing as you expect? |
|
I worked through this a bit further and it seems that the call to replaceConstantInputs under the FixReads phase performs a more general coalescing of constants than AddressLowering. The generality has two dimensions: AddressLowering only improves constant expressions that are fed as inputs to RawAddress or OffsetAddress objects and then only in so far as they can be incorporated into displacements or offset register shifts. In consequence of that latter limitation, AddressLowering only improves a subset of the expressions simplified by FixReadsPhase pulling up constants from nested adds (E + C1) and shifts (E << C2). It does not attempt to simplify reducible constant expressions like C1 & C2 or C1 | C2 -- I guess that is because they have already been dealt with by FixReadsPhase. What I have not yet understood is why in this specific unit test the x86 code ends up putting the large object size into a register and using a small displacement plus register offset mode address. If the expression is reducible to a register + displacement on AArch64 I would also expect it to be reducible to register + displacement on x86 yet in this case a part of that displacement gets retained in an offset register. I tweaked the Java code for the test a little and found that in other cases x86 the constant expression is reduced to a single large displacement. WHn the test simply creates a large byte[] the Prefetch offsets get coalesced with a byte[] size resulting in 3 prefetch instructions each with a large embedded displacement. So, it's not just an AArch64 issue and the disparity here is because occasionally x86 gets lucky and reuses an offset register value. In which case, maybe this is just something to live with for now. I did wonder whether it might be worth inhibiting constant replacement based on usage count > 1 (or maybe > 2?). That should help here because the new TOP is consumed 3 times. This ought to result in computation of the new TOP address once with each subsequent prefetch using a small displacement relative to that computed address. It's hard to know what the consequences of this change might be for later phases though. Limited coalescing might break some other important invariant. With C2 I know that the normalization model expected by each phase has some simple requirements for order-independence and convergence but I am not aware of whether Graal is built with any similar expectations. I might play with inhibiting according to usage count just to see what happens. However, an explanation of whatever theoretical/formal underpinnings the organization of the Graal phases relies on would probably be necessary before determining to adopt any change identified from such playing around. |
|
Heuristics based on usage counts are always kind of flaky. I agree that folding the constant into the address for this pattern on ARM doesn't improve the code. Something more principled might be the perform address lowering before fixing reads. Then you could have helpers on AddressNode that could incorporate the flow dependent constants into the address if that's profitable. Would that work? |
|
Also I don't really the see the weirdness you were describing with large constants being materialized on amd64, though I'm looking at the high level graph. Maybe something is going wrong when we produce the assembly? |
|
Hi Tom, I played around with this and agree that relying on usage counts doesn't look very promising. Placing address lowering before the FixReads phase would only work if the former was rewritten to perform the more complex constant reduction that FixReads currently does but limiting it by i) selectively applying it only to value chains that feed address base/offset inputs ii) inhibiting reduction in cases where the resulting offset cannot be used as a displacement This might be worth trying but it would probably be quite complex to implement. I reproduced the large constant offset on x86 by changing the original test to create and return a byte[] array. I think it also made the test routine do a little bit of extra work -- unfortunately I have now reverted the code and cannot remember exactly what I used. Anyway, I didn't look at the graph itself, just the generated code and the large constant had been folded into a large displacement relative to the old heap top loaded into r15. |
|
Well I was imagining we'd add a method to AddressNode like foldConstant(Node value, long newValue) that would be called by FixedReads to ask it to incorporate a constant into it's displacement. This would let the address reject that transformation if it would require a different address form or switch to a different form. Would something like that work? |
If this is intended to be performed after lowering then I'm not clear how what you are suggesting is supposed to work. A lowered AArch64AddressNode may have
It is only in the first case that there will be an old Value to replace with a new Value and in that case the old Value will likely not be a small constant because small constants will already have been folded into a SCALED/UNSCALED_DISPLACEMENT mode address). It is only in the 3rd case that the displacement in the lowered AArch64AddressNode will be non-zero meaning that an attempt to augment it might make sense. However, there isn't any old Value here to be replaced. The address merely contains an int displacement in one of its fields. Am I missing your point here? Coul you explain a bit more about how you thought this was going to work? |
|
I may not have thought this through completely but I'll take a stab at fleshing it out. I read your original complaint as being that the extra constant folding done by FixReads may be a bad idea for AArch64 but since it's shared code there's no way to control it. So if we lower the addresses first then we could build some contract between AddressNode and FixReads that would let it participate in pulling in constants if that would be profitable. The replacing constants on an AddressNode would have to work a bit differently since it's unlikely to have a direct input which can be converted into a constant. FixReads would either need to have some special logic which decomposes the inputs to an AddressNode to look for things like x + y -> x + c and offer those values to the AddressNode or FixReads could provide a helper like getConstantForNode that wraps up the getBestStamp logic in replaceConstantInputs that it passes to a new AddressNode.foldConstants routine. I agree for the AArch64 case that it's unlikely that there are any opportunities that are likely to be profitable but I think that's the point of your complaints. The only case would be redistributing an add of a constant from the base in the 3rd form but only if it the result would still fit in the displacement. Sparc has similar problems as AArch64 with constants but AMD64 can fold in almost anything. Does that make more sense and/or sound like a good idea? |
|
Thanks for the clarification Tom. I think I now understand what you are suggesting and I believe it may be possible to do what you are suggesting (although I am not sure). I'll be happy to research this idea but perhaps it would be better to submit it as a separate patch to follow on from this one -- it's really an independent issue. n.b. I rebased the patch against the latest repo master HEAD in the hope that it would bypass whatever issue is stopping the Travis build but it has still failed to build (I think it cannto find a suitable jdk9). When originally submitted the PR passed the Travis check. Is there something that can be done to fix Travis? |
|
I apologize for the instability of the Travis gate but we continue to be at the mercy of external factors. Namely, the JDK9 EA binaries appear to be removed as soon as a new EA binary is available. And we still cannot produce a JVMCI JDK8 binary. I'll update Graal's JDK9 dependency once https://bugs.openjdk.java.net/browse/JDK-8182310 lands in an EA binary. |
|
Thanks for the explanation Doug. Can the Travis build be fixed by updating the top-level .travis.yml to specify JDK9_EA_BUILD="176" or is there more required than that? |
|
That might work - feel free to include that change in this PR. We will then update the version used internally (defined in In general we try to keep the JDK9 version used for internal and external testing the same. |
|
You're now running into #232. You will need this patch as well: |
|
Ok, so this required a little more work than I thought. The download for jdk9+176 does not contain string "-ea". Nor does the reported java version. So, I had to update the YAML in two places and also fix class JVMCIVersionCheck. |
|
Hmm, sorry we appear to have both fixed the same issue. I implemented a slightly different fix for JVMCIVersionCheck which allowed 9+176 to work but also continued to allow EA releases to work. Shall I back that change out? |
|
Yes, I'd prefer my check as we don't need to be backward compatible for JDK9 EA binaries. Let's keep it as simple as possible while we can! |
|
Ok, I have applied your patch to my PR -- it ought to pass the build now. |
|
I agree that any change in how fix reads and address lowering work should be in a separate PR. |
This patch implements prefetch for AArch64.
Currently, on AArch64 a PrefetchAllocateNode inserted into generic code results in a redundant load constant followed by a nop. This patch ensures that a prefetch (prfm) instruction is generated with a suitable prefetch mode and displacement mode address.
Note that the AArch64 port does not currently set a suitable value for PrefetchAllocateInstr in it's implementation of class VM_Version. The default value of 0 (PLDL1KEEP) would not be a disaster but is not really the best choice. So, this patch follows the current AArch64 C2 compiler in enforcing use of prefetch mode PSTL1KEEP when lowering a PrefetchAllocateNode, ignoring the config setting derived from PrefetchAllocateInstr.