-
Notifications
You must be signed in to change notification settings - Fork 6.1k
8347901: C2 should remove unused leaf / pure runtime calls #24966
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Welcome back mchevalier! A progress list of the required criteria for merging this PR into |
|
@marc-chevalier This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be: You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 758 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. ➡️ To integrate this PR with the above commit message to the |
|
@marc-chevalier The following labels will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command. |
I agree that we should not do that in these changes. But did you consider to move/clone such call (new macro node) down to "users" in case the result is not used on some paths? They will be executed only where they are needed. And I think it is safe since current control dominates paths where the result is used. |
|
I've considered it, but rather for a follow-up. My thought was to first introduce the node types, removal mechanics and such, but keep it pined by control and not touch that in this change. In the follow-up, I was hoping I would have "just" the control-pinning problem to address. Moving the calls down may be beneficial in case the result is not used in a branch (and then we save the call when executing the branch not using it), but if the usage is in a loop, we rather want the call to stay (or be hoisted) before the loop. The heuristic "out of as many loops as possible, and the later possible" seems to also apply here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work, Marc.
High-level comment: I don't know what are the future plans, but as the patch stands now, it feels like it complicates both the design and the implementation.
Original implementation relies on macro nodes which are later expanded into leaf runtime calls. What you propose introduce new concept of "pure calls" which is: (1) not a CallNode anymore; and (2) relies on subclassing (which makes it hard to mix with other node properties). Moreover, I don't see much benefit in committing to runtime call representation from the very beginning (early in high-level IR).
Going forward, IMO the sweet sport is to support arbitrary nodes to be lowered into leaf runtime calls. You make a big step in that direction by relaxing requirements on PureCall to be just a CFG node (and not a full-blown CallLeaf node). Next step would be to relax CFG node requirement and let compiler pick the right place to insert it. (Existing expensive node support in C2 addresses some similar challenges.)
And, as a complementary options, in some cases it may be just enough to mark individual call nodes as pure, so they can be pruned later if nobody consumes result of their computation anymore.
|
Thanks for the comment. I'll think deeper about it. I've started by trying to make PureCall a subclass of Call (or a property of LeafCall) but that broke a lot of things that were using some invariants on CallNode that weren't holding anymore. After a some time tracking bugs and trying to fix, I thought it would be simpler to have a new kind of node, and it would have less impact on existing code. Another reason I've changed it to a direct sub-class of Node is that I felt it made little sense to be a Call (or sub-class of) since Calls are Safepoint, but pure calls don't need to be (and similar "conceptual" problems). It seemed like a hack to me. About
I don't think I understand what you mean. Overall, I see the weaknesses of my design, but I'm not sure which direction to take instead. |
A leaf runtime call which doesn't depend or change memory state can be inserted at arbitrary points in the graph. So, an arbitrary data node can be lowered into a runtime call once the place to insert it is known/chosen.
I suggest to experiment with untangling The hard part is probably related to picking a point in CFG to insert the call, but the control the node has may be not suitable for that (e.g., if inputs don't dominate control anymore). In that case, updating control input during loop opts may be an option. |
Good point! I still think I don't get everything. Let me try to sum up what I think I should do. For now, I don't want to mess with control, but I should prepare the field. Using general Call nodes for pure calls was pretty difficult: Call nodes have too much opinion, assumptions to easily work with for pure calls. But eventually, I want to change the nodes I'm using into a Call node, and more precisely a CallLeaf (I suspect once I'm done doing all I can do with pure calls, so in macro expansion, it's fine). To be able to do this transformation, I need to know control at this point. My goal is to start with control-less nodes, but find the late control during loop optimization, control-pin them at this point (because that's when the information is available) with both control input and output (needed for the expansion in CallLeaf), and continuing with control-pinned nodes. For now, I'm happy with the control I get from parsing. So, under my nodes, I need 2 outputs: control and data (everywhere now, and at least after control-pinning in the follow-up). I should then make ModFloating/ModD/ModF sub-classes of |
If you combine lowering with pinning, you could replace a data node with a CFG node (CallLeaf in your case) at the point in CFG you choose. A single CFG node is enough to insert a CFG-only node, but you need to ensure the graph stays schedulable after the insertion. If you want to start with pinned node, the simplest way would be to make
Keep in mind that it assumes the node is pinned in CFG from the very beginning. Once the node starts in data-only mode, the control input it gained during parsing may end up too early for node's inputs to be scheduleable. |
|
I think a very simple approach you can take is having |
It's not as simple as it seems. In order to work reliably it requires full control of the code being called, so without extra work it is appropriate for generated stubs only. If you want to call some native code VM doesn't control, then either all caller-saved registers should be preserved across the call (which may be prohibitively expensive) or it should be made explicit there's a call taking place so all ABI effects are taken into account. |
|
@iwanowww I believe jdk/src/hotspot/cpu/arm/arm.ad Line 5962 in adebfa7
And
|
|
Interesting! I wasn't aware ADLC already features such support. Thanks for the pointers. It does look attractive, especially for platform-specific use cases. But there are some pitfalls which makes it hard to use on its own. In particular, data nodes are aggressively commoned and freely flow in the graph. Unless it is taken into account during GVN and code motion, the final schedule may end up far from optimal. (In other words, it's highly beneficial to match only expensive nodes in such a way.) Moreover, some optimizations are highly sensitive to the presence of calls. (Think of the consequences of a call scheduled inside a heavily vectorized loop.) Macro-expansion also suffers from some of those issues, but still IMO an explicit |
|
I like @merykitty's suggestion, but I don't understand how bad are the disadvantages of it. Commoning can be prevented as you mentioned above. As for scheduling, isn't it the same problem for many nodes? If we have something like var x = anOject.aField; // anObject known to be not null
if (flag) { // flag independent of `anObject`
// something with x
} else {
// [...] nothing with x
}I don't think there is any ordering between the if and the definition of |
|
Tbh I don't understand @iwanowww arguments. We have expensive data nodes such as Ideally, what we want to do with expensive data nodes is to common them aggressively like any other data node. Then, during code motion, we can clone them if it is beneficial. |
|
I'm just pointing out that delaying lowering decision till matching phase neither makes scheduling easier nor makes implementation simpler. For loop opts it is important to know when loops contain calls and act accordingly (by trying to hoist relevant nodes out of loops and disabling some optimizations when the calls are still there). The difference between CFG nodes effectively pinned AT some point and non-CFG nodes with control dependency (effectively pushing them UNDER their control input) becomes insignificant once CFG nodes depend solely on control. In other words, once a call node doesn't consume/produce memory and I/O states, it becomes straightforward to move it around in CFG when desired (between it's inputs and users). Speaking of scheduling, would default scheduling heuristics do a good job? The case of expensive nodes exemplifies the need of custom scheduling heuristics for such nodes. Implementation-wise, lowering during matching becomes platform-specific and requires each platform to introduce
The current implementation of expensive nodes can definitely be improved, but the nice property it has is that it only decreases the number of nodes through careful commoning during loop opts. Once cloning is allowed, there's a new problem to care about: the case of too many clones. A simple incremental improvement would be to teach |
|
After patient guidance from @iwanowww, I came to a new version whose implementation has very little to do with this one. I'll close it and open a fresh one. Nevertheless, thanks to everyone who looked at it! |
A first part toward a better support of pure functions.
Pure Functions
Pure functions (considered here) are functions that have no side effects, no effect on the control flow (no exception or such), cannot deopt etc.. It's really a function that you can execute anywhere, with whichever arguments without effect other than wasting time. Integer division is not pure as dividing by zero is throwing. But many floating point functions will just return
NaNor+/-infinityin problematic cases.Scope
We are not going all powerful for now! It's mostly about identifying some pure functions and being able to remove them if the result is unused. Some other things are not part of this PR, on purpose. Especially, this PR doesn't propose a way to move pure calls around. The reason is that pure calls are macro nodes later expanded into other, regular calls, which require a control input. To be able to do the expansion, we just keep the control in the pure call as well.
Implementation Overview
We created here some new node kind for pure calls that are expanded into regular calls during macro expansion. This also allows the removal of
ModDandModFnodes that have their pure equivalent now. They are surprisingly hard to unify with other floating point functions from an implementation point of view!IR framework and IGV needed a little bit of fixing.
Thanks,
Marc
Progress
Issue
Reviewers
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/24966/head:pull/24966$ git checkout pull/24966Update a local copy of the PR:
$ git checkout pull/24966$ git pull https://git.openjdk.org/jdk.git pull/24966/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 24966View PR using the GUI difftool:
$ git pr show -t 24966Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/24966.diff
Using Webrev
Link to Webrev Comment