Skip to content

Conversation

@marc-chevalier
Copy link
Member

@marc-chevalier marc-chevalier commented Apr 30, 2025

A first part toward a better support of pure functions.

Pure Functions

Pure functions (considered here) are functions that have no side effects, no effect on the control flow (no exception or such), cannot deopt etc.. It's really a function that you can execute anywhere, with whichever arguments without effect other than wasting time. Integer division is not pure as dividing by zero is throwing. But many floating point functions will just return NaN or +/-infinity in problematic cases.

Scope

We are not going all powerful for now! It's mostly about identifying some pure functions and being able to remove them if the result is unused. Some other things are not part of this PR, on purpose. Especially, this PR doesn't propose a way to move pure calls around. The reason is that pure calls are macro nodes later expanded into other, regular calls, which require a control input. To be able to do the expansion, we just keep the control in the pure call as well.

Implementation Overview

We created here some new node kind for pure calls that are expanded into regular calls during macro expansion. This also allows the removal of ModD and ModF nodes that have their pure equivalent now. They are surprisingly hard to unify with other floating point functions from an implementation point of view!

IR framework and IGV needed a little bit of fixing.

Thanks,
Marc


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8347901: C2 should remove unused leaf / pure runtime calls (Enhancement - P3)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/24966/head:pull/24966
$ git checkout pull/24966

Update a local copy of the PR:
$ git checkout pull/24966
$ git pull https://git.openjdk.org/jdk.git pull/24966/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 24966

View PR using the GUI difftool:
$ git pr show -t 24966

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/24966.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Apr 30, 2025

👋 Welcome back mchevalier! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Apr 30, 2025

@marc-chevalier This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8347901: C2 should remove unused leaf / pure runtime calls

Reviewed-by: kvn

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 758 new commits pushed to the master branch:

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk
Copy link

openjdk bot commented Apr 30, 2025

@marc-chevalier The following labels will be automatically applied to this pull request:

  • graal
  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added graal graal-dev@openjdk.org hotspot-compiler hotspot-compiler-dev@openjdk.org labels Apr 30, 2025
@marc-chevalier marc-chevalier marked this pull request as ready for review May 2, 2025 08:04
@openjdk openjdk bot added the rfr Pull request is ready for review label May 2, 2025
@mlbridge
Copy link

mlbridge bot commented May 2, 2025

Webrevs

@vnkozlov
Copy link
Contributor

vnkozlov commented May 3, 2025

Hi @marc-chevalier

doesn't propose a way to move pure calls around

I agree that we should not do that in these changes.

But did you consider to move/clone such call (new macro node) down to "users" in case the result is not used on some paths? They will be executed only where they are needed. And I think it is safe since current control dominates paths where the result is used.

@marc-chevalier
Copy link
Member Author

I've considered it, but rather for a follow-up. My thought was to first introduce the node types, removal mechanics and such, but keep it pined by control and not touch that in this change. In the follow-up, I was hoping I would have "just" the control-pinning problem to address.

Moving the calls down may be beneficial in case the result is not used in a branch (and then we save the call when executing the branch not using it), but if the usage is in a loop, we rather want the call to stay (or be hoisted) before the loop. The heuristic "out of as many loops as possible, and the later possible" seems to also apply here.

Copy link
Contributor

@vnkozlov vnkozlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label May 5, 2025
Copy link
Contributor

@iwanowww iwanowww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work, Marc.

High-level comment: I don't know what are the future plans, but as the patch stands now, it feels like it complicates both the design and the implementation.

Original implementation relies on macro nodes which are later expanded into leaf runtime calls. What you propose introduce new concept of "pure calls" which is: (1) not a CallNode anymore; and (2) relies on subclassing (which makes it hard to mix with other node properties). Moreover, I don't see much benefit in committing to runtime call representation from the very beginning (early in high-level IR).

Going forward, IMO the sweet sport is to support arbitrary nodes to be lowered into leaf runtime calls. You make a big step in that direction by relaxing requirements on PureCall to be just a CFG node (and not a full-blown CallLeaf node). Next step would be to relax CFG node requirement and let compiler pick the right place to insert it. (Existing expensive node support in C2 addresses some similar challenges.)

And, as a complementary options, in some cases it may be just enough to mark individual call nodes as pure, so they can be pruned later if nobody consumes result of their computation anymore.

@marc-chevalier
Copy link
Member Author

Thanks for the comment. I'll think deeper about it.

I've started by trying to make PureCall a subclass of Call (or a property of LeafCall) but that broke a lot of things that were using some invariants on CallNode that weren't holding anymore. After a some time tracking bugs and trying to fix, I thought it would be simpler to have a new kind of node, and it would have less impact on existing code. Another reason I've changed it to a direct sub-class of Node is that I felt it made little sense to be a Call (or sub-class of) since Calls are Safepoint, but pure calls don't need to be (and similar "conceptual" problems). It seemed like a hack to me.

About

support arbitrary nodes to be lowered into leaf runtime calls.

I don't think I understand what you mean. Overall, I see the weaknesses of my design, but I'm not sure which direction to take instead.

@iwanowww
Copy link
Contributor

iwanowww commented May 6, 2025

support arbitrary nodes to be lowered into leaf runtime calls.

A leaf runtime call which doesn't depend or change memory state can be inserted at arbitrary points in the graph. So, an arbitrary data node can be lowered into a runtime call once the place to insert it is known/chosen.

Overall, I see the weaknesses of my design, but I'm not sure which direction to take instead.

I suggest to experiment with untangling ModF/ModD from CallLeaf, making them expensive nodes (to avoid commoning during GVN) , and still lower them into CallLeaf.
(It doesn't have to be part of existing macro expansion. Depending on implementation considerations, earlier or later may be more appropriate. But they should be expanded before RA kicks in.)

The hard part is probably related to picking a point in CFG to insert the call, but the control the node has may be not suitable for that (e.g., if inputs don't dominate control anymore). In that case, updating control input during loop opts may be an option.

@marc-chevalier
Copy link
Member Author

making them expensive nodes (to avoid commoning during GVN)

Good point!

I still think I don't get everything. Let me try to sum up what I think I should do.

For now, I don't want to mess with control, but I should prepare the field. Using general Call nodes for pure calls was pretty difficult: Call nodes have too much opinion, assumptions to easily work with for pure calls. But eventually, I want to change the nodes I'm using into a Call node, and more precisely a CallLeaf (I suspect once I'm done doing all I can do with pure calls, so in macro expansion, it's fine). To be able to do this transformation, I need to know control at this point. My goal is to start with control-less nodes, but find the late control during loop optimization, control-pin them at this point (because that's when the information is available) with both control input and output (needed for the expansion in CallLeaf), and continuing with control-pinned nodes. For now, I'm happy with the control I get from parsing.

So, under my nodes, I need 2 outputs: control and data (everywhere now, and at least after control-pinning in the follow-up). I should then make ModFloating/ModD/ModF sub-classes of MultNode (I guess, I can make ModFloating a direct sub-class of MultNode. And I can introduce new node types for native math calls that would behave similarly wrt to elimination (and pinning in the future), and would also expand into CallLeaf. A weirdness of these nodes is that they would be CFG or not whether they are pinned already, and not depending on their type, but I'm not aware of a fundamental issue about that, as long as the change doesn't happen in the middle of a phase where it's relevant.

@iwanowww
Copy link
Contributor

My goal is to start with control-less nodes, but find the late control during loop optimization, control-pin them at this point (because that's when the information is available) with both control input and output (needed for the expansion in CallLeaf), and continuing with control-pinned nodes.

If you combine lowering with pinning, you could replace a data node with a CFG node (CallLeaf in your case) at the point in CFG you choose. A single CFG node is enough to insert a CFG-only node, but you need to ensure the graph stays schedulable after the insertion.

If you want to start with pinned node, the simplest way would be to make CallPure a subclass of CallLeaf, require it to be CFG-only (no memory in/out, no IO, etc) and populate only control in/out when inserting it into the graph during parsing.

For now, I'm happy with the control I get from parsing.

Keep in mind that it assumes the node is pinned in CFG from the very beginning. Once the node starts in data-only mode, the control input it gained during parsing may end up too early for node's inputs to be scheduleable.

@merykitty
Copy link
Member

I think a very simple approach you can take is having CallPureNode as a pure data node. It does not have to have anything to do with CallNode (no lowering into a CallNode, no subclass from CallNode) and it can have its mach implementation like this:

instruct pureCall1F(xmm0 dst, xmm0 src) %{
    match(Set dst (CallPure src));
    effect(CALL);
    format %{
        __ call(/*something*/);
    %}
%}

@iwanowww
Copy link
Contributor

iwanowww commented May 12, 2025

I think a very simple approach you can take is having CallPureNode as a pure data node

It's not as simple as it seems. In order to work reliably it requires full control of the code being called, so without extra work it is appropriate for generated stubs only. If you want to call some native code VM doesn't control, then either all caller-saved registers should be preserved across the call (which may be prohibitively expensive) or it should be made explicit there's a call taking place so all ABI effects are taken into account.

@merykitty
Copy link
Member

@iwanowww I believe effect(CALL) marks that a call is taking place and the register allocator will know how to save the registers accordingly. Note that on arm, long division is implemented as a call:

instruct divL_reg_reg(R0R1RegL dst, R2R3RegL src1, R0R1RegL src2) %{

And SharedRuntime::ldiv is implemented in C++:

JRT_LEAF(jlong, SharedRuntime::ldiv(jlong y, jlong x))

@iwanowww
Copy link
Contributor

Interesting! I wasn't aware ADLC already features such support. Thanks for the pointers.

It does look attractive, especially for platform-specific use cases. But there are some pitfalls which makes it hard to use on its own. In particular, data nodes are aggressively commoned and freely flow in the graph. Unless it is taken into account during GVN and code motion, the final schedule may end up far from optimal. (In other words, it's highly beneficial to match only expensive nodes in such a way.) Moreover, some optimizations are highly sensitive to the presence of calls. (Think of the consequences of a call scheduled inside a heavily vectorized loop.)

Macro-expansion also suffers from some of those issues, but still IMO an explicit Call node is a more appropriate solution to the problem.

@marc-chevalier
Copy link
Member Author

I like @merykitty's suggestion, but I don't understand how bad are the disadvantages of it. Commoning can be prevented as you mentioned above. As for scheduling, isn't it the same problem for many nodes? If we have something like

var x = anOject.aField;  // anObject known to be not null
if (flag) {  // flag independent of `anObject`
  // something with x
} else {
  // [...] nothing with x
}

I don't think there is any ordering between the if and the definition of x, and so we should push the latter under the if. And conversely, if the declaration is already in the branch in the original code, we should not let it float above. Or in case of loop, we should rather put it outside as much as possible. But none of that seems enforced by edges: memory node is not a CFG node, the nodes if the if(flag) might not use memory (so no memory edges)... The same would be true for an arithmetic node (like AddI, for instance), but we could argue those are cheap (even if in a loop, cheap becomes expensive), while a memory access is not that cheap.
So, don't the problems we have with @merykitty's pure-call-as-pure-data-node suggestion already exist for other node kinds? And if we would have troubles with scheduling of pure calls, shouldn't we have this kind of issue already?

@merykitty
Copy link
Member

merykitty commented May 19, 2025

Tbh I don't understand @iwanowww arguments. We have expensive data nodes such as SqrtD that have control inputs to prevent them floating too aggressively. Additionally, a CallNode is pinned AT its control input, while a data node is pinned UNDER its control input. It gives the scheduler much more freedom scheduling a data node to a better location compared to a call node.

Ideally, what we want to do with expensive data nodes is to common them aggressively like any other data node. Then, during code motion, we can clone them if it is beneficial.

@iwanowww
Copy link
Contributor

I'm just pointing out that delaying lowering decision till matching phase neither makes scheduling easier nor makes implementation simpler.

For loop opts it is important to know when loops contain calls and act accordingly (by trying to hoist relevant nodes out of loops and disabling some optimizations when the calls are still there).

The difference between CFG nodes effectively pinned AT some point and non-CFG nodes with control dependency (effectively pushing them UNDER their control input) becomes insignificant once CFG nodes depend solely on control. In other words, once a call node doesn't consume/produce memory and I/O states, it becomes straightforward to move it around in CFG when desired (between it's inputs and users).

Speaking of scheduling, would default scheduling heuristics do a good job? The case of expensive nodes exemplifies the need of custom scheduling heuristics for such nodes.

Implementation-wise, lowering during matching becomes platform-specific and requires each platform to introduce effect(CALL) AD instructions. Moreover, each call shape (determined by arity and argument kinds) has to be explicitly handled with a dedicated AD instruction. And it doesn't benefit from existing support of call nodes every platform already has.

Ideally, what we want to do with expensive data nodes is to common them aggressively like any other data node. Then, during code motion, we can clone them if it is beneficial.

The current implementation of expensive nodes can definitely be improved, but the nice property it has is that it only decreases the number of nodes through careful commoning during loop opts. Once cloning is allowed, there's a new problem to care about: the case of too many clones.

A simple incremental improvement would be to teach PhaseIdealLoop::process_expensive_nodes() to push expensive nodes closer to their users if they are on less frequent code paths. Then it can be taught (how and when) to clone expensive nodes between multiple users.

@marc-chevalier
Copy link
Member Author

After patient guidance from @iwanowww, I came to a new version whose implementation has very little to do with this one. I'll close it and open a fresh one. Nevertheless, thanks to everyone who looked at it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

graal graal-dev@openjdk.org hotspot-compiler hotspot-compiler-dev@openjdk.org ready Pull request is ready to be integrated rfr Pull request is ready for review

Development

Successfully merging this pull request may close these issues.

4 participants