Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8272138: ZGC: Adopt relaxed ordering for self-healing #5046

Closed
wants to merge 3 commits into from

Conversation

@weixlu
Copy link
Contributor

@weixlu weixlu commented Aug 9, 2021

ZGC utilizes self-healing in load barrier to fix references to old objects. Currently, this fixing (ZBarrier::self_heal) adopts memory_order_conservative to guarantee that (1) the slow path (relocate, mark, etc. where we get healed address) always happens before self healing, and (2) the other thread that accesses the same reference is able to access the content at the healed address.
Let us consider changing the bounded releasing CAS in forwarding table installation to unbounded. In this way, the release (OrderAccess::release()) serves as a membar to block any subsequent store operations (forwarding table installation and self healing). More specifically, we can use release(), relaxed_cas() in the forwarding table and relaxed_cas() in the self healing. In the scenario when a thread reads healed address from forwarding table, relaxed_cas() is also fine due to the load-acquire in forwarding table.
Since release is done only in forwarding table, the majority of self heals, which only remap the pointer, is no longer restricted by the memory order. Performance is visibly improved, demonstrated by benchmark corretto/heapothesys on AArch64. The optimized version decreases both average concurrent mark/relocation time by 15%-20% and 10%-15%, respectively。

baseline
[1000.412s][info   ][gc,stats    ]       Phase: Concurrent Relocate                           0.000 / 0.000       161.031 / 232.546     147.641 / 289.855     147.641 / 289.855     ms
[1100.412s][info   ][gc,stats    ]       Phase: Concurrent Relocate                         220.739 / 220.739     162.373 / 232.546     149.909 / 289.855     149.909 / 289.855     ms
[1200.411s][info   ][gc,stats    ]       Phase: Concurrent Relocate                           0.000 / 0.000       157.900 / 232.546     148.912 / 289.855     148.912 / 289.855     ms
[1000.412s][info   ][gc,stats    ]       Phase: Concurrent Mark                               0.000 / 0.000       816.505 / 1508.676    759.533 / 1567.506    759.533 / 1567.506    ms
[1100.412s][info   ][gc,stats    ]       Phase: Concurrent Mark                            1262.680 / 1262.680    838.750 / 1508.676    772.495 / 1567.506    772.495 / 1567.506    ms
[1200.411s][info   ][gc,stats    ]       Phase: Concurrent Mark                               0.000 / 0.000       813.220 / 1422.825    769.346 / 1567.506    769.346 / 1567.506    ms
optimized
[1000.447s][info   ][gc,stats    ]       Phase: Concurrent Relocate                         248.035 / 248.035     162.719 / 253.498     124.061 / 273.227     124.061 / 273.227     ms
[1100.447s][info   ][gc,stats    ]       Phase: Concurrent Relocate                         217.434 / 217.434     161.209 / 253.498     126.767 / 273.227     126.767 / 273.227     ms
[1200.447s][info   ][gc,stats    ]       Phase: Concurrent Relocate                           0.000 / 0.000       164.023 / 253.498     129.314 / 273.227     129.314 / 273.227     ms
[1000.447s][info   ][gc,stats    ]       Phase: Concurrent Mark                            1224.059 / 1224.059    809.838 / 1518.214    582.867 / 1518.214    582.867 / 1518.214    ms
[1100.447s][info   ][gc,stats    ]       Phase: Concurrent Mark                            1302.235 / 1302.235    793.589 / 1518.214    600.215 / 1518.214    600.215 / 1518.214    ms
[1200.447s][info   ][gc,stats    ]       Phase: Concurrent Mark                               0.000 / 0.000       821.320 / 1518.214    615.187 / 1518.214    615.187 / 1518.214    ms

Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

  • JDK-8272138: ZGC: Adopt relaxed ordering for self-healing

Reviewers

Contributors

  • Hao Tang <albert.th@alibaba-inc.com>

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/5046/head:pull/5046
$ git checkout pull/5046

Update a local copy of the PR:
$ git checkout pull/5046
$ git pull https://git.openjdk.java.net/jdk pull/5046/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 5046

View PR using the GUI difftool:
$ git pr show -t 5046

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/5046.diff

@bridgekeeper
Copy link

@bridgekeeper bridgekeeper bot commented Aug 9, 2021

👋 Welcome back weixlu! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

Loading

@openjdk openjdk bot added the rfr label Aug 9, 2021
@openjdk
Copy link

@openjdk openjdk bot commented Aug 9, 2021

@weixlu Could not parse Hao Tang albert.th@alibaba-inc.com as a valid contributor.
Syntax: /contributor (add|remove) [@user | openjdk-user | Full Name <email@address>]. For example:

  • /contributor add @openjdk-bot
  • /contributor add duke
  • /contributor add J. Duke <duke@openjdk.org>

Loading

@openjdk
Copy link

@openjdk openjdk bot commented Aug 9, 2021

@weixlu The following label will be automatically applied to this pull request:

  • hotspot-gc

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

Loading

@openjdk openjdk bot added the hotspot-gc label Aug 9, 2021
@mlbridge
Copy link

@mlbridge mlbridge bot commented Aug 9, 2021

Loading

@weixlu
Copy link
Contributor Author

@weixlu weixlu commented Aug 9, 2021

/contributor add Hao Tan albert.th@alibaba-inc.com

Loading

@openjdk
Copy link

@openjdk openjdk bot commented Aug 9, 2021

@weixlu
Contributor Hao Tan <albert.th@alibaba-inc.com> successfully added.

Loading

@weixlu
Copy link
Contributor Author

@weixlu weixlu commented Aug 9, 2021

/contributor add Hao Tang albert.th@alibaba-inc.com

Loading

@openjdk
Copy link

@openjdk openjdk bot commented Aug 9, 2021

@weixlu
Contributor Hao Tang <albert.th@alibaba-inc.com> successfully added.

Loading

@weixlu
Copy link
Contributor Author

@weixlu weixlu commented Aug 9, 2021

/contributor remove Hao Tan albert.th@alibaba-inc.com

Loading

@openjdk
Copy link

@openjdk openjdk bot commented Aug 9, 2021

@weixlu
Contributor Hao Tan <albert.th@alibaba-inc.com> successfully removed.

Loading

Copy link
Contributor

@fisk fisk left a comment

The ordering for relocation, not to expose bytes from before a remote thread copy, is already handled appropriately in the forwarding table itself. Marking does not need to happen before self-healing. In other words, the release serves no purpose, as far as I can see. At least not in the jdk-jdk repository. There is a loom dependency in self-heal ordering, but I volunteer to fix that before their next rebase. So I would suggest having relaxed ordering instead, for the CAS. I wonder if that gives you more performance boost.

Loading

@weixlu
Copy link
Contributor Author

@weixlu weixlu commented Aug 9, 2021

@fisk Thanks for your comment. Yes, there is no need for marking to happen before self healing. But when an object needs to be copied to to-space in relocation phase, I guess self healing should not happen before copy. Otherwise the other thread may get a healed reference which hasn't finished copy yet. Since this pull request (#4763) adopts release ordering for forwarding table access, maybe we need the release semantic here in self healing to prevent the reorder.

Loading

@fisk
Copy link
Contributor

@fisk fisk commented Aug 9, 2021

A thread that copies the object and self heals will conceptually do the following, assuming relaxed memory ordering:

copy();
release();
cas_forwarding_table();
cas_self_heal();

The release before casing in the forwading table, acts as a release for both accesses, in the scenario when the copy is being published. So in the scenario you describe, the release in the forwarding table is already enough, to ensure that anyone reading the self healed pointer, is guaranteed to not observe bytes from before the copy. In the scenario when one thread performs the copy that gets published to the forwarding table, and another thread self-heals the pointer with the value acquired from the forwarding table, we will indeed not have a release to publish the pointer, only an acquire used to read from the forwarding table. However, this is fine, as the new MCA ARMv8 architecture does not allow causal consistency violations like WRC (cf. https://dl.acm.org/doi/pdf/10.1145/3158107 section 4). So we no longer need to use acquire/release to guarantee causal consistency across threads. This would naturally not hold for PPC, but there is no PPC port for ZGC yet.

It is interesting though that when loading a self-healed pointer, we do not perform any acquire. That is fine when dereferencing the loaded pointer, as a dependent load, eliding the need for an acquire. And that is indeed fine for the JIT compiled code, because we know it is always a dependent load (or safe in other ways). However, for the C++ code, we can not guarantee that there will be a dependent load in a spec conforming way. That might be something to look into. In practice, there isn't any good reason why reading and oop and then dereferencing it wouldn't yield a dependent load, but the spec doesn't promise anything and could in theory allow compilers to mess this up. However, having an acquire for every oop load in the runtime does sound a bit costly. The memory_order_consume semantics were supposed to solve this, but I'm not sure if the compilers have yet become good at doing something useful with that, other than just having it be equivalent to acquire. Might be something to check out in the disassembly to see what it yields. But that is an exercise for another day, as this isn't an issue you are introducing with this patch.

Hope this helps explain my thoughts in more detail.

Loading

@weixlu
Copy link
Contributor Author

@weixlu weixlu commented Aug 10, 2021

@fisk Thanks a lot for your detailed explanation. But I’m quite confused about the release in the forwarding table, why is it able to act as a release for both accesses? In my view, since the release is “bonded” to forwarding table, it only ensures that copy happens before installing forwardee, why does it have something to do with self healing? From your explanation, I guess release is not “bonded” to forwarding table. Instead, it maybe serves as membar to block all the CASes afterwards. Thanks again for your patient explanation.

Loading

@tanghaoth90
Copy link
Contributor

@tanghaoth90 tanghaoth90 commented Aug 10, 2021

@fisk Hi, Eric. We are wondering if one thread loading a healed pointer can observe the corresponding copy has not finished yet. Assuming relaxed ordering for cas_self_heal, both Thread A and Thread B are loading the same reference.

Thread A: load obj.fld; // will relocate the object referenced by obj.fld
thread A will do the following:

1    copy();
2    cas_forwarding_table(); // release
3    cas_self_heal(); // relaxed

Thread B: load obj.fld; // load the same reference
thread B may obverses the following reordering of thread A:

3    cas_self_heal(); // relaxed
1    copy();
2    cas_forwarding_table(); // release

To our knowledge, release ordering in line 2 does not prevent line 3 to be reordering before line 1, which indicates the release in the forwarding table is not enough. Perhaps we need to add acquire ordering to line 2 or add release ordering to line 3.

In another way, as @weixlu said,

Instead, it maybe serves as membar to block all the CASes afterwards.

relaxed ordering in line 2 along with release ordering in line 3 can indeed ensure thread B always observes the object copy.

Looking forward to your advice.

Loading

@fisk
Copy link
Contributor

@fisk fisk commented Aug 10, 2021

@fisk Thanks a lot for your detailed explanation. But I’m quite confused about the release in the forwarding table, why is it able to act as a release for both accesses? In my view, since the release is “bonded” to forwarding table, it only ensures that copy happens before installing forwardee, why does it have something to do with self healing? From your explanation, I guess release is not “bonded” to forwarding table. Instead, it maybe serves as membar to block all the CASes afterwards. Thanks again for your patient explanation.

What I wrote kind of assumes that we have an explicit OrderAccess::release(); relaxed_cas(); in the forwarding table. Looks like we currently have a bounded releasing cas instead. The unbounded release is guaranteed to provide release semantics for any subsequent store operation. However, a bounded Store-Release, (i.e. stlr) instruction only needs to provide the release semantics to the bound store, as you say. So I suppose what I propose is to use release(); relaxed_cas(); in the forwarding table, and then also relaxed_cas(); in the self healing. Letting the release be done in only one place, and only when actual copying happens. The vast majority of self heals I have found are purely remapping the pointer lazily, which can then dodge any need for release when self healing. Hope this makes sense.

Loading

@fisk
Copy link
Contributor

@fisk fisk commented Aug 10, 2021

@fisk Hi, Eric. We are wondering if one thread loading a healed pointer can observe the corresponding copy has not finished yet. Assuming relaxed ordering for cas_self_heal, both Thread A and Thread B are loading the same reference.

Thread A: load obj.fld; // will relocate the object referenced by obj.fld
thread A will do the following:

1    copy();
2    cas_forwarding_table(); // release
3    cas_self_heal(); // relaxed

Thread B: load obj.fld; // load the same reference
thread B may obverses the following reordering of thread A:

3    cas_self_heal(); // relaxed
1    copy();
2    cas_forwarding_table(); // release

To our knowledge, release ordering in line 2 does not prevent line 3 to be reordering before line 1, which indicates the release in the forwarding table is not enough. Perhaps we need to add acquire ordering to line 2 or add release ordering to line 3.

In another way, as @weixlu said,

Instead, it maybe serves as membar to block all the CASes afterwards.

relaxed ordering in line 2 along with release ordering in line 3 can indeed ensure thread B always observes the object copy.

Looking forward to your advice.

Yeah so I was kind of assuming the forwarding table installation would be an unbound release(); relaxed_cas();
That way, the release serves a dual purpose: releasing for the table and releasing for the self heal. That way the vast majority of self heals (that only do remapping), won't need to release.

Loading

@weixlu weixlu reopened this Aug 10, 2021
@openjdk openjdk bot removed the rfr label Aug 10, 2021
@@ -74,6 +74,7 @@ static uintptr_t relocate_object_inner(ZForwarding* forwarding, uintptr_t from_a
// Copy object
ZUtils::object_copy_disjoint(from_addr, to_addr, size);

OrderAccess::release();
Copy link
Contributor

@pliden pliden Aug 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please replace the calls to OrderAccess::release() with a single call here instead:

+++ b/src/hotspot/share/gc/z/zForwarding.inline.hpp
@@ -136,6 +136,8 @@ inline uintptr_t ZForwarding::insert(uintptr_t from_index, uintptr_t to_offset,
   const ZForwardingEntry new_entry(from_index, to_offset);
   const ZForwardingEntry old_entry; // Empty
 
+  OrderAccess::release();
+
   for (;;) {
     const ZForwardingEntry prev_entry = Atomic::cmpxchg(entries() + *cursor, old_entry, new_entry, memory_order_relaxed);
     if (!prev_entry.populated()) {

While keeping the calls close to the copy code has some value, I think I'd prefer to have it in a single place, and close to the corresponding acquire in at().

Loading

Copy link
Contributor

@pliden pliden Aug 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated my comment above, since it had a typo (used memory_order_release instead of memory_order_relaxed).

Loading

Copy link
Contributor Author

@weixlu weixlu Aug 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, it's better to add the fence in one single place.

Loading

Copy link
Contributor Author

@weixlu weixlu Aug 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated my comment above, since it had a typo (used memory_order_release instead of memory_order_relaxed).

I have replaced it. Thanks for your advice.

Loading

@openjdk openjdk bot added the rfr label Aug 10, 2021
@weixlu
Copy link
Contributor Author

@weixlu weixlu commented Aug 10, 2021

@fisk Thanks a lot for your detailed explanation. But I’m quite confused about the release in the forwarding table, why is it able to act as a release for both accesses? In my view, since the release is “bonded” to forwarding table, it only ensures that copy happens before installing forwardee, why does it have something to do with self healing? From your explanation, I guess release is not “bonded” to forwarding table. Instead, it maybe serves as membar to block all the CASes afterwards. Thanks again for your patient explanation.

What I wrote kind of assumes that we have an explicit OrderAccess::release(); relaxed_cas(); in the forwarding table. Looks like we currently have a bounded releasing cas instead. The unbounded release is guaranteed to provide release semantics for any subsequent store operation. However, a bounded Store-Release, (i.e. stlr) instruction only needs to provide the release semantics to the bound store, as you say. So I suppose what I propose is to use release(); relaxed_cas(); in the forwarding table, and then also relaxed_cas(); in the self healing. Letting the release be done in only one place, and only when actual copying happens. The vast majority of self heals I have found are purely remapping the pointer lazily, which can then dodge any need for release when self healing. Hope this makes sense.

Yes, it's better to use OrderAccess::release() with two relaxed_cas in forwarding table and self heal. I have made some changes and will run benchmarks afterwards. I believe your advice helps improve performace.

Loading

@@ -136,8 +136,12 @@ inline uintptr_t ZForwarding::insert(uintptr_t from_index, uintptr_t to_offset,
const ZForwardingEntry new_entry(from_index, to_offset);
const ZForwardingEntry old_entry; // Empty

// Make sure that object copy is finished
// before forwarding table installation
OrderAccess::release();
Copy link
Contributor

@tanghaoth90 tanghaoth90 Aug 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we replace it by OrderAccess::storestore()? I think LoadStore constraint in OrderAccess::release() is not required here.
Furthermore, is it possible to relax Atomic::load_acquire in ZForwarding::at?

Loading

Copy link
Member

@albertnetymk albertnetymk Aug 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe further relaxing is possible, but it's unclear whether there will be observable perf improvement. Could be interesting to see some numbers before drawing any conclusion.

Loading

Copy link
Contributor

@tanghaoth90 tanghaoth90 Aug 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@albertnetymk Thanks for your suggestion.

Can we replace it by OrderAccess::storestore()?

OrderAccess::release() has the same implementation as OrderAccess::storestore() on AArch64. Therefore, replacing OrderAccess::release() does not yield perf improvement.
I found an interesting pull request related to StoreStore: #427 (not integrated yet). I am not aware that this patch can make any improvement.

Furthermore, is it possible to relax Atomic::load_acquire in ZForwarding::at?

My initial experiment suggests >5% concurrent mark time reduction using Atomic::load instead.
I am still working on checking the correctness of the relaxation.

Loading

@weixlu
Copy link
Contributor Author

@weixlu weixlu commented Aug 11, 2021

After changing the bounded release to unbounded and relaxed_cas, concurrent mark/relocate time has decreased by 10%-20% according to corretto/heapothesys benchmark. In contrast, the original optimization only witnessed 5%-10% decrease.

Loading

fisk
fisk approved these changes Aug 11, 2021
Copy link
Contributor

@fisk fisk left a comment

Looks good.

Loading

@openjdk
Copy link

@openjdk openjdk bot commented Aug 11, 2021

@weixlu This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8272138: ZGC: Adopt relaxed ordering for self-healing

Co-authored-by: Hao Tang <albert.th@alibaba-inc.com>
Reviewed-by: eosterlund, pliden

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 8 new commits pushed to the master branch:

  • 5350b99: 8272131: PhaseMacroExpand::generate_slow_arraycopy crash when clone null CallProjections.fallthrough_ioproj
  • 1489352: 8271718: Crash when during color transformation the color profile is replaced
  • 2a9acc3: 8272050: typo in MachSpillCopyNode::implementation after JDK-8131362
  • b62e742: 8213714: AttachingConnector/attach/attach001 failed due to "bind failed: Address already in use"
  • 66d1faa: 8271601: Math.floorMod(int, int) and Math.floorMod(long, long) differ in their logic
  • 57ae9fb: 8140442: Add getOutermostTypeElement to javax.lang.model utility class
  • 67869b4: 8270137: Kerberos Credential Retrieval from Cache not Working in Cross-Realm Setup
  • 35b399a: 8269130: Replace usages of Collection.toArray() with Collection.toArray(T[]) to avoid redundant array copying

Please see this link for an up-to-date comparison between the source branch of this pull request and the master branch.
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@fisk, @pliden) but any other Committer may sponsor as well.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

Loading

@openjdk openjdk bot added the ready label Aug 11, 2021
@weixlu weixlu changed the title 8272138: ZGC: Adopt release ordering for self-healing 8272138: ZGC: Adopt relaxed ordering for self-healing Aug 11, 2021
@openjdk openjdk bot added ready and removed ready labels Aug 11, 2021
pliden
pliden approved these changes Aug 11, 2021
Copy link
Contributor

@pliden pliden left a comment

Looks good.

Loading

@weixlu
Copy link
Contributor Author

@weixlu weixlu commented Aug 11, 2021

@fisk @pliden Thanks for your advice and reviewing. Could you please sponsor this change?

Loading

@weixlu
Copy link
Contributor Author

@weixlu weixlu commented Aug 11, 2021

/integrate

Loading

@openjdk openjdk bot added the sponsor label Aug 11, 2021
@openjdk
Copy link

@openjdk openjdk bot commented Aug 11, 2021

@weixlu
Your change (at version 3b8ed5f) is now ready to be sponsored by a Committer.

Loading

@pliden
Copy link
Contributor

@pliden pliden commented Aug 11, 2021

/sponsor

Loading

@openjdk
Copy link

@openjdk openjdk bot commented Aug 11, 2021

Going to push as commit 846cc88.
Since your change was applied there have been 8 commits pushed to the master branch:

  • 5350b99: 8272131: PhaseMacroExpand::generate_slow_arraycopy crash when clone null CallProjections.fallthrough_ioproj
  • 1489352: 8271718: Crash when during color transformation the color profile is replaced
  • 2a9acc3: 8272050: typo in MachSpillCopyNode::implementation after JDK-8131362
  • b62e742: 8213714: AttachingConnector/attach/attach001 failed due to "bind failed: Address already in use"
  • 66d1faa: 8271601: Math.floorMod(int, int) and Math.floorMod(long, long) differ in their logic
  • 57ae9fb: 8140442: Add getOutermostTypeElement to javax.lang.model utility class
  • 67869b4: 8270137: Kerberos Credential Retrieval from Cache not Working in Cross-Realm Setup
  • 35b399a: 8269130: Replace usages of Collection.toArray() with Collection.toArray(T[]) to avoid redundant array copying

Your commit was automatically rebased without conflicts.

Loading

@openjdk
Copy link

@openjdk openjdk bot commented Aug 11, 2021

@pliden @weixlu Pushed as commit 846cc88.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
5 participants