Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement self-forwarding of objects that preserves header bits #10

Closed
wants to merge 4 commits into from

Conversation

rkennke
Copy link
Collaborator

@rkennke rkennke commented Jun 28, 2021

In a few places in GCs we self-forward objects to indicate promotion failures. This is problematic with Lilliput because it irreversably overrides header bits.

I propose to address this problem by using the recently-free'd biased-locking bit to indicate self-forwarding, without actually writing the ptr-to-self in the header.

A few notes:

  • The code in g1FullGCCompactionPoint.cpp keeps degenerating into ugly mess. This certainly warrants some rewriting.
  • We have some naked header-decodings, which get tidied-up. This could also be brought upstream.
  • cas_forward_to() kinda duplicates forward_to_atomic(), and has been replaced by the new forward_to_self_atomic(). It could be unduplicated upstream, too.

An alternative may be to preserve the header of self-forwarded objects in a side-table (like PreservedMarksStack) instead. This may be possible but hairy: we could not access the compressed-klass* in the upper bits until the header gets restored. (This also affects calls to size(), etc).

The ex-biased-locking-bit may still be used in regular operation. It only acts as self-forwarding-indicator when the lower two bits are also set. It requires the usual marks-preservation if we do this.

We might want to have a discussion which project would need header bits, and how to realistically allocate them. #4522 mentions Valhalla as possible taker of the BL header bit. We may be able to free one or two bits when we compress the klass* even more. For example, we currently use 25 bits for i-hash, and 32 bits for nklass*. We usually want 32bits for i-hash instead. This would leave 25bits for nklass, which can address 268MB of Klass*-space (usual compression scheme), or 32million classes (table-lookup), or something in-between if we use fixed-size Klass (seems unrealistic though). Taking away another bit mean halving the addressable space.

(It would be kinda-nice to have the BL-bit for Shenandoah, too, and for a similar purpose: indicate evacuation failure. But we do have a working solution, however that is ugly and affects performance a little.)

Testing:

  • tier1
  • hotspot_gc
  • hotspot_gc_shenandoah

Progress

  • Change must not contain extraneous whitespace
  • Change must be properly reviewed

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/lilliput pull/10/head:pull/10
$ git checkout pull/10

Update a local copy of the PR:
$ git checkout pull/10
$ git pull https://git.openjdk.java.net/lilliput pull/10/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 10

View PR using the GUI difftool:
$ git pr show -t 10

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/lilliput/pull/10.diff

@bridgekeeper
Copy link

@bridgekeeper bridgekeeper bot commented Jun 28, 2021

👋 Welcome back rkennke! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr label Jun 28, 2021
@mlbridge
Copy link

@mlbridge mlbridge bot commented Jun 28, 2021

Webrevs

@rose00
Copy link
Collaborator

@rose00 rose00 commented Jun 29, 2021

The Valhalla project has a strong requirement for very fast detection of references to primitive objects (the term was previously inline objects), because the behavior of the very frequent acmp operation differs between references to primitives and to regular identity objects. After much experiment, the prototype has settled on testing the biased-lock bit in the mark word to see if the reference points to a primitive object instead of a regular one. An extra indirection through the klass pointer would be too slow.

Valhalla defines three additional object header bits, for more limited purposes. (H/T to Fred Parain for reminding me of these.) For primitive objects only, there is a "larval" state which sometimes needs to be marked, on an unpublished object which is under construction. This bit is related to the "always locked" bit; it is a stronger condition. It seems likely to me that whatever provision there is for short-circuiting attempted lock operations on primitive objects will easily expand to another provision for marking larval objects, by using related state bits in the header (assuming header locking consumes more than one bit).

For arrays, there are two bits which accelerate slow paths relating to arrays whose elements are restricted to primitive classes (aka inline objects). The two bits encode whether (a) the array elements are stored as null-free references (internally boxed primitives) or (b) the array elements are stored flat as inline primitive objects (the preferred case). These bits accelerate array access operations. They are present only on 64-bit VMs; when bits are scarce they are loaded via the klass pointer. This would seem to be a reasonable tactic for Lilliput also, if bits are scarce. Another tactic for Lilliput might be to pack such array-specific flags into an extra field just for arrays, adjacent to the length field.

The relevant declarations are in markOop.hpp:

https://github.com/openjdk/valhalla/blob/b62a66e379fda00a5ba4622da7661c89b380ecdf/src/hotspot/share/oops/markWord.hpp#L97

(As a completely separate technique, header bits could be migrated to pointer-color bits on a 64-bit VM. If Lilliput explores this territory, three out of the above four bits could be moved to the pointer, since they are immutable object properties. The larval bit is slightly stateful, undergoing a single state transition just before the object is published.)

@rkennke
Copy link
Collaborator Author

@rkennke rkennke commented Jun 30, 2021

The Valhalla project has a strong requirement for very fast detection of references to primitive objects (the term was previously inline objects), because the behavior of the very frequent acmp operation differs between references to primitives and to regular identity objects. After much experiment, the prototype has settled on testing the biased-lock bit in the mark word to see if the reference points to a primitive object instead of a regular one. An extra indirection through the klass pointer would be too slow.

Valhalla defines three additional object header bits, for more limited purposes. (H/T to Fred Parain for reminding me of these.) For primitive objects only, there is a "larval" state which sometimes needs to be marked, on an unpublished object which is under construction. This bit is related to the "always locked" bit; it is a stronger condition. It seems likely to me that whatever provision there is for short-circuiting attempted lock operations on primitive objects will easily expand to another provision for marking larval objects, by using related state bits in the header (assuming header locking consumes more than one bit).

For arrays, there are two bits which accelerate slow paths relating to arrays whose elements are restricted to primitive classes (aka inline objects). The two bits encode whether (a) the array elements are stored as null-free references (internally boxed primitives) or (b) the array elements are stored flat as inline primitive objects (the preferred case). These bits accelerate array access operations. They are present only on 64-bit VMs; when bits are scarce they are loaded via the klass pointer. This would seem to be a reasonable tactic for Lilliput also, if bits are scarce. Another tactic for Lilliput might be to pack such array-specific flags into an extra field just for arrays, adjacent to the length field.

The relevant declarations are in markOop.hpp:

https://github.com/openjdk/valhalla/blob/b62a66e379fda00a5ba4622da7661c89b380ecdf/src/hotspot/share/oops/markWord.hpp#L97

(As a completely separate technique, header bits could be migrated to pointer-color bits on a 64-bit VM. If Lilliput explores this territory, three out of the above four bits could be moved to the pointer, since they are immutable object properties. The larval bit is slightly stateful, undergoing a single state transition just before the object is published.)

Thanks, John, for the detailed explanations!

For this PR, I'd like to take the BL bit. But I believe it should be ok, because it's only relevant during GC and only if the lowest two bits are also set, when doing self-forwarding in case of a promotion failure. In this scenario, the header bits would be overwritten anyway, and GC would have to ensure that the header (including the BL bit) is preserved if there is anything interesting in the lower header bits.

Roman

@rose00
Copy link
Collaborator

@rose00 rose00 commented Jul 1, 2021

For this PR, I'd like to take the BL bit. But I believe it should be ok, because it's only relevant during GC and only if the lowest two bits are also set, when doing self-forwarding in case of a promotion failure. In this scenario, the header bits would be overwritten anyway, and GC would have to ensure that the header (including the BL bit) is preserved if there is anything interesting in the lower header bits.

I think I agree, given that the BL bit will be used only while the header is in the special forwarding state and/or during a safepoint.

Except for the larval marking (which is akin to a locking state), these Valhalla header bits are really hoisted copies of properties that can be reconstituted from the Klass block associated with the object. So, worst case, the GC might blow them away in favor of a forwarding pointer, and then "heal" them by re-fetching header bits from the Klass. The larval marking should probably be healed (if necessary) by whatever mechanism takes care of lock states.

Copy link
Collaborator

@shipilev shipilev left a comment

This looks generally fine for the experimental code (i.e. no obvious bugs), I just have a few questions.

oop forwarded_to = obj->forwardee();
if (obj->is_forwarded() && !_current->is_in(forwarded_to)) {
return obj->size();
if (obj->is_forwarded()) {
Copy link
Collaborator

@shipilev shipilev Jul 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this seems like a nice micro-optimization for upstream. Do you want to take care of it?

Copy link
Collaborator Author

@rkennke rkennke Jul 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naa, this doesn't really do much, unless we extract the header once, and then check is_forwarded() on it, and avoid decoding if it's not forwarded - but even then this seems to be too micro ;-)
Here it is not an optimization but a necessary change, because we cannot blindly decode the forwardee anymore.

src/hotspot/share/oops/markWord.hpp Outdated Show resolved Hide resolved
src/hotspot/share/oops/oop.inline.hpp Outdated Show resolved Hide resolved
src/hotspot/share/oops/oop.hpp Show resolved Hide resolved
@openjdk
Copy link

@openjdk openjdk bot commented Jul 7, 2021

@rkennke This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

Implement self-forwarding of objects that preserves header bits

Reviewed-by: shade

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been no new commits pushed to the master branch. If another commit should be pushed before you perform the /integrate command, your PR will be automatically rebased. If you prefer to avoid any potential automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready label Jul 7, 2021
@rkennke
Copy link
Collaborator Author

@rkennke rkennke commented Jul 9, 2021

/integrate

@openjdk
Copy link

@openjdk openjdk bot commented Jul 9, 2021

Going to push as commit 0130503.

@openjdk openjdk bot closed this Jul 9, 2021
@openjdk openjdk bot added integrated and removed ready rfr labels Jul 9, 2021
@openjdk
Copy link

@openjdk openjdk bot commented Jul 9, 2021

@rkennke Pushed as commit 0130503.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Copy link
Collaborator

@tschatzl tschatzl left a comment

Sorry, I'm late, but I only now got the time to look into this.

@@ -234,7 +234,7 @@ void G1ParCopyClosure<barrier, should_mark>::do_oop_work(T* p) {
oop forwardee;
markWord m = obj->mark();
if (m.is_marked()) {
forwardee = cast_to_oop(m.decode_pointer());
forwardee = obj->forwardee(m);
Copy link
Collaborator

@tschatzl tschatzl Jul 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original code has been done that way intentionally: the changed code reloads the value from the header from memory (from what I can tell from generated code), while the original code does not, just reusing the value in the register.

Even if this is a nano-nano-optimization I would prefer to keep it as is. Some of the other changes seem to cause very similar "regressions".

(Yeah, I'm late, sorry).

Copy link
Collaborator Author

@rkennke rkennke Jul 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original code has been done that way intentionally: the changed code reloads the value from the header from memory (from what I can tell from generated code), while the original code does not, just reusing the value in the register.

Even if this is a nano-nano-optimization I would prefer to keep it as is. Some of the other changes seem to cause very similar "regressions".

(Yeah, I'm late, sorry).

I think I preserved the original behavior of re-using the already-loaded header. Notice that I added an oopDesc::forwardee(markWord m) to do that.

Copy link
Collaborator

@tschatzl tschatzl Jul 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, then ignore me :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
integrated
4 participants