Skip to content

8332455: Improve G1/ParallelGC tasks to not override array lengths #19282

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 25 commits into from

Conversation

rkennke
Copy link
Contributor

@rkennke rkennke commented May 17, 2024

In order to not cause excessive traffic on task queues when scanning large object arrays, G1 and ParallelGC use a way of slicing those arrays into smaller pieces. It overrides the from-space array's length field to track the array slices.

I think it would be cleaner if we improve the tasks such that array slices can be fully encoded in the task and does not require overriding the array length.

This PR borrows the principal encoding and slicing algorithm from Shenandoah (originally written by @shipilev). It also unifies the slicing implementations of the young GC and concurrent marking GC and ParallelGC.

For a description of the encoding and slicing algorithm, see top of arraySlicer.hpp.

On x86 (32-bit) we don't have enough bits in the single-word task to encode the slicing, so I'm extending the task to 64 bits (pointer and two int32 fields).

I put in some efforts to make sure the shared array slicing uses the user-configurable flags ParGCArrayScanChunk and ObjArrayMarkingStride just as before, but TBH, I don't see the point of having those flags as product flags to begin with. I would probably deprecate and remove ParGCArrayScanChunk, and use the develop flag ObjArrayMarkingStride everywhere. YMMV.

Testing:

  • hotspot_gc
  • tier1
  • tier2

Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8332455: Improve G1/ParallelGC tasks to not override array lengths (Enhancement - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/19282/head:pull/19282
$ git checkout pull/19282

Update a local copy of the PR:
$ git checkout pull/19282
$ git pull https://git.openjdk.org/jdk.git pull/19282/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 19282

View PR using the GUI difftool:
$ git pr show -t 19282

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/19282.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented May 17, 2024

👋 Welcome back rkennke! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented May 17, 2024

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

@openjdk
Copy link

openjdk bot commented May 17, 2024

@rkennke The following label will be automatically applied to this pull request:

  • hotspot-gc

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-gc hotspot-gc-dev@openjdk.org label May 17, 2024
@openjdk
Copy link

openjdk bot commented Jun 6, 2024

@rkennke this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout JDK-8332455
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

@openjdk openjdk bot added the merge-conflict Pull request has merge conflict with target branch label Jun 6, 2024
@openjdk openjdk bot removed the merge-conflict Pull request has merge conflict with target branch label Jun 14, 2024
@openjdk
Copy link

openjdk bot commented Jun 25, 2024

⚠️ @rkennke This pull request contains merges that bring in commits not present in the target repository. Since this is not a "merge style" pull request, these changes will be squashed when this pull request in integrated. If this is your intention, then please ignore this message. If you want to preserve the commit structure, you must change the title of this pull request to Merge <project>:<branch> where <project> is the name of another project in the OpenJDK organization (for example Merge jdk:master).

@rkennke rkennke changed the title 8332455: Improve G1 tasks to not override array lengths 8332455: Improve G1/ParallelGC tasks to not override array lengths Jul 3, 2024
@rkennke rkennke marked this pull request as ready for review July 4, 2024 09:27
@openjdk openjdk bot added the rfr Pull request is ready for review label Jul 4, 2024
@mlbridge
Copy link

mlbridge bot commented Jul 4, 2024

Webrevs

Copy link

@kimbarrett kimbarrett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides the problems discussed below, I also wondered about the rationale for
this change. The JBS issue (incorrectly, see below) describes what G1
currently does, but doesn't say anything about why it's a problem. I see this
issue has the lilliput label, so maybe it has something to do with that
project? Without that information it's hard to discuss any alternatives.

The bug report says G1 array slicing uses the from-space array's length field
to track the array slices. That isn't true. It uses the to-space array's
length field, since it uses PartialArrayTaskStepper. ParallelGC uses the
from-space length in PSPromotionManager::process_array_chunk, though there is
an RFE to change ParallelGC to also use PartialArrayTaskStepper (JDK-8311163).

So why is that a problem? The JBS issue is silent about that. Some things
that came up in some internal to Oracle discussion include

(1) That approach has problems for the concurrent copying collectors.
(2) That approach doesn't work for G1 concurrent marking.
(3) For STW copying collectors, an allocation failure leaves one without a
separate copy, preventing use of that mechanism.

Having a unified mechanism that works for all collectors probably has some
benefits, if it doesn't impose some onorous cost on the STW copying
collectors. The allocation failure case should be rare, but it's still an
annoying special case to have to handle.

So yes, there are reasons to investigate alternatives. But I think what is
being proposed here is busted by invalid assumptions about the address space.

I also wish the proposed change was broken up a little bit. Changing all of
G1 STW, G1 Concurrent, and Parallel at the same time is hard to review, and I
don't think there's a need to do all in one change.

I have a couple of ideas for alternative mechanisms that I might explore.

// Encoding limitations caused by current bitscales mean:
// 10 bits for slice: max 1024 blocks per array
// 5 bits for power: max 2^32 array
// 49 bits for oop: max 512 TB of addressable space

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This encoding is incompatible with Linux Large Virtual Address space:
https://www.kernel.org/doc/html/v5.8/arm64/memory.html
which has a 52 bit address space. I also don't know of any reason why future address space
configuration couldn't support the full non-tagged range (so 56 bits). I think that makes this
scheme not viable.

Copy link
Member

@shipilev shipilev Jul 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think hardware support is an orthogonal issue. It would have been an issue if we just blindly casted the pointer to oop, and relied on hardware to only treat the lowest bits for the actual address and/or mmap-ed the Java heap at very high addresses.

But since we are masking out the oop explicitly (see oop_extract_mask) and we map Java heap at lower addresses (in first order, to please compressed oops), we do not care for practical heap sizes. What we get as the limit is how much of the Java heap we can represent, given that we encode oop-s. Shenandoah checks this on startup, for example:

#if SHENANDOAH_OPTIMIZED_MARKTASK
// The optimized ShenandoahMarkTask takes some bits away from the full object bits.
// Fail if we ever attempt to address more than we can.
if ((uintptr_t)heap_rs.end() >= ShenandoahMarkTask::max_addressable()) {
FormatBuffer<512> buf("Shenandoah reserved [" PTR_FORMAT ", " PTR_FORMAT") for the heap, \n"
"but max object address is " PTR_FORMAT ". Try to reduce heap size, or try other \n"
"VM options that allocate heap at lower addresses (HeapBaseMinAddress, AllocateHeapAt, etc).",
p2i(heap_rs.base()), p2i(heap_rs.end()), ShenandoahMarkTask::max_addressable());
vm_exit_during_initialization("Fatal Error", buf);
}
#endif

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a practical heap size? 512T should be enough for everyone :)

As a former colleague liked to say, don't try to stuff two bits of information
into one bit of data.


void* _ptr;
uint16_t _slice;
uint16_t _pow;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation imposes significant additional overhead on the normal non-array case, since
an oop task needs to carry around the unused _slice and _pow members, including insertion and
removal from the taskqueue and such. That seems pretty problematic for 32bit platforms that are
targeting relatively low end embedded space.

@rkennke
Copy link
Contributor Author

rkennke commented Jul 5, 2024

Besides the problems discussed below, I also wondered about the rationale for this change. The JBS issue (incorrectly, see below) describes what G1 currently does, but doesn't say anything about why it's a problem. I see this issue has the lilliput label, so maybe it has something to do with that project? Without that information it's hard to discuss any alternatives.

Correct. And sorry about that. Indeed my motivation to explore alternatives come out of the Lilliput project. I am currently working on 'Lilliput2', that is 4-byte-object-headers (and I've already got a version that runs quite well). While the header (aka mark-word) is only 4-bytes, many GCs still use the whole first word of the object to store the forwarding pointer. This is normally not a problem - the object is already copied and we don't care much about anything in the from-space copy. However, arrays now would have their length-field in the first word, too. Thus overriding the first word of the from-space copy with the forwarding pointer would destroy any information that the GC keeps there (or the other way around).

I knew that in Shenandoah we don't do that, and instead encode the slicing in the tasks (and I heard that ZGC does something similar, but I haven't looked at that, yet), so my goal was to explore if that is feasible for G1, too. At first I only did G1 young STW GC, but digging deeper I realized that G1 concurrent GC and Parallel young GC also use similar approaches (without sharing much). So I also explored if the encoding would work there, too. This is pretty much how I ended up with this PR.

About the address space concern, yes, I am aware of this. We're doing the exact same slicing in Shenandoah since forever, and so far never had a customer who would get to that limit. Our reasoning there was that if one uses this amount of Java heap, using 2x as much native memory for the GC task queues (using the fallback version of the task encoding) would probably don't matter all that much.

But I'll be totally happy to explore other approaches. In-fact, I think at least G1 concurrent mark should be fine, anyway.

The bug report says G1 array slicing uses the from-space array's length field to track the array slices. That isn't true. It uses the to-space array's length field, since it uses PartialArrayTaskStepper.

Right. I got that wrong. However, it still requires the from-space length to remain intact, which is not the case with Lilliput2.

ParallelGC uses the from-space length in PSPromotionManager::process_array_chunk, though there is an RFE to change ParallelGC to also use PartialArrayTaskStepper (JDK-8311163).

Ok, that is good. If we could come up with an approach that makes PartialArrayTaskStepper not depend on the from-space length, that would solve my problem.

I also wish the proposed change was broken up a little bit. Changing all of G1 STW, G1 Concurrent, and Parallel at the same time is hard to review, and I don't think there's a need to do all in one change.

I could do that. But let's first see if we come up with an approach that makes us all happy ;-)

One idea that I just now had (for my particular problem in Lilliput2) is this: we only need the length-field in the from-space copy to keep the actual length of the array, so that we know when to stop, and also to restore the length in the to-space copy. We could copy the length to the from-space offset 8 (instead of 4, which is the length field with Lilliput2), and use that, instead. That should be ok, because surely we don't need to do the slicing for 0-length arrays (in which case we would stomp over the next object), and for larger arrays, we don't care about the from-space elements just like we don't care about the length.

I have a couple of ideas for alternative mechanisms that I might explore.

Would you mind sharing the ideas?

Thanks,
Roman

@kimbarrett
Copy link

I have a couple of ideas for alternative mechanisms that I might explore.

Would you mind sharing the ideas?

I'll respond to other points separately.

Here's a (very drafty) suggestion.
https://github.com/openjdk/jdk/compare/master...kimbarrett:openjdk-jdk:array-splitting?expand=1

(There are lots of specific details about these changes that I don't like, and
would want to do a bit differently in a real PR, but it demonstrates the
approach, and has passed basic testing. Note that it requires disabling
ParallelGC for now to build, e.g. --disable-jvm-feature-parallelgc. There are
lots of comments that haven't been updated to reflect associated code
changes.)

The idea is to define a C++ object that represents the state of the iteration
over an array. So at a minimum the new array and the next array index to start
processing from. (We might have other members for convenience, or for use by
other variants than the current usage by G1.) The associated ScannerTask
entries in the taskqueue contain a pointer to the state object associated with
an array. We allocate the state object when we decide to parallelize the
scanning of the array.

Use arena allocation for the state objects, so we don't have to search the
taskqueues to reclaim them if we need to discard the contents of the queues
(such as when we sometimes abort G1 concurrent marking).

Put a free-list allocator in front of the arena allocator, so the high water
mark for allocation is based on the number of in-progress arrays rather than
the total number that exist.

The states are reference counted, so we know when to return them to the
free-list. Each referring ScannerTask we insert in a taskqueue increases the
count. When we take one out and have finished processing it then decrement the
count. Release the state when its count reaches zero. Manipulation of tasks
inside a task queue doesn't need to manipulate the count. While a ScannerTask
might get copied around in the queues, that mechanism ensures the number of
copies ultimately doesn't change over what's inserted and later taken out for
processing.

@kimbarrett
Copy link

I have a couple of ideas for alternative mechanisms that I might explore.

Would you mind sharing the ideas?

I'll respond to other points separately.

Here's a (very drafty) suggestion. https://github.com/openjdk/jdk/compare/master...kimbarrett:openjdk-jdk:array-splitting?expand=1

A different idea is to have segregated queues for oops and partial array
tasks, as is already done in some cases by some collectors. That complicates
processing, stealing, and termination testing somewhat, but avoids the more
complicated allocation scheme for the state. One could even consider
segregated queues for oop* and narrowOop*, avoiding the whole task type
dispatching. I've long wondered about the tradeoffs there.

@kimbarrett
Copy link

One idea that I just now had (for my particular problem in Lilliput2) is this: we only need the length-field in the from-space copy to keep the actual length of the array, so that we know when to stop, and also to restore the length in the to-space copy. We could copy the length to the from-space offset 8 (instead of 4, which is the length field with Lilliput2), and use that, instead. That should be ok, because surely we don't need to do the slicing for 0-length arrays (in which case we would stomp over the next object), and for larger arrays, we don't care about the from-space elements just like we don't care about the length.

That still doesn't permit parallelizing the processing in the evac-failure case.

@kimbarrett
Copy link

About the address space concern, yes, I am aware of this. We're doing the exact same slicing in Shenandoah since forever, and so far never had a customer who would get to that limit. Our reasoning there was that if one uses this amount of Java heap, using 2x as much native memory for the GC task queues (using the fallback version of the task encoding) would probably don't matter all that much.

I would guess there aren't tons of users of Linux Large Virtual address space.

For fat tasks, I'm less concerned about the actual space impact than on the
additional memory transfer costs for the normal and dominant non-array case,
where we end up spending cycles copying around these completely unused words.

I'd much have allocated states or segregated queues than fat tasks.

@rkennke
Copy link
Contributor Author

rkennke commented Jul 10, 2024

Withdrawing this PR, I found a better solution to my problem, no need to change the array slicing in such a fundamental way.

@rkennke rkennke closed this Jul 10, 2024
@kimbarrett
Copy link

Withdrawing this PR, I found a better solution to my problem, no need to change the array slicing in such a fundamental way.

Care to share?

@rkennke
Copy link
Contributor Author

rkennke commented Jul 10, 2024

Withdrawing this PR, I found a better solution to my problem, no need to change the array slicing in such a fundamental way.

Care to share?

Yeah, it's this commit:
rkennke/lilliput@18574c6

The idea is simply to not use the array-length in from-space for array-slicing (because it would clash with the forwarding-pointer), but instead use the 2nd word, and preserve the original array-length there. It is kinda Lilliput(2) specific, so I guess there's no reason to upstream it ahead of time. It's also still kinda quick-and-dirty. ParallelGC would have the same problem, but I've not addressed that, yet, because Parallel (Full) GC has more severe problems that I want to solve first. G1 concurrent marking should be fine, afaict (it doesn't use or override from-space length, nor does it need to deal with forwarding pointers).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-gc hotspot-gc-dev@openjdk.org rfr Pull request is ready for review
Development

Successfully merging this pull request may close these issues.

3 participants