-
Notifications
You must be signed in to change notification settings - Fork 5.9k
8332455: Improve G1/ParallelGC tasks to not override array lengths #19282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Welcome back rkennke! A progress list of the required criteria for merging this PR into |
❗ This change is not yet ready to be integrated. |
@rkennke this pull request can not be integrated into git checkout JDK-8332455
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides the problems discussed below, I also wondered about the rationale for
this change. The JBS issue (incorrectly, see below) describes what G1
currently does, but doesn't say anything about why it's a problem. I see this
issue has the lilliput label, so maybe it has something to do with that
project? Without that information it's hard to discuss any alternatives.
The bug report says G1 array slicing uses the from-space array's length field
to track the array slices. That isn't true. It uses the to-space array's
length field, since it uses PartialArrayTaskStepper. ParallelGC uses the
from-space length in PSPromotionManager::process_array_chunk, though there is
an RFE to change ParallelGC to also use PartialArrayTaskStepper (JDK-8311163).
So why is that a problem? The JBS issue is silent about that. Some things
that came up in some internal to Oracle discussion include
(1) That approach has problems for the concurrent copying collectors.
(2) That approach doesn't work for G1 concurrent marking.
(3) For STW copying collectors, an allocation failure leaves one without a
separate copy, preventing use of that mechanism.
Having a unified mechanism that works for all collectors probably has some
benefits, if it doesn't impose some onorous cost on the STW copying
collectors. The allocation failure case should be rare, but it's still an
annoying special case to have to handle.
So yes, there are reasons to investigate alternatives. But I think what is
being proposed here is busted by invalid assumptions about the address space.
I also wish the proposed change was broken up a little bit. Changing all of
G1 STW, G1 Concurrent, and Parallel at the same time is hard to review, and I
don't think there's a need to do all in one change.
I have a couple of ideas for alternative mechanisms that I might explore.
// Encoding limitations caused by current bitscales mean: | ||
// 10 bits for slice: max 1024 blocks per array | ||
// 5 bits for power: max 2^32 array | ||
// 49 bits for oop: max 512 TB of addressable space |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This encoding is incompatible with Linux Large Virtual Address space:
https://www.kernel.org/doc/html/v5.8/arm64/memory.html
which has a 52 bit address space. I also don't know of any reason why future address space
configuration couldn't support the full non-tagged range (so 56 bits). I think that makes this
scheme not viable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think hardware support is an orthogonal issue. It would have been an issue if we just blindly casted the pointer to oop
, and relied on hardware to only treat the lowest bits for the actual address and/or mmap-ed the Java heap at very high addresses.
But since we are masking out the oop
explicitly (see oop_extract_mask
) and we map Java heap at lower addresses (in first order, to please compressed oops), we do not care for practical heap sizes. What we get as the limit is how much of the Java heap we can represent, given that we encode oop
-s. Shenandoah checks this on startup, for example:
jdk/src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp
Lines 202 to 212 in d8c1c6a
#if SHENANDOAH_OPTIMIZED_MARKTASK | |
// The optimized ShenandoahMarkTask takes some bits away from the full object bits. | |
// Fail if we ever attempt to address more than we can. | |
if ((uintptr_t)heap_rs.end() >= ShenandoahMarkTask::max_addressable()) { | |
FormatBuffer<512> buf("Shenandoah reserved [" PTR_FORMAT ", " PTR_FORMAT") for the heap, \n" | |
"but max object address is " PTR_FORMAT ". Try to reduce heap size, or try other \n" | |
"VM options that allocate heap at lower addresses (HeapBaseMinAddress, AllocateHeapAt, etc).", | |
p2i(heap_rs.base()), p2i(heap_rs.end()), ShenandoahMarkTask::max_addressable()); | |
vm_exit_during_initialization("Fatal Error", buf); | |
} | |
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is a practical heap size? 512T should be enough for everyone :)
As a former colleague liked to say, don't try to stuff two bits of information
into one bit of data.
|
||
void* _ptr; | ||
uint16_t _slice; | ||
uint16_t _pow; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation imposes significant additional overhead on the normal non-array case, since
an oop task needs to carry around the unused _slice and _pow members, including insertion and
removal from the taskqueue and such. That seems pretty problematic for 32bit platforms that are
targeting relatively low end embedded space.
Correct. And sorry about that. Indeed my motivation to explore alternatives come out of the Lilliput project. I am currently working on 'Lilliput2', that is 4-byte-object-headers (and I've already got a version that runs quite well). While the header (aka mark-word) is only 4-bytes, many GCs still use the whole first word of the object to store the forwarding pointer. This is normally not a problem - the object is already copied and we don't care much about anything in the from-space copy. However, arrays now would have their length-field in the first word, too. Thus overriding the first word of the from-space copy with the forwarding pointer would destroy any information that the GC keeps there (or the other way around). I knew that in Shenandoah we don't do that, and instead encode the slicing in the tasks (and I heard that ZGC does something similar, but I haven't looked at that, yet), so my goal was to explore if that is feasible for G1, too. At first I only did G1 young STW GC, but digging deeper I realized that G1 concurrent GC and Parallel young GC also use similar approaches (without sharing much). So I also explored if the encoding would work there, too. This is pretty much how I ended up with this PR. About the address space concern, yes, I am aware of this. We're doing the exact same slicing in Shenandoah since forever, and so far never had a customer who would get to that limit. Our reasoning there was that if one uses this amount of Java heap, using 2x as much native memory for the GC task queues (using the fallback version of the task encoding) would probably don't matter all that much. But I'll be totally happy to explore other approaches. In-fact, I think at least G1 concurrent mark should be fine, anyway.
Right. I got that wrong. However, it still requires the from-space length to remain intact, which is not the case with Lilliput2.
Ok, that is good. If we could come up with an approach that makes PartialArrayTaskStepper not depend on the from-space length, that would solve my problem.
I could do that. But let's first see if we come up with an approach that makes us all happy ;-) One idea that I just now had (for my particular problem in Lilliput2) is this: we only need the length-field in the from-space copy to keep the actual length of the array, so that we know when to stop, and also to restore the length in the to-space copy. We could copy the length to the from-space offset 8 (instead of 4, which is the length field with Lilliput2), and use that, instead. That should be ok, because surely we don't need to do the slicing for 0-length arrays (in which case we would stomp over the next object), and for larger arrays, we don't care about the from-space elements just like we don't care about the length.
Would you mind sharing the ideas? Thanks, |
I'll respond to other points separately. Here's a (very drafty) suggestion. (There are lots of specific details about these changes that I don't like, and The idea is to define a C++ object that represents the state of the iteration Use arena allocation for the state objects, so we don't have to search the Put a free-list allocator in front of the arena allocator, so the high water The states are reference counted, so we know when to return them to the |
A different idea is to have segregated queues for oops and partial array |
That still doesn't permit parallelizing the processing in the evac-failure case. |
I would guess there aren't tons of users of Linux Large Virtual address space. For fat tasks, I'm less concerned about the actual space impact than on the I'd much have allocated states or segregated queues than fat tasks. |
Withdrawing this PR, I found a better solution to my problem, no need to change the array slicing in such a fundamental way. |
Care to share? |
Yeah, it's this commit: The idea is simply to not use the array-length in from-space for array-slicing (because it would clash with the forwarding-pointer), but instead use the 2nd word, and preserve the original array-length there. It is kinda Lilliput(2) specific, so I guess there's no reason to upstream it ahead of time. It's also still kinda quick-and-dirty. ParallelGC would have the same problem, but I've not addressed that, yet, because Parallel (Full) GC has more severe problems that I want to solve first. G1 concurrent marking should be fine, afaict (it doesn't use or override from-space length, nor does it need to deal with forwarding pointers). |
In order to not cause excessive traffic on task queues when scanning large object arrays, G1 and ParallelGC use a way of slicing those arrays into smaller pieces. It overrides the from-space array's length field to track the array slices.
I think it would be cleaner if we improve the tasks such that array slices can be fully encoded in the task and does not require overriding the array length.
This PR borrows the principal encoding and slicing algorithm from Shenandoah (originally written by @shipilev). It also unifies the slicing implementations of the young GC and concurrent marking GC and ParallelGC.
For a description of the encoding and slicing algorithm, see top of arraySlicer.hpp.
On x86 (32-bit) we don't have enough bits in the single-word task to encode the slicing, so I'm extending the task to 64 bits (pointer and two int32 fields).
I put in some efforts to make sure the shared array slicing uses the user-configurable flags ParGCArrayScanChunk and ObjArrayMarkingStride just as before, but TBH, I don't see the point of having those flags as product flags to begin with. I would probably deprecate and remove ParGCArrayScanChunk, and use the develop flag ObjArrayMarkingStride everywhere. YMMV.
Testing:
Progress
Issue
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/19282/head:pull/19282
$ git checkout pull/19282
Update a local copy of the PR:
$ git checkout pull/19282
$ git pull https://git.openjdk.org/jdk.git pull/19282/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 19282
View PR using the GUI difftool:
$ git pr show -t 19282
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/19282.diff
Webrev
Link to Webrev Comment