-
Notifications
You must be signed in to change notification settings - Fork 6.1k
8342382: Implement JEP 522: G1 GC: Improve Throughput by Reducing Synchronization #23739
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8342382: Implement JEP 522: G1 GC: Improve Throughput by Reducing Synchronization #23739
Conversation
|
👋 Welcome back tschatzl! A progress list of the required criteria for merging this PR into |
|
/contributor add @tstuefe |
|
@tschatzl This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be: You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 7 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. ➡️ To integrate this PR with the above commit message to the |
|
@tschatzl |
|
@tschatzl |
|
/contributor add @c-refice |
|
@tschatzl Syntax:
User names can only be used for users in the census associated with this repository. For other contributors you need to supply the full name and email address. |
* gc thread cpu time tracking with gc+cpu=debug logging * use correct young gen RS length prediction for base time calculation after finalizing young gen * proper accounting of sweep rt continuation to base time * merge to actual remset size calculation fixes * fix compilation on aarch64, Windows * fix compilation on aarch64 * remove trailing spaces from files * initial riscv implementation * cleanup s390/ppc * fix missing ResizeTLABs event in jfr parallel phases test * ppc barrier attempt (not even built) * removed necessary C1 slow path stub generation :( * fix riscv compilation * fix RISCV barrier, passes javac HelloWorld and its execution * re-add too-much-deleted stuff * consider yield time in dirtying rate calculations * cleanup * clean up in refinement heuristics * remove unused g1_young_card_val * refactoring, renaming * when calculating total merged cards from RS, compensate for the merge cache * cleanup, refactoring, renaming * refactoring of refinement/redirtying stats recording * disable some expensive logging * fix merge card cache compensation * improve documentation about CardValue's LSB discriminating between clean/non-clean * additional comments to assembly code * fix too early clearing of refinement statistics after regular refinement completion where cards_to_cset would always be zero * improved sizing of work for refinement table merge phase * refactoring, removing fixmes * aarch64 improved array post barrier * fix aarch64 array post barrier assembler version * fix testgclogmessages.java test after recent changes * arm32 barrier * currently yield duration only applies to sweeping * add missing files after rebase * regularize new_val_maybe_null * ppc build fixes * more ppc build fixes after bad merge * too many fixes :( * cleanup * fix check for enough space to evacuate * remove code to take expected old gen surviving words into account when determining eden length * remove some debug code * track safepoints in recent refinement epoch to calculate card dirtying time * remove card_table1 member from g1barrierset * refactoring, cleanup * fix issues with tracking gc pauses for card dirtying * epoch timing fixes; little cleanup * more time accounting fixes * some refactoring * cleanup * remove more debug logs * remove parts of already pushed stuff before merge * improve cpu time output * synchronize accesses for prediction relevant members between refinement and young gen revise thread * remove UseNewCode in barrier code * some assert to check that dirtying cards is done at the right time * comment why the lock when updating redirtying information * remove dead code * some cleanup in code generation * initial version * add store_addr == new_val check to all platforms * remove unnecessary stuff * fix ppc barrier code (from M. Doerr) * too strong different register assertion due to ppc optimization * fix passing of new_val_may_be_null for c1 barriers * factor out x.a = x assignments for the C1 compiler. * remove FIXMEs * fix s390 barrier code (from A. Kumar) * removed empty JMVCI write_barrier_post stub because JVMCI users need more changes than that anyway * added card table base offset constant for use with JVMCI * add clean_card_val() for JVMCI * fix node costs for g1 post barrier * remove trailing whitespace in files * update post barrier cost estimate * copyright updates * cleanup of G1CardTableClaimTable * worker threads elapsed CPU time refactoring * undid _maybe_null from _may_be_null in code generator * minor dead code removal * some cleanup * minor refactorings * update cpu time gathering to proposed upstream * remove cpu time logging code - can be retrieved from performance counters as well * renamings of "primary control thread" to "refinement control thread" everywhere * simplify refinement control thread main loop * add refinement control thread perf counters
c0f9348 to
0372026
Compare
|
in this pr you've wrote but on https://tschatzl.github.io/2025/02/21/new-write-barriers.html you wrote: i guess the second one is correct |
|
@tschatzl I did not contribute the ppc port. Did you mean @TheRealMDoerr or @reinrich ? |
|
/contributor remove @tstuefe |
|
@tschatzl |
|
@tschatzl |
|
/contributor add Carlo Refice carlo.refice@oracle.com |
|
@tschatzl |
|
/label add hotspot-gc |
|
@tschatzl |
* fixed comment typo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PPC64 code looks great! Thanks for doing this! Only some comments are no longer correct.
* missing synchronization with card table swapping
* documentation for a few PSS members * rename some member variables to contain _ct and _rt suffixes in remembered set verification
|
@tschatzl : Hi, would you mind adding a small cleanup change for riscv? |
This is the |
Yes, sure! The purpose is to minimize the difference to avoid possible issues in the future. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
|
@RealFYang : going to wait for the response of @theRealAph about the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still Good!
|
Aaand, off it goes... Thanks @walulyai @albertnetymk @theRealAph @TheRealMDoerr @robcasloz @RealFYang @offamitkumar @tarsa @tstuefe for your help to complete this change. /integrate |
|
Going to push as commit 8d5c005.
Your commit was automatically rebased without conflicts. |
|
Congratulations! And thanks for updating it for such a long time! |
Hi all,
please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier.
The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25.
Current situation
With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier.
The main reason for the current barrier is how g1 implements concurrent refinement:
These tasks require the current barrier to look as follows for an assignment
x.a = yin pseudo code:Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc.
The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining.
There are several papers showing that this barrier alone can decrease throughput by 10-20% (Yang12), which is corroborated by some benchmarks (see links).
The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a second card table ("refinement table"). The second card table also replaces the dirty card queue.
In that scheme the fine-grained synchronization is unnecessary because mutator and refinement threads always write to different memory areas (and no concurrent write where an update can be lost can occur). This removes the necessity for synchronization for every reference write.
Also no card enqueuing is required any more.
Only the filters and the card mark remain.
How this works
In the beginning both the card table and the refinement table are completely unmarked (contain "clean" cards). The mutator dirties the card table, until G1 heuristics think that a significant enough amount of cards were dirtied based on what is allocated for scanning them during the garbage collection.
At that point, the card table and the refinement table are exchanged "atomically" using handshakes. The mutator keeps dirtying the (the previous, clean refinement table which is now the) card table, while the refinement threads look for and refine dirty cards on the refinement table as before.
Refinement of cards is very similar to before: if an interesting reference in a dirty card has been found, G1 records it in appropriate remembered sets. In this implementation there is an exception for references to the current collection set (typically young gen) - the refinement threads redirty that card on the card table with a special
to-collection-setvalue.This is valid because races with the mutator for that write do not matter - the entire card will eventually be rescanned anyway, regardless of whether it ends up as dirty or to-collection-set. The advantage of marking to-collection-set cards specially is that the next time the card tables are swapped, the refinement threads will not re-refine them on the assumption that that reference to the collection set will not change. This decreases refinement work substantially.
If refinement gets interrupted by GC, the refinement table will be merged with the card table before card scanning, which works as before.
New barrier pseudo-code for an assignment
x.a = y:This is basically the Serial/Parallel GC barrier with additional filters to keep the number of dirty cards as little as possible.
A few more comments about the barrier:
Current G1 marks the cards corresponding to young gen regions as all "young" so that the original barrier could potentially avoid the
StoreLoad. This implementation removes this facility (which might be re-introduced later), but measurements showed that pre-dirtying the young generation region's cards as "dirty" (g1 does not need to use an extra "young" value) did not yield any measurable performance difference.Refinement process
The goal of the refinement (threads) is to make sure that the number of cards to scan in the garbage collection is below a particular threshold.
The prototype changes the refinement threads into a single control thread and a set of (refinement) worker threads. Differently to the previous implementation, the control thread does not do any refinement, but only executes the heuristics to start a calculated amount of worker threads and tracking refinement progress.
The refinement trigger is based on current known number of pending (i.e. dirty) cards on the card table and a pending card generation rate, fairly similarly to the previous algorithm. After the refinement control thread determines that it is time to do refinement, it starts the following sequence:
This work either consists of refinement of the particular card (old generation regions) or clearing the cards (next collection set/young generation regions).
If the work is interrupted by a non-garbage collection synchronization point, work is suspended temporarily and resumed later using the heap snapshot.
After the refinement process the refinement table is all-clean again and ready to be swapped again.
Garbage collection pause changes
Since a garbage collection (young or full gc) pause may occur at any point during the refinement process, the garbage collection needs some compensating work for the not yet swept parts of the refinement table.
Note that this situation is very rare, and the heuristics try to avoid that, so in most cases nothing needs to be done as the refinement table is all clean.
If this happens, young collections add a new phase called
Merge Refinement Tablein the garbage collection pause right before theMerge Heap Rootsphase. This compensating phase does the following:If a full collection interrupts concurrent refinement, the refinement table is simply cleared and all dirty cards thrown away.
A garbage collection generates new cards (e.g. references from promoted objects into the young generation) on the refinement table. This acts similarly to the extra DCQS used to record these interesting references/cards and redirty the card table using them in the previous implementation. G1 swaps the card tables at the end of the collection to keep the post-condition of the refinement table being all clean (and any to-be-refined cards on the card table) at the end of garbage collection.
Performance metrics
Following is an overview of the changes in behavior. Some numbers are provided in the CR in the first comment.
Native memory usage
The refinement table takes an additional 0.2% of the Java heap size of native memory compared to JDK 21 and above (in JDK 21 we removed one card table sized data structure, so this is a non-issue when updating from before).
Some of that additional memory usage is automatically reclaimed by removing the dirty card queues. Additional memory is reclaimed by managing the cards containing to-collection-set references on the card table by dropping the explicit remembered sets for young generation completely and any remembered set entries which would otherwise be duplicated into the other region's remembered sets.
In some applications/benchmarks these gains completely offset the additional card table, however most of the time this is not the case, particularly for throughput applications currently.
It is possible to allocate the refinement table lazily, which means that since these applications often do not need any concurrent refinement, there is no overhead at all but actually a net reduction of native memory usage. This is not implemented in this prototype.
Latency ("Pause times")
Not affected or slightly better. Pause times decrease due to a shorter "Merge remembered sets" phase due to no work required for the remembered sets for the young generation - they are always already on the card table!
However merging of the refinement table into the card table is extremely fast and is always faster than merging remembered sets for the young gen in my measurements. Since this work is linearly scanning some memory, this is embarassingly parallel too.
The cards created during garbage collection do not need to be redirtied, so that phase has also been removed.
The card table swap is based on predictions for mutator card dirtying rate and refinement rate as before, and the policy is actually fairly similar to before. It is still rather aggressive, but in most cases takes less cpu resources than the one before, mostly because refining takes less cpu time. Many applications do not do any refinement at all like before. More investigation could be done to improve this in the future.
Throughput
This change always increases throughput in my measurements, depending on benchmark/application it may not actually show up in scores though.
Due to the pre-barrier and the additional filters in the barrier G1 is still slower than Parallel on raw throughput benchmarks, but is typically somewhere half-way to Parallel GC or closer.
Code Size
Code size measurements on DaCapo benchmarks by @robcasloz showed that this change decreases code size by around 5%.
Platform support
Since the post write barrier changed, additional work for some platforms is required to allow this change to proceed. At this time all work for all platforms is done, but needs testing
None of the above mentioned platforms implement the barrier method to write cards for a reference array (aarch64 and x64 are fully implemented), they call the runtime as before. I believe it is doable fairly easily now with this simplified barrier for some extra performance, but not necessary.
Alternatives
The JEP text extensively discusses alternatives.
Reviewing
The change can be roughly divided in these fairly isolated parts
G1ConcurrentRefineThread::run_servicemethodmerge_refinement_table()ing1RemSet.cppG1Policy::record_dirtying_stats.Further information is available in the JEP draft; there is also an a bit more extensive discussion of the change on my blog.
Some additional comments:
UseCondCardMarkto true by default. The conditional card mark corresponds to the third filter in the write barrier now, and since I decided to keep all filters for this change, it makes sense to directly use this mechanism.If there are any questions, feel free to ask.
Testing: tier1-7 (multiple tier1-7, tier1-8 with slightly older versions)
Thanks,
Thomas
Progress
Issues
Reviewers
Contributors
<amitkumar@openjdk.org><mdoerr@openjdk.org><carlo.refice@oracle.com><fyang@openjdk.org>Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739$ git checkout pull/23739Update a local copy of the PR:
$ git checkout pull/23739$ git pull https://git.openjdk.org/jdk.git pull/23739/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 23739View PR using the GUI difftool:
$ git pr show -t 23739Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/23739.diff
Using Webrev
Link to Webrev Comment