-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8027545: Improve object array chunking test in G1's copy_to_survivor_space #90
Conversation
👋 Welcome back kbarrett! A progress list of the required criteria for merging this PR into |
@kimbarrett The primary solved issue for a PR is set through the PR title. Since the current title does not contain an issue reference, it will now be updated. |
@kimbarrett The primary solved issue for a PR is set through the PR title. Since the current title does not contain an issue reference, it will now be updated. |
@kimbarrett The primary solved issue for a PR is set through the PR title. Since the current title does not contain an issue reference, it will now be updated. |
@kimbarrett The following label will be automatically applied to this pull request: When this pull request is ready to be reviewed, an RFR email will be sent to the corresponding mailing list. If you would like to change these labels, use the |
Webrevs
|
/label add hotspot-gc |
@kimbarrett |
The title for this review was supposed to be based on JDK-8158045 "Improve large object handling during evacuation", but because of SKARA-634 the last of the three associated issue commands was used, rather than the first. |
PartialArrayTaskStepper::Step step | ||
= _partial_array_stepper.start(objArrayOop(from_obj), | ||
to_array, | ||
_partial_objarray_chunk_size); | ||
|
||
// Push any needed partial scan tasks. Pushed before processing the | ||
// intitial chunk to allow other workers to steal while we're processing. | ||
for (uint i = 0; i < step._ncreate; ++i) { | ||
push_on_queue(ScannerTask(PartialArrayScanTask(from_obj))); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the intention would be clearer, if start
returns a pair <int intial_chunk_size, bool has_left_over>
instead of reusing Step
struct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reused Step intentionally, to not commit at this layer to only creating one partial task. The stepper can change its calculations and this layer won't care.
return obj; | ||
} | ||
|
||
if (G1StringDedup::is_enabled() && (klass == SystemDictionary::String_klass())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unclear to me why the additional checking with String_klass()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Faster fast-path, avoiding some of the other calculations before going into G1StringDedup::enqueue_from_evacuation and only there discovering we're not dealing with a String at all. But this doesn't belong in this changeset. I'll remove it.
src/hotspot/share/oops/arrayOop.hpp
Outdated
@@ -104,12 +104,19 @@ class arrayOopDesc : public oopDesc { | |||
|
|||
// Accessors for instance variable which is not a C++ declared nonstatic | |||
// field. | |||
int length() const { | |||
return *(int*)(((intptr_t)this) + length_offset_in_bytes()); | |||
int length() const { return *length_addr(); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I find the original code easier to read, since the identical content, *(int*)(((char*)mem) + length_offset_in_bytes())
exists in both length()
and set_length(HeapWord* mem, int length)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should have included the 2-arg set_length in the changes to make everything consistent. I'll revise in the next update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall a very nice improvement. Some comments and ideas below.
@@ -79,6 +80,9 @@ class G1ParScanThreadState : public CHeapObj<mtGC> { | |||
// Indicates whether in the last generation (old) there is no more space | |||
// available for allocation. | |||
bool _old_gen_is_full; | |||
// Size (in elements) of a partial objArray task chunk. | |||
int _partial_objarray_chunk_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to skip the "obj"-part here to have more consistent naming or, as mentioned above, include it in the stepper instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the naming and factoring I've done is forward looking.
I think we should consider splitting the copying part as well as the scanning, and chunking the copy of typeArrays. For example, JDK-8031565 suggests copying large typeArrays as part of termination waiting; I think splitting them into partial array tasks to use the normal parallelism in the framework is better than some new side channel. The chunk size for that should probably be substantially larger (and depend on the element size) than for objArray.
Also, Project Valhalla is going to add new kinds of arrays that are neither objArray nor typeArray. We'll want to split them too. The same splitting calculations can apply, even though the chunk size may be different (and probably depends on the element klass).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, keeping the "obj"-part sounds reasonable.
@@ -157,6 +162,7 @@ class G1ParScanThreadState : public CHeapObj<mtGC> { | |||
|
|||
private: | |||
inline void do_partial_array(PartialArrayScanTask task); | |||
inline void start_partial_objArray(G1HeapRegionAttr dest_dir, oop from, oop to); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, drop "obj" for consistent naming and avoiding the camel-case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See response above about the name of _partial_objarray_chunk_size. But I should probably be consistent about objarray vs objArray. Since objArray is what it is over in runtime oop-land I'm going to go with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, we have different taste. We very seldom go with camel-case in members and especially in the middle it looks strange. Looking through the GC-code I find mostly objarray
or obj_array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right; in gc code, it seems "objarray" is used almost exclusively. The one exception I found seems to be around the G1CMObjArrayProcessor, where some camel-case has snuck in, such as G1ConcurrentMark::_objArray_processor. So I'll use "objarray".
static uint compute_task_limit(uint n_workers) { | ||
// Don't need more than n_workers tasks at a time. But allowing up to | ||
// that maximizes available parallelism. | ||
return n_workers; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In preparation for a more advanced logic for the limit, or why not just use the input at the call-site?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Being noncommittal about whether something more "clever" could or should be done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
||
// The from-space object contains the real length. | ||
int length = from->length(); | ||
assert(start < length, "invariant: start %d, length %d", start, length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just so I'm not missing anything. There will never be more tasks than chunks left, so there is no risk that we have a race for the last chunk and thus hit this assertion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct. Unless there's a bug in the analysis or implementation I haven't found. That's why it's an invariant.
_partial_objarray_chunk_size(ParGCArrayScanChunk), | ||
_partial_array_stepper(n_workers), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about saving the chunk size in the stepper instead? Then we don't need to pass it in to start()
and next()
. To avoid needing it for the call to oop_iterate_range()
we could instead have the Step
include both the start and end index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above discussion about naming and factoring. The same stepper can support multiple array types if the chunk size is external to the stepper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can, but I think I would prefer multiple "steppers" for that case. This is of course just a matter of taste and I'm fine with leaving the chunk size external.
struct Step { | ||
int _index; // Array index for the step. | ||
uint _ncreate; // Number of new tasks to create. | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't mind having getters for those, but it's not a hard request :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Step is intended to be a trivial data carrier. A (HotSpot) Pair or std::pair (if we used stdlib) would do, except I like named data items.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer them named as well 👍
assert(_task_limit > 0, "precondition"); | ||
assert(_task_fannout > 0, "precondition"); | ||
uint max_pending = (_task_fannout - 1) * task_num + 1; | ||
|
||
// The actual pending may be less than that. Bound by remaining_tasks to | ||
// not overrun. Also bound by _task_limit to avoid spawning an excessive | ||
// number of tasks for a large array. The +1 is to replace the current | ||
// task with a new task when _task_limit limited. The pending value may | ||
// not be what's actually in the queues, because of concurrent task | ||
// processing. That's okay; we just need to determine the correct number | ||
// of tasks to add for this task. | ||
uint pending = MIN3(max_pending, remaining_tasks, _task_limit); | ||
uint ncreate = MIN2(_task_fannout, MIN2(remaining_tasks, _task_limit + 1) - pending); | ||
Step result = { start, ncreate }; | ||
return result; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to my comment above, what do you think about trying to write some test for this to verify we never get to many "tasks".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The asserts in the stepper's next() after the increment of to's length verify that we haven't overrun. I haven't thought of a way to verify the algorithm doesn't generate too few tasks though. Well, other than getting crashes because some array elements didn't get processed. But maybe you mean unit tests? I will try to write some; I should have done so already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should have been a bit more clear. I'm talking about unit tests, so if you plan writing some that's great.
Mailing list message from Kim Barrett on hotspot-dev:
Well, that discussion is only ?above" in the github UI tool. Now that I see what happens when I reply there, maybe I won?t do that so much. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice test and thanks for going with "objarray".
Looks good and I'll approve this now, but please fix the indent issue in the test before integrating.
int length, | ||
int* to_length_addr, | ||
uint chunk_size) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indent looks to be one-off here.
@kimbarrett This change now passes all automated pre-integration checks. In addition to the automated checks, the change must also fulfill all project specific requirements After integration, the commit message will be:
Since the source branch of this PR was last updated there have been 49 commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid automatic rebasing, please merge ➡️ To integrate this PR with the above commit message to the |
Mailing list message from Thomas Schatzl on hotspot-dev: Hi, On 11.09.20 11:01, Kim Barrett wrote:
- partialArrayTaskStepper.hpp: s/_task_fannout/_task_fanout (or maybe _task_fan_out) - since _task_fannout and _task_limit are only assigned in the - also the asserts in PartialArrayTaskStepper::next_impl could be Looks good otherwise. Thanks, |
Mailing list message from Kim Barrett on hotspot-dev:
Thanks, will do.
Yeah, changed return type from void to int at some point and missed adjusting the indentation. |
Mailing list message from Kim Barrett on hotspot-dev:
I thought it looked a little odd at first, but never got around to spell
I'd rather not. I've been down that road and didn't like the results. Just
As you say, the current location shows the requirements for the nearby
Thanks.
|
/issue add 8158045 8027761 8027545 |
@kimbarrett Adding additional issue to issue list: |
Mailing list message from Thomas Schatzl on hotspot-dev: Hi, On 12.09.20 03:02, Kim Barrett wrote:
Okay. Going to mark this as reviewed assuming the rename will be done. Thanks, |
/summary Generate multiple partial array tasks for large objArrays. |
@kimbarrett Setting summary to |
/integrate |
@kimbarrett Since your change was applied there have been 49 commits pushed to the
Your commit was automatically rebased without conflicts. Pushed as commit dafcf10. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. |
8302038: compiler implementation for statements before super(), first iteration
Co-authored-by: Xin Liu <xxinliu@amazon.com>
/issue add JDK-8158045
/issue add JDK-8027761
/issue add JDK-8027545
This is rework of initial change from before the transition to git.
The initial RFR email is attached below.
The primary change is to limit the number of partial array tasks in
the queues for any given array. The original change just split up an
array into N tasks that were all enqueued at once. But for a large
array this could be a lot of tasks, leading to significant and
unnecessary queue expansion. Instead we now limit the number of tasks
for a given array based on the number of workers, and only gradually
add tasks up to that limit. This gives other threads an opportunity
to steal such tasks, while still keeping queue growth under control.
Most of the calculation for this is handled by a new helper class, so
this can later be shared with ParallelGC.
The dispatch on array klass type for has also been changed. It now
affirmatively breaks Project Valhalla, rather than quietly doing
something that probably isn't what is actually wanted. I'll discuss
with them so there is a plan for dealing with it when they take this
update.
Ran tier1-6 in mach5 and some local stress tests.
Performance testing was unchanged from previous, except I wasn't able
to reproduce the small specjbb2015 critical-jops improvement
previously seen on one platform. My suspicion is that improvement was
a statistical abberation.
--- Initial RFR email ---
RFR: 8158045: Improve large object handling during evacuation
RFR: 8027761: Investigate fast-path for scanning only objects with references during gc
RFR: 8027545: Improve object array chunking test in G1's copy_to_survivor_space
Please review this change to type dispatching and handling in G1's
evacuation copying, in order to improve the hot paths and improve array
handling. This change addresses several closely co-located enhancement
requests; it seemed difficult to split them up in a sensible way.
do_copy_to_survivor_space now gets the klass of the object being copied
once, up front, for use in multiple places. This avoids fetching (including
re-decoding when compressed) the klass multiple times. This addresses part
of JDK-8027545.
Moved check for and handling of string deduplication later, only applying it
after the special array cases have been dealt with, since strings are not
arrays. (They are header objects pointing to an array of character values.)
Special case typeArray, doing nothing other than the copy, since they
contain no oops that need to be processed. This addresses JDK-8027761.
Changed handling of objArray, pushing all of the partial array tasks up
front, rather than processing the current chunk after pushing a single task
for the remaining work. This addresses JDK-8158045.
As part of these, cached some more frequently accessed values in
G1ParScanThreadState member variables. This addresses part of part of
JDK-8027545.
While both the old and new code will work for Project Valhalla, the handling
of arrays should be updated for that project, which introduces new array
types.
Deleted a lingering reference to G1ParScanPartialArrayClosure that was
deleted long ago (JDK-8035330, JDK 9).
CR:
https://bugs.openjdk.java.net/browse/JDK-8158045
https://bugs.openjdk.java.net/browse/JDK-8027761
https://bugs.openjdk.java.net/browse/JDK-8027545
Webrev:
https://cr.openjdk.java.net/~kbarrett/8158045/open.00/
Testing:
tier1-3
performance testing - seems to be at worst performance neutral, with a
statistically significant 3% improvement in specjbb2015 critical-jops seen
on one platform.
Progress
Issues
Reviewers
Download
$ git fetch https://git.openjdk.java.net/jdk pull/90/head:pull/90
$ git checkout pull/90