Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8300915: G1: incomplete SATB because nmethod entry barriers don't get armed #12194

Conversation

reinrich
Copy link
Member

@reinrich reinrich commented Jan 25, 2023

The change makes sure all nmethod entry barriers are armed when a G1 concurrent marking cycle starts by doing it unconditionally in G1ConcurrentMark::pre_concurrent_start(). The barrier is needed to add oop constants to the SATB.

This alone would be a conservative/minimal fix but I didn't like that CodeCache::is_gc_marking_cycle_active() can return true after a marking cycle start is undone. I.e. G1ConcurrentMarkThread::_state == Idle && CodeCache::is_gc_marking_cycle_active() is possible.
Edit: It felt like an inconsistency but maybe its actually ok not to finish the CC marking cycle when the G1 marking is undone. Please comment...
The changes in G1ConcurrentMark::post_concurrent_undo_start() were made to avoid it.

Testing

  • Rebased this PR onto JDK-8297487, replaced related assertions with guarantees, and then executed test/langtools:tier1 in a loop for 12h without seeing the issues reported in JDK-8299956.

  • CI testing at SAP: This includes most JCK and JTREG tiers 1-4, also in Xcomp mode, on the standard platforms and also on ppc64le.

  • GHA: build failures because download of apache ant failed. serviceability/jvmti/vthread/SuspendResumeAll/SuspendResumeAll.java had a timeout on windows-x64. Should be unrelated. It succeeded in our CI testing.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8300915: G1: incomplete SATB because nmethod entry barriers don't get armed

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/12194/head:pull/12194
$ git checkout pull/12194

Update a local copy of the PR:
$ git checkout pull/12194
$ git pull https://git.openjdk.org/jdk pull/12194/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 12194

View PR using the GUI difftool:
$ git pr show -t 12194

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/12194.diff

@bridgekeeper
Copy link

bridgekeeper bot commented Jan 25, 2023

👋 Welcome back rrich! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Jan 25, 2023

@reinrich The following label will be automatically applied to this pull request:

  • hotspot-gc

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-gc hotspot-gc-dev@openjdk.org label Jan 25, 2023
Comment on lines -3372 to -3377
// This is the normal case when we do not call collect when a
// concurrent mark is ongoing. We then start a new code marking
// cycle. If, on the other hand, a concurrent mark is ongoing, we
// will be conservative and use the last code marking cycle. Code
// caches marked between the two concurrent marks will live a bit
// longer than needed.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the comment because I found it a little missleading. If "concurrent marking" is G1 cm then we never reach here with an ongoing concurrent marking. Reaching here G1 cm was either finished normally, undone or aborted. But maybe "concurrent marking" means marking of nmethods?

@reinrich reinrich marked this pull request as ready for review January 25, 2023 13:47
@openjdk openjdk bot added the rfr Pull request is ready for review label Jan 25, 2023
@mlbridge
Copy link

mlbridge bot commented Jan 25, 2023

@tschatzl
Copy link
Contributor

Edit: It felt like an inconsistency but maybe its actually ok not to finish the CC marking cycle when the G1 marking is undone. Please comment...
The changes in G1ConcurrentMark::post_concurrent_undo_start() were made to avoid it.

I think this has merely been an oversight/bug.

src/hotspot/share/gc/g1/g1CollectedHeap.cpp Outdated Show resolved Hide resolved
Comment on lines 795 to 796
CodeCache::on_gc_marking_cycle_start();
CodeCache::arm_all_nmethods();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe factor this out into a method (start_codecache_marking_cycle) and call it above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like this right?

void G1CollectedHeap::start_codecache_marking_cycle() {
  CodeCache::on_gc_marking_cycle_start();
  CodeCache::arm_all_nmethods();
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. Also I meant "... to call that method here and above"; i.e. I asked to factor out because this is done twice.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the "finish" stuff could be extracted out similarly, I did not look if the calls are in the same file so it may not be possible.

Copy link
Contributor

@fisk fisk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this change consider the case when we do an initiating mark YC, but there were no old objects, and hence the "finished" part is never executed as CM is skipped? It isn't obvious to me that said case is handled correctly.

reinrich and others added 2 commits January 25, 2023 18:24
Co-authored-by: Thomas Schatzl <59967451+tschatzl@users.noreply.github.com>
@tschatzl
Copy link
Contributor

tschatzl commented Jan 25, 2023

Does this change consider the case when we do an initiating mark YC, but there were no old objects, and hence the "finished" part is never executed as CM is skipped? It isn't obvious to me that said case is handled correctly.

Not sure this is what you were looking for, but the code adds the CodeCache::on_gc_marking_cycle_finish() + arm calls at the end of the concurrent start gc (still within the pause) by G1ConcurrentMark::post_concurrent_undo_start.
This is when G1 "skips" the concurrent mark.

There is no possibility for G1 to completely do nothing e.g. in the case when there are no old regions. It always needs to do the "Undo Mark" stuff.

@reinrich
Copy link
Member Author

Does this change consider the case when we do an initiating mark YC, but there were no old objects, and hence the "finished" part is never executed as CM is skipped? It isn't obvious to me that said case is handled correctly.

Not sure this is what you were looking for, but the code adds the CodeCache::on_gc_marking_cycle_finish() + arm calls at the end of the concurrent start gc (still within the pause) by G1ConcurrentMark::post_concurrent_undo_start.

Thanks for stepping in @tschatzl . I wasn't quite sure if "... as CM is skipped ..." referred to the 'Undo' of the marking start or something else I wasn't aware of. I agree with your answer.
I'd also point out that with this change the nmethod entry barriers will always be armed when concurrent marking begins.

Also if CM is skipped CodeCache::_gc_epoch will be incremented. This has the effect that nmethods appear cooler that before this pr. I didn't think this was not a problem.

@fisk
Copy link
Contributor

fisk commented Jan 26, 2023

Does this change consider the case when we do an initiating mark YC, but there were no old objects, and hence the "finished" part is never executed as CM is skipped? It isn't obvious to me that said case is handled correctly.

Not sure this is what you were looking for, but the code adds the CodeCache::on_gc_marking_cycle_finish() + arm calls at the end of the concurrent start gc (still within the pause) by G1ConcurrentMark::post_concurrent_undo_start. This is when G1 "skips" the concurrent mark.

There is no possibility for G1 to completely do nothing e.g. in the case when there are no old regions. It always needs to do the "Undo Mark" stuff.

That's what I meant, thanks.

@fisk
Copy link
Contributor

fisk commented Jan 26, 2023

Hmm. Some thoughts. The way I look at it, we have essentially 4 different endings to CM when we poke it to "start". 1) successful finish, 2) undone end (no CM needed in the end), 3) aborted by STW full GC, 4) aborted by new CM request.
If I understand this patch correctly, it tries to recognize case 2 as essentiallly being like 1, in the sense that we have traversed the full heap and we are done. Is that correct?
In that case, I think there is still the same issue for case 4. If a CM is aborted because we are interrupting an old CM in order to start a new CM, then we still have a case where we have an old marking cycle that has finished, but not successfully. We can't call G1CollectedHeap::finish_codecache_marking_cycle() until we have successfully marked through all objects in the heap. If we were to call that when an old CM is aborted, then is_unloading() might start returning true on nmethods that were on-stack in some loom continuation object in the old generation. This was the main point of only conditionally starting cycles - then the epoch counters being marked in the nmethods would be conservative and potentially keep nmethods around a bit longer that really could be discarded. Naturally, what I didn't consider at that time, was that not arming all nmethods at the latest initiating marking operation was an SATB violation, when we rely on nmethod entry barriers enforcing SATB constraints. At the time when that code was written, we thought that G1 didn't need that, but then it turned out that it did.
Having said all that, I think what I'm getting at is that this patch fixes the SATB problem, but introduces a new premature nmethod unloading condition for the loom code instead. It's like we want to have the cake and eat it too.
What would of course be super nice is if we didn't abort CMs to start new CMs. I'm not entirely sure why that's a thing in G1. Anyway, if we want to keep on doing that, then I think what we need to do is to explicitly recognize when case 4 happens, and then walk the code cache, mark all nmethods as maybe_on_stack (because the truth is that we don't really know if the GC never got to finish successfully), and then call G1CollectedHeap::finish_codecache_marking_cycle(). Then we can close the gap and solve both the SATB issue and the loom nmethod unloading problems at the same time.
Hope this makes sense, and feel free to ask if something isn't clear in my description. And thanks for looking into this.

@reinrich
Copy link
Member Author

Hmm. Some thoughts. The way I look at it, we have essentially 4 different
endings to CM when we poke it to "start". 1) successful finish, 2) undone end
(no CM needed in the end), 3) aborted by STW full GC, 4) aborted by new CM
request. If I understand this patch correctly, it tries to recognize case 2 as
essentiallly being like 1, in the sense that we have traversed the full heap
and we are done. Is that correct?

Well it can be seen like this.

I wasn't aware of 4). Does it really exist? The assertion in CodeCache::on_gc_marking_cycle_start() should have fired here:

CodeCache::on_gc_marking_cycle_start() : void
G1CollectedHeap::start_codecache_marking_cycle() : void
G1ConcurrentMark::pre_concurrent_start(enum GCCause::Cause) : void
G1YoungCollector::pre_evacuate_collection_set(G1EvacInfo *) : void
G1YoungCollector::collect() : void

I havn't seen it so far though.

Maybe you mean the restart of marking in G1ConcurrentMarkThread::phase_mark_loop()?
This is not problematic I think.

In that case, I think there is still the same issue for case 4. If a CM is
aborted because we are interrupting an old CM in order to start a new CM, then
we still have a case where we have an old marking cycle that has finished, but
not successfully. We can't call
G1CollectedHeap::finish_codecache_marking_cycle() until we have successfully
marked through all objects in the heap. If we were to call that when an old CM
is aborted, then is_unloading() might start returning true on nmethods that
were on-stack in some loom continuation object in the old generation. This was
the main point of only conditionally starting cycles - then the epoch counters
being marked in the nmethods would be conservative and potentially keep
nmethods around a bit longer that really could be discarded.

That's a very good point I didn't see and didn't get it from the comment I removed.

On the other hand: do we ever call is_unloading() on nmethods without a complete marking cycle?

Naturally, what I didn't consider at that time, was that not arming all
nmethods at the latest initiating marking operation was an SATB violation,
when we rely on nmethod entry barriers enforcing SATB constraints. At the time
when that code was written, we thought that G1 didn't need that, but then it
turned out that it did. Having said all that, I think what I'm getting at is
that this patch fixes the SATB problem, but introduces a new premature nmethod
unloading condition for the loom code instead. It's like we want to have the
cake and eat it too. What would of course be super nice is if we didn't abort
CMs to start new CMs. I'm not entirely sure why that's a thing in G1. Anyway,
if we want to keep on doing that, then I think what we need to do is to
explicitly recognize when case 4 happens, and then walk the code cache, mark
all nmethods as maybe_on_stack (because the truth is that we don't really know
if the GC never got to finish successfully), and then call
G1CollectedHeap::finish_codecache_marking_cycle(). Then we can close the gap
and solve both the SATB issue and the loom nmethod unloading problems at the
same time. Hope this makes sense, and feel free to ask if something isn't
clear in my description. And thanks for looking into this.

I understand that cases 2) and 4) are might be problematic because of loom (nmethod
frames on the java heap). Adhoc though: can't we just undo the code cache cycle
be decrementing CodeCache::_gc_epoch

But maybe it's not really an issue because we don't do unloading without a complete cycle...?

@tschatzl
Copy link
Contributor

Hmm. Some thoughts. The way I look at it, we have essentially 4 different
endings to CM when we poke it to "start". 1) successful finish, 2) undone end
(no CM needed in the end), 3) aborted by STW full GC, 4) aborted by new CM
request. If I understand this patch correctly, it tries to recognize case 2 as
essentiallly being like 1, in the sense that we have traversed the full heap
and we are done. Is that correct?

Well it can be seen like this.

I wasn't aware of 4). Does it really exist? The assertion in CodeCache::on_gc_marking_cycle_start() should have fired here:

Case 4 isn't possible. During an existing marking (from start to the Remark pause), the G1CollectorState::initiate_conc_mark_if_possible flag prevents that. The related code to set that is fairly strewn around G1Policy though, I agree.

CodeCache::on_gc_marking_cycle_start() : void
G1CollectedHeap::start_codecache_marking_cycle() : void
G1ConcurrentMark::pre_concurrent_start(enum GCCause::Cause) : void
G1YoungCollector::pre_evacuate_collection_set(G1EvacInfo *) : void
G1YoungCollector::collect() : void

I havn't seen it so far though.

Neither have we.

Maybe you mean the restart of marking in G1ConcurrentMarkThread::phase_mark_loop()? This is not problematic I think.

In that case, I think there is still the same issue for case 4. If a CM is
aborted because we are interrupting an old CM in order to start a new CM, then
we still have a case where we have an old marking cycle that has finished, but
not successfully. We can't call
G1CollectedHeap::finish_codecache_marking_cycle() until we have successfully
marked through all objects in the heap. If we were to call that when an old CM
is aborted, then is_unloading() might start returning true on nmethods that
were on-stack in some loom continuation object in the old generation. This was
the main point of only conditionally starting cycles - then the epoch counters
being marked in the nmethods would be conservative and potentially keep
nmethods around a bit longer that really could be discarded.

That's a very good point I didn't see and didn't get it from the comment I removed.

Maybe we can improve the comment instead of removing then? :)

Thomas

@fisk
Copy link
Contributor

fisk commented Jan 26, 2023

I thought case 4 exists (I thought I ran into that case last time I tried to have a stricter start/finish model with asserts), but perhaps I really ran into something else. In that case, this approach looks good.

@openjdk
Copy link

openjdk bot commented Jan 26, 2023

@reinrich This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8300915: G1: incomplete SATB because nmethod entry barriers don't get armed

Reviewed-by: tschatzl, eosterlund

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 48 new commits pushed to the master branch:

  • 107e184: 8301179: Replace NULL with nullptr in share/gc/serial/
  • b77abc6: 8301178: Replace NULL with nullptr in share/gc/epsilon/
  • f7da09c: 8301164: Remove unused ResourceStack class
  • 6e4710b: 8300253: Introduce AArch64 nzcv accessors
  • c6b3f2d: 8301129: Link to debuginfo files should only be made after stripping
  • 938b409: 8301133: IGV: NPE occurs when creating a diff graph with a graph in a different folder
  • 0eb1f66: 8298038: [s390] Configure script detects num_cores +1
  • c3ff151: 8301190: [vectorapi] The typeChar of LaneType is incorrect when default locale is tr
  • 7eff578: 8288050: Add support of SHA-512/224 and SHA-512/256 to the PBKDF2 and PBES2 impls in SunJCE provider
  • d6007a3: 8298869: Update ConnectionTest.java for changes to TLS implementation
  • ... and 38 more: https://git.openjdk.org/jdk/compare/b2071f79d854f40df0f3bc2de6828fbcea4d325a...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Jan 26, 2023
@reinrich
Copy link
Member Author

In that case, I think there is still the same issue for case 4. If a CM is
aborted because we are interrupting an old CM in order to start a new CM, then
we still have a case where we have an old marking cycle that has finished, but
not successfully. We can't call
G1CollectedHeap::finish_codecache_marking_cycle() until we have successfully
marked through all objects in the heap. If we were to call that when an old CM
is aborted, then is_unloading() might start returning true on nmethods that
were on-stack in some loom continuation object in the old generation. This was
the main point of only conditionally starting cycles - then the epoch counters
being marked in the nmethods would be conservative and potentially keep
nmethods around a bit longer that really could be discarded.

That's a very good point I didn't see and didn't get it from the comment I removed.

On the other hand: do we ever call is_unloading() on nmethods without a complete marking cycle?

I see now: we do indeed e.g. in CodeBlobIterator::next_impl().

So I'm beginning to think we should do the conservative/minimal fix mentioned in
the pr description above: just arm nmethod entry barriers unconditionally in the initiating YC pause before the
concurrent marking.

@fisk
Copy link
Contributor

fisk commented Jan 26, 2023

In that case, I think there is still the same issue for case 4. If a CM is

aborted because we are interrupting an old CM in order to start a new CM, then

we still have a case where we have an old marking cycle that has finished, but

not successfully. We can't call

G1CollectedHeap::finish_codecache_marking_cycle() until we have successfully

marked through all objects in the heap. If we were to call that when an old CM

is aborted, then is_unloading() might start returning true on nmethods that

were on-stack in some loom continuation object in the old generation. This was

the main point of only conditionally starting cycles - then the epoch counters

being marked in the nmethods would be conservative and potentially keep

nmethods around a bit longer that really could be discarded.

That's a very good point I didn't see and didn't get it from the comment I removed.

On the other hand: do we ever call is_unloading() on nmethods without a complete marking cycle?

I see now: we do indeed e.g. in CodeBlobIterator::next_impl().

So I'm beginning to think we should do the conservative/minimal fix mentioned in

the pr description above: just arm nmethod entry barriers unconditionally in the initiating YC pause before the

concurrent marking.

Right. Thomas is assuring me though that case 4 no longer exists. If that's the case, then my worries all vanish.

@reinrich
Copy link
Member Author

I do think the problem exists with case 2) undone end (no CM needed in the end).

CodeCache::_gc_epoch is incremented before the YC and with the current version of the pr again when the CM start is undone after YC at the same safepoint. This can happen a couple of times and is_unloading() might start returning true on nmethods in loom StackChunks. That would be not good as they do have frames.

@fisk
Copy link
Contributor

fisk commented Jan 26, 2023

I do think the problem exists with case 2) undone end (no CM needed in the end).

CodeCache::_gc_epoch is incremented before the YC and with the current version of the pr again when the CM start is undone after YC at the same safepoint. This can happen a couple of times and is_unloading() might start returning true on nmethods in loom StackChunks. That would be not good as they do have frames.

Isn't it the case though that the CM is undone because the YC already visited all the objects? Then when the _gc_epoch is incremented at the end where we call the undo hook, the implication is that we have finished the full collection, and hence nmethods that haven't been marked as maybe on-stack, can't be on-stack (virtually or physically). That assumption seems to hold true, since the YC traversed all the stack chunks in the young generation, and there aren't any in the old generation. Or is there a situation where we undo, even though there is more than one object in the old generation that we have not visited?

@tschatzl
Copy link
Contributor

I do think the problem exists with case 2) undone end (no CM needed in the end).
CodeCache::_gc_epoch is incremented before the YC and with the current version of the pr again when the CM start is undone after YC at the same safepoint. This can happen a couple of times and is_unloading() might start returning true on nmethods in loom StackChunks. That would be not good as they do have frames.

Isn't it the case though that the CM is undone because the YC already visited all the objects? Then when the _gc_epoch is

The reason why the CM is undone because the heap occupancy after gc is lower than the threshold to start the initial marking at the end of gc.
The undo has nothing to do with having visited a certain set of objects.

incremented at the end where we call the undo hook, the implication is that we have finished the full collection, and hence nmethods that haven't been marked as maybe on-stack, can't be on-stack (virtually or physically). That assumption seems to hold true, since the YC traversed all the stack chunks in the young generation, and there aren't any in the old generation. Or is there a situation where we undo, even though there is more than one object in the old generation that we have not visited?

The undo does not take (and never took) whether objects have been visited or not or whether there are stack chunks in the old generation into account. It makes that decision purely on heap occupancy.

@tschatzl
Copy link
Contributor

incremented at the end where we call the undo hook, the implication is that we have finished the full collection, and hence nmethods that haven't been marked as maybe on-stack, can't be on-stack (virtually or physically). That assumption seems to hold true, since the YC traversed all the stack chunks in the young generation, and there aren't any in the old generation. Or is there a situation where we undo, even though there is more than one object in the old generation that we have not visited?

The undo does not take (and never took) whether objects have been visited or not or whether there are stack chunks in the old generation into account. It makes that decision purely on heap occupancy.

So things are broken :(

(Random 1min idea coming up....):

One option could be decreasing the epoch again, but that will keep alive already in this concurrent start tagged objects.
What about having a separate "unloading-epoch" and "next-epoch"?

When determining whether an object can be unloaded, use the unloading epoch, that can be decremented again at the end of marking, while the next-epoch (that is always incremented) is used next time we start the marking for tagging. If we complete the marking, set the unloading-epoch to next-epoch.

E.g.
mark start:
unloading-epoch = 1
next-epoch = 1

tag all nmethods with next-epoch ( = 1)

case 1:
we decide to abort the marking
decrement unloading-epoch (= 0)
increment next-epoch (= 2)

(since we compare nmethods with unloading-epoch (=0) everything is alive)

[later]

mark start
unloading-epoch = 0
next-epoch = 2

tag all nmethods with next-epoch (= 2)

we finish the marking
set unloading epoch to next-epoch (=2)
increment next-epoch (=3)

and so on. This handles an infinite amount of concurrent undos I think.

As mentioned, a one-minute idea..., not sure it works or is useful at all.

@fisk
Copy link
Contributor

fisk commented Jan 26, 2023

I do think the problem exists with case 2) undone end (no CM needed in the end).

CodeCache::_gc_epoch is incremented before the YC and with the current version of the pr again when the CM start is undone after YC at the same safepoint. This can happen a couple of times and is_unloading() might start returning true on nmethods in loom StackChunks. That would be not good as they do have frames.

Isn't it the case though that the CM is undone because the YC already visited all the objects? Then when the _gc_epoch is

The reason why the CM is undone because the heap occupancy after gc is lower than the threshold to start the initial marking at the end of gc.

The undo has nothing to do with having visited a certain set of objects.

incremented at the end where we call the undo hook, the implication is that we have finished the full collection, and hence nmethods that haven't been marked as maybe on-stack, can't be on-stack (virtually or physically). That assumption seems to hold true, since the YC traversed all the stack chunks in the young generation, and there aren't any in the old generation. Or is there a situation where we undo, even though there is more than one object in the old generation that we have not visited?

The undo does not take (and never took) whether objects have been visited or not or whether there are stack chunks in the old generation into account. It makes that decision purely on heap occupancy.

Ah. So essentially case 4 does exist, but it's case 2 that doesn't. Case 4 is case 2. Shoot...

@reinrich
Copy link
Member Author

Yet another 1min solution: increment CodeCache::_gc_epoch after the YC when we know we won't undo.
Nmethods that are currently on the native stacks look slightly older but that can be accounted for in nmethod::is_maybe_on_stack()

@fisk
Copy link
Contributor

fisk commented Jan 26, 2023

incremented at the end where we call the undo hook, the implication is that we have finished the full collection, and hence nmethods that haven't been marked as maybe on-stack, can't be on-stack (virtually or physically). That assumption seems to hold true, since the YC traversed all the stack chunks in the young generation, and there aren't any in the old generation. Or is there a situation where we undo, even though there is more than one object in the old generation that we have not visited?

The undo does not take (and never took) whether objects have been visited or not or whether there are stack chunks in the old generation into account. It makes that decision purely on heap occupancy.

So things are broken :(

(Random 1min idea coming up....):

One option could be decreasing the epoch again, but that will keep alive already in this concurrent start tagged objects.

What about having a separate "unloading-epoch" and "next-epoch"?

When determining whether an object can be unloaded, use the unloading epoch, that can be decremented again at the end of marking, while the next-epoch (that is always incremented) is used next time we start the marking for tagging. If we complete the marking, set the unloading-epoch to next-epoch.

E.g.

mark start:

unloading-epoch = 1

next-epoch = 1

tag all nmethods with next-epoch ( = 1)

case 1:

we decide to abort the marking

decrement unloading-epoch (= 0)

increment next-epoch (= 2)

(since we compare nmethods with unloading-epoch (=0) everything is alive)

[later]

mark start

unloading-epoch = 0

next-epoch = 2

tag all nmethods with next-epoch (= 2)

we finish the marking

set unloading epoch to next-epoch (=2)

increment next-epoch (=3)

and so on. This handles an infinite amount of concurrent undos I think.

As mentioned, a one-minute idea..., not sure it works or is useful at all.

How about just arming the nmethods unconditionally when we mark start? What we have works for nmethod unloading but isn't SATB safe. By arming the nmethods at mark start unconditionally, it seems like we fix the SATB problem as well?

@reinrich
Copy link
Member Author

incremented at the end where we call the undo hook, the implication is that we have finished the full collection, and hence nmethods that haven't been marked as maybe on-stack, can't be on-stack (virtually or physically). That assumption seems to hold true, since the YC traversed all the stack chunks in the young generation, and there aren't any in the old generation. Or is there a situation where we undo, even though there is more than one object in the old generation that we have not visited?

The undo does not take (and never took) whether objects have been visited or not or whether there are stack chunks in the old generation into account. It makes that decision purely on heap occupancy.

So things are broken :(
(Random 1min idea coming up....):
One option could be decreasing the epoch again, but that will keep alive already in this concurrent start tagged objects.
What about having a separate "unloading-epoch" and "next-epoch"?
When determining whether an object can be unloaded, use the unloading epoch, that can be decremented again at the end of marking, while the next-epoch (that is always incremented) is used next time we start the marking for tagging. If we complete the marking, set the unloading-epoch to next-epoch.
E.g.
mark start:
unloading-epoch = 1
next-epoch = 1
tag all nmethods with next-epoch ( = 1)
case 1:
we decide to abort the marking
decrement unloading-epoch (= 0)
increment next-epoch (= 2)
(since we compare nmethods with unloading-epoch (=0) everything is alive)
[later]
mark start
unloading-epoch = 0
next-epoch = 2
tag all nmethods with next-epoch (= 2)
we finish the marking
set unloading epoch to next-epoch (=2)
increment next-epoch (=3)
and so on. This handles an infinite amount of concurrent undos I think.
As mentioned, a one-minute idea..., not sure it works or is useful at all.

How about just arming the nmethods unconditionally when we mark start? What we have works for nmethod unloading but isn't SATB safe. By arming the nmethods at mark start unconditionally, it seems like we fix the SATB problem as well?

That's what I described as conservative/minimal fix. Please read above.

@fisk
Copy link
Contributor

fisk commented Jan 26, 2023

incremented at the end where we call the undo hook, the implication is that we have finished the full collection, and hence nmethods that haven't been marked as maybe on-stack, can't be on-stack (virtually or physically). That assumption seems to hold true, since the YC traversed all the stack chunks in the young generation, and there aren't any in the old generation. Or is there a situation where we undo, even though there is more than one object in the old generation that we have not visited?

The undo does not take (and never took) whether objects have been visited or not or whether there are stack chunks in the old generation into account. It makes that decision purely on heap occupancy.

So things are broken :(

(Random 1min idea coming up....):

One option could be decreasing the epoch again, but that will keep alive already in this concurrent start tagged objects.

What about having a separate "unloading-epoch" and "next-epoch"?

When determining whether an object can be unloaded, use the unloading epoch, that can be decremented again at the end of marking, while the next-epoch (that is always incremented) is used next time we start the marking for tagging. If we complete the marking, set the unloading-epoch to next-epoch.

E.g.

mark start:

unloading-epoch = 1

next-epoch = 1

tag all nmethods with next-epoch ( = 1)

case 1:

we decide to abort the marking

decrement unloading-epoch (= 0)

increment next-epoch (= 2)

(since we compare nmethods with unloading-epoch (=0) everything is alive)

[later]

mark start

unloading-epoch = 0

next-epoch = 2

tag all nmethods with next-epoch (= 2)

we finish the marking

set unloading epoch to next-epoch (=2)

increment next-epoch (=3)

and so on. This handles an infinite amount of concurrent undos I think.

As mentioned, a one-minute idea..., not sure it works or is useful at all.

How about just arming the nmethods unconditionally when we mark start? What we have works for nmethod unloading but isn't SATB safe. By arming the nmethods at mark start unconditionally, it seems like we fix the SATB problem as well?

That's what I described as conservative/minimal fix. Please read above.

Ah, I see. It sounds good to me.

@reinrich
Copy link
Member Author

I've pushed an "extended" conservative fix that does not increment CodeCache::_gc_epoch when g1 concurrent marking is undone or aborted. But it always arms nmethods before concurrent marking starts (3074787).

I noticed that the code cache cycle was finished in G1ConcurrentMark::remark() even if marking was not finished because of marking stack overflow. Marking is restarted in this case (after root region scan). Upon the second remark() attempt an assertion should fire when calling CodeCache::on_gc_marking_cycle_finish() because it was finished before. In the release build this could make nmethods with frames in continuations look as if they didn't have any frames. I fix this too (da2162f).

I'm testing if this version still fixes https://bugs.openjdk.org/browse/JDK-8299956

CodeCache::on_gc_marking_cycle_start();
}
if (!full_gc) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be else if?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, !full_gc means we reach here because concurrent marking is about to start and we want to arm unconditionally then. Never arm for the start of a full gc on the other hand.

Would it be clearer if I renamed full_gc to concurrent_marking_start?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think so. Otherwise, looks good. Thanks again!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use concurrent_mark_start (without the "ing"), or in_concurrent_start_mark[_pause].

Copy link
Contributor

@tschatzl tschatzl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments/bikeshedding :)
Lgtm otherwise.

//
// (1) was aborted and a full gc is about to begin or
// (2) it was undone because the heap occupancy was below the threshold after
// the initiating young gc (see G1ConcurrentMark::post_concurrent_undo_start())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// the initiating young gc (see G1ConcurrentMark::post_concurrent_undo_start())
// the concurrent start young gc (see G1ConcurrentMark::post_concurrent_undo_start())

Comment on lines 3378 to 3381
// Especially in case (2) it is important that the active codecache cycle was not
// finished when executing the G1 undo operation. If it was (repeatedly) finished
// then nmethods with frames in continuation StackChunks could appear as not
// on stack because they were not marked as on stack in the undone concurrent marking.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Especially in case (2) it is important that the active codecache cycle was not
// finished when executing the G1 undo operation. If it was (repeatedly) finished
// then nmethods with frames in continuation StackChunks could appear as not
// on stack because they were not marked as on stack in the undone concurrent marking.
// Especially in case (2) it is important to note the active codecache cycle did not
// finish (maybe "complete, i.e. did not mark through all live objects,"?) when executing the G1 concurrent undo operation. If it (repeatedly) did not finish/complete
// then nmethods with frames in continuation StackChunks could appear as not
// on stack because they were not marked as on stack in the undone concurrent markings.

// on stack because they were not marked as on stack in the undone concurrent marking.
//
// Also for (2) it is important to arm nmethod entry barriers even if no new
// code cache cycle is started. They are needed for a complete SATB. A full gc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// code cache cycle is started. They are needed for a complete SATB. A full gc
// code cache cycle is started. They are needed for a complete SATB (snapshot?). A full gc

CodeCache::on_gc_marking_cycle_start();
}
if (!full_gc) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use concurrent_mark_start (without the "ing"), or in_concurrent_start_mark[_pause].

@reinrich
Copy link
Member Author

Thanks for the feedback @tschatzl and @fisk !

I've changed the commenting again but didn't push. I thought it should be explained right at the definition of CodeCache::on_gc_marking_cycle_finish() why the code cache cycle must not be finished if the heap marking was not finished. I understood it immediately when @fisk reminded me but overlooked it reading code even though I kinda new it.

I've tried to incorporate @tschatzl last comments too.

Last but not least: the current version still fixes the issues after https://bugs.openjdk.org/browse/JDK-8300915

Thanks, Richard.

Copy link
Contributor

@fisk fisk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job @reinrich! This looks good. Thanks for fixing it.

Copy link
Contributor

@tschatzl tschatzl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ship it.

@reinrich
Copy link
Member Author

Thanks again! I'll wait for test results and ship it after the WE.

@reinrich
Copy link
Member Author

/integrate

@openjdk
Copy link

openjdk bot commented Jan 30, 2023

Going to push as commit 3db558b.
Since your change was applied there have been 77 commits pushed to the master branch:

  • cbefe1f: 8301163: jdk/internal/vm/Continuation/Fuzz.java increase COMPILATION_TIMEOUT for Linux ppc64le
  • c2ebd17: 6187113: DefaultListSelectionModel.removeIndexInterval(0, Integer.MAX_VALUE) fails
  • 4bd3f0a: 8301088: oopDesc::print_on should consistently use a trailing newline
  • 64b25ea: 8291569: Consider removing JNI checks for signals SIGPIPE and SIGXFSZ
  • 1ff4646: 8298612: Refactor archiving of java String objects
  • d4e9f5e: 8238170: BeanContextSupport remove and propertyChange can deadlock
  • 6475501: 8300208: Optimize Adler32 stub for AVX-512 targets.
  • af564e4: 8299844: RISC-V: Implement _onSpinWait intrinsic
  • 5dfc4ec: 8300140: ZipFile.isSignatureRelated returns true for files in META-INF subdirectories
  • 5c59de5: Merge
  • ... and 67 more: https://git.openjdk.org/jdk/compare/b2071f79d854f40df0f3bc2de6828fbcea4d325a...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Jan 30, 2023
@openjdk openjdk bot closed this Jan 30, 2023
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Jan 30, 2023
@openjdk
Copy link

openjdk bot commented Jan 30, 2023

@reinrich Pushed as commit 3db558b.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@reinrich reinrich deleted the 8300915_G1__incomplete_SATB_because_nmethod_entry_barriers_don_t_get_armed branch February 20, 2023 12:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-gc hotspot-gc-dev@openjdk.org integrated Pull request has been integrated
3 participants