Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for a suspendOnRestore option in JDWP #591

Open
wants to merge 2 commits into
base: openj9
Choose a base branch
from

Conversation

mikezhang1234567890
Copy link
Contributor

To help debug issues with checkpoint/restore, add an option similar to "suspend=y" which suspends on VM startup except this new option will suspend when the VM is restored from a checkpoint.
To that end, this PR adds support for a VM_Restore JVMTI event to JDWP and JDB.

depends on eclipse-openj9/openj9#17252

@tajila @JasonFengJ9 can you please take a look?

@tajila
Copy link
Member

tajila commented Apr 24, 2023

@JasonFengJ9 Please review

@JasonFengJ9
Copy link
Member

jenkins copyright check

@JasonFengJ9
Copy link
Member

JasonFengJ9 commented Apr 24, 2023

15:10:43 src/jdk.jdi/share/classes/com/sun/jdi/event/VMRestoreEvent.java
15:10:43 IBM Portions Copyright should be used in this file

This is a new file, it can go to closed/src, and the Copyright is

/*
 * ===========================================================================
 * (c) Copyright IBM Corp. 2023, 2023 All Rights Reserved
 * ===========================================================================
 *
 * This code is free software; you can redistribute it and/or modify it
 * under the terms of the GNU General Public License version 2 only, as
 * published by the Free Software Foundation.
 *
 * IBM designates this particular file as subject to the "Classpath" exception
 * as provided by IBM in the LICENSE file that accompanied this code.
 *
 * This code is distributed in the hope that it will be useful, but WITHOUT
 * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
 * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
 * version 2 for more details (a copy is included in the LICENSE file that
 * accompanied this code).
 *
 * You should have received a copy of the GNU General Public License version
 * 2 along with this work; if not, see <http://www.gnu.org/licenses/>.
 *
 * ===========================================================================
 */

In addition, there is a line ending check error

11:06:26  src/java.se/share/data/jdwp/jdwp.spec:30: trailing whitespace.

@JasonFengJ9
Copy link
Member

src/jdk.jdi/share/classes/com/sun/jdi/event/VMRestoreEvent.java

Only IBM Copyright is required if this is not a file originating from OpenJDK.
Any reason it can't be in the openj9 repo instead?

@mikezhang1234567890
Copy link
Contributor Author

Any reason it can't be in the openj9 repo instead?

Not sure if there's any reason it can't be in the OpenJ9 repo instead, but I put it in the extension repo since the jdk.jdi package needed changes to support the new VM restore event anyways.

I'll see if moving it to closed/src or the OpenJ9 repo makes more sense.

Copy link
Member

@JasonFengJ9 JasonFengJ9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a CRIU feature, the java code can be decorated w/ CRIU_SUPPORT, and the native code w/ J9VM_OPT_CRIU_SUPPORT, just created

src/jdk.jdwp.agent/share/native/libjdwp/debugInit.c Outdated Show resolved Hide resolved
src/jdk.jdwp.agent/share/native/libjdwp/debugInit.c Outdated Show resolved Hide resolved
src/jdk.jdwp.agent/share/native/libjdwp/debugInit.c Outdated Show resolved Hide resolved
src/jdk.jdwp.agent/share/native/libjdwp/debugInit.c Outdated Show resolved Hide resolved
src/jdk.jdwp.agent/share/native/libjdwp/eventHandler.c Outdated Show resolved Hide resolved
src/jdk.jdwp.agent/share/native/libjdwp/eventHelper.c Outdated Show resolved Hide resolved
src/jdk.jdwp.agent/share/native/libjdwp/eventHelper.c Outdated Show resolved Hide resolved
src/jdk.jdwp.agent/share/native/libjdwp/eventHelper.c Outdated Show resolved Hide resolved
src/jdk.jdwp.agent/share/native/libjdwp/transport.c Outdated Show resolved Hide resolved
@mikezhang1234567890
Copy link
Contributor Author

Rebased on top of d1ba4a1 and added ifdefs so that the new code only compiles with CRIU enabled.

Also addressed formatting review comments.

@JasonFengJ9 ready for another look.

Copy link
Member

@JasonFengJ9 JasonFengJ9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/jdk.jdwp.agent/share/native/libjdwp/util.c Outdated Show resolved Hide resolved
src/jdk.jdwp.agent/share/native/libjdwp/util.c Outdated Show resolved Hide resolved
src/jdk.jdwp.agent/share/native/libjdwp/eventHandler.c Outdated Show resolved Hide resolved
src/jdk.jdwp.agent/share/native/libjdwp/debugInit.c Outdated Show resolved Hide resolved
src/jdk.jdwp.agent/share/native/libjdwp/debugInit.c Outdated Show resolved Hide resolved
src/jdk.jdwp.agent/share/native/libjdwp/eventHandler.c Outdated Show resolved Hide resolved
src/jdk.jdwp.agent/share/native/libjdwp/eventHelper.c Outdated Show resolved Hide resolved
@mikezhang1234567890
Copy link
Contributor Author

Rebased on the merged commit and updated to address review comments.

src/java.se/share/data/jdwp/jdwp.spec Outdated Show resolved Hide resolved
src/jdk.jdwp.agent/share/native/libjdwp/debugInit.c Outdated Show resolved Hide resolved
src/jdk.jdwp.agent/share/native/libjdwp/eventHelper.c Outdated Show resolved Hide resolved
src/jdk.jdwp.agent/share/native/libjdwp/eventHelper.c Outdated Show resolved Hide resolved
src/jdk.jdwp.agent/share/native/libjdwp/util.c Outdated Show resolved Hide resolved
src/jdk.jdwp.agent/share/native/libjdwp/util.c Outdated Show resolved Hide resolved
src/jdk.jdwp.agent/share/native/libjdwp/util.c Outdated Show resolved Hide resolved
@keithc-ca
Copy link
Member

It's safer to just include j9cfg.h in every file that uses any openj9 configuration option (even if some of those includes are redundant).

@mikezhang1234567890
Copy link
Contributor Author

j9cfg.h has been included in every file that uses the J9VM_OPT_CRIU_SUPPORT and I believe I addressed the review comments. @keithc-ca can you please take another look?

@keithc-ca
Copy link
Member

I'm not thrilled with this approach. Changing JDWP means no other debuggers will be able to use the new event.
Have you explored reusing transport_waitForConnection() at restore time?

@mikezhang1234567890
Copy link
Contributor Author

Changing JDWP means no other debuggers will be able to use the new event.

Another debugger not using JDWP should still be able to add their own callbacks for the VMRestore JVMTI event. The callbacks added to JDWP shouldn't affect that.

Have you explored reusing transport_waitForConnection() at restore time

If a debugger is already attached then transport_waitForConnection() will not pause the application.

@mikezhang1234567890
Copy link
Contributor Author

Do you have any other concerns about this or the openj9 PR @keithc-ca?

@keithc-ca
Copy link
Member

I still think the overall behavior should be as I described in #591 (comment).

Consider the expanded use-cases where more than one checkpoint is taken in the progression towards increasingly customized applications. The distinction of suspending before a checkpoint versus suspending after one ceases to be clear. Between the first and last checkpoints what would a user be expected to say or do?

I think the answer should be simply, "use the agent, with or without suspend=y, to debug the phase of interest".

@tajila
Copy link
Member

tajila commented Jun 29, 2023

Consider the expanded use-cases where more than one checkpoint is taken in the progression towards increasingly customized applications. The distinction of suspending before a checkpoint versus suspending after one ceases to be clear.

I think its important to recognize that while in theory there can be infinite checkpoint and restore phases, there can only be one VM start up phase and suspend is defined within that context.

In practical use cases there will only be one restore because of the limitations imposed by security APIs if multiple restores are permitted.

I think the answer should be simply, "use the agent, with or without suspend=y, to debug the phase of interest".

Im not opposed to that behaviour as a starting point, simply because this feature hasn't been in the field long enough to understand usage patterns. That being said, I can see a need for more flexibility beyond what you suggested. So perhaps something like the following could work: suspendOnRestore defaults to suspend, this provides the behaviour you suggested and the user can be oblivious to the existence of this option if they are happy with the behaviour. For those who need more control, they can set suspendOnRestore to toggle the restore behaviour.

@keithc-ca
Copy link
Member

I have been assuming that debugging settings from one phase are not implicitly carried over into subsequent phases: users must explicitly specify that agent for any phase they want to debug. Is my assumption incorrect or inappropriate?

I can see a need for more flexibility beyond what you suggested.

Perhaps an example that requires suspendOnRestore (i.e. the existing suspend option is inadequate or is not appropriate) would help me understand where you see this going.

@mikezhang1234567890
Copy link
Contributor Author

I have been assuming that debugging settings from one phase are not implicitly carried over into subsequent phases: users must explicitly specify that agent for any phase they want to debug.

The options for the JDWP agent do currently persist through checkpoint and restore.

@tajila
Copy link
Member

tajila commented Jun 29, 2023

I have been assuming that debugging settings from one phase are not implicitly carried over into subsequent phases: users must explicitly specify that agent for any phase they want to debug. Is my assumption incorrect or inappropriate?

JDWP options are only specified at startup. We are unable to dynamically enable JDWP after restore.

Perhaps an example that requires suspendOnRestore (i.e. the existing suspend option is inadequate or is not appropriate) would help me understand where you see this going.

An example would be, you want to suspend on start, but not restore.

@tajila
Copy link
Member

tajila commented Jun 29, 2023

JDWP options are only specified at startup. We are unable to dynamically enable JDWP after restore.

We may have support for this in the future, but this is a med -> long term goal.

@keithc-ca
Copy link
Member

JDWP options are only specified at startup. We are unable to dynamically enable JDWP after restore.

This seems to me the fundamental issue we should be attacking.

@tajila
Copy link
Member

tajila commented Jul 5, 2023

I have been assuming that debugging settings from one phase are not implicitly carried over into subsequent phases: users must explicitly specify that agent for any phase they want to debug. Is my assumption incorrect or inappropriate?

Recall, JVM restore is not a new start, it is a continuation of an existing process. So it is correct that options specified on startup should persist on restore. This is true for all JVM options, not just JDWP options.

We have the ability to specify options on restore. However, if we allowed suspend on restore, we would be redefining the meaning of suspend, because suspend is defined in the context of VM startup. This is the fundamental issue with what you are proposing.

However, if we had suspendOnRestore it is understood that this option is defined in the context of restore. For example:

Phases -> effect
//startup options: suspend=y,suspendOnRestore=y
Startup -> VM suspends
1st Restore -> VM suspends
//restore options: suspendOnRestore=n
2nd Restore -> VM does not suspend
//restore options: suspendOnRestore=y
3rd Restore -> VM suspends

JDWP options are only specified at startup. We are unable to dynamically enable JDWP after restore.

This seems to me the fundamental issue we should be attacking.

When we are able to dynamically enable debug capabilities, we will be able to do:

Phases -> effect
Startup -> VM does not suspend
//restore options: suspendOnRestore=y
1st Restore -> VM suspends
//restore options: suspendOnRestore=n
2nd Restore -> VM does not suspend
//restore options: suspendOnRestore=y
3rd Restore -> VM suspends

So I dont see it as an issue, it is really a point in time limitation.

One could make the arguement, that suspendOnRestore should only be a restore option, that is, you cant specify it on startup but you can on restore. I wouldn't be opposed to that. But I think there needs to be a clear distiction between, suspend on VM start, and suspend on restore.

@keithc-ca
Copy link
Member

I repeat my concerns about this approach.

In order to debug any phase in the evolution of an application through one or more checkpoints, the JDWP agent must be enabled in the very first phase (and based on comments above, every phase). The producer of an application must enable the agent unless they are prepared to accept the consequence that no customers will be able to debug their use of the product.

Enabling the JDWP agent has associated performance implications: all subsequent phases will pay those costs even when debugging is not needed. Using suspendOnRestore=y (or making it the default) would mean that a debugger must be used to attach to the application to instruct it to proceed - not a very user-friendly situation.

It may be that the VM needs to provide suspend and restore events, but to my mind this doesn't represent the way they should be handled.

@mikezhang1234567890
Copy link
Contributor Author

the JDWP agent must be enabled in the very first phase (and based on comments above, every phase)

Currently, even without CRIU, the JDWP agent must be enabled at start up due to the capabilities needed. Regardless of CRIU, using JDWP for debug means you're taking the performance hit for the entire run of an application.

The producer of an application must enable the agent unless they are prepared to accept the consequence that no customers will be able to debug their use of the product.

Enabling JDWP is the user's decision at run time, so consequences are determined by the user each run.

@JasonFengJ9
Copy link
Member

Added a commit CRIU supports Java debugger during checkpoint and restore

  • Added an option suspendOnRestore which suspends when the VM is restored
    from a checkpoint;
  • The existing option "suspend=y" also suspends the VM at CRIU restore.

@JasonFengJ9
Copy link
Member

@tajila @keithc-ca could you please review this PR to enable CRIU supports Java debugger during checkpoint and restore?

Note: I added a commit CRIU supports Java debugger during checkpoint and restore at the top of initial changes.

mikezhang1234567890 and others added 2 commits May 22, 2024 08:34
To help debug issues with checkpoint/restore, add an option similar to
"suspend=y" which suspends on VM startup except this new option will suspend
when the VM is restored from a checkpoint.
To that end, this PR adds support for a VM_Restore JVMTI event to JDWP and JDB.

Signed-off-by: Mike Zhang <mike.h.zhang@ibm.com>
Added an option suspendOnRestore which suspends when the VM
is restored from a checkpoint;
The existing option "suspend=y" also suspends the VM at restore.

Signed-off-by: Jason Feng <fengj@ca.ibm.com>
@JasonFengJ9
Copy link
Member

JDK21 benchmark results with this PR (default settings, no JDWP):

Startup time results for localhost/ol-instanton-test-pingperf-restore-afterappstart:jdk21_20240529_152441 app pingperf restoreOpts None
Bechmark metrics avg min max stdDev maxVar confInt samples
StartupTime 531 503 569 13 13.1% 0.47% 80
CRIUTime 87 87 88 0 1.8% 0.06% 68
Application 153 148 157 2 6.1% 0.26% 80
FirstResponse 632 600 672 14 12.0% 0.42% 80
ContainerFootprint (KiB) 117617 117043 118067 233 0.9% 0.04% 80
RSS (KiB) 149039 148828 149244 102 0.3% 0.01% 80
Peak RSS (KiB) 147954 146064 151600 942 3.8% 0.12% 80
Throughput results for localhost/ol-instanton-test-pingperf-restore-afterappstart:jdk21_20240529_152441 app pingperf restoreOpts None
Bechmark metrics avg min max stdDev maxVar confInt samples
Throughput 23677 23118 24147 337 4.5% 0.82% 10
ContainerFootprint (KiB) 896266 868966 914432 13487 5.2% 0.87% 10
RSS (KiB) 186825 181600 193412 3930 6.5% 1.22% 10
Peak RSS (KiB) 428391 415568 458324 11971 10.3% 1.62% 10
Startup time results for localhost/ol-instanton-test-pingperf-restore-afterappstart:jdk21_20240529_152833 app pingperf restoreOpts None
Bechmark metrics avg min max stdDev maxVar confInt samples
StartupTime 533 503 558 13 10.9% 0.47% 80
CRIUTime 88 88 89 0 1.5% 0.06% 67
Application 152 151 155 1 2.6% 0.10% 80
FirstResponse 634 603 658 14 9.1% 0.40% 80
ContainerFootprint (KiB) 114707 114176 115098 189 0.8% 0.03% 79
RSS (KiB) 146616 146336 147072 132 0.5% 0.02% 80
Peak RSS (KiB) 145651 143568 147788 936 2.9% 0.12% 80
Throughput results for localhost/ol-instanton-test-pingperf-restore-afterappstart:jdk21_20240529_152833 app pingperf restoreOpts None
Bechmark metrics avg min max stdDev maxVar confInt samples
Throughput 23706 22859 24058 362 5.2% 0.89% 10
ContainerFootprint (KiB) 900045 869888 912282 12294 4.9% 0.79% 10
RSS (KiB) 182587 179436 187756 2801 4.6% 0.89% 10
Peak RSS (KiB) 430518 403116 453400 18868 12.5% 2.94% 8

This benchmark result shows that this PR has no noticeable perf impact when the JDWP is not enabled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants