New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8280684: JfrRecorderService failes with guarantee(num_written > 0) when no space left on device. #7227
Conversation
…en no space left on device.
|
@tkiriyama The following label will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command. |
…en no space left on device.
…en no space left on device.
Webrevs
|
/label add hotspot-jfr |
@dholmes-ora |
JFR team need to review this. |
Hi, JFR team Could somebody please review this fix for 8280684? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Takuya, thanks for your contribution.
@@ -100,7 +100,7 @@ class JfrJavaSupport : public AllStatic { | |||
static bool set_handler(jobject clazz, jobject handler, TRAPS); | |||
|
|||
// critical | |||
static void abort(jstring errorMsg, TRAPS); | |||
static void abort(jstring errorMsg, TRAPS, bool dump_core=true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure this is necessary. The existing core dump logic already handles the case where a core file cannot be generated due to disk full.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your review.
Whether or not hotspot generate a core file is determined by the argument of vm_abort(bool dump_core). If the argument is "true", vm_abort(bool dump_core) calls os::abort(bool dump_core) to generate a core file.
See the following code:
jdk/src/hotspot/share/runtime/java.cpp
Line 625 in 3c160ab
void vm_abort(bool dump_core) { |
I think JfrJavaSupport::abort() should pass "false" as an argument to vm_abort(bool dump_core).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. My point was that the os won't be able to create a core file if there is no available space.
But this is indeed more succinct, if we don't want to create a core categorically from this location.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just an observation but the filesystem that is full, and the filesystem where a core would be written, need not be the same file system. That said, a core dump in this case seems unwarranted.
System.out.printf(" Wrote large file in %d ns (%d ms) %n", t1 - t0, TimeUnit.NANOSECONDS.toMillis(t1 - t0)); | ||
raf.close(); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate the effort, but we can't have a test that intentionally provokes a disk full situation. Instead, the updated error message will have to be manually verified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use @run main/manual
in TestJFRDiskFull.java. I think this label means manually test.
I mannually confirmed this test to pass with jtreg after this fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My apologies, I missed the @run main/manual decoration. I don't think we have any JFR tests that use it.
If you can ensure this test is excluded for automatic runs, then perhaps...but then I don't know who will get to run it, so the value of the test is questionable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Manual tests are excluded if the jtreg test run specifies to run automatic tests only (as we do in our CI). So this really only serves as a validation of the fix, with no real expectation that anyone will necessarily every run it again. Even as a locally run test, filling the disk can easily lead to unexpected problems for other processes - including the swap/paging file on Windows - so this is a risky test to run.
/* | ||
* @test | ||
* @bug 8280684 | ||
* @summary JfrRecorderService failes with guarantee(num_written > 0) when no space left on device. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: failes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually "summary" is meant to describe what the test does, not what the original bug was about
Takuya, can I suggest keeping your proposed changes but excluding the test? |
…en no space left on device.
OK. This test is surely risky. I remove this test. |
ThreadInVMfromNative transition(jt); | ||
JfrJavaSupport::abort(JfrJavaSupport::new_string(msg, jt), jt, false); | ||
} | ||
else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The else block can be removed. Just put the guarantee inline with the other code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much. You're right. I removed the else block.
…en no space left on device.
log_error(jfr, system)("%s", msg); | ||
JavaThread* jt = JavaThread::current(); | ||
ThreadInVMfromNative transition(jt); | ||
JfrJavaSupport::abort(JfrJavaSupport::new_string(msg, jt), jt, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi again Takuya, I'm sorry, but I should have noticed this earlier:
I now see that the code needs to allocate a Java string oop to conform to the existing abort function signature, which caters to invocations from Java. Then abort() immediately strips out the c-string from the oop. To be correct, also headers for logging/log.hpp and runtime/thread.inline.hpp should need be included.
I believe we can simplify this by updating the abort() signature so that we don't need to drag in those extra dependencies. Please see my following comment where I suggest a way to do this.
Thanks for your patience
Markus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your valuable comments. I agree with you. I corrected this fix in accordance with your suggestions.
|
…en no space left on device.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thank you.
@tkiriyama This change now passes all automated pre-integration checks. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 335 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@mgronlun) but any other Committer may sponsor as well.
|
I hope this change is integrated. |
/integrate |
@tkiriyama |
/sponsor |
Going to push as commit 9471f24.
Your commit was automatically rebased without conflicts. |
@mgronlun @tkiriyama Pushed as commit 9471f24. |
I think JFR should report an error message and jvm should shut down safely instead of gurantee failure.
For instance, jdk.jfr.internal.Repository#newChunk() reports an appropriate message and stops jvm as below
by using JfrJavaSupport::abort().
[0.673s][error][jfr] Could not create chunk in repository /tmp/2022_01_12_22_32_42_18030, class java.io.IOException: Unable to create JFR repository directory using base location (/tmp)
[0.673s][error][jfr,system] Could not create chunk in repository /tmp/2022_01_12_22_32_42_18030, class java.io.IOException: Unable to create JFR repository directory using base location (/tmp)
[0.673s][error][jfr,system] An irrecoverable error in Jfr. Shutting down VM...
I modified StreamWriterHost not to call guarantee failure but to call JfrJavaSupport::abort().
I added a argument to JfrJavaSupport::abort() which tells os::abort() not to put out core
because there is no space on device.
Could you please review the fix?
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/7227/head:pull/7227
$ git checkout pull/7227
Update a local copy of the PR:
$ git checkout pull/7227
$ git pull https://git.openjdk.java.net/jdk pull/7227/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 7227
View PR using the GUI difftool:
$ git pr show -t 7227
Using diff file
Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/7227.diff