-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8324668: JDWP process management needs more efficient file descriptor handling #17588
Conversation
👋 Welcome back jpai! A progress list of the required criteria for merging this PR into |
Webrevs
|
The PR description appears in every email sent for the PR. For PRs with extensive descriptions, perhaps they can be included as a separate comment. |
The JVM can be launched with the jdwp agent, like When that option is set, jdwp during initialization will then launch that command. The implementation of launching the command (on *nix) involves Until recently, on macOS, for the JVM we used to limit the max file descriptors to 10240. The jdwp code above would then iterate over these 10240 file descriptors (I'm leaving out the part where the code skips the standard input/output/error file descriptors) and close them one at a time (even if there weren't that many). This apparently would finish in reasonable amount of time (likely slow, but wasn't noticed to cause any issues). In https://bugs.openjdk.org/browse/JDK-8300088 we stopped setting that 10240 limit on macOS for the JVM. After that change, on macos, the max file descriptors for JVM turns out to be 2147483647. When jdwp then launches the child process from the JVM, it then has to now iterate almost 2147483647 times and close each of the file descriptors in that logic. This now became extremely slow and noticable and caused 3 tests (https://bugs.openjdk.org/browse/JDK-8324578) which were using the In theory, this entire code which deals with launching a child process from the JVM is what the native implementation of Java's The Relatively minor changes have also been introduced in the copied over code - changes like:
I have tested this change with and without the changes done in https://bugs.openjdk.org/browse/JDK-8300088 and the My knowledge of C is elementary, so there might be a few more things that need to be considered in this change. Looking forward to any such changes that might be necessary. For example, I'm not sure if Finally, the libjava code for launching child process has some more specifics, which I didn't want to introduce/copy over here. I decided to focus solely on the file descriptors issue in this PR. I haven't yet reviewed whether jdwp would benefit from doing those additional things that libjava does when launching the child process. |
Hello Roger,
I had forgotten this. I've updated the description to be much shorter now and added a separate comment for the details. |
The failures in Github actions jobs are related to this change:
I will address them soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered using fcntl(d, F_SETFD, 1)
instead of the fancy closeDescriptors()
?
I haven't tested it myself, but per the man close
page:
Most of the descriptors can be rearranged with dup2(2) or deleted with close() before the execve is attempted, but if some of these descriptors will still be needed if the execve fails, it is necessary to arrange for them to be closed if the execve succeeds. For this reason, the call “fcntl(d, F_SETFD, 1)” is provided, which arranges that a descriptor will be closed after a successful execve;
/* find max allowed file descriptors for a process | ||
* and assume all were opened for the parent process and | ||
* copied over to this child process. we close them all */ | ||
const rlim_t max_fd = sysconf(_SC_OPEN_MAX); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use int
, like the original code, instead of rlim_t
- as per man page close()
expects int
, not rlim_t
, ex:
const int max_fd = (int)sysconf(_SC_OPEN_MAX);
JDI_ASSERT(max_fd != -1);
int fd;
/* leave out standard input/output/error file descriptors */
for (fd = STDERR_FILENO + 1; fd < max_fd; fd++) {
(void)close(fd);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello Gerard, my understanding is that the limit value configured may exceed the int range. I wanted to avoid the overflow by casting it to int in such cases. I had noticed close() takes an int, but I couldn't think of any other way of avoiding the overflow at this place.
In the JVM parent process we do however limit it to INT_MAX. So maybe I should assume that it will be an int cast it to int here, like you suggest, and add a code comment explaining this? Does that sound OK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is fine, but in that case we should also do:
JDI_ASSERT(max_fd <= INT_MAX);
in case sysconf(_SC_OPEN_MAX)
returns value greater than INT_MAX
, since close()
only accepts int
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I've updated the PR to add this assert. Additionally I've also slightly changed the implementation in the newly introduced closeDescriptors() method to ensure that we don't pass values to close() that exceed INT_MAX.
close(fd); | ||
} | ||
|
||
closedir(dp); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be:
(void)close(fd);
and
(void)closedir(dp);
to show that we're ignoring the return value?
Since this area is pretty mysterious to deal with, should we consider checking whether closedir()
actually returns am error, and at least be verbal and print a warning, as opposed to silently ignoring it?
Printing a warning, might help us diagnose any issue in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be:
(void)close(fd);
and
(void)closedir(dp);
to show that we're ignoring the return value?
Done. I've updated the PR to cast these calls to void.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this area is pretty mysterious to deal with, should we consider checking whether closedir() actually returns am error, and at least be verbal and print a warning, as opposed to silently ignoring it?
Printing a warning, might help us diagnose any issue in the future?
I think printing a warning if closedir() fails may not be useful and may be misleading too. But your point about logging a warning in this area sounds reasonable. I think what we could log as a warning is if we fallback to the slow sequential close() approach if/when this optimal closeDescriptor() implementation fails. I've updated the PR with such a log message. Do you think that would be enough?
Looks good, thank you for debugging this and providing a fix! I did have some questions and feedback, nothing major, just making sure we cover all our bases. |
I hadn't considered that. Now that you mentioned it, I had a look at what |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking this on @jaikiran and for determining the underlying problem!
Seems a reasonable approach to copy. A few nits and a suggestion.
Thanks.
@@ -44,6 +48,104 @@ static char *skipNonWhitespace(char *p) { | |||
return p; | |||
} | |||
|
|||
int | |||
isAsciiDigit(char c) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The standard isdigit
function from ctype.h
does this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
// closes every file descriptor that is listed as a directory | ||
// entry in "/proc/self/fd" (or its equivalent). standard | ||
// input/output/error file descriptors will not be closed | ||
// by this function. this function returns 0 on failure | ||
// and 1 on success |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pre-existing nit: Please start sentences with a capital and end with a period. Thanks. This applies through the copied code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've fixed the code comments to follow this suggestion. I've left single sentence comments as is. If there's a preference to change them too, do let me know and I'll update those too.
snprintf(aix_fd_dir, 32, "/proc/%d/fd", getpid()); | ||
#endif | ||
|
||
if ((dp = opendir(FD_DIR)) == NULL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is better to log the failure here, where you can actually report the reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Thank you David for those inputs, I've updated the PR to implement those suggestions. CI tests continue to pass with these latest changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, the changes make the code better already, I just want to make sure we take this opportunity to polish this code really well...
/* find max allowed file descriptors for a process | ||
* and assume all were opened for the parent process and | ||
* copied over to this child process. we close them all */ | ||
const rlim_t max_fd = sysconf(_SC_OPEN_MAX); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is fine, but in that case we should also do:
JDI_ASSERT(max_fd <= INT_MAX);
in case sysconf(_SC_OPEN_MAX)
returns value greater than INT_MAX
, since close()
only accepts int
.
" child process optimally, falling back to closing" | ||
" %d file descriptors sequentially", (max_fd - i + 1))); | ||
for (; i < max_fd; i++) { | ||
(void)close(i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here as well, I'd really like to check for close()
for errors here and report any.
And if we find errors here, do we really want to proceed knowing that something went wrong already?
int fd; | ||
if (isdigit(dirp->d_name[0]) && | ||
(fd = strtol(dirp->d_name, NULL, 10)) >= from_fd) { | ||
(void)close(fd); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd really, really like to check close()
for errors and report any.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. If we are seeing issues with close(), probably it is best to fail. That way we'll get a bug report and can figure out exactly why it is failing rather than blindly proceeding without having properly closed all fds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are going to check close for errors then we will have to know for certain we are only trying to close open fd's, and we will have to deal with EINTR. I would think this is a "best effort" to close open FD's and not something we have to care about in detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are going to check close for errors then we will have to know for certain we are only trying to close open fd's, and we will have to deal with EINTR. I would think this is a "best effort" to close open FD's and not something we have to care about in detail.
Right, EINTR would need to be ignored as it would be wrong to restart. I'm scratching my head as to why this code would even do with EIO or if it even possible here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at man close, EIO gets returned if:
[EIO] A previously-uncommitted write(2) encountered an input/output error.
If I understand this launch=<cmd>
option correctly for the jdwp agent, then the arbitrary command gets launched very early in the lifecycle of the JVM. So in theory, I think there shouldn't be too many file descriptors that the JVM has opened so far and the chances of it having started writing to some of those opened file descriptors (other than standard out/err) perhaps is slim? But that's just a guess. Probably not a strong one too, given that if the JVM was launched with any other agent along with jdwp, then it's unknown if the other agent would have opened any file descriptors for write.
My understanding of these close() calls after fork() in this code, matches David's - I think it's a best effort attempt. So perhaps ignoring such failures, like we do now, is OK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignoring errors just doesn't seem like a good strategy (especially in light of so many mysterious timeouts that are plaguing MACH5 tests mostly on macOS)
This fix right here that we are doing doesn't have to deal with making this code more robust, as long as we follow up in another issue. For that then, I filed JDK-8324979, and that frees us up to move on with this fix.
if (closeDescriptors() == 0) { /* failed, close the old way */ | ||
/* Find max allowed file descriptors for a process | ||
* and assume all were opened for the parent process and | ||
* copied over to this child process. We close them all. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm somewhat leaning towards this just being a fatal error. Why would it ever fail? Do we think the version in childproc.c that also does this is doing so for a good reason, or perhaps just like jdwp it used to use the slow version and then kept it around as a backup when the faster version was introduced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are sliding into " the perfect is the enemy of the good" here. If you want to investigate the resilience of closing FD's then I suggest you look at that in a different PR and let Jai integrate this optimization to avoid the timeouts that were seen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello Chris, I had a look at the ProcessImpl implementation to see why that fallback is in place. Looking at the ProcessImpl implementation, it has "launch mechanisms" when launching a child process. It is configurable through a system property.
By default, when not configured, the ProcessImpl implementation uses posix_spawn() instead of fork(). So in its current form on macos (given the context of this PR), when the ProcessBuilder java API gets used to start child processes, the ProcessImpl's implementation ends up using posix_spawn() and thus this entire file descriptor closing in childproc.c never gets called - neither the closeDescriptors() nor the fallback of closing OPEN_MAX number of descriptors. (I won't go into implementation details of what ProcessImpl does when posix_spawn() is used, because after looking at it, I have more questions than answers and I think discussing those here is going to cause distraction).
As for this specific if block, the only time we enter here is if we can't opendir()
the platform specific directory containing the open file descriptors. Whether or not it realistically happens, I'm unsure. I think having this fallback is perhaps a better option than just failing in such cases - this fallback when activated will certainly "slowdown" the launch of the arbitrary command but when no timeouts are in place, the arbritary command would indeed get launched.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updates look good to me. Thanks.
I filed https://bugs.openjdk.org/browse/JDK-8324979 as a followup to make sure that we can get this code as robust as possible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fix looks very good, thank you @jaikiran !
LOG_MISC(("warning: failed to close file descriptors of" | ||
" child process optimally, falling back to closing" | ||
" %d file descriptors sequentially", (max_fd - i + 1))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No one is ever going to see this. This logging is off by default, and when turned on goes to the logfile, not stdout or stderr. Perhaps ERROR_MESSAGE() would be better, which goes to stderr and also gives the option of aborting if the debug agent is launched with errorexit=y.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Chris for pointing to ERROR_MESSAGE(). I've updated the PR to use that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Thanks for taking this on.
@jaikiran This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 63 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. ➡️ To integrate this PR with the above commit message to the |
Thank you Chris, David and Gerard for the inputs and reviews. I'll integrate this in a day or two once I'm able to have a full CI run done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of minor nits to fix before integrating but otherwise looks good. Thanks again for taking this on.
#include "sys.h" | ||
#include "util.h" | ||
#include "error_messages.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: to maintain include sort order this should have gone where log_messages.h
was removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello David, I had actually first put it in that line you noted, but that then lead to a compilation error:
In file included from /jdk/src/jdk.jdwp.agent/unix/native/libjdwp/exec_md.c:33:
/jdk/src/jdk.jdwp.agent/share/native/libjdwp/error_messages.h:35:20: error: unknown type name 'FILE'
void print_message(FILE *fp, const char *prefix, const char *suffix,
^
/jdk/src/jdk.jdwp.agent/share/native/libjdwp/error_messages.h:41:29: error: a parameter list without types is only allowed in a function definition
const char * jvmtiErrorText(jvmtiError);
^
/jdk/src/jdk.jdwp.agent/share/native/libjdwp/error_messages.h:43:28: error: a parameter list without types is only allowed in a function definition
const char * jdwpErrorText(jdwpError);
^
3 errors generated.
I think the reason for those compilation errors is that error_messages.h
and a few other header files are not explicitly listing their necessary includes and are relying on transitive includes. So I decided to take the easy route out, by adding the error_messages.h
after the util.h
to rely on the transitive includes to get past that compilation error.
I think to properly fix that compile error, this following change (which I tested locally and works) would be needed:
diff --git a/src/jdk.jdwp.agent/share/native/libjdwp/JDWP.h b/src/jdk.jdwp.agent/share/native/libjdwp/JDWP.h
index 92a9f457e8a..2683c6791a1 100644
--- a/src/jdk.jdwp.agent/share/native/libjdwp/JDWP.h
+++ b/src/jdk.jdwp.agent/share/native/libjdwp/JDWP.h
@@ -26,6 +26,8 @@
#ifndef JDWP_JDWP_H
#define JDWP_JDWP_H
+#include "jni_md.h"
+#include "jvmti.h"
#include "JDWPCommands.h"
/*
diff --git a/src/jdk.jdwp.agent/share/native/libjdwp/error_messages.h b/src/jdk.jdwp.agent/share/native/libjdwp/error_messages.h
index 4126b76f226..810ab384f1a 100644
--- a/src/jdk.jdwp.agent/share/native/libjdwp/error_messages.h
+++ b/src/jdk.jdwp.agent/share/native/libjdwp/error_messages.h
@@ -26,6 +26,9 @@
#ifndef JDWP_ERROR_MESSAGES_H
#define JDWP_ERROR_MESSAGES_H
+#include <stdio.h>
+#include "JDWP.h"
+
/* It is assumed that ALL strings are UTF-8 safe on entry */
#define TTY_MESSAGE(args) ( tty_message args )
#define ERROR_MESSAGE(args) ( \
diff --git a/src/jdk.jdwp.agent/unix/native/libjdwp/exec_md.c b/src/jdk.jdwp.agent/unix/native/libjdwp/exec_md.c
index b43d4de80fc..cf0d00db192 100644
--- a/src/jdk.jdwp.agent/unix/native/libjdwp/exec_md.c
+++ b/src/jdk.jdwp.agent/unix/native/libjdwp/exec_md.c
@@ -30,9 +30,9 @@
#include <unistd.h>
#include <string.h>
#include <ctype.h>
+#include "error_messages.h"
#include "sys.h"
#include "util.h"
-#include "error_messages.h"
static char *skipWhitespace(char *p) {
while ((*p != '\0') && isspace(*p)) {
Do you think this should be fixed? I can file a JBS issue for it and address it as a separate PR.
(Edit: a more accurate patch/fix for missing includes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please file a JBS issue, but no need for you to address it. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've created https://bugs.openjdk.org/browse/JDK-8325094 to track this.
ERROR_MESSAGE(("failed to open dir %s while determining" | ||
" file descriptors to close for process %d", FD_DIR, getpid())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: indent of second line is now off - please align "
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. PR has been update to fix this.
ERROR_MESSAGE(("failed to close file descriptors of" | ||
" child process optimally, falling back to closing" | ||
" %d file descriptors sequentially", (max_fd - i + 1))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: please realign lines on "
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
tier1, tier2 and tier3 testing of the latest state of this PR has passed both with and without the changes that were done in https://bugs.openjdk.org/browse/JDK-8300088. |
/integrate |
Going to push as commit a663248.
Your commit was automatically rebased without conflicts. |
Can I please get a review of this change which proposes to address https://bugs.openjdk.org/browse/JDK-8324668?
This change proposes to fix the issue in jdwp where when launching a child process (for the
launch=<cmd>
option), it iterates over an extremely large number of file descriptors to close each one of those. Details about the issue and the proposed fixed are added in a subsequent comment in this PR #17588 (comment). tier1, tier2 and tier3 testing is currently in progress with this change.Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/17588/head:pull/17588
$ git checkout pull/17588
Update a local copy of the PR:
$ git checkout pull/17588
$ git pull https://git.openjdk.org/jdk.git pull/17588/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 17588
View PR using the GUI difftool:
$ git pr show -t 17588
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/17588.diff
Webrev
Link to Webrev Comment