New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[JENKINS-40470] Jobs didn't finish on Solaris 11 Intel node #2701
Conversation
Maybe @jimklimov can take a look? We briefly talked about this on IRC. |
In (JENKINS-40470)[https://issues.jenkins-ci.org/browse/JENKINS-40470] Minoru Sakamoto says "I tested PR2701 in the same way, there was no problem" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@csiden Thanks for the fix. Looks good to me in general
* try reading before giving up. This avoids having readLine() loop | ||
* over the entire process address space if this class has bugs. | ||
*/ | ||
private static final int LINE_LENGTH_LIMIT = 10000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good practice is to make it a SystemProperty in order to allow managing it from the code. Maybe not required here since the limit is big enough for realistic use-cases
:) go for LTS! ;) |
Needs documenting on https://wiki.jenkins-ci.org/display/JENKINS/Features+controlled+by+system+properties after merge. |
Just to keep you posted here too, due to (presumably) jenkins-infra/jenkins-infra#665 I can not download an intact |
@jimklimov Other than building this PR yourself I can only offer https://dl.dropboxusercontent.com/u/29853/jenkins-PR2701.war as workaround. To confirm it's the same file if you're paranoid: Check that file's MD5 checksum and compare to https://ci.jenkins.io/fingerprint/38fc481c0e00694f20a356a3185433c1/ |
@daniel-beck : Thanks, I'll try. Not exactly paranoid, but the network/jetty hiccups gave me some difficult time yesterday :) |
Ok, got this WAR intact, deployed and fired up a build that worked for me before. |
Ok, so my test on OpenIndiana took a while, and ended in a build failure (got an intermediate commit apparently with some breakage in sources), but the job did end. |
@jimklimov So are you fine with the merge? |
OK, I assume @jimklimov confirmed the fix in the last comment. Merging in order to get it in the next weekly |
Yes, LGTM. Sorry for delay. |
@oleg-nenashev will this be backported to LTS, please? Or we need to start some new bug? (what is the process?) |
@tarzanek See my comment in the ticket. Commonly we do not backport non-urgent fixes if they are not soaked in Weekly releases for more than two weeks. But this one is urgent && the fix is confirmed by users, so maybe it can be ported into the nearest 2.32.2. WDYT @olivergondza ? |
* [JENKINS-40470] Jobs didn't finish on Solaris 11 Intel node * make length limit a system property (cherry picked from commit 4df0c56)
The fix is the addition of the two calls to
to64()
that should have been included in the original change. The rest is code to avoid having other bugs in this area hang jobs in a tightwhile(true)
loop, which is what was happening here.Summary
The call to
getEnvironmentVariables()
is hung in a call toreadLine()
in thewhile(true)
loop, the loop is proceeding and incrementingaddr
, but it never encounters a null byte because the originaladdr
that was passed in was garbage, and so the loop never terminates. It doesn't matter if your job has no steps, we still callgetEnvironmentVariables()
for all processes when the job finishes and we hang on the first call toreadLine()
if any process is a 32-bit process.I can reliably reproduce this issue on Solaris 11 using a 64-bit JVM. I cannot reproduce it at all on Solaris 10 (even with a 64-bit JVM) or with a 32-bit JVM. Logging statements in this code indicate that while all the math seems right, when the actual
pread()
is done we are getting back the wrong data (as compared to other tools reading what should be the same address).The problem is that when the JVM is 64-bit and the process being inspected is 32-bit we need to convert the
envp
entries we iterate over (which we get from the 32-bit process's address space) from 4-byte addresses to Java longs, the original change let Java do this conversion fromint
tolong
implicitly, which means the result was wrong for "negative numbers" (Java thinks these are signed numbers, not memory addresses). The fix uses theto64()
function, which does the conversion correctly.Using logging it looks like Solaris 10 uses a different part of the address space to store the
envp
entries than Solaris 11. By coincidence the Solaris 10 addresses are not negative numbers when treated as signed integers, so they don't see this issue, but the bug theoretically still exists on Solaris 10.Testing
I confirmed that before this change I could reproduce the issue on Solaris 11, but not Solaris 10, with empty Freestyle Jobs, with 64-bit JVMs on the Solaris side (i.e. on Solaris 11 the job never terminates, on Solaris 10 it does). After this change both jobs terminate correctly with both 64-bit and 32-bit JVMs.
I confirmed on both Solaris 10 and 11 (with both 64-bit and 32-bit JVMs) that pipeline jobs with long running shell steps could still be terminated after this change (i.e. I haven't broken the fix to the original issue).
My original testing for #2507 had only been done on Solaris 10.