Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JENKINS-40470] Jobs didn't finish on Solaris 11 Intel node #2701

Merged
merged 2 commits into from Jan 7, 2017

Conversation

csiden
Copy link
Contributor

@csiden csiden commented Jan 4, 2017

The fix is the addition of the two calls to to64() that should have been included in the original change. The rest is code to avoid having other bugs in this area hang jobs in a tight while(true) loop, which is what was happening here.

Summary

The call to getEnvironmentVariables() is hung in a call to readLine() in the while(true) loop, the loop is proceeding and incrementing addr, but it never encounters a null byte because the original addr that was passed in was garbage, and so the loop never terminates. It doesn't matter if your job has no steps, we still call getEnvironmentVariables() for all processes when the job finishes and we hang on the first call to readLine() if any process is a 32-bit process.

I can reliably reproduce this issue on Solaris 11 using a 64-bit JVM. I cannot reproduce it at all on Solaris 10 (even with a 64-bit JVM) or with a 32-bit JVM. Logging statements in this code indicate that while all the math seems right, when the actual pread() is done we are getting back the wrong data (as compared to other tools reading what should be the same address).

The problem is that when the JVM is 64-bit and the process being inspected is 32-bit we need to convert the envp entries we iterate over (which we get from the 32-bit process's address space) from 4-byte addresses to Java longs, the original change let Java do this conversion from int to long implicitly, which means the result was wrong for "negative numbers" (Java thinks these are signed numbers, not memory addresses). The fix uses the to64() function, which does the conversion correctly.

Using logging it looks like Solaris 10 uses a different part of the address space to store the envp entries than Solaris 11. By coincidence the Solaris 10 addresses are not negative numbers when treated as signed integers, so they don't see this issue, but the bug theoretically still exists on Solaris 10.

Testing

I confirmed that before this change I could reproduce the issue on Solaris 11, but not Solaris 10, with empty Freestyle Jobs, with 64-bit JVMs on the Solaris side (i.e. on Solaris 11 the job never terminates, on Solaris 10 it does). After this change both jobs terminate correctly with both 64-bit and 32-bit JVMs.

I confirmed on both Solaris 10 and 11 (with both 64-bit and 32-bit JVMs) that pipeline jobs with long running shell steps could still be terminated after this change (i.e. I haven't broken the fix to the original issue).

My original testing for #2507 had only been done on Solaris 10.

@daniel-beck
Copy link
Member

Maybe @jimklimov can take a look? We briefly talked about this on IRC.

@oleg-nenashev
Copy link
Member

In (JENKINS-40470)[https://issues.jenkins-ci.org/browse/JENKINS-40470] Minoru Sakamoto says "I tested PR2701 in the same way, there was no problem"

@oleg-nenashev oleg-nenashev added needs-more-reviews Complex change, which would benefit from more eyes regression-fix Pull request that fixes a regression in one of the previous Jenkins releases labels Jan 4, 2017
Copy link
Member

@oleg-nenashev oleg-nenashev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@csiden Thanks for the fix. Looks good to me in general

* try reading before giving up. This avoids having readLine() loop
* over the entire process address space if this class has bugs.
*/
private static final int LINE_LENGTH_LIMIT = 10000;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good practice is to make it a SystemProperty in order to allow managing it from the code. Maybe not required here since the limit is big enough for realistic use-cases

@tarzanek
Copy link

tarzanek commented Jan 4, 2017

:) go for LTS! ;)

@daniel-beck
Copy link
Member

@jimklimov
Copy link

Just to keep you posted here too, due to (presumably) jenkins-infra/jenkins-infra#665 I can not download an intact jenkins.war to test locally so far.

@daniel-beck
Copy link
Member

@jimklimov Other than building this PR yourself I can only offer https://dl.dropboxusercontent.com/u/29853/jenkins-PR2701.war as workaround.

To confirm it's the same file if you're paranoid: Check that file's MD5 checksum and compare to https://ci.jenkins.io/fingerprint/38fc481c0e00694f20a356a3185433c1/

@jimklimov
Copy link

@daniel-beck : Thanks, I'll try. Not exactly paranoid, but the network/jetty hiccups gave me some difficult time yesterday :)

@jimklimov
Copy link

Ok, got this WAR intact, deployed and fired up a build that worked for me before.
The test is that it should not hang indefinitely, but succeed or fail when expected - right?

@oleg-nenashev oleg-nenashev added ready-for-merge The PR is ready to go, and it will be merged soon if there is no negative feedback and removed needs-more-reviews Complex change, which would benefit from more eyes labels Jan 5, 2017
@jimklimov
Copy link

Ok, so my test on OpenIndiana took a while, and ended in a build failure (got an intermediate commit apparently with some breakage in sources), but the job did end.

@oleg-nenashev
Copy link
Member

@jimklimov So are you fine with the merge?

@oleg-nenashev
Copy link
Member

OK, I assume @jimklimov confirmed the fix in the last comment. Merging in order to get it in the next weekly

@oleg-nenashev oleg-nenashev merged commit 4df0c56 into jenkinsci:master Jan 7, 2017
oleg-nenashev added a commit that referenced this pull request Jan 7, 2017
@jimklimov
Copy link

Yes, LGTM. Sorry for delay.

@tarzanek
Copy link

tarzanek commented Jan 9, 2017

@oleg-nenashev will this be backported to LTS, please? Or we need to start some new bug? (what is the process?)

@oleg-nenashev
Copy link
Member

@tarzanek See my comment in the ticket. Commonly we do not backport non-urgent fixes if they are not soaked in Weekly releases for more than two weeks. But this one is urgent && the fix is confirmed by users, so maybe it can be ported into the nearest 2.32.2. WDYT @olivergondza ?

olivergondza pushed a commit that referenced this pull request Jan 11, 2017
* [JENKINS-40470] Jobs didn't finish on Solaris 11 Intel node

* make length limit a system property

(cherry picked from commit 4df0c56)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-for-merge The PR is ready to go, and it will be merged soon if there is no negative feedback regression-fix Pull request that fixes a regression in one of the previous Jenkins releases
Projects
None yet
5 participants