-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[JENKINS-52847] Avoid using -p to support basic implementations of ps #77
Conversation
@@ -140,7 +140,7 @@ public String getScript() { | |||
FilePath resultFile = c.getResultFile(ws); | |||
FilePath controlDir = c.controlDir(ws); | |||
if (capturingOutput) { | |||
cmd = String.format("pid=$$; { while ps -o pid -p $pid | grep -q $pid && [ -d '%s' -a \\! -f '%s' ]; do touch '%s'; sleep 3; done } & jsc=%s; %s=$jsc '%s' > '%s' 2> '%s'; echo $? > '%s.tmp'; mv '%s.tmp' '%s'; wait", | |||
cmd = String.format("pid=$$; { while ps -o pid | grep -q \"^\\s*$pid$\" && [ -d '%s' -a \\! -f '%s' ]; do touch '%s'; sleep 3; done } & jsc=%s; %s=$jsc '%s' > '%s' 2> '%s'; echo $? > '%s.tmp'; mv '%s.tmp' '%s'; wait", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To get this to work in csh/tcsh we'd need to use \"^\\s*$pid\"\'$\'
(becomes "^\s*$pid"'$'
after removing Java's escaping) because in those shells things like echo "$"
throw an error.
Do we care about csh/tcsh?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so - @jglick might have other thoughts.
But is grep guaranteed to be supported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is making me twitch a bit. Which docker images did we test with?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No idea about whether it's guaranteed to be supported on all platforms, but it is at least included in BusyBox. I'll look around for a definitive answer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which docker images did we test with?
jenkinsci/slave with both alpine (3.19-1-alpine, before the procops fix) and Debian (3.20-1) tags.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is grep guaranteed to be supported
Someone could always have a custom Arch distro that specifically excludes grep, but I think it is pretty unlikely. The best info I could find about where grep
is included by default is https://unix.stackexchange.com/questions/37064/which-are-the-standard-commands-available-in-every-linux-based-distribution. Based on that, and the fact that our use is compatible with BusyBox as well, I think it is pretty safe to count on grep being present, unless there are some new popular container OS's that leave it out or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we care about csh/tcsh?
No, the wrapper script is always run using sh
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
POSIX ps
is supposed to support -p
so if there are some widely used containers with ps
implementations that do not, I am not sure where that leaves us—we can only guess what might work.
I am a little nervous about dropping -p
because then we start to rely on
By default, ps shall select all processes with the same effective user ID as the current user and the same controlling terminal as the invoker.
I suppose both the EUID and the controlling terminal should be identical between the main wrapper script process and the ps
process. -p
selects the process explicitly without this restriction.
@@ -140,7 +140,7 @@ public String getScript() { | |||
FilePath resultFile = c.getResultFile(ws); | |||
FilePath controlDir = c.controlDir(ws); | |||
if (capturingOutput) { | |||
cmd = String.format("pid=$$; { while ps -o pid -p $pid | grep -q $pid && [ -d '%s' -a \\! -f '%s' ]; do touch '%s'; sleep 3; done } & jsc=%s; %s=$jsc '%s' > '%s' 2> '%s'; echo $? > '%s.tmp'; mv '%s.tmp' '%s'; wait", | |||
cmd = String.format("pid=$$; { while ps -o pid | grep -q \"^\\s*$pid$\" && [ -d '%s' -a \\! -f '%s' ]; do touch '%s'; sleep 3; done } & jsc=%s; %s=$jsc '%s' > '%s' 2> '%s'; echo $? > '%s.tmp'; mv '%s.tmp' '%s'; wait", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we care about csh/tcsh?
No, the wrapper script is always run using sh
.
See also this for example. Extend In the longer term, the current implementation using shell scripting should probably be discarded in favor of a Golang wrapper that could also do a |
@jglick I have really mixed feelings about shipping Go artifacts to solve a process-tracking issue. Seems like it adds a lot of complexity without much reason -- why not simply use the Java 8 Process tooling such as 'isAlive'? |
@@ -220,7 +232,7 @@ private void runOnDocker(DumbSlave s) throws Exception { | |||
do { | |||
Thread.sleep(1000); | |||
baos = new ByteArrayOutputStream(); | |||
assertEquals(0, dockerLauncher.launch().cmds("ps", "-e", "-o", "pid,stat,command").stdout(new TeeOutputStream(baos, System.out)).join()); | |||
assertEquals(0, dockerLauncher.launch().cmds("ps", "-e", "-o", "pid,stat,comm").stdout(new TeeOutputStream(baos, System.out)).join()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comm
appears to be the standardized name: http://pubs.opengroup.org/onlinepubs/009695399/utilities/ps.html, and BusyBox doesn't support command
.
Off topic here; as mentioned in JIRA, we probably cannot use Java code for that. |
BTW things like e015167 do not actually need to be committed, much less pushed. Suffices to locally git checkout master -- src/main develop your test until it fails in the expected way, then git checkout HEAD -- src/main and verify that it passes, then commit the new test. We trust you! Alternately, to be explicit in history: git checkout -b tmp master
# edit src/test/ until you get the expected failure
# add @Ignore
git commit -a -m 'Reproduced failure'
git checkout support-basic-ps
git merge --no-commit tmp
# remove @Ignore
git commit -a -m 'Yup it is fixed'
git branch -d tmp or the |
Fails I guess the way you intended:
So, time to restore fix? |
@jglick Ack, in the past I have been asked to show the failing test on the public CI history, but if everyone here is ok with it then it's certainly easier to just confirm locally. In any case I will squash-merge into master when ready here to get rid of the intermediate commits. The time between the commits was due to me running the tests locally to make sure everything passed. |
@@ -203,12 +205,22 @@ public void smokeTest() throws Exception { | |||
runOnDocker(new DumbSlave("docker", "/home/test", new SSHLauncher(container.ipBound(22), container.port(22), "test", "test", "", ""))); | |||
} | |||
|
|||
@Test public void runOnAlpineDocker() throws Exception { | |||
AlpineFixture container = dockerAlpine.get(); | |||
runOnDocker(new DumbSlave("docker", "/home/test", new SSHLauncher(container.ipBound(22), container.port(22), "test", "test", "", "")), 45); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any particular reason for the longer sleep?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's lower than 30 seconds or so, the exit status is 0 and everything works fine. 45 seconds seemed to be long enough to consistently hit this code which triggers the failure.
I am not sure exactly what the minimum time to hit that case is, maybe 2 * HEARTBEAT_CHECK_INTERVAL + HEARTBEAT_MINIMUM_DELTA
? I'll try to understand that code better to see if we can decrease the timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably HEARTBEAT_CHECK_INTERVAL * 2
or something. OK.
@@ -0,0 +1,9 @@ | |||
FROM jenkinsci/slave:3.19-1-alpine |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC this is the version before jenkinsci/docker-agent#28.
At least for me it is fine if the PR author adds, say, a line comment to the line of the test where an error is thrown in the expected way with the main patch reverted, and shows the error output. The CI history is actually less useful since the branch project will normally be deleted once the PR is merged, and with it goes any record of the test output. (To be solved at some point by external test storage!) |
See JENKINS-52847 and #75 (comment).
Some implementations of
ps
don't support the-p
option, but we can work around that by passing a more complicated regex togrep
.I manually tested the change on OS X and Alpine Linux to make sure it works on both platforms.