Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JENKINS-52847] Avoid using -p to support basic implementations of ps #77

Merged
merged 5 commits into from
Aug 7, 2018

Conversation

dwnusbaum
Copy link
Member

@dwnusbaum dwnusbaum commented Aug 1, 2018

See JENKINS-52847 and #75 (comment).

Some implementations of ps don't support the -p option, but we can work around that by passing a more complicated regex to grep.

I manually tested the change on OS X and Alpine Linux to make sure it works on both platforms.

@dwnusbaum dwnusbaum changed the title Avoid using -p to support basic implementations of ps [JENKINS-52847] Avoid using -p to support basic implementations of ps Aug 2, 2018
@@ -140,7 +140,7 @@ public String getScript() {
FilePath resultFile = c.getResultFile(ws);
FilePath controlDir = c.controlDir(ws);
if (capturingOutput) {
cmd = String.format("pid=$$; { while ps -o pid -p $pid | grep -q $pid && [ -d '%s' -a \\! -f '%s' ]; do touch '%s'; sleep 3; done } & jsc=%s; %s=$jsc '%s' > '%s' 2> '%s'; echo $? > '%s.tmp'; mv '%s.tmp' '%s'; wait",
cmd = String.format("pid=$$; { while ps -o pid | grep -q \"^\\s*$pid$\" && [ -d '%s' -a \\! -f '%s' ]; do touch '%s'; sleep 3; done } & jsc=%s; %s=$jsc '%s' > '%s' 2> '%s'; echo $? > '%s.tmp'; mv '%s.tmp' '%s'; wait",
Copy link
Member Author

@dwnusbaum dwnusbaum Aug 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To get this to work in csh/tcsh we'd need to use \"^\\s*$pid\"\'$\' (becomes "^\s*$pid"'$' after removing Java's escaping) because in those shells things like echo "$" throw an error.

Do we care about csh/tcsh?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so - @jglick might have other thoughts.

But is grep guaranteed to be supported?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is making me twitch a bit. Which docker images did we test with?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No idea about whether it's guaranteed to be supported on all platforms, but it is at least included in BusyBox. I'll look around for a definitive answer.

Copy link
Member Author

@dwnusbaum dwnusbaum Aug 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which docker images did we test with?

jenkinsci/slave with both alpine (3.19-1-alpine, before the procops fix) and Debian (3.20-1) tags.

Copy link
Member Author

@dwnusbaum dwnusbaum Aug 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is grep guaranteed to be supported

Someone could always have a custom Arch distro that specifically excludes grep, but I think it is pretty unlikely. The best info I could find about where grep is included by default is https://unix.stackexchange.com/questions/37064/which-are-the-standard-commands-available-in-every-linux-based-distribution. Based on that, and the fact that our use is compatible with BusyBox as well, I think it is pretty safe to count on grep being present, unless there are some new popular container OS's that leave it out or something.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we care about csh/tcsh?

No, the wrapper script is always run using sh.

Copy link
Member

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

POSIX ps is supposed to support -p so if there are some widely used containers with ps implementations that do not, I am not sure where that leaves us—we can only guess what might work.

I am a little nervous about dropping -p because then we start to rely on

By default, ps shall select all processes with the same effective user ID as the current user and the same controlling terminal as the invoker.

I suppose both the EUID and the controlling terminal should be identical between the main wrapper script process and the ps process. -p selects the process explicitly without this restriction.

@@ -140,7 +140,7 @@ public String getScript() {
FilePath resultFile = c.getResultFile(ws);
FilePath controlDir = c.controlDir(ws);
if (capturingOutput) {
cmd = String.format("pid=$$; { while ps -o pid -p $pid | grep -q $pid && [ -d '%s' -a \\! -f '%s' ]; do touch '%s'; sleep 3; done } & jsc=%s; %s=$jsc '%s' > '%s' 2> '%s'; echo $? > '%s.tmp'; mv '%s.tmp' '%s'; wait",
cmd = String.format("pid=$$; { while ps -o pid | grep -q \"^\\s*$pid$\" && [ -d '%s' -a \\! -f '%s' ]; do touch '%s'; sleep 3; done } & jsc=%s; %s=$jsc '%s' > '%s' 2> '%s'; echo $? > '%s.tmp'; mv '%s.tmp' '%s'; wait",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we care about csh/tcsh?

No, the wrapper script is always run using sh.

@jglick
Copy link
Member

jglick commented Aug 6, 2018

Oh my:

We allow (and ignore) most of the above. FIXME.

See also this for example.

Extend BourneShellScriptTest to use a Docker fixture based on jenkinsci/slave:3.19-1-alpine to reproduce the problem before the fix please.

In the longer term, the current implementation using shell scripting should probably be discarded in favor of a Golang wrapper that could also do a setsid. See discussion in JENKINS-25503. Of course this means shipping binaries for each supported platform, or shipping only an x86_64 binary and falling back to the current implementation on exotic platforms.

@svanoort
Copy link
Member

svanoort commented Aug 6, 2018

@jglick I have really mixed feelings about shipping Go artifacts to solve a process-tracking issue. Seems like it adds a lot of complexity without much reason -- why not simply use the Java 8 Process tooling such as 'isAlive'?

@@ -220,7 +232,7 @@ private void runOnDocker(DumbSlave s) throws Exception {
do {
Thread.sleep(1000);
baos = new ByteArrayOutputStream();
assertEquals(0, dockerLauncher.launch().cmds("ps", "-e", "-o", "pid,stat,command").stdout(new TeeOutputStream(baos, System.out)).join());
assertEquals(0, dockerLauncher.launch().cmds("ps", "-e", "-o", "pid,stat,comm").stdout(new TeeOutputStream(baos, System.out)).join());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comm appears to be the standardized name: http://pubs.opengroup.org/onlinepubs/009695399/utilities/ps.html, and BusyBox doesn't support command.

@jglick
Copy link
Member

jglick commented Aug 7, 2018

why not simply use the Java 8 Process tooling

Off topic here; as mentioned in JIRA, we probably cannot use Java code for that.

@jglick
Copy link
Member

jglick commented Aug 7, 2018

BTW things like e015167 do not actually need to be committed, much less pushed. Suffices to locally

git checkout master -- src/main

develop your test until it fails in the expected way, then

git checkout HEAD -- src/main

and verify that it passes, then commit the new test. We trust you! Alternately, to be explicit in history:

git checkout -b tmp master
# edit src/test/ until you get the expected failure
# add @Ignore
git commit -a -m 'Reproduced failure'
git checkout support-basic-ps
git merge --no-commit tmp
# remove @Ignore
git commit -a -m 'Yup it is fixed'
git branch -d tmp

or the git rebase -i equivalent if you like that sort of thing.

@jglick
Copy link
Member

jglick commented Aug 7, 2018

Fails I guess the way you intended:

java.lang.AssertionError: expected:<0> but was:<-1>
	at org.junit.Assert.fail(Assert.java:88)
	at org.junit.Assert.failNotEquals(Assert.java:834)
	at org.junit.Assert.assertEquals(Assert.java:645)
	at org.junit.Assert.assertEquals(Assert.java:631)
	at org.jenkinsci.plugins.durabletask.BourneShellScriptTest.runOnDocker(BourneShellScriptTest.java:233)
	at org.jenkinsci.plugins.durabletask.BourneShellScriptTest.runOnAlpineDocker(BourneShellScriptTest.java:214)
…
[workspace] Running shell script
ps: unrecognized option: p
BusyBox v1.27.2 (2017-12-12 10:41:50 GMT) multi-call binary.

Usage: ps [-o COL1,COL2=HEADER]

Show list of processes

	-o COL1,COL2=HEADER	Select columns for display
wrapper script does not seem to be touching the log file in /home/test/workspace@tmp/durable-e62978d8
…

So, time to restore fix?

@jglick jglick self-requested a review August 7, 2018 19:16
@dwnusbaum
Copy link
Member Author

and verify that it passes, then commit the new test. We trust you!

@jglick Ack, in the past I have been asked to show the failing test on the public CI history, but if everyone here is ok with it then it's certainly easier to just confirm locally. In any case I will squash-merge into master when ready here to get rid of the intermediate commits. The time between the commits was due to me running the tests locally to make sure everything passed.

@@ -203,12 +205,22 @@ public void smokeTest() throws Exception {
runOnDocker(new DumbSlave("docker", "/home/test", new SSHLauncher(container.ipBound(22), container.port(22), "test", "test", "", "")));
}

@Test public void runOnAlpineDocker() throws Exception {
AlpineFixture container = dockerAlpine.get();
runOnDocker(new DumbSlave("docker", "/home/test", new SSHLauncher(container.ipBound(22), container.port(22), "test", "test", "", "")), 45);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason for the longer sleep?

Copy link
Member Author

@dwnusbaum dwnusbaum Aug 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's lower than 30 seconds or so, the exit status is 0 and everything works fine. 45 seconds seemed to be long enough to consistently hit this code which triggers the failure.

I am not sure exactly what the minimum time to hit that case is, maybe 2 * HEARTBEAT_CHECK_INTERVAL + HEARTBEAT_MINIMUM_DELTA? I'll try to understand that code better to see if we can decrease the timeout.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably HEARTBEAT_CHECK_INTERVAL * 2 or something. OK.

@@ -0,0 +1,9 @@
FROM jenkinsci/slave:3.19-1-alpine
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this is the version before jenkinsci/docker-agent#28.

@jglick
Copy link
Member

jglick commented Aug 7, 2018

in the past I have been asked to show the failing test on the public CI history

At least for me it is fine if the PR author adds, say, a line comment to the line of the test where an error is thrown in the expected way with the main patch reverted, and shows the error output.

The CI history is actually less useful since the branch project will normally be deleted once the PR is merged, and with it goes any record of the test output. (To be solved at some point by external test storage!)

@dwnusbaum dwnusbaum merged commit 5ded96f into jenkinsci:master Aug 7, 2018
@dwnusbaum dwnusbaum deleted the support-basic-ps branch August 7, 2018 19:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants