fix lockrun to return child process's exit status #2

Merged
merged 1 commit into from Oct 11, 2011

Conversation

Projects
None yet
2 participants
Contributor

eweathers commented Oct 8, 2011

The previous code was incorrectly exiting with the value returned in
waipid()'s status parameter, which is not simply the exit status of the
child process, but instead an aggregated status variable representing a
few different reasons for waitpid()'s return. This prevented
invocations of lockrun from correctly returning exit status values that
were non-zero. To fix this we just need to parse the exit code out of
the status variable using the WEXITSTATUS macro.

Expected error code:

% ls /doesnotexist
ls: /doesnotexist: No such file or directory
% echo $?
2

Wrong error code from previous lockrun implementation:

% lockrun -V --lockfile=/tmp/lock.1 -- ls /doesnotexist
ls: /doesnotexist: No such file or directory
Waiting for process 14266
pid 14266 exited with status 512 (time=0 sec)
% echo $?
0

Note that the value 512 was being truncated because exit codes are
expected to be 1 byte (with values between 0 and 255).

After this fix, the exit status is properly passed back:

% lockrun -V --lockfile=/tmp/lock.1 -- ls /doesnotexist
ls: /doesnotexist: No such file or directory
Waiting for process 15114
pid 15114 exited with status 512, exit code: 2 (time=0 sec)
ls: /doesnotexist: No such file or directory
% echo $?
2

P.S., interestingly, the previous contributor to this project was also
from Groupon, Mike Cerna.
P.P.S., tabs are evil.

@eweathers eweathers fix lockrun to return child process's exit status
The previous code was incorrectly exiting with the value returned in
waipid()'s status parameter, which is not simply the exit status of the
child process, but instead an aggregated status variable representing a
few different reasons for waitpid()'s return.   This prevented
invocations of lockrun from correctly returning exit status values that
were non-zero.  To fix this we just need to parse the exit code out of
the status variable using the WEXITSTATUS macro.

Expected error code:

  % ls /doesnotexist
  ls: /doesnotexist: No such file or directory
  % echo $?
  2

Wrong error code from previous lockrun implementation:

  % lockrun -V --lockfile=/tmp/lock.1 -- ls /doesnotexist
  ls: /doesnotexist: No such file or directory
  Waiting for process 14266
  pid 14266 exited with status 512 (time=0 sec)
  % echo $?
  0

Note that the value 512 was being truncated because exit codes are
expected to be 1 byte (with values between 0 and 255).

After this fix, the exit status is properly passed back:

  % lockrun -V --lockfile=/tmp/lock.1 -- ls /doesnotexist
  ls: /doesnotexist: No such file or directory
  Waiting for process 15114
  pid 15114 exited with status 512, exit code: 2 (time=0 sec)
  ls: /doesnotexist: No such file or directory
  % echo $?
  2

P.S., interestingly, the previous contributor to this project was also
from Groupon, Mike Cerna.
P.P.S., tabs are evil.
3eb7773

@pushcx pushcx added a commit that referenced this pull request Oct 11, 2011

@pushcx pushcx Merge pull request #2 from eweathers/master
fix lockrun to return child process's exit status
681379d

@pushcx pushcx merged commit 681379d into pushcx:master Oct 11, 2011

Owner

pushcx commented Oct 11, 2011

Thanks for the little fix. If you know Peter Christiansen at Groupon, tell him I say hi.

Contributor

eweathers commented Oct 11, 2011

On Tue, Oct 11, 2011 at 4:08 PM, Peter Harkins <
reply@reply.github.com>wrote:

Thanks for the little fix.

No problem! I'd also love it if I could specify a "kill children after
timeout of X seconds (milliseconds?)". I might look into creating such a
patch later...

If you know Peter Christiansen at Groupon, tell him I say hi.

heh, I work in Palo Alto, but was in Chicago for 1 week last month and my
temporary seat happened to be directly beside him.

  • Erik

Reply to this email directly or view it on GitHub:
#2 (comment)

Owner

pushcx commented Oct 12, 2011

I don't know why you'd want that feature. Could you tell me some more about it?

Contributor

eweathers commented Oct 12, 2011

hey Peter.

The situation is that we have a cronjob that we run under lockrun every
minute. The cmd locked by lockrun is a subshell (sh -c 'foo && bar && baz
...') that first does a "git pull" (over SSH), then it runs a script to
process the updated files. The problem is that the SSH connection made by
the "git pull" sometimes just hangs, preventing the locked process from
completing, and leaving the lockfile around forever (or 1+ hours), which
breaks things for us.

When we notice this I've been going in and manually killing the process tree
and removing the lockfile. I'm going to fix this by either:
(1) modifying lockrun to take a parameter and kill the child process (I'm
assuming I'll need to use a separate pthread for this, though maybe I can
set a timer to signal the main lockrun process).
-OR-
(2) use a separate wrapper script (bash or perl) that does this
timeout-and-kill.

  • Erik

On Wed, Oct 12, 2011 at 7:04 AM, Peter Harkins <
reply@reply.github.com>wrote:

I don't know why you'd want that feature. Could you tell me some more about
it?

Reply to this email directly or view it on GitHub:
#2 (comment)

Owner

pushcx commented Oct 13, 2011

Hm. For now, I'd suggest making a cron job that runs every minute and does a pkill (if the process name is unique) and rm.

I could see a separate option to consider a lockfile stale, eg. --stale=N (in seconds) that would SIGTERM previous lockrun to die with its children (and then SIGKILL it if nothing happens). That's a good fit with lockrun's functionality and I'd merge it.

Contributor

eweathers commented Oct 13, 2011

On Thu, Oct 13, 2011 at 4:02 PM, Peter Harkins <
reply@reply.github.com>wrote:

Hm. For now, I'd suggest making a cron job that runs every minute and does
a pkill (if the process name is unique) and rm.

I could see a separate option to consider a lockfile stale, eg. --stale=N
(in seconds) that would SIGKILL previous lockrun to die with its children
(and then SIGKILL it if nothing happens). That's a good fit with lockrun's
functionality and I'd merge it.

Ah, interesting idea! Yeah, that makes total sense, and is much simpler to
implement. Will let you know when/if I get to this.

Thanks!

  • Erik

Reply to this email directly or view it on GitHub:
#2 (comment)

@allardhoeve allardhoeve added a commit to ByteInternet/lockrun that referenced this pull request Aug 12, 2013

@allardhoeve allardhoeve Move debian packaging to debian branch #2
This reverts commit 649daaf.
a089b26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment