build: clear stalled jobs on POSIX CI hosts #11246

Closed
wants to merge 1 commit into
from

Conversation

@Trott
Member

Trott commented Feb 8, 2017

Sometimes, after a cluster or debug test fails, a fixture hangs around
and holds onto a needed port, causing subsequent CI runs to fail. This
adds a command I've been running manually when this occurs. The command
will clear the stalled jobs before a CI run.

Checklist
  • make -j4 test (UNIX), or vcbuild test (Windows) passes
  • commit message follows commit guidelines
Affected core subsystem(s)

build

Makefile
@@ -188,6 +188,9 @@ test/addons/.buildstamp: config.gypi \
# TODO(bnoordhuis) Force rebuild after gyp update.
build-addons: $(NODE_EXE) test/addons/.buildstamp
+clear-stalled:
+ ps awwwx | grep Release/node | grep -v grep | awk '{print $1}' | xargs kill

This comment has been minimized.

@mscdex

mscdex Feb 8, 2017

Contributor

What about simply using pkill instead? Isn't that available on all *nix platforms in CI?

@mscdex

mscdex Feb 8, 2017

Contributor

What about simply using pkill instead? Isn't that available on all *nix platforms in CI?

This comment has been minimized.

@Trott

Trott Feb 8, 2017

Member

It's not on our AIX hosts.

@Trott

Trott Feb 8, 2017

Member

It's not on our AIX hosts.

@mscdex mscdex added the test label Feb 8, 2017

@jasnell

jasnell approved these changes Feb 8, 2017

@Trott

This comment has been minimized.

Show comment
Hide comment
@Trott

This comment has been minimized.

Show comment
Hide comment
@Trott

Trott Feb 8, 2017

Member

CI is green. Removing WIP from the title and removing the in progress label.

/cc @nodejs/build for reviews

Member

Trott commented Feb 8, 2017

CI is green. Removing WIP from the title and removing the in progress label.

/cc @nodejs/build for reviews

@Trott Trott removed the in progress label Feb 8, 2017

@Trott Trott changed the title from WIP: build: clear stalled jobs on POSIX CI hosts to build: clear stalled jobs on POSIX CI hosts Feb 8, 2017

@Trott Trott referenced this pull request Feb 8, 2017

Closed

lib: deprecate node --debug at runtime #10970

3 of 3 tasks complete
@mhdawson

LGTM, thanks for this, we've had to clean the AIX machines several times in the last few days so this is great to see.

@Trott

This comment has been minimized.

Show comment
Hide comment
@Trott

Trott Feb 9, 2017

Member

I'm terrible at grok'ing Makefile stuff. I put this as an order-only prerequisite. If someone knowledgable could indicate whether that's the right thing to do or not in this case, that would be appreciated.

Member

Trott commented Feb 9, 2017

I'm terrible at grok'ing Makefile stuff. I put this as an order-only prerequisite. If someone knowledgable could indicate whether that's the right thing to do or not in this case, that would be appreciated.

@richardlau

This comment has been minimized.

Show comment
Hide comment
@richardlau

richardlau Feb 9, 2017

Member

The clear-stalled target should be phony and added to .PHONY at the end of the makefile since it does not refer to a file or directory called clear-stalled.

If the target is declared phony, I'm not sure that it matters if it is order-only or not as a prerequisite.

Member

richardlau commented Feb 9, 2017

The clear-stalled target should be phony and added to .PHONY at the end of the makefile since it does not refer to a file or directory called clear-stalled.

If the target is declared phony, I'm not sure that it matters if it is order-only or not as a prerequisite.

@richardlau richardlau referenced this pull request in nodejs/build Feb 9, 2017

Open

Job to clean up old processes #591

@gibfahn

This comment has been minimized.

Show comment
Hide comment
@gibfahn

gibfahn Feb 9, 2017

Member

Related to nodejs/build#591

As discussed in that issue, I'd rather have something in the test runner that kept track of processes that hadn't been cleaned up at the end of each run, and failed the job if there were any. However I do think this is better than what we currently have (later jobs randomly failing and @nodejs/build members having to clean up machines manually), so I'd be +1 on this for now.

Currently I don't think this is logging what processes it is cleaning up though, which is information I do think we need, so I'd be -1 on this without that.


I'm also not sure under what circumstances processes get left behind, doesn't tools/test.py clean up once the TIMEOUT happens? I assumed it was for orphaned node subprocesses, but I'm not sure.

Member

gibfahn commented Feb 9, 2017

Related to nodejs/build#591

As discussed in that issue, I'd rather have something in the test runner that kept track of processes that hadn't been cleaned up at the end of each run, and failed the job if there were any. However I do think this is better than what we currently have (later jobs randomly failing and @nodejs/build members having to clean up machines manually), so I'd be +1 on this for now.

Currently I don't think this is logging what processes it is cleaning up though, which is information I do think we need, so I'd be -1 on this without that.


I'm also not sure under what circumstances processes get left behind, doesn't tools/test.py clean up once the TIMEOUT happens? I assumed it was for orphaned node subprocesses, but I'm not sure.

Makefile
+ XARGS = xargs -r
+endif
+clear-stalled:
+ ps awwwx | grep Release/node | grep -v grep | awk '{print $$1}' | $(XARGS) kill

This comment has been minimized.

@gibfahn

gibfahn Feb 9, 2017

Member

I think you could just add the same line up to the awk above to list the processes that will be killed, i.e.:

ps awwwx | grep Release/node | grep -v grep
ps awwwx | grep Release/node | grep -v grep | awk '{print $$1}' | $(XARGS) kill
@gibfahn

gibfahn Feb 9, 2017

Member

I think you could just add the same line up to the awk above to list the processes that will be killed, i.e.:

ps awwwx | grep Release/node | grep -v grep
ps awwwx | grep Release/node | grep -v grep | awk '{print $$1}' | $(XARGS) kill
@Trott

This comment has been minimized.

Show comment
Hide comment
@Trott

Trott Feb 9, 2017

Member

I assumed it was for orphaned node subprocesses, but I'm not sure.

That seems right to me. Always seems to be subprocesses from cluster or debug tests.

Member

Trott commented Feb 9, 2017

I assumed it was for orphaned node subprocesses, but I'm not sure.

That seems right to me. Always seems to be subprocesses from cluster or debug tests.

@Trott

This comment has been minimized.

Show comment
Hide comment
@gibfahn

gibfahn approved these changes Feb 9, 2017

The code addition LGTM, no idea about the Makefile syntax

@santigimeno

This comment has been minimized.

Show comment
Hide comment
@santigimeno

santigimeno Feb 9, 2017

Member

I'd rather have something in the test runner that kept track of processes that hadn't been cleaned up at the end of each run, and failed the job if there were any.

I agree on this. If a test is leaving processes behind I think it should fail.

Currently I don't think this is logging what processes it is cleaning up though, which is information I do think we need, so I'd be -1 on this without that.

I also agree on this.

Member

santigimeno commented Feb 9, 2017

I'd rather have something in the test runner that kept track of processes that hadn't been cleaned up at the end of each run, and failed the job if there were any.

I agree on this. If a test is leaving processes behind I think it should fail.

Currently I don't think this is logging what processes it is cleaning up though, which is information I do think we need, so I'd be -1 on this without that.

I also agree on this.

@gibfahn

This comment has been minimized.

Show comment
Hide comment
@gibfahn

gibfahn Feb 9, 2017

Member

@santigimeno So with the latest update this will now list the processes it's killing, so I'd say this is an improvement over what we currently do. Are you -1 on this landing as a step in the right direction?

If we could fix tools/test.py that would be great, but I'd rather have this in the meantime. Once the tools/test.py fix is in place, we could probably change this to make it fail the build if it finds any processes (as that means the tools/test.py cleanup failed).

Member

gibfahn commented Feb 9, 2017

@santigimeno So with the latest update this will now list the processes it's killing, so I'd say this is an improvement over what we currently do. Are you -1 on this landing as a step in the right direction?

If we could fix tools/test.py that would be great, but I'd rather have this in the meantime. Once the tools/test.py fix is in place, we could probably change this to make it fail the build if it finds any processes (as that means the tools/test.py cleanup failed).

@santigimeno

This comment has been minimized.

Show comment
Hide comment
@santigimeno

santigimeno Feb 9, 2017

Member

So with the latest update this will now list the processes it's killing, so I'd say this is an improvement over what we currently do. Are you -1 on this landing as a step in the right direction?

Sorry, I had overlooked the cat command. I'd rather have this running at the end of the CI run than before so we know when it really happened, but I agree this is an improvement.

Member

santigimeno commented Feb 9, 2017

So with the latest update this will now list the processes it's killing, so I'd say this is an improvement over what we currently do. Are you -1 on this landing as a step in the right direction?

Sorry, I had overlooked the cat command. I'd rather have this running at the end of the CI run than before so we know when it really happened, but I agree this is an improvement.

@gibfahn

This comment has been minimized.

Show comment
Hide comment
@gibfahn

gibfahn Feb 9, 2017

Member

I'd rather have this running at the end of the CI run than before so we know when it really happened

Good point, @Trott does this make sense to you? If we did that we could have it fail the build if any processes were found.

Member

gibfahn commented Feb 9, 2017

I'd rather have this running at the end of the CI run than before so we know when it really happened

Good point, @Trott does this make sense to you? If we did that we could have it fail the build if any processes were found.

Makefile
+ XARGS = xargs -r
+endif
+clear-stalled:
+ ps awwwx | grep Release/node | grep -v grep | cat

This comment has been minimized.

@joshgav

joshgav Feb 9, 2017

Member

Does the third w in the ps options add something? Docs and anecdotal usage indicate more than 2 doesn't make a difference.

@joshgav

joshgav Feb 9, 2017

Member

Does the third w in the ps options add something? Docs and anecdotal usage indicate more than 2 doesn't make a difference.

@joshgav

joshgav approved these changes Feb 9, 2017

LGTM, subject to minor nit above.

build: clear stalled jobs on POSIX CI hosts
Sometimes, after a cluster or debug test fails, a fixture hangs around
and holds onto a needed port, causing subsequent CI runs to fail. This
adds a command I've been running manually when this occurs. The command
will clear the stalled jobs before a CI run.
@Trott

This comment has been minimized.

Show comment
Hide comment
@Trott

Trott Feb 9, 2017

Member

Updated Makefile per nit from @joshgav.

Re-running CI: https://ci.nodejs.org/job/node-test-pull-request/6317/

Member

Trott commented Feb 9, 2017

Updated Makefile per nit from @joshgav.

Re-running CI: https://ci.nodejs.org/job/node-test-pull-request/6317/

@Trott

This comment has been minimized.

Show comment
Hide comment
@Trott

Trott Feb 9, 2017

Member

@santigimeno wrote:

I'd rather have this running at the end of the CI run than before so we know when it really happened, but I agree this is an improvement.

Queued up for a subsequent PR: Trott@48693fc

Member

Trott commented Feb 9, 2017

@santigimeno wrote:

I'd rather have this running at the end of the CI run than before so we know when it really happened, but I agree this is an improvement.

Queued up for a subsequent PR: Trott@48693fc

@Trott Trott referenced this pull request Feb 9, 2017

Closed

build: fail on CI if leftover processes #11269

2 of 2 tasks complete

Trott added a commit to Trott/io.js that referenced this pull request Feb 10, 2017

build: clear stalled jobs on POSIX CI hosts
Sometimes, after a cluster or debug test fails, a fixture hangs around
and holds onto a needed port, causing subsequent CI runs to fail. This
adds a command I've been running manually when this occurs. The command
will clear the stalled jobs before a CI run.

PR-URL: nodejs#11246
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Michael Dawson <michael_dawson@ca.ibm.com>
Reviewed-By: Gibson Fahnestock <gibfahn@gmail.com>
Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com>
Reviewed-By: Josh Gavant <josh.gavant@outlook.com>
@Trott

This comment has been minimized.

Show comment
Hide comment
@Trott

Trott Feb 10, 2017

Member

Landed in 90ab68b

Member

Trott commented Feb 10, 2017

Landed in 90ab68b

@Trott Trott closed this Feb 10, 2017

italoacasas added a commit that referenced this pull request Feb 13, 2017

build: clear stalled jobs on POSIX CI hosts
Sometimes, after a cluster or debug test fails, a fixture hangs around
and holds onto a needed port, causing subsequent CI runs to fail. This
adds a command I've been running manually when this occurs. The command
will clear the stalled jobs before a CI run.

PR-URL: #11246
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Michael Dawson <michael_dawson@ca.ibm.com>
Reviewed-By: Gibson Fahnestock <gibfahn@gmail.com>
Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com>
Reviewed-By: Josh Gavant <josh.gavant@outlook.com>

italoacasas added a commit to italoacasas/node that referenced this pull request Feb 14, 2017

build: clear stalled jobs on POSIX CI hosts
Sometimes, after a cluster or debug test fails, a fixture hangs around
and holds onto a needed port, causing subsequent CI runs to fail. This
adds a command I've been running manually when this occurs. The command
will clear the stalled jobs before a CI run.

PR-URL: nodejs#11246
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Michael Dawson <michael_dawson@ca.ibm.com>
Reviewed-By: Gibson Fahnestock <gibfahn@gmail.com>
Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com>
Reviewed-By: Josh Gavant <josh.gavant@outlook.com>

KryDos added a commit to KryDos/node that referenced this pull request Feb 25, 2017

build: clear stalled jobs on POSIX CI hosts
Sometimes, after a cluster or debug test fails, a fixture hangs around
and holds onto a needed port, causing subsequent CI runs to fail. This
adds a command I've been running manually when this occurs. The command
will clear the stalled jobs before a CI run.

PR-URL: nodejs#11246
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Michael Dawson <michael_dawson@ca.ibm.com>
Reviewed-By: Gibson Fahnestock <gibfahn@gmail.com>
Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com>
Reviewed-By: Josh Gavant <josh.gavant@outlook.com>
@jasnell

This comment has been minimized.

Show comment
Hide comment
@jasnell

jasnell Mar 7, 2017

Member

would need backport PRs to land on v6 and v4

Member

jasnell commented Mar 7, 2017

would need backport PRs to land on v6 and v4

@gibfahn

This comment has been minimized.

Show comment
Hide comment
@gibfahn

gibfahn Jun 17, 2017

Member

I think it's worth backporting this to v6.x-staging. @Trott would you be willing to backport this?

guide

#12158 is a follow-up to this.

Member

gibfahn commented Jun 17, 2017

I think it's worth backporting this to v6.x-staging. @Trott would you be willing to backport this?

guide

#12158 is a follow-up to this.

@gibfahn gibfahn referenced this pull request Jun 17, 2017

Merged

build: avoid passing kill empty input in Makefile #12158

2 of 2 tasks complete

Trott added a commit to Trott/io.js that referenced this pull request Jun 18, 2017

build: clear stalled jobs on POSIX CI hosts
Sometimes, after a cluster or debug test fails, a fixture hangs around
and holds onto a needed port, causing subsequent CI runs to fail. This
adds a command I've been running manually when this occurs. The command
will clear the stalled jobs before a CI run.

PR-URL: nodejs#11246
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Michael Dawson <michael_dawson@ca.ibm.com>
Reviewed-By: Gibson Fahnestock <gibfahn@gmail.com>
Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com>
Reviewed-By: Josh Gavant <josh.gavant@outlook.com>

@Trott Trott referenced this pull request Jun 18, 2017

Closed

(v6.x backport) build: clear stalled jobs on POSIX CI hosts #13754

0 of 4 tasks complete
@Trott

This comment has been minimized.

Show comment
Hide comment
Member

Trott commented Jun 18, 2017

@gibfahn #13754

gibfahn added a commit that referenced this pull request Jun 18, 2017

build: clear stalled jobs on POSIX CI hosts
Sometimes, after a cluster or debug test fails, a fixture hangs around
and holds onto a needed port, causing subsequent CI runs to fail. This
adds a command I've been running manually when this occurs. The command
will clear the stalled jobs before a CI run.

PR-URL: #11246
Backport-PR-URL: #13754
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Michael Dawson <michael_dawson@ca.ibm.com>
Reviewed-By: Gibson Fahnestock <gibfahn@gmail.com>
Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com>
Reviewed-By: Josh Gavant <josh.gavant@outlook.com>

gibfahn added a commit that referenced this pull request Jun 20, 2017

build: clear stalled jobs on POSIX CI hosts
Sometimes, after a cluster or debug test fails, a fixture hangs around
and holds onto a needed port, causing subsequent CI runs to fail. This
adds a command I've been running manually when this occurs. The command
will clear the stalled jobs before a CI run.

PR-URL: #11246
Backport-PR-URL: #13754
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Michael Dawson <michael_dawson@ca.ibm.com>
Reviewed-By: Gibson Fahnestock <gibfahn@gmail.com>
Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com>
Reviewed-By: Josh Gavant <josh.gavant@outlook.com>

MylesBorins added a commit that referenced this pull request Jul 11, 2017

build: clear stalled jobs on POSIX CI hosts
Sometimes, after a cluster or debug test fails, a fixture hangs around
and holds onto a needed port, causing subsequent CI runs to fail. This
adds a command I've been running manually when this occurs. The command
will clear the stalled jobs before a CI run.

PR-URL: #11246
Backport-PR-URL: #13754
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Michael Dawson <michael_dawson@ca.ibm.com>
Reviewed-By: Gibson Fahnestock <gibfahn@gmail.com>
Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com>
Reviewed-By: Josh Gavant <josh.gavant@outlook.com>

@MylesBorins MylesBorins referenced this pull request Jul 18, 2017

Merged

v6.11.2 proposal #14356

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment