Spurious process manager test failures #2297

SeanTAllen · 2017-10-21T21:33:06Z

Some process manager tests (but not all), check the exit code. You can see this here:

https://github.com/ponylang/ponyc/blob/master/packages/process/_test.pony#L460

The problem is that despite the test running to completion in terms of process correct data, sometimes, _kill_child fails. Returning a -1 error which triggers "failed" being called on the notify which fails our test.

It appears that from the perspective of the test, everything worked fine, but for some reason, kill failed. There are two possibilities:

the external process didn't die
it was already shutdown

If its the latter, we want to make _kill_child smarter. If the process is already gone, then there is no need to throw an error. See: https://github.com/ponylang/ponyc/blob/master/packages/process/process_monitor.pony#L454

If the external process didn't die but the test otherwise completed fine, then really, there's nothing we can do on the Pony side and the test should pass.

Someone needs to dig in and determine exactly why these occassionally fail with a "KILL ERROR".

My advice to anyone taking this one, you'll want to run the standard library tests in a loop in order to trigger.

Something like:

while ./stdlib --sequential; do :; done

The text was updated successfully, but these errors were encountered:

mfelsche · 2017-11-18T13:31:15Z

Some more tests failing spuriously:

Example outputs from jenkins job https://travis-ci.org/ponylang/ponyc/jobs/303769678 for commit: 398d361 :

LLVM 3.7.1:

**** FAILED: process/WritevOrdering
	Received from stdout: 11 bytes
	Received so far: 11 bytes
	Expecting: 11 bytes
dispose: child exit code: 0
dispose: stdout: 11 bytes
dispose: stderr: 
dispose: received first data after: 	7425822 ns
dispose: total data process_time: 	8143206 ns
dispose: ProcessNotify lifetime: 	15569028 ns
ProcessError: KillError

LLVM 3.9.1:

**** FAILED: process/STDERR
dispose: child exit code: 1
dispose: stdout: 0 bytes
dispose: stderr: cat: file_does_not_exist: No such file or directory
dispose: total data process_time: 	754961948132 ns
dispose: ProcessNotify lifetime: 	45210553 ns
ProcessError: KillError

jemc · 2017-11-19T07:04:53Z

There are two possibilities:

the external process didn't die

it was already shutdown

Hmm.. according to this man page for kill, there are three possibilities:

On success (at least one signal was sent), zero is returned. On error, -1 is returned, and errno is set appropriately.

EINVAL - An invalid signal was specified.

EPERM - The process does not have permission to send the signal to any of the target processes.

ESRCH - The pid or process group does not exist. Note that an existing process might be a zombie, a process which already committed termination, but has not yet been wait(2)ed for.

Of those, I think ESRCH is probably the only one that makes sense as being a sporadic failure, so it's probably that one.

So it seems we just need to update to swallow the error if errno == ESRCH?

SeanTAllen · 2017-11-19T14:21:44Z

@jemc that seems like a reasonable first attempt.

winksaville · 2017-12-21T22:28:24Z

I'll look into this.

@waitpid

This provides a fix for ponylang#2297. I was able to replicate two spurious failures. The first was KillError thrown by _kill_child, I've changed the code so KillError is not thrown as there is really no reasonable recovery that can be performed. The second spurious error I saw was in _close where @waitpid would return an error because the _child_pid was -1. This occurs in packages/process/_test.pony::TestFileExecCapabilityIsRequired and ::_TestNonExecutablePathResultsInExecveError which are designed to fail and _child_pid is always -1. This causes a failure to be always be sent asynchronously and sometimes is actually gets reported. failure is reported

@waitpid

This provides a fix for ponylang#2297. I was able to replicate two spurious failures. The first was KillError thrown by _kill_child, I've changed the code so KillError is not thrown as there is really no reasonable recovery that can be performed. The second spurious error I saw was in _close where @waitpid would return an error because the _child_pid was -1. This occurs in packages/process/_test.pony::TestFileExecCapabilityIsRequired and ::_TestNonExecutablePathResultsInExecveError which are designed to fail and _child_pid is always -1. This causes a failure to be always be sent asynchronously and sometimes is actually gets reported. failure is reported

@waitpid

This provides a fix for ponylang#2297. I was able to replicate two spurious failures. The first was KillError thrown by _kill_child, I've changed the code so KillError is not thrown as there is really no reasonable recovery that can be performed. The second spurious error I saw was in _close where @waitpid would return an error because the _child_pid was -1. This occurs in packages/process/_test.pony::TestFileExecCapabilityIsRequired and ::_TestNonExecutablePathResultsInExecveError which are designed to fail and _child_pid is always -1. This causes a failure to be always be sent asynchronously and sometimes is actually gets reported. failure is reported

@waitpid

This provides a fix for ponylang#2297. I was able to replicate two spurious failures. The first was KillError thrown by _kill_child, I've changed the code so KillError is not thrown as there is really no reasonable recovery that can be performed. The second spurious error I saw was in _close where @waitpid would return an error because the _child_pid was -1. This occurs in packages/process/_test.pony::TestFileExecCapabilityIsRequired and ::_TestNonExecutablePathResultsInExecveError which are designed to fail and _child_pid is always -1. This causes a failure to be always be sent asynchronously and sometimes is actually gets reported. failure is reported

@waitpid

This provides a fix for ponylang#2297. I was able to replicate two spurious failures. The first was KillError thrown by _kill_child, I've changed the code so KillError is not thrown as there is really no reasonable recovery that can be performed. The second spurious error I saw was in _close where @waitpid would return an error because the _child_pid was -1. This occurs in packages/process/_test.pony::TestFileExecCapabilityIsRequired and ::_TestNonExecutablePathResultsInExecveError which are designed to fail and _child_pid is always -1. This causes a failure to be always be sent asynchronously and sometimes is actually gets reported. failure is reported

SeanTAllen added bug: 1 - needs investigation help wanted Extra attention is needed labels Oct 21, 2017

SeanTAllen mentioned this issue Oct 22, 2017

Standard Library tests occassionally fails to exit #2298

Closed

mfelsche mentioned this issue Nov 18, 2017

Standard Library Tests for process package fail on OSX #2353

Closed

jemc closed this as completed Nov 19, 2017

jemc reopened this Nov 19, 2017

winksaville mentioned this issue Dec 24, 2017

Running stdlib in a loop can hang #2445

Closed

winksaville mentioned this issue Dec 26, 2017

Fix some spurious process manager test failures #2452

Merged

SeanTAllen closed this as completed in 0040c2a Jan 3, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spurious process manager test failures #2297

Spurious process manager test failures #2297

SeanTAllen commented Oct 21, 2017 •

edited

mfelsche commented Nov 18, 2017 •

edited

jemc commented Nov 19, 2017 •

edited

SeanTAllen commented Nov 19, 2017

winksaville commented Dec 21, 2017

Spurious process manager test failures #2297

Spurious process manager test failures #2297

Comments

SeanTAllen commented Oct 21, 2017 • edited

mfelsche commented Nov 18, 2017 • edited

jemc commented Nov 19, 2017 • edited

SeanTAllen commented Nov 19, 2017

winksaville commented Dec 21, 2017

SeanTAllen commented Oct 21, 2017 •

edited

mfelsche commented Nov 18, 2017 •

edited

jemc commented Nov 19, 2017 •

edited