Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore EPIPE and ECONNRESET errors from stdin #35

Closed

Conversation

SystemParadox
Copy link
Contributor

Fixes #34.

Thanks.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.43%) to 99.57% when pulling 991600b on SystemParadox:34-fix-epipe-error into 60545fb on peerigon:master.

@jhnns
Copy link
Member

jhnns commented May 15, 2015

Thx for reporting this.

It seems like this is a Linux problem because it does not occur on OSX. My assumption is that Phantomjs receives the SIGINT signal and handles it by itself. Thus it is not necessary that phridge also tries to shutdown the process.

However, I don't like to just silently ignore errors, because that is usually just a quickfix and will haunt you when you try to debug something without getting an error :)

I think I'll rewrite your PR so that phridge only ignores these errors when a SIGINT is received.

Why do you also ignore ECONNRESET? This error does not occur in my test setup...

@SystemParadox
Copy link
Contributor Author

As discussed in #34, Linux shells tend to group child processes into 'process groups'. When the shell gets a ctrl-c, it doesn't just send it to the current foreground process, it sends it to the entire process group, which means the children receive it as well. If you send the signal with kill -SIGINT this doesn't happen, it's just a shell thing. However, this is an aside- it's just a symptom of a wider problem - which is that phridge throws uncaught EPIPE/ECONNRESET exceptions if the phantomJS process dies.

I agree entirely about not silently ignoring errors :). HOWEVER, in this case I believe that this may quite likely be the correct behaviour. EPIPE/ECONNRESET are fired because the process has died - it's just not been cleaned up because process.on('exit') hasn't fired yet.

This can only be solved in one of the following ways:

  1. Find a way to guarantee that exit fires before EPIPE/ECONNRESET and either clean up so that they never fire or set a flag so that they will be ignored from that point.
  2. Ignore EPIPE/ECONNRESET with the expectation that exit will fire to clean everything up
  3. Treat EPIPE/ECONNRESET/exit all the same. Cleanup, then ignore further events.

Setting to ignore only once a SIGINT is received doesn't handle the case when just phantomJS is killed.

I didn't notice ECONNRESET until I started writing the tests - where I ended up randomly getting EPIPE or ECONNRESET about 50:50. I am not sure whether I have also seen this outside of the mocha test cases or not. That one probably does need more investigation.

Thanks.

@SystemParadox
Copy link
Contributor Author

I should say that if you wish to reproduce the wider test case, start a node process which runs phridge to create a phantomJS, then kill (SIGTERM, SIGINT, SIGKILL, whatever) the phantomJS, and then allow node to run phantom.dispose(). It causes node to die horribly with EPIPE errors from nowhere.

@jhnns
Copy link
Member

jhnns commented May 17, 2015

Yeah, I guess you're right.

Setting to ignore only once a SIGINT is received doesn't handle the case when just phantomJS is killed.

When is this case? I think, this should never be the case under normal circumstances...

@SystemParadox
Copy link
Contributor Author

Probably not under normal circumstances. However, if I do killall phantomjs I would expect the node process to spawn another one, not die with an uncaught exception from nowhere.

If a phantomJS process dies, it makes sense for any tasks it was running to return errors. But the errors should be specific to those requests - they shouldn't blow up the whole server.

Possibly phridge or the phantomjs object should also emit some kind of error, but it needs to be a clear, documented, catchable error so that a server-type process can catch it, print a warning, spawn a new instance and carry on.

@domasx2
Copy link
Contributor

domasx2 commented Jul 18, 2015

👍 I get ECONNRESET if phantom process is killed while evaluating a page, this would fix it

@jhnns
Copy link
Member

jhnns commented Sep 25, 2015

Could you guys check out the current master? I've improved the error handling. Does especially this commit fix your problems?

I don't like to discard errors. That's why I've written an error handler which waits 300ms for the process to shut down. If that's not the case, the error is emitted on the Phantom instance where the developer is able to catch it. If the error is not handled, node throws it.

@SystemParadox
Copy link
Contributor Author

Glad to see stdin/stdout/etc are now using _onUnexpectedError and rejecting deferreds on the Phantom object instead of throwing errors from undocumented internals. That's a big improvement :)

However, I think the timeout is fragile and confusing and just introduces race conditions.

  • What if my process takes longer than 300ms to exit?
  • What if my process caught the SIGINT and decided not to exit at all?

I really think you should just remove this. To my mind, there are only 3 places when we care about errors:

  1. I am waiting for a promise to resolve and an error occurs
  2. I try to initiate a call and an error occurs
  3. I try to initiate a call but that process has already errored/closed/disposed. If the specific error has already been thrown to a previous listener, this should probably just be the instance is already disposed error. But if the error occured when no-one was listening, it should be stored and thrown to the first person who cares about it.
  • Errors should only be thrown synchronously when calls are made or by rejecting promises. There should never be an uncaught error. We should not be using EventEmitter and expecting the user to listen for 'out-of-band' errors.
  • If an error occurs when no-one is listening, it should be stored and thrown to the first person who cares about it
  • If an error occurs when no-one is listening, and no valid call ever gets made for it to be thrown to, then we are clearly shutting down and don't care
  • dispose() should completely ignore any errors that may occur if the process has already been terminated or is in the process of doing so

Thanks.

@jhnns
Copy link
Member

jhnns commented Sep 26, 2015

Thx for reading through the source code and for providing feedback. That's why I didn't just publish a new version 😀

I have to admit, using a timeout seems a bit hacky, but I don't think that it's too fragile. It basically just says: "Ok, I'll wait 300ms for the process to exit (e.g. because a SIGTERM was received and the process is busy shutting down). In that case we don't want to report anything because the error was kind of expected. After that delay we emit the error and pass responsibility back to the user, because we ahve to assume that it was not expected".

To answer your questions:

What if my process takes longer than 300ms to exit?

Than you have to handle the error event in order to prevent the process to crash. It's just an assumption that most applications are able to exit within 300ms.

What if my process caught the SIGINT and decided not to exit at all?

Again, than you have to handle the error event.

Errors should only be thrown synchronously when calls are made or by rejecting promises. There should never be an uncaught error. We should not be using EventEmitter and expecting the user to listen for 'out-of-band' errors.

I don't think that this is true. There are error conditions without explicit calls, like receiving SIGTERM from "the outside". Using an error event for these situations is kind of nodejs standard.

If an error occurs when no-one is listening, it should be stored and thrown to the first person who cares about it

I don't think that this is a good behavior. The error would be completely out of context. Imho it's best practice to report errors as soon as possible – and within the correct context.

dispose() should completely ignore any errors that may occur if the process has already been terminated or is in the process of doing so

Agreed. The current solution does that already.

@SystemParadox
Copy link
Contributor Author

There are error conditions without explicit calls, like receiving SIGTERM from "the outside". Using an error event for these situations is kind of nodejs standard.

A SIGTERM/SIGINT/SIGWHATEVER is an event, not an error. If a process is told to exit by the sysadmin, operating system, etc, I don't consider that to be an error. We could be shutting down. It's only an error if you then try to use this process.

The error would be completely out of context. Imho it's best practice to report errors as soon as possible – and within the correct context.

This is my issue with the timeout approach. It feels like the error is out of context, since it's not being thrown immediately (I realise my "storing the error and throwing it to the next caller" approach has the same issue, but at least it was given to someone who cares about it, and likely within the context of a request which will almost certainly have its own error handler rather than crashing the whole process).

If you are going to throw errors EventEmitter-style, this will need to be very clearly documented as something the user has to think about. To me it seems that throwing a timeout into the mix here just makes things more complicated. The purpose of the timeout is to decrease the number of things the user has to think about, so that the user doesn't need to care about random errors occuring for the simple case of SIGINT/SIGTERM to process groups. But we haven't helped the situation for anyone. All we've done is made the error unreliable - we've introduced a race condition. Everyone has to read the documentation and handle the possible error because it could happen in production even if it works fine in development/testing. If the machine is slow or busy, you might get an unexpected crash.

What we've done is taken a possible uncaught error that everyone has to deal with and made it more complicated. Everyone still has to handle it because it still might happen. Except now it's 300ms late and unreliable.

I agree that it would be nice to handle this process group issue for SIGINT/SIGTERM so that most users don't have to think about it, but I don't think there is any way to distinguish this case. This leaves only two options:

  1. Only throw errors which affect a call (either in progress or one made against an errored process)
  2. Clearly document that random uncaught errors may occur when a phantom process is killed, particularly with reference to the process group case when pressing ctrl-c. Everyone has to handle them, and all of us will probably just ignore them or print them as warnings because we can't tell the difference either.

@jhnns
Copy link
Member

jhnns commented Sep 29, 2015

I agree to all of your arguments. Adding a timeout only makes it easier for the group of users who are lucky to have a process which exits within 300ms 😉.

What about a mixture of both options? We could emit an unexpectedExit-event which is emitted as soon as PhantomJS is unusable. This way we'll give the user the chance to handle this (rather unlikely) situation immediately. However, sending SIGINT/SIGTERM to the process group won't throw an error anymore.

Of course, this needs to be documented in the README.

@SystemParadox
Copy link
Contributor Author

An opt-in error handler sounds like a good idea to me. So by default it won't crash the process. Any pending promises will be rejected, and any new calls will throw the "already exited" error. If the process is exiting when the user didn't expect for some reason, they can investigate by listening to the unexpectedExit event.

Am I right in assuming that with this approach you would be removing the timeout and that error events would never be emitted?

When rejecting promises or throwing the "already exited" error, it would be nice to either throw the error that caused it, or possibly better, attach the original error as a property on the error that gets thrown.

Thanks very much :)

@jhnns
Copy link
Member

jhnns commented Sep 30, 2015

Am I right in assuming that with this approach you would be removing the timeout and that error events would never be emitted?

Error events on childProcess and its streams will still be emitted, but they won't crash the process anymore because phridge attaches a listener for it.

When rejecting promises or throwing the "already exited" error, it would be nice to either throw the error that caused it, or possibly better, attach the original error as a property on the error that gets thrown.

Yes, I'll do that, but I'm afraid they won't be very helpful either, like Error: write EPIPE 😉

jhnns added a commit that referenced this pull request Oct 5, 2015
- Remove setTimeout for error handling
- Change event name "error" to "unexpectedExit"

See discussion at #35
@jhnns
Copy link
Member

jhnns commented Oct 5, 2015

Just published this with v1.2.0

@jhnns jhnns closed this Oct 5, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

phridge.dispose causes EPIPE error after SIGINT
4 participants