Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

panic in rpc causes the whole server crash #441

Closed
linuxerwang opened this issue Nov 10, 2015 · 6 comments

Comments

@linuxerwang
Copy link

commented Nov 10, 2015

I noticed that grpc-go behaves differently from http server in handling panic. In http server, a panic in handler will stop the goroutine but not the whole server; while a panic in grpc-go rpc will crash the whole server. This is a problem because panic might be from the standard or 3rd party packages the programmer can't control.

Crash the server because of panic in one goroutine is not acceptable for a robust service.

@dsymonds

This comment has been minimized.

Copy link
Contributor

commented Nov 10, 2015

On the other hand, quietly catching a panic might leave the server in an inconsistent or incorrect state, so while crashing sucks at least you'll find out and can go fix the panic.

We've come to regret the net/http panic catching after running Go in production for a while. It sounded like a good idea at the time, but it has caused its own issues in practice.

@linuxerwang

This comment has been minimized.

Copy link
Author

commented Nov 10, 2015

Thanks for your input. I agree you have your good point. But think about a mission critical service crashes because one request encountered a panic somehow and all the other current requests are lost (innocent users would be affected).

I think "quietly catching a panic" is certainly unacceptable. The panic should be well logged, and the rpc caller should get an error.

@dsymonds

This comment has been minimized.

Copy link
Contributor

commented Nov 10, 2015

If it's mission critical, you'll want some sort of frontend (e.g. nginx) that can quietly retry the request if a backend crashes. There's no way to completely rule out crashes in bad code, and catching panics in only the RPC server handler goroutines isn't going to stop them all.

@linuxerwang

This comment has been minimized.

Copy link
Author

commented Nov 10, 2015

Agreed. Thanks, dsymonds.

@iamqizhao

This comment has been minimized.

Copy link
Contributor

commented Nov 11, 2015

As David mentioned, a mission critical system should have
i) retrying mechanism on client side;
ii) cluster management to restart one or more new backends if the old ones die;
iii) Name discovery to tell clients the updated set of backends.

This is much better than putting the system in a undefined state.

The thing we need to guarantee is that gRPC must not crash due to the errors of peers.

@iamqizhao iamqizhao closed this Nov 11, 2015

@benhoyt

This comment has been minimized.

Copy link

commented Apr 4, 2018

I realize this issue is closed, but this behavior was jarring to me too. It seems to me that most servers (web and gRPC) shouldn't have state shared between requests, so "leaving the server in an inconsistent or incorrect state" doesn't apply for most web servers.

It seems much better to have zero downtime and keep the server running in case of an expected panic than take the whole thing down and rely on server management stuff to hopefully start it up again quickly.

Either way, we'll probably add this to our wrapper package. It'd be nice if there was an option to enable this for grpc-go though.

@lock lock bot locked as resolved and limited conversation to collaborators Oct 1, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
4 participants
You can’t perform that action at this time.