Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recover from panic in http and grpc handlers. #2059

Merged
merged 3 commits into from
May 11, 2020

Conversation

cyriltovena
Copy link
Contributor

I don't see any good reason to crash any component during a bad request.

The stacktrace is still printed in multiline to stderr.
e.g

panic: foo
goroutine 50 [running]:
github.com/grafana/loki/pkg/util/server.onPanic(0x16c7720, 0x18af5e0, 0x4fd01d5aaf8, 0x3e7)
	/Users/ctovena/go/src/github.com/grafana/loki/pkg/util/server/recovery.go:41 +0x85
github.com/grpc-ecosystem/go-grpc-middleware/recovery.WithRecoveryHandler.func1.1(0x18d2140, 0xc000126008, 0x16c7720, 0x18af5e0, 0x1d81ca0, 0x1663693)
	/Users/ctovena/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.1.0/recovery/options.go:33 +0x39
github.com/grpc-ecosystem/go-grpc-middleware/recovery.recoverFrom(0x18d2140, 0xc000126008, 0x16c7720, 0x18af5e0, 0xc0002166c0, 0x1d5ed32, 0xd0)
	/Users/ctovena/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.1.0/recovery/interceptors.go:52 +0x5a
github.com/grpc-ecosystem/go-grpc-middleware/recovery.UnaryServerInterceptor.func1.1(0x18d2140, 0xc000126008, 0xc000204088, 0xc00026af28)
	/Users/ctovena/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.1.0/recovery/interceptors.go:26 +0x77
panic(0x16c7720, 0x18af5e0)
	/usr/local/Cellar/go/1.14.2_1/libexec/src/runtime/panic.go:969 +0x166
github.com/grafana/loki/pkg/util/server.Test_onPanic.func3(0x18d2140, 0xc000126008, 0x0, 0x0, 0xc00026aed8, 0x16642ab, 0x2575318, 0xc000263200)
	/Users/ctovena/go/src/github.com/grafana/loki/pkg/util/server/recovery_test.go:32 +0x39
github.com/grpc-ecosystem/go-grpc-middleware/recovery.UnaryServerInterceptor.func1(0x18d2140, 0xc000126008, 0x0, 0x0, 0x0, 0x18075d8, 0x0, 0x0, 0x0, 0x0)
	/Users/ctovena/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.1.0/recovery/interceptors.go:30 +0xbc
github.com/grafana/loki/pkg/util/server.Test_onPanic(0xc000263200)
	/Users/ctovena/go/src/github.com/grafana/loki/pkg/util/server/recovery_test.go:31 +0x2c9
testing.tRunner(0xc000263200, 0x18075e0)
	/usr/local/Cellar/go/1.14.2_1/libexec/src/testing/testing.go:991 +0xdc
created by testing.(*T).Run
	/usr/local/Cellar/go/1.14.2_1/libexec/src/testing/testing.go:1042 +0x357

Signed-off-by: Cyril Tovena cyril.tovena@gmail.com

I don't see any good reason to crash any component during a bad request.

Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>
Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>
Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>
@codecov-io
Copy link

Codecov Report

Merging #2059 into master will increase coverage by 0.17%.
The diff coverage is 42.30%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2059      +/-   ##
==========================================
+ Coverage   64.02%   64.19%   +0.17%     
==========================================
  Files         133      136       +3     
  Lines       10222    10240      +18     
==========================================
+ Hits         6545     6574      +29     
+ Misses       3187     3175      -12     
- Partials      490      491       +1     
Impacted Files Coverage Δ
pkg/loki/fake_auth.go 0.00% <0.00%> (ø)
pkg/loki/loki.go 0.00% <0.00%> (ø)
pkg/loki/modules.go 15.31% <0.00%> (-0.48%) ⬇️
pkg/querier/http.go 0.00% <0.00%> (-10.15%) ⬇️
pkg/util/server/error.go 100.00% <100.00%> (ø)
pkg/util/server/middleware.go 100.00% <100.00%> (ø)
pkg/util/server/recovery.go 100.00% <100.00%> (ø)
... and 2 more

Copy link
Member

@owen-d owen-d left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good but I'm not sure about putting panic recovery on the write path. I believe we wanted to recover on the read path and specifically on ingester reads which is the one area our reads/write coincide and thus a place where read panics could cause data loss.

@@ -140,37 +141,33 @@ func New(cfg Config) (*Loki, error) {
}

func (t *Loki) setupAuthMiddleware() {
t.cfg.Server.GRPCMiddleware = []grpc.UnaryServerInterceptor{serverutil.RecoveryGRPCUnaryInterceptor}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we want recovery middleware everywhere -- what about just on the read path?

@cyriltovena
Copy link
Contributor Author

This looks good but I'm not sure about putting panic recovery on the write path. I believe we wanted to recover on the read path and specifically on ingester reads which is the one area our reads/write coincide and thus a place where read panics could cause data loss.

This was intentional, I don't see why an API call read or write should cause a crash and data loss. Even crashing a distributor can cause an outage, imagine a situation where we have a specific stream that takes down a distributor. In a very short time, it could bring the whole write path down, for all tenant.

Again I don't see a situation where we would want to leave a component crash on request.

@slim-bean
Copy link
Collaborator

Again I don't see a situation where we would want to leave a component crash on request.

I would second this, read/write we never want a panic. Write is actually a much worse scenario because it would guarantee round robin crash of every component because writes are consistent and always happening.

Copy link
Member

@owen-d owen-d left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@cyriltovena cyriltovena merged commit bce4470 into grafana:master May 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants