Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic / crash when validator fails to query sentry for address #2825

Closed
Karmastic opened this issue Apr 6, 2020 · 1 comment · Fixed by #2822
Closed

Panic / crash when validator fails to query sentry for address #2825

Karmastic opened this issue Apr 6, 2020 · 1 comment · Fixed by #2822
Assignees
Labels
c:bug Category: bug

Comments

@Karmastic
Copy link

SUMMARY

In testing, I had a validator configured with 2 sentries; the validator was running but the sentries were not. The validator was crashing (k8s pod restarting) and logs indicate a panic (appears unintentional).

ISSUE TYPE
  • Bug Report
COMPONENT NAME

worker/registration

OASIS NODE VERSION
Software version: 20.4.1
Runtime protocol version: 0.11.0
Consensus protocol version: 0.23.0
Committee protocol version: 0.8.0
Tendermint core version: 0.32.8
ABCI library version: 0.16.1
Go toolchain version: 1.13.9
OS / ENVIRONMENT

Binary was generated in golang:1.13-alpine (no rust binaries); running in alpine:latest. Dockerfile available upon request.

STEPS TO REPRODUCE
ACTUAL RESULTS

Validator starts, but crashes soon after.

level=warn module=grpc caller=clientconn.go:1283 ts=2020-04-06T16:21:15.239590709Z msg="grpc: addrConn.createTransport failed to connect to {100.71.164.162:9009 0  <nil>}. Err :connection error: desc = \"transport
level=info module=grpc caller=pickfirst.go:78 ts=2020-04-06T16:21:15.239632595Z msg="pickfirstBalancer: HandleSubConnStateChange: 0xc03a651780, TRANSIENT_FAILURE"
level=warn module=worker/registration caller=worker.go:585 ts=2020-04-06T16:21:15.239712554Z msg="failed to obtain addressesfrom sentry node" err="rpc error: code = Unavailable desc = all SubConns are in Transient
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x11e793e]
goroutine 141 [running]:
github.com/oasislabs/oasis-core/go/worker/registration.(*Worker).querySentries(0xc00d9886e0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
    github.com/oasislabs/oasis-core/go@/worker/registration/worker.go:591 +0x7fe
github.com/oasislabs/oasis-core/go/worker/registration.(*Worker).registerNode(0xc00d9886e0, 0x0, 0xc03aab2800, 0x2, 0xc00e406800)
    github.com/oasislabs/oasis-core/go@/worker/registration/worker.go:504 +0xba7
github.com/oasislabs/oasis-core/go/worker/registration.(*Worker).registrationLoop.func1.1(0xc03aab2820, 0xc03aab2820)
    github.com/oasislabs/oasis-core/go@/worker/registration/worker.go:194 +0x137
github.com/cenkalti/backoff/v4.RetryNotifyWithTimer(0xc00d64fbe8, 0x1ac7a80, 0xc03aab2820, 0x0, 0x1ad1100, 0xc03a8aa370, 0x0, 0x0)
    github.com/cenkalti/backoff/v4@v4.0.0/retry.go:52 +0xf6
github.com/cenkalti/backoff/v4.RetryNotify(...)
    github.com/cenkalti/backoff/v4@v4.0.0/retry.go:31
github.com/cenkalti/backoff/v4.Retry(...)
    github.com/cenkalti/backoff/v4@v4.0.0/retry.go:25
github.com/oasislabs/oasis-core/go/worker/registration.(*Worker).registrationLoop.func1(0x0, 0xc03aab2800, 0x1, 0x1, 0x1)
    github.com/oasislabs/oasis-core/go@/worker/registration/worker.go:181 +0x1a6
github.com/oasislabs/oasis-core/go/worker/registration.(*Worker).registrationLoop(0xc00d9886e0)
    github.com/oasislabs/oasis-core/go@/worker/registration/worker.go:252 +0x2b4
github.com/oasislabs/oasis-core/go/worker/registration.(*Worker).doNodeRegistration(0xc00d9886e0)
    github.com/oasislabs/oasis-core/go@/worker/registration/worker.go:279 +0x64e
created by github.com/oasislabs/oasis-core/go/worker/registration.(*Worker).Start
    github.com/oasislabs/oasis-core/go@/worker/registration/worker.go:808 +0x163
EXPECTED RESULTS

I would expect the validator to either gracefully recover, or intentionally panic.

@kostko kostko added the c:bug Category: bug label Apr 6, 2020
@kostko
Copy link
Member

kostko commented Apr 6, 2020

Thanks for your report, looks like there's a missing continue statement here:
https://github.com/oasislabs/oasis-core/blob/ec97371cafe86695d6af79a5ea2df63ac9d1353a/go/worker/registration/worker.go#L678-L685

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c:bug Category: bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants