Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consul 0.6.4 crash #2125

Closed
leonblueconic opened this issue Jun 20, 2016 · 4 comments
Closed

Consul 0.6.4 crash #2125

leonblueconic opened this issue Jun 20, 2016 · 4 comments
Assignees
Labels
type/bug Feature does not function as expected type/crash The issue description contains a golang panic and stack trace

Comments

@leonblueconic
Copy link

If you have a question, please direct it to the
consul mailing list if it hasn't been
addressed in either the FAQ or in one
of the Consul Guides.

When filing a bug, please include the following:

consul version for both Client and Server

Client: 0.6.4
Server: 0.6.4

consul info for both Client and Server

agent:
check_monitors = 0
check_ttls = 0
checks = 22
services = 24
build:
prerelease =
revision = 26a0ef8
version = 0.6.4
consul:
bootstrap = false
known_datacenters = 1
leader = false
server = true
raft:
applied_index = 999
commit_index = 999
fsm_pending = 0
last_contact = 20.884238ms
last_log_index = 999
last_log_term = 3
last_snapshot_index = 0
last_snapshot_term = 0
num_peers = 12
state = Follower
term = 3
runtime:
arch = amd64
cpu_count = 4
goroutines = 86
max_procs = 2
os = linux
version = go1.6
serf_lan:
encrypted = true
event_queue = 0
event_time = 3
failed = 0
intent_queue = 0
left = 0
member_time = 16
members = 13
query_queue = 0
query_time = 1
serf_wan:
encrypted = true
event_queue = 0
event_time = 1
failed = 0
intent_queue = 0
left = 0
member_time = 1
members = 1
query_queue = 0
query_time = 1

Operating system and Environment details

Amazon Linux update to date with all update

Description of the Issue (and unexpected/desired result)

Crash of consul

Reproduction steps

Not able to reproducre

Log Fragments or Link to gist

panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x68 pc=0x4f00e0]

goroutine 20 [running]:
panic(0xcb9ec0, 0xc82000e0c0)
/goroot/src/runtime/panic.go:464 +0x3e6
github.com/hashicorp/consul/command/agent.(_localState).syncCheck(0xc820001cb0, 0xc822dd35c0, 0x5c, 0x0, 0x0)
/gopath/src/github.com/hashicorp/consul/command/agent/local.go:606 +0x90
github.com/hashicorp/consul/command/agent.(_localState).syncChanges(0xc820001cb0, 0x0, 0x0)
/gopath/src/github.com/hashicorp/consul/command/agent/local.go:488 +0x446
github.com/hashicorp/consul/command/agent.(_localState).antiEntropy(0xc820001cb0, 0xc82000acc0)
/gopath/src/github.com/hashicorp/consul/command/agent/local.go:339 +0x208
created by github.com/hashicorp/consul/command/agent.(_Agent).StartSync
/gopath/src/github.com/hashicorp/consul/command/agent/agent.go:580 +0x4f

@rikibarel
Copy link

We have the same issue, consul crash ~300 times a day !!!! , on all cluster nodes ( we have 3)
the error message:
2016/06/30 03:44:22 [INFO] snapshot: Creating new snapshot at /opt/consul/data/raft/snapshots/16018-198934171-1467272662110.tmp
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x28 pc=0x9ee7f9]

Please Advise
Thanks
Riki

@slackpad slackpad added type/bug Feature does not function as expected type/crash The issue description contains a golang panic and stack trace labels Jun 30, 2016
@slackpad slackpad self-assigned this Jun 30, 2016
@rikibarel
Copy link

Upgrade to 6.4 solved the crash...

@slackpad
Copy link
Contributor

slackpad commented Sep 21, 2016

I think I see how we could get into this situation. A similar situation exists for services as well as checks.

If a check was being deleted and gets removed from l.checks and status gets updated with l.checkStatus[checkID] = syncStatus{remoteDelete: true}, but then UpdateCheck() gets called, it can clobber the check status with a call like l.checkStatus[checkID] = syncStatus{inSync: false}. There are other paths that set the status in a similar way, which can cause the remoteDelete status to get dropped, resulting in hitting the middle clause of this l.checkStatus[checkID] = syncStatus{inSync: false}.

The cleanest way to solve this is probably just to remove the remoteDelete attribute completely and do a check if it's in the local state maps right inside syncChanges() to decide if it should be deleted (we can just set it out of sync when it needs to be deleted).

@slackpad
Copy link
Contributor

Fix for this is in review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Feature does not function as expected type/crash The issue description contains a golang panic and stack trace
Projects
None yet
Development

No branches or pull requests

3 participants