Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad 1.0.1 client segfaults on startup with Consul 1.8.3 #9738

Closed
cb22 opened this issue Jan 6, 2021 · 2 comments · Fixed by #9751
Closed

Nomad 1.0.1 client segfaults on startup with Consul 1.8.3 #9738

cb22 opened this issue Jan 6, 2021 · 2 comments · Fixed by #9751
Assignees
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/consul/connect Consul Connect integration type/bug

Comments

@cb22
Copy link

cb22 commented Jan 6, 2021

Nomad version

Server: Nomad v1.0.1 (c9c68aa)
Client: Nomad v1.0.1 (c9c68aa) (was, 0.12.9) + Consul 1.8.3

Operating system and Environment details

Debian 10, official Nomad packages

Issue

After upgrading the server from 0.12 to 1.0 (which went smoothly), I tried to upgrade a client node that was running Consul 1.8.3 and had Consul Connect enabled jobs scheduled on it.

After doing a apt install nomad && systemctl restart nomad, it would panic - presumably trying to look up which versions of Envoy are supported and failing.

Downgrading the client back to 0.12 worked OK, as did manually specifying the version of Envoy to use (I feel that upgrading Consul would have done the trick too):

  meta = {
    connect.sidecar_image = "envoyproxy/envoy:v1.11.2"
  }

The upgrade guide mentions that it should do this automatically:

If the version of the Consul agent is older than v1.7.8, v1.8.4, or v1.9.0, Nomad will fallback to the v1.11.2 version of Envoy. As before, if the meta.connect.sidecar_image, meta.connect.gateway_image, or sidecar_task stanza are set, those settings take precedence.

Nomad Client logs (if appropriate)

Jan 06 12:43:03 hostname systemd[1]: Started Nomad.
Jan 06 12:43:03 hostname nomad[3900]: ==> Loaded configuration from /etc/nomad.d/aws-tags.json, /etc/nomad.d/aws.hcl, /etc/nomad.d/nomad.hcl, /etc/nomad.d/spot.hcl
Jan 06 12:43:03 hostname nomad[3900]: ==> Starting Nomad agent...
Jan 06 12:43:03 hostname nomad[3900]: ==> Nomad agent configuration:
Jan 06 12:43:03 hostname nomad[3900]:        Advertise Addrs: HTTP: 10.0.2.191:4646
Jan 06 12:43:03 hostname nomad[3900]:             Bind Addrs: HTTP: 0.0.0.0:4646
Jan 06 12:43:03 hostname nomad[3900]:                 Client: true
Jan 06 12:43:03 hostname nomad[3900]:              Log Level: INFO
Jan 06 12:43:03 hostname nomad[3900]:                 Region: global (DC: eu-west-2c)
Jan 06 12:43:03 hostname nomad[3900]:                 Server: false
Jan 06 12:43:03 hostname nomad[3900]:                Version: 1.0.1
Jan 06 12:43:03 hostname nomad[3900]: ==> Nomad agent started! Log data will stream in below:
Jan 06 12:43:03 hostname nomad[3900]:     2021-01-06T12:43:03.352Z [WARN]  agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=/opt/nomad/plugins
Jan 06 12:43:03 hostname nomad[3900]:     2021-01-06T12:43:03.355Z [INFO]  agent: detected plugin: name=docker type=driver plugin_version=0.1.0
Jan 06 12:43:03 hostname nomad[3900]:     2021-01-06T12:43:03.355Z [INFO]  agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
Jan 06 12:43:03 hostname nomad[3900]:     2021-01-06T12:43:03.355Z [INFO]  agent: detected plugin: name=exec type=driver plugin_version=0.1.0
Jan 06 12:43:03 hostname nomad[3900]:     2021-01-06T12:43:03.355Z [INFO]  agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
Jan 06 12:43:03 hostname nomad[3900]:     2021-01-06T12:43:03.355Z [INFO]  agent: detected plugin: name=java type=driver plugin_version=0.1.0
Jan 06 12:43:03 hostname nomad[3900]:     2021-01-06T12:43:03.355Z [INFO]  agent: detected plugin: name=nvidia-gpu type=device plugin_version=0.1.0
Jan 06 12:43:03 hostname nomad[3900]:     2021-01-06T12:43:03.362Z [INFO]  client: using state directory: state_dir=/opt/nomad/client
Jan 06 12:43:03 hostname nomad[3900]:     2021-01-06T12:43:03.362Z [INFO]  client: using alloc directory: alloc_dir=/opt/nomad/alloc
Jan 06 12:43:03 hostname nomad[3900]:     2021-01-06T12:43:03.389Z [INFO]  client.fingerprint_mgr.cgroup: cgroups are available
Jan 06 12:43:03 hostname nomad[3900]:     2021-01-06T12:43:03.391Z [INFO]  client.fingerprint_mgr.consul: consul agent is available
Jan 06 12:43:03 hostname nomad[3900]:     2021-01-06T12:43:03.418Z [WARN]  client.fingerprint_mgr.network: unable to parse speed: path=/usr/sbin/ethtool device=ens5
Jan 06 12:43:03 hostname nomad[3900]:     2021-01-06T12:43:03.419Z [WARN]  client.fingerprint_mgr.network: unable to parse speed: path=/usr/sbin/ethtool device=lo
Jan 06 12:43:03 hostname nomad[3900]:     2021-01-06T12:43:03.422Z [WARN]  client.fingerprint_mgr.network: unable to parse speed: path=/usr/sbin/ethtool device=ens5
Jan 06 12:43:03 hostname nomad[3900]:     2021-01-06T12:43:03.426Z [WARN]  client.fingerprint_mgr.network: unable to parse speed: path=/usr/sbin/ethtool device=docker0
Jan 06 12:43:03 hostname nomad[3900]:     2021-01-06T12:43:03.430Z [WARN]  client.fingerprint_mgr.network: unable to parse speed: path=/usr/sbin/ethtool device=nomad
Jan 06 12:43:03 hostname nomad[3900]:     2021-01-06T12:43:03.444Z [INFO]  client.plugin: starting plugin manager: plugin-type=csi
Jan 06 12:43:03 hostname nomad[3900]:     2021-01-06T12:43:03.444Z [INFO]  client.plugin: starting plugin manager: plugin-type=driver
Jan 06 12:43:03 hostname nomad[3900]:     2021-01-06T12:43:03.445Z [INFO]  client.plugin: starting plugin manager: plugin-type=device
Jan 06 12:43:03 hostname nomad[3900]:     2021-01-06T12:43:03.465Z [INFO]  client: node registration complete
Jan 06 12:43:03 hostname nomad[3900]:     2021-01-06T12:43:03.469Z [INFO]  client: started client: node_id=acf6906c-63b4-92b0-e74c-9bd42550d976
Jan 06 12:43:03 hostname nomad[3900]:     2021/01/06 12:43:03.473156 [INFO] (runner) creating new runner (dry: false, once: false)
Jan 06 12:43:03 hostname nomad[3900]:     2021/01/06 12:43:03.473481 [INFO] (runner) creating watcher
Jan 06 12:43:03 hostname nomad[3900]:     2021/01/06 12:43:03.473561 [INFO] (runner) starting
Jan 06 12:43:03 hostname nomad[3900]:     2021/01/06 12:43:03.476325 [INFO] (runner) creating new runner (dry: false, once: false)
Jan 06 12:43:03 hostname nomad[3900]:     2021/01/06 12:43:03.476465 [INFO] (runner) creating watcher
Jan 06 12:43:03 hostname nomad[3900]:     2021/01/06 12:43:03.476561 [INFO] (runner) starting
Jan 06 12:43:03 hostname nomad[3900]: panic: runtime error: invalid memory address or nil pointer dereference
Jan 06 12:43:03 hostname nomad[3900]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x191865d]
Jan 06 12:43:03 hostname nomad[3900]: goroutine 362 [running]:
Jan 06 12:43:03 hostname nomad[3900]: github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*envoyVersionHook).Prestart(0xc000b7a030, 0x3a08e80, 0xc000cf6980, 0xc000cf6940, 0xc000e4c140, 0xc000cf6980, 0xc000f9f450 
Jan 06 12:43:03 hostname nomad[3900]:         github.com/hashicorp/nomad/client/allocrunner/taskrunner/envoy_version_hook.go:82 +0x7d
Jan 06 12:43:03 hostname nomad[3900]: github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*TaskRunner).prestart(0xc0002e1080, 0x0, 0x0)
Jan 06 12:43:03 hostname nomad[3900]:         github.com/hashicorp/nomad/client/allocrunner/taskrunner/task_runner_hooks.go:231 +0x5bf
Jan 06 12:43:03 hostname nomad[3900]: github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*TaskRunner).Run(0xc0002e1080)
Jan 06 12:43:03 hostname nomad[3900]:         github.com/hashicorp/nomad/client/allocrunner/taskrunner/task_runner.go:517 +0x4a5
Jan 06 12:43:03 hostname nomad[3900]: created by github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).runTasks
Jan 06 12:43:03 hostname nomad[3900]:         github.com/hashicorp/nomad/client/allocrunner/alloc_runner.go:358 +0xa5
Jan 06 12:43:03 hostname systemd[1]: nomad.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Jan 06 12:43:03 hostname systemd[1]: nomad.service: Failed with result 'exit-code'.
Jan 06 12:43:05 hostname systemd[1]: nomad.service: Service RestartSec=2s expired, scheduling restart.
Jan 06 12:43:05 hostname systemd[1]: nomad.service: Scheduled restart job, restart counter is at 5.
Jan 06 12:43:05 hostname systemd[1]: Stopped Nomad.
@shoenig shoenig added stage/needs-investigation theme/consul/connect Consul Connect integration stage/accepted Confirmed, and intend to work on. No timeline committment though. type/bug and removed stage/needs-investigation labels Jan 6, 2021
@shoenig
Copy link
Member

shoenig commented Jan 7, 2021

Thanks for reporting @cb22, I was able to reproduce this like you described by launching a Nomad v0.12.9, creating a connect job, and restarting Nomad on v1.0.1 (avoiding -dev mode). The version of Consul doesn't matter here.

FWIW in the upgrade guide we do recommend doing node drains and upgrading to Nomad v1.0.0+ and Consul 1.9.0+ concurrently to ensure a smooth transition. However we certainly shouldn't panic on the in-place upgrade path either.

@shoenig shoenig self-assigned this Jan 7, 2021
@shoenig shoenig added this to Needs Triage in Nomad - Community Issues Triage via automation Jan 7, 2021
shoenig added a commit that referenced this issue Jan 7, 2021
When upgrading from Nomad v0.12.x to v1.0.x, Nomad client will panic on
startup if the node is running Connect enabled jobs. This is caused by
a missing piece of plumbing of the Consul Proxies API interface during the
client restore process.

Fixes #9738
shoenig added a commit that referenced this issue Jan 7, 2021
When upgrading from Nomad v0.12.x to v1.0.x, Nomad client will panic on
startup if the node is running Connect enabled jobs. This is caused by
a missing piece of plumbing of the Consul Proxies API interface during the
client restore process.

Fixes #9738
Nomad - Community Issues Triage automation moved this from Needs Triage to Done Jan 7, 2021
backspace pushed a commit that referenced this issue Jan 22, 2021
When upgrading from Nomad v0.12.x to v1.0.x, Nomad client will panic on
startup if the node is running Connect enabled jobs. This is caused by
a missing piece of plumbing of the Consul Proxies API interface during the
client restore process.

Fixes #9738
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 26, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/consul/connect Consul Connect integration type/bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants