Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure: consul cluster start fails during auto discovery #3193

Closed
brande opened this issue Jun 26, 2017 · 15 comments
Closed

Azure: consul cluster start fails during auto discovery #3193

brande opened this issue Jun 26, 2017 · 15 comments
Assignees
Labels
type/bug Feature does not function as expected type/crash The issue description contains a golang panic and stack trace

Comments

@brande
Copy link

brande commented Jun 26, 2017

consul version for both Client and Server

Client: v0.8.4
Server: v0.8.4

Operating system and Environment details

ubuntu 16.04

Description of the Issue (and unexpected/desired result)

consul fails during start...
this is how I start it:

./consul agent -server -bind 0.0.0.0 -client 0.0.0.0 -bootstrap-expect 3 -raft-protocol 3 -ui -retry-join-azure-tag-name purpose -retry-join-azure-tag-value consulcluster -config-file /opt/consul/consul.json

this is my config:

{
  "log_level": "TRACE",
  "data_dir": "/opt/consul/data",
  "server": true,
  "retry_join_azure": {
	"tenant_id": "xxx",
	"subscription_id": "xxx",
	"client_id": "xxx",
	"secret_access_key": "xxx"
  }
}

and this is what happens:

==> WARNING: Expect Mode enabled, expecting 3 servers
==> Starting Consul agent...
==> Consul agent running!
           Version: 'v0.8.4'
           Node ID: '5babf506-db72-24ca-5600-d572f3b30c46'
         Node name: 'vm2'
        Datacenter: 'dc1'
            Server: true (bootstrap: false)
       Client Addr: 0.0.0.0 (HTTP: 8500, HTTPS: -1, DNS: 8600)
      Cluster Addr: 10.0.0.5 (LAN: 8301, WAN: 8302)
    Gossip encrypt: false, RPC-TLS: false, TLS-Incoming: false

==> Log data will now stream in as it occurs:

    2017/06/26 13:33:18 [INFO] raft: Initial configuration (index=0): []
    2017/06/26 13:33:18 [INFO] raft: Node at 10.0.0.5:8300 [Follower] entering Follower state (Leader: "")
    2017/06/26 13:33:18 [INFO] serf: EventMemberJoin: vm2 10.0.0.5
    2017/06/26 13:33:18 [INFO] serf: EventMemberJoin: vm2.dc1 10.0.0.5
    2017/06/26 13:33:18 [INFO] agent: Started DNS server 0.0.0.0:8600 (udp)
    2017/06/26 13:33:18 [WARN] serf: Failed to re-join any previously known node
    2017/06/26 13:33:18 [INFO] consul: Adding LAN server vm2 (Addr: tcp/10.0.0.5:8300) (DC: dc1)
    2017/06/26 13:33:18 [WARN] serf: Failed to re-join any previously known node
    2017/06/26 13:33:18 [INFO] consul: Handled member-join event for server "vm2.dc1" in area "wan"
    2017/06/26 13:33:18 [INFO] agent: Started DNS server 0.0.0.0:8600 (tcp)
    2017/06/26 13:33:18 [INFO] agent: Started HTTP server on [::]:8500
    2017/06/26 13:33:18 [INFO] agent: Joining cluster...
    2017/06/26 13:33:18 Sending GET https://management.azure.com/subscriptions/b6f483f3-1f96-4d11-9087-18afcd9fd22a/providers/Microsoft.Network/networkInterfaces?api-version=2016-09-01
    2017/06/26 13:33:20 GET https://management.azure.com/subscriptions/b6f483f3-1f96-4d11-9087-18afcd9fd22a/providers/Microsoft.Network/networkInterfaces?api-version=2016-09-01 received 200 OK
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xe822df]

goroutine 71 [running]:
github.com/hashicorp/consul/command/agent.(*Config).discoverAzureHosts(0xc420220900, 0xc420224780, 0x20, 0x0, 0x0, 0x0, 0x2b4cc720004c0358)
	/gopath/src/github.com/hashicorp/consul/command/agent/config_azure.go:44 +0x42f
github.com/hashicorp/consul/command/agent.(*Agent).retryJoin(0xc4201c66c0)
	/gopath/src/github.com/hashicorp/consul/command/agent/retry_join.go:40 +0x748
created by github.com/hashicorp/consul/command/agent.(*Agent).Start
	/gopath/src/github.com/hashicorp/consul/command/agent/agent.go:315 +0x9b2

I'm getting the same error if I compile consul from source btw...

Reproduction steps

just install consul on azure vms add the right credentials to the config file and try the above cml to start

Thanks,
Brande

@slackpad slackpad added type/bug Feature does not function as expected type/crash The issue description contains a golang panic and stack trace labels Jun 26, 2017
@magiconair
Copy link
Contributor

I'll have a look.

@magiconair magiconair self-assigned this Jun 26, 2017
@magiconair
Copy link
Contributor

magiconair commented Jun 26, 2017

@brande I was able to create the Azure VM and get some of the credentials but I could use some help. How do I get the client id and secret access key?

@brande
Copy link
Author

brande commented Jun 27, 2017

found out that the tag used by consul has to be set to the network interface ressource not to the vm. However, it would be great if consul didn't crash in case of an unset tag. It should ignore the ressource without the tag and should not crash as it is right now... Thoughts?

@brande
Copy link
Author

brande commented Jun 27, 2017

@magiconair Thanks for looking into it. Here is a nice explanation about how to get all these together: https://www.terraform.io/docs/providers/azurerm/

@slackpad slackpad added this to Frank in Consul 0.9.0 Jun 27, 2017
@slackpad slackpad moved this from Frank to Agent Refactoring in Consul 0.9.0 Jun 28, 2017
@magiconair
Copy link
Contributor

@brande yes, consul shouldn't crash :) I'll have another look

@draggeta
Copy link

draggeta commented Jul 6, 2017

I'm having the same issue as @brande. This happen on both Windows and Linux servers. I already had the tags set on the network interfaces and when testing the account used it is able to retrieve them correctly.

However, Consul keeps failing to start. below you can find the config and the error:

consul agent -config-dir=C:\HashiCorp\Consul\consul.conf
{
    "datacenter": "datacenter",
    "log_level": "INFO",
    "server": true,
    "ui": true,
    "data_dir": "C:\\HashiCorp\\Consul\\consul.data",
    "bind_addr": "0.0.0.0",
    "client_addr": "0.0.0.0",
    "ports": {
        "https": 8501
    },
    "key_file": "C:\\HashiCorp\\Consul\\consul.cert\\server.key",
    "cert_file": "C:\\HashiCorp\\Consul\\consul.cert\\server.crt",
    "ca_file": "C:\\HashiCorp\\Consul\\consul.cert\\ca_bundle.crt",
    "protocol": 3,
    "retry_join_azure": {
        "tag_name": "service",
        "tag_value": "consulCluster",
        "subscription_id": "xxx",
        "tenant_id": "yyy",
        "client_id": "zzz",
        "secret_access_key": "password"
    }
}

The output error:

==> Starting Consul agent...
==> Consul agent running!
           Version: 'v0.8.5'
           Node ID: 'e520793c-9dca-8bd2-6d6a-f086921f248a'
         Node name: 'vm03'
        Datacenter: datacenter'
            Server: true (bootstrap: false)
       Client Addr: 0.0.0.0 (HTTP: 8500, HTTPS: 8501, DNS: 8600)
      Cluster Addr: 10.8.2.5 (LAN: 8301, WAN: 8302)
    Gossip encrypt: false, RPC-TLS: false, TLS-Incoming: false

==> Log data will now stream in as it occurs:

    2017/07/06 16:25:18 [INFO] raft: Initial configuration (index=1): [{Suffrage:Voter ID:10.8.2.5:8300 Address:10.8.2.5:8300}]
    2017/07/06 16:25:18 [INFO] raft: Node at 10.8.2.5:8300 [Follower] entering Follower state (Leader: "")
    2017/07/06 16:25:18 [INFO] serf: EventMemberJoin: vm03 10.8.2.5
    2017/07/06 16:25:18 [WARN] serf: Failed to re-join any previously known node
    2017/07/06 16:25:18 [INFO] consul: Adding LAN server vm03 (Addr: tcp/10.8.2.5:8300) (DC: datacenter)
    2017/07/06 16:25:18 [INFO] serf: EventMemberJoin: vm03.datacenter 10.8.2.5
    2017/07/06 16:25:18 [WARN] serf: Failed to re-join any previously known node
    2017/07/06 16:25:18 [INFO] consul: Handled member-join event for server "vm03.datacenter" in area "wan"
    2017/07/06 16:25:18 [INFO] agent: Started DNS server 0.0.0.0:8600 (tcp)
    2017/07/06 16:25:18 [INFO] agent: Started DNS server 0.0.0.0:8600 (udp)
    2017/07/06 16:25:18 [INFO] agent: Started HTTP server on [::]:8500
    2017/07/06 16:25:18 [INFO] agent: Started HTTPS server on [::]:8501
    2017/07/06 16:25:18 [INFO] agent: Joining cluster...
    2017/07/06 16:25:19 Sending GET https://management.azure.com/subscriptions/xxx/providers/Microsoft.Network/networkInterfaces?api-version=2016-09-01
    2017/07/06 16:25:19 GET https://management.azure.com/subscriptions/xxx/providers/Microsoft.Network/networkInterfaces?api-version=2016-09-01 received 200 OK
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xc0000005 code=0x0 addr=0x0 pc=0xeb1f56]

goroutine 38 [running]:
github.com/hashicorp/consul/agent.(*Config).discoverAzureHosts(0xc04224e000, 0xc0421d4d70, 0x20, 0x0, 0x0, 0x0, 0x327938d0004d5d0f)
        /gopath/src/github.com/hashicorp/consul/agent/config_azure.go:44 +0x436
github.com/hashicorp/consul/agent.(*Agent).retryJoin(0xc042212a00)
        /gopath/src/github.com/hashicorp/consul/agent/retry_join.go:40 +0x74f
created by github.com/hashicorp/consul/agent.(*Agent).Start
        /gopath/src/github.com/hashicorp/consul/agent/agent.go:328 +0x9fc

@sdluxeon
Copy link

sdluxeon commented Jul 14, 2017

I'm having the same issue as @brande and @draggeta the NIC tag is set

@SayliS
Copy link

SayliS commented Jul 14, 2017

Same issue

@magiconair
Copy link
Contributor

I've moved the code to the new https://github.com/hashicorp/go-discover repo which I'll merge back into consul once that is tested. There is also a command line client that you can use. I'm waiting for a colleague to get out of jetlag to help me but if someone wants to verify that the cmd line client does not crash on azure then that would help already.

@sdluxeon
Copy link

sdluxeon commented Jul 14, 2017

We found the issue here. It reproduces when you have network interfaces that have tags but doesn't have the tag configured in consul. The hot fix on our end was to add the tag to all the NIC in Azure. Consider having different values for different consul clusters across the same subscription. BTW Are we going to have breaking changes on 0.9.0?

@magiconair
Copy link
Contributor

@sdluxeon thx for the info. I'll have a look on how to make that more robust.

Re breaking changes: Some smaller ones. Pls keep an eye on the Changelog.

@draggeta
Copy link

@magiconair: I can confirm @sdluxeon's findings. With both the tool and Consul, the error occurs only when not all nics have the tag name set. The value doesn't matter in this case.

@magiconair
Copy link
Contributor

Got it. The Azure API data structures are somewhat unusual. You don't see *map[string]*string often used in Go. I've pushed fixes for both go-discover and consul.

@draggeta I'd appreciate if you could test the go-discover code one more time.

magiconair added a commit to hashicorp/go-discover that referenced this issue Jul 15, 2017
magiconair added a commit to hashicorp/go-discover that referenced this issue Jul 15, 2017
@draggeta
Copy link

@magiconair That indeed fixed it. Thanks for the fix. Now to wait for the next release :)

@magiconair
Copy link
Contributor

We're planning one next week. So the consul version of the fix should go in there.

slackpad pushed a commit that referenced this issue Jul 16, 2017
@slackpad slackpad moved this from Agent Refactoring / Config Split / Test Stability to Done in Consul 0.9.0 Jul 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Feature does not function as expected type/crash The issue description contains a golang panic and stack trace
Projects
No open projects
Development

No branches or pull requests

6 participants