Skip to content
This repository has been archived by the owner on Sep 7, 2023. It is now read-only.

switch the default from -dev mode to agent mode or no default at all #117

Open
danielmotaleite opened this issue Mar 8, 2019 · 0 comments

Comments

@danielmotaleite
Copy link

I was deploying a docker consul as agent in our test rancher environment and i forgot to add the agent command line, but loaded with the below agent CONSUL_LOCAL_CONFIG file:

{
	"autopilot": {
		"cleanup_dead_servers": true
	},
	"acl": {
		"enabled": true,
		"default_policy": "allow",
		"down_policy": "allow",
		"tokens": {
			"default": "anonymous",
			"master": "{consul-key}"
		}
	},
	"primary_datacenter": "FRA",
	"datacenter": "FRA",
	"domain": "internal",
	"encrypt": "{encrypt-key}",
	"log_level": "INFO",
	"retry_join": ["consul-a01", "consul-c01", "consul-b01"],
	"protocol": 3,
	"dns_config": {
		"enable_truncate": true
	}
}

The result was that that the consul loaded, bind to the localhost and was in server mode. the localhost binding make it unable to correctly join the existent cluster. I also suspect that the server mode without expected_bootstrap made it elect itself as a leader, sending chaos to the existent consul cluster, as we will see later on

I quickly noticed the localhost binding and added the CONSUL_BIND_INTERFACE = ens3, fixing the localhost issue, but it was still in server mode. At this point i noticed that my testing vault was sealed and trying to unseal reported vault was not initialized. I also finally noticed that the test consul was reporting as a server, not as a agent and joined the existent cluster as a new server. I also noticed that my cluster master was in 1.4.0, but i was loading the docker for 1.4.3. i do not know if the higher versions also helped in this problem

I finally fixed the command line and versions, but had a broken consul cluster setup, with errors all over the place and no elected leader. I had to shutdown 2 nodes from the 3 node master consul, build a peers.json and restart, to be able to finally remove the bad docker consul server and elect a new leader and restore the cluster.

The end result was that i lost my consul cluster for several minutes and my kv and vault data were lost, i had to restored that data from backup. (making me thing that storing vault in consul is probably also not the best solution, as it was so easy to lose it)

checking how everything happen, i finally notice that docker consul will default to agent -dev, but that in the end enabled server mode in my cluster. IMHO, this is a dangerous default, as you can see, a simple mistake of forgetting to add the command line almost killed a existent consul cluster and people without consul backups would be in trouble. The fact that we need to use command line and environment variables make this easier to happen too.

A sane default should be start it as a agent, without the -dev, it is the safest mode. Another alternative is to enforce a user specified mode, if no command line is passed, simply fail, output the help and exist. No behind the scene magic.

Looking to past issued, this one is another example why the default is bad. If something is wrong, do not start as -dev mode: hashicorp/consul#3255

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant