[ERR] consul: failed to establish leadership: unknown CA provider "" #4954

BlinkyStitt · 2018-11-14T04:24:52Z

Overview of the Issue

I'm trying to use consul connect and now my cluster is partially broken.

Reproduction Steps

I'm not sure how exactly to reproduce. I've been poking at this cluster too much to have clear steps. I'll try more tomorrow.

I put this in /etc/consul.d/config.hcl:

connect {
  enabled = true
}

When I upgraded consul from 1.2.3 to 1.3.0 and added that config, something went wrong and leader election failed.

I manually recovered by poking peers.json and now the cluster has a leader:

Node      ID                                    Address             State     Voter  RaftProtocol
consul    375e6536-a4d0-5770-3b7d-98dbe4a65686  192.168.0.100:8300  follower  true   3
consul-a  65000f53-2857-4daf-19b8-61e5bb4492c0  192.168.0.101:8300  follower  true   3
consul-b  ef31e65f-4536-14c0-4c8d-fdb1f6725922  192.168.0.102:8300  leader    true   3

However, something is still wrong with the cluster. The UI is showing stale data (#4923, maybe?) and vault is stuck in standby.

$ vault status
Key                    Value
---                    -----
Seal Type              shamir
Initialized            true
Sealed                 false
Total Shares           1
Threshold              1
Version                0.11.3
Cluster Name           vault-cluster-641f43ea
Cluster ID             e9570865-65db-1ea6-738b-c759d68fbfdd
HA Enabled             true
HA Cluster             n/a
HA Mode                standby
Active Node Address    <none>

$ vault secrets enable pki
Error enabling: Error making API request.

URL: POST http://127.0.0.1:8200/v1/sys/mounts/pki
Code: 500. Errors:

* local node not active but active cluster node not found

I would like to use vault for consul connect, but was just trying to get it working with the simplest setup first. How do I migrate to vault from this broken connect provider?

From my reading of the docs, the ca stuff was all supposed to be automatic so I don't know how to do it manually. I think I need to run consul connect ca set-config -config-file ca.json but don't know what to put for ca.json (also, why no hcl support here?).

Consul info for both Client and Server

Client info

agent:
	check_monitors = 0
	check_ttls = 1
	checks = 1
	services = 1
build:
	prerelease = 
	revision = e8757838
	version = 1.3.0
consul:
	known_servers = 3
	server = false
runtime:
	arch = amd64
	cpu_count = 8
	goroutines = 52
	max_procs = 8
	os = linux
	version = go1.11.1
serf_lan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 168
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 1708
	members = 5
	query_queue = 0
	query_time = 1

Server info

agent:
	check_monitors = 0
	check_ttls = 0
	checks = 0
	services = 0
build:
	prerelease = 
	revision = e8757838
	version = 1.3.0
consul:
	bootstrap = false
	known_datacenters = 1
	leader = true
	leader_addr = 192.168.0.102:8300
	server = true
raft:
	applied_index = 6311623
	commit_index = 6311623
	fsm_pending = 0
	last_contact = 0
	last_log_index = 6311623
	last_log_term = 897
	last_snapshot_index = 6311248
	last_snapshot_term = 700
	latest_configuration = [{Suffrage:Voter ID:375e6536-a4d0-5770-3b7d-98dbe4a65686 Address:192.168.0.100:8300} {Suffrage:Voter ID:65000f53-2857-4daf-19b8-61e5bb4492c0 Address:192.168.0.101:8300} {Suffrage:Voter ID:ef31e65f-4536-14c0-4c8d-fdb1f6725922 Address:192.168.0.102:8300}]
	latest_configuration_index = 1
	num_peers = 2
	protocol_version = 3
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Leader
	term = 897
runtime:
	arch = amd64
	cpu_count = 8
	goroutines = 94
	max_procs = 8
	os = linux
	version = go1.11.1
serf_lan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 168
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 2
	member_time = 1708
	members = 7
	query_queue = 0
	query_time = 1
serf_wan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 228
	members = 3
	query_queue = 0
	query_time = 1

Operating system and Environment details

I'm running everything inside docker containers on a single host.

Log Fragments

consul_b_1  | bootstrap_expect > 0: expecting 3 servers
consul_b_1  |     2018/11/14 04:04:31 [ERR] consul: failed to establish leadership: unknown CA provider ""
consul_b_1  |     2018/11/14 04:05:31 [ERR] consul: failed to establish leadership: unknown CA provider ""
consul_b_1  |     2018/11/14 04:06:31 [ERR] consul: failed to establish leadership: unknown CA provider ""
consul_b_1  |     2018/11/14 04:07:31 [ERR] consul: failed to establish leadership: unknown CA provider ""
consul_b_1  |     2018/11/14 04:08:31 [ERR] consul: failed to establish leadership: unknown CA provider ""
consul_b_1  |     2018/11/14 04:09:31 [ERR] consul: failed to establish leadership: unknown CA provider ""
consul_b_1  |     2018/11/14 04:10:31 [ERR] consul: failed to establish leadership: unknown CA provider ""
consul_b_1  |     2018/11/14 04:11:31 [ERR] consul: failed to establish leadership: unknown CA provider ""

The text was updated successfully, but these errors were encountered:

kyhavlov · 2018-11-14T05:35:31Z

Hey @wysenynja, thanks for the bug report. That config you gave for connect (enabled = true) should be enough to get things working in both 1.2.3 and 1.3.0, and the empty provider type should be an impossible state to get into since the provider defaults to "consul" in both of those versions.

I tried a few things to reproduce the state you show where it's failing to establish leadership but couldn't get the same result:

Upgrading a lone server from 1.2.3 with Connect disabled to 1.3.0 with just the connect = enabled config.
A rolling upgrade of a set of 3 servers from 1.2.3 with Connect disabled to 1.3.0 with the above config.
Partially upgrading a set of 3 servers so that 2 were running 1.3.0 and the other was on 1.2.3, then getting one of the 1.3.0 servers to become leader and bootstrap the CA before restarting it to cause the 1.2.3 server to become leader again (in case some state was backwards incompatible here).

Any steps you have to help reproduce this are appreciated. It may take a fresh cluster to do so if you've got things back to a working state though, as it sounds like the invalid CA config that was preventing the leader election has been fixed (or wasn't an issue on another server).

agy · 2018-11-27T19:53:06Z

@kyhavlov I'm experiencing the same issue (without Vault). After upgrading the cluster to 1.3.0 (and then 1.3.1) and enabling connect I receive the same error:

2018/11/27 17:40:05 [ERR] consul: failed to establish leadership: unknown CA provider ""

After adding some debug statements to initializeCAConfig() I can see that the CA config returned from the FSM state is non-nil but empty, and so the empty config is returned.

2018/11/27 17:40:05 [DEBUG] consul: (agy) initializeCAConfig state.CAConfig: &structs.CAConfiguration{ClusterID:"", Provider:"", Config:map[string]interface {}(nil), RaftIndex:structs.RaftIndex{CreateIndex:0x0, ModifyIndex:0x0}}
2018/11/27 17:40:05 [DEBUG] consul: (agy) initializeCAConfig modIndex: 0x0

I attempted to set the ca_provider to consul as well, but this doesn't seem to make a difference.

$ consul connect ca get-config
{
	"Provider": "",
	"Config": null,
	"CreateIndex": 0,
	"ModifyIndex": 0
}

Dumping the leader's agent config shows that connect is enabled and depending whether I set the provider it's either "" or "consul".

$ curl -s localhost:8500/v1/agent/self | jq .DebugConfig.ConnectEnabled
true
$ curl -s localhost:8500/v1/agent/self | jq .DebugConfig.ConnectCAProvider
""

I cannot reproduce this on test clusters with the same scenario. The problematic cluster is moderate in size with ~1500 nodes.

agy · 2018-11-27T20:32:12Z

I have rotated all of the consul servers in this cluster with new machines and the issue remains.

agy · 2018-11-27T22:57:29Z

Attempting to update the CA config also does not work:

$ cat connect_config.json
{
  "Provider": "consul",
  "Config": {
    "LeafCertTTL": "72h",
    "RotationPeriod": "2160h",
    "PrivateKey": "",
    "RootCert": ""
  }
}

$ curl -v -X PUT -d @connect_config.json http://127.0.0.1:8500/v1/connect/ca/configuration
* Hostname was NOT found in DNS cache
*   Trying 127.0.0.1...
* Connected to 127.0.0.1 (127.0.0.1) port 8500 (#0)
> PUT /v1/connect/ca/configuration HTTP/1.1
> User-Agent: curl/7.35.0
> Host: 127.0.0.1:8500
> Accept: */*
> Content-Length: 135
> Content-Type: application/x-www-form-urlencoded
>
* upload completely sent off: 135 out of 135 bytes
< HTTP/1.1 500 Internal Server Error
< Vary: Accept-Encoding
< Date: Tue, 27 Nov 2018 22:44:41 GMT
< Content-Length: 57
< Content-Type: text/plain; charset=utf-8
<
* Connection #0 to host 127.0.0.1 left intact
rpc error making call: internal error: CA provider is nil

agy · 2018-11-29T23:46:54Z

While the server the failed to establish leadership state, the leadership is not given up and an election doesn't occur. Reads and writes seem to work, but reads that are forced as persistent fail.

$ curl -s -X PUT -d "foo" http://localhost:8500/v1/kv/bar; echo
true

$ curl -s -X GET http://localhost:8500/v1/kv/bar | jq .
[
  {
    "LockIndex": 0,
    "Key": "bar",
    "Flags": 0,
    "Value": "Zm9v",
    "CreateIndex": 284603937,
    "ModifyIndex": 284603961
  }
]

$ curl -s -X GET -d "foo" http://localhost:8500/v1/kv/bar?consistent=; echo
rpc error making call: Not ready to serve consistent reads

Killing the leader when in this state does not trigger an election. The follower nodes do report that there is no leader (as expected).

I'm unsure what the expected behaviour should be when in this state.

pearkes · 2018-11-30T20:50:05Z

Regarding this comment: #4954 (comment)

And the error message:

2018/11/27 17:40:05 [ERR] consul: failed to establish leadership: unknown CA provider ""

It may be worth investigating if #5016 is showing a symptom of a similar bug. Adding a new server in this case and restoring from a snapshot manually in the case of #5016 could be hitting the same condition. That case includes a repro.

agy · 2018-11-30T21:27:41Z

The Docker image referenced in #5016 doesn't seem to include the snapshot (unless I'm missing something obvious).

I can get a test cluster in the same state as my broken one by importing the raft db. But I cannot reproduce any other way.

I should also note that at no point did I manually restore a snapshot.

PurrBiscuit · 2018-11-30T21:48:12Z

Re: the snapshot - that's correct; there's no snapshot built into the image currently - we had been voluming in the snapshot from a local directory into the container and then running the restore from there.

That issue we are seeing there is also from a 1.2.1 to 1.2.4 upgrade; snapshot restores were working ok with 1.2.1 but are producing that error with 1.2.4. It's a little different then this case which is why I wanted to open a separate issue for it (although they do sound related.)

pearkes · 2018-11-30T21:57:35Z

Thanks @agy and @PurrBiscuit for the information in both cases...we're continuing to look into this.

pearkes · 2018-12-03T17:56:52Z

@agy Can you clarify what version you're upgrading from to 1.3.0? Our current hunch is that you could be seeing a symptom of #4535 which was fixed in 1.2.3. If you were ever on a previous version utilizing Connect CA configuration that wrote to the state store you'd be seeing this as the state store would have been corrupted.

If this is case we're considering adding something like a -force option to connect ca set-config that would allow you to override the CA configuration without going through the rotation mechanism that would fail with invalid configuration (which it seems you have based on what we've seen) in your state store.

Alternatively (the better option we think) we could add automatic handling of this corrupted CA configuration which would then allow you to bypass this issue by treating this as a nil configuration.

If we added something like that could you potentially jump to 1.4.1-dev (master)?

agy · 2018-12-03T19:57:58Z

@pearkes 0.8.3 -> 1.3.0 -> 1.3.1.

Note: I had tested this upgrade path on a newly provisioned test cluster. And I have run this upgrade for all the testing that I've done earlier.

Since I'm able to reproduce this issue on a test cluster by importing the raft store for the current broken cluster, I can test whatever fixes that you propose. I agree that your alternate, "better" solution is preferable.

Upgrading to 1.4.x is problematic because I have not had the opportunity to test the new ACL system.

agy · 2018-12-03T21:04:02Z

Unfortunately, since I rotated all the members of the broken cluster I cannot verify if I had enabled connect when the cluster was 1.3.0 or only once it was 1.3.1.

pearkes · 2018-12-03T21:13:41Z

@agy Our concern and assumption was that it corrupted the state in 1.2.0 - 1.2.2. If you never ran those versions (regardless of where you're coming from now) that is relatively confusing but doesn't necessarily make the fix different.

pearkes · 2018-12-03T21:30:00Z

@agy can you also clarify from this comment:

I can get a test cluster in the same state as my broken one by importing the raft db

What operation did you do here? consul snapshot save/restore or did you copy the actual raft DB (something in the data directory, if so which file(s))?

agy · 2018-12-03T21:52:18Z

@pearkes I have done both.

The snapshot save/restore fails on restore with:

Error restoring snapshot: Unexpected response code: 500 (unknown CA provider "")

I have also copied over a tarball of the Consul data_dir from the broken cluster to the test cluster. I wouldn't normally, ever do this, but it was the only way I am able to get a test cluster into the same state as the broken one.

What I do is:

Have three test nodes with 1.3.1 installed.
Duplicate the configuration from the broken cluster (replacing node names, addresses, etc where appropriate).
Remove node-id.
Start all three of the nodes.
Stop one of the nodes.
Create peers.json with the newly created node-ids and IPs.
Start the stopped node.

As previously mentioned, this is far from ideal and is only to allow me to attempt some workarounds/fixes.

File listing:

$ tar tf consul.tar
consul/
consul/proxy/
consul/proxy/snapshot.json
consul/serf/
consul/serf/remote.snapshot
consul/serf/local.snapshot
consul/serf/local.keyring
consul/serf/remote.keyring
consul/raft/
consul/raft/snapshots/
consul/raft/snapshots/12554-284583919-1543427019761/
consul/raft/snapshots/12554-284583919-1543427019761/state.bin
consul/raft/snapshots/12554-284583919-1543427019761/meta.json
consul/raft/snapshots/12554-284600350-1543430224806/
consul/raft/snapshots/12554-284600350-1543430224806/state.bin
consul/raft/snapshots/12554-284600350-1543430224806/meta.json
consul/raft/raft.db
consul/raft/peers.info
consul/node-id
consul/checkpoint-signature

agy · 2018-12-03T21:54:29Z

@wysenynja you mentioned that you had a similar issue when upgrading 1.2.3 to 1.3.0. Did you have connect enabled pre-1.2.3 ?

BlinkyStitt · 2018-12-04T01:56:18Z

I don’t believe so. I’m running 1.4 now without issue. I am able to rebuild this cluster easily and it didn’t happen when I started fresh.

This PR both prevents a blank CA config from being written out to a snapshot and allows Consul to gracefully recover from a snapshot with an invalid CA config. Fixes #4954.

kyhavlov · 2018-12-06T02:47:29Z

I opened #5061 which should fix this - any older versions from 1.2.3 forward will be able to cherry pick this fix using these steps: #5016 (comment)

thanapolr · 2018-12-06T09:09:44Z

I have got this issue when enable connect in 1.4.0, too.

2018/12/06 15:46:41 [ERR] consul: failed to establish leadership: unknown CA provider ""
2018/12/06 15:47:22 [ERR] http: Request GET /v1/kv/vault/core/lock?consistent=, error: Not ready to serve consistent reads from=127.0.0.1:45468
2018/12/06 15:47:31 [ERR] http: Request GET /v1/kv/vault/core/lock?consistent=, error: Not ready to serve consistent reads from=127.0.0.1:45506
2018/12/06 15:47:41 [ERR] http: Request GET /v1/kv/vault/core/lock?consistent=, error: Not ready to serve consistent reads from=127.0.0.1:45442

This PR both prevents a blank CA config from being written out to a snapshot and allows Consul to gracefully recover from a snapshot with an invalid CA config. Fixes #4954.

thanapolr · 2018-12-07T07:30:26Z

The #5061 solved my problem.

Prevent blank CA config from being committed to the snapshot. hashicorp#4954

valarauca · 2019-06-04T19:53:14Z

I was able to reproduce this error with the following server configuration

{
  "addresses": {
    "dns": "0.0.0.0",
    "http": "127.0.0.1",
    "https": "0.0.0.0",
    "grpc": "0.0.0.0"
  },
  "bootstrap_expect": 5,
  "ca_file": "/opt/consul/certs/ca_cert.pem",
  "cert_file": "/opt/consul/certs/local_cert.pem",
  "data_dir": "/opt/consul/data",
  "discard_check_output": null,
  "discovery_max_stale": null,
  "enable_script_checks": false,
  "enable_local_script_checks": false,
  "encrypt": "72Tle7Mf5E72Zpq/cLz9+g==",
  "encrypt_verify_incoming": true,
  "encrypt_verify_outgoing": true,
  "key_file": "/opt/consul/certs/local_key.pem",
  "log_level": "DEBUG",
  "log_file": "/opt/consul/log/consul.log",
  "log_rotate_bytes": 1048576,
  "pid_file": "/opt/consul/pid/consul.pid",
  "ports": {},
  "retry_join": [
    "10.126.0.178",
    "10.126.0.146",
    "10.126.0.150",
    "10.126.0.144",
    "10.126.0.145"
  ],
  "server": true,
  "start_join": [
    "10.126.0.178",
    "10.126.0.146",
    "10.126.0.150",
    "10.126.0.144",
    "10.126.0.145"
  ],
  "verify_incoming": true,
  "verify_incoming_https": true,
  "verify_incoming_rpc": true,
  "verify_outgoing": true,
  "connect": {
    "ca_config": {
      "private_key": "/opt/consul/certs/ca_key.pem",
      "root_cert": "/opt/consul/certs/ca_cert.pem",
      "csr_max_per_second": 100,
      "csr_max_concurrent": 4,
      "leaf_cert_ttl": "4h"
    },
    "ca_provider": "consul",
    "enabled": true
  }
}

This is on consul 1.5.1

My agents report

[Err] consul.watch:  Watch (type: connect_leaf) errored: Unexpected response code: 500 (rpc error making call: internal error: CA provider is nil), retry in 5s
[Err] consul.watch:  Watch (type: connect_leaf) errored: Unexpected response code: 500 (rpc error making call: internal error: CA provider is nil), retry in 20s
[Err] consul.watch:  Watch (type: connect_leaf) errored: Unexpected response code: 500 (rpc error making call: internal error: CA provider is nil), retry in 45s
[Err] consul.watch:  Watch (type: connect_leaf) errored: Unexpected response code: 500 (rpc error making call: internal error: CA provider is nil), retry in 1m20s
[Err] consul.watch:  Watch (type: connect_leaf) errored: Unexpected response code: 500 (rpc error making call: internal error: CA provider is nil), retry in 2m5s

I can remove this configuration section, but when I do I start getting registration errors that RPC resources are exhausted try again. So I assumed I should increasing the number of certificate requests per second (and in parallel) as I'm kicking off some ~175 sidecars across the whole cluster on deployment.

pearkes added type/bug Feature does not function as expected theme/connect Anything related to Consul Connect, Service Mesh, Side Car Proxies needs-investigation The issue described is detailed and complex. labels Nov 28, 2018

kyhavlov added a commit that referenced this issue Dec 6, 2018

connect/ca: prevent blank CA config in snapshot

9f7e53f

This PR both prevents a blank CA config from being written out to a snapshot and allows Consul to gracefully recover from a snapshot with an invalid CA config. Fixes #4954.

kyhavlov mentioned this issue Dec 6, 2018

connect/ca: prevent blank CA config in snapshot #5061

Merged

kyhavlov added a commit that referenced this issue Dec 7, 2018

connect/ca: prevent blank CA config in snapshot

4f2715d

This PR both prevents a blank CA config from being written out to a snapshot and allows Consul to gracefully recover from a snapshot with an invalid CA config. Fixes #4954.

kyhavlov closed this as completed in #5061 Dec 7, 2018

agy pushed a commit to agy/consul that referenced this issue Dec 10, 2018

Bachport patch for hashicorpGH-1954

b834b2d

Prevent blank CA config from being committed to the snapshot. hashicorp#4954

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ERR] consul: failed to establish leadership: unknown CA provider "" #4954

[ERR] consul: failed to establish leadership: unknown CA provider "" #4954

BlinkyStitt commented Nov 14, 2018 •

edited

Loading

kyhavlov commented Nov 14, 2018 •

edited

Loading

agy commented Nov 27, 2018 •

edited

Loading

agy commented Nov 27, 2018

agy commented Nov 27, 2018

agy commented Nov 29, 2018

pearkes commented Nov 30, 2018 •

edited

Loading

agy commented Nov 30, 2018 •

edited

Loading

PurrBiscuit commented Nov 30, 2018 •

edited

Loading

pearkes commented Nov 30, 2018

pearkes commented Dec 3, 2018

agy commented Dec 3, 2018

agy commented Dec 3, 2018

pearkes commented Dec 3, 2018 •

edited

Loading

pearkes commented Dec 3, 2018 •

edited

Loading

agy commented Dec 3, 2018 •

edited

Loading

agy commented Dec 3, 2018

BlinkyStitt commented Dec 4, 2018

kyhavlov commented Dec 6, 2018

thanapolr commented Dec 6, 2018

thanapolr commented Dec 7, 2018

valarauca commented Jun 4, 2019 •

edited

Loading

[ERR] consul: failed to establish leadership: unknown CA provider "" #4954

[ERR] consul: failed to establish leadership: unknown CA provider "" #4954

Comments

BlinkyStitt commented Nov 14, 2018 • edited Loading

Overview of the Issue

Reproduction Steps

Consul info for both Client and Server

Operating system and Environment details

Log Fragments

kyhavlov commented Nov 14, 2018 • edited Loading

agy commented Nov 27, 2018 • edited Loading

agy commented Nov 27, 2018

agy commented Nov 27, 2018

agy commented Nov 29, 2018

pearkes commented Nov 30, 2018 • edited Loading

agy commented Nov 30, 2018 • edited Loading

PurrBiscuit commented Nov 30, 2018 • edited Loading

pearkes commented Nov 30, 2018

pearkes commented Dec 3, 2018

agy commented Dec 3, 2018

agy commented Dec 3, 2018

pearkes commented Dec 3, 2018 • edited Loading

pearkes commented Dec 3, 2018 • edited Loading

agy commented Dec 3, 2018 • edited Loading

agy commented Dec 3, 2018

BlinkyStitt commented Dec 4, 2018

kyhavlov commented Dec 6, 2018

thanapolr commented Dec 6, 2018

thanapolr commented Dec 7, 2018

valarauca commented Jun 4, 2019 • edited Loading

BlinkyStitt commented Nov 14, 2018 •

edited

Loading

kyhavlov commented Nov 14, 2018 •

edited

Loading

agy commented Nov 27, 2018 •

edited

Loading

pearkes commented Nov 30, 2018 •

edited

Loading

agy commented Nov 30, 2018 •

edited

Loading

PurrBiscuit commented Nov 30, 2018 •

edited

Loading

pearkes commented Dec 3, 2018 •

edited

Loading

pearkes commented Dec 3, 2018 •

edited

Loading

agy commented Dec 3, 2018 •

edited

Loading

valarauca commented Jun 4, 2019 •

edited

Loading