Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ERR] consul: failed to establish leadership: unknown CA provider "" #4954

Closed
BlinkyStitt opened this issue Nov 14, 2018 · 21 comments
Closed
Labels
needs-investigation The issue described is detailed and complex. theme/connect Anything related to Consul Connect, Service Mesh, Side Car Proxies type/bug Feature does not function as expected

Comments

@BlinkyStitt
Copy link

BlinkyStitt commented Nov 14, 2018

Overview of the Issue

I'm trying to use consul connect and now my cluster is partially broken.

Reproduction Steps

I'm not sure how exactly to reproduce. I've been poking at this cluster too much to have clear steps. I'll try more tomorrow.

I put this in /etc/consul.d/config.hcl:

connect {
  enabled = true
}

When I upgraded consul from 1.2.3 to 1.3.0 and added that config, something went wrong and leader election failed.

I manually recovered by poking peers.json and now the cluster has a leader:

Node      ID                                    Address             State     Voter  RaftProtocol
consul    375e6536-a4d0-5770-3b7d-98dbe4a65686  192.168.0.100:8300  follower  true   3
consul-a  65000f53-2857-4daf-19b8-61e5bb4492c0  192.168.0.101:8300  follower  true   3
consul-b  ef31e65f-4536-14c0-4c8d-fdb1f6725922  192.168.0.102:8300  leader    true   3

However, something is still wrong with the cluster. The UI is showing stale data (#4923, maybe?) and vault is stuck in standby.

$ vault status
Key                    Value
---                    -----
Seal Type              shamir
Initialized            true
Sealed                 false
Total Shares           1
Threshold              1
Version                0.11.3
Cluster Name           vault-cluster-641f43ea
Cluster ID             e9570865-65db-1ea6-738b-c759d68fbfdd
HA Enabled             true
HA Cluster             n/a
HA Mode                standby
Active Node Address    <none>

$ vault secrets enable pki
Error enabling: Error making API request.

URL: POST http://127.0.0.1:8200/v1/sys/mounts/pki
Code: 500. Errors:

* local node not active but active cluster node not found

I would like to use vault for consul connect, but was just trying to get it working with the simplest setup first. How do I migrate to vault from this broken connect provider?

From my reading of the docs, the ca stuff was all supposed to be automatic so I don't know how to do it manually. I think I need to run consul connect ca set-config -config-file ca.json but don't know what to put for ca.json (also, why no hcl support here?).

Consul info for both Client and Server

Client info
agent:
	check_monitors = 0
	check_ttls = 1
	checks = 1
	services = 1
build:
	prerelease = 
	revision = e8757838
	version = 1.3.0
consul:
	known_servers = 3
	server = false
runtime:
	arch = amd64
	cpu_count = 8
	goroutines = 52
	max_procs = 8
	os = linux
	version = go1.11.1
serf_lan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 168
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 1708
	members = 5
	query_queue = 0
	query_time = 1
Server info
agent:
	check_monitors = 0
	check_ttls = 0
	checks = 0
	services = 0
build:
	prerelease = 
	revision = e8757838
	version = 1.3.0
consul:
	bootstrap = false
	known_datacenters = 1
	leader = true
	leader_addr = 192.168.0.102:8300
	server = true
raft:
	applied_index = 6311623
	commit_index = 6311623
	fsm_pending = 0
	last_contact = 0
	last_log_index = 6311623
	last_log_term = 897
	last_snapshot_index = 6311248
	last_snapshot_term = 700
	latest_configuration = [{Suffrage:Voter ID:375e6536-a4d0-5770-3b7d-98dbe4a65686 Address:192.168.0.100:8300} {Suffrage:Voter ID:65000f53-2857-4daf-19b8-61e5bb4492c0 Address:192.168.0.101:8300} {Suffrage:Voter ID:ef31e65f-4536-14c0-4c8d-fdb1f6725922 Address:192.168.0.102:8300}]
	latest_configuration_index = 1
	num_peers = 2
	protocol_version = 3
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Leader
	term = 897
runtime:
	arch = amd64
	cpu_count = 8
	goroutines = 94
	max_procs = 8
	os = linux
	version = go1.11.1
serf_lan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 168
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 2
	member_time = 1708
	members = 7
	query_queue = 0
	query_time = 1
serf_wan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 228
	members = 3
	query_queue = 0
	query_time = 1

Operating system and Environment details

I'm running everything inside docker containers on a single host.

Log Fragments

consul_b_1  | bootstrap_expect > 0: expecting 3 servers
consul_b_1  |     2018/11/14 04:04:31 [ERR] consul: failed to establish leadership: unknown CA provider ""
consul_b_1  |     2018/11/14 04:05:31 [ERR] consul: failed to establish leadership: unknown CA provider ""
consul_b_1  |     2018/11/14 04:06:31 [ERR] consul: failed to establish leadership: unknown CA provider ""
consul_b_1  |     2018/11/14 04:07:31 [ERR] consul: failed to establish leadership: unknown CA provider ""
consul_b_1  |     2018/11/14 04:08:31 [ERR] consul: failed to establish leadership: unknown CA provider ""
consul_b_1  |     2018/11/14 04:09:31 [ERR] consul: failed to establish leadership: unknown CA provider ""
consul_b_1  |     2018/11/14 04:10:31 [ERR] consul: failed to establish leadership: unknown CA provider ""
consul_b_1  |     2018/11/14 04:11:31 [ERR] consul: failed to establish leadership: unknown CA provider ""
@kyhavlov
Copy link
Contributor

kyhavlov commented Nov 14, 2018

Hey @wysenynja, thanks for the bug report. That config you gave for connect (enabled = true) should be enough to get things working in both 1.2.3 and 1.3.0, and the empty provider type should be an impossible state to get into since the provider defaults to "consul" in both of those versions.

I tried a few things to reproduce the state you show where it's failing to establish leadership but couldn't get the same result:

  • Upgrading a lone server from 1.2.3 with Connect disabled to 1.3.0 with just the connect = enabled config.
  • A rolling upgrade of a set of 3 servers from 1.2.3 with Connect disabled to 1.3.0 with the above config.
  • Partially upgrading a set of 3 servers so that 2 were running 1.3.0 and the other was on 1.2.3, then getting one of the 1.3.0 servers to become leader and bootstrap the CA before restarting it to cause the 1.2.3 server to become leader again (in case some state was backwards incompatible here).

Any steps you have to help reproduce this are appreciated. It may take a fresh cluster to do so if you've got things back to a working state though, as it sounds like the invalid CA config that was preventing the leader election has been fixed (or wasn't an issue on another server).

@agy
Copy link
Contributor

agy commented Nov 27, 2018

@kyhavlov I'm experiencing the same issue (without Vault). After upgrading the cluster to 1.3.0 (and then 1.3.1) and enabling connect I receive the same error:

2018/11/27 17:40:05 [ERR] consul: failed to establish leadership: unknown CA provider ""

After adding some debug statements to initializeCAConfig() I can see that the CA config returned from the FSM state is non-nil but empty, and so the empty config is returned.

2018/11/27 17:40:05 [DEBUG] consul: (agy) initializeCAConfig state.CAConfig: &structs.CAConfiguration{ClusterID:"", Provider:"", Config:map[string]interface {}(nil), RaftIndex:structs.RaftIndex{CreateIndex:0x0, ModifyIndex:0x0}}
2018/11/27 17:40:05 [DEBUG] consul: (agy) initializeCAConfig modIndex: 0x0

I attempted to set the ca_provider to consul as well, but this doesn't seem to make a difference.

$ consul connect ca get-config
{
	"Provider": "",
	"Config": null,
	"CreateIndex": 0,
	"ModifyIndex": 0
}

Dumping the leader's agent config shows that connect is enabled and depending whether I set the provider it's either "" or "consul".

$ curl -s localhost:8500/v1/agent/self | jq .DebugConfig.ConnectEnabled
true
$ curl -s localhost:8500/v1/agent/self | jq .DebugConfig.ConnectCAProvider
""

I cannot reproduce this on test clusters with the same scenario. The problematic cluster is moderate in size with ~1500 nodes.

@agy
Copy link
Contributor

agy commented Nov 27, 2018

I have rotated all of the consul servers in this cluster with new machines and the issue remains.

@agy
Copy link
Contributor

agy commented Nov 27, 2018

Attempting to update the CA config also does not work:

$ cat connect_config.json
{
  "Provider": "consul",
  "Config": {
    "LeafCertTTL": "72h",
    "RotationPeriod": "2160h",
    "PrivateKey": "",
    "RootCert": ""
  }
}
$ curl -v -X PUT -d @connect_config.json http://127.0.0.1:8500/v1/connect/ca/configuration
* Hostname was NOT found in DNS cache
*   Trying 127.0.0.1...
* Connected to 127.0.0.1 (127.0.0.1) port 8500 (#0)
> PUT /v1/connect/ca/configuration HTTP/1.1
> User-Agent: curl/7.35.0
> Host: 127.0.0.1:8500
> Accept: */*
> Content-Length: 135
> Content-Type: application/x-www-form-urlencoded
>
* upload completely sent off: 135 out of 135 bytes
< HTTP/1.1 500 Internal Server Error
< Vary: Accept-Encoding
< Date: Tue, 27 Nov 2018 22:44:41 GMT
< Content-Length: 57
< Content-Type: text/plain; charset=utf-8
<
* Connection #0 to host 127.0.0.1 left intact
rpc error making call: internal error: CA provider is nil

@pearkes pearkes added type/bug Feature does not function as expected theme/connect Anything related to Consul Connect, Service Mesh, Side Car Proxies needs-investigation The issue described is detailed and complex. labels Nov 28, 2018
@agy
Copy link
Contributor

agy commented Nov 29, 2018

While the server the failed to establish leadership state, the leadership is not given up and an election doesn't occur. Reads and writes seem to work, but reads that are forced as persistent fail.

$ curl -s -X PUT -d "foo" http://localhost:8500/v1/kv/bar; echo
true
$ curl -s -X GET http://localhost:8500/v1/kv/bar | jq .
[
  {
    "LockIndex": 0,
    "Key": "bar",
    "Flags": 0,
    "Value": "Zm9v",
    "CreateIndex": 284603937,
    "ModifyIndex": 284603961
  }
]
$ curl -s -X GET -d "foo" http://localhost:8500/v1/kv/bar?consistent=; echo
rpc error making call: Not ready to serve consistent reads

Killing the leader when in this state does not trigger an election. The follower nodes do report that there is no leader (as expected).

I'm unsure what the expected behaviour should be when in this state.

@pearkes
Copy link
Contributor

pearkes commented Nov 30, 2018

Regarding this comment: #4954 (comment)

And the error message:

2018/11/27 17:40:05 [ERR] consul: failed to establish leadership: unknown CA provider ""

It may be worth investigating if #5016 is showing a symptom of a similar bug. Adding a new server in this case and restoring from a snapshot manually in the case of #5016 could be hitting the same condition. That case includes a repro.

@agy
Copy link
Contributor

agy commented Nov 30, 2018

The Docker image referenced in #5016 doesn't seem to include the snapshot (unless I'm missing something obvious).

I can get a test cluster in the same state as my broken one by importing the raft db. But I cannot reproduce any other way.

I should also note that at no point did I manually restore a snapshot.

@PurrBiscuit
Copy link

PurrBiscuit commented Nov 30, 2018

Re: the snapshot - that's correct; there's no snapshot built into the image currently - we had been voluming in the snapshot from a local directory into the container and then running the restore from there.

That issue we are seeing there is also from a 1.2.1 to 1.2.4 upgrade; snapshot restores were working ok with 1.2.1 but are producing that error with 1.2.4. It's a little different then this case which is why I wanted to open a separate issue for it (although they do sound related.)

@pearkes
Copy link
Contributor

pearkes commented Nov 30, 2018

Thanks @agy and @PurrBiscuit for the information in both cases...we're continuing to look into this.

@pearkes
Copy link
Contributor

pearkes commented Dec 3, 2018

@agy Can you clarify what version you're upgrading from to 1.3.0? Our current hunch is that you could be seeing a symptom of #4535 which was fixed in 1.2.3. If you were ever on a previous version utilizing Connect CA configuration that wrote to the state store you'd be seeing this as the state store would have been corrupted.

If this is case we're considering adding something like a -force option to connect ca set-config that would allow you to override the CA configuration without going through the rotation mechanism that would fail with invalid configuration (which it seems you have based on what we've seen) in your state store.

Alternatively (the better option we think) we could add automatic handling of this corrupted CA configuration which would then allow you to bypass this issue by treating this as a nil configuration.

If we added something like that could you potentially jump to 1.4.1-dev (master)?

@agy
Copy link
Contributor

agy commented Dec 3, 2018

@pearkes 0.8.3 -> 1.3.0 -> 1.3.1.

Note: I had tested this upgrade path on a newly provisioned test cluster. And I have run this upgrade for all the testing that I've done earlier.

Since I'm able to reproduce this issue on a test cluster by importing the raft store for the current broken cluster, I can test whatever fixes that you propose. I agree that your alternate, "better" solution is preferable.

Upgrading to 1.4.x is problematic because I have not had the opportunity to test the new ACL system.

@agy
Copy link
Contributor

agy commented Dec 3, 2018

Unfortunately, since I rotated all the members of the broken cluster I cannot verify if I had enabled connect when the cluster was 1.3.0 or only once it was 1.3.1.

@pearkes
Copy link
Contributor

pearkes commented Dec 3, 2018

@agy Our concern and assumption was that it corrupted the state in 1.2.0 - 1.2.2. If you never ran those versions (regardless of where you're coming from now) that is relatively confusing but doesn't necessarily make the fix different.

@pearkes
Copy link
Contributor

pearkes commented Dec 3, 2018

@agy can you also clarify from this comment:

I can get a test cluster in the same state as my broken one by importing the raft db

What operation did you do here? consul snapshot save/restore or did you copy the actual raft DB (something in the data directory, if so which file(s))?

@agy
Copy link
Contributor

agy commented Dec 3, 2018

@pearkes I have done both.

The snapshot save/restore fails on restore with:

Error restoring snapshot: Unexpected response code: 500 (unknown CA provider "")

I have also copied over a tarball of the Consul data_dir from the broken cluster to the test cluster. I wouldn't normally, ever do this, but it was the only way I am able to get a test cluster into the same state as the broken one.

What I do is:

  • Have three test nodes with 1.3.1 installed.
  • Duplicate the configuration from the broken cluster (replacing node names, addresses, etc where appropriate).
  • Remove node-id.
  • Start all three of the nodes.
  • Stop one of the nodes.
  • Create peers.json with the newly created node-ids and IPs.
  • Start the stopped node.

As previously mentioned, this is far from ideal and is only to allow me to attempt some workarounds/fixes.

File listing:

$ tar tf consul.tar
consul/
consul/proxy/
consul/proxy/snapshot.json
consul/serf/
consul/serf/remote.snapshot
consul/serf/local.snapshot
consul/serf/local.keyring
consul/serf/remote.keyring
consul/raft/
consul/raft/snapshots/
consul/raft/snapshots/12554-284583919-1543427019761/
consul/raft/snapshots/12554-284583919-1543427019761/state.bin
consul/raft/snapshots/12554-284583919-1543427019761/meta.json
consul/raft/snapshots/12554-284600350-1543430224806/
consul/raft/snapshots/12554-284600350-1543430224806/state.bin
consul/raft/snapshots/12554-284600350-1543430224806/meta.json
consul/raft/raft.db
consul/raft/peers.info
consul/node-id
consul/checkpoint-signature

@agy
Copy link
Contributor

agy commented Dec 3, 2018

@wysenynja you mentioned that you had a similar issue when upgrading 1.2.3 to 1.3.0. Did you have connect enabled pre-1.2.3 ?

@BlinkyStitt
Copy link
Author

I don’t believe so. I’m running 1.4 now without issue. I am able to rebuild this cluster easily and it didn’t happen when I started fresh.

kyhavlov added a commit that referenced this issue Dec 6, 2018
This PR both prevents a blank CA config from being written out to
a snapshot and allows Consul to gracefully recover from a snapshot
with an invalid CA config.

Fixes #4954.
@kyhavlov
Copy link
Contributor

kyhavlov commented Dec 6, 2018

I opened #5061 which should fix this - any older versions from 1.2.3 forward will be able to cherry pick this fix using these steps: #5016 (comment)

@thanapolr
Copy link

I have got this issue when enable connect in 1.4.0, too.

2018/12/06 15:46:41 [ERR] consul: failed to establish leadership: unknown CA provider ""
2018/12/06 15:47:22 [ERR] http: Request GET /v1/kv/vault/core/lock?consistent=, error: Not ready to serve consistent reads from=127.0.0.1:45468
2018/12/06 15:47:31 [ERR] http: Request GET /v1/kv/vault/core/lock?consistent=, error: Not ready to serve consistent reads from=127.0.0.1:45506
2018/12/06 15:47:41 [ERR] http: Request GET /v1/kv/vault/core/lock?consistent=, error: Not ready to serve consistent reads from=127.0.0.1:45442

kyhavlov added a commit that referenced this issue Dec 7, 2018
This PR both prevents a blank CA config from being written out to
a snapshot and allows Consul to gracefully recover from a snapshot
with an invalid CA config.

Fixes #4954.
@thanapolr
Copy link

The #5061 solved my problem.

agy pushed a commit to agy/consul that referenced this issue Dec 10, 2018
Prevent blank CA config from being committed to the snapshot.

hashicorp#4954
@valarauca
Copy link

valarauca commented Jun 4, 2019

I was able to reproduce this error with the following server configuration

{
  "addresses": {
    "dns": "0.0.0.0",
    "http": "127.0.0.1",
    "https": "0.0.0.0",
    "grpc": "0.0.0.0"
  },
  "bootstrap_expect": 5,
  "ca_file": "/opt/consul/certs/ca_cert.pem",
  "cert_file": "/opt/consul/certs/local_cert.pem",
  "data_dir": "/opt/consul/data",
  "discard_check_output": null,
  "discovery_max_stale": null,
  "enable_script_checks": false,
  "enable_local_script_checks": false,
  "encrypt": "72Tle7Mf5E72Zpq/cLz9+g==",
  "encrypt_verify_incoming": true,
  "encrypt_verify_outgoing": true,
  "key_file": "/opt/consul/certs/local_key.pem",
  "log_level": "DEBUG",
  "log_file": "/opt/consul/log/consul.log",
  "log_rotate_bytes": 1048576,
  "pid_file": "/opt/consul/pid/consul.pid",
  "ports": {},
  "retry_join": [
    "10.126.0.178",
    "10.126.0.146",
    "10.126.0.150",
    "10.126.0.144",
    "10.126.0.145"
  ],
  "server": true,
  "start_join": [
    "10.126.0.178",
    "10.126.0.146",
    "10.126.0.150",
    "10.126.0.144",
    "10.126.0.145"
  ],
  "verify_incoming": true,
  "verify_incoming_https": true,
  "verify_incoming_rpc": true,
  "verify_outgoing": true,
  "connect": {
    "ca_config": {
      "private_key": "/opt/consul/certs/ca_key.pem",
      "root_cert": "/opt/consul/certs/ca_cert.pem",
      "csr_max_per_second": 100,
      "csr_max_concurrent": 4,
      "leaf_cert_ttl": "4h"
    },
    "ca_provider": "consul",
    "enabled": true
  }
}

This is on consul 1.5.1

My agents report

[Err] consul.watch:  Watch (type: connect_leaf) errored: Unexpected response code: 500 (rpc error making call: internal error: CA provider is nil), retry in 5s
[Err] consul.watch:  Watch (type: connect_leaf) errored: Unexpected response code: 500 (rpc error making call: internal error: CA provider is nil), retry in 20s
[Err] consul.watch:  Watch (type: connect_leaf) errored: Unexpected response code: 500 (rpc error making call: internal error: CA provider is nil), retry in 45s
[Err] consul.watch:  Watch (type: connect_leaf) errored: Unexpected response code: 500 (rpc error making call: internal error: CA provider is nil), retry in 1m20s
[Err] consul.watch:  Watch (type: connect_leaf) errored: Unexpected response code: 500 (rpc error making call: internal error: CA provider is nil), retry in 2m5s

I can remove this configuration section, but when I do I start getting registration errors that RPC resources are exhausted try again. So I assumed I should increasing the number of certificate requests per second (and in parallel) as I'm kicking off some ~175 sidecars across the whole cluster on deployment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-investigation The issue described is detailed and complex. theme/connect Anything related to Consul Connect, Service Mesh, Side Car Proxies type/bug Feature does not function as expected
Projects
None yet
Development

No branches or pull requests

7 participants