Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RPC errors from Consul clients to Consul servers after upgrading from 1.7.2 to 1.7.4 #8174

Open
timprijn opened this issue Jun 23, 2020 · 0 comments
Labels
theme/certificates Related to creating, distributing, and rotating certificates in Consul

Comments

@timprijn
Copy link

Overview of the Issue

After upgrading from 1.7.2 to 1.7.4 we had a loss of communication between our Consul clients and servers. We noticed the following log lines:

  • Consul servers: [WARN] agent.server.rpc: Non-TLS connection attempted with VerifyIncoming set...
  • Consul clients: [ERROR] agent.dns: rpc error: error="rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection"
  • Consul clients: [ERROR] agent.http: Request error: method=GET url=/v1/agent/checks from=127.0.0.1:44118 error="ACL not found"

RPC communication between our clients and servers must be over TLS and our servers are configured with verify_incoming_rpc = true. After some investigation I noticed that our clients did not have the configuration setting verify_outgoing = true. After applying this setting everything seemed fine. Do note that in previous releases the missing client setting didn't seem to be an issue.

I looked a bit into the Consul code and noticed the following commit 2f7d097 (for release 1.7.3) and specifically these lines seem interesting:

consul/tlsutil/config.go

Lines 549 to 552 in 8b4a3d9

// if CAs are provided or VerifyOutgoing is set, use TLS
if c.base.VerifyOutgoing {
return false
}

I think the comment does not match the code anymore. Before this change the code was:

// if CAs are provided or VerifyOutgoing is set, use TLS
if c.caPool != nil || c.base.VerifyOutgoing {
  return false
}

Now my theory is that our configuration worked with releases before 1.7.3 because of the caPool check and we do provide the CA's for the client. I do find it hard to understand why this commit has been done, so perhaps I'm mistaken...

But if I'm not: shouldn't there be a code check or remark in the migration guide stating that client configurations before 1.7.3, that do provide the CA's but do not have the verify_outgoing setting set to true, will now fail. Or perhaps I missed it.

Reproduction Steps

Steps to reproduce this issue, eg:

  1. Upgrade from 1.7.2 to 1.7.3 (or 1.7.4)
  2. Have a client configuration which requires TLS communication and has the following settings:
verify_incoming = true
ca_file = "{{ consul_ssl_dir }}/consul-ca.crt"
cert_file  = "{{ consul_ssl_dir }}/consul.crt"
key_file = "{{ consul_ssl_dir }}/consul.key"
  1. Private cloud is down.. (and see errors and warnings in log)

Consul info for both Client and Server

Client info
agent:
	check_monitors = 0
	check_ttls = 0
	checks = 12
	services = 11
build:
	prerelease =
	revision = d149d7e9
	version = 1.7.4
consul:
	acl = enabled
	known_servers = 3
	server = false
runtime:
	arch = amd64
	cpu_count = 4
	goroutines = 58
	max_procs = 4
	os = linux
	version = go1.13.12
serf_lan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 44
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 407
	members = 23
	query_queue = 0
	query_time = 1
Server info
agent:
	check_monitors = 0
	check_ttls = 1
	checks = 4
	services = 4
build:
	prerelease =
	revision = d149d7e9
	version = 1.7.4
consul:
	acl = enabled
	bootstrap = false
	known_datacenters = 1
	leader = false
	leader_addr = 10.180.48.11:8300
	server = true
raft:
	applied_index = 11177912
	commit_index = 11177912
	fsm_pending = 0
	last_contact = 13.426708ms
	last_log_index = 11177912
	last_log_term = 335
	last_snapshot_index = 11165807
	last_snapshot_term = 335
	latest_configuration = [{Suffrage:Voter ID:b37debdc-9b7a-f881-630d-da40239a8300 Address:10.180.48.12:8300} {Suffrage:Voter ID:1844c789-ea3a-c5af-8ba6-16a739882925 Address:10.180.48.16:8300} {Suffrage:Voter ID:a58b98d8-c51a-2ccc-d1ef-415e84472d35 Address:10.180.48.11:8300}]
	latest_configuration_index = 0
	num_peers = 2
	protocol_version = 3
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Follower
	term = 335
runtime:
	arch = amd64
	cpu_count = 2
	goroutines = 92
	max_procs = 2
	os = linux
	version = go1.13.12
serf_lan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 44
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 407
	members = 23
	query_queue = 0
	query_time = 1
serf_wan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 21
	members = 3
	query_queue = 0
	query_time = 1

Operating system and Environment details

Distributor ID: Ubuntu
Description: Ubuntu 18.04.4 LTS
Release: 18.04
Codename: bionic
Kernel: Linux 4.15.0-91-generic
Architecture: x86-64

Log Fragments

  • Consul servers: [WARN] agent.server.rpc: Non-TLS connection attempted with VerifyIncoming set...
  • Consul clients: [ERROR] agent.dns: rpc error: error="rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection"
  • Consul clients: [ERROR] agent.http: Request error: method=GET url=/v1/agent/checks from=127.0.0.1:44118 error="ACL not found"
@jsosulska jsosulska added the theme/certificates Related to creating, distributing, and rotating certificates in Consul label Jul 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/certificates Related to creating, distributing, and rotating certificates in Consul
Projects
None yet
Development

No branches or pull requests

2 participants