Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consul 1.6.3 DDos using consul-template (1.6.2 working fine) #7259

Closed
obourdon opened this issue Feb 11, 2020 · 13 comments
Closed

Consul 1.6.3 DDos using consul-template (1.6.2 working fine) #7259

obourdon opened this issue Feb 11, 2020 · 13 comments

Comments

@obourdon
Copy link

Overview of the Issue

Call to curl -s -D - http://consul-server:8500/v1/agent/members fails with curl: (56) Recv failure: Connection reset by peer (even though consul members still seems to work) after
consul-template failure

Reproduction Steps

Note that under exact same scenario, consul 1.6.2 does not exhibit the same behaviour as far as the curl call mentioned is concerned. Looking at code differences between consul 1.6.3 and 1.6.2 did not ring any bell on my side

We have a set of uniform versions of 1 or 3 consul servers and 2 consul machines (all 1.6.2 or all 1.6.3)
We are running 2 sets of nomad jobs using consul-template (0.24.0). One set of jobs is targeted at consul client 1 (using constraints & tags) the second set of nomad jobs is targeted at consul client 2

In jobs set 2, we had an error in the sense that one of the consul key which parametrize one of the jobs is missing from consul KV store therefore making consul-template "hang" waiting for this key to be defined (we fixed this on our side by using keyOrDefault instead of key)

However, this error seems to have a dramatic effect on the consul client node where the job causing the error is executing. We reversed the job target rules to make sure that this behaviour is not linked to the host itself but really to the fact that the failing job was executed on this particular failing job and this proved to be the case (anyways all hosts are running same OS, software versions, and configurations for the same roles ...)

We are using both consul members and curl -s -D - http://consul-server:8500/v1/agent/members to make sure that everything is working properly on all nodes and are launching these 2 commands on each node successfully when using consul 1.6.2

In the case of consul 1.6.2, the nomad job is waiting for consul template to get the missing key and once we use consul kv pu missing-key some-value everything comes back to normal and curl + CLI members calls are always working at all times

However using consul 1.6.3, once the consul-template is hanging, the consul node where this was executed can not make any successful calls to curl members. CLI consul members command seems to execute properly but if we run systemctl restart consul-agent then node seems to have left the cluster. If I remember correctly, restarting consul server on the server node(s) also shows that cluster has lost all members

One thing I would also like to state is that due to the large set of jobs we run in job 2, we get the known following warning message at job startup but, again, this still does work properly on 1.6.2 and not on 1.6.3

[WARN] (runner) watching 161 dependencies - watching this many dependencies could DDos your consul cluster

Consul info for both Client and Server

Client info

CoreOS 2303.3.0

agent:
	check_monitors = 0
	check_ttls = 0
	checks = 2
	services = 2
build:
	prerelease =
	revision = 7f3b5f34
	version = 1.6.3
consul:
	acl = disabled
	known_servers = 1
	server = false
runtime:
	arch = amd64
	cpu_count = 2
	goroutines = 72
	max_procs = 2
	os = linux
	version = go1.12.13
serf_lan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 2
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 2
	member_time = 11
	members = 7
	query_queue = 0
	query_time = 1
Server info

CoreOS 2303.3.0

agent:
	check_monitors = 0
	check_ttls = 0
	checks = 1
	services = 1
build:
	prerelease =
	revision = 7f3b5f34
	version = 1.6.3
consul:
	acl = disabled
	bootstrap = true
	known_datacenters = 1
	leader = true
	leader_addr = 10.0.1.107:8300
	server = true
raft:
	applied_index = 781
	commit_index = 781
	fsm_pending = 0
	last_contact = 0
	last_log_index = 781
	last_log_term = 2
	last_snapshot_index = 0
	last_snapshot_term = 0
	latest_configuration = [{Suffrage:Voter ID:80773a74-78f9-5edd-7e9f-a08facde080f Address:10.0.1.107:8300}]
	latest_configuration_index = 1
	num_peers = 0
	protocol_version = 3
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Leader
	term = 2
runtime:
	arch = amd64
	cpu_count = 2
	goroutines = 431
	max_procs = 2
	os = linux
	version = go1.12.13
serf_lan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 2
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 2
	member_time = 11
	members = 7
	query_queue = 0
	query_time = 1
serf_wan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 1
	members = 1
	query_queue = 0
	query_time = 1

Operating system and Environment details

OS, Architecture, and any other information you can provide about the environment.

See above

Log Fragments

Include appropriate Client or Server log fragments. If the log is longer than a few dozen lines, please include the URL to the gist of the log instead of posting it in the issue. Use -log-level=TRACE on the client and server to capture the maximum log detail.

Currently gathering these

@obourdon
Copy link
Author

obourdon commented Feb 11, 2020

After further deeper investigation, it seems like DDos is the issue

  • the issue has nothing to do with the error (unknown key in template)
  • the issue does not occur on 3 nodes consul cluster (at least I was not able to DDos it)
  • the issue is reproductible 100% of the time with a 1 node consul server "cluster"
  • the issue never occurs on a 1 node 1.6.2 consul server "cluster"

@obourdon
Copy link
Author

Also looking deeper into 1.6.2 to 1.6.3 changes may be this commit could explain the new behavior

@obourdon obourdon changed the title Consul 1.6.3 failure after consul-template run with missing key Consul 1.6.3 DDos (1.6.2 working fine) Feb 14, 2020
@obourdon obourdon changed the title Consul 1.6.3 DDos (1.6.2 working fine) Consul 1.6.3 "supposed" DDos (1.6.2 working fine) Feb 14, 2020
@obourdon
Copy link
Author

According to #7257 seems like it is also reproductible with 3 nodes server cluster

@chuckyz
Copy link

chuckyz commented Feb 18, 2020

@obourdon try out the new limits entries.

https://www.consul.io/docs/agent/options.html#limits

Specifically the *_conns_per_client entries.

This may also imply that something is way overusing one of the ways into the server cluster, and not using local agents. For instance, do you have Prometheus running with a very default consul_sd_config?

@obourdon
Copy link
Author

@chuckyz thanks for the infos. I have reproduced the issue isolated from my complete (crypted and SSL secured) environment using official consul Docker container and get the same behaviour: OK with 1.6.2 KO with 1.6.3 and later.

You can find the code to reproduce it yourself here.

In fact, consul-template is running [on an agent node/in a docker container] and from the packet capture I have made it is using local consul agent to try to get KVs and using the same
connection (only 1 SYN seen in pcap file)

Activating debug on both consul server docker container and consul client did not show much
(no error/warnings/... on either side). On the agent side I see the over 150 'Request GET /v1/kv/.... bursting. After the burst, running consul members command locally on both client and server fails for the client. Note that after a while the agent seems to recover and the command is functional again

I have also tried to add some more Consul client/server configuration parameters without more success.

@obourdon
Copy link
Author

See also consul-template issues and PRs #1279, #1066, #1065, #1107

@obourdon obourdon changed the title Consul 1.6.3 "supposed" DDos (1.6.2 working fine) Consul 1.6.3 DDos using consul-template (1.6.2 working fine) Feb 24, 2020
@obourdon
Copy link
Author

I think the issue/configuration values to solve can be both in consul AND consul-template therefore the post of 2 separate issues: See consul-template issue #1346

@pierresouchay
Copy link
Contributor

@obourdon Don't get the exact template you are using, but if you have issues with consul-template DOSing consul, you might try consul-templaterb which has a few protections included to avoid those kinds of behaviors (that's one of the reasons I developed this tool, and it is production-grade ready in use for years at Criteo on very large clusters)

@eikenb
Copy link
Contributor

eikenb commented Mar 6, 2020

I asked around and it looks like this has to do with the http_max_conns_per_client setting that was added in 1.6.3. It defaults to 100 in 1.6.x (they bumped it to 200 in 1.7.1).
Given that you mention it warning about 161 watches it seems likely that this is what you are hitting.

I suggest first trying to adjust that setting. Here's the docs on it.
https://www.consul.io/docs/agent/options.html#http_max_conns_per_client

Hope this helps.

@brydoncheyney-slc
Copy link

this is one we were hit with - running a Nomad client cluster with ~40 scheduled allocations on each generated almost 1000 established HTTP connections per client... may look at https://github.com/criteo/consul-templaterb to help diagnose exactly what's going on!

@pierresouchay
Copy link
Contributor

@brydoncheyney-slc if the limit is 100 (assuming you are using 1.6.3+), it probably won't work either (assuming you are using 100+ endpoints), but if you test, use -d flag to help diagnose the number of calls and reason why. Ping me if you need help

@brydoncheyney-slc
Copy link

@pierresouchay interesting. I ran a fairly naive ss -tan dport eq 8500 | wc -l to determine the number of connections (to inform limits.http_max_conns_per_clientm which now works) but will run with debug enabled to look at the why.

Appreciate the heads up. If anything interesting crops up I will report back...

@obourdon
Copy link
Author

@eikenb indeed increasing the http_max_conns_per_client configuration parameter seems to fix the issue

I am still confused by the naming of this variable as it seems that the client only makes one http connection to make all requests but ...

I first tried this with all fixed versions above 1.7.1 (where the limit was put back to 200) but I still was having the issue. Digging deeper showed that even though I have ~150 keys in consul, my consul-templates are requesting ~300 KVs (can be seen after launching docker exec -t consul-client /srv/jobs/ddos_entrypoint.sh without the redirection to /dev/null

...
2020/03/18 04:34:39.088255 [WARN] (runner) watching 300 dependencies - watching this many dependencies could DDoS your consul cluster

However, adding

export CONSUL_AGENT_EXTRA_CFG=',"limits": {"http_max_conns_per_client": 400}'
export CONSUL_SERVER_EXTRA_CFG=',"limits": {"http_max_conns_per_client": 400}'

does the trick
Many thanks for all help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants