Consul 1.6.3 DDos using consul-template (1.6.2 working fine) #7259

obourdon · 2020-02-11T14:13:45Z

Overview of the Issue

Call to curl -s -D - http://consul-server:8500/v1/agent/members fails with curl: (56) Recv failure: Connection reset by peer (even though consul members still seems to work) after
consul-template failure

Reproduction Steps

Note that under exact same scenario, consul 1.6.2 does not exhibit the same behaviour as far as the curl call mentioned is concerned. Looking at code differences between consul 1.6.3 and 1.6.2 did not ring any bell on my side

We have a set of uniform versions of 1 or 3 consul servers and 2 consul machines (all 1.6.2 or all 1.6.3)
We are running 2 sets of nomad jobs using consul-template (0.24.0). One set of jobs is targeted at consul client 1 (using constraints & tags) the second set of nomad jobs is targeted at consul client 2

In jobs set 2, we had an error in the sense that one of the consul key which parametrize one of the jobs is missing from consul KV store therefore making consul-template "hang" waiting for this key to be defined (we fixed this on our side by using keyOrDefault instead of key)

However, this error seems to have a dramatic effect on the consul client node where the job causing the error is executing. We reversed the job target rules to make sure that this behaviour is not linked to the host itself but really to the fact that the failing job was executed on this particular failing job and this proved to be the case (anyways all hosts are running same OS, software versions, and configurations for the same roles ...)

We are using both consul members and curl -s -D - http://consul-server:8500/v1/agent/members to make sure that everything is working properly on all nodes and are launching these 2 commands on each node successfully when using consul 1.6.2

In the case of consul 1.6.2, the nomad job is waiting for consul template to get the missing key and once we use consul kv pu missing-key some-value everything comes back to normal and curl + CLI members calls are always working at all times

However using consul 1.6.3, once the consul-template is hanging, the consul node where this was executed can not make any successful calls to curl members. CLI consul members command seems to execute properly but if we run systemctl restart consul-agent then node seems to have left the cluster. If I remember correctly, restarting consul server on the server node(s) also shows that cluster has lost all members

One thing I would also like to state is that due to the large set of jobs we run in job 2, we get the known following warning message at job startup but, again, this still does work properly on 1.6.2 and not on 1.6.3

[WARN] (runner) watching 161 dependencies - watching this many dependencies could DDos your consul cluster

Consul info for both Client and Server

Client info

CoreOS 2303.3.0

agent:
	check_monitors = 0
	check_ttls = 0
	checks = 2
	services = 2
build:
	prerelease =
	revision = 7f3b5f34
	version = 1.6.3
consul:
	acl = disabled
	known_servers = 1
	server = false
runtime:
	arch = amd64
	cpu_count = 2
	goroutines = 72
	max_procs = 2
	os = linux
	version = go1.12.13
serf_lan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 2
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 2
	member_time = 11
	members = 7
	query_queue = 0
	query_time = 1

Server info

CoreOS 2303.3.0

agent:
	check_monitors = 0
	check_ttls = 0
	checks = 1
	services = 1
build:
	prerelease =
	revision = 7f3b5f34
	version = 1.6.3
consul:
	acl = disabled
	bootstrap = true
	known_datacenters = 1
	leader = true
	leader_addr = 10.0.1.107:8300
	server = true
raft:
	applied_index = 781
	commit_index = 781
	fsm_pending = 0
	last_contact = 0
	last_log_index = 781
	last_log_term = 2
	last_snapshot_index = 0
	last_snapshot_term = 0
	latest_configuration = [{Suffrage:Voter ID:80773a74-78f9-5edd-7e9f-a08facde080f Address:10.0.1.107:8300}]
	latest_configuration_index = 1
	num_peers = 0
	protocol_version = 3
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Leader
	term = 2
runtime:
	arch = amd64
	cpu_count = 2
	goroutines = 431
	max_procs = 2
	os = linux
	version = go1.12.13
serf_lan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 2
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 2
	member_time = 11
	members = 7
	query_queue = 0
	query_time = 1
serf_wan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 1
	members = 1
	query_queue = 0
	query_time = 1

Operating system and Environment details

OS, Architecture, and any other information you can provide about the environment.

See above

Log Fragments

Include appropriate Client or Server log fragments. If the log is longer than a few dozen lines, please include the URL to the gist of the log instead of posting it in the issue. Use -log-level=TRACE on the client and server to capture the maximum log detail.

Currently gathering these

The text was updated successfully, but these errors were encountered:

obourdon · 2020-02-11T19:47:03Z

After further deeper investigation, it seems like DDos is the issue

the issue has nothing to do with the error (unknown key in template)
the issue does not occur on 3 nodes consul cluster (at least I was not able to DDos it)
the issue is reproductible 100% of the time with a 1 node consul server "cluster"
the issue never occurs on a 1 node 1.6.2 consul server "cluster"

obourdon · 2020-02-11T20:54:58Z

Also looking deeper into 1.6.2 to 1.6.3 changes may be this commit could explain the new behavior

obourdon · 2020-02-14T05:27:02Z

According to #7257 seems like it is also reproductible with 3 nodes server cluster

chuckyz · 2020-02-18T16:00:01Z

@obourdon try out the new limits entries.

https://www.consul.io/docs/agent/options.html#limits

Specifically the *_conns_per_client entries.

This may also imply that something is way overusing one of the ways into the server cluster, and not using local agents. For instance, do you have Prometheus running with a very default consul_sd_config?

obourdon · 2020-02-21T15:55:27Z

@chuckyz thanks for the infos. I have reproduced the issue isolated from my complete (crypted and SSL secured) environment using official consul Docker container and get the same behaviour: OK with 1.6.2 KO with 1.6.3 and later.

You can find the code to reproduce it yourself here.

In fact, consul-template is running [on an agent node/in a docker container] and from the packet capture I have made it is using local consul agent to try to get KVs and using the same
connection (only 1 SYN seen in pcap file)

Activating debug on both consul server docker container and consul client did not show much
(no error/warnings/... on either side). On the agent side I see the over 150 'Request GET /v1/kv/.... bursting. After the burst, running consul members command locally on both client and server fails for the client. Note that after a while the agent seems to recover and the command is functional again

I have also tried to add some more Consul client/server configuration parameters without more success.

obourdon · 2020-02-24T13:02:38Z

See also consul-template issues and PRs #1279, #1066, #1065, #1107

obourdon · 2020-02-24T16:37:31Z

I think the issue/configuration values to solve can be both in consul AND consul-template therefore the post of 2 separate issues: See consul-template issue #1346

pierresouchay · 2020-02-27T21:13:27Z

@obourdon Don't get the exact template you are using, but if you have issues with consul-template DOSing consul, you might try consul-templaterb which has a few protections included to avoid those kinds of behaviors (that's one of the reasons I developed this tool, and it is production-grade ready in use for years at Criteo on very large clusters)

eikenb · 2020-03-06T23:26:56Z

I asked around and it looks like this has to do with the http_max_conns_per_client setting that was added in 1.6.3. It defaults to 100 in 1.6.x (they bumped it to 200 in 1.7.1).
Given that you mention it warning about 161 watches it seems likely that this is what you are hitting.

I suggest first trying to adjust that setting. Here's the docs on it.
https://www.consul.io/docs/agent/options.html#http_max_conns_per_client

Hope this helps.

brydoncheyney-slc · 2020-03-10T16:32:02Z

this is one we were hit with - running a Nomad client cluster with ~40 scheduled allocations on each generated almost 1000 established HTTP connections per client... may look at https://github.com/criteo/consul-templaterb to help diagnose exactly what's going on!

pierresouchay · 2020-03-11T10:40:08Z

@brydoncheyney-slc if the limit is 100 (assuming you are using 1.6.3+), it probably won't work either (assuming you are using 100+ endpoints), but if you test, use -d flag to help diagnose the number of calls and reason why. Ping me if you need help

brydoncheyney-slc · 2020-03-11T10:57:19Z

@pierresouchay interesting. I ran a fairly naive ss -tan dport eq 8500 | wc -l to determine the number of connections (to inform limits.http_max_conns_per_clientm which now works) but will run with debug enabled to look at the why.

Appreciate the heads up. If anything interesting crops up I will report back...

obourdon · 2020-03-18T04:48:17Z

@eikenb indeed increasing the http_max_conns_per_client configuration parameter seems to fix the issue

I am still confused by the naming of this variable as it seems that the client only makes one http connection to make all requests but ...

I first tried this with all fixed versions above 1.7.1 (where the limit was put back to 200) but I still was having the issue. Digging deeper showed that even though I have ~150 keys in consul, my consul-templates are requesting ~300 KVs (can be seen after launching docker exec -t consul-client /srv/jobs/ddos_entrypoint.sh without the redirection to /dev/null

...
2020/03/18 04:34:39.088255 [WARN] (runner) watching 300 dependencies - watching this many dependencies could DDoS your consul cluster

However, adding

export CONSUL_AGENT_EXTRA_CFG=',"limits": {"http_max_conns_per_client": 400}'
export CONSUL_SERVER_EXTRA_CFG=',"limits": {"http_max_conns_per_client": 400}'

does the trick
Many thanks for all help

obourdon changed the title ~~Consul 1.6.3 failure after consul-template run with missing key~~ Consul 1.6.3 DDos (1.6.2 working fine) Feb 14, 2020

obourdon mentioned this issue Feb 14, 2020

Connection reset by peer #7257

Closed

obourdon changed the title ~~Consul 1.6.3 DDos (1.6.2 working fine)~~ Consul 1.6.3 "supposed" DDos (1.6.2 working fine) Feb 14, 2020

obourdon changed the title ~~Consul 1.6.3 "supposed" DDos (1.6.2 working fine)~~ Consul 1.6.3 DDos using consul-template (1.6.2 working fine) Feb 24, 2020

obourdon mentioned this issue Feb 24, 2020

Consul 1.6.3 DDos using consul-template (1.6.2 working fine) hashicorp/consul-template#1346

Closed

obourdon mentioned this issue Mar 12, 2020

Bump default of HTTPMaxConnsPerClient to 200 #7289

Merged

1 task

obourdon closed this as completed Mar 18, 2020

findkim mentioned this issue Dec 1, 2020

unable to use many services hashicorp/consul-terraform-sync#146

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consul 1.6.3 DDos using consul-template (1.6.2 working fine) #7259

Consul 1.6.3 DDos using consul-template (1.6.2 working fine) #7259

obourdon commented Feb 11, 2020

obourdon commented Feb 11, 2020 •

edited

obourdon commented Feb 11, 2020

obourdon commented Feb 14, 2020

chuckyz commented Feb 18, 2020 •

edited

obourdon commented Feb 21, 2020

obourdon commented Feb 24, 2020

obourdon commented Feb 24, 2020

pierresouchay commented Feb 27, 2020

eikenb commented Mar 6, 2020

brydoncheyney-slc commented Mar 10, 2020

pierresouchay commented Mar 11, 2020

brydoncheyney-slc commented Mar 11, 2020

obourdon commented Mar 18, 2020

Consul 1.6.3 DDos using consul-template (1.6.2 working fine) #7259

Consul 1.6.3 DDos using consul-template (1.6.2 working fine) #7259

Comments

obourdon commented Feb 11, 2020

Overview of the Issue

Reproduction Steps

Consul info for both Client and Server

Operating system and Environment details

Log Fragments

obourdon commented Feb 11, 2020 • edited

obourdon commented Feb 11, 2020

obourdon commented Feb 14, 2020

chuckyz commented Feb 18, 2020 • edited

obourdon commented Feb 21, 2020

obourdon commented Feb 24, 2020

obourdon commented Feb 24, 2020

pierresouchay commented Feb 27, 2020

eikenb commented Mar 6, 2020

brydoncheyney-slc commented Mar 10, 2020

pierresouchay commented Mar 11, 2020

brydoncheyney-slc commented Mar 11, 2020

obourdon commented Mar 18, 2020

obourdon commented Feb 11, 2020 •

edited

chuckyz commented Feb 18, 2020 •

edited