[WARN] agent.server.serf.lan: serf: Intent queue depth exceeds limit , dropping messages! #14200

orarnon · 2022-08-15T14:32:46Z

Overview of the Issue

Our 5 Consul clusters all emit these messages all the time.
We're using version 1.13.1
This is related to this issue.

Consul info for both Client and Server

Server info

agent:
	check_monitors = 0
	check_ttls = 0
	checks = 0
	services = 1
build:
	prerelease =
	revision = c6d0f9ec
	version = 1.13.1
	version_metadata =
consul:
	acl = disabled
	bootstrap = false
	known_datacenters = 1
	leader = false
	leader_addr = 10.200.4.70:8300
	server = true
raft:
	applied_index = 994304793
	commit_index = 994304793
	fsm_pending = 0
	last_contact = 45.679195ms
	last_log_index = 994304793
	last_log_term = 419529
	last_snapshot_index = 994294563
	last_snapshot_term = 419529
	latest_configuration = [{Suffrage:Voter ID:0cb59283-962f-9c2a-d850-47dab2a502fc Address:10.200.2.72:8300} {Suffrage:Voter ID:1ca80a4c-6910-8487-6617-24608348ec08 Address:10.200.4.70:8300} {Suffrage:Voter ID:523ec5b0-0454-4e7e-6d0d-47781bd233b6 Address:10.200.10.157:8300} {Suffrage:Voter ID:6024a30e-206e-5f6a-215a-6be0615555c4 Address:10.200.3.166:8300} {Suffrage:Voter ID:54cb2ccb-0e44-f06e-f4ad-d0c6aef3ec29 Address:10.200.1.154:8300}]
	latest_configuration_index = 0
	num_peers = 4
	protocol_version = 3
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Follower
	term = 419529
runtime:
	arch = amd64
	cpu_count = 8
	goroutines = 276620
	max_procs = 4
	os = linux
	version = go1.18.1
serf_lan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 23315
	failed = 41
	health_score = 0
	intent_queue = 11318
	left = 1395
	member_time = 14481495
	members = 5456
	query_queue = 0
	query_time = 394

Operating system and Environment details

Ubuntu 18.04, running on m5d.2xlarge on AWS

Log Fragments

2022-08-15T14:29:43.955Z [WARN] agent.server.serf.lan: serf: Intent queue depth (11782) exceeds limit (10966), dropping messages!

The text was updated successfully, but these errors were encountered:

huikang · 2022-08-15T17:08:00Z

@orarnon , thanks for reporting. To help debug this issue, could you provide any steps to reproduce the behavior?
Is it caused by force-leave as in #8179?

orarnon · 2022-08-16T06:38:16Z

Hi @huikang,

No, we have thousands of servers connected to our Consul servers. I just see it all the time now, coupled with failed RPC calls.
From looking at the metrics, it is not related to join/leave requests.
If you have a tip for me to narrow it down or maybe point me in the right direction, I can dig deeper.

huikang · 2022-08-16T13:42:35Z

@orarnon , thanks for the info. I noticed that the version affected is v1.13.1.

Did you observe this behavior after upgrading to v1.13.1? If so, what was the server agent version prior to upgrade?
What is the client agent version?

orarnon · 2022-08-17T06:46:22Z

Hi,
It happened even before. I think we ran 1.11.x.
So it's not a new log entry but it should have been resolved according to the issue attached.

jkirschner-hashicorp · 2022-09-29T20:21:53Z

Hi @orarnon,

How many agents do you have in the 5 different Consul datacenters? How frequently are agents leaving or joining? Are you consistently seeing this log message, or only occasionally?

The log message you shared can be seen under conditions of high agent churn. Each time an agent leaves or joins, that leave or join message is broadcast through the cluster via gossip. My understanding is that if there are many gossip messages to broadcast, some older messages in the queue may be discarded.

The issue you linked fixed a bug [PR] that could unnecessarily cause high gossip broadcast traffic. I think what you're seeing is separate from that issue/PR.

orarnon · 2022-12-27T10:07:36Z

Hi @jkirschner-hashicorp, we do have leaving and joining nodes but not in the thousands from what I can understand.
The above PR was merged a long time ago and we are running Consul v1.13.1

jkirschner-hashicorp added waiting-reply Waiting on response from Original Poster or another individual in the thread theme/operator-usability Replaces UX. Anything related to making things easier for the practitioner labels Sep 29, 2022

github-actions bot removed the waiting-reply Waiting on response from Original Poster or another individual in the thread label Dec 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WARN] agent.server.serf.lan: serf: Intent queue depth exceeds limit , dropping messages! #14200

[WARN] agent.server.serf.lan: serf: Intent queue depth exceeds limit , dropping messages! #14200

orarnon commented Aug 15, 2022

huikang commented Aug 15, 2022

orarnon commented Aug 16, 2022

huikang commented Aug 16, 2022

orarnon commented Aug 17, 2022

jkirschner-hashicorp commented Sep 29, 2022

orarnon commented Dec 27, 2022

[WARN] agent.server.serf.lan: serf: Intent queue depth exceeds limit , dropping messages! #14200

[WARN] agent.server.serf.lan: serf: Intent queue depth exceeds limit , dropping messages! #14200

Comments

orarnon commented Aug 15, 2022

Overview of the Issue

Consul info for both Client and Server

Operating system and Environment details

Log Fragments

huikang commented Aug 15, 2022

orarnon commented Aug 16, 2022

huikang commented Aug 16, 2022

orarnon commented Aug 17, 2022

jkirschner-hashicorp commented Sep 29, 2022

orarnon commented Dec 27, 2022