Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WARN] agent.server.serf.lan: serf: Intent queue depth exceeds limit , dropping messages! #14200

Open
orarnon opened this issue Aug 15, 2022 · 6 comments
Labels
theme/operator-usability Replaces UX. Anything related to making things easier for the practitioner

Comments

@orarnon
Copy link

orarnon commented Aug 15, 2022

Overview of the Issue

Our 5 Consul clusters all emit these messages all the time.
We're using version 1.13.1
This is related to this issue.

Consul info for both Client and Server

Server info
agent:
	check_monitors = 0
	check_ttls = 0
	checks = 0
	services = 1
build:
	prerelease =
	revision = c6d0f9ec
	version = 1.13.1
	version_metadata =
consul:
	acl = disabled
	bootstrap = false
	known_datacenters = 1
	leader = false
	leader_addr = 10.200.4.70:8300
	server = true
raft:
	applied_index = 994304793
	commit_index = 994304793
	fsm_pending = 0
	last_contact = 45.679195ms
	last_log_index = 994304793
	last_log_term = 419529
	last_snapshot_index = 994294563
	last_snapshot_term = 419529
	latest_configuration = [{Suffrage:Voter ID:0cb59283-962f-9c2a-d850-47dab2a502fc Address:10.200.2.72:8300} {Suffrage:Voter ID:1ca80a4c-6910-8487-6617-24608348ec08 Address:10.200.4.70:8300} {Suffrage:Voter ID:523ec5b0-0454-4e7e-6d0d-47781bd233b6 Address:10.200.10.157:8300} {Suffrage:Voter ID:6024a30e-206e-5f6a-215a-6be0615555c4 Address:10.200.3.166:8300} {Suffrage:Voter ID:54cb2ccb-0e44-f06e-f4ad-d0c6aef3ec29 Address:10.200.1.154:8300}]
	latest_configuration_index = 0
	num_peers = 4
	protocol_version = 3
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Follower
	term = 419529
runtime:
	arch = amd64
	cpu_count = 8
	goroutines = 276620
	max_procs = 4
	os = linux
	version = go1.18.1
serf_lan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 23315
	failed = 41
	health_score = 0
	intent_queue = 11318
	left = 1395
	member_time = 14481495
	members = 5456
	query_queue = 0
	query_time = 394

Operating system and Environment details

Ubuntu 18.04, running on m5d.2xlarge on AWS

Log Fragments

2022-08-15T14:29:43.955Z [WARN] agent.server.serf.lan: serf: Intent queue depth (11782) exceeds limit (10966), dropping messages!

@huikang
Copy link
Collaborator

huikang commented Aug 15, 2022

@orarnon , thanks for reporting. To help debug this issue, could you provide any steps to reproduce the behavior?
Is it caused by force-leave as in #8179?

@orarnon
Copy link
Author

orarnon commented Aug 16, 2022

Hi @huikang,

No, we have thousands of servers connected to our Consul servers. I just see it all the time now, coupled with failed RPC calls.
From looking at the metrics, it is not related to join/leave requests.
If you have a tip for me to narrow it down or maybe point me in the right direction, I can dig deeper.

@huikang
Copy link
Collaborator

huikang commented Aug 16, 2022

@orarnon , thanks for the info. I noticed that the version affected is v1.13.1.

  • Did you observe this behavior after upgrading to v1.13.1? If so, what was the server agent version prior to upgrade?
  • What is the client agent version?

@orarnon
Copy link
Author

orarnon commented Aug 17, 2022

Hi,
It happened even before. I think we ran 1.11.x.
So it's not a new log entry but it should have been resolved according to the issue attached.

@jkirschner-hashicorp
Copy link
Contributor

Hi @orarnon,

How many agents do you have in the 5 different Consul datacenters? How frequently are agents leaving or joining? Are you consistently seeing this log message, or only occasionally?

The log message you shared can be seen under conditions of high agent churn. Each time an agent leaves or joins, that leave or join message is broadcast through the cluster via gossip. My understanding is that if there are many gossip messages to broadcast, some older messages in the queue may be discarded.

The issue you linked fixed a bug [PR] that could unnecessarily cause high gossip broadcast traffic. I think what you're seeing is separate from that issue/PR.

@jkirschner-hashicorp jkirschner-hashicorp added waiting-reply Waiting on response from Original Poster or another individual in the thread theme/operator-usability Replaces UX. Anything related to making things easier for the practitioner labels Sep 29, 2022
@orarnon
Copy link
Author

orarnon commented Dec 27, 2022

Hi @jkirschner-hashicorp, we do have leaving and joining nodes but not in the thousands from what I can understand.
The above PR was merged a long time ago and we are running Consul v1.13.1

@github-actions github-actions bot removed the waiting-reply Waiting on response from Original Poster or another individual in the thread label Dec 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/operator-usability Replaces UX. Anything related to making things easier for the practitioner
Projects
None yet
Development

No branches or pull requests

3 participants