Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUGFIX] Avoid GetDatacenter* methods to flood Consul servers logs #8685

Conversation

pierresouchay
Copy link
Contributor

When calling GetDatacentersByDistance() or GetDatacentersMap(), an
incorrect condition was used to diplay log message, thus flooding
Consul's logs.

Example of message:

  [WARN] agent.router: Non-server in server-only area: non_server=myClientNode area=lan

This message is only valid for WAN areas, filter to avoid creating
hundreds of logs/s on our custers, each time someone is calling this
method.

Our logs were flooded by such messages when migrating our Consul servers
from 1.7.7 to 1.8.4.

This will issue fix #8663

When calling `GetDatacentersByDistance()` or `GetDatacentersMap()`, an
incorrect condition was used to diplay log message, thus flooding
Consul's logs.

Example of message:

```
  [WARN] agent.router: Non-server in server-only area: non_server=myClientNode area=lan
```

This message is only valid for WAN areas, filter to avoid creating
hundreds of logs/s on our clusters, each time someone is calling this
method.

Our logs were flooded by such messages when migrating our Consul servers
from 1.7.7 to 1.8.4.

This will issue fix hashicorp#8663
@pierresouchay pierresouchay force-pushed the do_not_flood_logs_with_Non-server_in_server-only_area branch from 239840f to 191d098 Compare September 15, 2020 10:19
@pierresouchay
Copy link
Contributor Author

Force pushed to pass unstable test TestACLTokenReap_Primary

@pierresouchay
Copy link
Contributor Author

Other unstable test TestLeader_SecondaryCA_IntermediateRenew

@pierresouchay
Copy link
Contributor Author

We applied this patch in production and it is working as expected (I mean, not message flooding with command consul monitor)

@dnephin dnephin added the theme/internals Serf, Raft, SWIM, Lifeguard, Anti-Entropy, locking topics label Sep 15, 2020
Copy link
Contributor

@dnephin dnephin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR! I think this may be related to #8559

This looks like it should fix the warnings, but I wonder if we should be skipping more of the work. In these two functions should we if areaID == types.AreaLAN { continue } as the first step in the loops?

I think @mkeeler had something like this originally, but I thought it was a concern for the API, so suggested me move it. It looks like I may have been wrong about that.

Or maybe these warnings are not correct anymore? Should we remove the warnings entirely if the original assumptions have changed? When would we expect to see these warnings?

@pierresouchay
Copy link
Contributor Author

pierresouchay commented Sep 15, 2020

Hello @dnephin I took the same exact pattern as @mkeeler : https://github.com/hashicorp/consul/pull/8559/files#diff-08b62bd4c7f28dd9659a6289cc92698eR160
I am ok to change it to if areaID == types.AreaLAN { continue } if you prefer, tell me, I'll do it in a minute (but tests might fail as are currently unstable, see #8686 - I had to push force several times)

EDIT: doing if areaID == types.AreaLAN { continue } is much more intrusive:

Next lines, such as:

			if parts.Datacenter == r.localDatacenter {
				// Everything in the local datacenter looks like zero RTT.
				index[parts.Datacenter] = append(existing, 0.0)
			} else {
				...

would be changed... so as a bug fix, I would suggest to keep it like this as it would require more significant changes

EDIT2: I think those warning were added to mimic other code, but author did not think about the case of the local DC were all members might be there (while it is legit for foreign DCs)

@pierresouchay
Copy link
Contributor Author

Note: this bugfix is really important on large clusters. We froze deployment on prod because of it. On our "small" preprod clusters (~700 nodes per DC) , we have several 10s of those messages/sec, we are using those calls in prod (~1qps), so we would probably have ~5k messages like this per second on servers... Because 1 call creates n messages (n being the number of nodes on your cluster, we had up to 7k servers/cluster)

Copy link
Contributor

@dnephin dnephin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the fix! I agree we should get this merged and worry about a refactor later.

.changelog/8685.txt Outdated Show resolved Hide resolved
@dnephin dnephin merged commit 3995cc3 into hashicorp:master Sep 15, 2020
@hashicorp-ci
Copy link
Contributor

🍒✅ Cherry pick of commit 3995cc3 onto release/1.8.x succeeded!

hashicorp-ci pushed a commit that referenced this pull request Sep 15, 2020
…n-server_in_server-only_area

[BUGFIX] Avoid GetDatacenter* methods to flood Consul servers logs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/internals Serf, Raft, SWIM, Lifeguard, Anti-Entropy, locking topics
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consul Server logs flooded with WARN message
3 participants