-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUGFIX] Avoid GetDatacenter* methods to flood Consul servers logs #8685
[BUGFIX] Avoid GetDatacenter* methods to flood Consul servers logs #8685
Conversation
When calling `GetDatacentersByDistance()` or `GetDatacentersMap()`, an incorrect condition was used to diplay log message, thus flooding Consul's logs. Example of message: ``` [WARN] agent.router: Non-server in server-only area: non_server=myClientNode area=lan ``` This message is only valid for WAN areas, filter to avoid creating hundreds of logs/s on our clusters, each time someone is calling this method. Our logs were flooded by such messages when migrating our Consul servers from 1.7.7 to 1.8.4. This will issue fix hashicorp#8663
239840f
to
191d098
Compare
Force pushed to pass unstable test TestACLTokenReap_Primary |
Other unstable test TestLeader_SecondaryCA_IntermediateRenew |
191d098
to
4797a9c
Compare
We applied this patch in production and it is working as expected (I mean, not message flooding with command |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR! I think this may be related to #8559
This looks like it should fix the warnings, but I wonder if we should be skipping more of the work. In these two functions should we if areaID == types.AreaLAN { continue }
as the first step in the loops?
I think @mkeeler had something like this originally, but I thought it was a concern for the API, so suggested me move it. It looks like I may have been wrong about that.
Or maybe these warnings are not correct anymore? Should we remove the warnings entirely if the original assumptions have changed? When would we expect to see these warnings?
Hello @dnephin I took the same exact pattern as @mkeeler : https://github.com/hashicorp/consul/pull/8559/files#diff-08b62bd4c7f28dd9659a6289cc92698eR160 EDIT: doing Next lines, such as:
would be changed... so as a bug fix, I would suggest to keep it like this as it would require more significant changes EDIT2: I think those warning were added to mimic other code, but author did not think about the case of the local DC were all members might be there (while it is legit for foreign DCs) |
Note: this bugfix is really important on large clusters. We froze deployment on prod because of it. On our "small" preprod clusters (~700 nodes per DC) , we have several 10s of those messages/sec, we are using those calls in prod (~1qps), so we would probably have ~5k messages like this per second on servers... Because 1 call creates n messages (n being the number of nodes on your cluster, we had up to 7k servers/cluster) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the fix! I agree we should get this merged and worry about a refactor later.
🍒✅ Cherry pick of commit 3995cc3 onto |
…n-server_in_server-only_area [BUGFIX] Avoid GetDatacenter* methods to flood Consul servers logs
When calling
GetDatacentersByDistance()
orGetDatacentersMap()
, anincorrect condition was used to diplay log message, thus flooding
Consul's logs.
Example of message:
This message is only valid for WAN areas, filter to avoid creating
hundreds of logs/s on our custers, each time someone is calling this
method.
Our logs were flooded by such messages when migrating our Consul servers
from 1.7.7 to 1.8.4.
This will issue fix #8663