telemetry: add Agent TLS Certificate expiration metric #10768

dnephin · 2021-08-04T17:50:33Z

This should be the last of the new metrics for #9891

Adds a new metric with the number of seconds until the Agent TLS certificate expires, updated hourly.
Also fixes a couple bugs in the existing metrics, and makes one improvement to the error logging.

Tested using socat -d - udp6-listen:8125 and ./consul agent -dev -hcl='telemetry {statsd_address = "localhost:8125"}' along with some agent config to set the CA and certificate for agent tls.

hashicorp-ci · 2021-08-04T17:51:08Z

🤔 This PR has changes in the website/ directory but does not have a type/docs-cherrypick label. If the changes are for the next version, this can be ignored. If they are updates to current docs, attach the label to auto cherrypick to the stable-website branch after merging.

1. do not emit the metric if Query fails 2. properly check for PrimaryUsersIntermediate, the logic was inverted Also improve the logging by including the metric name in the log message

mkeeler

The code here looks good and correct but I have a few meta points of feedback for potential future enhancements.

First is that metrics that are conditionally emitted only by the leader are harder to deal with than metrics emitted by all servers (due to stale values emitted by the previous leader hanging around for a while). It might be a worthwhile change to have all servers emit the connect ca expiry metrics if possible.

Secondly, it could be useful to have certificate updates trigger immediate metrics emission. I am thinking of a scenario where the agent tls certificate is being monitored and it gets below some threshold. Someone gets paged, fixes the certificate, and reloads Consul. However it may take another hour until the metric goes back to normal. During that time the monitor might keep firing. It would be a minor quality of life improvement to get the metrics updated when the cert is updated to allow any monitors to auto-resolve.

mkeeler · 2021-08-04T18:45:14Z

agent/consul/leader_metrics.go

 	ticker := time.NewTicker(certExpirationMonitorInterval)
 	defer ticker.Stop()

+	logger := m.Logger.With("metric", strings.Join(m.Key, "."))
+
 	for {
 		select {
 		case <-ctx.Done():
 			return nil
 		case <-ticker.C:


Will this mean that the metric/logs do not get emitted until the monitoring interval has passed (so after an hour)? Or does the ticker fire immediately and then again after each interval?

It does not fire immediately. I think that would be a good improvement to handle the scenario you mentioned.

Added that in #10771.

dnephin · 2021-08-04T19:11:54Z

First is that metrics that are conditionally emitted only by the leader are harder to deal with than metrics emitted by all servers (due to stale values emitted by the previous leader hanging around for a while). It might be a worthwhile change to have all servers emit the connect ca expiry metrics if possible.

I haven't run into this problem with datadog/statsd. Is it mostly an issue with prometheus?

Would it be sufficient to have the non-leaders emit NaN for these metrics?

dnephin · 2021-08-04T22:41:43Z

Going to address follow ups in #10771 pending discussion about emitting NaN vs the metric on all Servers.

hc-github-team-consul-core · 2021-08-04T22:42:53Z

🍒 If backport labels were added before merging, cherry-picking will start automatically.

To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/422411.

dnephin added theme/telemetry Anything related to telemetry or observability theme/certificates Related to creating, distributing, and rotating certificates in Consul labels Aug 4, 2021

github-actions bot added type/docs Documentation needs to be created/updated/clarified and removed theme/telemetry Anything related to telemetry or observability theme/certificates Related to creating, distributing, and rotating certificates in Consul labels Aug 4, 2021

dnephin added 2 commits August 4, 2021 13:51

telemetry: add a metric for agent TLS cert expiry

8c57544

telemetry: fix a couple bugs in cert expiry metrics

9420506

1. do not emit the metric if Query fails 2. properly check for PrimaryUsersIntermediate, the logic was inverted Also improve the logging by including the metric name in the log message

dnephin force-pushed the dnephin/agent-tls-cert-expiration-metric branch from 2318182 to 9420506 Compare August 4, 2021 17:51

vercel bot temporarily deployed to Preview – consul August 4, 2021 17:52 Inactive

vercel bot temporarily deployed to Preview – consul-ui-staging August 4, 2021 17:52 Inactive

dnephin requested review from dhiaayachi and a team August 4, 2021 17:57

This was referenced Aug 4, 2021

telemetry: add log message when certs are about to expire #10770

Merged

Add telemetry and logging around expired certificates #9891

Closed

mkeeler approved these changes Aug 4, 2021

View reviewed changes

dnephin mentioned this pull request Aug 4, 2021

telemetry: improve cert expiry metrics #10771

Merged

2 tasks

dnephin merged commit e940168 into main Aug 4, 2021

dnephin deleted the dnephin/agent-tls-cert-expiration-metric branch August 4, 2021 22:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

telemetry: add Agent TLS Certificate expiration metric #10768

telemetry: add Agent TLS Certificate expiration metric #10768

dnephin commented Aug 4, 2021

hashicorp-ci commented Aug 4, 2021

mkeeler left a comment

mkeeler Aug 4, 2021

dnephin Aug 4, 2021

dnephin Aug 4, 2021

dnephin commented Aug 4, 2021 •

edited

Loading

dnephin commented Aug 4, 2021

hc-github-team-consul-core commented Aug 4, 2021

telemetry: add Agent TLS Certificate expiration metric #10768

telemetry: add Agent TLS Certificate expiration metric #10768

Conversation

dnephin commented Aug 4, 2021

hashicorp-ci commented Aug 4, 2021

mkeeler left a comment

Choose a reason for hiding this comment

mkeeler Aug 4, 2021

Choose a reason for hiding this comment

dnephin Aug 4, 2021

Choose a reason for hiding this comment

dnephin Aug 4, 2021

Choose a reason for hiding this comment

dnephin commented Aug 4, 2021 • edited Loading

dnephin commented Aug 4, 2021

hc-github-team-consul-core commented Aug 4, 2021

dnephin commented Aug 4, 2021 •

edited

Loading