Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

telemetry: add Agent TLS Certificate expiration metric #10768

Merged
merged 2 commits into from
Aug 4, 2021

Conversation

dnephin
Copy link
Contributor

@dnephin dnephin commented Aug 4, 2021

This should be the last of the new metrics for #9891

Adds a new metric with the number of seconds until the Agent TLS certificate expires, updated hourly.
Also fixes a couple bugs in the existing metrics, and makes one improvement to the error logging.

Tested using socat -d - udp6-listen:8125 and ./consul agent -dev -hcl='telemetry {statsd_address = "localhost:8125"}' along with some agent config to set the CA and certificate for agent tls.

@dnephin dnephin added theme/telemetry Anything related to telemetry or observability theme/certificates Related to creating, distributing, and rotating certificates in Consul labels Aug 4, 2021
@github-actions github-actions bot added type/docs Documentation needs to be created/updated/clarified and removed theme/telemetry Anything related to telemetry or observability theme/certificates Related to creating, distributing, and rotating certificates in Consul labels Aug 4, 2021
@hashicorp-ci
Copy link
Contributor

🤔 This PR has changes in the website/ directory but does not have a type/docs-cherrypick label. If the changes are for the next version, this can be ignored. If they are updates to current docs, attach the label to auto cherrypick to the stable-website branch after merging.

1. do not emit the metric if Query fails
2. properly check for PrimaryUsersIntermediate, the logic was inverted

Also improve the logging by including the metric name in the log message
@dnephin dnephin force-pushed the dnephin/agent-tls-cert-expiration-metric branch from 2318182 to 9420506 Compare August 4, 2021 17:51
@vercel vercel bot temporarily deployed to Preview – consul August 4, 2021 17:52 Inactive
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging August 4, 2021 17:52 Inactive
@dnephin dnephin requested review from dhiaayachi and a team August 4, 2021 17:57
Copy link
Member

@mkeeler mkeeler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code here looks good and correct but I have a few meta points of feedback for potential future enhancements.

First is that metrics that are conditionally emitted only by the leader are harder to deal with than metrics emitted by all servers (due to stale values emitted by the previous leader hanging around for a while). It might be a worthwhile change to have all servers emit the connect ca expiry metrics if possible.

Secondly, it could be useful to have certificate updates trigger immediate metrics emission. I am thinking of a scenario where the agent tls certificate is being monitored and it gets below some threshold. Someone gets paged, fixes the certificate, and reloads Consul. However it may take another hour until the metric goes back to normal. During that time the monitor might keep firing. It would be a minor quality of life improvement to get the metrics updated when the cert is updated to allow any monitors to auto-resolve.

ticker := time.NewTicker(certExpirationMonitorInterval)
defer ticker.Stop()

logger := m.Logger.With("metric", strings.Join(m.Key, "."))

for {
select {
case <-ctx.Done():
return nil
case <-ticker.C:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this mean that the metric/logs do not get emitted until the monitoring interval has passed (so after an hour)? Or does the ticker fire immediately and then again after each interval?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not fire immediately. I think that would be a good improvement to handle the scenario you mentioned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added that in #10771.

@dnephin
Copy link
Contributor Author

dnephin commented Aug 4, 2021

First is that metrics that are conditionally emitted only by the leader are harder to deal with than metrics emitted by all servers (due to stale values emitted by the previous leader hanging around for a while). It might be a worthwhile change to have all servers emit the connect ca expiry metrics if possible.

I haven't run into this problem with datadog/statsd. Is it mostly an issue with prometheus?

Would it be sufficient to have the non-leaders emit NaN for these metrics?

@dnephin
Copy link
Contributor Author

dnephin commented Aug 4, 2021

Going to address follow ups in #10771 pending discussion about emitting NaN vs the metric on all Servers.

@dnephin dnephin merged commit e940168 into main Aug 4, 2021
@dnephin dnephin deleted the dnephin/agent-tls-cert-expiration-metric branch August 4, 2021 22:42
@hc-github-team-consul-core
Copy link
Collaborator

🍒 If backport labels were added before merging, cherry-picking will start automatically.

To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/422411.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/docs Documentation needs to be created/updated/clarified
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants