-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add metric for remaining lifetime of certificates authenticating requests #50387
Add metric for remaining lifetime of certificates authenticating requests #50387
Conversation
/assign @crassirostris |
@@ -71,6 +85,8 @@ func (a *Authenticator) AuthenticateRequest(req *http.Request) (user.Info, bool, | |||
} | |||
} | |||
|
|||
remaining := req.TLS.PeerCertificates[0].NotAfter.Sub(time.Now()) | |||
clientCertificateExpirationGauge.WithLabelValues(req.TLS.PeerCertificates[0].Subject.CommonName).Set(float64(remaining / time.Second)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the cardinality on this is way too high. a label per client and essentially infinite possible values. see the warning at the bottom of https://prometheus.io/docs/practices/naming/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you consider the number of nodes in the cluster to be too high cardinality? It would be useful to have the information to identify which nodes are about to drop out of the cluster because they are failing to update certificates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you consider the number of nodes in the cluster to be too high cardinality?
Yes. I'm not sure about using metrics as a means of communicating health of specific clients. I'd expect that more on node status or maybe in events.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want something that can be set to have alerts for monitoring in production. I'll see what can be done with node status or events.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed the metric to a histogram to report counts in certain buckets, no detail about source, so the cardinality is low. I'll follow up with an addition to node status so it is possible to easily query nodes about remaining credential lifetime.
1e68ba4
to
6f9fe1e
Compare
prometheus.HistogramOpts{ | ||
Name: "apiserver_client_certificate_expiration_gauge", | ||
Help: "Gauge of the remaining lifetime on the certificate used to authenticate a request.", | ||
Buckets: []float64{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a bucket for negative (expired) would probably be good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay!
Name: "apiserver_client_certificate_expiration_gauge", | ||
Help: "Gauge of the remaining lifetime on the certificate used to authenticate a request.", | ||
Buckets: []float64{ | ||
float64(6 * time.Hour / time.Second), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(6 * time.Hour).Seconds()
I think looks better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
utilerrors "k8s.io/apimachinery/pkg/util/errors" | ||
"k8s.io/apimachinery/pkg/util/sets" | ||
"k8s.io/apiserver/pkg/authentication/authenticator" | ||
"k8s.io/apiserver/pkg/authentication/user" | ||
) | ||
|
||
var clientCertificateExpirationGauge = prometheus.NewHistogram( | ||
prometheus.HistogramOpts{ | ||
Name: "apiserver_client_certificate_expiration_gauge", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Namespace: "apiserver",
Subsystem: "client",
Name: "certificate_expiration_seconds"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
utilerrors "k8s.io/apimachinery/pkg/util/errors" | ||
"k8s.io/apimachinery/pkg/util/sets" | ||
"k8s.io/apiserver/pkg/authentication/authenticator" | ||
"k8s.io/apiserver/pkg/authentication/user" | ||
) | ||
|
||
var clientCertificateExpirationGauge = prometheus.NewHistogram( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gauge?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume you mean in the name. Updated.
@@ -71,6 +92,8 @@ func (a *Authenticator) AuthenticateRequest(req *http.Request) (user.Info, bool, | |||
} | |||
} | |||
|
|||
remaining := req.TLS.PeerCertificates[0].NotAfter.Sub(time.Now()) | |||
clientCertificateExpirationGauge.Observe(float64(remaining / time.Second)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remaining.Seconds()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
4003597
to
ac4aa6d
Compare
Name: "certificate_expiration_seconds", | ||
Help: "Gauge of the remaining lifetime on the certificate used to authenticate a request.", | ||
Buckets: []float64{ | ||
math.Inf(-1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the buckets held the max, so a bucket for negative numbers would be set to 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, silly me. Done.
ac4aa6d
to
93baf71
Compare
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last nits
Namespace: "apiserver", | ||
Subsystem: "client", | ||
Name: "certificate_expiration_seconds", | ||
Help: "Gauge of the remaining lifetime on the certificate used to authenticate a request.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gauge of the remaining
I'd say "Distribution of the remaining"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Help: "Gauge of the remaining lifetime on the certificate used to authenticate a request.", | ||
Buckets: []float64{ | ||
0, | ||
float64((6 * time.Hour).Seconds()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cast is unnecessary, .Seconds()
returns float64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -71,6 +95,8 @@ func (a *Authenticator) AuthenticateRequest(req *http.Request) (user.Info, bool, | |||
} | |||
} | |||
|
|||
remaining := req.TLS.PeerCertificates[0].NotAfter.Sub(time.Now()) | |||
clientCertificateExpirationHistogram.Observe(float64(remaining.Seconds())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
When incoming requests to the API server are authenticated by a certificate, the expiration of the certificate can affect the validity of the authentication. With auto rotation of certificates, which is starting with kubelet certificates, the goal is to use shorter lifetimes and let the kubelet renew the certificate as desired. Monitoring certificates which are approaching expiration and not renewing would be an early warning sign that nodes are about to stop participating in the cluster.
93baf71
to
49a19c6
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: crassirostris, jcbsmpsn, liggitt No associated issue. Update pull-request body to add a reference to an issue, or get approval with The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
/test pull-kubernetes-federation-e2e-gce |
/retest |
/test all [submit-queue is verifying that this PR is safe to merge] |
Automatic merge from submit-queue |
@@ -71,6 +95,8 @@ func (a *Authenticator) AuthenticateRequest(req *http.Request) (user.Info, bool, | |||
} | |||
} | |||
|
|||
remaining := req.TLS.PeerCertificates[0].NotAfter.Sub(time.Now()) | |||
clientCertificateExpirationHistogram.Observe(remaining.Seconds()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it make sense to measure this after verifying? If I understand this correctly a client could now influence these metrics by generating certificates with random expiry dates, not that I can see a particularly useful attack here, just a potential for distortion. It would be simple to just move this measurement after the certificate verification 4 lines below 🙂. Please let me know if I'm missing something.
fixes #50778
When incoming requests to the API server are authenticated by a certificate, the expiration of the certificate can affect the validity of the authentication. With auto rotation of certificates, which is starting with kubelet certificates, the goal is to use shorter lifetimes and let the kubelet renew the certificate as desired. Monitoring certificates which are approaching expiration and not renewing would be an early warning sign that nodes are about to stop participating in the cluster.
Release note: