Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metric for remaining lifetime of certificates authenticating requests #50387

Merged

Conversation

jcbsmpsn
Copy link
Contributor

@jcbsmpsn jcbsmpsn commented Aug 9, 2017

fixes #50778

When incoming requests to the API server are authenticated by a certificate, the expiration of the certificate can affect the validity of the authentication. With auto rotation of certificates, which is starting with kubelet certificates, the goal is to use shorter lifetimes and let the kubelet renew the certificate as desired. Monitoring certificates which are approaching expiration and not renewing would be an early warning sign that nodes are about to stop participating in the cluster.

Release note:

Add new Prometheus metric that monitors the remaining lifetime of certificates used to authenticate requests to the API server.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 9, 2017
@k8s-github-robot k8s-github-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Aug 9, 2017
@jcbsmpsn
Copy link
Contributor Author

jcbsmpsn commented Aug 9, 2017

/assign @crassirostris

@@ -71,6 +85,8 @@ func (a *Authenticator) AuthenticateRequest(req *http.Request) (user.Info, bool,
}
}

remaining := req.TLS.PeerCertificates[0].NotAfter.Sub(time.Now())
clientCertificateExpirationGauge.WithLabelValues(req.TLS.PeerCertificates[0].Subject.CommonName).Set(float64(remaining / time.Second))
Copy link
Member

@liggitt liggitt Aug 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the cardinality on this is way too high. a label per client and essentially infinite possible values. see the warning at the bottom of https://prometheus.io/docs/practices/naming/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you consider the number of nodes in the cluster to be too high cardinality? It would be useful to have the information to identify which nodes are about to drop out of the cluster because they are failing to update certificates.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you consider the number of nodes in the cluster to be too high cardinality?

Yes. I'm not sure about using metrics as a means of communicating health of specific clients. I'd expect that more on node status or maybe in events.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want something that can be set to have alerts for monitoring in production. I'll see what can be done with node status or events.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed the metric to a histogram to report counts in certain buckets, no detail about source, so the cardinality is low. I'll follow up with an addition to node status so it is possible to easily query nodes about remaining credential lifetime.

@jcbsmpsn jcbsmpsn force-pushed the metric-certificate-expiration branch from 1e68ba4 to 6f9fe1e Compare August 10, 2017 18:58
prometheus.HistogramOpts{
Name: "apiserver_client_certificate_expiration_gauge",
Help: "Gauge of the remaining lifetime on the certificate used to authenticate a request.",
Buckets: []float64{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a bucket for negative (expired) would probably be good

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link

@crassirostris crassirostris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay!

Name: "apiserver_client_certificate_expiration_gauge",
Help: "Gauge of the remaining lifetime on the certificate used to authenticate a request.",
Buckets: []float64{
float64(6 * time.Hour / time.Second),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(6 * time.Hour).Seconds() I think looks better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


utilerrors "k8s.io/apimachinery/pkg/util/errors"
"k8s.io/apimachinery/pkg/util/sets"
"k8s.io/apiserver/pkg/authentication/authenticator"
"k8s.io/apiserver/pkg/authentication/user"
)

var clientCertificateExpirationGauge = prometheus.NewHistogram(
prometheus.HistogramOpts{
Name: "apiserver_client_certificate_expiration_gauge",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Namespace: "apiserver",
Subsystem: "client",
Name: "certificate_expiration_seconds"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


utilerrors "k8s.io/apimachinery/pkg/util/errors"
"k8s.io/apimachinery/pkg/util/sets"
"k8s.io/apiserver/pkg/authentication/authenticator"
"k8s.io/apiserver/pkg/authentication/user"
)

var clientCertificateExpirationGauge = prometheus.NewHistogram(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gauge?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you mean in the name. Updated.

@@ -71,6 +92,8 @@ func (a *Authenticator) AuthenticateRequest(req *http.Request) (user.Info, bool,
}
}

remaining := req.TLS.PeerCertificates[0].NotAfter.Sub(time.Now())
clientCertificateExpirationGauge.Observe(float64(remaining / time.Second))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remaining.Seconds()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@jcbsmpsn jcbsmpsn force-pushed the metric-certificate-expiration branch 2 times, most recently from 4003597 to ac4aa6d Compare August 10, 2017 23:13
Name: "certificate_expiration_seconds",
Help: "Gauge of the remaining lifetime on the certificate used to authenticate a request.",
Buckets: []float64{
math.Inf(-1),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the buckets held the max, so a bucket for negative numbers would be set to 0?

Copy link
Contributor Author

@jcbsmpsn jcbsmpsn Aug 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, silly me. Done.

@jcbsmpsn jcbsmpsn force-pushed the metric-certificate-expiration branch from ac4aa6d to 93baf71 Compare August 11, 2017 01:15
@liggitt
Copy link
Member

liggitt commented Aug 11, 2017

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 11, 2017
Copy link

@crassirostris crassirostris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last nits

Namespace: "apiserver",
Subsystem: "client",
Name: "certificate_expiration_seconds",
Help: "Gauge of the remaining lifetime on the certificate used to authenticate a request.",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gauge of the remaining

I'd say "Distribution of the remaining"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Help: "Gauge of the remaining lifetime on the certificate used to authenticate a request.",
Buckets: []float64{
0,
float64((6 * time.Hour).Seconds()),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cast is unnecessary, .Seconds() returns float64

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -71,6 +95,8 @@ func (a *Authenticator) AuthenticateRequest(req *http.Request) (user.Info, bool,
}
}

remaining := req.TLS.PeerCertificates[0].NotAfter.Sub(time.Now())
clientCertificateExpirationHistogram.Observe(float64(remaining.Seconds()))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@crassirostris
Copy link

@jcbsmpsn There should be an associated issue or @liggitt should comment /approve no-issue

When incoming requests to the API server are authenticated by a
certificate, the expiration of the certificate can affect the validity
of the authentication. With auto rotation of certificates, which is
starting with kubelet certificates, the goal is to use shorter lifetimes
and let the kubelet renew the certificate as desired. Monitoring
certificates which are approaching expiration and not renewing would be
an early warning sign that nodes are about to stop participating in the
cluster.
@jcbsmpsn jcbsmpsn force-pushed the metric-certificate-expiration branch from 93baf71 to 49a19c6 Compare August 11, 2017 18:19
@k8s-github-robot k8s-github-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 11, 2017
@crassirostris
Copy link

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 12, 2017
@k8s-github-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: crassirostris, jcbsmpsn, liggitt

No associated issue. Update pull-request body to add a reference to an issue, or get approval with /approve no-issue

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@jcbsmpsn
Copy link
Contributor Author

/test pull-kubernetes-federation-e2e-gce

@crassirostris
Copy link

/retest

@k8s-github-robot k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 16, 2017
@k8s-github-robot
Copy link

/test all [submit-queue is verifying that this PR is safe to merge]

@k8s-github-robot
Copy link

Automatic merge from submit-queue

@k8s-github-robot k8s-github-robot merged commit 6bc0b29 into kubernetes:master Aug 16, 2017
@@ -71,6 +95,8 @@ func (a *Authenticator) AuthenticateRequest(req *http.Request) (user.Info, bool,
}
}

remaining := req.TLS.PeerCertificates[0].NotAfter.Sub(time.Now())
clientCertificateExpirationHistogram.Observe(remaining.Seconds())
Copy link
Member

@brancz brancz Sep 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it make sense to measure this after verifying? If I understand this correctly a client could now influence these metrics by generating certificates with random expiry dates, not that I can see a particularly useful attack here, just a potential for distortion. It would be simple to just move this measurement after the certificate verification 4 lines below 🙂. Please let me know if I'm missing something.

jpbetz referenced this pull request Dec 6, 2017
…0387-#56444-release-1.8

Automated cherry pick of #56444 to release-1.8
k8s-github-robot pushed a commit that referenced this pull request Dec 8, 2017
…0387-#56444-release-1.7

Automatic merge from submit-queue.

Automated cherry pick of #50387 #56444 release-1.7

Automated cherry pick of #50387 #56444 release 1.7
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a metric tracking the remaining validity of kubelet certificates
7 participants