Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubelet: add certificate rotation error metric #84614

Conversation

@rphillips
Copy link
Member

rphillips commented Oct 31, 2019

What type of PR is this?
/priority important-soon
/kind feature
/sig node
@kubernetes/sig-node-pr-reviews
@dashpole @mrunalp @derekwaynecarr

What this PR does / why we need it:
Creates a metric to keep track if the Kubelet is failing to rotate it's certificate. The metric can then be used by monitoring to notify the user to take some action. Presently, the node can go into a failure state and become unschedulable without any noticeable notification.

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

kubelet now exports a "server_expiration_renew_failure" and "client_expiration_renew_failure" metric counter if the certificate rotations cannot be performed.
@rphillips

This comment has been minimized.

Copy link
Member Author

rphillips commented Oct 31, 2019

/cc @sjenning

@k8s-ci-robot k8s-ci-robot requested a review from sjenning Oct 31, 2019
@fedebongio

This comment has been minimized.

Copy link
Contributor

fedebongio commented Oct 31, 2019

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Oct 31, 2019

@fedebongio: GitHub didn't allow me to assign the following users: sambdavidson.

Note that only kubernetes members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @logicalhan @sambdavidson

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@rphillips rphillips force-pushed the rphillips:fixes/add_cert_rotation_failure_metric branch 2 times, most recently from 5df6f6c to cf29d04 Oct 31, 2019
Copy link
Contributor

sambdavidson left a comment

Keep in mind I have a PR #84534 which edits these exact files in similar ways. It is likely that my PR will be submitted first as it is already mostly reviewed. So you might have run into a ton of merge conflicts if you don't rebase on top of mine.

pkg/kubelet/certificate/kubelet.go Outdated Show resolved Hide resolved
pkg/kubelet/certificate/kubelet.go Outdated Show resolved Hide resolved
@rphillips rphillips force-pushed the rphillips:fixes/add_cert_rotation_failure_metric branch from cf29d04 to 4702430 Oct 31, 2019
@rphillips

This comment has been minimized.

Copy link
Member Author

rphillips commented Oct 31, 2019

Fixed to use Counters.

@rphillips

This comment has been minimized.

Copy link
Member Author

rphillips commented Nov 1, 2019

/retest

@rphillips rphillips force-pushed the rphillips:fixes/add_cert_rotation_failure_metric branch from 4702430 to e1533f6 Nov 4, 2019
@sjenning

This comment has been minimized.

Copy link
Contributor

sjenning commented Nov 4, 2019

/lgtm
/cc @derekwaynecarr

@rphillips

This comment has been minimized.

Copy link
Member Author

rphillips commented Nov 4, 2019

rebased... this is ready for review

@rphillips rphillips force-pushed the rphillips:fixes/add_cert_rotation_failure_metric branch from e1533f6 to 4a645c8 Nov 4, 2019
@rphillips rphillips force-pushed the rphillips:fixes/add_cert_rotation_failure_metric branch from 0594d5f to b7e18a8 Nov 5, 2019
@rphillips

This comment has been minimized.

Copy link
Member Author

rphillips commented Nov 5, 2019

/test pull-kubernetes-dependencies

@rphillips

This comment has been minimized.

Copy link
Member Author

rphillips commented Nov 5, 2019

go: golang.org/x/lint@v0.0.0-20190409202823-959b441ac42: unexpected status (https://proxy.golang.org/golang.org/x/lint/@v/v0.0.0-20190409202823-959b441ac42.info): 410 Gone
@@ -40,6 +40,7 @@ import (
"k8s.io/client-go/util/cert"
"k8s.io/client-go/util/certificate/csr"
"k8s.io/client-go/util/keyutil"
"k8s.io/component-base/metrics"

This comment has been minimized.

Copy link
@mikedanese

mikedanese Nov 5, 2019

Member

Might be this, although not sure. Haven't figured out go.mod yet.

@rphillips rphillips force-pushed the rphillips:fixes/add_cert_rotation_failure_metric branch from b7e18a8 to ef8addd Nov 5, 2019
@@ -124,6 +125,9 @@ type Config struct {
// allows one to setup monitoring and alerting of unexpected rotation
// behavior and track trends in rotation frequency.
CertificateRotation Histogram
// CertifcateRenewFailure will record a metric that keeps track of
// certificate renewal failures.
CertificateRenewFailure CounterVec

This comment has been minimized.

Copy link
@mikedanese

mikedanese Nov 5, 2019

Member

Maybe change this to:

CertificateRenewFailure func(...string)

and pass in:

func(a) {
  metric.WithLabels(a...).Inc()
}

From kubelet so client-go doesn't need to depend on "k8s.io/component-base/metrics".

This comment has been minimized.

Copy link
@liggitt

liggitt Nov 5, 2019

Member

yeah, need to prevent this:

tsort: cycle in data
tsort: k8s.io/client-go
tsort: k8s.io/component-base
@rphillips rphillips force-pushed the rphillips:fixes/add_cert_rotation_failure_metric branch from ef8addd to 1f0fe6d Nov 5, 2019
@k8s-ci-robot k8s-ci-robot added size/L and removed size/XXL labels Nov 5, 2019
@rphillips

This comment has been minimized.

Copy link
Member Author

rphillips commented Nov 5, 2019

I think I got it working. My $GOPATH/pkg/mod was owned by root and go list all was failing.

@rphillips

This comment has been minimized.

Copy link
Member Author

rphillips commented Nov 5, 2019

Damn. that is a lot of owners... I'll refactor without the dependency.

@rphillips rphillips force-pushed the rphillips:fixes/add_cert_rotation_failure_metric branch from 1f0fe6d to 8e50c55 Nov 5, 2019
@k8s-ci-robot k8s-ci-robot added size/M and removed size/L labels Nov 5, 2019
@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Nov 5, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: derekwaynecarr, mikedanese, rphillips

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@rphillips

This comment has been minimized.

Copy link
Member Author

rphillips commented Nov 5, 2019

I'm going to propose the original change... It seems hacky with the CounterVec manipulation.

@rphillips

This comment has been minimized.

Copy link
Member Author

rphillips commented Nov 6, 2019

/retest

@dims

This comment has been minimized.

Copy link
Member

dims commented Nov 16, 2019

/uncc

@k8s-ci-robot k8s-ci-robot removed the request for review from dims Nov 16, 2019
@rphillips

This comment has been minimized.

Copy link
Member Author

rphillips commented Dec 2, 2019

/hold cancel

@mikedanese could you give this a final review?

@@ -61,6 +61,15 @@ func NewKubeletServerCertificateManager(kubeClient clientset.Interface, kubeCfg
},
)
legacyregistry.MustRegister(certificateExpiration)
var certificateRenewFailure = compbasemetrics.NewCounter(

This comment has been minimized.

Copy link
@mikedanese

mikedanese Dec 3, 2019

Member

This assumes that this function will only be called once which is not obvious... but I guess that's how we do thinks. I'd probably prefer moving both of these to package level, but won't block this change.

@mikedanese

This comment has been minimized.

Copy link
Member

mikedanese commented Dec 3, 2019

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Dec 3, 2019
@k8s-ci-robot k8s-ci-robot merged commit 205570e into kubernetes:master Dec 3, 2019
14 of 15 checks passed
14 of 15 checks passed
pull-kubernetes-kubemark-e2e-gce-big Job triggered.
Details
cla/linuxfoundation rphillips authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-dependencies Job succeeded.
Details
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-kind Job succeeded.
Details
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-node-e2e-containerd Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
tide In merge pool.
Details
@k8s-ci-robot k8s-ci-robot added this to the v1.18 milestone Dec 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.