New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubelet creates and manages node leases #66257

Merged
merged 1 commit into from Aug 27, 2018

Conversation

@mtaufen
Contributor

mtaufen commented Jul 16, 2018

This extends the Kubelet to create and periodically update leases in a
new kube-node-lease namespace. Based on KEP-0009,
these leases can be used as a node health signal, and will allow us to
reduce the load caused by over-frequent node status reporting.

  • add NodeLease feature gate
  • add kube-node-lease system namespace for node leases
  • add Kubelet option for lease duration
  • add Kubelet-internal lease controller to create and update lease
  • add e2e test for NodeLease feature

I would like to determine a standard policy for lease renewal frequency, based on the configured lease duration, so that we don't need to expose frequency as an additional knob. The renew interval is currently calculated as 1/3 of the lease duration.

kubelet: Users can now enable the alpha NodeLease feature gate to have the Kubelet create and periodically renew a Lease in the kube-node-lease namespace. The lease duration defaults to 40s, and can be configured via the kubelet.config.k8s.io/v1beta1.KubeletConfiguration's NodeLeaseDurationSeconds field.

/cc @wojtek-t @liggitt

@k8s-merge-robot

This comment has been minimized.

Show comment
Hide comment
@k8s-merge-robot

k8s-merge-robot Jul 16, 2018

Contributor

[MILESTONENOTIFIER] Milestone Pull Request: Up-to-date for process

@mtaufen @yujuhong

Pull Request Labels
  • sig/node: Pull Request will be escalated to these SIGs if needed.
  • priority/important-soon: Escalate to the pull request owners and SIG owner; move out of milestone after several unsuccessful escalation attempts.
  • kind/feature: New functionality.
Help
Contributor

k8s-merge-robot commented Jul 16, 2018

[MILESTONENOTIFIER] Milestone Pull Request: Up-to-date for process

@mtaufen @yujuhong

Pull Request Labels
  • sig/node: Pull Request will be escalated to these SIGs if needed.
  • priority/important-soon: Escalate to the pull request owners and SIG owner; move out of milestone after several unsuccessful escalation attempts.
  • kind/feature: New functionality.
Help

@mtaufen mtaufen referenced this pull request Jul 17, 2018

Open

Move frequent Kubelet heartbeats to Lease API #589

2 of 4 tasks complete
lease, err := c.client.Get(c.holderIdentity, metav1.GetOptions{})
if apierrors.IsNotFound(err) {
// lease does not exist, create it
lease, err := c.client.Create(c.newLease(nil))

This comment has been minimized.

@liggitt

liggitt Jul 17, 2018

Member

is this working today in a test cluster with authz enabled? I expected to see modifications to the Node authorizer and NodeRestriction admission plugin to let kubelets mess with their own leases

@liggitt

liggitt Jul 17, 2018

Member

is this working today in a test cluster with authz enabled? I expected to see modifications to the Node authorizer and NodeRestriction admission plugin to let kubelets mess with their own leases

This comment has been minimized.

@mtaufen

mtaufen Jul 17, 2018

Contributor

Good point, I forgot about that. I will add those changes to this PR, and it probably makes sense to ensure the e2e test also runs in the standard e2e suite, instead of just the e2e node suite.

We also don't currently run the node authorizer/node restriction in the e2e node tests. Maybe I should revisit #60172?

@mtaufen

mtaufen Jul 17, 2018

Contributor

Good point, I forgot about that. I will add those changes to this PR, and it probably makes sense to ensure the e2e test also runs in the standard e2e suite, instead of just the e2e node suite.

We also don't currently run the node authorizer/node restriction in the e2e node tests. Maybe I should revisit #60172?

lease, err := c.client.Get(c.holderIdentity, metav1.GetOptions{})
if apierrors.IsNotFound(err) {
// lease does not exist, create it
lease, err := c.client.Create(c.newLease(nil))

This comment has been minimized.

@rphillips

rphillips Jul 17, 2018

Member

Do these client calls include a request timeout?

@rphillips

rphillips Jul 17, 2018

Member

Do these client calls include a request timeout?

This comment has been minimized.

@liggitt

liggitt Jul 17, 2018

Member

heartbeatClient is explicitly constructed with a timeout to avoid eternal hangs, and a distinct QPS to avoid starvation by other kube API calls, but is currently fixed to corev1. need to adapt/widen that for use here

@liggitt

liggitt Jul 17, 2018

Member

heartbeatClient is explicitly constructed with a timeout to avoid eternal hangs, and a distinct QPS to avoid starvation by other kube API calls, but is currently fixed to corev1. need to adapt/widen that for use here

This comment has been minimized.

@mtaufen

mtaufen Jul 17, 2018

Contributor

I believe that is configured on the client before it is passed to the controller.

@mtaufen

mtaufen Jul 17, 2018

Contributor

I believe that is configured on the client before it is passed to the controller.

This comment has been minimized.

@mtaufen

mtaufen Jul 17, 2018

Contributor

If we want the timeout to match the heartbeat frequency, then we need a distinct client, since the Lease-based heartbeat potentially has a different frequency than the node status update, right?

@mtaufen

mtaufen Jul 17, 2018

Contributor

If we want the timeout to match the heartbeat frequency, then we need a distinct client, since the Lease-based heartbeat potentially has a different frequency than the node status update, right?

This comment has been minimized.

@liggitt

liggitt Jul 17, 2018

Member

Taking the shorter of the two would probably be fine

@liggitt

liggitt Jul 17, 2018

Member

Taking the shorter of the two would probably be fine

This comment has been minimized.

@mtaufen

mtaufen Jul 21, 2018

Contributor

done

@mtaufen

mtaufen Jul 21, 2018

Contributor

done

Show outdated Hide outdated pkg/kubelet/nodelease/controller.go
Show outdated Hide outdated pkg/kubelet/nodelease/controller.go
Show outdated Hide outdated pkg/kubelet/kubelet.go
Show outdated Hide outdated pkg/kubelet/nodelease/controller.go
Show outdated Hide outdated pkg/kubelet/nodelease/controller.go
Show outdated Hide outdated pkg/kubelet/apis/kubeletconfig/v1beta1/types.go
Show outdated Hide outdated pkg/kubelet/nodelease/controller.go
// retryUpdateLease attempts to update the lease for maxUpdateRetries,
// call this once you're sure the lease has been created
func (c *controller) retryUpdateLease(base *coordv1beta1.Lease) {

This comment has been minimized.

@yujuhong

yujuhong Jul 18, 2018

Contributor

I know this is copied from retryUpdateNodeStatus, but just FYI, we've found before that there were multiple layers of retries..and that was probably unnecessary. kubernetes/node-problem-detector#124 (comment)

@yujuhong

yujuhong Jul 18, 2018

Contributor

I know this is copied from retryUpdateNodeStatus, but just FYI, we've found before that there were multiple layers of retries..and that was probably unnecessary. kubernetes/node-problem-detector#124 (comment)

This comment has been minimized.

@mtaufen

mtaufen Jul 21, 2018

Contributor

Noted. Is the client advanced enough now for us to remove onRepeatedHeartbeatFailure too?
@liggitt?

@mtaufen

mtaufen Jul 21, 2018

Contributor

Noted. Is the client advanced enough now for us to remove onRepeatedHeartbeatFailure too?
@liggitt?

This comment has been minimized.

@liggitt

liggitt Jul 21, 2018

Member

No. That is required to force close dead TCP connections.

@liggitt

liggitt Jul 21, 2018

Member

No. That is required to force close dead TCP connections.

This comment has been minimized.

@yujuhong

yujuhong Aug 20, 2018

Contributor

Maybe we can lower the maxUpdateRetries later in a separate PR.

@yujuhong

yujuhong Aug 20, 2018

Contributor

Maybe we can lower the maxUpdateRetries later in a separate PR.

lease.Spec.HolderIdentity = pointer.StringPtr(c.holderIdentity)
lease.Spec.LeaseDurationSeconds = pointer.Int32Ptr(c.leaseDurationSeconds)
lease.Spec.RenewTime = &metav1.MicroTime{Time: c.clock.Now()}
return lease

This comment has been minimized.

@yujuhong

yujuhong Jul 18, 2018

Contributor

Is OwnerReference going to be added in follow-up PR, or a different mechanism will be used to ensure the leases are cleaned up?

@yujuhong

yujuhong Jul 18, 2018

Contributor

Is OwnerReference going to be added in follow-up PR, or a different mechanism will be used to ensure the leases are cleaned up?

Show outdated Hide outdated pkg/kubelet/nodelease/controller.go
@mtaufen

This comment has been minimized.

Show comment
Hide comment
@mtaufen

mtaufen Aug 5, 2018

Contributor

When I put the node_lease_test.go file in test/e2e/common (and change package name accordingly), I can't get it to run with make test-e2e-node FOCUS="NodeLease" SKIP="" REMOTE=true TEST_ARGS='--feature-gates=NodeLease=true' PARALLELISM=1 CLEANUP=true.
Output is:

SUCCESS! -- 0 Passed | 0 Failed | 0 Pending | 270 Skipped PASS

Similarly, I couldn't get it to run with go run hack/e2e.go -- --build --up --test --test_args="--ginkgo.focus=\[Feature:NodeLease\] --ginkgo.skip=''":

SUCCESS! -- 0 Passed | 0 Failed | 0 Pending | 1022 Skipped PASS

Isn't test/e2e/common supposed to allow a test to run from either suite? Any ideas what's going on here?

I left the file in test/e2e_node for now, since that's the only way I can get it to run :/.
The test probably wouldn't pass in the cluster e2e yet anyway, but it's odd to me that it's not even running...

@BenTheElder

Contributor

mtaufen commented Aug 5, 2018

When I put the node_lease_test.go file in test/e2e/common (and change package name accordingly), I can't get it to run with make test-e2e-node FOCUS="NodeLease" SKIP="" REMOTE=true TEST_ARGS='--feature-gates=NodeLease=true' PARALLELISM=1 CLEANUP=true.
Output is:

SUCCESS! -- 0 Passed | 0 Failed | 0 Pending | 270 Skipped PASS

Similarly, I couldn't get it to run with go run hack/e2e.go -- --build --up --test --test_args="--ginkgo.focus=\[Feature:NodeLease\] --ginkgo.skip=''":

SUCCESS! -- 0 Passed | 0 Failed | 0 Pending | 1022 Skipped PASS

Isn't test/e2e/common supposed to allow a test to run from either suite? Any ideas what's going on here?

I left the file in test/e2e_node for now, since that's the only way I can get it to run :/.
The test probably wouldn't pass in the cluster e2e yet anyway, but it's odd to me that it's not even running...

@BenTheElder

@BenTheElder

This comment has been minimized.

Show comment
Hide comment
@BenTheElder

BenTheElder Aug 5, 2018

Member

I was taking a look at this but discovered #66995, I will circle back..

Member

BenTheElder commented Aug 5, 2018

I was taking a look at this but discovered #66995, I will circle back..

@liggitt

This comment has been minimized.

Show comment
Hide comment
@liggitt

liggitt Aug 23, 2018

Member

/hold cancel
/retest

my comments are addressed, thanks

Member

liggitt commented Aug 23, 2018

/hold cancel
/retest

my comments are addressed, thanks

@yujuhong

This comment has been minimized.

Show comment
Hide comment
@yujuhong

yujuhong Aug 23, 2018

Contributor

/lgtm

Contributor

yujuhong commented Aug 23, 2018

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Aug 23, 2018

@k8s-ci-robot

This comment has been minimized.

Show comment
Hide comment
@k8s-ci-robot

k8s-ci-robot Aug 23, 2018

Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mtaufen, wojtek-t, yujuhong

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Contributor

k8s-ci-robot commented Aug 23, 2018

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mtaufen, wojtek-t, yujuhong

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot

This comment has been minimized.

Show comment
Hide comment
@k8s-ci-robot

k8s-ci-robot Aug 23, 2018

Contributor

New changes are detected. LGTM label has been removed.

Contributor

k8s-ci-robot commented Aug 23, 2018

New changes are detected. LGTM label has been removed.

@mtaufen

This comment has been minimized.

Show comment
Hide comment
@mtaufen

mtaufen Aug 23, 2018

Contributor

re-add label after rebase

Contributor

mtaufen commented Aug 23, 2018

re-add label after rebase

@mtaufen mtaufen added the lgtm label Aug 23, 2018

@wojtek-t

This comment has been minimized.

Show comment
Hide comment
@wojtek-t

wojtek-t Aug 24, 2018

Member

@mtaufen - please rebase one more time.

Member

wojtek-t commented Aug 24, 2018

@mtaufen - please rebase one more time.

Kubelet creates and manages node leases
This extends the Kubelet to create and periodically update leases in a
new kube-node-lease namespace. Based on [KEP-0009](https://github.com/kubernetes/community/blob/master/keps/sig-node/0009-node-heartbeat.md),
these leases can be used as a node health signal, and will allow us to
reduce the load caused by over-frequent node status reporting.

- add NodeLease feature gate
- add kube-node-lease system namespace for node leases
- add Kubelet option for lease duration
- add Kubelet-internal lease controller to create and update lease
- add e2e test for NodeLease feature
- modify node authorizer and node restriction admission controller
to allow Kubelets access to corresponding leases
@k8s-ci-robot

This comment has been minimized.

Show comment
Hide comment
@k8s-ci-robot

k8s-ci-robot Aug 26, 2018

Contributor

New changes are detected. LGTM label has been removed.

Contributor

k8s-ci-robot commented Aug 26, 2018

New changes are detected. LGTM label has been removed.

@k8s-ci-robot k8s-ci-robot removed the lgtm label Aug 26, 2018

@mtaufen

This comment has been minimized.

Show comment
Hide comment
@mtaufen

mtaufen Aug 26, 2018

Contributor

re-add label after rebase

Contributor

mtaufen commented Aug 26, 2018

re-add label after rebase

@mtaufen

This comment has been minimized.

Show comment
Hide comment
@mtaufen

mtaufen Aug 27, 2018

Contributor

/retest

Contributor

mtaufen commented Aug 27, 2018

/retest

@mtaufen mtaufen removed the needs-rebase label Aug 27, 2018

@k8s-merge-robot

This comment has been minimized.

Show comment
Hide comment
@k8s-merge-robot

k8s-merge-robot Aug 27, 2018

Contributor

/test all [submit-queue is verifying that this PR is safe to merge]

Contributor

k8s-merge-robot commented Aug 27, 2018

/test all [submit-queue is verifying that this PR is safe to merge]

@k8s-merge-robot

This comment has been minimized.

Show comment
Hide comment
@k8s-merge-robot

k8s-merge-robot Aug 27, 2018

Contributor

Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here.

Contributor

k8s-merge-robot commented Aug 27, 2018

Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-merge-robot k8s-merge-robot merged commit aec2702 into kubernetes:master Aug 27, 2018

16 of 18 checks passed

Submit Queue Required Github CI test is not green: pull-kubernetes-kubemark-e2e-gce-big
Details
pull-kubernetes-local-e2e-containerized Job triggered.
Details
cla/linuxfoundation mtaufen authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-cross Skipped
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gke Skipped
pull-kubernetes-e2e-kops-aws Job succeeded.
Details
pull-kubernetes-e2e-kubeadm-gce Skipped
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
@wojtek-t

This comment has been minimized.

Show comment
Hide comment
@wojtek-t

wojtek-t Aug 27, 2018

Member

Great to see that merged.

Member

wojtek-t commented Aug 27, 2018

Great to see that merged.

@tengqm tengqm referenced this pull request Aug 31, 2018

Open

Documentation needed for Node lease management #10160

1 of 2 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment