Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

track/close kubelet->API connections on heartbeat failure #63492

Merged
merged 2 commits into from May 14, 2018

Conversation

@liggitt
Copy link
Member

liggitt commented May 7, 2018

xref #48638
xref kubernetes-incubator/kube-aws#598

we're already typically tracking kubelet -> API connections and have the ability to force close them as part of client cert rotation. if we do that tracking unconditionally, we gain the ability to also force close connections on heartbeat failure as well. it's a big hammer (means reestablishing pod watches, etc), but so is having all your pods evicted because you didn't heartbeat.

this intentionally does minimal refactoring/extraction of the cert connection tracking transport in case we want to backport this

  • first commit unconditionally sets up the connection-tracking dialer, and moves all the cert management logic inside an if-block that gets skipped if no certificate manager is provided (view with whitespace ignored to see what actually changed)
  • second commit plumbs the connection-closing function to the heartbeat loop and calls it on repeated failures

follow-ups:

  • consider backporting this to 1.10, 1.9, 1.8
  • refactor the connection managing dialer to not be so tightly bound to the client certificate management

/sig node
/sig api-machinery

kubelet: fix hangs in updating Node status after network interruptions/changes between the kubelet and API server
@liggitt

This comment has been minimized.

Copy link
Member Author

liggitt commented May 7, 2018

@liggitt liggitt force-pushed the liggitt:node-heartbeat-close-connections branch from 60fd863 to 05116f9 May 7, 2018

@liggitt

This comment has been minimized.

Copy link
Member Author

liggitt commented May 7, 2018

@awly
Copy link
Contributor

awly left a comment

Recommend adding a test to make sure OnHeartbeatFailure triggers.

@@ -541,19 +541,19 @@ func run(s *options.KubeletServer, kubeDeps *kubelet.Dependencies) (err error) {
return fmt.Errorf("invalid kubeconfig: %v", err)
}

var clientCertificateManager certificate.Manager
var clientCertificateManager certificate.Manager = nil

This comment has been minimized.

@awly

awly May 7, 2018

Contributor

This doesn't seem to do anything useful, remove = nil

@@ -51,80 +51,93 @@ import (
//
// stopCh should be used to indicate when the transport is unused and doesn't need
// to continue checking the manager.
func UpdateTransport(stopCh <-chan struct{}, clientConfig *restclient.Config, clientCertificateManager certificate.Manager, exitAfter time.Duration) error {
func UpdateTransport(stopCh <-chan struct{}, clientConfig *restclient.Config, clientCertificateManager certificate.Manager, exitAfter time.Duration) (func(), error) {

This comment has been minimized.

@awly

awly May 7, 2018

Contributor

Document the new return value in func comment

This comment has been minimized.

@liggitt

liggitt May 7, 2018

Author Member

done

@liggitt liggitt force-pushed the liggitt:node-heartbeat-close-connections branch 2 times, most recently from 8a0a991 to c47a7a9 May 7, 2018

@liggitt

This comment has been minimized.

Copy link
Member Author

liggitt commented May 7, 2018

Recommend adding a test to make sure OnHeartbeatFailure triggers.

done

@liggitt liggitt changed the title WIP - track/close kubelet->API connections on heartbeat failure track/close kubelet->API connections on heartbeat failure May 7, 2018

@liggitt liggitt force-pushed the liggitt:node-heartbeat-close-connections branch from c47a7a9 to f18a52d May 7, 2018

@@ -350,6 +350,9 @@ func (kl *Kubelet) updateNodeStatus() error {
glog.V(5).Infof("Updating node status")
for i := 0; i < nodeStatusUpdateRetry; i++ {
if err := kl.tryUpdateNodeStatus(i); err != nil {
if i > 0 && kl.onRepeatedHeartbeatFailure != nil {
kl.onRepeatedHeartbeatFailure()

This comment has been minimized.

@dims

dims May 7, 2018

Member

Do we want to set onRepeatedHeartbeatFailure to nil or something? (once we invoke the method)

This comment has been minimized.

@liggitt

liggitt May 7, 2018

Author Member

No, that would mean the kubelet would hit the same issue if the network condition was encountered twice during a single process lifetime

This comment has been minimized.

@dims

dims May 7, 2018

Member

Ack thanks!

@liggitt liggitt changed the title track/close kubelet->API connections on heartbeat failure WIP - track/close kubelet->API connections on heartbeat failure May 7, 2018

liggitt added some commits May 7, 2018

@liggitt liggitt force-pushed the liggitt:node-heartbeat-close-connections branch from f18a52d to 814b065 May 7, 2018

@liggitt

This comment has been minimized.

Copy link
Member Author

liggitt commented May 7, 2018

/test pull-kubernetes-e2e-gke

@liggitt

This comment has been minimized.

Copy link
Member Author

liggitt commented May 7, 2018

/test pull-kubernetes-local-e2e

@derekwaynecarr

This comment has been minimized.

Copy link
Member

derekwaynecarr commented May 14, 2018

kubelet changes lgtm.

/approve

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented May 14, 2018

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: derekwaynecarr, dims, liggitt, mikedanese

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-github-robot

This comment has been minimized.

Copy link
Contributor

k8s-github-robot commented May 14, 2018

/test all [submit-queue is verifying that this PR is safe to merge]

@k8s-github-robot

This comment has been minimized.

Copy link
Contributor

k8s-github-robot commented May 14, 2018

Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit 8220171 into kubernetes:master May 14, 2018

15 of 16 checks passed

Submit Queue Required Github CI test is not green: pull-kubernetes-e2e-gce
Details
cla/linuxfoundation liggitt authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-cross Skipped
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gke Skipped
pull-kubernetes-e2e-kops-aws Job succeeded.
Details
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce Job succeeded.
Details
pull-kubernetes-local-e2e Skipped
pull-kubernetes-local-e2e-containerized Job succeeded.
Details
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
@liggitt

This comment has been minimized.

Copy link
Member Author

liggitt commented May 14, 2018

I plan to open picks to 1.8, 1.9, and 1.10, but will hold them until this makes it through serial/soak/scale CI tests

@liggitt

This comment has been minimized.

k8s-github-robot pushed a commit that referenced this pull request May 17, 2018

Kubernetes Submit Queue
Merge pull request #63831 from liggitt/automated-cherry-pick-of-#6349…
…2-upstream-release-1.10

Automatic merge from submit-queue.

Automated cherry pick of #63492: Always track kubelet -> API connections

Cherry pick of #63492 on release-1.10.

#63492: Always track kubelet -> API connections

k8s-github-robot pushed a commit that referenced this pull request May 17, 2018

Kubernetes Submit Queue
Merge pull request #63834 from liggitt/automated-cherry-pick-of-#6349…
…2-upstream-release-1.8

Automatic merge from submit-queue.

Automated cherry pick of #63492: Always track kubelet -> API connections

Cherry pick of #63492 on release-1.8.

#63492: Always track kubelet -> API connections

k8s-github-robot pushed a commit that referenced this pull request Jun 27, 2018

Kubernetes Submit Queue
Merge pull request #63832 from liggitt/automated-cherry-pick-of-#6349…
…2-upstream-release-1.9

Automatic merge from submit-queue.

Automated cherry pick of #63492: Always track kubelet -> API connections

Cherry pick of #63492 on release-1.9.

#63492: Always track kubelet -> API connections

jackfrancis added a commit to jackfrancis/aks-engine that referenced this pull request Jan 3, 2019

@jackfrancis jackfrancis referenced this pull request Jan 3, 2019

Merged

docs: remove obsolete "known issue" #225

2 of 4 tasks complete

jackfrancis added a commit to Azure/aks-engine that referenced this pull request Jan 3, 2019

juhacket pushed a commit to juhacket/aks-engine that referenced this pull request Mar 14, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.