Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

track/close kubelet->API connections on heartbeat failure #63492

Merged

Conversation

@liggitt
Copy link
Member

@liggitt liggitt commented May 7, 2018

xref #48638
xref kubernetes-retired/kube-aws#598

we're already typically tracking kubelet -> API connections and have the ability to force close them as part of client cert rotation. if we do that tracking unconditionally, we gain the ability to also force close connections on heartbeat failure as well. it's a big hammer (means reestablishing pod watches, etc), but so is having all your pods evicted because you didn't heartbeat.

this intentionally does minimal refactoring/extraction of the cert connection tracking transport in case we want to backport this

  • first commit unconditionally sets up the connection-tracking dialer, and moves all the cert management logic inside an if-block that gets skipped if no certificate manager is provided (view with whitespace ignored to see what actually changed)
  • second commit plumbs the connection-closing function to the heartbeat loop and calls it on repeated failures

follow-ups:

  • consider backporting this to 1.10, 1.9, 1.8
  • refactor the connection managing dialer to not be so tightly bound to the client certificate management

/sig node
/sig api-machinery

kubelet: fix hangs in updating Node status after network interruptions/changes between the kubelet and API server
@liggitt
Copy link
Member Author

@liggitt liggitt commented May 7, 2018

@liggitt liggitt force-pushed the node-heartbeat-close-connections branch from 60fd863 to 05116f9 May 7, 2018
@liggitt
Copy link
Member Author

@liggitt liggitt commented May 7, 2018

Copy link
Contributor

@awly awly left a comment

Recommend adding a test to make sure OnHeartbeatFailure triggers.

@@ -541,19 +541,19 @@ func run(s *options.KubeletServer, kubeDeps *kubelet.Dependencies) (err error) {
return fmt.Errorf("invalid kubeconfig: %v", err)
}

var clientCertificateManager certificate.Manager
var clientCertificateManager certificate.Manager = nil
Copy link
Contributor

@awly awly May 7, 2018

This doesn't seem to do anything useful, remove = nil

@@ -51,80 +51,93 @@ import (
//
// stopCh should be used to indicate when the transport is unused and doesn't need
// to continue checking the manager.
func UpdateTransport(stopCh <-chan struct{}, clientConfig *restclient.Config, clientCertificateManager certificate.Manager, exitAfter time.Duration) error {
func UpdateTransport(stopCh <-chan struct{}, clientConfig *restclient.Config, clientCertificateManager certificate.Manager, exitAfter time.Duration) (func(), error) {
Copy link
Contributor

@awly awly May 7, 2018

Document the new return value in func comment

Copy link
Member Author

@liggitt liggitt May 7, 2018

done

@liggitt liggitt force-pushed the node-heartbeat-close-connections branch 2 times, most recently from 8a0a991 to c47a7a9 May 7, 2018
@liggitt
Copy link
Member Author

@liggitt liggitt commented May 7, 2018

Recommend adding a test to make sure OnHeartbeatFailure triggers.

done

@liggitt liggitt changed the title WIP - track/close kubelet->API connections on heartbeat failure track/close kubelet->API connections on heartbeat failure May 7, 2018
@liggitt liggitt force-pushed the node-heartbeat-close-connections branch from c47a7a9 to f18a52d May 7, 2018
@@ -350,6 +350,9 @@ func (kl *Kubelet) updateNodeStatus() error {
glog.V(5).Infof("Updating node status")
for i := 0; i < nodeStatusUpdateRetry; i++ {
if err := kl.tryUpdateNodeStatus(i); err != nil {
if i > 0 && kl.onRepeatedHeartbeatFailure != nil {
kl.onRepeatedHeartbeatFailure()
Copy link
Member

@dims dims May 7, 2018

Do we want to set onRepeatedHeartbeatFailure to nil or something? (once we invoke the method)

Copy link
Member Author

@liggitt liggitt May 7, 2018

No, that would mean the kubelet would hit the same issue if the network condition was encountered twice during a single process lifetime

Copy link
Member

@dims dims May 7, 2018

Ack thanks!

@liggitt liggitt changed the title track/close kubelet->API connections on heartbeat failure WIP - track/close kubelet->API connections on heartbeat failure May 7, 2018
@liggitt liggitt force-pushed the node-heartbeat-close-connections branch from f18a52d to 814b065 May 7, 2018
@liggitt
Copy link
Member Author

@liggitt liggitt commented May 7, 2018

/test pull-kubernetes-e2e-gke

@liggitt
Copy link
Member Author

@liggitt liggitt commented May 7, 2018

/test pull-kubernetes-local-e2e

@k8s-github-robot
Copy link
Contributor

@k8s-github-robot k8s-github-robot commented May 14, 2018

/test all [submit-queue is verifying that this PR is safe to merge]

@k8s-github-robot
Copy link
Contributor

@k8s-github-robot k8s-github-robot commented May 14, 2018

Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit 8220171 into kubernetes:master May 14, 2018
15 of 16 checks passed
15 of 16 checks passed
@k8s-github-robot
Submit Queue Required Github CI test is not green: pull-kubernetes-e2e-gce
Details
@thelinuxfoundation
cla/linuxfoundation liggitt authorized
Details
@k8s-ci-robot
pull-kubernetes-bazel-build Job succeeded.
Details
@k8s-ci-robot
pull-kubernetes-bazel-test Job succeeded.
Details
@k8s-ci-robot
pull-kubernetes-cross Skipped
@k8s-ci-robot
pull-kubernetes-e2e-gce Job succeeded.
Details
@k8s-ci-robot
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
@k8s-ci-robot
pull-kubernetes-e2e-gke Skipped
@k8s-ci-robot
pull-kubernetes-e2e-kops-aws Job succeeded.
Details
@k8s-ci-robot
pull-kubernetes-integration Job succeeded.
Details
@k8s-ci-robot
pull-kubernetes-kubemark-e2e-gce Job succeeded.
Details
@k8s-ci-robot
pull-kubernetes-local-e2e Skipped
@k8s-ci-robot
pull-kubernetes-local-e2e-containerized Job succeeded.
Details
@k8s-ci-robot
pull-kubernetes-node-e2e Job succeeded.
Details
@k8s-ci-robot
pull-kubernetes-typecheck Job succeeded.
Details
@k8s-ci-robot
pull-kubernetes-verify Job succeeded.
Details
@liggitt
Copy link
Member Author

@liggitt liggitt commented May 14, 2018

I plan to open picks to 1.8, 1.9, and 1.10, but will hold them until this makes it through serial/soak/scale CI tests

@liggitt
Copy link
Member Author

@liggitt liggitt commented May 15, 2018

k8s-github-robot pushed a commit that referenced this issue May 17, 2018
…2-upstream-release-1.10

Automatic merge from submit-queue.

Automated cherry pick of #63492: Always track kubelet -> API connections

Cherry pick of #63492 on release-1.10.

#63492: Always track kubelet -> API connections
k8s-github-robot pushed a commit that referenced this issue May 17, 2018
…2-upstream-release-1.8

Automatic merge from submit-queue.

Automated cherry pick of #63492: Always track kubelet -> API connections

Cherry pick of #63492 on release-1.8.

#63492: Always track kubelet -> API connections
k8s-github-robot pushed a commit that referenced this issue Jun 27, 2018
…2-upstream-release-1.9

Automatic merge from submit-queue.

Automated cherry pick of #63492: Always track kubelet -> API connections

Cherry pick of #63492 on release-1.9.

#63492: Always track kubelet -> API connections
jackfrancis added a commit to jackfrancis/aks-engine that referenced this issue Jan 3, 2019
jackfrancis added a commit to Azure/aks-engine that referenced this issue Jan 3, 2019
juhacket pushed a commit to juhacket/aks-engine that referenced this issue Mar 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment