Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eliminate hangs/throttling of node heartbeat #52176

Merged
merged 1 commit into from
Sep 16, 2017

Conversation

liggitt
Copy link
Member

@liggitt liggitt commented Sep 8, 2017

Fixes #48638
Fixes #50304

Stops kubelet from wedging when updating node status if unable to establish tcp connection.

Notes that this only affects the node status loop. The pod sync loop would still hang until the dead TCP connections timed out, so more work is needed to keep the sync loop responsive in the face of network issues, but this change lets existing pods coast without the node controller trying to evict them

kubelet to master communication when doing node status updates now has a timeout to prevent indefinite hangs

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Sep 8, 2017
@k8s-github-robot k8s-github-robot added the do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. label Sep 8, 2017
@liggitt
Copy link
Member Author

liggitt commented Sep 8, 2017

cc @kubernetes/sig-node-pr-reviews

@k8s-ci-robot k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Sep 8, 2017
@@ -395,6 +395,12 @@ func (r *Request) Context(ctx context.Context) *Request {
return r
}

// Throttle sets a rate limiter for the request.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment: It overrides the current throttle from the client, if any

@smarterclayton smarterclayton added this to the v1.8 milestone Sep 8, 2017
@smarterclayton
Copy link
Contributor

This makes sense to me and is a minimal solution.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 8, 2017
client: restclient,
decorator: func(req *rest.Request) *rest.Request {
ctx, cancelFunc := context.WithTimeout(context.Background(), kl.nodeStatusUpdateFrequency)
cancelFuncs = append(cancelFuncs, cancelFunc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure it matters in the context of the current usage, but the decorated client here isn't thread-safe. Maybe worth a comment, at least?

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 8, 2017
@liggitt
Copy link
Member Author

liggitt commented Sep 9, 2017

stackdriver flake
/retest

// each request should not take longer than the update frequency
ctx, cancelFunc := context.WithTimeout(context.Background(), kl.nodeStatusUpdateFrequency)
// free the context after twice that length of time
time.AfterFunc(2*kl.nodeStatusUpdateFrequency, cancelFunc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool idea

@ironcladlou
Copy link
Contributor

Latest revision looks really clearly done to me

@derekwaynecarr
Copy link
Member

/assign @derekwaynecarr

@derekwaynecarr
Copy link
Member

this is a critical bug fix, and labeling appropriately for 1.8 milestone.

/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 11, 2017
@derekwaynecarr
Copy link
Member

/test pull-kubernetes-e2e-gce-etcd3

@k8s-github-robot k8s-github-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Sep 11, 2017
@RyPeck
Copy link
Contributor

RyPeck commented Dec 15, 2017

@liggitt @alexef I believe the issue described in in #41916 is actually an issue with the underlying TCP connection. The fix in this merge implements a timeout on the HTTP request, and does not close and set up a new TCP connection right?

This is also discussed in kubernetes-retired/kube-aws#598

@liggitt
Copy link
Member Author

liggitt commented Dec 17, 2017

This does not force closing the TCP connection, but does establish a new one (as demonstrated by the accompanying test, which hangs the first TCP connection and ensures a separate connection is established)

@gtie
Copy link

gtie commented Dec 21, 2017

@liggitt @RyPeck experiment shows that the connection is NOT reestablished on a K8s 1.8.3 cluster. Simple way to try it out on a kubelet is start dropping all packets destined to the API, and then observe how long it takes for the "ESTABLISHED" connection to disappear and new connection attempts to appear.

We've also had a 1.8.3 cluster go crazy for some 18 minutes when ELB was scaled down and half of the ELB "endpoints" disappeared. Investigation points to kubelet not bothering to re-establish connections to the live nodes.

What we did see happening was 10-second spaced retries/failures from kubelet to report its status, which result in an timeout waiting for HTTP response This particular timeout, however, does not result in the connection being re-established - the next attempt is simply reusing the same TCP pipe.

If I read the attached test correctly, it ensures that each status update attempt itself is timed out, and that there are multiple attempts happening. This does not prove that the TCP connection is reestablished.

@vboginskey
Copy link
Contributor

We also encountered the problem referenced in #48638 on 1.7.8. This PR did not resolve the issue.

@liggitt
Copy link
Member Author

liggitt commented Jan 3, 2018

This does not prove that the TCP connection is reestablished.

The attempt counter is only incremented when a new connection is established. Does that not demonstrate that the timeout causes the hung connection to be abandoned and a new connection created?

We also encountered the problem referenced in #48638 on 1.7.8. This PR did not resolve the issue.

Can you give more details or a reproducer, possibly as an additional unit test in the same vein as the one included in this PR?

@vboginskey
Copy link
Contributor

Can you give more details or a reproducer, possibly as an additional unit test in the same vein as the one included in this PR?

Sure, let me start with details.

Environment: AWS, Kubernetes 1.7.8, CoreOS 1576.4.0. HA masters behind an ELB.
Event: The master ELB changes IPs, as confirmed by AWS CloudTrail. Immediately, all kubelets start reporting:

E1227 22:45:20.510774    3983 kubelet_node_status.go:357] Error updating node status, will retry: error getting node "ip-10-188-36-106.ec2.internal": Get https://k8s.vpc.gh.internal/api/v1/nodes/ip-10-188-36-106.ec2.internal: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

Shortly thereafter, the controller manager reports:

I1227 22:45:37.638923       9 event.go:218] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-10-188-36-106.ec2.internal", UID:"e1c2c442-e4e0-11e7-927b-124b89f5bfb0", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeNotReady' Node ip-10-188-36-106.ec2.internal status is now: NodeNotReady

This happens on each kubelet and there is now a production outage due to #45126.

We restart kubelets on all worker nodes and now this happens:

I1227 22:49:58.239809       9 event.go:218] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-10-188-36-106.ec2.internal", UID:"e1c2c442-e4e0-11e7-927b-124b89f5bfb0", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'DeletingAllPods' Node ip-10-188-36-106.ec2.internal event: Deleting all Pods from Node ip-10-188-36-106.ec2.internal.

The pods take several minutes to get rescheduled and restarted, resolving the outage.

We have since switched to NLBs, mitigating this particular failure mode. However, we still occasionally see individual nodes become NotReady, with the same logs. In the absence of operator intervention, each event lasts about 15 minutes, which corroborates #41916 (comment). Logging onto an affected node and curling the API endpoint manually returns the response as expected.

The best reproduction steps I can suggest at this time are here: #41916 (comment). I'll look into reproducing the specific ELB condition in a cloud provider independent way as well.

@vboginskey
Copy link
Contributor

I was able to catch one of these with tcpdump today. Here's the relevant section:

image

The kubelet (10.188.48.76) stops communicating with the apiserver (10.188.21.18) at 12:03:24. It reports an error updating node status at 12:03:34 and continues until restarted at 12:06:39.

In the interval between 12:03:24 and 12:06:39, there is no new TCP connection established and the only communication between the kubelet and the apiserver is a backoff of TCP retransmissions.

@cbonte
Copy link
Contributor

cbonte commented Jan 19, 2018

Hi all, is there any update on this issue ?
We encountered those hangs on kubernetes 1.6.x where the fix was not backported, but reading the recent feedbacks, it seems an upgrade to 1.7/1.8 won't guarantee it will 100% resolve it.

@lokesh-shekar
Copy link

Any mitigation recommendation ? We are seeing this often in our cluster as well.

george-angel added a commit to utilitywarehouse/tf_kube_aws that referenced this pull request Feb 23, 2018
Primarily to mitigate kubernetes issue with kubeletes losing connection
to masters ( ~06:20 in the morning for us ), and after a period ejecting
pods, until that connection is reistablished. Some of the related issues
and pull requests:

 - kubernetes/kubernetes#41916
 - kubernetes/kubernetes#48638
 - kubernetes/kubernetes#52176

Using an NLB is a suggested mitigation from:
kubernetes/kubernetes#52176 (comment)
george-angel added a commit to utilitywarehouse/tf_kube_aws that referenced this pull request Feb 23, 2018
Primarily to mitigate kubernetes issue with kubeletes losing connection
to masters ( ~06:20 in the morning for us ), and after a period ejecting
pods, until that connection is reistablished. Some of the related issues
and pull requests:

 - kubernetes/kubernetes#41916
 - kubernetes/kubernetes#48638
 - kubernetes/kubernetes#52176

Using a NLB is a suggested mitigation from:
kubernetes/kubernetes#52176 (comment)
george-angel added a commit to utilitywarehouse/tf_kube_aws that referenced this pull request Feb 23, 2018
Primarily to mitigate kubernetes issue with kubeletes losing connection
to masters ( ~06:20 in the morning for us ), and after a period ejecting
pods, until that connection is reistablished. Some of the related issues
and pull requests:

 - kubernetes/kubernetes#41916
 - kubernetes/kubernetes#48638
 - kubernetes/kubernetes#52176

Using a NLB is a suggested mitigation from:
kubernetes/kubernetes#52176 (comment)
george-angel added a commit to utilitywarehouse/tf_kube_aws that referenced this pull request Feb 23, 2018
Primarily to mitigate kubernetes issue with kubeletes losing connection
to masters ( ~06:20 in the morning for us ), and after a period ejecting
pods, until that connection is reistablished. Some of the related issues
and pull requests:

 - kubernetes/kubernetes#41916
 - kubernetes/kubernetes#48638
 - kubernetes/kubernetes#52176

Using a NLB is a suggested mitigation from:
kubernetes/kubernetes#52176 (comment)

NLBs do not support security groups, so we delete the elb SG and open
443 to the world on the master security group
@jfoy
Copy link
Contributor

jfoy commented Mar 5, 2018

Under 1.9.3 and 1.7.12, we are still seeing this failure cause an outage of a large fraction of production nodes roughly every week or two, across nine clusters in six different AWS regions.

Here's a procedure to reproduce the problem under controlled conditions on Kubernetes 1.9.3. I'm not sure how to turn this into a unit test, yet.

Intent:

In production, we see load balancers in front of the apiserver become unresponsive and stop sending ACKs on established TCP connections. After ~15 minutes, the kubelet socket's TCP retransmission delay is exhausted, the kernel considers the connection failed, and the Go runtime establishes a new TCP connection.

15 minutes is much longer than the usual node-status-update-frequency/node-monitor-grace-period. We need to identify and recover from this failure on the scale of 10-30s. The patch in #52176 attempts to meet this requirement by setting a ~10s HTTP request timeout on the client sending node status updates.

This procedure demonstrates that the kubelet cannot recover from a stalled TCP connection even though #52176 works as intended. The test simulates the effects of a remote TCP socket becoming unresponsive (not sending ACKs) by interposing a local TCP proxy between kubelet and apiserver, then suspending the proxy child process handling the connection. This doesn't precisely reproduce the production failure: in the test the remote socket continues sending ACKs, and as a consequence, the TCP retransmission timeout is never hit and the local socket never fails.

Even so, we can reliably get the kubelet to fail.

Procedure:

  1. Start with a running cluster node with a healthy kubelet. In one local terminal, observe node health:
    kubectl get $node -w
  2. Start two terminal sessions to the victim node.
  3. In terminal session A, start a socat session to proxy TCP traffic. Make sure to enable fork behavior on the listening socket to handle multiple connections:
    socat -d -d -d TCP4-LISTEN:60443,fork TCP4:$apiserver_ip:443
  4. In terminal session B, update your kubelet configuration to point to 127.0.0.1:60443. You may have to add an entry to the node's /etc/hosts to convince the TLS client that this is a valid address for the apiserver's cert.
  5. In session B, bounce the kubelet so it's going through your local proxy. In session A you should now see socat logging healthy connection information. The node should still be healthy in the apiserver; pause to verify this if needed.
  6. In session B, identify the child process spawned by socat to handle the active TCP connection. I used this command to discover it:
    ps -u"$USER" -opid,ppid,pgrp,cmd
  7. In session B, pause (do not terminate) the socat child process.
    kill -s TSTP $childpid
  8. When the node heartbeat timeout/retries have expired you should see the node become NotReady. The kubelet will log something like this, repeating ad nauseam:
kubelet_node_status.go:383] Error updating node status, will retry: error getting node "ip-10-31-100-133": Get https://controller:60443/api/v1/nodes/ip-10-31-100-133?resourceVersion=0: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
kubelet_node_status.go:383] Error updating node status, will retry: error getting node "ip-10-31-100-133": Get https://controller:60443/api/v1/nodes/ip-10-31-100-133: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
kubelet_node_status.go:383] Error updating node status, will retry: error getting node "ip-10-31-100-133": Get https://controller:60443/api/v1/nodes/ip-10-31-100-133: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
kubelet_node_status.go:383] Error updating node status, will retry: error getting node "ip-10-31-100-133": Get https://controller:60443/api/v1/nodes/ip-10-31-100-133: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
kubelet_node_status.go:383] Error updating node status, will retry: error getting node "ip-10-31-100-133": Get https://controller:60443/api/v1/nodes/ip-10-31-100-133: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
kubelet_node_status.go:375] Unable to update node status: update node status exceeds retry count
  1. In session B, show that a new TCP session succeeds against the same local proxy port:
    kubectl --kube-config /etc/kubernetes/kubeconfig.yaml get nodes
  2. In session B, bounce the kubelet. The new kubelet process will establish a new connection and the node will recover. You can also verify recovery by terminating the stopped proxy process with kill -s 9 $childpid or by reviving the stopped proxy with kill -s CONT $childpid.

Summary:

A stalled TCP connection to the apiserver still brings down an active node, with all the consequences outlined in kubernetes-retired/kube-aws#598, #41916, #48638, #44557, #48670, and Financial-Times/content-k8s-provisioner@d290fea.

I'm not sure where else to put responsibility for maintaining this connection, so my first suggestion would be to have HeartbeatClient force its connection(s) to close on request timeout.

@bobbytables
Copy link

@jfoy - Just for me to understand, you're whole cluster basically busts right? We're seeing the same thing where nodes go into NotReady and all of the sudden the entire cluster can't do anything.

@jfoy
Copy link
Contributor

jfoy commented Mar 5, 2018

@bobbytables Often not the entire cluster, but a big swath of it, yeah. It seems to be random based on which IP address each node's kubelet resolved the ELB name to, the last time it had to start a connection to the apiserver. The nodes drop out for ~15 minutes then suddenly recover, causing much of the load on the cluster to be rescheduled.

@bobbytables
Copy link

bobbytables commented Mar 5, 2018

@bobbytables Often not the entire cluster, but a big swath of it, yeah. It seems to be random based on which IP address each node's kubelet resolved the ELB name to, the last time it had to start a connection to the apiserver. The nodes drop out for ~15 minutes then suddenly recover, causing much of the load on the cluster to be rescheduled.

Cool, I'm literally sitting in a post-mortem room talking about this. I'm copying your note.

@RyPeck
Copy link
Contributor

RyPeck commented Mar 6, 2018

@jfoy @bobbytables we noticed that it tended to occur by availability zone.

@jfoy
Copy link
Contributor

jfoy commented Mar 7, 2018

This appears to describe the upstream bug: kubernetes/client-go#342

@stephbu
Copy link

stephbu commented Mar 30, 2018

@jfoy we just finished troubleshooting exactly the same issue with AWS NLB in US-East1 with a number of different clusters (mix of 1.7.8 and 1.9.3) #61917

Semi-random event that seems to have started to get worse in the last couple of weeks, to the extent that I'm seeing it multiple times per day now. Seems to happen more often at night. Suspect the sheer over-utilization in the region is driving reduced network quality, increasing chance of socket failure.

@jfoy
Copy link
Contributor

jfoy commented Mar 30, 2018

@stephbu Thanks, that's really useful to know (and disappointing!)

@zhouhaibing089
Copy link
Contributor

The unit test here assumes a good TCP connection, which might not be in real case.

@jfoy
Copy link
Contributor

jfoy commented Mar 30, 2018

@zhouhaibing089 That's right. The problem seems to occur when the kernel thinks the socket is still good and is in a TCP retransmission-retry state. I suspect there's a way to recreate that situation in a unit test via raw sockets, but I haven't had time to develop that idea.

@obeattie
Copy link

I'm afraid I haven't had any time to look into this further and write a concrete test case, but in case it's helpful: as I mentioned in #48670 (comment) we have been running that patch in production since August and have not seen this issue re-occur.

@henderjm
Copy link

henderjm commented Apr 9, 2018

We observed the same issue as reported in #48670, tested the PR for our case and all works. Is there a possibility of reopening the PR as it appears others could benefit from it?

@jfoy
Copy link
Contributor

jfoy commented Jun 5, 2018

We've moved over to NLBs entirely and we're still hitting the issue on a node-by-node basis. I'm going to see what we can do for additional instrumentation to find out who's dropping the ball between the apiserver, the NLB, and the kubelet.

EDIT: moving conversation over to #48638

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.