Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ELB IP changes can bring the cluster down #598

Closed
danielfm opened this issue Apr 26, 2017 · 76 comments
Closed

ELB IP changes can bring the cluster down #598

danielfm opened this issue Apr 26, 2017 · 76 comments
Milestone

Comments

@danielfm
Copy link
Collaborator

@danielfm danielfm commented Apr 26, 2017

I ran into kubernetes/kubernetes#41916 twice in the last 3 days in my production cluster, with almost 50% of worker nodes transitioning to NotReady state almost simultaneously in both days, causing a brief downtime in critical services due to Kubernetes default (and agressive) eviction policy for failing nodes.

I just contacted AWS support to validate the hypothesis of the ELB changing IPs at the time of both incidents, and the answer was yes.

My configuration (multi-node control plane with ELB) matches exactly the one in that issue, and probably most kube-aws users are subject to this.

Have anyone else ran into this at some point?

@mumoshu

This comment has been minimized.

Copy link
Collaborator

@mumoshu mumoshu commented Apr 27, 2017

@danielfm Thanks for the report.

I think I have never encountered the issue myself for my production cluster but anyway, I believe this is a BIG issue.

For me, the issue seems to be composed of two parts: one is a long TTL other than ELB's(for example a CNAME record's TTL in kube-aws, which points to ELB's DNS name) and the another is an issue in ELB and/or kubelet's which prevents kubelet to detect broken connection to apiservers.

Is my assumption correct?

Anyway, once recordSetTTL in cluster.yaml is set to considerably lower than 60s(=ELB's default TTL), kubelet should detect any dead long-polling connection to one of API servers (via an ELB CNAME -> ELB instances) and then re-discover ELB instances via DNS and reconnect to one of them.

However I'm happy to work-around the issue in kube-aws.

Possible work-arounds

Perhaps:

  • periodically monitoring the list of ips an route53 alias record for an ELB and then
  • restarting kubelet whenever the monitor detects the list to be changed

so that kubelet can reconnect to one of active ELB ips before it marks the k8s node NotReady?

If we go that way, the monitor should be executed periodically with an interval of (<node-monitor-grace-period> - <amount of time required for kubelet to start> - <amount of time required for each monitor run>) / <a heuristic value: N> - <amount of time required to report node status after kubelet started> at least, so that we can provide kubelet at most N chances to successfully report its status to possibly available ELB.

Or we can implement a DNS round robin with a health-checking mechanism for serving k8s API endpoints like suggested and described in #281 #373

@mumoshu

This comment has been minimized.

Copy link
Collaborator

@mumoshu mumoshu commented Apr 27, 2017

This hash changes only when backend IPs of an ELB has been changed.

$ dig my.k8s.cluster.example.com +noall +answer +short | sort | sha256sum | awk '{print $1}'
96ff696e9a0845dd56b1764dd3cfaa1426cdfbd7510bab64983e97268f8e0bc4

$ dig my.k8s.cluster.example.com +noall +answer +short | sort
52.192.165.96
52.199.129.34
<elb id>.ap-northeast-1.elb.amazonaws.com.
@mumoshu

This comment has been minimized.

Copy link
Collaborator

@mumoshu mumoshu commented Apr 27, 2017

@danielfm I guess setting recordSetTTL in cluster.yaml(which sets the TTL of a CNAME record associated to a controller ELB) to be a value considerably lower than that of ELB's DNS A records TTL(=60sec) and kubelet's --node-monitor-grace-period and running the following service would alleviate the issue.

/etc/systemd/system/elb-ip-change-watcher.service:

[Service]
Type=notify
Restart=on-failure
Environment=API_ENDPOINT_DNS_NAME={{.APIEndpoint.DNSName}}
ExecStart=/opt/bin/elb-ip-change-watcher

/opt/bin/elb-ip-change-watcher:

#!/usr/bin/env bash

set -vxe

current_elb_backends_version() {
  dig ${API_ENDPOINT_DNS_NAME:?Missing required env var API_ENDPOINT_DNS_NAME} +noall +answer +short | \
    # take into account only ips even if dig returned a CNAME answer(when API_ENDPOINT_DNS_NAME is a CNAME rather than an A(or Route 53's "Alias") record
    grep -o '[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}' | \
    # sort IPs so that DNS round-robin doesn't unexpectedly trigger a kubelet restart
    sort | \
    sha256sum | \
    # sha256sum returns outputs like "<sha256 hash value> -". We only need the hash value excluding the trailing hyphen
    awk '{print $1}'
}

run_once() {
  local file=$ELB_BACKENDS_VERSION_FILE
  prev_ver=$(cat $file || echo)
  current_ver=$(current_elb_backends_version)
  echo comparing the previous version "$prev_ver" and the current version "$current_ver"
  if [ "$prev_ver" == "" -o "$prev_ver" == "$current_ver" ]; then
    echo the version has not changed. no need to restart kubelet.
    if [ "$KUBELET_RESTART_STRATEGY" == "watchdog" ]; then
      echo "notifying kubelet's watchdog not to trigger a restart of kubelet..."
      local kubelet_pid
      kubelet_pid=$(systemctl show $KUBELET_SYSTEMD_UNIT_NAME -p MainPID | cut -d '=' -f 2)
      systemd-notify --pid=$kubelet_pid WATCHDOG=1
    fi
  else
    echo the version has been changed. need to restart kubelet.
    if [ "$KUBELET_RESTART_STRATEGY" == "systemctl" ]; then
      systemctl restart $KUBELET_SYSTEMD_UNIT_NAME
    fi
  fi
  echo writing $current_ver to $file
  echo "$current_ver" > $file
}

ELB_BACKENDS_VERSION_FILE=${ELB_BACKENDS_VERSION_FILE:-/var/run/coreos/elb-backends-version}
KUBELET_SYSTEMD_UNIT_NAME=${KUBELET_SYSTEMD_UNIT_NAME:-kubelet.service}
KUBELET_RESTART_STRATEGY=${KUBELET_RESTART_STRATEGY:-systemctl}
WATCH_INTERVAL_SEC=${WATCH_INTERVAL_SEC:-3}

systemd-notify --ready
while true; do
  systemd-notify --status "determining if there're changes in elb ips"
  run_once
  systemd-notify --status "sleeping for $WATCH_INTERVAL_SEC seconds"
  sleep $WATCH_INTERVAL_SEC
done
@mumoshu

This comment has been minimized.

Copy link
Collaborator

@mumoshu mumoshu commented Apr 27, 2017

Also - I have never realized that but we're creating CNAME records - which has its own TTL(=300 seconds by default in kube-aws) other than ELB's A records' TTL - for controller ELBs.
https://github.com/kubernetes-incubator/kube-aws/blob/master/core/controlplane/config/templates/stack-template.json#L778
We'd better make them Route53 Alias records so that there will be only one TTL.

@mumoshu

This comment has been minimized.

Copy link
Collaborator

@mumoshu mumoshu commented Apr 27, 2017

Updated my assumption on the issue in my first comment #598 (comment)

@mumoshu

This comment has been minimized.

Copy link
Collaborator

@mumoshu mumoshu commented Apr 27, 2017

Hmm, a TTL for a CNAME record associated to ELB's DNS name seems to be capped at 60s, even though kube-aws' default recordSetTTL is set to 300s.
Then, we won't need to care about the balance of two TTLs(kube-aws CNAME and ELB's A) that much.

@mumoshu

This comment has been minimized.

Copy link
Collaborator

@mumoshu mumoshu commented Apr 27, 2017

Anyway, to forcibly restart kubelet to reconnect apiserver(s) when necessary(=elb ips changed) would still be important.
The script and the systemd unit I've suggested in #598 (comment) would be useful to achieve it.

@redbaron

This comment has been minimized.

Copy link
Contributor

@redbaron redbaron commented Apr 27, 2017

@mumoshu does it mean that going ELB-less mode with just CNAME record in Route53 pointing to controller nodes is considered dangerous now?

@mumoshu

This comment has been minimized.

Copy link
Collaborator

@mumoshu mumoshu commented Apr 27, 2017

@redbaron Do you mean a single CNAME record for your k8s API endpoint which points to just one controller node's DNS name?

I've been assuming that you would have a DNS name associated to one or more A records(rather than CNAME) each is associated to one of controller nodes public/private IP if you'd go without an ELB.

@redbaron

This comment has been minimized.

Copy link
Contributor

@redbaron redbaron commented Apr 27, 2017

Doesn't change the fact that final set of A records returned from DNS request change when you trigger controllers ASG update, right? From what I read here it matches the case when ELBs change their IPs

@mumoshu

This comment has been minimized.

Copy link
Collaborator

@mumoshu mumoshu commented Apr 27, 2017

@redbaron Ah, so you have a CNAME record which points to a DNS record containing multiple A records each is associated to one of controller nodes, right?

If so, no, as long as you have low TTLs for your DNS records.

AFAIK, someone in the upstream issue said that it becomes an issue only when ELB is involved. Perhaps ELB doesn't immediately shutdown an unnecessary instance and doesn't send FIN? If that's the case, kubelet would be unable to detect broken connections immediately.

When your controllers ASG is updated, old, unnecessary nodes would be terminated before it becomes non-functional like ELB's instances.

So your ELB-less mode with CNAME+A records would be safe as long as you have health checks to update the route 53 record set to eventually return A records "only for healthy controller nodes", and you have a TTL lower than --node-monitor-grace-period

@redbaron

This comment has been minimized.

Copy link
Contributor

@redbaron redbaron commented Apr 27, 2017

ELB-less mode I refer to is a recent feature in kube-aws, I didn't check which records it creates exactly, just wanted to verify that is still a safe option to do considering this bug report.

@mumoshu

This comment has been minimized.

Copy link
Collaborator

@mumoshu mumoshu commented Apr 27, 2017

@redbaron If you're referring to the DNS round-robin for API endpoints, it isn't implemented yet.

@redbaron

This comment has been minimized.

Copy link
Contributor

@redbaron redbaron commented Apr 27, 2017

I wonder if using ALB can help here. L7 load balancer precisely knows all incoming/outgoing requests and can forcibly yet safely close HTTP connection when ELB scales/changes IP addresses.

@mumoshu

This comment has been minimized.

Copy link
Collaborator

@mumoshu mumoshu commented Apr 28, 2017

@redbaron Thanks, I never realized that ALBs may help. I wish they can help, too.
Although I'd like to hear from contributors in the upstream issue to kindly clarify whether ALBs help or not,
shall we take this opportunity to add support for ALBs into kube-aws anyway? 😃

@camilb

This comment has been minimized.

Copy link
Collaborator

@camilb camilb commented Apr 28, 2017

Today happened to me for the first time with a 128 days old cluster.

@danielfm

This comment has been minimized.

Copy link
Collaborator Author

@danielfm danielfm commented Apr 28, 2017

Seems like Amazon is doing some serious maintenance work in the ELB infrastructure these days...

@mumoshu

This comment has been minimized.

Copy link
Collaborator

@mumoshu mumoshu commented May 9, 2017

@danielfm @camilb @redbaron For me it seems like there are still no obvious fixes other than:

  • Adding a dirty, periodic elb monitoring and kubelet restarting script of mine
  • Moving to Route 53 DNS round-robin based api endpoints (suggested by @tarvip in kubernetes/kubernetes#41916 (comment) and #281)
    • A major challenge to include this into kube-aws is how to keep Route 53 records sets to be up-to-date with the running controller nodes at the time. Rolling-updates of controller nodes, sudden termination of controller nodes, unresponsive apiserver containers, etc., should ideally be considered while syncing record sets.

Would you like to proceed with any of them or any other idea(s)?

@tarvip

This comment has been minimized.

Copy link
Contributor

@tarvip tarvip commented May 9, 2017

Rolling-updates of controller nodes, sudden termination of controller nodes, unresponsive apiserver containers, etc

I think sudden termination, unresponsive apiserver and etc shouldn't cause problems as long as these events are not happening at the same time, but in this case even ELB setup fails.

But keeping route53 record up-to-date is a bit more complicated without using Lambda and SNS.
When adding/restarting or replacing(terminating and recreating) controller nodes we could have unit in cloud-config that is executed when host comes up, this script gets all controller nodes that are part of ASG and modifies route53 record accordingly. But I don't know how to handle node removal (without adding new node back, although I think this is not happening very often).

Also I think it shouldn't be cause problems if route53 record is not fully up-to-date when restarting or replacing controller nodes, as long they are not replaced at the same time there should be at least one working controller.

@mumoshu

This comment has been minimized.

Copy link
Collaborator

@mumoshu mumoshu commented May 9, 2017

Thanks @tarvip!

think sudden termination, unresponsive apiserver and etc shouldn't cause problems as long as these events are not happening at the same time, but in this case even ELB setup fails.

Sorry if I wasn't clear enough but the above concerns are not specific to this problem but more general. I just didn't want to introduce a new issue due to missing health checks / constant updates to route 53 record sets.

@mumoshu

This comment has been minimized.

Copy link
Collaborator

@mumoshu mumoshu commented May 9, 2017

Also I think it shouldn't be cause problems if route53 record is not fully up-to-date when restarting or replacing controller nodes, as long they are not replaced at the same time there should be at least one working controller.

Probably that's true - then, my question is if everyone is ok with e.g. 50% k8s api error rate persists for several minutes when one of two controller nodes are being replaced?

@mumoshu

This comment has been minimized.

Copy link
Collaborator

@mumoshu mumoshu commented May 9, 2017

But I don't know how to handle node removal (without adding new node back, although I think this is not happening very often).

If we go without cloudwatch events + lambda, we probably need a systemd timer which periodically triggers a script to update a route53 record set so that it periodically start reflecting controller nodes terminated either expectedly or unexpectedly, right?

@redbaron

This comment has been minimized.

Copy link
Contributor

@redbaron redbaron commented May 9, 2017

@mumoshu , is ALB known not to help here?

@tarvip

This comment has been minimized.

Copy link
Contributor

@tarvip tarvip commented May 9, 2017

I just didn't want to introduce a new issue due to missing health checks / constant updates to route 53 record sets.

I think missing health check is ok, that health check is just TCP check anyway, kubelet and kube-proxy can also detect connection failure and recreate connection to another host.
Regarding constant updates, I guess we can perform update only if current state is different compared to route53.

@mumoshu

This comment has been minimized.

Copy link
Collaborator

@mumoshu mumoshu commented May 9, 2017

Thanks @redbaron - No but it isn't known to help either.

@tarvip

This comment has been minimized.

Copy link
Contributor

@tarvip tarvip commented May 9, 2017

If we go without cloudwatch events + lambda, we probably need a systemd timer which periodically triggers a script to update a route53 record set so that it periodically start reflecting controller nodes terminated either expectedly or unexpectedly, right?

Yes, that is one way to solve it.

@mumoshu

This comment has been minimized.

Copy link
Collaborator

@mumoshu mumoshu commented May 9, 2017

@tarvip Thanks for the reply!

I think missing health check is ok, that health check is just TCP check anyway, kubelet and kube-proxy can also detect connection failure and recreate connection to another host.

Ah, makes sense to me! At least it seems to worth trying for me now - thanks for the clarification.

@whereisaaron

This comment has been minimized.

Copy link
Contributor

@whereisaaron whereisaaron commented Oct 9, 2017

AWS's NLB replacement for ELB's using one IP per zone/subnet and those IP be your EIP's. So using this new product you can get a LB with a set of fixed IPs that won't change.

http://docs.aws.amazon.com/elasticloadbalancing/latest/network/introduction.html

AWS new ALB (for HTTPS) and NLB (for TCP) seem to AWS's next-gen replacement for the old ELB, which AWS now calls 'Classic Load Balancers'. k8s and kube-aws should probably look to transition to the new products, which also appear have some advantages, such as fixed IP's - as I see #937 and #945 are doing! 🎉

@mumoshu

This comment has been minimized.

Copy link
Collaborator

@mumoshu mumoshu commented Oct 10, 2017

@whereisaaron Thanks for the suggestion! I agree with your point. Anyway, please let me also add that ALB was experimented in #608 and decided as not appropriate for an K8S API load balancer.

@rodawg

This comment has been minimized.

Copy link

@rodawg rodawg commented Nov 6, 2017

Unfortunately NLBs don't support VPC Peering on AWS, so some users (including me) will need to use Classic ELBs in conjunction with NLBs to support kubectl commands.

@stephbu

This comment has been minimized.

Copy link

@stephbu stephbu commented Nov 12, 2017

Yes we see this today in production, and experienced player impact yesterday from this exact issue.
Working with AWS support we reproduced the issue by forcing a scaledown on the API ELB for one of our integration clusters. All worker nodes went stale and workloads were evicted before nodes recovered at the 15min mark after the scaling event.

We confirmed that the DNS was updated almost immediately. We're going with the Kubelet restart tied to DNS change for the time being, but IMHO this is not a good long-term fix.

@javefang

This comment has been minimized.

Copy link

@javefang javefang commented Nov 15, 2017

Seen this today. Our set up use Consul DNS for kubelet to discover the apiserver, which means the apiserver DNS name are multiple A-record pointing to the exact IP addresses of the apiservers, which changes every time an apiserver node is replaced.

In our case the workers come back eventually but it took a long while. My feeling is kubelet is not really respecting DNS TTLs as all Consul DNS names have TTL set to 0. Can anyone confirm?

@mumoshu

This comment has been minimized.

Copy link
Collaborator

@mumoshu mumoshu commented Nov 15, 2017

Thanks everyone.
At this point, would the only possible, universal work-around be the one shared by @roybotnik?
// At least mine won't work with @javefang's case of course.

@mumoshu

This comment has been minimized.

Copy link
Collaborator

@mumoshu mumoshu commented Nov 15, 2017

I was in the impression that since some k8s version kubelet has implemented the clide-side timeout to mitigate this issue, but can't remember the exact github issue right now.

@javefang

This comment has been minimized.

Copy link

@javefang javefang commented Nov 15, 2017

I noticed that after the master DNS record changed the underlying IP, all kubelet instances fail for exactly 15min. (Our master DNS TTL is 0). When it fails we get the following error.

Nov 15 13:08:08 dev-kubeworker-gen-0 kubelet[11954]: E1115 13:08:08.638348   11954 kubelet_node_status.go:390] Error updating node status, will retry: error getting node "dev-kubeworker-gen-0": Get https://apiserver-gen.service.dev.winton.consul/api/v1/nodes/dev-kubeworker-gen-0: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

It recovered by its own without restarting after 15min (sharp). It feels more like kubelet (or the apiserver client used) is caching the DNS. I'm trying to pin-point the exact line of code which caused this behaviour. But anyone know the code-base better might be able to confirm this?

@javefang

This comment has been minimized.

Copy link

@javefang javefang commented Nov 15, 2017

Seeing the following messages right before the worker came back. Last failure was at 13:15:18, then it reported some watch error (10.106.102.105 was the previous master which got destroyed) and re-resolved the DNS name before the cluster report the worker as "Ready" again! Maybe this is related to kubelet watch on apiserver not being dropped quick enough when the apiserver endpoint becomes unavailable?

Nov 15 13:15:16 dev-kubeworker-gen-0 kubelet[11954]: I1115 13:15:16.994725   11954 qos_container_manager_linux.go:320] [ContainerManager]: Updated QoS cgroup configuration
Nov 15 13:15:18 dev-kubeworker-gen-0 kubelet[11954]: E1115 13:15:18.670478   11954 kubelet_node_status.go:390] Error updating node status, will retry: error getting node "dev-kubeworker-gen-0": Get https://apiserver-gen.service.dev.test.consul/api/v1/nodes/dev-kubeworker-gen-0: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Nov 15 13:15:22 dev-kubeworker-gen-0 kubelet[11954]: E1115 13:15:22.640287   11954 streamwatcher.go:109] Unable to decode an event from the watch stream: read tcp 10.106.102.104:49178->10.106.102.105:443: read: no route to host
Nov 15 13:15:22 dev-kubeworker-gen-0 kubelet[11954]: E1115 13:15:22.640410   11954 kubelet_node_status.go:390] Error updating node status, will retry: error getting node "dev-kubeworker-gen-0": Get https://apiserver-gen.service.dev.test.consul/api/v1/nodes/dev-kubeworker-gen-0: read tcp 10.106.102.104:49178->10.106.102.105:443: read: no route to host
Nov 15 13:15:22 dev-kubeworker-gen-0 kubelet[11954]: E1115 13:15:22.640943   11954 streamwatcher.go:109] Unable to decode an event from the watch stream: read tcp 10.106.102.104:49178->10.106.102.105:443: read: no route to host
Nov 15 13:15:22 dev-kubeworker-gen-0 kubelet[11954]: E1115 13:15:22.641445   11954 streamwatcher.go:109] Unable to decode an event from the watch stream: read tcp 10.106.102.104:49178->10.106.102.105:443: read: no route to host
Nov 15 13:15:22 dev-kubeworker-gen-0 kubelet[11954]: W1115 13:15:22.663747   11954 reflector.go:334] k8s.io/kubernetes/pkg/kubelet/kubelet.go:413: watch of *v1.Service ended with: too old resource version: 1883 (16328)
Nov 15 13:15:22 dev-kubeworker-gen-0 kubelet[11954]: W1115 13:15:22.665010   11954 reflector.go:334] k8s.io/kubernetes/pkg/kubelet/kubelet.go:422: watch of *v1.Node ended with: too old resource version: 16145 (16328)
Nov 15 13:15:22 dev-kubeworker-gen-0 kubelet[11954]: W1115 13:15:22.665806   11954 reflector.go:334] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: watch of *v1.Pod ended with: too old resource version: 5602 (16345)

Found a possible line of code which explains the 15min behaviour

https://github.com/kubernetes/kubernetes/blob/fc8bfe2d8929e11a898c4557f9323c482b5e8842/pkg/kubelet/kubeletconfig/watch.go#L44

@whereisaaron

This comment has been minimized.

Copy link
Contributor

@whereisaaron whereisaaron commented Nov 15, 2017

It seems like there is a problem. If the controller DNS entry has a 30 second TTL, the kubelet should be able to recover from an IP change within 30s + update period, so about 40s. @javefang you think the kubelet is using this long, up to 15 minute back-off when the old IP goes stale? And so not a DNS caching problem, but rather it just stops trying to update the controller for several minutes?

For AWS at least, an NLB using fixed EIP addresses would mostly obviate the IP address every changing I think? Even if you recreate or move the LB, you can reapply the EIP so nothing changes. However an extra wrinkle is we would want worker nodes in multi-AZ clusters to use the EIP for the NLM endpoint in the same AZ. NLB's have one EIP per AZ as I understand it?

We saw a similar issue a couple time where the workers couldn't contact the controllers for ~2 minutes (no IP address change involved). Even though well less than the 5 minutes eviction time, everything got evicted anyway. Maybe the same back-off issue?

@javefang

This comment has been minimized.

Copy link

@javefang javefang commented Nov 15, 2017

@whereisaaron yep this is indeed taking 15min for kubelet to recover. I have reproduced it with the following setup:

  • OS: Centos 7.4 (SELinux on)
  • Docker: 1.12.6
  • K8S: 1.8.3
  • Apiserver: 3 instances running on separate VMs
  • Apiserver DNS names: all 3 registered as Consul Services, which does DNS round-robin for them (dig apiserver.service.consul will show 3 IPs, pointing to the VMs running the apiserver)

To reproduce:

  1. Destroy the VM running apiserver 1
  2. Create a new VM to replace (this will get a different IP)
  3. 30% of the worker nodes goes into "NotReady" state, kubelet prints error message kubelet_node_status.go:390] Error updating node status, will retry: error getting node "dev-kubeworker-gen-0": Get https://apiserver-gen.service.dev.winton.consul/api/v1/nodes/dev-kubeworker-gen-0: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
  4. Repeat 1-3 for the other 2 apiservers
  5. Now all workers should be in "NotReady" state
  6. Wait 15min, kubelet on workers print the unable to decode an event from the watch stream and read: no route to host message before coming back to "Ready" state

I'm just curious about the mechanism in kubelet that can cause kubelet to be broken for 15min after any apiserver IP changes. We are deploying this on-premise. Tomorrow I'll try to put the 3 apiservers behind a load balancer with fixed IP to see if that fixes the issue.

@javefang

This comment has been minimized.

Copy link

@javefang javefang commented Nov 17, 2017

UPDATE: putting all apiservers behind a load balancer (round-robin) with a static IP fixed it. Now all workers work fine even if I replace one of the master node. So using fixed IP load balancer would be my workaround for now. But do you think it's still worth investigating by kubelet doesn't respect apiserver's DNS TTL?

@RyPeck

This comment has been minimized.

Copy link

@RyPeck RyPeck commented Dec 15, 2017

I believe the 15 minute window break many of us are experiencing is described in kubernetes/kubernetes#41916 (comment). Reading through issues and pull requests, I don't see where a TCP Timeout was implemented on the underlying connection. The timeout on the HTTP request definitely was implemented.

liggitt pushed a commit to liggitt/kubernetes that referenced this issue May 15, 2018
Kubernetes Submit Queue
…-connections

Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

track/close kubelet->API connections on heartbeat failure

xref kubernetes#48638
xref kubernetes-incubator/kube-aws#598

we're already typically tracking kubelet -> API connections and have the ability to force close them as part of client cert rotation. if we do that tracking unconditionally, we gain the ability to also force close connections on heartbeat failure as well. it's a big hammer (means reestablishing pod watches, etc), but so is having all your pods evicted because you didn't heartbeat.

this intentionally does minimal refactoring/extraction of the cert connection tracking transport in case we want to backport this

* first commit unconditionally sets up the connection-tracking dialer, and moves all the cert management logic inside an if-block that gets skipped if no certificate manager is provided (view with whitespace ignored to see what actually changed)
* second commit plumbs the connection-closing function to the heartbeat loop and calls it on repeated failures

follow-ups:
* consider backporting this to 1.10, 1.9, 1.8
* refactor the connection managing dialer to not be so tightly bound to the client certificate management

/sig node
/sig api-machinery

```release-note
kubelet: fix hangs in updating Node status after network interruptions/changes between the kubelet and API server
```
@frankconrad

This comment has been minimized.

Copy link

@frankconrad frankconrad commented Sep 22, 2018

All there work around the real problem, that the connections are keep forever.
If we limit the livetime of the connection by time the problem would be not happened. Or at lest by nr or handled requests.
Also we would get a better load distribution, because new connections allow loadblancer todo new distribution.

@liggitt

This comment has been minimized.

Copy link

@liggitt liggitt commented Sep 22, 2018

the connections are keep forever. If we limit the livetime or an connection by time the problem would be not happened.

They don't live forever, they live for the operating system TCP timeout limit (typically 15 minutes by default)

@danielfm

This comment has been minimized.

Copy link
Collaborator Author

@danielfm danielfm commented Sep 22, 2018

I haven't seen this happening anymore in some of the latest versions of Kubernetes 1.8.x (and I suspect the same is true for newer versions as well), so maybe we can close this?

@frankconrad

This comment has been minimized.

Copy link

@frankconrad frankconrad commented Sep 22, 2018

Yes and this 15 min are to long for many cases, like here.
The dead connection from elb/alb when there get terminated after there are 6 days depricated, mean not visible in dns any more.
If we would reconnect every hour (or 10 min) we would not have the problem. And would get as site effect better load distribution. But still would have all benefits from keepalive.
What have done here is workaround the real problem, that no dynamic cloud based loadblancer can good handle long live connections good.
The problem need to fixed on http connection handling pooling too, as the higher level there is no real influence of connection resuse if you use pool feature.

@liggitt

This comment has been minimized.

Copy link

@liggitt liggitt commented Sep 22, 2018

The fix merged into the last several releases of kubernetes was to drop/reestablish the apiserver connections from the kubelet if the heartbeat times out twice in a row. Reconnecting every 10 minutes or every hour would still let nodes go unavailable.

@frankconrad

This comment has been minimized.

Copy link

@frankconrad frankconrad commented Sep 22, 2018

What seen in other go projects, if you use pooling and frequently sent request that keepalive idle timeout get not reached you run into this issue. If you disable pooling and make only one request per connection, you have not that issue. But higher latency and overhead, this why keepalive make sense.

By the way, the old Apache httpd had not only keepalive idle timeout but also keepalive max request count. Which helped a lot in many of this problems.

@fejta-bot

This comment has been minimized.

Copy link

@fejta-bot fejta-bot commented Apr 25, 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@fejta-bot

This comment has been minimized.

Copy link

@fejta-bot fejta-bot commented May 25, 2019

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@fejta-bot

This comment has been minimized.

Copy link

@fejta-bot fejta-bot commented Jun 24, 2019

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

@k8s-ci-robot k8s-ci-robot commented Jun 24, 2019

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
You can’t perform that action at this time.