Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kops rolling-update doesn't de-register instances from ELB network load balancer gracefully #11256

Closed
amorey opened this issue Apr 18, 2021 · 27 comments

Comments

@amorey
Copy link

amorey commented Apr 18, 2021

1. What kops version are you running? The command kops version, will display
this information.

1.20.0

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

1.20.5

3. What cloud provider are you using?

AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

kops rolling-update cluster <cluster-name> --instance-group <node-group-name> --yes --force

5. What happened after the commands executed?

Before running kops rolling-update, I ran a script in a loop to make HTTP requests to an NLB endpoint sitting in front of an echo server. At the point when kops rolling-update reported that it was stopping an instance, the HTTP requests started hanging and then recovered after ~90 seconds.

6. What did you expect to happen?

I expected the HTTP requests to continue to be handled successfully.

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

9. Anything else do we need to know?

I believe this issue is due to the fact that kops rolling-update detaches instances from their Auto Scaling groups without de-registering them from their NLB target groups first. My targetgroup has a de-registration delay of 90 seconds so this might explain the 90 second recovery time.

@rifelpet
Copy link
Member

I think the 90 seconds of dropped requests are due to the normal lifecycle of the pod and aren't specific to kops rolling-update. If you kubectl delete pod the echoserver pod, do you experience the same dropped requests?

If so you can avoid this by ensuring the echoserver's deployment replicas > 1, a PodDisruptionBudget matching the deployment pods with minAvailable: 1, and a preStop lifecycleHook on the echoserver containers that sleeps for 90 seconds. The hook ensures that echoserver won't receive the shutdown signal until after the deregistration delay expires and the target is entirely removed from the target group. This kubecon talk goes into more details about the lifecycle hook if you're curious.

There are node labels we can add that will encourage load balancer controllers to remove the node from load balancers, but I'm not sure it is necessary if the workloads running on the node to be replaced are configured properly for zero downtime.

@amorey
Copy link
Author

amorey commented Apr 20, 2021

Thanks for your quick response!

The echoserver's deployment already had 2 replicas and, as per your suggestion, I added a PodDisruptionBudget with minAvailable: 1 and a preStop sleep of 100 seconds. In addition, I added a terminationGracePeriodSeconds of 110 seconds to ensure the preStop hook would have time to complete.

Here's the script I used to curl on a loop (it also prints out a request timestamp and the response code):

while true;
do
  printf "$(date) "
  curl -I -s -o /dev/null -w "%{http_code}" <URL>
  echo ""
  sleep .1
done

And for the follow-up test I decided to try pinging a CLB endpoint as well as the original NLB endpoint both directly and through a CloudFlare proxy. Here is my echoserver yaml:

kind: Deployment
apiVersion: apps/v1
metadata:
  name: echoserver
  labels:
    app: echoserver
spec:
  replicas: 2
  revisionHistoryLimit: 5
  selector:
    matchLabels:
      app: echoserver
  template:
    metadata:
      labels:
        app: echoserver
    spec:
      terminationGracePeriodSeconds: 110
      containers:
        - image: k8s.gcr.io/echoserver:1.10
          name: echoserver
          ports:
            - containerPort: 8080
          resources:
            requests:
              memory: 4Mi
            limits:
              memory: 20Mi
          imagePullPolicy: Always
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 100"]
---
kind: Service
apiVersion: v1
metadata:
  name: echoserver-nlb
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
    service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"
spec:
  type: LoadBalancer
  ports:
    - name: http
      port: 80
      targetPort: 8080
  selector:
    app: echoserver
---
kind: Service
apiVersion: v1
metadata:
  name: echoserver-clb
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: "elb"
    service.beta.kubernetes.io/aws-load-balancer-backend-protocol: "tcp"
    service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"
    service.beta.kubernetes.io/aws-load-balancer-connection-draining-enabled: "true"
    service.beta.kubernetes.io/aws-load-balancer-connection-draining-timeout: "90"
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-healthy-threshold: "2"                       
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-unhealthy-threshold: "3"                     
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-timeout: "5"                                 
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval: "10"
spec:
  type: LoadBalancer
  ports:
    - name: http
      port: 80
      targetPort: 8080
  selector:
    app: echoserver
---
kind: PodDisruptionBudget
apiVersion: policy/v1beta1
metadata:
  name: echoserver
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: echoserver

And here are the configuration settings for the load balancers:

  • CLB
    • Cross-zone load balancing: Enabled
    • Idle timeout: 60 seconds
    • Connection Draining: Enabled, 90 seconds
  • NLB
    • Cross-zone load balancing: Enabled
    • Connection termination on deregistration: Enabled
    • Deregistration delay: 90 seconds

Finally, here were the results of running the kops rolling-update cluster command:

Results
CLB-direct HTTP 200 OK response codes without any slow response times
CLB-proxy HTTP 520 Web Server Returned an Unknown Error response after 60 sec. request time each time kops stopped an instance
NLB-direct HTTP 200 OK response code after 77 sec. request time each time kops stopped an instance
NLB-proxy HTTP 524 A Timeout Occurred response code after 100 sec. request time each time kops stopped an instance

For comparison, I set up ASG lifecycle hooks to 1) drain kubernetes node instances and 2) de-register instances from their NLB target groups when an ASG termination request is received and I'm able to take down instances using the AWS CLI command autoscaling terminate-instance-in-auto-scaling-group without any downtime or slow request times (on the NLB).

Please let me know if I've misconfigured something for rolling-update zero downtime.

@olemarkus
Copy link
Member

I thought it could be related to kubernetes/kubernetes#100779 but I see you are the author of that one as well

@amorey
Copy link
Author

amorey commented Apr 20, 2021

Yeah, I've been testing out various failure modes recently and that's another one I found. That issue has to do with automatic ELB de-registration when the master node instances restart so I don't think it's related but there might be some subtle connection in the kubernetes ELB controller code that I'm missing. I should add that I originally came across that issue while doing a kops rolling-update but that's only a coincidence due to the restart of the master node instance during a full rolling-update.

@johngmyers
Copy link
Member

Perhaps not enough time between cordoning and termination for nodes that don't have any workload pods on them?

@johngmyers
Copy link
Member

/area rolling-update

@johngmyers
Copy link
Member

Perhaps we need to give the preStop and terminationGracePeriodSeconds treatment to kube-proxy.

@rifelpet
Copy link
Member

The challenge with integrating that in kops is choosing an appropriate duration. It should be longer than the longest expected response times of the service + any delay in updating all relevant load balancers to no longer send new requests to the node.

It would probably be easiest to just expose it in the API. It'd likely be most appropriate in KubeProxyConfig. It would be nice if the setting was per-InstanceGroup rather than per-Cluster, but KubeProxyConfig isn't in InstanceGroupSpec.

@johngmyers
Copy link
Member

First we should confirm it is the kube-proxy that is the issue.

I believe cordoning is what starts removing the node from the LB. If so, we would need a minimum time between that and termination, regardless of the time it takes to drain. We don't normally include kube-proxy in the drain and Cilium can implement its functionality without a pod, so I don't think terminationGracePeriodSeconds would help us after all.

By default we put all nodes, regardless of IG, into the LB pools. Someone would have to go out of their way to restrict the LB pools. We could put it into the RollingUpdate.

@amorey
Copy link
Author

amorey commented Apr 20, 2021

In kubernetes v1.18.17, kubectl cordon starts removing nodes from LBs but this is not the case in v1.20.5 (kubectl delete node does). The behavior for CLBs and NLBs is the same in both versions.

I think these are the steps that need to happen In order to decommission an instance gracefully:

  1. De-register instance from the LBs
  2. Drain the node
  3. Close remaining network connections (either by stopping kube-proxy or letting LB close connections)
  4. Terminate instance

Stopping kube-proxy sounds like a good way to give rolling-update control over the timing of step 3 but the LB de-registration step is still missing.

@johngmyers
Copy link
Member

johngmyers commented Apr 20, 2021

@amorey https://kubernetes.io/docs/concepts/architecture/nodes/ states:

Caution: kubectl cordon marks a node as 'unschedulable', which has the side effect of the service controller removing the node from any LoadBalancer node target lists it was previously eligible for, effectively removing incoming load balancer traffic from the cordoned node(s).

Is the Kubernetes documentation out of date?

@amorey
Copy link
Author

amorey commented Apr 20, 2021

Either that or maybe a regression. We might be able to figure out when/why the behavior changed by looking at the history here: https://github.com/kubernetes/kubernetes/commits/master/staging/src/k8s.io/legacy-cloud-providers/aws/aws_loadbalancer.go

@johngmyers
Copy link
Member

Found in Kubernetes 1.19 release notes:

Service load balancers no longer exclude nodes marked unschedulable from the candidate nodes. The service load balancer exclusion label should be used instead.

Users upgrading from 1.18 who have cordoned nodes should set the node.kubernetes.io/exclude-from-external-load-balancers label on the impacted nodes before upgrading if they wish those nodes to remain excluded from service load balancers. (#90823, @smarterclayton) [SIG Apps, Cloud Provider and Network]

@johngmyers
Copy link
Member

Filed kubernetes/website#27639

@johngmyers
Copy link
Member

Partial fix in #11273

@amorey
Copy link
Author

amorey commented Apr 21, 2021

Partial fix in #11273

🚀 Wow! That's the fastest turnaround time I've seen on any open source project. Thanks!

What are you thinking about with regards to closing the active connections after the drain? For my use case, it's fine to let the LB close the connections but I need control over the the rolling-update timeout.

@amorey
Copy link
Author

amorey commented Apr 21, 2021

Here's a kubernetes blog post from today:
https://kubernetes.io/blog/2021/04/21/graceful-node-shutdown-beta/

Apparently there's a graceful node shutdown feature that's in beta and enabled by default starting with 1.21. Rolling-update should be able to offload some work there but I'm not sure if the feature will handle LB de-registration or not.

@johngmyers
Copy link
Member

I believe the intent of the 5s post-drain delay is to give the node time to send FIN packets for all connections to/from the terminated pods.

You bring up the point that we need something to give the node time to send FIN packets for connections to/from kube-proxy. There's also the issue that we might need additional time for the load balancers to notice the node has been removed from the pool (especially if there are no pods to drain) and we might want to give additional time for outstanding requests through kube-proxy to complete.

As for Graceful Node Shutdown:

I don't see how to get Graceful Node Shutdown to respect PodDisruptionBudgets. Even if it could, I'd still like to have kops rolling update run the drain of non-DaemonSet pods.

We should make sure that Graceful Node Shutdown is appropriately hooked up in systemd, but that's another ticket.

We might need to tweak the kube-proxy pod to take advantage of Graceful Node Termination.

I would like to move the deletion of the node from the Kubernetes API to the node, but that appears to be explicitly disallowed by the Node Authorizer. I don't know the reason they did that.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 22, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 21, 2021
@amorey
Copy link
Author

amorey commented Aug 23, 2021

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Aug 23, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 21, 2021
@johngmyers
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 21, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 19, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 21, 2022
@olemarkus
Copy link
Member

/remove-lifecycle rotten

I think this one have been solved now with #12902 and #12966

/close

@k8s-ci-robot
Copy link
Contributor

@olemarkus: Closing this issue.

In response to this:

/remove-lifecycle rotten

I think this one have been solved now with #12902 and #12966

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants