Termination + start of new master de-registers nodes from AWS ELB (both CLB and NLB) #100779

amorey · 2021-04-02T12:37:12Z

What happened:

After a master node instance is terminated and a new master node joins the cluster, worker node instances targeted by AWS ELBs are de-registered from the load balancers. In the case of CLBs, instances are removed as listeners and then added back. In the case of NLBs, instances are drained and then re-initialized.

What you expected to happen:

I expected the new master node to join the cluster gracefully without causing the load balancer to de-register instances.

How to reproduce it (as minimally and precisely as possible):

Set up a 3-node cluster using kops (e.g. kops create cluster --zones=us-east-1a,us-east-1b)
Create a classic load balancer and a network load balancer service:

kind: Deployment
apiVersion: apps/v1
metadata:
  name: echoserver
  labels:
    app: echoserver
spec:
  replicas: 2
  revisionHistoryLimit: 5
  selector:
    matchLabels:
      app: echoserver
  template:
    metadata:
      labels:
        app: echoserver
    spec:
      containers:
        - image: k8s.gcr.io/echoserver:1.10
          name: echoserver
          ports:
            - containerPort: 8080
          resources:
            requests:
              memory: 4Mi
            limits:
              memory: 20Mi
          imagePullPolicy: Always
---
kind: Service
apiVersion: v1
metadata:
  name: echoserver-nlb
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
    service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"
spec:
  type: LoadBalancer
  ports:
    - name: http
      port: 80
      targetPort: 8080
  selector:
    app: echoserver
---
kind: Service
apiVersion: v1
metadata:
  name: echoserver-clb
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: "elb"
    service.beta.kubernetes.io/aws-load-balancer-backend-protocol: "tcp"
    service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"                
    service.beta.kubernetes.io/aws-load-balancer-connection-draining-enabled: "true"                      
    service.beta.kubernetes.io/aws-load-balancer-connection-draining-timeout: "60"                        
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-healthy-threshold: "2"                       
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-unhealthy-threshold: "3"                     
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-timeout: "5"                                 
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval: "10"
spec:
  type: LoadBalancer
  ports:
    - name: http
      port: 80
      targetPort: 8080
  selector:
    app: echoserver

Terminate the master node instance from the AWS console
Watch the CLB/NLB console screens to see the worker node instances get removed/drained when a new master node joins the cluster

Anything else we need to know?:

I've only tested it on AWS
Restarting the master node from the command line (shutdown -r now) does not trigger a load balancer drain
I tried it with a 3-master HA configuration and found the same behavior when all master node instances were terminated simultaneously
I saw the same behavior with Kubernetes 1.20.X and 1.21.X (installed with kops-1.20.0-beta.1 and kops-1.21.0-alpha.1 respectively)

Environment:

Kubernetes version (use kubectl version): 1.19.9
Cloud provider or hardware configuration: AWS
OS (e.g: cat /etc/os-release): Ubuntu 20.04.2 LTS (Focal Fossa)
Kernel (e.g. uname -a): Linux ip-172-20-38-84 5.4.0-1039-aws Make it clear that Kubernetes can run anywhere but that the initial scri... #41-Ubuntu SMP Wed Feb 24 23:13:36 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Install tools: kops
Network plugin and version (if this is a network-related bug): kubenet
Others:

The text was updated successfully, but these errors were encountered:

neolit123 · 2021-04-04T15:18:45Z

/area provider/aws
/sig cloud-provider

amorey · 2021-04-14T10:57:31Z

Are there any updates on this? It would be helpful to know what the triage timeline usually is. As a supplement here are screenshots of the steps to reproduce the problem (side-by-side browser windows are EC2/ELB/TargetGroup consoles):

Step 1: Healthy 3-node cluster

Step 2: Terminate master node instance

Step 3: New master node instance is initialized automatically

Step 4: When new master node joins cluster, instances are de-registered from CLB and NLB

Step 5: Instances are re-initialized in CLB and NLB

Step 6: Healthy 3-node cluster

cheftako · 2021-06-23T20:41:00Z

/assign @kishorj
/triage accepted

k8s-triage-robot · 2021-09-21T20:52:12Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

kishorj · 2021-09-21T22:47:57Z

/remove-lifecycle stale

k8s-triage-robot · 2021-12-20T23:11:21Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-01-19T23:14:53Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2022-02-18T23:27:44Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2022-02-18T23:28:02Z

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

amorey · 2022-02-19T06:46:23Z

/remove-lifecycle stale

amorey added the kind/bug Categorizes issue or PR as related to a bug. label Apr 2, 2021

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 2, 2021

amorey changed the title ~~Termination + start of new master node triggers load balancer drain on AWS (CLB and NLB)~~ Termination + start of new master de-registers nodes from AWS ELB (both CLB and NLB) Apr 2, 2021

amorey mentioned this issue Apr 2, 2021

running "kubectl drain" no longer removes machine instances from ELBs kubernetes/kops#10774

Closed

k8s-ci-robot added area/provider/aws Issues or PRs related to aws provider sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 4, 2021

olemarkus mentioned this issue Apr 20, 2021

kops rolling-update doesn't de-register instances from ELB network load balancer gracefully kubernetes/kops#11256

Closed

k8s-ci-robot assigned kishorj Jun 23, 2021

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 23, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 21, 2021

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 21, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 20, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 19, 2022

k8s-ci-robot closed this as completed Feb 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Termination + start of new master de-registers nodes from AWS ELB (both CLB and NLB) #100779

Termination + start of new master de-registers nodes from AWS ELB (both CLB and NLB) #100779

amorey commented Apr 2, 2021 •

edited

Loading

neolit123 commented Apr 4, 2021

amorey commented Apr 14, 2021 •

edited

Loading

cheftako commented Jun 23, 2021

k8s-triage-robot commented Sep 21, 2021

kishorj commented Sep 21, 2021

k8s-triage-robot commented Dec 20, 2021

k8s-triage-robot commented Jan 19, 2022

k8s-triage-robot commented Feb 18, 2022

k8s-ci-robot commented Feb 18, 2022

amorey commented Feb 19, 2022

Termination + start of new master de-registers nodes from AWS ELB (both CLB and NLB) #100779

Termination + start of new master de-registers nodes from AWS ELB (both CLB and NLB) #100779

Comments

amorey commented Apr 2, 2021 • edited Loading

What happened:

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

neolit123 commented Apr 4, 2021

amorey commented Apr 14, 2021 • edited Loading

cheftako commented Jun 23, 2021

k8s-triage-robot commented Sep 21, 2021

kishorj commented Sep 21, 2021

k8s-triage-robot commented Dec 20, 2021

k8s-triage-robot commented Jan 19, 2022

k8s-triage-robot commented Feb 18, 2022

k8s-ci-robot commented Feb 18, 2022

amorey commented Feb 19, 2022

amorey commented Apr 2, 2021 •

edited

Loading

amorey commented Apr 14, 2021 •

edited

Loading