Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Termination + start of new master de-registers nodes from AWS ELB (both CLB and NLB) #100779

Closed
amorey opened this issue Apr 2, 2021 · 10 comments
Assignees
Labels
area/provider/aws Issues or PRs related to aws provider kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@amorey
Copy link

amorey commented Apr 2, 2021

What happened:

After a master node instance is terminated and a new master node joins the cluster, worker node instances targeted by AWS ELBs are de-registered from the load balancers. In the case of CLBs, instances are removed as listeners and then added back. In the case of NLBs, instances are drained and then re-initialized.

What you expected to happen:

I expected the new master node to join the cluster gracefully without causing the load balancer to de-register instances.

How to reproduce it (as minimally and precisely as possible):

  1. Set up a 3-node cluster using kops (e.g. kops create cluster --zones=us-east-1a,us-east-1b)
  2. Create a classic load balancer and a network load balancer service:
kind: Deployment
apiVersion: apps/v1
metadata:
  name: echoserver
  labels:
    app: echoserver
spec:
  replicas: 2
  revisionHistoryLimit: 5
  selector:
    matchLabels:
      app: echoserver
  template:
    metadata:
      labels:
        app: echoserver
    spec:
      containers:
        - image: k8s.gcr.io/echoserver:1.10
          name: echoserver
          ports:
            - containerPort: 8080
          resources:
            requests:
              memory: 4Mi
            limits:
              memory: 20Mi
          imagePullPolicy: Always
---
kind: Service
apiVersion: v1
metadata:
  name: echoserver-nlb
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
    service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"
spec:
  type: LoadBalancer
  ports:
    - name: http
      port: 80
      targetPort: 8080
  selector:
    app: echoserver
---
kind: Service
apiVersion: v1
metadata:
  name: echoserver-clb
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: "elb"
    service.beta.kubernetes.io/aws-load-balancer-backend-protocol: "tcp"
    service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"                
    service.beta.kubernetes.io/aws-load-balancer-connection-draining-enabled: "true"                      
    service.beta.kubernetes.io/aws-load-balancer-connection-draining-timeout: "60"                        
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-healthy-threshold: "2"                       
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-unhealthy-threshold: "3"                     
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-timeout: "5"                                 
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval: "10"
spec:
  type: LoadBalancer
  ports:
    - name: http
      port: 80
      targetPort: 8080
  selector:
    app: echoserver
  1. Terminate the master node instance from the AWS console
  2. Watch the CLB/NLB console screens to see the worker node instances get removed/drained when a new master node joins the cluster

Anything else we need to know?:

  • I've only tested it on AWS
  • Restarting the master node from the command line (shutdown -r now) does not trigger a load balancer drain
  • I tried it with a 3-master HA configuration and found the same behavior when all master node instances were terminated simultaneously
  • I saw the same behavior with Kubernetes 1.20.X and 1.21.X (installed with kops-1.20.0-beta.1 and kops-1.21.0-alpha.1 respectively)

Environment:

  • Kubernetes version (use kubectl version): 1.19.9
  • Cloud provider or hardware configuration: AWS
  • OS (e.g: cat /etc/os-release): Ubuntu 20.04.2 LTS (Focal Fossa)
  • Kernel (e.g. uname -a): Linux ip-172-20-38-84 5.4.0-1039-aws Make it clear that Kubernetes can run anywhere but that the initial scri... #41-Ubuntu SMP Wed Feb 24 23:13:36 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools: kops
  • Network plugin and version (if this is a network-related bug): kubenet
  • Others:
@amorey amorey added the kind/bug Categorizes issue or PR as related to a bug. label Apr 2, 2021
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 2, 2021
@amorey amorey changed the title Termination + start of new master node triggers load balancer drain on AWS (CLB and NLB) Termination + start of new master de-registers nodes from AWS ELB (both CLB and NLB) Apr 2, 2021
@neolit123
Copy link
Member

/area provider/aws
/sig cloud-provider

@k8s-ci-robot k8s-ci-robot added area/provider/aws Issues or PRs related to aws provider sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 4, 2021
@amorey
Copy link
Author

amorey commented Apr 14, 2021

Are there any updates on this? It would be helpful to know what the triage timeline usually is. As a supplement here are screenshots of the steps to reproduce the problem (side-by-side browser windows are EC2/ELB/TargetGroup consoles):

Step 1: Healthy 3-node cluster
step-1

Step 2: Terminate master node instance
step-2

Step 3: New master node instance is initialized automatically
step-3

Step 4: When new master node joins cluster, instances are de-registered from CLB and NLB
step-4

Step 5: Instances are re-initialized in CLB and NLB
step-5

Step 6: Healthy 3-node cluster
step-6

@cheftako
Copy link
Member

/assign @kishorj
/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 23, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 21, 2021
@kishorj
Copy link
Contributor

kishorj commented Sep 21, 2021

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 21, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 20, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 19, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@amorey
Copy link
Author

amorey commented Feb 19, 2022

/remove-lifecycle stale

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/provider/aws Issues or PRs related to aws provider kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

6 participants