running "kubectl drain" no longer removes machine instances from ELBs #10774

amorey · 2021-02-09T12:52:39Z

1. What kops version are you running? The command kops version, will display
this information.

1.19.0

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

1.19.7

3. What cloud provider are you using?

AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

kubectl drain

5. What happened after the commands executed?

Previously, using kops/k8s 1.18.2/1.18.10 running kubectl drain would remove machine instances from classic ELBs. Now running kubectl drain does not do so.

6. What did you expect to happen?

I expected running kubectl drain to remove the node machine instances from classic ELBs. I was also hoping that running kubectl drain would remove instances from network ELBs but that isn't happening either (with both kops/k8s 1.18.2/1.18.10 and 1.19.0/1.19.7).

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: "2021-02-01T13:55:38Z"
  name: <REDACTED>
spec:
  api:
    dns: {}
  authorization:
    rbac: {}
  channel: stable
  cloudProvider: aws
  configBase: <REDACTED>
  containerRuntime: docker
  dnsZone: <REDACTED>
  etcdClusters:
  - cpuRequest: 200m
    etcdMembers:
    - instanceGroup: <REDACTED>
      name: a
    memoryRequest: 100Mi
    name: main
  - cpuRequest: 100m
    etcdMembers:
    - instanceGroup: <REDACTED>
      name: a
    memoryRequest: 100Mi
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
  kubelet:
    <REDACTED>
  kubernetesApiAccess:
  - <REDACTED>
  kubernetesVersion: 1.19.7
  masterPublicName: <REDACTED>
  networkCIDR: 10.0.0.0/16
  networkID: <REDACTED>
  networking:
    flannel:
      backend: vxlan
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - <REDACTED>
  subnets:
  - cidr: 10.0.32.0/19
    id: <REDACTED>
    name: <REDACTED>
    type: Public
    zone: <REDACTED>
  - cidr: 10.0.64.0/19
    id: <REDACTED>
    name: <REDACTED>
    type: Public
    zone: <REDACTED>
  topology:
    dns:
      type: Public
    masters: public
    nodes: public

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2021-02-01T13:55:39Z"
  labels:
    kops.k8s.io/cluster: <REDACTED>
  name: <REDACTED>
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20210119.1
  machineType: t3.medium
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: <REDACTED>
  role: Master
  subnets:
  - <REDACTED>

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2021-02-01T13:55:39Z"
  generation: 1
  labels:
    kops.k8s.io/cluster: <REDACTED>
  name: <REDACTED>
spec:
  cloudLabels:
    k8s.io/cluster-autoscaler/enabled: "true"
    k8s.io/cluster-autoscaler/<REDACTED>: "true"
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20210119.1
  machineType: t3.medium
  maxSize: 10
  minSize: 2
  mixedInstancesPolicy:
    instances:
    - t3.medium
    - t3a.medium
    onDemandAboveBase: 0
    onDemandBase: 0
    spotInstancePools: 2
  nodeLabels:
    kops.k8s.io/instancegroup: <REDACTED>
  role: Node
  rootVolumeSize: 30
  subnets:
  - <REDACTED>

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2021-02-01T13:55:39Z"
  generation: 1
  labels:
    kops.k8s.io/cluster: <REDACTED>
  name: <REDACTED>
spec:
  cloudLabels:
    k8s.io/cluster-autoscaler/enabled: "true"
    k8s.io/cluster-autoscaler/<REDACTED>: "true"
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20210119.1
  machineType: t3.medium
  maxSize: 10
  minSize: 2
  mixedInstancesPolicy:
    instances:
    - t3.medium
    - t3a.medium
    onDemandAboveBase: 0
    onDemandBase: 0
    spotInstancePools: 2
  nodeLabels:
    kops.k8s.io/instancegroup: <REDACTED>
  role: Node
  rootVolumeSize: 30
  subnets:
  - <REDACTED>

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

9. Anything else do we need to know?

The text was updated successfully, but these errors were encountered:

olemarkus · 2021-02-09T19:16:04Z

If I understand you correctly, you mean that an ELB provisioned through a Service does not target drained nodes?
That should not be the case. An ELB should target ALL nodes and then kube-proxy forwards to the correct node.

Furthermore, kOps have not changed anything that will influence the behavior of commands like kubectl.

amorey · 2021-02-09T19:56:45Z

Thanks for your quick reply. Just to clarify - the behavior I'm seeing is that an ELB provisioned through a Service targets ALL nodes; however, this wasn't the case previously (with kops/k8s 1.18.2/1.18.10). Previously, draining a node would also remove it from the ELB targets.

Ok, it makes sense that this change isn't related to kops. Considering this behavior how do you recommend updating instances gracefully with kops? As far as I can tell, running kops rolling-update cluster drains each node and then issues a shutdown command which kills the connections being proxied by that instance.

olemarkus · 2021-02-09T20:03:19Z

If the problem is that proxied connections are killed there are two solutions that I am aware of:
a) use NLB and local traffic policy. In this case, only nodes with the pods on them will reply to the NLB health check. See e.g https://aws.amazon.com/blogs/opensource/network-load-balancer-support-in-kubernetes-1-9/
b) Have a look at https://cilium.io/blog/2020/11/10/cilium-19#maglev. This feature is not available on kOps, but I think it could be. Haven't looked much into what that would take.

amorey · 2021-02-24T12:55:53Z

Thanks for your suggestions! I went down a cilium/eBPF/maglev rabbit hole and found it very interesting. Now I think I have a better handle on the issue I described earlier but while testing out NLB's I ran into a bigger issue. In my setup, when the master node's machine instance restarts, the NLB target instances all get deregistered simultaneously and then re-registered which causes a long outage. I realize this isn't a kops issue but would you happen to know if this is expected behavior and if so how to prevent it?

olemarkus · 2021-02-24T13:00:35Z

A master instance restart should not cause the entire NLB target group to re-register. Is that something you see happening consistently?

amorey · 2021-02-24T13:24:37Z

Yes, consistently. I tried it about five times using different versions of K8S (1.19, 1.20 and 1.21) and saw it happen each time. I noticed it first when running kops rolling-update cluster and then narrowed it down to an issue with the master node instance coming back online. I haven't tried it with a HA master setup.

olemarkus · 2021-02-24T13:45:49Z

Managing those things is handled by the control plane. I don't think it would actully de-register nodes just because ... but maybe there is a timeout or something that causes this. Do you have the chance to test a HA cluster as well?

amorey · 2021-02-25T09:02:02Z

I just tested it on a 3-master HA cluster and found that terminating 1/3 or 2/3 master instances simultaneously did not de-register the nodes but terminating 3/3 simultaneously did.

amorey · 2021-04-02T20:51:26Z

Kubernetes bug report for reference kubernetes/kubernetes#100779

fejta-bot · 2021-07-01T21:31:28Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

k8s-triage-robot · 2021-07-31T22:25:55Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

k8s-triage-robot · 2021-09-01T18:43:13Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2021-09-01T18:43:26Z

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 1, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 31, 2021

k8s-ci-robot closed this as completed Sep 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

running "kubectl drain" no longer removes machine instances from ELBs #10774

running "kubectl drain" no longer removes machine instances from ELBs #10774

amorey commented Feb 9, 2021 •

edited

Loading

olemarkus commented Feb 9, 2021

amorey commented Feb 9, 2021

olemarkus commented Feb 9, 2021

amorey commented Feb 24, 2021

olemarkus commented Feb 24, 2021

amorey commented Feb 24, 2021 •

edited

Loading

olemarkus commented Feb 24, 2021

amorey commented Feb 25, 2021

amorey commented Apr 2, 2021

fejta-bot commented Jul 1, 2021

k8s-triage-robot commented Jul 31, 2021

k8s-triage-robot commented Sep 1, 2021

k8s-ci-robot commented Sep 1, 2021

running "kubectl drain" no longer removes machine instances from ELBs #10774

running "kubectl drain" no longer removes machine instances from ELBs #10774

Comments

amorey commented Feb 9, 2021 • edited Loading

olemarkus commented Feb 9, 2021

amorey commented Feb 9, 2021

olemarkus commented Feb 9, 2021

amorey commented Feb 24, 2021

olemarkus commented Feb 24, 2021

amorey commented Feb 24, 2021 • edited Loading

olemarkus commented Feb 24, 2021

amorey commented Feb 25, 2021

amorey commented Apr 2, 2021

fejta-bot commented Jul 1, 2021

k8s-triage-robot commented Jul 31, 2021

k8s-triage-robot commented Sep 1, 2021

k8s-ci-robot commented Sep 1, 2021

amorey commented Feb 9, 2021 •

edited

Loading

amorey commented Feb 24, 2021 •

edited

Loading