Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

running "kubectl drain" no longer removes machine instances from ELBs #10774

Closed
amorey opened this issue Feb 9, 2021 · 13 comments
Closed

running "kubectl drain" no longer removes machine instances from ELBs #10774

amorey opened this issue Feb 9, 2021 · 13 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@amorey
Copy link

amorey commented Feb 9, 2021

1. What kops version are you running? The command kops version, will display
this information.

1.19.0

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

1.19.7

3. What cloud provider are you using?

AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

kubectl drain

5. What happened after the commands executed?

Previously, using kops/k8s 1.18.2/1.18.10 running kubectl drain would remove machine instances from classic ELBs. Now running kubectl drain does not do so.

6. What did you expect to happen?

I expected running kubectl drain to remove the node machine instances from classic ELBs. I was also hoping that running kubectl drain would remove instances from network ELBs but that isn't happening either (with both kops/k8s 1.18.2/1.18.10 and 1.19.0/1.19.7).

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: "2021-02-01T13:55:38Z"
  name: <REDACTED>
spec:
  api:
    dns: {}
  authorization:
    rbac: {}
  channel: stable
  cloudProvider: aws
  configBase: <REDACTED>
  containerRuntime: docker
  dnsZone: <REDACTED>
  etcdClusters:
  - cpuRequest: 200m
    etcdMembers:
    - instanceGroup: <REDACTED>
      name: a
    memoryRequest: 100Mi
    name: main
  - cpuRequest: 100m
    etcdMembers:
    - instanceGroup: <REDACTED>
      name: a
    memoryRequest: 100Mi
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
  kubelet:
    <REDACTED>
  kubernetesApiAccess:
  - <REDACTED>
  kubernetesVersion: 1.19.7
  masterPublicName: <REDACTED>
  networkCIDR: 10.0.0.0/16
  networkID: <REDACTED>
  networking:
    flannel:
      backend: vxlan
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - <REDACTED>
  subnets:
  - cidr: 10.0.32.0/19
    id: <REDACTED>
    name: <REDACTED>
    type: Public
    zone: <REDACTED>
  - cidr: 10.0.64.0/19
    id: <REDACTED>
    name: <REDACTED>
    type: Public
    zone: <REDACTED>
  topology:
    dns:
      type: Public
    masters: public
    nodes: public

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2021-02-01T13:55:39Z"
  labels:
    kops.k8s.io/cluster: <REDACTED>
  name: <REDACTED>
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20210119.1
  machineType: t3.medium
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: <REDACTED>
  role: Master
  subnets:
  - <REDACTED>

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2021-02-01T13:55:39Z"
  generation: 1
  labels:
    kops.k8s.io/cluster: <REDACTED>
  name: <REDACTED>
spec:
  cloudLabels:
    k8s.io/cluster-autoscaler/enabled: "true"
    k8s.io/cluster-autoscaler/<REDACTED>: "true"
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20210119.1
  machineType: t3.medium
  maxSize: 10
  minSize: 2
  mixedInstancesPolicy:
    instances:
    - t3.medium
    - t3a.medium
    onDemandAboveBase: 0
    onDemandBase: 0
    spotInstancePools: 2
  nodeLabels:
    kops.k8s.io/instancegroup: <REDACTED>
  role: Node
  rootVolumeSize: 30
  subnets:
  - <REDACTED>

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2021-02-01T13:55:39Z"
  generation: 1
  labels:
    kops.k8s.io/cluster: <REDACTED>
  name: <REDACTED>
spec:
  cloudLabels:
    k8s.io/cluster-autoscaler/enabled: "true"
    k8s.io/cluster-autoscaler/<REDACTED>: "true"
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20210119.1
  machineType: t3.medium
  maxSize: 10
  minSize: 2
  mixedInstancesPolicy:
    instances:
    - t3.medium
    - t3a.medium
    onDemandAboveBase: 0
    onDemandBase: 0
    spotInstancePools: 2
  nodeLabels:
    kops.k8s.io/instancegroup: <REDACTED>
  role: Node
  rootVolumeSize: 30
  subnets:
  - <REDACTED>

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

9. Anything else do we need to know?

@olemarkus
Copy link
Member

If I understand you correctly, you mean that an ELB provisioned through a Service does not target drained nodes?
That should not be the case. An ELB should target ALL nodes and then kube-proxy forwards to the correct node.

Furthermore, kOps have not changed anything that will influence the behavior of commands like kubectl.

@amorey
Copy link
Author

amorey commented Feb 9, 2021

Thanks for your quick reply. Just to clarify - the behavior I'm seeing is that an ELB provisioned through a Service targets ALL nodes; however, this wasn't the case previously (with kops/k8s 1.18.2/1.18.10). Previously, draining a node would also remove it from the ELB targets.

Ok, it makes sense that this change isn't related to kops. Considering this behavior how do you recommend updating instances gracefully with kops? As far as I can tell, running kops rolling-update cluster drains each node and then issues a shutdown command which kills the connections being proxied by that instance.

@olemarkus
Copy link
Member

If the problem is that proxied connections are killed there are two solutions that I am aware of:
a) use NLB and local traffic policy. In this case, only nodes with the pods on them will reply to the NLB health check. See e.g https://aws.amazon.com/blogs/opensource/network-load-balancer-support-in-kubernetes-1-9/
b) Have a look at https://cilium.io/blog/2020/11/10/cilium-19#maglev. This feature is not available on kOps, but I think it could be. Haven't looked much into what that would take.

@amorey
Copy link
Author

amorey commented Feb 24, 2021

Thanks for your suggestions! I went down a cilium/eBPF/maglev rabbit hole and found it very interesting. Now I think I have a better handle on the issue I described earlier but while testing out NLB's I ran into a bigger issue. In my setup, when the master node's machine instance restarts, the NLB target instances all get deregistered simultaneously and then re-registered which causes a long outage. I realize this isn't a kops issue but would you happen to know if this is expected behavior and if so how to prevent it?

@olemarkus
Copy link
Member

A master instance restart should not cause the entire NLB target group to re-register. Is that something you see happening consistently?

@amorey
Copy link
Author

amorey commented Feb 24, 2021

Yes, consistently. I tried it about five times using different versions of K8S (1.19, 1.20 and 1.21) and saw it happen each time. I noticed it first when running kops rolling-update cluster and then narrowed it down to an issue with the master node instance coming back online. I haven't tried it with a HA master setup.

@olemarkus
Copy link
Member

Managing those things is handled by the control plane. I don't think it would actully de-register nodes just because ... but maybe there is a timeout or something that causes this. Do you have the chance to test a HA cluster as well?

@amorey
Copy link
Author

amorey commented Feb 25, 2021

I just tested it on a 3-master HA cluster and found that terminating 1/3 or 2/3 master instances simultaneously did not de-register the nodes but terminating 3/3 simultaneously did.

@amorey
Copy link
Author

amorey commented Apr 2, 2021

Kubernetes bug report for reference kubernetes/kubernetes#100779

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 1, 2021
@k8s-triage-robot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 31, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

5 participants