Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloud load-balancers should have health checks for nodes #14661

Closed
thockin opened this issue Sep 28, 2015 · 37 comments
Closed

Cloud load-balancers should have health checks for nodes #14661

thockin opened this issue Sep 28, 2015 · 37 comments
Labels
area/cloudprovider lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/network Categorizes an issue or PR as relevant to SIG Network.

Comments

@thockin
Copy link
Member

thockin commented Sep 28, 2015

Given the state of cloud-LB today, most (GCE, AWS, Openstack) LB implementations target nodes indiscriminately. We should ensure that the cloud-load-balancer is only targetting healthy nodes.

We should health-check to the nodes' kubelet or kube-proxy or add a new "do nothing but answer node health" daemon.

@thockin thockin added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. team/cluster labels Sep 28, 2015
@thockin thockin added this to the v1.2-candidate milestone Sep 28, 2015
@bprashanth
Copy link
Contributor

We should health-check to the nodes' kubelet or kube-proxy or add a new "do nothing but answer node health" daemon.

Nodes will already flip to unhealthy after ~40s of kubelet silence, or immediately when something like docker death is observed by kubelet. You can probably fix it by swapping the nodelister: https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/service/servicecontroller.go#L77 with the conditional node lister https://github.com/kubernetes/kubernetes/blob/master/pkg/client/cache/listers.go#L119, or diffing them.

@bprashanth
Copy link
Contributor

I mean, swap those and remove unhealthy nodes from the target instance group

@thockin
Copy link
Member Author

thockin commented Sep 28, 2015

That's a lot of propagation. 40s is an eternity

On Mon, Sep 28, 2015 at 10:12 AM, Prashanth B notifications@github.com
wrote:

I mean, swap those and remove unhealthy nodes from the target instance
group


Reply to this email directly or view it on GitHub
#14661 (comment)
.

@bprashanth
Copy link
Contributor

I mean your daemon could die too so you'll need a timeout. Avoiding the gce health checks suggestion because I'd rather do this in kube and have it work xplat.

@roberthbailey
Copy link
Contributor

Healthchecking the kube-proxy seems like the best idea to me. Port 10249 is currently used for both healthz and pprof, and we might not want pprof data exposed to the internet. But it wouldn't be difficult to make healthz bind to 0.0.0.0 by default (even if we moved pprof to a different port) and then allow 10249 to be opened to healthcheckers as necessary.

I don't think we should be doing our own healthchecking and manually modifying the target instance group. That code would be GCE-specific and seems like it's just re-implementing what is provided for free by the GCP cloud healthchecker as long as you can provide an http endpoint that is accessible to 130.211.0.0/22.

@bgrant0607
Copy link
Member

See #8673

@thockin
Copy link
Member Author

thockin commented Feb 14, 2016

I'm going close the dup (I dup'ed myself!). This one has more info, so keeping it.

@bprashanth
Copy link
Contributor

Why health check nodes instead of straight to network endpoint? If we do that, isn't the 40s eternity enough for node health checks?

@thockin
Copy link
Member Author

thockin commented Feb 14, 2016

Today we load-balance to nodes. If a node dies, I want my cloud LB
targetting a different node ASAP.

On Sat, Feb 13, 2016 at 9:13 PM, Prashanth B notifications@github.com
wrote:

Why health check nodes instead of straight to network endpoint? If we do
that, isn't the 40s eternity enough for node health checks?


Reply to this email directly or view it on GitHub
#14661 (comment)
.

@bprashanth
Copy link
Contributor

Yes we can easily implement a short term solution where the service/ingress controller runs a goroutine that just polls health check daemons on the node. "Bouncy" makes that harder because kube-proxy will still continue to think the endpoint is ready. Even if we don't go to the extent of reporting utiliztion information per request, I think we should come up with somethig that re-uses an existing health check idiom (either nodecontroller health or liveness/readiness), in the long run.

The risk of having some sort of instantaneous and binary feedback loop is osciallation or flapping. The "right" way to solve this, IMO, is to use backend weights.

@glerchundi
Copy link

@bprashanth @thockin which is the status of this issue? I would like to add GCE LB health checking, did you solved by making kubelet port available or how?

@thockin thockin closed this as completed Oct 19, 2016
@thockin
Copy link
Member Author

thockin commented Oct 19, 2016

We're actually using cloud healthchecks for something different now, so we
should close this.

On Tue, Oct 18, 2016 at 5:58 AM, Gorka Lerchundi Osa <
notifications@github.com> wrote:

@bprashanth https://github.com/bprashanth @thockin
https://github.com/thockin which is the status of this issue? I would
like to add GCE LB health checking, did you solved by making kubelet port
available or how?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#14661 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFVgVIKYfKkrY6LwjECZPRx-lEV0m2kfks5q1MJ4gaJpZM4GFDrX
.

@glerchundi
Copy link

Ok @thockin can you point me to the documentation/issue/whatever to follow the new direction? I'll appreciate any hint that helps me finding out this concrete topic.

Thanks

@thockin
Copy link
Member Author

thockin commented Oct 19, 2016

#29409

On Tue, Oct 18, 2016 at 11:45 PM, Gorka Lerchundi Osa <
notifications@github.com> wrote:

Ok @thockin https://github.com/thockin can you point me to the
documentation/issue/whatever to follow the new direction? I'll appreciate
any hint that helps me finding out this concrete topic.

Thanks


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#14661 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFVgVLfVHiQQ_lMPXxvfizekHDDoLXJoks5q1byVgaJpZM4GFDrX
.

@glerchundi
Copy link

thanks!

@bprashanth
Copy link
Contributor

This is still an issue of course, meaning, if a node disappears due to partition we will take 40s to kill endpoints. If we had a health check endpoint, it would take O(5s). There are different ways to tackle it.

One way is to make kube-proxy health checking much smarter (i.e use ipvs). We need to confirm that ipvs will autofail over if it noticed that eg a SYN got blackholed once. I wasn't able to confirm this with some preliminary tinkering: #30134 (comment)

Another, is to make our health checking smarter. The problem right now is that the kubelet is responsible for reporting endpoint health and node status. We could come up with a system that reports endpoint health much more frequently, and doesn't actually run on the node: #28442

And yet another, is to actually do what this issue proposes, and add a secondary health check endpoint to nodes. Or find some clever way to leverage the same "healthcheck-nodeport" logic in a way that works for both "onlyLocal" and "Global" services.

@thockin
Copy link
Member Author

thockin commented Oct 25, 2016

On Fri, Oct 21, 2016 at 12:37 PM, Prashanth B notifications@github.com wrote:

This is still an issue of course, meaning, if a node disappears due to partition we will take 40s to kill endpoints. If we had a health check endpoint, it would take O(5s). There are different ways to tackle it.

One way is to make kube-proxy health checking much smarter (i.e use ipvs). We need to confirm that ipvs will autofail over if it noticed that eg a SYN got blackholed once. I wasn't able to confirm this with some preliminary tinkering: #30134 (comment)

I convinced myself that it DOES NOT do what we want. it does exactly
what we do today.

Another, is to make our health checking smarter. The problem right now is that the kubelet is responsible for reporting endpoint health and node status. We could come up with a system that reports endpoint health much more frequently, and doesn't actually run on the node: #28442

And yet another, is to actually do what this issue proposes, and add a secondary health check endpoint to nodes. Or find some clever way to leverage the same "healthcheck-nodeport" logic in a way that works for both "onlyLocal" and "Global" services.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

@thockin
Copy link
Member Author

thockin commented Jan 10, 2017

Re-opening. I do think we should have a node-level healthcheck for LBs that are not using the OnlyLocal annotation.

@thockin thockin reopened this Jan 10, 2017
@thockin thockin added sig/network Categorizes an issue or PR as relevant to SIG Network. area/platform/gce labels Jan 10, 2017
k8s-github-robot pushed a commit that referenced this issue May 27, 2017
Automatic merge from submit-queue (batch tested with PRs 46252, 45524, 46236, 46277, 46522)

Make GCE load-balancers create health checks for nodes

From #14661. Proposal on kubernetes/community#552. Fixes #46313.

Bullet points:
- Create nodes health check and firewall (for health checking) for non-OnlyLocal service.
- Create local traffic health check and firewall (for health checking) for OnlyLocal service.
- Version skew: 
   - Don't create nodes health check if any nodes has version < 1.7.0.
   - Don't backfill nodes health check on existing LBs unless users explicitly trigger it.

**Release note**:

```release-note
GCE Cloud Provider: New created LoadBalancer type Service now have health checks for nodes by default.
An existing LoadBalancer will have health check attached to it when:
- Change Service.Spec.Type from LoadBalancer to others and flip it back.
- Any effective change on Service.Spec.ExternalTrafficPolicy.
```
@MrHohn
Copy link
Member

MrHohn commented Nov 30, 2017

Unassigning as GCE part was done.
/unassign
/remove-area platform/gce
/area cloudprovider

@thockin
Copy link
Member Author

thockin commented Nov 30, 2017 via email

@MrHohn
Copy link
Member

MrHohn commented Nov 30, 2017

Can you update what you think still needs to be done, and who should be
doing it?

Sure, the real work for each cloud provider is to attach (or configure) a health check pointing to node:10256/healthz (kube-proxy healthz port is 10256 by default) while provisioning load balancers.

I'd expect in-tree cloud provider owners to follow up on this. Looping in the ones I found in OWNER files (omitting those don't have real loadbalancer implemention).
@justinsb for AWS
@colemickens for Azure
@FengyunPan for Openstack

@MrHohn
Copy link
Member

MrHohn commented Nov 30, 2017

@ngtuna for CloudStack

@jhorwit2
Copy link
Contributor

jhorwit2 commented Dec 1, 2017

@MrHohn This requires at least v1.7.2, correct?

@MrHohn
Copy link
Member

MrHohn commented Dec 1, 2017

MrHohn This requires at least v1.7.2, correct?

@jhorwit2 Thanks for mentioning, that is correct. v1.7.2 is the earliest k8s version that kube-proxy properly serves healthz port. We should probably make it a global const somewhere.

@thockin
Copy link
Member Author

thockin commented Dec 1, 2017 via email

@colemickens
Copy link
Contributor

I think they're sort of the same thing, the SC would just be calling into the cloudprovider and asking for this health check, or it would be part of the normal behavior of the SC calling the CP to create/update a load balancer.

It seems like it would be helpful if Kubernetes were prescriptive here, one way or the other, and that issues per-cloud-provider were opened to track them adding this health check.

Questions:

  1. Are there backward compatibility concerns? Based on how I'd implement this in Azure on a first-pass, it would retroactively add health checks which seems potentially risky.

  2. How is the cloudprovider supposed to know what port kubelet is listening on for /healthz? I don't think it's appropriate to hardcode the default port. This will lead to broken configurations for anyone that chooses to change the default port (maybe for security, not sure what all's exposed on /healthz...)

@MrHohn
Copy link
Member

MrHohn commented Dec 11, 2017

I think they're sort of the same thing, the SC would just be calling into the cloudprovider and asking for this health check, or it would be part of the normal behavior of the SC calling the CP to create/update a load balancer.

I'd prefer the latter, make "ensure health check" another interface may place more constraints on cloud provider implementation --- given that even how health check is attached to a LB may happen very differently, and could be coupled with LB management.

  1. Are there backward compatibility concerns? Based on how I'd implement this in Azure on a first-pass, it would retroactively add health checks which seems potentially risky.

Adding health check retroactively seems risky if adding health check itself is service disruptive. Another concern is various node versions --- some node reposes health check while others (in older version) don't.

  1. How is the cloudprovider supposed to know what port kubelet is listening on for /healthz? I don't think it's appropriate to hardcode the default port. This will lead to broken configurations for anyone that chooses to change the default port (maybe for security, not sure what all's exposed on /healthz...)

To clarify, this "/healthz" is served by kube-proxy but not kubelet. The proxy healthz port is defined in the ports package:

ProxyHealthzPort = 10256

If someone choose to use a different healthz port in kube-proxy (via flag or config), they will need to do the same for service controller (or cloud controller manager).

@colemickens
Copy link
Contributor

Adding health check retroactively seems risky if adding health check itself is service disruptive. Another concern is various node versions --- some node reposes health check while others (in older version) don't.

I don't recall the stance on version skew, but if it has been there for at least a few revisions, then I'm not worried about the presence of /healthz presuming this feature wouldn't be backported. I hadn't considered that adding the health-check would be disruptive. Do we know if this is really the case in any of the supported clouds?

To clarify, this "/healthz" is served by kube-proxy but not kubelet. The proxy healthz port is defined in the ports package:

Got it, thanks.

If someone choose to use a different healthz port in kube-proxy (via flag or config), they will need to do the same for service controller (or cloud controller manager).

Hm, to my knowledge, most of the cloud-providers don't really even take configuration today (or if they do, it's inferred configuration discovered via their metadata servers).

I guess it wouldn't be the worst thing in the world if it defaulted to the default kube-proxy port and was overrideble inside the cloud provider, but there's still going to be the case where:

  1. User is running with non-default healthz port.
  2. User upgrades cluster, switches to CCM.
  3. None of their (new) load-balancers work because they're now health checking a path on a wrong port.

The user doesn't really know why any of this is happening until they find this issue, find the corresponding PR for their cloud provider and find where the new flag/config field is.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 12, 2018
@thockin
Copy link
Member Author

thockin commented Mar 20, 2018

/lifecycle frozen
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 20, 2018
@alok87
Copy link
Contributor

alok87 commented May 15, 2018

@justinsb Do we have health check for AWS ELBs? We want to keep adding/removing nodes in the ELB based on health. What is the right way to do it.

We are not using k8s provisioned ELB, it is the ELB we have created. We want to keep healthy kubernetes nodes under it.

@jhorwit2
Copy link
Contributor

@alok87 You'd want emulate what Kubernetes does in that scenario, which is either use the kube-proxy healthz endpoint, which is by default port 10256, or use the healthcheck port on the service if it's traffic policy is local.

@alok87
Copy link
Contributor

alok87 commented May 15, 2018

@jhorwit2 What does the Kubernetes does? On what basis it taints a node healthy or not unhealthy. We have two system pods - weave and kube-proxy. We want to check both are good and also check node_ip:port is working before making a node healthy and attaching it to LB. Do I need to write a custom controller to do this?

@krmayankk
Copy link

Curious what is remaining here ? Also do all health checks (for both https, network, container native) go to kube-proxy 10256 port ? Why do they not go to kubelet ?

@thockin
Copy link
Member Author

thockin commented Feb 20, 2019

kube-proxy exposes "healthiness" as "was able to update the heartbeat recently. We may want to go further with "readiness" vs "liveness" (I have an idea brewing) but it's unlikely that it will be totally user-defined. Instead, I am focusing on things like node schedulability. If the node is unschedulable, it should probably not be considered for new LB traffic.

That allow users to define arbitrary rules for what makes a node unschedulable, which is something we want anyway.

Closing this. Thanks for the reminder :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cloudprovider lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/network Categorizes an issue or PR as relevant to SIG Network.
Projects
None yet
Development

No branches or pull requests