Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

componentstatus fails when components are not running on the same host as apiserver #19570

Closed
stepanstipl opened this issue Jan 12, 2016 · 14 comments

Comments

@stepanstipl
Copy link

commented Jan 12, 2016

kubectl get componentstatus fails when components are not running on the same host (for example when you run multiple Api servers according to HA guide http://kubernetes.io/v1.1/docs/admin/high-availability.html).

I can see the problem being in kubernetes/pkg/master/master.go where code expects 127.0.0.1 for both controller-manager and scheduler components.

func (m *Master) getServersToValidate(c *Config) map[string]apiserver.Server {
    serversToValidate := map[string]apiserver.Server{
        "controller-manager": {Addr: "127.0.0.1", Port: ports.ControllerManagerPort, Path: "/healthz"},
        "scheduler":          {Addr: "127.0.0.1", Port: ports.SchedulerPort, Path: "/healthz"},
    }
$kubectl version                                                                                                                                   
Client Version: version.Info{Major:"1", Minor:"1", GitVersion:"v1.1.3+6a81b50", GitCommit:"6a81b50c7e97bbe0ade075de55ab4fa34f049dc2", GitTreeState:"not a git tree"}
Server Version: version.Info{Major:"1", Minor:"1", GitVersion:"v1.1.3", GitCommit:"6a81b50c7e97bbe0ade075de55ab4fa34f049dc2", GitTreeState:"clean"}
@lavalamp

This comment has been minimized.

Copy link
Member

commented Jan 13, 2016

@karlkfi is reworking componentstatus, it's not obvious how to fix this assumption in the meantime.

@stepanstipl

This comment has been minimized.

Copy link
Author

commented Jan 13, 2016

Thanks for info, not a big deal, just spent some time worried that my cluster is not working before I discovered that this was the problem... I would like to help, just not sure how with this...?

Maybe a bit of info about my setup: I'm running 3 api servers on AWS behind ELB and using podmaster service to control which of them runs scheduler and controller-manager, pretty much what's in HA guide.

On my setup the check will always succeed once and fail twice, depending on which api server will be queried behind the ELB.

@karlkfi

This comment has been minimized.

Copy link
Contributor

commented Jan 13, 2016

for reference:

@gg7

This comment has been minimized.

Copy link

commented Jul 17, 2016

componentstatus will fail even if components are running on the same host as the apiserver, but not listening on the loopback interface.

Exact error so this is easier to google:

NAME                 STATUS      MESSAGE                                                                                        ERROR
controller-manager   Unhealthy   Get http://127.0.0.1:10252/healthz: dial tcp 127.0.0.1:10252: getsockopt: connection refused   
scheduler            Unhealthy   Get http://127.0.0.1:10251/healthz: dial tcp 127.0.0.1:10251: getsockopt: connection refused   
@kargakis

This comment has been minimized.

Copy link
Member

commented Jun 11, 2017

/sig cluster-lifecycle

@robszumski

This comment has been minimized.

Copy link
Contributor

commented Oct 6, 2017

Any update on this issue? We see this on all Tectonic clusters due to being self-hosted.

@mariusv

This comment has been minimized.

Copy link

commented Oct 25, 2017

I get the same issue evenif everything works just fine and I can spin up new pods and delete just that kubectl get cs fails

Kubernetes version:

Client Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.7", GitCommit:"095136c3078ccf887b9034b7ce598a0a1faff769", GitTreeState:"clean", BuildDate:"2017-07-05T16:51:56Z", GoVersion:"go1.7.6", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.7+coreos.0", GitCommit:"c8c505ee26ac3ab4d1dff506c46bc5538bc66733", GitTreeState:"clean", BuildDate:"2017-07-06T17:38:33Z", GoVersion:"go1.7.6", Compiler:"gc", Platform:"linux/amd64"}

Error:

thor:~ marius$ kubectl get cs
NAME                 STATUS      MESSAGE                                                                                                        ERROR
scheduler            Unhealthy   Get http://127.0.0.1:10251/healthz: dial tcp 127.0.0.1:10251: getsockopt: connection refused
controller-manager   Unhealthy   Get http://127.0.0.1:10252/healthz: dial tcp 127.0.0.1:10252: getsockopt: connection refused
etcd-0               Healthy     {"health": "true"}
etcd-2               Healthy     {"health": "true"}
etcd-3               Healthy     {"health": "true"}
@janse180

This comment has been minimized.

Copy link

commented Dec 23, 2017

I can confirm this issue is also present with a self-hosted bootkube install. The cluster is working as expected but the health checks fail. This also causes Clilum pods to report they are not ready as Cilium uses the healthcheck to test the kubernetes api.

 $ kubectl version
Client Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.5", GitCommit:"cce11c6a185279d037023e02ac5249e14daa22bf", GitTreeState:"clean", BuildDate:"2017-12-07T16:16:03Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"windows/amd64"}
Server Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.5", GitCommit:"cce11c6a185279d037023e02ac5249e14daa22bf", GitTreeState:"clean", BuildDate:"2017-12-07T16:05:18Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
$ kubectl get componentstatuses
NAME                 STATUS      MESSAGE                                                                                        ERROR
scheduler            Unhealthy   Get http://127.0.0.1:10251/healthz: dial tcp 127.0.0.1:10251: getsockopt: connection refused
controller-manager   Unhealthy   Get http://127.0.0.1:10252/healthz: dial tcp 127.0.0.1:10252: getsockopt: connection refused
etcd-1               Healthy     {"health": "true"}
etcd-2               Healthy     {"health": "true"}
etcd-0               Healthy     {"health": "true"}
$kubectl -n kube-system get po
NAME                                                 READY     STATUS    RESTARTS   AGE
cilium-6mg4w                                         0/1       Running   161        15h
cilium-9gjq4                                         0/1       Running   145        14h
cilium-rzzst                                         0/1       Running   161        15h
cilium-w4fhh                                         0/1       Running   163        15h
$kubectl -n kube-system describe po cilium-w4fhh
Events:
  Type     Reason     Age                 From                                  Message
  ----     ------     ----                ----                                  -------
  Warning  Unhealthy  19m (x45 over 12h)  kubelet, kube-master-02 Readiness probe failed: 
KVStore:                Ok       
Etcd: http://10.28.4..0; http://10.28.4.11:2379 - 3.2.0; http://10.28.4.12:2379 - (Leader) 3.2.0
ContainerRuntime:       Ok
Kubernetes:             Failure   Get http://127.0.0.1:10252/healthz: dial tcp 127.0.0.1:10252: getsockopt: connection refused
Kubernetes APIs:        ["core/v1::Endpoint", "extensions/v1beta1::Ingress", "core/v1::Node", "CustomResourceDefinition", "cilium/v2::CiliumNetworkPolicy", "exta1::NetworkPolicy", "networking.k8s.io/v1::NetworkPolicy", "core/v1::Service"]
Cilium:                 Failure   Kubernetes service is not ready
NodeMonitor:            Listening for events on 24 CPUs with 64x4096 of shared memory
Cilium health daemon:   Ok
@kevinjqiu

This comment has been minimized.

Copy link

commented Jan 2, 2018

Can confirm with a tectonic provisioned cluster, even though controller-manager and scheduler deployments are fine, kubectl get cs shows them as unhealthy:

kubectl get cs
NAME                 STATUS      MESSAGE                                                                                        ERROR
controller-manager   Unhealthy   Get http://127.0.0.1:10252/healthz: dial tcp 127.0.0.1:10252: getsockopt: connection refused   
scheduler            Unhealthy   Get http://127.0.0.1:10251/healthz: dial tcp 127.0.0.1:10251: getsockopt: connection refused   
etcd-0               Healthy     {"health": "true"}                                                                             
etcd-2               Healthy     {"health": "true"}                                                                             
etcd-1               Healthy     {"health": "true"}                                                                             
etcd-3               Healthy     {"health": "true"}                                                                             
etcd-4               Healthy     {"health": "true"}

If I port-forward the pods and do curl on them, they're fine:

$ kubectl port-forward kube-controller-manager-7f59cc56c4-bh69l :10252
Forwarding from 127.0.0.1:44102 -> 10252

$ curl localhost:44102/healthz
ok

A bit digging of the source code showed that the api server hard coded the addresses for controller manager and scheduler: https://github.com/kubernetes/kubernetes/blob/master/pkg/registry/core/rest/storage_core.go#L242-L243

On the tectonic cluster, the api-server, controller-manager and scheduler are run as hyperkube in separate pods. They don't share the same networking namespace. localhost:10252/10251 on the api-server won't reach controller-manager or scheduler.

@rphillips

This comment has been minimized.

Copy link
Member

commented Feb 16, 2018

Would there be any objections to start the deprecation process for componentstatus? or write up an issue stating the reasons why we should deprecate CS?

@fejta-bot

This comment has been minimized.

Copy link

commented Jun 14, 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@fejta-bot

This comment has been minimized.

Copy link

commented Jul 14, 2018

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@fejta-bot

This comment has been minimized.

Copy link

commented Aug 13, 2018

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@AkihiroSuda

This comment has been minimized.

Copy link
Contributor

commented Jan 12, 2019

What's the correct way to get health status of scheduler and controller-manager when they are not running on the kube-apiserver network namespace ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.