Alerts firing: ControllerManager, Scheduler and TargetDown #1530

domcar · 2018-02-20T14:18:08Z

What did you do?
I installed prometheus-operator and kube-prometheus using helm:

helm install coreos/prometheus-operator --name prometheus-operator
helm install coreos/kube-prometheus --name kube-prometheus --set rbacEnable=true

What did you expect to see?
Everything green in Alert Manager

What did you see instead? Under which circumstances?
Some Alerts are firing:

K8s Scheduler
K8S Controller
NodeDiskRuningFull
TargetDown

Environment
GKE

Kubernetes version information:

Client Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.1", GitCommit:"f38e43b221d08850172a9a4ea785a86a3ffa3b3a", GitTreeState:"clean", BuildDate:"2017-10-11T23:27:35Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"8+", GitVersion:"v1.8.5-gke.0", GitCommit:"2c2a807131fa8708abc92f3513fe167126c8cce5", GitTreeState:"clean", BuildDate:"2017-12-19T20:05:45Z", GoVersion:"go1.8.3b4", Compiler:"gc", Platform:"linux/amd64"}
Kubernetes cluster kind:

I used terraform to create the cluster on GKE
Prometheus Operator Logs:
No Errors nor warnings

I guess somehow these targets get not scraped. Can you help me out on how to solve this issue please? Thanks

The text was updated successfully, but these errors were encountered:

domcar · 2018-02-21T08:26:57Z

If it helps, it looks like some services have no endpoints:

kubectl get endpoints 
kube-system   kube-controller-manager                            <none>                                                           19h
kube-system   kube-prometheus-exporter-kube-scheduler            <none>                                                           24m

sandromello · 2018-02-21T17:29:39Z

I had a similar issue, but I've used kubeadm to install the cluster. I fixed those alerts editing selector of those services.

If you have kubernetes core components as pods in the kube-system namespace, make sure the label selector of those services match with the labels of the pods.

kubectl get svc kube-prom-exporter-kube-scheduler kube-prom-exporter-kube-controller-manager -n kube-system -o wide
NAME                                         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)     AGE       SELECTOR
kube-prom-exporter-kube-scheduler            ClusterIP   None         <none>        10251/TCP   3h        component=kube-scheduler
kube-prom-exporter-kube-controller-manager   ClusterIP   None         <none>        10252/TCP   3h        component=kube-controller-manager

kubectl get po -l component -n kube-system --show-labels
NAME                                                 READY     STATUS    RESTARTS   AGE       LABELS
(...)
kube-apiserver-ip-10-0-41-71.ec2.internal            1/1       Running   0          3h        component=kube-apiserver,tier=control-plane
kube-controller-manager-ip-10-0-41-71.ec2.internal   1/1       Running   0          3h        component=kube-controller-manager,tier=control-plane
kube-scheduler-ip-10-0-41-71.ec2.internal            1/1       Running   0          3h        component=kube-scheduler,tier=control-plane

If any of those components were started bound to 127.0.0.1 you need to change that, please take a look at kubeadm on prometheus for more information.

hamid2013 · 2018-02-22T08:05:38Z

I am also facing same issue, but in my case i have used Azure acs-engine to launch the cluster.

Keep getting the Scheduler and Controller alert.

I can see the pods are running, but there is no corresponding service available there.

domcar · 2018-02-22T08:19:35Z

@sandromello The problem is that I don't have the Pods kube-scheduler or controller-manager. I think this is the reason why it doesn't work

ScottBrenner · 2018-02-22T19:27:52Z

This is a known issue with GKE prometheus-operator/prometheus-operator#355 prometheus-operator/prometheus-operator#845. I ended up just deleting the two alerts.

hameno · 2018-03-04T09:01:16Z

This also seems to be the case for https://github.com/rancher/rke deployments (at least it is happening on my dev cluster)

gianrubio · 2018-03-12T19:55:57Z

@domcar one way to avoid this issue is to have a flag to control if some dependencies from kube-prometheus will be deployed. Looking on alertmanager example on how it's possible to skip the installation of a dependency.

PR are always welcome :)

ghost · 2018-05-11T10:15:14Z

I don't have any endpoints for kube controller manager and scheduler then how to monitor them using prometheus and prometheus operator.

Alerts are being triggered from the alert manager

bonovoxly · 2018-06-21T10:01:13Z

@ScottBrenner what's the best way to delete an alert using helm? Is it possible to cherry-pick out the alerts, or would I need to recreate them all (minus the non-working alerts for GKE)?

…kube state exporters optional When running prometheus operator on hosted kuberenetes like GCE, few of the exporters are optional, so adding ability to conditional installations. Fixes #1001, prometheus-operator#355, prometheus-operator#845

…kube state exporters optional (#1525) * Helm: Improving readme instructions for testing helm chart locally Adding note about where to run commands from and also breaking up large bash commands into multiple lines for simple copy paste. * kube-prometheus: Making kubelets, kubescheduler, kube controller and kube state exporters optional When running prometheus operator on hosted kuberenetes like GCE, few of the exporters are optional, so adding ability to conditional installations. Fixes #1001, #355, #845 * Update Chart.yaml * Update Chart.yaml

ScottBrenner · 2018-06-29T02:00:25Z

@bonovoxly Was using kube-prometheus, never touched Helm.

ne1000 · 2018-08-29T03:20:22Z

@domcar @ScottBrenner I also met the same issue, but in my case i have used binary packages to install the cluster , can you give me a piece of advice fix the issue?

phyllisstein · 2018-09-13T08:45:12Z

I ran into this issue with a cluster deployed through kops in AWS. The solution that worked for me was sitting in an old version of the repo: I had to deploy the services listed here to kube-system. With that done, the alerts went green.

Edit: N.B. that I think you can also generate the requisite files by adding (import 'kube-prometheus/kube-prometheus-kops.libsonnet') to your JSONnet config:

local kp =
  (import 'kube-prometheus/kube-prometheus.libsonnet') +
  (import 'kube-prometheus/kube-prometheus-kops.libsonnet') +
  {
    _config+:: {
      namespace: 'monitoring',
  /* ...etc. */

ghost · 2018-11-30T18:46:09Z

Same issue with aws eks

vrathore18 · 2019-01-21T06:51:22Z

I am facing the same issue. I don't have the Pods kube-scheduler or controller-manager. @domcar how did you fixed the issue??

P.S I used helm for installation. CLoud using: AWS

chris530 · 2019-03-03T07:22:40Z

I noticed the labels the service was looking for was not returning any pods. After adding the label k8s-app=kube-controller-manager to the control manager, and k8s-app=kube-scheduler to the scheduler the alerts cleared up as the service could find pods now.

rpf3 · 2020-02-25T14:52:41Z

@chris530 I had to do something very similar to the service selectors; basically null out the component label and add k8s-app label to the selector for those two services.

flogfy · 2021-04-23T15:35:44Z

@chris530 how were you able to add these labels to the controller manager and the kube scheduler ? I don't even have the pods and services associated with neither kube-scheduler nor kube-controller-manager. My kubernetes is installed with RKE.

woody3549 · 2021-12-01T09:47:38Z

Hello,

I am currently using prometheus-stack version 20.0.1
Alerts KubeSchedulerDown and KubeControllerManagerDown are currently being raised for no apparent reason.
Is that also a label issues, please ?
How did you solve it ?

Thanks for your help.
Regards,

ferpizza · 2021-12-09T11:08:54Z

Hi,

I've been dealing with these false positives on GKE. After investigating a little, I realized that GKE doesn't expose the Kubernetes Scheduler nor the Control Manager to end users.

As we are blinded to these services, there is no need for deploying neither the Scheduler Scraper nor the Control Manager Scraper or their respective Alerts.

The easiest way of dealing with these false positive alerts is to disable the Scraping and Alerts related to services managed by GKE on the Values file of the Helm Chart.

kubeControllerManager:
  enabled: false

kubeScheduler:
  enabled: false

This is probably the case for other cloud providers, although I'm not sure about it.

Cheers,

woody3549 · 2021-12-30T14:41:26Z

Hi @ferpizza,

Now I no longer receive alerts for KubeScheduler and KubeControllerManager.
Thanks.

However, a new KubeProxyDown alert now appears.
Can you please point me out what GKE exposes ?
I might have to disable it as well.

Cheers

ferpizza · 2022-01-03T16:09:56Z

Hello @woody3549,

I haven't found official documentation setting apart those k8s components that are exposed to end-users form the ones that are kept private for Google's management. You can make an assumption based on whether such component is key for ensuring GKE services.

kube-proxy is one of those components, being a critical piece in the networking of your cluster.

When I wrote my first comment I was on version 18.1.1 of the Kube Prometheus Stack helm chart, and that version did not include the kube-proxy alerts or scraper.

Since then I have updated to version 27.1.0, which includes the kube-proxy alert, and was confronted with the same issue regarding false positives.

We can solve this, and the two prior alerts, by adding the following lines to our Values file.

kubeControllerManager:
  enabled: false

kubeScheduler:
  enabled: false

kubeProxy:
  enabled: false

woody3549 · 2022-01-04T09:17:46Z

Hello,

Ok thanks. This makes sense and is very helpful.

Regards

gianrubio closed this as completed in prometheus-operator/prometheus-operator#1525 Jun 27, 2018

paulfantom transferred this issue from prometheus-operator/prometheus-operator Dec 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alerts firing: ControllerManager, Scheduler and TargetDown #1530

Alerts firing: ControllerManager, Scheduler and TargetDown #1530

domcar commented Feb 20, 2018

domcar commented Feb 21, 2018

sandromello commented Feb 21, 2018 •

edited

hamid2013 commented Feb 22, 2018 •

edited

domcar commented Feb 22, 2018

ScottBrenner commented Feb 22, 2018

hameno commented Mar 4, 2018 •

edited

gianrubio commented Mar 12, 2018

ghost commented May 11, 2018 •

edited by ghost

bonovoxly commented Jun 21, 2018

ScottBrenner commented Jun 29, 2018

ne1000 commented Aug 29, 2018

phyllisstein commented Sep 13, 2018 •

edited

ghost commented Nov 30, 2018

vrathore18 commented Jan 21, 2019

chris530 commented Mar 3, 2019

rpf3 commented Feb 25, 2020

flogfy commented Apr 23, 2021

woody3549 commented Dec 1, 2021

ferpizza commented Dec 9, 2021

woody3549 commented Dec 30, 2021

ferpizza commented Jan 3, 2022

woody3549 commented Jan 4, 2022

Alerts firing: ControllerManager, Scheduler and TargetDown #1530

Alerts firing: ControllerManager, Scheduler and TargetDown #1530

Comments

domcar commented Feb 20, 2018

domcar commented Feb 21, 2018

sandromello commented Feb 21, 2018 • edited

hamid2013 commented Feb 22, 2018 • edited

domcar commented Feb 22, 2018

ScottBrenner commented Feb 22, 2018

hameno commented Mar 4, 2018 • edited

gianrubio commented Mar 12, 2018

ghost commented May 11, 2018 • edited by ghost

bonovoxly commented Jun 21, 2018

ScottBrenner commented Jun 29, 2018

ne1000 commented Aug 29, 2018

phyllisstein commented Sep 13, 2018 • edited

ghost commented Nov 30, 2018

vrathore18 commented Jan 21, 2019

chris530 commented Mar 3, 2019

rpf3 commented Feb 25, 2020

flogfy commented Apr 23, 2021

woody3549 commented Dec 1, 2021

ferpizza commented Dec 9, 2021

woody3549 commented Dec 30, 2021

ferpizza commented Jan 3, 2022

woody3549 commented Jan 4, 2022

sandromello commented Feb 21, 2018 •

edited

hamid2013 commented Feb 22, 2018 •

edited

hameno commented Mar 4, 2018 •

edited

ghost commented May 11, 2018 •

edited by ghost

phyllisstein commented Sep 13, 2018 •

edited