Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alerts firing: ControllerManager, Scheduler and TargetDown #1530

Closed
domcar opened this issue Feb 20, 2018 · 22 comments
Closed

Alerts firing: ControllerManager, Scheduler and TargetDown #1530

domcar opened this issue Feb 20, 2018 · 22 comments

Comments

@domcar
Copy link

domcar commented Feb 20, 2018

What did you do?
I installed prometheus-operator and kube-prometheus using helm:

helm install coreos/prometheus-operator --name prometheus-operator
helm install coreos/kube-prometheus --name kube-prometheus --set rbacEnable=true

What did you expect to see?
Everything green in Alert Manager

What did you see instead? Under which circumstances?
Some Alerts are firing:

  • K8s Scheduler
  • K8S Controller
  • NodeDiskRuningFull
  • TargetDown

Environment
GKE

  • Kubernetes version information:

    Client Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.1", GitCommit:"f38e43b221d08850172a9a4ea785a86a3ffa3b3a", GitTreeState:"clean", BuildDate:"2017-10-11T23:27:35Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"darwin/amd64"}
    Server Version: version.Info{Major:"1", Minor:"8+", GitVersion:"v1.8.5-gke.0", GitCommit:"2c2a807131fa8708abc92f3513fe167126c8cce5", GitTreeState:"clean", BuildDate:"2017-12-19T20:05:45Z", GoVersion:"go1.8.3b4", Compiler:"gc", Platform:"linux/amd64"}

  • Kubernetes cluster kind:

    I used terraform to create the cluster on GKE

  • Prometheus Operator Logs:
    No Errors nor warnings

I guess somehow these targets get not scraped. Can you help me out on how to solve this issue please? Thanks

@domcar
Copy link
Author

domcar commented Feb 21, 2018

If it helps, it looks like some services have no endpoints:

kubectl get endpoints 
kube-system   kube-controller-manager                            <none>                                                           19h
kube-system   kube-prometheus-exporter-kube-scheduler            <none>                                                           24m

@sandromello
Copy link

sandromello commented Feb 21, 2018

I had a similar issue, but I've used kubeadm to install the cluster. I fixed those alerts editing selector of those services.

If you have kubernetes core components as pods in the kube-system namespace, make sure the label selector of those services match with the labels of the pods.

kubectl get svc kube-prom-exporter-kube-scheduler kube-prom-exporter-kube-controller-manager -n kube-system -o wide
NAME                                         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)     AGE       SELECTOR
kube-prom-exporter-kube-scheduler            ClusterIP   None         <none>        10251/TCP   3h        component=kube-scheduler
kube-prom-exporter-kube-controller-manager   ClusterIP   None         <none>        10252/TCP   3h        component=kube-controller-manager
kubectl get po -l component -n kube-system --show-labels
NAME                                                 READY     STATUS    RESTARTS   AGE       LABELS
(...)
kube-apiserver-ip-10-0-41-71.ec2.internal            1/1       Running   0          3h        component=kube-apiserver,tier=control-plane
kube-controller-manager-ip-10-0-41-71.ec2.internal   1/1       Running   0          3h        component=kube-controller-manager,tier=control-plane
kube-scheduler-ip-10-0-41-71.ec2.internal            1/1       Running   0          3h        component=kube-scheduler,tier=control-plane

If any of those components were started bound to 127.0.0.1 you need to change that, please take a look at kubeadm on prometheus for more information.

@hamid2013
Copy link

hamid2013 commented Feb 22, 2018

I am also facing same issue, but in my case i have used Azure acs-engine to launch the cluster.

Keep getting the Scheduler and Controller alert.

I can see the pods are running, but there is no corresponding service available there.

@domcar
Copy link
Author

domcar commented Feb 22, 2018

@sandromello The problem is that I don't have the Pods kube-scheduler or controller-manager. I think this is the reason why it doesn't work

@ScottBrenner
Copy link
Contributor

This is a known issue with GKE prometheus-operator/prometheus-operator#355 prometheus-operator/prometheus-operator#845. I ended up just deleting the two alerts.

@hameno
Copy link

hameno commented Mar 4, 2018

This also seems to be the case for https://github.com/rancher/rke deployments (at least it is happening on my dev cluster)

@gianrubio
Copy link
Contributor

@domcar one way to avoid this issue is to have a flag to control if some dependencies from kube-prometheus will be deployed. Looking on alertmanager example on how it's possible to skip the installation of a dependency.

PR are always welcome :)

@ghost
Copy link

ghost commented May 11, 2018

I don't have any endpoints for kube controller manager and scheduler then how to monitor them using prometheus and prometheus operator.

Alerts are being triggered from the alert manager

@bonovoxly
Copy link

@ScottBrenner what's the best way to delete an alert using helm? Is it possible to cherry-pick out the alerts, or would I need to recreate them all (minus the non-working alerts for GKE)?

moizjv referenced this issue in moizjv/prometheus-operator Jun 25, 2018
…kube state exporters optional

When running prometheus operator on hosted kuberenetes like GCE, few of the exporters are optional, so adding ability to conditional installations.

Fixes #1001, prometheus-operator#355, prometheus-operator#845
gianrubio referenced this issue in prometheus-operator/prometheus-operator Jun 27, 2018
…kube state exporters optional (#1525)

* Helm: Improving readme instructions for testing helm chart locally

Adding note about where to run commands from and also breaking up large bash commands into multiple lines for simple copy paste.

* kube-prometheus: Making kubelets, kubescheduler, kube controller and kube state exporters optional

When running prometheus operator on hosted kuberenetes like GCE, few of the exporters are optional, so adding ability to conditional installations.

Fixes #1001, #355, #845

* Update Chart.yaml

* Update Chart.yaml
@ScottBrenner
Copy link
Contributor

@bonovoxly Was using kube-prometheus, never touched Helm.

@ne1000
Copy link

ne1000 commented Aug 29, 2018

@domcar @ScottBrenner I also met the same issue, but in my case i have used binary packages to install the cluster , can you give me a piece of advice fix the issue?

@phyllisstein
Copy link

phyllisstein commented Sep 13, 2018

I ran into this issue with a cluster deployed through kops in AWS. The solution that worked for me was sitting in an old version of the repo: I had to deploy the services listed here to kube-system. With that done, the alerts went green.

Edit: N.B. that I think you can also generate the requisite files by adding (import 'kube-prometheus/kube-prometheus-kops.libsonnet') to your JSONnet config:

local kp =
  (import 'kube-prometheus/kube-prometheus.libsonnet') +
  (import 'kube-prometheus/kube-prometheus-kops.libsonnet') +
  {
    _config+:: {
      namespace: 'monitoring',
  /* ...etc. */

@ghost
Copy link

ghost commented Nov 30, 2018

Same issue with aws eks

@vrathore18
Copy link

I am facing the same issue. I don't have the Pods kube-scheduler or controller-manager. @domcar how did you fixed the issue??

P.S I used helm for installation. CLoud using: AWS

@chris530
Copy link

chris530 commented Mar 3, 2019

I noticed the labels the service was looking for was not returning any pods. After adding the label k8s-app=kube-controller-manager to the control manager, and k8s-app=kube-scheduler to the scheduler the alerts cleared up as the service could find pods now.

@rpf3
Copy link

rpf3 commented Feb 25, 2020

@chris530 I had to do something very similar to the service selectors; basically null out the component label and add k8s-app label to the selector for those two services.

@flogfy
Copy link

flogfy commented Apr 23, 2021

@chris530 how were you able to add these labels to the controller manager and the kube scheduler ? I don't even have the pods and services associated with neither kube-scheduler nor kube-controller-manager. My kubernetes is installed with RKE.

@woody3549
Copy link

Hello,

I am currently using prometheus-stack version 20.0.1
Alerts KubeSchedulerDown and KubeControllerManagerDown are currently being raised for no apparent reason.
Is that also a label issues, please ?
How did you solve it ?

Thanks for your help.
Regards,

@paulfantom paulfantom transferred this issue from prometheus-operator/prometheus-operator Dec 1, 2021
@ferpizza
Copy link

ferpizza commented Dec 9, 2021

Hi,

I've been dealing with these false positives on GKE. After investigating a little, I realized that GKE doesn't expose the Kubernetes Scheduler nor the Control Manager to end users.

As we are blinded to these services, there is no need for deploying neither the Scheduler Scraper nor the Control Manager Scraper or their respective Alerts.

The easiest way of dealing with these false positive alerts is to disable the Scraping and Alerts related to services managed by GKE on the Values file of the Helm Chart.

kubeControllerManager:
  enabled: false

kubeScheduler:
  enabled: false

This is probably the case for other cloud providers, although I'm not sure about it.

Cheers,

@woody3549
Copy link

Hi @ferpizza,

Now I no longer receive alerts for KubeScheduler and KubeControllerManager.
Thanks.

However, a new KubeProxyDown alert now appears.
Can you please point me out what GKE exposes ?
I might have to disable it as well.

Cheers

@ferpizza
Copy link

ferpizza commented Jan 3, 2022

Hello @woody3549,

I haven't found official documentation setting apart those k8s components that are exposed to end-users form the ones that are kept private for Google's management. You can make an assumption based on whether such component is key for ensuring GKE services.

kube-proxy is one of those components, being a critical piece in the networking of your cluster.

When I wrote my first comment I was on version 18.1.1 of the Kube Prometheus Stack helm chart, and that version did not include the kube-proxy alerts or scraper.

Since then I have updated to version 27.1.0, which includes the kube-proxy alert, and was confronted with the same issue regarding false positives.

We can solve this, and the two prior alerts, by adding the following lines to our Values file.

kubeControllerManager:
  enabled: false

kubeScheduler:
  enabled: false

kubeProxy:
  enabled: false

@woody3549
Copy link

Hello,

Ok thanks. This makes sense and is very helpful.

Regards

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests