New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

alerting rules #1

Open
saokar opened this Issue Mar 19, 2018 · 9 comments

Comments

Projects
None yet
2 participants
@saokar

saokar commented Mar 19, 2018

I need to add a alert for blackbox monitoring. where do I add this alert rule ?

here's my Alert Rule,

ALERT ProbeFailing
  IF probe_success < 1
  FOR 15m
  WITH {
    job="blackbox_exporter"
  }
  SUMMARY "probe {{ "{{$labels.job" }}}} failing"
  DESCRIPTION "probe {{ "{{$labels.job" }}}} failing"
@jcreager

This comment has been minimized.

Owner

jcreager commented Mar 23, 2018

@saokar You should add a configMap and mount this rule to the configMap under the file called up.rules. So your configmap will look something like this.

kind: ConfigMap
apiVersion: v1
metadata:
  name: prometheus-rulefiles-blackbox
  namespace: default
  labels:
    role: prometheus-rulefiles
    prometheus: blackbox
data:
  recording.rules: |-
  up.rules: |-
    ALERT ProbeFailing
      IF probe_success < 1
      FOR 15m
      WITH {
        job="blackbox_exporter"
      }
      SUMMARY "probe {{ "{{$labels.job" }}}} failing"
      DESCRIPTION "probe {{ "{{$labels.job" }}}} failing"

Prometheus operator will load this configMap automatically. However, there are special steps you need to take to force the configMap to reload. I include this bash script to force the configMap to reload: https://github.com/jcreager/my-k8s/blob/master/prod/prometheus/blackbox/make_secrets.sh

If you read my article on setting up custom configs, take another look at the bottom third of the article where I describe setting up the configMap, and using the bash script to force prometheus-operator to reload the rules. http://joecreager.com/custom-configurations-with-prometheus-operator/

After you do this, you should see your rule in the prometheus UI under status>rules. Hopefully that helps. Let me know if I can explain anything better.

@saokar

This comment has been minimized.

saokar commented Mar 23, 2018

@jcreager that is what I tried. But I still see "No rules defined" prometheus UI under status>rules. Dont see any errors in prometheus operator logs.

Not sure what I am doing wrong.

@jcreager

This comment has been minimized.

Owner

jcreager commented Mar 23, 2018

@saokar Can you share your configs? It sound like the configMap is not being bound to the pod by prometheus operator. Does your rule selector in your prometheus config match your metadata labels in your configMap?

A couple of things to check:

  • ssh into your pod (kubectl exec -it <pod-name> -- /bin/sh) and check to see if there is a rules file bound to the pod in /etc/prometheus/rules. If you can't find up.rules in your prometheus container, that most likely means that prometheus operator can't find a rules file configMap to match on.
  • Prometheus operator have a config-reloader sidecar container. kubectl logs -f that container and see if it reloading the rules.
  • Make sure you apply configMap (`kubectl apply -f) before you apply the secret so that the config-reloader sees that there is a new checksum for the rules file.
  • If you are using my example shell script to manage the config secret, make sure that your configmap is name prometheus-rulefiles-blackbox, otherwise you need to update this line to use whatever name you are using if you are wedded to it:
    CONFIGMAPS_JSON="{\"items\":[{\"key\":\"monitoring/prometheus-rulefiles-blackbox\",\"checksum\":\""$CHECKSUM"\"}]}"

If you can't make progress with that, please share your configs if possible and I'll see what else I can do to assist.

@saokar

This comment has been minimized.

saokar commented Mar 23, 2018

@jcreager
here's my config,

kind: ConfigMap
apiVersion: v1
metadata:
  name: prometheus-rulefiles-blackbox
  namespace: monitoring
  labels:
    role: prometheus-rulefiles
    prometheus: blackbox
data:
  recording.rules: |-
  up.rules: |-
    ALERT ProbeFailing
      IF probe_success < 1
      FOR 15m
      WITH {
        job="blackbox_exporter"
      }
      SUMMARY "probe {{ "{{$labels.job" }}}} failing"
      DESCRIPTION "probe {{ "{{$labels.job" }}}} failing"

Error I am seeing in the prometheus pod log,

$ kubectl -n monitoring logs -f prometheus-blackbox-0 --container=prometheus
level=info ts=2018-03-23T19:23:59.224010953Z caller=main.go:225 msg="Starting Prometheus" version="(version=2.2.0-rc.0, branch=HEAD, revision=1fe05d40e4b2f4f7479048b1cc3c42865eb73bab)"
level=info ts=2018-03-23T19:23:59.224064154Z caller=main.go:226 build_context="(go=go1.9.2, user=root@f7abb25edc70, date=20180213-11:40:47)"
level=info ts=2018-03-23T19:23:59.224082244Z caller=main.go:227 host_details="(Linux 4.14.19-coreos #1 SMP Wed Feb 14 03:18:05 UTC 2018 x86_64 prometheus-blackbox-0 (none))"
level=info ts=2018-03-23T19:23:59.224096119Z caller=main.go:228 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2018-03-23T19:23:59.231618947Z caller=main.go:502 msg="Starting TSDB ..."
level=info ts=2018-03-23T19:23:59.231748791Z caller=web.go:383 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2018-03-23T19:24:04.202595808Z caller=main.go:512 msg="TSDB started"
level=info ts=2018-03-23T19:24:04.206580146Z caller=main.go:588 msg="Loading configuration file" filename=/etc/prometheus/config/prometheus.yaml
level=info ts=2018-03-23T19:24:04.210212669Z caller=kubernetes.go:191 component="discovery manager notify" discovery=k8s msg="Using pod service account via in-cluster config"
level=error ts=2018-03-23T19:24:04.21094304Z caller=manager.go:479 component="rule manager" msg="loading groups failed" err="yaml: unmarshal errors:\n line 1: cannot unmarshal !!str ALERT T... into rulefmt.RuleGroups"

@jcreager

This comment has been minimized.

Owner

jcreager commented Mar 23, 2018

@saokar Thanks, can you also post the logs for the prometheus-config-reloader container? I think this may be an issue with RBAC permissions for theServiceAccount attempting to access the configmap. Most likely the SA needs to be given access to access the configmap. I hadn't considered that (or worried about it) for my own purposes because I'm not doing any alerting yet. If this is the cause, there will be something like this in the prometheus-config-reloader logs:

ts=2018-03-23T19:10:20Z caller=main.go:207 component=volume-watcher msg="Updating rule files failed." err="kubernetes api: Failure 403 configmaps \"prometheus-rulefiles-blackbox\" is forbidden: User \"system:serviceaccount:default:default\" cannot get configmaps in the namespace \"default\""

I'm not sure offhand how to address this yet (but I'm sure there is a way), but I'll see if I can figure it out over the weekend. If you are able to figure out how to address this, please let me know. I'd like to update my own configs as well.

@saokar

This comment has been minimized.

saokar commented Mar 23, 2018

@jcreager
this is the error in the logs for the prometheus-config-reloader container

ts=2018-03-23T19:39:55Z caller=main.go:214 component=volume-watcher msg="Reloading Prometheus temporarily failed." err="Post http://localhost:9090/-/reload: dial tcp 127.0.0.1:9090: getsockopt: connection refused" next-retry=1m24.286574224s

@jcreager

This comment has been minimized.

Owner

jcreager commented Mar 30, 2018

@saokar that doesn't look like a RBAC issue. I think we are running in to different problems. Were you able to make any progress?

Regarding the error you are seeing in your logs, it looks like the config-reloader can't reach the prometheus container. If you kubectl exec -it <pod-name> -c prometheus-config-reloader -- /bin/sh and then run wget http://localhost:9090/-/reload, what do you get as a response? For example, I get:

wget: server returned error: HTTP/1.1 405 Method Not Allowed

I think it might help if you can share your configs. What does your Service config look like? Here is mine: https://github.com/jcreager/my-k8s/blob/master/prod/prometheus/blackbox/services.yml

@saokar

This comment has been minimized.

saokar commented Mar 31, 2018

@jcreager There seems to error parsing the rules. Here's the log from prometheus pod,

level=error ts=2018-03-31T18:03:57.810873345Z caller=manager.go:479 component="rule manager" msg="loading groups failed" err="yaml: unmarshal errors:\n line 1: cannot unmarshal !!str ALERT P... into rulefmt.RuleGroups"
level=error ts=2018-03-31T18:03:57.810919007Z caller=main.go:607 msg="Failed to apply configuration" err="error loading rules, previous rule set restored"
level=error ts=2018-03-31T18:03:57.81095565Z caller=main.go:453 msg="Error reloading config" err="one or more errors occurred while applying the new configuration (--config.file=/etc/prometheus/config/prometheus.yaml)"
level=error ts=2018-03-31T18:03:57.811010692Z caller=main.go:582 err="Error loading config one or more errors occurred while applying the new configuration (--config.file=/etc/prometheus/config/prometheus.yaml)"
level=info ts=2018-03-31T18:03:57.811047146Z caller=main.go:584 msg="See you next time!"

Maybe this is related to new formatting rules in prometheus 2.0 ?

@jcreager

This comment has been minimized.

Owner

jcreager commented Apr 10, 2018

@saokar try copying your rule into a .rules and using promtool to inspect the rules (you need to build promtools from source or download a release). The rule you posted looks right, but there could be a less obvious error. Other than that, I'd say failing to parse the rules file is progress. Sort that out and you should be set. Let me know if you manage to make any progress.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment