No description, website, or topics provided.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
README.md
monitoring-prometheus-alertmanager_config.yaml
monitoring-prometheus-alertrules_config.yaml

README.md

IBM Cloud Private - Alerting with Prometheus

With the new versions of IBM Cloud Private (ICP) a new version of Prometheus has also been implemented. For ICP 2.1.0.3 this would be Prometheus version 2.0.0 to be precise.

Amongst other changes and improvements, this brings with it a new format for configuring the Alertmanager rules. After some digging around and a lot of trial and error I got a working configuration up and running that I thought I'd share with you.

Remark: the configuration files can be downloaded from GitHub here

Receivers

First I configured two receivers:




Slack

This is handled by the slack receiver.

    global:
      slack_api_url: 'https://hooks.slack.com/services/xxxx/yyyy/zzzz'
      group_wait: 10s
      group_interval: 10s
      repeat_interval: 1m
....

        slack_configs:
        - channel: '#ibmcloudprivate'
          text: 'Nodes: {{ range .Alerts }}{{ .Labels.instance }} {{ end }}      ---- Summary: {{ .CommonAnnotations.summary }}      ---- Description: {{ .CommonAnnotations.description }}       ---- https://*yourICPClusterAddress*:8443/prometheus/alerts '

Replace the yourICPClusterAddress address with the real ingress IP of your ICP Cluster.


IBM Cloud Event Management

This is handled by a webhook receiver. The URL can be obtained directly in the web interface under:

Administration / Event sources / Configure an event source / Prometheus

        webhook_configs:
          - send_resolved: true
            url: 'https://cem-normalizer-us-south.opsmgmt.bluemix.net/webhook/prometheus/xxxxx/yyyyy/zzzzz'

Routes

Then we define the two routes pulling in both receivers:

    routes:
      - receiver: webhook
        continue: true
      - receiver: slack_alerts
        continue: true

You can configure the timing parts to your likings. Typically for testing purposes you would want to chose much shorter group- and repeat intervals.



RULES

Now on to the more challenging part - the new rule format. You can read up on this here. And there is a nice crisp article on how to convert old rules into the new format.

So basically we go from this:

ALERT HighErrorRate
  IF job:request_latency_seconds:mean5m{job="myjob"} > 0.5
  FOR 10m
  ANNOTATIONS {
    summary = "High request latency",
  }

to this:

groups:
- name: alert.rules
  rules:
  - alert: HighErrorRate
    expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5
    for: 10m
    annotations:
      summary: High request latency

Not complete rocket science but takes some time to convert and test it out.


Example rules

So to give you a starting point and get you of the ground, you might want to start with this set of rules:

    groups:
    - name: alert.rules
      rules:
      - alert: high_cpu_load
        expr: node_load1 > 5
        for: 10s
        labels:
          severity: critical
        annotations:
          description: Docker host is under high load, the avg load 1m is at {{ $value}}.
            Reported by instance {{ $labels.instance }} of job {{ $labels.job }}.
          summary: Server under high load
      - alert: high_memory_load
        expr: (sum(node_memory_MemTotal) - sum(node_memory_MemFree + node_memory_Buffers
          + node_memory_Cached)) / sum(node_memory_MemTotal) * 100 > 85
        for: 30s
        labels:
          severity: warning
        annotations:
          description: Docker host memory usage is {{ humanize $value}}%. Reported by
            instance {{ $labels.instance }} of job {{ $labels.job }}.
          summary: Server memory is almost full
      - alert: high_storage_load
        expr: (node_filesystem_size{fstype="aufs"} - node_filesystem_free{fstype="aufs"})
          / node_filesystem_size{fstype="aufs"} * 100 > 15
        for: 30s
        labels:
          severity: warning
        annotations:
          description: Docker host storage usage is {{ humanize $value}}%. Reported by
            instance {{ $labels.instance }} of job {{ $labels.job }}.
          summary: Server storage is almost full

Again, adapt the thresholds to your likings/needs.



Putting it all together

So to implement those alerts, basically you'll have to update two ConfigMaps in ICP.

To do this:

  • Go to Configuration / ConfigMaps
  • Filter for "alert"
  • And you should find three ConfigMaps

Now you can either modify them by hand by clicking on the action handle to the very right (three small vertical points)

Or you can just select "Create Resource" in the top menu and paste the two following scripts one after another.


Update Alert Rules

To modify the monitoring-prometheus-alertrules ConfigMap

apiVersion: v1
kind: ConfigMap
metadata:
  labels:
    app: monitoring-prometheus
    component: prometheus
  name: monitoring-prometheus-alertrules
  namespace: kube-system
data:
  sample.rules: |-
    groups:
    - name: alert.rules
      rules:
      - alert: high_cpu_load
        expr: node_load1 > 5
        for: 10s
        labels:
          severity: critical
        annotations:
          description: Docker host is under high load, the avg load 1m is at {{ $value}}.
            Reported by instance {{ $labels.instance }} of job {{ $labels.job }}.
          summary: Server under high load
      - alert: high_memory_load
        expr: (sum(node_memory_MemTotal) - sum(node_memory_MemFree + node_memory_Buffers
          + node_memory_Cached)) / sum(node_memory_MemTotal) * 100 > 85
        for: 30s
        labels:
          severity: warning
        annotations:
          description: Docker host memory usage is {{ humanize $value}}%. Reported by
            instance {{ $labels.instance }} of job {{ $labels.job }}.
          summary: Server memory is almost full
      - alert: high_storage_load
        expr: (node_filesystem_size{fstype="aufs"} - node_filesystem_free{fstype="aufs"})
          / node_filesystem_size{fstype="aufs"} * 100 > 15
        for: 30s
        labels:
          severity: warning
        annotations:
          description: Docker host storage usage is {{ humanize $value}}%. Reported by
            instance {{ $labels.instance }} of job {{ $labels.job }}.
          summary: Server storage is almost full

Update Alert Routes

And to modify the monitoring-prometheus-alertmanager ConfigMap

apiVersion: v1
kind: ConfigMap
metadata:
  labels:
    app: monitoring-prometheus
    component: alertmanager
  name: monitoring-prometheus-alertmanager
  namespace: kube-system
data:
  alertmanager.yml: |-
    global:
      resolve_timeout: 20s
      slack_api_url: 'https://hooks.slack.com/services/xxx/yyy/zzz'
    route:
      receiver: webhook
      group_by: [alertname, instance, severity]
      group_wait: 10s
      group_interval: 10s
      repeat_interval: 1m
      routes:
      - receiver: webhook
        continue: true
      - receiver: slack_alerts
        continue: true

    receivers:
    - name: webhook
      webhook_configs:
      - send_resolved: false
        url: 'https://cem-normalizer-us-south.opsmgmt.bluemix.net/webhook/prometheus/xxx/yyy/zzz'
    - name: slack_alerts
      slack_configs:
      - send_resolved: false
        channel: '#ibmcloudprivate'
        text: 'Nodes: {{ range .Alerts }}{{ .Labels.instance }} {{ end }}      ---- Summary: {{ .CommonAnnotations.summary }}      ---- Description: {{ .CommonAnnotations.description }}       ---- https://9.30.189.183:8443/prometheus/alerts '

Test the whole thing

Check Rules

Check if it works by going to

https://yourICPClusterAddress:8443/prometheus/rules

and verify that the rules have been loaded. This might take a while to update!


Create Load

If you like to test the CPU rule firing you can do the following:

Start a simple BusyBox shell

docker run -it --rm busybox

and then generate some simple load by pasting several times the following (depends on the number of cores you're running)

yes > /dev/null &

Check Load

To check out the generated load open the following URL:

https://yourICPClusterAddress:8443/prometheus/graphg0.range_input=1h&g0.expr=node_load1%20%3E%201.5&g0.tab=0


Check Alerts

And to see if the alerts are firing either go to

https://yourICPClusterAddress:8443/alertmanager/#/alerts

or to

https://yourICPClusterAddress:8443/prometheus/alerts for more details




I hope that this short writeup might help some poor soul out there, trying to get the Prometheus Alerting system working in the recent versions of ICP.