mozalert is a monitoring tool written in python which runs as a Kubernetes CustomResourceDefinition (CRD). mozalert maintains intervals for executing predefined container-based checks, and handles escalations if necessary.
git clone https://github.com/mozafrank/mozalert
cd mozalert/install
for f in clusterrole.yaml clusterrolebinding.yaml crd.yaml stateful.yaml; do
kubectl apply -f $f
done
The mozalert CRD provides the "check" operator. Here is an example Check manifest:
kind: Check
metadata:
name: check-test-1
namespace: default
spec:
check_interval: 1m
retry_interval: 3m
notification_interval: 5m
max_attempts: 3
escalations:
- type: email
args:
email: you@email.com
image: afrank/mozlenium
secret_ref: check-test-1-mozlenium-secrets
check_cm: check-test-1-cm
Available parameters:
check_interval
: REQUIRED Define the interval by which to run your check. Supports XXhXXmXXs format. An interval is defined as the amount of time since the last check finished to wait before starting a new check.retry_interval
: OPTIONAL: Define check interval to use while the check is in a failure state. For example, let's say you want your check to fail 3 times before it escalates, but you only want the check to run every 30 minutes. 3 failures would mean it'll take 90 minutes of downtime before the escalation is sent. Instead, you can set yourretry_interval
to, say, 3 minutes, then you'll get alerted much more quickly for a persistent failure.notification_interval
: OPTIONAL: Define the check interval while the check is in an escalated state. This is also the interval for which alerts are sent out. If not specified this is set tocheck_interval
.max_attempts
: REQUIRED: The number of check attempts before a check enters a failed state and escalation begins.escalations
: REQUIRED: A List of escalations defined by a dictionary with keystype
andargs
. Currently only supports email, but in the future this will be used for HTTP-based escalations as well. For an escalation of type email, one arg is required, with key "email".image
: REQUIRED: Specify the image to be used by the checker. Example Imagessecret_ref
: OPTIONAL: The name of a secret resource to use for this check. Secrets defined here show up in the container env (or$secure
in the case of mozlenium).check_cm
: OPTIONAL: The ConfigMap which holds your actual check code. See example below.url
: OPTIONAL: URL to check for url-based checks.template.spec
: OPTIONAL Instead of specifying image, secret_ref and check_cm you can override everything by defining a full pod spec which will get used by the checker. You can see examples of this here and here.timeout
: OPTIONAL Max time for check to run before being killed. Default 5m.
Example secret manifest for a check:
# here is where you put any secrets you want your
# check to have. The secret value is base64-encoded
# For example: echo -n thisisnotsecret | base64 -w0
kind: Secret
type: Opaque
apiVersion: v1
metadata:
name: check-test-1-mozlenium-secrets
namespace: default
data:
SECRETSTUFF: dGhpc2lzbm90c2VjcmV0
Example ConfigMap for a mozlenium check:
# here is the check itself, stored in a configmap
# this block was generated with
# k create configmap check-test-1-cm --from-file=./demo-check.js
kind: ConfigMap
apiVersion: v1
metadata:
name: check-test-1-cm
namespace: default
data:
demo-check.js: |+
//demo check
require('mozlenium')();
var assert = require('assert');
var url = 'https://www.google.com'
console.log("starting check");
$browser.get(url);
console.log($secure.SECRETSTUFF);
console.log("well that went great");
Interacting with the operator:
$ kubectl get checks
NAME STATUS STATE ATTEMPT MAX_ATTEMPTS ESCALATION LAST_CHECK NEXT_CHECK AGE
check-test-1 OK IDLE 0 3 afrank@mozilla.com 2020-06-16 14:07:55 2020-06-16 14:08:55 2d20h
Getting detailed output from a check, including logs of the last run:
$ kubectl get check check-test-1 -oyaml
apiVersion: crd.k8s.afrank.local/v1
kind: Check
metadata:
name: check-test-1
namespace: default
spec:
check_cm: check-test-1-cm
check_interval: 1m
escalations:
- type: email
args:
email: afrank@mozilla.com
image: afrank/mozlenium
max_attempts: 3
notification_interval: 10m
retry_interval: 1m
secret_ref: check-test-1-mozlenium-secrets
status:
attempt: "0"
lastCheckTimestamp: "2020-06-16 14:11:14"
logs: |
//demo check
starting check
thisisnotsecret
well that went great
Check finished in 2 seconds with status code 0
nextCheckTimestamp: "2020-06-16 14:12:14"
state: IDLE
status: OK
Mozalert comes with a few checkers which are easy to use out of the box:
afrank/pinger
: Simply test if a public endpoint is available. The main argument in the spec ischeck_url
. This uses wget to perform the request.afrank/mozlenium
: Using New Relic style nodejs selenium test syntax this runs a synthetics check with Firefox Nightly.afrank/mozlenium-esr
: This is a version of mozlenium which uses Firefox-esr from Debian stable instead of nightly.afrank/mozlenium-chrome
: For those who must, a version of mozlenium with google chrome is also provided.
Using your own checker is easy. Just implement a docker container which performs some check and exits with either a failure status (2) or success status (0). The Pinger is a very simple example of how this works. To pass arguments to your custom checker's entrypoint, place them in order in your check manifest, in the spec.args
list. For example:
apiVersion: crd.k8s.afrank.local/v1
kind: Check
metadata:
name: check-test-1
namespace: default
spec:
check_interval: 1m
image: my/custom-check-image
args:
- arg1
- arg2
- arg3
Mozalert supports sending basic telemetry back from the checker and forwarding to the metrics subsystem. The supported metrics are:
total_time
: Total time of request (ms)latency
: Time (ms) for the response to start.
The mozlenium and pinger checkers included in mozalert support these metrics. If you want to add them to a custom checker, send these lines (for example) to STDOUT of the check:
TELEMETRY: total_time 1234
TELEMETRY: latency 1234
For example, if I wanted to use total_time
in a bash check, I could do:
#!/bin/bash
stime=$(date +%s%3N)
echo doing some checking
my_check()
etime=$(date +%s%3N)
((total_time=etime-stime))
echo Check finished
echo TELEMETRY: total_time $total_time
exit 0
And note these "TELEMETRY" lines are parsed by mozalert and filtered out of your log output, so you won't see them in the status subresource of your check object.
The entire stack is meant to run in Kubernetes but for development can be run locally or via docker.
pip3 install .
mozalert
docker build -t mozalert-controller .