Find file
Fetching contributors…
Cannot retrieve contributors at this time
915 lines (757 sloc) 32.6 KB
NagCat Configuration
The configuration file that defines tests uses the coil format. This
document assumes a working knowledge of coil but should still make sense
to someone new to the format. A basic test may look something like this:
foo-is-ok: {
query: {
type: "http"
path: "/status"
filters: [ "regex:^OK$" ]
documentation: "Check that foo reports its status as OK"
investigation: [ "1. Check foo is up and running."
"2. Check foo can access database."
"3. Stumped? Escalate to next tier." ]
This defines a test named 'foo-is-ok'. This name will be used to
reference this test in the Nagios service definition or in NagCat's
command line arguments if trying things out in stand-alone mode.
The most important section is the 'query' section. It defines the
request to send to the service as well as any processing that needs to
be performed on the data received.
To be friendly to the people who may have to respond to alerts this test
generated documentation can be included inside the reports. The
'documentation' attribute will always be included in the report and
should note what exactly the results of the test mean. The
'investigation' attribute will only be included if the test status is
WARNING, CRITICAL, or UNKNOWN and should include some basic info on how
to diagnose and possibly fix the issue.
When this test runs it will do the following:
1. Send a GET request to "http://<hostname>:<port>/status" where
<hostname> and <port> are the host and port number defined by
either Nagios or command line arguments. If this request fails
abort the test and report CRITICAL.
2. Filter the returned data, passing along only the first match of
the regular expression. If it fails to match the expression abort
the test and report CRITICAL.
3. If nothing has failed so far then we must be good! Report OK.
Compound Queries and Thresholds
Some more complex tests may need to get multiple pieces of data and
compare them, there is no limit to the number of sub-queries in a compound
query. For example checking that a system is using today's data set or
that a client is in sync with its master. This can be accomplished using
the compound query which allows you define two sub-queries within one
query section. Note that only one level is allowed, a sub-query cannot
itself be another compound query. Here is an example:
foo-client-in-sync: {
query: {
type: "compound"
master: {
type: "http"
host: "master-foo"
port: 8080
path: "/status/sequence_number"
filters: [ "regex:\d+" ]
client: {
type: "http"
path: "/status/sequence_number"
filters: [ "regex:\d+" ]
return: "$(master) - $(client)"
warning: ">100"
critical: ">200"
documentation: "Client must not differ by more than 100"
In this example we fetch a sequence number from both the master and
client, presumably representing the position of a data stream that the
master is sending out. The compound query itself has only two special
attributes: 'type' and 'return'. 'type' declares that it is a compound
query and 'return' defines how to combine the results, usually in some
sort of math expression.
The 'warning' and 'critical' attributes define a comparison with the
returned data to check, causing the test to report WARNING or CRITICAL
respectively. If neither comparison is true then it will report OK.
If a test requires a configuration or substitution that has no logical default
then you will want to specify it in nagios, where the remainder of the host,
port and other deployment specific configuration lives. In order to do this
you simply specify a parameter, preceeded by an underscore, that matches the
full coil path to the parameter you wish to override.
So, to replace foo-client-in-sync with a test that has no default port for the
master, a variable URI for the client, and a variable regex filter on the
client return value you might create the following nagcat test:
foo-client-in-sync: {
query: {
type: "compound"
master: {
type: "http"
host: "themasterdc7"
port: ""
path: "/status/sequence_number"
filters: [ "regex:\d+" ]
client: {
type: "http"
path: ""
interestingvalue: ""
filters: [ "regex:\d+" "regex:${interestingvalue}" ]
return: "master - client"
warning: ">100"
critical: ">200"
Then in Nagios you would need to include these parameters in the test
define service {
host_name someclientdc9
# other nagios options...
_test foo-client-in-sync
_port 9019 themasterdc9
_query.master.port 1234
_query.client.path /some/path
_query.client.interestingvalue foo.*bar
Format Details
Below is a full specification of the file format. All top level items in
the file are expected to be test definitions, each with a unique name.
Any structure attribute that is not defined here may be used for
anything which is useful for coil structure inheritance or in string
Test names may be any valid coil identifier. The only attribute that is
strictly required is 'query' but 'documentation' and 'investigation'
should normally be included.
test-name: {
query: {
type: "<query type>"
# All query structures must have a 'type' attribute. The
# interpretation of the rest of the structure depends on the
# type. See the Queries section for the details.
# 'host' is an optional attribute to define the host this test runs
# against. It provides the default for queries. When running under
# Nagios this defaults to Nagios' host_name attribute.
host: "somehost"
# 'addr' is an option attribute to define the host's IP address.
# When running under nagios it defaults to Nagios' address
# attribute. If that isn't available it is set via DNS.
addr: ""
# 'port' is usually required and provides the default port for any
# query types that need it (http, tcp, snmp, etc).
port: 1234
# 'description' usually can be ignored, it is set based on Nagios'
# service_description attribute. Currently it is only used by the
# RRDTool trending setup to determine the file name to use.
description: "Service Alvie"
# 'label' is a optional user friendly name to describe the final
# value. It is used in some errors and rrdtool trend graphs.
label: "% Disk Used"
# 'repeat' defines how frequently to run the test. The value can be
# a number in seconds or a string of the format "1s" or "1 second"
# or "2 seconds", which ever you prefer. In addition to seconds
# possible units are minutes, hours, days, and weeks. For this value
# you probably only need seconds and minutes but other time values
# use the same format. The default value is 1 minute.
repeat: "1m"
# Filters are strings defining some manipulation to the data
# returned, they are run in the order listed.
# Valid filters are:
# "regex:<some regular expression>"
# If the regular expression includes a grouping ie: (foo) the
# result will be the first grouping, otherwise it is the match
# of the entire expression.
# "grep:<some regular expression>"
# This is a variation on the regex filter but instead of
# returning the first match it returns all lines that contain
# a match, just like the tool we know and love.
# "grepv:<some regular expression>"
# Inverse grep, named after the "grep -v" command.
# "lines"
# This filter simply counts the number of lines in the data
# similar to wc -l. Note that unlike wc -l this filter always
# behaves as if there is a terminating newline. Only an empty
# string counts as zero lines. This is because internally
# nagcat generally doesn't bother with adding terminating
# newlines like Unix commands do.
# "bytes"
# This filter simply counts the number of bytes in the data
# like wc -c. Unlike the lines filter this one doesn't bother
# with adding a terminating newline since it isn't needed.
# "date2epoch:<date format>"
# Convert a date using the given format to a time in seconds
# since the epoch, useful for doing math with $(NOW).
# "html"
# Convert an HTML document into a (hopefully) valid XML
# document that can be use with xpath and other XML filters.
# "xpath:<some xpath expression>"
# If you are parsing XML, using XPath expressions are the way
# to go. Some quick examples:
# Get the HTML title: /html/head/title/text()
# Get a div with a specific id: //div[@id="something"]
# More info here:
# "xslt:<path or xml>"
# Transform an XML document using XSLT. The argument here can
# be either an absolute path to the XSLT file or the XSLT XML
# document itself. Remember, use '''style quotes''' for
# multi-line strings.
# More info here:
# "table:<row>,<column>"
# Select a specific cell, entire row, or entire column from a
# structured table of data. The filter will attempt to detect
# if the data is formatted as CSV or similar format. Even a
# Unix /etc/passwd file format works. Row and column are
# specified as integers and start at 0. If row 0 contains
# column headers then the column can be specified as the name
# as it appears in row 0 rather than it's index. If column 0
# contains non-numeric keys that identify rows then the row
# can be specified as the name as it appears in column 0. If
# column is empty the whole row is reproduced, separated by
# the format's delimiter. If the row is empty the whole
# column is reproduced, separated by newlines.
# Examples:
# "table:0,3" "table:1,foo" "table:foo,bar" "table:1," "table:,3"
# "save:<some name>"
# Saved the current result as the given name and will be
# included with that name in the 'Extra Output' section of the
# report.
# "warning:<operator> <value>"
# "critical:<operator> <value>"
# The warning and critical filters simply test the data and
# raise a Warning or Critical alert if the test is true.
# Normally you will define these with the separate warning
# and critical attributes rather than in the filter chain for
# clarity but using them as filters allows more complex tests.
# See the description of warning/critical below for more info.
# "ok:<operator> <value>"
# This is similar to the warning/critical filters except it
# forces the test to the OK state. This is useful if further
# filters and warning/critical checks should not be considered
# for certain values. For example if a query fetched a null
# value that is ok but the null would break the next filter.
# See the description of warning/critical below for syntax.
filters: [ "somefilter:arguments" ]
# 'warning' and 'critical' will compare the data returned by the
# query with a given value. If it evaluates to true the test will
# report WARNING or CRITICAL respectively, with CRITICAL winning out
# over WARNING if both are true.
# Relational operators (< <= >= >) only work with numeric values.
# Equality operators (= == != <>) will compare the values as numbers
# if possible, otherwise it will compare them as strings.
# = is an alias for == and <> is an alias for !=
# Regular expressions can be checked with =~ and !~ and will check
# that the query matches or doesn't match respectively. The regular
# expression is matched in multiline mode and <value> is the raw
# pattern. Currently the match is case sensitive and you cannot
# specifiy flags. For example to go critical unless the exact
# string "ok" is returned you could write: critical: "!~ ^ok$"
warning: "<operator> <value>"
critical: "<operator> <value>"
# 'warning_time_limit' provides a time limit on how long the test
# may remain in a warning state. After the time limit expires the
# test will become critical. This is useful for issues that are only
# a big deal if they persist for a while. The format is the same as
# the 'repeat' value and the default is 0 (disabled).
warning_time_limit: "30 minutes"
# The 'trend' block defines how to record and graph the data over
# time using rrdtool. By the final state is always recorded but if
# the data is numeric this can be used to record it as well.
trend: {
# 'type' defines the type of data. It can be any of the normal
# rrdtool data source types: 'gauge' records a plain value, such
# as temperature 'counter' records the change in a counter,
# resets are auto-detected. 'derive' also records the change in
# the value, but can be negative. 'absolute' is a special
# counter that resets when read. If the 'type' attribute is not
# given then only the final state is recorded.
# Note: for the type 'counter' rrdtool tries to watch for 32bit
# and 64bit overflows. If you are monitoring a counter that does
# not do this (typical in Python applications) then you should
# use the type 'derive' and set min to 0.
type: 'gauge'
# 'min' and 'max' set the minimum and maximum allowable value.
# Anything outside of the expected range is dropped. Both values
# default to None which means there is no limit.
min: 0
max: None
# 'label' will appear in the graph legend to identify this
# data. It will default to the test's label or 'Result'.
label: "Awesome Data"
# 'display' defines how to display the data, can be 'line'
# or 'area'. Use 'area' if you want the various data souces
# to stack on top of each other. Defaults to 'line'.
display: "line"
# 'stack' defines whether to stack this value on top of
# the previous one (in the order they appear in the coil
# config). Useful in compound queries that represent
# different kinds of the same sort of value (ie memory).
# Should be True or False, defaults to False.
stack: False
# 'color' defines a RGB color to use for the line or area.
# A color will be chosen if not given.
color: "#000000"
# 'scale' defines what scale the returned data is. This can
# be used to ensure the scale is correct if a query returns
# a size in KB instead of Bytes. In this case set scale to
# 1024. Defaults to 0.
scale: 1024
# 'base' defines what base the data is. This also helps ensure
# the scale is correct. If a query is getting an amount of
# memory set this to 1024 so one 'K' is 1024, one M is 1024K,
# etc. The default is 1000 so on 'K' is 1000, one M is 1000K.
base: 1024
# 'title' defines a title to place at the top of the graph.
title: "somehost - Pretty Graph"
# axis_min/axis_max: default minimum and maximum scale for the
# y axis. If any data is outside of this range the scale will
# be expanded to include it. axis_min defaults to 0
axis_min: 0
axis_max: 99
# axis_label provides a label for the y axis. This will
# typically be the units of the data such as 'bytes'.
# The default is no label
axis_label: "bytes"
# 'documentation' and 'investigation' are used only for providing
# in-line information in the generated reports. They can be either a
# single string or a list of strings to represent multiple lines.
documentation: "What does this test actually test?"
# 'investigation' is only displayed when the status is not OK
investigation: [ "The problem may be caused by this."
"Or it may be caused by that." ]
# 'url' provides links to additional documentation.
url: "http://something/amazing.html"
# 'priority' is an arbitrary string that can be used to indicate
# how urgent any reported problems are.
priority: "FIXMENOW"
query: {
type: "compound"
# At least one sub-query must be defined and may be any type except
# for compound.
# Sub-queries may have any name other than 'type' or 'return'
sub-query-a: { }
# All compound queries must define a return statement.
# Values of the sub-queries are in the format $(query-name)
# Values operate both as strings and numbers depending on which
# operator is used. The following operators expect a numeric
# value and will cause the alert to go CRITICAL if it is not.
# + - * / // % divmod() pow() ** neg() pos() abs()
# < <= => >
# The == and != operators will do a numeric comparison if the two
# values happen to be numbers, otherwise it will compare them as
# strings.
# By default there is a variable called $(NOW) which is the current
# time in seconds since the epoch. Useful in combination with the
# date2epoch filter in a subquery.
return: "<some expression>"
All other queries have the following in common:
query: {
type: "???"
# Override the host being tested
host: "<host name>"
# Override the port
port: <port number>
# Filters and warning/critical tests can be defined here just as
# they are inside a test but will apply to only this query.
filters: [ "somefilter:arguments" ]
warning: "<operator> <value>"
critical: "<operator> <value>"
# 'trend' is the same as the trend block in the main test section
# but will graph this specific query's data. All the same
# attributes are supported except for axis_min, axis_max, axis_label.
trend: { type: "gauge" }
# 'repeat' may be defined here to override the repeat value for the
# entire test. This is useful for having one sub-query in a compound
# query run less frequently than the rest if it is an expensive
# operation or its value only changes at a specific time of day.
# The default is the parent test's repeat value.
repeat: "1m"
# 'timeout' sets how long the query is allowed to take. If the query
# takes longer than the timeout it will be aborted and raise a
# critical error. The default value is 15 seconds.
timeout: "15s"
query: {
type: "http"
# set type to "https" for SSL connections
# Path to request with either a GET or a POST
path: "/some/url/path"
# Method to use. Defaults to GET unless data: is set, in which case
# the default is POST. Any HTTP method is allowed but GET/POST are
# probably the only ones that will prove to be useful currently.
method: "GET"
# Data to POST.
data: "some arbitrary data"
# Extra request headers to send, must be a structure with
# header-name: "value" pairs.
headers: { Accept: "text/plain" }
# Use HTTP Basic authentication (adds an Authorization header)
username: "user"
password: "pass"
# When using HTTPS you can provide paths to a client key and
# certificate. You can also provide a path to a CA cert to
# verify the other side's certificate against.
ssl_key: "/path/to/foo.key"
ssl_cert: "/path/to/foo.cert"
ssl_cacert: "path/to/ca.cert"
# If a key or cert is in ASN1 format instead of PEM use _type.
# Your files are probably in PEM format so you can leave this out.
ssl_key_type: "ASN1
ssl_cert_type: "ASN1"
ssl_cacert_type: "PEM"
This is an variation on HTTP/HTTPS queries for making XMLRPC calls.
query: {
type: "xmlrpc"
# set type to "xmlrpcs" for SSL connections
# Path to POST to (/RPC2 is the default)
path: "/RPC2"
# Method name to call. (required)
method: "func_name"
# Arguments to the method. May be a single value, a list of values
# for multiple arguments, or a struct if the first argument is a
# structure. Due to current limitations of coil structs it is only
# possible to submit a struct as the first and only parameter. Also,
# key names are limited by coil's requirements. If this doesn't work
# for you submit the raw XML via a http query instead.
params: 1
#params: [1 2]
#params: {this: 1}
# Alias for params
parameters: 1
# May be "xml" or "value" which sets whether the result should be
# the raw XML response or the parsed value. XML may be useful for
# complex data structures which can be processed with xpath and
# similar filters, the parsed value (the default) is best for single
# values such as an integer or string.
result: "value"
# Extra request headers to send, must be a structure with
# header-name: "value" pairs.
headers: { Host: "" }
# Use HTTP Basic authentication (adds an Authorization header)
username: "user"
password: "pass"
# xmlrpcs queries support the same ssl_ options as normal HTTPS,
# see the HTTPS section above for more info.
ssl_key: "/path/to/foo.key"
ssl_cert: "/path/to/foo.cert"
ssl_cacert: "path/to/ca.cert"
#ssl_key_type: "PEM
#ssl_cert_type: "PEM"
#ssl_cacert_type: "PEM"
query: {
type: "tcp"
# set type to "ssl" for SSL connections
# Data to send. Be sure to include any newlines, etc.
data: "some arbitrary data"
# Raw SSL sockets support the same ssl_ options as HTTPS,
# see the HTTPS section above for more info.
ssl_key: "/path/to/foo.key"
ssl_cert: "/path/to/foo.cert"
ssl_cacert: "path/to/ca.cert"
#ssl_key_type: "PEM
#ssl_cert_type: "PEM"
#ssl_cacert_type: "PEM"
Run a command. An exit status of 0 is OK, anything else is Critical.
query: {
type: "subprocess"
# Command to execute, can be a shell pipeline or whatever.
command: "something to run"
# Data to write to stdin
data: "some arbitrary data"
# Optionally you may add various environment variables.
environment: {
PATH: "/special/path:/usr/bin"
A variation of "subprocess" is "nagios_plugin" which interprets the exit
code the same way as Nagios would where 0 is OK, 1 is Warning, 2 is
Critical, and 3 and above is Unknown. Note that unlike real Nagios
plugins nagcat does not define the standard macros as environment
variables. If your plugin needs any set them manually in the
environment block. This may change in the future.
query: {
type: "nagios_plugin"
# These are the exact same as for "subprocess" queries
command: "something"
data: "something"
environment: {}
SNMP: (versions 1 and 2 only currently)
There are two variations on SNMP requests, the first shown here fetches
a single value from a given OID. The second variation is for fetching
values from a table. Table lookups work by providing one oid for the
root of the value you want to get (oid_base) and the oid for the root of
the labels for those values (oid_key). The attribute 'key' defines which
label in the oid_key list you want and the query will return the
corresponding value from oid_base.
query: {
type: "snmp"
# community name, required
community: "public"
# port number, defaults to 161
port: "161"
# tcp or udp, defaults to udp
protocol: "udp"
# SNMP version, 1 or 2c, defaults to 2c
version: "2c"
# OID number, required for single oid requests
# Both the textual name and the raw numbers work.
oid: "." # SNMPv2-MIB::sysDescr.0
# Instead of the above oid attribute use the following for tables:
oid_base: "HOST-RESOURCES-MIB::hrStorageSize"
oid_key: "HOST-RESOURCES-MIB::hrStorageDescr"
# Fetch root's storage size
key: "/"
# With either the oid or oid_base/oid_key you can provide
# oid_scale to fetch the units of the value and scale it
# appropriately. This is needed for hrStorage* because it
# returns sizes in anything from 1024 to 65536 Bytes.
oid_scale: "HOST-RESOURCES-MIB::hrStorageAllocationUnits"
For the purposes of testing system time we can do basic NTP queries.
The result is seconds since the Unix epoch. Unless your test is for
the NTP server itself you will likely want to override the host
attribute with your favorite server. See the system_time test in
example.coil for an example on how to use this with SNMP.
query: {
type: "ntp"
host: ""
port: 123
Oracle SQL:
For running Oracle SQL queries (mainly SELECT). Note that the type is
oracle_sql but the alias oraclesql also works for compatibility with
older versions of Nagcat.
Results are represented as XML for easy processing using XPath filters.
For a table that was created with the columns (a number, b varchar2(10))
the format of a select will look something like this:
<a type="NUMBER">1</a>
<b type="STRING">aaa</b>
<a type="NUMBER">2</a>
<b type="STRING">bbb</b>
query: {
type: "oracle_sql"
user: "username"
password: "password"
# dsn defines the database to connect to, in the simple case it is
# a host name and database name but if tnsnames.ora is in use then
# any identifier in it can be used.
dsn: "dbhost/dbname"
# Although in theory you *could* put statements other than SELECT
# such as UPDATE or INSERT I have no idea why you would want to.
sql: "SELECT :bind AS value FROM dual"
# The above query has a bind value named 'bind', we can specify it
# with the attribute binds as a list or struct:
binds: [ 1 ]
binds: { bind: 1 }
# The attribute 'parameters' is an alias to binds for the sake of
# semi-consistency with the plsql query type.
parameters: [ 1 ]
Oracle PL/SQL:
It is also possible to run PL/SQL procedures. This query can only
execute a single procedure, not arbitrary PL/SQL code.
query: {
type: "oracle_plsql"
user: "username"
password: "password"
dsn: "dbhost/dbname"
# Name of the procedure, which is probably in a package.
procedure: "monitor_package.procedure_name"
# Any number of input and output parameters can be passed to the
# procedure. To be useful the procedure will need to define at least
# one output parameter that gives us data to work with. Each
# parameter is defined with a three item list:
# [ direction name value/type ]
# where direction is 'in' or 'out, name is the parameter name, and
# value/type is the input value for 'in' or the data type to expect.
parameters: [
['in' 'this' 1]
['out' 'that' 'number']
['out' 'list' 'cursor']
The output XML is similar to the oracle_sql output:
<that type="NUMBER">2</that>
<a type="STRING">foo</a>
It is possible to query host and service status directly from nagios.
This can be useful for checking how long a test has been in a particular
state among other things.
query: {
type: "host_status"
# The host can be overridden like any query, the default is whatever
# host the test is configured for.
host: "somehost"
# Pull a specific attribute out of the host status, if this is not
# set then the whole status block is returned as XML.
attribute: "plugin_output"
query: {
type: "service_status"
# The service description from nagios, defaults to the description
# for this particular test.
description: "Some Service"
# Same as with host_status
host: "somehost"
attribute: "plugin_output"
When attribute is not set the XML is simply a direct translation of the
data in Nagios' status file, for example:
<service_description>Some Service</service_description>
<plugin_output>service_alive: OK</plugin_output>
Host is similar, just wrapped in <host> instead.
If the result of a test depends on the previous value (such as an alert
that goes off if a counter is reset such as uptime) that previous value
can be fetched from the RRDTool file. It is also possible for one test
to query the latest status of an entirely different test.
Currently only the last update can be fetched, more query types may be
added in the future.
query: {
type: "rrd_lastupdate"
# The service description from nagios, defaults to the description
# for this particular test so it is only required if you are
# querying the status of something else.
description: "Service Alive"
# The host can also be overridden for querying other test results.
# It defaults to the host this test is configured for.
host: "somehost"
# Return only a single data source rather than the full XML report.
# This is useful for the common case where only one value is needed.
source: "subquery"
# If you want to raise an error if the last update is too old. This
# defaults to 0 meaning don't perform the check.
freshness: "30 minutes"
Without the source attribute the result is a a blob of XML that is a
subset of the rrdtool dump xml format and looks something like this:
The names _state and _result are special names. _state is the final
state of the test. 0 is OK, 1 is WARNING, 2 is CRITICAL, and 3 is
UNKNOWN. _result is the value recorded if a trend block is in the main
test block. Any other name will be the name of a query block that is
trending. If you only want one value specify the source attribute to
skip the full XML. Unknown values are represented as empty strings.