shh
ships with a large number of pollers which probably get you
pretty close to what you need.
The conntrack poller produces 1 metric, which represents the total
number of open connections as reported by
/proc/sys/net/netfilter/nf_conntrack_count
The metric is emitted as: <prefix>.conntrack.count
shh
's built in CPU poller is based on the data found in
/proc/stat
. The values expressed there are monotonically increasing,
resetting after every reboot. Since this information isn't very useful
for humans, it is converted into percentages by doing the following:
totalDifference = sum(forall current[measure]) - sum(forall last[measure])
this[measure] = (current[measure] - last[measure]) / totalDifference * 100
Where this
is the reported measurement, current
are the values
from /proc/stat
now and last
are the values reported on the last
poll. current
replaces last
for the next poll.
shh
emits the following CPU metrics (as percentages between 0-100),
for each CPU:
<prefix>.cpu.user
<prefix>.cpu.nice
<prefix>.cpu.system
<prefix>.cpu.idle
<prefix>.cpu.iowait
<prefix>.cpu.irq
<prefix>.cpu.softirq
<prefix>.cpu.steal
For definitions of these metrics see the section on /proc/stat
in
man 5 proc.
The df
poller takes an interesting approach to determining which
mount points to report disk usage on. It is controlled by the
environment variable SHH_DF_TYPES
, which should be provided as a
comma separated list of filesystem types (as used in fstab
).
/proc/mounts
is then read, and mounts that utilize one of the
configured filesystem types will report the following metrics in
bytes:
<prefix>.df.<mntpt>.total_bytes
<prefix>.df.<mntpt>.root.free.bytes
<prefix>.df.<mntpt>.user.free.bytes
<prefix>.df.<mntpt>.used.bytes
The poller will also include metrics about inodes:
<prefix>.df.<mntpt>.total.inodes
<prefix>.df.<mntpt>.free.inodes
And, if the environment variable SHH_PERCENTAGES
includes df
:
<prefix>.df.<mntpt>.used.perc
which is the percentage used on the filesystem (0-1).
The <mntpt>
here is a massaged version of the actual mount point,
which substitutes _
for /
to make the metric name better
compatible with graphing and collection tools.
Information about disk IO is collected by first reading from
/proc to get a list of partitions. Using this information it
gathers information from /sys
which exposes the necessary
information to report Disk IO metrics:
<prefix>.disk.<device>.read.requests
<prefix>.disk.<device>.read.merges
<prefix>.disk.<device>.read.bytes
<prefix>.disk.<device>.read.ticks
<prefix>.disk.<device>.write.requests
<prefix>.disk.<device>.write.merges
<prefix>.disk.<device>.write.bytes
<prefix>.disk.<device>.write.ticks
<prefix>.disk.<device>.in_flight.requests
<prefix>.disk.<device>.io.ticks
<prefix>.disk.<device>.queue.time
More information can be found in the block stat documentation.
The kernel provides information about the number of allocated file
structs (and given that there's 1 file per struct...), and the max
number it will allocate (which is customizable). This data is
presented to us through /proc/sys/fs/file-nr
. The information is
presented in the following metrics:
<prefix>.filenr.alloc
<prefix>.filenr.free
<prefix>.filenr.max
See man 5 proc
System load averages are given in the following metrics:
<prefix>.load.1m
<prefix>.load.5m
<prefix>.load.15m
In addition, /proc/loadavg
reports the number of currently runnable
processes/threads, and the total number of processes/threads that are
available to be executed:
<prefix>.load.scheduling.entities.executing
<prefix>.load.scheduling.entities.total
For completeness, shh
also exposes the process id of the last process
started by the system:
<prefix>.load.pid.last
The mem
poller uses /proc/meminfo
which exposes a variable number
of measurements depending on the kernel version and configuration. By
default, shh
will report a subset of these. Adding mem
to
SHH__FULL
will tell shh
to report all of them. shh
reports all
measurements in bytes. See man 5 proc for more information on
available data.
The template for the emitted metrics are:
<prefix>.mem.<fixup-name>
where <fixup-name>
is a lowercased version of the stat with '(' and ')'
replaced by '.', and '.' replaced by '_'.
In addition, if the environment variable SHH_PERCENTAGES
includes mem
and/or swap
:
<prefix>.memtotal.perc
<prefix>.swaptotal.perc
Which represents the total percentage of in use memory / swap (between 0-1).
shh
can poll the nagios3stats
program and report on a range of
metrics reported by that program. nagios3stats
must be in the path
and executable and nagios config/stats files needs to be in the
default locations.
By default this poller handles the following measurements:
<prefix>.nagios3stats.numservices
<prefix>.nagios3stats.numhosts
<prefix>.nagios3stats.avgactsvclat
<prefix>.nagios3stats.avgacthstlat
<prefix>.nagios3stats.numhstactchk5m
<prefix>.nagios3stats.numsvcactchk5m
<prefix>.nagios3stats.numhstactchk1m
<prefix>.nagios3stats.numsvcactchk1m
shh
can report network interface status information as reported by
/proc/net/dev
. To control which devices should be reported, use the
SHH_NIF_DEVICES
environment variable, which should be a comma
separted list of network interfaces.
<prefix>.nif.<device>.receive.bytes
<prefix>.nif.<device>.receive.packets
<prefix>.nif.<device>.receive.errors
<prefix>.nif.<device>.receive.dropped
<prefix>.nif.<device>.receive.errors.fifo
<prefix>.nif.<device>.receive.errors.frame
<prefix>.nif.<device>.receive.compressed
<prefix>.nif.<device>.receive.multicast
<prefix>.nif.<device>.transmit.bytes
<prefix>.nif.<device>.transmit.packets
<prefix>.nif.<device>.transmit.errors
<prefix>.nif.<device>.transmit.dropped
<prefix>.nif.<device>.transmit.errors.fifo
<prefix>.nif.<device>.transmit.errors.collisions
<prefix>.nif.<device>.transmit.errors.carrier
<prefix>.nif.<device>.transmit.compressed
The ntpdate poller runs the command ntpdate -q -u
. It reports:
<prefix>.ntpdate.offset.<server>
<prefix>.ntpdate.delay.<server>
The <server>
value utilizes the values of the servers checked
against, which are configured via the SHH_NTPDATE_SERVERS
environment variable, and should be a fully-qualified domain name.
The processes poller submits measurements of the count of processes in
the various process states. It uses /proc/<pid>/stat
to get this
information.
<prefix>.processes.running.count
<prefix>.processes.sleeping.count
<prefix>.processes.waiting.count
<prefix>.processes.zombie.count
<prefix>.processes.stopped.count
<prefix>.processes.paging.count
Additionally the processes poller will match the process names found
in /proc/<pid>/stat
to the SHH_PROCESSES_REGEX
and if the name
matches it will report these additional measurements:
<prefix>.processes.<process name>.procs.count
<prefix>.processes.<process name>.threads.count
<prefix>.processes.<process name>.cpu.sys.seconds
<prefix>.processes.<process name>.cpu.user.seconds
<prefix>.processes.<process name>.mem.pagefaults.minor.count
<prefix>.processes.<process name>.mem.pagefaults.major.count
<prefix>.processes.<process name>.mem.rss.bytes
<prefix>.processes.<process name>.mem.stacksize.bytes
<prefix>.processes.<process name>.mem.virtual.bytes
<prefix>.processes.<process name>.io.read.bytes
<prefix>.processes.<process name>.io.write.bytes
<prefix>.processes.<process name>.io.read.ops
<prefix>.processes.<process name>.io.write.ops
The self poller provides metrics by introspecting itself. The Go programming language makes this rather trivial through the runtime package.
<prefix>.self.memstats.goroutines.num
<prefix>.self.memstats.general.alloc
<prefix>.self.memstats.general.alloc.bytes
<prefix>.self.memstats.heap.alloc.bytes
<prefix>.self.memstats.heap.inuse.bytes
If the environment variable SHH_FULL
contains "self", it also
reports the following:
<prefix>.self.measurements.length
<prefix>.self.memstats.general.sys.bytes
<prefix>.self.memstats.general.pointer.lookups
<prefix>.self.memstats.general.mallocs
<prefix>.self.memstats.general.frees
<prefix>.self.memstats.heap.sys.bytes
<prefix>.self.memstats.heap.idle.bytes
<prefix>.self.memstats.heap.released.bytes
<prefix>.self.memstats.heap.objects
<prefix>.self.memstats.stack.inuse
<prefix>.self.memstats.stack.sys
<prefix>.self.memstats.mspan.inuse
<prefix>.self.memstats.mspan.sys
<prefix>.self.memstats.mcache.inuse
<prefix>.self.memstats.mcache.sys
<prefix>.self.memstats.buckhash.sys
<prefix>.self.memstats.gc.next
<prefix>.self.memstats.gc.pause
<prefix>.self.memstats.gc.num
The sockstat poller uses that socket statistics found in
/proc/net/sockstat
and /proc/net/sockstat6
. The collected metrics
can be controlled by the comma separated list of protocols specified
in the environment variable SHH_SOCKSTAT_PROTOS
.
<prefix>.sockstat.<protocol>.alloc
<prefix>.sockstat.<protocol>.inuse
<prefix>.sockstat.<protocol>.mem
<prefix>.sockstat.<protocol>.orphan
<prefix>.sockstat.<protocol>.tw
Splunk provides an API endpoint that provides an Atom feed describing
the state of it's peers, including replication status and whether or
not they're actually up. This poller uses a single environment variable
SHH_SPLUNK_PEERS_URL
which includes the authentication information
in the URL pointing to the api. This is usually something like:
https://user:pass@localhost:8089/services/search/distributed/peers?count=-1
If the URL is an HTTPS endpoint, setting SHH_SPLUNK_PEERS_SKIP_VERIFY=true
may be required.
<prefix>.splunksearchpeers.down
<prefix>.splunksearchpeers.up
<prefix>.splunksearchpeers.replication.failure
<prefix>.splunksearchpeers.replication.success
<prefix>.splunksearchpeers.total
The folsom poller uses an API exposed on a running erlang Node to fetch various
Erlang VM metrics. It understands the API provided by either folsom_cowboy
or folsom_webmachine. This poller uses a single environment variable
SHH_FOLSOM_BASE_URL
which should contain the base URL pointing to the api. This is
usually something like:
https://localhost:5565/
The poller will use the base URL to construct calls to a few known endpoints.
It will even query the special <BASE_URL>/_metrics
endpoint to discover and
query any dynamically generated folsom metrics.
<prefix>.folsom.mem.total
<prefix>.folsom.mem.procs.total
<prefix>.folsom.mem.procs.used
<prefix>.folsom.mem.system
<prefix>.folsom.mem.atom.total
<prefix>.folsom.mem.atom.used
<prefix>.folsom.mem.binary
<prefix>.folsom.mem.code
<prefix>.folsom.mem.ets
<prefix>.folsom.stats.context-switches
<prefix>.folsom.stats.gc.num
<prefix>.folsom.stats.gc.reclaimed
<prefix>.folsom.stats.io.input
<prefix>.folsom.stats.io.output
<prefix>.folsom.stats.reductions
<prefix>.folsom.stats.run-queue
<prefix>.folsom.stats.runtime
<prefix>.folsom.stats.wall-clock
<prefix>.folsom.ets.<ets-tab-name>.memory
<prefix>.folsom.ets.<ets-tab-name>.size
<prefix>.folsom.<dynamic-metric-name>
The Redis poller will run the INFO
command at every interval and record stats as specified within
SHH_REDIS_INFO
, which is specified in the following way:
section0:key0,key1;section1:key0,key1
By default the setting is:
clients:connected_clients;memory:used_memory,used_memory_rss;stats:instantaneous_ops_per_sec;keyspace:db0.keys
This will then report:
<prefix>.clients.connected-clients
<prefix>.memory.used-memory
<prefix>.memory.used-memory-rss
<prefix>.stats.instantaneous-ops-per-sec
<prefix>.keyspace.db0.keys
The Cgroup poller retrieves resource usage metrics for a list of cgroups. It fetches information for the processes in each group from two cgroup controllers: cpuacct and memory.
From cpuacct, it fetches user and system CPU time used by the group expressed as a percentage of a single core. For example, a value of 125 would mean that process in the group used CPU time equivalent to one and one quarter of a CPU core for the entire polling interval.
From memory, it fetches the MAXIMUM memory usage of the group over the entire interval (the high-water mark). It attempts to request that the kernel reset this value after each poll by writing a zero to the memory.max_memory_usage file in the group's directory. This will only work if the SHH process has permissions to the file. You can simply chown/chgrp the file appropriately to grant that access. Failing to do this will result in this poller reporting only the historical maximum memory usage irrespective of the interval.
Three memory high-water marks are reported: "user", "kernel", and "tcp". "user" is standard memory usage (RSS). "Kernel" is kernel buffers and the like required by system calls made by processes in the group. "tcp" is just kernel memory required for TCP sessions. "Kernel" and "tcp" will always report 0 unless a respective memory limit for the cgroup has been set, due to limitations in the memory cgroup controller.
Metrics produced:
<prefix>.cgroup.<cgroup name>.cpu.user
<prefix>.cgroup.<cgroup name>.cpu.system
<prefix>.cgroup.<cgroup name>.mem.user
<prefix>.cgroup.<cgroup name>.mem.kernel
<prefix>.cgroup.<cgroup name>.mem.kernel.tcp
These 5 metrics will be emitted for each cgroup in SHH_CGROUPS. If a given cgroup does not exist in either controller, it will be silently ignored and the metrics for that controller/cgroup will not be emitted.
shh
is written in the Go programming language, which doesn't support
dynamic linking. This makes building a plugin system fairly difficult,
but of course simplifies other aspects of the software lifecycle--most
notably, deployment.
However, there are 2 mechanisms which can be utilized to create your
own poller. The first involves using a builtin poller called "listen";
the other involves modifying the shh
source code.
shh
provides a facility for external processes to emit stats via a
socket. If you include the "listen" poller in SHH_POLLERS
, shh
will
create a listening socket, listening at the address described by
SHH_LISTEN
(defaults to a unix socket called #shh in the CWD). If you
set SHH_LISTEN_TIMEOUT
to a duration (defaults to the value of
SHH_INTERVAL
) the socket will close if the timeout duration passed
without receiving any data.
Data is then communicated in the following format:
<RFC3339 date stamp> <what> <value>\n
<what>
is the metric name, and the interpretation of <value>
is somewhat arbitrary:
The Poller will create a FloatGauge if the value parses as a floating point number, and a counter otherwise.
The metrics will be emitted as:
`<prefix>.listen.stats.<what>`
The listen poller also emits metrics about itself:
<prefix>.listen._meta_.connection.count
<prefix>.listen._meta_.parse.error.count
<prefix>.listen._meta_.metric.count
The environment variable SHH_LISTEN
is a comma separated value with 2 fields. The first field is the socket type (e.g. tcp, tcp4, tcp6, unix, unixpacket) and the second field is an appropriate address for that type, as specified by Go's networking libraries.
Making a first class poller is fairly simple. In this section we'll develop a poller that, for every tick outputs a counter value equal to 1.
Pollers should be given a simple, but descriptive name. For our
example, we'll just call it one
. We'll put the poller in
one_poller.go
.
Though it'd be possible to make libraries of pollers and link them in
at compile time, the current strategy is to just put them all in
package main
. We import time
because the Poll
method we'll
implement shortly takes one argument of type time.Time
.
package main
import (
"time"
)
Pollers are just an interface that we need to implement. This block of code implements or simple constant one counter:
type One struct {
measurements chan<- Measurement
}
func NewOnePoller(measurements chan<- Measurement) One {
return One{measurements}
}
// Called on every tick from the main loop
func (poller One) Poll(tick time.Time) {
poller.measurements <- CounterMeasurement{tick, poller.Name(), []string{"one"}, 1}
}
func (poller One) Name() string {
return "one"
}
// A finalizer of sorts
func (poller One) Exit() {}
Just implementing the Poller, however, doesn't automatically make it
available. Next, we must add our new Poller to pollers.go
such that
we can make it available to the MultiPoller
, which is how shh
internally collects metrics across many pollers at once.
Somewhere in the switch statement add the following:
case "one":
mp.RegisterPoller(NewOnePoller(measurements))
Adding "one" to the list of pollers in the environment variable
SHH_POLLERS
will create a one poller, and start calling it's Poll
method for each tick.