Prometheus exporter that mines /proc to report on selected processes.
Some apps are impractical to instrument directly, either because you don't control the code or they're written in a language that isn't easy to instrument with Prometheus. We must instead resort to mining /proc.
Either grab a package for your OS from the Releases page, or install via docker.
Usage:
process-exporter [options] -config.path filename.yml
or via docker:
docker run -d --rm -p 9256:9256 --privileged -v /proc:/host/proc -v `pwd`:/config ncabatoff/process-exporter --procfs /host/proc -config.path /config/filename.yml
Important options (run process-exporter --help for full list):
-children (default:true) makes it so that any process that otherwise isn't part of its own group becomes part of the first group found (if any) when walking the process tree upwards. In other words, resource usage of subprocesses is added to their parent's usage unless the subprocess identifies as a different group name.
-threads (default:false) means that metrics will be broken down by thread name as well as group name.
-recheck (default:false) means that on each scrape the process names are re-evaluated. This is disabled by default as an optimization, but since processes can choose to change their names, this may result in a process falling into the wrong group if we happen to see it for the first time before it's assumed its proper name.
-procnames is intended as a quick alternative to using a config file. Details in the following section.
To select and group the processes to monitor, either provide command-line arguments or use a YAML configuration file.
The recommended option is to use a config file via -config.path, but for convenience and backwards compatibility the -procnames/-namemapping options exist as an alternative.
The general format of the -config.path YAML file is a top-level
process_names
section, containing a list of name matchers:
process_names:
- matcher1
- matcher2
...
- matcherN
The default config shipped with the deb/rpm packages is:
process_names:
- name: "{{.Comm}}"
cmdline:
- '.+'
A process may only belong to one group: even if multiple items would match, the first one listed in the file wins.
(Side note: to avoid confusion with the cmdline YAML element, we'll refer to
the command-line arguments of a process /proc/<pid>/cmdline
as the array
argv[]
.)
Each item in process_names
gives a recipe for identifying and naming
processes. The optional name
tag defines a template to use to name
matching processes; if not specified, name
defaults to {{.ExeBase}}
.
Template variables available:
{{.Comm}}
contains the basename of the original executable, i.e. 2nd field in/proc/<pid>/stat
{{.ExeBase}}
contains the basename of the executable{{.ExeFull}}
contains the fully qualified path of the executable{{.Username}}
contains the username of the effective user{{.Matches}}
map contains all the matches resulting from applying cmdline regexps
Each item in process_names
must contain one or more selectors (comm
, exe
or cmdline
); if more than one selector is present, they must all match. Each
selector is a list of strings to match against a process's comm
, argv[0]
,
or in the case of cmdline
, a regexp to apply to the command line. The cmdline
regexp uses the Go syntax.
For comm
and exe
, the list of strings is an OR, meaning any process
matching any of the strings will be added to the item's group.
For cmdline
, the list of regexes is an AND, meaning they all must match. Any
capturing groups in a regexp must use the ?P<name>
option to assign a name to
the capture, which is used to populate .Matches
.
Performance tip: give an exe or comm clause in addition to any cmdline clause, so you avoid executing the regexp when the executable name doesn't match.
process_names:
# comm is the second field of /proc/<pid>/stat minus parens.
# It is the base executable name, truncated at 15 chars.
# It cannot be modified by the program, unlike exe.
- comm:
- bash
# exe is argv[0]. If no slashes, only basename of argv[0] need match.
# If exe contains slashes, argv[0] must match exactly.
- exe:
- postgres
- /usr/local/bin/prometheus
# cmdline is a list of regexps applied to argv.
# Each must match, and any captures are added to the .Matches map.
- name: "{{.ExeFull}}:{{.Matches.Cfgfile}}"
exe:
- /usr/local/bin/process-exporter
cmdline:
- -config.path\s+(?P<Cfgfile>\S+)
Here's the config I use on my home machine:
process_names:
- comm:
- chromium-browse
- bash
- prometheus
- gvim
- exe:
- /sbin/upstart
cmdline:
- --user
name: upstart:-user
Every name in the procnames list becomes a process group. The default name of a process is the value found in the second field of /proc//stat ("comm"), which is truncated at 15 chars. Usually this is the same as the name of the executable.
If -namemapping isn't provided, every process with a comm value present in -procnames is assigned to a group based on that name, and any other processes are ignored.
The -namemapping option is a comma-separated list of alternating name,regexp values. It allows assigning a name to a process based on a combination of the process name and command line. For example, using
-namemapping "python2,([^/]+).py,java,-jar\s+([^/]+).jar"
will make it so that each different python2 and java -jar invocation will be tracked with distinct metrics. Processes whose remapped name is absent from the procnames list will be ignored. On a Ubuntu Xenian machine being used as a workstation, here's a good way of tracking resource usage for a few different key user apps:
process-exporter -namemapping "upstart,(--user)"
-procnames chromium-browse,bash,gvim,prometheus,process-exporter,upstart:-user
Since upstart --user is the parent process of the X11 session, this will make all apps started by the user fall into the group named "upstart:-user", unless they're one of the others named explicitly with -procnames, like gvim.
There's no meaningful way to name a process that will only ever name a single process, so process-exporter assumes that every metric will be attached to a group of processes - not a process group in the technical sense, just one or more processes that meet a configuration's specification of what should be monitored and how to name it.
All these metrics start with namedprocess_namegroup_
and have at minimum
the label groupname
.
Number of processes in this group.
CPU usage based on /proc/[pid]/stat fields utime(14) and stime(15) i.e. user and system time. This is similar to the node_exporter's node_cpu_seconds_total
.
Bytes read based on /proc/[pid]/io field read_bytes. The man page says
Attempt to count the number of bytes which this process really did cause to be fetched from the storage layer. This is accurate for block-backed filesystems.
but I would take it with a grain of salt.
Bytes written based on /proc/[pid]/io field write_bytes. As with read_bytes, somewhat dubious. May be useful for isolating which processes are doing the most I/O, but probably not measuring just how much I/O is happening.
Number of major page faults based on /proc/[pid]/stat field majflt(12).
Number of minor page faults based on /proc/[pid]/stat field minflt(10).
Number of context switches based on /proc/[pid]/status fields voluntary_ctxt_switches
and nonvoluntary_ctxt_switches. The extra label ctxswitchtype
can have two values:
voluntary
and nonvoluntary
.
Number of bytes of memory used. The extra label memtype
can have two values:
resident: Field rss(24) from /proc/[pid]/stat, whose doc says:
This is just the pages which count toward text, data, or stack space. This does not include pages which have not been demand-loaded in, or which are swapped out.
virtual: Field vsize(23) from /proc/[pid]/stat, virtual memory size.
swapped: Field VmSwap from /proc/[pid]/status, translated from KB to bytes.
Number of file descriptors, based on counting how many entries are in the directory /proc/[pid]/fd.
Worst ratio of open filedescs to filedesc limit, amongst all the procs in the group. The limit is the fd soft limit based on /proc/[pid]/limits.
Normally Prometheus metrics ought to be as "basic" as possible (i.e. the raw values rather than a derived ratio), but we use a ratio here because nothing else makes sense. Suppose there are 10 procs in a given group, each with a soft limit of 4096, and one of them has 4000 open fds and the others all have 40, their total fdcount is 4360 and total soft limit is 40960, so the ratio is 1:10, but in fact one of the procs is about to run out of fds. With worst_fd_ratio we're able to know this: in the above example it would be 0.97, rather than the 0.10 you'd see if you computed sum(open_filedesc) / sum(limit_filedesc).
Epoch time (seconds since 1970/1/1) at which the oldest process in the group started. This is derived from field starttime(22) from /proc/[pid]/stat, added to boot time to make it relative to epoch.
Sum of number of threads of all process in the group. Based on field num_threads(20) from /proc/[pid]/stat.
Number of threads in the group in each of various states, based on the field state(3) from /proc/[pid]/stat.
The extra label state
can have these values: Running
, Sleeping
, Waiting
, Zombie
, Other
.
Since publishing thread metrics adds a lot of overhead, these metrics are disabled by default. Use the -threads command-line argument to enable them.
All these metrics start with namedprocess_namegroup_
and have at minimum
the labels groupname
and threadname
. threadname
is field comm(2) from
/proc/[pid]/stat. Just as groupname breaks the set of processes down into
groups, threadname breaks a given process group down into subgroups.
Number of threads in this thread subgroup.
Same as cpu_user_seconds_total and cpu_system_seconds_total, but broken down
per-thread subgroup. Unlike cpu_user_seconds_total/cpu_system_seconds_total,
the label cpumode
is used to distinguish between user
and system
time.
Same as read_bytes_total and write_bytes_total, but broken down
per-thread subgroup. Unlike read_bytes_total/write_bytes_total,
the label iomode
is used to distinguish between read
and write
bytes.
Same as major_page_faults_total, but broken down per-thread subgroup.
Same as minor_page_faults_total, but broken down per-thread subgroup.
Same as context_switches_total, but broken down per-thread subgroup.
process-exporter will consume CPU in proportion to the number of processes in the system and the rate at which new ones are created. The most expensive parts - applying regexps and executing templates - are only applied once per process seen, unless the command-line option -recheck is provided.
If you have mostly long-running processes process-exporter overhead should be minimal: each time a scrape occurs, it will parse of /proc/$pid/stat and /proc/$pid/cmdline for every process being monitored and add a few numbers.
An example Grafana dashboard to view the metrics is available at https://grafana.net/dashboards/249
Install dep, then:
dep ensure
make