Monitor for essential NTP metrics
Python Shell Makefile
Switch branches/tags
Nothing to show
Clone or download
Latest commit abc0387 Aug 10, 2018
Permalink
Failed to load latest commit information.
misc Update statsplot scripts Apr 12, 2018
reactive - mkdir ntpmon's parent (/opt) before rsync, if it doesn't exist Aug 10, 2018
src
testdata Clean up top-level directory Mar 13, 2017
unit_tests Split syncpeer and discard into individual types Nov 1, 2016
.gitignore Making a start at a Nagios check version of ntpmon Mar 17, 2015
COPYING.txt ckp Feb 27, 2010
Makefile Code now mirrored by Launchpad; no need for Makefile target Apr 14, 2017
README.txt
layer.yaml First cut at juju charms reactive layer Mar 13, 2017

README.txt

NTPmon
by Paul Gear <github@libertysys.com.au>
Copyright (c) 2015-2017 Paul D. Gear <http://libertysys.com.au/>

License: GPLv3 - see COPYING.txt for details


Introduction
------------

NTPmon is a program which is designed to report on essential health metrics
for NTP.  It provides a Nagios check which can be used with many alerting
systems, including support for Nagios performance data.  NTPmon can also run
as a daemon for sending metrics to collectd or telegraf.


Prerequisites
-------------

NTPmon is written in python, and requires python 3.3 or later.  It uses
modules from the standard python library, and also requires the psutil
library, which is available from pypi or your operating system repositories.
NTPmon also requires 'ntpq' and 'ntptrace' from the NTP distribution.

On Ubuntu (and probably other Debian-based Linux distributions), you can
install all the prerequisites by running:

    sudo apt-get install ntp python3-psutil


Metrics
-------

NTPmon alerts on the following metrics of the local NTP server:

sync:
    Does NTP have a sync peer?  If not, return CRITICAL, otherwise return OK.

peers:
    Are there more than the minimum number of peers active?  The NTP
    algorithms require a minimum of 3 peers for accurate clock management; to
    allow for failure or maintenance of one peer at all times, NTPmon returns
    OK for 4 or more configured peers, CRITICAL for 1 or 0, and WARNING for
    2-3.

reach:
    Are the configured peers reliably reachable on the network?  Return
    CRITICAL for less than 50% total reachability of all configured peers;
    return OK for greater than 75% total reachability of all configured peers.

offset:
    Is the clock offset from its sync peer (or other peers, if the sync peer
    is not available) acceptable?  Return CRITICAL for 50 milliseconds or more
    average difference, WARNING for 10 ms or more average difference, and OK
    for anything less.

traceloop:
    Is there a sync loop between the local server and the stratum 1 servers?
    If so, return CRITICAL.  Most public NTP servers do not support tracing,
    so for anything other than a loop (including a timeout), return OK.

In addition, NTPmon retrieves the following metrics directly from the local
NTP server (using 'ntpq -nc readvar'):
    - offset (as 'sysoffset', to distinguish it from 'offset')
    - sys_jitter (as 'sysjitter', for grouping with 'sysoffset')
    - frequency
    - stratum
    - rootdelay
    - rootdisp
See the NTP documentation for the meaning of these metrics:
https://www.eecis.udel.edu/~mills/ntp/html/ntpq.html#system


Changes from previous version
-----------------------------

NTPmon has been rewritten from version 1.0.0 of check_ntpmon.  Changes from
check_ntpmon are:

- Now requires python 3.

- Removed dependency on GNU coreutils.

- Added support for detecting ntptrace loops.

- Added support for Nagios performance data:
  https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/3/en/perfdata.html

- Added collectd daemon.

- Added telegraf daemon.

- Removed support for changing thresholds; if the one person on the Internet
  who actually uses this really wants it, I might add it back. :-)


Startup delay
-------------

By default, until ntpd has been running for 512 seconds (the minimum time for
8 polls at 64-second intervals), check_ntpmon will return OK (zero return code).
This is to prevent false positives on startup or for short-lived VMs.  To
ignore this safety precaution, use --run-time with a low number (e.g. 1 sec).


Usage
-----

(To do)


To do
-----

- Better/more documentation.
- Expand unit tests.
- Create installer.