Skip to content
Switch branches/tags
Go to file

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Global Watch Dog

The concept of global watchdogs are simple. It is a collection of timers, that must be reset periodically, and if they timeout, it is considered a fault. You can also fault them early or retire them. To keep things secure, yet simple, you require a token to interact with watchdogs with a given prefix.


To get the command line tool, go to [](the releases page).

To get the go library, do

go get

To get the python library, do

pip install gwd


Watchdog names must consist of lowercase letters and numbers. You are allowed underscores (_) and periods (.). Note that in some visualization and alerting tools, periods will be treated as group delimiters, for example these watchdogs


may be rendered as

 - server1:
   - disk_space
   - internet
 - server2:
   - disk_space
   - internet
   - memory

And alerts may delivered at the granularity of "dc.server2", not individual alerts for every watchdog.


To set up the authentication key, put your authentication token (in hex plaintext) in one of these places. These are in decreasing order of priority:

  • $WD_TOKEN environment variable
  • .wd_token file in your current directory
  • .wd_token file in your $HOME directory
  • /etc/wd/token file

To create a new token (e.g for a specific service or machine) simply run

wd auth
import gwd
import ""


To create a new watchdog, or reset an existing one, you need to kick it. You can also specify how long the timeout should be. If not specified, the default is 300 seconds, or 5 minutes.

wd kick
# or with timeout
wd kick 600
import gwd
# or with timeout
gwd.kick("", 600)
import ""
   wd.Kick("", 600)


Sometimes you know something is wrong immediately, and instead of waiting for the timeout to occur, you want to immediately set the watchdog to a failed state. To do so, you fault the watchdog

wd fault "I know something is wrong"
import gwd
gwd.fault("", "I know something is wrong")
import ""
   wd.Fault("", "I know something is wrong")


To remove a watchdog, or group of watchdogs that are no longer useful, you can retire all watchdogs that begin with a prefix

wd retire
import gwd
import ""


You can send changes in watchdog status to a slack integration by using the monitor command. First get a slack integration URL which looks a bit like and then do

wd monitor slack my/prefix ""
import ""
   id, _ := wd.Monitor("my/prefix", wd.MonitorSlack, "")

This will print out the ID of your monitor. Save this ID because you will need it if you want to delete the monitor later:

wd delmonitor theid
import ""


To see what's broken, you can get the status of all watchdogs with a given prefix

wd status my.
import gwd
stats = gwd.status("my.")
import ""
   stats,_ := wd.Status("my.")

To explain the data, lets look at the bash output:

  $ wd status m.
  STATE NAME     EXPIRE                           REASON
  KGOOD m.test.2 Thu, 01 Dec 2016 10:57:32 -0800  K
  FAULT m.test.3 Thu, 01 Dec 2016 10:52:46 -0800  FAULT:deliberate
  TMOUT m.test.5 Thu, 01 Dec 2016 10:33:24 -0800  K

There are three watchdogs here. The first is in the KGOOD state (kicked and good). The expiry is in the future.

The second has been deliberately faulted, as indicated by the FAULT state, and a reason starting with FAULT:. The final watchdog has timed out. It does not have a reason because there was no deliberate fault, but we can see that the expiry is in the past.

The Go and Python bindings have the same names for the fields, with the same meaning.

Note that this output is a little out of date, the current output has an additional field "CUMD" which is the cumulative downtime.

Tips and tricks

The command line tool has some additional features to make it easier to use in scripts. If you want to parse the output, it is useful to print the status as tab delimited without the header:

$ wd status m. --tabsep --noheader
TMOUT	m.test.2	Thu, 01 Dec 2016 10:57:32 -0800	K
FAULT	m.test.3	Thu, 01 Dec 2016 10:52:46 -0800	FAULT:deliberate
TMOUT	m.test.5	Thu, 01 Dec 2016 10:33:24 -0800	K

This looks similar, but as it is tab separated we can use tools like cut very easily. Lets get a list of the times when watchdogs timed out:

$ wd status m. --tabsep --noheader | grep "^TMOUT" | cut -f 3
Thu, 01 Dec 2016 10:57:32 -0800
Thu, 01 Dec 2016 10:33:24 -0800

You can also add some color to the output with --color (failed watchdogs will be red), and you can reorder it so that failed watchdogs are listed first (useful if you have lots of them) with --badfirst.

REST interface

If you need support in languages other than bash, python and go, it is very easy to interact with the REST interface. There are three endpoints, and for availability you should attempt to communicate with all three before failing. The endpoints are geographically distributed, so should not all fail at the same time.

The URL patterns are:

GET /auth/{prefix}?hmac={hmac}
GET /kick/{name}?timeout={seconds}&hmac={hmac}
GET /fault/{name}?reason={reason}&hmac={hmac}
GET /status/{prefix}?hmac={hmac}&header={0/1}
GET /retire/{prefix}?hmac={hmac}

The only difficult part is calculating the hmac token, which is calculated as the sha-256 of the authentication token (in binary, 32 bytes) with the prefix or name appended onto it.