Skip to content

Commit

Permalink
Feature/watch (#809)
Browse files Browse the repository at this point in the history
* Implement pg_autoctl watch command.

The idea is that our users and customers could have an interactive dashboard
without needing to build one themselves from the watch(1) command and other
utilities.

* Add a --watch option to pg_autoctl show state|events.

* Add libncurses to the Dockerfile dependencies.

* Per review, show logs when failing to contact the monitor.

To enable that, we switch the terminal back to "cooked" mode where we can
read the logs on stderr. When the connection to the monitor could be
established again, we switch back to the "raw" mode with the previous
settings and continue displaying our dashboard there.

Co-authored-by: Jelte Fennema <github-tech@jeltef.nl>
  • Loading branch information
DimCitus and JelteF committed Oct 7, 2021
1 parent 87acba1 commit 84576fd
Show file tree
Hide file tree
Showing 25 changed files with 2,845 additions and 10 deletions.
5 changes: 4 additions & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,8 @@ RUN apt-get update \
libxml2-dev \
libxslt1-dev \
libselinux1-dev \
libncurses-dev \
libncurses6 \
make \
openssl \
pipenv \
Expand Down Expand Up @@ -84,7 +86,8 @@ RUN apt-get update \
make \
sudo \
tmux \
watch \
watch \
libncurses6 \
lsof \
psutils \
dnsutils \
Expand Down
2 changes: 2 additions & 0 deletions docs/architecture-multi-standby.rst
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,8 @@ following three replication settings:
- Replication quorum
- Candidate priority

.. _number_sync_standbys:

Number Sync Standbys
^^^^^^^^^^^^^^^^^^^^

Expand Down
7 changes: 7 additions & 0 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -417,6 +417,13 @@ def setup(app):
[author],
1,
),
(
"ref/pg_autoctl_watch",
"pg_autoctl watch",
"pg_autoctl watch",
[author],
1,
),
(
"ref/pg_autoctl_stop",
"pg_autoctl stop",
Expand Down
14 changes: 14 additions & 0 deletions docs/faq.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,20 @@ and your question and its answer might make it to this FAQ.

__ https://github.com/citusdata/pg_auto_failover/issues_

I stopped the primary and no failover is happening for 20s to 30s, why?
-----------------------------------------------------------------------

In order to avoid spurious failovers when the network connectivity is not
stable, pg_auto_failover implements a timeout of 20s before acting on a node
that is known unavailable. This needs to be added to the delay between
health checks and the retry policy.

See the :ref:`configuration` part for more information about how to setup
the different delays and timeouts that are involved in the decision making.

See also :ref:`pg_autoctl watch` to have a dashboard that helps
understanding the system and what's going on in the moment.

The secondary is blocked in the CATCHING_UP state, what should I do?
--------------------------------------------------------------------

Expand Down
9 changes: 9 additions & 0 deletions docs/how-to.rst
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,15 @@ formation with the following command::

$ pg_autoctl show state

The ``pg_autoctl show state`` commands outputs the current state of the
system only once. Sometimes it would be nice to have an auto-updated display
such as provided by common tools such as `watch(1)` or `top(1)` and the
like. For that, the following commands are available (see also
:ref:`pg_autoctl_watch`)::

$ pg_autoctl watch
$ pg_autoctl show state --watch

To analyze what's been happening to get to the current state, it is possible
to review the past events generated by the pg_auto_failover monitor with the
following command::
Expand Down
2 changes: 2 additions & 0 deletions docs/ref/configuration.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
.. _configuration:

Configuring pg_auto_failover
============================

Expand Down
1 change: 1 addition & 0 deletions docs/ref/manual.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ have their own manual page.
pg_autoctl_perform
pg_autoctl_do
pg_autoctl_run
pg_autoctl_watch
pg_autoctl_stop
pg_autoctl_reload
pg_autoctl_status
1 change: 1 addition & 0 deletions docs/ref/pg_autoctl.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ pg_autoctl provides the following commands::
+ set Set a pg_auto_failover node, or formation setting
+ perform Perform an action orchestrated by the monitor
run Run the pg_autoctl service (monitor or keeper)
watch Display a dashboard to watch monitor's events and state
stop signal the pg_autoctl service for it to stop
reload signal the pg_autoctl for it to reload its configuration
status Display the current status of the pg_autoctl service
Expand Down
11 changes: 11 additions & 0 deletions docs/ref/pg_autoctl_show_events.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ about state changes of the pg_auto_failover nodes managed by the monitor::
--formation formation to query, defaults to 'default'
--group group to query formation, defaults to all
--count how many events to fetch, defaults to 10
--watch display an auto-updating dashboard
--json output data in the JSON format

Options
Expand Down Expand Up @@ -46,6 +47,16 @@ Options

By default only the last 10 events are printed.

--watch

Take control of the terminal and display the current state of the system
and the last events from the monitor. The display is updated automatically
every 500 milliseconds (half a second) and reacts properly to window size
change.

Depending on the terminal window size, a different set of columns is
visible in the state part of the output. See :ref:`pg_autoctl_watch`.

--json

Output a JSON formated data instead of a table formatted list.
Expand Down
13 changes: 12 additions & 1 deletion docs/ref/pg_autoctl_show_state.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ registered to the pg_auto_failover monitor::
--formation formation to query, defaults to 'default'
--group group to query formation, defaults to all
--local show local data, do not connect to the monitor
--watch display an auto-updating dashboard
--json output data in the JSON format

Options
Expand Down Expand Up @@ -51,14 +52,24 @@ Options

Print the local state information without connecting to the monitor.

--watch

Take control of the terminal and display the current state of the system
and the last events from the monitor. The display is updated automatically
every 500 milliseconds (half a second) and reacts properly to window size
change.

Depending on the terminal window size, a different set of columns is
visible in the state part of the output. See :ref:`pg_autoctl_watch`.

--json

Output a JSON formated data instead of a table formatted list.

Description
-----------

The ``pg_autoctl show state`` outputs includes the following columns:
The ``pg_autoctl show state`` output includes the following columns:

- Name

Expand Down
139 changes: 139 additions & 0 deletions docs/ref/pg_autoctl_watch.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
.. _pg_autoctl_watch:

pg_autoctl watch
======================

pg_autoctl watch - Display an auto-updating dashboard

Synopsis
--------

This command outputs the events that the pg_auto_failover events records
about state changes of the pg_auto_failover nodes managed by the monitor::

usage: pg_autoctl watch [ --pgdata --formation --group ]

--pgdata path to data directory
--monitor show the monitor uri
--formation formation to query, defaults to 'default'
--group group to query formation, defaults to all
--json output data in the JSON format

Options
-------

--pgdata

Location of the Postgres node being managed locally. Defaults to the
environment variable ``PGDATA``. Use ``--monitor`` to connect to a monitor
from anywhere, rather than the monitor URI used by a local Postgres node
managed with ``pg_autoctl``.

--monitor

Postgres URI used to connect to the monitor. Must use the ``autoctl_node``
username and target the ``pg_auto_failover`` database name. It is possible
to show the Postgres URI from the monitor node using the command
:ref:`pg_autoctl_show_uri`.

--formation

List the events recorded for nodes in the given formation. Defaults to
``default``.

--group

Limit output to a single group in the formation. Default to including all
groups registered in the target formation.

Description
-----------

The ``pg_autoctl watch`` output is divided in 3 sections.

The first section is a single header line which includes the name of the
currently selected formation, the formation replication setting
:ref:`number_sync_standbys`, and then in the right most position the current
time.

The second section displays one line per node, and each line contains a list
of columns that describe the current state for the node. This list can
includes the following columns, and which columns are part of the output
depends on the terminal window size. This choice is dynamic and changes if
your terminal window size changes:

- Name

Name of the node.

- Node, or Id

Node information. When the formation has a single group (group zero),
then this column only contains the nodeId.

Only Citus formations allow several groups. When using a Citus formation
the Node column contains the groupId and the nodeId, separated by a
colon, such as ``0:1`` for the first coordinator node.

- Reported Lag, or Lag(R)

Time interval between now and the last known time when a node has
reported to the monitor, using the ``node_active`` protocol.

This value is expected to stay under 2s or abouts, and is known to
increment when either the ``pg_autoctl run`` service is not running, or
when there is a network split.

- Health Lag, or Lag(H)

Time inverval between now and the last known time when the monitor could
connect to a node's Postgres instance, via its health check mechanism.

This value is expected to stay under 6s or abouts, and is known to
increment when either the Postgres service is not running on the target
node, or when there is a network split.

- Host:Port

Hostname and port number used to connect to the node.

- TLI: LSN

Timeline identifier (TLI) and Postgres Log Sequence Number (LSN).

The LSN is the current position in the Postgres WAL stream. This is a
hexadecimal number. See `pg_lsn`__ for more information.

__ https://www.postgresql.org/docs/current/datatype-pg-lsn.html

The current `timeline`__ is incremented each time a failover happens, or
when doing Point In Time Recovery. A node can only reach the secondary
state when it is on the same timeline as its primary node.

__ https://www.postgresql.org/docs/current/continuous-archiving.html#BACKUP-TIMELINES

- Connection

This output field contains two bits of information. First, the Postgres
connection type that the node provides, either ``read-write`` or
``read-only``. Then the mark ``!`` is added when the monitor has failed
to connect to this node, and ``?`` when the monitor didn't connect to
the node yet.

- Reported State

The current FSM state as reported to the monitor by the pg_autoctl
process running on the Postgres node.

- Assigned State

The assigned FSM state on the monitor. When the assigned state is not
the same as the reported start, then the pg_autoctl process running on
the Postgres node might have not retrieved the assigned state yet, or
might still be implementing the FSM transition from the current state to
the assigned state.

The third and last section lists the most recent events that the monitor has
registered, the more recent event is found at the bottom of the screen.

To quit the command hit either the ``F1`` key or the ``q`` key.
1 change: 1 addition & 0 deletions src/bin/pg_autoctl/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,7 @@ LIBS += -L $(shell $(PG_CONFIG) --libdir)
LIBS += $(shell $(PG_CONFIG) --ldflags)
LIBS += $(shell $(PG_CONFIG) --libs)
LIBS += -lpq
LIBS += -lncurses

all: $(PG_AUTOCTL) ;

Expand Down
3 changes: 3 additions & 0 deletions src/bin/pg_autoctl/cli_common.h
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,9 @@ extern CommandLine show_settings_command;
extern CommandLine show_file_command;
extern CommandLine show_standby_names_command;

/* cli_watch.c */
extern CommandLine watch_command;

/* cli_systemd.c */
extern CommandLine systemd_cat_service_file_command;

Expand Down
2 changes: 1 addition & 1 deletion src/bin/pg_autoctl/cli_do_tmux.c
Original file line number Diff line number Diff line change
Expand Up @@ -800,7 +800,7 @@ prepare_tmux_script(TmuxOptions *options, PQExpBuffer script)
options->root,
"monitor");
tmux_add_send_keys_command(script,
"watch -n 0.2 %s show state",
"%s watch",
options->binpath);

/* add a window for interactive pg_autoctl commands */
Expand Down
2 changes: 2 additions & 0 deletions src/bin/pg_autoctl/cli_root.c
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,7 @@ CommandLine *root_subcommands_with_debug[] = {
&perform_commands,
&do_commands,
&service_run_command,
&watch_command,
&service_stop_command,
&service_reload_command,
&service_status_command,
Expand All @@ -119,6 +120,7 @@ CommandLine *root_subcommands[] = {
&set_commands,
&perform_commands,
&service_run_command,
&watch_command,
&service_stop_command,
&service_reload_command,
&service_status_command,
Expand Down

0 comments on commit 84576fd

Please sign in to comment.